In a recent engagement, time was limited to spend on proactive failure investigations. The digital platform team took on the role of facilitators for nine delivery teams to introduce chaos engineering principles and help increase understanding of the digital services they were building and operating. With limited time for the exercises across all the teams, we ran 2.5-hour sessions that included two experiments and a post-incident review, instead of running full chaos days with multiple ongoing scenarios.
To choose experiments under those conditions, we ran experiment selection sessions with the team leads to select potential failures to investigate and gather knowledge on based on two factors:
the level of impact a potential failure could have on the user, team, or organisation
whether the response to that potential failure was known or unknown either by the service, team, or other parts of the organisation
Working together with the team leads, we prioritised and selected experiments that allowed the team to investigate potential failures combining a high level of impact with an unknown response, because this would provide the best conditions for the team to understand more about how their service worked.
The team had built an authenticated user journey to manage personal details and payment methods. During an experiment selection exercise, we found a great example of a high impact/unknown response failure in the journey whereby if a request failed to be sent to the authentication provider it could prevent users from being able to login.
The potential failure had a high impact on the user experience and the team was unsure of the response with the authentication provider hosting the login pages. It was not clear if any alerts would be fired and if they would be notified.