This section is a bare-bones version of the what and how of running a Chaos Day. The rest of the playbook expands each step, covering the key outcomes, common problems and examples of application.
If you already know what Chaos Days are, why they are beneficial and that you’re ready to run one, read on. If you’d first like to get a firmer understanding, skip over this section and read What and why.
The steps for running a Chaos Day are
Plan who, what, when, where.
Execute the experiments.
Review execution, impact, response and chaos mechanics.
Share knowledge gained.
Start small: involve one or two teams, not the entire engineering group.
Identify a few of the most experienced engineers across those teams. They will be the agents of chaos, who will design and execute the experiments.
At least two weeks ahead of Chaos Day, facilitate a planning session with the agents of chaos. Draw the system architecture on a whiteboard (or remote equivalent), then use post-its and/or a Trello board to brainstorm possible experiments that simulate a failure that your system should tolerate. Don’t focus on failures that you have no control over, such as an outage within your cloud provider, as they have low learning value. For each experiment, consider:
Failure mode (e.g., partial connectivity loss, an instance being terminated, network slowdown).
Expected impact in both technical and business terms (e.g., dependant services fail, or in-progress customer transactions are halted).
Anticipated response (e.g., the service auto-heals, an alert is fired, or nobody notices).
If the failure remains unresolved, how would the injected fault be rolled back?
Should the experiment be run in isolation (e.g., would a degradation in monitoring limit learning from other experiments?)
Which environment will it be run on? Our experience suggests that using the same pre-production environment for all experiments makes execution easier, providing valuable learning, without production's costs and risks.
Shortlist 4–8 experiments based on business risk and learning opportunity (e.g., what failure mode would have the greatest risk to the business and a system response that you’re uncertain about?).
Prepare the experiments (hence, the two-week gap between planning and the main event), keeping them secret from the participating teams to maximise the realism of unexpected failures.
Determine a date for Chaos Day, checking that it won’t impact key business events (e.g., if the target environment is severely degraded, checking this won’t delay any production releases that need to pass through it around that date).
Schedule post-chaos review meetings for participating teams (as close to after Chaos Day as possible).
Let participating teams know when Chaos Day is, and that they should treat failures in the target environment as if “production were on fire.”
Ensure that participating teams know which communication channel(s) to use (e.g., a public #pre-prod-incidents Slack channel), to aid documenting response timelines.
Provide a physical/remote space (and plenty of snacks) for the agents of chaos. Provide a facilitator to help the team keep pace through the experiments (we’ve found using the Trello board from planning to be helpful here, with additional columns for experiments: In Progress, Resolved by Agents, Resolved by Owning Team).
Monitor each experiment closely, analysing and documenting impact and team response. A private Slack channel is useful (e.g., #agents-of-chaos), as is the Trello board.
Ensure experiments are concluded and normal service restored before the end of the day.
Run reviews for each experiment, with a wide group of people: the agents of chaos, those who responded, or helped with a diagnosis and resolution, plus their colleagues.
Structure each review as a post-incident review/post-mortem: Walk through the timeline, discussing and documenting what people saw, thought, did (and didn’t do). Focus on surfacing new knowledge about system behaviour, instead of improvement tasks. If ideas for improvement come up, note them and assign an owner to consider them later, to avoid knee-jerk resilience solutions.
Identify improvements you could make to the mechanics of running the Chaos Day itself. Document them somewhere you and others can easily return to when you run the next one (improvements to this playbook are also most welcome!).
Disseminate the review write-ups as widely as possible, so other engineers, teams, and wider stakeholders can benefit from the new-found knowledge. This might include any combination of posting them on a wiki, sharing them on Slack and presenting them at a show-and-tell.