What experiments to run on a Chaos Day

Context

It can be tempting to launch into chaos engineering with the intention to break things in diverse and spectacular ways and then see what happens. This approach is definitely chaotic and may generate new insights, but is not the desired type of chaos and is likely to be a poor return on investment. To avoid this pitfall, remind yourself and the team why you’re doing chaos engineering: to improve system resilience through learning how the whole system (product, process, people) responds to injected failures.
Improved resilience comes through learning, and learning comes in many forms. For example, brainstorming experiments will help participants learn about the focal product’s architecture and characteristics, and this learning may get drawn upon in the next production incident, leading to reduced time to recover (measured over time, the Mean Time To Recover, MTTR, is one of the four key indicators of software delivery performance). Experiments should therefore be identified, selected and designed to optimise for learning potential, instead of maximising inflicted damage, the number of remediation items identified, or how time to recover. Experiment themes Experiments take many forms and provide different types of lessons. Depending on the team’s context, it can be useful to group the experiments around a particular theme, or just aim for diversity. In running Chaos Days at various organisations we’ve seen the following themes emerge:
  1. 1.
    Build confidence in the resilience of service and the people and processes in place to operate it - Use when a new service or major architectural change is being introduced, or ahead of an upcoming peak event (e.g. Cyber-5 in the case of an online retailer). It provides an indication to the owning and ancillary teams (e.g. Operations or other Support teams) and stakeholders that the service is resilient, appropriately understood and supported. This theme is particularly useful for teams that have recently inherited a service or are new to the You Build It, You Run It model. To find out more see our You Build It, You Run It playbook, by Steve Smith and Bethan Timmins.
  2. 2.
    Share knowledge and expertise across the team - Use to help new and existing team members fire-drill incident response in a safer setting than a real production-issue. This builds architectural and domain knowledge, familiarity with incident processes, observability tools and service runbooks. This is particularly useful to also test out recent improvements to runbooks and/or telemetry (logging, metrics, alerting).
  3. 3.
    Shape a resilience backlog - Use to identify gaps in resilience mechanisms, observability, runbooks or team knowledge and help prioritise what to invest in next.
  4. 4.
    Test out improvements to process and team knowledge - If your team has suffered one or more production incidents that went badly and remediations have been put in place, use this theme to recreate similar failures and learn what gaps still exist.
Copy link