Ready for chaos?
Running a Chaos Day requires people’s time and system usage, so it needs to be as carefully scheduled as any other piece of work. The immediate benefit of a Chaos Day might not be as appealing or tangible as new features. This means that investment in a Chaos Day is frequently put off.
This challenge can be addressed by starting with the smaller investment of a time-boxed system risk assessment using an approach such as FAIR. This provides an opportunity to explore what failures could happen, their frequency and the magnitude of their impact. This gives meaningful, monetary data that can help stakeholders re-evaluate prioritising features over resilience.
Chaos Days provide particular benefits if run weeks or months before major changes are deployed to production, or ahead of traffic peaks such as Black Friday for e-commerce sites. Ensure there is a sufficient gap between consecutive events to allow for learning to be distilled and improvements to be applied. For one client with a very large platform (1,000 microservices, processing 1 billion requests on a peak day), we found that 2–3 Chaos Days each year was a suitable frequency for their context.
Despite the many benefits of Chaos Days, if your production system is regularly “on fire”, you probably have enough ready-made chaos to contend with! In this case, your focus should be on running and improving post-incident reviews to bring about system stability. Once you’ve had a few months free of repeated production issues, try and run a small Chaos Day (in pre-production) to further explore system stability.