Chaos Days is just one of many tools for improving system resilience. Others include:
AWS Game Days. AWS runs these days to teach design and diagnosis techniques for improving resilience using an AWS-based fake production service. They are intense and great fun but don’t teach you anything about your own system.
Per feature chaos testing. When a team builds a new feature, they run manual or automated experiments to explore the feature’s impact on system resilience as part of its testing. This can be a good way to introduce chaos-engineering principles, as well as help teams shift-left operability thinking (i.e., consider it earlier in the engineering process, instead of when the first product issue hits).
Purple team security exercises. These exercises help identify vulnerabilities and weaknesses in a product by simulating the behaviours and techniques of malicious attackers in the most realistic way possible.
Automated failure injection. Tools such as Gremlin and Netflix’s Chaos Monkey can be used to inject failures, regularly but randomly, to test out the system response on an ongoing basis.
Production incidents. Treat production incidents as learning opportunities, or in the words of John Allspaw: “incidents are unplanned investments”. If managed well (see Google’s SRE book and Etsy’s debriefing guide), then valuable, firsthand insights can be gained due to everything about the chaos being real! Live issues can be costly to the business. Therefore, it is beneficial to extract as much business value from them as possible, which can be achieved through a better understanding of the system and possible resilience improvements.