As new online threats continually emerge, ensuring your digital systems’ security and reliability is more important than ever for businesses. Many companies turn to chaos engineering, intentionally introducing failures to see how their systems respond and to help tech teams pinpoint and shore up weaknesses.
However, if critical details are overlooked, it can lead to incomplete or misleading results and few, if any, real improvements. Below, members of Forbes Technology Council detail essential elements of effective chaos engineering initiatives that are often missed but that significantly improve their effectiveness.
1. Cross-Functional Collaboration
Chaos engineering works best when employees comprehend and even embrace the practice. A common mistake is confining the practice to engineering or DevOps teams. Instead, involve members of other teams who can provide valuable perspectives on how it impacts their end of the business. Take the time to inform employees across departments about this strategy and its role within a technology practice. – Ricardo Madan, TEKsystems
2. Automated Failure Testing
Automating chaos engineering and making it a part of everyday life is often overlooked. It’s great if you run an experiment to see if your system can recover after a component fails. It’s even better if you automate experiments to continuously test failure and recovery. Rare, unpredictable events tend to be scary and poorly understood. Make failure boring by forcing your system to fail often. – Charity Majors, Honeycomb.io
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
3. Achievable Goals
Organizations need to set realistic expectations and goals. Being “chaos ready” for a single critical system is a high bar; for an organization’s entire infrastructure, it’s extremely difficult. My suggestion is to start with a small but meaningful boundary. Then, refactor, incorporate learnings and thoughtfully expand. – Elliott Cordo, Data Futures
4. A Baseline For What’s ‘Good’
Before introducing a chaos monkey into your testing and production environments, don’t overlook establishing a baseline of what “good” looks like. Assuming that the current state should be the baseline is a mistake. The motivation for introducing chaos engineering to improve robustness and recovery is usually that things are not great now. If your baseline is the current state, you won’t improve very much. – Evan J. Schwartz, AMCS Group
5. Team Buy-In
While chaos engineering focuses on breaking things, building trust within the team is often overlooked. Everyone needs to buy into the process, understanding that it’s about strengthening the system, not pointing fingers. Trust is the foundation for successful chaos experiments. – Ashok Manoharan, FocusLabs
6. A Balance Between Creativity And Realism
Maintaining a balance between creativity and realism in chaos engineering experiments is crucial. Introducing plausible failures relevant to actual operating conditions ensures meaningful results and practical system improvements. – Brian Sathianathan, Iterate.ai
7. The Human Factor
While much attention is given to systems and infrastructure in chaos engineering, preparing teams to respond effectively to unexpected failures is crucial. Regular drills and scenario planning ensure not just engineers, but all relevant teams, are ready to manage real incidents, making the organization truly resilient. – Andres Zunino, ZirconTech
8. Structured Reflection And Follow-Up
One often-overlooked aspect is the need for a solid post-experiment analysis framework. While inducing failure is crucial, the true value of chaos engineering lies in how teams analyze the results, identify systemic weaknesses and implement improvements. Without structured reflection and follow-up, the insights gained can easily be lost, undermining the entire exercise. – Miguel Llorca, Torrent Group
9. Clear Objectives And Metrics
It’s essential to define clear objectives and metrics before chaos engineering experiments. Without clear goals, it’s difficult to measure the impact or success of chaos tests. By establishing what you’re testing for—like system resilience, recovery time or failure points—teams can gather actionable insights, ensure experiments are purposeful and avoid unnecessary disruptions. – Ruchir Brahmbhatt, Ecosmob Technologies Private Limited
10. A ‘Failures Are Learning Opportunities’ Mindset
One critical but often missed aspect of chaos engineering is an organization’s cultural readiness. Teams ought to be ready to accept failures as learning tools rather than as defeats. That mindset shift is big, but it’s critical if you’re going to actually realize real insights and achieve improvements in system resilience. – Sandro Shubladze, Datamam
11. High-Pressure Scenario Drills
Teams must be trained to handle high-pressure recovery scenarios during major infrastructure outages. When resets aren’t an option, the ability to recover depends on humans acting quickly under stress. Chaos engineering should include drills that simulate these situations, ensuring teams are prepared to manage real incidents efficiently. – Suman Sharma, Procyon Inc.
12. A Culture Of Psychological Safety
For chaos engineering to succeed, it’s critical to build a culture of psychological safety. Teams need to feel secure in experimenting, openly discussing failures and learning from them without fear of blame or repercussions. Such a culture encourages continuous improvement, making the insights gained from chaos experiments truly valuable and actionable. – Prashanthi Reddy, Wasl Group
13. Impact Zone Limits
One often-overlooked but essential aspect of chaos engineering is defining clear “blast radius” boundaries. Identifying the potential impact zone ensures experiments don’t affect critical systems or customers, containing chaos and reducing risk for more effective resilience engineering. – Sarah Choudhary, Ice Innovations
14. Clear And Timely Communication
An important but often-overlooked aspect of chaos engineering? Communication! You can unleash all the digital mayhem you want, but if your team isn’t in the loop, you’ve just created real chaos. Make sure everyone knows it’s a drill, or you’ll have engineers panicking like they’re in a zombie apocalypse—except the zombies are 404 errors! – Nikhil Jathar, AvanSaber Technologies
15. True Unpredictability
An aspect of chaos engineering that is often overlooked is the need to create a wide variety of random incidents that can occur in development environments. Innovation comes from true unpredictably. If engineers expect random chaotic scenarios to emerge from a limited pool, that’s not very chaotic, is it? – Syed Ahmed, Act-On Software