What is chaos engineering:
Chaos engineering is a methodology that helps developers attain consistent reliability by hardening distributed services against failures in production. Another way to think about chaos engineering is that it's about embracing the inherent chaos in complex systems and, through experimentation, growing confidence in your solution's ability to handle it. A common way to introduce chaos is to deliberately inject faults that cause system components to fail. The goal is to observe, monitor, respond to, and improve your system's reliability under adverse circumstances.
Why Chaos Engineering?
Contrary to what the name may indicate, chaos events are not performed in a chaotic fashion. The goal of chaos engineering is to identify weakness in a system through controlled experiments that introduce random and unpredictable behavior. A main benefit of chaos engineering is that organizations can use it to identify vulnerabilities before a hacker does or before a system failure. In simple words Game day or Chaos day is like fire drill.
Chaos Day/ Game Day : Planned Failure
About a week prior to the event, our champion sent a company-wide email on behalf of the chaos team.
In any Game Day, an exact target or targets should be specified. Without it, it’s impossible to bring in the right people to run and observe the Game Day. It could be as simple as “Cassandra cluster” or “Inventory service” in which case at least those that either run or use the service can make a decision whether they can or want to attend.
Game Day Participants and Roles
Chaos General: This person owns the experiments that are going to be run
Chaos Commander: This person is the one in charge of implementing these Chaos Engineering experiments
Chaos Scribe: chaos scribe is observing and writing down all the notes as this game day happens
Chaos Observer: chaos observer is the one testing that user experience, looking at monitoring, alerting dashboards.
What is a Chaos Schedule?
- Start in war room or over zoom call with all participants
- White boarding architecture session to clear assumption and create hypothesis
- Defining test cases and scope (define blast radius)
- Execution of test and monitoring
key questions to ask as tests are being conducted:
- Do we have enough information?
- Is the behavior what we expected?
- What is the customer seeing if this were to happen?
- What’s happening to systems upstream or downstream?
- Recap :
What happened? Was that expected? What do we do next?
After tests are run, it’s good to take some time to wind down, then have a follow up recap. This should be done relatively soon after the GameDay (days, not weeks), as the experience is still fresh for everyone.
Summary:
While a single Game Day can provide some insights into the behavior of an application or system, continuous implementation will help increase security in the operation of distributed applications and components in the long run. Hopefully, this has given you some insight into how companies like Netflix, Amazon and Walmart can benefit from the concept of game days. In future articles we will discuss about various types of attacks and initiatives for chaos engineering.
References: