GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable. Chaos engineering ensures that IT teams learn from mistakes that can happen, within a controlled environment.
Netflix was one of the pioneers in the field when they first decided to move their services to the cloud in 2010. They developed [Chaos Monkey] (https://netflix.github.io/chaosmonkey/) to test their system stability through random disruption of services within their architecture. Buoyed by the success of their tool, they went on to develop additional tools known as the Simian Army for different aspects of chaos testing.
What are the benefits of GameDays?
IT ecosystems are dependant on cloud engineering with infrastructure based on microservices, which means that disruption of service can be a costly affair, especially for up and coming small companies. Game days help to find unique blindspots for teams, making them better prepared when something happens in real-time.
How to start running your game days:
Preparation : Ensure that every member of the team is aware of the date and time beforehand. All information must be shared on team calendars and clear communication channels have to be established. Game Days typically run between 2-4 hours primarily involving the team of engineers who work closely with the targeted systems. Install chaos engineering tools and best processes before beginning to ensure observability of everything that is breaking, along with methods of mitigation. Assign someone to document everything that happens to create a knowledge base for new and existing members of the workforce.
Selection of experiments : With everyone present, have a good look at your system architecture on the whiteboard ask yourself what could go wrong. There are plenty of tools that can be used to inject failure into your systems, you can even design your internal tools to have better control of everything. Some of the popular tests to run are latency tests, failure of data center, faulty code injection, etc. You have to contain your blast radius initially by testing the least dependent system, before scaling it up as you gain confidence in your systems and processes. Collect all relevant data from these experiments for the next stage.
Retrospective Meeting : Recap everything that happened and ask team members to come forward to explain what they learned from the experiments. Have in-depth discussions to understand what can go wrong, MTTR required architectural updates and how well your incident response systems kicked into action.
This is a typical workflow of what a GameDay would look like, different teams can implement their changes based on their requirements. GameDays help teams to stay on their toes while managing critical distributed systems. The goal is to run experiments in a way that doesn’t cause harm to the users and stakeholders.