Search for Well Architected Advice
< All Topics
Print

Conduct game days regularly

Conducting game days is essential for validating the reliability of your workload. It allows your team to simulate failure scenarios and ensures that the system behaves as expected under stress, thereby reinforcing confidence in your operational procedures.

Best Practices

  • Schedule Regular Game Days: Plan and conduct game days at regular intervals. This keeps the team familiar with emergency protocols and allows for continuous improvement of incident response based on previous experiences.
  • Simulate Real-World Scenarios: Design game day exercises that closely mimic actual production failures. This will help your team practice effective response strategies and adapt to real-time challenges.
  • Involve All Relevant Teams: Ensure that all teams involved in the operation of the system participate in game days. This enhances collective knowledge and prepares everyone for cross-functional collaboration during actual incidents.

Supporting Questions

  • Are your teams trained in the procedures developed during game days?
  • How often are game days conducted, and are they effective in achieving their objectives?

Roles and Responsibilities

  • Incident Response Team: This team is responsible for planning and executing game day exercises, ensuring all necessary stakeholders participate and learn from each simulation.
  • DevOps Engineers: DevOps engineers should actively engage in game days to test system changes and validate recovery procedures, thereby directly contributing to the system’s reliability.

Artifacts

  • Game Day Playbook: A comprehensive document that outlines all exercises, scenarios, and procedures for game day events. This ensures consistency and provides a reference for team members.
  • Post-Mortem Reports: After each game day, it’s crucial to document lessons learned and gather feedback. These reports help in refining strategies for future incidents.

Cloud Services

AWS

  • AWS CloudFormation: Use AWS CloudFormation to script and automate your infrastructure changes so that game day simulations can be rapidly deployed and reverted.
  • AWS Lambda: Leverage AWS Lambda to automate responses to triggered events during game days, allowing for testing of serverless architecture reliability.
  • Amazon SNS: Utilize Amazon Simple Notification Service (SNS) to simulate alert mechanisms and notification strategies during game day exercises, ensuring communication remains effective during incidents.

Question: How do you test reliability?
Pillar: Reliability (Code: REL)

Table of Contents