Search for Well Architected Advice
< All Topics
Print

Conduct game days regularly

Conducting game days is essential for validating the reliability of your workload. It allows your team to simulate failure scenarios and ensures that the system behaves as expected under stress, thereby reinforcing confidence in your operational procedures.

Best Practices

Implement Regular Game Days

  • Schedule game days on a regular basis (e.g., monthly or quarterly) to ensure that all stakeholders are familiar with the responses and procedures related to incidents and failures.
  • Use a predefined scenario for each game day, simulating real-life failure events to challenge the team and identify potential weaknesses in the response plan.
  • Involve cross-functional teams during game days, including development, operations, and support, ensuring everyone understands their role in incident response.
  • Review the outcomes and lessons learned after each game day. Create a feedback loop to update procedures and documentation based on real-world experiences during the exercises.
  • Encourage a blameless culture during game days to ensure that participants feel safe identifying and discussing mistakes, leading to more effective improvements.
  • Utilize a combination of environments for testing, including staging and production, to understand the impact of incidents on live applications and how to mitigate user impact.

Questions to ask your team

  • How often do you conduct game days to test your reliability procedures?
  • Who is involved in the planning and execution of your game days?
  • What metrics do you collect during game days to assess their effectiveness?
  • How do you ensure the scenarios being tested closely mimic real production failures?
  • What processes do you have in place to capture and act on lessons learned from game days?
  • Are there any recent examples of improvements made as a result of past game days?

Who should be doing this?

Game Day Facilitator

  • Plan and organize game day events.
  • Ensure all necessary participants are informed and available.
  • Develop realistic scenarios for testing reliability.
  • Facilitate discussions and activities during the game day.
  • Aggregate feedback and learnings post-event.

Operations Team Member

  • Participate in game days to simulate responding to incidents.
  • Follow established procedures during testing.
  • Provide feedback on response effectiveness and areas for improvement.
  • Collaborate with other team members to refine incident response strategies.

Engineering Team Member

  • Assist in creating the scenarios that will be tested during game days.
  • Ensure that systems and applications are configured correctly for testing.
  • Evaluate system performance during testing and document findings.
  • Implement improvements based on feedback from game day exercises.

Leadership Sponsor

  • Support the game day initiative and ensure alignment with business goals.
  • Encourage team participation and accountability.
  • Review outcomes and ensure follow-up actions are taken.
  • Promote a culture of reliability and continuous improvement across teams.

What evidence shows this is happening in your organization?

  • Game Day Planning Template: A structured template to plan game days, outlining objectives, participants, scenarios to test, and communication strategies.
  • Game Day Report: A comprehensive report documenting the results of each game day, including identified issues, resolutions, and lessons learned.
  • Emergency Response Checklist: A checklist for teams to follow during simulated incidents, ensuring all critical steps are taken to mitigate critical failures.
  • Game Day Playbook: A playbook detailing specific roles, responsibilities, and protocols for participants during game day exercises.
  • Reliability Testing Dashboard: A visual dashboard that tracks game day activities, results, and metrics to monitor the effectiveness of reliability testing.
  • Post-Mortem Review Template: A template for conducting post-mortem reviews after game days to analyze the events and improve future test scenarios.
  • Game Day Strategy Guide: A guide outlining strategies for running effective game days, including best practices and common pitfalls to avoid.

Cloud Services

AWS

  • AWS Fault Injection Simulator: Allows you to carry out controlled experiments to improve an application’s resilience by injecting faults and analyzing impact.
  • AWS CloudWatch: Provides monitoring for AWS resources and applications, enabling real-time visibility and alerts for production events.
  • AWS Config: Helps assess, audit, and evaluate the configurations of AWS resources to ensure compliance and stability during game days.

Azure

  • Azure Chaos Studio: Enables you to identify weaknesses in your applications by simulating failures in production-like environments.
  • Azure Monitor: Provides a comprehensive solution for collecting, analyzing, and acting on telemetry from applications and services in Azure.
  • Azure Resource Manager: Allows for resource management and configuration compliance, facilitating structured deployment and testing during game days.

Google Cloud Platform

  • Google Cloud Chaos Engineering: Helps you test your services under unexpected failures and replicate conditions that might occur in production.
  • Google Cloud Monitoring: Provides monitoring, logging, and diagnostics to understand the health and performance of your applications.
  • Google Cloud Deployment Manager: Manages resources and configurations in Google Cloud, allowing you to easily deploy and manage resources for testing.
Table of Contents