Search for Well Architected Advice
< All Topics
Print

Test resiliency using chaos engineering

Testing the reliability of your systems is critical to ensure they can withstand unexpected failures. Chaos engineering provides a systematic approach to introduce faults into your applications, allowing you to assess how they cope under adverse conditions. Through regular testing, you can validate the resilience of your architecture and make necessary adjustments.

Best Practices

  • Establish a Chaos Engineering Schedule: Integrate chaos experiments into your deployment pipeline. Regularly scheduled chaos testing helps you identify weaknesses in your systems before they affect end users. Begin with small, controlled scenarios and progressively test more complex situations.

Supporting Questions

  • What metrics do you use to measure system response during chaos experiments?
  • How frequently are chaos experiments conducted?
  • Are there clear recovery procedures established for each chaos experiment?

Roles and Responsibilities

  • DevOps Engineer: Responsible for implementing chaos experiments and monitoring system behavior during these tests, ensuring that findings are documented and analyzed for improvement.
  • Quality Assurance Specialist: Works alongside DevOps to define test scenarios and metrics that evaluate system reliability based on chaos experiment outcomes.

Artifacts

  • Chaos Engineering Experiment Plan: A document outlining the objectives, methods, and expected outcomes of each chaos experiment.
  • Incident Response Playbook: Guidelines for the team on how to respond to failures triggered by chaos experiments.

Cloud Services

AWS

  • AWS Fault Injection Simulator: A fully managed service that lets you carry out chaos engineering experiments on AWS, allowing you to improve application reliability by injecting faults and observing system behavior.
  • Amazon CloudWatch: Provides monitoring and observability for your chaos experiments, enabling you to track the performance and health metrics of your applications during and after chaos events.

Question: How do you test reliability?
Pillar: Reliability (Code: REL)

Table of Contents