Search for Well Architected Advice
< All Topics
Print

Automate healing on all layers

Workloads that require high availability must implement strategies to handle component failures automatically. By automating healing processes, you can guarantee quicker recovery and minimal disruption, essential for user satisfaction and maintaining service level agreements (SLAs).

Best Practices

  • Use Auto Scaling Groups: Leverage Auto Scaling Groups to automatically replace unhealthy instances. This ensures workloads remain available by dynamically adjusting resources based on demand and health status, essential for maintaining performance during peak loads.
  • Implement Review Mechanisms: Regularly review automated healing configurations to ensure they are effective and aligned with evolving business requirements. Continuous improvement of automated processes enhances overall reliability.

Supporting Questions

  • Do you have monitoring and alerting configured to detect failures in real-time?
  • Are your automated healing processes tested regularly to ensure they work as intended?

Roles and Responsibilities

  • DevOps Engineer: Responsible for implementing and maintaining automated healing processes within the deployment pipeline.
  • Systems Architect: Ensures the overall architecture is designed with reliability in mind and incorporates healing mechanisms effectively.

Artifacts

  • CloudFormation Templates: Infrastructure as Code (IaC) templates that automate the deployment of resources and recovery configurations.
  • Monitoring Dashboards: Visual tools that display the health and performance of workloads, enabling quick detection of failures.

Cloud Services

AWS

  • Amazon EC2 Auto Scaling: Automates the scaling of EC2 instances based on current demand and health checks, enhancing the workload’s reliability.
  • AWS Lambda: Can be used to trigger automated responses to failures, such as restarting services or notifying stakeholders.
  • Amazon CloudWatch: Monitors the health of AWS resources and applications, enabling automated actions to remediate issues promptly.

Question: How do you design your workload to withstand component failures?
Pillar: Reliability (Code: REL)

Table of Contents