Search for Well Architected Advice
< All Topics
Print

Fail over to healthy resources

In cloud environments, it is essential to design workloads that can maintain availability in the face of component failures. This requires architectures that can quickly redirect traffic or workloads to healthy resources when failures occur, ensuring continuity and performance in operations.

Best Practices

  • Multi-AZ Deployment: Implementing multi-Availability Zone (AZ) deployments for critical applications ensures that when one AZ experiences an issue, traffic can be routed to another AZ, minimizing downtime and maintaining user accessibility.
  • Cross-Region Failover: For even greater resilience, consider deploying across AWS regions. In case of a regional failure, your applications can automatically switch to another region, ensuring business continuity.
  • Health Checks and Monitoring: Setting up robust health checks and monitoring tools is vital to detect failures promptly and trigger failover processes. Use AWS CloudWatch to monitor resource health and automate recovery actions.

Supporting Questions

  • Have you implemented automated failover mechanisms to ensure minimal downtime?
  • Do you conduct regular testing of your failover processes to ensure readiness?

Roles and Responsibilities

  • Architect: Responsible for designing resilient architectures that can withstand component failures and ensuring that all configurations are optimized for availability.
  • DevOps Engineer: Handles the implementation and operation of monitoring tools, automation scripts for failover, and performs regular maintenance of the infrastructure.

Artifacts

  • Disaster Recovery Plan: A documented approach detailing the procedures for maintaining and recovering business operations in the event of a disaster, including the failover strategies.
  • Architecture Diagrams: Visual representations of the system architecture that illustrate how components interact and the paths for failover, ensuring clarity and understanding among team members.

Cloud Services

AWS

  • AWS Elastic Load Balancing (ELB): Distribute incoming application traffic across multiple targets, such as Amazon EC2 instances across multiple AZs to ensure availability in case of failures.
  • Amazon Route 53: A scalable DNS service that can route end-users to healthy endpoints across different AWS regions or AZs, enabling quick and efficient failover.
  • AWS CloudFormation: Allows you to define and provision your cloud infrastructure using code, making it easier to manage your resources and implement reliable, automated failover architectures.

Question: How do you design your workload to withstand component failures?
Pillar: Reliability (Code: REL)

Table of Contents