Search for Well Architected Advice
< All Topics
Print

Rely on the data plane and not the control plane during recovery

Architecting workloads for high availability and reducing mean time to recovery (MTTR) involves designing them to be resilient against component failures. Utilizing the data plane rather than the control plane when responding to failures ensures operational continuity and minimizes reliance on resource administration during critical periods.

Best Practices

  • Prioritize Data Plane Over Control Plane: During recovery efforts, operations should target data plane actions for maximum effectiveness. Implement automation scripts that leverages data plane capabilities to recover and failover services quickly, ensuring they can resume normal operations with minimal delays.

Supporting Questions

  • Is your recovery plan fundamentally relying on data plane operations rather than requiring extensive control plane interactions?

Roles and Responsibilities

  • Cloud Architect: Responsible for designing the architecture ensuring resilience, focusing on data plane mechanisms for fault tolerance.

Artifacts

  • Resilience Testing Framework: A framework for conducting tests that verify the effectiveness of data plane operations during failure scenarios, ensuring that systems can recover without control plane dependencies.

Cloud Services

AWS

  • Amazon Route 53: This service can reroute traffic using health checks and DNS failover procedures that rely on data plane operations, enhancing workload availability and resilience.
  • AWS Elastic Load Balancing: Distributes incoming application traffic across multiple targets, such as EC2 instances, thereby utilizing the data plane for seamless recovery and minimizing downtime during component failures.

Question: How do you design your workload to withstand component failures?
Pillar: Reliability (Code: REL)

Table of Contents