Search for Well Architected Advice
< All Topics
Print

Obtain resources upon detection of impairment to a workload

The ability to scale resources reactively is critical for maintaining availability in a cloud environment. This approach ensures that, when issues arise, the system can respond swiftly—thus minimizing downtime and user impact.

Best Practices

Implement Auto Scaling and Health Checks

  • Use AWS Auto Scaling to automatically add or remove resources based on demand. This ensures that your application can handle increases in load without manual intervention.
  • Configure health checks on your resources (like EC2 instances or containers) to monitor their performance. If an instance fails a health check, Auto Scaling can replace it automatically, maintaining availability.
  • Utilize Amazon CloudWatch for monitoring metrics and triggering alarms that initiate scaling actions when certain thresholds are met.

Set Up Load Balancing

  • Use Elastic Load Balancing (ELB) to distribute incoming application traffic across multiple targets (like EC2 instances). This improves fault tolerance and can help detect unhealthy instances, rerouting traffic to healthy ones.
  • Deploy Application Load Balancers (ALBs) for HTTP/HTTPS traffic to monitor the health of applications and perform dynamic scaling based on demand.

Utilize Managed Services

  • Leverage AWS managed services like Amazon RDS, Amazon DynamoDB, or AWS Lambda that offer built-in scaling features and can automatically adjust capacity based on demand without requiring manual intervention.
  • These services provide high availability and automatically handle failover, further improving the reliability of your workload.

Implement Event-Driven Architecture

  • Design your application to react to events via services like Amazon SNS or AWS Lambda. This way, you can automatically trigger additional resources when specific events occur (like a spike in traffic or application error).
  • Using event-driven architecture allows more agility and can help quickly adapt to changes in demand without pre-provisioning resources.

Questions to ask your team

  • How does your system monitor workload performance and detect impairments?
  • What mechanisms do you have in place to ensure that resources are automatically added or removed based on demand changes?
  • Can you provide examples of past instances where your system has successfully scaled resources in response to impairments?
  • How do you test and validate your scaling processes to ensure they function as intended during periods of increased demand?
  • What alerts or notifications are set up to inform you of workload impairments, and how quickly can you respond?
  • Do you have a documented strategy for scaling back resources when demand decreases to optimize cost?

Who should be doing this?

Cloud Architect

  • Design scalable architectures that can automatically adjust resources based on demand.
  • Implement monitoring solutions to detect workload impairments.
  • Choose appropriate AWS services that offer elasticity and scalability.
  • Evaluate and optimize resource allocation to ensure efficient use of cloud resources.

DevOps Engineer

  • Set up automated scaling policies based on real-time performance metrics.
  • Manage infrastructure as code to ensure quick deployment and configuration of resources.
  • Respond to alerts on workload impairments and initiate corrective actions.
  • Conduct regular tests of scaling capabilities to validate reliability and performance.

Site Reliability Engineer (SRE)

  • Monitor application performance and workload availability continuously.
  • Analyze incidents related to resource impairments and recommend improvements.
  • Collaborate with development teams to ensure reliability practices are integrated throughout the SDLC.
  • Conduct post-mortem analyses to learn from workload impairments and improve response strategies.

Product Owner

  • Ensure that the priorities for scaling and reliability align with business goals.
  • Communicate with stakeholders regarding the impact of workload scalability on product functionality.
  • Gather and prioritize feedback on system performance and adapt the roadmap accordingly.
  • Collaborate with the Cloud Architect and DevOps team to ensure resource adaptability aligns with user needs.

What evidence shows this is happening in your organization?

  • Scalability and Reliability Playbook: A comprehensive guide outlining procedures for scaling resources based on demand fluctuations and detecting impairments. This playbook details steps to monitor workloads, identify bottlenecks, and automatically provision resources to ensure availability.
  • Reactive Scaling Strategy Document: A strategic document that describes the methodologies for scaling resources reactively when a workload’s availability is impacted. It includes methodologies for assessment, thresholds for scaling, and automated alert mechanisms.
  • Availability Monitoring Dashboard: A real-time monitoring dashboard that displays workload performance metrics, resource availability, and current demand levels. This dashboard alerts the operations team when thresholds are breached, initiating automatic scaling actions.
  • Resource Scaling Policy: A formal policy outlining the guidelines and best practices for resource scaling. It includes criteria for when to scale up or down, the tools and services used for scaling, and designated responsibilities for team members.
  • Incident Response Runbook: A runbook detailing the procedures to follow when an impairment in workload availability is detected. This includes escalation paths, communication protocols, and steps for quickly deploying additional resources to restore service.

Cloud Services

AWS

  • Amazon EC2 Auto Scaling: Automates the process of scaling EC2 instances based on demand, ensuring availability by adding or removing instances as needed.
  • Amazon CloudWatch: Monitors your AWS resources and applications in real-time, allowing you to set alarms and automatically trigger scaling actions based on metrics.
  • AWS Elastic Load Balancing: Distributes incoming application traffic across multiple targets, allowing for automatic scaling based on traffic demands.

Azure

  • Azure Virtual Machine Scale Sets: Allows you to deploy and manage a set of identical VMs that can automatically scale in response to workload demands.
  • Azure Monitor: Collects and analyzes telemetry data to monitor your applications, enabling you to set alerts and automate scaling actions.
  • Azure Load Balancer: Distributes network traffic across multiple servers, supporting automatic scaling to manage traffic spikes.

Google Cloud Platform

Question: How do you design your workload to adapt to changes in demand?
Pillar: Reliability (Code: REL)

Table of Contents