Obtain resources upon detection of impairment to a workload

PostedDecember 20, 2024

UpdatedMarch 22, 2025

ByKevin McCaffrey

The ability to scale resources reactively is critical for maintaining availability in a cloud environment. This approach ensures that, when issues arise, the system can respond swiftly—thus minimizing downtime and user impact.

Best Practices

Implement Auto Scaling and Health Checks

Use AWS Auto Scaling to automatically add or remove resources based on demand. This ensures that your application can handle increases in load without manual intervention.
Configure health checks on your resources (like EC2 instances or containers) to monitor their performance. If an instance fails a health check, Auto Scaling can replace it automatically, maintaining availability.
Utilize Amazon CloudWatch for monitoring metrics and triggering alarms that initiate scaling actions when certain thresholds are met.

Set Up Load Balancing

Use Elastic Load Balancing (ELB) to distribute incoming application traffic across multiple targets (like EC2 instances). This improves fault tolerance and can help detect unhealthy instances, rerouting traffic to healthy ones.
Deploy Application Load Balancers (ALBs) for HTTP/HTTPS traffic to monitor the health of applications and perform dynamic scaling based on demand.

Utilize Managed Services

Leverage AWS managed services like Amazon RDS, Amazon DynamoDB, or AWS Lambda that offer built-in scaling features and can automatically adjust capacity based on demand without requiring manual intervention.
These services provide high availability and automatically handle failover, further improving the reliability of your workload.

Implement Event-Driven Architecture

Design your application to react to events via services like Amazon SNS or AWS Lambda. This way, you can automatically trigger additional resources when specific events occur (like a spike in traffic or application error).
Using event-driven architecture allows more agility and can help quickly adapt to changes in demand without pre-provisioning resources.

Questions to ask your team

How does your system monitor workload performance and detect impairments?
What mechanisms do you have in place to ensure that resources are automatically added or removed based on demand changes?
Can you provide examples of past instances where your system has successfully scaled resources in response to impairments?
How do you test and validate your scaling processes to ensure they function as intended during periods of increased demand?
What alerts or notifications are set up to inform you of workload impairments, and how quickly can you respond?
Do you have a documented strategy for scaling back resources when demand decreases to optimize cost?

Who should be doing this?

Cloud Architect

Design scalable architectures that can automatically adjust resources based on demand.
Implement monitoring solutions to detect workload impairments.
Choose appropriate AWS services that offer elasticity and scalability.
Evaluate and optimize resource allocation to ensure efficient use of cloud resources.

DevOps Engineer

Set up automated scaling policies based on real-time performance metrics.
Manage infrastructure as code to ensure quick deployment and configuration of resources.
Respond to alerts on workload impairments and initiate corrective actions.
Conduct regular tests of scaling capabilities to validate reliability and performance.

Site Reliability Engineer (SRE)

Monitor application performance and workload availability continuously.
Analyze incidents related to resource impairments and recommend improvements.
Collaborate with development teams to ensure reliability practices are integrated throughout the SDLC.
Conduct post-mortem analyses to learn from workload impairments and improve response strategies.

Product Owner

Ensure that the priorities for scaling and reliability align with business goals.
Communicate with stakeholders regarding the impact of workload scalability on product functionality.
Gather and prioritize feedback on system performance and adapt the roadmap accordingly.
Collaborate with the Cloud Architect and DevOps team to ensure resource adaptability aligns with user needs.

What evidence shows this is happening in your organization?

Scalability and Reliability Playbook: A comprehensive guide outlining procedures for scaling resources based on demand fluctuations and detecting impairments. This playbook details steps to monitor workloads, identify bottlenecks, and automatically provision resources to ensure availability.
Reactive Scaling Strategy Document: A strategic document that describes the methodologies for scaling resources reactively when a workload’s availability is impacted. It includes methodologies for assessment, thresholds for scaling, and automated alert mechanisms.
Availability Monitoring Dashboard: A real-time monitoring dashboard that displays workload performance metrics, resource availability, and current demand levels. This dashboard alerts the operations team when thresholds are breached, initiating automatic scaling actions.
Resource Scaling Policy: A formal policy outlining the guidelines and best practices for resource scaling. It includes criteria for when to scale up or down, the tools and services used for scaling, and designated responsibilities for team members.
Incident Response Runbook: A runbook detailing the procedures to follow when an impairment in workload availability is detected. This includes escalation paths, communication protocols, and steps for quickly deploying additional resources to restore service.

Cloud Services

AWS

Amazon EC2 Auto Scaling: Automates the process of scaling EC2 instances based on demand, ensuring availability by adding or removing instances as needed.
Amazon CloudWatch: Monitors your AWS resources and applications in real-time, allowing you to set alarms and automatically trigger scaling actions based on metrics.
AWS Elastic Load Balancing: Distributes incoming application traffic across multiple targets, allowing for automatic scaling based on traffic demands.

Azure

Azure Virtual Machine Scale Sets: Allows you to deploy and manage a set of identical VMs that can automatically scale in response to workload demands.
Azure Monitor: Collects and analyzes telemetry data to monitor your applications, enabling you to set alerts and automate scaling actions.
Azure Load Balancer: Distributes network traffic across multiple servers, supporting automatic scaling to manage traffic spikes.

Google Cloud Platform

Google Kubernetes Engine (GKE) Autoscaler: Automatically adjusts the number of nodes in a Google Kubernetes cluster based on the demands of your workloads.
Google Cloud Monitoring: Provides insights into the performance of your applications, allowing for proactive scaling based on monitoring metrics.
Google Cloud Load Balancing: Distributes user traffic across multiple instances, and scales resources automatically based on traffic demands.

Question: How do you design your workload to adapt to changes in demand?
Pillar: Reliability (Code: REL)

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals