Skip to content
Well Architected Guide
Menu
Menu
Well Architected Pillars
Documents
Search for Well Architected Advice
Reliability
Manage service quotas and constraints
How do you ensure sufficient gap between quotas and maximum usage to accommodate failover?
How do you automate quota management?
How do you monitor and manage service quotas?
How do you accommodate fixed service quotas and constraints through architecture?
How do you manage service quotas and constraints across accounts and Regions?
How do you manage service quotas and constraints?
How do you build a program that embeds reliability into workload teams?
Plan your network topology
How do you enforce non-overlapping private IP address ranges in all private address spaces?
How do you prefer hub-and-spoke topologies over many-to-many mesh?
How do you ensure IP subnet allocation accounts for expansion and availability?
How do you provision redundant connectivity between private networks in the cloud and on-premises environments?
How do you use highly available network connectivity for workload public endpoints?
Design your workload service architecture
Provide service contracts per API
Build services focused on specific business domains and functionality
Choose how to segment your workload
Design interactions in a distributed system to prevent failures
Identify which kind of distributed system is required
Implement loosely coupled dependencies
Make all responses idempotent
Do constant work
Design interactions in a distributed system to mitigate or withstand failures
Implement emergency levers
Make services stateless where possible
Set client timeouts
Fail fast and limit queues
Control and limit retry calls
Throttle requests
Implement graceful degradation to transform applicable hard dependencies into soft dependencies
Monitor workload resources
Monitor end-to-end tracing of requests through your system
Conduct reviews regularly
Analytics
Automate responses (Real-time processing and alarming)
Send notifications (Real-time processing and alarming)
Define and calculate metrics (Aggregation)
Monitor End-to-End Tracing of Requests Through Your System
Design your workload to adapt to changes in demand
Obtain resources upon detection that more resources are needed for a workload
Obtain resources upon detection of impairment to a workload
Obtain resources upon detection that more resources are needed for a workload
Load test your workload
Implement change
Use runbooks for standard activities such as deployment
Integrate functional testing as part of your deployment
Integrate resiliency testing as part of your deployment
Deploy using immutable infrastructure
Deploy changes with automation
Back up data
Identify and back up all data that needs to be backed up, or reproduce the data from sources
Secure and encrypt backups
Perform data backup automatically
Perform periodic recovery of the data to verify backup integrity and processes
Fault isolation to protect your workload
Deploy the workload to multiple locations
Select the appropriate locations for your multi-location deployment
Use bulkhead architectures to limit scope of impact
Automate recovery for components constrained to a single location
Design your workload to withstand component failures
Monitor all components of the workload to detect failures
Fail over to healthy resources
Automate healing on all layers
Rely on the data plane and not the control plane during recovery
Use static stability to prevent bimodal behavior
Send notifications when events impact availability
Architect your product to meet availability targets and uptime service level agreements (SLAs)
Test reliability
Test functional requirements
Test functional requirements
Test scaling and performance requirements
Test resiliency using chaos engineering
Conduct game days regularly
Plan for disaster recovery (DR)
Define recovery objectives for downtime and data loss
Use defined recovery strategies to meet the recovery objectives
Test disaster recovery implementation to validate the implementation
Manage configuration drift at the DR site or Region
Automate recovery
Main
Reliability
Category - Reliability
Manage service quotas and constraints
How do you ensure sufficient gap between quotas and maximum usage to accommodate failover?
How do you automate quota management?
How do you monitor and manage service quotas?
How do you accommodate fixed service quotas and constraints through architecture?
How do you manage service quotas and constraints across accounts and Regions?
How do you manage service quotas and constraints?
How do you build a program that embeds reliability into workload teams?
Plan your network topology
How do you enforce non-overlapping private IP address ranges in all private address spaces?
How do you prefer hub-and-spoke topologies over many-to-many mesh?
How do you ensure IP subnet allocation accounts for expansion and availability?
How do you provision redundant connectivity between private networks in the cloud and on-premises environments?
How do you use highly available network connectivity for workload public endpoints?
Design your workload service architecture
Provide service contracts per API
Build services focused on specific business domains and functionality
Choose how to segment your workload
Design interactions in a distributed system to prevent failures
Identify which kind of distributed system is required
Implement loosely coupled dependencies
Make all responses idempotent
Do constant work
Design interactions in a distributed system to mitigate or withstand failures
Implement emergency levers
Make services stateless where possible
Set client timeouts
Fail fast and limit queues
Control and limit retry calls
Throttle requests
Implement graceful degradation to transform applicable hard dependencies into soft dependencies
Monitor workload resources
Monitor end-to-end tracing of requests through your system
Conduct reviews regularly
Analytics
Automate responses (Real-time processing and alarming)
Send notifications (Real-time processing and alarming)
Define and calculate metrics (Aggregation)
Monitor End-to-End Tracing of Requests Through Your System
Design your workload to adapt to changes in demand
Obtain resources upon detection that more resources are needed for a workload
Obtain resources upon detection of impairment to a workload
Obtain resources upon detection that more resources are needed for a workload
Load test your workload
Implement change
Use runbooks for standard activities such as deployment
Integrate functional testing as part of your deployment
Integrate resiliency testing as part of your deployment
Deploy using immutable infrastructure
Deploy changes with automation
Back up data
Identify and back up all data that needs to be backed up, or reproduce the data from sources
Secure and encrypt backups
Perform data backup automatically
Perform periodic recovery of the data to verify backup integrity and processes
Fault isolation to protect your workload
Deploy the workload to multiple locations
Select the appropriate locations for your multi-location deployment
Use bulkhead architectures to limit scope of impact
Automate recovery for components constrained to a single location
Design your workload to withstand component failures
Monitor all components of the workload to detect failures
Fail over to healthy resources
Automate healing on all layers
Rely on the data plane and not the control plane during recovery
Use static stability to prevent bimodal behavior
Send notifications when events impact availability
Architect your product to meet availability targets and uptime service level agreements (SLAs)
Test reliability
Test functional requirements
Test functional requirements
Test scaling and performance requirements
Test resiliency using chaos engineering
Conduct game days regularly
Plan for disaster recovery (DR)
Define recovery objectives for downtime and data loss
Use defined recovery strategies to meet the recovery objectives
Test disaster recovery implementation to validate the implementation
Manage configuration drift at the DR site or Region
Automate recovery