Skip to content
Menu
Menu
Contact Us
What we Offer
Well Architected Pillars
Documents
Search for Well Architected Advice
Reliability
Manage service quotas and constraints
Be aware of service quotas and constraints in Cloud Services
Manage service quotas across accounts and Regions
Accommodate fixed service quotas and constraints through architecture
Monitor and manage quotas
Automate quota management
Ensure sufficient gap between quotas and usage to accommodate failover
Plan your network topology
Use highly available network connectivity for your workload public endpoints
Provision Redundant Connectivity Between Private Networks in the Cloud and On-Premises Environments
Ensure IP subnet allocation accounts for expansion and availability
Prefer hub-and-spoke topologies over many-to-many mesh
Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
Design your workload service architecture
Provide service contracts per API
Build services focused on specific business domains and functionality
Choose how to segment your workload
Design interactions in a distributed system to prevent failures
Identify which kind of distributed system is required
Implement loosely coupled dependencies
Make All Responses Idempotent
Do constant work
Design interactions in a distributed system to mitigate or withstand failures
Implement emergency levers
Make services stateless where possible
Set client timeouts
Fail fast and limit queues
Control and limit retry calls
Throttle requests
Implement graceful degradation to transform applicable hard dependencies into soft dependencies
Monitor workload resources
Monitor end-to-end tracing of requests through your system
Conduct reviews regularly
Analytics
Automate responses (Real-time processing and alarming)
Send notifications (Real-time processing and alarming)
Define and calculate metrics (Aggregation)
Monitor End-to-End Tracing of Requests Through Your System
Define and calculate metrics
Send notifications
Automate responses
Design your workload to adapt to changes in demand
Obtain resources upon detection that more resources are needed for a workload
Obtain resources upon detection of impairment to a workload
Obtain resources upon detection that more resources are needed for a workload
Load test your workload
Implement change
Use runbooks for standard activities such as deployment
Integrate functional testing as part of your deployment
Integrate resiliency testing as part of your deployment
Deploy using immutable infrastructure
Deploy changes with automation
Back up data
Identify and back up all data that needs to be backed up, or reproduce the data from sources
Secure and encrypt backups
Perform data backup automatically
Perform periodic recovery of the data to verify backup integrity and processes
Fault isolation to protect your workload
Deploy the workload to multiple locations
Select the appropriate locations for your multi-location deployment
Use bulkhead architectures to limit scope of impact
Automate recovery for components constrained to a single location
Design your workload to withstand component failures
Monitor all components of the workload to detect failures
Fail over to healthy resources
Automate healing on all layers
Rely on the data plane and not the control plane during recovery
Use static stability to prevent bimodal behavior
Send notifications when events impact availability
Architect your product to meet availability targets and uptime service level agreements (SLAs)
Test reliability
Use playbooks to investigate failures
Test functional requirements
Test scaling and performance requirements
Test resiliency using chaos engineering
Conduct game days regularly
Plan for disaster recovery (DR)
Define recovery objectives for downtime and data loss
Use defined recovery strategies to meet the recovery objectives
Test disaster recovery implementation to validate the implementation
Manage configuration drift at the DR site or Region
Automate recovery
Main
Reliability
Category - Reliability
Manage service quotas and constraints
Be aware of service quotas and constraints in Cloud Services
Manage service quotas across accounts and Regions
Accommodate fixed service quotas and constraints through architecture
Monitor and manage quotas
Automate quota management
Ensure sufficient gap between quotas and usage to accommodate failover
Plan your network topology
Use highly available network connectivity for your workload public endpoints
Provision Redundant Connectivity Between Private Networks in the Cloud and On-Premises Environments
Ensure IP subnet allocation accounts for expansion and availability
Prefer hub-and-spoke topologies over many-to-many mesh
Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
Design your workload service architecture
Provide service contracts per API
Build services focused on specific business domains and functionality
Choose how to segment your workload
Design interactions in a distributed system to prevent failures
Identify which kind of distributed system is required
Implement loosely coupled dependencies
Make All Responses Idempotent
Do constant work
Design interactions in a distributed system to mitigate or withstand failures
Implement emergency levers
Make services stateless where possible
Set client timeouts
Fail fast and limit queues
Control and limit retry calls
Throttle requests
Implement graceful degradation to transform applicable hard dependencies into soft dependencies
Monitor workload resources
Monitor end-to-end tracing of requests through your system
Conduct reviews regularly
Analytics
Automate responses (Real-time processing and alarming)
Send notifications (Real-time processing and alarming)
Define and calculate metrics (Aggregation)
Monitor End-to-End Tracing of Requests Through Your System
Define and calculate metrics
Send notifications
Automate responses
Design your workload to adapt to changes in demand
Obtain resources upon detection that more resources are needed for a workload
Obtain resources upon detection of impairment to a workload
Obtain resources upon detection that more resources are needed for a workload
Load test your workload
Implement change
Use runbooks for standard activities such as deployment
Integrate functional testing as part of your deployment
Integrate resiliency testing as part of your deployment
Deploy using immutable infrastructure
Deploy changes with automation
Back up data
Identify and back up all data that needs to be backed up, or reproduce the data from sources
Secure and encrypt backups
Perform data backup automatically
Perform periodic recovery of the data to verify backup integrity and processes
Fault isolation to protect your workload
Deploy the workload to multiple locations
Select the appropriate locations for your multi-location deployment
Use bulkhead architectures to limit scope of impact
Automate recovery for components constrained to a single location
Design your workload to withstand component failures
Monitor all components of the workload to detect failures
Fail over to healthy resources
Automate healing on all layers
Rely on the data plane and not the control plane during recovery
Use static stability to prevent bimodal behavior
Send notifications when events impact availability
Architect your product to meet availability targets and uptime service level agreements (SLAs)
Test reliability
Use playbooks to investigate failures
Test functional requirements
Test scaling and performance requirements
Test resiliency using chaos engineering
Conduct game days regularly
Plan for disaster recovery (DR)
Define recovery objectives for downtime and data loss
Use defined recovery strategies to meet the recovery objectives
Test disaster recovery implementation to validate the implementation
Manage configuration drift at the DR site or Region
Automate recovery