Skip to content
Well Architected Guide
Menu
Menu
Well Architected Pillars
Documents
Search for Well Architected Advice
Operational Excellence
Determine what your priorities are
Evaluate external customer needs
Evaluate internal customer needs
Evaluate governance requirements
Evaluate compliance requirements
Evaluate threat landscape
Evaluate tradeoffs
Manage benefits and risks
Structure your organization to support your business outcomes
Resources have identified owners
Processes and procedures have identified owners
Operations activities have identified owners responsible for their performance
Team members know what they are responsible for
Mechanisms exist to identify responsibility and ownership
Mechanisms exist to request additions, changes, and exceptions
Responsibilities between teams are predefined or negotiated
Organizational culture to support your business outcomes
Executive Sponsorship
Team members are empowered to take action when outcomes are at risk
Escalation is encouraged
Communications are timely, clear, and actionable
Experimentation is encouraged
Team members are encouraged to maintain and grow their skill sets
Resource teams appropriately
Diverse opinions are encouraged and sought within and across teams
Implement observability in your workload
Identify key performance indicators
Implement application telemetry
Implement user experience telemetry
Implement dependency telemetry
Implement distributed tracing
Reduce defects, ease remediation, and improve flow into production
Use version control
Test and validate changes
Use configuration management systems
Use build and deployment management systems
Perform patch management
Implement practices to improve code quality
Share design standards
Use multiple environments
Make frequent, small, reversible changes
Fully automate integration and deployment
Mitigate deployment risks
Plan for unsuccessful changes
Test deployments
Employ safe deployment strategies
Automate testing and rollback
Be ready to support a workload
Ensure personnel capability
Ensure a consistent review of operational readiness
Use runbooks to perform procedures
Use playbooks to investigate issues
Make informed decisions to deploy systems and changes
Create support plans for production workloads
Uilize workload observability
Create actionable alerts
Analyze workload metrics
Analyze workload logs
Analyze workload traces
Create dashboards
Understand the health of your operations
Measure operations goals and KPIs with metrics
Communicate status and trends to ensure visibility into operation
Review operations metrics and prioritize improvement
Manage workload and operations events
Use a process for event, incident, and problem management
Have a process per alert
Prioritize operational events based on business impact
Define escalation paths
Define a customer communication plan for outages
Communicate status through dashboards
Automate responses to events
Evolve your operations
Have a process for continuous improvement
Perform post-incident analysis
Implement feedback loops
Perform knowledge management
Define drivers for improvement
Validate insights
Perform operations metrics reviews
Document and share lessons learned
Allocate time to make improvements
Main
Operational Excellence
Category - Operational Excellence
Determine what your priorities are
Evaluate external customer needs
Evaluate internal customer needs
Evaluate governance requirements
Evaluate compliance requirements
Evaluate threat landscape
Evaluate tradeoffs
Manage benefits and risks
Structure your organization to support your business outcomes
Resources have identified owners
Processes and procedures have identified owners
Operations activities have identified owners responsible for their performance
Team members know what they are responsible for
Mechanisms exist to identify responsibility and ownership
Mechanisms exist to request additions, changes, and exceptions
Responsibilities between teams are predefined or negotiated
Organizational culture to support your business outcomes
Executive Sponsorship
Team members are empowered to take action when outcomes are at risk
Escalation is encouraged
Communications are timely, clear, and actionable
Experimentation is encouraged
Team members are encouraged to maintain and grow their skill sets
Resource teams appropriately
Diverse opinions are encouraged and sought within and across teams
Implement observability in your workload
Identify key performance indicators
Implement application telemetry
Implement user experience telemetry
Implement dependency telemetry
Implement distributed tracing
Reduce defects, ease remediation, and improve flow into production
Use version control
Test and validate changes
Use configuration management systems
Use build and deployment management systems
Perform patch management
Implement practices to improve code quality
Share design standards
Use multiple environments
Make frequent, small, reversible changes
Fully automate integration and deployment
Mitigate deployment risks
Plan for unsuccessful changes
Test deployments
Employ safe deployment strategies
Automate testing and rollback
Be ready to support a workload
Ensure personnel capability
Ensure a consistent review of operational readiness
Use runbooks to perform procedures
Use playbooks to investigate issues
Make informed decisions to deploy systems and changes
Create support plans for production workloads
Uilize workload observability
Create actionable alerts
Analyze workload metrics
Analyze workload logs
Analyze workload traces
Create dashboards
Understand the health of your operations
Measure operations goals and KPIs with metrics
Communicate status and trends to ensure visibility into operation
Review operations metrics and prioritize improvement
Manage workload and operations events
Use a process for event, incident, and problem management
Have a process per alert
Prioritize operational events based on business impact
Define escalation paths
Define a customer communication plan for outages
Communicate status through dashboards
Automate responses to events
Evolve your operations
Have a process for continuous improvement
Perform post-incident analysis
Implement feedback loops
Perform knowledge management
Define drivers for improvement
Validate insights
Perform operations metrics reviews
Document and share lessons learned
Allocate time to make improvements