Skip to content

Contact Us
What we Offer
Well Architected Pillars
Documents

Search for Well Architected Advice

Operational Excellence

Main
Operational Excellence

Category - Operational Excellence

Determine what your priorities are

Evaluate external customer needs

Evaluate internal customer needs

Evaluate governance requirements

Evaluate compliance requirements

Evaluate threat landscape

Evaluate tradeoffs

Manage benefits and risks

+ 4 Articles

Show Remaining Articles

Structure your organization to support your business outcomes

Resources have identified owners

Processes and procedures have identified owners

Operations activities have identified owners responsible for their performance

Team members know what they are responsible for

Mechanisms exist to identify responsibility and ownership

Mechanisms exist to request additions, changes, and exceptions

Responsibilities between teams are predefined or negotiated

+ 4 Articles

Show Remaining Articles

Organizational culture to support your business outcomes

Executive Sponsorship

Team members are empowered to take action when outcomes are at risk

Escalation is encouraged

Communications are timely, clear, and actionable

Experimentation is encouraged

Team members are encouraged to maintain and grow their skill sets

Resource teams appropriately

Diverse opinions are encouraged and sought within and across teams

+ 5 Articles

Show Remaining Articles

Implement observability in your workload

Identify key performance indicators

Implement application telemetry

Implement user experience telemetry

Implement dependency telemetry

Implement distributed tracing

+ 2 Articles

Show Remaining Articles

Reduce defects, ease remediation, and improve flow into production

Use version control

Test and validate changes

Use configuration management systems

Use build and deployment management systems

Perform patch management

Implement practices to improve code quality

Share design standards

Use multiple environments

Make frequent, small, reversible changes

Fully automate integration and deployment

+ 7 Articles

Show Remaining Articles

Mitigate deployment risks

Plan for unsuccessful changes

Test deployments

Employ safe deployment strategies

Automate testing and rollback

+ 1 Articles

Show Remaining Articles

Be ready to support a workload

Ensure personnel capability

Ensure a consistent review of operational readiness

Use runbooks to perform procedures

Use playbooks to investigate issues

Make informed decisions to deploy systems and changes

Create support plans for production workloads

+ 3 Articles

Show Remaining Articles

Uilize workload observability

Create actionable alerts

Analyze workload metrics

Analyze workload logs

Analyze workload traces

Create dashboards

+ 2 Articles

Show Remaining Articles

Understand the health of your operations

Measure operations goals and KPIs with metrics

Communicate status and trends to ensure visibility into operation

Review operations metrics and prioritize improvement

Manage workload and operations events

Use a process for event, incident, and problem management

Have a process per alert

Prioritize operational events based on business impact

Define escalation paths

Define a customer communication plan for outages

Communicate status through dashboards

Automate responses to events

+ 4 Articles

Show Remaining Articles

Evolve your operations

Have a process for continuous improvement

Perform post-incident analysis

Implement feedback loops

Perform knowledge management

Define drivers for improvement

Validate insights

Perform operations metrics reviews

Document and share lessons learned

Allocate time to make improvements

Perform post-incident analysis

+ 7 Articles

Show Remaining Articles

© 2025 • Built with GeneratePress