Ensure a consistent review of operational readiness

PostedNovember 7, 2024

UpdatedDecember 2, 2024

ByKevin McCaffrey

Ensuring Consistent Review of Operational Readiness
Operational Readiness Reviews (ORRs) are crucial for validating that a workload can be safely operated in production. ORRs ensure that all processes, people, and systems are ready to maintain and support the workload, reducing the risk of failures or unplanned downtime. By conducting consistent ORRs, teams can certify their workloads for operational stability and identify areas for improvement before deployment.

Conduct Operational Readiness Reviews (ORRs)

Use ORRs to validate that your workload is ready for production. ORRs involve a detailed review and inspection process using a checklist of requirements, covering areas such as monitoring, incident response, backup, and recovery. An ORR helps ensure that all components of the workload are properly configured, maintained, and supported.

Use ORRs for Self-Service Certification

Operational Readiness Reviews are designed to be a self-service experience that teams use to certify their workloads. Teams use a standardized checklist to assess their readiness and ensure that best practices are followed. This empowers teams to take ownership of their workloads and ensure they meet the required standards before going live.

Checklist-Based Validation

ORRs utilize a checklist of requirements to ensure consistency and completeness in evaluating readiness. The checklist includes items such as monitoring capabilities, incident management protocols, security controls, and performance testing. By using a checklist, teams can systematically review all critical aspects of the workload, identify gaps, and implement necessary improvements.

Incorporate Best Practices and Lessons Learned

ORRs incorporate best practices and lessons learned from years of building and operating software at Amazon. These best practices include insights into handling incidents, scaling workloads, securing systems, and maintaining high availability. Using this knowledge, teams can improve their operational processes and be better prepared to handle potential issues.

Validate Safety and Stability

The goal of an ORR is to validate that the workload can be operated safely and reliably. This includes ensuring that all dependencies are functioning correctly, that there are adequate monitoring and alerting mechanisms in place, and that the team is prepared to respond to any incidents. By consistently conducting ORRs, teams can reduce the risk of operational issues and ensure stable workload operation.

Supporting Questions

What is the process for conducting an Operational Readiness Review (ORR)?
How does an ORR help validate the safety and stability of your workload?
What best practices are included in the ORR checklist to ensure operational readiness?

Roles and Responsibilities

Operations Manager
Responsibilities:

Conduct Operational Readiness Reviews (ORRs) to validate the operational readiness of workloads.
Ensure that all checklist items are completed and that any identified gaps are addressed before deployment.

DevOps Engineer
Responsibilities:

Participate in ORRs to verify that all infrastructure components are configured and monitored correctly.
Ensure that any automation scripts, monitoring tools, or configuration management settings meet the requirements of the ORR checklist.

Service Owner
Responsibilities:

Take ownership of the workload and ensure that it meets all requirements outlined in the ORR checklist.
Implement best practices and lessons learned from previous ORRs to improve workload readiness.

Artifacts

Operational Readiness Checklist: A comprehensive checklist used during ORRs to validate readiness, covering areas such as monitoring, security, incident response, and backup.
Readiness Review Report: A report summarizing the results of the ORR, including identified gaps, action items, and timelines for remediation.
Operational Readiness Certification Document: A document certifying that the workload has passed the ORR and is ready for production deployment.
AWS Operational Readiness Reviews: Read about AWS Operational Readiness Reviews here.

Relevant AWS Tools

Review and Certification Tools

AWS Well-Architected Tool: Provides best practices and guidance for reviewing workload readiness and helps teams understand potential gaps in their workload architecture.
AWS Trusted Advisor: Checks the readiness of AWS resources and provides recommendations on security, performance, cost optimization, and fault tolerance, which can be used as part of an ORR.

Monitoring and Alerting Tools

Amazon CloudWatch: Monitors the health and performance of workloads, providing metrics and alerts to validate operational readiness.
AWS Systems Manager: Automates compliance checks and operational reviews to validate readiness as part of the ORR process.

Incident Management Tools

AWS Systems Manager Incident Manager: Supports incident response capabilities, ensuring that teams are ready to respond to issues identified during the ORR.
AWS Config: Tracks configuration changes and validates that the current configurations meet readiness requirements, helping identify gaps during ORRs.

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals