-
Operational Excellence
-
- Resources have identified owners
- Processes and procedures have identified owners
- Operations activities have identified owners responsible for their performance
- Team members know what they are responsible for
- Mechanisms exist to identify responsibility and ownership
- Mechanisms exist to request additions, changes, and exceptions
- Responsibilities between teams are predefined or negotiated
-
- Executive Sponsorship
- Team members are empowered to take action when outcomes are at risk
- Escalation is encouraged
- Communications are timely, clear, and actionable
- Experimentation is encouraged
- Team members are encouraged to maintain and grow their skill sets
- Resource teams appropriately
- Diverse opinions are encouraged and sought within and across teams
-
- Use version control
- Test and validate changes
- Use configuration management systems
- Use build and deployment management systems
- Perform patch management
- Implement practices to improve code quality
- Share design standards
- Use multiple environments
- Make frequent, small, reversible changes
- Fully automate integration and deployment
-
Security
-
- Evaluate and implement new security services and features regularly
- Automate testing and validation of security controls in pipelines
- Identify and prioritize risks using a threat model
- Keep up-to-date with security recommendations
- Keep up-to-date with security threats
- Identify and validate control objectives
- Secure account root user and properties
- Separate workloads using accounts
-
- Analyze public and cross-account access
- Manage access based on life cycle
- Share resources securely with a third party
- Reduce permissions continuously
- Share resources securely within your organization
- Establish emergency access process
- Define permission guardrails for your organization
- Grant least privilege access
- Define access requirements
-
- Build a program that embeds security ownership in workload teams
- Centralize services for packages and dependencies
- Manual code reviews
- Automate testing throughout the development and release lifecycle
- Train for application security
- Regularly assess security properties of the pipelines
- Deploy software programmatically
- Perform regular penetration testing
-
-
Reliability
-
- How do you ensure sufficient gap between quotas and maximum usage to accommodate failover?
- How do you automate quota management?
- How do you monitor and manage service quotas?
- How do you accommodate fixed service quotas and constraints through architecture?
- How do you manage service quotas and constraints across accounts and Regions?
- How do you manage service quotas and constraints?
- How do you build a program that embeds reliability into workload teams?
-
- How do you enforce non-overlapping private IP address ranges in all private address spaces?
- How do you prefer hub-and-spoke topologies over many-to-many mesh?
- How do you ensure IP subnet allocation accounts for expansion and availability?
- How do you provision redundant connectivity between private networks in the cloud and on-premises environments?
- How do you use highly available network connectivity for workload public endpoints?
-
- Monitor end-to-end tracing of requests through your system
- Conduct reviews regularly
- Analytics
- Automate responses (Real-time processing and alarming)
- Send notifications (Real-time processing and alarming)
- Define and calculate metrics (Aggregation)
- Monitor End-to-End Tracing of Requests Through Your System
-
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers
- Rely on the data plane and not the control plane during recovery
- Use static stability to prevent bimodal behavior
- Send notifications when events impact availability
- Architect your product to meet availability targets and uptime service level agreements (SLAs)
-
-
Cost Optimization
-
- Establish ownership of cost optimization
- Establish a partnership between finance and technology
- Establish cloud budgets and forecasts
- Implement cost awareness in your organizational processes
- Monitor cost proactively
- Keep up-to-date with new service releases
- Quantify business value from cost optimization
- Report and notify on cost optimization
- Create a cost-aware culture
-
- Perform cost analysis for different usage over time
- Analyze all components of this workload
- Perform a thorough analysis of each component
- Select components of this workload to optimize cost in line with organization priorities
- Perform cost analysis for different usage over time
- Select software with cost effective licensing
-
-
Performance
-
- Learn about and understand available cloud services and features
- Evaluate how trade-offs impact customers and architecture efficiency
- Use guidance from your cloud provider or an appropriate partner to learn about architecture patterns and best practices
- Factor cost into architectural decisions
- Use policies and reference architectures
- Use benchmarking to drive architectural decisions
- Use a data-driven approach for architectural choices
-
- Use purpose-built data store that best support your data access and storage requirements
- Collect and record data store performance metrics
- Evaluate available configuration options for data store
- Implement Strategies to Improve Query Performance in Data Store
- Implement data access patterns that utilize caching
-
- Understand how networking impacts performance
- Evaluate available networking features
- Choose appropriate dedicated connectivity or VPN for your workload
- Use load balancing to distribute traffic across multiple resources
- Choose network protocols to improve performance
- Choose your workload's location based on network requirements
- Optimize network configuration based on metrics
-
- Establish key performance indicators (KPIs) to measure workload health and performance
- Use monitoring solutions to understand the areas where performance is most critical
- Define a process to improve workload performance
- Review metrics at regular intervals
- Load test your workload
- Use automation to proactively remediate performance-related issues
- Keep your workload and services up-to-date
-
-
Sustainability
-
- Optimize geographic placement of workloads based on their networking requirements
- Align SLAs with sustainability goals
- Optimize geographic placement of workloads based on their networking requirements
- Stop the creation and maintenance of unused assets
- Optimize team member resources for activities performed
- Implement buffering or throttling to flatten the demand curve
-
- Optimize software and architecture for asynchronous and scheduled jobs
- Remove or refactor workload components with low or no use
- Optimize areas of code that consume the most time or resources
- Optimize impact on devices and equipment
- Use software patterns and architectures that best support data access and storage patterns
- Remove unneeded or redundant data
- Use technologies that support data access and storage patterns
- Use policies to manage the lifecycle of your datasets
- Use shared file systems or storage to access common data
- Back up data only when difficult to recreate
- Use elasticity and automation to expand block storage or file system
- Minimize data movement across networks
-
- Articles coming soon
Make frequent, small, reversible changes
Making Frequent, Small, and Reversible Changes
Implementing frequent, small, and reversible changes helps reduce the scope and impact of changes in your workload. By breaking changes into smaller pieces, you can more easily identify and address issues, making troubleshooting and remediation more effective. When integrated with change management, configuration management, and build and delivery systems, these practices support faster rollbacks and improve overall reliability.
Reduce Scope with Small Changes
Break changes into small, manageable pieces to minimize the scope of each individual change. Small changes are less complex, easier to understand, and involve fewer components. This reduces the likelihood of unintended consequences and makes it easier to identify which change may have caused an issue if a problem arises.
Increase Deployment Frequency
Deploy changes frequently to continuously iterate and improve the workload. Frequent deployments mean that changes are smaller and more focused, which helps identify issues early in the development lifecycle. By deploying often, teams also become more comfortable with the process, making it smoother and more reliable over time.
Ensure Changes Are Reversible
Make sure that each change is reversible, allowing teams to quickly roll back changes if issues are detected. Reversible changes help minimize the impact of any defects, enabling rapid restoration of stability in the production environment. This approach reduces downtime and improves user experience by allowing for a quick response to unexpected problems.
Integrate with Change Management and CI/CD Systems
Use change management, configuration management, and CI/CD systems to implement and track frequent changes. Automated systems help validate changes, track configuration updates, and manage the build and deployment process. By integrating frequent changes into an automated pipeline, teams can maintain consistency, reduce manual errors, and gain confidence in the changes they deploy.
Improve Troubleshooting and Remediation
Frequent, small changes make troubleshooting easier because they reduce the number of variables that could be causing an issue. With smaller changes, teams can quickly isolate and remediate the root cause of a problem. If an issue cannot be easily fixed, rolling back the small change ensures that the workload remains stable while the issue is investigated further.
Supporting Questions
- How are changes broken down into smaller, manageable pieces?
- How does deploying frequent changes improve the reliability of the workload?
- What systems are in place to ensure that changes are reversible and easy to roll back?
Roles and Responsibilities
Developer
Responsibilities:
- Implement changes in small, manageable increments and ensure that changes are reversible.
- Validate changes in a testing environment before they are deployed to production.
DevOps Engineer
Responsibilities:
- Integrate frequent, small changes into the CI/CD pipeline to automate testing, validation, and deployment.
- Use change management systems to track changes and ensure that they can be rolled back if necessary.
Change Manager
Responsibilities:
- Review and approve small changes as part of the change management process.
- Ensure that changes are tracked and that rollbacks are available if issues arise during deployment.
Artifacts
- Change Log: A record of all changes made, including the status of each change, any incidents encountered, and details of any rollbacks performed.
- Rollback Procedure Document: A document that outlines the steps required to revert changes in the event of an issue.
- Change Management Policy: A policy defining best practices for making small, frequent, and reversible changes, including guidelines for approvals and deployment.
Relevant AWS Tools
Change Management and Tracking Tools
- AWS Systems Manager Change Manager: Automates the process of tracking and approving changes, ensuring that changes are reversible and manageable.
- AWS Config: Tracks configuration changes across AWS resources, providing insight into what changes were made and enabling quick rollbacks if needed.
CI/CD and Automation Tools
- AWS CodePipeline: Automates the build, test, and deployment process, allowing for frequent and consistent deployment of small changes.
- AWS CodeDeploy: Deploys changes to applications in a controlled manner, ensuring that deployments can be easily rolled back if issues are detected.
Monitoring and Troubleshooting Tools
- Amazon CloudWatch: Monitors deployments to track the impact of small changes and quickly detect any issues that may arise.
- AWS X-Ray: Helps trace requests through applications, making it easier to troubleshoot issues that occur after a small change is deployed.