Plan for unsuccessful changes

PostedNovember 6, 2024

UpdatedNovember 7, 2024

ByKevin McCaffrey

Planning for Unsuccessful Changes
Planning for unsuccessful changes is critical for maintaining system stability and minimizing the impact of deployment failures. By creating a strategy to revert to a known good state or remediate issues in production, teams can ensure that any negative outcomes from a deployment are addressed swiftly and effectively. Policies that establish rollback and recovery strategies help teams be prepared for failures and develop clear pathways to recovery.

Establish Rollback and Recovery Policies

Create policies that require every deployment to include a rollback plan or a remediation strategy. This policy ensures that teams are consistently prepared to handle unsuccessful changes and can quickly bring systems back to a stable state. Clearly defining rollback and recovery procedures helps teams respond confidently and minimizes the time it takes to recover from issues.

Deploy with Rollback Steps

Ensure that each deployment includes specific rollback steps. Automated rollback mechanisms should be used to quickly return the system to a previous, stable version when issues are detected. Rollbacks reduce the impact on users by rapidly reverting any problematic changes without needing manual intervention.

Use Feature Flags for Incremental Changes

Incorporate feature flags to enable controlled, incremental changes to your workload. Feature flags allow teams to turn new features on or off without needing a new deployment. If an issue is identified, the feature can be disabled, thereby reducing the risk of system failure and providing a quick response to unexpected problems.

Implement Traffic Isolation and Traffic Shifting

Use traffic isolation and traffic shifting to control the impact of changes. Deploy changes to a subset of users or isolate them to specific regions before rolling them out broadly. This approach allows teams to test changes in production with minimal risk. If an issue arises, traffic can be shifted away from the affected components, providing an opportunity to mitigate the impact and recover quickly.

Plan for Multi-Component Changes

When a single release involves multiple related components, ensure the strategy accounts for the ability to withstand or recover from the failure of any individual component. Use techniques like canary deployments, blue-green deployments, or rolling updates to validate each component separately, and ensure that recovery steps are available for every part of the deployment.

Remediate in Production If Necessary

In cases where rolling back is not possible, plan for remediation within the production environment. This may involve hotfixes, adjusting configuration parameters, or other in-place corrective measures. Preparing for in-production remediation ensures that the team can adapt and address issues without taking the entire system offline.

Supporting Questions

What policies are in place to plan for unsuccessful changes and ensure rapid recovery?
How are rollback steps incorporated into the deployment strategy?
What techniques are used to minimize the impact of changes that involve multiple related components?

Roles and Responsibilities

Deployment Engineer
Responsibilities:

Define and implement rollback steps for each deployment to ensure a quick return to a stable state if issues are detected.
Use traffic isolation and traffic shifting techniques to minimize the impact of changes on users.

Release Manager
Responsibilities:

Establish deployment policies that require rollback or remediation plans for every release.
Coordinate the deployment of multi-component changes, ensuring that each component can withstand or recover from failure.

DevOps Engineer
Responsibilities:

Use feature flags to incrementally roll out changes and provide a mechanism for quickly disabling problematic features.
Set up automated rollback mechanisms and monitor deployment health to ensure a quick response to any detected issues.

Artifacts

Rollback Plan Document: A document detailing rollback steps, triggers, and procedures for each deployment.
Feature Flag Configuration: A list of feature flags used for enabling or disabling features during deployment, including default states and rollback actions.
Deployment Recovery Checklist: A checklist outlining the steps required to recover from unsuccessful changes, including rollback, traffic shifting, and in-production remediation actions.

Relevant AWS Tools

Deployment and Rollback Tools

AWS CodeDeploy: Supports deployment strategies such as blue-green deployments and rolling updates, with built-in rollback capabilities to quickly recover from unsuccessful changes.
AWS Elastic Beanstalk: Provides managed deployment options with built-in rollback and recovery, allowing teams to revert to a previous, stable version if issues arise.

Traffic Control Tools

AWS App Mesh: Implements traffic routing and shifting strategies to isolate changes to specific users or regions, allowing teams to validate changes with minimal impact.
Amazon Route 53: Manages DNS traffic policies, enabling traffic isolation and shifting in response to deployment success or failure.

Feature Management Tools

AWS AppConfig: Allows the use of feature flags to control configuration and enable or disable features without needing a new deployment, providing an additional safety net during releases.
AWS Lambda: Can be used to automate feature flag updates and enable immediate changes in response to detected issues.

Monitoring and Alerting Tools

Amazon CloudWatch: Monitors the health and performance of deployments, providing metrics and alerts to help detect issues during or after deployments.
AWS Systems Manager Incident Manager: Helps coordinate the response to incidents, including automating rollback or remediation actions to minimize downtime and user impact.

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals