Search for Well Architected Advice
< All Topics
Print

Plan for unsuccessful changes

Planning for Unsuccessful Changes
Planning for unsuccessful changes is critical for maintaining system stability and minimizing the impact of deployment failures. By creating a strategy to revert to a known good state or remediate issues in production, teams can ensure that any negative outcomes from a deployment are addressed swiftly and effectively. Policies that establish rollback and recovery strategies help teams be prepared for failures and develop clear pathways to recovery.

Establish Rollback and Recovery Policies

Create policies that require every deployment to include a rollback plan or a remediation strategy. This policy ensures that teams are consistently prepared to handle unsuccessful changes and can quickly bring systems back to a stable state. Clearly defining rollback and recovery procedures helps teams respond confidently and minimizes the time it takes to recover from issues.

Deploy with Rollback Steps

Ensure that each deployment includes specific rollback steps. Automated rollback mechanisms should be used to quickly return the system to a previous, stable version when issues are detected. Rollbacks reduce the impact on users by rapidly reverting any problematic changes without needing manual intervention.

Use Feature Flags for Incremental Changes

Incorporate feature flags to enable controlled, incremental changes to your workload. Feature flags allow teams to turn new features on or off without needing a new deployment. If an issue is identified, the feature can be disabled, thereby reducing the risk of system failure and providing a quick response to unexpected problems.

Implement Traffic Isolation and Traffic Shifting

Use traffic isolation and traffic shifting to control the impact of changes. Deploy changes to a subset of users or isolate them to specific regions before rolling them out broadly. This approach allows teams to test changes in production with minimal risk. If an issue arises, traffic can be shifted away from the affected components, providing an opportunity to mitigate the impact and recover quickly.

Plan for Multi-Component Changes

When a single release involves multiple related components, ensure the strategy accounts for the ability to withstand or recover from the failure of any individual component. Use techniques like canary deployments, blue-green deployments, or rolling updates to validate each component separately, and ensure that recovery steps are available for every part of the deployment.

Remediate in Production If Necessary

In cases where rolling back is not possible, plan for remediation within the production environment. This may involve hotfixes, adjusting configuration parameters, or other in-place corrective measures. Preparing for in-production remediation ensures that the team can adapt and address issues without taking the entire system offline.

Supporting Questions

  • What policies are in place to plan for unsuccessful changes and ensure rapid recovery?
  • How are rollback steps incorporated into the deployment strategy?
  • What techniques are used to minimize the impact of changes that involve multiple related components?

Roles and Responsibilities

Deployment Engineer
Responsibilities:

  • Define and implement rollback steps for each deployment to ensure a quick return to a stable state if issues are detected.
  • Use traffic isolation and traffic shifting techniques to minimize the impact of changes on users.

Release Manager
Responsibilities:

  • Establish deployment policies that require rollback or remediation plans for every release.
  • Coordinate the deployment of multi-component changes, ensuring that each component can withstand or recover from failure.

DevOps Engineer
Responsibilities:

  • Use feature flags to incrementally roll out changes and provide a mechanism for quickly disabling problematic features.
  • Set up automated rollback mechanisms and monitor deployment health to ensure a quick response to any detected issues.

Artifacts

  • Rollback Plan Document: A document detailing rollback steps, triggers, and procedures for each deployment.
  • Feature Flag Configuration: A list of feature flags used for enabling or disabling features during deployment, including default states and rollback actions.
  • Deployment Recovery Checklist: A checklist outlining the steps required to recover from unsuccessful changes, including rollback, traffic shifting, and in-production remediation actions.

Relevant AWS Tools

Deployment and Rollback Tools

  • AWS CodeDeploy: Supports deployment strategies such as blue-green deployments and rolling updates, with built-in rollback capabilities to quickly recover from unsuccessful changes.
  • AWS Elastic Beanstalk: Provides managed deployment options with built-in rollback and recovery, allowing teams to revert to a previous, stable version if issues arise.

Traffic Control Tools

  • AWS App Mesh: Implements traffic routing and shifting strategies to isolate changes to specific users or regions, allowing teams to validate changes with minimal impact.
  • Amazon Route 53: Manages DNS traffic policies, enabling traffic isolation and shifting in response to deployment success or failure.

Feature Management Tools

  • AWS AppConfig: Allows the use of feature flags to control configuration and enable or disable features without needing a new deployment, providing an additional safety net during releases.
  • AWS Lambda: Can be used to automate feature flag updates and enable immediate changes in response to detected issues.

Monitoring and Alerting Tools

  • Amazon CloudWatch: Monitors the health and performance of deployments, providing metrics and alerts to help detect issues during or after deployments.
  • AWS Systems Manager Incident Manager: Helps coordinate the response to incidents, including automating rollback or remediation actions to minimize downtime and user impact.
Table of Contents