Integrate resiliency testing as part of your deployment

PostedDecember 20, 2024

UpdatedMarch 22, 2025

ByKevin McCaffrey

Integrating resiliency testing during the deployment process ensures that controlled changes can be made without introducing unforeseen disruptions. This approach allows teams to validate the resilience of their system against potential failures and maintain application availability even during updates or changes.

Best Practices

Implement Automated Resiliency Testing in CI/CD Pipeline

Integrate chaos engineering principles into your automated deployment process to expose system weaknesses.
Use tools like Chaos Monkey or Gremlin to simulate failures and test system response under adverse conditions.
Run resiliency tests in a controlled pre-production environment to avoid impacting production workloads while assessing system stability.
Analyze test results to identify systemic weaknesses and improve system design and architecture.
Document findings and incorporate lessons learned into future deployment planning, ensuring a continuous improvement process.

Questions to ask your team

How frequently do you run resiliency tests in your deployment pipeline?
What types of chaos engineering experiments have you implemented?
How do you determine the impact of failures on your workloads during resiliency testing?
Can you provide examples of changes made as a result of insights gained from your resiliency tests?
What tools or frameworks are you using to automate your resiliency testing?
How do you ensure that your team is trained and equipped to conduct these tests effectively?
What metrics do you track to evaluate the success of your resiliency testing?
How are failures during resiliency testing communicated and addressed in your deployment process?

Who should be doing this?

DevOps Engineer

Design and implement automated deployment pipelines that include resiliency testing.
Integrate chaos engineering principles into the deployment process.
Monitor and report on the outcomes of resiliency tests during pre-production deployments.

Quality Assurance Engineer

Develop and execute test plans that incorporate resiliency testing scenarios.
Identify potential failure points in the infrastructure and application components.
Collaborate with developers to ensure code changes do not negatively impact system reliability.

Site Reliability Engineer

Ensure systems are built to handle failures gracefully.
Implement monitoring and alerting to track the impact of changes on system reliability.
Facilitate post-mortem analyses to learn from resiliency testing outcomes.

Project Manager

Coordinate the efforts of various teams involved in implementing resiliency testing.
Manage timelines and deliverables related to the deployment process.
Communicate with stakeholders regarding the goals and results of the resiliency tests.

Software Developer

Write code that adheres to best practices for reliability and resiliency.
Collaborate with DevOps and QA teams to ensure that changes are deployable and tested adequately.
Participate in troubleshooting issues identified during resiliency testing.

What evidence shows this is happening in your organization?

Resiliency Testing Playbook: A comprehensive playbook that outlines the process for integrating resiliency testing into the deployment pipeline, including guidelines on using chaos engineering techniques to validate system robustness in pre-production environments.
Automated Deployment Pipeline Diagram: A visual representation of the automated deployment pipeline, illustrating the integration points for resiliency tests, with clear steps indicating where chaos engineering principles are applied.
Change Management Policy: A formal policy document that outlines how controlled changes are managed within the organization, emphasizing the importance of resiliency testing before production releases.
Resiliency Test Checklist: A detailed checklist that teams must complete before deploying changes, ensuring all required resiliency tests using chaos engineering techniques are executed and documented.
Weekly Resiliency Testing Report: A report that summarizes the outcomes of resiliency tests performed during the week, including metrics on system performance, observed issues, and recommendations for improvement.

Cloud Services

AWS

AWS Fault Injection Simulator: Allows you to carry out chaos engineering experiments to identify weaknesses in your application, helping ensure reliability by improving the system’s resilience to unexpected failures.
AWS CloudFormation: Enables you to manage and provision your infrastructure as code, making it easy to deploy consistent environments that include resiliency testing setups.
AWS CodePipeline: Automates your continuous integration and continuous delivery pipelines, and allows integration of automated testing, including resiliency tests within your deployment process.

Azure

Azure Chaos Studio: Enables you to experiment with your applications by simulating failures to improve application reliability and resilience.
Azure Resource Manager: Provides a management layer that simplifies the process of deploying and managing Azure resources, including integration of testing and monitoring for resiliency.
Azure DevOps: Offers comprehensive tools for continuous integration and continuous delivery, including features to implement resiliency testing and manage deployments effectively.

Google Cloud Platform

Google Cloud Chaos Engineering: Provides tools and practices to simulate failures in your system, enabling you to test resiliency and improve the reliability of your applications.
Google Cloud Deployment Manager: Allows you to create, manage, and deploy cloud resources with templates, facilitating the incorporation of testing and resiliency practices in deployment.
Google Cloud Build: A CI/CD platform that automates the creation of inclusive build and test environments, allowing integration of resiliency tests during the deployment process.

Question: How do you implement change?
Pillar: Reliability (Code: REL)

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals