Search for Well Architected Advice
< All Topics
Print

Perform post-incident analysis

Testing reliability is crucial to validate the resilience of your workload. Conducting thorough post-incident analysis allows you to learn from customer-impacting events and enhances your ability to respond effectively. It ensures that you can limit future incidents and improves overall system reliability.

Best Practices

  • Documenting Incident Conclusions: After an incident, capture detailed findings, including timelines and impact assessments. Use these records to create action items, preventive strategies, and revise procedures. This documentation is vital for building resilience and effective recovery processes.
  • Establish Clear Communication Channels: Develop a communication plan to share findings from the post-incident analysis. Tailor the message for different stakeholders to ensure understanding and to foster a culture of transparency. This helps in aligning teams and managing expectations effectively.

Supporting Questions

  • Are there documented procedures for handling customer-impacting incidents?
  • Do teams regularly review past incidents and their resolutions?

Roles and Responsibilities

  • Incident Response Team: Responsible for conducting incident analysis, developing insights, and documenting findings. This team plays a critical role in understanding failures and identifying corrective measures.
  • Operations Manager: Oversees the implementation of incident response procedures and ensures that post-incident analyses are conducted regularly and effectively.

Artifacts

  • Incident Report: A detailed report that outlines incidents, their impact, contributing factors, and action items. This document is crucial for future reference and for informing stakeholders.
  • Mitigation Plan: A structured plan that outlines identified risks and the strategies to mitigate them to prevent future incidents.

Cloud Services

AWS

  • AWS CloudTrail: Provides logging and monitoring capabilities, enabling teams to review API calls and changes made within the cloud environment, which aids in conducting thorough incident analysis.
  • AWS Config: Allows for tracking configuration changes, which helps in identifying what has changed leading up to an incident, supporting post-incident research.

Question: How do you test reliability?
Pillar: Reliability (Code: REL)

Table of Contents