Search for Well Architected Advice
< All Topics
Print

Send notifications when events impact availability

Establishing a robust notification system is critical for maintaining high availability in cloud workloads. By sending alerts upon detecting breaches of predefined thresholds, teams can respond quickly to potential failures, ensuring better overall system resilience even when issues are automatically mitigated.

Best Practices

  • Define Clear Thresholds: Clearly defining thresholds for system performance allows teams to identify issues before they impact users. Implement automatic notifications through AWS services to alert responsible personnel when these thresholds are breached, facilitating timely interventions.

Supporting Questions

  • Are notifications set up for every critical threshold across all components?

Roles and Responsibilities

  • Operations Team: Responsible for managing alerts, ensuring that the right stakeholders are notified, and interpreting the data from notifications to facilitate quick remediation.

Artifacts

  • Notification Policies Document: A comprehensive document detailing the thresholds, notification protocols, and escalation procedures in place for real-time monitoring of workload performance.

Cloud Services

AWS

  • Amazon CloudWatch: CloudWatch enables real-time monitoring of AWS resources and applications, allowing for the creation of alarms that send notifications when metrics cross specified thresholds, enhancing the overall reliability of the workload.
  • AWS Lambda: Lambda can be utilized to automate the response to events detected by CloudWatch alarms. For example, it can trigger functions that notify teams via various channels like email or SMS.

Question: How do you design your workload to withstand component failures?
Pillar: Reliability (Code: REL)

Table of Contents