Send notifications when events impact availability

PostedDecember 20, 2024

UpdatedMarch 22, 2025

ByKevin McCaffrey

Establishing a robust notification system is critical for maintaining high availability in cloud workloads. By sending alerts upon detecting breaches of predefined thresholds, teams can respond quickly to potential failures, ensuring better overall system resilience even when issues are automatically mitigated.

Best Practices

Implement Monitoring and Alerting Systems

Utilize AWS CloudWatch to set alarms on key metrics related to availability and performance thresholds.
Define clear thresholds for alerts that indicate when a service is experiencing issues.
Integrate with AWS SNS (Simple Notification Service) to ensure alerts are sent via email, SMS, or other channels to the appropriate teams.
Regularly review and refine alarm thresholds to reduce false positives and ensure operational relevance.
Develop runbooks that outline immediate steps to be taken when alerts are triggered, enhancing response speed.

Establish Notification Workflows

Create automated workflows using AWS Lambda or Step Functions to handle alerts and initiate recovery processes where applicable.
Ensure notifications include relevant context to help teams understand the impact and necessary remediation steps.
Regularly test notification workflows to ensure timely delivery of alerts, and refine based on feedback.
Incorporate a process to review alert logs to identify patterns and improve monitoring effectiveness over time.

Continuous Improvement of Alerting Strategies

Conduct post-mortem analyses after incidents to evaluate the effectiveness of alerts and adjust as needed.
Facilitate regular training sessions for teams on how to respond to alerts and use notification tools effectively.
Solicit feedback from team members on alert relevance and response effectiveness to improve collaboration.
Initiate regular planning sessions to assess and enhance the reliability strategy, ensuring it aligns with evolving business needs.

Questions to ask your team

What mechanisms are in place to monitor the health of your components?
Do you have thresholds defined for critical metrics that would trigger notifications?
How are notifications prioritized and escalated in the event of component failures?
Is there a documented process for responding to notifications about availability issues?
How do you ensure that notifications are sent to the appropriate teams or individuals?
Have you tested your notification system to ensure it works under different failure scenarios?

Who should be doing this?

Cloud Architect

Design the architecture to ensure reliability and high availability.
Implement monitoring solutions to detect component failures.
Establish notification thresholds for availability events.

DevOps Engineer

Set up the automation for sending notifications upon event detection.
Configure logging and monitoring tools to capture relevant data.
Collaborate with the Cloud Architect to integrate notification systems.

Site Reliability Engineer (SRE)

Develop processes for incident response and management.
Analyze incidents to improve future detection and notification thresholds.
Ensure that the notification system is tested and operating reliably.

Operations Manager

Oversee the implementation of notification protocols.
Coordinate between teams to ensure seamless communication during incidents.
Review and update incident response plans based on notification feedback.

What evidence shows this is happening in your organization?

Incident Notification Policy: A formal policy outlining the processes and thresholds for sending notifications when availability events occur, ensuring timely communication and action.
Availability Monitoring Dashboard: A real-time dashboard that visualizes system performance and alerts the team when thresholds are breached, showcasing all active notifications and their resolution status.
Event Response Playbook: A step-by-step guide detailing how to respond to events impacting availability, including notification procedures, roles of team members, and escalation paths.
Reliability Checklists: Checklists for engineers to ensure all components of the workload send appropriate notifications, including configurations and automated resolutions tracking.
Monitoring and Notification Strategy Document: A strategic document that outlines the criteria for monitoring services, the notifications to be sent, and the roles responsible for managing these notifications.

Cloud Services

AWS

Amazon CloudWatch: Monitors your AWS resources and applications in real-time, allowing you to set alarms and receive notifications based on specific metrics.
AWS SNS (Simple Notification Service): A messaging service that allows you to decouple and coordinate the different components of your applications through message delivery and notifications.
AWS Lambda: Enables you to run code in response to events, allowing for automated responses to availability-impacting events.
AWS Config: Provides an AWS resource inventory, configuration history, and configuration change notifications to enable security and governance.

Azure

Azure Monitor: Collects and analyzes telemetry data from Azure resources, allowing you to set alerts based on metrics to maintain high availability.
Azure Logic Apps: Helps automate workflows and send notifications based on certain triggers, which can be tied to resource availability.
Azure Application Insights: A feature of Azure Monitor that provides application performance management and monitoring, helping detect and diagnose availability issues.

Google Cloud Platform

Google Cloud Monitoring: Provides visibility into the performance and uptime of your applications, allowing you to set up alerts and notifications.
Google Cloud Functions: Allows you to execute code in response to cloud events, helping automate actions based on availability-impacting events.
Google Cloud Pub/Sub: A messaging service for event-driven systems that allows you to send notifications based on availability issues in real-time.

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals