Search for Well Architected Advice
< All Topics
Print

Send notifications

Real-time notifications and alerts are critical for maintaining the reliability of workloads in the cloud. By optimizing these notifications, organizations ensure quick responses to potential issues, minimizing downtime and enhancing system resilience.

Best Practices

Implement Real-Time Monitoring and Alerts

  • Set up CloudWatch Alarms to monitor key metrics and logs from your workload.
  • Utilize SNS (Simple Notification Service) to send notifications to the appropriate teams when a threshold is breached or an anomaly is detected.
  • Integrate monitoring solutions with incident response tools so notifications reach the right personnel and are actionable.
  • Regularly review and adjust alert thresholds based on workload performance patterns and business requirements.
  • Test your notification system regularly to ensure timely responses during actual incidents. This is important to minimize downtime and maintain high availability.

Questions to ask your team

  • Have you set up monitoring for key metrics and logs in your workload?
  • What thresholds have you defined for alerts, and are they regularly reviewed?
  • Which personnel or systems receive the notifications when an issue is detected?
  • How quickly can your team respond to the alerts generated from the monitoring system?
  • Are there automated responses in place for certain alerts to minimize downtime?

Who should be doing this?

Cloud Operations Manager

  • Establish monitoring frameworks for workload resources.
  • Configure logs and metrics for tracking performance and health.
  • Set alert thresholds and conditions for notifications.
  • Oversee the integration of monitoring tools with existing systems.
  • Ensure timely responses to alerts and notifications.

DevOps Engineer

  • Implement monitoring solutions using AWS services (e.g., CloudWatch, X-Ray).
  • Develop and maintain scripts for automated monitoring setups.
  • Continuously analyze logs and metrics to detect anomalies.
  • Coordinate with the Cloud Operations Manager to refine monitoring strategies.

Incident Response Team Member

  • Act upon notifications received regarding performance issues.
  • Investigate and resolve production incidents quickly.
  • Provide feedback on the effectiveness of alerting protocols.
  • Maintain documentation of incident responses and outcomes.

What evidence shows this is happening in your organization?

  • Monitoring Notification Policy: A policy document outlining the protocols for sending notifications and alerts related to performance thresholds and significant events. This policy details the channels of communication, escalation procedures, and responsibilities of personnel.
  • Incident Response Playbook: A comprehensive playbook that guides the organization on how to respond to monitoring alerts, including steps for diagnosis, escalation, and resolution of issues identified by monitoring systems.
  • Monitoring Dashboard Design: A design document for a centralized dashboard that visualizes logs and metrics in real-time. This dashboard includes alerts for threshold breaches and a persistent view of workload performance.
  • Alert Threshold Checklist: A checklist used to determine appropriate alert thresholds for various metrics within the environment. This checklist aids in ensuring consistent monitoring and timely alerting.
  • Performance Monitoring Guide: A guide that provides best practices for configuring workload monitoring, including metrics to track, log aggregation techniques, and setting up automatic recovery processes.

Cloud Services

AWS

  • Amazon CloudWatch: CloudWatch allows you to monitor logs and metrics in real-time, set thresholds, and trigger notifications through Amazon SNS when issues arise.
  • AWS Lambda: You can use Lambda functions to automatically respond to CloudWatch alarms, enabling automated recovery of resources or notifications.
  • AWS Simple Notification Service (SNS): SNS provides a flexible, fully managed messaging service that enables you to send notifications and alerts based on monitoring events.

Azure

  • Azure Monitor: Azure Monitor provides comprehensive monitoring for your workloads, allowing you to collect, analyze, and act on telemetry data. It supports alerts based on custom thresholds.
  • Azure Logic Apps: Logic Apps can integrate with Azure Monitor to create workflows that respond to alert conditions, such as sending notifications when issues are detected.
  • Azure Notification Hubs: Notification Hubs enable you to send push notifications to almost any platform, helping you alert users about critical events in real-time.

Google Cloud Platform

  • Google Cloud Monitoring: Google Cloud Monitoring provides visibility into performance and uptime, with the capability to set up alerts that notify you when anomalies occur.
  • Google Cloud Functions: Cloud Functions can be triggered in response to alerts from Monitoring, allowing for automated remediation workflows.
  • Google Cloud Pub/Sub: Pub/Sub is a messaging service that allows you to send real-time notifications based on events detected through monitoring.

Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)

Table of Contents