Search for Well Architected Advice
-
Operational Excellence
-
- Resources have identified owners
- Processes and procedures have identified owners
- Operations activities have identified owners responsible for their performance
- Team members know what they are responsible for
- Mechanisms exist to identify responsibility and ownership
- Mechanisms exist to request additions, changes, and exceptions
- Responsibilities between teams are predefined or negotiated
-
- Executive Sponsorship
- Team members are empowered to take action when outcomes are at risk
- Escalation is encouraged
- Communications are timely, clear, and actionable
- Experimentation is encouraged
- Team members are encouraged to maintain and grow their skill sets
- Resource teams appropriately
- Diverse opinions are encouraged and sought within and across teams
-
- Use version control
- Test and validate changes
- Use configuration management systems
- Use build and deployment management systems
- Perform patch management
- Implement practices to improve code quality
- Share design standards
- Use multiple environments
- Make frequent, small, reversible changes
- Fully automate integration and deployment
-
- Have a process for continuous improvement
- Perform post-incident analysis
- Implement feedback loops
- Perform knowledge management
- Define drivers for improvement
- Validate insights
- Perform operations metrics reviews
- Document and share lessons learned
- Allocate time to make improvements
- Perform post-incident analysis
-
Security
-
- Separate workloads using accounts
- Secure account root user and properties
- Identify and validate control objectives
- Keep up-to-date with security recommendations
- Keep up-to-date with security threats
- Identify and prioritize risks using a threat model
- Automate testing and validation of security controls in pipelines
- Evaluate and implement new security services and features regularly
-
- Define access requirements
- Grant least privilege access
- Define permission guardrails for your organization
- Manage access based on life cycle
- Establish emergency access process
- Share resources securely within your organization
- Reduce permissions continuously
- Share resources securely with a third party
- Analyze public and cross-account access
-
- Perform regular penetration testing
- Deploy software programmatically
- Regularly assess security properties of the pipelines
- Train for Application Security
- Automate testing throughout the development and release lifecycle
- Manual Code Reviews
- Centralize services for packages and dependencies
- Build a program that embeds security ownership in workload teams
-
-
Reliability
-
- Be aware of service quotas and constraints in Cloud Services
- Manage service quotas across accounts and Regions
- Accommodate fixed service quotas and constraints through architecture
- Monitor and manage quotas
- Automate quota management
- Ensure sufficient gap between quotas and usage to accommodate failover
-
- Use highly available network connectivity for your workload public endpoints
- Provision Redundant Connectivity Between Private Networks in the Cloud and On-Premises Environments
- Ensure IP subnet allocation accounts for expansion and availability
- Prefer hub-and-spoke topologies over many-to-many mesh
- Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
-
- Monitor end-to-end tracing of requests through your system
- Conduct reviews regularly
- Analytics
- Automate responses (Real-time processing and alarming)
- Send notifications (Real-time processing and alarming)
- Define and calculate metrics (Aggregation)
- Monitor End-to-End Tracing of Requests Through Your System
- Define and calculate metrics
- Send notifications
- Automate responses
-
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers
- Rely on the data plane and not the control plane during recovery
- Use static stability to prevent bimodal behavior
- Send notifications when events impact availability
- Architect your product to meet availability targets and uptime service level agreements (SLAs)
-
-
Cost Optimization
-
- Establish ownership of cost optimization
- Establish a partnership between finance and technology
- Establish cloud budgets and forecasts
- Implement cost awareness in your organizational processes
- Monitor cost proactively
- Keep up-to-date with new service releases
- Quantify business value from cost optimization
- Report and notify on cost optimization
- Create a cost-aware culture
-
- Perform cost analysis for different usage over time
- Analyze all components of this workload
- Perform a thorough analysis of each component
- Select components of this workload to optimize cost in line with organization priorities
- Perform cost analysis for different usage over time
- Select software with cost effective licensing
-
-
Performance
-
- Learn about and understand available cloud services and features
- Evaluate how trade-offs impact customers and architecture efficiency
- Use guidance from your cloud provider or an appropriate partner to learn about architecture patterns and best practices
- Factor cost into architectural decisions
- Use policies and reference architectures
- Use benchmarking to drive architectural decisions
- Use a data-driven approach for architectural choices
-
- Use purpose-built data store that best support your data access and storage requirements
- Collect and record data store performance metrics
- Evaluate available configuration options for data store
- Implement Strategies to Improve Query Performance in Data Store
- Implement data access patterns that utilize caching
-
- Understand how networking impacts performance
- Evaluate available networking features
- Choose appropriate dedicated connectivity or VPN for your workload
- Use load balancing to distribute traffic across multiple resources
- Choose network protocols to improve performance
- Choose your workload's location based on network requirements
- Optimize network configuration based on metrics
-
- Establish key performance indicators (KPIs) to measure workload health and performance
- Use monitoring solutions to understand the areas where performance is most critical
- Define a process to improve workload performance
- Review metrics at regular intervals
- Load test your workload
- Use automation to proactively remediate performance-related issues
- Keep your workload and services up-to-date
-
-
Sustainability
-
- Scale workload infrastructure dynamically
- Align SLAs with sustainability goals
- Optimize geographic placement of workloads based on their networking requirements
- Stop the creation and maintenance of unused assets
- Optimize team member resources for activities performed
- Implement buffering or throttling to flatten the demand curve
-
- Optimize software and architecture for asynchronous and scheduled jobs
- Remove or refactor workload components with low or no use
- Optimize areas of code that consume the most time or resources
- Optimize impact on devices and equipment
- Use software patterns and architectures that best support data access and storage patterns
- Remove unneeded or redundant data
- Use technologies that support data access and storage patterns
- Use policies to manage the lifecycle of your datasets
- Use shared file systems or storage to access common data
- Back up data only when difficult to recreate
- Use elasticity and automation to expand block storage or file system
- Minimize data movement across networks
- Implement a data classification policy
- Remove unneeded or redundant data
-
- Articles coming soon
< All Topics
Print
Send notifications
PostedMarch 21, 2025
UpdatedMarch 22, 2025
ByKevin McCaffrey
Real-time notifications and alerts are critical for maintaining the reliability of workloads in the cloud. By optimizing these notifications, organizations ensure quick responses to potential issues, minimizing downtime and enhancing system resilience.
Best Practices
Implement Real-Time Monitoring and Alerts
- Set up CloudWatch Alarms to monitor key metrics and logs from your workload.
- Utilize SNS (Simple Notification Service) to send notifications to the appropriate teams when a threshold is breached or an anomaly is detected.
- Integrate monitoring solutions with incident response tools so notifications reach the right personnel and are actionable.
- Regularly review and adjust alert thresholds based on workload performance patterns and business requirements.
- Test your notification system regularly to ensure timely responses during actual incidents. This is important to minimize downtime and maintain high availability.
Questions to ask your team
- Have you set up monitoring for key metrics and logs in your workload?
- What thresholds have you defined for alerts, and are they regularly reviewed?
- Which personnel or systems receive the notifications when an issue is detected?
- How quickly can your team respond to the alerts generated from the monitoring system?
- Are there automated responses in place for certain alerts to minimize downtime?
Who should be doing this?
Cloud Operations Manager
- Establish monitoring frameworks for workload resources.
- Configure logs and metrics for tracking performance and health.
- Set alert thresholds and conditions for notifications.
- Oversee the integration of monitoring tools with existing systems.
- Ensure timely responses to alerts and notifications.
DevOps Engineer
- Implement monitoring solutions using AWS services (e.g., CloudWatch, X-Ray).
- Develop and maintain scripts for automated monitoring setups.
- Continuously analyze logs and metrics to detect anomalies.
- Coordinate with the Cloud Operations Manager to refine monitoring strategies.
Incident Response Team Member
- Act upon notifications received regarding performance issues.
- Investigate and resolve production incidents quickly.
- Provide feedback on the effectiveness of alerting protocols.
- Maintain documentation of incident responses and outcomes.
What evidence shows this is happening in your organization?
- Monitoring Notification Policy: A policy document outlining the protocols for sending notifications and alerts related to performance thresholds and significant events. This policy details the channels of communication, escalation procedures, and responsibilities of personnel.
- Incident Response Playbook: A comprehensive playbook that guides the organization on how to respond to monitoring alerts, including steps for diagnosis, escalation, and resolution of issues identified by monitoring systems.
- Monitoring Dashboard Design: A design document for a centralized dashboard that visualizes logs and metrics in real-time. This dashboard includes alerts for threshold breaches and a persistent view of workload performance.
- Alert Threshold Checklist: A checklist used to determine appropriate alert thresholds for various metrics within the environment. This checklist aids in ensuring consistent monitoring and timely alerting.
- Performance Monitoring Guide: A guide that provides best practices for configuring workload monitoring, including metrics to track, log aggregation techniques, and setting up automatic recovery processes.
Cloud Services
AWS
- Amazon CloudWatch: CloudWatch allows you to monitor logs and metrics in real-time, set thresholds, and trigger notifications through Amazon SNS when issues arise.
- AWS Lambda: You can use Lambda functions to automatically respond to CloudWatch alarms, enabling automated recovery of resources or notifications.
- AWS Simple Notification Service (SNS): SNS provides a flexible, fully managed messaging service that enables you to send notifications and alerts based on monitoring events.
Azure
- Azure Monitor: Azure Monitor provides comprehensive monitoring for your workloads, allowing you to collect, analyze, and act on telemetry data. It supports alerts based on custom thresholds.
- Azure Logic Apps: Logic Apps can integrate with Azure Monitor to create workflows that respond to alert conditions, such as sending notifications when issues are detected.
- Azure Notification Hubs: Notification Hubs enable you to send push notifications to almost any platform, helping you alert users about critical events in real-time.
Google Cloud Platform
- Google Cloud Monitoring: Google Cloud Monitoring provides visibility into performance and uptime, with the capability to set up alerts that notify you when anomalies occur.
- Google Cloud Functions: Cloud Functions can be triggered in response to alerts from Monitoring, allowing for automated remediation workflows.
- Google Cloud Pub/Sub: Pub/Sub is a messaging service that allows you to send real-time notifications based on events detected through monitoring.
Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)
Table of Contents