Search for Well Architected Advice
-
Operational Excellence
-
- Resources have identified owners
- Processes and procedures have identified owners
- Operations activities have identified owners responsible for their performance
- Team members know what they are responsible for
- Mechanisms exist to identify responsibility and ownership
- Mechanisms exist to request additions, changes, and exceptions
- Responsibilities between teams are predefined or negotiated
-
- Executive Sponsorship
- Team members are empowered to take action when outcomes are at risk
- Escalation is encouraged
- Communications are timely, clear, and actionable
- Experimentation is encouraged
- Team members are encouraged to maintain and grow their skill sets
- Resource teams appropriately
- Diverse opinions are encouraged and sought within and across teams
-
- Use version control
- Test and validate changes
- Use configuration management systems
- Use build and deployment management systems
- Perform patch management
- Implement practices to improve code quality
- Share design standards
- Use multiple environments
- Make frequent, small, reversible changes
- Fully automate integration and deployment
-
- Have a process for continuous improvement
- Perform post-incident analysis
- Implement feedback loops
- Perform knowledge management
- Define drivers for improvement
- Validate insights
- Perform operations metrics reviews
- Document and share lessons learned
- Allocate time to make improvements
- Perform post-incident analysis
-
Security
-
- Separate workloads using accounts
- Secure account root user and properties
- Identify and validate control objectives
- Keep up-to-date with security recommendations
- Keep up-to-date with security threats
- Identify and prioritize risks using a threat model
- Automate testing and validation of security controls in pipelines
- Evaluate and implement new security services and features regularly
-
- Define access requirements
- Grant least privilege access
- Define permission guardrails for your organization
- Manage access based on life cycle
- Establish emergency access process
- Share resources securely within your organization
- Reduce permissions continuously
- Share resources securely with a third party
- Analyze public and cross-account access
-
- Perform regular penetration testing
- Deploy software programmatically
- Regularly assess security properties of the pipelines
- Train for Application Security
- Automate testing throughout the development and release lifecycle
- Manual Code Reviews
- Centralize services for packages and dependencies
- Build a program that embeds security ownership in workload teams
-
-
Reliability
-
- Be aware of service quotas and constraints in Cloud Services
- Manage service quotas across accounts and Regions
- Accommodate fixed service quotas and constraints through architecture
- Monitor and manage quotas
- Automate quota management
- Ensure sufficient gap between quotas and usage to accommodate failover
-
- Use highly available network connectivity for your workload public endpoints
- Provision Redundant Connectivity Between Private Networks in the Cloud and On-Premises Environments
- Ensure IP subnet allocation accounts for expansion and availability
- Prefer hub-and-spoke topologies over many-to-many mesh
- Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
-
- Monitor end-to-end tracing of requests through your system
- Conduct reviews regularly
- Analytics
- Automate responses (Real-time processing and alarming)
- Send notifications (Real-time processing and alarming)
- Define and calculate metrics (Aggregation)
- Monitor End-to-End Tracing of Requests Through Your System
- Define and calculate metrics
- Send notifications
- Automate responses
-
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers
- Rely on the data plane and not the control plane during recovery
- Use static stability to prevent bimodal behavior
- Send notifications when events impact availability
- Architect your product to meet availability targets and uptime service level agreements (SLAs)
-
-
Cost Optimization
-
- Establish ownership of cost optimization
- Establish a partnership between finance and technology
- Establish cloud budgets and forecasts
- Implement cost awareness in your organizational processes
- Monitor cost proactively
- Keep up-to-date with new service releases
- Quantify business value from cost optimization
- Report and notify on cost optimization
- Create a cost-aware culture
-
- Perform cost analysis for different usage over time
- Analyze all components of this workload
- Perform a thorough analysis of each component
- Select components of this workload to optimize cost in line with organization priorities
- Perform cost analysis for different usage over time
- Select software with cost effective licensing
-
-
Performance
-
- Learn about and understand available cloud services and features
- Evaluate how trade-offs impact customers and architecture efficiency
- Use guidance from your cloud provider or an appropriate partner to learn about architecture patterns and best practices
- Factor cost into architectural decisions
- Use policies and reference architectures
- Use benchmarking to drive architectural decisions
- Use a data-driven approach for architectural choices
-
- Use purpose-built data store that best support your data access and storage requirements
- Collect and record data store performance metrics
- Evaluate available configuration options for data store
- Implement Strategies to Improve Query Performance in Data Store
- Implement data access patterns that utilize caching
-
- Understand how networking impacts performance
- Evaluate available networking features
- Choose appropriate dedicated connectivity or VPN for your workload
- Use load balancing to distribute traffic across multiple resources
- Choose network protocols to improve performance
- Choose your workload's location based on network requirements
- Optimize network configuration based on metrics
-
- Establish key performance indicators (KPIs) to measure workload health and performance
- Use monitoring solutions to understand the areas where performance is most critical
- Define a process to improve workload performance
- Review metrics at regular intervals
- Load test your workload
- Use automation to proactively remediate performance-related issues
- Keep your workload and services up-to-date
-
-
Sustainability
-
- Scale workload infrastructure dynamically
- Align SLAs with sustainability goals
- Optimize geographic placement of workloads based on their networking requirements
- Stop the creation and maintenance of unused assets
- Optimize team member resources for activities performed
- Implement buffering or throttling to flatten the demand curve
-
- Optimize software and architecture for asynchronous and scheduled jobs
- Remove or refactor workload components with low or no use
- Optimize areas of code that consume the most time or resources
- Optimize impact on devices and equipment
- Use software patterns and architectures that best support data access and storage patterns
- Remove unneeded or redundant data
- Use technologies that support data access and storage patterns
- Use policies to manage the lifecycle of your datasets
- Use shared file systems or storage to access common data
- Back up data only when difficult to recreate
- Use elasticity and automation to expand block storage or file system
- Minimize data movement across networks
- Implement a data classification policy
- Remove unneeded or redundant data
-
- Articles coming soon
< All Topics
Print
Conduct reviews regularly
PostedNovember 29, 2024
UpdatedMarch 22, 2025
ByKevin McCaffrey
Regular reviews of workload monitoring implementation are crucial for maintaining your system’s reliability. It ensures that the monitoring strategies remain aligned with current workloads, technological changes, and business requirements, facilitating timely detection of issues and enabling proactive modifications.
Best Practices
Implement Regular Monitoring Reviews
- Establish a review schedule (e.g., quarterly, biannually) to assess current monitoring configurations and metrics.
- Involve relevant stakeholders in the review process to ensure comprehensive coverage of workload performance and reliability.
- Use tools such as AWS CloudWatch to aggregate logs and metrics, allowing for easy access during reviews.
- Identify key performance indicators (KPIs) and baseline metrics before the reviews to track changes and improvements over time.
- Document significant events and decisions made during reviews to ensure continuous learning and improvement.
- Utilize dashboards to visualize metrics effectively, making it easier to spot trends and areas needing attention.
- Make necessary adjustments to monitoring configurations based on findings from reviews to enhance reliability and performance.
Questions to ask your team
- How often do you review the monitoring setup of your workload?
- What criteria do you use to determine if changes are needed in your monitoring strategy?
- Have you documented the significant events that triggered updates to your monitoring processes?
- Are there specific metrics or logs that you prioritize during your reviews?
- How do you ensure that the updated monitoring practices are effectively implemented?
- What tools do you use to facilitate the review process of workload monitoring?
- How do you share insights from your monitoring reviews with your team or stakeholders?
Who should be doing this?
Cloud Operations Engineer
- Set up and maintain monitoring tools for workload resources.
- Analyze logs and metrics to assess the health of workloads.
- Configure alerts for performance thresholds and significant events.
- Conduct regular reviews of monitoring configurations and effectiveness.
- Work with development teams to implement changes based on review findings.
Site Reliability Engineer (SRE)
- Oversee the reliability of production systems.
- Lead the effort to regularly review and update monitoring practices.
- Provide insights on incident management and response based on monitoring data.
- Collaborate with the Cloud Operations Engineer to ensure monitoring aligns with best practices.
- Facilitate post-incident reviews to improve monitoring strategies.
DevOps Manager
- Ensure that team members conduct regular reviews of workload monitoring.
- Champion the importance of monitoring for reliability within the organization.
- Allocate resources for monitoring tools and training.
- Establish policies and procedures for regular monitoring assessments.
- Evaluate the impact of changes in workloads on monitoring effectiveness.
What evidence shows this is happening in your organization?
- Workload Monitoring Review Checklist: A checklist to guide the team through the regular review process of workload monitoring, ensuring all metrics and logs are evaluated and updated according to recent events and performance observations.
- Monitoring and Alerting Policy: A policy document outlining the standards and procedures for monitoring workload resources, including the thresholds for alerts and the protocol for responding to significant events.
- Monthly Monitoring Review Report: A report generated monthly that summarizes workload monitoring metrics, highlighting any significant events and providing recommendations for improvements in monitoring strategies.
- Workload Performance Dashboard: An interactive dashboard displaying real-time metrics of workload performance, enabling teams to quickly identify issues and assess the need for review based on predefined performance thresholds.
- Workload Monitoring Strategy Guide: A comprehensive guide that outlines best practices for configuring logs and metrics, including methods for automatically recovering from failures and ensuring continuous reliability.
Cloud Services
AWS
- Amazon CloudWatch: A monitoring service for AWS cloud resources and the applications you run on AWS, CloudWatch collects logs and metrics, enabling you to set alarms and automate responses to workload performance.
- AWS CloudTrail: CloudTrail enables governance, compliance, and operational and risk auditing of your AWS account by tracking user activity and changes in your AWS resources.
- AWS X-Ray: Helps you analyze and debug production applications, providing insights into performance and issues within your application and its resources.
Azure
- Azure Monitor: A comprehensive service that collects, analyzes, and acts on telemetry from your cloud and on-premises environments, enabling you to optimize performance and availability.
- Azure Log Analytics: Part of Azure Monitor, it helps you collect and analyze log and performance data from various sources and deploy alerts for significant events.
Google Cloud Platform
- Google Cloud Monitoring: Provides monitoring, logging, and debugging capabilities for Google Cloud resources, allowing you to visualize and alert on performance and health data.
- Google Cloud Logging: Allows you to store and analyze logging data from your applications and infrastructure, making it easier to detect, troubleshoot, and respond to issues.
Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)
Table of Contents