Search for Well Architected Advice
< All Topics
Print

Conduct reviews regularly

Regular reviews of workload monitoring implementation are crucial for maintaining your system’s reliability. It ensures that the monitoring strategies remain aligned with current workloads, technological changes, and business requirements, facilitating timely detection of issues and enabling proactive modifications.

Best Practices

Implement Regular Monitoring Reviews

  • Establish a review schedule (e.g., quarterly, biannually) to assess current monitoring configurations and metrics.
  • Involve relevant stakeholders in the review process to ensure comprehensive coverage of workload performance and reliability.
  • Use tools such as AWS CloudWatch to aggregate logs and metrics, allowing for easy access during reviews.
  • Identify key performance indicators (KPIs) and baseline metrics before the reviews to track changes and improvements over time.
  • Document significant events and decisions made during reviews to ensure continuous learning and improvement.
  • Utilize dashboards to visualize metrics effectively, making it easier to spot trends and areas needing attention.
  • Make necessary adjustments to monitoring configurations based on findings from reviews to enhance reliability and performance.

Questions to ask your team

  • How often do you review the monitoring setup of your workload?
  • What criteria do you use to determine if changes are needed in your monitoring strategy?
  • Have you documented the significant events that triggered updates to your monitoring processes?
  • Are there specific metrics or logs that you prioritize during your reviews?
  • How do you ensure that the updated monitoring practices are effectively implemented?
  • What tools do you use to facilitate the review process of workload monitoring?
  • How do you share insights from your monitoring reviews with your team or stakeholders?

Who should be doing this?

Cloud Operations Engineer

  • Set up and maintain monitoring tools for workload resources.
  • Analyze logs and metrics to assess the health of workloads.
  • Configure alerts for performance thresholds and significant events.
  • Conduct regular reviews of monitoring configurations and effectiveness.
  • Work with development teams to implement changes based on review findings.

Site Reliability Engineer (SRE)

  • Oversee the reliability of production systems.
  • Lead the effort to regularly review and update monitoring practices.
  • Provide insights on incident management and response based on monitoring data.
  • Collaborate with the Cloud Operations Engineer to ensure monitoring aligns with best practices.
  • Facilitate post-incident reviews to improve monitoring strategies.

DevOps Manager

  • Ensure that team members conduct regular reviews of workload monitoring.
  • Champion the importance of monitoring for reliability within the organization.
  • Allocate resources for monitoring tools and training.
  • Establish policies and procedures for regular monitoring assessments.
  • Evaluate the impact of changes in workloads on monitoring effectiveness.

What evidence shows this is happening in your organization?

  • Workload Monitoring Review Checklist: A checklist to guide the team through the regular review process of workload monitoring, ensuring all metrics and logs are evaluated and updated according to recent events and performance observations.
  • Monitoring and Alerting Policy: A policy document outlining the standards and procedures for monitoring workload resources, including the thresholds for alerts and the protocol for responding to significant events.
  • Monthly Monitoring Review Report: A report generated monthly that summarizes workload monitoring metrics, highlighting any significant events and providing recommendations for improvements in monitoring strategies.
  • Workload Performance Dashboard: An interactive dashboard displaying real-time metrics of workload performance, enabling teams to quickly identify issues and assess the need for review based on predefined performance thresholds.
  • Workload Monitoring Strategy Guide: A comprehensive guide that outlines best practices for configuring logs and metrics, including methods for automatically recovering from failures and ensuring continuous reliability.

Cloud Services

AWS

  • Amazon CloudWatch: A monitoring service for AWS cloud resources and the applications you run on AWS, CloudWatch collects logs and metrics, enabling you to set alarms and automate responses to workload performance.
  • AWS CloudTrail: CloudTrail enables governance, compliance, and operational and risk auditing of your AWS account by tracking user activity and changes in your AWS resources.
  • AWS X-Ray: Helps you analyze and debug production applications, providing insights into performance and issues within your application and its resources.

Azure

  • Azure Monitor: A comprehensive service that collects, analyzes, and acts on telemetry from your cloud and on-premises environments, enabling you to optimize performance and availability.
  • Azure Log Analytics: Part of Azure Monitor, it helps you collect and analyze log and performance data from various sources and deploy alerts for significant events.

Google Cloud Platform

  • Google Cloud Monitoring: Provides monitoring, logging, and debugging capabilities for Google Cloud resources, allowing you to visualize and alert on performance and health data.
  • Google Cloud Logging: Allows you to store and analyze logging data from your applications and infrastructure, making it easier to detect, troubleshoot, and respond to issues.

Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)

Table of Contents