Conduct reviews regularly

PostedNovember 29, 2024

UpdatedMarch 22, 2025

ByKevin McCaffrey

Regular reviews of workload monitoring implementation are crucial for maintaining your system’s reliability. It ensures that the monitoring strategies remain aligned with current workloads, technological changes, and business requirements, facilitating timely detection of issues and enabling proactive modifications.

Best Practices

Implement Regular Monitoring Reviews

Establish a review schedule (e.g., quarterly, biannually) to assess current monitoring configurations and metrics.
Involve relevant stakeholders in the review process to ensure comprehensive coverage of workload performance and reliability.
Use tools such as AWS CloudWatch to aggregate logs and metrics, allowing for easy access during reviews.
Identify key performance indicators (KPIs) and baseline metrics before the reviews to track changes and improvements over time.
Document significant events and decisions made during reviews to ensure continuous learning and improvement.
Utilize dashboards to visualize metrics effectively, making it easier to spot trends and areas needing attention.
Make necessary adjustments to monitoring configurations based on findings from reviews to enhance reliability and performance.

Questions to ask your team

How often do you review the monitoring setup of your workload?
What criteria do you use to determine if changes are needed in your monitoring strategy?
Have you documented the significant events that triggered updates to your monitoring processes?
Are there specific metrics or logs that you prioritize during your reviews?
How do you ensure that the updated monitoring practices are effectively implemented?
What tools do you use to facilitate the review process of workload monitoring?
How do you share insights from your monitoring reviews with your team or stakeholders?

Who should be doing this?

Cloud Operations Engineer

Set up and maintain monitoring tools for workload resources.
Analyze logs and metrics to assess the health of workloads.
Configure alerts for performance thresholds and significant events.
Conduct regular reviews of monitoring configurations and effectiveness.
Work with development teams to implement changes based on review findings.

Site Reliability Engineer (SRE)

Oversee the reliability of production systems.
Lead the effort to regularly review and update monitoring practices.
Provide insights on incident management and response based on monitoring data.
Collaborate with the Cloud Operations Engineer to ensure monitoring aligns with best practices.
Facilitate post-incident reviews to improve monitoring strategies.

DevOps Manager

Ensure that team members conduct regular reviews of workload monitoring.
Champion the importance of monitoring for reliability within the organization.
Allocate resources for monitoring tools and training.
Establish policies and procedures for regular monitoring assessments.
Evaluate the impact of changes in workloads on monitoring effectiveness.

What evidence shows this is happening in your organization?

Workload Monitoring Review Checklist: A checklist to guide the team through the regular review process of workload monitoring, ensuring all metrics and logs are evaluated and updated according to recent events and performance observations.
Monitoring and Alerting Policy: A policy document outlining the standards and procedures for monitoring workload resources, including the thresholds for alerts and the protocol for responding to significant events.
Monthly Monitoring Review Report: A report generated monthly that summarizes workload monitoring metrics, highlighting any significant events and providing recommendations for improvements in monitoring strategies.
Workload Performance Dashboard: An interactive dashboard displaying real-time metrics of workload performance, enabling teams to quickly identify issues and assess the need for review based on predefined performance thresholds.
Workload Monitoring Strategy Guide: A comprehensive guide that outlines best practices for configuring logs and metrics, including methods for automatically recovering from failures and ensuring continuous reliability.

Cloud Services

AWS

Amazon CloudWatch: A monitoring service for AWS cloud resources and the applications you run on AWS, CloudWatch collects logs and metrics, enabling you to set alarms and automate responses to workload performance.
AWS CloudTrail: CloudTrail enables governance, compliance, and operational and risk auditing of your AWS account by tracking user activity and changes in your AWS resources.
AWS X-Ray: Helps you analyze and debug production applications, providing insights into performance and issues within your application and its resources.

Azure

Azure Monitor: A comprehensive service that collects, analyzes, and acts on telemetry from your cloud and on-premises environments, enabling you to optimize performance and availability.
Azure Log Analytics: Part of Azure Monitor, it helps you collect and analyze log and performance data from various sources and deploy alerts for significant events.

Google Cloud Platform

Google Cloud Monitoring: Provides monitoring, logging, and debugging capabilities for Google Cloud resources, allowing you to visualize and alert on performance and health data.
Google Cloud Logging: Allows you to store and analyze logging data from your applications and infrastructure, making it easier to detect, troubleshoot, and respond to issues.

Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals