Search for Well Architected Advice
< All Topics
Print

Conduct reviews regularly

Conducting regular reviews of workload monitoring is essential for maintaining a high level of reliability and ensuring that monitoring systems are up to date with current needs. Regular reviews help identify gaps, update instrumentation as systems evolve, and adapt to changes prompted by significant events or incidents. Monitoring requirements can shift over time due to scaling, feature changes, or infrastructure updates, so frequent assessments are key to ensuring continued observability and reliability.

Establish Monitoring Review Cadence

Set a regular cadence for reviewing how workload monitoring is implemented. This could be monthly, quarterly, or after significant events such as system changes or incidents. These reviews help ensure that your monitoring practices evolve alongside your system’s architecture, and that new components are properly instrumented for visibility.

Assign Monitoring Review Champions

Assign monitoring review champions within each team to oversee the review process. These champions are responsible for evaluating the effectiveness of existing monitoring practices, ensuring that monitoring metrics are relevant, and identifying areas for improvement. Champions should collaborate across teams to maintain a cohesive monitoring strategy that spans all services and components.

Incorporate Feedback from Significant Events

Significant events, such as outages, incidents, or system upgrades, provide opportunities for learning and improving monitoring practices. After such events, review the existing monitoring setup to determine if it could have better predicted or mitigated the event. Update metrics, dashboards, and alerting thresholds to reflect lessons learned.

Provide Training on Monitoring Best Practices

Train teams on best practices for workload monitoring and how to interpret monitoring data effectively. Training should cover the use of monitoring tools, the importance of different metrics, and how to respond to alerts. Proper training ensures that all team members can contribute to effective monitoring and system reliability.

Develop Guidelines for Effective Monitoring

Create clear guidelines for monitoring your workloads effectively. These guidelines should cover best practices for metric selection, alert configuration, and the use of logging and tracing tools. Documented guidelines help ensure consistency in monitoring practices across teams and components.

Integrate Monitoring Validation into CI/CD Pipelines

Integrate monitoring validation into CI/CD pipelines to automatically verify that new deployments and changes are adequately monitored. This helps catch gaps in monitoring coverage early in the development cycle, before they reach production. Automated checks can validate that necessary metrics are in place and that alerts are properly configured.

Conduct Monitoring Gap Analysis

Periodically conduct a gap analysis to determine if there are areas of your system that are not being adequately monitored. Ensure that all critical components are instrumented, and that metrics align with the current workload performance and reliability goals. Address any identified gaps to improve overall system observability.

Foster a Culture of Monitoring and Improvement

Encourage teams to treat monitoring as an ongoing process rather than a one-time setup. Recognize and reward proactive improvements to monitoring, and create a culture of transparency where teams openly discuss what monitoring worked well during incidents and what could be improved. This approach drives continuous improvement in monitoring practices and system reliability.

Utilize Dashboards for Monitoring Visibility

Create and maintain dashboards that provide visibility into workload health, key metrics, and system performance. Tools like Amazon CloudWatch can be used to visualize data, track trends, and detect anomalies. Dashboards help teams monitor workload status in real-time and provide insights that inform decision-making.

Supporting Questions:

  • How often are workload monitoring reviews conducted, and are they tied to specific events or schedules?
  • Are there clear guidelines for implementing and maintaining effective workload monitoring?
  • How do you incorporate lessons learned from significant events to improve monitoring practices?
  • Is there a defined process for identifying and addressing monitoring gaps?
  • How are teams trained and supported in understanding and improving monitoring?

Roles and Responsibilities:

  • Monitoring Review Champions: Lead the review process, evaluate the effectiveness of monitoring, and collaborate with other teams.
  • DevOps Engineers: Implement and update monitoring tools and practices to ensure coverage and reliability.
  • Site Reliability Engineers (SREs): Monitor workload performance, analyze trends, and assist with gap analysis.
  • Team Leads: Ensure monitoring reviews are conducted regularly and that lessons from incidents are applied.
  • Quality Assurance (QA) Team: Validate monitoring as part of the deployment process to ensure comprehensive coverage.

Artefacts:

  • Monitoring Review Reports: Documentation of the findings from regular monitoring reviews, including identified gaps and actions taken.
  • Updated Dashboards: Dashboards updated based on review findings to better reflect current workload health and performance.
  • Incident Postmortem Reports: Reports that include analysis of monitoring effectiveness during incidents and suggested improvements.
  • Monitoring Guidelines: Documentation outlining best practices for monitoring metrics, alerting thresholds, and overall observability strategy.
  • Gap Analysis Documentation: Records of identified gaps in monitoring and the actions taken to address them.

Relevant AWS Services:

  • Amazon CloudWatch: Monitors workload metrics, logs, and alarms, providing a centralized platform for workload observability.
  • AWS CloudTrail: Logs API calls and changes to AWS resources, providing insights into changes that might require monitoring updates.
  • AWS Config: Tracks configuration changes and helps ensure monitoring tools are deployed as part of resource changes.
  • AWS X-Ray: Provides distributed tracing that complements workload monitoring, offering deeper insights into request flows.
  • AWS Systems Manager: Helps automate operational tasks, including those related to monitoring configuration and validation.
Table of Contents