Search for Well Architected Advice
< All Topics
Print

Define and calculate metrics

Defining and calculating metrics are essential for monitoring the performance and reliability of your workload. By systematically logging events and calculating relevant metrics, you can detect performance degradation and facilitate timely interventions, thereby maintaining high availability and reliability of services.

Best Practices

Implement Comprehensive Logging Strategies

  • Ensure that all components of your workload generate logs at an appropriate level of detail, including application, system, and database logs. This visibility into operations is crucial for diagnosing issues and understanding behavior at runtime. Consider using structured logging to make data easier to query and analyze.
  • Utilize AWS services like Amazon CloudWatch Logs to centrally collect and manage your logs, which helps in correlating events across different services.
  • Regularly review and adjust log retention policies to balance information needs with cost management, ensuring that critical logs are retained while minimizing unnecessary data storage costs.

Define Key Performance Indicators (KPIs)

  • Identify and define critical KPIs that reflect the reliability of your application, such as error rates, request latency, and system availability. Having specific metrics allows for a more focused monitoring strategy and facilitates proactive measures.
  • Utilize AWS CloudWatch to create custom metrics based on the defined KPIs. This enables you to monitor the performance of your workloads effectively and detect anomalies quickly.
  • Establish baseline values for your metrics, so you can set appropriate thresholds for alerts. Regularly review these baselines to ensure they remain relevant as your workload evolves.

Set Up Automated Alerts and Notifications

  • Configure alerts on your defined metrics to notify your operations team immediately when thresholds are breached. Use services like Amazon SNS (Simple Notification Service) to send notifications via email or SMS.
  • Implement automated remediation actions based on alerts. For example, if a specific error rate is exceeded, you could automatically trigger a Lambda function to scale out resources or restart a service.
  • Regularly test your alerting mechanisms to ensure they are functioning correctly and that your team is familiar with the alert response procedures.

Continuously Monitor and Improve

  • Perform regular reviews of your logging and monitoring strategy as part of your operational excellence practices. Ensure that the metrics you track remain aligned with business goals and customer expectations.
  • Adopt a culture of improvement: encourage team members to suggest changes to logging and monitoring practices based on their experiences, and integrate those improvements into your workflow.
  • Leverage the AWS Well-Architected Tool to make adjustments based on findings from your reviews, ensuring continuous alignment with the framework’s best practices.

Questions to ask your team

  • What log data are you currently collecting, and how are you storing it?
  • Have you defined key metrics that are critical to assessing workload reliability?
  • How do you apply filters to your log data to focus on specific events or performance issues?
  • What thresholds have you established for these metrics, and how do you monitor them?
  • Are you using any tools or services to automate the collection and analysis of your logs and metrics?
  • How are you alerted when a metric exceeds its threshold, and who receives these alerts?
  • What steps have you taken to ensure your monitoring solution is resilient and can withstand failures?
  • How frequently do you review and update your defined metrics and thresholds based on workload changes?

Who should be doing this?

Cloud Operations Engineer

  • Define the key metrics to monitor workload resources.
  • Implement log storage solutions to ensure availability and accessibility of log data.
  • Apply appropriate filters to log data for accurate metric calculation.
  • Set up alerting mechanisms for threshold breaches in metrics.
  • Regularly review and update monitoring configurations to align with changes in the workload.
  • Collaborate with development teams to ensure that relevant log events are captured.

DevOps Engineer

  • Integrate logging and monitoring tools into CI/CD pipelines.
  • Automate the collection and visualization of metrics using dashboard tools.
  • Conduct analysis of log data to identify trends and anomalies.
  • Work with the Cloud Operations Engineer to refine metrics and thresholds based on workload performance.

System Administrator

  • Ensure the reliability and performance of log storage solutions.
  • Manage access controls to log data and monitoring tools.
  • Assist in the deployment of monitoring tools and frameworks.
  • Participate in incident response by analyzing logs during outages.

What evidence shows this is happening in your organization?

  • Reliability Monitoring Dashboard: A visual dashboard created using AWS CloudWatch that displays key metrics related to workload resources, including error rates, latency, and resource utilization, enabling real-time monitoring and alerts.
  • Log Analysis and Metric Calculation Playbook: A comprehensive playbook that outlines the steps for analyzing log data, applying filters, and calculating relevant metrics, such as event counts and latency from timestamps, to ensure effective monitoring of workload health.
  • Monitoring Policies Document: A formal policy document detailing the organization’s approach to monitoring workload resources. It includes guidelines on logging, metrics definition, thresholds for alerts, and responsibilities for monitoring and responding to incidents.
  • Threshold Notification Strategy: A strategy document that describes how notifications are triggered when defined thresholds are crossed, including methods for configuring alerts in AWS CloudWatch and integration with communication tools like Amazon SNS.
  • Metrics Calculation Checklist: A checklist for development and operations teams to ensure all necessary metrics are defined and calculated from log sources, including which logs to monitor, filters to apply, and metrics to track for reliability.

Cloud Services

AWS

  • Amazon CloudWatch: CloudWatch allows you to collect and track metrics, collect log files, and set alarms to notify you of thresholds being crossed.
  • AWS Lambda: You can use Lambda to compute metrics and process logs in real time, enabling automatic recovery or alerts based on calculated conditions.
  • AWS X-Ray: X-Ray helps you analyze and debug applications by providing insights into request latencies and errors.

Azure

  • Azure Monitor: Azure Monitor provides comprehensive monitoring services, collecting metrics and logs to analyze the performance of applications and resources.
  • Azure Log Analytics: Log Analytics allows you to query and analyze log data, helping to extract metrics needed for operational insights.

Google Cloud Platform

  • Google Cloud Monitoring: Google Cloud Monitoring helps you gain visibility into the performance of applications and services through metrics and log analysis.
  • Google Cloud Logging: Cloud Logging allows you to store, search, analyze, and alert on log data generated by your services and applications.

Question: How do you monitor workload resources?
Pillar: Reliability (Code: REL)

Table of Contents