Search for the Right Document
< All Topics
Print

Alert Response Runbook Example: High CPU Usage Alert

Runbook Name: High CPU Usage Alert Response
Date Created: November 7, 2024
Date Updated: November 7, 2024
Author: Kevin McCaffrey
Assigned Owner: [Team Member’s Name]


Alert Details

  • Alert Type: High CPU Usage
  • Trigger: CPU usage exceeds 85% for more than 5 minutes
  • Monitoring Tool: Amazon CloudWatch
  • Severity Level: High

Step-by-Step Response Process

1. Acknowledge the Alert

  • Action: Log into the incident management tool (e.g., AWS Systems Manager Incident Manager) and acknowledge the alert.
  • Criteria for Success: Alert status changes to “Acknowledged,” and stakeholders are notified of the incident.

2. Gather Information

  • Action: Review CloudWatch metrics to confirm sustained high CPU usage. Check the following:
    • Time the alert was triggered
    • Specific instance(s) impacted
    • Any correlated alerts or anomalies
  • Tools: Amazon CloudWatch, AWS Systems Manager

3. Assess Impact

  • Action: Determine if the high CPU usage is impacting user experience or critical workloads.
  • Criteria for Success: Identify the service(s) impacted and the severity of the degradation.

4. Investigate the Cause

  • Action: Check logs and recent deployments for any clues, such as:
    • Application logs: Review logs for errors or unusual activity.
    • System logs: Look for processes consuming excessive resources.
    • Recent changes: Investigate any deployments or configuration changes that may have triggered the alert.
  • Tools: AWS CloudWatch Logs, AWS Systems Manager, deployment logs

5. Take Action Based on Findings

  • Scenario A: Temporary Spike
    • Action: If the CPU usage spike appears to be temporary and caused by a scheduled job, monitor for further degradation and document the findings.
    • Criteria for Success: CPU usage normalizes without further intervention.
  • Scenario B: Long-Running Process or High Load
    • Action: Restart the offending process or scale up resources (e.g., increase instance size or add instances to the load balancer).
    • Criteria for Success: CPU usage drops below 85%, and the service recovers.
  • Scenario C: Code or Configuration Issue
    • Action: Roll back to the previous stable deployment or apply a patch.
    • Criteria for Success: CPU usage stabilizes, and the issue is resolved.

6. Document and Communicate

  • Action: Record the root cause, actions taken, and resolution in the incident management system. Notify stakeholders of the incident status and any follow-up actions required.
  • Tools: AWS Systems Manager Incident Manager, email/Slack notifications
  • Criteria for Success: Incident report is completed, and stakeholders are informed.

7. Escalate if Unresolved

  • Action: If the issue cannot be resolved within 30 minutes, escalate to the Operations Manager or relevant team.
  • Escalation Path: Contact Operations Manager at [Contact Information]
  • Criteria for Success: Escalation is logged, and the appropriate team takes over the incident.

Criteria for Successful Resolution

  • CPU usage is reduced to below 85%.
  • Services are fully operational and performing as expected.
  • Incident is documented and communicated to relevant teams.

Tools and Resources

  • Monitoring: Amazon CloudWatch, AWS Systems Manager
  • Incident Management: AWS Systems Manager Incident Manager
  • Communication: Amazon SNS, email, Slack
  • Automation: AWS Systems Manager Automation for restarting services or scaling resources

Review and Maintenance

  • Review Frequency: Monthly
  • Owner Responsible for Updates: Runbook Author

Supporting Artifacts:

  • High CPU Usage Investigation Checklist
  • Incident Communication Templates
  • Resource Scaling Playbook for automated actions
Table of Contents