Alert Response Runbook Example: High CPU Usage Alert

PostedNovember 11, 2024

UpdatedNovember 11, 2024

ByKevin McCaffrey

Runbook Name: High CPU Usage Alert Response
Date Created: November 7, 2024
Date Updated: November 7, 2024
Author: Kevin McCaffrey
Assigned Owner: [Team Member’s Name]

Alert Details

Alert Type: High CPU Usage
Trigger: CPU usage exceeds 85% for more than 5 minutes
Monitoring Tool: Amazon CloudWatch
Severity Level: High

Step-by-Step Response Process

1. Acknowledge the Alert

Action: Log into the incident management tool (e.g., AWS Systems Manager Incident Manager) and acknowledge the alert.
Criteria for Success: Alert status changes to “Acknowledged,” and stakeholders are notified of the incident.

2. Gather Information

Action: Review CloudWatch metrics to confirm sustained high CPU usage. Check the following:
- Time the alert was triggered
- Specific instance(s) impacted
- Any correlated alerts or anomalies
Tools: Amazon CloudWatch, AWS Systems Manager

3. Assess Impact

Action: Determine if the high CPU usage is impacting user experience or critical workloads.
Criteria for Success: Identify the service(s) impacted and the severity of the degradation.

4. Investigate the Cause

Action: Check logs and recent deployments for any clues, such as:
- Application logs: Review logs for errors or unusual activity.
- System logs: Look for processes consuming excessive resources.
- Recent changes: Investigate any deployments or configuration changes that may have triggered the alert.
Tools: AWS CloudWatch Logs, AWS Systems Manager, deployment logs

5. Take Action Based on Findings

Scenario A: Temporary Spike
- Action: If the CPU usage spike appears to be temporary and caused by a scheduled job, monitor for further degradation and document the findings.
- Criteria for Success: CPU usage normalizes without further intervention.
Scenario B: Long-Running Process or High Load
- Action: Restart the offending process or scale up resources (e.g., increase instance size or add instances to the load balancer).
- Criteria for Success: CPU usage drops below 85%, and the service recovers.
Scenario C: Code or Configuration Issue
- Action: Roll back to the previous stable deployment or apply a patch.
- Criteria for Success: CPU usage stabilizes, and the issue is resolved.

6. Document and Communicate

Action: Record the root cause, actions taken, and resolution in the incident management system. Notify stakeholders of the incident status and any follow-up actions required.
Tools: AWS Systems Manager Incident Manager, email/Slack notifications
Criteria for Success: Incident report is completed, and stakeholders are informed.

7. Escalate if Unresolved

Action: If the issue cannot be resolved within 30 minutes, escalate to the Operations Manager or relevant team.
Escalation Path: Contact Operations Manager at [Contact Information]
Criteria for Success: Escalation is logged, and the appropriate team takes over the incident.

Criteria for Successful Resolution

CPU usage is reduced to below 85%.
Services are fully operational and performing as expected.
Incident is documented and communicated to relevant teams.

Tools and Resources

Monitoring: Amazon CloudWatch, AWS Systems Manager
Incident Management: AWS Systems Manager Incident Manager
Communication: Amazon SNS, email, Slack
Automation: AWS Systems Manager Automation for restarting services or scaling resources

Review and Maintenance

Review Frequency: Monthly
Owner Responsible for Updates: Runbook Author

Supporting Artifacts:

High CPU Usage Investigation Checklist
Incident Communication Templates
Resource Scaling Playbook for automated actions

Planning and Strategy

Requirements

Requirement Gathering

Requirement Formats

Requirement Diagrams

Impact Analysis

Communication

Design

Architecture

Diagrams

Operations

Operational Readiness

Operational Readiness Review

Ownership and Responsibility

Monitoring and Metrics

Metrics

Dashboards and Visualizations

Analysis and Reporting

Telemetry, Logging and Tracing

Telemetry Implementation

Logging

Tracing

Alerting

Events, Incidents and Problems

Policies and Procedures

Events

Incidents Response

Post-incident Analysis

Communication and Status Updates

Process Documentation and Improvement

Runbooks

Playbooks

Improvement

Testing