Search for the Right Document
Alert Response Runbook Example: High CPU Usage Alert
Runbook Name: High CPU Usage Alert Response
Date Created: November 7, 2024
Date Updated: November 7, 2024
Author: Kevin McCaffrey
Assigned Owner: [Team Member’s Name]
Alert Details
- Alert Type: High CPU Usage
- Trigger: CPU usage exceeds 85% for more than 5 minutes
- Monitoring Tool: Amazon CloudWatch
- Severity Level: High
Step-by-Step Response Process
1. Acknowledge the Alert
- Action: Log into the incident management tool (e.g., AWS Systems Manager Incident Manager) and acknowledge the alert.
- Criteria for Success: Alert status changes to “Acknowledged,” and stakeholders are notified of the incident.
2. Gather Information
- Action: Review CloudWatch metrics to confirm sustained high CPU usage. Check the following:
- Time the alert was triggered
- Specific instance(s) impacted
- Any correlated alerts or anomalies
- Tools: Amazon CloudWatch, AWS Systems Manager
3. Assess Impact
- Action: Determine if the high CPU usage is impacting user experience or critical workloads.
- Criteria for Success: Identify the service(s) impacted and the severity of the degradation.
4. Investigate the Cause
- Action: Check logs and recent deployments for any clues, such as:
- Application logs: Review logs for errors or unusual activity.
- System logs: Look for processes consuming excessive resources.
- Recent changes: Investigate any deployments or configuration changes that may have triggered the alert.
- Tools: AWS CloudWatch Logs, AWS Systems Manager, deployment logs
5. Take Action Based on Findings
- Scenario A: Temporary Spike
- Action: If the CPU usage spike appears to be temporary and caused by a scheduled job, monitor for further degradation and document the findings.
- Criteria for Success: CPU usage normalizes without further intervention.
- Scenario B: Long-Running Process or High Load
- Action: Restart the offending process or scale up resources (e.g., increase instance size or add instances to the load balancer).
- Criteria for Success: CPU usage drops below 85%, and the service recovers.
- Scenario C: Code or Configuration Issue
- Action: Roll back to the previous stable deployment or apply a patch.
- Criteria for Success: CPU usage stabilizes, and the issue is resolved.
6. Document and Communicate
- Action: Record the root cause, actions taken, and resolution in the incident management system. Notify stakeholders of the incident status and any follow-up actions required.
- Tools: AWS Systems Manager Incident Manager, email/Slack notifications
- Criteria for Success: Incident report is completed, and stakeholders are informed.
7. Escalate if Unresolved
- Action: If the issue cannot be resolved within 30 minutes, escalate to the Operations Manager or relevant team.
- Escalation Path: Contact Operations Manager at [Contact Information]
- Criteria for Success: Escalation is logged, and the appropriate team takes over the incident.
Criteria for Successful Resolution
- CPU usage is reduced to below 85%.
- Services are fully operational and performing as expected.
- Incident is documented and communicated to relevant teams.
Tools and Resources
- Monitoring: Amazon CloudWatch, AWS Systems Manager
- Incident Management: AWS Systems Manager Incident Manager
- Communication: Amazon SNS, email, Slack
- Automation: AWS Systems Manager Automation for restarting services or scaling resources
Review and Maintenance
- Review Frequency: Monthly
- Owner Responsible for Updates: Runbook Author
Supporting Artifacts:
- High CPU Usage Investigation Checklist
- Incident Communication Templates
- Resource Scaling Playbook for automated actions