Search for Well Architected Advice
< All Topics
Print

Define a customer communication plan for outages

Defining a Customer Communication Plan for System Outages
Defining a robust communication plan for system outages is crucial for maintaining transparency and trust with customers and stakeholders. A well-designed communication plan helps ensure that affected users are informed promptly, understand the situation, and receive timely updates until the issue is resolved. Testing the communication plan in advance helps ensure that it can be effectively deployed during actual outages.

Establish Key Elements of the Communication Plan

Define key elements of the communication plan to ensure comprehensive coverage during an outage. Key elements include:

  • Audience Identification: Identify affected audiences—customers, stakeholders, partners, internal teams, etc. Understand their expectations, preferred communication channels, and information needs during outages.
  • Communication Channels: Select appropriate channels for communicating with customers. Common channels include:
    • Status Pages: Publicly accessible web pages to share real-time updates.
    • Email Notifications: For direct communication with users.
    • SMS Alerts: For high-priority outages where immediate notification is critical.
    • Social Media: To reach a broader audience and provide visibility.
  • Messaging Strategy: Develop a consistent messaging strategy for outages. Messaging should include:
    • Initial Notification: Inform users as soon as an outage is detected, including what services are affected, the potential impact, and acknowledgment that the team is investigating.
    • Progress Updates: Provide regular updates to keep customers informed about investigation progress, estimated resolution times, and actions being taken.
    • Resolution Notification: Notify users when the services are restored, including a summary of what happened and any next steps (such as mitigations or follow-up).

Plan for Different Outage Scenarios

Define and prepare communication plans for different types of outages:

  • Partial Service Disruption: Involves a subset of features or users. Notifications should target affected customers with specific information.
  • Full Outage: Affects all users. Notifications should be broader, providing transparent information to all users and stakeholders.
  • Planned Maintenance: Clearly communicate planned maintenance in advance to minimize surprises. Notify users of scheduled downtime and provide reminders closer to the maintenance window.

Establish Roles and Responsibilities

Assign roles and responsibilities for managing communication during outages. Roles may include:

  • Communication Manager: Responsible for crafting and approving customer-facing messages.
  • Incident Response Lead: Coordinates with the technical team to gather accurate information about the outage.
  • Stakeholder Liaison: Manages direct communication with business stakeholders, partners, or VIP customers.

Test the Communication Plan

Test the communication plan regularly to ensure it works effectively during actual incidents. Conduct drills and simulations (similar to technical incident response exercises) to validate whether messages are being drafted, approved, and delivered promptly. Testing helps identify gaps in communication channels, messaging, or the responsiveness of assigned roles.

Communicate Both Impact and Resolution

Communicate clearly with customers both during and after the outage:

  • During Impact: Be transparent about the status of the affected services, the progress being made to resolve the issue, and estimated recovery timelines.
  • After Resolution: When services are restored, notify customers promptly. Provide a summary of the incident, explain what caused it, and outline actions taken to prevent recurrence. This builds trust and provides a complete picture of the incident’s resolution.

Follow-Up Post-Outage

After the outage is resolved, follow up with a post-incident summary that provides more details about the cause and the corrective actions taken. Consider sending a customer satisfaction survey to gather feedback on how the outage was handled, which can help improve future responses.

Supporting Questions

  • How do you ensure that customers are informed promptly when outages occur?
  • What channels are used to communicate with different audiences during an outage?
  • How is the communication plan tested to ensure it will be effective during an actual incident?

Roles and Responsibilities

Communication Manager
Responsibilities:

  • Draft initial notifications, progress updates, and resolution messages.
  • Ensure that messaging is consistent, accurate, and delivered on time.

Incident Response Lead
Responsibilities:

  • Provide up-to-date information to the communication team to ensure accurate and timely communication with customers.
  • Coordinate with technical teams to gather necessary information regarding the outage.

Stakeholder Liaison
Responsibilities:

  • Maintain direct communication with business stakeholders, partners, or key customers to ensure they are informed and reassured during outages.
  • Communicate details about the outage resolution and follow-up actions to stakeholders.

Artifacts

  • Communication Plan Document: A detailed document that outlines the communication process for different types of outages, including responsibilities, audiences, and timelines for notifications.
  • Customer Notification Templates: Predefined message templates for initial notifications, progress updates, and resolution messages to speed up communication during incidents.
  • Testing and Simulation Report: A report summarizing the outcomes of communication plan tests, including areas for improvement and identified gaps.

Relevant AWS Tools

Notification and Communication Tools

  • Amazon SNS (Simple Notification Service): Automates notifications to customers via email or SMS during outages, ensuring prompt delivery.
  • AWS Systems Manager Incident Manager: Manages incident workflows, including notifying stakeholders and tracking communication activities during outages.

Monitoring and Status Tools

  • AWS Service Health Dashboard: Provides updates on the health of AWS services, which can be used as a reference for customers during widespread incidents affecting AWS infrastructure.
  • AWS Status Page: An external status page that shows the health of customer workloads and helps provide transparency during ongoing incidents.

Automation Tools

  • AWS Lambda: Automates the process of sending notifications, updating status pages, or running scripts that ensure communication consistency during outages.
  • AWS Systems Manager Automation: Automates workflows during an outage to ensure that communication processes are followed and that notifications are issued at predefined intervals.
Table of Contents