Search for Well Architected Advice
< All Topics
Print

Ensure sufficient gap between quotas and usage to accommodate failover

Managing service quotas effectively is crucial to ensure that your cloud-based workload can handle failures without running into limits. If a resource becomes unavailable, it may still count against your quota, leading to constricted resource availability when it’s needed most. Preparing for these scenarios safeguards the reliability of your operations.

Best Practices

Establish Quota Monitoring and Alerts

  • Implement a monitoring system to track service quotas in real-time, ensuring that usage trends are visualized and analyzed regularly. This is important as it helps to proactively identify potential breaches before they impact operations. Use AWS CloudWatch to set up alarms that notify teams when usage approaches critical resource limits.

Plan for Resource Overlap During Failures

  • Assess the overlap of resources when one fails and a replacement needs to be provisioned. It’s essential to account for both active and failover resources in your quota planning. This can help you avoid scenarios where failed resources still limit your capacity. Use AWS Service Quotas to examine current limits and adjust accordingly based on your specific workloads.

Test Failover Scenarios Regularly

  • Conduct regular failover testing to ensure that replacements are provisioned without exhausting service quotas during outages. This testing validates your gap calculations and helps refine your failover strategy. Simulate various failure scenarios including network, Availability Zone, and Regional failures to understand their impacts on quotas and service continuity.

Document and Review Quota Management Policies

  • Clearly document your quota management policies and procedures, outlining how gaps are calculated and the processes in place for request increases when necessary. Regularly review and update these policies as workloads change, ensuring they remain aligned with your reliability goals and usage patterns. This promotes consistency in managing service constraints across teams.

Engage with AWS Support for Quota Increases

  • Establish a relationship with AWS Support to facilitate smoother requests for quota increases when necessary. Providing well-prepared documentation and detailed reasons for increases can expedite this process. Engage in a proactive manner, especially before anticipated high-demand periods, to ensure quotas align with your operational needs.

Questions to ask your team

  • What strategies do you have in place to monitor your service quotas and alert you when usage approaches limits?
  • How do you assess the risk of potential resource failures in your architecture?
  • Have you conducted any testing to evaluate how your system behaves under scenarios of resource limit breaches?
  • What processes are in place to request quota increases if necessary, and how quickly can this be executed?
  • Do you regularly review and adjust your quotas based on growth projections and historical usage patterns?
  • How do you ensure that your teams are informed about existing service quotas and their implications for system reliability?

Who should be doing this?

Cloud Architect

  • Design the architecture to include sufficient capacity and redundancy to handle potential failures.
  • Monitor resource usage and quotas regularly to identify discrepancies and potential overages.
  • Evaluate the impact of service limits on architecture choices and identify alternatives when necessary.
  • Document and communicate the strategies for managing service quotas and constraints within the team.

Operations Manager

  • Oversee the implementation of quota management strategies in the production environment.
  • Conduct regular reviews of service quotas to ensure they meet business needs.
  • Coordinate with the Cloud Architect to assess the impact of failures on resource availability and adjust quotas accordingly.
  • Ensure that failover mechanisms are in place and tested regularly.

DevOps Engineer

  • Implement automation scripts to monitor resource usage and alert for quota nearing limits.
  • Work with the architecture team to deploy infrastructure that aligns with recommended quota management practices.
  • Assist in conducting failure simulations to evaluate performance against quotas.
  • Provide insights for optimizing resource usage and minimizing unnecessary resource allocation.

Technical Support Engineer

  • Respond to incidents related to resource limits and service quotas.
  • Assist in troubleshooting issues arising from quota constraints.
  • Provide feedback to the Cloud Architect on common patterns in quota-related incidents to inform updates to architecture and practices.
  • Document findings and lessons learned from incidents involving resource management.

What evidence shows this is happening in your organization?

  • Service Quota Management Plan: A comprehensive document outlining the strategies for monitoring, assessing, and managing service quotas and constraints. This includes methodologies for calculating the necessary gap between quotas and usage to accommodate failover scenarios.
  • Service Quota Dashboard: An interactive dashboard that visualizes current usage against established quotas, highlighting potential risks and failover conditions. The dashboard includes alerts for situations where resources are nearing their limits or where failover may impact service reliability.
  • Reliability Checklists: A checklist designed to help teams identify essential quota considerations and constraints. It includes evaluation steps for ensuring coverage against potential resource failures and managing necessary gap requirements.
  • Failover Strategy Guide: A guide detailing the strategies and best practices for implementing failover mechanisms. It encompasses the management of service quotas to ensure sufficient capacity during failover events, along with examples and scenarios.
  • Resource Utilization Matrix: A matrix that categorizes resources based on their quotas and current utilization levels, helping organizations analyze stress points and plan for redundancy. This tool provides insights into required gaps needed to mitigate failover impact.

Cloud Services

AWS

  • AWS Trusted Advisor: Provides real-time guidance to help you provision your resources optimally and alerts you to service limits that you may be approaching.
  • Amazon CloudWatch: Monitors your resources and applications in real-time, allowing you to set alarms for usage metrics which can help in identifying when you are approaching service quotas.
  • AWS Service Quotas: Allows you to view and manage your quotas for AWS services, enabling you to request increases proactively to avoid hitting service limits.
  • AWS Auto Scaling: Automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost, which helps in managing resources within quotas efficiently.

Azure

  • Azure Monitor: Collects and analyzes telemetry data from Azure resources to help detect when you are nearing quota limits.
  • Azure Advisor: Provides best practices and recommendations for Azure services, including monitoring service limits to optimize their usage.
  • Azure Resource Manager: Manages your infrastructure through a unified API, which includes tracking usage and quotas for your resources.

Google Cloud Platform

  • Google Cloud Monitoring: Provides visibility into resource usage, helping you monitor and respond to quota limits effectively.
  • Google Cloud Resource Manager: Enables you to manage and configure your resources in Google Cloud, helping to track usage and ensure compliance with service quotas.
  • Google Cloud Console: Provides insights on the current resource usage across services, enabling you to manage quotas and request increases where necessary.
Table of Contents