Search for Well Architected Advice
-
Operational Excellence
-
- Resources have identified owners
- Processes and procedures have identified owners
- Operations activities have identified owners responsible for their performance
- Team members know what they are responsible for
- Mechanisms exist to identify responsibility and ownership
- Mechanisms exist to request additions, changes, and exceptions
- Responsibilities between teams are predefined or negotiated
-
- Executive Sponsorship
- Team members are empowered to take action when outcomes are at risk
- Escalation is encouraged
- Communications are timely, clear, and actionable
- Experimentation is encouraged
- Team members are encouraged to maintain and grow their skill sets
- Resource teams appropriately
- Diverse opinions are encouraged and sought within and across teams
-
- Use version control
- Test and validate changes
- Use configuration management systems
- Use build and deployment management systems
- Perform patch management
- Implement practices to improve code quality
- Share design standards
- Use multiple environments
- Make frequent, small, reversible changes
- Fully automate integration and deployment
-
- Have a process for continuous improvement
- Perform post-incident analysis
- Implement feedback loops
- Perform knowledge management
- Define drivers for improvement
- Validate insights
- Perform operations metrics reviews
- Document and share lessons learned
- Allocate time to make improvements
- Perform post-incident analysis
-
Security
-
- Separate workloads using accounts
- Secure account root user and properties
- Identify and validate control objectives
- Keep up-to-date with security recommendations
- Keep up-to-date with security threats
- Identify and prioritize risks using a threat model
- Automate testing and validation of security controls in pipelines
- Evaluate and implement new security services and features regularly
-
- Define access requirements
- Grant least privilege access
- Define permission guardrails for your organization
- Manage access based on life cycle
- Establish emergency access process
- Share resources securely within your organization
- Reduce permissions continuously
- Share resources securely with a third party
- Analyze public and cross-account access
-
- Perform regular penetration testing
- Deploy software programmatically
- Regularly assess security properties of the pipelines
- Train for Application Security
- Automate testing throughout the development and release lifecycle
- Manual Code Reviews
- Centralize services for packages and dependencies
- Build a program that embeds security ownership in workload teams
-
-
Reliability
-
- Be aware of service quotas and constraints in Cloud Services
- Manage service quotas across accounts and Regions
- Accommodate fixed service quotas and constraints through architecture
- Monitor and manage quotas
- Automate quota management
- Ensure sufficient gap between quotas and usage to accommodate failover
-
- Use highly available network connectivity for your workload public endpoints
- Provision Redundant Connectivity Between Private Networks in the Cloud and On-Premises Environments
- Ensure IP subnet allocation accounts for expansion and availability
- Prefer hub-and-spoke topologies over many-to-many mesh
- Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
-
- Monitor end-to-end tracing of requests through your system
- Conduct reviews regularly
- Analytics
- Automate responses (Real-time processing and alarming)
- Send notifications (Real-time processing and alarming)
- Define and calculate metrics (Aggregation)
- Monitor End-to-End Tracing of Requests Through Your System
- Define and calculate metrics
- Send notifications
- Automate responses
-
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers
- Rely on the data plane and not the control plane during recovery
- Use static stability to prevent bimodal behavior
- Send notifications when events impact availability
- Architect your product to meet availability targets and uptime service level agreements (SLAs)
-
-
Cost Optimization
-
- Establish ownership of cost optimization
- Establish a partnership between finance and technology
- Establish cloud budgets and forecasts
- Implement cost awareness in your organizational processes
- Monitor cost proactively
- Keep up-to-date with new service releases
- Quantify business value from cost optimization
- Report and notify on cost optimization
- Create a cost-aware culture
-
- Perform cost analysis for different usage over time
- Analyze all components of this workload
- Perform a thorough analysis of each component
- Select components of this workload to optimize cost in line with organization priorities
- Perform cost analysis for different usage over time
- Select software with cost effective licensing
-
-
Performance
-
- Learn about and understand available cloud services and features
- Evaluate how trade-offs impact customers and architecture efficiency
- Use guidance from your cloud provider or an appropriate partner to learn about architecture patterns and best practices
- Factor cost into architectural decisions
- Use policies and reference architectures
- Use benchmarking to drive architectural decisions
- Use a data-driven approach for architectural choices
-
- Use purpose-built data store that best support your data access and storage requirements
- Collect and record data store performance metrics
- Evaluate available configuration options for data store
- Implement Strategies to Improve Query Performance in Data Store
- Implement data access patterns that utilize caching
-
- Understand how networking impacts performance
- Evaluate available networking features
- Choose appropriate dedicated connectivity or VPN for your workload
- Use load balancing to distribute traffic across multiple resources
- Choose network protocols to improve performance
- Choose your workload's location based on network requirements
- Optimize network configuration based on metrics
-
- Establish key performance indicators (KPIs) to measure workload health and performance
- Use monitoring solutions to understand the areas where performance is most critical
- Define a process to improve workload performance
- Review metrics at regular intervals
- Load test your workload
- Use automation to proactively remediate performance-related issues
- Keep your workload and services up-to-date
-
-
Sustainability
-
- Scale workload infrastructure dynamically
- Align SLAs with sustainability goals
- Optimize geographic placement of workloads based on their networking requirements
- Stop the creation and maintenance of unused assets
- Optimize team member resources for activities performed
- Implement buffering or throttling to flatten the demand curve
-
- Optimize software and architecture for asynchronous and scheduled jobs
- Remove or refactor workload components with low or no use
- Optimize areas of code that consume the most time or resources
- Optimize impact on devices and equipment
- Use software patterns and architectures that best support data access and storage patterns
- Remove unneeded or redundant data
- Use technologies that support data access and storage patterns
- Use policies to manage the lifecycle of your datasets
- Use shared file systems or storage to access common data
- Back up data only when difficult to recreate
- Use elasticity and automation to expand block storage or file system
- Minimize data movement across networks
- Implement a data classification policy
- Remove unneeded or redundant data
-
- Articles coming soon
< All Topics
Print
Use bulkhead architectures to limit scope of impact
PostedDecember 20, 2024
UpdatedMarch 22, 2025
ByKevin McCaffrey
ID: REL_REL10_3
Implementing bulkhead architectures is crucial for enhancing the reliability of your workloads. By creating fault-isolated boundaries, potential failures can be contained within specific segments, thus safeguarding other components from disruptions. This isolation minimizes the risk and impact of outages, fostering a more resilient system.
Best Practices
Implementing Bulkhead Architectures
- Design your application with physical or logical boundaries that segment components into isolated environments, such as using separate instances, containers, or microservices.
- Identify critical components that can affect overall workload performance and partition them to prevent cascading failures.
- Utilize AWS services like Amazon ECS or AWS Lambda to define cell boundaries, ensuring failures in one cell do not impact others.
- Conduct failure testing (chaos engineering) to validate that your bulkhead architecture limits failure scope as expected.
- Monitor and log the performance of each isolated component to quickly detect failures and their impact on the overall workload.
Questions to ask your team
- Have you identified components of your workload that can fail independently without affecting others?
- What mechanisms are in place to ensure that the failure of one component does not cascade to others?
- How do you monitor the health of each isolated component?
- Are there processes for quickly isolating and resolving failures within a bulkhead?
- How often do you test your fault isolation strategies under simulated failure conditions?
Who should be doing this?
Cloud Architect
- Design and implement bulkhead architectures to ensure effective fault isolation.
- Identify critical components and establish fault isolated boundaries.
- Assess the resilience of existing workloads and recommend improvements.
DevOps Engineer
- Implement and monitor architectures that support fault isolation.
- Automate deployment processes to ensure consistent implementation of bulkhead strategies.
- Conduct regular testing of fault isolation boundaries to validate effectiveness.
Site Reliability Engineer (SRE)
- Monitor workloads for failures and their impacts within isolated boundaries.
- Analyze post-incident reports to improve fault isolation strategies.
- Collaborate with development teams to ensure fault tolerance in application design.
What evidence shows this is happening in your organization?
- Bulkhead Architecture Implementation Guide: A comprehensive guide detailing how to design and implement bulkhead architectures within cloud workloads, focusing on best practices and architectural patterns.
- Fault Isolation Strategy Document: A strategic document outlining the approach to fault isolation in workloads, including diagrams and examples of bulkhead architectures in practice.
- Reliability Checklists for Fault Isolation: A checklist to assess the reliability of your workloads through fault isolation techniques, emphasizing the use of bulkhead architectures.
- Architecture Diagrams for Bulkhead Implementations: Visual diagrams that illustrate the architecture of workloads using bulkhead strategies, showing how components interact within isolated boundaries.
- Monitoring and Reporting Dashboard: A dashboard that tracks the health and performance of isolated components, helping to visualize the impact of failures and the effectiveness of the bulkhead design.
Cloud Services
AWS
- Amazon EC2 Auto Scaling: Automatically adjusts the number of EC2 instances in response to traffic patterns, providing fault isolation by partitioning applications across multiple instances.
- Amazon ECS (Elastic Container Service): Enables running containerized applications in isolation, allowing services to be deployed across multiple clusters to minimize the impact of failures.
- Amazon RDS (Relational Database Service): Supports multi-AZ deployments for automatic failover, isolating database workloads to enhance reliability.
Azure
- Azure Virtual Machines Scale Sets: Allows you to deploy and manage a set of identical VMs, isolating components and ensuring high availability through instance management.
- Azure Kubernetes Service (AKS): Manages containerized applications, enabling fault isolation through the orchestration of pods across multiple nodes.
- Azure SQL Database: Provides built-in high availability and geo-redundancy features, isolating database instances to maintain performance during failures.
Google Cloud Platform
- Google Kubernetes Engine (GKE): Orchestrates containers across clusters, providing fault isolation by distributing workloads and allowing for self-healing mechanisms.
- Google Compute Engine Instance Groups: Automatically manages a collection of VM instances, offering fault isolation through load balancing and auto-healing features.
- Cloud SQL: Managed database service with high availability configurations, which isolates data workloads to minimize downtime during failures.
Table of Contents