Identify which kind of distributed system is required

PostedNovember 29, 2024

UpdatedMarch 22, 2025

ByKevin McCaffrey

Selecting the appropriate type of distributed system is crucial for ensuring system reliability. Understanding the nature of your application’s timing and response requirements helps in defining the interaction model and resilience features necessary for maintaining operational stability amidst network-induced failures.

Best Practices

Identify the Required Distributed System Type

Assess your workload requirements to determine if a hard, soft, or offline real-time distributed system is necessary. Understanding these requirements is crucial for reliability as it informs design decisions and resource allocation.
For hard real-time systems, implement mechanisms such as prioritized message queues and low-latency communication protocols to ensure that responses meet stringent timing requirements. This minimizes the chance of failures due to timing violations.
In soft real-time systems, design interactions to be resilient to delays, using techniques such as timeouts and retries to handle transient network issues without impacting overall system reliability.
For offline systems, employ batch processing strategies that can tolerate delays in data processing. Ensure good data batching algorithms that account for failure detection and recovery without jeopardizing system performance.
Regularly review and update your architecture to match the current demands of your workload, considering factors like traffic patterns and processing load, to ensure the chosen system type remains suitable.

Questions to ask your team

Have you assessed the responsiveness requirements of your workload to determine if a hard, soft, or offline real-time system is appropriate?
What measures are in place to ensure that communication delays do not affect the reliability of your components?
How do you handle data loss during component interactions, and what fallbacks do you have in place?
Are there monitoring tools in place to identify and alert on failures or performance degradation in your distributed system?
Have you conducted tests to simulate failures in your communications network, and how did your system respond?

Who should be doing this?

System Architect

Evaluate and determine the type of distributed system required based on workload needs.
Design system interactions that minimize dependencies and ensure component communication reliability.
Implement strategies for data replication and redundancy to enhance reliability.
Define and document the performance requirements for the system, distinguishing between hard and soft real-time requirements.

DevOps Engineer

Set up and maintain monitoring systems to track component reliability and response times.
Automate deployment processes to ensure consistent and reliable system updates.
Implement and manage resilient infrastructure that can handle failures gracefully, ensuring minimal downtime.

Quality Assurance Engineer

Develop and execute test plans that simulate network failures and latency to assess system reliability.
Conduct stress testing on the distributed system to ensure it meets reliability standards under load.
Provide feedback on system interactions and identify potential failure points for improvement.

Network Engineer

Design and maintain the network architecture to support reliable communication between distributed components.
Implement techniques for fault tolerance in the network layer, such as load balancing and traffic shaping.
Monitor network performance and troubleshoot connectivity issues that may impact system reliability.

Product Owner

Define the business requirements for reliability and ensure alignment with distributed system design.
Prioritize reliability features and improvements in the product backlog.
Collaborate with stakeholders to establish acceptable reliability metrics and maintain communication regarding system performance.

What evidence shows this is happening in your organization?

Distributed System Design Template: A template to outline the design requirements for various types of distributed systems, including hard real-time, soft real-time, and offline systems. This template helps teams identify which system type is needed based on the reliability requirements of their workload.
Reliability Assessment Report: A report that evaluates existing distributed systems within an organization, focusing on their reliability, response times, and failure rates. The report provides insights into where enhancements can be made to prevent failures and improve MTBF.
Distributed System Interaction Strategy Guide: A guide that outlines best practices for designing interactions in a distributed system to enhance reliability. The guide includes strategies for handling network latency and data loss while ensuring minimal impact on system components.
Reliability Metrics Dashboard: A real-time dashboard that tracks and visualizes key metrics related to the reliability of distributed systems, such as response times, failure rates, and component interdependencies. This dashboard helps stakeholders monitor system performance and make informed decisions.
Failure Recovery Playbook: A detailed playbook that provides step-by-step procedures for recovering from failures in distributed systems. It includes scenarios based on the type of distributed system in use, helping organizations effectively respond to and minimize downtime.

Cloud Services

AWS

Amazon EC2: Amazon EC2 provides resizable compute capacity in the cloud, enabling you to deploy reliable distributed systems that can handle failures by automatically scaling and recovering instances.
Amazon SQS: Amazon Simple Queue Service (SQS) is a message queue service that helps decouple and scale microservices, distributed systems, and serverless applications, allowing for reliable message delivery even in case of network issues.
AWS Lambda: AWS Lambda allows you to run code in response to events and triggers without managing servers, helping to build resilient architectures that withstand failures.

Azure

Azure Virtual Machines: Azure VMs provide on-demand scalable computing resources, ensuring that your application can maintain reliability and recover from failures.
Azure Queue Storage: Azure Queue Storage provides reliable message queuing for communication between application components, ensuring tasks are not lost during processing failures.
Azure Functions: Azure Functions enables you to run small pieces of code (functions) without worrying about the infrastructure, promoting resilience through event-driven architecture.

Google Cloud Platform

Google Compute Engine: Google Compute Engine offers virtual machines that can be customized to provide consistent performance and reliability within a distributed system.
Google Cloud Pub/Sub: Google Cloud Pub/Sub is a messaging service for building real-time analytics and event-driven architectures, ensuring messages are delivered even during transient failures.
Google Cloud Functions: Google Cloud Functions allows you to run your code in response to events, enabling you to build applications that are resilient and scalable without the overhead of managing servers.

Question: How do you design interactions in a distributed system to prevent failures?
Pillar: Reliability (Code: REL)

Operational Excellence

Determine what your priorities are

Structure your organization to support your business outcomes

Organizational culture to support your business outcomes

Implement observability in your workload

Reduce defects, ease remediation, and improve flow into production

Mitigate deployment risks

Be ready to support a workload

Uilize workload observability

Understand the health of your operations

Manage workload and operations events

Evolve your operations

Security

Securely operate your workload

Manage identities for people and machines

Manage permissions for people and machines

Detect and investigate security events

Protect your network resources

Protect your compute resources

Classify your data

Protect your data at rest

Protect your data in transit

Anticipate, respond to, and recover from incidents

Incorporate and validate the security properties of applications throughout the design, development, and deployment lifecycle

Reliability

Manage service quotas and constraints

Plan your network topology

Design your workload service architecture

Design interactions in a distributed system to prevent failures

Design interactions in a distributed system to mitigate or withstand failures

Monitor workload resources

Design your workload to adapt to changes in demand

Implement change

Back up data

Fault isolation to protect your workload

Design your workload to withstand component failures

Test reliability

Plan for disaster recovery (DR)

Cost Optimization

Implement cloud financial management

Govern usage

Monitor your cost and usage

Decommission resources

Evaluate cost when you select services

Meet cost targets when you select resource type, size and number

Use pricing models to reduce cost

Plan for data transfer charges

Manage demand, and supply resources

Evaluate new services

Evaluate the cost of effort

Performance

Select the appropriate cloud resources and architecture patterns for your workload

Select and use compute resources in your workload

Store, manage, and access data in your workload

Select and configure networking resources in your workload

Support more performance efficiency for your workload

Sustainability

Select Regions for your workload

Align cloud resources to your demand

Take advantage of software and architecture patterns to support your sustainability goals

Take advantage of data management policies and patterns to support your sustainability goals

Select and use cloud hardware and services in your architecture to support your sustainability goals

Implement organizational processes support your sustainability goals