3.6 Distributed Systems and Clusters

Distributed systems and clusters connect networking with operating systems, databases and system design. In the PSC Computer Engineer exam, this topic is not only about definitions; you should be able to explain why distributed systems are built, what failures they face, how clusters improve availability/performance, and what tradeoffs appear in replication, consistency and partition tolerance.

Engineering Definitions

Distributed system

Standard definition: A collection of independent computers that appears to users as a single coherent system.

Exam meaning: धेरै machines मिलेर एउटै system जस्तो service दिने architecture।

Cluster

Standard definition: A group of interconnected computers working together as a single computing resource for performance, availability or scalability.

Exam meaning: नजिकबाट managed गरिएका multiple nodes जसले एउटै service/resource जस्तो काम गर्छन्।

Transparency

Standard definition: The property of hiding distribution details such as location, replication, migration or failure from users and applications.

Exam meaning: System distributed भए पनि user/program लाई unnecessary complexity नदेखिने गुण।

Replication

Standard definition: Maintaining copies of data or service on multiple nodes to improve availability, performance or fault tolerance.

Exam meaning: एउटै data/service का copies धेरै nodes मा राख्ने technique।

Consistency

Standard definition: The rule that defines what value a read operation may return when data is replicated and updated.

Exam meaning: Replicated data update भएपछि readers ले कुन version देख्ने भन्ने guarantee।

Fault tolerance

Standard definition: The ability of a system to continue operating correctly despite failures of components.

Exam meaning: Node/network/component fail हुँदा पनि service चलिरहने क्षमता।

RPC

Standard definition: Remote Procedure Call is a communication abstraction where a program invokes a procedure on a remote machine as if it were local.

Exam meaning: Remote server को function call लाई local function call जस्तो बनाउने abstraction।

Concept Teaching

A distributed system is built because one machine may not be enough for performance, reliability, geographic reach or data volume. But the price is complexity: partial failure, network delay, concurrency, inconsistent replicas, security boundaries and coordination problems. A good engineering answer always balances benefit and tradeoff.

Why Distributed Systems Are Needed

Distributed systems solve real engineering limits of single-machine systems.

Scalability: add more machines to handle more users, data or computation.
Availability: service continues even if one node fails.
Performance: requests can be processed in parallel or served from nearer nodes.
Resource sharing: storage, printers, databases and services can be shared across network.
Geographic distribution: users in different regions can access nearby servers.
Organizational distribution: different systems owned by different teams can interoperate through APIs.

Core Challenges in Distributed Systems

Distributed systems are hard because failures and timing are not uniform.

Partial failure: one node or link can fail while rest of system still runs.
No global clock: different nodes have clock skew, making event ordering difficult.
Network latency: remote communication is much slower and less predictable than memory access.
Concurrency: multiple clients may update shared state at the same time.
Replication conflict: copies of data can temporarily disagree.
Security: communication crosses machine and network boundaries.
Observability: debugging needs logs, metrics and traces from many nodes.

Transparency Types

Transparency is a favorite subjective-question concept. It means hiding distribution complexity where possible.

Transparency type	Meaning	Example
Access transparency	Same operations for local and remote resources	Open remote file like local file conceptually
Location transparency	Resource name does not reveal physical location	Service name instead of server IP
Migration transparency	Resource can move without affecting user	VM/container moves to another host
Replication transparency	Multiple copies appear as one resource	Read from replicated database
Concurrency transparency	Many users share resource safely	Transactions/locks hide simultaneous access
Failure transparency	System masks certain failures	Retry/failover to another node

Cluster Architecture

A cluster is a practical distributed system arrangement where nodes are usually administered together.

Cluster nodes run the same or coordinated workloads.
A load balancer or scheduler distributes requests/jobs among nodes.
Shared-nothing clusters keep independent CPU, memory and storage per node; easier to scale horizontally.
Shared-storage clusters allow multiple nodes to access common storage; useful but needs coordination.
Heartbeat messages detect node liveness.
Failover moves service from failed node to healthy node.
Cluster management needs membership, health check, configuration and deployment control.

Cluster Types

Different clusters optimize for different goals.

Cluster type	Main goal	Example use
High availability cluster	Minimize downtime using failover	Web/database service redundancy
Load balancing cluster	Distribute user requests across nodes	Public website/application servers
High performance computing cluster	Parallel computation	Scientific simulation, rendering
Storage cluster	Distributed storage capacity and durability	Object/file/block storage systems
Database cluster	Replication and distributed query/write handling	Primary-replica or sharded database

RPC and Message Passing

Distributed components communicate using messages. RPC hides message exchange behind function-call style abstraction, but network failure still exists.

Client stub marshals parameters into a network message.
Server stub unmarshals request, calls real procedure and marshals response.
Serialization converts objects/data into transferable bytes, such as JSON, Protocol Buffers or XML.
RPC may fail due to timeout, server crash, duplicate request or network partition.
Idempotent operations are safer for retries because repeating them has the same effect.
Exam trap: remote call is not the same as local call; latency, failure and partial execution matter.

Replication Models

Replication improves read performance and availability, but it creates consistency problems.

Primary-replica: one primary handles writes and replicas copy updates.
Multi-primary: multiple nodes accept writes, but conflict resolution is harder.
Synchronous replication: write completes after replicas confirm; stronger consistency but higher latency.
Asynchronous replication: primary returns before all replicas update; faster but replicas can lag.
Quorum idea: read/write operations require agreement from a subset of replicas.
Replication is not the same as backup; replication is online availability/performance, backup is recovery from loss/corruption.

Consistency Models

Consistency model tells what behavior clients can expect from replicated data.

Model	Guarantee	Tradeoff
Strong consistency	Reads see latest successful write	Higher coordination and latency
Sequential consistency	All clients observe operations in one valid order	Still may not match real-time order
Causal consistency	Causally related updates seen in order	Concurrent updates may appear differently
Eventual consistency	Replicas converge if no new updates occur	Temporary stale reads possible
Read-your-writes	User sees own updates after writing	Useful user-facing guarantee

CAP Tradeoff and Network Partition

CAP theorem is often simplified badly. The useful exam explanation is about behavior during network partition.

Consistency: every read sees the latest write or an error under the chosen model.
Availability: every request to a non-failing node receives a response.
Partition tolerance: system continues despite network split/lost messages.
In a partition, a distributed data system must choose between returning possibly stale data or refusing some operations.
CP choice favors consistency by rejecting/limiting operations during partition.
AP choice favors availability by serving requests and reconciling later.
Exam trap: CAP does not say a system can normally have only two properties in all situations; the hard choice appears when partition happens.

Fault Tolerance Mechanisms

Fault tolerance combines detection, isolation, recovery and redundancy.

Heartbeat and health checks detect failed nodes.
Timeouts detect unavailable remote calls, but choosing timeout value is difficult.
Retries handle transient failures; backoff prevents retry storms.
Circuit breaker stops repeatedly calling a failing service.
Failover redirects traffic to healthy node.
Checkpointing saves state so computation can resume after failure.
Consensus protocols coordinate agreement among nodes, but are more complex and costly.

Load Balancing and Scalability

Load balancing distributes work so no single node becomes a bottleneck.

Round robin sends requests cyclically to servers.
Least connections sends traffic to server with fewer active connections.
Weighted algorithms give stronger servers more traffic.
Consistent hashing reduces remapping when nodes join/leave, useful in distributed caches/storage.
Horizontal scaling adds more machines; vertical scaling adds resources to one machine.
Stateful applications need session management, shared storage or sticky sessions.

Engineering Mechanism

Client sends request to load balancer, service registry or known server endpoint.
Load balancer selects a healthy node using an algorithm such as round robin or least connections.
Service node may call other nodes using RPC/message passing.
Data may be read from a replica, primary database, cache or distributed storage.
If a node fails, health checks detect it and traffic is redirected.
Replication and recovery mechanisms restore redundancy.
Consistency rules decide whether reads can return stale data or must coordinate with current primary/quorum.

Diagrams / Models To Draw

Draw client -> load balancer -> multiple application nodes -> replicated database.
Draw primary-replica replication with write to primary and replication to followers.
Draw active-passive high availability cluster with heartbeat and failover.
Draw RPC flow: client stub -> network -> server stub -> procedure -> response.
Draw CAP partition: two groups of nodes separated by network failure.
Draw consistent hashing ring for distributed cache/storage.

Formulas, Fields and Algorithms

Availability approximation for independent redundant components: system availability improves when failover replicas are independent.
Quorum rule often expressed as R + W > N for strong quorum overlap, where N = replicas, R = read quorum, W = write quorum.
Horizontal scaling = adding more nodes; vertical scaling = increasing CPU/RAM/storage of one node.
Failover time roughly includes detection time + election/activation time + traffic redirection time.
Replication lag = time between primary write and replica visibility.
Consistent hashing reduces key movement from O(K) toward O(K/N) when one node changes, conceptually.

Concept	Engineering purpose	Exam distinction
Distributed system	Many independent computers provide one coherent service	Broader concept than a cluster
Cluster	Managed group of nodes for availability/performance	Usually tightly managed and similar-purpose nodes
Replication	Copies data/service for performance and availability	Creates consistency and conflict issues
Load balancing	Distributes requests/jobs	Does not by itself guarantee data consistency
Fault tolerance	Continues service despite failure	Needs redundancy plus detection/recovery
RPC	Remote function-call abstraction	Remote calls can fail and have high latency
Consensus	Agreement among nodes	Useful for leader election/replicated log but costly

Exam Point

For distributed system answers, always include benefits and challenges.
Differentiate distributed system, cluster, grid and cloud if asked.
For replication, mention consistency tradeoff and replication lag.
For cluster, mention load balancer/scheduler, heartbeat, failover and shared-nothing/shared-storage idea.
For CAP, explain the partition scenario carefully instead of writing only C, A and P full forms.
For RPC, mention stubs, marshalling, timeout and retry/idempotence.

Worked Example

Consider an online exam result website. During high traffic, a load balancer sends users to multiple web servers. Static content may be cached, application nodes read from database replicas, and writes go to a primary database. If one web server fails, health checks remove it. If the primary database fails, a replica may be promoted. The system gains availability and performance, but engineers must handle stale replica reads, failover time and consistency during network partition.

Subjective Answer Pattern

Define distributed system and cluster.
Explain why distribution is needed: scalability, availability, performance and resource sharing.
Discuss challenges: partial failure, latency, concurrency, clock synchronization and consistency.
Explain cluster architecture with nodes, load balancer/scheduler, heartbeat and failover.
Describe replication models and consistency models.
Explain fault tolerance mechanisms and CAP tradeoff.
Conclude with an example architecture and its tradeoffs.

Common Engineering Mistakes

Treating cluster and distributed system as exactly the same term.
Saying replication automatically solves backup and consistency.
Ignoring partial failure; in distributed systems, one component can fail while others continue.
Explaining CAP as “choose any two forever” without partition context.
Forgetting that RPC can timeout after the server already executed the operation.
Assuming load balancing alone makes an application highly available.
Ignoring state/session handling when scaling web servers.

MCQ Revision

What is partial failure?
Which transparency hides multiple copies from users?
What does a heartbeat detect in a cluster?
What is the difference between synchronous and asynchronous replication?
In quorum systems, what is the meaning of R + W > N?
Which CAP property refers to network split tolerance?
What does RPC marshalling do?
Which load balancing method sends traffic cyclically?

Final Summary

Distributed systems use multiple independent computers to provide one coherent service.
Clusters are managed groups of nodes commonly used for high availability, load balancing and parallel processing.
Replication improves availability and read performance but introduces consistency and lag issues.
Fault tolerance requires detection, redundancy, failover, retries and recovery.
CAP tradeoff matters during network partitions.
Engineering exam answers should describe architecture, mechanism and tradeoff together.