Distributed systems and clusters connect networking with operating systems, databases and system design. In the PSC Computer Engineer exam, this topic is not only about definitions; you should be able to explain why distributed systems are built, what failures they face, how clusters improve availability/performance, and what tradeoffs appear in replication, consistency and partition tolerance.
Engineering Definitions
Distributed system
Standard definition: A collection of independent computers that appears to users as a single coherent system.
Exam meaning: धेरै machines मिलेर एउटै system जस्तो service दिने architecture।
Cluster
Standard definition: A group of interconnected computers working together as a single computing resource for performance, availability or scalability.
Exam meaning: नजिकबाट managed गरिएका multiple nodes जसले एउटै service/resource जस्तो काम गर्छन्।
Transparency
Standard definition: The property of hiding distribution details such as location, replication, migration or failure from users and applications.
Exam meaning: System distributed भए पनि user/program लाई unnecessary complexity नदेखिने गुण।
Replication
Standard definition: Maintaining copies of data or service on multiple nodes to improve availability, performance or fault tolerance.
Exam meaning: एउटै data/service का copies धेरै nodes मा राख्ने technique।
Consistency
Standard definition: The rule that defines what value a read operation may return when data is replicated and updated.
Exam meaning: Replicated data update भएपछि readers ले कुन version देख्ने भन्ने guarantee।
Fault tolerance
Standard definition: The ability of a system to continue operating correctly despite failures of components.
Exam meaning: Node/network/component fail हुँदा पनि service चलिरहने क्षमता।
RPC
Standard definition: Remote Procedure Call is a communication abstraction where a program invokes a procedure on a remote machine as if it were local.
Exam meaning: Remote server को function call लाई local function call जस्तो बनाउने abstraction।
Concept Teaching
A distributed system is built because one machine may not be enough for performance, reliability, geographic reach or data volume. But the price is complexity: partial failure, network delay, concurrency, inconsistent replicas, security boundaries and coordination problems. A good engineering answer always balances benefit and tradeoff.
Why Distributed Systems Are Needed
Distributed systems solve real engineering limits of single-machine systems.
- Scalability: add more machines to handle more users, data or computation.
- Availability: service continues even if one node fails.
- Performance: requests can be processed in parallel or served from nearer nodes.
- Resource sharing: storage, printers, databases and services can be shared across network.
- Geographic distribution: users in different regions can access nearby servers.
- Organizational distribution: different systems owned by different teams can interoperate through APIs.
Core Challenges in Distributed Systems
Distributed systems are hard because failures and timing are not uniform.
- Partial failure: one node or link can fail while rest of system still runs.
- No global clock: different nodes have clock skew, making event ordering difficult.
- Network latency: remote communication is much slower and less predictable than memory access.
- Concurrency: multiple clients may update shared state at the same time.
- Replication conflict: copies of data can temporarily disagree.
- Security: communication crosses machine and network boundaries.
- Observability: debugging needs logs, metrics and traces from many nodes.
Transparency Types
Transparency is a favorite subjective-question concept. It means hiding distribution complexity where possible.
| Transparency type | Meaning | Example |
|---|---|---|
| Access transparency | Same operations for local and remote resources | Open remote file like local file conceptually |
| Location transparency | Resource name does not reveal physical location | Service name instead of server IP |
| Migration transparency | Resource can move without affecting user | VM/container moves to another host |
| Replication transparency | Multiple copies appear as one resource | Read from replicated database |
| Concurrency transparency | Many users share resource safely | Transactions/locks hide simultaneous access |
| Failure transparency | System masks certain failures | Retry/failover to another node |
Cluster Architecture
A cluster is a practical distributed system arrangement where nodes are usually administered together.
- Cluster nodes run the same or coordinated workloads.
- A load balancer or scheduler distributes requests/jobs among nodes.
- Shared-nothing clusters keep independent CPU, memory and storage per node; easier to scale horizontally.
- Shared-storage clusters allow multiple nodes to access common storage; useful but needs coordination.
- Heartbeat messages detect node liveness.
- Failover moves service from failed node to healthy node.
- Cluster management needs membership, health check, configuration and deployment control.
Cluster Types
Different clusters optimize for different goals.
| Cluster type | Main goal | Example use |
|---|---|---|
| High availability cluster | Minimize downtime using failover | Web/database service redundancy |
| Load balancing cluster | Distribute user requests across nodes | Public website/application servers |
| High performance computing cluster | Parallel computation | Scientific simulation, rendering |
| Storage cluster | Distributed storage capacity and durability | Object/file/block storage systems |
| Database cluster | Replication and distributed query/write handling | Primary-replica or sharded database |
RPC and Message Passing
Distributed components communicate using messages. RPC hides message exchange behind function-call style abstraction, but network failure still exists.
- Client stub marshals parameters into a network message.
- Server stub unmarshals request, calls real procedure and marshals response.
- Serialization converts objects/data into transferable bytes, such as JSON, Protocol Buffers or XML.
- RPC may fail due to timeout, server crash, duplicate request or network partition.
- Idempotent operations are safer for retries because repeating them has the same effect.
- Exam trap: remote call is not the same as local call; latency, failure and partial execution matter.
Replication Models
Replication improves read performance and availability, but it creates consistency problems.
- Primary-replica: one primary handles writes and replicas copy updates.
- Multi-primary: multiple nodes accept writes, but conflict resolution is harder.
- Synchronous replication: write completes after replicas confirm; stronger consistency but higher latency.
- Asynchronous replication: primary returns before all replicas update; faster but replicas can lag.
- Quorum idea: read/write operations require agreement from a subset of replicas.
- Replication is not the same as backup; replication is online availability/performance, backup is recovery from loss/corruption.
Consistency Models
Consistency model tells what behavior clients can expect from replicated data.
| Model | Guarantee | Tradeoff |
|---|---|---|
| Strong consistency | Reads see latest successful write | Higher coordination and latency |
| Sequential consistency | All clients observe operations in one valid order | Still may not match real-time order |
| Causal consistency | Causally related updates seen in order | Concurrent updates may appear differently |
| Eventual consistency | Replicas converge if no new updates occur | Temporary stale reads possible |
| Read-your-writes | User sees own updates after writing | Useful user-facing guarantee |
CAP Tradeoff and Network Partition
CAP theorem is often simplified badly. The useful exam explanation is about behavior during network partition.
- Consistency: every read sees the latest write or an error under the chosen model.
- Availability: every request to a non-failing node receives a response.
- Partition tolerance: system continues despite network split/lost messages.
- In a partition, a distributed data system must choose between returning possibly stale data or refusing some operations.
- CP choice favors consistency by rejecting/limiting operations during partition.
- AP choice favors availability by serving requests and reconciling later.
- Exam trap: CAP does not say a system can normally have only two properties in all situations; the hard choice appears when partition happens.
Fault Tolerance Mechanisms
Fault tolerance combines detection, isolation, recovery and redundancy.
- Heartbeat and health checks detect failed nodes.
- Timeouts detect unavailable remote calls, but choosing timeout value is difficult.
- Retries handle transient failures; backoff prevents retry storms.
- Circuit breaker stops repeatedly calling a failing service.
- Failover redirects traffic to healthy node.
- Checkpointing saves state so computation can resume after failure.
- Consensus protocols coordinate agreement among nodes, but are more complex and costly.
Load Balancing and Scalability
Load balancing distributes work so no single node becomes a bottleneck.
- Round robin sends requests cyclically to servers.
- Least connections sends traffic to server with fewer active connections.
- Weighted algorithms give stronger servers more traffic.
- Consistent hashing reduces remapping when nodes join/leave, useful in distributed caches/storage.
- Horizontal scaling adds more machines; vertical scaling adds resources to one machine.
- Stateful applications need session management, shared storage or sticky sessions.
Engineering Mechanism
- Client sends request to load balancer, service registry or known server endpoint.
- Load balancer selects a healthy node using an algorithm such as round robin or least connections.
- Service node may call other nodes using RPC/message passing.
- Data may be read from a replica, primary database, cache or distributed storage.
- If a node fails, health checks detect it and traffic is redirected.
- Replication and recovery mechanisms restore redundancy.
- Consistency rules decide whether reads can return stale data or must coordinate with current primary/quorum.
Diagrams / Models To Draw
- Draw client -> load balancer -> multiple application nodes -> replicated database.
- Draw primary-replica replication with write to primary and replication to followers.
- Draw active-passive high availability cluster with heartbeat and failover.
- Draw RPC flow: client stub -> network -> server stub -> procedure -> response.
- Draw CAP partition: two groups of nodes separated by network failure.
- Draw consistent hashing ring for distributed cache/storage.
Formulas, Fields and Algorithms
- Availability approximation for independent redundant components: system availability improves when failover replicas are independent.
- Quorum rule often expressed as R + W > N for strong quorum overlap, where N = replicas, R = read quorum, W = write quorum.
- Horizontal scaling = adding more nodes; vertical scaling = increasing CPU/RAM/storage of one node.
- Failover time roughly includes detection time + election/activation time + traffic redirection time.
- Replication lag = time between primary write and replica visibility.
- Consistent hashing reduces key movement from O(K) toward O(K/N) when one node changes, conceptually.
| Concept | Engineering purpose | Exam distinction |
|---|---|---|
| Distributed system | Many independent computers provide one coherent service | Broader concept than a cluster |
| Cluster | Managed group of nodes for availability/performance | Usually tightly managed and similar-purpose nodes |
| Replication | Copies data/service for performance and availability | Creates consistency and conflict issues |
| Load balancing | Distributes requests/jobs | Does not by itself guarantee data consistency |
| Fault tolerance | Continues service despite failure | Needs redundancy plus detection/recovery |
| RPC | Remote function-call abstraction | Remote calls can fail and have high latency |
| Consensus | Agreement among nodes | Useful for leader election/replicated log but costly |
Exam Point
- For distributed system answers, always include benefits and challenges.
- Differentiate distributed system, cluster, grid and cloud if asked.
- For replication, mention consistency tradeoff and replication lag.
- For cluster, mention load balancer/scheduler, heartbeat, failover and shared-nothing/shared-storage idea.
- For CAP, explain the partition scenario carefully instead of writing only C, A and P full forms.
- For RPC, mention stubs, marshalling, timeout and retry/idempotence.
Worked Example
Consider an online exam result website. During high traffic, a load balancer sends users to multiple web servers. Static content may be cached, application nodes read from database replicas, and writes go to a primary database. If one web server fails, health checks remove it. If the primary database fails, a replica may be promoted. The system gains availability and performance, but engineers must handle stale replica reads, failover time and consistency during network partition.
Subjective Answer Pattern
- Define distributed system and cluster.
- Explain why distribution is needed: scalability, availability, performance and resource sharing.
- Discuss challenges: partial failure, latency, concurrency, clock synchronization and consistency.
- Explain cluster architecture with nodes, load balancer/scheduler, heartbeat and failover.
- Describe replication models and consistency models.
- Explain fault tolerance mechanisms and CAP tradeoff.
- Conclude with an example architecture and its tradeoffs.
Common Engineering Mistakes
- Treating cluster and distributed system as exactly the same term.
- Saying replication automatically solves backup and consistency.
- Ignoring partial failure; in distributed systems, one component can fail while others continue.
- Explaining CAP as “choose any two forever” without partition context.
- Forgetting that RPC can timeout after the server already executed the operation.
- Assuming load balancing alone makes an application highly available.
- Ignoring state/session handling when scaling web servers.
MCQ Revision
- What is partial failure?
- Which transparency hides multiple copies from users?
- What does a heartbeat detect in a cluster?
- What is the difference between synchronous and asynchronous replication?
- In quorum systems, what is the meaning of R + W > N?
- Which CAP property refers to network split tolerance?
- What does RPC marshalling do?
- Which load balancing method sends traffic cyclically?
Final Summary
- Distributed systems use multiple independent computers to provide one coherent service.
- Clusters are managed groups of nodes commonly used for high availability, load balancing and parallel processing.
- Replication improves availability and read performance but introduces consistency and lag issues.
- Fault tolerance requires detection, redundancy, failover, retries and recovery.
- CAP tradeoff matters during network partitions.
- Engineering exam answers should describe architecture, mechanism and tradeoff together.