Distributed systems and clusters connect networking with operating systems, databases and system design. In the PSC Computer Engineer exam, this topic is not only about definitions; you should be able to explain why distributed systems are built, what failures they face, how clusters improve availability/performance, and what tradeoffs appear in replication, consistency and partition tolerance.

Engineering Definitions

Distributed system

Standard definition: A collection of independent computers that appears to users as a single coherent system.

Exam meaning: धेरै machines मिलेर एउटै system जस्तो service दिने architecture।

Cluster

Standard definition: A group of interconnected computers working together as a single computing resource for performance, availability or scalability.

Exam meaning: नजिकबाट managed गरिएका multiple nodes जसले एउटै service/resource जस्तो काम गर्छन्।

Transparency

Standard definition: The property of hiding distribution details such as location, replication, migration or failure from users and applications.

Exam meaning: System distributed भए पनि user/program लाई unnecessary complexity नदेखिने गुण।

Replication

Standard definition: Maintaining copies of data or service on multiple nodes to improve availability, performance or fault tolerance.

Exam meaning: एउटै data/service का copies धेरै nodes मा राख्ने technique।

Consistency

Standard definition: The rule that defines what value a read operation may return when data is replicated and updated.

Exam meaning: Replicated data update भएपछि readers ले कुन version देख्ने भन्ने guarantee।

Fault tolerance

Standard definition: The ability of a system to continue operating correctly despite failures of components.

Exam meaning: Node/network/component fail हुँदा पनि service चलिरहने क्षमता।

RPC

Standard definition: Remote Procedure Call is a communication abstraction where a program invokes a procedure on a remote machine as if it were local.

Exam meaning: Remote server को function call लाई local function call जस्तो बनाउने abstraction।

Concept Teaching

A distributed system is built because one machine may not be enough for performance, reliability, geographic reach or data volume. But the price is complexity: partial failure, network delay, concurrency, inconsistent replicas, security boundaries and coordination problems. A good engineering answer always balances benefit and tradeoff.

Why Distributed Systems Are Needed

Distributed systems solve real engineering limits of single-machine systems.

  • Scalability: add more machines to handle more users, data or computation.
  • Availability: service continues even if one node fails.
  • Performance: requests can be processed in parallel or served from nearer nodes.
  • Resource sharing: storage, printers, databases and services can be shared across network.
  • Geographic distribution: users in different regions can access nearby servers.
  • Organizational distribution: different systems owned by different teams can interoperate through APIs.

Core Challenges in Distributed Systems

Distributed systems are hard because failures and timing are not uniform.

  • Partial failure: one node or link can fail while rest of system still runs.
  • No global clock: different nodes have clock skew, making event ordering difficult.
  • Network latency: remote communication is much slower and less predictable than memory access.
  • Concurrency: multiple clients may update shared state at the same time.
  • Replication conflict: copies of data can temporarily disagree.
  • Security: communication crosses machine and network boundaries.
  • Observability: debugging needs logs, metrics and traces from many nodes.

Transparency Types

Transparency is a favorite subjective-question concept. It means hiding distribution complexity where possible.

Transparency type Meaning Example
Access transparency Same operations for local and remote resources Open remote file like local file conceptually
Location transparency Resource name does not reveal physical location Service name instead of server IP
Migration transparency Resource can move without affecting user VM/container moves to another host
Replication transparency Multiple copies appear as one resource Read from replicated database
Concurrency transparency Many users share resource safely Transactions/locks hide simultaneous access
Failure transparency System masks certain failures Retry/failover to another node

Cluster Architecture

A cluster is a practical distributed system arrangement where nodes are usually administered together.

  • Cluster nodes run the same or coordinated workloads.
  • A load balancer or scheduler distributes requests/jobs among nodes.
  • Shared-nothing clusters keep independent CPU, memory and storage per node; easier to scale horizontally.
  • Shared-storage clusters allow multiple nodes to access common storage; useful but needs coordination.
  • Heartbeat messages detect node liveness.
  • Failover moves service from failed node to healthy node.
  • Cluster management needs membership, health check, configuration and deployment control.

Cluster Types

Different clusters optimize for different goals.

Cluster type Main goal Example use
High availability cluster Minimize downtime using failover Web/database service redundancy
Load balancing cluster Distribute user requests across nodes Public website/application servers
High performance computing cluster Parallel computation Scientific simulation, rendering
Storage cluster Distributed storage capacity and durability Object/file/block storage systems
Database cluster Replication and distributed query/write handling Primary-replica or sharded database

RPC and Message Passing

Distributed components communicate using messages. RPC hides message exchange behind function-call style abstraction, but network failure still exists.

  • Client stub marshals parameters into a network message.
  • Server stub unmarshals request, calls real procedure and marshals response.
  • Serialization converts objects/data into transferable bytes, such as JSON, Protocol Buffers or XML.
  • RPC may fail due to timeout, server crash, duplicate request or network partition.
  • Idempotent operations are safer for retries because repeating them has the same effect.
  • Exam trap: remote call is not the same as local call; latency, failure and partial execution matter.

Replication Models

Replication improves read performance and availability, but it creates consistency problems.

  • Primary-replica: one primary handles writes and replicas copy updates.
  • Multi-primary: multiple nodes accept writes, but conflict resolution is harder.
  • Synchronous replication: write completes after replicas confirm; stronger consistency but higher latency.
  • Asynchronous replication: primary returns before all replicas update; faster but replicas can lag.
  • Quorum idea: read/write operations require agreement from a subset of replicas.
  • Replication is not the same as backup; replication is online availability/performance, backup is recovery from loss/corruption.

Consistency Models

Consistency model tells what behavior clients can expect from replicated data.

Model Guarantee Tradeoff
Strong consistency Reads see latest successful write Higher coordination and latency
Sequential consistency All clients observe operations in one valid order Still may not match real-time order
Causal consistency Causally related updates seen in order Concurrent updates may appear differently
Eventual consistency Replicas converge if no new updates occur Temporary stale reads possible
Read-your-writes User sees own updates after writing Useful user-facing guarantee

CAP Tradeoff and Network Partition

CAP theorem is often simplified badly. The useful exam explanation is about behavior during network partition.

  • Consistency: every read sees the latest write or an error under the chosen model.
  • Availability: every request to a non-failing node receives a response.
  • Partition tolerance: system continues despite network split/lost messages.
  • In a partition, a distributed data system must choose between returning possibly stale data or refusing some operations.
  • CP choice favors consistency by rejecting/limiting operations during partition.
  • AP choice favors availability by serving requests and reconciling later.
  • Exam trap: CAP does not say a system can normally have only two properties in all situations; the hard choice appears when partition happens.

Fault Tolerance Mechanisms

Fault tolerance combines detection, isolation, recovery and redundancy.

  • Heartbeat and health checks detect failed nodes.
  • Timeouts detect unavailable remote calls, but choosing timeout value is difficult.
  • Retries handle transient failures; backoff prevents retry storms.
  • Circuit breaker stops repeatedly calling a failing service.
  • Failover redirects traffic to healthy node.
  • Checkpointing saves state so computation can resume after failure.
  • Consensus protocols coordinate agreement among nodes, but are more complex and costly.

Load Balancing and Scalability

Load balancing distributes work so no single node becomes a bottleneck.

  • Round robin sends requests cyclically to servers.
  • Least connections sends traffic to server with fewer active connections.
  • Weighted algorithms give stronger servers more traffic.
  • Consistent hashing reduces remapping when nodes join/leave, useful in distributed caches/storage.
  • Horizontal scaling adds more machines; vertical scaling adds resources to one machine.
  • Stateful applications need session management, shared storage or sticky sessions.

Engineering Mechanism

  • Client sends request to load balancer, service registry or known server endpoint.
  • Load balancer selects a healthy node using an algorithm such as round robin or least connections.
  • Service node may call other nodes using RPC/message passing.
  • Data may be read from a replica, primary database, cache or distributed storage.
  • If a node fails, health checks detect it and traffic is redirected.
  • Replication and recovery mechanisms restore redundancy.
  • Consistency rules decide whether reads can return stale data or must coordinate with current primary/quorum.

Diagrams / Models To Draw

  • Draw client -> load balancer -> multiple application nodes -> replicated database.
  • Draw primary-replica replication with write to primary and replication to followers.
  • Draw active-passive high availability cluster with heartbeat and failover.
  • Draw RPC flow: client stub -> network -> server stub -> procedure -> response.
  • Draw CAP partition: two groups of nodes separated by network failure.
  • Draw consistent hashing ring for distributed cache/storage.

Formulas, Fields and Algorithms

  • Availability approximation for independent redundant components: system availability improves when failover replicas are independent.
  • Quorum rule often expressed as R + W > N for strong quorum overlap, where N = replicas, R = read quorum, W = write quorum.
  • Horizontal scaling = adding more nodes; vertical scaling = increasing CPU/RAM/storage of one node.
  • Failover time roughly includes detection time + election/activation time + traffic redirection time.
  • Replication lag = time between primary write and replica visibility.
  • Consistent hashing reduces key movement from O(K) toward O(K/N) when one node changes, conceptually.
Concept Engineering purpose Exam distinction
Distributed system Many independent computers provide one coherent service Broader concept than a cluster
Cluster Managed group of nodes for availability/performance Usually tightly managed and similar-purpose nodes
Replication Copies data/service for performance and availability Creates consistency and conflict issues
Load balancing Distributes requests/jobs Does not by itself guarantee data consistency
Fault tolerance Continues service despite failure Needs redundancy plus detection/recovery
RPC Remote function-call abstraction Remote calls can fail and have high latency
Consensus Agreement among nodes Useful for leader election/replicated log but costly

Exam Point

  • For distributed system answers, always include benefits and challenges.
  • Differentiate distributed system, cluster, grid and cloud if asked.
  • For replication, mention consistency tradeoff and replication lag.
  • For cluster, mention load balancer/scheduler, heartbeat, failover and shared-nothing/shared-storage idea.
  • For CAP, explain the partition scenario carefully instead of writing only C, A and P full forms.
  • For RPC, mention stubs, marshalling, timeout and retry/idempotence.

Worked Example

Consider an online exam result website. During high traffic, a load balancer sends users to multiple web servers. Static content may be cached, application nodes read from database replicas, and writes go to a primary database. If one web server fails, health checks remove it. If the primary database fails, a replica may be promoted. The system gains availability and performance, but engineers must handle stale replica reads, failover time and consistency during network partition.

Subjective Answer Pattern

  • Define distributed system and cluster.
  • Explain why distribution is needed: scalability, availability, performance and resource sharing.
  • Discuss challenges: partial failure, latency, concurrency, clock synchronization and consistency.
  • Explain cluster architecture with nodes, load balancer/scheduler, heartbeat and failover.
  • Describe replication models and consistency models.
  • Explain fault tolerance mechanisms and CAP tradeoff.
  • Conclude with an example architecture and its tradeoffs.

Common Engineering Mistakes

  • Treating cluster and distributed system as exactly the same term.
  • Saying replication automatically solves backup and consistency.
  • Ignoring partial failure; in distributed systems, one component can fail while others continue.
  • Explaining CAP as “choose any two forever” without partition context.
  • Forgetting that RPC can timeout after the server already executed the operation.
  • Assuming load balancing alone makes an application highly available.
  • Ignoring state/session handling when scaling web servers.

MCQ Revision

  • What is partial failure?
  • Which transparency hides multiple copies from users?
  • What does a heartbeat detect in a cluster?
  • What is the difference between synchronous and asynchronous replication?
  • In quorum systems, what is the meaning of R + W > N?
  • Which CAP property refers to network split tolerance?
  • What does RPC marshalling do?
  • Which load balancing method sends traffic cyclically?

Final Summary

  • Distributed systems use multiple independent computers to provide one coherent service.
  • Clusters are managed groups of nodes commonly used for high availability, load balancing and parallel processing.
  • Replication improves availability and read performance but introduces consistency and lag issues.
  • Fault tolerance requires detection, redundancy, failover, retries and recovery.
  • CAP tradeoff matters during network partitions.
  • Engineering exam answers should describe architecture, mechanism and tradeoff together.