Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is fault tolerance? Explain failure models, the use of redundancy, and the concept of agreement in faulty systems (Byzantine generals problem).

Fault Tolerance

Fault tolerance is the ability of a distributed system to continue operating correctly (providing its specified service) even when one or more of its components fail. The aim is to mask failures so that they are not visible to users, achieving high availability, reliability, safety and maintainability.

Key terms:

  • Fault – the underlying defect (e.g. a broken link).
  • Error – the manifestation of a fault in the system state.
  • Failure – when the system deviates from its specification.

Failure Models

Different kinds of failure must be anticipated:

Failure typeDescription
Crash (fail-stop)A process halts and stays halted; it had been working correctly until then.
OmissionA process or channel fails to send (send-omission) or receive (receive-omission) messages.
TimingA response is correct but arrives outside the specified time interval (relevant to synchronous systems).
ResponseThe server responds incorrectly: a wrong value or a wrong state transition.
Arbitrary (Byzantine)The most severe: a process may behave arbitrarily, send conflicting/wrong messages, or act maliciously.

Use of Redundancy

Fault tolerance is achieved by masking faults using redundancy:

  • Information redundancy – extra bits to detect/correct errors (e.g. checksums, error-correcting codes).
  • Time redundancy – repeating an action (e.g. retransmitting a message, retrying a transaction).
  • Physical (hardware/process) redundancy – replicating components or processes so that survivors mask the failure of others (e.g. Triple Modular Redundancy with voting, replicated servers).

Agreement in Faulty Systems — Byzantine Generals Problem

Non-faulty processes often must reach agreement on a value (e.g. commit/abort, a coordinator's order). With arbitrary (Byzantine) failures this is hard because faulty processes may send different values to different peers.

The Byzantine Generals Problem models this: several generals (processes) must agree on a common plan (attack/retreat) by exchanging messages, while up to m of them are traitors sending conflicting information. The goal: all loyal generals decide the same plan, and if the commander is loyal every loyal general follows the commander's order.

Key result (Lamport, Shostak, Pease): with m faulty processes, agreement is possible only if the total number of processes satisfies

n3m+1.n \ge 3m + 1.

That is, more than two-thirds of the processes must be correct. The algorithm requires m+1m+1 rounds of message exchange (oral messages). With unforgeable signed messages (written messages), agreement is possible for any n > m.

Conclusion: Fault tolerance combines well-defined failure models with redundancy; tolerating arbitrary faults requires costly agreement protocols bounded by the n3m+1n \ge 3m+1 condition.

fault-tolerance
2long10 marks

Explain clock synchronization in distributed systems. Discuss Lamport's logical clocks and vector clocks with examples, and describe how the happened-before relation orders events.

Clock Synchronization in Distributed Systems

In a distributed system each node has its own physical clock that drifts at a slightly different rate, so the clocks disagree over time (clock skew). Many tasks (ordering events, timestamps, consistency, make) need clocks that agree, so clock synchronization keeps clocks close to one another (internal) or to real time / UTC (external). Because message delays are unpredictable, perfect synchronization is impossible; instead we either synchronize physical clocks approximately (Cristian's algorithm, Berkeley, NTP) or use logical clocks to capture only the ordering of events.

Happened-Before Relation (→)

Lamport defined the happened-before relation, a partial order capturing causality:

  1. If a and b are events in the same process and a occurs before b, then aba \rightarrow b.
  2. If a is the sending of a message and b is its receipt, then aba \rightarrow b.
  3. Transitivity: if aba \rightarrow b and bcb \rightarrow c then aca \rightarrow c.

If neither aba \rightarrow b nor bab \rightarrow a, the events are concurrent (aba \parallel b).

Lamport's Logical Clocks

Each process PiP_i keeps a counter CiC_i. Rules:

  • Before each event, Ci=Ci+1C_i = C_i + 1.
  • A sender attaches its timestamp tt to every message.
  • On receipt, the receiver sets Cj=max(Cj,t)+1C_j = \max(C_j, t) + 1.

Property: if aba \rightarrow b then C(a)<C(b)C(a) < C(b). (The converse need not hold.)

Example: Process P1 sends message m to P2.

  • In P1, event send(m) gets timestamp, say, 4.
  • P2's clock is currently 1; on receiving m it sets max(1,4)+1=5\max(1,4)+1 = 5.

Thus send (4) < receive (5), preserving causality. Ties (equal timestamps in different processes) are broken using process IDs to give a total order.

Vector Clocks

Lamport clocks cannot tell whether two events are causally related or concurrent. Vector clocks fix this. Each process PiP_i keeps a vector ViV_i of size n:

  • Before an event: Vi[i]=Vi[i]+1V_i[i] = V_i[i] + 1.
  • Send: attach the whole vector ViV_i to the message.
  • Receive vector VmV_m: for all k, Vj[k]=max(Vj[k],Vm[k])V_j[k] = \max(V_j[k], V_m[k]), then increment Vj[j]V_j[j].

Comparison: ab    V(a)<V(b)a \rightarrow b \iff V(a) < V(b) (every component \le and at least one strictly <<). If neither vector dominates, the events are concurrent.

Example: Start V1=(0,0,0)V_1=(0,0,0), V2=(0,0,0)V_2=(0,0,0). P1 does an event → (1,0,0)(1,0,0) and sends. P2 (currently (0,0,0)(0,0,0)) receives, takes max and increments its own component → (1,1,0)(1,1,0). Here (1,0,0)<(1,1,0)(1,0,0) < (1,1,0), correctly showing the send happened-before the receive.

Conclusion

Logical and vector clocks order events without synchronized physical clocks; Lamport clocks give a consistent total order, while vector clocks additionally detect concurrency, both grounded in the happened-before relation.

clock-synchronizationlogical-clock
3long10 marks

What is distributed mutual exclusion? Explain the centralized, token-ring and Ricart-Agrawala (distributed) algorithms, comparing them in terms of message complexity and fault tolerance.

Distributed Mutual Exclusion

Mutual exclusion ensures that at most one process at a time enters a critical section (CS) accessing a shared resource. In a distributed system there is no shared memory or common clock, so coordination is done purely by message passing. A correct algorithm must guarantee safety (at most one process in CS), liveness (no deadlock/starvation — requests eventually granted) and ideally fairness/ordering (requests served in happened-before order).

1. Centralized Algorithm

One process is elected coordinator. To enter the CS a process sends a REQUEST to the coordinator; the coordinator replies GRANT if the CS is free, otherwise queues the request (or sends nothing/denied). On exit the process sends RELEASE, and the coordinator grants the CS to the next queued request.

  • Messages per CS entry: 3 (request, grant, release).
  • Pros: simple, fair (FIFO queue), few messages.
  • Cons: single point of failure (coordinator crash blocks everyone); coordinator is a performance bottleneck; cannot distinguish a dead coordinator from a busy CS.

2. Token-Ring Algorithm

Processes are arranged in a logical ring. A single token circulates around the ring. A process may enter the CS only when it holds the token; it keeps the token while in the CS and passes it to the next neighbour afterwards. If it does not need the CS, it forwards the token immediately.

  • Messages per CS entry: between 1 and ∞ (the token may circulate many times when no one wants the CS).
  • Pros: no starvation; safety guaranteed by single token.
  • Cons: lost token must be detected and regenerated; a crashed process breaks the ring and must be bypassed; latency to acquire CS can be high.

3. Ricart–Agrawala (Distributed) Algorithm

A fully distributed algorithm using Lamport timestamps. To enter the CS, a process sends a timestamped REQUEST to all n1n-1 other processes and waits for REPLY (OK) from every one. A receiver replies OK immediately if it is not interested or not in the CS; if it is in the CS it queues the request; if it also wants the CS, it compares timestamps and replies OK only if the requester's timestamp is smaller (ties broken by process ID), otherwise queues. On exit, the process sends OK to all queued requests.

  • Messages per CS entry: 2(n1)2(n-1) — one request and one reply per other process.
  • Pros: no single point of failure (no central coordinator); ensures ordering by timestamp; no starvation.
  • Cons: every process is a point of failure (one crash blocks all, since its reply is awaited — improved by Maekawa's quorum method); high message overhead; each process must know all others.

Comparison

AlgorithmMessages per entrySingle point of failureFault tolerance
Centralized3Yes (coordinator)Poor
Token-Ring1 → ∞Lost token / broken ringModerate (needs token regeneration)
Ricart–Agrawala2(n1)2(n-1)Any process crash blocks allPoor unless extended

Conclusion: The centralized algorithm is cheapest in messages but least robust; the token ring avoids starvation but suffers from lost tokens; Ricart–Agrawala is truly distributed and ordering-correct but the most message-expensive and sensitive to crashes.

mutual-exclusion
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a distributed system? Explain the goals and characteristics of a distributed system.

Distributed System

A distributed system is a collection of independent (autonomous) computers, connected by a network, that appears to its users as a single coherent system. The computers coordinate their actions only by passing messages, having no shared memory or global clock.

Goals

  1. Resource sharing – let users share hardware, software, data and services.
  2. Transparency – hide the distribution (location, access, replication, failure, etc.) so the system looks like one machine.
  3. Openness – use standard interfaces/protocols so components can interoperate and be extended.
  4. Scalability – continue to perform well as the number of users, resources or geographic span grows.
  5. Reliability and fault tolerance – keep working despite partial failures.

Characteristics

  • Concurrency of components – many processes run and access shared resources simultaneously.
  • No global clock – coordination relies on message passing and logical ordering.
  • Independent failures – components can fail independently, and others may not immediately know.
  • Heterogeneity – varied hardware, OS, networks and languages, masked by middleware.
  • Geographical distribution of nodes communicating over a network.

In short: a distributed system shares resources across autonomous, concurrently-executing machines while presenting a single, transparent, scalable and fault-tolerant system to its users.

characteristics
5short5 marks

Explain Lamport's logical clock with an example.

Lamport's Logical Clock

Physical clocks in a distributed system cannot be perfectly synchronized, yet processes still need to order events consistently. Lamport's logical clock provides a software counter that respects causality without using real time.

Happened-before relation (→):

  1. Same process, a before baba \rightarrow b.
  2. Send of a message a and its receipt baba \rightarrow b.
  3. Transitive: ab,bcaca \rightarrow b, b \rightarrow c \Rightarrow a \rightarrow c.

Clock rules: Each process PiP_i keeps a counter CiC_i.

  • Before every event: Ci=Ci+1C_i = C_i + 1.
  • Each message carries the sender's timestamp tt.
  • On receipt: Cj=max(Cj,t)+1C_j = \max(C_j, t) + 1.

Property: abC(a)<C(b)a \rightarrow b \Rightarrow C(a) < C(b).

Example: Suppose P1's clock reaches 6 when it sends message m (timestamp t=6t=6). P2's clock is currently 2. On receiving m, P2 sets

C2=max(2,6)+1=7.C_2 = \max(2, 6) + 1 = 7.

So send = 6 < receive = 7, correctly preserving the causal order. Equal timestamps in different processes are ordered by process ID to obtain a total order.

logical-clock
6short5 marks

What is a Remote Procedure Call (RPC)? Explain its working with a diagram.

Remote Procedure Call (RPC)

A Remote Procedure Call lets a program call a procedure that executes in a different address space (typically on another machine) as if it were a normal local procedure call. RPC (Birrell & Nelson) hides the underlying message passing, providing access transparency so the programmer need not write explicit send/receive code.

Working

RPC uses client and server stubs plus the RPC runtime:

  1. The client calls the client stub like an ordinary local procedure.
  2. The client stub marshals (packs and serializes) the parameters into a message.
  3. The client's OS/RPC runtime sends the message over the network to the server.
  4. The server OS passes the message to the server stub, which unmarshals the parameters.
  5. The server stub calls the actual server procedure, which executes and returns a result.
  6. The server stub marshals the result and sends it back.
  7. The client stub unmarshals the result and returns it to the caller, which resumes as from a local call.

Diagram (described)

 Client machine                         Server machine
 +-----------+   call   +-----------+   +-----------+   call   +-----------+
 |  Client   |--------->|  Client   |   |  Server   |--------->|  Server   |
 |  program  |<---------|   stub    |   |   stub    |<---------| procedure |
 +-----------+  return  +-----------+   +-----------+  return  +-----------+
                            |  ^             ^  |
                   marshal  |  | unmarshal   |  | marshal
                            v  |             |  v
                       +-----------+     +-----------+
                       | OS / net  |====>| OS / net  |  (message over network)
                       +-----------+     +-----------+

Note: RPC must handle issues local calls do not — parameter passing (no shared memory, so pass-by-value/marshalling), heterogeneous data representation (e.g. XDR), and failure semantics (at-least-once, at-most-once, exactly-once delivery guarantees).

rpc
7short5 marks

Differentiate between centralized and distributed mutual exclusion algorithms.

Centralized vs Distributed Mutual Exclusion

AspectCentralized algorithmDistributed algorithm (e.g. Ricart–Agrawala)
CoordinationA single elected coordinator grants/denies the critical section.All processes participate equally; permission is obtained from all peers.
Messages per CS entry3 (request, grant, release).2(n1)2(n-1) (request to all, reply from all).
Decision basisCoordinator's FIFO queue.Lamport timestamps; lowest timestamp wins (ties by process ID).
Single point of failureYes — coordinator crash blocks the whole system.No central point, but any process's crash blocks all (it must reply).
BottleneckCoordinator can be a performance bottleneck.Load spread over all processes; more total traffic.
ComplexitySimple to implement.More complex; every process must know all others.
FairnessCoordinator can ensure FIFO fairness.Ordering guaranteed by timestamps; no starvation.

Summary: The centralized scheme is simple and message-efficient but depends on one coordinator (single point of failure and bottleneck). The distributed scheme removes the central coordinator and orders requests by timestamps, at the cost of far more messages and sensitivity to any node failing.

mutual-exclusion
8short5 marks

Explain distributed deadlock detection. What is a wait-for graph?

Distributed Deadlock Detection

A deadlock occurs when a set of processes are each blocked waiting for a resource held by another process in the set, so none can proceed. In a distributed system, resources and processes are spread over many sites and no single site has the global state, so detecting deadlock is harder; the system must combine local information across sites.

Approach: Since prevention/avoidance are costly, distributed systems often use deadlock detection: allow deadlocks to form, then detect cycles and resolve them (by aborting/rolling back a victim process and releasing its resources). Common organizations:

  • Centralized detection – a coordinator builds a global wait-for graph from local graphs.
  • Distributed detection – sites cooperate (e.g. edge-chasing / Chandy–Misra–Haas algorithm, which circulates probe messages along wait-for edges; if a probe returns to its initiator, a cycle—deadlock—exists).
  • Hierarchical detection – sites organized in a tree, detecting deadlocks at the lowest common ancestor.

Detection must avoid phantom (false) deadlocks caused by out-of-date or inconsistent global snapshots.

Wait-For Graph (WFG)

A wait-for graph is a directed graph used to represent blocking relationships:

  • Nodes = processes (transactions).
  • An edge PiPjP_i \rightarrow P_j means PiP_i is waiting for a resource currently held by PjP_j.

A cycle in the wait-for graph indicates a deadlock among the processes on the cycle. In a distributed system each site maintains a local WFG; a deadlock may span sites, so a global WFG (the union of local graphs plus inter-site edges) must be examined for cycles.

Example: P1P2P3P1P_1 \rightarrow P_2 \rightarrow P_3 \rightarrow P_1 forms a cycle, so P1,P2,P3P_1, P_2, P_3 are deadlocked; one is chosen as a victim and aborted to break the cycle.

deadlock
9short5 marks

Explain the Bully algorithm for electing a coordinator.

Bully Algorithm (Election)

An election algorithm chooses one process to act as coordinator when the current one fails. The Bully algorithm (Garcia-Molina) assumes each process has a unique numeric ID and that processes can communicate reliably; the process with the highest ID should become the coordinator.

Steps: When a process P notices the coordinator is not responding, it starts an election:

  1. P sends an ELECTION message to every process with a higher ID.
    • If no higher process replies (within timeout), P wins, becomes the coordinator, and sends a COORDINATOR message to all processes announcing itself.
    • If a higher process replies with an OK (alive) message, P drops out; that higher process now takes over the election.
  2. Any process that receives an ELECTION message replies OK and then starts its own election (repeating step 1) among still-higher processes.
  3. Eventually the highest-ID live process finds no one above it answers, declares victory and broadcasts COORDINATOR.
  4. A recovered process with a higher ID than the current coordinator will start a new election and "bully" the others to reclaim the role — hence the name.

Messages: the COORDINATOR and ELECTION/OK exchange; worst case O(n2)O(n^2) messages.

Example: With processes 1–7 and coordinator 7 crashing, suppose 4 detects it. Process 4 sends ELECTION to 5, 6, 7. Processes 5 and 6 reply OK and start their own elections; 6 finds 7 dead, wins, and broadcasts COORDINATOR to all. Process 6 becomes the new coordinator.

election
10short5 marks

What are the different kinds of transparency in a distributed system?

Kinds of Transparency in a Distributed System

Transparency is the property of hiding the distributed nature of the system (separation of components, replication, failures, etc.) so that users and applications perceive a single coherent system. ISO/ANSA defines several kinds:

TransparencyWhat it hides
AccessDifferences in data representation and how a resource is accessed (local vs remote access looks the same).
LocationWhere a resource is physically located (a name carries no location information).
MigrationThat a resource may move to another location; access remains unaffected.
RelocationThat a resource may move while in use.
ReplicationThat several copies of a resource exist; the user sees a single resource.
ConcurrencyThat a resource is shared by several competing users simultaneously.
FailureThe failure and recovery of a resource, so the system appears to work normally.
PersistenceWhether a (software) resource is in memory or on disk.

Summary: These transparencies (access, location, migration, relocation, replication, concurrency, failure, persistence) together let a distributed system present itself to users as one unified machine. Note that full transparency is not always desirable — e.g. hiding wide-area latency can mislead users.

transparency
11short5 marks

What is middleware? Explain its role in a distributed system.

Middleware

Middleware is a software layer that sits between the operating system / network and the distributed applications running on each machine. Its purpose is to mask the heterogeneity of the underlying hardware, operating systems, networks and programming languages, and to provide a common, higher-level programming abstraction so that the system appears as a single coherent system to applications and developers.

Role in a Distributed System

  1. Hides heterogeneity – masks differences in machines, OSes, networks and data representations.
  2. Provides high-level communication abstractions – such as RPC, Remote Method Invocation (RMI), message-oriented middleware (message queues), and publish/subscribe, instead of raw sockets.
  3. Supports transparency – access, location, replication, failure and concurrency transparency are largely implemented in middleware.
  4. Offers common services – naming/directory services, security/authentication, transactions, persistence, concurrency control and event notification.
  5. Eases development and openness – standard interfaces (e.g. CORBA, Java RMI, web services / gRPC, DCOM) let independently developed components interoperate.

In short: Middleware is the "glue" that turns a collection of heterogeneous networked computers into a usable distributed system by providing communication, transparency and common services above the OS layer.

middleware
12short5 marks

Explain Cristian's algorithm for physical clock synchronization.

Cristian's Algorithm (Physical Clock Synchronization)

Cristian's algorithm (1989) synchronizes a client's physical clock to an external time server (e.g. a UTC-based time source) — a form of external synchronization suited to systems where round-trip times are small relative to the required accuracy.

Procedure:

  1. The client records its local time T0T_0 and sends a request to the time server.
  2. The server replies with its current time TserverT_{server}.
  3. The client records the time T1T_1 when the reply arrives.

The round-trip time is RTT=T1T0RTT = T_1 - T_0. Assuming the request and reply take roughly equal time, the message took about RTT/2RTT/2 to travel back, so when the reply is received the true time is approximately

Tclient=Tserver+T1T02.T_{client} = T_{server} + \frac{T_1 - T_0}{2}.

The client sets its clock to this value.

Handling server processing time: if the server reports its interrupt-handling/processing time II, the estimate improves to

Tclient=Tserver+(T1T0)I2.T_{client} = T_{server} + \frac{(T_1 - T_0) - I}{2}.

Example: T0=100T_0 = 100, T1=120T_1 = 120 (so RTT=20RTT = 20), server replies Tserver=500T_{server} = 500. The client sets its clock to 500+20/2=510500 + 20/2 = 510.

Points:

  • Clocks should be set forward smoothly (gradually advanced), and never set backward abruptly (to avoid breaking event ordering / timestamps); slow clocks are sped up and fast clocks slowed until corrected.
  • Limitation: accuracy depends on a symmetric, short round trip; variable network delay reduces precision, and the single time server is a point of failure (addressed by the Berkeley algorithm / NTP).
clock-synchronization

Frequently asked questions

Where can I find the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) question paper 2077?
The full BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2077 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Distributed System (BSc CSIT, CSC462) 2077 paper come with solutions?
Yes. Every question on this Distributed System (BSc CSIT, CSC462) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2077 paper?
The BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2077 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Distributed System (BSc CSIT, CSC462) past paper free?
Yes — reading and attempting this Distributed System (BSc CSIT, CSC462) past paper on Kekkei is completely free.