Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

Explain the architecture of a distributed file system. Discuss the design and working of the Network File System (NFS) or Andrew File System (AFS).

Distributed File System (DFS)

A distributed file system is a file system that allows files to be stored on multiple networked machines but accessed by clients as if they were on a single local file system. It provides location transparency, access transparency, and a uniform namespace.

Architecture (Client–Server model)

A DFS is typically built from three logical components:

  1. Client module – runs on the user machine, intercepts file system calls and forwards them to remote servers (often through a Virtual File System / VFS layer).
  2. File service – stores and manages the actual file data and provides operations (read, write, create, delete).
  3. Directory/name service – maps human-readable file names to file identifiers (location of the file).

Files may be replicated (for availability and performance) and cached at clients (to reduce network traffic). The design must balance consistency, performance, availability, and scalability.

Network File System (NFS) – Sun Microsystems

NFS is a stateless, RPC-based DFS.

  • Architecture: Built on top of the Virtual File System (VFS) layer. The VFS routes requests for local files to the local FS and requests for remote files to the NFS client, which uses RPC + XDR (External Data Representation) to talk to the NFS server.
  • Mount protocol: A remote directory is mounted into the client's local namespace; the server returns a file handle identifying the exported directory.
  • Stateless server: The server keeps no per-client open-file state. Every request (e.g. read(handle, offset, count)) is self-contained and idempotent, so after a crash the server simply restarts and clients retransmit — giving simple crash recovery.
  • Caching: Clients cache file blocks and attributes; consistency is weak (validated periodically), which can cause clients to see slightly stale data.

Andrew File System (AFS) – CMU

AFS is designed for large scale and is stateful.

  • Whole-file caching: When a client opens a file, the entire file is copied to the client's local disk and all reads/writes happen locally; the file is written back on close. This minimizes server load and gives excellent scalability.
  • Callbacks: The server keeps state (a callback promise). If another client modifies the file, the server sends a callback break to invalidate the cached copy, providing stronger consistency than NFS.
  • Session semantics: Changes become visible to others only after the file is closed.

NFS vs AFS

FeatureNFSAFS
Server stateStatelessStateful (callbacks)
Caching unitBlocks (in memory)Whole file (on local disk)
ConsistencyWeak (poll-based)Stronger (callback-based)
ScalabilityModerateHigh

Conclusion: NFS favors simplicity and easy recovery via statelessness, while AFS favors scalability and consistency via whole-file caching and callbacks.

distributed-fs
2long10 marks

What is fault tolerance? Explain failure models, the use of redundancy, and the concept of agreement in faulty systems (Byzantine generals problem).

Fault Tolerance

Fault tolerance is the ability of a distributed system to continue providing correct service even when one or more of its components fail. A fault is the underlying defect, an error is its manifestation, and a failure occurs when the system deviates from its specification. The key requirements are availability, reliability, safety, and maintainability.

Failure Models

Failures are classified by how a faulty component behaves:

Failure typeDescription
Crash (fail-stop)A process halts and stays halted; others can detect it.
OmissionA process/channel fails to send (send-omission) or receive (receive-omission) messages.
TimingResponse is correct but arrives too late/early (in synchronous systems).
ResponseServer returns an incorrect value or wrong state transition.
Arbitrary / ByzantineA component behaves arbitrarily — may send conflicting or malicious messages. This is the hardest to tolerate.

Use of Redundancy

Fault tolerance is achieved primarily through redundancy (masking failures by replicating):

  • Information redundancy: extra bits, e.g. error-correcting codes (Hamming, CRC).
  • Time redundancy: repeat an action (retransmit a request, retry a transaction).
  • Physical/Hardware redundancy: replicate components, e.g. multiple processors/disks (RAID), Triple Modular Redundancy (TMR with voters).
  • Software/Process redundancy: replicate processes into groups and use voting so that the failure of a minority is masked.

Agreement in Faulty Systems — Byzantine Generals Problem

Non-faulty processes must reach agreement on a value even when some processes are faulty/treacherous.

  • Statement: Several generals (processes) must agree on a common plan (attack/retreat) by exchanging messages, but some generals are traitors who send contradictory messages. Loyal generals must (i) all decide the same plan, and (ii) follow a loyal commander's order.
  • Lamport–Shostak–Pease result: With arbitrary (Byzantine) failures and oral (unsigned) messages, agreement is possible only if n3m+1n \ge 3m + 1, where nn is the total number of processes and mm is the number of faulty ones. Equivalently, more than two-thirds of the processes must be correct. It requires m+1m+1 rounds of message exchange.
  • With signed (authenticated) messages, agreement is possible for any n>mn > m.

Example: With m=1m = 1 traitor, at least n=4n = 4 generals are needed; with 3 generals one of whom is a traitor, the two loyal ones cannot agree.

Conclusion: Fault tolerance combines failure detection, redundancy for masking, and agreement protocols (like Byzantine agreement) to keep the system correct despite faults.

fault-tolerance
3long10 marks

Explain clock synchronization in distributed systems. Discuss Lamport's logical clocks and vector clocks with examples, and describe how the happened-before relation orders events.

Clock Synchronization in Distributed Systems

In a distributed system every machine has its own physical clock, and these clocks drift apart over time. Since there is no global clock, processes need a way to order events and agree on time. Physical clock synchronization keeps real-time clocks close; logical clocks only capture the ordering of events, which is often all that is needed.

The Happened-Before Relation (\rightarrow)

Lamport defined the happened-before relation to capture causality:

  1. If aa and bb are events in the same process and aa occurs before bb, then aba \rightarrow b.
  2. If aa is the sending of a message and bb is its receipt, then aba \rightarrow b.
  3. Transitivity: if aba \rightarrow b and bcb \rightarrow c, then aca \rightarrow c.

If neither aba \rightarrow b nor bab \rightarrow a, the events are concurrent (aba \parallel b).

Lamport's Logical Clocks

Each process PiP_i keeps a counter CiC_i. Rules:

  • IR1: before each event, Ci=Ci+1C_i = C_i + 1.
  • IR2: a message carries its send timestamp tt. On receipt, Cj=max(Cj,t)+1C_j = \max(C_j, t) + 1.

Property: if aba \rightarrow b then C(a)<C(b)C(a) < C(b). (The converse need not hold.)

Example: P1P_1 sends a message at local time 6; it arrives at P2P_2 whose clock reads 4. By IR2 the receive event gets max(4,6)+1=7\max(4,6)+1 = 7, so the send (6) < receive (7), preserving causal order.

Vector Clocks

Lamport clocks cannot tell whether C(a)<C(b)C(a) < C(b) implies causality. Vector clocks fix this. Each process keeps a vector Vi[1..n]V_i[1..n]:

  • VR1: before an internal/send event, Vi[i]=Vi[i]+1V_i[i] = V_i[i] + 1.
  • VR2: a message carries ViV_i; on receipt, Vj[k]=max(Vj[k],Vmsg[k])V_j[k] = \max(V_j[k], V_{msg}[k]) for all kk, then Vj[j]=Vj[j]+1V_j[j] = V_j[j] + 1.

Comparison: V(a)<V(b)V(a) < V(b) iff every component V(a)[k]V(b)[k]V(a)[k] \le V(b)[k] and at least one is strictly smaller. Then:

ab    V(a)<V(b)a \rightarrow b \iff V(a) < V(b)

If neither V(a)<V(b)V(a) < V(b) nor V(b)<V(a)V(b) < V(a), the events are concurrent.

Example: Start P1=(0,0,0)P_1=(0,0,0), P2=(0,0,0)P_2=(0,0,0). P1P_1 does an event (1,0,0)\Rightarrow (1,0,0) and sends to P2P_2. P2P_2 (at (0,0,0)(0,0,0)) receives max((0,0,0),(1,0,0))=(1,0,0)\Rightarrow \max((0,0,0),(1,0,0))=(1,0,0) then bump (1,1,0)\Rightarrow (1,1,0). Since (1,0,0)<(1,1,0)(1,0,0) < (1,1,0), the send causally precedes the receive.

Conclusion: The happened-before relation defines partial event ordering; Lamport clocks give a consistent total order but lose information, while vector clocks precisely capture causality and detect concurrency.

clock-synchronizationlogical-clock
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a distributed system? Explain the goals and characteristics of a distributed system.

Distributed System

A distributed system is a collection of independent (autonomous) computers connected by a network that appears to its users as a single coherent system. The components cooperate by passing messages and share no common physical clock or memory.

Goals

  1. Resource sharing – share hardware, software and data (printers, files, databases) across the network.
  2. Transparency – hide the distribution so the system looks like one machine (access, location, replication, concurrency, failure transparency).
  3. Openness – use standard, well-defined interfaces/protocols so components from different vendors can interoperate and be extended.
  4. Scalability – grow in size, geography, or administration without major performance loss or redesign.

Characteristics

  • Concurrency: many processes execute simultaneously and may share resources, requiring coordination.
  • No global clock: coordination relies on message passing; only logical ordering of events is possible.
  • Independent failures: components can fail independently and partially while the rest of the system keeps running, so the system must tolerate faults.
  • Heterogeneity: different hardware, OS, and networks are handled, usually via middleware.

In short: a distributed system provides resource sharing with transparency, openness, scalability and fault tolerance over autonomous, concurrently-executing, clock-less machines.

characteristics
5short5 marks

Explain Lamport's logical clock with an example.

Lamport's Logical Clock

Because distributed processes have no shared physical clock, Lamport's logical clock assigns a monotonically increasing counter to events so that causally related events are correctly ordered. It implements the happened-before (\rightarrow) relation such that:

abC(a)<C(b)a \rightarrow b \Rightarrow C(a) < C(b)

Rules

Each process PiP_i keeps a counter CiC_i:

  • IR1 (internal/send): before timestamping any event, increment: Ci=Ci+1C_i = C_i + 1.
  • IR2 (receive): a message carries its send timestamp tt; the receiver sets Cj=max(Cj,t)+1C_j = \max(C_j, t) + 1.

Example

Two processes with different tick rates:

  • P1P_1 ticks: 1, 2, 3, 4, 5, 6 …
  • P1P_1 sends message m at its time 6.
  • P2P_2's clock currently reads 4 when m arrives.
  • By IR2: receive timestamp =max(4,6)+1=7= \max(4, 6) + 1 = 7.

Thus the send (6) < receive (7), so the causal order send happened-before receive is preserved. Without the rule, the receive (4) would wrongly appear earlier than the send (6).

Total ordering: ties (equal timestamps in different processes) are broken using process IDs, giving a consistent global total order used in algorithms like Lamport's mutual exclusion.

Limitation: C(a)<C(b)C(a) < C(b) does not imply aba \rightarrow b; logical clocks cannot detect concurrency (vector clocks are needed for that).

logical-clock
6short5 marks

What is a Remote Procedure Call (RPC)? Explain its working with a diagram.

Remote Procedure Call (RPC)

RPC is a communication mechanism that lets a program call a procedure located on a remote machine as if it were a local procedure call, hiding the underlying message passing from the programmer. It provides access transparency and was introduced by Birrell and Nelson.

Working (steps)

Client                                Server
  | 1. call proc(args)                  |
  v                                     |
[Client stub] --2. marshal args-->      |
  | (pack into message)                 |
[Client OS] --3. send msg over network--> [Server OS]
                                          |
                                  4. msg passed to [Server stub]
                                          | 5. unmarshal args
                                          v
                                  6. call actual procedure
                                          |
                                  7. result -> marshal
[Client OS] <--8. reply msg-------------- [Server OS]
  |
[Client stub] 9. unmarshal result
  v
10. return value to client
  1. The client calls the client stub (a local proxy procedure).
  2. The client stub marshals (packs) the parameters into a message.
  3. The client OS sends the message to the remote server.
  4. The server OS hands the message to the server stub.
  5. The server stub unmarshals the parameters.
  6. The server stub calls the actual procedure on the server. 7–10. The result is marshaled, sent back, unmarshaled by the client stub, and returned to the caller.

Key points

  • Stubs hide marshaling/network code; IDL (Interface Definition Language) generates them.
  • Parameter passing: call-by-value works directly; call-by-reference is hard (no shared address space) and is handled by copy/restore.
  • Failure semantics: because the network can lose messages, RPC offers semantics such as at-least-once or at-most-once delivery.

In short: RPC makes distributed communication look like ordinary procedure calls through client/server stubs and marshaling.

rpc
7short5 marks

Differentiate between centralized and distributed mutual exclusion algorithms.

Centralized vs Distributed Mutual Exclusion

Mutual exclusion ensures that only one process at a time enters a critical section (CS) accessing a shared resource.

Centralized Algorithm

A single coordinator grants permission. To enter the CS a process sends a request to the coordinator; the coordinator replies with a grant if the resource is free, otherwise queues the request. On exit the process sends a release and the coordinator grants the next queued request.

Distributed Algorithm (e.g. Ricart–Agrawala)

No central node: a process wanting the CS multicasts a timestamped request to all other processes and enters only after receiving OK/reply from all. Requests are ordered by Lamport timestamps to break ties.

Differences

AspectCentralizedDistributed
ControlOne coordinator decidesAll processes participate
Messages per CS entry3 (request, grant, release)2(n−1) (request + reply to/from all)
Single point of failureYes (coordinator)No single point (but more points can fail)
BottleneckCoordinator can be a bottleneckLoad is distributed
ImplementationSimple, easyMore complex, needs ordering
Fairness/StarvationFair, no starvation (FIFO queue)Fair via timestamps

Conclusion: The centralized scheme is simple and message-efficient but has a single point of failure and a bottleneck; the distributed scheme removes the single coordinator at the cost of more messages and complexity.

mutual-exclusion
8short5 marks

Explain distributed deadlock detection. What is a wait-for graph?

Distributed Deadlock Detection

A deadlock occurs when a set of processes are each waiting for a resource held by another in the set, so none can proceed. In a distributed system the resources, processes and the wait-for information are spread across many sites, making detection harder.

Wait-For Graph (WFG)

A wait-for graph is a directed graph in which:

  • each node represents a process, and
  • a directed edge PiPjP_i \rightarrow P_j means PiP_i is waiting for a resource currently held by PjP_j.

A cycle in the WFG indicates a deadlock. In a distributed system the full graph is the union of local WFGs held at different sites (a global WFG).

Approaches to Distributed Deadlock Detection

  1. Centralized: a coordinator builds a global WFG from the local WFGs and searches for cycles. Simple but a single point of failure and can report phantom (false) deadlocks due to message delays.
  2. Distributed: every site participates in detection (e.g. Chandy–Misra–Haas edge-chasing): when a process blocks, it sends a probe message along its outgoing wait-for edges; if the probe returns to its initiator, a cycle (deadlock) exists.
  3. Hierarchical: sites are organized in a tree; deadlock detection is done at the lowest common ancestor.

Recovery

Once detected, a deadlock is broken by victim selection — aborting/rolling back one process to release its resources.

In short: distributed deadlock detection finds cycles in the (distributed) wait-for graph, commonly via edge-chasing probe messages, then resolves them by aborting a victim.

deadlock
9short5 marks

Explain the Bully algorithm for electing a coordinator.

Bully Algorithm (Coordinator Election)

The Bully algorithm (Garcia-Molina) elects the process with the highest process ID as the coordinator when the current coordinator fails. It assumes the system is synchronous, each process knows the IDs of all others, and messages are reliable.

Message types

  • ELECTION – announces an election to higher-ID processes.
  • OK (ANSWER) – a reply that a higher process is alive and takes over.
  • COORDINATOR – announces the winner to all processes.

Algorithm

When a process PP notices the coordinator is not responding:

  1. PP sends an ELECTION message to all processes with a higher ID.
  2. If no one with a higher ID responds (within a timeout), PP wins: it becomes the coordinator and sends a COORDINATOR message to all lower-ID processes.
  3. If a higher-ID process replies with OK, PP drops out; that higher process now holds the election (it repeats step 1 among still-higher processes).
  4. Eventually the highest-ID live process wins and broadcasts COORDINATOR.

A recovered process (or one with the highest ID) can immediately bully the others by starting an election and taking over — hence the name.

Example

Processes 1–7, coordinator 7 crashes. Process 4 detects it and sends ELECTION to 5, 6, 7. Processes 5 and 6 reply OK; 4 stops. 5 and 6 each hold elections; 6 replies OK to 5, so 5 stops. 6 sends ELECTION to 7 (no reply), so 6 wins and broadcasts COORDINATOR to all.

Cost: worst case O(n2)O(n^2) messages.

election
10short5 marks

What are the different kinds of transparency in a distributed system?

Types of Transparency in a Distributed System

Transparency means hiding the fact that the system's resources and processes are distributed across multiple machines, so users and applications perceive a single coherent system. The ISO/ANSA reference model defines the following kinds (ISO RM-ODP):

TransparencyWhat it hides
AccessDifferences in data representation and how a resource is accessed (local vs remote access look the same).
LocationWhere a resource is physically located (name does not reveal location).
MigrationThat a resource may move to another location; names stay valid.
RelocationThat a resource may move while in use.
ReplicationThat a resource is replicated; the user sees one logical copy.
ConcurrencyThat a resource is shared by several competing users simultaneously.
FailureThe failure and recovery of a resource, so the system appears to keep working.
PersistenceWhether a resource is in memory or on disk.

In short: these transparencies (access, location, migration, relocation, replication, concurrency, failure, persistence) let the distributed system appear as one unified machine to its users.

transparency
11short5 marks

What is middleware? Explain its role in a distributed system.

Middleware

Middleware is a software layer that sits between the operating system/network and the distributed applications. It hides the heterogeneity of the underlying networks, hardware and operating systems and provides a uniform programming model and common services, so developers can build distributed applications more easily.

+-------------------------------------------------+
|              Distributed Application             |
+-------------------------------------------------+
|        MIDDLEWARE (common services, API)         |  <-- provides transparency
+-------------------------------------------------+
|   Local OS   |   Local OS   |   Local OS  | ...   |
+-------------------------------------------------+
|                   Network                        |
+-------------------------------------------------+

Role in a Distributed System

  1. Provides transparency – masks access, location, replication, and failure differences so remote resources look local.
  2. Hides heterogeneity – lets different OS, hardware, and languages interoperate via standard interfaces.
  3. Communication abstractions – offers RPC, remote method invocation (RMI), and message-oriented middleware (MOM) instead of raw sockets.
  4. Common services – naming, persistence, security/authentication, transactions, concurrency control, and replication.
  5. Eases development – higher-level API so programmers focus on application logic, not networking details.

Examples: CORBA, Java RMI, DCOM, gRPC, and message brokers such as RabbitMQ/JMS.

In short: middleware is the glue that makes a collection of heterogeneous networked machines behave as one coherent distributed system.

middleware
12short5 marks

Explain Cristian's algorithm for physical clock synchronization.

Cristian's Algorithm (Physical Clock Synchronization)

Cristian's algorithm synchronizes a client's clock with a time server (which holds accurate UTC time, e.g. from a radio/atomic source). It is suitable when round-trip times are small compared to the required accuracy.

Procedure

  1. The client sends a request to the time server at local time T0T_0.
  2. The server replies with its current time TT (UTC).
  3. The client records the reply-receipt time T1T_1. The round-trip time is RTT=T1T0RTT = T_1 - T_0.
  4. The message took roughly RTT/2RTT/2 to travel back, so the client sets its clock to:
Tclient=T+T1T02T_{client} = T + \frac{T_1 - T_0}{2}

If the server's interrupt-handling time II is known, a better estimate is:

Tclient=T+(T1T0)I2T_{client} = T + \frac{(T_1 - T_0) - I}{2}

Accuracy

The error is bounded by ±(RTT2Tmin)\pm\left(\dfrac{RTT}{2} - T_{min}\right), where TminT_{min} is the minimum one-way transmission time. Taking the average of several requests (discarding outliers with large RTT) improves accuracy.

Example

If T0=5:00:00.000T_0 = 5{:}00{:}00.000, T1=5:00:00.020T_1 = 5{:}00{:}00.020 (so RTT=20RTT = 20 ms) and the server time T=5:10:00.000T = 5{:}10{:}00.000, the client sets its clock to 5:10:00.000+10 ms=5:10:00.0105{:}10{:}00.000 + 10\text{ ms} = 5{:}10{:}00.010.

Limitations

  • Clocks must only ever move forward, so if the new time is behind the current time, the clock is slowed down gradually rather than set back.
  • Relies on a single time server (single point of failure); the Berkeley algorithm avoids needing an accurate external server.
clock-synchronization

Frequently asked questions

Where can I find the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) question paper 2079?
The full BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Distributed System (BSc CSIT, CSC462) 2079 paper come with solutions?
Yes. Every question on this Distributed System (BSc CSIT, CSC462) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2079 paper?
The BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2079 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Distributed System (BSc CSIT, CSC462) past paper free?
Yes — reading and attempting this Distributed System (BSc CSIT, CSC462) past paper on Kekkei is completely free.