Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is data replication in distributed systems? Explain different consistency models and replica management techniques used to maintain consistency.

Data Replication in Distributed Systems

Data replication is the technique of maintaining multiple copies (replicas) of the same data on different machines (nodes) in a distributed system. The main reasons for replication are:

  • Reliability/availability – if one replica fails, others can serve requests.
  • Performance – clients access a nearby replica, reducing latency, and load is shared among replicas (scalability).

The key challenge is consistency: when one replica is updated, all copies must eventually reflect the change. This creates a trade-off between consistency, availability and performance.

Consistency Models

A consistency model is a contract between the data store and processes about how reads and writes behave.

1. Data-centric models

  • Strict consistency – any read returns the result of the most recent write (requires global time; impossible in practice).
  • Sequential consistency – the result of any execution is the same as if all operations were executed in some sequential order that respects each process's program order.
  • Causal consistency – writes that are causally related must be seen in the same order by all processes; concurrent writes may be seen in different orders.
  • FIFO (PRAM) consistency – writes from a single process are seen by others in the order issued.
  • Weak / release / entry consistency – use synchronization variables (acquire/release) so consistency is enforced only at synchronization points.

2. Client-centric models (for eventual-consistency stores):

  • Eventual consistency – if no new updates occur, all replicas eventually converge.
  • Monotonic reads, monotonic writes, read-your-writes, writes-follow-reads – guarantees for a single client across replicas.

Replica Management Techniques

Replica placement

  • Permanent replicas (e.g., mirror servers).
  • Server-initiated replicas (created dynamically near demand, e.g., CDNs).
  • Client-initiated replicas (caches).

Update propagation

  • Push (server-based) vs Pull (client-based), or a lease-based hybrid.
  • Propagate the update operation vs propagate the modified data vs invalidate.

Consistency protocols

  • Primary-based protocols – a primary copy serializes all writes (remote-write or local-write).
  • Replicated-write protocolsactive replication (write sent to all replicas, often with totally ordered multicast) or quorum-based protocols where a read quorum NRN_R and write quorum NWN_W must satisfy NR+NW>NN_R + N_W > N and NW>N/2N_W > N/2 to prevent read–write and write–write conflicts.

Conclusion

Replication improves availability and performance but requires a chosen consistency model and an appropriate replica-management/consistency protocol to keep copies coherent.

replicationconsistency
2long10 marks

Explain the architecture of a distributed file system. Discuss the design and working of the Network File System (NFS) or Andrew File System (AFS).

Distributed File System (DFS) Architecture

A distributed file system allows files stored on remote servers to be accessed by clients over a network as if they were local, providing access, location and migration transparency.

General architecture components:

  • Client module – intercepts file system calls (often via a Virtual File System / VFS layer) and forwards remote operations to servers.
  • File service / file server – stores files and serves read/write/lookup requests.
  • Directory (name) service – maps human-readable path names to file identifiers.
  • Caching – client-side caches reduce network traffic and latency; require a cache-consistency mechanism.

Services are usually built on RPC, are typically stateless or stateful, and aim for transparency, scalability and fault tolerance.

Network File System (NFS)

NFS (by Sun) is the classic DFS model.

  • Architecture: Client and server communicate via RPC. The VFS layer on the client distinguishes local from remote files; an NFS client maps VFS operations to NFS RPC calls handled by the NFS server.
  • Mounting: A remote directory is mounted into the client's local namespace; after mounting it is accessed transparently.
  • Statelessness (classic NFSv2/v3): The server keeps no per-client open-file state. Every request (e.g., read, write) is self-contained and carries a file handle plus offset. This makes crash recovery simple — after a server reboot, clients just retry. Operations are made idempotent so retries are safe.
  • Caching: Clients cache data and attributes; consistency is approximate (validated using timestamps), so NFS provides only near one-copy semantics.
  • Access model: Remote access model (operations executed on the server).

Andrew File System (AFS) — alternative

  • Design goal: scalability for thousands of clients.
  • Whole-file caching: When a file is opened, the entire file is copied to the client's local disk; reads/writes happen locally; on close, the file is written back to the server (the upload/download model).
  • Callbacks: The server keeps state and issues a callback promise; if another client modifies the file, the server sends a callback invalidating the cached copy. This gives stronger consistency than NFS while minimizing server load.
  • Session semantics: Changes become visible to others only after the file is closed.

Comparison

FeatureNFS (v3)AFS
Caching unitBlocksWhole file (on local disk)
Server stateStatelessStateful (callbacks)
ConsistencyValidation by timestampsCallback + session semantics
ScalabilityModerateHigh
distributed-fs
3long10 marks

What is fault tolerance? Explain failure models, the use of redundancy, and the concept of agreement in faulty systems (Byzantine generals problem).

Fault Tolerance

Fault tolerance is the ability of a distributed system to continue providing correct service (possibly degraded) even when some of its components fail. A system is dependable if it is available, reliable, safe and maintainable. The basic terms: a fault (defect) may cause an error (incorrect state) which may lead to a failure (system deviates from correct service).

Failure Models

Failures are classified by how a process/server can behave incorrectly:

  • Crash (fail-stop) failure – a process halts and stays halted; it was working correctly until then.
  • Omission failure – a server fails to respond (receive omission: it doesn't receive incoming messages; send omission: it doesn't send replies).
  • Timing failure – the response is correct but arrives outside the specified time interval (too late/early).
  • Response failure – the response is incorrect (value failure: wrong value; state-transition failure: wrong control flow).
  • Arbitrary / Byzantine failure – the most severe: a process may produce arbitrary, even malicious or contradictory, outputs at arbitrary times.

Redundancy

Fault tolerance is achieved primarily through redundancy (masking failures):

  • Information redundancy – extra bits, e.g., error-correcting codes, checksums.
  • Time redundancy – repeat an operation (retransmission, re-execution), useful for transient faults.
  • Physical (hardware/software) redundancy – replicate components/processes. Example: Triple Modular Redundancy (TMR) uses three units feeding voters that mask one faulty unit. To mask up to kk crash faults, k+1k+1 replicas suffice; to mask kk Byzantine faults, 2k+12k+1 replicas are needed.

Agreement in Faulty Systems: Byzantine Generals Problem

Processes must reach agreement (consensus) on a value despite some processes being faulty/traitorous. In the Byzantine Generals Problem, several generals must agree on a common plan (attack/retreat), but some generals are traitors who send conflicting messages.

Lamport's result: Agreement among NN processes tolerating mm Byzantine (arbitrary) faulty processes is possible only if

N3m+1.N \ge 3m + 1.

Thus with m=1m=1 traitor at least 44 generals are required; with 3 generals and 1 traitor, no solution exists. With reliable signed messages (authentication) this bound can be relaxed. Reaching agreement also requires enough rounds of message exchange.

Conclusion

Fault tolerance combines a clear failure model, redundancy to mask faults, and agreement protocols (subject to the 3m+13m+1 bound) to keep distributed systems dependable.

fault-tolerance
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a distributed system? Explain the goals and characteristics of a distributed system.

A distributed system is a collection of independent computers (nodes) that appears to its users as a single coherent system, with the nodes coordinating their actions by passing messages over a network (no shared memory or global clock).

Goals:

  • Resource sharing – share hardware, software and data among users.
  • Transparency – hide distribution (access, location, replication, etc.) so the system looks like one machine.
  • Openness – use standard interfaces/protocols, enabling interoperability and portability.
  • Scalability – handle growth in users, data and geographic size without major redesign.

Characteristics:

  • No global clock – processes coordinate using message passing and logical clocks.
  • Concurrency – many components execute in parallel and share resources.
  • Independent failures – components can fail independently; the rest should keep working (fault tolerance).
  • Heterogeneity – different hardware, OS and networks, hidden by middleware.
  • Autonomy of nodes – each node manages its own resources.
characteristics
5short5 marks

Explain Lamport's logical clock with an example.

Lamport's Logical Clock

Because distributed systems lack a global physical clock, Lamport's logical clock assigns numeric timestamps to events so that the happened-before relation (\rightarrow) is captured, giving a consistent ordering of events.

Happened-before rules:

  1. If aa and bb are in the same process and aa occurs before bb, then aba \rightarrow b.
  2. If aa is the sending of a message and bb is its receipt, then aba \rightarrow b.
  3. Transitivity: if aba \rightarrow b and bcb \rightarrow c then aca \rightarrow c.

Clock update rules (each process keeps a counter CC):

  • Before each event, increment the local clock: C:=C+1C := C + 1.
  • A message carries the sender's timestamp tt.
  • On receipt, the receiver sets C:=max(C,t)+1C := \max(C, t) + 1.

This guarantees the clock condition: if aba \rightarrow b then C(a)<C(b)C(a) < C(b).

Example

Processes P1P_1 and P2P_2. P1P_1 sends message mm at its event with timestamp.

  • P1P_1: event aa (send mm), C1:01C_1: 0 \to 1, so C(a)=1C(a)=1.
  • P2P_2 had local clock C2=4C_2 = 4 when it does an internal event (5\to 5), then receives mm (timestamp 11): C2:=max(5,1)+1=6C_2 := \max(5, 1) + 1 = 6.

Since the send (1) happened-before the receive (6), 1<61 < 6 — the ordering is preserved.

Note: Lamport clocks give abC(a)<C(b)a \rightarrow b \Rightarrow C(a) < C(b), but not the converse; vector clocks are needed to detect concurrency.

logical-clock
6short5 marks

What is a Remote Procedure Call (RPC)? Explain its working with a diagram.

Remote Procedure Call (RPC)

RPC is a communication mechanism that allows a program to call a procedure that executes on a different (remote) machine, while making the call look like an ordinary local procedure call. It hides the underlying message passing, providing access transparency.

Working (steps)

Client machine                         Server machine
  client                                  server
     | 1. call                                |
  client stub  -- 2. marshal & send -->  server stub
     |                                       | 3. unmarshal, call
 RPC runtime  ----- network ------>   RPC runtime
     |                                       | 4. server executes
  client stub <-- 6. send reply -----  server stub <- 5. result
     | 7. unmarshal                           |
  return value
  1. The client calls the client stub as if it were a normal local procedure.
  2. The client stub marshals (packs) the parameters into a message and asks the local OS/RPC runtime to send it.
  3. The message travels over the network to the server; the server stub unmarshals the parameters.
  4. The server stub calls the actual server procedure, which executes and produces a result.
  5. The result is returned to the server stub, which marshals it into a reply message.
  6. The reply is sent back to the client machine.
  7. The client stub unmarshals the result and returns it to the client, which resumes as if the call returned locally.

Key points

  • Stubs hide marshalling/networking; generated from an IDL interface definition.
  • Parameter passing uses call by value (binding by reference is hard across address spaces).
  • Binding locates the server (e.g., via a binder/port-mapper).
  • Failure semantics: at-least-once, at-most-once, or exactly-once (ideal but hard); idempotent operations ease retries.
rpc
7short5 marks

Differentiate between centralized and distributed mutual exclusion algorithms.

Centralized vs Distributed Mutual Exclusion

Mutual exclusion ensures that only one process at a time enters a critical section (CS) accessing a shared resource.

Centralized algorithm: One process acts as a coordinator. A process sends a REQUEST to the coordinator, which grants permission (sends OK) if the CS is free, otherwise queues the request; the process sends RELEASE on exit, and the coordinator grants the next queued request.

Distributed algorithm (e.g., Ricart–Agrawala): A process wishing to enter the CS sends a timestamped REQUEST to all other processes and enters only after receiving OK/REPLY from all of them. Decisions are made collectively using logical-clock timestamps to order requests.

AspectCentralizedDistributed
Decision makingSingle coordinator decidesAll processes participate
Messages per CS entry3 (request, grant, release)2(N−1) (request + reply to/from others)
Single point of failureYes (coordinator)No single coordinator; but failure of any node can block (needs failure handling)
BottleneckCoordinator can be a bottleneckLoad distributed
ImplementationSimple, easyComplex; needs synchronized logical clocks
Fairness/orderingFIFO by coordinatorBy timestamp ordering
Number of failure points1N

Summary: The centralized scheme is simple and message-efficient but has a single point of failure and bottleneck; the distributed scheme removes the single coordinator but costs more messages and is more complex and sensitive to individual node failures.

mutual-exclusion
8short5 marks

Explain distributed deadlock detection. What is a wait-for graph?

Distributed Deadlock Detection

A deadlock occurs when a set of processes are each blocked waiting for resources held by others in the set, so none can proceed. In a distributed system, processes and resources are spread across nodes, so no single site has the complete picture, making detection harder.

Wait-For Graph (WFG)

A wait-for graph is a directed graph used to detect deadlock:

  • Nodes = processes.
  • An edge PiPjP_i \rightarrow P_j means "PiP_i is waiting for a resource currently held by PjP_j."

A deadlock exists if and only if the WFG contains a cycle (with single-instance resources). Example: P1P2P3P1P_1 \rightarrow P_2 \rightarrow P_3 \rightarrow P_1 is a cycle, so P1,P2,P3P_1, P_2, P_3 are deadlocked.

In a distributed system the global WFG is the union of local WFGs at each site connected by edges crossing sites.

Detection approaches

  • Centralized detection – a coordinator maintains the global WFG; sites report local edges. Simple but a single point of failure and can detect phantom (false) deadlocks due to message delays.
  • Distributed detection – each site participates; e.g., edge-chasing / probe-based algorithms (Chandy–Misra–Haas): a blocked process sends a probe message along outgoing wait-for edges. If a probe returns to its initiator, a cycle (deadlock) is detected.
  • Hierarchical detection – sites organized in a tree; a parent detects deadlocks among its children.

Resolution: break the cycle by aborting/rolling back a victim process and releasing its resources.

deadlock
9short5 marks

Explain the Bully algorithm for electing a coordinator.

Bully Algorithm (Coordinator Election)

The Bully algorithm (Garcia-Molina) elects the process with the highest ID as the new coordinator after the current coordinator fails. It assumes each process has a unique ID, processes know each other's IDs, and message delivery is reliable.

Algorithm

When a process PP notices the coordinator is not responding, it starts an election:

  1. PP sends an ELECTION message to all processes with a higher ID than itself.
  2. If no higher process responds, PP wins and becomes the coordinator; it sends a COORDINATOR message to all other processes announcing itself.
  3. If a higher process responds (with an OK/ALIVE message), that process takes over the election; PP steps down and waits.
  4. Each higher process that received an ELECTION then repeats the procedure (holds its own election among still-higher processes).
  5. Eventually the highest-ID alive process becomes coordinator and broadcasts the COORDINATOR message.

If a previously crashed process with a higher ID recovers, it starts a new election and "bullies" the current coordinator out — hence the name.

Example

Processes {1,2,3,4,5,6,7}\{1,2,3,4,5,6,7\}; coordinator 77 crashes. Process 44 detects this and sends ELECTION to 5,6,75,6,7. Processes 55 and 66 reply OK and start their own elections; 77 does not reply. 66 sends ELECTION to 77, gets no reply, so 66 becomes the new coordinator and broadcasts COORDINATOR to all.

Message cost: O(N2)O(N^2) in the worst case.

election
10short5 marks

What are the different kinds of transparency in a distributed system?

Kinds of Transparency in a Distributed System

Transparency means hiding the fact that resources are distributed across multiple computers, so the system appears as a single coherent system to users and applications. The main kinds (ISO/ANSA classification) are:

  • Access transparency – hides differences in data representation and in how a resource is accessed; local and remote resources are accessed using identical operations.
  • Location transparency – hides where a resource is physically located (names do not reveal location).
  • Migration transparency – hides that a resource may move to another location; its name stays the same after moving.
  • Relocation transparency – hides that a resource may be moved while in use.
  • Replication transparency – hides that multiple copies of a resource exist; the user sees a single logical resource.
  • Concurrency transparency – hides that a resource may be shared by several competing users; access appears exclusive/consistent.
  • Failure transparency – hides the failure and recovery of a resource; the user is unaware of faults.
  • Persistence transparency – hides whether a resource is in memory or on disk.

Full transparency is not always desirable (e.g., it can hurt performance or conflict with location-awareness), so it is balanced against other goals.

transparency
11short5 marks

What is middleware? Explain its role in a distributed system.

Middleware

Middleware is a software layer that sits between the operating system/network and the distributed applications. It provides a common programming abstraction across a heterogeneous collection of computers, networks and operating systems, hiding their differences from application developers.

   Application 1     Application 2     Application 3
  ---------------------------------------------------
        Middleware (common services & abstractions)
  ---------------------------------------------------
     OS/Network        OS/Network        OS/Network

Role / Functions in a Distributed System

  • Provides transparency – masks heterogeneity of hardware, OS and networks, and offers access/location/replication transparency so distributed resources look local.
  • Communication abstractions – offers high-level mechanisms such as RPC, Remote Method Invocation (RMI), message-oriented middleware (MOM) and publish/subscribe, instead of raw sockets.
  • Common services – naming/directory, security/authentication, transactions, concurrency control, persistence, and replication services.
  • Interoperability & openness – standard interfaces (e.g., CORBA, RMI, web services, gRPC) let components written in different languages/platforms work together.
  • Eases development – application programmers focus on business logic, not low-level networking and synchronization.

Examples: CORBA, Java RMI, Microsoft DCOM/.NET Remoting, gRPC, message brokers (RabbitMQ/Kafka).

middleware
12short5 marks

Explain Cristian's algorithm for physical clock synchronization.

Cristian's Algorithm (Physical Clock Synchronization)

Cristian's algorithm synchronizes a client's clock with a time server that has access to an accurate (UTC) time source. It is suitable when round-trip times are short compared to the required accuracy.

Procedure

  1. The client process PP sends a request to the time server at local time T0T_0 asking for the current time.
  2. The server, on receipt, replies with its time TserverT_{\text{server}}.
  3. The client receives the reply at local time T1T_1.
  4. The client estimates the message round-trip delay as Tround=T1T0T_{\text{round}} = T_1 - T_0 and assumes the reply took half of it, so it sets its clock to:
Tclient=Tserver+T1T02.T_{\text{client}} = T_{\text{server}} + \frac{T_1 - T_0}{2}.

If the (minimum) one-way message time II (interrupt-handling/processing time at server) is known, a better estimate is:

Tclient=Tserver+(T1T0)I2.T_{\text{client}} = T_{\text{server}} + \frac{(T_1 - T_0) - I}{2}.

Accuracy bound

The time value lies within the interval [Tserver+I,  Tserver+(T1T0)I][T_{\text{server}} + I,\; T_{\text{server}} + (T_1 - T_0) - I], so the error is at most

±(T1T02I).\pm\left(\frac{T_1 - T_0}{2} - I\right).

Example

If T0=100T_0 = 100 ms, reply received at T1=120T_1 = 120 ms, and server time =500= 500, then

Tclient=500+1201002=500+10=510.T_{\text{client}} = 500 + \frac{120 - 100}{2} = 500 + 10 = 510.

Limitation: Relies on a single time server (single point of failure) and assumes symmetric, small network delays; clocks must never be set backward (adjust gradually instead).

clock-synchronization

Frequently asked questions

Where can I find the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) question paper 2081?
The full BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2081 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Distributed System (BSc CSIT, CSC462) 2081 paper come with solutions?
Yes. Every question on this Distributed System (BSc CSIT, CSC462) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2081 paper?
The BSc CSIT (TU) Distributed System (BSc CSIT, CSC462) 2081 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Distributed System (BSc CSIT, CSC462) past paper free?
Yes — reading and attempting this Distributed System (BSc CSIT, CSC462) past paper on Kekkei is completely free.