Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is a distributed database management system (DDBMS)? Explain data fragmentation, replication and allocation techniques used in distributed databases with suitable examples.

Distributed Database Management System (DDBMS)

A distributed database is a collection of logically interrelated databases distributed over a computer network. A DDBMS is the software that manages the distributed database and makes the distribution transparent to users, so the system appears as a single, centralized database even though the data is physically stored across multiple sites.

Key transparencies: location transparency, fragmentation transparency, replication transparency, and concurrency/failure transparency.

Data Fragmentation

Fragmentation divides a global relation into smaller pieces (fragments) that can be stored at different sites. Rules: completeness (every tuple/attribute appears in some fragment), reconstruction (the original relation can be rebuilt), and disjointness (for horizontal fragmentation).

  • Horizontal fragmentation — splits a relation by rows using a selection predicate.
EMP1=σDept=Sales(EMP),EMP2=σDept=HR(EMP)EMP_1 = \sigma_{Dept = 'Sales'}(EMP), \quad EMP_2 = \sigma_{Dept = 'HR'}(EMP)

Reconstruction: EMP=EMP1EMP2EMP = EMP_1 \cup EMP_2.

  • Vertical fragmentation — splits a relation by columns, each fragment keeping the primary key.
EMPA=πEno,Name(EMP),EMPB=πEno,Salary(EMP)EMP_A = \pi_{Eno, Name}(EMP), \quad EMP_B = \pi_{Eno, Salary}(EMP)

Reconstruction: EMP=EMPAEMPBEMP = EMP_A \bowtie EMP_B (join on key).

  • Mixed/Hybrid fragmentation — combination of horizontal and vertical (e.g., horizontally fragment, then vertically fragment a fragment).

Replication

Replication stores copies of a fragment/relation at multiple sites.

  • Full replication — every site has a full copy: fast reads, high availability, but expensive updates.
  • Partial replication — only selected fragments are replicated (frequently accessed data).
  • No replication — each fragment stored at exactly one site: cheap updates, but lower availability.

Replication improves read performance and reliability but increases update cost and the burden of keeping copies consistent.

Data Allocation

Allocation decides which fragment is stored at which site. Strategies are guided by an allocation model that minimizes total cost (storage + query + update + communication) based on the access frequency of each fragment at each site. A fragment is typically allocated to the site(s) where it is most frequently accessed.

Example: A bank with branches in Kathmandu and Pokhara horizontally fragments the ACCOUNT table by branch; the Kathmandu fragment is allocated to the Kathmandu server (where it is mostly used) and replicated read-only to Pokhara for occasional reporting.

distributed-database
2long10 marks

Explain the object-oriented database model. Discuss the ODMG object model along with the Object Definition Language (ODL) and Object Query Language (OQL) with examples.

Object-Oriented Database (OODB) Model

An object-oriented database stores data as objects, integrating object-oriented programming concepts with database capabilities (persistence, concurrency, recovery). Core features: object identity (OID), encapsulation (state + behavior/methods), classes, inheritance, polymorphism, and support for complex objects. It overcomes the impedance mismatch between the relational model and OO programming languages.

ODMG Object Model

The ODMG (Object Data Management Group) standard defines a common model for OODBs so that applications are portable across compliant systems. Components:

  • Object Model — defines objects (with mutable state, identified by OID) and literals (immutable values). Supports atomic, structured and collection types (Set, Bag, List, Array, Dictionary), relationships, attributes and operations.
  • ODL — Object Definition Language (schema definition).
  • OQL — Object Query Language (declarative querying).
  • A binding to OO languages (C++, Java, Smalltalk).

Object Definition Language (ODL)

ODL is a programming-language-independent language used to define the schema (classes, attributes, relationships, methods).

class Student (extent students key rollNo) {
  attribute string  rollNo;
  attribute string  name;
  attribute integer age;
  relationship Set<Course> takes inverse Course::takenBy;
  void enroll(in Course c);
};

class Course (extent courses key code) {
  attribute string code;
  attribute string title;
  relationship Set<Student> takenBy inverse Student::takes;
};

Here extent is the set of all objects of the class, key enforces uniqueness, and relationship ... inverse declares a bidirectional relationship.

Object Query Language (OQL)

OQL is an SQL-like declarative query language for the ODMG model that supports path expressions, method calls and complex/nested results.

-- Names of students older than 20
SELECT s.name
FROM   s IN students
WHERE  s.age > 20;

-- Path expression: titles of courses taken by 'Ram'
SELECT c.title
FROM   s IN students, c IN s.takes
WHERE  s.name = "Ram";

Unlike SQL, OQL can return objects, structures, and nested collections, and navigate relationships through path expressions (s.takes).

object-oriented-database
3long10 marks

What is concurrency control? Explain the two-phase locking protocol and timestamp-based ordering protocol for concurrency control with suitable examples.

Concurrency Control

Concurrency control is the activity of coordinating the concurrent (interleaved) execution of transactions in a multi-user database so that the result is equivalent to some serial execution (serializability), while preserving the ACID properties. Without it, anomalies such as lost update, dirty read, and unrepeatable read can occur.

Two-Phase Locking (2PL) Protocol

2PL is a lock-based protocol. A transaction must obtain a shared lock (S) before reading and an exclusive lock (X) before writing a data item. Locking proceeds in two phases:

  1. Growing phase — the transaction may acquire locks but cannot release any.
  2. Shrinking phase — the transaction may release locks but cannot acquire any new lock.

The point after the last lock is acquired is the lock point. 2PL guarantees conflict serializability.

Variants: Strict 2PL holds all exclusive locks until commit/abort (prevents cascading rollback); Rigorous 2PL holds all locks until commit.

Example:

T1: lock-X(A); read(A); A=A-50; write(A);
    lock-X(B); release(A); read(B); B=B+50; write(B); release(B);

All locks are acquired before any is released, so T1 obeys 2PL. A drawback is the possibility of deadlock.

Timestamp-Based Ordering Protocol

Each transaction TiT_i gets a unique timestamp TS(Ti)TS(T_i) at start. Each data item XX has two values: W-timestamp(X) (largest timestamp of a transaction that wrote X) and R-timestamp(X) (largest timestamp that read X). The protocol orders transactions by their timestamps and ensures the schedule is equivalent to executing them in timestamp order.

Read rule (Ti reads X):

  • If TS(Ti)<W-TS(X)TS(T_i) < W\text{-}TS(X)TiT_i is too late; reject and roll back TiT_i.
  • Else read is allowed; set R-TS(X)=max(R-TS(X),TS(Ti))R\text{-}TS(X) = \max(R\text{-}TS(X), TS(T_i)).

Write rule (Ti writes X):

  • If TS(Ti)<R-TS(X)TS(T_i) < R\text{-}TS(X) or TS(Ti)<W-TS(X)TS(T_i) < W\text{-}TS(X)reject and roll back TiT_i.
  • Else write is allowed; set W-TS(X)=TS(Ti)W\text{-}TS(X) = TS(T_i).

Example: Suppose TS(T1)=5TS(T_1)=5, TS(T2)=10TS(T_2)=10. If T2T_2 writes XX (so W-TS(X)=10W\text{-}TS(X)=10) and then T1T_1 (timestamp 5) tries to write XX, since TS(T1)=5<W-TS(X)=10TS(T_1)=5 < W\text{-}TS(X)=10, T1T_1 is rolled back and restarted with a new timestamp.

Timestamp ordering is deadlock-free (no waiting) but can cause repeated rollbacks/starvation.

transactionconcurrency
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a deductive database? Explain the use of Datalog with an example.

Deductive Database

A deductive database is a database system that can derive new facts (the intensional database, IDB) from a set of stored base facts (the extensional database, EDB) using logical inference rules. It combines a relational database with a logic-programming style rule engine, allowing recursive queries that are hard to express in plain SQL.

Datalog

Datalog is a declarative, rule-based query language (a restricted, non-procedural subset of Prolog). A Datalog program consists of:

  • Facts (ground atoms) — the stored data.
  • Rules of the form head :- body, meaning if the body is true, infer the head.

Datalog supports recursion, making it powerful for transitive queries.

Example — ancestor relationship:

% Facts (EDB)
parent(ram, sita).
parent(sita, hari).

% Rules (IDB)
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

% Query
?- ancestor(ram, hari).   % returns true

The recursive rule derives ancestor(ram, hari) even though it is not stored directly, demonstrating inference that ordinary relational queries cannot easily express.

deductive-database
5short5 marks

Explain the two-phase locking (2PL) protocol for concurrency control.

Two-Phase Locking (2PL) Protocol

2PL is a lock-based concurrency-control protocol that guarantees conflict-serializable schedules. A transaction acquires a shared lock (S) before reading and an exclusive lock (X) before writing a data item, and all its locking activity is divided into two non-overlapping phases:

  1. Growing phase — the transaction may acquire (upgrade) locks but may not release any lock.
  2. Shrinking phase — the transaction may release (downgrade) locks but may not acquire any new lock.

The boundary between the two phases is called the lock point. Holding to this rule ensures serializability.

Variants

  • Strict 2PL — all exclusive locks are held until commit/abort, preventing cascading rollbacks.
  • Rigorous 2PL — all locks (S and X) held until commit.

Example

lock-X(A); read(A); write(A);     -- growing
lock-X(B);                        -- growing (last lock = lock point)
unlock(A); read(B); write(B);     -- shrinking
unlock(B);

Drawback: 2PL can lead to deadlocks, which must be handled by deadlock detection or prevention.

concurrency
6short5 marks

Explain the two-phase commit (2PC) protocol used in distributed transactions.

Two-Phase Commit (2PC) Protocol

2PC is an atomic commit protocol that ensures all sites participating in a distributed transaction either commit or abort together, preserving atomicity across the network. One site acts as the coordinator; the others are participants (cohorts).

Phase 1 — Voting / Prepare Phase

  1. The coordinator sends a PREPARE message to all participants.
  2. Each participant that can commit writes a <ready T> record to its log, forces the log to disk, and replies VOTE-COMMIT (Yes). If it cannot commit, it writes <abort T> and replies VOTE-ABORT (No).

Phase 2 — Decision / Commit Phase

  1. If all participants voted Yes, the coordinator writes <commit T> to its log and sends GLOBAL-COMMIT to all; otherwise it writes <abort T> and sends GLOBAL-ABORT.
  2. Each participant carries out the decision, writes the corresponding log record, and sends an ACK to the coordinator.
  3. After receiving all ACKs, the coordinator completes the transaction.

Drawback

2PC is a blocking protocol: if the coordinator fails after participants have voted Yes, the participants are stuck holding locks (in-doubt) until the coordinator recovers. Three-phase commit (3PC) addresses this blocking problem.

distributed-transaction
7short5 marks

What is indexing? Differentiate between primary, secondary and clustering indexes.

Indexing

An index is an auxiliary data structure (typically a sorted file of (search-key, pointer) entries or a B+-tree) built on one or more attributes of a relation to speed up data retrieval, avoiding a full scan. It trades extra storage and update overhead for faster query access.

Comparison of Index Types

FeaturePrimary IndexClustering IndexSecondary Index
Built onOrdering key field (unique)Ordering non-key fieldNon-ordering field (key or non-key)
File orderFile sorted on this fieldFile sorted on this fieldFile not sorted on this field
EntriesOne per block (sparse)One per distinct valueOne per record (dense)
Number per tableAt most oneAt most oneMany allowed
DensitySparseSparseDense

Key points

  • A primary index is defined on the field by which the file is physically sorted and which is the primary key; it is sparse (one entry per data block).
  • A clustering index is also on the field by which the file is sorted, but that field has duplicate values (non-key); one index entry per distinct value points to the first block containing it.
  • A secondary index is defined on a field that is not the physical ordering field; it is dense and a table may have many secondary indexes.
indexing
8short5 marks

Explain the architecture of parallel databases. What is inter-query and intra-query parallelism?

Parallel Database Architecture

A parallel database uses multiple CPUs and disks operating in parallel to improve performance (throughput and response time) through speedup and scaleup. The main hardware architectures are:

  • Shared-memory — all processors share a common main memory and disks via an interconnect. Easy to program; limited scalability due to memory-bus contention.
  • Shared-disk — each processor has its own private memory but all share the disks. Good fault tolerance; interconnect to disks can be a bottleneck.
  • Shared-nothing — each processor has its own memory and disks; nodes communicate only by message passing. Highly scalable (used in big-data systems); more complex to program.
  • Hierarchical (hybrid) — combines the above (e.g., shared-nothing cluster of shared-memory nodes).

Inter-query vs Intra-query Parallelism

  • Inter-query parallelismdifferent queries/transactions execute simultaneously on different processors. It increases throughput (transactions per second) but does not speed up a single query. Common in OLTP systems.
  • Intra-query parallelism — a single query is executed faster by running its parts in parallel. It is of two kinds:
    • Intra-operation parallelism — a single operation (e.g., sort, join) is parallelized across processors.
    • Inter-operation parallelism — different operators of one query plan run in parallel (pipelined or independent). This reduces the response time of an individual query and is useful for OLAP/decision-support workloads.
parallel-database
9short5 marks

What is an object-relational database? How does it differ from a purely relational database?

Object-Relational Database (ORDBMS)

An object-relational database extends the traditional relational model with object-oriented features while retaining SQL and the relational foundation. It is a hybrid: data is still organized in tables, but the system supports rich, user-defined and complex types. ORDBMS features (in SQL:1999+) include:

  • User-defined / abstract data types (UDTs) and structured types.
  • Type inheritance (subtypes/supertypes) and table inheritance.
  • Collection types — arrays, multisets, nested tables.
  • Methods / user-defined functions attached to types.
  • References (REF types) to row objects and object identifiers (OIDs).
  • Large objects (BLOB/CLOB).

Difference from a Purely Relational Database

AspectRelational DBObject-Relational DB
Data typesOnly atomic (1NF) built-in typesAtomic plus complex/user-defined types
StructureFlat tables, attributes are atomicTables may contain nested/complex attributes
BehaviorNo methods stored with dataMethods/functions attached to types
InheritanceNot supportedType and table inheritance supported
IdentityTuples identified by primary keyRow objects can have OIDs and REFs

In short, a purely relational database enforces the first normal form (atomic attribute values) and stores only data, whereas an ORDBMS relaxes 1NF to allow complex objects, inheritance, and behavior — bridging the relational and object-oriented worlds while remaining backward-compatible with SQL.

object-relational
10short5 marks

What is XML data model? Explain how XML is used to represent and query data in databases.

XML Data Model

XML (eXtensible Markup Language) is a self-describing, semi-structured data model that represents data as a hierarchy (tree) of nested, user-defined tags (elements) with optional attributes. Unlike the rigid tabular relational model, XML is flexible — schemas may be irregular or partially defined — which makes it ideal for data exchange between heterogeneous systems and the web.

Key constructs: elements (nested, ordered), attributes, text content, a single root element, and structure validation via DTD or XML Schema (XSD).

Example document:

<library>
  <book isbn="123">
    <title>Advanced Database</title>
    <author>Ram</author>
    <price>500</price>
  </book>
</library>

Representing and Querying XML in Databases

Storage: XML can be stored as (a) a native XML database, (b) a column of an XML data type in an object-relational DBMS, or (c) shredded into relational tables (mapping elements/attributes to columns).

Querying:

  • XPath — navigates the XML tree using path expressions to select nodes.
    /library/book[price > 400]/title
    
  • XQuery — a full query language (FLWOR: For-Let-Where-Order-Return) for transforming and combining XML.
    for $b in doc("library.xml")/library/book
    where $b/price > 400
    return $b/title
    

These let applications query hierarchical XML data much as SQL queries relational tables.

xml
11short5 marks

Explain log-based recovery techniques in a database system.

Log-Based Recovery

A log (transaction log) is a sequential file recording every update so the database can be restored to a consistent state after a failure. Each log record stores <Ti, X, old_value, new_value> for a write, plus <Ti start>, <Ti commit>, <Ti abort> markers. Recovery relies on the Write-Ahead Logging (WAL) rule: the log record must be written to stable storage before the corresponding data change reaches the database.

1. Deferred Database Modification

Updates are not applied to the database until the transaction commits — they are only recorded in the log. On recovery, transactions with both <Ti start> and <Ti commit> are redone (redo(Ti)); those without a commit are simply ignored (no undo needed, since nothing was written).

2. Immediate Database Modification

Updates are written to the database before commit. The log keeps both old and new values. On recovery:

  • redo(Ti) if the log has both <Ti start> and <Ti commit> — reapply new values.
  • undo(Ti) if the log has <Ti start> but no <Ti commit> — restore old values (in reverse order).

Checkpoints

To limit the amount of log scanned, the system periodically writes a checkpoint record after flushing buffers and the log to disk. During recovery only transactions active at or after the most recent checkpoint need to be examined, greatly speeding up recovery.

recovery
12short5 marks

Write short notes on the MapReduce programming model for big data processing.

MapReduce Programming Model

MapReduce is a programming model and framework (popularized by Google, implemented in Apache Hadoop) for processing very large data sets in parallel across a shared-nothing cluster of commodity machines. The programmer writes two functions; the framework handles parallelization, data distribution, fault tolerance and load balancing automatically.

Two Phases

  1. Map — takes an input (key, value) pair and emits a set of intermediate (key, value) pairs.
map(k1,v1)list(k2,v2)map(k_1, v_1) \rightarrow list(k_2, v_2)
  1. Shuffle & Sort — the framework groups all intermediate values by key and routes them to reducers.
  2. Reduce — takes a key and the list of all its values and produces the final output.
reduce(k2,list(v2))list(v3)reduce(k_2, list(v_2)) \rightarrow list(v_3)

Example — Word Count

Map(doc):    for each word w in doc -> emit(w, 1)
Reduce(w, counts): emit(w, sum(counts))

Input "a b a" → Map emits (a,1)(b,1)(a,1) → Shuffle groups a:[1,1], b:[1] → Reduce gives (a,2)(b,1).

Features

  • Scalability — scales horizontally to thousands of nodes.
  • Fault tolerance — failed tasks are automatically re-executed; data replicated in HDFS.
  • Data locality — computation moved to where data resides, reducing network traffic.
  • Simplicity — hides the complexity of distributed/parallel programming.

Limitation: batch-oriented and disk-heavy; less suited to iterative or real-time workloads (addressed by Spark).

mapreduce

Frequently asked questions

Where can I find the BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) question paper 2080?
The full BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2080 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Advanced Database (BSc CSIT, CSC461) 2080 paper come with solutions?
Yes. Every question on this Advanced Database (BSc CSIT, CSC461) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2080 paper?
The BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2080 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Advanced Database (BSc CSIT, CSC461) past paper free?
Yes — reading and attempting this Advanced Database (BSc CSIT, CSC461) past paper on Kekkei is completely free.