Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is concurrency control? Explain the two-phase locking protocol and timestamp-based ordering protocol for concurrency control with suitable examples.

Concurrency Control

Concurrency control is the activity of coordinating the simultaneous execution of multiple transactions in a multi-user database so that the result is equivalent to some serial (one-at-a-time) execution. It prevents anomalies such as lost updates, dirty reads, unrepeatable reads and phantom reads, thereby preserving the Isolation and Consistency properties of transactions.

1. Two-Phase Locking (2PL) Protocol

2PL is a lock-based concurrency-control protocol. A transaction must obtain a shared lock (S) before reading a data item and an exclusive lock (X) before writing it. Each transaction executes in two phases:

  • Growing (expanding) phase: the transaction may acquire locks but cannot release any.
  • Shrinking (contracting) phase: the transaction may release locks but cannot acquire any new lock.

The point where the transaction holds all its locks is called the lock point. 2PL guarantees conflict-serializability.

Example:

StepT1T2
1lock-X(A)
2read(A); A=A-50; write(A)
3lock-X(B) (growing)
4unlock(A) (shrinking)
5lock-X(A) (waits, then proceeds)

Variants: Strict 2PL (all exclusive locks held until commit/abort — avoids cascading rollback) and Rigorous 2PL (all locks held until commit). Drawback: 2PL can cause deadlocks.

2. Timestamp-Ordering (TO) Protocol

Each transaction TiT_i is assigned a unique timestamp TS(Ti)TS(T_i) at start (smaller = older). Each data item QQ keeps:

  • W-TS(Q)W\text{-}TS(Q): largest timestamp of any transaction that successfully wrote QQ.
  • R-TS(Q)R\text{-}TS(Q): largest timestamp of any transaction that successfully read QQ.

The protocol orders conflicting operations by timestamp:

On read(Q) by TiT_i:

  • If TS(Ti)<W-TS(Q)TS(T_i) < W\text{-}TS(Q)TiT_i needs a value already overwritten → reject and roll back TiT_i.
  • Else execute read; set R-TS(Q)=max(R-TS(Q),TS(Ti))R\text{-}TS(Q)=\max(R\text{-}TS(Q),\,TS(T_i)).

On write(Q) by TiT_i:

  • If TS(Ti)<R-TS(Q)TS(T_i) < R\text{-}TS(Q) → a younger transaction already read the old value → reject/roll back.
  • If TS(Ti)<W-TS(Q)TS(T_i) < W\text{-}TS(Q) → write is obsolete → ignore (Thomas Write Rule) or reject.
  • Else execute write; set W-TS(Q)=TS(Ti)W\text{-}TS(Q)=TS(T_i).

Example: Suppose TS(T1)=5TS(T1)=5, TS(T2)=10TS(T2)=10. If T2T2 writes AA (now W-TS(A)=10W\text{-}TS(A)=10) and later the older T1T1 tries to read(A), then TS(T1)=5<W-TS(A)=10TS(T1)=5 < W\text{-}TS(A)=10, so T1T1 is rolled back and restarted with a new timestamp.

The TO protocol ensures serializability in timestamp order and is deadlock-free (no waiting), but may cause starvation/cascading rollbacks.

transactionconcurrency
2long10 marks

What is a NoSQL database? Explain the characteristics of NoSQL systems and discuss the CAP theorem. Compare document-based, key-value, column-based and graph-based NoSQL data stores.

NoSQL Database

A NoSQL (Not Only SQL) database is a non-relational data store designed to handle large volumes of unstructured/semi-structured data with high scalability, availability and flexible schemas. NoSQL systems emerged to meet the needs of Big Data and web-scale applications where the fixed-schema relational model and ACID overhead become bottlenecks.

Characteristics of NoSQL Systems

  • Schema-less / flexible schema: records need not follow a fixed structure.
  • Horizontal scalability: scale out by adding commodity servers (sharding/partitioning).
  • High availability & fault tolerance: data is replicated across nodes.
  • BASE model (Basically Available, Soft state, Eventually consistent) instead of strict ACID.
  • Distributed architecture with no single point of failure.
  • Optimized for specific data models (documents, key-value, columns, graphs).

CAP Theorem

Proposed by Eric Brewer, the CAP theorem states that a distributed data store can simultaneously guarantee only two of the following three properties:

  • Consistency (C): every read receives the most recent write (all nodes see the same data).
  • Availability (A): every request receives a (non-error) response.
  • Partition tolerance (P): the system continues to operate despite network partitions.

Since network partitions are unavoidable in distributed systems, real systems must choose between CP (e.g., HBase, MongoDB) and AP (e.g., Cassandra, DynamoDB) during a partition.

Comparison of NoSQL Data Stores

TypeData modelExample DBsBest for
Key-ValueSimple (key → value) pairs; value is opaqueRedis, DynamoDB, RiakCaching, session stores, fast lookups
DocumentSelf-describing documents (JSON/BSON/XML) keyed by id; nested fields queryableMongoDB, CouchDBContent management, catalogs, flexible records
Column-family (wide-column)Rows grouped into column families; sparse, columnar storageCassandra, HBase, BigtableTime-series, analytics, write-heavy big data
GraphNodes + edges with properties; relationships are first-classNeo4j, JanusGraphSocial networks, recommendation, fraud detection

Summary: Key-value gives fastest simple access; document adds queryable structure; column-family scales for huge sparse datasets; graph excels where relationships dominate.

nosqlbigdata
3long10 marks

Explain the Enhanced Entity-Relationship (EER) model. Discuss specialization, generalization, categorization and aggregation with suitable diagrams and examples.

Enhanced Entity-Relationship (EER) Model

The EER model extends the basic ER model with additional semantic (object-oriented) concepts to capture complex relationships in modern applications. It adds subclasses/superclasses, specialization, generalization, categorization (union types), and aggregation, along with attribute and relationship inheritance.

1. Specialization (Top-Down)

The process of defining a set of subclasses of an entity type (the superclass) based on some distinguishing characteristic. The subclasses inherit the attributes and relationships of the superclass and add their own specific ones.

Example: EMPLOYEE is specialized into SECRETARY, ENGINEER, and TECHNICIAN. Diagram: a superclass box EMPLOYEE connected through a circle (with subset symbol ) to the three subclass boxes; JobType is the defining attribute.

2. Generalization (Bottom-Up)

The reverse of specialization: identifying common features of several entity types and abstracting them into a single generalized superclass.

Example: CAR and TRUCK share attributes (VehicleId, Price) → generalized into superclass VEHICLE. Diagram: CAR and TRUCK boxes connected upward through a circle to VEHICLE.

Constraints on specialization/generalization:

  • Disjointness: disjoint (d) — an entity belongs to at most one subclass; overlapping (o) — may belong to several.
  • Completeness: total (double line) — every superclass entity must belong to some subclass; partial (single line) — may belong to none.

3. Categorization (Union Type)

A category (union type) is a subclass whose members are a subset of the union of entities from several different superclasses. Unlike a shared subclass, each member relates to one of the superclasses.

Example: OWNER is a category of the union of PERSON, BANK, and COMPANY (an account owner can be any one of these). Diagram: a circle marked connecting PERSON, BANK, COMPANY to the OWNER category box.

4. Aggregation

Aggregation is an abstraction that treats a relationship (together with its participating entities) as a higher-level entity, allowing relationships among relationships.

Example: the relationship WORKS-ON between EMPLOYEE and PROJECT is aggregated, and this aggregate participates in a USES relationship with MACHINERY. Diagram: a dashed rectangle enclosing the EMPLOYEE—WORKS-ON—PROJECT relationship, with a line from that box to the USES diamond and on to MACHINERY.

Conclusion: These EER constructs provide richer, object-oriented semantics (inheritance, union, abstraction) that the basic ER model cannot express.

extended-erer-model
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a deductive database? Explain the use of Datalog with an example.

Deductive Database

A deductive database is a database system that can derive new facts (implicit data) from the facts and rules stored in it using logical inference. It combines a relational database with a rule-based logic system, typically based on first-order logic / Datalog. It has two components:

  • Facts (extensional database, EDB): the base data, like relational tuples.
  • Rules (intensional database, IDB): logical rules used to deduce additional facts.

Datalog

Datalog is a declarative, non-procedural query language (a subset of Prolog, without function symbols) used to express facts and rules. A rule has the form:

HEAD :- BODY.   % read as: HEAD is true IF BODY is true

Example — family relationships:

% Facts (EDB)
parent(ram, sita).
parent(sita, hari).
parent(hari, gita).

% Rules (IDB)
grandparent(X, Z) :- parent(X, Y), parent(Y, Z).
ancestor(X, Y)    :- parent(X, Y).
ancestor(X, Y)    :- parent(X, Z), ancestor(Z, Y).

Query: ?- grandparent(ram, Q). deduces Q = hari (because parent(ram,sita) and parent(sita,hari)). The recursive ancestor rule shows Datalog's key strength: expressing recursive queries that plain SQL handled poorly.

deductive-database
5short5 marks

Explain the two-phase locking (2PL) protocol for concurrency control.

Two-Phase Locking (2PL) Protocol

2PL is a lock-based concurrency-control protocol that guarantees conflict-serializability. A transaction must acquire a shared lock (S) before reading and an exclusive lock (X) before writing a data item. Every transaction follows two phases:

  1. Growing (expanding) phase — the transaction may acquire locks but may not release any lock.
  2. Shrinking (contracting) phase — the transaction may release locks but may not acquire any new lock.

Once a transaction releases its first lock, it enters the shrinking phase and can never lock again. The moment it holds the maximum number of locks is the lock point; serializing transactions by their lock points yields a serializable schedule.

Variants:

  • Strict 2PL: holds all exclusive (write) locks until commit/abort → prevents cascading rollbacks.
  • Rigorous 2PL: holds all locks (S and X) until commit.

Advantages: ensures serializability and recoverability (strict version). Disadvantages: can cause deadlocks and reduces concurrency due to waiting.

concurrency
6short5 marks

Explain the two-phase commit (2PC) protocol used in distributed transactions.

Two-Phase Commit (2PC) Protocol

2PC is an atomic commitment protocol that ensures all sites participating in a distributed transaction either all commit or all abort, preserving atomicity across the network. One site acts as the coordinator and the rest are participants (cohorts).

Phase 1 — Prepare / Voting Phase

  1. The coordinator sends a PREPARE message to all participants.
  2. Each participant executes the transaction up to commit point, force-writes a <ready T> log record, and replies VOTE-COMMIT (Yes) if it can commit, or VOTE-ABORT (No) otherwise.

Phase 2 — Commit / Decision Phase

  1. If all participants voted Yes, the coordinator writes <commit T> to its log and sends GLOBAL-COMMIT to all; otherwise it sends GLOBAL-ABORT.
  2. Each participant commits or aborts as instructed, writes the outcome to its log, and returns an ACK.
  3. After receiving all ACKs, the coordinator writes <complete T> and the transaction ends.

Properties: guarantees atomicity even with failures (timeouts trigger abort). Drawback: it is a blocking protocol — if the coordinator crashes after participants have voted Yes but before sending the decision, participants remain blocked holding locks. (Three-Phase Commit reduces this blocking.)

distributed-transaction
7short5 marks

What is indexing? Differentiate between primary, secondary and clustering indexes.

Indexing

Indexing is a data-structure technique that creates an auxiliary access structure on one or more attributes of a file to speed up data retrieval, avoiding a full scan of the data file. An index entry is typically a (search-key value, pointer) pair, and indexes are usually kept sorted so they can be searched quickly (e.g., binary search, B+-trees).

Primary, Secondary and Clustering Indexes

FeaturePrimary IndexClustering IndexSecondary Index
Defined onOrdering key field (unique, e.g., primary key)Ordering non-key field (duplicates allowed)Any non-ordering field (key or non-key)
File orderingFile is physically ordered on this fieldFile is physically ordered on this fieldFile is not ordered on this field
DensitySparse (one entry per data block, points to first/anchor record)Sparse (one entry per distinct value / block)Dense (one entry per record, or per distinct value)
Number per fileAt most oneAt most oneCan have many

Key points:

  • A file can have only one primary or clustering index (because a file is physically ordered on only one field), but several secondary indexes.
  • Primary index is built on the unique ordering key; clustering index is similar but the ordering field has duplicate values.
  • Secondary index improves access on non-ordering fields but is larger (denser) and adds storage/maintenance overhead.
indexing
8short5 marks

Explain the architecture of parallel databases. What is inter-query and intra-query parallelism?

Architecture of Parallel Databases

A parallel database uses multiple CPUs and disks operating in parallel to improve performance (throughput and response time) through query and transaction parallelism. The three classic architectures are:

  • Shared-Memory (shared everything): all processors share a common main memory and disks via a fast interconnect. Easy to program but limited scalability (memory-bus contention).
  • Shared-Disk (shared nothing but disk): each processor has its own memory but all share the disks. Good fault tolerance; suffers disk-interconnect bottlenecks.
  • Shared-Nothing: each processor has its own memory and disk; nodes communicate only via the network. Highly scalable — the model used by most large parallel/Big-Data systems.
  • (Hierarchical/hybrid) combines the above (e.g., shared-nothing clusters of shared-memory nodes).

Inter-Query Parallelism

Different queries/transactions are executed in parallel on different processors. This increases transaction throughput (more queries per second) but does not speed up an individual query. Mainly used in OLTP workloads; requires careful concurrency control (e.g., cache coherence in shared-disk systems).

Intra-Query Parallelism

A single query is executed in parallel across multiple processors to reduce its response time. It has two forms:

  • Intra-operation parallelism: parallelize one operation (e.g., a parallel sort or parallel join over partitioned data).
  • Inter-operation parallelism: execute different operators of the same query in parallel (pipelined or independent parallelism).

Intra-query parallelism is especially beneficial for large analytical (OLAP) queries.

parallel-database
9short5 marks

What is an object-relational database? How does it differ from a purely relational database?

Object-Relational Database (ORDBMS)

An object-relational database is a database management system that combines the relational model (tables, SQL, ACID transactions) with object-oriented features such as user-defined types, inheritance, and complex objects. It extends SQL (SQL:1999 and later) so that columns can hold rich, structured data rather than only atomic values. Examples: Oracle, PostgreSQL, IBM DB2.

How it Differs from a Purely Relational Database

AspectPure Relational (RDBMS)Object-Relational (ORDBMS)
Data typesOnly atomic/simple types (1NF, scalar values)User-defined types (UDTs), structured types, references
Complex objectsNot supported directlySupports arrays, nested tables, collections, multi-valued attributes
InheritanceNoneType and table inheritance (sub-types/super-types)
BehaviourData onlyMethods/functions can be attached to types (encapsulation)
Object identityIdentified by primary-key valueSupports object identifiers (OIDs) and reference types
StandardSQL-92SQL:1999+ object extensions

Summary: A pure relational DB strictly stores flat, atomic-valued tuples, whereas an ORDBMS lets you model complex, real-world objects (with structure, inheritance and behaviour) while retaining the maturity, querying power and reliability of the relational model. It thus bridges RDBMS and pure OODBMS.

object-relational
10short5 marks

What is XML data model? Explain how XML is used to represent and query data in databases.

XML Data Model

XML (eXtensible Markup Language) is a self-describing, text-based markup language for representing semi-structured, hierarchical data. Its data model is a tree (ordered, labelled) in which:

  • Elements are nodes delimited by tags, e.g. <book>...</book>, and may be nested to any depth.
  • Attributes provide name–value properties of an element, e.g. <book id="b1">.
  • Text (PCDATA) forms the leaf content.
  • A document has exactly one root element and must be well-formed; it may be valid against a schema (DTD or XML Schema/XSD).

XML is platform-independent, making it ideal for data exchange between heterogeneous systems and the web.

Example:

<library>
  <book id="b1">
    <title>Database Systems</title>
    <author>Elmasri</author>
    <price>500</price>
  </book>
</library>

Representing and Querying XML in Databases

  • Storage: XML can be stored as a CLOB/text, shredded into relational tables, or kept natively as an XML data type in XML-enabled/native XML databases (e.g., the SQL/XML XML type).
  • Querying:
    • XPath — navigates the tree with path expressions, e.g. /library/book/title selects all titles; //book[price>400] filters.
    • XQuery — a full query language with FLWOR expressions (For, Let, Where, Order by, Return):
for $b in doc("library.xml")/library/book
where $b/price > 400
return $b/title
  • SQL/XML — SQL functions (XMLELEMENT, XMLAGG, XMLQUERY) to produce/query XML from relational data.
xml
11short5 marks

Explain log-based recovery techniques in a database system.

Log-Based Recovery

Log-based recovery restores the database to a consistent state after a failure using a log (journal) — a sequential file recording every update before it reaches the database (the Write-Ahead Logging, WAL rule). Each log record typically has the form <Ti, X, old_value, new_value>, plus markers <Ti start>, <Ti commit>, <Ti abort>.

Two Update/Recovery Techniques

1. Deferred Database Modification (NO-UNDO/REDO):

  • Updates are written only to the log; the actual database is modified after the transaction commits.
  • On recovery, REDO committed transactions; ignore uncommitted ones (their changes never reached the DB). No UNDO needed.

2. Immediate Database Modification (UNDO/REDO):

  • Updates are written to the database while the transaction is still active (using WAL).
  • On recovery: UNDO transactions that have <start> but no <commit> (restore old_value), and REDO transactions that have <commit> (apply new_value).

Checkpoints

To limit the amount of log scanned, the system periodically writes a <checkpoint> record after flushing buffers to disk. During recovery only transactions active at or after the last checkpoint need to be examined — REDO from the checkpoint, UNDO active transactions — greatly reducing recovery time.

recovery
12short5 marks

Write short notes on the MapReduce programming model for big data processing.

MapReduce Programming Model

MapReduce is a programming model (introduced by Google, implemented in Hadoop) for processing very large data sets in parallel across a cluster of commodity machines. The programmer expresses the computation as two functions, and the framework handles parallelization, data distribution, fault tolerance and load balancing automatically.

Two Phases

  1. Map: takes input (key, value) pairs and emits a set of intermediate (key, value) pairs.
map(k1,v1)list(k2,v2)map(k_1, v_1) \rightarrow list(k_2, v_2)
  1. Reduce: the framework shuffles and sorts intermediate pairs so that all values with the same key go to one reducer, which aggregates them into the final output.
reduce(k2,list(v2))list(v3)reduce(k_2, list(v_2)) \rightarrow list(v_3)

Between them is the Shuffle & Sort phase that groups values by key.

Example — Word Count

map(docName, text):
    for each word w in text:
        emit(w, 1)

reduce(word, counts):
    emit(word, sum(counts))

Key Features

  • Data locality: computation is moved to where data resides (HDFS blocks).
  • Scalability & fault tolerance: failed tasks are re-executed on other nodes; works on thousands of nodes.
  • Simplicity: developers write only map/reduce; the framework manages the rest.

Limitations: batch-oriented (high latency), poor for iterative/interactive jobs — addressed by newer engines like Apache Spark.

mapreduce

Frequently asked questions

Where can I find the BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) question paper 2075?
The full BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2075 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Advanced Database (BSc CSIT, CSC461) 2075 paper come with solutions?
Yes. Every question on this Advanced Database (BSc CSIT, CSC461) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2075 paper?
The BSc CSIT (TU) Advanced Database (BSc CSIT, CSC461) 2075 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Advanced Database (BSc CSIT, CSC461) past paper free?
Yes — reading and attempting this Advanced Database (BSc CSIT, CSC461) past paper on Kekkei is completely free.