Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

Explain the architecture of a data warehouse. Differentiate between MOLAP, ROLAP, and HOLAP.

Architecture of a Data Warehouse

A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is commonly described as a three-tier architecture.

Three-Tier Architecture

1. Bottom Tier — Data Warehouse Server (Database Layer)

  • A relational/RDBMS server that holds the warehouse data.
  • Data is fed from operational (OLTP) databases and external sources through ETL (Extract, Transform, Load):
    • Extract data from heterogeneous sources.
    • Transform / clean it (resolve inconsistencies, integrate, summarize).
    • Load it into the warehouse.
  • Also stores the metadata repository (data about data: source, ETL rules, schema).

2. Middle Tier — OLAP Server

  • Presents a multidimensional view of data to users.
  • Implemented as ROLAP (relational backend) or MOLAP (multidimensional cube store).

3. Top Tier — Front-End Client Layer

  • Tools for querying, reporting, analysis, and data mining (dashboards, BI tools).

Components

  • Data Sources (OLTP DBs, flat files, external data)
  • ETL / staging area and metadata
  • Data marts (department-level subsets)
  • Monitoring & administration

(Diagram in words: Sources → ETL/Staging → Data Warehouse + Metadata (bottom) → OLAP Server (middle) → Query/Report/Mining tools (top).)

MOLAP vs ROLAP vs HOLAP

FeatureMOLAPROLAPHOLAP
StorageMultidimensional cube arraysRelational tables (star/snowflake)Hybrid: detail in relational, summaries in cube
Query speedVery fast (pre-aggregated)Slower (SQL joins, runtime aggregation)Fast for summaries, scalable for detail
Scalability / data volumeLimited (cube can explode)High (handles very large data)High
Storage efficiencySparse-data overheadEfficientBalanced
Best forSmall/medium data, fast OLAPLarge detailed datasetsMix of large detail + fast aggregates
  • MOLAP stores data in optimized multidimensional cubes giving the fastest response but limited scalability.
  • ROLAP keeps data in relational tables and generates aggregations via SQL at query time, scaling to large volumes but slower.
  • HOLAP combines both: low-level detailed data stays in the relational store while summary/aggregated data is kept in a cube, balancing speed and scalability.
data-warehouseolap
2long10 marks

Explain the FP-growth algorithm for frequent pattern mining with an example. How does it improve upon Apriori?

FP-Growth Algorithm (Frequent Pattern Growth)

FP-Growth mines frequent itemsets without candidate generation by compressing the database into a compact FP-tree and then mining it recursively.

Steps

  1. First DB scan — count support of each item; discard items below min_support; sort remaining (frequent) items in descending order of support (the F-list).
  2. Second DB scan — for each transaction, keep only frequent items, order them by the F-list, and insert them into the FP-tree (shared prefixes share nodes; each node keeps a count). A header table links all nodes of the same item.
  3. Mine the FP-tree — for each item (bottom-up), build its conditional pattern base (prefix paths), construct a conditional FP-tree, and recursively grow frequent patterns.

Example

Transactions, min_support = 2:

TIDItems
T1A, B, C
T2A, C, D
T3A, B
T4B, C
T5A, B, C

Step 1 — counts: A=4, B=4, C=4, D=1. Drop D (support 1 < 2). F-list (desc): A:4, B:4, C:4.

Step 2 — ordered transactions: T1 {A,B,C}, T2 {A,C}, T3 {A,B}, T4 {B,C}, T5 {A,B,C}.

FP-tree (described): Root → A(4) → B(3) → C(2); A→C(1); plus a separate root→B(1)→C(1) branch (from T4). Header table links the A, B, C nodes.

Step 3 — mining (e.g., suffix C): conditional pattern base of C = {A,B:2}, {A:1}, {B:1}. Frequent patterns containing C: {A,C}:3, {B,C}:3, {A,B,C}:2. Similarly mining B gives {A,B}:3. Final frequent itemsets (support ≥ 2): {A},{B},{C},{A,B},{A,C},{B,C},{A,B,C}.

How FP-Growth Improves on Apriori

  • No candidate generation: Apriori generates and tests huge numbers of candidate itemsets; FP-Growth avoids this entirely.
  • Fewer DB scans: only two scans vs. Apriori's repeated scans (one per pass/level).
  • Compact structure: the FP-tree compresses shared prefixes, saving memory.
  • Faster: especially on dense datasets with long frequent patterns, often an order of magnitude faster.
  • Drawback: the FP-tree may not fit in memory for very large/sparse data.
association-rulesfp-growth
3long10 marks

What is classification? Explain the ID3 decision tree algorithm and build a decision tree for a given training dataset.

Classification

Classification is a supervised learning task that builds a model (classifier) from a labeled training set to predict the categorical class label of new, unseen data. It works in two phases: (1) learning/model construction from training data, and (2) classification/prediction with accuracy estimated on a test set.

ID3 Decision Tree Algorithm

ID3 (Iterative Dichotomiser 3) builds a tree top-down by selecting, at each node, the attribute with the highest information gain (greatest entropy reduction).

Entropy of dataset SS with classes ii:

Entropy(S)=ipilog2piEntropy(S) = -\sum_i p_i \log_2 p_i

Information Gain of attribute AA:

Gain(S,A)=Entropy(S)vValues(A)SvSEntropy(Sv)Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}\, Entropy(S_v)

Procedure:

  1. Compute entropy of the whole set.
  2. For each attribute compute information gain.
  3. Choose the attribute with maximum gain as the splitting node.
  4. Split data on its values; recurse on each branch.
  5. Stop when all instances in a node share one class (leaf), no attributes remain (majority class), or no data remains.

Example

Dataset (play tennis style), attribute Outlook (Sunny/Overcast/Rain), target Play (Yes/No), 14 samples with 9 Yes, 5 No:

Entropy(S)=914log2914514log25140.940Entropy(S) = -\tfrac{9}{14}\log_2\tfrac{9}{14} - \tfrac{5}{14}\log_2\tfrac{5}{14} \approx 0.940

For Outlook: Sunny(2Y,3N)→0.971, Overcast(4Y,0N)→0, Rain(3Y,2N)→0.971.

Gain(S,Outlook)=0.940(514(0.971)+414(0)+514(0.971))0.246Gain(S,Outlook)=0.940-\big(\tfrac{5}{14}(0.971)+\tfrac{4}{14}(0)+\tfrac{5}{14}(0.971)\big)\approx 0.246

If Outlook gives the highest gain among all attributes, it becomes the root. The Overcast branch is pure → leaf Yes. The Sunny and Rain branches are split further (e.g., Sunny on Humidity, Rain on Wind) by recomputing gains, producing the final tree.

Resulting tree (described):

Outlook
├── Sunny    → Humidity (High→No, Normal→Yes)
├── Overcast → Yes
└── Rain     → Wind (Strong→No, Weak→Yes)

New records are classified by traversing from root to a leaf.

classificationdecision-tree
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is a data mart? How does it differ from a data warehouse?

A data mart is a small, subject-oriented subset of a data warehouse focused on a single department or business function (e.g., Sales, Finance, HR). It contains only the data relevant to that group's analysis needs.

Differences from a data warehouse:

AspectData MartData Warehouse
ScopeSingle department/subjectEntire enterprise
SizeSmall (MB–few GB)Large (GB–TB+)
Data sourcesFewMany, integrated
Design/build timeQuick, cheaperLong, complex, costly
UsersDepartmentalOrganization-wide

Data marts can be dependent (sourced from the central warehouse) or independent (built directly from operational sources).

data-mart
5short5 marks

Explain the snowflake schema with an example.

Snowflake Schema

The snowflake schema is a refinement of the star schema in which dimension tables are normalized into multiple related tables, forming a snowflake shape around the central fact table.

Characteristics:

  • A central fact table holds measures and foreign keys.
  • Dimension tables are split into sub-dimensions to remove redundancy (normalization).
  • Saves storage but needs more joins, so queries can be slower than a star schema.

Example (Sales):

  • Fact table: Sales(product_id, store_id, time_id, units_sold, revenue)
  • Product dimension normalized into Product(product_id, name, category_id)Category(category_id, category_name).
  • Store dimension normalized into Store(store_id, city_id)City(city_id, city_name, state_id)State(state_id, state_name).

Here the Product → Category and Store → City → State chains branch out like a snowflake, unlike the star schema where each dimension is a single flat denormalized table.

schema
6short5 marks

Define support and confidence in association rule mining.

For an association rule ABA \Rightarrow B over a transaction database:

Support — the fraction of all transactions that contain both AA and BB. It measures how frequently the itemset occurs.

Support(A \Rightarrow B) = P(A \cup B) = \frac{\text{# transactions containing } A \text{ and } B}{\text{Total transactions}}

Confidence — the fraction of transactions containing AA that also contain BB. It measures the reliability/strength of the rule.

Confidence(AB)=P(BA)=Support(AB)Support(A)Confidence(A \Rightarrow B) = P(B \mid A) = \frac{Support(A \cup B)}{Support(A)}

Example: If 100 transactions, 20 contain {bread, butter} and 25 contain bread, then Support=20/100=20%Support = 20/100 = 20\% and Confidence=20/25=80%Confidence = 20/25 = 80\%. Rules are kept only if they meet a user-set minimum support and minimum confidence.

association-rules
7short5 marks

What is data cleaning? Explain methods to handle missing values.

Data Cleaning

Data cleaning is the preprocessing step that detects and corrects (or removes) errors, inconsistencies, noise, duplicates, and missing values from data so that mining produces reliable results. It includes handling missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies.

Methods to Handle Missing Values

  1. Ignore the tuple — drop records with missing class label / many missing values (used when missing data is small).
  2. Fill manually — domain expert enters the value (accurate but tedious, impractical for large data).
  3. Use a global constant — replace with a constant like "Unknown" or -\infty.
  4. Use the attribute mean/median — fill with the average (or median) of that attribute.
  5. Use the class-wise mean — fill with the mean of the attribute for samples of the same class.
  6. Use the most probable value — predict it using regression, decision trees, Bayesian inference, or interpolation (most accurate).
preprocessing
8short5 marks

Explain the working of the Naive Bayes classifier.

Naive Bayes Classifier

A Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem with a strong ("naive") assumption that all attributes are conditionally independent given the class.

Bayes' theorem:

P(CX)=P(XC)P(C)P(X)P(C \mid X) = \frac{P(X \mid C)\, P(C)}{P(X)}

where X=(x1,x2,,xn)X = (x_1, x_2, \dots, x_n) is the feature vector and CC a class.

Working:

  1. Compute the prior P(C)P(C) for each class from training data.
  2. By the independence assumption, the likelihood factorizes:
P(XC)=i=1nP(xiC)P(X \mid C) = \prod_{i=1}^{n} P(x_i \mid C)
  1. For a new instance, compute P(CX)P(C)iP(xiC)P(C \mid X) \propto P(C)\prod_i P(x_i \mid C) for every class. P(X)P(X) is constant and can be ignored.
  2. Assign the class with the highest posterior (MAP rule):
C^=argmaxC  P(C)iP(xiC)\hat{C} = \arg\max_C \; P(C)\prod_{i} P(x_i \mid C)

Notes: continuous attributes use a Gaussian density; Laplace smoothing avoids zero probabilities. It is simple, fast, and works well even when the independence assumption is only approximately true (e.g., spam filtering).

classificationnaive-bayes
9short5 marks

What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled training data (input–output pairs) to learn a mapping that predicts the output for new inputs. Unsupervised learning uses unlabeled data and discovers hidden structure/patterns on its own.

AspectSupervisedUnsupervised
Training dataLabeled (has target/class)Unlabeled
GoalPredict label / valueFind structure / grouping
TasksClassification, regressionClustering, association, dimensionality reduction
ExamplesDecision tree, Naive Bayes, SVMk-means, hierarchical clustering, Apriori
FeedbackGuided by known answersNo feedback / ground truth

Example: classifying email as spam/not-spam (supervised) vs. grouping customers into segments without predefined labels (unsupervised).

machine-learning
10short5 marks

Explain hierarchical clustering briefly.

Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters instead of a single flat partition. It does not require the number of clusters in advance. Two approaches:

1. Agglomerative (bottom-up):

  • Start with each object as its own cluster.
  • Repeatedly merge the two closest clusters.
  • Continue until all objects form one cluster.

2. Divisive (top-down):

  • Start with all objects in one cluster.
  • Repeatedly split clusters until each object is separate.

Distance (linkage) measures between clusters: single linkage (minimum/nearest pair), complete linkage (maximum/farthest pair), average linkage, and centroid linkage.

The resulting dendrogram can be "cut" at a chosen height to obtain a desired number of clusters. It is intuitive but computationally costly (O(n2)O(n^2) or worse) and merges/splits cannot be undone.

clustering
11short5 marks

What is a fact table and a dimension table?

In a star/snowflake schema of a data warehouse:

Fact Table

  • The central table that stores measurable, numeric facts (measures) of a business process — e.g., units_sold, revenue, quantity.
  • Contains foreign keys referencing all related dimension tables.
  • Typically very large (many rows), with few columns.

Dimension Table

  • Stores descriptive, textual attributes (context) used to filter/group facts — e.g., Time (day, month, year), Product (name, category), Customer, Location.
  • Has a primary key referenced by the fact table.
  • Usually smaller with many descriptive columns.

Relationship: A fact table connects to multiple dimension tables; dimensions provide the "who, what, when, where" while the fact table provides the "how much."

Example: Sales(time_id, product_id, store_id, units_sold) is the fact table; Time, Product, Store are dimension tables.

schema
12short5 marks

Write short notes on metadata in data warehousing.

Metadata in Data Warehousing

Metadata is "data about data" — it describes the structure, source, meaning, and operations of the data stored in the warehouse, and acts as a roadmap/directory for the entire DW.

Types of metadata:

  1. Business metadata — meaning of data in business terms: definitions, ownership, business rules, mappings (helps end users).
  2. Technical metadata — schema definitions, table/column structures, data types, indexes, access permissions (used by developers/DBAs).
  3. Operational (ETL) metadata — source-to-target mappings, transformation/cleaning rules, data lineage, load history, refresh schedules, and data currency.

Importance / uses:

  • Acts as a directory to locate and understand warehouse contents.
  • Guides the ETL process and ensures data lineage and consistency.
  • Helps in query building, mapping, and impact analysis.
  • Improves data quality, governance, and maintenance.

Metadata is held in a metadata repository within the bottom tier of the DW architecture.

metadata

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2075?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.