Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

Define data warehousing. Explain the data warehouse design process and the components of a data warehouse with a diagram.

Data Warehousing

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It consolidates data from multiple heterogeneous operational sources into a single, consistent store optimized for analytical querying (OLAP) rather than transaction processing.

Key characteristics:

  • Subject-oriented — organized around major subjects (sales, customer) rather than applications.
  • Integrated — data from different sources is cleaned and made consistent (naming, units, encoding).
  • Time-variant — keeps historical data to allow trend analysis.
  • Non-volatile — data is loaded and read but not normally updated/deleted by users.

Data Warehouse Design Process

The design typically follows these steps:

  1. Choose the business process to model (e.g., sales, shipments).
  2. Choose the grain — the level of detail of a fact-table record (e.g., one row per sale-line-item).
  3. Choose the dimensions that apply at that grain (time, item, branch, location).
  4. Choose the measures (facts) to populate each fact record (e.g., units_sold, dollars_sold).

Design approaches:

  • Top-down (Inmon): build an enterprise warehouse first, then derive data marts.
  • Bottom-up (Kimball): build dimensional data marts first, then integrate them via a bus architecture.

Components / Architecture

A three-tier architecture is commonly used:

  • Bottom tier – Warehouse database server: the relational store; data is brought in by ETL (Extract, Transform, Load) tools from operational DBs and external sources via the data staging area, plus metadata and gateways.
  • Middle tier – OLAP server: ROLAP or MOLAP server that presents a multidimensional view of data to users.
  • Top tier – Front-end client tools: query/reporting, analysis, and data-mining tools.

Diagram (described in words): At the left are operational source databases and flat/external files. Arrows feed into the ETL / data staging area, which loads the central data warehouse (with its metadata repository and data marts). The warehouse feeds the OLAP server, which in turn serves the front-end tools (query, report, analysis, mining) used by analysts.

data-warehousedesign
2long10 marks

What is cluster analysis? Explain the DBSCAN algorithm. Given a dataset with Eps = 2.5 and MinPts = 3, identify core points, border points, and outliers.

Cluster Analysis

Cluster analysis is the process of grouping a set of data objects into clusters such that objects within the same cluster are highly similar to one another and dissimilar to objects in other clusters. It is an unsupervised learning task (no predefined class labels). Similarity is usually measured with a distance metric such as Euclidean distance:

d(p,q)=i=1n(piqi)2d(p,q)=\sqrt{\sum_{i=1}^{n}(p_i-q_i)^2}

DBSCAN Algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed and marks low-density points as noise. It uses two parameters:

  • Eps (ε): radius of the neighborhood around a point.
  • MinPts: minimum number of points required within the ε-neighborhood (including the point itself) to form a dense region.

Point types:

  • Core point: has at least MinPts points within its ε-neighborhood.
  • Border point: has fewer than MinPts neighbors but lies within the ε-neighborhood of a core point.
  • Noise/outlier: neither a core nor a border point.
DBSCAN(D, Eps, MinPts):
  mark all points unvisited
  for each unvisited point p in D:
     mark p visited
     N = neighbors of p within Eps
     if |N| < MinPts: mark p as NOISE   // may later become border
     else:
        create new cluster C; add p to C
        for each point q in N (expand):
           if q unvisited:
              mark q visited
              N' = neighbors of q within Eps
              if |N'| >= MinPts: N = N + N'   // q is core
           if q not yet in any cluster: add q to C

Worked Example (Eps = 2.5, MinPts = 3)

(The original paper supplies a point table; classification is illustrated here with a representative set of 2-D points.)

PointCoordsNeighbors within Eps=2.5 (incl. self)CountType
A(1,1)A,B,C3Core
B(2,1)A,B,C3Core
C(1,2)A,B,C3Core
D(8,8)D,E2Border (in Eps of core E)
E(8,9)D,E,F3Core
F(9,9)E,F2Border
G(15,15)G1Outlier/Noise

Result: Core points = {A, B, C, E}; Border points = {D, F}; Outliers/Noise = {G}. Cluster 1 = {A,B,C}, Cluster 2 = {D,E,F}, and G is noise.

Method: For each point compute how many points fall within distance 2.5; ≥3 ⇒ core, else if reachable from a core ⇒ border, otherwise ⇒ outlier.

clusteringdbscan
3long10 marks

Explain the Apriori principle. Using the Apriori algorithm, generate frequent itemsets and association rules for a given transaction dataset.

Apriori Principle

The Apriori principle states: If an itemset is frequent, then all of its subsets must also be frequent. Equivalently (the contrapositive, used for pruning): if an itemset is infrequent, all of its supersets are also infrequent. This anti-monotone property of support lets the algorithm prune the search space — candidate kk-itemsets are generated only from frequent (k1)(k-1)-itemsets.

Key measures:

  • Support: supp(X)=transactions containing Xtotal transactions\text{supp}(X)=\dfrac{\text{transactions containing }X}{\text{total transactions}}
  • Confidence: conf(XY)=supp(XY)supp(X)\text{conf}(X\Rightarrow Y)=\dfrac{\text{supp}(X\cup Y)}{\text{supp}(X)}

Worked Example

Transactions (min_support = 2, min_confidence = 60%):

TIDItems
T1A, B, C
T2A, C
T3A, D
T4B, E, F
T5A, B, C

Step 1 – C1 / L1 (1-itemsets, count ≥ 2): A=4, B=3, C=3, D=1, E=1, F=1 → L1 = {A, B, C}.

Step 2 – C2 / L2 (2-itemsets): {A,B}=2, {A,C}=3, {B,C}=2 → L2 = {AB, AC, BC} (all ≥ 2).

Step 3 – C3 / L3 (3-itemsets): Candidate {A,B,C} (all 2-subsets are frequent). Count {A,B,C}=2 ≥ 2 → L3 = {ABC}.

Frequent itemsets: A, B, C, AB, AC, BC, ABC.

Step 4 – Association rules from {A,B,C} (support=2/5):

RuleConfidenceKeep?
A,B ⇒ C2/2 = 100%
A,C ⇒ B2/3 = 67%
B,C ⇒ A2/2 = 100%
A ⇒ B,C2/4 = 50%
B ⇒ A,C2/3 = 67%
C ⇒ A,B2/3 = 67%

Strong rules (conf ≥ 60%): {A,B}⇒C, {A,C}⇒B, {B,C}⇒A, B⇒{A,C}, C⇒{A,B}.

association-rulesapriori
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is data transformation? List its techniques.

Data Transformation

Data transformation is a data-preprocessing step in which data is converted or consolidated into forms appropriate for mining, so that the resulting mining process is more efficient and the patterns found are easier to understand.

Techniques:

  • Smoothing — remove noise from data (binning, regression, clustering).
  • Aggregation — summary or aggregation operations are applied (e.g., daily sales → monthly).
  • Generalization — low-level data replaced by higher-level concepts using concept hierarchies (e.g., street → city).
  • Normalization — scale attribute values to a small specified range, e.g.:
    • Min-max: v=vminmaxmin(newmaxnewmin)+newminv' = \dfrac{v - min}{max - min}(new_{max}-new_{min}) + new_{min}
    • Z-score: v=vμσv' = \dfrac{v - \mu}{\sigma}
    • Decimal scaling: v=v10jv' = \dfrac{v}{10^j}
  • Attribute/feature construction — new attributes are built from existing ones.
  • Discretization — continuous values replaced by interval or conceptual labels.
preprocessing
5short5 marks

Differentiate between ROLAP and MOLAP.

ROLAP vs MOLAP

AspectROLAP (Relational OLAP)MOLAP (Multidimensional OLAP)
StorageData stored in relational tables (star/snowflake schema)Data stored in multidimensional arrays/cubes
AccessSQL queries on RDBMSDirect array indexing
ScalabilityHandles very large data volumes wellLimited by cube size; less scalable for huge data
Query speedGenerally slower (joins computed at run time)Faster for typical queries (pre-aggregated cube)
SparsityHandles sparse data efficientlySparse cubes waste storage
Pre-computationAggregations computed on the fly or as materialized viewsAggregations pre-computed and stored in the cube
Example toolsMicrostrategy, MondrianEssbase, Cognos PowerPlay

Summary: ROLAP scales to large, sparse data using relational storage and SQL but is slower; MOLAP offers fast query response through pre-aggregated multidimensional cubes but is limited in scalability and wastes space on sparse data. HOLAP combines both.

olap
6short5 marks

Explain the bottom-up and top-down approaches of building a data warehouse.

Approaches to Building a Data Warehouse

Top-Down Approach (Inmon)

An enterprise-wide data warehouse (EDW) is built first as a single, centralized, normalized repository of all corporate data. Data marts (subject-specific subsets) are then created from this central warehouse.

  • Advantages: consistent enterprise view, single version of truth, easier to maintain dimensions.
  • Disadvantages: high initial cost, long development time, complex.

Bottom-Up Approach (Kimball)

Individual data marts are built first for specific business areas (e.g., sales, finance) and later integrated into a larger warehouse using conformed dimensions (the data warehouse bus architecture).

  • Advantages: faster delivery, lower initial cost, quick ROI, incremental.
  • Disadvantages: possible data redundancy/inconsistency if marts are not well coordinated.

In short: Top-down = warehouse first, then marts (Inmon); Bottom-up = marts first, then warehouse (Kimball).

data-warehouse
7short5 marks

What is a confusion matrix? How is it used to evaluate a classifier?

Confusion Matrix

A confusion matrix is an n×nn \times n table that summarizes the performance of a classifier by comparing predicted class labels with the actual class labels. For binary classification:

Predicted PositivePredicted Negative
Actual PositiveTP (True Positive)FN (False Negative)
Actual NegativeFP (False Positive)TN (True Negative)

Evaluating a Classifier

From these four counts, standard metrics are derived:

  • Accuracy =TP+TNTP+TN+FP+FN= \dfrac{TP+TN}{TP+TN+FP+FN} — overall fraction correctly classified.
  • Precision =TPTP+FP= \dfrac{TP}{TP+FP} — correctness of positive predictions.
  • Recall / Sensitivity =TPTP+FN= \dfrac{TP}{TP+FN} — coverage of actual positives.
  • Specificity =TNTN+FP= \dfrac{TN}{TN+FP}.
  • F1-score =2PrecisionRecallPrecision+Recall= \dfrac{2\,\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}.
  • Error rate =1Accuracy= 1 - \text{Accuracy}.

The matrix also reveals which classes are confused (off-diagonal entries), which a single accuracy number hides — important for imbalanced datasets.

classificationevaluation
8short5 marks

Explain the concept of data cube.

Data Cube

A data cube is a multidimensional model that allows data to be viewed and analyzed across multiple dimensions (perspectives) and measures (numeric facts). Despite the name "cube," it can have nn dimensions, not just three.

  • Dimensions: the perspectives by which an organization keeps records, e.g., time, item, location, supplier. Each dimension may have a concept hierarchy (day → month → quarter → year).
  • Measures (facts): the numeric quantities being analyzed, e.g., dollars_sold, units_sold, stored in the central fact table.

Example: a 3-D cube for sales with axes time × item × location; each cell holds a measure such as dollars_sold for a particular (time, item, location) combination.

Apex cuboid / lattice: A data cube is actually a lattice of cuboids. The cuboid holding the lowest-level summary is the base cuboid; the 0-D cuboid holding the grand total is the apex cuboid (all). OLAP operations — roll-up, drill-down, slice, dice, pivot — navigate this cube to support interactive multidimensional analysis.

olap
9short5 marks

What is outlier analysis? Why is it important?

Outlier Analysis

An outlier is a data object that deviates significantly from the rest of the data, as if it were generated by a different mechanism. Outlier analysis (outlier/anomaly detection) is the process of identifying such objects.

Methods include:

  • Statistical (distribution-based): assume a data distribution and flag points with low probability (e.g., beyond μ±3σ\mu \pm 3\sigma).
  • Distance-based: an object is an outlier if a large fraction of objects lie farther than a distance DD.
  • Density-based (e.g., LOF): compare local density of a point with that of its neighbors.
  • Clustering-based: points that do not belong to any cluster (e.g., DBSCAN noise).

Why it is important:

  • In many applications the outliers are more interesting than the normal data: fraud detection (credit-card/insurance), network intrusion detection, fault/medical diagnosis, and detecting measurement errors.
  • Removing or treating outliers improves the quality of other mining tasks (clustering, regression) that are sensitive to extreme values.
outlier
10short5 marks

Define entropy and information gain.

Entropy

Entropy measures the impurity or uncertainty in a collection of examples. For a dataset DD with mm classes where pip_i is the proportion of class ii:

Entropy(D)=i=1mpilog2pi\text{Entropy}(D) = -\sum_{i=1}^{m} p_i \log_2 p_i

Entropy is 0 when all objects belong to one class (pure) and maximal (=log2m\log_2 m, e.g. 1 for two equally likely classes) when classes are evenly mixed.

Information Gain

Information gain is the expected reduction in entropy achieved by partitioning the data on an attribute AA. If splitting on AA produces vv subsets DjD_j:

InfoA(D)=j=1vDjDEntropy(Dj)\text{Info}_A(D) = \sum_{j=1}^{v} \frac{|D_j|}{|D|}\,\text{Entropy}(D_j) Gain(A)=Entropy(D)InfoA(D)\text{Gain}(A) = \text{Entropy}(D) - \text{Info}_A(D)

The attribute with the highest information gain is chosen as the splitting attribute at each node in the ID3 / C4.5 decision-tree algorithms, because it best separates the classes.

decision-tree
11short5 marks

Explain the partitioning method of clustering.

Partitioning Method of Clustering

A partitioning method organizes nn objects into kk clusters (knk \le n), where each cluster contains at least one object and each object belongs to exactly one cluster. It uses an iterative relocation technique that moves objects between groups to improve a partition until an objective function (e.g., minimizing intra-cluster squared error) is optimized.

Objective (k-means squared error):

E=i=1kpCid(p,ci)2E = \sum_{i=1}^{k} \sum_{p \in C_i} d(p, c_i)^2

where cic_i is the centroid/representative of cluster CiC_i.

k-means algorithm:

1. Arbitrarily choose k objects as initial cluster centers.
2. Repeat:
   a. Assign each object to the cluster whose center is nearest.
   b. Recompute each cluster center as the mean of its objects.
3. Until the centers no longer change (convergence).
  • k-means uses the mean (centroid) as the cluster representative.
  • k-medoids (PAM) uses an actual object (medoid), making it more robust to outliers.

Limitations: must specify kk in advance, sensitive to initial seeds and outliers, and tends to find spherical clusters of similar size.

clustering
12short5 marks

Write short notes on web mining.

Web Mining

Web mining is the application of data-mining techniques to automatically discover and extract useful information, patterns, and knowledge from web data (documents, hyperlinks, server logs).

It is generally divided into three categories:

  • Web Content Mining: extracting useful information/knowledge from the content of web pages (text, images, structured records) — e.g., search-result classification, opinion mining.
  • Web Structure Mining: analyzing the hyperlink structure (the web graph) to discover authoritative pages and communities — e.g., Google's PageRank and the HITS (hubs and authorities) algorithm.
  • Web Usage Mining: mining web server logs / clickstream data to understand user navigation behavior — used for personalization, recommendation, and improving site design.

Applications: search engines, recommendation systems, e-commerce personalization, targeted advertising, and detecting web access patterns.

Challenges: the web is huge, heterogeneous, dynamic, semi-structured, and noisy, which makes mining difficult.

web-mining

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2077?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.