Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is a data warehouse? Explain the three-tier architecture of a data warehouse in detail with a neat diagram.

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It is a central repository that consolidates data from multiple heterogeneous operational sources for analysis and reporting (OLAP) rather than transaction processing.

Key characteristics:

  • Subject-oriented: organized around major subjects (customer, product, sales) rather than applications.
  • Integrated: data from different sources is cleaned and made consistent (naming, units, encoding).
  • Time-variant: stores historical data with a time dimension (years of data).
  • Non-volatile: data is loaded and read, but not updated/deleted in real time.

Three-Tier Architecture

A data warehouse is commonly built using a three-tier architecture:

  +-------------------------------------------------------+
  | Top Tier: Front-End Tools                             |
  | (Query, Reporting, OLAP, Data Mining, Dashboards)     |
  +-------------------------------------------------------+
                          ^
                          |
  +-------------------------------------------------------+
  | Middle Tier: OLAP Server                              |
  | (ROLAP / MOLAP / HOLAP engine)                        |
  +-------------------------------------------------------+
                          ^
                          |
  +-------------------------------------------------------+
  | Bottom Tier: Data Warehouse Server (RDBMS)            |
  |   Data Marts | Metadata | Monitoring & Administration |
  +-------------------------------------------------------+
     ^         ^
     | ETL     | (Extract, Transform, Load)
  +-----------------------+
  | Operational DBs, Flat |
  | files, External data  |
  +-----------------------+

1. Bottom Tier — Data Warehouse Server (Data layer):

  • A back-end relational database that stores the warehouse data.
  • Data is fed from operational databases, flat files and external sources through ETL (Extract, Transform, Load) / gateways (ODBC, JDBC, OLEDB).
  • Also holds the metadata repository (definitions of data, source mappings) and tools for monitoring and administration.

2. Middle Tier — OLAP Server: Presents the multidimensional view of data to the user. Implemented as:

  • ROLAP (Relational OLAP): maps multidimensional operations to standard relational tables (star/snowflake schema). Scales to large data.
  • MOLAP (Multidimensional OLAP): stores data in special multidimensional array structures (data cubes) for fast retrieval.
  • HOLAP (Hybrid OLAP): combines ROLAP storage for detailed data with MOLAP for aggregates.

3. Top Tier — Front-End / Client Tools:

  • Tools used by end users: query and reporting tools, analysis tools, OLAP tools (slice, dice, drill-down), and data mining tools (prediction, classification, clustering).
  • Produces reports, charts and dashboards for decision making.

Conclusion: The separation into three tiers gives modularity, scalability and the ability to optimize each layer (storage, processing, presentation) independently.

data-warehousearchitecture
2long10 marks

Explain the K-means algorithm. Cluster the given set of points into two clusters and show all iterations until convergence.

K-means Algorithm

K-means is a partitional, unsupervised clustering algorithm that divides nn data points into kk clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the within-cluster sum of squared errors (SSE):

SSE=i=1kxCixμi2SSE = \sum_{i=1}^{k} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2

Steps:

  1. Choose the number of clusters kk.
  2. Initialise kk centroids (randomly or pick kk points).
  3. Assignment: assign each point to the nearest centroid (Euclidean distance).
  4. Update: recompute each centroid as the mean of points assigned to it.
  5. Repeat steps 3–4 until centroids no longer change (convergence).

Worked Example

Let the points be A(1,1), B(2,1), C(4,3), D(5,4)A(1,1),\ B(2,1),\ C(4,3),\ D(5,4) and k=2k = 2. Take initial centroids m1=A(1,1)m_1 = A(1,1) and m2=D(5,4)m_2 = D(5,4) (Euclidean distance, d=(x1x2)2+(y1y2)2d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}).

Iteration 1 — Assignment:

Pointdist to m1(1,1)m_1(1,1)dist to m2(5,4)m_2(5,4)Cluster
A(1,1)0.005.001
B(2,1)1.004.241
C(4,3)3.611.412
D(5,4)5.000.002

Cluster 1 = {A, B}, Cluster 2 = {C, D}.

Update centroids:

  • m1=(1+22,1+12)=(1.5, 1)m_1 = \left(\frac{1+2}{2}, \frac{1+1}{2}\right) = (1.5,\ 1)
  • m2=(4+52,3+42)=(4.5, 3.5)m_2 = \left(\frac{4+5}{2}, \frac{3+4}{2}\right) = (4.5,\ 3.5)

Iteration 2 — Assignment with new centroids:

Pointdist to m1(1.5,1)m_1(1.5,1)dist to m2(4.5,3.5)m_2(4.5,3.5)Cluster
A(1,1)0.504.301
B(2,1)0.503.541
C(4,3)3.200.712
D(5,4)4.610.712

Clusters are unchanged (Cluster 1 = {A,B}, Cluster 2 = {C,D}), so the centroids would not change again.

Convergence reached.

  • Cluster 1: {A(1,1), B(2,1)}, centroid (1.5,1)(1.5, 1)
  • Cluster 2: {C(4,3), D(5,4)}, centroid (4.5,3.5)(4.5, 3.5)

(If the exam supplies different points, apply the same assignment→update→repeat procedure.)

clusteringkmeans
3long10 marks

What is classification? Explain the K-Nearest Neighbour (KNN) algorithm and classify a new instance for a given dataset.

Classification

Classification is a supervised learning (data mining) technique that builds a model from a labelled training set to predict the categorical class label of new, unseen instances. It has two phases: (1) learning/training — build the classifier from training tuples whose class is known; (2) classification/testing — use the model to assign class labels to new data. Examples: spam vs. not-spam, loan = safe/risky.

K-Nearest Neighbour (KNN) Algorithm

KNN is a lazy, instance-based classifier — it stores all training data and classifies a new instance by majority vote of its kk closest neighbours.

Steps:

  1. Choose kk (number of neighbours).
  2. Compute the distance (usually Euclidean, d=(xiyi)2d=\sqrt{\sum (x_i - y_i)^2}) from the new instance to every training point.
  3. Select the kk nearest training points.
  4. Assign the class that is most frequent (majority vote) among those kk neighbours.

Worked Example

Training data (attributes X1,X2X_1, X_2, Class):

PointX1X_1X2X_2Class
P111A
P221A
P344B
P454B

Classify new instance Q=(2,2)Q = (2, 2) with k=3k = 3.

PointDistance to Q(2,2)
P1(1,1)1+1=1.41\sqrt{1+1}=1.41
P2(2,1)0+1=1.00\sqrt{0+1}=1.00
P3(4,4)4+4=2.83\sqrt{4+4}=2.83
P4(5,4)9+4=3.61\sqrt{9+4}=3.61

The 3 nearest neighbours are P2 (A), P1 (A), P3 (B) → votes: A = 2, B = 1.

Majority class = A, so Q(2,2)Q(2,2) is classified as Class A.

Note: kk is usually odd to avoid ties; attributes should be normalised so large-scale features do not dominate the distance.

classificationknn
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

Differentiate between operational database and data warehouse.

Operational Database (OLTP) vs Data Warehouse (OLAP)

AspectOperational Database (OLTP)Data Warehouse (OLAP)
PurposeDay-to-day transaction processingAnalysis, reporting, decision support
DataCurrent, detailed, up-to-dateHistorical, summarized, integrated
OrientationApplication-orientedSubject-oriented
OperationsFrequent INSERT/UPDATE/DELETEMostly read-only, complex queries
DesignNormalized (ER model)De-normalized (star/snowflake)
SizeMB–GBGB–TB and beyond
AccessShort, simple transactionsLong, ad-hoc analytical queries
UsersClerks, operational staffAnalysts, managers, executives
TimeReal-time / current valueTime-variant (history retained)

In short, an operational database is optimized for fast, reliable transactions, while a data warehouse is optimized for query-intensive analysis over large historical, integrated data.

data-warehouse
5short5 marks

Explain the OLAP operations with examples.

OLAP Operations

OLAP (Online Analytical Processing) operations manipulate the data cube to view data at different levels of detail and from different perspectives.

1. Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing dimensions.

  • Example: Sales by city → rolled up to sales by country.

2. Drill-down (roll-down): Reverse of roll-up; navigates from less detailed to more detailed data.

  • Example: Sales by quarter → drilled down to sales by month.

3. Slice: Selects a single value of one dimension, producing a sub-cube (2-D view).

  • Example: From the cube (Time, Item, Location), slice where Time = Q1.

4. Dice: Selects a sub-cube by choosing ranges of values on two or more dimensions.

  • Example: Item ∈ {Mobile, Laptop} AND Location ∈ {Nepal, India} AND Time ∈ {Q1, Q2}.

5. Pivot (rotate): Rotates the data axes to give an alternative presentation of the same data (e.g., swap rows and columns).

  • Example: Swap the Location axis with the Time axis in a report.

(Some texts add Drill-across — across two or more fact tables — and Drill-through — to the underlying detailed/operational data.)

olap
6short5 marks

What is market basket analysis?

Market Basket Analysis

Market Basket Analysis (MBA) is an association-rule mining technique that discovers co-occurrence relationships among items purchased together in customer transactions. It answers: "Which products are frequently bought together?"

Results are expressed as association rules of the form XYX \Rightarrow Y (if a customer buys XX, they are likely to buy YY), evaluated by:

  • Support: support(XY)=P(XY)support(X \Rightarrow Y) = P(X \cup Y) — fraction of transactions containing both.
  • Confidence: confidence(XY)=P(YX)=support(XY)support(X)confidence(X \Rightarrow Y) = P(Y \mid X) = \dfrac{support(X \cup Y)}{support(X)} — reliability of the rule.
  • Lift: confidence(XY)support(Y)\dfrac{confidence(X \Rightarrow Y)}{support(Y)} — strength beyond chance (>1>1 = positive correlation).

Classic example: the {Bread, Butter} ⇒ {Milk} or the well-known {Diapers} ⇒ {Beer} rule.

Applications: store shelf layout, cross-selling and product bundling, recommendation systems, promotions and catalogue design. Algorithms such as Apriori and FP-Growth are used to find the frequent itemsets that drive these rules.

association-rules
7short5 marks

Explain the candidate generation step in Apriori.

Candidate Generation in Apriori

In the Apriori algorithm, candidate kk-itemsets (CkC_k) are generated from the frequent (k1)(k{-}1)-itemsets (Lk1L_{k-1}) using two steps. This relies on the Apriori property: every subset of a frequent itemset must also be frequent (the downward-closure property).

1. Join Step: Generate candidates by self-joining Lk1Lk1L_{k-1} \bowtie L_{k-1}. Two itemsets in Lk1L_{k-1} are joined only if their first (k2)(k-2) items are identical (items kept in sorted order); the result is a kk-itemset.

2. Prune Step: Remove any candidate kk-itemset that has any (k1)(k-1)-subset not present in Lk1L_{k-1} (since it cannot be frequent). This prunes the search space before the expensive support-counting scan.

Example: Let L2={{1,2},{1,3},{2,3},{2,4}}L_2 = \{ \{1,2\}, \{1,3\}, \{2,3\}, \{2,4\} \}.

  • Join: {1,2}{1,3}{1,2,3}\{1,2\}\bowtie\{1,3\} \to \{1,2,3\} (first item shared). {2,3}{2,4}{2,3,4}\{2,3\}\bowtie\{2,4\} \to \{2,3,4\}.
  • Prune: For {1,2,3}\{1,2,3\} subsets are {1,2},{1,3},{2,3}\{1,2\},\{1,3\},\{2,3\} — all in L2L_2keep. For {2,3,4}\{2,3,4\} subset {3,4}L2\{3,4\} \notin L_2pruned.
  • So C3={{1,2,3}}C_3 = \{ \{1,2,3\} \}.

The remaining candidates are then counted against the database to form L3L_3.

apriori
8short5 marks

What is overfitting in classification? How can it be avoided?

Overfitting in Classification

Overfitting occurs when a classification model learns the training data too well — including its noise, outliers and random fluctuations — so it has very high accuracy on the training set but poor accuracy (poor generalisation) on unseen test data. The model becomes overly complex and fits idiosyncrasies that are not true patterns.

Symptom: low training error but high test/validation error.

How to Avoid / Reduce Overfitting

  • Pruning (for decision trees): pre-pruning (stop growing early) or post-pruning (build the full tree, then remove weak branches).
  • Cross-validation: use k-fold cross-validation to tune the model and detect overfitting.
  • More / cleaner training data: larger, noise-free datasets reduce reliance on spurious patterns.
  • Reduce model complexity: limit tree depth, number of features, or model parameters; feature selection.
  • Regularization: penalise large/complex models (e.g., L1/L2 penalties).
  • Use a separate validation/test set and stop training when validation error starts to rise (early stopping).
  • Ensemble methods (bagging, random forests) that average many models to reduce variance.
classification
9short5 marks

Explain the silhouette coefficient for cluster evaluation.

Silhouette Coefficient

The silhouette coefficient is an internal cluster-validity measure that evaluates how well each object lies within its cluster by combining cohesion (closeness to its own cluster) and separation (distance from other clusters).

For a data point ii:

  • a(i)a(i) = average distance from ii to all other points in its own cluster (cohesion).
  • b(i)b(i) = minimum average distance from ii to points of any other cluster (separation).

The silhouette of point ii is:

s(i)=b(i)a(i)max{a(i), b(i)}s(i) = \frac{b(i) - a(i)}{\max\{a(i),\ b(i)\}}

Interpretation (1s(i)1-1 \le s(i) \le 1):

  • s(i)1s(i) \approx 1 → point is well clustered (far from neighbouring clusters).
  • s(i)0s(i) \approx 0 → point lies on the border between two clusters.
  • s(i)<0s(i) < 0 → point is probably assigned to the wrong cluster.

The overall silhouette score is the mean of s(i)s(i) over all points; a higher average indicates a better clustering. It is also used to choose the best number of clusters kk by picking the kk that maximises the average silhouette.

clusteringevaluation
10short5 marks

What is data discretization?

Data Discretization

Data discretization is a data-reduction / preprocessing technique that transforms continuous (numeric) attributes into a finite number of discrete intervals or categorical labels. The original numeric values are replaced by interval/concept labels, which reduces data size and is required by algorithms that only handle categorical data (e.g., some decision-tree and association-rule methods).

Example: an Age attribute (0–100) discretized into {Child (0–12), Teen (13–19), Adult (20–59), Senior (60+)}.

Common methods:

  • Binning: partition into equal-width or equal-frequency (equal-depth) bins, then smooth by bin mean/median/boundaries — unsupervised.
  • Histogram analysis: uses histograms to find natural intervals.
  • Clustering: group similar values into clusters that become intervals.
  • Entropy / information-gain based (e.g., ChiMerge): supervised, uses class labels to choose split points.
  • Concept hierarchy generation: roll values up (e.g., city → state → country) for numeric or nominal data.

Benefits: smaller, simpler data; reduced noise; enables categorical-only algorithms; produces more interpretable rules.

preprocessing
11short5 marks

Differentiate between star and snowflake schema.

Star Schema vs Snowflake Schema

Both are multidimensional data-warehouse schemas with a central fact table surrounded by dimension tables; they differ in how the dimensions are organised.

AspectStar SchemaSnowflake Schema
StructureCentral fact table + single-level dimension tablesFact table + dimensions split into multiple related sub-tables
NormalizationDimensions are de-normalizedDimensions are normalized (hierarchies split out)
RedundancyHigher data redundancyLower redundancy
JoinsFewer joins → faster queriesMore joins → comparatively slower queries
StorageUses more spaceSaves space
Complexity / designSimple, easy to understandMore complex
MaintenanceHarder to maintain consistencyEasier (normalized)

Diagram (conceptually):

  • Star: Fact in the centre, each dimension one table radiating outward → looks like a star.
  • Snowflake: each dimension is further broken into sub-dimensions (e.g., City → State → Country) → branches resemble a snowflake.

Summary: the star schema favours query performance and simplicity; the snowflake schema favours normalization and reduced redundancy at the cost of extra joins.

schema
12short5 marks

Write short notes on spatial data mining.

Spatial Data Mining

Spatial data mining is the process of discovering interesting, non-trivial and previously unknown patterns, relationships and trends from spatial (geographic / geo-referenced) data — data that has location, shape and topological attributes (points, lines, polygons, maps, satellite/GIS data).

Key idea: it exploits spatial relationships such as distance, direction, adjacency, containment and overlap that ordinary (non-spatial) mining ignores. It is guided by the principle that "everything is related to everything else, but nearby things are more related" (spatial autocorrelation).

Major tasks/techniques:

  • Spatial classification & prediction — predict an attribute using location and neighbouring features.
  • Spatial clustering — group nearby objects (e.g., DBSCAN finds dense regions / hotspots).
  • Spatial association rules — e.g., is_a(x, school) ∧ close_to(x, park) ⇒ price_high.
  • Spatial outlier / hotspot detection — find locations that differ markedly from their neighbours.
  • Spatial trend analysis — how attributes change with distance/direction.

Applications: GIS, urban and transport planning, environmental and weather studies, epidemiology (disease hotspots), crime analysis, remote sensing and location-based services.

Challenges: large data volumes, complex spatial data types and indexing (R-trees, quad-trees), and the need to model spatial autocorrelation.

spatial-mining

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2079?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.