BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2079 Nepal

Q: Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2079?

The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper come with solutions?

Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper?

The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?

Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.

Question

1Long answer10 marks

What is a data warehouse? Explain the three-tier architecture of a data warehouse in detail with a neat diagram.

data-warehousearchitecture

Answer 1

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It is a central repository that consolidates data from multiple heterogeneous operational sources for analysis and reporting (OLAP) rather than transaction processing.

Key characteristics:

Subject-oriented: organized around major subjects (customer, product, sales) rather than applications.
Integrated: data from different sources is cleaned and made consistent (naming, units, encoding).
Time-variant: stores historical data with a time dimension (years of data).
Non-volatile: data is loaded and read, but not updated/deleted in real time.

Three-Tier Architecture

A data warehouse is commonly built using a three-tier architecture:

  +-------------------------------------------------------+
  | Top Tier: Front-End Tools                             |
  | (Query, Reporting, OLAP, Data Mining, Dashboards)     |
  +-------------------------------------------------------+
                          ^
                          |
  +-------------------------------------------------------+
  | Middle Tier: OLAP Server                              |
  | (ROLAP / MOLAP / HOLAP engine)                        |
  +-------------------------------------------------------+
                          ^
                          |
  +-------------------------------------------------------+
  | Bottom Tier: Data Warehouse Server (RDBMS)            |
  |   Data Marts | Metadata | Monitoring & Administration |
  +-------------------------------------------------------+
     ^         ^
     | ETL     | (Extract, Transform, Load)
  +-----------------------+
  | Operational DBs, Flat |
  | files, External data  |
  +-----------------------+

1. Bottom Tier — Data Warehouse Server (Data layer):

A back-end relational database that stores the warehouse data.
Data is fed from operational databases, flat files and external sources through ETL (Extract, Transform, Load) / gateways (ODBC, JDBC, OLEDB).
Also holds the metadata repository (definitions of data, source mappings) and tools for monitoring and administration.

2. Middle Tier — OLAP Server: Presents the multidimensional view of data to the user. Implemented as:

ROLAP (Relational OLAP): maps multidimensional operations to standard relational tables (star/snowflake schema). Scales to large data.
MOLAP (Multidimensional OLAP): stores data in special multidimensional array structures (data cubes) for fast retrieval.
HOLAP (Hybrid OLAP): combines ROLAP storage for detailed data with MOLAP for aggregates.

3. Top Tier — Front-End / Client Tools:

Tools used by end users: query and reporting tools, analysis tools, OLAP tools (slice, dice, drill-down), and data mining tools (prediction, classification, clustering).
Produces reports, charts and dashboards for decision making.

Conclusion: The separation into three tiers gives modularity, scalability and the ability to optimize each layer (storage, processing, presentation) independently.

Answer 2

K-means Algorithm

K-means is a partitional, unsupervised clustering algorithm that divides $n$ data points into $k$ clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the within-cluster sum of squared errors (SSE):

SSE = \sum_{i=1}^{k} \sum_{x \in C_i} \lVert x - \mu_i \rVert^2

Steps:

Choose the number of clusters $k$ .
Initialise $k$ centroids (randomly or pick $k$ points).
Assignment: assign each point to the nearest centroid (Euclidean distance).
Update: recompute each centroid as the mean of points assigned to it.
Repeat steps 3–4 until centroids no longer change (convergence).

Worked Example

Let the points be $A(1,1),\ B(2,1),\ C(4,3),\ D(5,4)$ and $k = 2$ . Take initial centroids $m_1 = A(1,1)$ and $m_2 = D(5,4)$ (Euclidean distance, $d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$ ).

Iteration 1 — Assignment:

Point	dist to $m_1(1,1)$	dist to $m_2(5,4)$	Cluster
A(1,1)	0.00	5.00	1
B(2,1)	1.00	4.24	1
C(4,3)	3.61	1.41	2
D(5,4)	5.00	0.00	2

Cluster 1 = {A, B}, Cluster 2 = {C, D}.

Update centroids:

$m_1 = \left(\frac{1+2}{2}, \frac{1+1}{2}\right) = (1.5,\ 1)$
$m_2 = \left(\frac{4+5}{2}, \frac{3+4}{2}\right) = (4.5,\ 3.5)$

Iteration 2 — Assignment with new centroids:

Point	dist to $m_1(1.5,1)$	dist to $m_2(4.5,3.5)$	Cluster
A(1,1)	0.50	4.30	1
B(2,1)	0.50	3.54	1
C(4,3)	3.20	0.71	2
D(5,4)	4.61	0.71	2

Clusters are unchanged (Cluster 1 = {A,B}, Cluster 2 = {C,D}), so the centroids would not change again.

Convergence reached.

Cluster 1: {A(1,1), B(2,1)}, centroid $(1.5, 1)$
Cluster 2: {C(4,3), D(5,4)}, centroid $(4.5, 3.5)$

(If the exam supplies different points, apply the same assignment→update→repeat procedure.)

Answer 3

Classification

Classification is a supervised learning (data mining) technique that builds a model from a labelled training set to predict the categorical class label of new, unseen instances. It has two phases: (1) learning/training — build the classifier from training tuples whose class is known; (2) classification/testing — use the model to assign class labels to new data. Examples: spam vs. not-spam, loan = safe/risky.

K-Nearest Neighbour (KNN) Algorithm

KNN is a lazy, instance-based classifier — it stores all training data and classifies a new instance by majority vote of its $k$ closest neighbours.

Steps:

Choose $k$ (number of neighbours).
Compute the distance (usually Euclidean, $d=\sqrt{\sum (x_i - y_i)^2}$ ) from the new instance to every training point.
Select the $k$ nearest training points.
Assign the class that is most frequent (majority vote) among those $k$ neighbours.

Worked Example

Training data (attributes $X_1, X_2$ , Class):

Point	$X_1$	$X_2$	Class
P1	1	1	A
P2	2	1	A
P3	4	4	B
P4	5	4	B

Classify new instance $Q = (2, 2)$ with $k = 3$ .

Point	Distance to Q(2,2)
P1(1,1)	$\sqrt{1+1}=1.41$
P2(2,1)	$\sqrt{0+1}=1.00$
P3(4,4)	$\sqrt{4+4}=2.83$
P4(5,4)	$\sqrt{9+4}=3.61$

The 3 nearest neighbours are P2 (A), P1 (A), P3 (B) → votes: A = 2, B = 1.

Majority class = A, so $Q(2,2)$ is classified as Class A.

Note: $k$ is usually odd to avoid ties; attributes should be normalised so large-scale features do not dominate the distance.

Answer 4

Operational Database (OLTP) vs Data Warehouse (OLAP)

Aspect	Operational Database (OLTP)	Data Warehouse (OLAP)
Purpose	Day-to-day transaction processing	Analysis, reporting, decision support
Data	Current, detailed, up-to-date	Historical, summarized, integrated
Orientation	Application-oriented	Subject-oriented
Operations	Frequent INSERT/UPDATE/DELETE	Mostly read-only, complex queries
Design	Normalized (ER model)	De-normalized (star/snowflake)
Size	MB–GB	GB–TB and beyond
Access	Short, simple transactions	Long, ad-hoc analytical queries
Users	Clerks, operational staff	Analysts, managers, executives
Time	Real-time / current value	Time-variant (history retained)

In short, an operational database is optimized for fast, reliable transactions, while a data warehouse is optimized for query-intensive analysis over large historical, integrated data.

Answer 5

OLAP Operations

OLAP (Online Analytical Processing) operations manipulate the data cube to view data at different levels of detail and from different perspectives.

1. Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing dimensions.

Example: Sales by city → rolled up to sales by country.

2. Drill-down (roll-down): Reverse of roll-up; navigates from less detailed to more detailed data.

Example: Sales by quarter → drilled down to sales by month.

3. Slice: Selects a single value of one dimension, producing a sub-cube (2-D view).

Example: From the cube (Time, Item, Location), slice where Time = Q1.

4. Dice: Selects a sub-cube by choosing ranges of values on two or more dimensions.

Example: Item ∈ {Mobile, Laptop} AND Location ∈ {Nepal, India} AND Time ∈ {Q1, Q2}.

5. Pivot (rotate): Rotates the data axes to give an alternative presentation of the same data (e.g., swap rows and columns).

Example: Swap the Location axis with the Time axis in a report.

(Some texts add Drill-across — across two or more fact tables — and Drill-through — to the underlying detailed/operational data.)

Answer 6

Market Basket Analysis

Market Basket Analysis (MBA) is an association-rule mining technique that discovers co-occurrence relationships among items purchased together in customer transactions. It answers: "Which products are frequently bought together?"

Results are expressed as association rules of the form $X \Rightarrow Y$ (if a customer buys $X$ , they are likely to buy $Y$ ), evaluated by:

Support: $support(X \Rightarrow Y) = P(X \cup Y)$ — fraction of transactions containing both.
Confidence: $confidence(X \Rightarrow Y) = P(Y \mid X) = \dfrac{support(X \cup Y)}{support(X)}$ — reliability of the rule.
Lift: $\dfrac{confidence(X \Rightarrow Y)}{support(Y)}$ — strength beyond chance ( $>1$ = positive correlation).

Classic example: the {Bread, Butter} ⇒ {Milk} or the well-known {Diapers} ⇒ {Beer} rule.

Applications: store shelf layout, cross-selling and product bundling, recommendation systems, promotions and catalogue design. Algorithms such as Apriori and FP-Growth are used to find the frequent itemsets that drive these rules.

Answer 7

Candidate Generation in Apriori

In the Apriori algorithm, candidate $k$ -itemsets ( $C_k$ ) are generated from the frequent $(k{-}1)$ -itemsets ( $L_{k-1}$ ) using two steps. This relies on the Apriori property: every subset of a frequent itemset must also be frequent (the downward-closure property).

1. Join Step: Generate candidates by self-joining $L_{k-1} \bowtie L_{k-1}$ . Two itemsets in $L_{k-1}$ are joined only if their first $(k-2)$ items are identical (items kept in sorted order); the result is a $k$ -itemset.

2. Prune Step: Remove any candidate $k$ -itemset that has any $(k-1)$ -subset not present in $L_{k-1}$ (since it cannot be frequent). This prunes the search space before the expensive support-counting scan.

Example: Let $L_2 = \{ \{1,2\}, \{1,3\}, \{2,3\}, \{2,4\} \}$ .

Join: $\{1,2\}\bowtie\{1,3\} \to \{1,2,3\}$ (first item shared). $\{2,3\}\bowtie\{2,4\} \to \{2,3,4\}$ .
Prune: For $\{1,2,3\}$ subsets are $\{1,2\},\{1,3\},\{2,3\}$ — all in $L_2$ → keep. For $\{2,3,4\}$ subset $\{3,4\} \notin L_2$ → pruned.
So $C_3 = \{ \{1,2,3\} \}$ .

The remaining candidates are then counted against the database to form $L_3$ .

Answer 8

Overfitting in Classification

Overfitting occurs when a classification model learns the training data too well — including its noise, outliers and random fluctuations — so it has very high accuracy on the training set but poor accuracy (poor generalisation) on unseen test data. The model becomes overly complex and fits idiosyncrasies that are not true patterns.

Symptom: low training error but high test/validation error.

How to Avoid / Reduce Overfitting

Pruning (for decision trees): pre-pruning (stop growing early) or post-pruning (build the full tree, then remove weak branches).
Cross-validation: use k-fold cross-validation to tune the model and detect overfitting.
More / cleaner training data: larger, noise-free datasets reduce reliance on spurious patterns.
Reduce model complexity: limit tree depth, number of features, or model parameters; feature selection.
Regularization: penalise large/complex models (e.g., L1/L2 penalties).
Use a separate validation/test set and stop training when validation error starts to rise (early stopping).
Ensemble methods (bagging, random forests) that average many models to reduce variance.

Answer 9

Silhouette Coefficient

The silhouette coefficient is an internal cluster-validity measure that evaluates how well each object lies within its cluster by combining cohesion (closeness to its own cluster) and separation (distance from other clusters).

For a data point $i$ :

$a(i)$ = average distance from $i$ to all other points in its own cluster (cohesion).
$b(i)$ = minimum average distance from $i$ to points of any other cluster (separation).

The silhouette of point $i$ is:

s(i) = \frac{b(i) - a(i)}{\max\{a(i),\ b(i)\}}

Interpretation ( $-1 \le s(i) \le 1$ ):

$s(i) \approx 1$ → point is well clustered (far from neighbouring clusters).
$s(i) \approx 0$ → point lies on the border between two clusters.
$s(i) < 0$ → point is probably assigned to the wrong cluster.

The overall silhouette score is the mean of $s(i)$ over all points; a higher average indicates a better clustering. It is also used to choose the best number of clusters $k$ by picking the $k$ that maximises the average silhouette.

Answer 10

Data Discretization

Data discretization is a data-reduction / preprocessing technique that transforms continuous (numeric) attributes into a finite number of discrete intervals or categorical labels. The original numeric values are replaced by interval/concept labels, which reduces data size and is required by algorithms that only handle categorical data (e.g., some decision-tree and association-rule methods).

Example: an Age attribute (0–100) discretized into {Child (0–12), Teen (13–19), Adult (20–59), Senior (60+)}.

Common methods:

Binning: partition into equal-width or equal-frequency (equal-depth) bins, then smooth by bin mean/median/boundaries — unsupervised.
Histogram analysis: uses histograms to find natural intervals.
Clustering: group similar values into clusters that become intervals.
Entropy / information-gain based (e.g., ChiMerge): supervised, uses class labels to choose split points.
Concept hierarchy generation: roll values up (e.g., city → state → country) for numeric or nominal data.

Benefits: smaller, simpler data; reduced noise; enables categorical-only algorithms; produces more interpretable rules.

Answer 11

Star Schema vs Snowflake Schema

Both are multidimensional data-warehouse schemas with a central fact table surrounded by dimension tables; they differ in how the dimensions are organised.

Aspect	Star Schema	Snowflake Schema
Structure	Central fact table + single-level dimension tables	Fact table + dimensions split into multiple related sub-tables
Normalization	Dimensions are de-normalized	Dimensions are normalized (hierarchies split out)
Redundancy	Higher data redundancy	Lower redundancy
Joins	Fewer joins → faster queries	More joins → comparatively slower queries
Storage	Uses more space	Saves space
Complexity / design	Simple, easy to understand	More complex
Maintenance	Harder to maintain consistency	Easier (normalized)

Diagram (conceptually):

Star: Fact in the centre, each dimension one table radiating outward → looks like a star.
Snowflake: each dimension is further broken into sub-dimensions (e.g., City → State → Country) → branches resemble a snowflake.

Summary: the star schema favours query performance and simplicity; the snowflake schema favours normalization and reduced redundancy at the cost of extra joins.

Answer 12

Spatial Data Mining

Spatial data mining is the process of discovering interesting, non-trivial and previously unknown patterns, relationships and trends from spatial (geographic / geo-referenced) data — data that has location, shape and topological attributes (points, lines, polygons, maps, satellite/GIS data).

Key idea: it exploits spatial relationships such as distance, direction, adjacency, containment and overlap that ordinary (non-spatial) mining ignores. It is guided by the principle that "everything is related to everything else, but nearby things are more related" (spatial autocorrelation).

Major tasks/techniques:

Spatial classification & prediction — predict an attribute using location and neighbouring features.
Spatial clustering — group nearby objects (e.g., DBSCAN finds dense regions / hotspots).
Spatial association rules — e.g., is_a(x, school) ∧ close_to(x, park) ⇒ price_high.
Spatial outlier / hotspot detection — find locations that differ markedly from their neighbours.
Spatial trend analysis — how attributes change with distance/direction.

Applications: GIS, urban and transport planning, environmental and weather studies, epidemiology (disease hotspots), crime analysis, remote sensing and location-based services.

Challenges: large data volumes, complex spatial data types and indexing (R-trees, quad-trees), and the need to model spatial autocorrelation.

Level	BSc CSIT (TU)
Stream	Science
Subject	Data Warehousing and Data Mining (BSc CSIT, CSC410)
Year	2079 BS
Exam session	Regular (annual)
Full marks	60
Time allowed	180 minutes
Questions	12, all with step-by-step solutions

BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2079 Nepal

Section A: Long Answer Questions

Data Warehouse

Three-Tier Architecture

K-means Algorithm

Worked Example

Classification

K-Nearest Neighbour (KNN) Algorithm

Worked Example

Section B: Short Answer Questions

Operational Database (OLTP) vs Data Warehouse (OLAP)

OLAP Operations

Market Basket Analysis

Candidate Generation in Apriori

Overfitting in Classification

How to Avoid / Reduce Overfitting

Silhouette Coefficient

Data Discretization

Star Schema vs Snowflake Schema

Spatial Data Mining

Frequently asked questions