BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2077 Nepal

Q: Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2077?

The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper come with solutions?

Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper?

The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?

Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.

Question

1Long answer10 marks

Define data warehousing. Explain the data warehouse design process and the components of a data warehouse with a diagram.

data-warehousedesign

Answer 1

Data Warehousing

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It consolidates data from multiple heterogeneous operational sources into a single, consistent store optimized for analytical querying (OLAP) rather than transaction processing.

Key characteristics:

Subject-oriented — organized around major subjects (sales, customer) rather than applications.
Integrated — data from different sources is cleaned and made consistent (naming, units, encoding).
Time-variant — keeps historical data to allow trend analysis.
Non-volatile — data is loaded and read but not normally updated/deleted by users.

Data Warehouse Design Process

The design typically follows these steps:

Choose the business process to model (e.g., sales, shipments).
Choose the grain — the level of detail of a fact-table record (e.g., one row per sale-line-item).
Choose the dimensions that apply at that grain (time, item, branch, location).
Choose the measures (facts) to populate each fact record (e.g., units_sold, dollars_sold).

Design approaches:

Top-down (Inmon): build an enterprise warehouse first, then derive data marts.
Bottom-up (Kimball): build dimensional data marts first, then integrate them via a bus architecture.

Components / Architecture

A three-tier architecture is commonly used:

Bottom tier – Warehouse database server: the relational store; data is brought in by ETL (Extract, Transform, Load) tools from operational DBs and external sources via the data staging area, plus metadata and gateways.
Middle tier – OLAP server: ROLAP or MOLAP server that presents a multidimensional view of data to users.
Top tier – Front-end client tools: query/reporting, analysis, and data-mining tools.

Diagram (described in words): At the left are operational source databases and flat/external files. Arrows feed into the ETL / data staging area, which loads the central data warehouse (with its metadata repository and data marts). The warehouse feeds the OLAP server, which in turn serves the front-end tools (query, report, analysis, mining) used by analysts.

Answer 2

Cluster Analysis

Cluster analysis is the process of grouping a set of data objects into clusters such that objects within the same cluster are highly similar to one another and dissimilar to objects in other clusters. It is an unsupervised learning task (no predefined class labels). Similarity is usually measured with a distance metric such as Euclidean distance:

d(p,q)=\sqrt{\sum_{i=1}^{n}(p_i-q_i)^2}

DBSCAN Algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed and marks low-density points as noise. It uses two parameters:

Eps (ε): radius of the neighborhood around a point.
MinPts: minimum number of points required within the ε-neighborhood (including the point itself) to form a dense region.

Point types:

Core point: has at least MinPts points within its ε-neighborhood.
Border point: has fewer than MinPts neighbors but lies within the ε-neighborhood of a core point.
Noise/outlier: neither a core nor a border point.

DBSCAN(D, Eps, MinPts):
  mark all points unvisited
  for each unvisited point p in D:
     mark p visited
     N = neighbors of p within Eps
     if |N| < MinPts: mark p as NOISE   // may later become border
     else:
        create new cluster C; add p to C
        for each point q in N (expand):
           if q unvisited:
              mark q visited
              N' = neighbors of q within Eps
              if |N'| >= MinPts: N = N + N'   // q is core
           if q not yet in any cluster: add q to C

Worked Example (Eps = 2.5, MinPts = 3)

(The original paper supplies a point table; classification is illustrated here with a representative set of 2-D points.)

Point	Coords	Neighbors within Eps=2.5 (incl. self)	Count	Type
A	(1,1)	A,B,C	3	Core
B	(2,1)	A,B,C	3	Core
C	(1,2)	A,B,C	3	Core
D	(8,8)	D,E	2	Border (in Eps of core E)
E	(8,9)	D,E,F	3	Core
F	(9,9)	E,F	2	Border
G	(15,15)	G	1	Outlier/Noise

Result: Core points = {A, B, C, E}; Border points = {D, F}; Outliers/Noise = {G}. Cluster 1 = {A,B,C}, Cluster 2 = {D,E,F}, and G is noise.

Method: For each point compute how many points fall within distance 2.5; ≥3 ⇒ core, else if reachable from a core ⇒ border, otherwise ⇒ outlier.

Answer 3

Apriori Principle

The Apriori principle states: If an itemset is frequent, then all of its subsets must also be frequent. Equivalently (the contrapositive, used for pruning): if an itemset is infrequent, all of its supersets are also infrequent. This anti-monotone property of support lets the algorithm prune the search space — candidate $k$ -itemsets are generated only from frequent $(k-1)$ -itemsets.

Key measures:

Support: $\text{supp}(X)=\dfrac{\text{transactions containing }X}{\text{total transactions}}$
Confidence: $\text{conf}(X\Rightarrow Y)=\dfrac{\text{supp}(X\cup Y)}{\text{supp}(X)}$

Worked Example

Transactions (min_support = 2, min_confidence = 60%):

TID	Items
T1	A, B, C
T2	A, C
T3	A, D
T4	B, E, F
T5	A, B, C

Step 1 – C1 / L1 (1-itemsets, count ≥ 2): A=4, B=3, C=3, D=1, E=1, F=1 → L1 = {A, B, C}.

Step 2 – C2 / L2 (2-itemsets): {A,B}=2, {A,C}=3, {B,C}=2 → L2 = {AB, AC, BC} (all ≥ 2).

Step 3 – C3 / L3 (3-itemsets): Candidate {A,B,C} (all 2-subsets are frequent). Count {A,B,C}=2 ≥ 2 → L3 = {ABC}.

Frequent itemsets: A, B, C, AB, AC, BC, ABC.

Step 4 – Association rules from {A,B,C} (support=2/5):

Rule	Confidence	Keep?
A,B ⇒ C	2/2 = 100%	✓
A,C ⇒ B	2/3 = 67%	✓
B,C ⇒ A	2/2 = 100%	✓
A ⇒ B,C	2/4 = 50%	✗
B ⇒ A,C	2/3 = 67%	✓
C ⇒ A,B	2/3 = 67%	✓

Strong rules (conf ≥ 60%): {A,B}⇒C, {A,C}⇒B, {B,C}⇒A, B⇒{A,C}, C⇒{A,B}.

Answer 4

Data Transformation

Data transformation is a data-preprocessing step in which data is converted or consolidated into forms appropriate for mining, so that the resulting mining process is more efficient and the patterns found are easier to understand.

Techniques:

Smoothing — remove noise from data (binning, regression, clustering).
Aggregation — summary or aggregation operations are applied (e.g., daily sales → monthly).
Generalization — low-level data replaced by higher-level concepts using concept hierarchies (e.g., street → city).
Normalization — scale attribute values to a small specified range, e.g.:
- Min-max: $v' = \dfrac{v - min}{max - min}(new_{max}-new_{min}) + new_{min}$
- Z-score: $v' = \dfrac{v - \mu}{\sigma}$
- Decimal scaling: $v' = \dfrac{v}{10^j}$
Attribute/feature construction — new attributes are built from existing ones.
Discretization — continuous values replaced by interval or conceptual labels.

Answer 5

ROLAP vs MOLAP

Aspect	ROLAP (Relational OLAP)	MOLAP (Multidimensional OLAP)
Storage	Data stored in relational tables (star/snowflake schema)	Data stored in multidimensional arrays/cubes
Access	SQL queries on RDBMS	Direct array indexing
Scalability	Handles very large data volumes well	Limited by cube size; less scalable for huge data
Query speed	Generally slower (joins computed at run time)	Faster for typical queries (pre-aggregated cube)
Sparsity	Handles sparse data efficiently	Sparse cubes waste storage
Pre-computation	Aggregations computed on the fly or as materialized views	Aggregations pre-computed and stored in the cube
Example tools	Microstrategy, Mondrian	Essbase, Cognos PowerPlay

Summary: ROLAP scales to large, sparse data using relational storage and SQL but is slower; MOLAP offers fast query response through pre-aggregated multidimensional cubes but is limited in scalability and wastes space on sparse data. HOLAP combines both.

Answer 6

Approaches to Building a Data Warehouse

Top-Down Approach (Inmon)

An enterprise-wide data warehouse (EDW) is built first as a single, centralized, normalized repository of all corporate data. Data marts (subject-specific subsets) are then created from this central warehouse.

Advantages: consistent enterprise view, single version of truth, easier to maintain dimensions.
Disadvantages: high initial cost, long development time, complex.

Bottom-Up Approach (Kimball)

Individual data marts are built first for specific business areas (e.g., sales, finance) and later integrated into a larger warehouse using conformed dimensions (the data warehouse bus architecture).

Advantages: faster delivery, lower initial cost, quick ROI, incremental.
Disadvantages: possible data redundancy/inconsistency if marts are not well coordinated.

In short: Top-down = warehouse first, then marts (Inmon); Bottom-up = marts first, then warehouse (Kimball).

Answer 7

Confusion Matrix

A confusion matrix is an $n \times n$ table that summarizes the performance of a classifier by comparing predicted class labels with the actual class labels. For binary classification:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

Evaluating a Classifier

From these four counts, standard metrics are derived:

Accuracy $= \dfrac{TP+TN}{TP+TN+FP+FN}$ — overall fraction correctly classified.
Precision $= \dfrac{TP}{TP+FP}$ — correctness of positive predictions.
Recall / Sensitivity $= \dfrac{TP}{TP+FN}$ — coverage of actual positives.
Specificity $= \dfrac{TN}{TN+FP}$ .
F1-score $= \dfrac{2\,\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$ .
Error rate $= 1 - \text{Accuracy}$ .

The matrix also reveals which classes are confused (off-diagonal entries), which a single accuracy number hides — important for imbalanced datasets.

Answer 8

Data Cube

A data cube is a multidimensional model that allows data to be viewed and analyzed across multiple dimensions (perspectives) and measures (numeric facts). Despite the name "cube," it can have $n$ dimensions, not just three.

Dimensions: the perspectives by which an organization keeps records, e.g., time, item, location, supplier. Each dimension may have a concept hierarchy (day → month → quarter → year).
Measures (facts): the numeric quantities being analyzed, e.g., dollars_sold, units_sold, stored in the central fact table.

Example: a 3-D cube for sales with axes time × item × location; each cell holds a measure such as dollars_sold for a particular (time, item, location) combination.

Apex cuboid / lattice: A data cube is actually a lattice of cuboids. The cuboid holding the lowest-level summary is the base cuboid; the 0-D cuboid holding the grand total is the apex cuboid (all). OLAP operations — roll-up, drill-down, slice, dice, pivot — navigate this cube to support interactive multidimensional analysis.

Answer 9

Outlier Analysis

An outlier is a data object that deviates significantly from the rest of the data, as if it were generated by a different mechanism. Outlier analysis (outlier/anomaly detection) is the process of identifying such objects.

Methods include:

Statistical (distribution-based): assume a data distribution and flag points with low probability (e.g., beyond $\mu \pm 3\sigma$ ).
Distance-based: an object is an outlier if a large fraction of objects lie farther than a distance $D$ .
Density-based (e.g., LOF): compare local density of a point with that of its neighbors.
Clustering-based: points that do not belong to any cluster (e.g., DBSCAN noise).

Why it is important:

In many applications the outliers are more interesting than the normal data: fraud detection (credit-card/insurance), network intrusion detection, fault/medical diagnosis, and detecting measurement errors.
Removing or treating outliers improves the quality of other mining tasks (clustering, regression) that are sensitive to extreme values.

Answer 10

Entropy

Entropy measures the impurity or uncertainty in a collection of examples. For a dataset $D$ with $m$ classes where $p_i$ is the proportion of class $i$ :

\text{Entropy}(D) = -\sum_{i=1}^{m} p_i \log_2 p_i

Entropy is 0 when all objects belong to one class (pure) and maximal (= $\log_2 m$ , e.g. 1 for two equally likely classes) when classes are evenly mixed.

Information Gain

Information gain is the expected reduction in entropy achieved by partitioning the data on an attribute $A$ . If splitting on $A$ produces $v$ subsets $D_j$ :

\text{Info}_A(D) = \sum_{j=1}^{v} \frac{|D_j|}{|D|}\,\text{Entropy}(D_j)

\text{Gain}(A) = \text{Entropy}(D) - \text{Info}_A(D)

The attribute with the highest information gain is chosen as the splitting attribute at each node in the ID3 / C4.5 decision-tree algorithms, because it best separates the classes.

Answer 11

Partitioning Method of Clustering

A partitioning method organizes $n$ objects into $k$ clusters ( $k \le n$ ), where each cluster contains at least one object and each object belongs to exactly one cluster. It uses an iterative relocation technique that moves objects between groups to improve a partition until an objective function (e.g., minimizing intra-cluster squared error) is optimized.

Objective (k-means squared error):

E = \sum_{i=1}^{k} \sum_{p \in C_i} d(p, c_i)^2

where $c_i$ is the centroid/representative of cluster $C_i$ .

k-means algorithm:

1. Arbitrarily choose k objects as initial cluster centers.
2. Repeat:
   a. Assign each object to the cluster whose center is nearest.
   b. Recompute each cluster center as the mean of its objects.
3. Until the centers no longer change (convergence).

k-means uses the mean (centroid) as the cluster representative.
k-medoids (PAM) uses an actual object (medoid), making it more robust to outliers.

Limitations: must specify $k$ in advance, sensitive to initial seeds and outliers, and tends to find spherical clusters of similar size.

Answer 12

Web Mining

Web mining is the application of data-mining techniques to automatically discover and extract useful information, patterns, and knowledge from web data (documents, hyperlinks, server logs).

It is generally divided into three categories:

Web Content Mining: extracting useful information/knowledge from the content of web pages (text, images, structured records) — e.g., search-result classification, opinion mining.
Web Structure Mining: analyzing the hyperlink structure (the web graph) to discover authoritative pages and communities — e.g., Google's PageRank and the HITS (hubs and authorities) algorithm.
Web Usage Mining: mining web server logs / clickstream data to understand user navigation behavior — used for personalization, recommendation, and improving site design.

Applications: search engines, recommendation systems, e-commerce personalization, targeted advertising, and detecting web access patterns.

Challenges: the web is huge, heterogeneous, dynamic, semi-structured, and noisy, which makes mining difficult.

Level	BSc CSIT (TU)
Stream	Science
Subject	Data Warehousing and Data Mining (BSc CSIT, CSC410)
Year	2077 BS
Exam session	Regular (annual)
Full marks	60
Time allowed	180 minutes
Questions	12, all with step-by-step solutions