BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2080 Nepal

Q: Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2080?

The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper come with solutions?

Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper?

The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?

Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.

Question

1Long answer10 marks

Explain the data warehouse architecture and the ETL process in detail. Discuss the role of metadata in a data warehouse.

data-warehouseetl

Answer 1

Data Warehouse Architecture

A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is typically organized in three tiers.

Three-Tier Architecture

Bottom Tier (Warehouse Database Server): A relational database system. Data is extracted from operational/external sources, cleaned, transformed and loaded here via back-end (ETL) tools. This tier also holds the metadata repository.
Middle Tier (OLAP Server): Presents a multidimensional view of data. Implemented as a ROLAP (relational OLAP) server that maps operations on multidimensional data to standard relational operations, or a MOLAP (multidimensional OLAP) server that directly implements multidimensional data and operations.
Top Tier (Front-End Client): Contains query, reporting, analysis and data-mining tools.

Data typically flows: Operational/external sources → Staging area → Enterprise Data Warehouse → Data Marts → OLAP/Analysis tools.

The ETL Process

ETL = Extract, Transform, Load — the back-end process that populates the warehouse.

Extract: Read data from heterogeneous sources (RDBMS, flat files, ERP, web logs). Can be full or incremental.
Transform: The cleaning and integration step performed in a staging area:
- Data cleaning — remove noise, fill missing values, correct inconsistencies.
- Integration — resolve schema/naming conflicts, merge sources.
- Transformation — normalization, aggregation, derivation of new attributes, surrogate-key generation.
Load: Write the transformed data into the warehouse fact and dimension tables, and refresh indexes/aggregates. May be done by bulk load and refreshed periodically.

Role of Metadata

Metadata is "data about data" — it describes the warehouse contents and structure. It is stored in the metadata repository and includes:

Structure/definition metadata: schema, dimensions, hierarchies, derived-data definitions, data-mart locations.
Operational metadata: data lineage (history/source of migrated data), currency (active, archived, purged), monitoring/usage statistics.
ETL/mapping metadata: source-to-target mappings, cleaning and transformation rules, refresh schedules.
Business metadata: business terms, ownership, and charging policies.

Metadata acts as the directory that helps users locate data, the guide for the ETL mapping, and the basis for query optimization and data lineage — making it central to building, using and maintaining the warehouse.

Answer 2

FP-Growth Algorithm

FP-Growth (Frequent Pattern Growth) mines frequent itemsets without candidate generation, unlike Apriori. It compresses the database into a compact FP-tree and then mines patterns recursively using a divide-and-conquer strategy. It requires only two database scans.

Steps:

Scan DB once, count item supports, discard items below min_support, and sort frequent items in descending support order (F-list).
Scan DB again; for each transaction, sort its frequent items by the F-list order and insert into the FP-tree (shared prefixes share nodes; node counts incremented).
Build a header table linking all nodes of each item.
Mine recursively: for each item (least frequent first), build its conditional pattern base → conditional FP-tree → generate frequent patterns.

Worked Example

Let min_support = 2. Transactions:

TID	Items
T1	A, B, C
T2	A, C, D
T3	A, B, D
T4	B, C, D
T5	A, B, C, D

Step 1 — Support counts: A=4, B=4, C=4, D=4. All ≥ 2. F-list (desc, ties broken alphabetically): A, B, C, D.

Step 2 — Ordered transactions:

T1: A,B,C T2: A,C,D T3: A,B,D T4: B,C,D T5: A,B,C,D

Step 3 — FP-tree (described):

root
├─ A:4
│  ├─ B:3 ── C:2 ── D:1
│  │         └─ D:1   (B-C-D path count)
│  ├─ C:1 ── D:1
│  └─ (B:3 also has D:1 child for A-B-D)
└─ B:1 ── C:1 ── D:1   (T4: B,C,D, no A)

Header links chain all nodes of A, B, C, D.

Step 4 — Mine (bottom-up from D):

D: conditional pattern base = {A,B,C:1},{A,C:1},{A,B:1},{B,C:1}. Frequent: {A,D}=3, {B,D}=3, {C,D}=3.
C: base = {A,B:2},{A:1},{B:1}. Frequent: {A,C}=3, {B,C}=3.
B: base = {A:3}. Frequent: {A,B}=3.
A: single item, support 4.

Frequent patterns (support ≥ 2): {A}:4, {B}:4, {C}:4, {D}:4, {A,B}:3, {A,C}:3, {A,D}:3, {B,C}:3, {B,D}:3, {C,D}:3.

Advantages

No candidate generation, only two scans, compact tree → far faster than Apriori on dense data.

Answer 3

Cluster Analysis

Cluster analysis is the process of grouping a set of data objects into clusters such that objects within a cluster are highly similar to one another and dissimilar to objects in other clusters. It is unsupervised (no predefined class labels). Quality depends on the similarity/distance measure (e.g., Euclidean distance) and the method used. Applications: customer segmentation, image processing, outlier detection.

Comparison of Clustering Methods

Aspect	Partitioning	Hierarchical	Density-based
Idea	Divide n objects into k partitions, iteratively relocate to optimize a criterion	Build a tree (dendrogram) of nested clusters	Grow clusters as dense regions separated by sparse regions
k known?	Yes, k given in advance	No	No
Cluster shape	Spherical/convex only	Convex	Arbitrary shapes
Outliers	Sensitive	Sensitive	Handles noise/outliers
Example algorithm	k-means, k-medoids (PAM)	AGNES (agglomerative), DIANA (divisive)	DBSCAN, OPTICS

Partitioning Methods

Construct k partitions and iteratively improve them (assign each object to nearest centroid, recompute centroids).
k-means minimizes within-cluster squared error $\sum_{i}\sum_{x\in C_i}\|x-\mu_i\|^2$ . Fast, but needs k, sensitive to outliers and finds only spherical clusters.

Hierarchical Methods

Create a dendrogram. Agglomerative (bottom-up): start with each object as a cluster, repeatedly merge the two closest clusters. Divisive (top-down): start with one cluster and split.
No need to specify k in advance; cut the dendrogram at desired level. But merges/splits are irreversible and cost is high ( $O(n^2)$ or more).

Density-Based Methods

A cluster is a maximal set of density-connected points. DBSCAN uses parameters Eps (radius) and MinPts; classifies points as core, border, or noise.
Finds clusters of arbitrary shape, is robust to noise, and does not need k, but is sensitive to the Eps/MinPts settings.

Summary: Use partitioning when k is known and clusters are compact; hierarchical when a cluster hierarchy is desired; density-based for arbitrary-shaped clusters with noise.

Answer 4

A data warehouse has four key characteristics (Inmon's definition):

Subject-oriented: Organized around major subjects/entities of the enterprise (e.g., customer, product, sales) rather than around day-to-day operations/transactions. It focuses on modelling and analysis for decision makers.
Integrated: Built by integrating multiple heterogeneous sources (RDBMS, flat files). Data-cleaning and integration techniques ensure consistency in naming conventions, encoding, units and attribute measures.
Time-variant: Data is stored to provide a historical perspective (e.g., last 5–10 years). Every key structure contains, implicitly or explicitly, a time element, so the warehouse keeps snapshots over time rather than just current values.
Non-volatile: Data is physically separated from operational data and is loaded read-only. Operational update/delete do not occur; the warehouse normally needs only two operations: initial loading of data and access (query) of data.

Answer 5

Data Cube

A data cube is a multidimensional model that allows data to be viewed and analyzed in terms of dimensions and measures (facts).

Dimensions are the perspectives/entities for analysis (e.g., time, item, location), each with a hierarchy (e.g., day → month → quarter → year).
Measures are numeric quantities stored in cells (e.g., sales amount, units sold).

A cube generalizes a 2-D table to n dimensions. The set of all cubes from the most detailed (base cuboid) to the most aggregated (apex cuboid) is the data-cube lattice (cuboids).

Aggregation

Aggregation is the summarization of measure values across one or more dimensions using functions such as SUM, COUNT, AVG, MIN, MAX.

Rolling up time from month to year aggregates sales for the whole year.
The apex cuboid (all) holds the total aggregate over every dimension.

Example: For a Sales cube with dimensions (time, item, location), the cell (2023, Laptop, Kathmandu) stores total sales for that combination. Aggregating over location gives total laptop sales in 2023 across all cities. OLAP operations like roll-up increase aggregation, while drill-down reduces it.

Answer 6

Frequent Itemset

An itemset is a set of one or more items. The support of an itemset $X$ is the fraction (or count) of transactions that contain $X$ :

\text{support}(X) = \frac{\text{No. of transactions containing } X}{\text{Total no. of transactions}}

An itemset is frequent if its support is greater than or equal to a user-specified minimum support threshold ( $min\_sup$ ). A k-itemset contains k items.

Example

TID	Items
T1	Bread, Milk
T2	Bread, Butter
T3	Bread, Milk, Butter
T4	Milk, Butter

Let $min\_sup = 50\%$ (i.e., count ≥ 2 of 4 transactions).

{Bread} = 3, {Milk} = 3, {Butter} = 3 → all frequent.
{Bread, Milk} = 2 → frequent. {Bread, Butter} = 2 → frequent. {Milk, Butter} = 2 → frequent.
{Bread, Milk, Butter} = 1 → not frequent.

So {Bread, Milk}, {Bread, Butter}, {Milk, Butter} are frequent 2-itemsets. Frequent itemsets form the basis for generating association rules.

Answer 7

Gini Index as a Splitting Criterion

The Gini index measures the impurity of a dataset $D$ and is used (e.g., in CART) to choose the best attribute to split on in a decision tree. For a dataset with $m$ classes and $p_i$ the probability of class $i$ :

Gini(D) = 1 - \sum_{i=1}^{m} p_i^{2}

$Gini = 0$ means the node is pure (all one class); higher values mean more impurity.

Splitting: When attribute $A$ splits $D$ into subsets $D_1$ and $D_2$ , the Gini index of the split is the weighted sum:

Gini_A(D) = \frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)

The reduction in impurity is $\Delta Gini = Gini(D) - Gini_A(D)$ . The attribute (and split point) giving the maximum reduction (i.e., the minimum $Gini_A(D)$ ) is chosen.

Example: If a node has 6 positive and 4 negative samples (10 total):

Gini = 1 - (0.6^2 + 0.4^2) = 1 - (0.36 + 0.16) = 0.48

The split producing the lowest weighted Gini is selected. CART uses Gini to create binary splits.

Answer 8

Agglomerative (Bottom-Up) Hierarchical Clustering

The agglomerative approach (AGNES) is a hierarchical clustering method that works bottom-up:

Start by placing each data object in its own cluster (n clusters for n objects).
Repeatedly merge the two closest (most similar) clusters into one.
Continue until all objects are in a single cluster or a stopping condition (desired number of clusters) is met.

The sequence of merges is recorded as a dendrogram; cutting it at a chosen level yields the clustering.

Inter-cluster distance (linkage) methods used to decide which clusters are closest:

Single linkage: minimum distance between points of two clusters.
Complete linkage: maximum distance between points.
Average linkage: average pairwise distance.
Centroid linkage: distance between cluster centroids.

Characteristics: No need to specify k in advance; however, merges are irreversible (a merge cannot be undone) and the method is computationally expensive ( $O(n^2)$ or more). It is the opposite of the divisive (top-down, DIANA) approach, which starts with one cluster and splits.

Answer 9

Issues in Data Quality

Real-world data is often dirty, and poor data quality leads to unreliable mining results ("garbage in, garbage out"). Major data-quality issues:

Incompleteness (Missing data): Lacking attribute values or certain attributes of interest (e.g., empty income field). Caused by data not collected or not entered.
Noise (Errors/Outliers): Random error or variance in a measured variable (e.g., salary = -100), often from faulty instruments or transmission errors.
Inconsistency: Discrepancies in codes, names or formats (e.g., Age = 30 but DOB = 1920; "Kathmandu" vs "KTM"). Common when integrating multiple sources.
Duplication / Redundancy: The same record or attribute appearing more than once, biasing results.
Invalid / Out-of-range values: Values violating constraints or business rules.
Timeliness and Believability: Data not updated in time, or data users do not trust.
Interpretability: Data that is hard to understand.

These issues are addressed in the data preprocessing phase through data cleaning (fill missing values, smooth noise, remove inconsistencies/duplicates), integration, and transformation.

Answer 10

Eager vs. Lazy Learners

Aspect	Eager Learners	Lazy Learners
Model building	Construct a generalization (model) from training data before receiving test cases	Simply store the training data and wait until a test tuple arrives
Training time	High (most work done during training)	Low/none (no model built)
Classification (test) time	Fast — just apply the model	Slow — must compare with stored tuples at query time
Generalization	Commits to a single global model/hypothesis	Forms a local approximation per query; supports incremental learning
Storage	Stores only the model	Stores the entire training set
Examples	Decision tree (ID3, C4.5), Naïve Bayes, Backpropagation/Neural networks, SVM	k-Nearest Neighbor (k-NN), Case-based reasoning

In short: Eager learners do the work up front (slow to train, fast to classify), whereas lazy learners defer computation to classification time (fast to train, slow to classify) and adapt locally to each query.

Answer 11

Support Count and Minimum Support Threshold

Support count (frequency / absolute support): The number of transactions in the database that contain a particular itemset $X$ . Denoted $\sigma(X)$ .

\sigma(X) = |\{ t \in D : X \subseteq t \}|

The (relative) support is this count divided by the total number of transactions:

support(X) = \frac{\sigma(X)}{|D|}

Minimum support threshold ( $min\_sup$ ): A user-specified value (a count or a percentage) that an itemset's support must meet or exceed to be considered frequent. Itemsets with $support(X) \ge min\_sup$ are frequent itemsets; those below it are pruned (this is the basis of the Apriori pruning).

Example: In 5 transactions, if {Milk, Bread} appears in 3 of them, its support count is 3 and its support is $3/5 = 60\%$ . If $min\_sup = 40\%$ , the itemset is frequent; if $min\_sup = 70\%$ , it is not.

Answer 12

Temporal Data Mining

Temporal data mining is the process of discovering interesting patterns, trends, and relationships from time-stamped (time-dependent) data — data whose attributes change over time or that includes a temporal/time dimension. It extends ordinary data mining by explicitly considering the ordering and time gaps between events.

Key tasks / techniques:

Sequential pattern mining: find frequently occurring ordered subsequences (e.g., customers who buy a phone often buy a case later) — algorithms like GSP, PrefixSpan.
Time-series analysis: trend, seasonality, similarity search and forecasting on numeric series (e.g., stock prices, weather).
Temporal association rules: rules valid within specific time intervals.
Periodicity / cyclic pattern detection: events recurring at regular intervals.
Temporal clustering and classification.

Applications: stock-market and sales forecasting, weather prediction, network/IoT sensor analysis, web-clickstream and medical (patient history) analysis.

Challenges: handling large volume of time-stamped data, irregular time intervals, noise, and concept drift (patterns changing over time).

Level	BSc CSIT (TU)
Stream	Science
Subject	Data Warehousing and Data Mining (BSc CSIT, CSC410)
Year	2080 BS
Exam session	Regular (annual)
Full marks	60
Time allowed	180 minutes
Questions	12, all with step-by-step solutions