Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

Explain the data warehouse architecture and the ETL process in detail. Discuss the role of metadata in a data warehouse.

Data Warehouse Architecture

A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is typically organized in three tiers.

Three-Tier Architecture

  1. Bottom Tier (Warehouse Database Server): A relational database system. Data is extracted from operational/external sources, cleaned, transformed and loaded here via back-end (ETL) tools. This tier also holds the metadata repository.
  2. Middle Tier (OLAP Server): Presents a multidimensional view of data. Implemented as a ROLAP (relational OLAP) server that maps operations on multidimensional data to standard relational operations, or a MOLAP (multidimensional OLAP) server that directly implements multidimensional data and operations.
  3. Top Tier (Front-End Client): Contains query, reporting, analysis and data-mining tools.

Data typically flows: Operational/external sources → Staging area → Enterprise Data Warehouse → Data Marts → OLAP/Analysis tools.

The ETL Process

ETL = Extract, Transform, Load — the back-end process that populates the warehouse.

  • Extract: Read data from heterogeneous sources (RDBMS, flat files, ERP, web logs). Can be full or incremental.
  • Transform: The cleaning and integration step performed in a staging area:
    • Data cleaning — remove noise, fill missing values, correct inconsistencies.
    • Integration — resolve schema/naming conflicts, merge sources.
    • Transformation — normalization, aggregation, derivation of new attributes, surrogate-key generation.
  • Load: Write the transformed data into the warehouse fact and dimension tables, and refresh indexes/aggregates. May be done by bulk load and refreshed periodically.

Role of Metadata

Metadata is "data about data" — it describes the warehouse contents and structure. It is stored in the metadata repository and includes:

  • Structure/definition metadata: schema, dimensions, hierarchies, derived-data definitions, data-mart locations.
  • Operational metadata: data lineage (history/source of migrated data), currency (active, archived, purged), monitoring/usage statistics.
  • ETL/mapping metadata: source-to-target mappings, cleaning and transformation rules, refresh schedules.
  • Business metadata: business terms, ownership, and charging policies.

Metadata acts as the directory that helps users locate data, the guide for the ETL mapping, and the basis for query optimization and data lineage — making it central to building, using and maintaining the warehouse.

data-warehouseetl
2long10 marks

Explain the FP-growth algorithm. Construct the FP-tree for a given transaction dataset and mine the frequent patterns.

FP-Growth Algorithm

FP-Growth (Frequent Pattern Growth) mines frequent itemsets without candidate generation, unlike Apriori. It compresses the database into a compact FP-tree and then mines patterns recursively using a divide-and-conquer strategy. It requires only two database scans.

Steps:

  1. Scan DB once, count item supports, discard items below min_support, and sort frequent items in descending support order (F-list).
  2. Scan DB again; for each transaction, sort its frequent items by the F-list order and insert into the FP-tree (shared prefixes share nodes; node counts incremented).
  3. Build a header table linking all nodes of each item.
  4. Mine recursively: for each item (least frequent first), build its conditional pattern baseconditional FP-tree → generate frequent patterns.

Worked Example

Let min_support = 2. Transactions:

TIDItems
T1A, B, C
T2A, C, D
T3A, B, D
T4B, C, D
T5A, B, C, D

Step 1 — Support counts: A=4, B=4, C=4, D=4. All ≥ 2. F-list (desc, ties broken alphabetically): A, B, C, D.

Step 2 — Ordered transactions:

  • T1: A,B,C T2: A,C,D T3: A,B,D T4: B,C,D T5: A,B,C,D

Step 3 — FP-tree (described):

root
├─ A:4
│  ├─ B:3 ── C:2 ── D:1
│  │         └─ D:1   (B-C-D path count)
│  ├─ C:1 ── D:1
│  └─ (B:3 also has D:1 child for A-B-D)
└─ B:1 ── C:1 ── D:1   (T4: B,C,D, no A)

Header links chain all nodes of A, B, C, D.

Step 4 — Mine (bottom-up from D):

  • D: conditional pattern base = {A,B,C:1},{A,C:1},{A,B:1},{B,C:1}. Frequent: {A,D}=3, {B,D}=3, {C,D}=3.
  • C: base = {A,B:2},{A:1},{B:1}. Frequent: {A,C}=3, {B,C}=3.
  • B: base = {A:3}. Frequent: {A,B}=3.
  • A: single item, support 4.

Frequent patterns (support ≥ 2): {A}:4, {B}:4, {C}:4, {D}:4, {A,B}:3, {A,C}:3, {A,D}:3, {B,C}:3, {B,D}:3, {C,D}:3.

Advantages

No candidate generation, only two scans, compact tree → far faster than Apriori on dense data.

association-rulesfp-growth
3long10 marks

What is cluster analysis? Compare partitioning, hierarchical, and density-based clustering methods with examples.

Cluster Analysis

Cluster analysis is the process of grouping a set of data objects into clusters such that objects within a cluster are highly similar to one another and dissimilar to objects in other clusters. It is unsupervised (no predefined class labels). Quality depends on the similarity/distance measure (e.g., Euclidean distance) and the method used. Applications: customer segmentation, image processing, outlier detection.

Comparison of Clustering Methods

AspectPartitioningHierarchicalDensity-based
IdeaDivide n objects into k partitions, iteratively relocate to optimize a criterionBuild a tree (dendrogram) of nested clustersGrow clusters as dense regions separated by sparse regions
k known?Yes, k given in advanceNoNo
Cluster shapeSpherical/convex onlyConvexArbitrary shapes
OutliersSensitiveSensitiveHandles noise/outliers
Example algorithmk-means, k-medoids (PAM)AGNES (agglomerative), DIANA (divisive)DBSCAN, OPTICS

Partitioning Methods

  • Construct k partitions and iteratively improve them (assign each object to nearest centroid, recompute centroids).
  • k-means minimizes within-cluster squared error ixCixμi2\sum_{i}\sum_{x\in C_i}\|x-\mu_i\|^2. Fast, but needs k, sensitive to outliers and finds only spherical clusters.

Hierarchical Methods

  • Create a dendrogram. Agglomerative (bottom-up): start with each object as a cluster, repeatedly merge the two closest clusters. Divisive (top-down): start with one cluster and split.
  • No need to specify k in advance; cut the dendrogram at desired level. But merges/splits are irreversible and cost is high (O(n2)O(n^2) or more).

Density-Based Methods

  • A cluster is a maximal set of density-connected points. DBSCAN uses parameters Eps (radius) and MinPts; classifies points as core, border, or noise.
  • Finds clusters of arbitrary shape, is robust to noise, and does not need k, but is sensitive to the Eps/MinPts settings.

Summary: Use partitioning when k is known and clusters are compact; hierarchical when a cluster hierarchy is desired; density-based for arbitrary-shaped clusters with noise.

clustering
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What are the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile)?

A data warehouse has four key characteristics (Inmon's definition):

  • Subject-oriented: Organized around major subjects/entities of the enterprise (e.g., customer, product, sales) rather than around day-to-day operations/transactions. It focuses on modelling and analysis for decision makers.
  • Integrated: Built by integrating multiple heterogeneous sources (RDBMS, flat files). Data-cleaning and integration techniques ensure consistency in naming conventions, encoding, units and attribute measures.
  • Time-variant: Data is stored to provide a historical perspective (e.g., last 5–10 years). Every key structure contains, implicitly or explicitly, a time element, so the warehouse keeps snapshots over time rather than just current values.
  • Non-volatile: Data is physically separated from operational data and is loaded read-only. Operational update/delete do not occur; the warehouse normally needs only two operations: initial loading of data and access (query) of data.
data-warehouse
5short5 marks

Explain the concept of data cube and aggregation.

Data Cube

A data cube is a multidimensional model that allows data to be viewed and analyzed in terms of dimensions and measures (facts).

  • Dimensions are the perspectives/entities for analysis (e.g., time, item, location), each with a hierarchy (e.g., day → month → quarter → year).
  • Measures are numeric quantities stored in cells (e.g., sales amount, units sold).

A cube generalizes a 2-D table to n dimensions. The set of all cubes from the most detailed (base cuboid) to the most aggregated (apex cuboid) is the data-cube lattice (cuboids).

Aggregation

Aggregation is the summarization of measure values across one or more dimensions using functions such as SUM, COUNT, AVG, MIN, MAX.

  • Rolling up time from month to year aggregates sales for the whole year.
  • The apex cuboid (all) holds the total aggregate over every dimension.

Example: For a Sales cube with dimensions (time, item, location), the cell (2023, Laptop, Kathmandu) stores total sales for that combination. Aggregating over location gives total laptop sales in 2023 across all cities. OLAP operations like roll-up increase aggregation, while drill-down reduces it.

olap
6short5 marks

What is a frequent itemset? Explain with an example.

Frequent Itemset

An itemset is a set of one or more items. The support of an itemset XX is the fraction (or count) of transactions that contain XX:

support(X)=No. of transactions containing XTotal no. of transactions\text{support}(X) = \frac{\text{No. of transactions containing } X}{\text{Total no. of transactions}}

An itemset is frequent if its support is greater than or equal to a user-specified minimum support threshold (min_supmin\_sup). A k-itemset contains k items.

Example

TIDItems
T1Bread, Milk
T2Bread, Butter
T3Bread, Milk, Butter
T4Milk, Butter

Let min_sup=50%min\_sup = 50\% (i.e., count ≥ 2 of 4 transactions).

  • {Bread} = 3, {Milk} = 3, {Butter} = 3 → all frequent.
  • {Bread, Milk} = 2 → frequent. {Bread, Butter} = 2 → frequent. {Milk, Butter} = 2 → frequent.
  • {Bread, Milk, Butter} = 1 → not frequent.

So {Bread, Milk}, {Bread, Butter}, {Milk, Butter} are frequent 2-itemsets. Frequent itemsets form the basis for generating association rules.

association-rules
7short5 marks

Explain the Gini index as a splitting criterion.

Gini Index as a Splitting Criterion

The Gini index measures the impurity of a dataset DD and is used (e.g., in CART) to choose the best attribute to split on in a decision tree. For a dataset with mm classes and pip_i the probability of class ii:

Gini(D)=1i=1mpi2Gini(D) = 1 - \sum_{i=1}^{m} p_i^{2}
  • Gini=0Gini = 0 means the node is pure (all one class); higher values mean more impurity.

Splitting: When attribute AA splits DD into subsets D1D_1 and D2D_2, the Gini index of the split is the weighted sum:

GiniA(D)=D1DGini(D1)+D2DGini(D2)Gini_A(D) = \frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)

The reduction in impurity is ΔGini=Gini(D)GiniA(D)\Delta Gini = Gini(D) - Gini_A(D). The attribute (and split point) giving the maximum reduction (i.e., the minimum GiniA(D)Gini_A(D)) is chosen.

Example: If a node has 6 positive and 4 negative samples (10 total):

Gini=1(0.62+0.42)=1(0.36+0.16)=0.48Gini = 1 - (0.6^2 + 0.4^2) = 1 - (0.36 + 0.16) = 0.48

The split producing the lowest weighted Gini is selected. CART uses Gini to create binary splits.

decision-tree
8short5 marks

What is the agglomerative approach in hierarchical clustering?

Agglomerative (Bottom-Up) Hierarchical Clustering

The agglomerative approach (AGNES) is a hierarchical clustering method that works bottom-up:

  1. Start by placing each data object in its own cluster (n clusters for n objects).
  2. Repeatedly merge the two closest (most similar) clusters into one.
  3. Continue until all objects are in a single cluster or a stopping condition (desired number of clusters) is met.

The sequence of merges is recorded as a dendrogram; cutting it at a chosen level yields the clustering.

Inter-cluster distance (linkage) methods used to decide which clusters are closest:

  • Single linkage: minimum distance between points of two clusters.
  • Complete linkage: maximum distance between points.
  • Average linkage: average pairwise distance.
  • Centroid linkage: distance between cluster centroids.

Characteristics: No need to specify k in advance; however, merges are irreversible (a merge cannot be undone) and the method is computationally expensive (O(n2)O(n^2) or more). It is the opposite of the divisive (top-down, DIANA) approach, which starts with one cluster and splits.

clustering
9short5 marks

Explain the issues in data quality.

Issues in Data Quality

Real-world data is often dirty, and poor data quality leads to unreliable mining results ("garbage in, garbage out"). Major data-quality issues:

  • Incompleteness (Missing data): Lacking attribute values or certain attributes of interest (e.g., empty income field). Caused by data not collected or not entered.
  • Noise (Errors/Outliers): Random error or variance in a measured variable (e.g., salary = -100), often from faulty instruments or transmission errors.
  • Inconsistency: Discrepancies in codes, names or formats (e.g., Age = 30 but DOB = 1920; "Kathmandu" vs "KTM"). Common when integrating multiple sources.
  • Duplication / Redundancy: The same record or attribute appearing more than once, biasing results.
  • Invalid / Out-of-range values: Values violating constraints or business rules.
  • Timeliness and Believability: Data not updated in time, or data users do not trust.
  • Interpretability: Data that is hard to understand.

These issues are addressed in the data preprocessing phase through data cleaning (fill missing values, smooth noise, remove inconsistencies/duplicates), integration, and transformation.

preprocessing
10short5 marks

Differentiate between eager and lazy learners.

Eager vs. Lazy Learners

AspectEager LearnersLazy Learners
Model buildingConstruct a generalization (model) from training data before receiving test casesSimply store the training data and wait until a test tuple arrives
Training timeHigh (most work done during training)Low/none (no model built)
Classification (test) timeFast — just apply the modelSlow — must compare with stored tuples at query time
GeneralizationCommits to a single global model/hypothesisForms a local approximation per query; supports incremental learning
StorageStores only the modelStores the entire training set
ExamplesDecision tree (ID3, C4.5), Naïve Bayes, Backpropagation/Neural networks, SVMk-Nearest Neighbor (k-NN), Case-based reasoning

In short: Eager learners do the work up front (slow to train, fast to classify), whereas lazy learners defer computation to classification time (fast to train, slow to classify) and adapt locally to each query.

classification
11short5 marks

What is the support count and minimum support threshold?

Support Count and Minimum Support Threshold

Support count (frequency / absolute support): The number of transactions in the database that contain a particular itemset XX. Denoted σ(X)\sigma(X).

σ(X)={tD:Xt}\sigma(X) = |\{ t \in D : X \subseteq t \}|

The (relative) support is this count divided by the total number of transactions:

support(X)=σ(X)Dsupport(X) = \frac{\sigma(X)}{|D|}

Minimum support threshold (min_supmin\_sup): A user-specified value (a count or a percentage) that an itemset's support must meet or exceed to be considered frequent. Itemsets with support(X)min_supsupport(X) \ge min\_sup are frequent itemsets; those below it are pruned (this is the basis of the Apriori pruning).

Example: In 5 transactions, if {Milk, Bread} appears in 3 of them, its support count is 3 and its support is 3/5=60%3/5 = 60\%. If min_sup=40%min\_sup = 40\%, the itemset is frequent; if min_sup=70%min\_sup = 70\%, it is not.

association-rules
12short5 marks

Write short notes on temporal data mining.

Temporal Data Mining

Temporal data mining is the process of discovering interesting patterns, trends, and relationships from time-stamped (time-dependent) data — data whose attributes change over time or that includes a temporal/time dimension. It extends ordinary data mining by explicitly considering the ordering and time gaps between events.

Key tasks / techniques:

  • Sequential pattern mining: find frequently occurring ordered subsequences (e.g., customers who buy a phone often buy a case later) — algorithms like GSP, PrefixSpan.
  • Time-series analysis: trend, seasonality, similarity search and forecasting on numeric series (e.g., stock prices, weather).
  • Temporal association rules: rules valid within specific time intervals.
  • Periodicity / cyclic pattern detection: events recurring at regular intervals.
  • Temporal clustering and classification.

Applications: stock-market and sales forecasting, weather prediction, network/IoT sensor analysis, web-clickstream and medical (patient history) analysis.

Challenges: handling large volume of time-stamped data, irregular time intervals, noise, and concept drift (patterns changing over time).

temporal-mining

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2080?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.