Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What is a data warehouse? Explain its characteristics. Describe the three-tier data warehouse architecture with a suitable diagram.

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data organized to support management decision-making (W. H. Inmon). It consolidates historical data from multiple operational (transactional) sources into a single repository optimized for query and analysis rather than transaction processing.

Characteristics

  1. Subject-oriented — Organized around major subjects (customer, product, sales) rather than around day-to-day operations or applications.
  2. Integrated — Data from heterogeneous sources is cleaned and made consistent (uniform naming, encoding, units, formats).
  3. Time-variant — Stores historical data with a time dimension (months/years of data), enabling trend analysis. Every record carries an explicit or implicit time element.
  4. Non-volatile — Data is loaded and read but normally not updated or deleted by end users; it is refreshed periodically (read-only for analysis).

Three-Tier Architecture

  Top Tier:      [ OLAP Tools | Query/Reporting | Data Mining | Analysis ]   <- Front-end client tools
                                  |
  Middle Tier:           [  OLAP Server (ROLAP / MOLAP)  ]                    <- presents multidimensional view
                                  |
  Bottom Tier:   [  Data Warehouse Server (RDBMS) + Data Marts + Metadata ]   <- ETL loads from sources
                                  |
              [ Operational DBs ][ Flat Files ][ External Sources ]
  • Bottom tier (Warehouse database server): Usually a relational database. Data is extracted from operational databases and external sources, then cleaned, transformed, and loaded (ETL/back-end tools). It also holds data marts and a metadata repository.
  • Middle tier (OLAP server): Implemented as either a ROLAP (relational OLAP, maps multidimensional operations to SQL) or MOLAP (multidimensional OLAP, uses array-based storage/data cubes). It presents an abstracted, multidimensional view of the data to the top tier.
  • Top tier (Front-end client layer): Tools for querying, reporting, analysis, and data mining used by analysts and decision makers.

This layered design separates storage, multidimensional processing, and presentation, improving scalability and performance.

data-warehousearchitecture
2long10 marks

What is association rule mining? Explain the Apriori algorithm with an example and discuss its limitations.

Association Rule Mining

Association rule mining discovers interesting relationships (co-occurrence patterns) among items in large transactional datasets. A rule has the form XYX \Rightarrow Y where X,YX, Y are disjoint itemsets, e.g. {bread, butter}{milk}\{\text{bread, butter}\} \Rightarrow \{\text{milk}\} ("market basket analysis").

Two key measures:

  • Support: supp(X)=transactions containing Xtotal transactions\text{supp}(X) = \dfrac{\text{transactions containing } X}{\text{total transactions}}
  • Confidence: conf(XY)=supp(XY)supp(X)\text{conf}(X \Rightarrow Y) = \dfrac{\text{supp}(X \cup Y)}{\text{supp}(X)}

A rule is strong if it meets a minimum support and minimum confidence threshold.

Apriori Algorithm

Apriori finds all frequent itemsets using the Apriori property: every non-empty subset of a frequent itemset must also be frequent (anti-monotonicity). It works level-by-level:

1. Find all frequent 1-itemsets (L1) by counting support.
2. For k = 2,3,...:
   a. Generate candidate k-itemsets Ck by joining L(k-1) with itself.
   b. Prune candidates having any infrequent (k-1)-subset.
   c. Scan database to count support of each candidate.
   d. Lk = candidates with support >= min_support.
3. Stop when Lk is empty. Frequent itemsets = union of all Lk.
4. Generate strong rules from frequent itemsets using min_confidence.

Example

Transactions (min_support = 50% i.e. 2 of 4, min_confidence = 70%):

TIDItems
T1A, B, C
T2A, C
T3A, D
T4B, C
  • L1: A(3), B(2), C(3), D(1). D is pruned (<2). L1 = {A, B, C}.
  • C2: {A,B}=1, {A,C}=2, {B,C}=2. L2 = {A,C}(2), {B,C}(2).
  • C3: {A,B,C} — needs subset {A,B} which is infrequent, so pruned. L3 empty.

Frequent itemsets: {A}, {B}, {C}, {A,C}, {B,C}.

Rule from {B,C}: CBC \Rightarrow B, confidence =2/3=66%= 2/3 = 66\% (rejected); BCB \Rightarrow C, confidence =2/2=100%= 2/2 = 100\% (strong rule).

Limitations

  1. Multiple database scans — one full scan per level kk, expensive for large data.
  2. Huge number of candidates — candidate generation explodes when there are many frequent items (e.g. 10410^4 1-itemsets give 107\approx 10^7 candidate 2-itemsets).
  3. High memory and computation cost for low support thresholds.
  4. Improvements such as FP-Growth (no candidate generation, two scans) address these issues.
association-rulesapriori
3long10 marks

Explain the K-means clustering algorithm. Apply it on a sample dataset to form two clusters and show the iterations.

K-Means Clustering

K-means is an unsupervised, partitional clustering algorithm that divides nn objects into kk clusters so that intra-cluster similarity is high. Each cluster is represented by its centroid (mean of its points). It minimizes the sum of squared errors SSE=i=1kxCixμi2SSE = \sum_{i=1}^{k}\sum_{x \in C_i} \lVert x - \mu_i \rVert^2.

Algorithm

1. Choose k (number of clusters); select k initial centroids.
2. Assignment step: assign each point to the nearest centroid
   (using Euclidean distance).
3. Update step: recompute each centroid as the mean of its members.
4. Repeat steps 2-3 until centroids no longer change (convergence).

Worked Example

Data (1-D): 2, 4, 10, 12, 3, 20, 30, 11, 25, with k=2k = 2.

Initial centroids: m1=2m_1 = 2, m2=4m_2 = 4.

Iteration 1 — assign by nearest centroid:

  • Cluster1 (near 2): {2, 3} → new mean m1=2.5m_1 = 2.5
  • Cluster2 (near 4): {4, 10, 12, 20, 30, 11, 25} → new mean m2=16m_2 = 16

Iteration 2 — with m1=2.5,m2=16m_1=2.5, m_2=16:

  • Cluster1: {2, 3, 4} → mean =3= 3
  • Cluster2: {10, 12, 20, 30, 11, 25} → mean =18= 18

Iteration 3 — with m1=3,m2=18m_1=3, m_2=18:

  • Cluster1: {2, 3, 4, 10} → mean =4.75= 4.75
  • Cluster2: {12, 20, 30, 11, 25} → mean =19.6= 19.6

Iteration 4 — with m1=4.75,m2=19.6m_1=4.75, m_2=19.6:

  • Cluster1: {2, 3, 4, 10, 11, 12} → mean =7= 7
  • Cluster2: {20, 30, 25} → mean =25= 25

Iteration 5 — with m1=7,m2=25m_1=7, m_2=25:

  • Cluster1: {2, 3, 4, 10, 11, 12} → mean =7= 7
  • Cluster2: {20, 30, 25} → mean =25= 25

Assignments and centroids no longer change → converged.

Final clusters: C1={2,3,4,10,11,12}C_1 = \{2,3,4,10,11,12\} (centroid 7), C2={20,25,30}C_2 = \{20,25,30\} (centroid 25).

Notes

  • Result depends on initial centroid choice and chosen kk; sensitive to outliers; works best with spherical, well-separated clusters.
clusteringkmeans
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

Define data mining and list its major applications.

Data mining is the process of discovering interesting, previously unknown, and potentially useful patterns, correlations, and knowledge from large amounts of data. It is the core analysis step of the KDD (Knowledge Discovery in Databases) process and uses techniques from machine learning, statistics, and database systems.

Major Applications

  • Market basket analysis / retail — association rules to drive product placement and cross-selling.
  • Banking & finance — credit scoring, fraud detection, risk analysis.
  • Telecommunications — fraud detection, churn prediction.
  • Healthcare & bioinformatics — disease diagnosis, gene/protein analysis.
  • CRM — customer segmentation and targeted marketing.
  • Web mining & search engines — recommendation systems, click-stream analysis.
  • Manufacturing — quality control and fault prediction.
data-mining
5short5 marks

Differentiate between OLTP and OLAP.

OLTP vs OLAP

FeatureOLTP (Online Transaction Processing)OLAP (Online Analytical Processing)
PurposeDay-to-day operational transactionsAnalysis and decision support
DataCurrent, detailed, operationalHistorical, summarized, consolidated
OrientationApplication/transaction orientedSubject oriented
OperationsFrequent INSERT/UPDATE/DELETEMostly read-only complex queries
DesignNormalized (ER, 3NF)De-normalized (star/snowflake, cubes)
Records accessedFew rows per queryMillions of rows aggregated
UsersClerks, operational staff (many)Analysts, managers (fewer)
Response timeMillisecondsSeconds to minutes
Database sizeMB–GBGB–TB
Backup/recoveryCritical, real-timePeriodic; reload from sources

Summary: OLTP runs the business (operational data), while OLAP analyzes the business (decision support over a data warehouse).

olap
6short5 marks

What is a concept hierarchy? Explain with an example.

Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a set of low-level (detailed) concepts to higher-level (more general) concepts. It organizes attribute values into multiple levels of abstraction, enabling data generalization and roll-up/drill-down in OLAP.

Types

  • Schema hierarchy — based on the schema/attributes, e.g. street < city < state < country for a location dimension.
  • Set-grouping hierarchy — groups values into ranges, e.g. age: {0–18 = young}, {19–40 = adult}, {41+ = senior}.

Example

For the location dimension:

            country  (Nepal)
               |
            province (Bagmati)
               |
             city    (Kathmandu)
               |
            street   (Durbar Marg)

Rolling up moves from street toward country (more general/summarized); drilling down moves the other way (more detailed). Such hierarchies let analysts view sales by street, then summarize by city, province, or country.

concept-hierarchy
7short5 marks

Explain the different OLAP operations (roll-up, drill-down, slice, dice).

OLAP Operations

OLAP operations let users navigate a multidimensional data cube at different levels of abstraction.

  1. Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing a dimension. e.g. moving sales from city → province → country, producing more summarized data.

  2. Drill-down: The reverse of roll-up — moves from summarized to more detailed data by stepping down a concept hierarchy or adding a dimension. e.g. quarter → month → day, or country → city.

  3. Slice: Selects a single value for one dimension, producing a sub-cube (a 2-D slice). e.g. fix time = Q1 to see sales for all products/locations in Q1 only.

  4. Dice: Selects a sub-cube by specifying a range/condition on two or more dimensions. e.g. location ∈ {KTM, Pokhara} AND time ∈ {Q1, Q2} AND item ∈ {phone, tablet}.

(Related: Pivot/rotate reorients the cube's axes to give an alternative view.)

olap
8short5 marks

What is data preprocessing? Why is it necessary?

Data Preprocessing

Data preprocessing is the set of techniques applied to raw data before mining to transform it into a clean, consistent, and suitable form for analysis. Real-world data is typically incomplete (missing values), noisy (errors, outliers), and inconsistent (conflicting codes/formats), so it cannot be mined reliably as-is.

Major Tasks

  • Data cleaning — fill in missing values, smooth noise, remove outliers, resolve inconsistencies.
  • Data integration — combine data from multiple sources, resolving conflicts and redundancy.
  • Data reduction — reduce volume via dimensionality/numerosity reduction or aggregation while preserving analytical results.
  • Data transformation — normalization, smoothing, generalization, attribute construction.

Why It Is Necessary

  • Improves data quality, hence the quality and accuracy of mining results (garbage in, garbage out).
  • Many algorithms require clean, consistent, properly scaled input.
  • Reduces size and complexity, improving mining efficiency and reducing storage/computation.
preprocessing
9short5 marks

Explain the steps in the KDD process.

KDD Process

KDD (Knowledge Discovery in Databases) is the overall, iterative process of converting raw data into useful knowledge; data mining is one core step within it.

Steps

  1. Data Cleaning — remove noise and inconsistent data, handle missing values.
  2. Data Integration — combine data from multiple heterogeneous sources.
  3. Data Selection — retrieve the data relevant to the analysis task from the database.
  4. Data Transformation — transform/consolidate data into forms appropriate for mining (e.g. normalization, aggregation).
  5. Data Mining — apply intelligent methods (classification, clustering, association, etc.) to extract patterns.
  6. Pattern Evaluation — identify the truly interesting patterns using interestingness measures.
  7. Knowledge Presentation — use visualization and reporting techniques to present mined knowledge to users.

(Steps 1–4 are often grouped as data preprocessing.) The process is iterative: feedback may loop back to earlier steps.

kdd
10short5 marks

What is a decision tree? How is it used for classification?

Decision Tree

A decision tree is a flowchart-like tree structure used for classification (and regression), where:

  • each internal node tests an attribute,
  • each branch represents an outcome of the test, and
  • each leaf node holds a class label.

The path from the root to a leaf forms a classification rule.

Use in Classification

  1. Building (training): The tree is induced top-down by recursively partitioning the training data. At each node, the best splitting attribute is chosen using a measure such as Information Gain / Entropy (ID3), Gain Ratio (C4.5), or Gini index (CART) — the attribute that best separates the classes.
  2. Splitting continues until nodes are pure (one class), no attributes remain, or a stopping criterion is met; pruning removes branches that overfit.
  3. Classifying a new record: Start at the root, test attributes along matching branches, and follow the path to a leaf; the leaf's class label is the predicted class.

Advantages: easy to understand and interpret, no need for normalization, handles both numeric and categorical data, and the rules are explicit.

classificationdecision-tree
11short5 marks

Differentiate between classification and clustering.

Classification vs Clustering

AspectClassificationClustering
Learning typeSupervisedUnsupervised
Class labelsPredefined/known; uses labeled training dataNo predefined labels; groups discovered from data
GoalAssign new objects to known classesPartition data into natural groups (clusters)
Training dataRequired (labeled)Not required
Basis of groupingLearned model that maps features → labelSimilarity/distance between objects
OutputA predictive model + class predictionSet of clusters
ExamplesDecision tree, Naive Bayes, SVM, KNNK-means, hierarchical, DBSCAN
EvaluationAccuracy, precision, recall (vs true labels)Cohesion/separation, silhouette (no true labels)

Summary: Classification predicts a known class label using labeled training data (supervised), whereas clustering discovers the groupings in unlabeled data based on similarity (unsupervised).

classificationclustering
12short5 marks

Write short notes on the star schema.

Star Schema

The star schema is the most common multidimensional model for a data warehouse. It consists of:

  • A central fact table containing measures (numeric, additive facts such as sales_amount, quantity) and foreign keys to the dimensions.
  • A set of dimension tables (e.g. time, product, location, customer), each connected directly to the fact table.

The diagram resembles a star: the fact table at the center with dimension tables radiating outward.

        [ Time ]        [ Product ]
             \            /
              \          /
            [   FACT (Sales)   ]
              /          \
             /            \
      [ Location ]      [ Customer ]

Key points

  • Dimension tables are de-normalized (each dimension is a single table), giving fewer joins and fast query performance.
  • Simple, intuitive design, well suited to OLAP.
  • Uses more storage due to redundancy in dimensions.
  • Contrast: the snowflake schema normalizes dimensions into sub-tables (more joins, less redundancy).
schema

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2074?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.