Browse papers
A

Section A: Long Answer Questions

Attempt any TWO questions.

3 questions·10 marks each
1long10 marks

What are the advantages and disadvantages of MOLAP over ROLAP? Explain the multidimensional data model with the help of a data cube.

MOLAP vs ROLAP

MOLAP (Multidimensional OLAP) stores aggregated data in optimized multidimensional array structures (data cubes). ROLAP (Relational OLAP) stores data in relational tables (star/snowflake schema) and generates SQL on the fly.

Advantages of MOLAP over ROLAP

  • Faster query performance — pre-computed and pre-aggregated cube cells give near-instant slice/dice/drill-down response.
  • Compact storage for dense cubes via array indexing (no need to store sparse keys).
  • Rich analytical functions (time-series, complex calculations) are natively supported.
  • Automatic aggregation is handled by the cube engine.

Disadvantages of MOLAP over ROLAP

  • Poor scalability — cube size explodes with many dimensions (the curse of dimensionality); cannot easily handle very large data volumes.
  • Sparse cube wastage — empty cells still consume structure overhead.
  • Limited detail — usually holds summarized data, so drilling to fine-grained records is hard.
  • Long cube build/load time and the cube must be rebuilt when source data changes.
  • ROLAP, in contrast, scales to terabytes, leverages mature RDBMS technology, and handles ad-hoc queries on detailed data.

Multidimensional Data Model and the Data Cube

The multidimensional data model views data as a set of measures (numeric facts such as sales, units_sold) organized by several dimensions (descriptive perspectives such as Time, Item, Location).

  • Dimension: a perspective for analysis, e.g. Time (day, month, quarter, year), Item (name, type, brand), Location (city, region, country).
  • Fact / Measure: the quantity analyzed, e.g. dollars sold.
  • Data Cube: an n-dimensional generalization of a 2-D spreadsheet that lets us model and view data along multiple dimensions.

Example data cube (3-D)

A cube with dimensions Time, Item, Location where each cell holds the measure sales amount:

Time \ Item (in City = Kathmandu)TVPhonePC
Q160582514
Q268095231
Q3812102330
Q4927103838

Stacking such tables for several cities gives the third dimension (Location), producing a cube. Higher-dimensional data is modeled as a series of cuboids; the lattice of cuboids forms the cube.

OLAP operations on the cube

  • Roll-up: aggregate to a higher level (city → country).
  • Drill-down: go to finer detail (quarter → month).
  • Slice: select one dimension value (Time = Q1) giving a sub-cube.
  • Dice: select a range on several dimensions.
  • Pivot (rotate): reorient the cube for presentation.

This multidimensional structure is what enables fast, intuitive analytical querying.

olapdata-model
2long10 marks

Explain the Apriori algorithm in detail. Discuss the limitations of the Apriori algorithm and how the FP-growth algorithm overcomes them.

The Apriori Algorithm

Apriori is a classic algorithm for mining frequent itemsets and generating association rules from transactional data. It is named after its use of prior knowledge of frequent-itemset properties.

Key principle (Apriori property)

Every non-empty subset of a frequent itemset must also be frequent (anti-monotonicity). Equivalently, if an itemset is infrequent, all its supersets are infrequent and can be pruned.

Algorithm (level-wise search)

Input: transaction database D, minimum support min_sup
1. L1 = {frequent 1-itemsets}                      // scan D, count supports
2. for (k = 2; L(k-1) != empty; k++) {
3.     Ck = apriori_gen(L(k-1))                      // candidate generation
4.        - JOIN: combine pairs of (k-1)-itemsets sharing first k-2 items
5.        - PRUNE: drop any candidate having an infrequent (k-1)-subset
6.     for each transaction t in D:                  // scan database
7.        increment count of candidates in Ck contained in t
8.     Lk = { c in Ck | support(c) >= min_sup }
9. }
10. return L = union of all Lk

After finding frequent itemsets, strong association rules (ABA \Rightarrow B) are generated for those with confidence=support(AB)support(A)min_conf\text{confidence} = \frac{\text{support}(A \cup B)}{\text{support}(A)} \ge min\_conf.

Limitations of Apriori

  1. Multiple database scans — it scans D once for each level kk, which is very costly for large databases.
  2. Huge number of candidate itemsets — e.g. 10410^4 frequent 1-itemsets can generate 107\sim 10^7 candidate 2-itemsets.
  3. High memory and computation to store and test candidates against every transaction.
  4. Performance degrades sharply with low support thresholds and long patterns.

FP-Growth and How It Overcomes These Limitations

FP-Growth (Frequent-Pattern Growth) mines frequent itemsets without candidate generation using a compact tree structure.

Steps

  1. Scan 1 — count item supports, discard infrequent items, sort frequent items by descending support.
  2. Scan 2 — build an FP-tree: insert each transaction's sorted frequent items as a path, sharing common prefixes; a header table links nodes of the same item.
  3. Mining — for each item (bottom-up), build its conditional pattern base and conditional FP-tree, recursively growing frequent patterns.

How it overcomes Apriori's limitations

  • Only two database scans are needed (versus many) — solves the repeated-scan problem.
  • No candidate generation — avoids the costly candidate explosion of Apriori.
  • Compact FP-tree stores the database in main memory by sharing prefixes, reducing space.
  • A divide-and-conquer strategy on conditional trees makes mining much faster, especially for dense data and long patterns.

Thus FP-Growth is typically an order of magnitude faster than Apriori on large datasets.

association-rulesapriorifp-growth
3long10 marks

Build a decision tree from a given dataset using ID3 with information gain as the attribute selection measure. Show all calculations.

Building a Decision Tree with ID3 (Information Gain)

(A standard dataset is assumed since the table was not supplied; the canonical "Play Tennis" dataset is used to demonstrate the full method and calculations.)

Dataset (14 examples)

OutlookTempHumidityWindPlayTennis
SunnyHotHighWeakNo
SunnyHotHighStrongNo
OvercastHotHighWeakYes
RainMildHighWeakYes
RainCoolNormalWeakYes
RainCoolNormalStrongNo
OvercastCoolNormalStrongYes
SunnyMildHighWeakNo
SunnyCoolNormalWeakYes
RainMildNormalWeakYes
SunnyMildNormalStrongYes
OvercastMildHighStrongYes
OvercastHotNormalWeakYes
RainMildHighStrongNo

Totals: 9 Yes, 5 No.

Formulas

Entropy(S)=ipilog2pi,Gain(S,A)=Entropy(S)vASvSEntropy(Sv)\text{Entropy}(S) = -\sum_i p_i \log_2 p_i, \qquad \text{Gain}(S,A) = \text{Entropy}(S) - \sum_{v \in A} \frac{|S_v|}{|S|}\,\text{Entropy}(S_v)

Step 1 — Entropy of the whole set

Entropy(S)=914log2914514log2514=0.940\text{Entropy}(S) = -\tfrac{9}{14}\log_2\tfrac{9}{14} - \tfrac{5}{14}\log_2\tfrac{5}{14} = 0.940

Step 2 — Gain of each attribute at the root

Outlook: Sunny(2Y,3N), Overcast(4Y,0N), Rain(3Y,2N)

  • E(Sunny)=0.971, E(Overcast)=0, E(Rain)=0.971
Gain(Outlook)=0.940(514(0.971)+414(0)+514(0.971))=0.9400.693=0.247\text{Gain(Outlook)} = 0.940 - \big(\tfrac{5}{14}(0.971)+\tfrac{4}{14}(0)+\tfrac{5}{14}(0.971)\big) = 0.940 - 0.693 = \mathbf{0.247}

Humidity: High(3Y,4N→0.985), Normal(6Y,1N→0.592)

Gain(Humidity)=0.940(714(0.985)+714(0.592))=0.151\text{Gain(Humidity)} = 0.940 - \big(\tfrac{7}{14}(0.985)+\tfrac{7}{14}(0.592)\big) = \mathbf{0.151}

Wind: Weak(6Y,2N→0.811), Strong(3Y,3N→1.0)

Gain(Wind)=0.940(814(0.811)+614(1.0))=0.048\text{Gain(Wind)} = 0.940 - \big(\tfrac{8}{14}(0.811)+\tfrac{6}{14}(1.0)\big) = \mathbf{0.048}

Temperature: Gain ≈ 0.029 (computed similarly).

Outlook has the highest gain (0.247) → it becomes the root.

Step 3 — Split on Outlook

  • Overcast → all 4 are Yes → leaf Yes (pure).
  • Sunny and Rain are impure → recurse.

Sunny branch (2Y, 3N, E=0.971)

  • Gain(Humidity)= 0.971 − (3/5·0 + 2/5·0) = 0.971 (High→all No, Normal→all Yes)
  • Gain(Temp)=0.571, Gain(Wind)=0.020 → Split on Humidity: High → No, Normal → Yes (both pure leaves).

Rain branch (3Y, 2N, E=0.971)

  • Gain(Wind)= 0.971 − (3/5·0 + 2/5·0) = 0.971 (Weak→all Yes, Strong→all No) → Split on Wind: Weak → Yes, Strong → No.

Final Decision Tree

Outlook?
 ├─ Sunny    → Humidity?
 │              ├─ High   → No
 │              └─ Normal → Yes
 ├─ Overcast → Yes
 └─ Rain     → Wind?
                ├─ Weak   → Yes
                └─ Strong → No

ID3 stops because every leaf is now pure (all examples same class).

classificationdecision-treeid3
B

Section B: Short Answer Questions

Attempt any EIGHT questions.

9 questions·5 marks each
4short5 marks

What is beam search? Explain.

Beam Search

Beam search is a heuristic search algorithm that explores a graph/tree by expanding only the most promising nodes at each level, keeping memory bounded.

  • At each level it generates all successors of the current candidates, scores them with a heuristic, and retains only the best ww nodes (where ww is the beam width). The rest are discarded.
  • It is a bounded-width best-first search: a compromise between greedy search (w=1w=1) and full breadth-first/best-first search (w=w=\infty).

Characteristics

  • Width ww controls the trade-off: larger ww → better solutions but more memory/time; smaller ww → faster but may miss the optimum.
  • Not complete and not optimal — pruning may discard the path to the goal.
  • Memory-efficient: only O(w)O(w) nodes per level are stored.

Uses in data mining / AI

Feature subset selection, rule induction (e.g. CN2), sequence decoding in NLP/machine translation, and large search spaces where exhaustive search is infeasible.

search
5short5 marks

Explain signed networks in the context of data mining.

Signed Networks

A signed network is a graph in which each edge carries a sign — positive (+) or negative (−) — representing the nature of the relationship between two nodes, not just its existence.

  • Positive edge: friendship, trust, agreement, similarity, "like."
  • Negative edge: enmity, distrust, disagreement, "dislike," antagonism.

Examples: social platforms (friend/foe, trust/distrust on Epinions or Slashdot), product reviews, and protein interaction networks.

Structural balance theory

Signed networks are analyzed using balance theory, governed by rules on triads:

  • "A friend of my friend is my friend" (+,+,+) — balanced.
  • "An enemy of my enemy is my friend" (−,−,+) — balanced.
  • Triads with an odd number of negative edges (e.g. +,+,−) are unbalanced and tend to be unstable.

Relevance to data mining

  • Sign / link prediction: predict whether an unknown edge is + or −.
  • Community detection: clusters have dense positive ties internally and negative ties between groups.
  • Trust propagation and recommendation: spreading trust/distrust to rank or recommend.
  • Anomaly detection: unbalanced or hostile substructures.

Thus signed networks let data mining capture polarity of relationships, enabling richer analysis than ordinary (unsigned) graphs.

networks
6short5 marks

What is a concept hierarchy? How is it generated?

Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a set of low-level (detailed) concepts to higher-level (more general) concepts. It organizes the values of an attribute into multiple levels of abstraction and is used in data warehousing and OLAP for roll-up/drill-down and for attribute-oriented generalization.

Example (Location): street → city → state → country. Example (Time): day → month → quarter → year.

How concept hierarchies are generated

  1. Schema-level (specified by users/experts) — a total or partial ordering of attributes is given explicitly, e.g. street < city < state < country.
  2. Set-grouping hierarchy — explicitly defining groupings of values, e.g. price {0–100} = cheap, {100–500} = moderate, {>500} = expensive.
  3. Automatic generation for numeric data (discretization) using:
    • Binning (equal-width / equal-frequency),
    • Histogram analysis,
    • Clustering,
    • Entropy-based discretization,
    • Natural partitioning (3-4-5 rule).
  4. Automatic generation for categorical data based on the number of distinct values — the attribute with the most distinct values is placed at the lowest level and the one with the fewest at the highest level (e.g. country has fewer distinct values than street, so country is higher).

Concept hierarchies thus allow data to be viewed and mined at different levels of granularity.

concept-hierarchy
7short5 marks

Compare different classifiers (decision tree vs. Naive Bayes).

Decision Tree vs Naive Bayes Classifier

AspectDecision TreeNaive Bayes
Type / basisDiscriminative; recursive partitioning using information gain / GiniGenerative; probabilistic, based on Bayes' theorem
Core assumptionNone about feature independenceAssumes all features are conditionally independent given the class
Model outputA tree of if-then rules; interpretablePosterior probabilities P(CX)P(C)iP(xiC)P(C\mid X) \propto P(C)\prod_i P(x_i\mid C)
Handling of featuresHandles numeric & categorical; captures feature interactionsWorks well with categorical; needs distribution (e.g. Gaussian) for numeric; ignores interactions
Training speedSlower (must evaluate splits)Very fast (just count probabilities)
Data requirementNeeds reasonably large data; can overfitWorks well even with small data
OverfittingProne to overfitting (needs pruning)Less prone; high bias
InterpretabilityHighly interpretable (visual rules)Less intuitive (probabilities)
RobustnessSensitive to noisy data and small changesRobust to noise and irrelevant features; sensitive when independence is badly violated

Summary

Use a decision tree when interpretability and feature interactions matter; use Naive Bayes when features are roughly independent, data is small, training must be fast (e.g. text classification / spam filtering).

classification
8short5 marks

Explain the DBSCAN clustering algorithm.

DBSCAN Clustering Algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed, marking sparse points as noise/outliers. It can find clusters of arbitrary shape and does not require the number of clusters to be specified in advance.

Key parameters

  • Eps (ε\varepsilon): radius of the neighborhood around a point.
  • MinPts: minimum number of points required within the Eps-neighborhood to form a dense region.

Point types

  • Core point: has at least MinPts points (including itself) within its ε\varepsilon-neighborhood.
  • Border point: within ε\varepsilon of a core point but has fewer than MinPts neighbors.
  • Noise point: neither core nor border.

Key concepts

  • Directly density-reachable, density-reachable, and density-connected define how points join a cluster from core points.

Algorithm

for each unvisited point p:
    mark p visited
    N = neighbors of p within Eps
    if |N| < MinPts:
        mark p as NOISE          // may later become border
    else:
        create new cluster C; add p
        expand cluster: for each point q in N (and their neighbors),
            if q is unvisited: mark visited, get its neighbors,
               if it is a core point, add its neighbors to N
            if q not in any cluster: add q to C

Advantages

  • Finds arbitrarily shaped clusters; identifies outliers/noise; no need to predefine k.

Disadvantages

  • Sensitive to choice of Eps and MinPts; struggles with clusters of varying density and high-dimensional data.
clusteringdbscan
9short5 marks

What is data integration? Explain the issues involved.

Data Integration

Data integration is the process of combining data from multiple heterogeneous sources (databases, data cubes, flat files) into a coherent, unified store such as a data warehouse. It is a key step in data preprocessing.

Major issues involved

  1. Schema integration & entity identification problem — matching equivalent real-world entities from different sources, e.g. customer_id in one DB vs cust_number in another. Metadata helps resolve such conflicts.
  2. Redundancy — an attribute may be derivable from another, or duplicated across sources. Redundancy can be detected by correlation analysis (Pearson's rr for numeric, chi-square χ2\chi^2 test for categorical) and covariance.
  3. Tuple duplication — the same record appearing more than once must be detected and removed.
  4. Data value conflicts — the same attribute may have different values/units/representations across sources (e.g. price in NPR vs USD, metric vs imperial, different encodings or scales).
  5. Inconsistent metadata / naming — different names, formats, or meanings (semantic heterogeneity) for the same concept.

Resolving these issues produces clean, non-redundant, consistent integrated data, improving the speed and quality of subsequent mining.

preprocessing
10short5 marks

Define lift and conviction in association rules.

Lift and Conviction in Association Rules

For a rule ABA \Rightarrow B:

Lift

Lift measures how much more likely BB is given AA, compared to BB's overall likelihood — i.e. the correlation between AA and BB.

Lift(AB)=P(AB)P(A)P(B)=Confidence(AB)P(B)\text{Lift}(A \Rightarrow B) = \frac{P(A \cup B)}{P(A)\,P(B)} = \frac{\text{Confidence}(A \Rightarrow B)}{P(B)}

Interpretation:

  • Lift = 1AA and BB are independent (no relationship).
  • Lift > 1positive correlation (presence of AA increases likelihood of BB); the rule is interesting.
  • Lift < 1negative correlation (AA discourages BB).

Conviction

Conviction measures how much the rule would be wrong if AA and BB were independent — i.e. the ratio of expected frequency of AA without BB to the observed frequency of incorrect predictions.

Conviction(AB)=1P(B)1Confidence(AB)=P(A)P(¬B)P(A¬B)\text{Conviction}(A \Rightarrow B) = \frac{1 - P(B)}{1 - \text{Confidence}(A \Rightarrow B)} = \frac{P(A)\,P(\lnot B)}{P(A \cup \lnot B)}

Interpretation:

  • Conviction = 1 → independence.
  • Conviction > 1 → a meaningful (correct, directional) rule; higher value means stronger implication.
  • Conviction = ∞ when confidence = 1 (the rule never fails). Unlike lift, conviction is directional (it distinguishes ABA\Rightarrow B from BAB\Rightarrow A).
association-rules
11short5 marks

Explain data reduction techniques.

Data Reduction Techniques

Data reduction obtains a reduced representation of a dataset that is much smaller in volume yet produces (almost) the same analytical/mining results. It reduces storage and speeds up mining without significant loss of quality.

Main techniques

  1. Dimensionality reduction — reduce the number of attributes/variables:
    • Wavelet transforms and Principal Component Analysis (PCA) project data onto fewer dimensions.
    • Attribute subset selection (e.g. stepwise forward selection, backward elimination, decision-tree induction) removes irrelevant/redundant attributes.
  2. Numerosity reduction — replace data by smaller representations:
    • Parametric: fit a model (e.g. regression, log-linear models) and store only parameters.
    • Non-parametric: histograms, clustering, sampling (e.g. SRSWOR, SRSWR, stratified sampling), and data cube aggregation.
  3. Data compression — apply encoding/transformations to compress the dataset (lossless or lossy, e.g. wavelet, PCA outputs).
  4. Data cube aggregation — aggregate detailed data to a higher summary level (e.g. daily sales → yearly sales).
  5. Discretization & concept hierarchy generation — replace raw values by interval/conceptual labels (e.g. age → {young, middle-aged, senior}), reducing distinct values.

These techniques cut data size while preserving the integrity needed for accurate mining.

preprocessing
12short5 marks

Write short notes on text mining.

Short Note: Text Mining

Text mining (text data mining / knowledge discovery from text) is the process of extracting useful, non-trivial patterns and knowledge from large collections of unstructured or semi-structured text (documents, emails, web pages, social media, reviews). Since most text is unstructured, it must first be converted into a structured form before mining.

Typical process / steps

  1. Text preprocessing — tokenization, stop-word removal, stemming/lemmatization, case normalization.
  2. Feature extraction / representation — build a vector space / bag-of-words or TF-IDF model; term-document matrix; n-grams or embeddings.
  3. Dimensionality reduction — e.g. LSI/LSA, feature selection.
  4. Mining / analysis — apply data-mining techniques: classification, clustering, association, summarization.

Common applications

  • Document classification & categorization (e.g. news, email folders).
  • Spam filtering and sentiment analysis of reviews/social media.
  • Information retrieval and extraction (named-entity recognition).
  • Topic detection / document clustering and text summarization.
  • Search engines and question answering.

Challenges

High dimensionality, ambiguity of natural language (synonymy, polysemy), noise, and multilingual content.

In short, text mining bridges information retrieval, NLP, and data mining to turn raw text into actionable knowledge.

text-mining

Frequently asked questions

Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2078?
The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper come with solutions?
Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper?
The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.