BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2078 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2078, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2078 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
What are the advantages and disadvantages of MOLAP over ROLAP? Explain the multidimensional data model with the help of a data cube.
MOLAP vs ROLAP
MOLAP (Multidimensional OLAP) stores aggregated data in optimized multidimensional array structures (data cubes). ROLAP (Relational OLAP) stores data in relational tables (star/snowflake schema) and generates SQL on the fly.
Advantages of MOLAP over ROLAP
- Faster query performance — pre-computed and pre-aggregated cube cells give near-instant slice/dice/drill-down response.
- Compact storage for dense cubes via array indexing (no need to store sparse keys).
- Rich analytical functions (time-series, complex calculations) are natively supported.
- Automatic aggregation is handled by the cube engine.
Disadvantages of MOLAP over ROLAP
- Poor scalability — cube size explodes with many dimensions (the curse of dimensionality); cannot easily handle very large data volumes.
- Sparse cube wastage — empty cells still consume structure overhead.
- Limited detail — usually holds summarized data, so drilling to fine-grained records is hard.
- Long cube build/load time and the cube must be rebuilt when source data changes.
- ROLAP, in contrast, scales to terabytes, leverages mature RDBMS technology, and handles ad-hoc queries on detailed data.
Multidimensional Data Model and the Data Cube
The multidimensional data model views data as a set of measures (numeric facts such as sales, units_sold) organized by several dimensions (descriptive perspectives such as Time, Item, Location).
- Dimension: a perspective for analysis, e.g. Time (day, month, quarter, year), Item (name, type, brand), Location (city, region, country).
- Fact / Measure: the quantity analyzed, e.g. dollars sold.
- Data Cube: an n-dimensional generalization of a 2-D spreadsheet that lets us model and view data along multiple dimensions.
Example data cube (3-D)
A cube with dimensions Time, Item, Location where each cell holds the measure sales amount:
| Time \ Item (in City = Kathmandu) | TV | Phone | PC |
|---|---|---|---|
| Q1 | 605 | 825 | 14 |
| Q2 | 680 | 952 | 31 |
| Q3 | 812 | 1023 | 30 |
| Q4 | 927 | 1038 | 38 |
Stacking such tables for several cities gives the third dimension (Location), producing a cube. Higher-dimensional data is modeled as a series of cuboids; the lattice of cuboids forms the cube.
OLAP operations on the cube
- Roll-up: aggregate to a higher level (city → country).
- Drill-down: go to finer detail (quarter → month).
- Slice: select one dimension value (Time = Q1) giving a sub-cube.
- Dice: select a range on several dimensions.
- Pivot (rotate): reorient the cube for presentation.
This multidimensional structure is what enables fast, intuitive analytical querying.
Explain the Apriori algorithm in detail. Discuss the limitations of the Apriori algorithm and how the FP-growth algorithm overcomes them.
The Apriori Algorithm
Apriori is a classic algorithm for mining frequent itemsets and generating association rules from transactional data. It is named after its use of prior knowledge of frequent-itemset properties.
Key principle (Apriori property)
Every non-empty subset of a frequent itemset must also be frequent (anti-monotonicity). Equivalently, if an itemset is infrequent, all its supersets are infrequent and can be pruned.
Algorithm (level-wise search)
Input: transaction database D, minimum support min_sup
1. L1 = {frequent 1-itemsets} // scan D, count supports
2. for (k = 2; L(k-1) != empty; k++) {
3. Ck = apriori_gen(L(k-1)) // candidate generation
4. - JOIN: combine pairs of (k-1)-itemsets sharing first k-2 items
5. - PRUNE: drop any candidate having an infrequent (k-1)-subset
6. for each transaction t in D: // scan database
7. increment count of candidates in Ck contained in t
8. Lk = { c in Ck | support(c) >= min_sup }
9. }
10. return L = union of all Lk
After finding frequent itemsets, strong association rules () are generated for those with .
Limitations of Apriori
- Multiple database scans — it scans D once for each level , which is very costly for large databases.
- Huge number of candidate itemsets — e.g. frequent 1-itemsets can generate candidate 2-itemsets.
- High memory and computation to store and test candidates against every transaction.
- Performance degrades sharply with low support thresholds and long patterns.
FP-Growth and How It Overcomes These Limitations
FP-Growth (Frequent-Pattern Growth) mines frequent itemsets without candidate generation using a compact tree structure.
Steps
- Scan 1 — count item supports, discard infrequent items, sort frequent items by descending support.
- Scan 2 — build an FP-tree: insert each transaction's sorted frequent items as a path, sharing common prefixes; a header table links nodes of the same item.
- Mining — for each item (bottom-up), build its conditional pattern base and conditional FP-tree, recursively growing frequent patterns.
How it overcomes Apriori's limitations
- Only two database scans are needed (versus many) — solves the repeated-scan problem.
- No candidate generation — avoids the costly candidate explosion of Apriori.
- Compact FP-tree stores the database in main memory by sharing prefixes, reducing space.
- A divide-and-conquer strategy on conditional trees makes mining much faster, especially for dense data and long patterns.
Thus FP-Growth is typically an order of magnitude faster than Apriori on large datasets.
Build a decision tree from a given dataset using ID3 with information gain as the attribute selection measure. Show all calculations.
Building a Decision Tree with ID3 (Information Gain)
(A standard dataset is assumed since the table was not supplied; the canonical "Play Tennis" dataset is used to demonstrate the full method and calculations.)
Dataset (14 examples)
| Outlook | Temp | Humidity | Wind | PlayTennis |
|---|---|---|---|---|
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cool | Normal | Weak | Yes |
| Rain | Cool | Normal | Strong | No |
| Overcast | Cool | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cool | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
Totals: 9 Yes, 5 No.
Formulas
Step 1 — Entropy of the whole set
Step 2 — Gain of each attribute at the root
Outlook: Sunny(2Y,3N), Overcast(4Y,0N), Rain(3Y,2N)
- E(Sunny)=0.971, E(Overcast)=0, E(Rain)=0.971
Humidity: High(3Y,4N→0.985), Normal(6Y,1N→0.592)
Wind: Weak(6Y,2N→0.811), Strong(3Y,3N→1.0)
Temperature: Gain ≈ 0.029 (computed similarly).
Outlook has the highest gain (0.247) → it becomes the root.
Step 3 — Split on Outlook
- Overcast → all 4 are Yes → leaf Yes (pure).
- Sunny and Rain are impure → recurse.
Sunny branch (2Y, 3N, E=0.971)
- Gain(Humidity)= 0.971 − (3/5·0 + 2/5·0) = 0.971 (High→all No, Normal→all Yes)
- Gain(Temp)=0.571, Gain(Wind)=0.020 → Split on Humidity: High → No, Normal → Yes (both pure leaves).
Rain branch (3Y, 2N, E=0.971)
- Gain(Wind)= 0.971 − (3/5·0 + 2/5·0) = 0.971 (Weak→all Yes, Strong→all No) → Split on Wind: Weak → Yes, Strong → No.
Final Decision Tree
Outlook?
├─ Sunny → Humidity?
│ ├─ High → No
│ └─ Normal → Yes
├─ Overcast → Yes
└─ Rain → Wind?
├─ Weak → Yes
└─ Strong → No
ID3 stops because every leaf is now pure (all examples same class).
Section B: Short Answer Questions
Attempt any EIGHT questions.
What is beam search? Explain.
Beam Search
Beam search is a heuristic search algorithm that explores a graph/tree by expanding only the most promising nodes at each level, keeping memory bounded.
- At each level it generates all successors of the current candidates, scores them with a heuristic, and retains only the best nodes (where is the beam width). The rest are discarded.
- It is a bounded-width best-first search: a compromise between greedy search () and full breadth-first/best-first search ().
Characteristics
- Width controls the trade-off: larger → better solutions but more memory/time; smaller → faster but may miss the optimum.
- Not complete and not optimal — pruning may discard the path to the goal.
- Memory-efficient: only nodes per level are stored.
Uses in data mining / AI
Feature subset selection, rule induction (e.g. CN2), sequence decoding in NLP/machine translation, and large search spaces where exhaustive search is infeasible.
Explain signed networks in the context of data mining.
Signed Networks
A signed network is a graph in which each edge carries a sign — positive (+) or negative (−) — representing the nature of the relationship between two nodes, not just its existence.
- Positive edge: friendship, trust, agreement, similarity, "like."
- Negative edge: enmity, distrust, disagreement, "dislike," antagonism.
Examples: social platforms (friend/foe, trust/distrust on Epinions or Slashdot), product reviews, and protein interaction networks.
Structural balance theory
Signed networks are analyzed using balance theory, governed by rules on triads:
- "A friend of my friend is my friend" (+,+,+) — balanced.
- "An enemy of my enemy is my friend" (−,−,+) — balanced.
- Triads with an odd number of negative edges (e.g. +,+,−) are unbalanced and tend to be unstable.
Relevance to data mining
- Sign / link prediction: predict whether an unknown edge is + or −.
- Community detection: clusters have dense positive ties internally and negative ties between groups.
- Trust propagation and recommendation: spreading trust/distrust to rank or recommend.
- Anomaly detection: unbalanced or hostile substructures.
Thus signed networks let data mining capture polarity of relationships, enabling richer analysis than ordinary (unsigned) graphs.
What is a concept hierarchy? How is it generated?
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level (detailed) concepts to higher-level (more general) concepts. It organizes the values of an attribute into multiple levels of abstraction and is used in data warehousing and OLAP for roll-up/drill-down and for attribute-oriented generalization.
Example (Location): street → city → state → country.
Example (Time): day → month → quarter → year.
How concept hierarchies are generated
- Schema-level (specified by users/experts) — a total or partial ordering of attributes is given explicitly, e.g.
street < city < state < country. - Set-grouping hierarchy — explicitly defining groupings of values, e.g. price
{0–100} = cheap,{100–500} = moderate,{>500} = expensive. - Automatic generation for numeric data (discretization) using:
- Binning (equal-width / equal-frequency),
- Histogram analysis,
- Clustering,
- Entropy-based discretization,
- Natural partitioning (3-4-5 rule).
- Automatic generation for categorical data based on the number of distinct values — the attribute with the most distinct values is placed at the lowest level and the one with the fewest at the highest level (e.g. country has fewer distinct values than street, so country is higher).
Concept hierarchies thus allow data to be viewed and mined at different levels of granularity.
Compare different classifiers (decision tree vs. Naive Bayes).
Decision Tree vs Naive Bayes Classifier
| Aspect | Decision Tree | Naive Bayes |
|---|---|---|
| Type / basis | Discriminative; recursive partitioning using information gain / Gini | Generative; probabilistic, based on Bayes' theorem |
| Core assumption | None about feature independence | Assumes all features are conditionally independent given the class |
| Model output | A tree of if-then rules; interpretable | Posterior probabilities |
| Handling of features | Handles numeric & categorical; captures feature interactions | Works well with categorical; needs distribution (e.g. Gaussian) for numeric; ignores interactions |
| Training speed | Slower (must evaluate splits) | Very fast (just count probabilities) |
| Data requirement | Needs reasonably large data; can overfit | Works well even with small data |
| Overfitting | Prone to overfitting (needs pruning) | Less prone; high bias |
| Interpretability | Highly interpretable (visual rules) | Less intuitive (probabilities) |
| Robustness | Sensitive to noisy data and small changes | Robust to noise and irrelevant features; sensitive when independence is badly violated |
Summary
Use a decision tree when interpretability and feature interactions matter; use Naive Bayes when features are roughly independent, data is small, training must be fast (e.g. text classification / spam filtering).
Explain the DBSCAN clustering algorithm.
DBSCAN Clustering Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed, marking sparse points as noise/outliers. It can find clusters of arbitrary shape and does not require the number of clusters to be specified in advance.
Key parameters
- Eps (): radius of the neighborhood around a point.
- MinPts: minimum number of points required within the Eps-neighborhood to form a dense region.
Point types
- Core point: has at least MinPts points (including itself) within its -neighborhood.
- Border point: within of a core point but has fewer than MinPts neighbors.
- Noise point: neither core nor border.
Key concepts
- Directly density-reachable, density-reachable, and density-connected define how points join a cluster from core points.
Algorithm
for each unvisited point p:
mark p visited
N = neighbors of p within Eps
if |N| < MinPts:
mark p as NOISE // may later become border
else:
create new cluster C; add p
expand cluster: for each point q in N (and their neighbors),
if q is unvisited: mark visited, get its neighbors,
if it is a core point, add its neighbors to N
if q not in any cluster: add q to C
Advantages
- Finds arbitrarily shaped clusters; identifies outliers/noise; no need to predefine k.
Disadvantages
- Sensitive to choice of Eps and MinPts; struggles with clusters of varying density and high-dimensional data.
What is data integration? Explain the issues involved.
Data Integration
Data integration is the process of combining data from multiple heterogeneous sources (databases, data cubes, flat files) into a coherent, unified store such as a data warehouse. It is a key step in data preprocessing.
Major issues involved
- Schema integration & entity identification problem — matching equivalent real-world entities from different sources, e.g.
customer_idin one DB vscust_numberin another. Metadata helps resolve such conflicts. - Redundancy — an attribute may be derivable from another, or duplicated across sources. Redundancy can be detected by correlation analysis (Pearson's for numeric, chi-square test for categorical) and covariance.
- Tuple duplication — the same record appearing more than once must be detected and removed.
- Data value conflicts — the same attribute may have different values/units/representations across sources (e.g. price in NPR vs USD, metric vs imperial, different encodings or scales).
- Inconsistent metadata / naming — different names, formats, or meanings (semantic heterogeneity) for the same concept.
Resolving these issues produces clean, non-redundant, consistent integrated data, improving the speed and quality of subsequent mining.
Define lift and conviction in association rules.
Lift and Conviction in Association Rules
For a rule :
Lift
Lift measures how much more likely is given , compared to 's overall likelihood — i.e. the correlation between and .
Interpretation:
- Lift = 1 → and are independent (no relationship).
- Lift > 1 → positive correlation (presence of increases likelihood of ); the rule is interesting.
- Lift < 1 → negative correlation ( discourages ).
Conviction
Conviction measures how much the rule would be wrong if and were independent — i.e. the ratio of expected frequency of without to the observed frequency of incorrect predictions.
Interpretation:
- Conviction = 1 → independence.
- Conviction > 1 → a meaningful (correct, directional) rule; higher value means stronger implication.
- Conviction = ∞ when confidence = 1 (the rule never fails). Unlike lift, conviction is directional (it distinguishes from ).
Explain data reduction techniques.
Data Reduction Techniques
Data reduction obtains a reduced representation of a dataset that is much smaller in volume yet produces (almost) the same analytical/mining results. It reduces storage and speeds up mining without significant loss of quality.
Main techniques
- Dimensionality reduction — reduce the number of attributes/variables:
- Wavelet transforms and Principal Component Analysis (PCA) project data onto fewer dimensions.
- Attribute subset selection (e.g. stepwise forward selection, backward elimination, decision-tree induction) removes irrelevant/redundant attributes.
- Numerosity reduction — replace data by smaller representations:
- Parametric: fit a model (e.g. regression, log-linear models) and store only parameters.
- Non-parametric: histograms, clustering, sampling (e.g. SRSWOR, SRSWR, stratified sampling), and data cube aggregation.
- Data compression — apply encoding/transformations to compress the dataset (lossless or lossy, e.g. wavelet, PCA outputs).
- Data cube aggregation — aggregate detailed data to a higher summary level (e.g. daily sales → yearly sales).
- Discretization & concept hierarchy generation — replace raw values by interval/conceptual labels (e.g. age → {young, middle-aged, senior}), reducing distinct values.
These techniques cut data size while preserving the integrity needed for accurate mining.
Write short notes on text mining.
Short Note: Text Mining
Text mining (text data mining / knowledge discovery from text) is the process of extracting useful, non-trivial patterns and knowledge from large collections of unstructured or semi-structured text (documents, emails, web pages, social media, reviews). Since most text is unstructured, it must first be converted into a structured form before mining.
Typical process / steps
- Text preprocessing — tokenization, stop-word removal, stemming/lemmatization, case normalization.
- Feature extraction / representation — build a vector space / bag-of-words or TF-IDF model; term-document matrix; n-grams or embeddings.
- Dimensionality reduction — e.g. LSI/LSA, feature selection.
- Mining / analysis — apply data-mining techniques: classification, clustering, association, summarization.
Common applications
- Document classification & categorization (e.g. news, email folders).
- Spam filtering and sentiment analysis of reviews/social media.
- Information retrieval and extraction (named-entity recognition).
- Topic detection / document clustering and text summarization.
- Search engines and question answering.
Challenges
High dimensionality, ambiguity of natural language (synonymy, polysemy), noise, and multilingual content.
In short, text mining bridges information retrieval, NLP, and data mining to turn raw text into actionable knowledge.
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2078?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2078 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.