BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2081 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2081, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2081 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
What is data mining? Explain the knowledge discovery in databases (KDD) process with a diagram, and discuss major data mining functionalities.
Data Mining
Data mining is the process of discovering interesting, previously unknown, valid, and potentially useful patterns, correlations, and knowledge from large volumes of data stored in databases, data warehouses, or other repositories. It is also called Knowledge Discovery from Data (KDD) and combines techniques from statistics, machine learning, and database systems.
Knowledge Discovery in Databases (KDD) Process
KDD is the overall process of converting raw data into useful knowledge. Data mining is one core step within it. The major steps are:
- Data Cleaning – remove noise and inconsistent data.
- Data Integration – combine data from multiple heterogeneous sources.
- Data Selection – retrieve data relevant to the analysis task from the database.
- Data Transformation – transform/consolidate data into forms suitable for mining (e.g., normalization, aggregation).
- Data Mining – apply intelligent methods to extract patterns.
- Pattern Evaluation – identify the truly interesting patterns using interestingness measures.
- Knowledge Presentation – use visualization and knowledge-representation techniques to present mined knowledge to the user.
Diagram (described): A left-to-right pipeline of stacked boxes:
Databases / Data Warehouse → [Cleaning + Integration] → Data Warehouse → [Selection + Transformation] → Task-relevant Data → [Data Mining] → Patterns → [Evaluation + Presentation] → Knowledge
The first four steps are preprocessing (the bulk of the effort), data mining is the core step, and the last two deliver knowledge to the user.
Major Data Mining Functionalities
- Characterization & Discrimination – summarize general features of a class (characterization) or compare features of contrasting classes (discrimination).
- Association & Correlation Analysis – discover frequent patterns and association rules (e.g., market-basket analysis: bread ⇒ butter).
- Classification – build a model to assign data to predefined categorical class labels (e.g., decision tree, Naïve Bayes).
- Prediction (Regression) – predict missing or unavailable numeric values.
- Cluster Analysis – group similar objects into clusters without predefined labels (unsupervised).
- Outlier / Anomaly Analysis – detect objects that deviate from the general behaviour (e.g., fraud detection).
- Evolution Analysis – model trends for data that changes over time.
Patterns are judged using interestingness measures: validity, novelty, usefulness, and understandability.
Explain association rule mining. Apply the Apriori algorithm on a sample transaction dataset to generate strong association rules using given minimum support and confidence.
Association Rule Mining
Association rule mining discovers interesting relationships (co-occurrences) among items in large transactional datasets. A rule has the form where are itemsets and .
Two key measures:
- Support: — fraction of transactions containing both and .
- Confidence: .
A strong rule satisfies both minimum support and minimum confidence.
Apriori Algorithm
Apriori uses the downward-closure (anti-monotone) property: every subset of a frequent itemset must also be frequent. It works level-wise, generating candidate -itemsets () from frequent -itemsets and pruning those with infrequent subsets.
Worked Example
Consider 4 transactions with min_support = 50% (i.e., 2 transactions) and min_confidence = 70%:
| TID | Items |
|---|---|
| T1 | A, B, C |
| T2 | A, C |
| T3 | A, D |
| T4 | B, C |
Step 1 – → (count each item):
| Item | Count | Frequent? |
|---|---|---|
| A | 3 | ✓ |
| B | 2 | ✓ |
| C | 3 | ✓ |
| D | 1 | ✗ (pruned) |
Step 2 – → :
| Itemset | Count | Frequent? |
|---|---|---|
| {A,B} | 1 | ✗ |
| {A,C} | 2 | ✓ |
| {B,C} | 2 | ✓ |
Step 3 – : Candidate {A,B,C} requires subset {A,B} which is not frequent → pruned. No frequent 3-itemsets, so we stop.
Generating Strong Rules from frequent itemsets
From {A,C} (support = 2):
- : conf → rejected
- : conf → rejected
From {B,C} (support = 2):
- : conf → strong rule ✓
- : conf → rejected
Result: The only strong association rule is (support 50%, confidence 100%), meaning every customer who buys B also buys C.
Explain classification using a decision tree. Construct a decision tree using ID3 for the given dataset and predict the class label of a new instance.
Classification using a Decision Tree
Classification builds a model that assigns data objects to predefined categorical class labels. A decision tree is a flowchart-like tree where each internal node tests an attribute, each branch is an outcome of the test, and each leaf node holds a class label. To classify a new instance, we traverse from the root to a leaf following the attribute tests.
ID3 Algorithm
ID3 builds the tree top-down, selecting at each node the attribute with the highest information gain (greatest reduction in entropy).
Worked Example (PlayTennis-style dataset)
| Outlook | Temp | PlayTennis |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Cool | No |
| Overcast | Hot | Yes |
| Overcast | Cool | Yes |
| Rain | Cool | Yes |
| Rain | Hot | No |
Class distribution: 3 Yes, 3 No → .
Gain(Outlook):
- Sunny (2 rows): both No → entropy 0
- Overcast (2 rows): both Yes → entropy 0
- Rain (2 rows): 1 Yes, 1 No → entropy 1.0
Gain(Temp): Hot {No,Yes,No}=entropy 0.918; Cool {No,Yes,Yes}=entropy 0.918 → .
Since , Outlook is the root.
Resulting Tree
Outlook
/ | \
Sunny Overcast Rain
| | |
No Yes Temp?
/ \
Cool Hot
| |
Yes No
Sunny and Overcast branches are pure (leaf = No / Yes). The Rain branch is split further on Temp (Cool→Yes, Hot→No).
Prediction for a new instance
New instance: (Outlook = Rain, Temp = Cool). Traverse: Outlook = Rain → test Temp → Temp = Cool → PlayTennis = Yes.
Thus the model predicts the class label Yes.
Section B: Short Answer Questions
Attempt any EIGHT questions.
Define data warehouse and list its key features.
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data that supports management decision-making (Bill Inmon's definition). It consolidates historical data from multiple operational sources into a central repository optimized for query, analysis, and reporting (OLAP) rather than transaction processing (OLTP).
Key Features
- Subject-oriented – organized around major subjects (e.g., customer, product, sales) rather than around applications.
- Integrated – data from heterogeneous sources is cleaned and made consistent (naming, units, encoding).
- Time-variant – stores historical data with a time dimension, enabling trend analysis over long periods.
- Non-volatile – data is loaded and read but not updated/deleted in real time; once entered it is stable.
- Supports OLAP & decision support – optimized for complex analytical queries, aggregation, and summarization.
Compare MOLAP, ROLAP, and HOLAP.
MOLAP vs ROLAP vs HOLAP
These are the three OLAP server architectures that differ in how multidimensional data is stored.
| Feature | MOLAP | ROLAP | HOLAP |
|---|---|---|---|
| Storage | Multidimensional arrays/cubes | Relational tables (star/snowflake) | Hybrid: cubes + relational |
| Data volume | Limited (sparse cube blow-up) | Very large, scalable | Large |
| Query speed | Fastest (pre-aggregated) | Slower (SQL joins/aggregation) | Fast for summary, scalable for detail |
| Pre-computation | High (precomputed aggregates) | Computed on the fly | Aggregates in cube, detail in tables |
| Storage cost | High (dense storage) | Lower | Moderate |
Summary:
- MOLAP (Multidimensional OLAP): stores data in optimized multidimensional cubes → fast queries but poor scalability for huge/sparse data.
- ROLAP (Relational OLAP): stores data in relational DBMS, generating SQL dynamically → highly scalable but slower queries.
- HOLAP (Hybrid OLAP): combines both — summary/aggregated data in a MOLAP cube for speed, detailed data in relational tables for scalability, giving a balance of performance and capacity.
Explain the slice and dice OLAP operations.
Slice and Dice OLAP Operations
OLAP operations let analysts explore a multidimensional data cube. Slice and Dice select sub-portions of the cube.
Slice
The slice operation selects a single value for one dimension, producing a sub-cube of lower dimensionality (a 2-D plane from a 3-D cube).
Example: From a cube with dimensions (Time, Item, Location), slice on Time = "Q1" returns a 2-D matrix of Item × Location sales for the first quarter only.
Dice
The dice operation selects a sub-cube by specifying a range/set of values on two or more dimensions, keeping the same dimensionality.
Example: dice for (Time ∈ {Q1, Q2}) AND (Item ∈ {Phone, Laptop}) AND (Location ∈ {Kathmandu, Pokhara}) returns a smaller cube restricted to those values.
Difference: Slice fixes one dimension to a single value (reduces dimensionality); Dice imposes conditions on multiple dimensions (keeps dimensionality but reduces size). Related operations include roll-up, drill-down, and pivot.
What are the limitations of the Apriori algorithm?
Limitations of the Apriori Algorithm
- Multiple database scans – Apriori scans the entire transaction database once for every level to count candidate supports. For long frequent patterns this means many costly I/O passes.
- Huge number of candidate itemsets – candidate generation can explode combinatorially; e.g., from frequent 1-itemsets it may generate ~ candidate 2-itemsets.
- High computational and memory cost – counting support for a massive candidate set and holding them in memory is expensive.
- Poor performance on dense/large datasets – with low minimum support, the number of frequent itemsets and candidates grows rapidly, degrading performance.
- Repeated work – it does not reuse counting effort across passes efficiently.
These limitations motivated improved methods such as the FP-Growth algorithm, which mines frequent patterns using a compact FP-tree without candidate generation and needs only two database scans.
Explain the DBSCAN algorithm and the concepts of core, border, and noise points.
DBSCAN Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed and marks low-density points as noise. It can find arbitrarily shaped clusters and does not require the number of clusters in advance.
It uses two parameters:
- Eps () – radius of the neighbourhood around a point.
- MinPts – minimum number of points required within the -neighbourhood to form a dense region.
Point Types
- Core point: a point that has at least MinPts points (including itself) within its -neighbourhood. It lies in the interior of a cluster.
- Border point: has fewer than MinPts neighbours, but falls within the -neighbourhood of a core point. It lies on the edge of a cluster.
- Noise point (outlier): neither a core point nor a border point — it does not belong to any cluster.
Algorithm Steps
1. For each unvisited point p, find its ε-neighbourhood.
2. If |neighbourhood| ≥ MinPts → p is a core point; start a new cluster
and expand it by recursively adding all density-reachable points.
3. Otherwise mark p as noise (it may later become a border point
if found within a core point's neighbourhood).
4. Repeat until all points are visited.
Advantages: discovers clusters of arbitrary shape, robust to noise/outliers, no need to specify cluster count. Drawback: sensitive to the choice of and MinPts, and struggles with clusters of varying density.
What is normalization in data preprocessing? Explain min-max normalization.
Normalization in Data Preprocessing
Normalization is a data-transformation technique that scales attribute values into a small, specified range (commonly or ). It prevents attributes with large ranges (e.g., income) from dominating attributes with small ranges (e.g., age) in distance-based algorithms such as k-NN, k-means, and neural networks, ensuring all features contribute fairly.
Min-Max Normalization
Min-max normalization performs a linear transformation that maps the original value of attribute to a new value in the target range :
For the common target range this simplifies to .
Example: For income with , , mapping a value of to range :
Advantage: preserves the relationships among original data values. Drawback: sensitive to outliers and breaks if future data falls outside the original range.
Differentiate between classification and prediction.
Classification vs Prediction
Both are forms of supervised learning that build a model from training data, but they differ in the kind of output produced.
| Aspect | Classification | Prediction (Regression) |
|---|---|---|
| Output | Discrete / categorical class label | Continuous numeric value |
| Goal | Assign data to predefined classes | Estimate an unknown numeric quantity |
| Example | Is a loan applicant risky or safe? | Predict the amount of a customer's purchase |
| Models | Decision tree, Naïve Bayes, SVM, k-NN | Linear/non-linear regression |
| Evaluation | Accuracy, precision, recall | Error metrics (MSE, RMSE, MAE) |
Summary: Classification predicts categorical (class) labels (e.g., yes/no, high/medium/low), whereas prediction (often called regression) predicts continuous numeric values (e.g., salary, temperature). Both first build a model on labelled training data and then apply it to new, unseen data.
Explain the K-medoids clustering method.
K-Medoids Clustering
K-medoids is a partitioning clustering method that divides objects into clusters, where each cluster is represented by its medoid — the most centrally located actual data object in the cluster (unlike k-means, whose centre is a computed mean). Because the representative is a real object, k-medoids is more robust to noise and outliers than k-means.
The most common implementation is PAM (Partitioning Around Medoids).
Algorithm (PAM)
1. Arbitrarily choose k objects as the initial medoids.
2. Assign each remaining object to the cluster of the nearest medoid
(using a distance measure, e.g., Manhattan/Euclidean).
3. For each medoid m and each non-medoid o:
- tentatively swap m with o and compute the total cost
(sum of distances of all objects to their nearest medoid).
- if the swap reduces total cost, keep it; else undo it.
4. Repeat steps 2–3 until no swap improves the cost (medoids stabilize).
The objective is to minimize the sum of dissimilarities between objects and their assigned medoid:
Advantages: robust to outliers, works with arbitrary distance measures. Disadvantage: computationally costlier than k-means ( per iteration), so it does not scale well to very large datasets.
Write short notes on the applications of data mining.
Applications of Data Mining
Data mining is widely applied across industries to extract actionable knowledge from large datasets:
- Market Basket Analysis & Retail: association rule mining to discover items frequently bought together, enabling product placement, cross-selling, and promotions.
- Banking & Finance: credit scoring, loan-default prediction, and fraud detection through anomaly/outlier analysis.
- Healthcare & Bioinformatics: disease diagnosis and prediction, identifying effective treatments, and analyzing genetic/DNA sequences.
- Telecommunications: customer churn prediction, fraud detection, and network fault analysis.
- CRM & Marketing: customer segmentation, targeted advertising, and recommendation systems (e.g., e-commerce, streaming).
- Education: predicting student performance and dropout risk (educational data mining).
- Web & Social Media Mining: search engines, web-usage analysis, opinion/sentiment mining.
- Manufacturing & Science: quality control, fault prediction, and analysis of scientific/sensor data.
In short, data mining supports decision-making wherever large volumes of data must be turned into useful patterns and predictions.
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2081?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2081 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2081 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2081 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2081 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.