BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2077 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2077, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2077 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
Define data warehousing. Explain the data warehouse design process and the components of a data warehouse with a diagram.
Data Warehousing
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It consolidates data from multiple heterogeneous operational sources into a single, consistent store optimized for analytical querying (OLAP) rather than transaction processing.
Key characteristics:
- Subject-oriented — organized around major subjects (sales, customer) rather than applications.
- Integrated — data from different sources is cleaned and made consistent (naming, units, encoding).
- Time-variant — keeps historical data to allow trend analysis.
- Non-volatile — data is loaded and read but not normally updated/deleted by users.
Data Warehouse Design Process
The design typically follows these steps:
- Choose the business process to model (e.g., sales, shipments).
- Choose the grain — the level of detail of a fact-table record (e.g., one row per sale-line-item).
- Choose the dimensions that apply at that grain (time, item, branch, location).
- Choose the measures (facts) to populate each fact record (e.g., units_sold, dollars_sold).
Design approaches:
- Top-down (Inmon): build an enterprise warehouse first, then derive data marts.
- Bottom-up (Kimball): build dimensional data marts first, then integrate them via a bus architecture.
Components / Architecture
A three-tier architecture is commonly used:
- Bottom tier – Warehouse database server: the relational store; data is brought in by ETL (Extract, Transform, Load) tools from operational DBs and external sources via the data staging area, plus metadata and gateways.
- Middle tier – OLAP server: ROLAP or MOLAP server that presents a multidimensional view of data to users.
- Top tier – Front-end client tools: query/reporting, analysis, and data-mining tools.
Diagram (described in words): At the left are operational source databases and flat/external files. Arrows feed into the ETL / data staging area, which loads the central data warehouse (with its metadata repository and data marts). The warehouse feeds the OLAP server, which in turn serves the front-end tools (query, report, analysis, mining) used by analysts.
What is cluster analysis? Explain the DBSCAN algorithm. Given a dataset with Eps = 2.5 and MinPts = 3, identify core points, border points, and outliers.
Cluster Analysis
Cluster analysis is the process of grouping a set of data objects into clusters such that objects within the same cluster are highly similar to one another and dissimilar to objects in other clusters. It is an unsupervised learning task (no predefined class labels). Similarity is usually measured with a distance metric such as Euclidean distance:
DBSCAN Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are densely packed and marks low-density points as noise. It uses two parameters:
- Eps (ε): radius of the neighborhood around a point.
- MinPts: minimum number of points required within the ε-neighborhood (including the point itself) to form a dense region.
Point types:
- Core point: has at least MinPts points within its ε-neighborhood.
- Border point: has fewer than MinPts neighbors but lies within the ε-neighborhood of a core point.
- Noise/outlier: neither a core nor a border point.
DBSCAN(D, Eps, MinPts):
mark all points unvisited
for each unvisited point p in D:
mark p visited
N = neighbors of p within Eps
if |N| < MinPts: mark p as NOISE // may later become border
else:
create new cluster C; add p to C
for each point q in N (expand):
if q unvisited:
mark q visited
N' = neighbors of q within Eps
if |N'| >= MinPts: N = N + N' // q is core
if q not yet in any cluster: add q to C
Worked Example (Eps = 2.5, MinPts = 3)
(The original paper supplies a point table; classification is illustrated here with a representative set of 2-D points.)
| Point | Coords | Neighbors within Eps=2.5 (incl. self) | Count | Type |
|---|---|---|---|---|
| A | (1,1) | A,B,C | 3 | Core |
| B | (2,1) | A,B,C | 3 | Core |
| C | (1,2) | A,B,C | 3 | Core |
| D | (8,8) | D,E | 2 | Border (in Eps of core E) |
| E | (8,9) | D,E,F | 3 | Core |
| F | (9,9) | E,F | 2 | Border |
| G | (15,15) | G | 1 | Outlier/Noise |
Result: Core points = {A, B, C, E}; Border points = {D, F}; Outliers/Noise = {G}. Cluster 1 = {A,B,C}, Cluster 2 = {D,E,F}, and G is noise.
Method: For each point compute how many points fall within distance 2.5; ≥3 ⇒ core, else if reachable from a core ⇒ border, otherwise ⇒ outlier.
Explain the Apriori principle. Using the Apriori algorithm, generate frequent itemsets and association rules for a given transaction dataset.
Apriori Principle
The Apriori principle states: If an itemset is frequent, then all of its subsets must also be frequent. Equivalently (the contrapositive, used for pruning): if an itemset is infrequent, all of its supersets are also infrequent. This anti-monotone property of support lets the algorithm prune the search space — candidate -itemsets are generated only from frequent -itemsets.
Key measures:
- Support:
- Confidence:
Worked Example
Transactions (min_support = 2, min_confidence = 60%):
| TID | Items |
|---|---|
| T1 | A, B, C |
| T2 | A, C |
| T3 | A, D |
| T4 | B, E, F |
| T5 | A, B, C |
Step 1 – C1 / L1 (1-itemsets, count ≥ 2): A=4, B=3, C=3, D=1, E=1, F=1 → L1 = {A, B, C}.
Step 2 – C2 / L2 (2-itemsets): {A,B}=2, {A,C}=3, {B,C}=2 → L2 = {AB, AC, BC} (all ≥ 2).
Step 3 – C3 / L3 (3-itemsets): Candidate {A,B,C} (all 2-subsets are frequent). Count {A,B,C}=2 ≥ 2 → L3 = {ABC}.
Frequent itemsets: A, B, C, AB, AC, BC, ABC.
Step 4 – Association rules from {A,B,C} (support=2/5):
| Rule | Confidence | Keep? |
|---|---|---|
| A,B ⇒ C | 2/2 = 100% | ✓ |
| A,C ⇒ B | 2/3 = 67% | ✓ |
| B,C ⇒ A | 2/2 = 100% | ✓ |
| A ⇒ B,C | 2/4 = 50% | ✗ |
| B ⇒ A,C | 2/3 = 67% | ✓ |
| C ⇒ A,B | 2/3 = 67% | ✓ |
Strong rules (conf ≥ 60%): {A,B}⇒C, {A,C}⇒B, {B,C}⇒A, B⇒{A,C}, C⇒{A,B}.
Section B: Short Answer Questions
Attempt any EIGHT questions.
What is data transformation? List its techniques.
Data Transformation
Data transformation is a data-preprocessing step in which data is converted or consolidated into forms appropriate for mining, so that the resulting mining process is more efficient and the patterns found are easier to understand.
Techniques:
- Smoothing — remove noise from data (binning, regression, clustering).
- Aggregation — summary or aggregation operations are applied (e.g., daily sales → monthly).
- Generalization — low-level data replaced by higher-level concepts using concept hierarchies (e.g., street → city).
- Normalization — scale attribute values to a small specified range, e.g.:
- Min-max:
- Z-score:
- Decimal scaling:
- Attribute/feature construction — new attributes are built from existing ones.
- Discretization — continuous values replaced by interval or conceptual labels.
Differentiate between ROLAP and MOLAP.
ROLAP vs MOLAP
| Aspect | ROLAP (Relational OLAP) | MOLAP (Multidimensional OLAP) |
|---|---|---|
| Storage | Data stored in relational tables (star/snowflake schema) | Data stored in multidimensional arrays/cubes |
| Access | SQL queries on RDBMS | Direct array indexing |
| Scalability | Handles very large data volumes well | Limited by cube size; less scalable for huge data |
| Query speed | Generally slower (joins computed at run time) | Faster for typical queries (pre-aggregated cube) |
| Sparsity | Handles sparse data efficiently | Sparse cubes waste storage |
| Pre-computation | Aggregations computed on the fly or as materialized views | Aggregations pre-computed and stored in the cube |
| Example tools | Microstrategy, Mondrian | Essbase, Cognos PowerPlay |
Summary: ROLAP scales to large, sparse data using relational storage and SQL but is slower; MOLAP offers fast query response through pre-aggregated multidimensional cubes but is limited in scalability and wastes space on sparse data. HOLAP combines both.
Explain the bottom-up and top-down approaches of building a data warehouse.
Approaches to Building a Data Warehouse
Top-Down Approach (Inmon)
An enterprise-wide data warehouse (EDW) is built first as a single, centralized, normalized repository of all corporate data. Data marts (subject-specific subsets) are then created from this central warehouse.
- Advantages: consistent enterprise view, single version of truth, easier to maintain dimensions.
- Disadvantages: high initial cost, long development time, complex.
Bottom-Up Approach (Kimball)
Individual data marts are built first for specific business areas (e.g., sales, finance) and later integrated into a larger warehouse using conformed dimensions (the data warehouse bus architecture).
- Advantages: faster delivery, lower initial cost, quick ROI, incremental.
- Disadvantages: possible data redundancy/inconsistency if marts are not well coordinated.
In short: Top-down = warehouse first, then marts (Inmon); Bottom-up = marts first, then warehouse (Kimball).
What is a confusion matrix? How is it used to evaluate a classifier?
Confusion Matrix
A confusion matrix is an table that summarizes the performance of a classifier by comparing predicted class labels with the actual class labels. For binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
Evaluating a Classifier
From these four counts, standard metrics are derived:
- Accuracy — overall fraction correctly classified.
- Precision — correctness of positive predictions.
- Recall / Sensitivity — coverage of actual positives.
- Specificity .
- F1-score .
- Error rate .
The matrix also reveals which classes are confused (off-diagonal entries), which a single accuracy number hides — important for imbalanced datasets.
Explain the concept of data cube.
Data Cube
A data cube is a multidimensional model that allows data to be viewed and analyzed across multiple dimensions (perspectives) and measures (numeric facts). Despite the name "cube," it can have dimensions, not just three.
- Dimensions: the perspectives by which an organization keeps records, e.g., time, item, location, supplier. Each dimension may have a concept hierarchy (day → month → quarter → year).
- Measures (facts): the numeric quantities being analyzed, e.g., dollars_sold, units_sold, stored in the central fact table.
Example: a 3-D cube for sales with axes time × item × location; each cell holds a measure such as dollars_sold for a particular (time, item, location) combination.
Apex cuboid / lattice: A data cube is actually a lattice of cuboids. The cuboid holding the lowest-level summary is the base cuboid; the 0-D cuboid holding the grand total is the apex cuboid (all). OLAP operations — roll-up, drill-down, slice, dice, pivot — navigate this cube to support interactive multidimensional analysis.
What is outlier analysis? Why is it important?
Outlier Analysis
An outlier is a data object that deviates significantly from the rest of the data, as if it were generated by a different mechanism. Outlier analysis (outlier/anomaly detection) is the process of identifying such objects.
Methods include:
- Statistical (distribution-based): assume a data distribution and flag points with low probability (e.g., beyond ).
- Distance-based: an object is an outlier if a large fraction of objects lie farther than a distance .
- Density-based (e.g., LOF): compare local density of a point with that of its neighbors.
- Clustering-based: points that do not belong to any cluster (e.g., DBSCAN noise).
Why it is important:
- In many applications the outliers are more interesting than the normal data: fraud detection (credit-card/insurance), network intrusion detection, fault/medical diagnosis, and detecting measurement errors.
- Removing or treating outliers improves the quality of other mining tasks (clustering, regression) that are sensitive to extreme values.
Define entropy and information gain.
Entropy
Entropy measures the impurity or uncertainty in a collection of examples. For a dataset with classes where is the proportion of class :
Entropy is 0 when all objects belong to one class (pure) and maximal (=, e.g. 1 for two equally likely classes) when classes are evenly mixed.
Information Gain
Information gain is the expected reduction in entropy achieved by partitioning the data on an attribute . If splitting on produces subsets :
The attribute with the highest information gain is chosen as the splitting attribute at each node in the ID3 / C4.5 decision-tree algorithms, because it best separates the classes.
Explain the partitioning method of clustering.
Partitioning Method of Clustering
A partitioning method organizes objects into clusters (), where each cluster contains at least one object and each object belongs to exactly one cluster. It uses an iterative relocation technique that moves objects between groups to improve a partition until an objective function (e.g., minimizing intra-cluster squared error) is optimized.
Objective (k-means squared error):
where is the centroid/representative of cluster .
k-means algorithm:
1. Arbitrarily choose k objects as initial cluster centers.
2. Repeat:
a. Assign each object to the cluster whose center is nearest.
b. Recompute each cluster center as the mean of its objects.
3. Until the centers no longer change (convergence).
- k-means uses the mean (centroid) as the cluster representative.
- k-medoids (PAM) uses an actual object (medoid), making it more robust to outliers.
Limitations: must specify in advance, sensitive to initial seeds and outliers, and tends to find spherical clusters of similar size.
Write short notes on web mining.
Web Mining
Web mining is the application of data-mining techniques to automatically discover and extract useful information, patterns, and knowledge from web data (documents, hyperlinks, server logs).
It is generally divided into three categories:
- Web Content Mining: extracting useful information/knowledge from the content of web pages (text, images, structured records) — e.g., search-result classification, opinion mining.
- Web Structure Mining: analyzing the hyperlink structure (the web graph) to discover authoritative pages and communities — e.g., Google's PageRank and the HITS (hubs and authorities) algorithm.
- Web Usage Mining: mining web server logs / clickstream data to understand user navigation behavior — used for personalization, recommendation, and improving site design.
Applications: search engines, recommendation systems, e-commerce personalization, targeted advertising, and detecting web access patterns.
Challenges: the web is huge, heterogeneous, dynamic, semi-structured, and noisy, which makes mining difficult.
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2077?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2077 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.