BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2079 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2079, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2079 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
What is a data warehouse? Explain the three-tier architecture of a data warehouse in detail with a neat diagram.
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process (W. H. Inmon). It is a central repository that consolidates data from multiple heterogeneous operational sources for analysis and reporting (OLAP) rather than transaction processing.
Key characteristics:
- Subject-oriented: organized around major subjects (customer, product, sales) rather than applications.
- Integrated: data from different sources is cleaned and made consistent (naming, units, encoding).
- Time-variant: stores historical data with a time dimension (years of data).
- Non-volatile: data is loaded and read, but not updated/deleted in real time.
Three-Tier Architecture
A data warehouse is commonly built using a three-tier architecture:
+-------------------------------------------------------+
| Top Tier: Front-End Tools |
| (Query, Reporting, OLAP, Data Mining, Dashboards) |
+-------------------------------------------------------+
^
|
+-------------------------------------------------------+
| Middle Tier: OLAP Server |
| (ROLAP / MOLAP / HOLAP engine) |
+-------------------------------------------------------+
^
|
+-------------------------------------------------------+
| Bottom Tier: Data Warehouse Server (RDBMS) |
| Data Marts | Metadata | Monitoring & Administration |
+-------------------------------------------------------+
^ ^
| ETL | (Extract, Transform, Load)
+-----------------------+
| Operational DBs, Flat |
| files, External data |
+-----------------------+
1. Bottom Tier — Data Warehouse Server (Data layer):
- A back-end relational database that stores the warehouse data.
- Data is fed from operational databases, flat files and external sources through ETL (Extract, Transform, Load) / gateways (ODBC, JDBC, OLEDB).
- Also holds the metadata repository (definitions of data, source mappings) and tools for monitoring and administration.
2. Middle Tier — OLAP Server: Presents the multidimensional view of data to the user. Implemented as:
- ROLAP (Relational OLAP): maps multidimensional operations to standard relational tables (star/snowflake schema). Scales to large data.
- MOLAP (Multidimensional OLAP): stores data in special multidimensional array structures (data cubes) for fast retrieval.
- HOLAP (Hybrid OLAP): combines ROLAP storage for detailed data with MOLAP for aggregates.
3. Top Tier — Front-End / Client Tools:
- Tools used by end users: query and reporting tools, analysis tools, OLAP tools (slice, dice, drill-down), and data mining tools (prediction, classification, clustering).
- Produces reports, charts and dashboards for decision making.
Conclusion: The separation into three tiers gives modularity, scalability and the ability to optimize each layer (storage, processing, presentation) independently.
Explain the K-means algorithm. Cluster the given set of points into two clusters and show all iterations until convergence.
K-means Algorithm
K-means is a partitional, unsupervised clustering algorithm that divides data points into clusters, where each point belongs to the cluster with the nearest mean (centroid). It minimises the within-cluster sum of squared errors (SSE):
Steps:
- Choose the number of clusters .
- Initialise centroids (randomly or pick points).
- Assignment: assign each point to the nearest centroid (Euclidean distance).
- Update: recompute each centroid as the mean of points assigned to it.
- Repeat steps 3–4 until centroids no longer change (convergence).
Worked Example
Let the points be and . Take initial centroids and (Euclidean distance, ).
Iteration 1 — Assignment:
| Point | dist to | dist to | Cluster |
|---|---|---|---|
| A(1,1) | 0.00 | 5.00 | 1 |
| B(2,1) | 1.00 | 4.24 | 1 |
| C(4,3) | 3.61 | 1.41 | 2 |
| D(5,4) | 5.00 | 0.00 | 2 |
Cluster 1 = {A, B}, Cluster 2 = {C, D}.
Update centroids:
Iteration 2 — Assignment with new centroids:
| Point | dist to | dist to | Cluster |
|---|---|---|---|
| A(1,1) | 0.50 | 4.30 | 1 |
| B(2,1) | 0.50 | 3.54 | 1 |
| C(4,3) | 3.20 | 0.71 | 2 |
| D(5,4) | 4.61 | 0.71 | 2 |
Clusters are unchanged (Cluster 1 = {A,B}, Cluster 2 = {C,D}), so the centroids would not change again.
Convergence reached.
- Cluster 1: {A(1,1), B(2,1)}, centroid
- Cluster 2: {C(4,3), D(5,4)}, centroid
(If the exam supplies different points, apply the same assignment→update→repeat procedure.)
What is classification? Explain the K-Nearest Neighbour (KNN) algorithm and classify a new instance for a given dataset.
Classification
Classification is a supervised learning (data mining) technique that builds a model from a labelled training set to predict the categorical class label of new, unseen instances. It has two phases: (1) learning/training — build the classifier from training tuples whose class is known; (2) classification/testing — use the model to assign class labels to new data. Examples: spam vs. not-spam, loan = safe/risky.
K-Nearest Neighbour (KNN) Algorithm
KNN is a lazy, instance-based classifier — it stores all training data and classifies a new instance by majority vote of its closest neighbours.
Steps:
- Choose (number of neighbours).
- Compute the distance (usually Euclidean, ) from the new instance to every training point.
- Select the nearest training points.
- Assign the class that is most frequent (majority vote) among those neighbours.
Worked Example
Training data (attributes , Class):
| Point | Class | ||
|---|---|---|---|
| P1 | 1 | 1 | A |
| P2 | 2 | 1 | A |
| P3 | 4 | 4 | B |
| P4 | 5 | 4 | B |
Classify new instance with .
| Point | Distance to Q(2,2) |
|---|---|
| P1(1,1) | |
| P2(2,1) | |
| P3(4,4) | |
| P4(5,4) |
The 3 nearest neighbours are P2 (A), P1 (A), P3 (B) → votes: A = 2, B = 1.
Majority class = A, so is classified as Class A.
Note: is usually odd to avoid ties; attributes should be normalised so large-scale features do not dominate the distance.
Section B: Short Answer Questions
Attempt any EIGHT questions.
Differentiate between operational database and data warehouse.
Operational Database (OLTP) vs Data Warehouse (OLAP)
| Aspect | Operational Database (OLTP) | Data Warehouse (OLAP) |
|---|---|---|
| Purpose | Day-to-day transaction processing | Analysis, reporting, decision support |
| Data | Current, detailed, up-to-date | Historical, summarized, integrated |
| Orientation | Application-oriented | Subject-oriented |
| Operations | Frequent INSERT/UPDATE/DELETE | Mostly read-only, complex queries |
| Design | Normalized (ER model) | De-normalized (star/snowflake) |
| Size | MB–GB | GB–TB and beyond |
| Access | Short, simple transactions | Long, ad-hoc analytical queries |
| Users | Clerks, operational staff | Analysts, managers, executives |
| Time | Real-time / current value | Time-variant (history retained) |
In short, an operational database is optimized for fast, reliable transactions, while a data warehouse is optimized for query-intensive analysis over large historical, integrated data.
Explain the OLAP operations with examples.
OLAP Operations
OLAP (Online Analytical Processing) operations manipulate the data cube to view data at different levels of detail and from different perspectives.
1. Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing dimensions.
- Example: Sales by city → rolled up to sales by country.
2. Drill-down (roll-down): Reverse of roll-up; navigates from less detailed to more detailed data.
- Example: Sales by quarter → drilled down to sales by month.
3. Slice: Selects a single value of one dimension, producing a sub-cube (2-D view).
- Example: From the cube (Time, Item, Location), slice where Time = Q1.
4. Dice: Selects a sub-cube by choosing ranges of values on two or more dimensions.
- Example: Item ∈ {Mobile, Laptop} AND Location ∈ {Nepal, India} AND Time ∈ {Q1, Q2}.
5. Pivot (rotate): Rotates the data axes to give an alternative presentation of the same data (e.g., swap rows and columns).
- Example: Swap the Location axis with the Time axis in a report.
(Some texts add Drill-across — across two or more fact tables — and Drill-through — to the underlying detailed/operational data.)
What is market basket analysis?
Market Basket Analysis
Market Basket Analysis (MBA) is an association-rule mining technique that discovers co-occurrence relationships among items purchased together in customer transactions. It answers: "Which products are frequently bought together?"
Results are expressed as association rules of the form (if a customer buys , they are likely to buy ), evaluated by:
- Support: — fraction of transactions containing both.
- Confidence: — reliability of the rule.
- Lift: — strength beyond chance ( = positive correlation).
Classic example: the {Bread, Butter} ⇒ {Milk} or the well-known {Diapers} ⇒ {Beer} rule.
Applications: store shelf layout, cross-selling and product bundling, recommendation systems, promotions and catalogue design. Algorithms such as Apriori and FP-Growth are used to find the frequent itemsets that drive these rules.
Explain the candidate generation step in Apriori.
Candidate Generation in Apriori
In the Apriori algorithm, candidate -itemsets () are generated from the frequent -itemsets () using two steps. This relies on the Apriori property: every subset of a frequent itemset must also be frequent (the downward-closure property).
1. Join Step: Generate candidates by self-joining . Two itemsets in are joined only if their first items are identical (items kept in sorted order); the result is a -itemset.
2. Prune Step: Remove any candidate -itemset that has any -subset not present in (since it cannot be frequent). This prunes the search space before the expensive support-counting scan.
Example: Let .
- Join: (first item shared). .
- Prune: For subsets are — all in → keep. For subset → pruned.
- So .
The remaining candidates are then counted against the database to form .
What is overfitting in classification? How can it be avoided?
Overfitting in Classification
Overfitting occurs when a classification model learns the training data too well — including its noise, outliers and random fluctuations — so it has very high accuracy on the training set but poor accuracy (poor generalisation) on unseen test data. The model becomes overly complex and fits idiosyncrasies that are not true patterns.
Symptom: low training error but high test/validation error.
How to Avoid / Reduce Overfitting
- Pruning (for decision trees): pre-pruning (stop growing early) or post-pruning (build the full tree, then remove weak branches).
- Cross-validation: use k-fold cross-validation to tune the model and detect overfitting.
- More / cleaner training data: larger, noise-free datasets reduce reliance on spurious patterns.
- Reduce model complexity: limit tree depth, number of features, or model parameters; feature selection.
- Regularization: penalise large/complex models (e.g., L1/L2 penalties).
- Use a separate validation/test set and stop training when validation error starts to rise (early stopping).
- Ensemble methods (bagging, random forests) that average many models to reduce variance.
Explain the silhouette coefficient for cluster evaluation.
Silhouette Coefficient
The silhouette coefficient is an internal cluster-validity measure that evaluates how well each object lies within its cluster by combining cohesion (closeness to its own cluster) and separation (distance from other clusters).
For a data point :
- = average distance from to all other points in its own cluster (cohesion).
- = minimum average distance from to points of any other cluster (separation).
The silhouette of point is:
Interpretation ():
- → point is well clustered (far from neighbouring clusters).
- → point lies on the border between two clusters.
- → point is probably assigned to the wrong cluster.
The overall silhouette score is the mean of over all points; a higher average indicates a better clustering. It is also used to choose the best number of clusters by picking the that maximises the average silhouette.
What is data discretization?
Data Discretization
Data discretization is a data-reduction / preprocessing technique that transforms continuous (numeric) attributes into a finite number of discrete intervals or categorical labels. The original numeric values are replaced by interval/concept labels, which reduces data size and is required by algorithms that only handle categorical data (e.g., some decision-tree and association-rule methods).
Example: an Age attribute (0–100) discretized into {Child (0–12), Teen (13–19), Adult (20–59), Senior (60+)}.
Common methods:
- Binning: partition into equal-width or equal-frequency (equal-depth) bins, then smooth by bin mean/median/boundaries — unsupervised.
- Histogram analysis: uses histograms to find natural intervals.
- Clustering: group similar values into clusters that become intervals.
- Entropy / information-gain based (e.g., ChiMerge): supervised, uses class labels to choose split points.
- Concept hierarchy generation: roll values up (e.g., city → state → country) for numeric or nominal data.
Benefits: smaller, simpler data; reduced noise; enables categorical-only algorithms; produces more interpretable rules.
Differentiate between star and snowflake schema.
Star Schema vs Snowflake Schema
Both are multidimensional data-warehouse schemas with a central fact table surrounded by dimension tables; they differ in how the dimensions are organised.
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Structure | Central fact table + single-level dimension tables | Fact table + dimensions split into multiple related sub-tables |
| Normalization | Dimensions are de-normalized | Dimensions are normalized (hierarchies split out) |
| Redundancy | Higher data redundancy | Lower redundancy |
| Joins | Fewer joins → faster queries | More joins → comparatively slower queries |
| Storage | Uses more space | Saves space |
| Complexity / design | Simple, easy to understand | More complex |
| Maintenance | Harder to maintain consistency | Easier (normalized) |
Diagram (conceptually):
- Star: Fact in the centre, each dimension one table radiating outward → looks like a star.
- Snowflake: each dimension is further broken into sub-dimensions (e.g., City → State → Country) → branches resemble a snowflake.
Summary: the star schema favours query performance and simplicity; the snowflake schema favours normalization and reduced redundancy at the cost of extra joins.
Write short notes on spatial data mining.
Spatial Data Mining
Spatial data mining is the process of discovering interesting, non-trivial and previously unknown patterns, relationships and trends from spatial (geographic / geo-referenced) data — data that has location, shape and topological attributes (points, lines, polygons, maps, satellite/GIS data).
Key idea: it exploits spatial relationships such as distance, direction, adjacency, containment and overlap that ordinary (non-spatial) mining ignores. It is guided by the principle that "everything is related to everything else, but nearby things are more related" (spatial autocorrelation).
Major tasks/techniques:
- Spatial classification & prediction — predict an attribute using location and neighbouring features.
- Spatial clustering — group nearby objects (e.g., DBSCAN finds dense regions / hotspots).
- Spatial association rules — e.g., is_a(x, school) ∧ close_to(x, park) ⇒ price_high.
- Spatial outlier / hotspot detection — find locations that differ markedly from their neighbours.
- Spatial trend analysis — how attributes change with distance/direction.
Applications: GIS, urban and transport planning, environmental and weather studies, epidemiology (disease hotspots), crime analysis, remote sensing and location-based services.
Challenges: large data volumes, complex spatial data types and indexing (R-trees, quad-trees), and the need to model spatial autocorrelation.
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2079?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2079 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.