BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2074 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2074, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2074 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
What is a data warehouse? Explain its characteristics. Describe the three-tier data warehouse architecture with a suitable diagram.
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data organized to support management decision-making (W. H. Inmon). It consolidates historical data from multiple operational (transactional) sources into a single repository optimized for query and analysis rather than transaction processing.
Characteristics
- Subject-oriented — Organized around major subjects (customer, product, sales) rather than around day-to-day operations or applications.
- Integrated — Data from heterogeneous sources is cleaned and made consistent (uniform naming, encoding, units, formats).
- Time-variant — Stores historical data with a time dimension (months/years of data), enabling trend analysis. Every record carries an explicit or implicit time element.
- Non-volatile — Data is loaded and read but normally not updated or deleted by end users; it is refreshed periodically (read-only for analysis).
Three-Tier Architecture
Top Tier: [ OLAP Tools | Query/Reporting | Data Mining | Analysis ] <- Front-end client tools
|
Middle Tier: [ OLAP Server (ROLAP / MOLAP) ] <- presents multidimensional view
|
Bottom Tier: [ Data Warehouse Server (RDBMS) + Data Marts + Metadata ] <- ETL loads from sources
|
[ Operational DBs ][ Flat Files ][ External Sources ]
- Bottom tier (Warehouse database server): Usually a relational database. Data is extracted from operational databases and external sources, then cleaned, transformed, and loaded (ETL/back-end tools). It also holds data marts and a metadata repository.
- Middle tier (OLAP server): Implemented as either a ROLAP (relational OLAP, maps multidimensional operations to SQL) or MOLAP (multidimensional OLAP, uses array-based storage/data cubes). It presents an abstracted, multidimensional view of the data to the top tier.
- Top tier (Front-end client layer): Tools for querying, reporting, analysis, and data mining used by analysts and decision makers.
This layered design separates storage, multidimensional processing, and presentation, improving scalability and performance.
What is association rule mining? Explain the Apriori algorithm with an example and discuss its limitations.
Association Rule Mining
Association rule mining discovers interesting relationships (co-occurrence patterns) among items in large transactional datasets. A rule has the form where are disjoint itemsets, e.g. ("market basket analysis").
Two key measures:
- Support:
- Confidence:
A rule is strong if it meets a minimum support and minimum confidence threshold.
Apriori Algorithm
Apriori finds all frequent itemsets using the Apriori property: every non-empty subset of a frequent itemset must also be frequent (anti-monotonicity). It works level-by-level:
1. Find all frequent 1-itemsets (L1) by counting support.
2. For k = 2,3,...:
a. Generate candidate k-itemsets Ck by joining L(k-1) with itself.
b. Prune candidates having any infrequent (k-1)-subset.
c. Scan database to count support of each candidate.
d. Lk = candidates with support >= min_support.
3. Stop when Lk is empty. Frequent itemsets = union of all Lk.
4. Generate strong rules from frequent itemsets using min_confidence.
Example
Transactions (min_support = 50% i.e. 2 of 4, min_confidence = 70%):
| TID | Items |
|---|---|
| T1 | A, B, C |
| T2 | A, C |
| T3 | A, D |
| T4 | B, C |
- L1: A(3), B(2), C(3), D(1). D is pruned (<2). L1 = {A, B, C}.
- C2: {A,B}=1, {A,C}=2, {B,C}=2. L2 = {A,C}(2), {B,C}(2).
- C3: {A,B,C} — needs subset {A,B} which is infrequent, so pruned. L3 empty.
Frequent itemsets: {A}, {B}, {C}, {A,C}, {B,C}.
Rule from {B,C}: , confidence (rejected); , confidence (strong rule).
Limitations
- Multiple database scans — one full scan per level , expensive for large data.
- Huge number of candidates — candidate generation explodes when there are many frequent items (e.g. 1-itemsets give candidate 2-itemsets).
- High memory and computation cost for low support thresholds.
- Improvements such as FP-Growth (no candidate generation, two scans) address these issues.
Explain the K-means clustering algorithm. Apply it on a sample dataset to form two clusters and show the iterations.
K-Means Clustering
K-means is an unsupervised, partitional clustering algorithm that divides objects into clusters so that intra-cluster similarity is high. Each cluster is represented by its centroid (mean of its points). It minimizes the sum of squared errors .
Algorithm
1. Choose k (number of clusters); select k initial centroids.
2. Assignment step: assign each point to the nearest centroid
(using Euclidean distance).
3. Update step: recompute each centroid as the mean of its members.
4. Repeat steps 2-3 until centroids no longer change (convergence).
Worked Example
Data (1-D): 2, 4, 10, 12, 3, 20, 30, 11, 25, with .
Initial centroids: , .
Iteration 1 — assign by nearest centroid:
- Cluster1 (near 2): {2, 3} → new mean
- Cluster2 (near 4): {4, 10, 12, 20, 30, 11, 25} → new mean
Iteration 2 — with :
- Cluster1: {2, 3, 4} → mean
- Cluster2: {10, 12, 20, 30, 11, 25} → mean
Iteration 3 — with :
- Cluster1: {2, 3, 4, 10} → mean
- Cluster2: {12, 20, 30, 11, 25} → mean
Iteration 4 — with :
- Cluster1: {2, 3, 4, 10, 11, 12} → mean
- Cluster2: {20, 30, 25} → mean
Iteration 5 — with :
- Cluster1: {2, 3, 4, 10, 11, 12} → mean
- Cluster2: {20, 30, 25} → mean
Assignments and centroids no longer change → converged.
Final clusters: (centroid 7), (centroid 25).
Notes
- Result depends on initial centroid choice and chosen ; sensitive to outliers; works best with spherical, well-separated clusters.
Section B: Short Answer Questions
Attempt any EIGHT questions.
Define data mining and list its major applications.
Data mining is the process of discovering interesting, previously unknown, and potentially useful patterns, correlations, and knowledge from large amounts of data. It is the core analysis step of the KDD (Knowledge Discovery in Databases) process and uses techniques from machine learning, statistics, and database systems.
Major Applications
- Market basket analysis / retail — association rules to drive product placement and cross-selling.
- Banking & finance — credit scoring, fraud detection, risk analysis.
- Telecommunications — fraud detection, churn prediction.
- Healthcare & bioinformatics — disease diagnosis, gene/protein analysis.
- CRM — customer segmentation and targeted marketing.
- Web mining & search engines — recommendation systems, click-stream analysis.
- Manufacturing — quality control and fault prediction.
Differentiate between OLTP and OLAP.
OLTP vs OLAP
| Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
| Purpose | Day-to-day operational transactions | Analysis and decision support |
| Data | Current, detailed, operational | Historical, summarized, consolidated |
| Orientation | Application/transaction oriented | Subject oriented |
| Operations | Frequent INSERT/UPDATE/DELETE | Mostly read-only complex queries |
| Design | Normalized (ER, 3NF) | De-normalized (star/snowflake, cubes) |
| Records accessed | Few rows per query | Millions of rows aggregated |
| Users | Clerks, operational staff (many) | Analysts, managers (fewer) |
| Response time | Milliseconds | Seconds to minutes |
| Database size | MB–GB | GB–TB |
| Backup/recovery | Critical, real-time | Periodic; reload from sources |
Summary: OLTP runs the business (operational data), while OLAP analyzes the business (decision support over a data warehouse).
What is a concept hierarchy? Explain with an example.
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level (detailed) concepts to higher-level (more general) concepts. It organizes attribute values into multiple levels of abstraction, enabling data generalization and roll-up/drill-down in OLAP.
Types
- Schema hierarchy — based on the schema/attributes, e.g. street < city < state < country for a
locationdimension. - Set-grouping hierarchy — groups values into ranges, e.g. age: {0–18 = young}, {19–40 = adult}, {41+ = senior}.
Example
For the location dimension:
country (Nepal)
|
province (Bagmati)
|
city (Kathmandu)
|
street (Durbar Marg)
Rolling up moves from street toward country (more general/summarized); drilling down moves the other way (more detailed). Such hierarchies let analysts view sales by street, then summarize by city, province, or country.
Explain the different OLAP operations (roll-up, drill-down, slice, dice).
OLAP Operations
OLAP operations let users navigate a multidimensional data cube at different levels of abstraction.
-
Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing a dimension. e.g. moving sales from city → province → country, producing more summarized data.
-
Drill-down: The reverse of roll-up — moves from summarized to more detailed data by stepping down a concept hierarchy or adding a dimension. e.g. quarter → month → day, or country → city.
-
Slice: Selects a single value for one dimension, producing a sub-cube (a 2-D slice). e.g. fix
time = Q1to see sales for all products/locations in Q1 only. -
Dice: Selects a sub-cube by specifying a range/condition on two or more dimensions. e.g.
location ∈ {KTM, Pokhara}ANDtime ∈ {Q1, Q2}ANDitem ∈ {phone, tablet}.
(Related: Pivot/rotate reorients the cube's axes to give an alternative view.)
What is data preprocessing? Why is it necessary?
Data Preprocessing
Data preprocessing is the set of techniques applied to raw data before mining to transform it into a clean, consistent, and suitable form for analysis. Real-world data is typically incomplete (missing values), noisy (errors, outliers), and inconsistent (conflicting codes/formats), so it cannot be mined reliably as-is.
Major Tasks
- Data cleaning — fill in missing values, smooth noise, remove outliers, resolve inconsistencies.
- Data integration — combine data from multiple sources, resolving conflicts and redundancy.
- Data reduction — reduce volume via dimensionality/numerosity reduction or aggregation while preserving analytical results.
- Data transformation — normalization, smoothing, generalization, attribute construction.
Why It Is Necessary
- Improves data quality, hence the quality and accuracy of mining results (garbage in, garbage out).
- Many algorithms require clean, consistent, properly scaled input.
- Reduces size and complexity, improving mining efficiency and reducing storage/computation.
Explain the steps in the KDD process.
KDD Process
KDD (Knowledge Discovery in Databases) is the overall, iterative process of converting raw data into useful knowledge; data mining is one core step within it.
Steps
- Data Cleaning — remove noise and inconsistent data, handle missing values.
- Data Integration — combine data from multiple heterogeneous sources.
- Data Selection — retrieve the data relevant to the analysis task from the database.
- Data Transformation — transform/consolidate data into forms appropriate for mining (e.g. normalization, aggregation).
- Data Mining — apply intelligent methods (classification, clustering, association, etc.) to extract patterns.
- Pattern Evaluation — identify the truly interesting patterns using interestingness measures.
- Knowledge Presentation — use visualization and reporting techniques to present mined knowledge to users.
(Steps 1–4 are often grouped as data preprocessing.) The process is iterative: feedback may loop back to earlier steps.
What is a decision tree? How is it used for classification?
Decision Tree
A decision tree is a flowchart-like tree structure used for classification (and regression), where:
- each internal node tests an attribute,
- each branch represents an outcome of the test, and
- each leaf node holds a class label.
The path from the root to a leaf forms a classification rule.
Use in Classification
- Building (training): The tree is induced top-down by recursively partitioning the training data. At each node, the best splitting attribute is chosen using a measure such as Information Gain / Entropy (ID3), Gain Ratio (C4.5), or Gini index (CART) — the attribute that best separates the classes.
- Splitting continues until nodes are pure (one class), no attributes remain, or a stopping criterion is met; pruning removes branches that overfit.
- Classifying a new record: Start at the root, test attributes along matching branches, and follow the path to a leaf; the leaf's class label is the predicted class.
Advantages: easy to understand and interpret, no need for normalization, handles both numeric and categorical data, and the rules are explicit.
Differentiate between classification and clustering.
Classification vs Clustering
| Aspect | Classification | Clustering |
|---|---|---|
| Learning type | Supervised | Unsupervised |
| Class labels | Predefined/known; uses labeled training data | No predefined labels; groups discovered from data |
| Goal | Assign new objects to known classes | Partition data into natural groups (clusters) |
| Training data | Required (labeled) | Not required |
| Basis of grouping | Learned model that maps features → label | Similarity/distance between objects |
| Output | A predictive model + class prediction | Set of clusters |
| Examples | Decision tree, Naive Bayes, SVM, KNN | K-means, hierarchical, DBSCAN |
| Evaluation | Accuracy, precision, recall (vs true labels) | Cohesion/separation, silhouette (no true labels) |
Summary: Classification predicts a known class label using labeled training data (supervised), whereas clustering discovers the groupings in unlabeled data based on similarity (unsupervised).
Write short notes on the star schema.
Star Schema
The star schema is the most common multidimensional model for a data warehouse. It consists of:
- A central fact table containing measures (numeric, additive facts such as
sales_amount,quantity) and foreign keys to the dimensions. - A set of dimension tables (e.g. time, product, location, customer), each connected directly to the fact table.
The diagram resembles a star: the fact table at the center with dimension tables radiating outward.
[ Time ] [ Product ]
\ /
\ /
[ FACT (Sales) ]
/ \
/ \
[ Location ] [ Customer ]
Key points
- Dimension tables are de-normalized (each dimension is a single table), giving fewer joins and fast query performance.
- Simple, intuitive design, well suited to OLAP.
- Uses more storage due to redundancy in dimensions.
- Contrast: the snowflake schema normalizes dimensions into sub-tables (more joins, less redundancy).
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2074?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.