BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2075 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2075, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2075 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
Explain the architecture of a data warehouse. Differentiate between MOLAP, ROLAP, and HOLAP.
Architecture of a Data Warehouse
A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is commonly described as a three-tier architecture.
Three-Tier Architecture
1. Bottom Tier — Data Warehouse Server (Database Layer)
- A relational/RDBMS server that holds the warehouse data.
- Data is fed from operational (OLTP) databases and external sources through ETL (Extract, Transform, Load):
- Extract data from heterogeneous sources.
- Transform / clean it (resolve inconsistencies, integrate, summarize).
- Load it into the warehouse.
- Also stores the metadata repository (data about data: source, ETL rules, schema).
2. Middle Tier — OLAP Server
- Presents a multidimensional view of data to users.
- Implemented as ROLAP (relational backend) or MOLAP (multidimensional cube store).
3. Top Tier — Front-End Client Layer
- Tools for querying, reporting, analysis, and data mining (dashboards, BI tools).
Components
- Data Sources (OLTP DBs, flat files, external data)
- ETL / staging area and metadata
- Data marts (department-level subsets)
- Monitoring & administration
(Diagram in words: Sources → ETL/Staging → Data Warehouse + Metadata (bottom) → OLAP Server (middle) → Query/Report/Mining tools (top).)
MOLAP vs ROLAP vs HOLAP
| Feature | MOLAP | ROLAP | HOLAP |
|---|---|---|---|
| Storage | Multidimensional cube arrays | Relational tables (star/snowflake) | Hybrid: detail in relational, summaries in cube |
| Query speed | Very fast (pre-aggregated) | Slower (SQL joins, runtime aggregation) | Fast for summaries, scalable for detail |
| Scalability / data volume | Limited (cube can explode) | High (handles very large data) | High |
| Storage efficiency | Sparse-data overhead | Efficient | Balanced |
| Best for | Small/medium data, fast OLAP | Large detailed datasets | Mix of large detail + fast aggregates |
- MOLAP stores data in optimized multidimensional cubes giving the fastest response but limited scalability.
- ROLAP keeps data in relational tables and generates aggregations via SQL at query time, scaling to large volumes but slower.
- HOLAP combines both: low-level detailed data stays in the relational store while summary/aggregated data is kept in a cube, balancing speed and scalability.
Explain the FP-growth algorithm for frequent pattern mining with an example. How does it improve upon Apriori?
FP-Growth Algorithm (Frequent Pattern Growth)
FP-Growth mines frequent itemsets without candidate generation by compressing the database into a compact FP-tree and then mining it recursively.
Steps
- First DB scan — count support of each item; discard items below
min_support; sort remaining (frequent) items in descending order of support (the F-list). - Second DB scan — for each transaction, keep only frequent items, order them by the F-list, and insert them into the FP-tree (shared prefixes share nodes; each node keeps a count). A header table links all nodes of the same item.
- Mine the FP-tree — for each item (bottom-up), build its conditional pattern base (prefix paths), construct a conditional FP-tree, and recursively grow frequent patterns.
Example
Transactions, min_support = 2:
| TID | Items |
|---|---|
| T1 | A, B, C |
| T2 | A, C, D |
| T3 | A, B |
| T4 | B, C |
| T5 | A, B, C |
Step 1 — counts: A=4, B=4, C=4, D=1. Drop D (support 1 < 2). F-list (desc): A:4, B:4, C:4.
Step 2 — ordered transactions: T1 {A,B,C}, T2 {A,C}, T3 {A,B}, T4 {B,C}, T5 {A,B,C}.
FP-tree (described): Root → A(4) → B(3) → C(2); A→C(1); plus a separate root→B(1)→C(1) branch (from T4). Header table links the A, B, C nodes.
Step 3 — mining (e.g., suffix C): conditional pattern base of C = {A,B:2}, {A:1}, {B:1}. Frequent patterns containing C: {A,C}:3, {B,C}:3, {A,B,C}:2. Similarly mining B gives {A,B}:3. Final frequent itemsets (support ≥ 2): {A},{B},{C},{A,B},{A,C},{B,C},{A,B,C}.
How FP-Growth Improves on Apriori
- No candidate generation: Apriori generates and tests huge numbers of candidate itemsets; FP-Growth avoids this entirely.
- Fewer DB scans: only two scans vs. Apriori's repeated scans (one per pass/level).
- Compact structure: the FP-tree compresses shared prefixes, saving memory.
- Faster: especially on dense datasets with long frequent patterns, often an order of magnitude faster.
- Drawback: the FP-tree may not fit in memory for very large/sparse data.
What is classification? Explain the ID3 decision tree algorithm and build a decision tree for a given training dataset.
Classification
Classification is a supervised learning task that builds a model (classifier) from a labeled training set to predict the categorical class label of new, unseen data. It works in two phases: (1) learning/model construction from training data, and (2) classification/prediction with accuracy estimated on a test set.
ID3 Decision Tree Algorithm
ID3 (Iterative Dichotomiser 3) builds a tree top-down by selecting, at each node, the attribute with the highest information gain (greatest entropy reduction).
Entropy of dataset with classes :
Information Gain of attribute :
Procedure:
- Compute entropy of the whole set.
- For each attribute compute information gain.
- Choose the attribute with maximum gain as the splitting node.
- Split data on its values; recurse on each branch.
- Stop when all instances in a node share one class (leaf), no attributes remain (majority class), or no data remains.
Example
Dataset (play tennis style), attribute Outlook (Sunny/Overcast/Rain), target Play (Yes/No), 14 samples with 9 Yes, 5 No:
For Outlook: Sunny(2Y,3N)→0.971, Overcast(4Y,0N)→0, Rain(3Y,2N)→0.971.
If Outlook gives the highest gain among all attributes, it becomes the root. The Overcast branch is pure → leaf Yes. The Sunny and Rain branches are split further (e.g., Sunny on Humidity, Rain on Wind) by recomputing gains, producing the final tree.
Resulting tree (described):
Outlook
├── Sunny → Humidity (High→No, Normal→Yes)
├── Overcast → Yes
└── Rain → Wind (Strong→No, Weak→Yes)
New records are classified by traversing from root to a leaf.
Section B: Short Answer Questions
Attempt any EIGHT questions.
What is a data mart? How does it differ from a data warehouse?
A data mart is a small, subject-oriented subset of a data warehouse focused on a single department or business function (e.g., Sales, Finance, HR). It contains only the data relevant to that group's analysis needs.
Differences from a data warehouse:
| Aspect | Data Mart | Data Warehouse |
|---|---|---|
| Scope | Single department/subject | Entire enterprise |
| Size | Small (MB–few GB) | Large (GB–TB+) |
| Data sources | Few | Many, integrated |
| Design/build time | Quick, cheaper | Long, complex, costly |
| Users | Departmental | Organization-wide |
Data marts can be dependent (sourced from the central warehouse) or independent (built directly from operational sources).
Explain the snowflake schema with an example.
Snowflake Schema
The snowflake schema is a refinement of the star schema in which dimension tables are normalized into multiple related tables, forming a snowflake shape around the central fact table.
Characteristics:
- A central fact table holds measures and foreign keys.
- Dimension tables are split into sub-dimensions to remove redundancy (normalization).
- Saves storage but needs more joins, so queries can be slower than a star schema.
Example (Sales):
- Fact table:
Sales(product_id, store_id, time_id, units_sold, revenue) - Product dimension normalized into
Product(product_id, name, category_id)→Category(category_id, category_name). - Store dimension normalized into
Store(store_id, city_id)→City(city_id, city_name, state_id)→State(state_id, state_name).
Here the Product → Category and Store → City → State chains branch out like a snowflake, unlike the star schema where each dimension is a single flat denormalized table.
Define support and confidence in association rule mining.
For an association rule over a transaction database:
Support — the fraction of all transactions that contain both and . It measures how frequently the itemset occurs.
Support(A \Rightarrow B) = P(A \cup B) = \frac{\text{# transactions containing } A \text{ and } B}{\text{Total transactions}}Confidence — the fraction of transactions containing that also contain . It measures the reliability/strength of the rule.
Example: If 100 transactions, 20 contain {bread, butter} and 25 contain bread, then and . Rules are kept only if they meet a user-set minimum support and minimum confidence.
What is data cleaning? Explain methods to handle missing values.
Data Cleaning
Data cleaning is the preprocessing step that detects and corrects (or removes) errors, inconsistencies, noise, duplicates, and missing values from data so that mining produces reliable results. It includes handling missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies.
Methods to Handle Missing Values
- Ignore the tuple — drop records with missing class label / many missing values (used when missing data is small).
- Fill manually — domain expert enters the value (accurate but tedious, impractical for large data).
- Use a global constant — replace with a constant like "Unknown" or .
- Use the attribute mean/median — fill with the average (or median) of that attribute.
- Use the class-wise mean — fill with the mean of the attribute for samples of the same class.
- Use the most probable value — predict it using regression, decision trees, Bayesian inference, or interpolation (most accurate).
Explain the working of the Naive Bayes classifier.
Naive Bayes Classifier
A Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem with a strong ("naive") assumption that all attributes are conditionally independent given the class.
Bayes' theorem:
where is the feature vector and a class.
Working:
- Compute the prior for each class from training data.
- By the independence assumption, the likelihood factorizes:
- For a new instance, compute for every class. is constant and can be ignored.
- Assign the class with the highest posterior (MAP rule):
Notes: continuous attributes use a Gaussian density; Laplace smoothing avoids zero probabilities. It is simple, fast, and works well even when the independence assumption is only approximately true (e.g., spam filtering).
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled training data (input–output pairs) to learn a mapping that predicts the output for new inputs. Unsupervised learning uses unlabeled data and discovers hidden structure/patterns on its own.
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Training data | Labeled (has target/class) | Unlabeled |
| Goal | Predict label / value | Find structure / grouping |
| Tasks | Classification, regression | Clustering, association, dimensionality reduction |
| Examples | Decision tree, Naive Bayes, SVM | k-means, hierarchical clustering, Apriori |
| Feedback | Guided by known answers | No feedback / ground truth |
Example: classifying email as spam/not-spam (supervised) vs. grouping customers into segments without predefined labels (unsupervised).
Explain hierarchical clustering briefly.
Hierarchical Clustering
Hierarchical clustering builds a tree (dendrogram) of nested clusters instead of a single flat partition. It does not require the number of clusters in advance. Two approaches:
1. Agglomerative (bottom-up):
- Start with each object as its own cluster.
- Repeatedly merge the two closest clusters.
- Continue until all objects form one cluster.
2. Divisive (top-down):
- Start with all objects in one cluster.
- Repeatedly split clusters until each object is separate.
Distance (linkage) measures between clusters: single linkage (minimum/nearest pair), complete linkage (maximum/farthest pair), average linkage, and centroid linkage.
The resulting dendrogram can be "cut" at a chosen height to obtain a desired number of clusters. It is intuitive but computationally costly ( or worse) and merges/splits cannot be undone.
What is a fact table and a dimension table?
In a star/snowflake schema of a data warehouse:
Fact Table
- The central table that stores measurable, numeric facts (measures) of a business process — e.g.,
units_sold,revenue,quantity. - Contains foreign keys referencing all related dimension tables.
- Typically very large (many rows), with few columns.
Dimension Table
- Stores descriptive, textual attributes (context) used to filter/group facts — e.g., Time (day, month, year), Product (name, category), Customer, Location.
- Has a primary key referenced by the fact table.
- Usually smaller with many descriptive columns.
Relationship: A fact table connects to multiple dimension tables; dimensions provide the "who, what, when, where" while the fact table provides the "how much."
Example: Sales(time_id, product_id, store_id, units_sold) is the fact table; Time, Product, Store are dimension tables.
Write short notes on metadata in data warehousing.
Metadata in Data Warehousing
Metadata is "data about data" — it describes the structure, source, meaning, and operations of the data stored in the warehouse, and acts as a roadmap/directory for the entire DW.
Types of metadata:
- Business metadata — meaning of data in business terms: definitions, ownership, business rules, mappings (helps end users).
- Technical metadata — schema definitions, table/column structures, data types, indexes, access permissions (used by developers/DBAs).
- Operational (ETL) metadata — source-to-target mappings, transformation/cleaning rules, data lineage, load history, refresh schedules, and data currency.
Importance / uses:
- Acts as a directory to locate and understand warehouse contents.
- Guides the ETL process and ensures data lineage and consistency.
- Helps in query building, mapping, and impact analysis.
- Improves data quality, governance, and maintenance.
Metadata is held in a metadata repository within the bottom tier of the DW architecture.
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2075?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.