BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2075 Nepal

Q: Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2075?

The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper come with solutions?

Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper?

The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2075 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?

Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.

Question

1Long answer10 marks

Explain the architecture of a data warehouse. Differentiate between MOLAP, ROLAP, and HOLAP.

data-warehouseolap

Answer 1

Architecture of a Data Warehouse

A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is commonly described as a three-tier architecture.

Three-Tier Architecture

1. Bottom Tier — Data Warehouse Server (Database Layer)

A relational/RDBMS server that holds the warehouse data.
Data is fed from operational (OLTP) databases and external sources through ETL (Extract, Transform, Load):
- Extract data from heterogeneous sources.
- Transform / clean it (resolve inconsistencies, integrate, summarize).
- Load it into the warehouse.
Also stores the metadata repository (data about data: source, ETL rules, schema).

2. Middle Tier — OLAP Server

Presents a multidimensional view of data to users.
Implemented as ROLAP (relational backend) or MOLAP (multidimensional cube store).

3. Top Tier — Front-End Client Layer

Tools for querying, reporting, analysis, and data mining (dashboards, BI tools).

Components

Data Sources (OLTP DBs, flat files, external data)
ETL / staging area and metadata
Data marts (department-level subsets)
Monitoring & administration

(Diagram in words: Sources → ETL/Staging → Data Warehouse + Metadata (bottom) → OLAP Server (middle) → Query/Report/Mining tools (top).)

MOLAP vs ROLAP vs HOLAP

Feature	MOLAP	ROLAP	HOLAP
Storage	Multidimensional cube arrays	Relational tables (star/snowflake)	Hybrid: detail in relational, summaries in cube
Query speed	Very fast (pre-aggregated)	Slower (SQL joins, runtime aggregation)	Fast for summaries, scalable for detail
Scalability / data volume	Limited (cube can explode)	High (handles very large data)	High
Storage efficiency	Sparse-data overhead	Efficient	Balanced
Best for	Small/medium data, fast OLAP	Large detailed datasets	Mix of large detail + fast aggregates

MOLAP stores data in optimized multidimensional cubes giving the fastest response but limited scalability.
ROLAP keeps data in relational tables and generates aggregations via SQL at query time, scaling to large volumes but slower.
HOLAP combines both: low-level detailed data stays in the relational store while summary/aggregated data is kept in a cube, balancing speed and scalability.

Answer 2

FP-Growth Algorithm (Frequent Pattern Growth)

FP-Growth mines frequent itemsets without candidate generation by compressing the database into a compact FP-tree and then mining it recursively.

Steps

First DB scan — count support of each item; discard items below min_support; sort remaining (frequent) items in descending order of support (the F-list).
Second DB scan — for each transaction, keep only frequent items, order them by the F-list, and insert them into the FP-tree (shared prefixes share nodes; each node keeps a count). A header table links all nodes of the same item.
Mine the FP-tree — for each item (bottom-up), build its conditional pattern base (prefix paths), construct a conditional FP-tree, and recursively grow frequent patterns.

Example

Transactions, min_support = 2:

TID	Items
T1	A, B, C
T2	A, C, D
T3	A, B
T4	B, C
T5	A, B, C

Step 1 — counts: A=4, B=4, C=4, D=1. Drop D (support 1 < 2). F-list (desc): A:4, B:4, C:4.

Step 2 — ordered transactions: T1 {A,B,C}, T2 {A,C}, T3 {A,B}, T4 {B,C}, T5 {A,B,C}.

FP-tree (described): Root → A(4) → B(3) → C(2); A→C(1); plus a separate root→B(1)→C(1) branch (from T4). Header table links the A, B, C nodes.

Step 3 — mining (e.g., suffix C): conditional pattern base of C = {A,B:2}, {A:1}, {B:1}. Frequent patterns containing C: {A,C}:3, {B,C}:3, {A,B,C}:2. Similarly mining B gives {A,B}:3. Final frequent itemsets (support ≥ 2): {A},{B},{C},{A,B},{A,C},{B,C},{A,B,C}.

How FP-Growth Improves on Apriori

No candidate generation: Apriori generates and tests huge numbers of candidate itemsets; FP-Growth avoids this entirely.
Fewer DB scans: only two scans vs. Apriori's repeated scans (one per pass/level).
Compact structure: the FP-tree compresses shared prefixes, saving memory.
Faster: especially on dense datasets with long frequent patterns, often an order of magnitude faster.
Drawback: the FP-tree may not fit in memory for very large/sparse data.

Answer 3

Classification

Classification is a supervised learning task that builds a model (classifier) from a labeled training set to predict the categorical class label of new, unseen data. It works in two phases: (1) learning/model construction from training data, and (2) classification/prediction with accuracy estimated on a test set.

ID3 Decision Tree Algorithm

ID3 (Iterative Dichotomiser 3) builds a tree top-down by selecting, at each node, the attribute with the highest information gain (greatest entropy reduction).

Entropy of dataset $S$ with classes $i$ :

Entropy(S) = -\sum_i p_i \log_2 p_i

Information Gain of attribute $A$ :

Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}\, Entropy(S_v)

Procedure:

Compute entropy of the whole set.
For each attribute compute information gain.
Choose the attribute with maximum gain as the splitting node.
Split data on its values; recurse on each branch.
Stop when all instances in a node share one class (leaf), no attributes remain (majority class), or no data remains.

Example

Dataset (play tennis style), attribute Outlook (Sunny/Overcast/Rain), target Play (Yes/No), 14 samples with 9 Yes, 5 No:

Entropy(S) = -\tfrac{9}{14}\log_2\tfrac{9}{14} - \tfrac{5}{14}\log_2\tfrac{5}{14} \approx 0.940

For Outlook: Sunny(2Y,3N)→0.971, Overcast(4Y,0N)→0, Rain(3Y,2N)→0.971.

Gain(S,Outlook)=0.940-\big(\tfrac{5}{14}(0.971)+\tfrac{4}{14}(0)+\tfrac{5}{14}(0.971)\big)\approx 0.246

If Outlook gives the highest gain among all attributes, it becomes the root. The Overcast branch is pure → leaf Yes. The Sunny and Rain branches are split further (e.g., Sunny on Humidity, Rain on Wind) by recomputing gains, producing the final tree.

Resulting tree (described):

Outlook
├── Sunny    → Humidity (High→No, Normal→Yes)
├── Overcast → Yes
└── Rain     → Wind (Strong→No, Weak→Yes)

New records are classified by traversing from root to a leaf.

Answer 4

A data mart is a small, subject-oriented subset of a data warehouse focused on a single department or business function (e.g., Sales, Finance, HR). It contains only the data relevant to that group's analysis needs.

Differences from a data warehouse:

Aspect	Data Mart	Data Warehouse
Scope	Single department/subject	Entire enterprise
Size	Small (MB–few GB)	Large (GB–TB+)
Data sources	Few	Many, integrated
Design/build time	Quick, cheaper	Long, complex, costly
Users	Departmental	Organization-wide

Data marts can be dependent (sourced from the central warehouse) or independent (built directly from operational sources).

Answer 5

Snowflake Schema

The snowflake schema is a refinement of the star schema in which dimension tables are normalized into multiple related tables, forming a snowflake shape around the central fact table.

Characteristics:

A central fact table holds measures and foreign keys.
Dimension tables are split into sub-dimensions to remove redundancy (normalization).
Saves storage but needs more joins, so queries can be slower than a star schema.

Example (Sales):

Fact table: Sales(product_id, store_id, time_id, units_sold, revenue)
Product dimension normalized into Product(product_id, name, category_id) → Category(category_id, category_name).
Store dimension normalized into Store(store_id, city_id) → City(city_id, city_name, state_id) → State(state_id, state_name).

Here the Product → Category and Store → City → State chains branch out like a snowflake, unlike the star schema where each dimension is a single flat denormalized table.

Answer 6

For an association rule $A \Rightarrow B$ over a transaction database:

Support — the fraction of all transactions that contain both $A$ and $B$ . It measures how frequently the itemset occurs.

Support(A \Rightarrow B) = P(A \cup B) = \frac{\text{# transactions containing } A \text{ and } B}{\text{Total transactions}}

Confidence — the fraction of transactions containing $A$ that also contain $B$ . It measures the reliability/strength of the rule.

Confidence(A \Rightarrow B) = P(B \mid A) = \frac{Support(A \cup B)}{Support(A)}

Example: If 100 transactions, 20 contain {bread, butter} and 25 contain bread, then $Support = 20/100 = 20\%$ and $Confidence = 20/25 = 80\%$ . Rules are kept only if they meet a user-set minimum support and minimum confidence.

Answer 7

Data Cleaning

Data cleaning is the preprocessing step that detects and corrects (or removes) errors, inconsistencies, noise, duplicates, and missing values from data so that mining produces reliable results. It includes handling missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies.

Methods to Handle Missing Values

Ignore the tuple — drop records with missing class label / many missing values (used when missing data is small).
Fill manually — domain expert enters the value (accurate but tedious, impractical for large data).
Use a global constant — replace with a constant like "Unknown" or $-\infty$ .
Use the attribute mean/median — fill with the average (or median) of that attribute.
Use the class-wise mean — fill with the mean of the attribute for samples of the same class.
Use the most probable value — predict it using regression, decision trees, Bayesian inference, or interpolation (most accurate).

Answer 8

Naive Bayes Classifier

A Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem with a strong ("naive") assumption that all attributes are conditionally independent given the class.

Bayes' theorem:

P(C \mid X) = \frac{P(X \mid C)\, P(C)}{P(X)}

where $X = (x_1, x_2, \dots, x_n)$ is the feature vector and $C$ a class.

Working:

Compute the prior $P(C)$ for each class from training data.
By the independence assumption, the likelihood factorizes:

P(X \mid C) = \prod_{i=1}^{n} P(x_i \mid C)

For a new instance, compute $P(C \mid X) \propto P(C)\prod_i P(x_i \mid C)$ for every class. $P(X)$ is constant and can be ignored.
Assign the class with the highest posterior (MAP rule):

\hat{C} = \arg\max_C \; P(C)\prod_{i} P(x_i \mid C)

Notes: continuous attributes use a Gaussian density; Laplace smoothing avoids zero probabilities. It is simple, fast, and works well even when the independence assumption is only approximately true (e.g., spam filtering).

Answer 9

Supervised learning uses labeled training data (input–output pairs) to learn a mapping that predicts the output for new inputs. Unsupervised learning uses unlabeled data and discovers hidden structure/patterns on its own.

Aspect	Supervised	Unsupervised
Training data	Labeled (has target/class)	Unlabeled
Goal	Predict label / value	Find structure / grouping
Tasks	Classification, regression	Clustering, association, dimensionality reduction
Examples	Decision tree, Naive Bayes, SVM	k-means, hierarchical clustering, Apriori
Feedback	Guided by known answers	No feedback / ground truth

Example: classifying email as spam/not-spam (supervised) vs. grouping customers into segments without predefined labels (unsupervised).

Answer 10

Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters instead of a single flat partition. It does not require the number of clusters in advance. Two approaches:

1. Agglomerative (bottom-up):

Start with each object as its own cluster.
Repeatedly merge the two closest clusters.
Continue until all objects form one cluster.

2. Divisive (top-down):

Start with all objects in one cluster.
Repeatedly split clusters until each object is separate.

Distance (linkage) measures between clusters: single linkage (minimum/nearest pair), complete linkage (maximum/farthest pair), average linkage, and centroid linkage.

The resulting dendrogram can be "cut" at a chosen height to obtain a desired number of clusters. It is intuitive but computationally costly ( $O(n^2)$ or worse) and merges/splits cannot be undone.

Answer 11

In a star/snowflake schema of a data warehouse:

Fact Table

The central table that stores measurable, numeric facts (measures) of a business process — e.g., units_sold, revenue, quantity.
Contains foreign keys referencing all related dimension tables.
Typically very large (many rows), with few columns.

Dimension Table

Stores descriptive, textual attributes (context) used to filter/group facts — e.g., Time (day, month, year), Product (name, category), Customer, Location.
Has a primary key referenced by the fact table.
Usually smaller with many descriptive columns.

Relationship: A fact table connects to multiple dimension tables; dimensions provide the "who, what, when, where" while the fact table provides the "how much."

Example: Sales(time_id, product_id, store_id, units_sold) is the fact table; Time, Product, Store are dimension tables.

Answer 12

Metadata in Data Warehousing

Metadata is "data about data" — it describes the structure, source, meaning, and operations of the data stored in the warehouse, and acts as a roadmap/directory for the entire DW.

Types of metadata:

Business metadata — meaning of data in business terms: definitions, ownership, business rules, mappings (helps end users).
Technical metadata — schema definitions, table/column structures, data types, indexes, access permissions (used by developers/DBAs).
Operational (ETL) metadata — source-to-target mappings, transformation/cleaning rules, data lineage, load history, refresh schedules, and data currency.

Importance / uses:

Acts as a directory to locate and understand warehouse contents.
Guides the ETL process and ensures data lineage and consistency.
Helps in query building, mapping, and impact analysis.
Improves data quality, governance, and maintenance.

Metadata is held in a metadata repository within the bottom tier of the DW architecture.

Level	BSc CSIT (TU)
Stream	Science
Subject	Data Warehousing and Data Mining (BSc CSIT, CSC410)
Year	2075 BS
Exam session	Regular (annual)
Full marks	60
Time allowed	180 minutes
Questions	12, all with step-by-step solutions

BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2075 Nepal

Section A: Long Answer Questions

Architecture of a Data Warehouse

Three-Tier Architecture

Components

MOLAP vs ROLAP vs HOLAP

FP-Growth Algorithm (Frequent Pattern Growth)

Steps

Example

How FP-Growth Improves on Apriori

Classification

ID3 Decision Tree Algorithm

Example

Section B: Short Answer Questions

Snowflake Schema

Data Cleaning

Methods to Handle Missing Values

Naive Bayes Classifier

Hierarchical Clustering

Metadata in Data Warehousing

Frequently asked questions