BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2074 Nepal

Q: Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2074?

The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper come with solutions?

Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper?

The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2074 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?

Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.

Question

1Long answer10 marks

What is a data warehouse? Explain its characteristics. Describe the three-tier data warehouse architecture with a suitable diagram.

data-warehousearchitecture

Answer 1

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data organized to support management decision-making (W. H. Inmon). It consolidates historical data from multiple operational (transactional) sources into a single repository optimized for query and analysis rather than transaction processing.

Characteristics

Subject-oriented — Organized around major subjects (customer, product, sales) rather than around day-to-day operations or applications.
Integrated — Data from heterogeneous sources is cleaned and made consistent (uniform naming, encoding, units, formats).
Time-variant — Stores historical data with a time dimension (months/years of data), enabling trend analysis. Every record carries an explicit or implicit time element.
Non-volatile — Data is loaded and read but normally not updated or deleted by end users; it is refreshed periodically (read-only for analysis).

Three-Tier Architecture

  Top Tier:      [ OLAP Tools | Query/Reporting | Data Mining | Analysis ]   <- Front-end client tools
                                  |
  Middle Tier:           [  OLAP Server (ROLAP / MOLAP)  ]                    <- presents multidimensional view
                                  |
  Bottom Tier:   [  Data Warehouse Server (RDBMS) + Data Marts + Metadata ]   <- ETL loads from sources
                                  |
              [ Operational DBs ][ Flat Files ][ External Sources ]

Bottom tier (Warehouse database server): Usually a relational database. Data is extracted from operational databases and external sources, then cleaned, transformed, and loaded (ETL/back-end tools). It also holds data marts and a metadata repository.
Middle tier (OLAP server): Implemented as either a ROLAP (relational OLAP, maps multidimensional operations to SQL) or MOLAP (multidimensional OLAP, uses array-based storage/data cubes). It presents an abstracted, multidimensional view of the data to the top tier.
Top tier (Front-end client layer): Tools for querying, reporting, analysis, and data mining used by analysts and decision makers.

This layered design separates storage, multidimensional processing, and presentation, improving scalability and performance.

Answer 2

Association Rule Mining

Association rule mining discovers interesting relationships (co-occurrence patterns) among items in large transactional datasets. A rule has the form $X \Rightarrow Y$ where $X, Y$ are disjoint itemsets, e.g. $\{\text{bread, butter}\} \Rightarrow \{\text{milk}\}$ ("market basket analysis").

Two key measures:

Support: $\text{supp}(X) = \dfrac{\text{transactions containing } X}{\text{total transactions}}$
Confidence: $\text{conf}(X \Rightarrow Y) = \dfrac{\text{supp}(X \cup Y)}{\text{supp}(X)}$

A rule is strong if it meets a minimum support and minimum confidence threshold.

Apriori Algorithm

Apriori finds all frequent itemsets using the Apriori property: every non-empty subset of a frequent itemset must also be frequent (anti-monotonicity). It works level-by-level:

1. Find all frequent 1-itemsets (L1) by counting support.
2. For k = 2,3,...:
   a. Generate candidate k-itemsets Ck by joining L(k-1) with itself.
   b. Prune candidates having any infrequent (k-1)-subset.
   c. Scan database to count support of each candidate.
   d. Lk = candidates with support >= min_support.
3. Stop when Lk is empty. Frequent itemsets = union of all Lk.
4. Generate strong rules from frequent itemsets using min_confidence.

Example

Transactions (min_support = 50% i.e. 2 of 4, min_confidence = 70%):

TID	Items
T1	A, B, C
T2	A, C
T3	A, D
T4	B, C

L1: A(3), B(2), C(3), D(1). D is pruned (<2). L1 = {A, B, C}.
C2: {A,B}=1, {A,C}=2, {B,C}=2. L2 = {A,C}(2), {B,C}(2).
C3: {A,B,C} — needs subset {A,B} which is infrequent, so pruned. L3 empty.

Frequent itemsets: {A}, {B}, {C}, {A,C}, {B,C}.

Rule from {B,C}: $C \Rightarrow B$ , confidence $= 2/3 = 66\%$ (rejected); $B \Rightarrow C$ , confidence $= 2/2 = 100\%$ (strong rule).

Limitations

Multiple database scans — one full scan per level $k$ , expensive for large data.
Huge number of candidates — candidate generation explodes when there are many frequent items (e.g. $10^4$ 1-itemsets give $\approx 10^7$ candidate 2-itemsets).
High memory and computation cost for low support thresholds.
Improvements such as FP-Growth (no candidate generation, two scans) address these issues.

Answer 3

K-Means Clustering

K-means is an unsupervised, partitional clustering algorithm that divides $n$ objects into $k$ clusters so that intra-cluster similarity is high. Each cluster is represented by its centroid (mean of its points). It minimizes the sum of squared errors $SSE = \sum_{i=1}^{k}\sum_{x \in C_i} \lVert x - \mu_i \rVert^2$ .

Algorithm

1. Choose k (number of clusters); select k initial centroids.
2. Assignment step: assign each point to the nearest centroid
   (using Euclidean distance).
3. Update step: recompute each centroid as the mean of its members.
4. Repeat steps 2-3 until centroids no longer change (convergence).

Worked Example

Data (1-D): 2, 4, 10, 12, 3, 20, 30, 11, 25, with $k = 2$ .

Initial centroids: $m_1 = 2$ , $m_2 = 4$ .

Iteration 1 — assign by nearest centroid:

Cluster1 (near 2): {2, 3} → new mean $m_1 = 2.5$
Cluster2 (near 4): {4, 10, 12, 20, 30, 11, 25} → new mean $m_2 = 16$

Iteration 2 — with $m_1=2.5, m_2=16$ :

Cluster1: {2, 3, 4} → mean $= 3$
Cluster2: {10, 12, 20, 30, 11, 25} → mean $= 18$

Iteration 3 — with $m_1=3, m_2=18$ :

Cluster1: {2, 3, 4, 10} → mean $= 4.75$
Cluster2: {12, 20, 30, 11, 25} → mean $= 19.6$

Iteration 4 — with $m_1=4.75, m_2=19.6$ :

Cluster1: {2, 3, 4, 10, 11, 12} → mean $= 7$
Cluster2: {20, 30, 25} → mean $= 25$

Iteration 5 — with $m_1=7, m_2=25$ :

Cluster1: {2, 3, 4, 10, 11, 12} → mean $= 7$
Cluster2: {20, 30, 25} → mean $= 25$

Assignments and centroids no longer change → converged.

Final clusters: $C_1 = \{2,3,4,10,11,12\}$ (centroid 7), $C_2 = \{20,25,30\}$ (centroid 25).

Notes

Result depends on initial centroid choice and chosen $k$ ; sensitive to outliers; works best with spherical, well-separated clusters.

Answer 4

Data mining is the process of discovering interesting, previously unknown, and potentially useful patterns, correlations, and knowledge from large amounts of data. It is the core analysis step of the KDD (Knowledge Discovery in Databases) process and uses techniques from machine learning, statistics, and database systems.

Major Applications

Market basket analysis / retail — association rules to drive product placement and cross-selling.
Banking & finance — credit scoring, fraud detection, risk analysis.
Telecommunications — fraud detection, churn prediction.
Healthcare & bioinformatics — disease diagnosis, gene/protein analysis.
CRM — customer segmentation and targeted marketing.
Web mining & search engines — recommendation systems, click-stream analysis.
Manufacturing — quality control and fault prediction.

Answer 5

OLTP vs OLAP

Feature	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Day-to-day operational transactions	Analysis and decision support
Data	Current, detailed, operational	Historical, summarized, consolidated
Orientation	Application/transaction oriented	Subject oriented
Operations	Frequent INSERT/UPDATE/DELETE	Mostly read-only complex queries
Design	Normalized (ER, 3NF)	De-normalized (star/snowflake, cubes)
Records accessed	Few rows per query	Millions of rows aggregated
Users	Clerks, operational staff (many)	Analysts, managers (fewer)
Response time	Milliseconds	Seconds to minutes
Database size	MB–GB	GB–TB
Backup/recovery	Critical, real-time	Periodic; reload from sources

Summary: OLTP runs the business (operational data), while OLAP analyzes the business (decision support over a data warehouse).

Answer 6

Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a set of low-level (detailed) concepts to higher-level (more general) concepts. It organizes attribute values into multiple levels of abstraction, enabling data generalization and roll-up/drill-down in OLAP.

Types

Schema hierarchy — based on the schema/attributes, e.g. street < city < state < country for a location dimension.
Set-grouping hierarchy — groups values into ranges, e.g. age: {0–18 = young}, {19–40 = adult}, {41+ = senior}.

Example

For the location dimension:

            country  (Nepal)
               |
            province (Bagmati)
               |
             city    (Kathmandu)
               |
            street   (Durbar Marg)

Rolling up moves from street toward country (more general/summarized); drilling down moves the other way (more detailed). Such hierarchies let analysts view sales by street, then summarize by city, province, or country.

Answer 7

OLAP Operations

OLAP operations let users navigate a multidimensional data cube at different levels of abstraction.

Roll-up (drill-up): Aggregates data by climbing up a concept hierarchy or reducing a dimension. e.g. moving sales from city → province → country, producing more summarized data.
Drill-down: The reverse of roll-up — moves from summarized to more detailed data by stepping down a concept hierarchy or adding a dimension. e.g. quarter → month → day, or country → city.
Slice: Selects a single value for one dimension, producing a sub-cube (a 2-D slice). e.g. fix time = Q1 to see sales for all products/locations in Q1 only.
Dice: Selects a sub-cube by specifying a range/condition on two or more dimensions. e.g. location ∈ {KTM, Pokhara} AND time ∈ {Q1, Q2} AND item ∈ {phone, tablet}.

(Related: Pivot/rotate reorients the cube's axes to give an alternative view.)

Answer 8

Data Preprocessing

Data preprocessing is the set of techniques applied to raw data before mining to transform it into a clean, consistent, and suitable form for analysis. Real-world data is typically incomplete (missing values), noisy (errors, outliers), and inconsistent (conflicting codes/formats), so it cannot be mined reliably as-is.

Major Tasks

Data cleaning — fill in missing values, smooth noise, remove outliers, resolve inconsistencies.
Data integration — combine data from multiple sources, resolving conflicts and redundancy.
Data reduction — reduce volume via dimensionality/numerosity reduction or aggregation while preserving analytical results.
Data transformation — normalization, smoothing, generalization, attribute construction.

Why It Is Necessary

Improves data quality, hence the quality and accuracy of mining results (garbage in, garbage out).
Many algorithms require clean, consistent, properly scaled input.
Reduces size and complexity, improving mining efficiency and reducing storage/computation.

Answer 9

KDD Process

KDD (Knowledge Discovery in Databases) is the overall, iterative process of converting raw data into useful knowledge; data mining is one core step within it.

Steps

Data Cleaning — remove noise and inconsistent data, handle missing values.
Data Integration — combine data from multiple heterogeneous sources.
Data Selection — retrieve the data relevant to the analysis task from the database.
Data Transformation — transform/consolidate data into forms appropriate for mining (e.g. normalization, aggregation).
Data Mining — apply intelligent methods (classification, clustering, association, etc.) to extract patterns.
Pattern Evaluation — identify the truly interesting patterns using interestingness measures.
Knowledge Presentation — use visualization and reporting techniques to present mined knowledge to users.

(Steps 1–4 are often grouped as data preprocessing.) The process is iterative: feedback may loop back to earlier steps.

Answer 10

Decision Tree

A decision tree is a flowchart-like tree structure used for classification (and regression), where:

each internal node tests an attribute,
each branch represents an outcome of the test, and
each leaf node holds a class label.

The path from the root to a leaf forms a classification rule.

Use in Classification

Building (training): The tree is induced top-down by recursively partitioning the training data. At each node, the best splitting attribute is chosen using a measure such as Information Gain / Entropy (ID3), Gain Ratio (C4.5), or Gini index (CART) — the attribute that best separates the classes.
Splitting continues until nodes are pure (one class), no attributes remain, or a stopping criterion is met; pruning removes branches that overfit.
Classifying a new record: Start at the root, test attributes along matching branches, and follow the path to a leaf; the leaf's class label is the predicted class.

Advantages: easy to understand and interpret, no need for normalization, handles both numeric and categorical data, and the rules are explicit.

Answer 11

Classification vs Clustering

Aspect	Classification	Clustering
Learning type	Supervised	Unsupervised
Class labels	Predefined/known; uses labeled training data	No predefined labels; groups discovered from data
Goal	Assign new objects to known classes	Partition data into natural groups (clusters)
Training data	Required (labeled)	Not required
Basis of grouping	Learned model that maps features → label	Similarity/distance between objects
Output	A predictive model + class prediction	Set of clusters
Examples	Decision tree, Naive Bayes, SVM, KNN	K-means, hierarchical, DBSCAN
Evaluation	Accuracy, precision, recall (vs true labels)	Cohesion/separation, silhouette (no true labels)

Summary: Classification predicts a known class label using labeled training data (supervised), whereas clustering discovers the groupings in unlabeled data based on similarity (unsupervised).

Answer 12

Star Schema

The star schema is the most common multidimensional model for a data warehouse. It consists of:

A central fact table containing measures (numeric, additive facts such as sales_amount, quantity) and foreign keys to the dimensions.
A set of dimension tables (e.g. time, product, location, customer), each connected directly to the fact table.

The diagram resembles a star: the fact table at the center with dimension tables radiating outward.

        [ Time ]        [ Product ]
             \            /
              \          /
            [   FACT (Sales)   ]
              /          \
             /            \
      [ Location ]      [ Customer ]

Key points

Dimension tables are de-normalized (each dimension is a single table), giving fewer joins and fast query performance.
Simple, intuitive design, well suited to OLAP.
Uses more storage due to redundancy in dimensions.
Contrast: the snowflake schema normalizes dimensions into sub-tables (more joins, less redundancy).

Level	BSc CSIT (TU)
Stream	Science
Subject	Data Warehousing and Data Mining (BSc CSIT, CSC410)
Year	2074 BS
Exam session	Regular (annual)
Full marks	60
Time allowed	180 minutes
Questions	12, all with step-by-step solutions

BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2074 Nepal

Section A: Long Answer Questions

Data Warehouse

Characteristics

Three-Tier Architecture

Association Rule Mining

Apriori Algorithm

Example

Limitations

K-Means Clustering

Algorithm

Worked Example

Notes

Section B: Short Answer Questions

Major Applications

OLTP vs OLAP

Concept Hierarchy

Types

Example

OLAP Operations

Data Preprocessing

Major Tasks

Why It Is Necessary

KDD Process

Steps

Decision Tree

Use in Classification

Classification vs Clustering

Star Schema

Frequently asked questions