BSc CSIT (TU) Science Data Warehousing and Data Mining (BSc CSIT, CSC410) Question Paper 2080 Nepal
This is the official BSc CSIT (TU) (Science stream) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper for 2080, as set in the regular annual examination. It carries 60 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) exam or solving previous years' question papers, this 2080 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt any TWO questions.
Explain the data warehouse architecture and the ETL process in detail. Discuss the role of metadata in a data warehouse.
Data Warehouse Architecture
A data warehouse (DW) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. Its architecture is typically organized in three tiers.
Three-Tier Architecture
- Bottom Tier (Warehouse Database Server): A relational database system. Data is extracted from operational/external sources, cleaned, transformed and loaded here via back-end (ETL) tools. This tier also holds the metadata repository.
- Middle Tier (OLAP Server): Presents a multidimensional view of data. Implemented as a ROLAP (relational OLAP) server that maps operations on multidimensional data to standard relational operations, or a MOLAP (multidimensional OLAP) server that directly implements multidimensional data and operations.
- Top Tier (Front-End Client): Contains query, reporting, analysis and data-mining tools.
Data typically flows: Operational/external sources → Staging area → Enterprise Data Warehouse → Data Marts → OLAP/Analysis tools.
The ETL Process
ETL = Extract, Transform, Load — the back-end process that populates the warehouse.
- Extract: Read data from heterogeneous sources (RDBMS, flat files, ERP, web logs). Can be full or incremental.
- Transform: The cleaning and integration step performed in a staging area:
- Data cleaning — remove noise, fill missing values, correct inconsistencies.
- Integration — resolve schema/naming conflicts, merge sources.
- Transformation — normalization, aggregation, derivation of new attributes, surrogate-key generation.
- Load: Write the transformed data into the warehouse fact and dimension tables, and refresh indexes/aggregates. May be done by bulk load and refreshed periodically.
Role of Metadata
Metadata is "data about data" — it describes the warehouse contents and structure. It is stored in the metadata repository and includes:
- Structure/definition metadata: schema, dimensions, hierarchies, derived-data definitions, data-mart locations.
- Operational metadata: data lineage (history/source of migrated data), currency (active, archived, purged), monitoring/usage statistics.
- ETL/mapping metadata: source-to-target mappings, cleaning and transformation rules, refresh schedules.
- Business metadata: business terms, ownership, and charging policies.
Metadata acts as the directory that helps users locate data, the guide for the ETL mapping, and the basis for query optimization and data lineage — making it central to building, using and maintaining the warehouse.
Explain the FP-growth algorithm. Construct the FP-tree for a given transaction dataset and mine the frequent patterns.
FP-Growth Algorithm
FP-Growth (Frequent Pattern Growth) mines frequent itemsets without candidate generation, unlike Apriori. It compresses the database into a compact FP-tree and then mines patterns recursively using a divide-and-conquer strategy. It requires only two database scans.
Steps:
- Scan DB once, count item supports, discard items below min_support, and sort frequent items in descending support order (F-list).
- Scan DB again; for each transaction, sort its frequent items by the F-list order and insert into the FP-tree (shared prefixes share nodes; node counts incremented).
- Build a header table linking all nodes of each item.
- Mine recursively: for each item (least frequent first), build its conditional pattern base → conditional FP-tree → generate frequent patterns.
Worked Example
Let min_support = 2. Transactions:
| TID | Items |
|---|---|
| T1 | A, B, C |
| T2 | A, C, D |
| T3 | A, B, D |
| T4 | B, C, D |
| T5 | A, B, C, D |
Step 1 — Support counts: A=4, B=4, C=4, D=4. All ≥ 2. F-list (desc, ties broken alphabetically): A, B, C, D.
Step 2 — Ordered transactions:
- T1: A,B,C T2: A,C,D T3: A,B,D T4: B,C,D T5: A,B,C,D
Step 3 — FP-tree (described):
root
├─ A:4
│ ├─ B:3 ── C:2 ── D:1
│ │ └─ D:1 (B-C-D path count)
│ ├─ C:1 ── D:1
│ └─ (B:3 also has D:1 child for A-B-D)
└─ B:1 ── C:1 ── D:1 (T4: B,C,D, no A)
Header links chain all nodes of A, B, C, D.
Step 4 — Mine (bottom-up from D):
- D: conditional pattern base = {A,B,C:1},{A,C:1},{A,B:1},{B,C:1}. Frequent: {A,D}=3, {B,D}=3, {C,D}=3.
- C: base = {A,B:2},{A:1},{B:1}. Frequent: {A,C}=3, {B,C}=3.
- B: base = {A:3}. Frequent: {A,B}=3.
- A: single item, support 4.
Frequent patterns (support ≥ 2): {A}:4, {B}:4, {C}:4, {D}:4, {A,B}:3, {A,C}:3, {A,D}:3, {B,C}:3, {B,D}:3, {C,D}:3.
Advantages
No candidate generation, only two scans, compact tree → far faster than Apriori on dense data.
What is cluster analysis? Compare partitioning, hierarchical, and density-based clustering methods with examples.
Cluster Analysis
Cluster analysis is the process of grouping a set of data objects into clusters such that objects within a cluster are highly similar to one another and dissimilar to objects in other clusters. It is unsupervised (no predefined class labels). Quality depends on the similarity/distance measure (e.g., Euclidean distance) and the method used. Applications: customer segmentation, image processing, outlier detection.
Comparison of Clustering Methods
| Aspect | Partitioning | Hierarchical | Density-based |
|---|---|---|---|
| Idea | Divide n objects into k partitions, iteratively relocate to optimize a criterion | Build a tree (dendrogram) of nested clusters | Grow clusters as dense regions separated by sparse regions |
| k known? | Yes, k given in advance | No | No |
| Cluster shape | Spherical/convex only | Convex | Arbitrary shapes |
| Outliers | Sensitive | Sensitive | Handles noise/outliers |
| Example algorithm | k-means, k-medoids (PAM) | AGNES (agglomerative), DIANA (divisive) | DBSCAN, OPTICS |
Partitioning Methods
- Construct k partitions and iteratively improve them (assign each object to nearest centroid, recompute centroids).
- k-means minimizes within-cluster squared error . Fast, but needs k, sensitive to outliers and finds only spherical clusters.
Hierarchical Methods
- Create a dendrogram. Agglomerative (bottom-up): start with each object as a cluster, repeatedly merge the two closest clusters. Divisive (top-down): start with one cluster and split.
- No need to specify k in advance; cut the dendrogram at desired level. But merges/splits are irreversible and cost is high ( or more).
Density-Based Methods
- A cluster is a maximal set of density-connected points. DBSCAN uses parameters Eps (radius) and MinPts; classifies points as core, border, or noise.
- Finds clusters of arbitrary shape, is robust to noise, and does not need k, but is sensitive to the Eps/MinPts settings.
Summary: Use partitioning when k is known and clusters are compact; hierarchical when a cluster hierarchy is desired; density-based for arbitrary-shaped clusters with noise.
Section B: Short Answer Questions
Attempt any EIGHT questions.
What are the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile)?
A data warehouse has four key characteristics (Inmon's definition):
- Subject-oriented: Organized around major subjects/entities of the enterprise (e.g., customer, product, sales) rather than around day-to-day operations/transactions. It focuses on modelling and analysis for decision makers.
- Integrated: Built by integrating multiple heterogeneous sources (RDBMS, flat files). Data-cleaning and integration techniques ensure consistency in naming conventions, encoding, units and attribute measures.
- Time-variant: Data is stored to provide a historical perspective (e.g., last 5–10 years). Every key structure contains, implicitly or explicitly, a time element, so the warehouse keeps snapshots over time rather than just current values.
- Non-volatile: Data is physically separated from operational data and is loaded read-only. Operational update/delete do not occur; the warehouse normally needs only two operations: initial loading of data and access (query) of data.
Explain the concept of data cube and aggregation.
Data Cube
A data cube is a multidimensional model that allows data to be viewed and analyzed in terms of dimensions and measures (facts).
- Dimensions are the perspectives/entities for analysis (e.g., time, item, location), each with a hierarchy (e.g., day → month → quarter → year).
- Measures are numeric quantities stored in cells (e.g., sales amount, units sold).
A cube generalizes a 2-D table to n dimensions. The set of all cubes from the most detailed (base cuboid) to the most aggregated (apex cuboid) is the data-cube lattice (cuboids).
Aggregation
Aggregation is the summarization of measure values across one or more dimensions using functions such as SUM, COUNT, AVG, MIN, MAX.
- Rolling up time from month to year aggregates sales for the whole year.
- The apex cuboid (all) holds the total aggregate over every dimension.
Example: For a Sales cube with dimensions (time, item, location), the cell (2023, Laptop, Kathmandu) stores total sales for that combination. Aggregating over location gives total laptop sales in 2023 across all cities. OLAP operations like roll-up increase aggregation, while drill-down reduces it.
What is a frequent itemset? Explain with an example.
Frequent Itemset
An itemset is a set of one or more items. The support of an itemset is the fraction (or count) of transactions that contain :
An itemset is frequent if its support is greater than or equal to a user-specified minimum support threshold (). A k-itemset contains k items.
Example
| TID | Items |
|---|---|
| T1 | Bread, Milk |
| T2 | Bread, Butter |
| T3 | Bread, Milk, Butter |
| T4 | Milk, Butter |
Let (i.e., count ≥ 2 of 4 transactions).
- {Bread} = 3, {Milk} = 3, {Butter} = 3 → all frequent.
- {Bread, Milk} = 2 → frequent. {Bread, Butter} = 2 → frequent. {Milk, Butter} = 2 → frequent.
- {Bread, Milk, Butter} = 1 → not frequent.
So {Bread, Milk}, {Bread, Butter}, {Milk, Butter} are frequent 2-itemsets. Frequent itemsets form the basis for generating association rules.
Explain the Gini index as a splitting criterion.
Gini Index as a Splitting Criterion
The Gini index measures the impurity of a dataset and is used (e.g., in CART) to choose the best attribute to split on in a decision tree. For a dataset with classes and the probability of class :
- means the node is pure (all one class); higher values mean more impurity.
Splitting: When attribute splits into subsets and , the Gini index of the split is the weighted sum:
The reduction in impurity is . The attribute (and split point) giving the maximum reduction (i.e., the minimum ) is chosen.
Example: If a node has 6 positive and 4 negative samples (10 total):
The split producing the lowest weighted Gini is selected. CART uses Gini to create binary splits.
What is the agglomerative approach in hierarchical clustering?
Agglomerative (Bottom-Up) Hierarchical Clustering
The agglomerative approach (AGNES) is a hierarchical clustering method that works bottom-up:
- Start by placing each data object in its own cluster (n clusters for n objects).
- Repeatedly merge the two closest (most similar) clusters into one.
- Continue until all objects are in a single cluster or a stopping condition (desired number of clusters) is met.
The sequence of merges is recorded as a dendrogram; cutting it at a chosen level yields the clustering.
Inter-cluster distance (linkage) methods used to decide which clusters are closest:
- Single linkage: minimum distance between points of two clusters.
- Complete linkage: maximum distance between points.
- Average linkage: average pairwise distance.
- Centroid linkage: distance between cluster centroids.
Characteristics: No need to specify k in advance; however, merges are irreversible (a merge cannot be undone) and the method is computationally expensive ( or more). It is the opposite of the divisive (top-down, DIANA) approach, which starts with one cluster and splits.
Explain the issues in data quality.
Issues in Data Quality
Real-world data is often dirty, and poor data quality leads to unreliable mining results ("garbage in, garbage out"). Major data-quality issues:
- Incompleteness (Missing data): Lacking attribute values or certain attributes of interest (e.g., empty income field). Caused by data not collected or not entered.
- Noise (Errors/Outliers): Random error or variance in a measured variable (e.g., salary = -100), often from faulty instruments or transmission errors.
- Inconsistency: Discrepancies in codes, names or formats (e.g., Age = 30 but DOB = 1920; "Kathmandu" vs "KTM"). Common when integrating multiple sources.
- Duplication / Redundancy: The same record or attribute appearing more than once, biasing results.
- Invalid / Out-of-range values: Values violating constraints or business rules.
- Timeliness and Believability: Data not updated in time, or data users do not trust.
- Interpretability: Data that is hard to understand.
These issues are addressed in the data preprocessing phase through data cleaning (fill missing values, smooth noise, remove inconsistencies/duplicates), integration, and transformation.
Differentiate between eager and lazy learners.
Eager vs. Lazy Learners
| Aspect | Eager Learners | Lazy Learners |
|---|---|---|
| Model building | Construct a generalization (model) from training data before receiving test cases | Simply store the training data and wait until a test tuple arrives |
| Training time | High (most work done during training) | Low/none (no model built) |
| Classification (test) time | Fast — just apply the model | Slow — must compare with stored tuples at query time |
| Generalization | Commits to a single global model/hypothesis | Forms a local approximation per query; supports incremental learning |
| Storage | Stores only the model | Stores the entire training set |
| Examples | Decision tree (ID3, C4.5), Naïve Bayes, Backpropagation/Neural networks, SVM | k-Nearest Neighbor (k-NN), Case-based reasoning |
In short: Eager learners do the work up front (slow to train, fast to classify), whereas lazy learners defer computation to classification time (fast to train, slow to classify) and adapt locally to each query.
What is the support count and minimum support threshold?
Support Count and Minimum Support Threshold
Support count (frequency / absolute support): The number of transactions in the database that contain a particular itemset . Denoted .
The (relative) support is this count divided by the total number of transactions:
Minimum support threshold (): A user-specified value (a count or a percentage) that an itemset's support must meet or exceed to be considered frequent. Itemsets with are frequent itemsets; those below it are pruned (this is the basis of the Apriori pruning).
Example: In 5 transactions, if {Milk, Bread} appears in 3 of them, its support count is 3 and its support is . If , the itemset is frequent; if , it is not.
Write short notes on temporal data mining.
Temporal Data Mining
Temporal data mining is the process of discovering interesting patterns, trends, and relationships from time-stamped (time-dependent) data — data whose attributes change over time or that includes a temporal/time dimension. It extends ordinary data mining by explicitly considering the ordering and time gaps between events.
Key tasks / techniques:
- Sequential pattern mining: find frequently occurring ordered subsequences (e.g., customers who buy a phone often buy a case later) — algorithms like GSP, PrefixSpan.
- Time-series analysis: trend, seasonality, similarity search and forecasting on numeric series (e.g., stock prices, weather).
- Temporal association rules: rules valid within specific time intervals.
- Periodicity / cyclic pattern detection: events recurring at regular intervals.
- Temporal clustering and classification.
Applications: stock-market and sales forecasting, weather prediction, network/IoT sensor analysis, web-clickstream and medical (patient history) analysis.
Challenges: handling large volume of time-stamped data, irregular time intervals, noise, and concept drift (patterns changing over time).
Frequently asked questions
- Where can I find the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) question paper 2080?
- The full BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper come with solutions?
- Yes. Every question on this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper?
- The BSc CSIT (TU) Data Warehousing and Data Mining (BSc CSIT, CSC410) 2080 paper carries 60 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper free?
- Yes — reading and attempting this Data Warehousing and Data Mining (BSc CSIT, CSC410) past paper on Kekkei is completely free.