BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) Question Paper 2079 Nepal
This is the official BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper for 2079, as set in the regular annual examination. It carries 100 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Machine Learning (PU, CMP 364) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) exam or solving previous years' question papers, this 2079 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt all / any as specified.
(a) Define supervised learning and clearly distinguish it from unsupervised learning with one suitable example of each. (4)
(b) Consider a simple linear regression model . Derive the closed-form (normal equation) solution for the parameters and by minimizing the mean squared error cost function . (7)
(c) The following data is given for hours studied () and marks obtained (): . Fit the best-fit line using the result derived in part (b) and predict the marks for a student who studies for 6 hours. (3)
(a) Supervised vs Unsupervised Learning (4)
Supervised learning trains a model on a labelled dataset , where each input has a known target output . The goal is to learn a mapping that generalizes to unseen inputs.
Unsupervised learning works on unlabelled data and discovers hidden structure (groupings, density, low-dimensional representation) without any target output.
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Data | Labelled | Unlabelled |
| Goal | Predict | Find structure |
| Example | Email spam classification (spam/ham labels) | Customer segmentation via clustering |
(b) Normal Equation for Simple Linear Regression (7)
For , minimize
Set the partial derivatives to zero.
These give the normal equations:
Solving this system (using means ):
(c) Fitting the Data (3)
Data: , with .
Best-fit line: .
Prediction at : marks.
(a) Explain the working principle of a decision tree classifier. Define entropy and information gain, and explain how they are used to select the splitting attribute at each node. (6)
(b) A dataset of 14 training examples for the target PlayTennis (Yes/No) contains 9 Yes and 5 No instances. The attribute Outlook has three values: Sunny (2 Yes, 3 No), Overcast (4 Yes, 0 No), and Rain (3 Yes, 2 No). Compute the entropy of the whole dataset and the information gain of the attribute Outlook. (6)
(c) State two advantages and two limitations of decision trees compared to other classifiers. (2)
(a) Decision Tree Classifier, Entropy and Information Gain (6)
A decision tree is a tree-structured classifier where each internal node tests an attribute, each branch is an outcome of that test, and each leaf assigns a class label. Starting at the root, a record follows branches according to its attribute values until it reaches a leaf. The tree is built top-down, recursively, choosing at each node the attribute that best separates the classes.
Entropy measures the impurity of a set with class proportions :
It is for a pure node and maximal when classes are equally mixed.
Information Gain of attribute is the expected reduction in entropy after splitting on :
At each node the algorithm (ID3) selects the attribute with the highest information gain as the splitting attribute.
(b) Entropy and Information Gain of Outlook (6)
Dataset entropy (9 Yes, 5 No, total 14):
Entropy of each value of Outlook:
- Sunny (2 Yes, 3 No):
- Overcast (4 Yes, 0 No): (pure)
- Rain (3 Yes, 2 No):
Weighted entropy after split:
Information Gain:
(c) Advantages and Limitations (2)
Advantages: (1) Easy to interpret and visualize (white-box, human-readable rules); (2) Require little data preparation and handle both numeric and categorical features.
Limitations: (1) Prone to overfitting / high variance, producing unstable trees sensitive to small data changes; (2) Greedy axis-parallel splits can give suboptimal, biased trees compared to ensemble/SVM methods.
(a) Draw the architecture of a multilayer feedforward neural network with one hidden layer and explain the role of activation functions. Why is a non-linear activation function (such as sigmoid or ReLU) necessary? (6)
(b) Describe the backpropagation algorithm. Derive the weight-update rule for a single output-layer weight using gradient descent on the squared error loss. (5)
(c) Define overfitting in the context of neural networks and explain how L2 regularization (weight decay) and dropout help to reduce it. (3)
(a) Feedforward Network with One Hidden Layer (6)
Architecture (described): Three layers connected left to right —
- Input layer: nodes (one per feature).
- Hidden layer: nodes ; each receives a weighted sum of all inputs plus a bias, then applies an activation: .
- Output layer: nodes , computed similarly from the hidden activations.
Every node in one layer connects to every node in the next (fully connected); information flows forward only.
Role of activation functions: They introduce a non-linear transformation at each neuron, controlling the firing of the neuron and enabling the network to learn complex, non-linear decision boundaries.
Why non-linearity is necessary: If all activations were linear, the composition of layers would collapse into a single linear transformation — equivalent to a one-layer linear model, unable to represent non-linear functions (e.g. XOR). Non-linear activations such as sigmoid or ReLU give the network universal approximation capability.
(b) Backpropagation and Output-Weight Update (5)
Backpropagation trains the network in two passes: (1) a forward pass computes activations layer by layer and the loss; (2) a backward pass propagates the error from the output back to earlier layers using the chain rule, computing for every weight. Weights are then updated by gradient descent.
Derivation for an output-layer weight connecting hidden unit to output , with squared error , net input , and :
Define the output error term . The update rule is:
where is the learning rate.
(c) Overfitting, L2 Regularization and Dropout (3)
Overfitting occurs when a network learns the training data (including its noise) too closely, achieving low training error but high test error — it fails to generalize.
- L2 regularization (weight decay): adds a penalty to the loss, shrinking weights toward zero. Smaller weights give a smoother, simpler function that is less likely to overfit.
- Dropout: during training, randomly "drops" (sets to zero) a fraction of neurons each iteration. This prevents co-adaptation of neurons and acts like training an ensemble of sub-networks, improving generalization.
(a) Explain the k-means clustering algorithm step by step. State the objective function it tries to minimize. (6)
(b) Given the one-dimensional data points and with initial cluster centroids and , perform two iterations of k-means and report the final clusters and centroids. (6)
(c) Discuss two limitations of k-means and briefly explain how the elbow method helps in choosing the value of . (2)
(a) k-Means Algorithm (6)
k-means partitions data into clusters by iteratively minimizing within-cluster variance.
Steps:
- Choose and initialize centroids (randomly or by heuristic).
- Assignment: assign each point to the nearest centroid: .
- Update: recompute each centroid as the mean of points assigned to it: .
- Repeat steps 2–3 until assignments (or centroids) no longer change.
Objective function (distortion / WCSS):
(b) Two Iterations on the 1-D Data (6)
Data: , , , .
Iteration 1 — assign to nearest centroid:
- Closer to 2: (e.g. )
- Closer to 30:
Update centroids:
Iteration 2 — reassign with :
- Closer to 7: (e.g. → cluster 2)
- Closer to 25:
Assignments are unchanged, so centroids stay → converged.
Final result: ; .
(c) Limitations and the Elbow Method (2)
Limitations: (1) must be specified in advance and results depend on initial centroids (can converge to a local optimum); (2) assumes spherical, equally-sized clusters and is sensitive to outliers and scaling.
Elbow method: run k-means for a range of , plot the objective (WCSS) versus . decreases as grows; the "elbow" — the point where the rate of decrease sharply flattens — indicates a good value of , balancing compactness against the number of clusters.
Section B: Short Answer Questions
Attempt all / any as specified.
A binary classifier produces the following confusion matrix on a test set: True Positives = 40, False Positives = 10, False Negatives = 20, True Negatives = 30. Compute the accuracy, precision, recall, and F1-score of the classifier. Briefly comment on what the precision and recall values indicate about its performance.
Given: TP , FP , FN , TN (total ).
Comment: Precision of 80% means that when the classifier predicts positive, it is correct 80% of the time (relatively few false alarms). Recall of 66.7% means it captures only two-thirds of the actual positives, missing one-third (20 false negatives). Thus the model is more precise than complete — it is conservative in raising positives but misses a notable share of true positives.
Differentiate between regression and classification problems with one example each. Also explain why Mean Squared Error (MSE) is an appropriate loss function for regression but not directly suitable for a classification task.
Regression vs Classification
| Regression | Classification | |
|---|---|---|
| Output | Continuous numeric value | Discrete class label |
| Goal | Predict a quantity | Assign a category |
| Example | Predicting house price (Rs.) from area | Predicting whether an email is spam or not |
Why MSE suits regression but not classification:
MSE, , measures squared numeric distance between predicted and true values, which is exactly meaningful when the target is a continuous quantity — it is convex in the parameters of a linear model and penalizes large errors.
For classification the labels are categorical (e.g. 0/1) and outputs are probabilities. Using MSE here is unsuitable because: (1) class labels are not on a metric scale, so squared distance is not a natural error measure; (2) combined with a sigmoid output, the MSE surface is non-convex with flat regions, causing slow learning and poor convergence; and (3) it weakly penalizes confident wrong predictions. The cross-entropy loss is preferred as it is convex (for logistic regression), strongly penalizes confident mistakes, and matches the probabilistic interpretation of the output.
Explain the bias-variance tradeoff with the help of a diagram. Relate the concepts of high bias and high variance to underfitting and overfitting respectively.
Bias-Variance Tradeoff
Bias is the error from overly simple assumptions in the model — it fails to capture the true relationship (systematic error). Variance is the error from excessive sensitivity to the particular training set — small changes in data cause large changes in the model. Expected test error decomposes as:
Diagram (described): Plot model complexity on the x-axis and error on the y-axis.
- The bias² curve starts high and decreases as complexity increases.
- The variance curve starts low and increases with complexity.
- The total test error is U-shaped: it falls, reaches a minimum, then rises. The optimal model sits at the bottom of this U.
Error
| \ (bias^2) / (variance)
| \ /
| \ _______ /
| \____/ total \____/ <- U-shaped total error
|________________________________ Model complexity
underfit optimal overfit
Relation to underfitting / overfitting:
- High bias → underfitting: the model is too simple, giving high error on both training and test data.
- High variance → overfitting: the model is too complex, giving very low training error but high test error because it memorizes noise.
The goal is to balance the two to minimize total generalization error.
Explain the working of the k-Nearest Neighbours (k-NN) algorithm. How does the choice of affect the bias and variance of the model? What is the effect of feature scaling on k-NN?
k-Nearest Neighbours (k-NN)
Working: k-NN is a non-parametric, lazy (instance-based) algorithm. There is no explicit training phase; it stores all training examples. To classify a new point :
- Compute the distance (usually Euclidean) from to every training point.
- Select the closest training points (its nearest neighbours).
- Assign the class by majority vote of those neighbours (for regression, take their average).
Effect of on bias and variance:
- Small (e.g. ): the decision boundary is highly flexible and follows local noise → low bias, high variance (overfitting).
- Large : predictions are smoothed over many neighbours, averaging out detail → high bias, low variance (underfitting if is too large).
Thus controls the bias-variance tradeoff and is chosen (e.g. by cross-validation) to balance them.
Effect of feature scaling: Because k-NN relies on distance, features with larger numeric ranges dominate the distance computation. Without scaling, a large-magnitude feature can overwhelm others, distorting the neighbourhood. Therefore features should be normalized/standardized (e.g. min-max or z-score) so each contributes proportionately.
What is k-fold cross-validation? Explain the procedure with a diagram for and state why it gives a more reliable estimate of model performance than a single train-test split.
k-Fold Cross-Validation
Definition: k-fold cross-validation is a resampling method that partitions the dataset into equal (or nearly equal) disjoint folds. The model is trained and validated times; each time a different fold serves as the validation set and the remaining folds form the training set. The validation scores are averaged to estimate performance.
Procedure for :
- Shuffle and split the data into 5 folds: .
- Iteration 1: train on , validate on .
- Iteration 2: train on , validate on . … and so on for all 5 folds.
- Report the mean (and standard deviation) of the 5 validation scores.
Diagram (described): Five rows, one per iteration; each row shows 5 blocks where exactly one block is marked Validation and the other four Train, with the validation block shifting one position to the right each row.
Fold: 1 2 3 4 5
Iter1: [Val][Trn][Trn][Trn][Trn]
Iter2: [Trn][Val][Trn][Trn][Trn]
Iter3: [Trn][Trn][Val][Trn][Trn]
Iter4: [Trn][Trn][Trn][Val][Trn]
Iter5: [Trn][Trn][Trn][Trn][Val]
Why more reliable than a single train-test split: Every data point is used for both training and validation exactly once, so the estimate does not depend on one lucky/unlucky split. Averaging over runs reduces the variance of the performance estimate and uses the data more efficiently, giving a more robust and less biased measure of generalization.
Explain logistic regression for binary classification. Write the sigmoid (logistic) function, state how the decision boundary is obtained, and explain why the cross-entropy loss is preferred over squared error for training it.
Logistic Regression (Binary Classification)
Logistic regression models the probability that an input belongs to the positive class. A linear combination is passed through the sigmoid (logistic) function to map it to :
Decision boundary: Predict class 1 if , else class 0. Since exactly when , the decision boundary is the set where
which is a linear hyperplane in feature space.
Why cross-entropy is preferred over squared error: The cross-entropy (log) loss for one example is
Combined with the sigmoid, this loss is convex in , so gradient descent reaches the global minimum reliably. In contrast, squared error with a sigmoid output is non-convex with flat saturated regions where the gradient vanishes, causing slow learning. Cross-entropy also penalizes confident wrong predictions heavily ( of a small probability is large), producing strong, well-scaled gradients and well-calibrated probabilities.
Compare hierarchical clustering with partitional (k-means) clustering. Briefly explain agglomerative clustering and the role of a dendrogram in deciding the number of clusters.
Hierarchical vs Partitional (k-means) Clustering
| Aspect | Hierarchical | Partitional (k-means) |
|---|---|---|
| Output | A nested tree of clusters (dendrogram) | A single flat partition into clusters |
| Need in advance? | No (cut tree afterwards) | Yes |
| Approach | Merge/split based on inter-cluster distance | Iterative centroid assignment |
| Cost | Higher, typically or more | Lower, roughly |
| Reversibility | Greedy merges are not undone | Re-assigns points each iteration |
Agglomerative (bottom-up) clustering: Start with each data point as its own cluster. Repeatedly merge the two closest clusters (closeness measured by a linkage criterion — single, complete, average, or Ward linkage) until all points form one cluster. This sequence of merges builds the hierarchy.
Role of the dendrogram: A dendrogram is a tree diagram whose vertical axis shows the distance at which clusters merge. To choose the number of clusters, draw a horizontal line that cuts the dendrogram; the number of vertical lines it crosses equals the number of clusters. Cutting at a level where the merge distances jump sharply (a long vertical gap) yields well-separated, natural clusters.
Write short notes on any two of the following:
(a) Gradient descent and the role of the learning rate (b) Curse of dimensionality (c) Support Vector Machines and the concept of margin (d) Feature engineering
(Answer any two — model answers for all four are given.)
(a) Gradient descent and the learning rate. Gradient descent is an iterative optimization method that minimizes a cost by updating parameters opposite to the gradient: . The learning rate sets the step size: too small → very slow convergence; too large → overshooting, oscillation, or divergence. A well-chosen (often decaying) gives stable, fast convergence to a minimum.
(b) Curse of dimensionality. As the number of features (dimensions) grows, the volume of the feature space grows exponentially, so data become sparse. Distances between points become similar, weakening distance-based methods (k-NN, clustering), and far more data is needed to generalize. Remedies include feature selection and dimensionality reduction (e.g. PCA).
(c) Support Vector Machines and margin. An SVM finds the hyperplane that separates two classes with the maximum margin — the largest distance between the boundary and the nearest training points (the support vectors). Maximizing the margin improves generalization. With the kernel trick, SVMs handle non-linearly separable data by implicitly mapping inputs to a higher-dimensional space.
(d) Feature engineering. The process of creating, transforming, and selecting input features to improve model performance — e.g. scaling/normalization, encoding categorical variables, handling missing values, creating interaction or polynomial terms, and extracting domain-specific features. Good features often matter more than the choice of algorithm.
Frequently asked questions
- Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2079?
- The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Machine Learning (PU, CMP 364) 2079 paper come with solutions?
- Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper?
- The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Machine Learning (PU, CMP 364) past paper free?
- Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.