BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) Question Paper 2078 Nepal
This is the official BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper for 2078, as set in the regular annual examination. It carries 100 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Machine Learning (PU, CMP 364) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) exam or solving previous years' question papers, this 2078 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt all / any as specified.
(a) Define machine learning and distinguish clearly between supervised and unsupervised learning, giving one real-world application of each. (5 marks)
(b) Consider a simple linear regression model . Derive the least squares cost function and obtain the gradient descent update rules for and . (7 marks)
(c) Explain the effect of choosing a learning rate that is too large versus too small on the convergence of gradient descent. (3 marks)
(a) Machine Learning; Supervised vs Unsupervised (5 marks)
Machine learning is the field of study that gives computers the ability to learn patterns from data and improve their performance on a task without being explicitly programmed with hand-coded rules. Formally (Mitchell): a program learns from experience with respect to task and performance measure if its performance at , measured by , improves with .
| Aspect | Supervised learning | Unsupervised learning |
|---|---|---|
| Data | Labelled examples | Unlabelled examples only |
| Goal | Learn a mapping to predict outputs | Discover hidden structure / grouping |
| Typical tasks | Classification, regression | Clustering, dimensionality reduction |
| Application | Spam email detection (predict spam/not-spam) | Customer segmentation (group similar customers) |
(b) Least-squares cost and gradient-descent rules (7 marks)
Hypothesis: . For training examples, the least-squares cost function measures mean squared error:
The factor simplifies the derivative. Taking partial derivatives:
Gradient descent moves opposite the gradient with learning rate , updating simultaneously:
(c) Effect of the learning rate (3 marks)
- Too large : the steps overshoot the minimum; the cost may oscillate or diverge (increase) instead of decreasing — gradient descent fails to converge.
- Too small : updates are tiny, so convergence is correct but very slow, needing many iterations.
A good rate is small enough to converge but large enough to be efficient; one often tries values like
(a) A decision tree learning algorithm uses an attribute-selection measure to decide which feature to split on. Define entropy and information gain, and explain how the ID3 algorithm uses information gain to construct a tree. (6 marks)
(b) Given the following training dataset, compute the information gain of the attribute Weather with respect to the target Play and state which attribute you would select as the root. (6 marks)
| Weather | Temperature | Play |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Mild | No |
| Overcast | Hot | Yes |
| Rainy | Mild | Yes |
| Rainy | Cool | Yes |
| Overcast | Cool | Yes |
(c) Decision trees are prone to overfitting. Explain two pruning strategies used to control tree complexity. (3 marks)
(a) Entropy, Information Gain and ID3 (6 marks)
Entropy measures the impurity (uncertainty) of a set with respect to the class label. For a binary target with positive proportion and negative :
It is when the set is pure and maximal ( bit) when classes are equally mixed.
Information gain of attribute is the expected reduction in entropy after splitting on :
ID3 builds the tree top-down and greedily: at each node it computes the information gain of every remaining attribute, selects the attribute with the highest gain as the split, partitions the data on its values, and recurses on each child until the subsets are pure (or attributes run out), forming leaves.
(b) Information gain of Weather (6 marks)
Target Play: 4 Yes, 2 No out of 6. Root entropy:
Split on Weather:
- Sunny (2 rows: No, No) → pure,
- Overcast (2 rows: Yes, Yes) → pure,
- Rainy (2 rows: Yes, Yes) → pure,
Weighted child entropy .
Since Weather perfectly separates the classes (each value maps to a single label), its gain equals the full root entropy — the maximum possible. Weather is selected as the root attribute.
(c) Two pruning strategies (3 marks)
- Pre-pruning (early stopping): stop growing a branch before it perfectly fits — e.g. stop if a node has too few samples, the information gain is below a threshold, or maximum depth is reached.
- Post-pruning: grow the full tree, then remove or collapse subtrees that do not improve performance on a validation set (e.g. reduced-error pruning or cost-complexity pruning). This usually generalises better than pre-pruning.
(a) Draw the architecture of a multilayer feedforward neural network with one hidden layer and explain the role of weights, bias, and activation functions. (5 marks)
(b) Derive the backpropagation weight-update equation for a single output-layer weight using the sigmoid activation function and the squared-error loss. (7 marks)
(c) Why are non-linear activation functions essential in a neural network? What happens if all activations are linear? (3 marks)
(a) Multilayer feedforward network architecture (5 marks)
Described in words (a standard 3-layer diagram): an input layer of nodes fully connected to a hidden layer of nodes, which is fully connected to an output layer. Signals flow strictly forward (input → hidden → output) with no cycles.
x1 ──┐ ┌── h1 ──┐
x2 ──┼─weights─┼── h2 ──┼─weights── y (output)
xn ──┘ └── h3 ──┘
(+bias) (+bias)
- Weights : learnable parameters scaling each connection; they store what the network has learned.
- Bias : a learnable constant added to each neuron's weighted sum, shifting the activation threshold so the unit need not pass through the origin.
- Activation function : applies a non-linear transform , enabling the network to model non-linear relationships.
(b) Backpropagation update for an output-layer weight (7 marks)
Let output neuron net input , output , target , and squared-error loss . We need via the chain rule:
Each factor:
Therefore:
Defining the output-layer error term , the gradient-descent update is:
(c) Why non-linear activations are essential (3 marks)
Non-linearities let the network approximate complex, non-linear decision boundaries (universal approximation). If every activation were linear, the whole network would be a composition of linear maps, which is itself a single linear map — so a deep network would collapse to an equivalent single-layer linear model, unable to learn anything beyond linear relationships, no matter how many layers it has.
(a) Describe the k-means clustering algorithm step by step, clearly stating its objective (distortion) function and its termination condition. (6 marks)
(b) Given the one-dimensional data points {2, 4, 10, 12, 3, 20, 30, 11, 25} and with initial centroids and , perform one full iteration of k-means and report the updated centroids. (6 marks)
(c) Explain how the elbow method helps in choosing an appropriate value of . (3 marks)
(a) k-means algorithm (6 marks)
k-means partitions points into clusters by minimising within-cluster squared distance.
- Initialise centroids (e.g. random points).
- Assignment step: assign each point to the nearest centroid: .
- Update step: recompute each centroid as the mean of the points assigned to it.
- Repeat steps 2–3.
Objective (distortion) function to minimise:
Termination condition: stop when assignments no longer change (centroids stable) or falls below a threshold / a maximum number of iterations is reached.
(b) One full iteration (6 marks)
Data: , , .
Assignment (point goes to nearer centroid; midpoint ):
- :
- :
Update centroids (means):
Updated centroids after one iteration: , .
(c) Elbow method (3 marks)
Run k-means for a range of values and plot the total distortion (within-cluster sum of squares) against . always decreases as grows, but the rate of decrease slows once the natural number of clusters is reached. The plot shows a sharp bend — the "elbow" — and the at that bend is chosen as a good trade-off between low distortion and model simplicity.
Section B: Short Answer Questions
Attempt all / any as specified.
Explain the Naive Bayes classifier. State the conditional-independence assumption it makes and discuss one situation where this assumption may fail yet the classifier still performs well.
Naive Bayes classifier. A probabilistic classifier based on Bayes' theorem. Given features , it predicts the class that maximises the posterior:
The denominator is ignored since it is constant across classes. Priors and likelihoods are estimated from training-data frequencies (often with Laplace smoothing).
Conditional-independence assumption. It assumes all features are mutually independent given the class — i.e. . This "naive" assumption is what lets the joint likelihood factor into a simple product, making training and prediction fast and needing little data.
When it fails yet still works. In text/spam classification words are clearly not independent (e.g. "New" and "York" co-occur). Despite this violated assumption, Naive Bayes usually classifies well because the decision only depends on which class gets the highest score, not on accurate probability values; even when individual probability estimates are biased, the ranking of classes is often still correct.
Differentiate between overfitting and underfitting in terms of the bias–variance trade-off. Explain how L1 (Lasso) and L2 (Ridge) regularization help reduce overfitting, and state how their effects on model parameters differ.
Overfitting vs underfitting (bias–variance).
- Underfitting (high bias, low variance): the model is too simple to capture the underlying pattern; it performs poorly on both training and test data.
- Overfitting (low bias, high variance): the model is too complex and fits noise in the training data; it does well on training but poorly on test data.
The bias–variance trade-off is the goal of choosing complexity that minimises total error .
Regularization adds a penalty on parameter size to the loss, discouraging overly complex models:
- L2 (Ridge): penalty . Shrinks weights smoothly toward (but not exactly to) zero, keeping all features with smaller coefficients.
- L1 (Lasso): penalty . Can drive some coefficients exactly to zero, performing automatic feature selection and yielding a sparse model.
Difference in effect: Ridge distributes shrinkage across all weights (good when many features are useful); Lasso produces sparsity by zeroing irrelevant features (good for selecting a few important ones). Both reduce variance/overfitting at the cost of a small increase in bias, controlled by .
For a binary classifier evaluated on a test set, the confusion matrix is given below.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | 40 | 10 |
| Actual Negative | 5 | 45 |
Compute the accuracy, precision, recall, and F1-score. Explain why accuracy alone can be misleading on an imbalanced dataset.
From the confusion matrix: , , , (total ).
Accuracy
Precision
Recall
F1-score
Why accuracy can mislead on imbalanced data. Accuracy counts all correct predictions equally. If one class dominates (e.g. 95% negatives), a trivial model that always predicts the majority class scores 95% accuracy while completely failing to detect the rare positive class. Precision, recall and F1 focus on the minority/positive class and expose this failure, so they are preferred for imbalanced problems.
Explain the k-Nearest Neighbours (k-NN) algorithm. Discuss how the choice of affects the bias and variance of the model, and why k-NN is called a lazy learner.
k-Nearest Neighbours (k-NN). A non-parametric, instance-based algorithm. To classify a query point , it (1) computes the distance (usually Euclidean) from to every training point, (2) selects the nearest neighbours, and (3) assigns the majority class among them (for regression, the average of their values).
Effect of on bias and variance:
- Small (e.g. ): very flexible, follows individual points → low bias, high variance; sensitive to noise/outliers and prone to overfitting.
- Large : averages over many neighbours, giving a smoother boundary → high bias, low variance; can underfit and blur class boundaries.
So controls the bias–variance trade-off; it is tuned (often via cross-validation), and an odd avoids ties in binary problems.
Why a "lazy learner": k-NN does no real training phase — it simply stores the dataset. All computation (distance calculation and voting) is deferred to query time. Because it postpones generalisation until a prediction is requested, it is called a lazy (instance-based) learner, in contrast to eager learners like decision trees that build a model up front.
Explain logistic regression for binary classification. (a) Write the sigmoid hypothesis function and explain how a decision boundary is obtained. (b) Why is the squared-error cost function not used for logistic regression?
Logistic regression is a linear model for binary classification that outputs the probability that an example belongs to class 1.
(a) Sigmoid hypothesis and decision boundary
It passes a linear combination through the sigmoid (logistic) function:
is interpreted as . We predict class 1 if , else class 0. Since , the decision boundary is the surface , which is linear in the features (a line/hyperplane separating the two classes).
(b) Why not squared-error cost
If we plug the sigmoid into the squared-error cost , the resulting is non-convex (many local minima), so gradient descent can get stuck and may not reach the global optimum. Instead, logistic regression uses the convex log-loss (cross-entropy) cost:
which is convex, derived from maximum likelihood, and penalises confident wrong predictions heavily — guaranteeing reliable convergence.
Describe k-fold cross-validation and explain how it gives a more reliable estimate of model performance than a single train/test split. What is the role of a separate validation set versus the test set?
k-fold cross-validation. The dataset is randomly split into equal-sized folds. The model is trained times; in each round one fold is held out as the validation set and the other folds are used for training. The performance scores are then averaged to estimate model performance (common choice or ).
Fold 1: [TEST ][train][train][train][train]
Fold 2: [train][TEST ][train][train][train]
... -> average the k scores
Why more reliable than a single split. A single train/test split gives a performance estimate that depends heavily on which points happened to land in the test set, so it has high variance. k-fold uses every example for both training and validation (each point is tested exactly once), so the averaged score is more stable, less biased by an unlucky split, and uses the data more efficiently — especially valuable for small datasets.
Validation set vs test set.
- Validation set: used during development to tune hyperparameters and select models. The model is repeatedly evaluated on it, so it indirectly influences the model.
- Test set: held out completely until the very end and used once to report an unbiased estimate of final generalisation performance. Keeping it untouched prevents the optimistic bias that arises if the data used for tuning is also used to report the final score.
Compare partitional (k-means) clustering with hierarchical (agglomerative) clustering. Explain what a dendrogram represents and how it is used to decide the number of clusters.
Partitional (k-means) vs hierarchical (agglomerative) clustering.
| Aspect | k-means (partitional) | Agglomerative (hierarchical) |
|---|---|---|
| Approach | Divides data into flat, non-overlapping clusters at once | Bottom-up: starts with each point as its own cluster, repeatedly merges the two closest clusters |
| Need in advance? | Yes, must be specified | No; cut the tree afterwards |
| Output | A single flat partition | A full hierarchy of nested clusters |
| Complexity | Efficient, ~ | Costlier, ~ or |
| Result | Depends on initial centroids; can change between runs | Deterministic for a given linkage |
Dendrogram. A tree diagram produced by agglomerative clustering. The leaves are individual data points; each internal node represents a merge, and the height of a merge equals the distance (dissimilarity) at which the two clusters were joined. Short horizontal links mean very similar items merged early; tall links mean dissimilar groups merged late.
Choosing the number of clusters. Draw a horizontal cut across the dendrogram; the number of vertical lines it crosses gives the number of clusters. The cut is placed at a height that crosses the longest vertical gap (the largest jump in merge distance), since cutting there separates the most well-distinguished groups.
Write short notes on any two of the following: (a) Support Vector Machine and the concept of margin; (b) The vanishing gradient problem in deep neural networks; (c) Feature scaling and its importance in gradient-based learning.
Answering any two of the three.
(a) Support Vector Machine and margin
An SVM is a supervised classifier that finds the optimal separating hyperplane between two classes. Among all hyperplanes that separate the data, it chooses the one with the maximum margin — the largest distance to the nearest training points of either class. Those closest points are the support vectors, and they alone define the boundary. A larger margin gives better generalisation. The margin width is , so SVM minimises subject to correct classification; soft margins and the kernel trick extend it to noisy and non-linearly-separable data.
(b) Vanishing gradient problem
In deep networks trained by backpropagation, gradients are products of many derivatives chained across layers. With saturating activations like sigmoid/tanh, each derivative is (sigmoid's max derivative is ), so as gradients propagate back through many layers they shrink exponentially toward zero. The early layers then receive almost no gradient and learn extremely slowly or not at all, making deep networks hard to train. Remedies: ReLU activations, careful weight initialisation (Xavier/He), batch normalisation, and residual (skip) connections.
(c) Feature scaling and its importance
Feature scaling transforms features to a comparable range, e.g. normalisation to , , or standardisation (zero mean, unit variance). It matters in gradient-based learning because if features have very different scales, the cost surface is elongated, so gradient descent zig-zags and converges slowly; scaling makes the contours more circular, speeding and stabilising convergence. It is also essential for distance-based methods (k-NN, k-means, SVM) so no single large-range feature dominates.
Frequently asked questions
- Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2078?
- The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Machine Learning (PU, CMP 364) 2078 paper come with solutions?
- Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper?
- The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Machine Learning (PU, CMP 364) past paper free?
- Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.