BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) Question Paper 2078 Nepal

Q: Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2078?

The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Machine Learning (PU, CMP 364) 2078 paper come with solutions?

Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper?

The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Machine Learning (PU, CMP 364) past paper free?

Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.

Question

1Long answer15 marks

(a) Define machine learning and distinguish clearly between supervised and unsupervised learning, giving one real-world application of each. (5 marks)

(b) Consider a simple linear regression model $h_\theta(x) = \theta_0 + \theta_1 x$ . Derive the least squares cost function $J(\theta_0, \theta_1)$ and obtain the gradient descent update rules for $\theta_0$ and $\theta_1$ . (7 marks)

(c) Explain the effect of choosing a learning rate that is too large versus too small on the convergence of gradient descent. (3 marks)

supervised-learningregression

Answer 1

(a) Machine Learning; Supervised vs Unsupervised (5 marks)

Machine learning is the field of study that gives computers the ability to learn patterns from data and improve their performance on a task without being explicitly programmed with hand-coded rules. Formally (Mitchell): a program learns from experience $E$ with respect to task $T$ and performance measure $P$ if its performance at $T$ , measured by $P$ , improves with $E$ .

Aspect	Supervised learning	Unsupervised learning
Data	Labelled examples $(x^{(i)}, y^{(i)})$	Unlabelled examples $x^{(i)}$ only
Goal	Learn a mapping $x \to y$ to predict outputs	Discover hidden structure / grouping
Typical tasks	Classification, regression	Clustering, dimensionality reduction
Application	Spam email detection (predict spam/not-spam)	Customer segmentation (group similar customers)

(b) Least-squares cost and gradient-descent rules (7 marks)

Hypothesis: $h_\theta(x) = \theta_0 + \theta_1 x$ . For $m$ training examples, the least-squares cost function measures mean squared error:

J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)}) - y^{(i)}\big)^2

The factor $\tfrac{1}{2}$ simplifies the derivative. Taking partial derivatives:

\frac{\partial J}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)

\frac{\partial J}{\partial \theta_1} = \frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)\,x^{(i)}

Gradient descent moves opposite the gradient with learning rate $\alpha$ , updating simultaneously:

\theta_0 := \theta_0 - \alpha\,\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)

\theta_1 := \theta_1 - \alpha\,\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)\,x^{(i)}

(c) Effect of the learning rate (3 marks)

Too large $\alpha$ : the steps overshoot the minimum; the cost may oscillate or diverge (increase) instead of decreasing — gradient descent fails to converge.
Too small $\alpha$ : updates are tiny, so convergence is correct but very slow, needing many iterations.

A good rate is small enough to converge but large enough to be efficient; one often tries values like $0.001, 0.01, 0.1, \dots$

Answer 2

(a) Entropy, Information Gain and ID3 (6 marks)

Entropy measures the impurity (uncertainty) of a set $S$ with respect to the class label. For a binary target with positive proportion $p_+$ and negative $p_-$ :

H(S) = -p_+\log_2 p_+ - p_-\log_2 p_-

It is $0$ when the set is pure and maximal ( $1$ bit) when classes are equally mixed.

Information gain of attribute $A$ is the expected reduction in entropy after splitting on $A$ :

IG(S,A) = H(S) - \sum_{v\in \text{values}(A)} \frac{|S_v|}{|S|}\,H(S_v)

ID3 builds the tree top-down and greedily: at each node it computes the information gain of every remaining attribute, selects the attribute with the highest gain as the split, partitions the data on its values, and recurses on each child until the subsets are pure (or attributes run out), forming leaves.

(b) Information gain of Weather (6 marks)

Target Play: 4 Yes, 2 No out of 6. Root entropy:

H(S) = -\tfrac{4}{6}\log_2\tfrac{4}{6} - \tfrac{2}{6}\log_2\tfrac{2}{6} = 0.3899 + 0.5283 = 0.918

Split on Weather:

Sunny (2 rows: No, No) → pure, $H = 0$
Overcast (2 rows: Yes, Yes) → pure, $H = 0$
Rainy (2 rows: Yes, Yes) → pure, $H = 0$

Weighted child entropy $= \tfrac{2}{6}(0)+\tfrac{2}{6}(0)+\tfrac{2}{6}(0) = 0$ .

IG(S,\text{Weather}) = 0.918 - 0 = \mathbf{0.918\ \text{bits}}

Since Weather perfectly separates the classes (each value maps to a single label), its gain equals the full root entropy — the maximum possible. Weather is selected as the root attribute.

(c) Two pruning strategies (3 marks)

Pre-pruning (early stopping): stop growing a branch before it perfectly fits — e.g. stop if a node has too few samples, the information gain is below a threshold, or maximum depth is reached.
Post-pruning: grow the full tree, then remove or collapse subtrees that do not improve performance on a validation set (e.g. reduced-error pruning or cost-complexity pruning). This usually generalises better than pre-pruning.

Answer 3

(a) Multilayer feedforward network architecture (5 marks)

Described in words (a standard 3-layer diagram): an input layer of nodes $x_1,\dots,x_n$ fully connected to a hidden layer of nodes, which is fully connected to an output layer. Signals flow strictly forward (input → hidden → output) with no cycles.

 x1 ──┐         ┌── h1 ──┐
 x2 ──┼─weights─┼── h2 ──┼─weights── y (output)
 xn ──┘         └── h3 ──┘
          (+bias)   (+bias)

Weights $w_{ij}$ : learnable parameters scaling each connection; they store what the network has learned.
Bias $b$ : a learnable constant added to each neuron's weighted sum, shifting the activation threshold so the unit need not pass through the origin.
Activation function $g(\cdot)$ : applies a non-linear transform $a = g\!\left(\sum_i w_i x_i + b\right)$ , enabling the network to model non-linear relationships.

(b) Backpropagation update for an output-layer weight (7 marks)

Let output neuron net input $z = \sum_j w_j a_j$ , output $\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}$ , target $y$ , and squared-error loss $E = \tfrac{1}{2}(y-\hat{y})^2$ . We need $\partial E/\partial w_j$ via the chain rule:

\frac{\partial E}{\partial w_j} = \frac{\partial E}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z}\cdot\frac{\partial z}{\partial w_j}

Each factor:

\frac{\partial E}{\partial \hat{y}} = -(y-\hat{y}),\qquad \frac{\partial \hat{y}}{\partial z} = \sigma'(z) = \hat{y}(1-\hat{y}),\qquad \frac{\partial z}{\partial w_j} = a_j

Therefore:

\frac{\partial E}{\partial w_j} = -(y-\hat{y})\,\hat{y}(1-\hat{y})\,a_j

Defining the output-layer error term $\delta = (y-\hat{y})\,\hat{y}(1-\hat{y})$ , the gradient-descent update is:

w_j := w_j + \alpha\,\delta\,a_j = w_j + \alpha\,(y-\hat{y})\,\hat{y}(1-\hat{y})\,a_j

(c) Why non-linear activations are essential (3 marks)

Non-linearities let the network approximate complex, non-linear decision boundaries (universal approximation). If every activation were linear, the whole network would be a composition of linear maps, which is itself a single linear map $Wx+b$ — so a deep network would collapse to an equivalent single-layer linear model, unable to learn anything beyond linear relationships, no matter how many layers it has.

Answer 4

(a) k-means algorithm (6 marks)

k-means partitions $n$ points into $k$ clusters by minimising within-cluster squared distance.

Initialise $k$ centroids $c_1,\dots,c_k$ (e.g. random points).
Assignment step: assign each point $x$ to the nearest centroid: $\text{cluster}(x)=\arg\min_j \lVert x - c_j\rVert^2$ .
Update step: recompute each centroid as the mean of the points assigned to it.
Repeat steps 2–3.

Objective (distortion) function to minimise:

J = \sum_{j=1}^{k}\sum_{x\in C_j}\lVert x - c_j\rVert^2

Termination condition: stop when assignments no longer change (centroids stable) or $J$ falls below a threshold / a maximum number of iterations is reached.

(b) One full iteration (6 marks)

Data: $\{2,4,10,12,3,20,30,11,25\}$ , $c_1=4$ , $c_2=12$ .

Assignment (point goes to nearer centroid; midpoint $=8$ ):

$\le 8 \to C_1$ : $\{2, 4, 3\}$
$> 8 \to C_2$ : $\{10, 12, 20, 30, 11, 25\}$

Update centroids (means):

c_1 = \frac{2+4+3}{3} = \frac{9}{3} = \mathbf{3.0}

c_2 = \frac{10+12+20+30+11+25}{6} = \frac{108}{6} = \mathbf{18.0}

Updated centroids after one iteration: $c_1 = 3.0$ , $c_2 = 18.0$ .

(c) Elbow method (3 marks)

Run k-means for a range of $k$ values and plot the total distortion $J$ (within-cluster sum of squares) against $k$ . $J$ always decreases as $k$ grows, but the rate of decrease slows once the natural number of clusters is reached. The plot shows a sharp bend — the "elbow" — and the $k$ at that bend is chosen as a good trade-off between low distortion and model simplicity.

Answer 5

Naive Bayes classifier. A probabilistic classifier based on Bayes' theorem. Given features $x_1,\dots,x_n$ , it predicts the class $C$ that maximises the posterior:

\hat{C} = \arg\max_{C} P(C)\prod_{i=1}^{n} P(x_i \mid C)

The denominator $P(x)$ is ignored since it is constant across classes. Priors $P(C)$ and likelihoods $P(x_i\mid C)$ are estimated from training-data frequencies (often with Laplace smoothing).

Conditional-independence assumption. It assumes all features are mutually independent given the class — i.e. $P(x_1,\dots,x_n\mid C)=\prod_i P(x_i\mid C)$ . This "naive" assumption is what lets the joint likelihood factor into a simple product, making training and prediction fast and needing little data.

When it fails yet still works. In text/spam classification words are clearly not independent (e.g. "New" and "York" co-occur). Despite this violated assumption, Naive Bayes usually classifies well because the decision only depends on which class gets the highest score, not on accurate probability values; even when individual probability estimates are biased, the ranking of classes is often still correct.

Answer 6

Overfitting vs underfitting (bias–variance).

Underfitting (high bias, low variance): the model is too simple to capture the underlying pattern; it performs poorly on both training and test data.
Overfitting (low bias, high variance): the model is too complex and fits noise in the training data; it does well on training but poorly on test data.

The bias–variance trade-off is the goal of choosing complexity that minimises total error $=\text{bias}^2 + \text{variance} + \text{noise}$ .

Regularization adds a penalty on parameter size to the loss, discouraging overly complex models:

L2 (Ridge): penalty $\lambda\sum_j \theta_j^2$ . Shrinks weights smoothly toward (but not exactly to) zero, keeping all features with smaller coefficients.
L1 (Lasso): penalty $\lambda\sum_j |\theta_j|$ . Can drive some coefficients exactly to zero, performing automatic feature selection and yielding a sparse model.

Difference in effect: Ridge distributes shrinkage across all weights (good when many features are useful); Lasso produces sparsity by zeroing irrelevant features (good for selecting a few important ones). Both reduce variance/overfitting at the cost of a small increase in bias, controlled by $\lambda$ .

Answer 7

From the confusion matrix: $TP=40$ , $FN=10$ , $FP=5$ , $TN=45$ (total $=100$ ).

Accuracy $= \dfrac{TP+TN}{\text{total}} = \dfrac{40+45}{100} = \dfrac{85}{100} = \mathbf{0.85}$

Precision $= \dfrac{TP}{TP+FP} = \dfrac{40}{40+5} = \dfrac{40}{45} = \mathbf{0.889}$

Recall $= \dfrac{TP}{TP+FN} = \dfrac{40}{40+10} = \dfrac{40}{50} = \mathbf{0.80}$

F1-score $= 2\cdot\dfrac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} = 2\cdot\dfrac{0.889\times 0.80}{0.889+0.80} = \dfrac{1.422}{1.689} = \mathbf{0.842}$

Why accuracy can mislead on imbalanced data. Accuracy counts all correct predictions equally. If one class dominates (e.g. 95% negatives), a trivial model that always predicts the majority class scores 95% accuracy while completely failing to detect the rare positive class. Precision, recall and F1 focus on the minority/positive class and expose this failure, so they are preferred for imbalanced problems.

Answer 8

k-Nearest Neighbours (k-NN). A non-parametric, instance-based algorithm. To classify a query point $x$ , it (1) computes the distance (usually Euclidean) from $x$ to every training point, (2) selects the $k$ nearest neighbours, and (3) assigns the majority class among them (for regression, the average of their values).

Effect of $k$ on bias and variance:

Small $k$ (e.g. $k=1$ ): very flexible, follows individual points → low bias, high variance; sensitive to noise/outliers and prone to overfitting.
Large $k$ : averages over many neighbours, giving a smoother boundary → high bias, low variance; can underfit and blur class boundaries.

So $k$ controls the bias–variance trade-off; it is tuned (often via cross-validation), and an odd $k$ avoids ties in binary problems.

Why a "lazy learner": k-NN does no real training phase — it simply stores the dataset. All computation (distance calculation and voting) is deferred to query time. Because it postpones generalisation until a prediction is requested, it is called a lazy (instance-based) learner, in contrast to eager learners like decision trees that build a model up front.

Answer 9

Logistic regression is a linear model for binary classification that outputs the probability that an example belongs to class 1.

(a) Sigmoid hypothesis and decision boundary

It passes a linear combination through the sigmoid (logistic) function:

h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1+e^{-\theta^T x}}, \qquad 0 < h_\theta(x) < 1

$h_\theta(x)$ is interpreted as $P(y=1\mid x)$ . We predict class 1 if $h_\theta(x) \ge 0.5$ , else class 0. Since $\sigma(z)\ge 0.5 \iff z\ge 0$ , the decision boundary is the surface $\theta^T x = 0$ , which is linear in the features (a line/hyperplane separating the two classes).

(b) Why not squared-error cost

If we plug the sigmoid $h_\theta(x)$ into the squared-error cost $\tfrac{1}{2m}\sum (h_\theta(x)-y)^2$ , the resulting $J(\theta)$ is non-convex (many local minima), so gradient descent can get stuck and may not reach the global optimum. Instead, logistic regression uses the convex log-loss (cross-entropy) cost:

J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}\Big[y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log\big(1-h_\theta(x^{(i)})\big)\Big]

which is convex, derived from maximum likelihood, and penalises confident wrong predictions heavily — guaranteeing reliable convergence.

Answer 10

k-fold cross-validation. The dataset is randomly split into $k$ equal-sized folds. The model is trained $k$ times; in each round one fold is held out as the validation set and the other $k-1$ folds are used for training. The $k$ performance scores are then averaged to estimate model performance (common choice $k=5$ or $10$ ).

Fold 1: [TEST ][train][train][train][train]
Fold 2: [train][TEST ][train][train][train]
...      -> average the k scores

Why more reliable than a single split. A single train/test split gives a performance estimate that depends heavily on which points happened to land in the test set, so it has high variance. k-fold uses every example for both training and validation (each point is tested exactly once), so the averaged score is more stable, less biased by an unlucky split, and uses the data more efficiently — especially valuable for small datasets.

Validation set vs test set.

Validation set: used during development to tune hyperparameters and select models. The model is repeatedly evaluated on it, so it indirectly influences the model.
Test set: held out completely until the very end and used once to report an unbiased estimate of final generalisation performance. Keeping it untouched prevents the optimistic bias that arises if the data used for tuning is also used to report the final score.

Answer 11

Partitional (k-means) vs hierarchical (agglomerative) clustering.

Aspect	k-means (partitional)	Agglomerative (hierarchical)
Approach	Divides data into $k$ flat, non-overlapping clusters at once	Bottom-up: starts with each point as its own cluster, repeatedly merges the two closest clusters
Need $k$ in advance?	Yes, $k$ must be specified	No; cut the tree afterwards
Output	A single flat partition	A full hierarchy of nested clusters
Complexity	Efficient, ~ $O(nkt)$	Costlier, ~ $O(n^2\log n)$ or $O(n^3)$
Result	Depends on initial centroids; can change between runs	Deterministic for a given linkage

Dendrogram. A tree diagram produced by agglomerative clustering. The leaves are individual data points; each internal node represents a merge, and the height of a merge equals the distance (dissimilarity) at which the two clusters were joined. Short horizontal links mean very similar items merged early; tall links mean dissimilar groups merged late.

Choosing the number of clusters. Draw a horizontal cut across the dendrogram; the number of vertical lines it crosses gives the number of clusters. The cut is placed at a height that crosses the longest vertical gap (the largest jump in merge distance), since cutting there separates the most well-distinguished groups.

Answer 12

Answering any two of the three.

(a) Support Vector Machine and margin

An SVM is a supervised classifier that finds the optimal separating hyperplane between two classes. Among all hyperplanes that separate the data, it chooses the one with the maximum margin — the largest distance to the nearest training points of either class. Those closest points are the support vectors, and they alone define the boundary. A larger margin gives better generalisation. The margin width is $\tfrac{2}{\lVert w\rVert}$ , so SVM minimises $\tfrac{1}{2}\lVert w\rVert^2$ subject to correct classification; soft margins and the kernel trick extend it to noisy and non-linearly-separable data.

(b) Vanishing gradient problem

In deep networks trained by backpropagation, gradients are products of many derivatives chained across layers. With saturating activations like sigmoid/tanh, each derivative is $<1$ (sigmoid's max derivative is $0.25$ ), so as gradients propagate back through many layers they shrink exponentially toward zero. The early layers then receive almost no gradient and learn extremely slowly or not at all, making deep networks hard to train. Remedies: ReLU activations, careful weight initialisation (Xavier/He), batch normalisation, and residual (skip) connections.

(c) Feature scaling and its importance

Feature scaling transforms features to a comparable range, e.g. normalisation to $[0,1]$ , $x' = \frac{x-x_{\min}}{x_{\max}-x_{\min}}$ , or standardisation $x' = \frac{x-\mu}{\sigma}$ (zero mean, unit variance). It matters in gradient-based learning because if features have very different scales, the cost surface is elongated, so gradient descent zig-zags and converges slowly; scaling makes the contours more circular, speeding and stabilising convergence. It is also essential for distance-based methods (k-NN, k-means, SVM) so no single large-range feature dominates.

Level	BE Computer Engineering (Pokhara University)
Subject	Machine Learning (PU, CMP 364)
Year	2078 BS
Exam session	Regular (annual)
Full marks	100
Time allowed	180 minutes
Questions	12, all with step-by-step solutions

	Predicted Positive	Predicted Negative
Actual Positive	40	10
Actual Negative	5	45