BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) Question Paper 2079 Nepal

Q: Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2079?

The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Machine Learning (PU, CMP 364) 2079 paper come with solutions?

Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper?

The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Machine Learning (PU, CMP 364) past paper free?

Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.

Question

1Long answer14 marks

(a) Define supervised learning and clearly distinguish it from unsupervised learning with one suitable example of each. (4)

(b) Consider a simple linear regression model $h_\theta(x) = \theta_0 + \theta_1 x$ . Derive the closed-form (normal equation) solution for the parameters $\theta_0$ and $\theta_1$ by minimizing the mean squared error cost function $J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$ . (7)

(c) The following data is given for hours studied ( $x$ ) and marks obtained ( $y$ ): $(1,2),(2,4),(3,5),(4,4),(5,6)$ . Fit the best-fit line using the result derived in part (b) and predict the marks for a student who studies for 6 hours. (3)

supervised-learningregression

Answer 1

(a) Supervised vs Unsupervised Learning (4)

Supervised learning trains a model on a labelled dataset $\{(x^{(i)}, y^{(i)})\}$ , where each input $x^{(i)}$ has a known target output $y^{(i)}$ . The goal is to learn a mapping $h_\theta(x) \approx y$ that generalizes to unseen inputs.

Unsupervised learning works on unlabelled data $\{x^{(i)}\}$ and discovers hidden structure (groupings, density, low-dimensional representation) without any target output.

Aspect	Supervised	Unsupervised
Data	Labelled $(x, y)$	Unlabelled $(x)$
Goal	Predict $y$	Find structure
Example	Email spam classification (spam/ham labels)	Customer segmentation via clustering

(b) Normal Equation for Simple Linear Regression (7)

For $h_\theta(x) = \theta_0 + \theta_1 x$ , minimize

J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2.

Set the partial derivatives to zero.

\frac{\partial J}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)}) = 0

\frac{\partial J}{\partial \theta_1} = \frac{1}{m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)})\,x^{(i)} = 0

These give the normal equations:

m\theta_0 + \theta_1\sum x^{(i)} = \sum y^{(i)}

\theta_0\sum x^{(i)} + \theta_1\sum (x^{(i)})^2 = \sum x^{(i)} y^{(i)}

Solving this $2\times 2$ system (using means $\bar{x},\bar{y}$ ):

\boxed{\theta_1 = \frac{\sum (x^{(i)}-\bar{x})(y^{(i)}-\bar{y})}{\sum (x^{(i)}-\bar{x})^2} = \frac{m\sum x y - \sum x \sum y}{m\sum x^2 - (\sum x)^2}}

\boxed{\theta_0 = \bar{y} - \theta_1 \bar{x}}

(c) Fitting the Data (3)

Data: $(1,2),(2,4),(3,5),(4,4),(5,6)$ , with $m=5$ .

$\sum x = 15,\ \sum y = 21,\ \sum xy = 1\cdot2+2\cdot4+3\cdot5+4\cdot4+5\cdot6 = 2+8+15+16+30 = 71$
$\sum x^2 = 1+4+9+16+25 = 55$
$\bar{x}=3,\ \bar{y}=4.2$

\theta_1 = \frac{5(71) - 15(21)}{5(55) - 15^2} = \frac{355 - 315}{275 - 225} = \frac{40}{50} = 0.8

\theta_0 = 4.2 - 0.8(3) = 4.2 - 2.4 = 1.8

Best-fit line: $\hat{y} = 1.8 + 0.8x$ .

Prediction at $x=6$ : $\hat{y} = 1.8 + 0.8(6) = 1.8 + 4.8 = \mathbf{6.6}$ marks.

Answer 2

(a) Decision Tree Classifier, Entropy and Information Gain (6)

A decision tree is a tree-structured classifier where each internal node tests an attribute, each branch is an outcome of that test, and each leaf assigns a class label. Starting at the root, a record follows branches according to its attribute values until it reaches a leaf. The tree is built top-down, recursively, choosing at each node the attribute that best separates the classes.

Entropy measures the impurity of a set $S$ with class proportions $p_i$ :

Entropy(S) = -\sum_{i} p_i \log_2 p_i.

It is $0$ for a pure node and maximal when classes are equally mixed.

Information Gain of attribute $A$ is the expected reduction in entropy after splitting on $A$ :

IG(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}\,Entropy(S_v).

At each node the algorithm (ID3) selects the attribute with the highest information gain as the splitting attribute.

(b) Entropy and Information Gain of Outlook (6)

Dataset entropy (9 Yes, 5 No, total 14):

Entropy(S) = -\tfrac{9}{14}\log_2\tfrac{9}{14} - \tfrac{5}{14}\log_2\tfrac{5}{14} = 0.4098 + 0.5305 = \mathbf{0.940}.

Entropy of each value of Outlook:

Sunny (2 Yes, 3 No): $-\tfrac{2}{5}\log_2\tfrac{2}{5} - \tfrac{3}{5}\log_2\tfrac{3}{5} = 0.971$
Overcast (4 Yes, 0 No): $0$ (pure)
Rain (3 Yes, 2 No): $0.971$

Weighted entropy after split:

\tfrac{5}{14}(0.971) + \tfrac{4}{14}(0) + \tfrac{5}{14}(0.971) = 0.347 + 0 + 0.347 = 0.694.

Information Gain:

IG(S, Outlook) = 0.940 - 0.694 = \mathbf{0.246}.

(c) Advantages and Limitations (2)

Advantages: (1) Easy to interpret and visualize (white-box, human-readable rules); (2) Require little data preparation and handle both numeric and categorical features.

Limitations: (1) Prone to overfitting / high variance, producing unstable trees sensitive to small data changes; (2) Greedy axis-parallel splits can give suboptimal, biased trees compared to ensemble/SVM methods.

Answer 3

(a) Feedforward Network with One Hidden Layer (6)

Architecture (described): Three layers connected left to right —

Input layer: nodes $x_1, x_2, \dots, x_n$ (one per feature).
Hidden layer: nodes $h_1, \dots, h_p$ ; each receives a weighted sum of all inputs plus a bias, then applies an activation: $h_j = g\!\left(\sum_i w_{ji} x_i + b_j\right)$ .
Output layer: nodes $\hat{y}_1, \dots, \hat{y}_k$ , computed similarly from the hidden activations.

Every node in one layer connects to every node in the next (fully connected); information flows forward only.

Role of activation functions: They introduce a non-linear transformation at each neuron, controlling the firing of the neuron and enabling the network to learn complex, non-linear decision boundaries.

Why non-linearity is necessary: If all activations were linear, the composition of layers would collapse into a single linear transformation $\hat{y} = W'x + b'$ — equivalent to a one-layer linear model, unable to represent non-linear functions (e.g. XOR). Non-linear activations such as sigmoid $\sigma(z)=\tfrac{1}{1+e^{-z}}$ or ReLU $\max(0,z)$ give the network universal approximation capability.

(b) Backpropagation and Output-Weight Update (5)

Backpropagation trains the network in two passes: (1) a forward pass computes activations layer by layer and the loss; (2) a backward pass propagates the error from the output back to earlier layers using the chain rule, computing $\partial E/\partial w$ for every weight. Weights are then updated by gradient descent.

Derivation for an output-layer weight $w_{kj}$ connecting hidden unit $h_j$ to output $\hat{y}_k$ , with squared error $E = \tfrac{1}{2}\sum_k (y_k - \hat{y}_k)^2$ , net input $a_k = \sum_j w_{kj} h_j$ , and $\hat{y}_k = g(a_k)$ :

\frac{\partial E}{\partial w_{kj}} = \frac{\partial E}{\partial \hat{y}_k}\cdot\frac{\partial \hat{y}_k}{\partial a_k}\cdot\frac{\partial a_k}{\partial w_{kj}} = -(y_k - \hat{y}_k)\,g'(a_k)\,h_j.

Define the output error term $\delta_k = (y_k - \hat{y}_k)\,g'(a_k)$ . The update rule is:

\boxed{w_{kj} \leftarrow w_{kj} + \eta\,\delta_k\,h_j}

where $\eta$ is the learning rate.

(c) Overfitting, L2 Regularization and Dropout (3)

Overfitting occurs when a network learns the training data (including its noise) too closely, achieving low training error but high test error — it fails to generalize.

L2 regularization (weight decay): adds a penalty $\frac{\lambda}{2}\sum w^2$ to the loss, shrinking weights toward zero. Smaller weights give a smoother, simpler function that is less likely to overfit.
Dropout: during training, randomly "drops" (sets to zero) a fraction of neurons each iteration. This prevents co-adaptation of neurons and acts like training an ensemble of sub-networks, improving generalization.

Answer 4

(a) k-Means Algorithm (6)

k-means partitions data into $k$ clusters by iteratively minimizing within-cluster variance.

Steps:

Choose $k$ and initialize $k$ centroids $\mu_1, \dots, \mu_k$ (randomly or by heuristic).
Assignment: assign each point $x^{(i)}$ to the nearest centroid: $c^{(i)} = \arg\min_j \lVert x^{(i)} - \mu_j \rVert^2$ .
Update: recompute each centroid as the mean of points assigned to it: $\mu_j = \frac{1}{|C_j|}\sum_{x \in C_j} x$ .
Repeat steps 2–3 until assignments (or centroids) no longer change.

Objective function (distortion / WCSS):

J = \sum_{j=1}^{k} \sum_{x^{(i)} \in C_j} \lVert x^{(i)} - \mu_j \rVert^2.

(b) Two Iterations on the 1-D Data (6)

Data: $\{2,4,10,12,3,20,30,11,25\}$ , $k=2$ , $\mu_1=2$ , $\mu_2=30$ .

Iteration 1 — assign to nearest centroid:

Closer to 2: $\{2,4,3,10,12,11\}$ (e.g. $10\!:|10-2|=8<|10-30|=20$ )
Closer to 30: $\{20,30,25\}$

Update centroids:

\mu_1 = \frac{2+4+3+10+12+11}{6} = \frac{42}{6} = 7, \qquad \mu_2 = \frac{20+30+25}{3} = \frac{75}{3} = 25.

Iteration 2 — reassign with $\mu_1=7,\ \mu_2=25$ :

Closer to 7: $\{2,4,3,10,12,11\}$ (e.g. $20\!:|20-7|=13>|20-25|=5$ → cluster 2)
Closer to 25: $\{20,30,25\}$

Assignments are unchanged, so centroids stay $\mu_1=7,\ \mu_2=25$ → converged.

Final result: $C_1=\{2,3,4,10,11,12\},\ \mu_1=7$ ; $C_2=\{20,25,30\},\ \mu_2=25$ .

(c) Limitations and the Elbow Method (2)

Limitations: (1) $k$ must be specified in advance and results depend on initial centroids (can converge to a local optimum); (2) assumes spherical, equally-sized clusters and is sensitive to outliers and scaling.

Elbow method: run k-means for a range of $k$ , plot the objective $J$ (WCSS) versus $k$ . $J$ decreases as $k$ grows; the "elbow" — the point where the rate of decrease sharply flattens — indicates a good value of $k$ , balancing compactness against the number of clusters.

Answer 5

Given: TP $=40$ , FP $=10$ , FN $=20$ , TN $=30$ (total $=100$ ).

\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} = \frac{40+30}{100} = 0.70 \;(70\%)

\text{Precision} = \frac{TP}{TP+FP} = \frac{40}{50} = 0.80 \;(80\%)

\text{Recall} = \frac{TP}{TP+FN} = \frac{40}{60} = 0.667 \;(66.7\%)

\text{F1} = \frac{2\cdot P \cdot R}{P+R} = \frac{2(0.80)(0.667)}{0.80+0.667} = \frac{1.067}{1.467} = 0.727 \;(72.7\%)

Comment: Precision of 80% means that when the classifier predicts positive, it is correct 80% of the time (relatively few false alarms). Recall of 66.7% means it captures only two-thirds of the actual positives, missing one-third (20 false negatives). Thus the model is more precise than complete — it is conservative in raising positives but misses a notable share of true positives.

Answer 6

Regression vs Classification

	Regression	Classification
Output	Continuous numeric value	Discrete class label
Goal	Predict a quantity	Assign a category
Example	Predicting house price (Rs.) from area	Predicting whether an email is spam or not

Why MSE suits regression but not classification:

MSE, $\frac{1}{m}\sum (y - \hat{y})^2$ , measures squared numeric distance between predicted and true values, which is exactly meaningful when the target is a continuous quantity — it is convex in the parameters of a linear model and penalizes large errors.

For classification the labels are categorical (e.g. 0/1) and outputs are probabilities. Using MSE here is unsuitable because: (1) class labels are not on a metric scale, so squared distance is not a natural error measure; (2) combined with a sigmoid output, the MSE surface is non-convex with flat regions, causing slow learning and poor convergence; and (3) it weakly penalizes confident wrong predictions. The cross-entropy loss is preferred as it is convex (for logistic regression), strongly penalizes confident mistakes, and matches the probabilistic interpretation of the output.

Answer 7

Bias-Variance Tradeoff

Bias is the error from overly simple assumptions in the model — it fails to capture the true relationship (systematic error). Variance is the error from excessive sensitivity to the particular training set — small changes in data cause large changes in the model. Expected test error decomposes as:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}.

Diagram (described): Plot model complexity on the x-axis and error on the y-axis.

The bias² curve starts high and decreases as complexity increases.
The variance curve starts low and increases with complexity.
The total test error is U-shaped: it falls, reaches a minimum, then rises. The optimal model sits at the bottom of this U.

Error
 |  \  (bias^2)         /  (variance)
 |   \                 /
 |    \      _______  /
 |     \____/  total \____/   <- U-shaped total error
 |________________________________  Model complexity
      underfit     optimal   overfit

Relation to underfitting / overfitting:

High bias → underfitting: the model is too simple, giving high error on both training and test data.
High variance → overfitting: the model is too complex, giving very low training error but high test error because it memorizes noise.

The goal is to balance the two to minimize total generalization error.

Answer 8

k-Nearest Neighbours (k-NN)

Working: k-NN is a non-parametric, lazy (instance-based) algorithm. There is no explicit training phase; it stores all training examples. To classify a new point $x$ :

Compute the distance (usually Euclidean) from $x$ to every training point.
Select the $k$ closest training points (its $k$ nearest neighbours).
Assign the class by majority vote of those $k$ neighbours (for regression, take their average).

Effect of $k$ on bias and variance:

Small $k$ (e.g. $k=1$ ): the decision boundary is highly flexible and follows local noise → low bias, high variance (overfitting).
Large $k$ : predictions are smoothed over many neighbours, averaging out detail → high bias, low variance (underfitting if $k$ is too large).

Thus $k$ controls the bias-variance tradeoff and is chosen (e.g. by cross-validation) to balance them.

Effect of feature scaling: Because k-NN relies on distance, features with larger numeric ranges dominate the distance computation. Without scaling, a large-magnitude feature can overwhelm others, distorting the neighbourhood. Therefore features should be normalized/standardized (e.g. min-max or z-score) so each contributes proportionately.

Answer 9

k-Fold Cross-Validation

Definition: k-fold cross-validation is a resampling method that partitions the dataset into $k$ equal (or nearly equal) disjoint folds. The model is trained and validated $k$ times; each time a different fold serves as the validation set and the remaining $k-1$ folds form the training set. The $k$ validation scores are averaged to estimate performance.

Procedure for $k=5$ :

Shuffle and split the data into 5 folds: $F_1, F_2, F_3, F_4, F_5$ .
Iteration 1: train on $F_2\!-\!F_5$ , validate on $F_1$ .
Iteration 2: train on $F_1,F_3,F_4,F_5$ , validate on $F_2$ . … and so on for all 5 folds.
Report the mean (and standard deviation) of the 5 validation scores.

Diagram (described): Five rows, one per iteration; each row shows 5 blocks where exactly one block is marked Validation and the other four Train, with the validation block shifting one position to the right each row.

Fold:   1     2     3     4     5
Iter1: [Val][Trn][Trn][Trn][Trn]
Iter2: [Trn][Val][Trn][Trn][Trn]
Iter3: [Trn][Trn][Val][Trn][Trn]
Iter4: [Trn][Trn][Trn][Val][Trn]
Iter5: [Trn][Trn][Trn][Trn][Val]

Why more reliable than a single train-test split: Every data point is used for both training and validation exactly once, so the estimate does not depend on one lucky/unlucky split. Averaging over $k$ runs reduces the variance of the performance estimate and uses the data more efficiently, giving a more robust and less biased measure of generalization.

Answer 10

Logistic Regression (Binary Classification)

Logistic regression models the probability that an input belongs to the positive class. A linear combination $z = \theta^T x = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n$ is passed through the sigmoid (logistic) function to map it to $(0,1)$ :

\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \hat{p} = P(y=1\mid x).

Decision boundary: Predict class 1 if $\hat{p} \ge 0.5$ , else class 0. Since $\sigma(z)=0.5$ exactly when $z=0$ , the decision boundary is the set where

\theta^T x = 0,

which is a linear hyperplane in feature space.

Why cross-entropy is preferred over squared error: The cross-entropy (log) loss for one example is

L = -\big[y\log\hat{p} + (1-y)\log(1-\hat{p})\big].

Combined with the sigmoid, this loss is convex in $\theta$ , so gradient descent reaches the global minimum reliably. In contrast, squared error with a sigmoid output is non-convex with flat saturated regions where the gradient vanishes, causing slow learning. Cross-entropy also penalizes confident wrong predictions heavily ( $-\log$ of a small probability is large), producing strong, well-scaled gradients and well-calibrated probabilities.

Answer 11

Hierarchical vs Partitional (k-means) Clustering

Aspect	Hierarchical	Partitional (k-means)
Output	A nested tree of clusters (dendrogram)	A single flat partition into $k$ clusters
Need $k$ in advance?	No (cut tree afterwards)	Yes
Approach	Merge/split based on inter-cluster distance	Iterative centroid assignment
Cost	Higher, typically $O(n^2)$ or more	Lower, roughly $O(nkt)$
Reversibility	Greedy merges are not undone	Re-assigns points each iteration

Agglomerative (bottom-up) clustering: Start with each data point as its own cluster. Repeatedly merge the two closest clusters (closeness measured by a linkage criterion — single, complete, average, or Ward linkage) until all points form one cluster. This sequence of merges builds the hierarchy.

Role of the dendrogram: A dendrogram is a tree diagram whose vertical axis shows the distance at which clusters merge. To choose the number of clusters, draw a horizontal line that cuts the dendrogram; the number of vertical lines it crosses equals the number of clusters. Cutting at a level where the merge distances jump sharply (a long vertical gap) yields well-separated, natural clusters.

Answer 12

(Answer any two — model answers for all four are given.)

(a) Gradient descent and the learning rate. Gradient descent is an iterative optimization method that minimizes a cost $J(\theta)$ by updating parameters opposite to the gradient: $\theta \leftarrow \theta - \eta\,\nabla_\theta J(\theta)$ . The learning rate $\eta$ sets the step size: too small → very slow convergence; too large → overshooting, oscillation, or divergence. A well-chosen (often decaying) $\eta$ gives stable, fast convergence to a minimum.

(b) Curse of dimensionality. As the number of features (dimensions) grows, the volume of the feature space grows exponentially, so data become sparse. Distances between points become similar, weakening distance-based methods (k-NN, clustering), and far more data is needed to generalize. Remedies include feature selection and dimensionality reduction (e.g. PCA).

(c) Support Vector Machines and margin. An SVM finds the hyperplane that separates two classes with the maximum margin — the largest distance between the boundary and the nearest training points (the support vectors). Maximizing the margin improves generalization. With the kernel trick, SVMs handle non-linearly separable data by implicitly mapping inputs to a higher-dimensional space.

(d) Feature engineering. The process of creating, transforming, and selecting input features to improve model performance — e.g. scaling/normalization, encoding categorical variables, handling missing values, creating interaction or polynomial terms, and extracting domain-specific features. Good features often matter more than the choice of algorithm.

Level	BE Computer Engineering (Pokhara University)
Subject	Machine Learning (PU, CMP 364)
Year	2079 BS
Exam session	Regular (annual)
Full marks	100
Time allowed	180 minutes
Questions	12, all with step-by-step solutions