Browse papers
A

Section A: Long Answer Questions

Attempt all / any as specified.

4 questions
1long15 marks

(a) Define machine learning and distinguish clearly between supervised and unsupervised learning, giving one real-world application of each. (5 marks)

(b) Consider a simple linear regression model hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x. Derive the least squares cost function J(θ0,θ1)J(\theta_0, \theta_1) and obtain the gradient descent update rules for θ0\theta_0 and θ1\theta_1. (7 marks)

(c) Explain the effect of choosing a learning rate that is too large versus too small on the convergence of gradient descent. (3 marks)

(a) Machine Learning; Supervised vs Unsupervised (5 marks)

Machine learning is the field of study that gives computers the ability to learn patterns from data and improve their performance on a task without being explicitly programmed with hand-coded rules. Formally (Mitchell): a program learns from experience EE with respect to task TT and performance measure PP if its performance at TT, measured by PP, improves with EE.

AspectSupervised learningUnsupervised learning
DataLabelled examples (x(i),y(i))(x^{(i)}, y^{(i)})Unlabelled examples x(i)x^{(i)} only
GoalLearn a mapping xyx \to y to predict outputsDiscover hidden structure / grouping
Typical tasksClassification, regressionClustering, dimensionality reduction
ApplicationSpam email detection (predict spam/not-spam)Customer segmentation (group similar customers)

(b) Least-squares cost and gradient-descent rules (7 marks)

Hypothesis: hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x. For mm training examples, the least-squares cost function measures mean squared error:

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)}) - y^{(i)}\big)^2

The factor 12\tfrac{1}{2} simplifies the derivative. Taking partial derivatives:

Jθ0=1mi=1m(hθ(x(i))y(i))\frac{\partial J}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big) Jθ1=1mi=1m(hθ(x(i))y(i))x(i)\frac{\partial J}{\partial \theta_1} = \frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)\,x^{(i)}

Gradient descent moves opposite the gradient with learning rate α\alpha, updating simultaneously:

θ0:=θ0α1mi=1m(hθ(x(i))y(i))\theta_0 := \theta_0 - \alpha\,\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big) θ1:=θ1α1mi=1m(hθ(x(i))y(i))x(i)\theta_1 := \theta_1 - \alpha\,\frac{1}{m}\sum_{i=1}^{m}\big(h_\theta(x^{(i)})-y^{(i)}\big)\,x^{(i)}

(c) Effect of the learning rate (3 marks)

  • Too large α\alpha: the steps overshoot the minimum; the cost may oscillate or diverge (increase) instead of decreasing — gradient descent fails to converge.
  • Too small α\alpha: updates are tiny, so convergence is correct but very slow, needing many iterations.

A good rate is small enough to converge but large enough to be efficient; one often tries values like 0.001,0.01,0.1,0.001, 0.01, 0.1, \dots

supervised-learningregression
2long15 marks

(a) A decision tree learning algorithm uses an attribute-selection measure to decide which feature to split on. Define entropy and information gain, and explain how the ID3 algorithm uses information gain to construct a tree. (6 marks)

(b) Given the following training dataset, compute the information gain of the attribute Weather with respect to the target Play and state which attribute you would select as the root. (6 marks)

WeatherTemperaturePlay
SunnyHotNo
SunnyMildNo
OvercastHotYes
RainyMildYes
RainyCoolYes
OvercastCoolYes

(c) Decision trees are prone to overfitting. Explain two pruning strategies used to control tree complexity. (3 marks)

(a) Entropy, Information Gain and ID3 (6 marks)

Entropy measures the impurity (uncertainty) of a set SS with respect to the class label. For a binary target with positive proportion p+p_+ and negative pp_-:

H(S)=p+log2p+plog2pH(S) = -p_+\log_2 p_+ - p_-\log_2 p_-

It is 00 when the set is pure and maximal (11 bit) when classes are equally mixed.

Information gain of attribute AA is the expected reduction in entropy after splitting on AA:

IG(S,A)=H(S)vvalues(A)SvSH(Sv)IG(S,A) = H(S) - \sum_{v\in \text{values}(A)} \frac{|S_v|}{|S|}\,H(S_v)

ID3 builds the tree top-down and greedily: at each node it computes the information gain of every remaining attribute, selects the attribute with the highest gain as the split, partitions the data on its values, and recurses on each child until the subsets are pure (or attributes run out), forming leaves.

(b) Information gain of Weather (6 marks)

Target Play: 4 Yes, 2 No out of 6. Root entropy:

H(S)=46log24626log226=0.3899+0.5283=0.918H(S) = -\tfrac{4}{6}\log_2\tfrac{4}{6} - \tfrac{2}{6}\log_2\tfrac{2}{6} = 0.3899 + 0.5283 = 0.918

Split on Weather:

  • Sunny (2 rows: No, No) → pure, H=0H = 0
  • Overcast (2 rows: Yes, Yes) → pure, H=0H = 0
  • Rainy (2 rows: Yes, Yes) → pure, H=0H = 0

Weighted child entropy =26(0)+26(0)+26(0)=0= \tfrac{2}{6}(0)+\tfrac{2}{6}(0)+\tfrac{2}{6}(0) = 0.

IG(S,Weather)=0.9180=0.918 bitsIG(S,\text{Weather}) = 0.918 - 0 = \mathbf{0.918\ \text{bits}}

Since Weather perfectly separates the classes (each value maps to a single label), its gain equals the full root entropy — the maximum possible. Weather is selected as the root attribute.

(c) Two pruning strategies (3 marks)

  • Pre-pruning (early stopping): stop growing a branch before it perfectly fits — e.g. stop if a node has too few samples, the information gain is below a threshold, or maximum depth is reached.
  • Post-pruning: grow the full tree, then remove or collapse subtrees that do not improve performance on a validation set (e.g. reduced-error pruning or cost-complexity pruning). This usually generalises better than pre-pruning.
decision-treesclassificationmodel-evaluation
3long15 marks

(a) Draw the architecture of a multilayer feedforward neural network with one hidden layer and explain the role of weights, bias, and activation functions. (5 marks)

(b) Derive the backpropagation weight-update equation for a single output-layer weight using the sigmoid activation function and the squared-error loss. (7 marks)

(c) Why are non-linear activation functions essential in a neural network? What happens if all activations are linear? (3 marks)

(a) Multilayer feedforward network architecture (5 marks)

Described in words (a standard 3-layer diagram): an input layer of nodes x1,,xnx_1,\dots,x_n fully connected to a hidden layer of nodes, which is fully connected to an output layer. Signals flow strictly forward (input → hidden → output) with no cycles.

 x1 ──┐         ┌── h1 ──┐
 x2 ──┼─weights─┼── h2 ──┼─weights── y (output)
 xn ──┘         └── h3 ──┘
          (+bias)   (+bias)
  • Weights wijw_{ij}: learnable parameters scaling each connection; they store what the network has learned.
  • Bias bb: a learnable constant added to each neuron's weighted sum, shifting the activation threshold so the unit need not pass through the origin.
  • Activation function g()g(\cdot): applies a non-linear transform a=g ⁣(iwixi+b)a = g\!\left(\sum_i w_i x_i + b\right), enabling the network to model non-linear relationships.

(b) Backpropagation update for an output-layer weight (7 marks)

Let output neuron net input z=jwjajz = \sum_j w_j a_j, output y^=σ(z)=11+ez\hat{y} = \sigma(z) = \frac{1}{1+e^{-z}}, target yy, and squared-error loss E=12(yy^)2E = \tfrac{1}{2}(y-\hat{y})^2. We need E/wj\partial E/\partial w_j via the chain rule:

Ewj=Ey^y^zzwj\frac{\partial E}{\partial w_j} = \frac{\partial E}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z}\cdot\frac{\partial z}{\partial w_j}

Each factor:

Ey^=(yy^),y^z=σ(z)=y^(1y^),zwj=aj\frac{\partial E}{\partial \hat{y}} = -(y-\hat{y}),\qquad \frac{\partial \hat{y}}{\partial z} = \sigma'(z) = \hat{y}(1-\hat{y}),\qquad \frac{\partial z}{\partial w_j} = a_j

Therefore:

Ewj=(yy^)y^(1y^)aj\frac{\partial E}{\partial w_j} = -(y-\hat{y})\,\hat{y}(1-\hat{y})\,a_j

Defining the output-layer error term δ=(yy^)y^(1y^)\delta = (y-\hat{y})\,\hat{y}(1-\hat{y}), the gradient-descent update is:

wj:=wj+αδaj=wj+α(yy^)y^(1y^)ajw_j := w_j + \alpha\,\delta\,a_j = w_j + \alpha\,(y-\hat{y})\,\hat{y}(1-\hat{y})\,a_j

(c) Why non-linear activations are essential (3 marks)

Non-linearities let the network approximate complex, non-linear decision boundaries (universal approximation). If every activation were linear, the whole network would be a composition of linear maps, which is itself a single linear map Wx+bWx+b — so a deep network would collapse to an equivalent single-layer linear model, unable to learn anything beyond linear relationships, no matter how many layers it has.

neural-networksoverfitting-regularization
4long15 marks

(a) Describe the k-means clustering algorithm step by step, clearly stating its objective (distortion) function and its termination condition. (6 marks)

(b) Given the one-dimensional data points {2, 4, 10, 12, 3, 20, 30, 11, 25} and k=2k = 2 with initial centroids c1=4c_1 = 4 and c2=12c_2 = 12, perform one full iteration of k-means and report the updated centroids. (6 marks)

(c) Explain how the elbow method helps in choosing an appropriate value of kk. (3 marks)

(a) k-means algorithm (6 marks)

k-means partitions nn points into kk clusters by minimising within-cluster squared distance.

  1. Initialise kk centroids c1,,ckc_1,\dots,c_k (e.g. random points).
  2. Assignment step: assign each point xx to the nearest centroid: cluster(x)=argminjxcj2\text{cluster}(x)=\arg\min_j \lVert x - c_j\rVert^2.
  3. Update step: recompute each centroid as the mean of the points assigned to it.
  4. Repeat steps 2–3.

Objective (distortion) function to minimise:

J=j=1kxCjxcj2J = \sum_{j=1}^{k}\sum_{x\in C_j}\lVert x - c_j\rVert^2

Termination condition: stop when assignments no longer change (centroids stable) or JJ falls below a threshold / a maximum number of iterations is reached.

(b) One full iteration (6 marks)

Data: {2,4,10,12,3,20,30,11,25}\{2,4,10,12,3,20,30,11,25\}, c1=4c_1=4, c2=12c_2=12.

Assignment (point goes to nearer centroid; midpoint =8=8):

  • 8C1\le 8 \to C_1: {2,4,3}\{2, 4, 3\}
  • >8C2> 8 \to C_2: {10,12,20,30,11,25}\{10, 12, 20, 30, 11, 25\}

Update centroids (means):

c1=2+4+33=93=3.0c_1 = \frac{2+4+3}{3} = \frac{9}{3} = \mathbf{3.0} c2=10+12+20+30+11+256=1086=18.0c_2 = \frac{10+12+20+30+11+25}{6} = \frac{108}{6} = \mathbf{18.0}

Updated centroids after one iteration: c1=3.0c_1 = 3.0, c2=18.0c_2 = 18.0.

(c) Elbow method (3 marks)

Run k-means for a range of kk values and plot the total distortion JJ (within-cluster sum of squares) against kk. JJ always decreases as kk grows, but the rate of decrease slows once the natural number of clusters is reached. The plot shows a sharp bend — the "elbow" — and the kk at that bend is chosen as a good trade-off between low distortion and model simplicity.

clusteringmodel-evaluation
B

Section B: Short Answer Questions

Attempt all / any as specified.

8 questions
5short7 marks

Explain the Naive Bayes classifier. State the conditional-independence assumption it makes and discuss one situation where this assumption may fail yet the classifier still performs well.

Naive Bayes classifier. A probabilistic classifier based on Bayes' theorem. Given features x1,,xnx_1,\dots,x_n, it predicts the class CC that maximises the posterior:

C^=argmaxCP(C)i=1nP(xiC)\hat{C} = \arg\max_{C} P(C)\prod_{i=1}^{n} P(x_i \mid C)

The denominator P(x)P(x) is ignored since it is constant across classes. Priors P(C)P(C) and likelihoods P(xiC)P(x_i\mid C) are estimated from training-data frequencies (often with Laplace smoothing).

Conditional-independence assumption. It assumes all features are mutually independent given the class — i.e. P(x1,,xnC)=iP(xiC)P(x_1,\dots,x_n\mid C)=\prod_i P(x_i\mid C). This "naive" assumption is what lets the joint likelihood factor into a simple product, making training and prediction fast and needing little data.

When it fails yet still works. In text/spam classification words are clearly not independent (e.g. "New" and "York" co-occur). Despite this violated assumption, Naive Bayes usually classifies well because the decision only depends on which class gets the highest score, not on accurate probability values; even when individual probability estimates are biased, the ranking of classes is often still correct.

classification
6short7 marks

Differentiate between overfitting and underfitting in terms of the bias–variance trade-off. Explain how L1 (Lasso) and L2 (Ridge) regularization help reduce overfitting, and state how their effects on model parameters differ.

Overfitting vs underfitting (bias–variance).

  • Underfitting (high bias, low variance): the model is too simple to capture the underlying pattern; it performs poorly on both training and test data.
  • Overfitting (low bias, high variance): the model is too complex and fits noise in the training data; it does well on training but poorly on test data.

The bias–variance trade-off is the goal of choosing complexity that minimises total error =bias2+variance+noise=\text{bias}^2 + \text{variance} + \text{noise}.

Regularization adds a penalty on parameter size to the loss, discouraging overly complex models:

  • L2 (Ridge): penalty λjθj2\lambda\sum_j \theta_j^2. Shrinks weights smoothly toward (but not exactly to) zero, keeping all features with smaller coefficients.
  • L1 (Lasso): penalty λjθj\lambda\sum_j |\theta_j|. Can drive some coefficients exactly to zero, performing automatic feature selection and yielding a sparse model.

Difference in effect: Ridge distributes shrinkage across all weights (good when many features are useful); Lasso produces sparsity by zeroing irrelevant features (good for selecting a few important ones). Both reduce variance/overfitting at the cost of a small increase in bias, controlled by λ\lambda.

overfitting-regularizationregression
7short7 marks

For a binary classifier evaluated on a test set, the confusion matrix is given below.

Predicted PositivePredicted Negative
Actual Positive4010
Actual Negative545

Compute the accuracy, precision, recall, and F1-score. Explain why accuracy alone can be misleading on an imbalanced dataset.

From the confusion matrix: TP=40TP=40, FN=10FN=10, FP=5FP=5, TN=45TN=45 (total =100=100).

Accuracy =TP+TNtotal=40+45100=85100=0.85= \dfrac{TP+TN}{\text{total}} = \dfrac{40+45}{100} = \dfrac{85}{100} = \mathbf{0.85}

Precision =TPTP+FP=4040+5=4045=0.889= \dfrac{TP}{TP+FP} = \dfrac{40}{40+5} = \dfrac{40}{45} = \mathbf{0.889}

Recall =TPTP+FN=4040+10=4050=0.80= \dfrac{TP}{TP+FN} = \dfrac{40}{40+10} = \dfrac{40}{50} = \mathbf{0.80}

F1-score =2PrecisionRecallPrecision+Recall=20.889×0.800.889+0.80=1.4221.689=0.842= 2\cdot\dfrac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} = 2\cdot\dfrac{0.889\times 0.80}{0.889+0.80} = \dfrac{1.422}{1.689} = \mathbf{0.842}

Why accuracy can mislead on imbalanced data. Accuracy counts all correct predictions equally. If one class dominates (e.g. 95% negatives), a trivial model that always predicts the majority class scores 95% accuracy while completely failing to detect the rare positive class. Precision, recall and F1 focus on the minority/positive class and expose this failure, so they are preferred for imbalanced problems.

model-evaluationclassification
8short6 marks

Explain the k-Nearest Neighbours (k-NN) algorithm. Discuss how the choice of kk affects the bias and variance of the model, and why k-NN is called a lazy learner.

k-Nearest Neighbours (k-NN). A non-parametric, instance-based algorithm. To classify a query point xx, it (1) computes the distance (usually Euclidean) from xx to every training point, (2) selects the kk nearest neighbours, and (3) assigns the majority class among them (for regression, the average of their values).

Effect of kk on bias and variance:

  • Small kk (e.g. k=1k=1): very flexible, follows individual points → low bias, high variance; sensitive to noise/outliers and prone to overfitting.
  • Large kk: averages over many neighbours, giving a smoother boundary → high bias, low variance; can underfit and blur class boundaries.

So kk controls the bias–variance trade-off; it is tuned (often via cross-validation), and an odd kk avoids ties in binary problems.

Why a "lazy learner": k-NN does no real training phase — it simply stores the dataset. All computation (distance calculation and voting) is deferred to query time. Because it postpones generalisation until a prediction is requested, it is called a lazy (instance-based) learner, in contrast to eager learners like decision trees that build a model up front.

classificationsupervised-learning
9short6 marks

Explain logistic regression for binary classification. (a) Write the sigmoid hypothesis function and explain how a decision boundary is obtained. (b) Why is the squared-error cost function not used for logistic regression?

Logistic regression is a linear model for binary classification that outputs the probability that an example belongs to class 1.

(a) Sigmoid hypothesis and decision boundary

It passes a linear combination through the sigmoid (logistic) function:

hθ(x)=σ(θTx)=11+eθTx,0<hθ(x)<1h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1+e^{-\theta^T x}}, \qquad 0 < h_\theta(x) < 1

hθ(x)h_\theta(x) is interpreted as P(y=1x)P(y=1\mid x). We predict class 1 if hθ(x)0.5h_\theta(x) \ge 0.5, else class 0. Since σ(z)0.5    z0\sigma(z)\ge 0.5 \iff z\ge 0, the decision boundary is the surface θTx=0\theta^T x = 0, which is linear in the features (a line/hyperplane separating the two classes).

(b) Why not squared-error cost

If we plug the sigmoid hθ(x)h_\theta(x) into the squared-error cost 12m(hθ(x)y)2\tfrac{1}{2m}\sum (h_\theta(x)-y)^2, the resulting J(θ)J(\theta) is non-convex (many local minima), so gradient descent can get stuck and may not reach the global optimum. Instead, logistic regression uses the convex log-loss (cross-entropy) cost:

J(θ)=1mi=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}\Big[y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log\big(1-h_\theta(x^{(i)})\big)\Big]

which is convex, derived from maximum likelihood, and penalises confident wrong predictions heavily — guaranteeing reliable convergence.

regressionclassification
10short6 marks

Describe k-fold cross-validation and explain how it gives a more reliable estimate of model performance than a single train/test split. What is the role of a separate validation set versus the test set?

k-fold cross-validation. The dataset is randomly split into kk equal-sized folds. The model is trained kk times; in each round one fold is held out as the validation set and the other k1k-1 folds are used for training. The kk performance scores are then averaged to estimate model performance (common choice k=5k=5 or 1010).

Fold 1: [TEST ][train][train][train][train]
Fold 2: [train][TEST ][train][train][train]
...      -> average the k scores

Why more reliable than a single split. A single train/test split gives a performance estimate that depends heavily on which points happened to land in the test set, so it has high variance. k-fold uses every example for both training and validation (each point is tested exactly once), so the averaged score is more stable, less biased by an unlucky split, and uses the data more efficiently — especially valuable for small datasets.

Validation set vs test set.

  • Validation set: used during development to tune hyperparameters and select models. The model is repeatedly evaluated on it, so it indirectly influences the model.
  • Test set: held out completely until the very end and used once to report an unbiased estimate of final generalisation performance. Keeping it untouched prevents the optimistic bias that arises if the data used for tuning is also used to report the final score.
model-evaluationoverfitting-regularization
11short6 marks

Compare partitional (k-means) clustering with hierarchical (agglomerative) clustering. Explain what a dendrogram represents and how it is used to decide the number of clusters.

Partitional (k-means) vs hierarchical (agglomerative) clustering.

Aspectk-means (partitional)Agglomerative (hierarchical)
ApproachDivides data into kk flat, non-overlapping clusters at onceBottom-up: starts with each point as its own cluster, repeatedly merges the two closest clusters
Need kk in advance?Yes, kk must be specifiedNo; cut the tree afterwards
OutputA single flat partitionA full hierarchy of nested clusters
ComplexityEfficient, ~O(nkt)O(nkt)Costlier, ~O(n2logn)O(n^2\log n) or O(n3)O(n^3)
ResultDepends on initial centroids; can change between runsDeterministic for a given linkage

Dendrogram. A tree diagram produced by agglomerative clustering. The leaves are individual data points; each internal node represents a merge, and the height of a merge equals the distance (dissimilarity) at which the two clusters were joined. Short horizontal links mean very similar items merged early; tall links mean dissimilar groups merged late.

Choosing the number of clusters. Draw a horizontal cut across the dendrogram; the number of vertical lines it crosses gives the number of clusters. The cut is placed at a height that crosses the longest vertical gap (the largest jump in merge distance), since cutting there separates the most well-distinguished groups.

clustering
12short5 marks

Write short notes on any two of the following: (a) Support Vector Machine and the concept of margin; (b) The vanishing gradient problem in deep neural networks; (c) Feature scaling and its importance in gradient-based learning.

Answering any two of the three.

(a) Support Vector Machine and margin

An SVM is a supervised classifier that finds the optimal separating hyperplane between two classes. Among all hyperplanes that separate the data, it chooses the one with the maximum margin — the largest distance to the nearest training points of either class. Those closest points are the support vectors, and they alone define the boundary. A larger margin gives better generalisation. The margin width is 2w\tfrac{2}{\lVert w\rVert}, so SVM minimises 12w2\tfrac{1}{2}\lVert w\rVert^2 subject to correct classification; soft margins and the kernel trick extend it to noisy and non-linearly-separable data.

(b) Vanishing gradient problem

In deep networks trained by backpropagation, gradients are products of many derivatives chained across layers. With saturating activations like sigmoid/tanh, each derivative is <1<1 (sigmoid's max derivative is 0.250.25), so as gradients propagate back through many layers they shrink exponentially toward zero. The early layers then receive almost no gradient and learn extremely slowly or not at all, making deep networks hard to train. Remedies: ReLU activations, careful weight initialisation (Xavier/He), batch normalisation, and residual (skip) connections.

(c) Feature scaling and its importance

Feature scaling transforms features to a comparable range, e.g. normalisation to [0,1][0,1], x=xxminxmaxxminx' = \frac{x-x_{\min}}{x_{\max}-x_{\min}}, or standardisation x=xμσx' = \frac{x-\mu}{\sigma} (zero mean, unit variance). It matters in gradient-based learning because if features have very different scales, the cost surface is elongated, so gradient descent zig-zags and converges slowly; scaling makes the contours more circular, speeding and stabilising convergence. It is also essential for distance-based methods (k-NN, k-means, SVM) so no single large-range feature dominates.

supervised-learningneural-networks

Frequently asked questions

Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2078?
The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Machine Learning (PU, CMP 364) 2078 paper come with solutions?
Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper?
The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Machine Learning (PU, CMP 364) past paper free?
Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.