Browse papers
A

Section A: Long Answer Questions

Attempt all / any as specified.

4 questions
1long14 marks

(a) Define supervised learning and clearly distinguish it from unsupervised learning with one suitable example of each. (4)

(b) Consider a simple linear regression model hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x. Derive the closed-form (normal equation) solution for the parameters θ0\theta_0 and θ1\theta_1 by minimizing the mean squared error cost function J(θ)=12mi=1m(hθ(x(i))y(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2. (7)

(c) The following data is given for hours studied (xx) and marks obtained (yy): (1,2),(2,4),(3,5),(4,4),(5,6)(1,2),(2,4),(3,5),(4,4),(5,6). Fit the best-fit line using the result derived in part (b) and predict the marks for a student who studies for 6 hours. (3)

(a) Supervised vs Unsupervised Learning (4)

Supervised learning trains a model on a labelled dataset {(x(i),y(i))}\{(x^{(i)}, y^{(i)})\}, where each input x(i)x^{(i)} has a known target output y(i)y^{(i)}. The goal is to learn a mapping hθ(x)yh_\theta(x) \approx y that generalizes to unseen inputs.

Unsupervised learning works on unlabelled data {x(i)}\{x^{(i)}\} and discovers hidden structure (groupings, density, low-dimensional representation) without any target output.

AspectSupervisedUnsupervised
DataLabelled (x,y)(x, y)Unlabelled (x)(x)
GoalPredict yyFind structure
ExampleEmail spam classification (spam/ham labels)Customer segmentation via clustering

(b) Normal Equation for Simple Linear Regression (7)

For hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x, minimize

J(θ)=12mi=1m(θ0+θ1x(i)y(i))2.J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2.

Set the partial derivatives to zero.

Jθ0=1mi=1m(θ0+θ1x(i)y(i))=0\frac{\partial J}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)}) = 0 Jθ1=1mi=1m(θ0+θ1x(i)y(i))x(i)=0\frac{\partial J}{\partial \theta_1} = \frac{1}{m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^{(i)} - y^{(i)})\,x^{(i)} = 0

These give the normal equations:

mθ0+θ1x(i)=y(i)m\theta_0 + \theta_1\sum x^{(i)} = \sum y^{(i)} θ0x(i)+θ1(x(i))2=x(i)y(i)\theta_0\sum x^{(i)} + \theta_1\sum (x^{(i)})^2 = \sum x^{(i)} y^{(i)}

Solving this 2×22\times 2 system (using means xˉ,yˉ\bar{x},\bar{y}):

θ1=(x(i)xˉ)(y(i)yˉ)(x(i)xˉ)2=mxyxymx2(x)2\boxed{\theta_1 = \frac{\sum (x^{(i)}-\bar{x})(y^{(i)}-\bar{y})}{\sum (x^{(i)}-\bar{x})^2} = \frac{m\sum x y - \sum x \sum y}{m\sum x^2 - (\sum x)^2}} θ0=yˉθ1xˉ\boxed{\theta_0 = \bar{y} - \theta_1 \bar{x}}

(c) Fitting the Data (3)

Data: (1,2),(2,4),(3,5),(4,4),(5,6)(1,2),(2,4),(3,5),(4,4),(5,6), with m=5m=5.

  • x=15, y=21, xy=12+24+35+44+56=2+8+15+16+30=71\sum x = 15,\ \sum y = 21,\ \sum xy = 1\cdot2+2\cdot4+3\cdot5+4\cdot4+5\cdot6 = 2+8+15+16+30 = 71
  • x2=1+4+9+16+25=55\sum x^2 = 1+4+9+16+25 = 55
  • xˉ=3, yˉ=4.2\bar{x}=3,\ \bar{y}=4.2
θ1=5(71)15(21)5(55)152=355315275225=4050=0.8\theta_1 = \frac{5(71) - 15(21)}{5(55) - 15^2} = \frac{355 - 315}{275 - 225} = \frac{40}{50} = 0.8 θ0=4.20.8(3)=4.22.4=1.8\theta_0 = 4.2 - 0.8(3) = 4.2 - 2.4 = 1.8

Best-fit line: y^=1.8+0.8x\hat{y} = 1.8 + 0.8x.

Prediction at x=6x=6: y^=1.8+0.8(6)=1.8+4.8=6.6\hat{y} = 1.8 + 0.8(6) = 1.8 + 4.8 = \mathbf{6.6} marks.

supervised-learningregression
2long14 marks

(a) Explain the working principle of a decision tree classifier. Define entropy and information gain, and explain how they are used to select the splitting attribute at each node. (6)

(b) A dataset of 14 training examples for the target PlayTennis (Yes/No) contains 9 Yes and 5 No instances. The attribute Outlook has three values: Sunny (2 Yes, 3 No), Overcast (4 Yes, 0 No), and Rain (3 Yes, 2 No). Compute the entropy of the whole dataset and the information gain of the attribute Outlook. (6)

(c) State two advantages and two limitations of decision trees compared to other classifiers. (2)

(a) Decision Tree Classifier, Entropy and Information Gain (6)

A decision tree is a tree-structured classifier where each internal node tests an attribute, each branch is an outcome of that test, and each leaf assigns a class label. Starting at the root, a record follows branches according to its attribute values until it reaches a leaf. The tree is built top-down, recursively, choosing at each node the attribute that best separates the classes.

Entropy measures the impurity of a set SS with class proportions pip_i:

Entropy(S)=ipilog2pi.Entropy(S) = -\sum_{i} p_i \log_2 p_i.

It is 00 for a pure node and maximal when classes are equally mixed.

Information Gain of attribute AA is the expected reduction in entropy after splitting on AA:

IG(S,A)=Entropy(S)vValues(A)SvSEntropy(Sv).IG(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}\,Entropy(S_v).

At each node the algorithm (ID3) selects the attribute with the highest information gain as the splitting attribute.

(b) Entropy and Information Gain of Outlook (6)

Dataset entropy (9 Yes, 5 No, total 14):

Entropy(S)=914log2914514log2514=0.4098+0.5305=0.940.Entropy(S) = -\tfrac{9}{14}\log_2\tfrac{9}{14} - \tfrac{5}{14}\log_2\tfrac{5}{14} = 0.4098 + 0.5305 = \mathbf{0.940}.

Entropy of each value of Outlook:

  • Sunny (2 Yes, 3 No): 25log22535log235=0.971-\tfrac{2}{5}\log_2\tfrac{2}{5} - \tfrac{3}{5}\log_2\tfrac{3}{5} = 0.971
  • Overcast (4 Yes, 0 No): 00 (pure)
  • Rain (3 Yes, 2 No): 0.9710.971

Weighted entropy after split:

514(0.971)+414(0)+514(0.971)=0.347+0+0.347=0.694.\tfrac{5}{14}(0.971) + \tfrac{4}{14}(0) + \tfrac{5}{14}(0.971) = 0.347 + 0 + 0.347 = 0.694.

Information Gain:

IG(S,Outlook)=0.9400.694=0.246.IG(S, Outlook) = 0.940 - 0.694 = \mathbf{0.246}.

(c) Advantages and Limitations (2)

Advantages: (1) Easy to interpret and visualize (white-box, human-readable rules); (2) Require little data preparation and handle both numeric and categorical features.

Limitations: (1) Prone to overfitting / high variance, producing unstable trees sensitive to small data changes; (2) Greedy axis-parallel splits can give suboptimal, biased trees compared to ensemble/SVM methods.

decision-treesclassification
3long14 marks

(a) Draw the architecture of a multilayer feedforward neural network with one hidden layer and explain the role of activation functions. Why is a non-linear activation function (such as sigmoid or ReLU) necessary? (6)

(b) Describe the backpropagation algorithm. Derive the weight-update rule for a single output-layer weight using gradient descent on the squared error loss. (5)

(c) Define overfitting in the context of neural networks and explain how L2 regularization (weight decay) and dropout help to reduce it. (3)

(a) Feedforward Network with One Hidden Layer (6)

Architecture (described): Three layers connected left to right —

  • Input layer: nodes x1,x2,,xnx_1, x_2, \dots, x_n (one per feature).
  • Hidden layer: nodes h1,,hph_1, \dots, h_p; each receives a weighted sum of all inputs plus a bias, then applies an activation: hj=g ⁣(iwjixi+bj)h_j = g\!\left(\sum_i w_{ji} x_i + b_j\right).
  • Output layer: nodes y^1,,y^k\hat{y}_1, \dots, \hat{y}_k, computed similarly from the hidden activations.

Every node in one layer connects to every node in the next (fully connected); information flows forward only.

Role of activation functions: They introduce a non-linear transformation at each neuron, controlling the firing of the neuron and enabling the network to learn complex, non-linear decision boundaries.

Why non-linearity is necessary: If all activations were linear, the composition of layers would collapse into a single linear transformation y^=Wx+b\hat{y} = W'x + b' — equivalent to a one-layer linear model, unable to represent non-linear functions (e.g. XOR). Non-linear activations such as sigmoid σ(z)=11+ez\sigma(z)=\tfrac{1}{1+e^{-z}} or ReLU max(0,z)\max(0,z) give the network universal approximation capability.

(b) Backpropagation and Output-Weight Update (5)

Backpropagation trains the network in two passes: (1) a forward pass computes activations layer by layer and the loss; (2) a backward pass propagates the error from the output back to earlier layers using the chain rule, computing E/w\partial E/\partial w for every weight. Weights are then updated by gradient descent.

Derivation for an output-layer weight wkjw_{kj} connecting hidden unit hjh_j to output y^k\hat{y}_k, with squared error E=12k(yky^k)2E = \tfrac{1}{2}\sum_k (y_k - \hat{y}_k)^2, net input ak=jwkjhja_k = \sum_j w_{kj} h_j, and y^k=g(ak)\hat{y}_k = g(a_k):

Ewkj=Ey^ky^kakakwkj=(yky^k)g(ak)hj.\frac{\partial E}{\partial w_{kj}} = \frac{\partial E}{\partial \hat{y}_k}\cdot\frac{\partial \hat{y}_k}{\partial a_k}\cdot\frac{\partial a_k}{\partial w_{kj}} = -(y_k - \hat{y}_k)\,g'(a_k)\,h_j.

Define the output error term δk=(yky^k)g(ak)\delta_k = (y_k - \hat{y}_k)\,g'(a_k). The update rule is:

wkjwkj+ηδkhj\boxed{w_{kj} \leftarrow w_{kj} + \eta\,\delta_k\,h_j}

where η\eta is the learning rate.

(c) Overfitting, L2 Regularization and Dropout (3)

Overfitting occurs when a network learns the training data (including its noise) too closely, achieving low training error but high test error — it fails to generalize.

  • L2 regularization (weight decay): adds a penalty λ2w2\frac{\lambda}{2}\sum w^2 to the loss, shrinking weights toward zero. Smaller weights give a smoother, simpler function that is less likely to overfit.
  • Dropout: during training, randomly "drops" (sets to zero) a fraction of neurons each iteration. This prevents co-adaptation of neurons and acts like training an ensemble of sub-networks, improving generalization.
neural-networksoverfitting-regularization
4long14 marks

(a) Explain the k-means clustering algorithm step by step. State the objective function it tries to minimize. (6)

(b) Given the one-dimensional data points {2,4,10,12,3,20,30,11,25}\{2, 4, 10, 12, 3, 20, 30, 11, 25\} and k=2k=2 with initial cluster centroids μ1=2\mu_1 = 2 and μ2=30\mu_2 = 30, perform two iterations of k-means and report the final clusters and centroids. (6)

(c) Discuss two limitations of k-means and briefly explain how the elbow method helps in choosing the value of kk. (2)

(a) k-Means Algorithm (6)

k-means partitions data into kk clusters by iteratively minimizing within-cluster variance.

Steps:

  1. Choose kk and initialize kk centroids μ1,,μk\mu_1, \dots, \mu_k (randomly or by heuristic).
  2. Assignment: assign each point x(i)x^{(i)} to the nearest centroid: c(i)=argminjx(i)μj2c^{(i)} = \arg\min_j \lVert x^{(i)} - \mu_j \rVert^2.
  3. Update: recompute each centroid as the mean of points assigned to it: μj=1CjxCjx\mu_j = \frac{1}{|C_j|}\sum_{x \in C_j} x.
  4. Repeat steps 2–3 until assignments (or centroids) no longer change.

Objective function (distortion / WCSS):

J=j=1kx(i)Cjx(i)μj2.J = \sum_{j=1}^{k} \sum_{x^{(i)} \in C_j} \lVert x^{(i)} - \mu_j \rVert^2.

(b) Two Iterations on the 1-D Data (6)

Data: {2,4,10,12,3,20,30,11,25}\{2,4,10,12,3,20,30,11,25\}, k=2k=2, μ1=2\mu_1=2, μ2=30\mu_2=30.

Iteration 1 — assign to nearest centroid:

  • Closer to 2: {2,4,3,10,12,11}\{2,4,3,10,12,11\} (e.g. 10 ⁣:102=8<1030=2010\!:|10-2|=8<|10-30|=20)
  • Closer to 30: {20,30,25}\{20,30,25\}

Update centroids:

μ1=2+4+3+10+12+116=426=7,μ2=20+30+253=753=25.\mu_1 = \frac{2+4+3+10+12+11}{6} = \frac{42}{6} = 7, \qquad \mu_2 = \frac{20+30+25}{3} = \frac{75}{3} = 25.

Iteration 2 — reassign with μ1=7, μ2=25\mu_1=7,\ \mu_2=25:

  • Closer to 7: {2,4,3,10,12,11}\{2,4,3,10,12,11\} (e.g. 20 ⁣:207=13>2025=520\!:|20-7|=13>|20-25|=5 → cluster 2)
  • Closer to 25: {20,30,25}\{20,30,25\}

Assignments are unchanged, so centroids stay μ1=7, μ2=25\mu_1=7,\ \mu_2=25converged.

Final result: C1={2,3,4,10,11,12}, μ1=7C_1=\{2,3,4,10,11,12\},\ \mu_1=7; C2={20,25,30}, μ2=25C_2=\{20,25,30\},\ \mu_2=25.

(c) Limitations and the Elbow Method (2)

Limitations: (1) kk must be specified in advance and results depend on initial centroids (can converge to a local optimum); (2) assumes spherical, equally-sized clusters and is sensitive to outliers and scaling.

Elbow method: run k-means for a range of kk, plot the objective JJ (WCSS) versus kk. JJ decreases as kk grows; the "elbow" — the point where the rate of decrease sharply flattens — indicates a good value of kk, balancing compactness against the number of clusters.

clusteringunsupervised-learning
B

Section B: Short Answer Questions

Attempt all / any as specified.

8 questions
5short6 marks

A binary classifier produces the following confusion matrix on a test set: True Positives = 40, False Positives = 10, False Negatives = 20, True Negatives = 30. Compute the accuracy, precision, recall, and F1-score of the classifier. Briefly comment on what the precision and recall values indicate about its performance.

Given: TP =40=40, FP =10=10, FN =20=20, TN =30=30 (total =100=100).

Accuracy=TP+TNTP+TN+FP+FN=40+30100=0.70  (70%)\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} = \frac{40+30}{100} = 0.70 \;(70\%) Precision=TPTP+FP=4050=0.80  (80%)\text{Precision} = \frac{TP}{TP+FP} = \frac{40}{50} = 0.80 \;(80\%) Recall=TPTP+FN=4060=0.667  (66.7%)\text{Recall} = \frac{TP}{TP+FN} = \frac{40}{60} = 0.667 \;(66.7\%) F1=2PRP+R=2(0.80)(0.667)0.80+0.667=1.0671.467=0.727  (72.7%)\text{F1} = \frac{2\cdot P \cdot R}{P+R} = \frac{2(0.80)(0.667)}{0.80+0.667} = \frac{1.067}{1.467} = 0.727 \;(72.7\%)

Comment: Precision of 80% means that when the classifier predicts positive, it is correct 80% of the time (relatively few false alarms). Recall of 66.7% means it captures only two-thirds of the actual positives, missing one-third (20 false negatives). Thus the model is more precise than complete — it is conservative in raising positives but misses a notable share of true positives.

classificationmodel-evaluation
6short6 marks

Differentiate between regression and classification problems with one example each. Also explain why Mean Squared Error (MSE) is an appropriate loss function for regression but not directly suitable for a classification task.

Regression vs Classification

RegressionClassification
OutputContinuous numeric valueDiscrete class label
GoalPredict a quantityAssign a category
ExamplePredicting house price (Rs.) from areaPredicting whether an email is spam or not

Why MSE suits regression but not classification:

MSE, 1m(yy^)2\frac{1}{m}\sum (y - \hat{y})^2, measures squared numeric distance between predicted and true values, which is exactly meaningful when the target is a continuous quantity — it is convex in the parameters of a linear model and penalizes large errors.

For classification the labels are categorical (e.g. 0/1) and outputs are probabilities. Using MSE here is unsuitable because: (1) class labels are not on a metric scale, so squared distance is not a natural error measure; (2) combined with a sigmoid output, the MSE surface is non-convex with flat regions, causing slow learning and poor convergence; and (3) it weakly penalizes confident wrong predictions. The cross-entropy loss is preferred as it is convex (for logistic regression), strongly penalizes confident mistakes, and matches the probabilistic interpretation of the output.

regressionmodel-evaluation
7short6 marks

Explain the bias-variance tradeoff with the help of a diagram. Relate the concepts of high bias and high variance to underfitting and overfitting respectively.

Bias-Variance Tradeoff

Bias is the error from overly simple assumptions in the model — it fails to capture the true relationship (systematic error). Variance is the error from excessive sensitivity to the particular training set — small changes in data cause large changes in the model. Expected test error decomposes as:

Error=Bias2+Variance+Irreducible noise.\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}.

Diagram (described): Plot model complexity on the x-axis and error on the y-axis.

  • The bias² curve starts high and decreases as complexity increases.
  • The variance curve starts low and increases with complexity.
  • The total test error is U-shaped: it falls, reaches a minimum, then rises. The optimal model sits at the bottom of this U.
Error
 |  \  (bias^2)         /  (variance)
 |   \                 /
 |    \      _______  /
 |     \____/  total \____/   <- U-shaped total error
 |________________________________  Model complexity
      underfit     optimal   overfit

Relation to underfitting / overfitting:

  • High bias → underfitting: the model is too simple, giving high error on both training and test data.
  • High variance → overfitting: the model is too complex, giving very low training error but high test error because it memorizes noise.

The goal is to balance the two to minimize total generalization error.

overfitting-regularizationmodel-evaluation
8short6 marks

Explain the working of the k-Nearest Neighbours (k-NN) algorithm. How does the choice of kk affect the bias and variance of the model? What is the effect of feature scaling on k-NN?

k-Nearest Neighbours (k-NN)

Working: k-NN is a non-parametric, lazy (instance-based) algorithm. There is no explicit training phase; it stores all training examples. To classify a new point xx:

  1. Compute the distance (usually Euclidean) from xx to every training point.
  2. Select the kk closest training points (its kk nearest neighbours).
  3. Assign the class by majority vote of those kk neighbours (for regression, take their average).

Effect of kk on bias and variance:

  • Small kk (e.g. k=1k=1): the decision boundary is highly flexible and follows local noise → low bias, high variance (overfitting).
  • Large kk: predictions are smoothed over many neighbours, averaging out detail → high bias, low variance (underfitting if kk is too large).

Thus kk controls the bias-variance tradeoff and is chosen (e.g. by cross-validation) to balance them.

Effect of feature scaling: Because k-NN relies on distance, features with larger numeric ranges dominate the distance computation. Without scaling, a large-magnitude feature can overwhelm others, distorting the neighbourhood. Therefore features should be normalized/standardized (e.g. min-max or z-score) so each contributes proportionately.

classification
9short6 marks

What is k-fold cross-validation? Explain the procedure with a diagram for k=5k=5 and state why it gives a more reliable estimate of model performance than a single train-test split.

k-Fold Cross-Validation

Definition: k-fold cross-validation is a resampling method that partitions the dataset into kk equal (or nearly equal) disjoint folds. The model is trained and validated kk times; each time a different fold serves as the validation set and the remaining k1k-1 folds form the training set. The kk validation scores are averaged to estimate performance.

Procedure for k=5k=5:

  1. Shuffle and split the data into 5 folds: F1,F2,F3,F4,F5F_1, F_2, F_3, F_4, F_5.
  2. Iteration 1: train on F2 ⁣ ⁣F5F_2\!-\!F_5, validate on F1F_1.
  3. Iteration 2: train on F1,F3,F4,F5F_1,F_3,F_4,F_5, validate on F2F_2. … and so on for all 5 folds.
  4. Report the mean (and standard deviation) of the 5 validation scores.

Diagram (described): Five rows, one per iteration; each row shows 5 blocks where exactly one block is marked Validation and the other four Train, with the validation block shifting one position to the right each row.

Fold:   1     2     3     4     5
Iter1: [Val][Trn][Trn][Trn][Trn]
Iter2: [Trn][Val][Trn][Trn][Trn]
Iter3: [Trn][Trn][Val][Trn][Trn]
Iter4: [Trn][Trn][Trn][Val][Trn]
Iter5: [Trn][Trn][Trn][Trn][Val]

Why more reliable than a single train-test split: Every data point is used for both training and validation exactly once, so the estimate does not depend on one lucky/unlucky split. Averaging over kk runs reduces the variance of the performance estimate and uses the data more efficiently, giving a more robust and less biased measure of generalization.

model-evaluation
10short6 marks

Explain logistic regression for binary classification. Write the sigmoid (logistic) function, state how the decision boundary is obtained, and explain why the cross-entropy loss is preferred over squared error for training it.

Logistic Regression (Binary Classification)

Logistic regression models the probability that an input belongs to the positive class. A linear combination z=θTx=θ0+θ1x1++θnxnz = \theta^T x = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n is passed through the sigmoid (logistic) function to map it to (0,1)(0,1):

p^=σ(z)=11+ez,p^=P(y=1x).\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \hat{p} = P(y=1\mid x).

Decision boundary: Predict class 1 if p^0.5\hat{p} \ge 0.5, else class 0. Since σ(z)=0.5\sigma(z)=0.5 exactly when z=0z=0, the decision boundary is the set where

θTx=0,\theta^T x = 0,

which is a linear hyperplane in feature space.

Why cross-entropy is preferred over squared error: The cross-entropy (log) loss for one example is

L=[ylogp^+(1y)log(1p^)].L = -\big[y\log\hat{p} + (1-y)\log(1-\hat{p})\big].

Combined with the sigmoid, this loss is convex in θ\theta, so gradient descent reaches the global minimum reliably. In contrast, squared error with a sigmoid output is non-convex with flat saturated regions where the gradient vanishes, causing slow learning. Cross-entropy also penalizes confident wrong predictions heavily (log-\log of a small probability is large), producing strong, well-scaled gradients and well-calibrated probabilities.

classificationregression
11short6 marks

Compare hierarchical clustering with partitional (k-means) clustering. Briefly explain agglomerative clustering and the role of a dendrogram in deciding the number of clusters.

Hierarchical vs Partitional (k-means) Clustering

AspectHierarchicalPartitional (k-means)
OutputA nested tree of clusters (dendrogram)A single flat partition into kk clusters
Need kk in advance?No (cut tree afterwards)Yes
ApproachMerge/split based on inter-cluster distanceIterative centroid assignment
CostHigher, typically O(n2)O(n^2) or moreLower, roughly O(nkt)O(nkt)
ReversibilityGreedy merges are not undoneRe-assigns points each iteration

Agglomerative (bottom-up) clustering: Start with each data point as its own cluster. Repeatedly merge the two closest clusters (closeness measured by a linkage criterion — single, complete, average, or Ward linkage) until all points form one cluster. This sequence of merges builds the hierarchy.

Role of the dendrogram: A dendrogram is a tree diagram whose vertical axis shows the distance at which clusters merge. To choose the number of clusters, draw a horizontal line that cuts the dendrogram; the number of vertical lines it crosses equals the number of clusters. Cutting at a level where the merge distances jump sharply (a long vertical gap) yields well-separated, natural clusters.

clusteringunsupervised-learning
12short4 marks

Write short notes on any two of the following:

(a) Gradient descent and the role of the learning rate (b) Curse of dimensionality (c) Support Vector Machines and the concept of margin (d) Feature engineering

(Answer any two — model answers for all four are given.)

(a) Gradient descent and the learning rate. Gradient descent is an iterative optimization method that minimizes a cost J(θ)J(\theta) by updating parameters opposite to the gradient: θθηθJ(θ)\theta \leftarrow \theta - \eta\,\nabla_\theta J(\theta). The learning rate η\eta sets the step size: too small → very slow convergence; too large → overshooting, oscillation, or divergence. A well-chosen (often decaying) η\eta gives stable, fast convergence to a minimum.

(b) Curse of dimensionality. As the number of features (dimensions) grows, the volume of the feature space grows exponentially, so data become sparse. Distances between points become similar, weakening distance-based methods (k-NN, clustering), and far more data is needed to generalize. Remedies include feature selection and dimensionality reduction (e.g. PCA).

(c) Support Vector Machines and margin. An SVM finds the hyperplane that separates two classes with the maximum margin — the largest distance between the boundary and the nearest training points (the support vectors). Maximizing the margin improves generalization. With the kernel trick, SVMs handle non-linearly separable data by implicitly mapping inputs to a higher-dimensional space.

(d) Feature engineering. The process of creating, transforming, and selecting input features to improve model performance — e.g. scaling/normalization, encoding categorical variables, handling missing values, creating interaction or polynomial terms, and extracting domain-specific features. Good features often matter more than the choice of algorithm.

supervised-learningoverfitting-regularization

Frequently asked questions

Where can I find the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) question paper 2079?
The full BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Machine Learning (PU, CMP 364) 2079 paper come with solutions?
Yes. Every question on this Machine Learning (PU, CMP 364) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper?
The BE Computer Engineering (Pokhara University) Machine Learning (PU, CMP 364) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Machine Learning (PU, CMP 364) past paper free?
Yes — reading and attempting this Machine Learning (PU, CMP 364) past paper on Kekkei is completely free.