Probability Engine · MDS 503

Statistical Computing with R: the questions likely to come

122 analyzed questions from 14 past papers (4 board exams, 2078-2082), grouped by syllabus unit — each with its probability, how often it's been asked, and where to study the answer.

Papers analyzed

incl. 4 board exams · 2078-2082

122

Analyzed questions

across 6 syllabus units

25%

Board marks from repeats

questions asked before

Units = 80% of marks

study these first

Model answers for this subject are being written. Every question links to its original paper so you can study from the source meanwhile.

Which exams to include?Showing: Board only (default)

Pick a unit

U4 · Q1/14 · 20826 marks

R Software for Supervised Learning

Do the following in R Studio using "airquality" dataset with R markdown to knit PDF output: a) Perform Shapiro-Wilk test on "Wind" variable and check normality of this variable b) Perform Bartlett test on "Wind" variable by "Month" variable and check equality of variance c) Fit 1-way ANOVA to compare "Wind" variable by "Month" variable and interpret the result carefully d) Fit the TukeyHSD post-hoc test with 95% confidence interval and interpret the result carefully

40%

Possible to appearAppeared in 2 of the last 2 board papers

Seen in

How well do you know this?rating moves you on

MODEL ANSWERU4 · 6 marks

Normality, variance equality, ANOVA and post-hoc on `airquality$Wind`

data(airquality)
aq <- airquality
aq$Month <- factor(aq$Month)   # treat Month as a grouping factor

a) Shapiro-Wilk test for normality of Wind

shapiro.test(aq$Wind)
# W = 0.9863, p-value = 0.1178

Hypotheses: H0 = Wind is normally distributed. Since p = 0.118 > 0.05, we fail to reject H0 — the Wind variable can be treated as approximately normal, satisfying the ANOVA normality assumption.

b) Bartlett test for homogeneity of variance

bartlett.test(Wind ~ Month, data = aq)
# Bartlett's K-squared = 3.74, df = 4, p-value = 0.4422

Hypotheses: H0 = variances of Wind are equal across months. With p = 0.44 > 0.05, we fail to reject H0 — the variances are homogeneous, so the equal-variance assumption of one-way ANOVA holds.

c) One-way ANOVA: Wind by Month

model <- aov(Wind ~ Month, data = aq)
summary(model)
#             Df Sum Sq Mean Sq F value  Pr(>F)
# Month        4  178.5   44.62   3.529 0.00879 **
# Residuals  148 1871.5   12.65

Interpretation: F(4, 148) = 3.53, p = 0.0088 < 0.05, so we reject H0 — mean Wind speed differs significantly across at least one pair of months.

d) Tukey HSD post-hoc (95% CI)

TukeyHSD(model, conf.level = 0.95)
plot(TukeyHSD(model))

Interpretation: Tukey's test compares every pair of months with family-wise 95% confidence intervals. A pair is significantly different when its adjusted p (p adj) < 0.05 and its confidence interval does not include 0. Here the largest contrast (e.g. Month 7 vs 9 / 5 vs 7) shows the significant differences, identifying exactly which months drive the overall ANOVA result, while pairs whose CIs span 0 do not differ significantly.

AI-generated answerView in 2082 paper →

U4 · Question 1 of 14

Exam Readiness

0%READY

0marks secured of 165

0/37questions rated

Coming Next6 marks

U4 · R Software for Supervised Learning

Do the followings in R Studio using "mtcars" dataset with R markdown to knit PDF output: a) Divide the data into train and test datasets with 70:30 random splits and your roll number as random seed b) Fit a supervised linear regression model and KNN regression model on train data with "mpg" as dependent variable and all other variables as independent variable c) Predict the miles per gallon variable in the test data using these models and get values for "wt=6000 lbs" d) Compare the fit indices (R-square, MSE, RMSE) of the predicted models and choose the best model

25%Outside chance to appear

Question Priority · U4ranked by appearance likelihood — study top-down

R Software for Supervised Learning

Analyzed next40%

★ TOP PICK

6 marksSEEN IN

40%

6 marksSEEN IN

25%

Do the following in R Studio using "USArrests" dataset with R script:

a) Divide the mtcars data into train and test datasets with 70:30 random splits b) Fit a supervised linear regression model and KNN regression model on train data with "Urban population – UrbanPop" as dependent variable and all other variables as independent variable c) Predict the UrbanPop variable in the test datasets using these two models and interpret results carefully d) Compare the fit indices (R-square, MSE, RMSE) of the two predicted models and choose the best model

6 marksSEEN IN

23%

Do the following in R Studio with R script:

a) Create a dataset with following variables: age (18-99 years), sex (male/female), educational levels (No education/Primary/Secondary/Beyond secondary), socio-economic status (Low, Middle, High) and body mass index (14 – 38) with random 250 cases of each variable. Your exam roll number must be used to set the random seed. b) Create scatterplot of age and body mass index variable and interpret it carefully. c) Which correlation coefficient must be used based on the interpretation of the scatterplot? Why? d) Compute the best correlation coefficient identified from the scatterplot and interpret it carefully. e) Test whether this correlation coefficient is statistically valid or not and justify its value.

Do the following in R Studio with R script:

a) Create a dataset with following variables: age (18-99 years), sex (male/female), educational levels (No education/Primary/Secondary/Beyond secondary), socio-economic status (Low, Middle, High) and body mass index (14 – 38) with random 250 cases of each variable. Your exam roll number must be used to set the random seed. b) Check if body mass index variable follows normal distribution using suggestive plot and confirmative tests and interpret the results carefully. c) Check if body mass index variables have equal variance for sex variable using suggestive plot and confirmatory test and interpret the results carefully. d) Which independent sample t-test must be used to compare body mass index by sex? Why? e) Perform the independent sample t-test identified above and interpret it carefully.

6 marksSEEN IN

20%

Do the following in R Studio using "mtcars" dataset with R script:

a) Divide the mtcars data into train and test datasets with 70:30 random splits. b) Fit a supervised logistic regression model and naïve bayes classification models on train data with transmission (am) as dependent variable and miles per gallon (mpg), displacement (disp), horse power (hp) and weight (wt) as independent variable. c) Predict the transmission (am) variable in the test data for both the models and interpret the result carefully. d) Get the confusion matrix, sensitivity, specificity of both the models using predicted transmission variable on test data and interpret them carefully. e) Which supervised classification model is the best for doing prediction? Why?

6 marksSEEN IN

20%

Do the following in R Studio with R script so that it can be knitted as PDF:

a) Prepare a data with 100 random observations and two variables: miles per gallon (mpg) with random range between 10 to 50 and transmission gears (gear) as random binary variable (3=3 gear, 4=four gear and 5=five gears), do not forget to use your class roll number as random seed to replicate the result b) Perform goodness-of-fit test on miles per gallon (mpg) variable to check if it follows normal distribution or not c) Perform goodness-of-fit test on miles per gallon (mpg) variable to check if the variances of mpg are equal or not on gears variable categories d) Perform the best 1-way analysis of variance test based on goodness-of-fit results with justification e) Can you use this test for this data? Interpret the result carefully, if applicable.

6 marksSEEN IN

19%

Do the followings in R Studio using R script so that it can be knitted as PDF:

b) Prepare a data with 200 random observations and four variables: miles per gallon (mpg) with random range between 10 to 50; transmission (am) as random binary variable (0=automatic, 1=Manual), weight (wt) with random range of 1 to 10 and horse power (hp) with random range of 125 and 400, do not forget to use your exam roll number as random seed to replicate the result c) Divide this data into train and test datasets with 70:30 random splits with your exam roll number as random seed for replication d) Fit a supervised linear regression model for the train data e) Explain the model fit and BLUE coefficients for the fitted model f) Predict the mpg variable in the test data, get fit indices and interpret them carefully

6 marksSEEN IN

19%

Do the following in R Studio with R script so that it can be knitted as PDF:

a) Prepare a data with four random variables and 300 observations: miles per gallon (mpg) with random range between 10 to 50; transmission (am) as random binary variable (0=automatic, 1=Manual), weight (wt) with random range of 1 to 10 and horse power (hp) with random range of 125 and 400, do not forget to use your exam roll number as random seed to replicate the result b) Divide this data into train and test datasets with 80:20 random splits with your exam roll number as random seed for replication c) Fit a supervised logistic regression model on train data with transmission (am) as dependent variable and miles per gallon (mpg), horse power (hp) and weight (wt) as independent variable d) Predict the transmission variable in the test data and interpret the predicted result carefully e) Get the confusion matrix, sensitivity, specificity of the predicted model and interpret them carefully

6 marksSEEN IN

19%

Explain the following concepts with examples: a) Decision Tree b) Support Vector Machine

3 marksSEEN IN

25%

Explain the following concept with focus on R software:

a) Leverage in linear regression with example b) Multicollinearity in logistic regression with example

3 marksSEEN IN

23%

Describe decision tree classification model with focus on:

a) Bagging b) Improved bagging c) Boosting

3 marksSEEN IN

20%

Explain the following concepts with focus on R software:

a) Test of normality b) Parametric tests c) Residual analysis

3 marksSEEN IN

20%

Explain the following concepts with examples focusing on R software:

a) Correlation b) Parametric tests c) Non-parametric tests

3 marksSEEN IN

19%

Compare following model with focus on R software:

a) Naïve Bayes and Support Vector Machine b) Decision Tree and Random Forest c) Feed-forward and feed-backward neural network

3 marksSEEN IN

19%

03The mock

Sit a probable paper

A full mock exam built from the most likely questions, mirroring the real paper's structure. Every slot is a real past question.

Most Probable Paper

Mirrors the real structure · 45 marks · based on 5 past papers

Group A

1.
Explain the following concepts with examples: a) Biplot from principal component analysis b) Biplot from classical multidimensional scaling
[3 marks]
R Software for Unsupervised LearningVery likelyfrom 2082 paper →
This question has recurred in 5 of 5 years; including the board exam 3× (2079 to 2082); and its topic (R Software for Unsupervised Learning) appears in 100% of years.
2.
Explain the following concepts with examples: a) Decision Tree b) Support Vector Machine
[3 marks]
R Software for Supervised LearningVery likelyfrom 2082 paper →
Asked once (2082); including the board exam 1× (2082); and its topic (R Software for Supervised Learning) appears in 100% of years.
3.
Explain the following concept with focus on R software:

a) Leverage in linear regression with example b) Multicollinearity in logistic regression with example
[3 marks]
R Software for Supervised LearningVery likelyfrom 2081 paper →
Asked once (2081); including the board exam 1× (2081); and its topic (R Software for Supervised Learning) appears in 100% of years.
4.
Describe decision tree classification model with focus on:

a) Bagging b) Improved bagging c) Boosting
[3 marks]
R Software for Supervised LearningVery likelyfrom 2080 paper →
This question appeared 2× (same year); including the board exam 1× (2080); and its topic (R Software for Supervised Learning) appears in 100% of years.
5.
Explain the following concepts with focus on R software:

a) Test of normality b) Parametric tests c) Residual analysis
[3 marks]
R Software for Supervised LearningVery likelyfrom 2080 paper →
Asked once (2080); including the board exam 1× (2080); and its topic (R Software for Supervised Learning) appears in 100% of years.

Group B

1.
Do the following in R Studio using "airquality" dataset with R markdown to knit PDF output: a) Perform Shapiro-Wilk test on "Wind" variable and check normality of this variable b) Perform Bartlett test on "Wind" variable by "Month" variable and check equality of variance c) Fit 1-way ANOVA to compare "Wind" variable by "Month" variable and interpret the result carefully d) Fit the TukeyHSD post-hoc test with 95% confidence interval and interpret the result carefully
[6 marks]
R Software for Supervised LearningVery likelyfrom 2082 paper →
This question has recurred in 2 of 5 years; including the board exam 2× (2081 to 2082); and its topic (R Software for Supervised Learning) appears in 100% of years.
2.
Do the following in R Studio with R script so that it can be knitted as PDF:

a) Prepare a column vector of miles per gallon (mpg) variable with random range between 10 to 50 of 500 values, do not forget to use your exam roll number as random seed to replicate the result a) Plot histogram of this "mpg" variable and interpret it carefully b) Refine the histogram by filling the bars with "blue" color and changing number of bins to 8 c) Add a vertical abline at the arithmetic mean of the mpg variable d) Plot Q-Q plot of mpg variable, add normal Q-Q line of red color on it and interpret it carefully e) Plot density plot of mpg variable without the border, fill it with yellow color and interpret it

OR

Use the "ggplot2" package and do as follow in R studio:

a) Define first layer of the ggplot object with diamond data, carat as x-axis and price as y-axis b) Add layer with geometric aesthetic as "point", statistics and position as "identity" c) Add layers with scale of y and x variables as continuous d) Add layer with coordinate system as Cartesian e) Add layer with appropriate title and interpret the resulting graph carefully
[6 marks]
R Software for Data Summary and VisualizationVery likelyfrom 2079 paper →
This question has recurred in 2 of 5 years; including the board exam 1× (2079); and its topic (R Software for Data Summary and Visualization) appears in 100% of years.
3.
Do the followings in R Studio using "mtcars" dataset with R markdown to knit PDF output: a) Divide the data into train and test datasets with 70:30 random splits and your roll number as random seed b) Fit a supervised linear regression model and KNN regression model on train data with "mpg" as dependent variable and all other variables as independent variable c) Predict the miles per gallon variable in the test data using these models and get values for "wt=6000 lbs" d) Compare the fit indices (R-square, MSE, RMSE) of the predicted models and choose the best model
[6 marks]
R Software for Supervised LearningVery likelyfrom 2082 paper →
Asked once (2082); including the board exam 1× (2082); and its topic (R Software for Supervised Learning) appears in 100% of years.
4.
Do the following in R Studio using "USArrests" dataset with R script:

a) Divide the mtcars data into train and test datasets with 70:30 random splits b) Fit a supervised linear regression model and KNN regression model on train data with "Urban population – UrbanPop" as dependent variable and all other variables as independent variable c) Predict the UrbanPop variable in the test datasets using these two models and interpret results carefully d) Compare the fit indices (R-square, MSE, RMSE) of the two predicted models and choose the best model
[6 marks]
R Software for Supervised LearningVery likelyfrom 2081 paper →
Asked once (2081); including the board exam 1× (2081); and its topic (R Software for Supervised Learning) appears in 100% of years.
5.
Do the following in R Studio with R script:

a) Create a dataset with following variables: age (18-99 years), sex (male/female), educational levels (No education/Primary/Secondary/Beyond secondary), socio-economic status (Low, Middle, High) and body mass index (14 – 38) with random 250 cases of each variable. Your exam roll number must be used to set the random seed. b) Create scatterplot of age and body mass index variable and interpret it carefully. c) Which correlation coefficient must be used based on the interpretation of the scatterplot? Why? d) Compute the best correlation coefficient identified from the scatterplot and interpret it carefully. e) Test whether this correlation coefficient is statistically valid or not and justify its value.

OR

Do the following in R Studio with R script:

a) Create a dataset with following variables: age (18-99 years), sex (male/female), educational levels (No education/Primary/Secondary/Beyond secondary), socio-economic status (Low, Middle, High) and body mass index (14 – 38) with random 250 cases of each variable. Your exam roll number must be used to set the random seed. b) Check if body mass index variable follows normal distribution using suggestive plot and confirmative tests and interpret the results carefully. c) Check if body mass index variables have equal variance for sex variable using suggestive plot and confirmatory test and interpret the results carefully. d) Which independent sample t-test must be used to compare body mass index by sex? Why? e) Perform the independent sample t-test identified above and interpret it carefully.
[6 marks]
R Software for Supervised LearningVery likelyfrom 2080 paper →
Asked once (2080); including the board exam 1× (2080); and its topic (R Software for Supervised Learning) appears in 100% of years.

04The receipts

Behind the numbers

The raw evidence the predictions are computed from: marks per unit per year, syllabus weights, trends, and coverage.

Show the heatmap, topic table and coverage analysis

The receipt: marks per unit, per year

Each row is a syllabus unit, each column an exam year, each cell the marks that unit earned that year. Click any cell to see the actual questions behind it.

Marks:nonefew → many

2078

2079

2080

2081

2082

Total

U4R Software for Supervised Learning

U3R Software for Data Summary and Visualization

U5R Software for Unsupervised Learning

U1R Software for Basic Programming

U2R Software for Data Manipulation

#	Syllabus unit	Probability	Appeared	Avg marks	Syllabus weight	Exam vs syllabus	Trend	Questions
1	U4R Software for Supervised Learning	Very likely80%	2079 2080 2081 2082	4.8	21%10 lecture hrs	Over-examinedexam 30% · syllabus 21%	Steady	1 recurring14 total
2	U3R Software for Data Summary and Visualization	Very likely80%	2079 2080 2081 2082	4.9	21%10 lecture hrs	Over-examinedexam 28% · syllabus 21%	Steady	none repeat11 total
3	U5R Software for Unsupervised Learning	Very likely80%	2079 2080 2081 2082	5	17%8 lecture hrs	Balancedexam 13% · syllabus 17%	Rising	1 recurring4 total
4	U1R Software for Basic Programming	Very likely80%	2079 2080 2081 2082	3	17%8 lecture hrs	Balancedexam 14% · syllabus 17%	Steady	none repeat5 total
5	U2R Software for Data Manipulation	Very likely60%	2079 2080 2081	3	12%6 lecture hrs	Balancedexam 16% · syllabus 12%	Steady	none repeat3 total

20783 sittings

first assessmentfirst reassessmentsecond assessment

20791 sitting

board

20804 sittings

boardfa reassessmentfirst assessmentsecond assessment

20814 sittings

boardfirst assessmentfirst reassessmentsecond assessment

20822 sittings

boardsecond assessment

Study smart, not hard

Drag the slider: studying the top 4 units in priority order covers ~87% of all observed marks.

Units to study4/6

~80% line

Lecture time vs exam marks

Where the exam pays more than the curriculum spends: ● lectures vs ● exam marks, as a share of the whole course. A long teal-leading bar = high-yield unit.

U4R Software for Supervised Learning

21% of lectures → 30% of markshigh yield

U3R Software for Data Summary and Visualization

21% of lectures → 28% of markshigh yield

U2R Software for Data Manipulation

12% of lectures → 16% of marks

U1R Software for Basic Programming

17% of lectures → 14% of marks

U5R Software for Unsupervised Learning

17% of lectures → 13% of marks

U6R Software for Communication

12% of lectures → 0% of markslow yield

Statistical Computing with R: the questions likely to come

Normality, variance equality, ANOVA and post-hoc on `airquality$Wind`

a) Shapiro-Wilk test for normality of Wind

b) Bartlett test for homogeneity of variance

c) One-way ANOVA: Wind by Month

d) Tukey HSD post-hoc (95% CI)

R Software for Supervised Learning

Sit a probable paper

Most Probable Paper

Behind the numbers

The receipt: marks per unit, per year

Study smart, not hard

Lecture time vs exam marks

2082 B.S.