0% found this document useful (0 votes)
14 views

Statistical Data Analysis Assignment

The document provides instructions for a practical assignment for a course. It includes 3 problems to be completed by the due date of December 2nd, 2023. Each problem has several questions and the assignment makes up 30% of the final course grade.

Uploaded by

calev28828
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Statistical Data Analysis Assignment

The document provides instructions for a practical assignment for a course. It includes 3 problems to be completed by the due date of December 2nd, 2023. Each problem has several questions and the assignment makes up 30% of the final course grade.

Uploaded by

calev28828
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

University of Bahrain

College of Science
Postgraduate Programs

Practical Assignment

Information
M.Sc. in Big Data Science and Analytics The Master Program
Statistical Data Analysis BDSA 602 Course Title & Code
First Semester 2023 – 2024 Academic Semester
Dr. name redacted Instructor’s Name

Instructions

 Please note that the due date for this practice is SATURDAY 2ND DECEMBER 2023.
 The three problems of this assignment correspond to the material of Chapters 2–4.
 This practice consists of three problems where each has several questions.
 This practical assignment will make up 30% of your final course grade.
 Write your answer for each question under the question directly.

Name redacted Student’s Name

Id redacted Student’s I.D.#

Maximum Mark PROBLEM 3 PROBLEM 2 PROBLEM 1

60 10 20 30

BDSA 602 Practical Assignment 2023 – 2024 Page 1 of 17


PROBLEM 1: STATISTICAL LEARNING [ 30 POINTS ]

Question 1 [ 04 points ]

The following statements compare flexible statistical methods with non-flexible statistical methods. Indicate whether
each statement is true or false.

Statement True / False


When the relationship between the predictors and response is highly non-linear, it is expected to
True
obtain better performance for the flexible statistical methods than non-flexible statistical methods.
When the number of predictors p is extremely large and the sample size n is small, it is expected to
False
obtain better performance for the flexible statistical methods than the non-flexible methods.
When the sample size n is extremely large and the number of predictors p is small, it is expected to
True
obtain better performance for the flexible statistical learning methods than the non-flexible methods.

When the variance of the error terms σ 2=Var ( ε ) is extremely high, it is expected to obtain better
False
performance for the flexible statistical method than the non-flexible methods.

Question 2 [ 04 points ]

For each regression problem, indicate whether the interest is in inference or prediction. Then provide the sample size
n and the number of predictors p for each problem.

Inference or
Statement Response n p
Prediction
You collect a set of data on the top 800 firms in the US. For each
firm you record profit, number of employees, industry, and the CEO
inference CEO salary 800 3
salary. You are interested in understanding which factors affect
CEO salary.
You are interested in predicting the change in the USD/Euro
exchange rate in relation to the weekly changes in the world stock USD/Euro 104
(1
markets. Hence you collect weekly data for all of 2010 and 2011. Prediction exchange 2
year=52
For each week you record the change in the USD/Euro, the change rate weeks)
in the US market and the change in the British market.

BDSA 602 Practical Assignment 2023 – 2024 Page 2 of 17


Question 3 [ 04 points ]

Consider using College data from the ISLR Variable Definition


package. The dataset contains several variables Private Public/Private indicator
for different universities and colleges in the US. Apps Number of applications received
The variables in the dataset are listed in the table Accept Number of applicants accepted
to the right. Enroll Number of new students enrolled
Top10perc New students from top 10% of high school class
Top25perc New students from top 25% of high school class
F.Undergrad Number of full-time undergraduates
P.Undergrad Number of part-time undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Percent of faculty with Ph.D.’s
Terminal Percent of faculty with terminal degree
S.F.Ratio Student/Faculty ratio
perc.alumni Percent of alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate

BDSA 602 Practical Assignment 2023 – 2024 Page 3 of 17


(A) [ 2 PTS ] Use the plot() function to produce side-by-side boxplots of Outstate versus Private. Compare the

average and the variability of the out-of-state tuition between private and public colleges.

BDSA 602 Practical Assignment 2023 – 2024 Page 4 of 17


(B) [ 2 PTS ] Create scatterplots using the scatter.smooth function to investigate the relationships between the
response out-of-state tuition (Outstate) and the following five predictors: Enroll, Room.Board, Terminal,
perc.alumni, Expend, and Grad.Rate. Which of these predictors seems to have an upward linear relationship with
the response?

It seems that Outstate has an clear upward linear relationship with Room.Board, Perc.alumni,
Expend, and Grad.Rate.

BDSA 602 Practical Assignment 2023 – 2024 Page 5 of 17


Question 4 [ 05 points ]

Consider using the Auto dataset which Variable Definition


is part of the ISLR package in R. The mpg Gas mileage (miles per gallon)
data contains information for 392 cylinders Number of cylinders between 4 and 8
vehicles. Load the data and make sure displacement Engine displacement (cubic inches)
that the missing values have been horsepower Engine horsepower
removed from the data. weight Vehicle weight (bounds)

library(ISLR) acceleration Time to accelerate from 0 to 60 miles per hour


attach(Auto) year Model year
str(Auto) origin Origin of car (1. American, 2. European, 3. Japanese)
?Auto # this will open help page name Vehicle name

(A) [ 3 PTS ] Find the range, mean and standard deviation (use the range(), mean() and sd() functions) of the
response variable (round your answers to 3 decimal places).

Variable Range Mean Standard Deviation


mpg 9 46.6 23.446 7.805

(B) [ 1 PTS ] Present a histogram for the response variable and describe its distribution.

library(ggplot2) # you might need to install the package


ggplot(data=Auto,aes(x=mpg))+
geom_histogram(color="red",fill="cyan",alpha=0.3)

Most cars have mpg of 40 and as the mpg increases, fewer cars fall into its category. Interestingly,
there the decrease in mpg is not uniform. Every 5th mpg, the count increases or decreases
alternatively.

BDSA 602 Practical Assignment 2023 – 2024 Page 6 of 17


(C) [ 1 PTS ] Specify the origin of the vehicles with gas mileage exceeding 45 miles per gallon?

BDSA 602 Practical Assignment 2023 – 2024 Page 7 of 17


Question 5 [ 06 points ]

The table to the right provides training X1 X2 X3


Observation Y
data containing six observations on 1 0 3 0 Red
three predictors and one qualitative 2 2 0 0 Red
response variable. Suppose you wish 3 0 1 3 Red
to make a prediction for Y when the 4 0 1 2 Green
test point is X 1 =X 2= X 3=0 using K- 5 -1 0 1 Green
6 1 1 1 Red
nearest neighbors.

Observation X1 X2 X3 Euclidean Distance

1 0 3 0 3

2 2 0 0 2
3 0 1 3 3.16
4 0 1 2 2.24

5 -1 0 1 1.41

6 1 1 1 1.73
(A) [ 3 PTS ] Compute the Euclidean distance between each observation and the test point.

(B) [ 1 PTS ] What is our prediction with K=1 ?


Green (Observation 5 is the closest )

(C) [ 1 PTS ] What is our prediction with K=3?

Red (Observation 2,5,and 6 are taken and the majority vote is Red)

(D) [ 1 PTS ] If the corresponding Bayes Decision Boundary is highly nonlinear, then would you expect the best
value for the parameter K to be large or small?

I prefer smaller K because we need to increase the variance. This is also better at capturing non linearity

BDSA 602 Practical Assignment 2023 – 2024 Page 8 of 17


Question 6 [ 07 points ]

For a given quantitative response Y and a single predictor X ∈[−3 ,5 ], consider the following regression model:

Y =f ( X ) + ε
3 2
where f ( X )=X −3 X −6 X+ 8 is the true underlying curve and ε is an unobservable random variable independent
of the predictor with mean zero and variance equals σ 2. The true regression curve f is unknown and hence you need

to estimate it by ^f using statistical learning methods. This curve estimate is then used to predict the response Y at a

new value of the predictor X =x , and the performance of this prediction is assessed by the test mean squares error
(test MSE).

Since this is a simulation study where the true underlying regression curve is known, we can derive a multiple of

training data as well as test data. For this study, you are given the training data {( xi , y i ) :i=1, … , n } and the test data
{( xi¿ , y ¿i ) :i=1 , … ,n }. You are requested to fit the K-nearest neighbor (KNN) regression model to the training data
and then to use the fitted model to provide predictions for the response at the given predictor’s values in the test data.
This procedure of model fitting, and prediction has been repeated M = 100 times to assess the model performance.
This simulation study enables you to assess the performance of the fitted KNN regression models by two measures:
training MSE and test MSE as shown in the following figures:

training MSE test MSE

K K

BDSA 602 Practical Assignment 2023 – 2024 Page 9 of 17


(A) [ 3 PTS ] Write down the decomposition of the test MSE (don’t derive it). Name the three terms involved in the
decomposition of the test MSE.

Test MSE = variance + bias^2 + error

(B) [ 1 PTS ] Comment on the behaviour of training MSE.

The training MSE starts very low when K is small. When K is small, the KNN model follows the training

data very closely as it has low bias. As K increases, the model becomes underfit

(C) [ 1 PTS ] Comment on the behaviour of test MSE.

At the lowest K spectrum, test MSE is relatively high which means the model is overfit. As K increases,

test MSE decreases and reaches a minimum at K=3. And when K increases further, test MSE drastically

rises again which meeans the model is being underfit

(D) [ 1 PTS ] Comment on the variation of MSE.

The increasing width of both the MSE spectrum shows the confidence around the estimate of MSE

becomes low with higher Ks.

(E) [ 1 PTS ] What is the optimal value of K?

K=3 as it gives the least test MSE

BDSA 602 Practical Assignment 2023 – 2024 Page 10 of 17


PROBLEM 2: LINEAR REGRESSION [ 15 POINTS ]

Question 1 [ 07 points ]

Consider a regression problem for a quantitative response y with p predictors x 1 , … , x 5. The model is given by:

Y = β0 + β 1 X 1 +…+ β 5 X 5 +ε

where ( β 0 , β1 , … , β5 ) are the model parameters and ε is the unobservable random variable (independent of the
predictors) with mean zero and constant variance. Suppose that you are given the training data

{( xi 1 , x i 2 , x i 3 , x i 4 , x i5 , y i ) :i=1 ,… , 200 } and you are requested to fit a multiple normal linear regression model to
this training data, that is, you consider fitting the model:

y i=β 0 + β 1 x i 1 + β 2 x i 2+ β 3 x i3 + β 4 x i 4 + β 5 x i 5 +ε i with ε i N ( 0 , σ 2 ) for i=1 , … .,200

The following print-out represents a summary of the fitted model:

Residuals:
Min 1Q Median 3Q Max
-36.735 -10.745 -1.279 9.280 35.996

Coefficients:
Estimate Standard Error t value Pr(>|t|)
(Intercept) -69.0254 29.0410 -2.377 0.018440 *
x1 0.4033 0.0681 5.921 1.43e-08 ***
x2 0.1750 0.2051 0.853 0.394730
x3 1.2435 0.4278 2.907 0.004080 **
x4 -0.0201 0.0456 -0.442 0.659340
x5 -9.3941 2.1557 -4.358 2.13e-05 ***
-------------------------------------------------------------------------------------------
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.25 on 194 degrees of freedom


Multiple R-squared: 0.2834, Adjusted R-squared: 0.2649
F-statistic: 15.34 on 5 and 194 DF, p-value: 1.058e-12

BDSA 602 Practical Assignment 2023 – 2024 Page 11 of 17


(A) [ 3 PTS ] Find the average response value when x 1=290, x 2=21, x 3=65, x 4 =120 and x 5=1.

Y=−69.0254+(0.4033×290)+(0.1750×21)+(1.2435×65)+(−0.0201×120)+(−9.3941×1)
Y= 120.63

(B) [ 2 PTS ] The variance inflation factor (VIF) is calculated for each predictor as shown below. Is there evidence
for the existence of multicollinearity? Justify your answer.

vif(model)

Gestation MotherAge MotherHeight MotherWeight MotherSmoker


1.067149 1.069631 1.280405 1.268033 1.048545

Usually, a VIF above 5-10 shows collinearity, Since these values are close to 1 and below the range of 5-10
there iis no evidence of multicollinearity

(C) [ 2 PTS ] Use the diagnostic plots below to assess the normality assumption of the model residuals.

In a histogram of standardized residuals, the histogram being bell curved shows a normal distribution. Since
we see a bell curve which is slightly skewed, we say that its normality is very slightly skewed.

If the residuals distributed in a Q-Q plot, the points should fall approximately along the dashed line. The plot
here shows some deviances at the ends of the line which indicates minor violations of normalty.

BDSA 602 Practical Assignment 2023 – 2024 Page 12 of 17


Overall, the residuals are normally distributed with minor deviations.

Question 2 [ 13 points ]

Consider a regression problem for the response y with 2 predictors x 1 and x 2. The model is given by:

y=β 0 + β 1 x 1 + β 2 x 2 +ϵ

where ( β 0 , β1 , β2 ) are the model parameters and ϵ is the error term which is an unobservable random variable
(independent of the predictors) with mean zero and constant variance. The variables used are listed below.

Variable* Description
Weight The baby's birth weight in ounces.
Gestation The duration of pregnancy in days.
Smoking The indicator for mother’s smoking status (1=smoker, 0=non-smoker).
* This is a reduced version of the original dataset taken from Stat Labs by Nolan and Speed from the Child Health and Development Studies
conducted at the Oakland, CA, Kaiser Foundation Hospital.

Then the training dataset {( xi 1 , x i 2 , y i ) :i=1 , … , 200 } has been used to get the following least-squares estimates:
Coefficients:
Estimate Std. Error t value
(Intercept) 2.8029 18.8752 0.1485
Gestation 0.4383 0.0675 6.4901
Smoking -8.5512 2.1561 ?
---------------------------------------------------------------------------
Residual standard error: 14.51 on 197 degrees of freedom

BDSA 602 Practical Assignment 2023 – 2024 Page 13 of 17


(A) [ 3 PTS ] Write down the regression equation of the red line using the resulting R print-out above. State the
effect of mother’s smoking status on the baby’s birth weight.

Weight=β 0 + β 1 Gestation + β 2 smoking


Weight=2 . 8029+0 . 4383∗Gestation−8 .5512∗smoking

Coefficient for Smoking is -8.5512 which implies that the birth weight of babies born to mothers who smoke is
8.5512 ounces less than those born to mothers who do not smoke

(B) [ 3 PTS ] Find the average birth weight in ounces for a baby who was delivered after 50 weeks as a gestational
period for a smoker mother.

Weight=β 0 + β 1 Gestation + β 2 smoking


Weight=2.8029+ 0.4383∗(50∗7)−8 . 5512∗1
Weight=147.6567

(C) [ 3 PTS ] Calculate the 99% confidence interval for the parameter β 1 and use the calculated confidence interval
to test the hypothesis H 0 : β 1=0 at 1% level of significance (use the 2.35 as the critical value).
Margin of error = critical value * std error β1
Margin of error = 2.35*0.0675
Margin of error = 0.158625
Lower bound = 0.4383 – margin of error = 0.279675
Upper bound = 0.4383 + margin of error = 0.596925
Since 0 is not within the lower and upper bound range, we reject the null hypothesis.
Which implies, smoking has an effect on the baby’s weight

(D) [ 4 PTS ] Write down the value of the RSS and the missing t-value.

RSS = RSE2 * degrees of freedom


RSS = 14.512 * 197
RSS= 41476.4

t-vaue = estimate / standard error


t-vaue = -8.5512 / 2.1561 = -3.97

BDSA 602 Practical Assignment 2023 – 2024 Page 14 of 17


PROBLEM 3: CLASSIFICATION [ 10 POINTS ]

Question 1 [ 04 points ]

Suppose that you collected data for a group of students in a graduate statistics class with two predictors defined as
X 1 =hours studied , X 2 =undergrade GPA , and the response Y =receive an A . You fit a logistic regression and
produce coefficient estimates: ^
β 0=−6 , ^β 1=0.05, ^β 2=1.

(A) [ 2 PTS ] Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets
an A in the graduate statistics class.

(B) [ 2 PTS ] How many hours would the student with an undergrad GPA of 3.5 need to study to have a 50% chance
of getting an A in the graduate statistics class?

BDSA 602 Practical Assignment 2023 – 2024 Page 15 of 17


Question 2 [ 02 points ]

Suppose that you wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X , last
year’s percent profit. You examine many companies and discover that the mean value of X for companies that issued
a dividend was X =10, while the mean for those that didn’t was X =0. In addition, the variance of X for these two
sets of companies was σ 2=36. Finally, 80 % of companies issued dividends. If X follows a normal distribution,
predict the probability that a company will issue a dividend this year given that its percentage profit was X =4 last
year. Hint: You will need to use the Bayes’ Theorem and the formula of the normal density function.

P(Dividend_issued)=0.8

P(Dividend_not_issued) =0.2

P ( X=4|Dividend issued ¿ ¿ P( Dividend issued )


P(Dividend_issued | X=4) =
P( X=4)

BDSA 602 Practical Assignment 2023 – 2024 Page 16 of 17


Question 3 [ 04 points ]

Suppose that 1000 test observations are available such that the confusion matrix obtained because of using the
Bayes Classifier is shown below. Calculate the metric below if class 1 represents the positive class.

Misclassification Error Model Accuracy


477+17
=0.494 483+23
483+23+ 477+17 =0.506
483+23+ 477+17

Specificity Sensitivity
23 483
=0.046 =0.966
23+477 483+17

Tr
ue
Cl
as
Positive Class s
1

Negative Class 0

0 Predicted Class 1

BDSA 602 Practical Assignment 2023 – 2024 Page 17 of 17

You might also like