Statistical Data Analysis Assignment
Statistical Data Analysis Assignment
College of Science
Postgraduate Programs
Practical Assignment
Information
M.Sc. in Big Data Science and Analytics The Master Program
Statistical Data Analysis BDSA 602 Course Title & Code
First Semester 2023 – 2024 Academic Semester
Dr. name redacted Instructor’s Name
Instructions
Please note that the due date for this practice is SATURDAY 2ND DECEMBER 2023.
The three problems of this assignment correspond to the material of Chapters 2–4.
This practice consists of three problems where each has several questions.
This practical assignment will make up 30% of your final course grade.
Write your answer for each question under the question directly.
60 10 20 30
Question 1 [ 04 points ]
The following statements compare flexible statistical methods with non-flexible statistical methods. Indicate whether
each statement is true or false.
When the variance of the error terms σ 2=Var ( ε ) is extremely high, it is expected to obtain better
False
performance for the flexible statistical method than the non-flexible methods.
Question 2 [ 04 points ]
For each regression problem, indicate whether the interest is in inference or prediction. Then provide the sample size
n and the number of predictors p for each problem.
Inference or
Statement Response n p
Prediction
You collect a set of data on the top 800 firms in the US. For each
firm you record profit, number of employees, industry, and the CEO
inference CEO salary 800 3
salary. You are interested in understanding which factors affect
CEO salary.
You are interested in predicting the change in the USD/Euro
exchange rate in relation to the weekly changes in the world stock USD/Euro 104
(1
markets. Hence you collect weekly data for all of 2010 and 2011. Prediction exchange 2
year=52
For each week you record the change in the USD/Euro, the change rate weeks)
in the US market and the change in the British market.
average and the variability of the out-of-state tuition between private and public colleges.
It seems that Outstate has an clear upward linear relationship with Room.Board, Perc.alumni,
Expend, and Grad.Rate.
(A) [ 3 PTS ] Find the range, mean and standard deviation (use the range(), mean() and sd() functions) of the
response variable (round your answers to 3 decimal places).
(B) [ 1 PTS ] Present a histogram for the response variable and describe its distribution.
Most cars have mpg of 40 and as the mpg increases, fewer cars fall into its category. Interestingly,
there the decrease in mpg is not uniform. Every 5th mpg, the count increases or decreases
alternatively.
1 0 3 0 3
2 2 0 0 2
3 0 1 3 3.16
4 0 1 2 2.24
5 -1 0 1 1.41
6 1 1 1 1.73
(A) [ 3 PTS ] Compute the Euclidean distance between each observation and the test point.
Red (Observation 2,5,and 6 are taken and the majority vote is Red)
(D) [ 1 PTS ] If the corresponding Bayes Decision Boundary is highly nonlinear, then would you expect the best
value for the parameter K to be large or small?
I prefer smaller K because we need to increase the variance. This is also better at capturing non linearity
For a given quantitative response Y and a single predictor X ∈[−3 ,5 ], consider the following regression model:
Y =f ( X ) + ε
3 2
where f ( X )=X −3 X −6 X+ 8 is the true underlying curve and ε is an unobservable random variable independent
of the predictor with mean zero and variance equals σ 2. The true regression curve f is unknown and hence you need
to estimate it by ^f using statistical learning methods. This curve estimate is then used to predict the response Y at a
new value of the predictor X =x , and the performance of this prediction is assessed by the test mean squares error
(test MSE).
Since this is a simulation study where the true underlying regression curve is known, we can derive a multiple of
training data as well as test data. For this study, you are given the training data {( xi , y i ) :i=1, … , n } and the test data
{( xi¿ , y ¿i ) :i=1 , … ,n }. You are requested to fit the K-nearest neighbor (KNN) regression model to the training data
and then to use the fitted model to provide predictions for the response at the given predictor’s values in the test data.
This procedure of model fitting, and prediction has been repeated M = 100 times to assess the model performance.
This simulation study enables you to assess the performance of the fitted KNN regression models by two measures:
training MSE and test MSE as shown in the following figures:
K K
The training MSE starts very low when K is small. When K is small, the KNN model follows the training
data very closely as it has low bias. As K increases, the model becomes underfit
At the lowest K spectrum, test MSE is relatively high which means the model is overfit. As K increases,
test MSE decreases and reaches a minimum at K=3. And when K increases further, test MSE drastically
The increasing width of both the MSE spectrum shows the confidence around the estimate of MSE
Question 1 [ 07 points ]
Consider a regression problem for a quantitative response y with p predictors x 1 , … , x 5. The model is given by:
Y = β0 + β 1 X 1 +…+ β 5 X 5 +ε
where ( β 0 , β1 , … , β5 ) are the model parameters and ε is the unobservable random variable (independent of the
predictors) with mean zero and constant variance. Suppose that you are given the training data
{( xi 1 , x i 2 , x i 3 , x i 4 , x i5 , y i ) :i=1 ,… , 200 } and you are requested to fit a multiple normal linear regression model to
this training data, that is, you consider fitting the model:
Residuals:
Min 1Q Median 3Q Max
-36.735 -10.745 -1.279 9.280 35.996
Coefficients:
Estimate Standard Error t value Pr(>|t|)
(Intercept) -69.0254 29.0410 -2.377 0.018440 *
x1 0.4033 0.0681 5.921 1.43e-08 ***
x2 0.1750 0.2051 0.853 0.394730
x3 1.2435 0.4278 2.907 0.004080 **
x4 -0.0201 0.0456 -0.442 0.659340
x5 -9.3941 2.1557 -4.358 2.13e-05 ***
-------------------------------------------------------------------------------------------
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Y=−69.0254+(0.4033×290)+(0.1750×21)+(1.2435×65)+(−0.0201×120)+(−9.3941×1)
Y= 120.63
(B) [ 2 PTS ] The variance inflation factor (VIF) is calculated for each predictor as shown below. Is there evidence
for the existence of multicollinearity? Justify your answer.
vif(model)
Usually, a VIF above 5-10 shows collinearity, Since these values are close to 1 and below the range of 5-10
there iis no evidence of multicollinearity
(C) [ 2 PTS ] Use the diagnostic plots below to assess the normality assumption of the model residuals.
In a histogram of standardized residuals, the histogram being bell curved shows a normal distribution. Since
we see a bell curve which is slightly skewed, we say that its normality is very slightly skewed.
If the residuals distributed in a Q-Q plot, the points should fall approximately along the dashed line. The plot
here shows some deviances at the ends of the line which indicates minor violations of normalty.
Question 2 [ 13 points ]
Consider a regression problem for the response y with 2 predictors x 1 and x 2. The model is given by:
y=β 0 + β 1 x 1 + β 2 x 2 +ϵ
where ( β 0 , β1 , β2 ) are the model parameters and ϵ is the error term which is an unobservable random variable
(independent of the predictors) with mean zero and constant variance. The variables used are listed below.
Variable* Description
Weight The baby's birth weight in ounces.
Gestation The duration of pregnancy in days.
Smoking The indicator for mother’s smoking status (1=smoker, 0=non-smoker).
* This is a reduced version of the original dataset taken from Stat Labs by Nolan and Speed from the Child Health and Development Studies
conducted at the Oakland, CA, Kaiser Foundation Hospital.
Then the training dataset {( xi 1 , x i 2 , y i ) :i=1 , … , 200 } has been used to get the following least-squares estimates:
Coefficients:
Estimate Std. Error t value
(Intercept) 2.8029 18.8752 0.1485
Gestation 0.4383 0.0675 6.4901
Smoking -8.5512 2.1561 ?
---------------------------------------------------------------------------
Residual standard error: 14.51 on 197 degrees of freedom
Coefficient for Smoking is -8.5512 which implies that the birth weight of babies born to mothers who smoke is
8.5512 ounces less than those born to mothers who do not smoke
(B) [ 3 PTS ] Find the average birth weight in ounces for a baby who was delivered after 50 weeks as a gestational
period for a smoker mother.
(C) [ 3 PTS ] Calculate the 99% confidence interval for the parameter β 1 and use the calculated confidence interval
to test the hypothesis H 0 : β 1=0 at 1% level of significance (use the 2.35 as the critical value).
Margin of error = critical value * std error β1
Margin of error = 2.35*0.0675
Margin of error = 0.158625
Lower bound = 0.4383 – margin of error = 0.279675
Upper bound = 0.4383 + margin of error = 0.596925
Since 0 is not within the lower and upper bound range, we reject the null hypothesis.
Which implies, smoking has an effect on the baby’s weight
(D) [ 4 PTS ] Write down the value of the RSS and the missing t-value.
Question 1 [ 04 points ]
Suppose that you collected data for a group of students in a graduate statistics class with two predictors defined as
X 1 =hours studied , X 2 =undergrade GPA , and the response Y =receive an A . You fit a logistic regression and
produce coefficient estimates: ^
β 0=−6 , ^β 1=0.05, ^β 2=1.
(A) [ 2 PTS ] Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets
an A in the graduate statistics class.
(B) [ 2 PTS ] How many hours would the student with an undergrad GPA of 3.5 need to study to have a 50% chance
of getting an A in the graduate statistics class?
Suppose that you wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X , last
year’s percent profit. You examine many companies and discover that the mean value of X for companies that issued
a dividend was X =10, while the mean for those that didn’t was X =0. In addition, the variance of X for these two
sets of companies was σ 2=36. Finally, 80 % of companies issued dividends. If X follows a normal distribution,
predict the probability that a company will issue a dividend this year given that its percentage profit was X =4 last
year. Hint: You will need to use the Bayes’ Theorem and the formula of the normal density function.
P(Dividend_issued)=0.8
P(Dividend_not_issued) =0.2
Suppose that 1000 test observations are available such that the confusion matrix obtained because of using the
Bayes Classifier is shown below. Calculate the metric below if class 1 represents the positive class.
Specificity Sensitivity
23 483
=0.046 =0.966
23+477 483+17
Tr
ue
Cl
as
Positive Class s
1
Negative Class 0
0 Predicted Class 1