Exercises Question
Exercises Question
258 Computerstatistik
Exercise sheet
Laura Vana
For each submission, upload Rmarkdown file (.Rmd) and a PDF in TUWEL. Each task and subtask should
have a conclusion using full sentences!
Task 1
Use the different help systems available for R (i.e., R homepage, built-in R functions) to find information
about:
1. the command [
2. the ordered argument for factor
Task 2
Pn
Calculate the sum sn = i=1 ri , for r = 1.08 and compare it to (rn+1 − 1)/(r − 1) − 1 for n = 10, 20, 30, 40.
Use the formula to compute the sum for all values of n between 1 and 100, and store the results in a vector.
Task 3
x1: 1 3 3 6 6 6 10 10 10 10 15 15 15 15 15
x2: 0.00000000 0.07142857 0.14285714 0.21428571 0.28571429 0.35714286
0.42857143 0.50000000 0.57142857 0.64285714 0.71428571 0.78571429
0.85714286 0.92857143 1.00000000
x3: 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
x4: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
x5: 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1
x6: 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
x7: 15 29 42 54 65 75 84 92 99 105 110 114 117 119 120
x8: 1.000000e+00 2.000000e+00 6.000000e+00 2.400000e+01 1.200000e+02
1
7.200000e+02 5.040000e+03 4.032000e+04 3.628800e+05 3.628800e+06
3.991680e+07 4.790016e+08 6.227021e+09 8.717829e+10 1.307674e+12
x9: 120 119 117 114 110 105 99 92 84 75 65 54 42 29 15
x10: "a" "b" "b" "b" "b" "c" "c" "c" "c" "c" "c" "c" "c" "d" "d"
x11: Define x11 as a factor which is based on x5
X12: Define x12 as a vector of factors based on x10, but include that there could
be also a level "e"
X13: Create a vector x13 which is a factor vector based on x2:
- values less then 0.37 belong to level "less"
- between 0.37 and 0.55 to level "sufficient"
- between 0.55 and 0.96 to the level "good"
- above 0.96 to the level "plenty"
Task 4
a.
b.
c.
What is the difference between the following chunks of code? Explain. Hint: look up the difference between
& and &&.
x <- -7
x > 0 & sqrt(x) < 2
x <- -7
x > 0 && sqrt(x) < 2
Task 5
Page 2 of 30
set.seed(1)
DAT <- sample(LETTERS[1:7], 15, replace = TRUE)
F1 <- factor(DAT, levels = (LETTERS[1:7]))
F2 <- F1
levels(F2) <- rev(levels(F2))
F3 <- rev(factor(DAT, levels = (LETTERS[1:7])))
F4 <- factor(DAT, levels = rev(LETTERS[1:7]))
Is there a difference between F2, F3 and F4? If yes, what is the difference?
Task 6
a.
What does the function dim return when applied to an atomic vector? What happens if the dim attributed
of a matrix or an array is assigned the value NULL?
b.
• give X row names r1,. . . ,r100 and column names c1,c2,c3. Hint: use the function paste.
• compute:
c.
What does as.matrix do to a data frame which has columns of different type?
Task 7
Unlike some other programming languages, base R does not have a dedicated data type for sets. Instead, R
treats a vector like a set by taking only its distinct elements. The set operators setdiff, intersect, union,
setequal and %in% are available in base R (see ?union). (Note: For v %in% S, only S is treated as a set,
however, not the vector v). There are however packages such as package sets on CRAN which specialize
on set operations.
Page 3 of 30
a.
set.seed(1234)
x <- c(1, 3, 4)
y <- sample(1:100, 10)
Compute their union, intersection and the set of elements in y but not in x. Check for each element of x
whether it can be found in y.
b.
For s <- 1:200, use logical operators to find the subvector of s that is divisible by 7, but not divisible by
2. Hint: use %% as the modulo operator; which() can be used to check which positions in a logical vector
are equal to TRUE.
c.
Create a vector s7 which contains all the numbers divisible by 7 in the vector of integers between 1 and 200.
Create a vector s2 which contains all the numbers divisible by 2 in the vector of integers between 1 and
200 (Hint: you can use seq). Use the set operators mentioned above to find the subvector of 1:200 that is
divisible by 7, but not divisible by 2.
Task 8
a.
Find the smallest normalized positive IEEE 754 64-bit floating-point number ϵ for which 1 + ϵ ̸= 1 (“pen
and paper”, not in R). Compare the number you get with the machine epsilon .Machine$double.eps. Also,
try checking 1 + epsilon == 1 in R.
b.
Find the smallest normalized positive IEEE 754 64-bit floating-point number ϵ for which 1 − ϵ ̸= 1 (“pen
and paper”, not in R). Compare the number you get with the machine epsilon .Machine$double.neg.eps.
Also, try checking 1 - epsilon == 1 in R.
c.
The machine precision or machine epsilon is an upper bound on the relative approximation error due to
rounding in floating point arithmetic: f l(x) = x(1 + ξ), |ξ| ≤ ϵ. Explain why the following commands deliver
different results, based on the subtasks a and b above. Which basic law of arithmetic is not satisfied by
floating point arithmetic in the example below?
.Machine$double.eps/2 + 1 - 1
## [1] 0
Page 4 of 30
.Machine$double.eps/2 + (1 - 1)
## [1] 1.110223e-16
Task 9
a.
Write a function sum_for_sort which takes a vector of length greater than one and first sorts the vector in
ascending order and then, by using a for loop, computes the sum of the sorted vector.
Call the function for the sequence:
set.seed(1)
x <- rnorm(1e6)
Compare the result with the sum computed using the built-in function sum(x) and with the sum computed
using 80 bit precision by employing tools from the multi-precision arithmetic R package Rmpfr.
# install.packages("Rmpfr")
library("Rmpfr")
s80 <- sum(mpfr(x, 80))
b.
factorials n! = i=1 i. and the second using log(), exp() and sum(). Evaluate the functions at the value
n = 1000 and k = 500. Discuss any differences, if any occur, and explain what you think is happening.
c.
Write a function with approximates the exponential function ex at a value x0 using the first n terms of the
Taylor expansion.
• Compare the results of the function at x0 = 1 and n = 25 with the results returned by exp(x0).
• Compare the results of the function at x0 = −25 and n = 25 with the results returned by exp(x0).
Discuss any differences, if any occur, and explain what you think is happening.
Task 10
a.
Install the package fueleconomy using install.packages("fueleconomy") and make yourself familiar with
the dataset vehicles. Are there observations with missing values? Hint: use complete.cases().
Page 5 of 30
b.
c.
How many unique values do the variables class, trans, drive and fuel have? Hint: check what the
function unique does.
d.
Compute the frequency table of drive vs fuel. Compute the total, column and row percentages.
e.
Make a data frame named FED containing the variables cyl, displ, hwy and cty. Compute the median of
all variables in FED, missing values should be removed for the computations.
f.
g.
Compute the trimmed means for the variable displ by the variables drive and fuel ( use trim = 0.1 and
recall the missing values).
h.
Make a factor DRIVE which takes the value 2WD when the variable drive has the values 2-Wheel Drive,
Front-Wheel Drive or Rear-Wheel Drive. Otherwise the the factor should get the value 4WD. Hint: check
%in%.
i.
Make a data frame REGULAR which contains the factor created above and the variables hwy and cty - but
only when the fuel type is Regular.
Task 11
a.
Explore the univariate distribution of some of the categorical variables in the data set i.e., compute the
percentage of candies in each level of the categorical variable. Repeat for categorical variables chocolate,
caramel, bar. You can use the prop.table() function for this purpose.
Page 6 of 30
b.
Make barplots for each variable showing the percentages computed in a. Put the name of the variable as the
plot title. Put the three plots side by side in one row.
c.
Generate a table which contains the number of candies for all all factor level combinations for the variables
chocolate, caramel, bar, fruity, peanutyalmondy, crispedricewafer. Order the resulting table in de-
creasing order by the number of candies in each group. What is the most common “taste profile” among the
candies in the sample? Hint: look at aggregate() for generating the table.
d.
Take the table in c. and for each row, generate a string containing the taste profile: e.g., for rows
the strings will be "fruity" and "chocolate,caramel". For the row containing only FALSE values, use
"". Hint: you can use apply() and paste0(, collapse = “,”).
Make a new data frame which contains the taste profile variable and the number of observations in the
sample.
e.
f.
Compute the correlation between the different quantitative variable. Discuss how the correlation is between
winpercent and the other two.
g.
For the different taste profiles in c., compute the average winpercent. Which taste profile is the “best” and
“worst”, respectively?
Task 12
a.
Install the package ISwR and read the help for the data set hellung.
Page 7 of 30
b.
Add to the data frame the variable GLUCOSE which is the corresponding factor of glucose which gives
labels Yes and No. Make a reasonable plot that lets you compare the values of diameter for the two glucose
groups.
c.
Plot conc against diameter, give different colors and plotting symbols depending on the variable glucose.
d.
Make same plot as in 3, but now use a logarithm scale for the x axis.
e.
Make same plot as in 4, but now use a logarithm scale for both axes. Add then also a legend to the plot,
the text should say glucose and no glucose.
f.
Make a “publication” ready figure from 5. It should have at least a nice legend, proper axis labels and
unnecessary space around the figure should be cut off. Write the code so that the figure will be saved as a
pdf with size 7 × 7 and that it is saved in the working directory as YourName.pdf. Upload this .pdf together
with the .Rmd and .PDF of the solution in TUWEL.
Task 13
Write a function which takes two vectors x and y of equal length and returns a scalar N .
For a collection of emails,
• The vector y records whether a certain email has been “spam” or “ham” (i.e., non-spam).
• The vector x records how many times capital letters appeared in each email.
We want to use x in order to predict whether a new email is spam, by selecting a cutoff N for the number of
capital letters and classify new emails which have more than N capital letters as spam and all others as ham.
Among an extensive list of possible cutoffs, the optimal cutoff is selected such that the number of wrongly
classified y’s is minimal.
Find the optimal N for the following data:
In writing the function, assume that the possible cutoff values are only the ones appearing in x (this is
restrictive of course in general), namely 0, 1, 2, 4, 5, 6, 8. Also, if there are ties in the misclassification rate,
choose the smallest cutoff as the optimal one.
Page 8 of 30
Task 14
a.
Download the data set sdi.csv from TUWEL and read it into R in an object called sdi. The original data
set is available from gapminder and it contains the sustainable development index (SDI) for a collection of
countries.
b.
For each year, compute the percentage of missing values in the SDI and plot them as a line plot with the
years on the x-axis and the percentage on the y-axis. Label the axes of this plot sensibly.
c.
d.
Compute the number of missings per country and store the results in missings_per_country. (Hint:
you can use rowSums.) Assign the names of the countries as the names of this vector. Return from
missings_per_country only the elements with at least one missing. sorted in decreasing order (i.e., first
element of the vector corresponds to the country with most missings, etc.).
e.
Convert the data set, which is in wide format, to a long format. Name this new object sdi_long. Convert
the year variable in long format data set to a factor.
f.
Eliminate from the data set sdi_long the rows with missing values.
g.
Generate a graph which displays parallel boxplots of the SDI for each year. Briefly describe the results.
h.
For each year, compute the average and the median SDI from the sdi_long. Generate a line plot with the
years on the x-axis and the average SDI on the y-axis. Add to this plot a dashed line of the medians (Hint:
check the lty argument of the lines() or plot() function). Give appropriate labels to the plot. Add a
legend for the line types.
Page 9 of 30
Task 15
In the following we will consider a sample of the latest movie lens data sets. Original data is available at
https://grouplens.org/datasets/movielens/.
• movies1.csv in TUWEL contains the movie id, the title and the genre. In this data set there is one
genre for each movie.
• movies2.csv in TUWEL contains the movie id, the title and the genre. In this data set there are
multiple genres for each movie separated by a |.
• ratings.csv in TUWEL contains the movie id, a userid, the rating for movie and a timestamp.
a.
Combine the movies1.csv and the ratings.csv files. How many users didn’t rate any of the movies in
movies1.csv? What is the dimension of the combined data set if you keep all user ratings?
What is the dimension of the combined data set if you only keep the user ratings for the movies in
movies1.csv?
b.
Combine the two movie data sets into one. Name this data movies.
c.
Combine the data set movies and the data in file ratings.csv such that all movies in movies are contained
in the combined data set. How many movies did not receive any review?
d.
Write out the combined data set which only contains user ratings for the movies in movies1.csv as a text
file .txt to your working directory, in the file missing values should appear as 999999. The text file should
not include a column for the row names.
Task 16
Use the decathlon data from package FactoMineR. Check the documentation of the data set.
a.
Make a correlation plot of the first 10 numeric variables in the data set. Briefly comment on the output.
b.
Plot a histogram of points obtained and overlay it with a kernel density estimate and a theoretical normal
distribution using the sample mean and sample standard deviation as parameters.
Give the lines different colors, the axes reasonable labels and a reasonable plot title. Add also a legend which
tells which curve is the kernel and which one the theoretic curve.
Draw also qqplots for points obtained against the quantiles of the normal distribution using the qqnorm()
function. Add a qqline.
Page 10 of 30
c.
Andrews curves can be used to visualize high-dimensional data. Each observation xi = (xi1 , xi2 , . . . , xip ) in
the sample gets mapped to:
xi1
fi (t) = √ + xi2 sin t + xi3 cos t + xi4 sin 2t + xi5 cos 2t + . . .
2
xi1 X X
=√ + xi,2k sin kt + xi,2k+1 cos kt, −π ≤ t ≤ π
2 1≤k≤⌊p/2⌋ 1≤k<⌊p/2⌋
In R, compute the curves for a sequence tseq <- seq(-pi, pi, length = 100) for each observation based
on the first p = 10 columns. Using a for loop, plot the curves in a graph using lines(), where the type of
curve and the color differs for 2004 Olympic Game and 2004 Decastar.
For better visualization, normalize the data before doing the computations, by subtracting the minimum
value and dividing by the range (max - min) for each column.
Task 17
The linelist_messy_dates.rds file from TUWEL contains data simulated for the Ebola epidemic.
a.
Import the data set in R. Familiarize yourself with the data set and try to make sense of the variables in the
data set. Briefly discuss which information is contained in this data set. What are the observational units
contained in the rows? Are there missing values?
b.
c.
Make a bar plot for the number of infections for each month of the year. Each bar must be labelled with
the abbreviated name of the month (e.g., Jan, Feb, . . . ). Pay attention to the ordering of the bars! Briefly
comment on the plot. Hint: to extract the month from a date you can use function format().
d.
For each row, compute the number of days between infection and onset and save this info in a new variable
date_infection_onset. Make parallel boxplots of date_infection_onset for each month of the year.
Make sure that the axis are readable. Briefly comment on the plot.
Page 11 of 30
e.
Compute the daily number of hospitalizations. Hint: you can use aggregate()
Make parallel boxplots for the number of daily hospitalization split in two groups: work days (Mo-Fr)
vs. weekend (Sa-Su). Are there any notable differences? Hint: you can use weekdays()
Task 18
a.
Make a pairwise scatterplot of the data using pairs as well as a 3D-scatterplot using function
scatterplot3d() of package scatterplot3d.
b.
For each column, identify the univariate outliers using the 1.5 * IQR rule used by the boxplot. Create a new
column univ_out in the data set which contains 0 if the observation has not been identified as an outlier, 1
if observation is univariate outlier for column 1, 2 if observation is univariate outlier for column 2 and 3 if
observation is univariate outlier for column 3.
c.
d.
Multivariate outliers can be identified by using the Mahalanobis distance (MD) between each point xi ∈ Rp
(i.e., row) and the mean vector µ ∈ Rp :
q
M D(xi ) = d(xi , µ) = (xi − µ)⊤ Σ−1 (xi − µ)
An observation is declared a multivariate outlier if M D2 (xi ) > Qα , where Qα is the α quantile of the χ2
distribution with p degrees of freedom (typical α values are 0.95 or 0.975).
• Estimate the squared MD for each observation in the data set using the sample mean and the sample
covariance matrix of the data. Add it as a new column MD2 to the data set. Hint: In R, you can use
mahalanobis()
• For α = 0.975, make a column MD_out which contains a 1 if the observation is declared as outlier and
0 otherwise.
Page 12 of 30
e.
The mean and covariance estimators used in estimating MD are sensitive to outliers, and can therefore be
contaminated easily. Repeat the exercise in d. but now use robust estimators for the mean and the covariance
parameter. A robust estimator of location is the median. A robust estimator for the covariance is the MCD
(Minimum Covariance Determinant) estimator which can be computed using the function cov.mcd() from
package MASS.
# install.packages("MASS")
library("MASS")
robust_cov <- cov.mcd(x[,1:p])$cov
f.
Create 3 pairwise scatterplots and 3 scatterplots in 3D where you visualize the different outliers:
• one pairwise and one 3D scatterplot where you color the univariate outliers with 3 different colors (for
1,2,3).
• one pairwise and one 3D scatterplot where you color the multivariate outliers based on MD with one
different color.
• one pairwise and one 3D scatterplot where you color the multivariate outliers based on robust MD
with one different color.
Discuss the results. How many outliers are identified by each of the methods?
g.
Another distance which can be used for outlier detection is the Euclidean distance between the point and
the center parameter.
s 2 2 q
xi1 − µ1 xip − µp
+ ... + = (xi − µ)⊤ diag(σ12 , . . . , σp2 )−1 (xi − µ)
σ1 σp
where µ = (µ1 , . . . µp )⊤ and the standard deviations σ = (σ1 , . . . σp )⊤ have to again be estimated from
the data. Note that, given that the variables can have different scales, one should normalize each centered
column vector by dividing by the standard deviation.
Explain what is the difference between the Euclidean distance and the Mahalanobis distance when used for
outlier detection for multivariate normally distributed data and state which one would you prefer.
Task 19
Page 13 of 30
a.
Show that X ⊤ r = 0. What does this property mean in terms of the correlation between the covariates and
the residuals?
b.
Pn
Show that i=1 ri = 0, where ri is the i-th residual.
c.
Run the following code to obtain the vector r of residuals. The model being estimated is a linear regression
with 2 independent variables X1 and X2 and an intercept.
## 1 2 3 4 5 6
## -2.148091 -2.148091 -2.348379 1.225844 3.235770 -3.199783
Write a function named standardized_resids “by hand” which takes as arguments the residuals and the
design matrix and computes the standardized residuals (i.e., do not use any built-in functions in R). Apply
the function to the vector r and the proper design matrix for this example. Hint: Do not forget the intercept
when building the design matrix.
d.
Compute the standardized residuals r̃i using the following built in function applied to the regression object.
Check whether the results are equal to your computation above.
e.
where yi is the omitted observation and ŷ(i) the prediction of yi based on a model that was fitted after
excluding the ith observation. The term σ̂(i) denotes the standard deviation of the n − 1 residuals obtained
from the regression without i.
Using the fact that yi − ŷ(i) = ri
1−hi (see e.g., here) one can show that
1/2
n−p−2
ři = r̃i
n − p − 1 − r̃i2
Page 14 of 30
where p is the number of independent variables (excluding the intercept).
Write a function named studentized_resids “by hand” which takes as arguments the original residuals
and the design matrix and computes the studentized residuals (i.e., do not use any built-in functions in R).
Apply the function to the vector r and the proper design matrix for this example. Hint: Do not forget
the intercept when building the design matrix. Note that p represents the number of independent variables
without intercept.
f.
Compute the studentized residuals r̃i using the following built in function applied to the regression object.
Check whether the results are equal to your computation above.
Task 20
Which residual vs. fitted plot in Figure 1 indicates the best fit? Also inspect the qqplots in Figure 2 and
comment on them. For each of the four data sets, explain which assumptions of the linear regression model
seem to be violated (if any).
Task 21
Assume that different models are fit to a data set which contains information on the sales of a product in
different countries (in hundred units) and on the advertising budgets which have been spent for promoting
the product: TV, youtube, social media (all measured in dollars). Different regression models are fit to
different subsamples of the data:
Model 4: sales ¯
[ = 200 + 0.05 · (T V − T¯V ) + 0.2 · (youtube − youtube) + 0.5 · social
Interpret the following quantities: Hint: pay attention the ceteris paribus interpretation. Example interpre-
tation: If youtube ad budget is increased by x we would expect an increase in sales by 2000 dollars, while
keeping all other variables unchanged.
Page 15 of 30
DataSet 1 DataSet 2
Residuals vs Fitted Residuals vs Fitted
517 392
10
815
788
237
2
Residuals
Residuals
5
0
0
−2
660
−5
−2 0 2 4 6 0 2 4 6
DataSet 3 DataSet 4
Residuals vs Fitted Residuals vs Fitted
382 406
351 771
5 10
0
Residuals
Residuals
287
−200
0
−10
−500
147
−2 0 2 4 6 −1 0 1 2 3 4
Page 16 of 30
DataSet 1 DataSet 2
Normal Q−Q Normal Q−Q
Standardized residuals
Standardized residuals
6
517
815 392
1 2 3
788
237
4
2
−1
0
−3
−2
660
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
DataSet 3 DataSet 4
Normal Q−Q Normal Q−Q
10
Standardized residuals
Standardized residuals
382 406
351
771
0
5
287
−20 −10
0
−5
147
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Page 17 of 30
Task 22
The ToothGrowth data in R contains info on the length of odontoblasts (cells responsible for tooth growth)
in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one
of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
a. Descriptives
Transform dose into an (unordered) factor variable with the levels low, medium, high.
Inspect the data using descriptives and graphs: i.e.,
• make a table using pipes |> in order to build a table with the mean, standard deviation and number
of observations for each level of supp.
• make a table using pipes |> in order to build a table with the mean, standard deviation and number
of observations for each level of dose.
• make a table using pipes |> in order to build a table with the mean, standard deviation and number
of observations for each of the 6 combinations for supp and dose.
b. Linear model
Run a linear regression model for len where you use supp and dose as independent variables. What is the
interpretation of the coefficients for Intercept, dosemedium and suppVC?
c. Sum contrasts
Fit the same model as in b. but using the sum contrasts for both dose and supp. Interpret the coefficients
for Intercept, dose1 and supp1. Hint: pay attention to whether this is a balanced or unbalanced design
d. Interactions
Fit a model with an interaction between dose and supp (using treatment contrasts). From the coefficients
compute the main effects (i.e., average length) for each of the 6 factor combinations. Compare this with the
averages in a. Hint: you can look at the model.matrix() for help.
Task 23
Load the data set child.iq.dta into R (you can download it from TUWEL). Note that this is a Stata file.
Hint: You can use the Import data set button.
The data set contains children’s test scores at age 3, mother education and mother age at the time she gave
birth for a sample for 400 children.
Page 18 of 30
a.
Fit a regression of child test scores and mother’s age. Display the data in a scatterplot together with the fitted
regression line. Check assumptions of the model. Interpret the slope coefficient. If you were to recommend
a birth age based on this model what would this be? (Consider scores below 90 to be problematic).
b.
Fit the regression also including education (as a numeric variable, not a factor). Interpret both slope
coefficients. What is your recommendation for age now? Did it change?
c.
Fit the regression including mom age and education (as a factor variable). Plot the different regression lines
for the different education levels in the scatterplot of test scores and mom age. Color points according to
the different education levels. Also show the projection of the points on the corresponding regression line.
(See slides for an example).
Task 24
Data were collected on the volume of users on the Northampton Rail Trail in Florence, Massachusetts for
ninety days from April 5, 2005 to November 15, 2005. The data set is available in the mosaicData package
in R:
Variables in the data set include:
• the number of crossings on a particular day (measured by a sensor near the intersection with Chestnut
Street, volume),
• the average of the min and max temperature in degrees Fahrenheit for that day (avgtemp),
• dichotomous indicators for spring, summer and fall,
• dichotomous factor of whether the day was a weekday or a weekend/holiday (dayType),
• measure of cloud cover (cloudcover in oktas),
Remove duplicated observations if any, remove observations with missing values if there are any present in
the data. Remove the columns which contain duplicated information. Make a factor variable season instead
of the dummies for spring, summer and fall.
b. Descriptives
Inspect the data using descriptives and graphs: i.e., summary, pairwise scatterplots, parallel boxplots where
relevant. Comment on the relation between volume and all the remaining variables.
Page 19 of 30
c. Linear model
Try to estimate the model with volume as a dependent variable and all other variables as independent
variables. What do you see? Are there any problems with the estimation? Comment on the results. If
needed, adjust the variables in the sample and re-estimate the model. Briefly comment on the relationship
between the variables and volume (focus on sign of coefficients rather than magnitude.)
d. Model comparison I
Test whether cloudcover can be dropped from the regression model given that precipitation, hightemp,
and lowtemp are retained. Use the F statistic and level of significance 0.01. Hint: you can use anova().
State the hypotheses, the corresponding p-value, and conclusion in terms of the problem.
e. Model comparison II
Test whether in the model containing precip, hightemp and lowtemp variables hightemp and lowtemp
should have equal coefficients or rather different coefficients. Use the F statistic and level of significance
0.01. State the hypotheses, the corresponding p-value, and conclusion in terms of the problem.
f. Model selection
Use the step function from the base stats package in R to perform backward model selection starting from
a linear model containing all possible variables (i.e., precip, hightemp, lowtemp, cloudcover, dayType,
season). Check the documentation of the step() function and make sure you understand how the function
works. What is the final model selected by this procedure?
Task 25
a.
Write a function which for argument n constructs the n×n Hilbert matrix where the entry (i, j) = 1/(i+j−1).
b.
For n = 3 and n = 10, using the microbenchmark library, compare how computing the inverse using the
Cholesky decomposition compares to the inverse using solve(). Comment on the results.
Task 26
a.
Compute Y ⊤ Y and Y Y ⊤ by matrix multiplication and by the built-in functions. Benchmark the results for
100 repetitions. Discuss results.
Page 20 of 30
b.
A2 v, (AA)v, A(Av)
c.
Assume B = I1000 is the identity matrix. Similar to b., microbenchmark the operations Av, ABv, (AB)v
and A(Bv). Discuss results.
Task 27
a.
For X being the Hilbert matrix for n = 6, calculate H = X(X ⊤ X)−1 X ⊤ (try to do this in an efficient way).
b.
Compute the eigenvectors and eigenvalues of H in R (you can use the built-in function). Look at the object
obtained and check its’ structure using str(). Extract the vector of eigenvalues.
c.
d.
e.
Compute the inverse of X and its eigenvalues and eigenvectors. Is there any relationship between the
eigenvalues of the inverse and of the original matrix?
Task 28
a.
Try to compute in R the Cholesky of the Hilbert matrix for n = 20. Comment on the results and explain
what you think is happening in the background.
Page 21 of 30
b.
The function kappa() can be used to find the condition number of a given matrix (the ratio of largest to
smallest non-zero singular values). This gives an idea of how bad certain matrix calculations will be when
applied to the matrix. Large values indicate poor numerical properties. Calculate the condition number for
the 3 × 3, 5 × 5 and 7 × 7, 20 × 20 Hilbert matrices. Interpret the results.
Task 29
Assume y is a numeric n-vector (the response) and X the n × p (n > p) model matrix having full column
rank. Let β be the coefficient vector of interest having length p.
The ordinary least squares problem is then formulated as
β̂ = (X ⊤ X)−1 X ⊤ y.
a.
Write a function LmNaive which takes as input y and X and returns β̂. The function should be a naive
implementation of the above equation.
b.
Using solve with two arguments is more efficient than with one argument and similar crossprod and
tcrossprod have advantages over the standard matrix multiplication function %*%.
Implement a function using these functions and name it LmCP.
c.
d.
Implement a function called LmSvd which uses singular value decomposition of X in computing the OLS
solution.
e.
Another way to compute β̂ is by solving the system X β̂ = y using a QR decomposition of the non-quadratic
matrix X. There is a built-in function qr.solve(X, y) for this purpose. However, here you should imple-
ment a function called LmQR by hand. Hint: check how the QR can be used for solving linear systems of
equations. Use the two step approach here.
Page 22 of 30
f.
Familiarize yourself with the function lm.fit() and describe in words what it does.
g.
set.seed(1)
n <- 50
x <- rt(n, 2)
eps <- rnorm(n,0, 0.1)
y <- 3 - 2 * x + eps
X <- cbind(1, x)
colnames(X) <- c("Intercept","x")
Check for each of the functions in subtasks a-f the OLS solutions. Then use the microbenchmark package
for timing comparisons (repeat 10 times) of all versions above and for the qr.solve(X, y) and lm.fit(X,
y) functions. Discuss the results.
Task 30
A matrix is called sparse if most of its entries are zero and dense otherwise. Typically, a sparse matrix has a
degree of sparsity (i.e., percentage of zero entries) of at least 90%. Building special functionality for sparse
matrices can lead to memory savings and improved computation time.
a.
and β is an n × 1 vector.
How many flops (one flop is a unit of computation which could denote one addition, subtraction, multipli-
cation or division of floating point numbers) are required to compute y = Xβ if we do not take the sparsity
of X into account?
How can we calculate y by taking the sparsity structure of X into account? How many flops would be
required for this operation?
b.
One way to represent sparse matrices is through a coordinate list with the following elements:
Page 23 of 30
In R write a function make_sparse_cl which takes a (sparse) matrix M as an argument and converts in to
a data frame with 3 columns, where the columns contain the 3 elements i, j and x. Hint: you can use
which(. . . , arr.ind = TRUE). Call the function on the following matrix:
set.seed(1234)
n <- 30
M <- matrix(0, nrow = n, ncol = n)
M[sample(1:(n ˆ 2), floor(nˆ2 * 0.1))] <- rnorm(floor(nˆ2 * 0.1))
c.
Using the Matrix package, one can cast a matrix M into a sparse format by as(M, "sparseMatrix"). Using
str(as(M, "sparseMatrix")) inspect the resulting object. Explain the “slots” or elements of the resulting
object.
d.
Write a function sparse_add for sparse matrix addition which takes two data frames as in b. as arguments.
The function should return another data frame with columns i, j, x containing information on the non-zero
entries of the sum matrix. Hint: you can make use of function merge(. . . , all = TRUE).
e.
Many simple operations such as standardization (more specifically centering) can destroy the sparsity of a
matrix. This can be problematic, for example, in a regression context where the X matrix is sparse.
One example where standardizing is important is in estimating a linear regression with ℓ1 -penalty (also
known as lasso regression):
β̂ ℓ1 (λ) ∈ arg minb ||y − Xb||22 + λ||b||1 .
set.seed(1)
X <- matrix(as.numeric(runif(2000) > 0.9), ncol = 20)
y <- X[,1] * 0.1 + rnorm(100, sd = 0.1)
Task 31
Linear regression models are often used for prediction purposes i.e., to predict new observations y new based
on a collection of xnew values. Ideally, when comparing different models, we should focus on how well the
model predicts on unseen data, not only by in-sample measures of fit such as R2 , which measure how well
the model fits the given data. For this purpose, a common strategy is to split the data at hand into a train
and a test sample. The train sample will be used for estimating the regression coefficients, while the test
data will be used to check how well the model is predicting the response variable.
Page 24 of 30
Consider the following data where the goal is to explore the relation of heat capacity for a type of acid and
temperature.
The model selection exercise here is concerned with choosing the degree of the polynomial K to include in
the model.
a.
Split the heatcap data set into a train set containing 15 observations (roughly 80% of the sample) and a
test set containing 3 observations. The observations should be randomly allocated to either train or test set.
b.
For the train set estimate the polynomial regression for K = 5. Specify the formula in lm() “by hand”: Cp
~ Temp + I(Temp ˆ 2) + .... Check the summary.
c.
If K is large it is inconvenient to type the formula by hand. The poly() function in R can be used instead.
Estimate the linear regression. Compare the coefficients and the goodness of fit of this model and the one
in b. Hint: Note that by default poly will create orthogonal polynomials.
d.
For the linear model in c, use the predict(lm.object, ...) function to predict the heat capacity Cp on
the test sample based on the fitted regression model on the train data. Compute the root mean squared
error between the predicted and the observed Cp on the test sample:
v
u 1 nX
u test
RM SE = t (ŷi − yiobserved )2
ntest i=1
e.
Repeat the prediction exercise in d. for K = 1, . . . , 7. Choose the degree K which delivers the smallest
RMSE.
Page 25 of 30
Task 32
Consider the following data, which contains the delay (in minutes) in the departure of a random sample
flights for two different airlines “AA” and “AB” operating at Vienna International Airport. Note that
negative numbers represent departures before the scheduled time.
yAA <- c(2.5, 2.9, 4.5, 5, 5.5, 6.5, 7.5, 8.48, 8.49, 8.5, 8.5, 8.5,
8.9, 9, 9.1, 9.2, 11.5, 28.5)
yAB <- c(-10, -9.9, -8.1, -8, -7.5, -0.9, -0.5, -0.4, 0, 0, 1, 8.5, 8.6,
8.6, 8.7, 8.8, 9, 9.5, 10, 10.1, 10.5, 15, 32)
a.
Look at the distribution of delays for the two airlines. Generate boxplots, histograms and density plots for
the two airlines (superimposed, not in separate plots). Add appropriate coloring and legends where needed.
Comment on what you see in terms of differences between the distributions (location, dispersion, skewness).
b.
Perform a two sample t-test to check whether the difference in delays among the two airlines is significantly
significant (assume a 10% α-level). Comment on the R output and interpret the results.
c.
Look at normal qqplots for the two groups. Do you see deviations from normality? What would this imply
for the t-test performed?
d.
If you think the sample t-test is the most appropriate test for this data, clearly state the arguments on why
this is the case. Otherwise, perform in R another test for the difference in location between two distributions,
which you consider more appropriate than the two sample t-test for this data.
Task 33
In this exercise you should examine how the level accuracy of the one sample t test depends on the underlying
distribution of the X variable. Let X1 , . . . , Xn be an i.i.d. sample with
√ mean µ and consider the hypothesis
H0 : µ = 0. The standard test is to reject when |T | > c0 where T = nx̄/s is the test statistic, x̄ and s are
the sample mean and standard deviation respectively, c0 is the 1 − α/2 quantile of the t-distribution with
n − 1 degrees of freedom for a nominal significance level α.
Remember to set a seed at the beginning of the code.
a.
Run 10000 repetitions to estimate the empirical Type I error rate (effective significance level) for standard
normal data when n = 15. Assume a nominal α of 5%. Compute also the standard error of the empirical
Type I error rate.
Hint: After simulating the mean zero data which is in accordance to the null hypothesis, run the one sample
t test with a significance level of α. Record whether you reject the null or not. The empirical Type I error
Page 26 of 30
rate is percentage of rejections over the repetitions. For computing the standard error, use the fact that this
estimator is a proportion.
b.
Repeat a. with
• a gamma distributed variable that is the sum of three exponentials with rate 1 X = Y1 + Y2 + Y3 , Yj ∼
Exp(1), j = 1, 2, 3.
• a Cauchy distribution with location 0 and scale 1 X ∼ Cauchy(0, 1) (note that for the Cauchy the
second and first moments do not exist so the central limit theorem does not hold)
Make sure you translate the data so that expectation of X is equal to zero in each case.
c.
d.
Task 34
Especially when dealing with time series data, it is quite common that error terms are serially correlated
(also called auto-correlated). You should examine the effect of having auto-correlated errors in a simple
regression framework which assumes iid errors and ignores the serial correlations.
iid
y t = β0 + β1 x t + ϵt , ϵt ∼ N (0, σ 2 )
Using simulation, you will investigate the behavior of the standard least-squares estimate of β1 . (Note that
one can also analyze the impact of the serial correlation analytically, but this is a computational class :-))
a.
Write a function sim_y_ar_errors which takes as arguments n, x (the independent random variable) and
parameters β1 , β2 , a, ση2 , and returns a response vector y simulated from the regression model above, where
the errors are serially correlated rather then iid normal.
For creating auto-correlated errors use the following autoregressive AR(1) model ϵt = aϵt−1 + ηt , where
iid
η ∼ N (0, ση2 ). Hint: you can use the arima.sim() function in R to generate this type of errors.
Page 27 of 30
b.
Set up a simulation study where for M = 1000 replications you generate a response y by drawing a different
set of errors in each replication (remember to set a seed at the beginning of the code).
• Use n = 20
• As an independent variable assume X ∼ Unif(−1, 1) – you can use the same X in all repetitions, as
we consider x to be deterministic in the model.
• Assume β0 = 0, β1 = 2, ση2 = 1 throughout the analysis.
• For a you should use the scenarios: a = {−0.9, −0.1, 0, 0.1, 0.9}. Note that a = 0 corresponds to the
no violation case.
c.
For each scenario (a = {−0.9, −0.1, 0, 0.1, 0.9}), estimate the bias β̂1 − E(β̂1 ) and the variability V ar(β̂1 ) for
the standard OLS estimator based on the M repetitions. Discuss your results. Compare with the bias and
variability you nominally have from the elementary least-squares theory that does not take serial correlations
into account (see e.g., Chapter 6, slide 14).
Hint: you can use lm to get β̂1 for each replication or the closed form formula.
Task 35
Many statistical learning applications require the prediction of a response variable that takes on categorical
values, rather than numerical values. The binary case, where the response variable can take one of two values
or factor levels, is the most commonly encountered:
(
1, categoryi = A
yi =
0, categoryi = B
a.
In principle there is no “algorithmic” reason why we cannot estimate a linear regression for the type of
problem above (such a model is sometimes referred to as the linear probability model).
P (y = 1) = β0 + X1 β1 + . . . + Xp βp + ϵ
b.
A more common model for binary response is the logistic regression model, where the probability of observing
a one is modeled as:
exp(x⊤i β)
pi = P (yi = 1) = .
1 + exp(x⊤i β)
Page 28 of 30
Note that the model above is a member of the class of generalized linear models (GLMs). When we observe
a sample on n observations, likelihood function of such a model is given by
n
Y
L(β; y) = pyi i (1 − pi )1−yi
i=1
Derive the log likelihood as well as the gradient and the Hessian matrix of the log likelihood for the logistic
regression model.
c.
Implement a function based on the Newton-Raphson algorithm which finds the vector of coefficients β which
minimizes the negative log-likelihood function of the logistic regression model. For a convergence criterion
you can use Euclidean distance between β̂ (i) and β̂ (i+1)
r
⊤
β̂ (i+1) − β̂ (i) β̂ (i+1) − β̂ (i) < tol.
Allow a maximum of maxit iterations in the function. If convergence is not achieved before maxit iterations,
you should print a warning using warning("Maximum iterations reached without convergence!")
Important: Avoid inverting the Hessian matrix (for a hint see slide 5 of Chapter 9).
d.
Run the linear probability model and the logistic regression model (using logistic_nr) for the following X
and y.
set.seed(1234)
n <- 1000
p <- 2
beta <- c(0.2, 2, 1)
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
mu <- 1 / (1 + exp(-X %*% beta))
y <- as.numeric(runif(n) > mu)
Compare the coefficients obtained by your logistic_nr function with the ones of the linear probability
model and with the coefficients obtained when running the built-in function in R for logistic regression:
Task 36
How many zeros does the function f (x) = sin(10x) − x have? Find all these using a root solver of your
choice. Hint: inspecting the graph of the function will be helpful.
Page 29 of 30
Task 37
Rosenbrock’s banana functionis a commonly used function to test optimization algorithms, given by
a.
Use a contour plot to inspect the function for a = 1, b = 100. Why is this function difficult to optimize?
Hint: you can look at filled.contour().
b.
Try to find the minimum of f with a = 1, b = 100 using a starting value of (0, 10). Try different optimizers
in R in case the first one fails.
Task 38
Using built-in functions in R (and its extension packages), find non-negative x1 , x2 , x3 and x4 to minimize
Task 39
Consider a portfolio management problem where the portofolio weights should be allocated to a collection
of three stocks such that the following function is minimized:
k ⊤
Q(x) = x V x − m⊤ x
2
where k is a risk aversion parameter which is fixed in advance. Assume that the mean and covariance of the
monthly returns for the three stocks are
a.
Find the optimal weights x under the assumption that short-selling is not allowed for k = 4 and k = 1,
respectively. How does being less risk-averse affect the investor’s behavior?
b.
Find the optimal weights x under the assumption that short-selling is allowed for k = 4 and k = 1,
respectively. How does being less risk-averse affect the investor’s behavior?
Page 30 of 30