23-24Exam-withanswers
23-24Exam-withanswers
———————————————————————————————
Exam
Version 2
———————————————————————
15-01-2024, 08:30 - 10:30, Exam Hall 1
• Please state clearly on this form your name and student number (see above).
• This is a closed book, 2-hour, on campus exam.
• This exam consists of 8 multiple choice (max 40 points), and 3 open questions
with sub-questions (max 60 points). For every multiple choice question 5 points
are awarded. The amount of points you can obtain for the open questions is
indicated for every question. A total of 100 points can be scored on this exam.
• Please mark your version number and answers to the multiple choice questions
on the multiple-choice form.
• Use this exam to write your answers to the open questions.
• Unless it is specified otherwise, assume a 5% significance level for all questions.
• Use of a graphical calculator is forbidden.
• Extra writing-space for the open questions may be found at the back of the exam.
Good luck!
Multiple choice questions
1. You want to test a hypothesis that female students have the same average grade for Statistics
as male students using data that contains grades for all students in the current academic year
and their gender. Which of the following tests is suitable for this purpose?
a. One sample t-test
b. Independent samples t-test
c. Wilcoxon Signed Rank Test for Matched Pairs
d. Paired-Samples t-test
Answer B
2. You are testing the null-hypothesis that life expectancy is equal to 71 years against the
alternative hypothesis that life expectancy is less than 71 years (at a 5% significance level).
The Stata output may be found below. Which of the following statements is true for this test?
a. We reject the null hypothesis because the p-value is 0.0286 which is less than 0.05.
b. We cannot reject the null hypothesis because the p-value is 0.9857 which is more
than 0.05
c. We reject the null hypothesis because the p-value is 0.0143 which is less than 0.05
d. We cannot reject the null hypothesis because the p-value is 0.0286 which is less than
0.05
Answer B
3. Which of the following statements is not true about testing association between two
variables?
a. The expected frequency in each cell of the contingency table must exceed 10
b. The test is based on a 𝛸 2 distribution
c. The test of association is a goodness-of-fit test.
d. The test of association tests the relationship between two qualitative variables
Answer A.
The expected count should be a minimum of 5.
1
4. John is an experienced financial analyst studying the financial characteristics of a sample of
technological (tech) companies. He would like to test the statement “The yearly cash flows
(thousand $) generated by tech companies are normally distributed”. The table below shows
the descriptive statistics of the variable cashflow obtained in Stata. Which of the following
statements is true?
Answer B
JB = n* [(skewness)^2 / 6 + (kurtosis – 3)^2 / 24]
JB = 164* [(4.640159)^2 / 6 + (25.89162 – 3)^2 / 24] = 4169.36
5. We are testing the effectiveness of a new fuel additive. We run an experiment with 12 cars.
We first run each car without the fuel treatment and measure the fuel efficiency (how many
kilometres per liter it can ride). We then add the fuel treatment and repeat the experiment. We
then run a Wilcoxon signed-rank test. The results of the experiment are shown below. Which
of the following statements is true about the test results?
2
a. We test a null hypothesis that the treatment has an effect on the fuel efficiency and
we cannot reject this hypothesis at a 5% significance level
b. We test a null hypothesis that the treatment has an effect on the fuel efficiency and
we reject this hypothesis at a 5% significance level
c. We test a null hypothesis that the treatment has no effect on the fuel efficiency and
we cannot reject this hypothesis at a 5% significance level
d. We test a null hypothesis that the treatment has no effect on the fuel efficiency and
we reject this hypothesis at a 5% significance level.
Answer D
6. Mandy is a bright social scientist studying the main characteristics of the labour market. She
has just finished regressing female labour participation on a set of independent variables. The
table below presents the regression output obtained using Stata. Which of the following
statements is true?
3
a. The correct F-statistic for testing overall significance is 0.395, and unemployment
contributes significantly to predict the dependent variable (alpha=5%).
b. The correct F-statistic for testing overall significance is 2.531, and population density
contributes significantly to predict the dependent variable (alpha=5%).
c. The correct F-statistic for testing overall significance is 2.531, and unemployment
contributes significantly to predict the dependent variable (alpha=5%).
d. The correct F-statistic for testing overall significance is 0.395, and population density
contributes significantly to predict the dependent variable (alpha=5%).
Answer C
F = MSR / MSE = =28.7722647/11.3664741 = 2.531
Unemployment contributes significantly because the 95% confidence interval does not include 0.
Furthermore, the t-value show is significant if you look it up in table 8 of the book.
Population density is insignificant evidenced by its t-value and confidence interval.
7. Two biochemists, Renzo and Femke, are studying data from the Food and Agriculture
Organization of the United Nations (FAO). They focus on daily data from European citizens
surveyed between 2018-2023. They are interested in conducting linear regression analysis on
the factors determining the level of hunger in a person.
Variable Definition
Hunger Feeling caused by lack of food, coupled with a desire to eat. Represented by a 100-
point scale (0 if individual is completely satiated and 100 if individual is starving).
Temperature Average daily temperature. Measured in degree Celsius.
Minutes Time elapsed since the last meal. Measured in minutes.
Day_or_Night Dummy variable with value of 0 if measurement of hunger takes place between
sunset and sunrise (night), and value of 1 between sunrise and sunset (day).
Calories Unit of measure of the energy people get from food and drinks consumed before
the measurement of hunger.
4
Which of the following statements is NOT correct?
a. Day_or_Night has the largest impact on Hunger because it has the largest coefficient
value (around 5).
b. Temperature is significant at a significance level of 10%.
c. An increase of 1 minute in time elapsed since the last meal is expected to be
significantly associated with a decrease of 2.9 points in the level of hunger.
d. Compared to nighttime, people’s hunger is on average (around) 5 points higher
during daytime.
Answer A
Day_or_Night has the largest impact on “Hunger” because it has the largest Beta value (standardized
coefficient).
Answer B is also incorrect, because it is not significant at a significance level of 10%.
Both answers are accepted.
8. Which of the following is NOT one of the assumptions required to perform a one-way
Analysis of Variance?
a. Independent observations
b. Homogeneity of Variance
c. Absence of Multicollinearity
d. Normality of the Sampling Distribution
Answer C
Multicollinearity cannot exist in a one-way ANOVA, because there is only one independent variable.
Therefore, this is not a required assumption for a one-way ANOVA.
5
Open questions
9. (20 points) Laura wants to study the life expectancies of 68 countries in the world. She
explores her data file by running some descriptive statistics in Stata and then regresses the life
expectancy on variables access to safe water, gross national product per capita, population
growth, and region. You may find the Stata output below.
6
a. How well does her model explain the dependent variable? Motivate your answer by using
information from the regression output. (2 points + 1 bonus point)
b. Which variable(s) contribute(s) significantly to the prediction of life expectancy? Use a 5%
significance level. Please provide an interpretation for the coefficient(s) of the significant variable(s)
and the constant. (5 points)
c. Laura notices that Stata has left the region Europe & C. Asia out of the regression model. Should
Laura be worried about that? Explain why Stata left this region out and what would happen if it would
be included. (4 points)
d. Laura decides to show her analysis to her friend Dylan, who is an expert in statistics. Dylan
suggests that she should perform a logarithmic transformation for the variable gnppc and use the
transformed variable in the regression analysis. What problems for a regression model can such a
transformation solve? (2 points + 2 bonus points)
e. Laura replaces the variable gnppc with the natural logarithm of gnppc in her regression. The Stata
output is shown below. Interpret the coefficient of the variable ln_gnppc. (4 points)
7
a) The F-test shows that the overall model is significant (1 bonus point). R-squared is given (1
point). The regression model explains 71.55% of the variation/variance in the dependent
variable (1 points). Adjusted R-squared gets zero points.
b) At the 5% level only safewater is significant (1 point) Interpretation: If the access to safe
water increases by 1%, the life expecstancy increases on average (or is expected to) by 0.18
years, ceteris paribus. (2 points, 0.5 points reduction if units missing, if “on average” missing,
or if “ceteris paribus” missing) . Constant: for a country that has zeros in all other variables,
the average life expectancy is 57,86 years. (2 points)
c) Should not be worried (1 point). Region is a dummy variable. Stata leaves out the reference
category (1 point) to avoid the dummy trap / perfect multicollinearity (2 points).
d) Logarithmic transformation can help with non-linearity (2 points) and also heteroskedasticity
(2 points).
e) A 1% increase in the gross national product per capita is expected to result in a
1.586536*ln(1.01)= 0.01578655811 ≈ 0.016 year increase in life expectancy, ceteris paribus.
(4 points, 1 point reduction if “ceteris paribus” missing).
8
10. (20 points) Doctor Feinstein is a neuropsychologist interested in how primates learn to
perform tasks. Using chimpanzees as test subjects, she gave the animals a simple task:
opening a sealed transparent food container using a wooden stick. A sample of 20 wild
chimpanzees was divided based on four learning techniques: “Friend” (chimp watched a
friendly chimpanzee opening the container) , “Foe” (chimp watched a non-friendly
chimpanzee opening the container), “Trainer” (chimp observed one of the trainers to do the
task), and “Reward” (chimp was offered a larger container with more food inside). The time
taken by the animals to solve the task was recorded. Doctor Feinstein performed various tests
on the data, for which the resulting output is presented below.
9
10
a. Please provide a definition or explanation of the Total Sum of Squares and its components when
using a one-way ANOVA. (6 points)
Components of SST are SSW and SSG. Since SST= SSW + SSG.
SST = Total Sum of Squares. Total Variation – the aggregate dispersion of the individual data values
(across the various groups). (2 points)
SSW = Sum of Squares Within Groups. Within-Group Variation – dispersion that exists among the
data values within a particular group. (2 points)
SSG = Sum of Squares Between Groups. Between-Group Variation – dispersion between the groups
sample means. (2 points)
b. Compute the F-statistic of the one-way ANOVA. Provide the null and alternative hypotheses of the
F-test. What can you say about the one-way ANOVA results in the context of Dr Feinstein’s research?
(5 points)
MSW=233.4/36=6.4833 (1 point)
Test statistic: F=MSG/MSW=127.5/6.4833=19.67 (1 point)
Since the F-statistic presents a p-value below to 5% (or even lower significance levels), the results
are significant. (1 point)
11
This means, that we should expect on average different means for these groups. In other words, the
expected amount of minutes needed to perform the task are not the same across the different learning
techniques. Som techniques seem to perform better than others. (1 point)
c. Doctor Feinstein cannot decide between the one-way ANOVA and Kruskal-Wallis test. She therefore
asks for your professional advice. Please explain which test result you would use by making use of the
assumptions. (5 points)
The one-way ANOVA assumes normality and equality of variances. (1 point for each)
The normality assumption does not hold. We can see that in both normality test presented in the
tables (swilk and sfrancia) , where the null hypotheses of normality are rejected at the 5% significance
level in both cases. (1 point)
On the other hand, the null hypothesis of equality of variances is not rejected by Levene’s test using
an alpha of 5%, so this important assumption holds. (1 point)
The KW test is a non-parametric version of One-way ANOVA used when the normality assumption
for the One-way ANOVA is violated. For this reason, we advise to use the Kruskal-Wallis test. (1
point)
d. Please provide the hypotheses for the Kruskal-Wallis test and interpret the test result. Does the
conclusion differ from the one-way ANOVA result? (4 points)
The null hypothesis of this ranked test is also the same as with the traditional one-way Anova (2
points):
Testing for equality of population means…
H0: μ1 = μ2 = ... = μK
H1: μi ≠ μj for at least one i, j pair
In this case, the KW test also rejects the null hypothesis (1 point): the expected amount of minutes
needed to perform the task are not the same across the different learning techniques. (1 point)
12
11. (20 points) You want to create a forecasting model for the Gross National Product (GNP) of
the United States using quarterly data from 1992 to 2002. You first draw a line graph using
Stata.
a) You decide to try exponential smoothing for forecasting. The Stata output may be found
below. The GNP for Quarter 2 of 2002 is 9678.4 and the smoothed value is equal to 9658.98.
What is the predicted GNP for Quarter 3 of 2002 according to this model? Explain your
answer. (5 points)
Answer: The smoothed value in the current period is used as the forecast value for the next
period
Since α is almost 1, an acceptable answer is 𝑥̂𝑡 = 𝑥𝑡 = 9678,4 but more precisely
𝑥̂𝑡 = (1 − 𝛼)𝑥
̂𝑡−1 + 𝛼𝑥𝑡 = 9676,4 (5 points)
b) Next you decide to fit an AR(1) process. However, before running the autoregressive model,
you regress the GNP variable on the first lag of the GNP variable and run a Durbin-Watson
test in Stata. See the output below. You find the following cutoff points for your Durbin-
Watson Test Statistic: DL=1,39 and DU=1,60. Interpret the Durbin-Watson test results. (5
points)
13
The test statistics lies between the cutoff points: DL<DW<DU. (1 points)
This means that the test is inconclusive. (2 points)
We can’t tell if additional lags should be included (2 points for conclusion)
c) Finally, you fit the AR(1) on the GNP data. You may find the results below. Did this yield a
good fit? Motivate your answer by referencing appropriate statistics (including the relevant
hypotheses) and graphs. (6 points)
Answer:
The model shows a good fit (1 point)
Portmanteau (Q) statistics are all statistically insignificant (1point) therefore we cannot reject Ho
that all autocorrelation coefficients are zero (1point)
ACF and PACF show that none of the lags are significantly correlated. (2 points)
This suggests that the residuals follow a white noise process (1 point)
14
15
16
d) The GNP of the US for the second quarter of 2002 was 9678,4 bn USD. According to the
AR(1) model, what is the predicted GNP for the third quarter of 2002? (4 bonus points)
Answer:
17