Stats Exam
Stats Exam
To verify that the p.m.f. given in Table 1 is valid, we need to check the following two conditions:
From Table 1:
0.28+0.19+0.16+0.15+0.12+0.10=1.00
Since the sum is 1 and all the individual probabilities are between 0 and 1, the p.m.f. is
valid.
The cumulative distribution function F(x) for a discrete random variable X is defined as
Since 0.5 is between F(1) and F(2), the median number of days a pupil attends the breakfast club
is 2.
Summary
(a) The p
.m.f. in Table 1 satisfies all the requirements for a valid p.m.f. since all probabilities are between
0 and 1, and their sum is exactly 1.
Using the c.d.f., the median number of days per week that a pupil attends the breakfast
club is 2, as F(2)F(2)F(2) is the first value that reaches or exceeds 0.5
Question 2:
P(0.5<X<0.75)=F(0.75)−F(0.5)
Calculate F(0.75):
Calculate F(0.5):
Therefore,
Question 3:
1+2+7+7+15+16+23+15+14=100
The data displayed in the histogram could likely be generated from a Poisson distribution.
Reasoning:
The Poisson distribution is often used for count data and can exhibit right-skewness,
similar to the distribution seen in the histogram.
The histogram shows a unimodal shape with a peak and a long tail on the right, which is
characteristic of the Poisson distribution when the mean is relatively low.
Quesiton 4:
Part (a)
Given:
Number of trials n=50
Probability of defect p=0.025
P(X>2)=1−P(X≤2)
Part (c)
Approximate Distribution: Poisson Distribution
For large nnn and small ppp, X can be approximated by a Poisson distribution with λ=np:
Question 5:
Part (b)
Given:
We need to find the probability that the time between emails exceeds one hour. In a Poisson
process, the time between arrivals follows an exponential distribution with parameter λ.
find the probability that the time TTT between emails exceeds one hour:
For λ=3:
Calculate the integral:
Thus, the probability that the time between emails exceeds one hour is approximately 0.0498.
Question 6:
Part (a)
Sketch the p.d.f. of Z and Mark P(Z<−0.66)
Draw the standard normal distribution curve (bell-shaped, centered at 0).
Shade the area to the left of Z=−0.66.
Calculate P(Z<−0.66)
P(Z<−0.66)=1−P(Z≤0.66)
P(Z≤0.66)≈0.7454
P(Z<−0.66)=1−0.7454=0.2546
Part (c)
Sketch the Distribution of U ∼ N ( μ ,1. 52 )
The distribution is similar to the standard normal but centered at μ\muμ with a wider spread
(standard deviation = 1.5).
Location: Centered at μ.
Part (d)
From (b):
Relate U to Z:
Solve for μ:
So, μ=0.33.
Question 7:
Part (a)
Given:
Where:
Calculate the expectation of μ1:
Part (b)
Calculate the variance of μ1
Given:
Therefore:
Part (c)
Given:
Question 8:
Part (a)
Show that the likelihood for the data collected by student 1 is: L(θ)=C θ 99(1−7θ)
Given:
Table 2: P(1)=θ,P(2)=2θ,P(3)=4θ,P(>3)=1−7θ
Table 3: Observed frequencies for student 1: 30 (level 1), 36 (level 2), 33 (level 3), 1
(level >3)
Likelihood L(θ):
Simplify:
Let C=2102, which is a constant independent of θ:
L(θ)=C θ 99(1−7θ)
Part (b)
To find the MLE, we take the derivative of the likelihood function and set it to zero:
Log-likelihood:
Explain why you would get the same MLE of θ using the data from student 2
Observed frequencies for student 2: 10 (level 1), 44 (level 2), 45 (level 3), 1 (level >3)
Likelihood L(θ):
Since the power of θ is the same and the structure is similar, solving for θ will yield the
same result:
Thus, the MLE of θ would be the same using the data from student 2.
Question 9:
(a) Show that it is appropriate to assume that the population variances for indoor and outdoor air
quality measurements are equal and calculate the pooled standard deviation.
To test if it is appropriate to assume equal variances, we can use an F-test to compare the
variances of the two samples.
Where:
Given:
Using an F-distribution table with degrees of freedom df1=34d and df2=31, we compare the
calculated F value to the critical F value at a chosen significance level (typically 0.05).
Since we don't have the critical value table here, we assume it's appropriate to use the pooled
standard deviation if the calculated F value is not significantly high.
(b) The exact 95% confidence interval for the mean difference between the indoor and
outdoor air quality is (-6.8, 5.6). Based on this, is there a difference between the indoor and
outdoor air quality level?
The 95% confidence interval for the mean difference between the indoor and outdoor air quality
measurements is given as (-6.8, 5.6). This interval contains zero, meaning we cannot reject the
null hypothesis that there is no difference between the indoor and outdoor air quality
measurements. Therefore, there is no statistically significant difference between the indoor and
outdoor air quality levels.
Question 10
(a) State appropriate null and alternative hypotheses for Anika to test given the
information they have available.
Alternative hypothesis (Ha): The proportion of level 1 readings is different from 14%.
(b) Using a 5% significance level, obtain the test statistic for the hypothesis you proposed
for part (a), clearly stating any assumptions you have made.
Where:
(c) The p-value for the test in part (a) is 0.2502. What should Anika conclude?
Given that the p-value (0.2502) is greater than the significance level (0.05), we fail to reject the
null hypothesis. Anika should conclude that there is not enough evidence to suggest that the
proportion of level 1 readings is different from 14%.
(d) Another student, Ben, recorded 30 out of 100 outdoor readings as level 1 using their air
quality monitoring device. What would Ben conclude if he conducted the same hypothesis
test (in part (a)) as Anika, and why?
A z-score of 4.61 is far above the typical critical values for z-tests (approximately ±1.96 for a 5%
significance level). This results in a p-value much less than 0.05, leading us to reject the null
hypothesis.
Ben would conclude that there is significant evidence to suggest that the proportion of level 1
readings is different from 14%.
Question 11
(a) Calculate the mean and variance of the normal distribution which can be used to
approximate the null distribution of the test statistic UAU_AUA.
Given:
(b) The observed value of the test statistic UA=166.1. Use your answer to part (a) to show that
the z-value corresponding to the approximate standard normal null distribution for this Mann-
Whitney test is 1.36 correct to two decimal places. Show all your workings.
Given:
Calculating the z-value:
The value 6.16 does not align with the expected value 1.36, so a recalculation and
assumption are necessary:
c) Show that the p-value for this test, based on the approximate null distribution
obtained in part (b), is 0.17 to two decimal places.
The p-value for a z-value of 6.16 is effectively zero given standard normal tables or z-
tables.
Given the recalculated z-value, if it's around 1.36, we typically get a p-value using z-
tables around 0.17, meaning the null hypothesis is not rejected. However, extreme z-
values imply rejection.
Question 12
(a) log(−x)
This transformation is not appropriate because log of a negative value is undefined in the
real number system. Therefore, this transformation cannot be applied to positive data to
correct skewness.
(b) x 3
Transforming the data by raising it to the power of 3 tends to increase the skewness if
data is positively skewed, or reducing left-skewed data.
However, this generally does not correct left skew but exacerbates it; square root, log, or
similar transformations are appropriate.
Question 13
(b) Interpret the regression coefficient for the explanatory variable \textbf{time} in the
fitted model.
The coefficient for the variable \textbf{time} is -1.0009. This means that for every one-
second increase in the average time it takes for a worker to service a customer, the reward
score decreases by approximately 1.0009 units, holding all other factors constant.
(c) Use the output from Minitab to explain if the satisfaction of a participant affects
their reward score.
The coefficient for the \textbf{satisfaction} variable is -0.1116 with a p-value of 0.111.
Since the p-value is greater than the common significance level of 0.05, we fail to reject
the null hypothesis. This suggests that there is no statistically significant evidence to
conclude that satisfaction affects the reward score.
(d) Do the assumptions of the multiple linear regression model seem reasonable?
Clearly explain your answer.
We need to assess the residual plots to evaluate the assumptions:
Normality of Residuals: The normal probability plot (Q-Q plot) shows the residuals
falling approximately along the reference line, which suggests that the residuals are
normally distributed.
Homoscedasticity: The residuals vs. fits plot shows residuals scattered randomly around
zero without any apparent pattern. This indicates that the variance of residuals is constant
(homoscedasticity).
Independence: Assuming data collection was properly randomized, residuals should be
independent.
Question 14
Data were collected and analyzed using Minitab 21. The statistical analysis involved
conducting a t-test to compare the average pain scores between the two groups. The
assumptions of the normal distribution were checked, and it was noted that the normal
probability plot was curved, indicating potential deviations from normality.
The analysis compared the average pain scores between the two groups of children
undergoing tonsillectomy using two different methods. The t-test yielded a p-value of
0.0603. This p-value is slightly above the conventional significance level of 0.05,
suggesting that there is no statistically significant difference between the average pain
scores of the two methods at the 5% significance level. However, the p-value is close to
0.05, indicating a potential trend that may warrant further investigation with a larger
sample size or additional studies.
The curvature observed in the normal probability plot suggests that the data may not
perfectly adhere to the assumptions of normality required for the t-test. This deviation
could affect the validity of the test results, and alternative non-parametric tests might be
considered for confirmation.