0% found this document useful (0 votes)
20 views28 pages

Stats Exam

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views28 pages

Stats Exam

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Question 1:

Part (a): Validity of the Probability Mass Function (p.m.f.)

To verify that the p.m.f. given in Table 1 is valid, we need to check the following two conditions:

1. Each probability p(x) must be between 0 and 1, inclusive.


2. The sum of all probabilities must equal 1.

From Table 1:

Let's check the sum:

0.28+0.19+0.16+0.15+0.12+0.10=1.00

Since the sum is 1 and all the individual probabilities are between 0 and 1, the p.m.f. is
valid.

Part (b): Cumulative Distribution Function (c.d.f.)

The cumulative distribution function F(x) for a discrete random variable X is defined as

We need to calculate F(x) for x=0,1,2,3,4,5

Calculating each step:

This gives us:


Calculating the Median
The median is the value of X for which the c.d.f. F(x) is greater than or equal to 0.5 for the first
time.

From the c.d.f. table:


 F(2)=0.63
 F(1)=0.47

Since 0.5 is between F(1) and F(2), the median number of days a pupil attends the breakfast club
is 2.

Summary

(a) The p

.m.f. in Table 1 satisfies all the requirements for a valid p.m.f. since all probabilities are between
0 and 1, and their sum is exactly 1.

(b) The cumulative distribution function (c.d.f.) values are:

Using the c.d.f., the median number of days per week that a pupil attends the breakfast
club is 2, as F(2)F(2)F(2) is the first value that reaches or exceeds 0.5
Question 2:

Part (a): Validity of the p.d.f.


Verify the normalization:

Part (b): Cumulative Distribution Function (c.d.f.)


Part (c): Probability P(0.5<X<0.75)

P(0.5<X<0.75)=F(0.75)−F(0.5)

Calculate F(0.75):

Calculate F(0.5):

Therefore,

Question 3:

Calculation of Total Frequency


1. First Bar (17-18):
o Height: 1
o Frequency: 1
2. Touching Bars (21-29):
o Heights: 2, 7, 7, 15, 16, 23, 15, 14

Total Frequency Calculation:

1+2+7+7+15+16+23+15+14=100

Total frequency: 100

(a) Description of the Shape of the Data Distribution

The histogram displays a distribution that is:

 Right-skewed: The data has a longer tail on the right side.


 Unimodal: There is a single peak, which occurs around the values 25-26.

(b) Suitable Discrete Probability Distribution

The data displayed in the histogram could likely be generated from a Poisson distribution.

Reasoning:

 The Poisson distribution is often used for count data and can exhibit right-skewness,
similar to the distribution seen in the histogram.
 The histogram shows a unimodal shape with a peak and a long tail on the right, which is
characteristic of the Poisson distribution when the mean is relatively low.

Quesiton 4:

Part (a)

Exact Model: Binomial Distribution

Given:
 Number of trials n=50
 Probability of defect p=0.025

The number of defective engines X follows a Binomial distribution: X∼Binomial(n=50,p=0.025)

We need to find P(X>2):

P(X>2)=1−P(X≤2)

Calculate P(X≤2) using the binomial formula:


Part (b)
Mean Number of Defective Engines
The mean of a Binomial distribution is given by μ=np:

Part (c)
Approximate Distribution: Poisson Distribution
For large nnn and small ppp, X can be approximated by a Poisson distribution with λ=np:

The number of defective engines X follows a Poisson distribution: X∼Poisson(λ=1.25)


We need to find P(X>2):

Calculate P(X≤2) using the Poisson formula:


Part (d)
Comparison of Probabilities
 Exact Binomial Probability: P(X>2)≈0.120
 Approximate Poisson Probability: P(X>2)≈0.130

Question 5:

Part (a) Assumptions for a Poisson Process


1. Independence: Emails arrive independently.
o Context: Each email is from an unrelated source.
o Holding: Likely, if emails are not influenced by each other.
2. Constant Rate (λ): Average rate of 3 emails per hour.
o Context: Consistent email rate over time.
o Holding: True if the business operates uniformly without peak hours.
3. One Event at a Time: No simultaneous arrivals.
o Context: Emails do not arrive exactly at the same moment.
o Holding: Reasonable with fine enough time intervals.
4. Negligible Multiple Events in Small Intervals: Very low probability of multiple emails in
a short time.
o Context: Rare occurrence of multiple emails in an infinitesimal period.
o Holding: Likely if emails are infrequent.

Part (b)

Given:

 Rate of email arrivals λ=3 per hour

We need to find the probability that the time between emails exceeds one hour. In a Poisson
process, the time between arrivals follows an exponential distribution with parameter λ.

The exponential distribution has the probability density function:

find the probability that the time TTT between emails exceeds one hour:

For λ=3:
Calculate the integral:

Evaluate the limits:

Thus, the probability that the time between emails exceeds one hour is approximately 0.0498.

Question 6:

Part (a)
Sketch the p.d.f. of Z and Mark P(Z<−0.66)
 Draw the standard normal distribution curve (bell-shaped, centered at 0).
 Shade the area to the left of Z=−0.66.
Calculate P(Z<−0.66)

Using standard normal distribution tables:

P(Z<−0.66)=1−P(Z≤0.66)
P(Z≤0.66)≈0.7454
P(Z<−0.66)=1−0.7454=0.2546

Part (c)
Sketch the Distribution of U ∼ N ( μ ,1. 52 )
The distribution is similar to the standard normal but centered at μ\muμ with a wider spread
(standard deviation = 1.5).

 Shape: Bell-shaped, like Z.

 Location: Centered at μ.

 Spread: Wider due to higher standard deviation.

Part (d)

Find μ given P(U<−0.66)=0.2546

From (b):

Relate U to Z:

Convert U to standard normal:

This is equivalent to:


Using standard normal distribution:

Solve for μ:

So, μ=0.33.

Question 7:

Part (a)

Show that μ1 is an unbiased estimator of μ

Given:

Where:
Calculate the expectation of μ1:

Part (b)
Calculate the variance of μ1

Given:

Calculate the variance:

Since X1 and X2 are independent:

Therefore:
Part (c)

Compare μ1 and μ2 with variances

Given:

Question 8:

Part (a)

Show that the likelihood for the data collected by student 1 is: L(θ)=C θ 99(1−7θ)

Given:

 Table 2: P(1)=θ,P(2)=2θ,P(3)=4θ,P(>3)=1−7θ
 Table 3: Observed frequencies for student 1: 30 (level 1), 36 (level 2), 33 (level 3), 1
(level >3)
Likelihood L(θ):

Simplify:
Let C=2102, which is a constant independent of θ:

L(θ)=C θ 99(1−7θ)

Part (b)

Calculate the maximum likelihood estimate (MLE) of θ

To find the MLE, we take the derivative of the likelihood function and set it to zero:

Log-likelihood:

Derivative of the log-likelihood:

Set the derivative to zero:


Part (c)

Explain why you would get the same MLE of θ using the data from student 2

The likelihood function for student 2 would be similarly constructed:

 Observed frequencies for student 2: 10 (level 1), 44 (level 2), 45 (level 3), 1 (level >3)

Likelihood L(θ):
Since the power of θ is the same and the structure is similar, solving for θ will yield the
same result:

Thus, the MLE of θ would be the same using the data from student 2.

Question 9:

(a) Show that it is appropriate to assume that the population variances for indoor and outdoor air
quality measurements are equal and calculate the pooled standard deviation.

To test if it is appropriate to assume equal variances, we can use an F-test to compare the
variances of the two samples.

Where:
Given:

Calculating the variances:

Using an F-distribution table with degrees of freedom df1=34d and df2=31, we compare the
calculated F value to the critical F value at a chosen significance level (typically 0.05).

Since we don't have the critical value table here, we assume it's appropriate to use the pooled
standard deviation if the calculated F value is not significantly high.

The pooled standard deviation sp is calculated as follows:

(b) The exact 95% confidence interval for the mean difference between the indoor and
outdoor air quality is (-6.8, 5.6). Based on this, is there a difference between the indoor and
outdoor air quality level?

The 95% confidence interval for the mean difference between the indoor and outdoor air quality
measurements is given as (-6.8, 5.6). This interval contains zero, meaning we cannot reject the
null hypothesis that there is no difference between the indoor and outdoor air quality
measurements. Therefore, there is no statistically significant difference between the indoor and
outdoor air quality levels.

Question 10

(a) State appropriate null and alternative hypotheses for Anika to test given the
information they have available.

Null hypothesis (H0): The proportion of level 1 readings is equal to 14%.

Alternative hypothesis (Ha): The proportion of level 1 readings is different from 14%.

(b) Using a 5% significance level, obtain the test statistic for the hypothesis you proposed
for part (a), clearly stating any assumptions you have made.

We use a z-test for proportions. The test statistic is calculated as follows:

Where:
(c) The p-value for the test in part (a) is 0.2502. What should Anika conclude?

Given that the p-value (0.2502) is greater than the significance level (0.05), we fail to reject the
null hypothesis. Anika should conclude that there is not enough evidence to suggest that the
proportion of level 1 readings is different from 14%.

(d) Another student, Ben, recorded 30 out of 100 outdoor readings as level 1 using their air
quality monitoring device. What would Ben conclude if he conducted the same hypothesis
test (in part (a)) as Anika, and why?

For Ben's data:

Calculating the z-score for Ben:

A z-score of 4.61 is far above the typical critical values for z-tests (approximately ±1.96 for a 5%
significance level). This results in a p-value much less than 0.05, leading us to reject the null
hypothesis.

Ben would conclude that there is significant evidence to suggest that the proportion of level 1
readings is different from 14%.

Question 11

(a) Calculate the mean and variance of the normal distribution which can be used to
approximate the null distribution of the test statistic UAU_AUA.

Given:

 nA=12(sample size from accounting)


 nB=11(sample size from marketing)
Calculating the mean:

Calculating the variance:

(b) The observed value of the test statistic UA=166.1. Use your answer to part (a) to show that
the z-value corresponding to the approximate standard normal null distribution for this Mann-
Whitney test is 1.36 correct to two decimal places. Show all your workings.

The z-value is calculated as follows:

Given:
Calculating the z-value:

The value 6.16 does not align with the expected value 1.36, so a recalculation and
assumption are necessary:

For correction, we recalculate:

c) Show that the p-value for this test, based on the approximate null distribution
obtained in part (b), is 0.17 to two decimal places.

The p-value for a z-value of 6.16 is effectively zero given standard normal tables or z-
tables.

(d) State your conclusions for the test.

Given the recalculated z-value, if it's around 1.36, we typically get a p-value using z-
tables around 0.17, meaning the null hypothesis is not rejected. However, extreme z-
values imply rejection.

Question 12

(a) log(−x)

This transformation is not appropriate because log of a negative value is undefined in the
real number system. Therefore, this transformation cannot be applied to positive data to
correct skewness.
(b) x 3

Transforming the data by raising it to the power of 3 tends to increase the skewness if
data is positively skewed, or reducing left-skewed data.

However, this generally does not correct left skew but exacerbates it; square root, log, or
similar transformations are appropriate.

Question 13

(a) Write down the regression model.

The regression model is given by the equation:

Using the coefficients from the output:

(b) Interpret the regression coefficient for the explanatory variable \textbf{time} in the
fitted model.

The coefficient for the variable \textbf{time} is -1.0009. This means that for every one-
second increase in the average time it takes for a worker to service a customer, the reward
score decreases by approximately 1.0009 units, holding all other factors constant.

(c) Use the output from Minitab to explain if the satisfaction of a participant affects
their reward score.

The coefficient for the \textbf{satisfaction} variable is -0.1116 with a p-value of 0.111.
Since the p-value is greater than the common significance level of 0.05, we fail to reject
the null hypothesis. This suggests that there is no statistically significant evidence to
conclude that satisfaction affects the reward score.

(d) Do the assumptions of the multiple linear regression model seem reasonable?
Clearly explain your answer.
We need to assess the residual plots to evaluate the assumptions:

 Normality of Residuals: The normal probability plot (Q-Q plot) shows the residuals
falling approximately along the reference line, which suggests that the residuals are
normally distributed.
 Homoscedasticity: The residuals vs. fits plot shows residuals scattered randomly around
zero without any apparent pattern. This indicates that the variance of residuals is constant
(homoscedasticity).
 Independence: Assuming data collection was properly randomized, residuals should be
independent.

Question 14

(a) The methods section

A study was conducted to compare the efficacy of two different methods of


tonsillectomy, referred to as method A and method B, in terms of the pain experienced by
the patients post-operation. The study involved 70 children, who were randomly assigned
to either method A or method B, resulting in 35 children per group. The pain experienced
by each child following the operation was measured using a pain rating scale, where the
scale ranged from 1 to 5.

Data were collected and analyzed using Minitab 21. The statistical analysis involved
conducting a t-test to compare the average pain scores between the two groups. The
assumptions of the normal distribution were checked, and it was noted that the normal
probability plot was curved, indicating potential deviations from normality.

(b) The discussion section

The analysis compared the average pain scores between the two groups of children
undergoing tonsillectomy using two different methods. The t-test yielded a p-value of
0.0603. This p-value is slightly above the conventional significance level of 0.05,
suggesting that there is no statistically significant difference between the average pain
scores of the two methods at the 5% significance level. However, the p-value is close to
0.05, indicating a potential trend that may warrant further investigation with a larger
sample size or additional studies.

The curvature observed in the normal probability plot suggests that the data may not
perfectly adhere to the assumptions of normality required for the t-test. This deviation
could affect the validity of the test results, and alternative non-parametric tests might be
considered for confirmation.

You might also like