Question 1 (5 Marks) : Part 3 Hypothesis Testing (5 Marks)
Question 1 (5 Marks) : Part 3 Hypothesis Testing (5 Marks)
The SETU (https://www.monash.edu/ups/setu) score of FIT units is known to follow a Gaussian distribution (https://en.wikipedia.org/wiki/Normal_distribution) with a variance of 0.25.
Suppose you wish to estimate for the mean SETU score for all units by taking a sample of n units and checking their last semester's SETU. How many units in this sample that you
need to have a 95% confidence interval for μ with a width of 0.1?
ANSWER
Question 2 (5 marks)
You do a poll to see what fraction p of the students participated in the FIT5197 SETU survey. You then take the average frequency of all surveyed people as an estimate p^ for p. Now it
is necessary to ensure that there is at least 95% certainty that the difference between the surveyed rate p
^ and the actual rate p is not more than 10% . At least how many people
ANSWER
Question 3 (5 marks)
Suppose you repeated the above polling process multiple times and obtained 40 confidence intervals, each with confidence level of 90% . About how many of them would you expect to
be "wrong"? That is, how many of them would not actually contain the parameter being estimated? Should you be surprised if 12 of them are wrong?
ANSWER
Question 4 (5 marks)
In lecture 3 (https://d3cgwrxphz0fqu.cloudfront.net/81/8c/818c7ed4d0cd856607bf4a5347fb10a6f9dcea50?response-content-
disposition=inline%3Bfilename%3D%22FIT5197_L3.pdf%22&response-content-
type=application%2Fpdf&Expires=1649953740&Signature=JqqTutDRrQhBB6QLX9pCb58FlEcx4WdmvWt6fOdki83rImO0cY8z5~VM1G8xyXBa81U9ffBzCivE5eoZCGB8LulfUuiuUlPaY7f
IBlEqW1k41YRZzwdlgmL~UCbMKHmFCOwfw2aoD1MgC2hE-2-iPCFesIXUrdY9oWUsjx6XaDjEAdRylr30SQGV93JdqehV46MvsU-
YW8Miq6BfeMWLPT2gvIjz7sz0Dqwp~6PRMGuJWNf6GfiAPW6-mjnAx91AKBKopIG4LRjkvL98oEgh~dSmPS4Hg__&Key-Pair-Id=APKAJRIEZFHR4FGFTJHA), we mentioned the use
of the weak law of large numbers which tells us that the sample estimator will converge to the population parameter if we have a sufficiently large number of observations (or sample
size). In this question, we would like to see how big the sample size should be in order to get the approximation error down to a certain level.
Continuing from Question 3, we consider the random variable X to denote the event that the confidence interval cover the unknown parameter or not. Thus, X will follow the Bernoulli
distribution with a parameter θ , i.e., X ∼ Be(θ), where θ = 0.9 was provided in question 3. Given that you collect n random variable X1 , X2 , … , Xn . Calculate the smallest
number of confidence intervals, n, you have to observe to guarantee that
n
∣ ∑ Xi ∣
1
P (∣ − θ∣ > 0.01) < 0.1.
∣ n ∣
ANSWER
As a hard-working student yourself, you have earned 12 chances at the end of the semester. When you finished your spins, the result showed {"N", "A", "N", "N", "B", "C", "N", "N", "N",
"A", "A", "N"} ("A","B" and "C" denote three hampers respectively, while "N" denotes "Better Luck Next Time"). You are shocked by the result and feel the game might be faulty. Before
questioning Levin, you would like to perform a hypothesis test to check whether you are really unlucky or has Levin secretly done something that had influenced the probability of
winning or not. State your hypothesis, perform the test and interpret the result.
ANSWER
Question 2 (2.5 marks)
The operation team of a retailer is about to report the performance of year 2022. As the data analyst, your job entails reviewing the reports provided by the team. One of the reports
regarding membership subscription looks suspicous to you. In this report, they compared the amount of money spent by the members against the non-members over the year. The
methodology is that they randomly selected 20 customers and compared their spending before and after becoming a member.
The average spending before becoming a member is $88.5 per week with a standard deviation of $11.2. The average after becoming a member is $105 per week with a standard
deviation of $15. In the report, the retailer claimed that after becoming a member, customers tend to spend 10% more than before on average.
As a statistician, you decide to perform a hypothesis test to verify the veracity of this claim. State your hypothesis, perform the test and interpret the result. Additionally, please suggest
another methodology to compare member vs non-member.
ANSWER
simulations : Number of samples you repeatedly take - for all Part 4, Q2 we set this number equal to 10000 , i.e., you have 10000 samples. If you have trouble understanding
this, perhaps it is time to rewatch the lecture recordings/materials.
n : Number of observations per sample, this will be given in the question as we will experiment with different values of n .
PMF(Y): Is the probability mass function that the random variable Y follows (please check Lecture 2 and Tutorial 2). Similar to n , we can experiment with different settings for
PMF(Y).
Random Variables RVs Y 1 , Y 2 , … , Y n ∼ PMF(Y) : All the random variables in the sample (observation RVs) will follow the distribution set out by the PMF. Again, the
number of observations n as well as the distribution PMF(Y) have not been set here but will be given in the questions.
Question 1: Theoretical Set-up for the CLT (No Coding or Simulation here!) (2 Marks)
Before simulating CLT, we must first establish what we would want to see from the simulation, i.e., what the theory tells us. Thus, we are going to set up the experiment here as well as
n
∑ Yi
n ¯¯¯
¯
set up our expectation for the (1) Summation Distribution , and (2) Mean Distribution .
i
∑ Yi Y ≡
i n
We will consider one of the possible set-ups for the distribution PMF(Y) as shown below. Additionally, we will also consider three different values for n , namely nSmall = 5 ,
nMedium = 30 , nBig = 100 .
Simply, we would like to obtain the distribution for (1) and (2) with each pair of n , and PMF(Y) that we set here. Again, please revisit the lecture materials if you have any doubts
since we have done a live presentation of this in our unit. Please put down your results up to five decimal places as we would like to compare this result with the simulation results later.
y 1 2 3 4 5
ANSWER
For each pair of n, PMF(Y) under each distribution (1) and (2), you are required to display a histogram to represent the results of repeated sampling, and a curve to display the
theoretical results from Question 1. Explain your findings and results (no more than 150 words).
Instructions for plots (MUST FOLLOW) : The marking for this question also includes the cleanliness of your plots (proper labels for axes, name of the plot must include
the type of sampling distribution, and the sample size that you are using, e.g. Mean Distribution: n = 30 ). The theoretical values and simulated values need to be presented
accordingly for ease of comparison - you must put these values in the legends.
Instructions for codes (MUST FOLLOW) : The code needs to be elegant (do not hard code) with enough comments describing what you want to do. Furthermore,
the naming of the variables needs to make sense. If you need to use a chunk of code for more than one time, please write a function for it, we will deduct marks if you copy and
paste your codes here and there. As specified from the beginning, please put your result with 5 decimal places so we can compare and assess the theoretical results of the CLT and its
simulation.
ANSWER