unit 2 part - 2
unit 2 part - 2
100
Most often, the methods of finding the parameters of large populations are
unrealistic. For example, when finding the average age of kids attending
kindergarten, it is impossible to collect the exact age of every kindergarten kid
in the world. Instead, the point estimator is used to make an estimate of the
population parameter.
It is desirable for a point estimate to be
Consistent: the larger the sample size, the more accurate the estimate. For
the point estimator to be consistent, the expected value should move toward
the true value.
Unbiased: The expectation of the observed values of many samples equals
the corresponding population parameter i.e., the sample mean is an
unbiased estimator for the population mean.
Most efficient: The most efficient point estimator is the one with the
smallest variance. Generally, the efficiency of the estimator depends on the
distribution of the population. For example, in a normal distribution, the
mean is considered more efficient than the median, but the same does not
apply in asymmetrical distributions.
School of Computer Engineering
How to find Point Estimate?
103
Example 2: calculate the best point estimate from the list of data i.e., 15.22,
14.34, 18.12, 12.61, 15.61, 14.22, 19.41, 12.22, 17.12, 14.22, 12.91 and 18.12.
Solution: In such a case, the sample mean (i.e., 15.34) is the best point estimate
for population mean.
An interval is a range of values. Let’s say we wanted to find out the average
cigarette use of senior citizens. We can’t survey every senior citizen on the
planet (due to time constraints and finances), so we take a sample of 1000
senior citizens and find that 10% of them smoke cigarettes. Although we have
only taken a sample, we can use that figure to estimate that “about” 10% of the
whole population smoke cigarettes. In reality, it’s unlikely to be exactly 10% (as
we only sampled a small percentage of people), but it’s probably somewhere
around there, perhaps between 5 and 15%. That “somewhere between 5 and
15%” is an interval estimate.
There’s nothing wrong with making a good guess at an interval, but sometimes
we want to be very confident that our results are sound and repeatable.
“Repeatable” means that if we do the whole thing over again, we’ll get the same
results. One way to do this is to express a confidence level. Confidence levels
are percentages of certainty. For example, we might say we are 99% confident
(i.e., we have a 99% confidence level) that between 5 and 15% of senior
citizens smoke cigarettes. When the interval estimate has a confidence level
attached, it’s called a confidence interval.
School of Computer Engineering
Confidence Interval Estimation
105
The lower bound (in the example, 5%) is called a lower confidence limit and
the upper bound (in the example, 15%) is called an upper confidence limit.
The bigger the sample size, the more narrow the confidence interval will be.
How to determine the lower and upper confidence limit?
Confidence limit Standard deviation
Sample size A measure of how many
standard deviations are below
Mean Z-score or above the population mean
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different
samples of the liquid. What is the interval estimation for the population mean at a 95%
confidence level?
Solution:
The sample mean of the boiling temperatures to be 101.82, with the standard deviation
σ=0.984. The confidence level is 95% and the sample size is 6. The Z-score for 95%
confidence level is 1.96.
µ = 101.82 ± 1.96 * (0.984/ √6) = 106.62, 97.02
So, upper confidence limit= 106.621 and lower confidence limit= 97.019
Standard error (SE) = σ / √n = 0.402 tells how accurately the sample reflects the
total population (measures the preciseness of an estimate of a population mean)
Margin of error = Z*(σ/√n) = 1.96 * (0.984/2.45) = 0.786 number of random sample
errors in the data that we are measuring (measures the half-width of a confidence
interval for a population mean)
Problem statement: The sample with the test scores in data analytics after end
semester examination are 55, 65, 80, 95, 90, 90, 95, 75, 75, 85, 90 and 80. Calculate the
confidence limit and margin error. Consider 95% confidence level.
School of Computer Engineering
Sampling Distributions
107
Recap
to
represents the probability of varied outcomes when a study is conducted.
There are 2 types of sampling distributions i.e.,
Sampling distribution of mean – [Discussed earlier]
T-distribution
As the df increases, the t-distribution will get closer and closer to matching
the standard normal distribution.
The values of the t-statistic is : t = [ x̄ - μ ] / [ s / √ n ] where,
t = t score,
x̄ = sample mean,
μ = population mean,
s = standard deviation of the sample,
n = sample size
Note: A t-score is equivalent to the number of standard deviations away
from the mean of the t-distribution.
A law school claims it’s graduates earn an average of $300 per hour. A
sample of 15 graduates is selected and found to have a mean salary of $280
with a sample standard deviation of $50. Assuming the school’s claim is
true, what is the t-score?
Solution: t= (280 – 300) / (50/ √ 15) = -20 / 12.909945 = -1.549.
In today’s data-driven world, decisions are based on data all the time.
Hypothesis plays a crucial role in that process, whether it may be making
business decisions, in the health sector, academia, or in quality improvement.
Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions
and making bad decisions.
Hypothesis testing is a type of statistical analysis in which assumptions are
put about a population parameter to the test. It is used to estimate the
relationship between variables.
Examples:
A faculty assumes that 60% of his students come from higher-middle-class
families.
A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for
diabetic patients.
It involves setting up a null hypothesis and an alternative hypothesis. These
two hypotheses will always be mutually exclusive. This means that if the null
hypothesis is true then the alternative hypothesis is false and vice versa.
The null hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
Example:
Smokers are no more susceptible to heart disease than nonsmokers.
The new drug has a cure rate no higher than other drugs on the market.
H0 is the symbol for it, and it is pronounced H-naught.
Hypothesis testing is used to conclude if the null hypothesis can be rejected or not.
Suppose an experiment is conducted to check if girls are shorter than boys at the
age of 5. The null hypothesis will say that they are of the same height.
The alternate hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis.
It indicates that there is a statistical significance between two possible outcomes
and can be denoted as Ha.
For the above-mentioned example, the alternative hypothesis would be that girls
are shorter than boys at the age of 5.
The null hypothesis is usually the current thinking, or status quo. The alternative
hypothesis is usually the hypothesis to be proved. The burden of proof is on the
alternative hypothesis.
School of Computer Engineering
Null Hypothesis and Alternate Hypothesis cont…
116
How to write null and alternate hypothesis - The only thing to know are the
dependent (DV) variables and independent variables (IV). To write null
hypothesis, and alternative hypothesis, fill in the following sentences with
variables i.e., does independent variable affect dependent variable?
Null hypothesis (H0): IV does not affect DV.
Alternative Hypothesis (Ha): IV affects DV.
Characteristics of a Hypothesis
It has to be clear and accurate in order to look reliable.
It has to be specific.
There should be scope for further investigation and experiments.
It should be explained in simple language while retaining its
significance.
IVs and DVs must be included with the relationship between them.
Example: Effect of new bill pass on the loan of farmers and H0: There is no
significant effect of the new bill passed on loans of farmers. The main
intention is to check the new bill passes can affect in both ways either
increase or decrease the loan of farmers.
School of Computer Engineering
Types of Error
122
Truth
H0 is true Ha is true
Reject H0 Type I error No error
Decision
Do not reject H0 No error Type II error
The question, then, is how strong the evidence in favor of the alternative
hypothesis must be to reject the null hypothesis.
This is done by means of a p-value. The p-value is the probability of seeing
a random sample at least as extreme as the observed sample, given that the
null hypothesis is true. The smaller the p-value, the more evidence there is
in favor of the alternative hypothesis.
The p-values are expressed as decimals and can be converted into
percentage. For example, a p-value of 0.0237 is 2.37%, which means there's
a 2.37% chance of the results being random or having happened by chance.
In the hypothesis test, if the value is:
A small p value (<=0.05), reject the null hypothesis.
A large p value (>0.05), do not reject the null hypothesis
The p-values are usually calculated using p-value tables, or calculated
automatically using statistical software like R, SPSS, Python etc.
Note: Other way to decide the rejection region is with z-score and it is
applicable when the sample size is less than or equal to 30.
School of Computer Engineering
Hypothesis Testing Example
124
Where x̄ n is the mean of the population, µ0 is the null hypothesis (i.e., the mean)
to be tested, σ is the standard deviation, n is the sample size.
School of Computer Engineering
Hypothesis Testing Numerical cont…
126
Using the data given in the equation we would have the following:
μ0 = 100, σ = 15, n = 30, x̄ n = 140
Plugging the values into the formula:
Step 4: Look up the values of z (called the critical value) from statistical
table (The table is predefined and should be referred)
From the table, the confidence level value is 1.96 with the confidence interval of
0.95.
Step 5: Draw a conclusion
In this case the tested statistic value of z calculated is more than the critical
value obtained from statistical tables (i.e., 14.606 > 1.96). Therefore the null
hypothesis is rejected in the favor of the alternative hypothesis.
Let’s take a closer look at the movie snacks example. Suppose we collect
data for 600 people at our theater. For each person, we know the type of
movie they saw and whether or not they bought snacks.
For the valid Chi-square test, the following conditions to be satisfied:
1. Data values that are a simple random sample from the population of
interest.
2. Two categorical or nominal variables.
3. For each combination of the levels of the two variables, we need at
least five expected values. When we have fewer than five for any one
combination, the test results are not reliable. To confirm this, we need
to know the total counts for each type of movie and the total counts for
whether snacks were bought or not. For now, we assume we meet this
requirement and will check it later.
Lastly, to get our test statistic, we add the numbers in the final row for each
cell: 3.46 + 3.75 + 5.81 + 6.21 + 12.65 + 13.52 + 9.14 + 10.70 = 65.24
Now, we need to find the critical value from the Chi-square distribution based
on degrees of freedom and significance level. This is the value to expect if the
two variables are independent.
The degrees of freedom depend on how many rows and how many columns
we have. The degrees of freedom (df) are calculated as df=(r−1)×(c−1) where
r is the number of rows, and c is the number of columns in the contingency
table. From the example, r is 4 and c is 2. Hence, df = (4−1)×(2−1)=3×1=3.
The Chi-square value with α = 0.05 (it is given and represents the probability
of rejecting the null hypothesis when it is true) and three degrees of freedom
is 7.815. Note: This value of 7.815 to be infer from the Chi-square
distribution table. Refer Appendix for further details
We compare the value of our test statistic (65.24) to the Chi-square value.
Since 65.24 > 7.815, we reject the idea that movie type and snack purchases
are independent.
School of Computer Engineering
Chi-square Test for Independence Example cont…
134