0% found this document useful (0 votes)
2 views

Module_3_Class

Class Notes by professor for Module 3 of BCS301 CSE

Uploaded by

BreadBeau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module_3_Class

Class Notes by professor for Module 3 of BCS301 CSE

Uploaded by

BreadBeau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Module III: Statistical Inference 1

Topic 1: Introduction to Statistical Inference, Sampling Distribution,


Standard Error, Testing of Hypothesis, Levels of Significance, Test of
Significance, and Confidence Limits

Dr. P. Rajendra

Professor, Dept. of Maths

CMRIT, Bengalore.

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 1 / 14
Outline

1 Sampling in Statistics

2 Statistical Inference

3 Sampling Distribution

4 Standard Error

5 Hypothesis Testing

6 Type I, Type II Errors and Levels of Significance

7 Confidence Interval

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 2 / 14
Sampling in Statistics
* Sampling is a statistical method of obtaining representative data
(observations) from a group. We often use sampling concepts in
everyday life without realizing it. For example, Checking the quality
of rice by taking a handful is an example of random sampling from a
large population.
* Population (Universe):The group of objects (or individuals) under
study is called the population or universe. A population can be either
Finite or Infinite.
* Sample: A part of the population that contains selected objects (or
individuals) is called a sample. The number of individuals in a sample
is called the sample size.
* Random sampling is the selection of objects (individuals) from the
population in such a way that each object has the same chance of
being selected. The lottery system is a common example of random
sampling.
* Simple sampling is a special case of random sampling.Each event in
simple sampling has the same probability of success or failure.
Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 3 / 14
Statistical Inference

A Statistical inference is the process of drawing conclusions about


populations based on samples. It involves two main activities:
(i). Estimation: Determining population parameters from sample
statistics. Estimating a single value for a population parameter is
called a point estimation. Providing a range of values for a population
parameter is called an interval estimation.
(ii). Hypothesis testing: Making decisions about populations based on
sample data
Key concepts:
. Population: The entire group we want to study
. Sample: A subset of the population
. Parameter: A characteristic of the population (e.g., mean, variance)
. Statistic: An estimate of the parameter based on the sample.

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 4 / 14
Population
Sample
(a) Entire group of interest
(a) Subset of the population
(b) Often too large to study
(b) Used to make inferences
entirely
(c) Described by statistics (e.g.,
(c) Described by parameters
x̄, s)
(e.g., µ, σ)
Example: In AI/ML: We often use statistical inference to understand and
make predictions about large datasets. Predicting house prices based on
features like size, location, etc.
. Population: All houses in a city
. Sample: Dataset of houses with known prices and features
. Parameter: True relationship between features and price
. Statistic: Estimated relationship from our model (e.g., coefficients in
linear regression)
. Inference: Using the model to predict prices of new houses

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 5 / 14
Sampling Distribution
A sampling distribution is the distribution of a statistic over many
samples. It describes the variability of the statistic. Most commonly used:
sampling distribution of the sample mean (X̄ ).
Sampling Distribution of the Mean
0.8
Normal Distribution

0.6
Frequency

0.4

0.2

0
95 97.5 100 102.5 105
Sample Mean

* Properties:
. Center: Expected value of the statistic
. Spread: Variability of the statistic
Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 6 / 14
Standard Error
i. The Standard Error (SE) is the standard deviation of a sampling
distribution. The most commonly used one is Standard Error of the
Mean.
ii. Formula: SE = √σn , where σ is the population standard deviation
and n is the sample size. If σ is unknown, we estimate it with the
sample standard deviation s. The reciprocal of standard error is called
the precision.
Effect of Sample Size on Standard Error
5

SE = σ/ n
4
Standard Error

0
20 40 60 80 100
Dr. P. Rajendra (Professor, Dept. of Maths) Sample
Module III: SizeInference
Statistical (n) 1 CMRIT, Bengalore. 7 / 14
Hypothesis Testing
* Hypothesis testing is a statistical method used to make inferences
about a population based on sample data. It involves making an
assumption (hypothesis) about a population parameter and then
testing this assumption using sample data.
* Key components:
. Null hypothesis (H0 ): The assumption we start with
. Alternative hypothesis (H1 or Ha ): The competing claim
. Test statistic: A value calculated from the sample data
. Decision rule: Criteria for rejecting or failing to reject H0
Steps in Hypothesis Testing:
1 State the null and alternative hypotheses

2 Choose a significance level (α)

3 Select the appropriate test statistic

4 Determine the critical region

5 Calculate the test statistic from sample data

6 Make a decision: Reject H or fail to reject H


0 0
7 Interpret the results

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 8 / 14
Type I, Type II Errors and Levels of Significance
* Type I Error (False Positive): Rejecting the null hypothesis when it’s
actually true.Probability = α (significance level)
* Type II Error (False Negative): Failing to reject the null hypothesis
when it’s actually false. Probability = β
* Power of a test = 1 - β (probability of correctly rejecting a false null
hypothesis)
* The level of significance (α) is the probability of rejecting the null
hypothesis when it is actually true. It represents the risk of making a
Type I error.
* Common levels: 0.05 (5%), 0.01 (1%), 0.10 (10%)
* Smaller α means Lower risk of Type I error and Higher risk of Type II
error (failing to reject a false null hypothesis)
* Choosing the Level of Significance is depends on the nature of the
problem & consequences of errors. For (i) Critical applications like
healthcare diagnosis: Lower α (0.01 or 0.001) & (ii) Exploratory Data
Analysis: Higher α (0.05 or 0.10)
* Decreasing α typically increases β
Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 9 / 14
Probability

Type II Error (β) Type I Error (α)

Test Statistic
Critical Value

Figure: Relationship between significance level and errors

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 10 / 14
Confidence Interval

* A confidence interval is a range of values that is likely to contain the


true population parameter
* It quantifies the uncertainty in our point estimate
* Interpretation: If we repeated the sampling process many times, the
true parameter would be within the interval in X% of the cases
* Common confidence levels: 95%, 99%, 90%
* General form: Point Estimate ± (Critical Value × Standard Error)
* For a population mean (large sample):
σ
X̄ ± Zα/2 · √
n

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 11 / 14
Probability

95% Confidence Interval


Parameter Value
Point Estimate

Figure: 95% Confidence Interval

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 12 / 14
Question 1: Explain the following terms:
(i).Standard Error
(ii). Statistical Hypothesis
(iii). Critical Region of a Statistical Test
(iv). Test of Significance
Answer:
i. Standard Error: The standard deviation of the sampling distribution of
a statistic, usually the mean. It measures the accuracy with which a
sample represents the population.
ii. Statistical Hypothesis: A statement about a population parameter
that can be tested using statistical methods. Common hypotheses include
null (H0 ) and alternate (H1 ).
iii. Critical Region of a Statistical Test: The range of values for which
the null hypothesis is rejected. If the test statistic falls in this region, it
indicates that the result is statistically significant.
iv. Test of Significance: A method to determine if the observed data
provide enough evidence to reject a null hypothesis. Common tests include
the Z-test, t-test, and chi-square test.
Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 13 / 14
Question 2: Define the following terms:
(i).Alternate Hypothesis
(ii). A Statistic
(iii). Level of Significance
(iv). Two-Tailed Test
Answer:
i. Alternate Hypothesis (H1 ): A hypothesis that proposes a change or
difference from the null hypothesis. It represents the conclusion that is
accepted if the null hypothesis is rejected.
ii. A Statistic: A quantity calculated from sample data, used to estimate
a population parameter. Examples: sample mean, sample variance.
iii. Level of Significance (α): The probability of rejecting the null
hypothesis when it is actually true (Type I error). Common values are 0.05
(5%) and 0.01 (1%).
iv. Two-Tailed Test: A test of significance where the critical region is in
both tails of the probability distribution. It checks for deviation in either
direction from the hypothesized value.

Dr. P. Rajendra (Professor, Dept. of Maths) Module III: Statistical Inference 1 CMRIT, Bengalore. 14 / 14
Hypothesis Testing problems
(Problems based on Binomial distributions and Proportions)

Dr. P. Rajendra

Professor, Dept. of Maths

CMRIT, Bengalore.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 1 / 29
Hypothesis Testing problems based on Binomial
distribution
The binomial distribution can be approximated by the normal distribution
when the sample size is large. Normal approximation is valid if np ≥ 5,
where p is the probability of success.
Steps of Hypothesis Testing:
1 State the Hypotheses:

H0 : p = p0 (Null Hypothesis)
H1 : p ̸= p0 (Alternative Hypothesis) for a two-tailed test.
2 Choose the Significance Level:

Common values: α = 0.05 or α = 0.01.


3 Calculate the Test Statistic:
x−np
Use the formula z = √ .
npq)
4 Determine the Critical Value :
For a two-tailed test, compare z with ±zα/2 (e.g., ±1.96 for
α = 0.05).
5 Make a Decision:

Reject
Dr. P. Rajendra or Dept.
(Professor, fail oftoMaths)
reject HHypothesis
0 basedTesting
on problems
the comparison.CMRIT, Bengalore. 2 / 29
Problem 1: A coin was tossed 400 times, and the head turned up 216
times. Test the hypothesis that the coin is unbiased at a 5% level of
significance.

Solution:
Step 1:
Null hypothesis (H0 ): The coin is unbiased (p = 0.5).
Alternative hypothesis (H1 ): The coin is biased (p ̸= 0.5).
Step 2: Expected number of heads
1
E (heads) = × 400 = 200 = np
2
Observed number of tails = 216.
Step 3: Standard Deviation

1 1 √
r

S.D. = npq = 400 × × = 100 = 10
2 2

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 3 / 29
Step 4:
The z-test statistic formula is:
x − np
z= √
npq

Substituting the values:


216 − 200 16
z= = = 1.6
10 10
Step 5:
At the 5% level of significance, the critical value for a two-tailed test is
z = 1.96. Since z = 1.6 is less than 1.96, we fail to reject the null
hypothesis.

Conclusion: The coin is likely unbiased.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 4 / 29
The critical values at a 5% significance level are z = −1.96 and z = 1.96.
Our calculated z-score (1.6) lies in the acceptance region.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 5 / 29
Problem 2: A coin was tossed 1600 times, and the tail turned up 864
times. Test the hypothesis that the coin is unbiased at a 1% level of
significance.

Solution:
Step 1:
Null hypothesis (H0 ): The coin is unbiased (p = 0.5).
Alternative hypothesis (H1 ): The coin is biased (p ̸= 0.5).
Step 2: Expected number of tails
1
× 1600 = 800 = np
E (tails) =
2
Observed number of tails = 864.
Step 3:

1 1 √
r

S.D. = npq = 1600 × × = 400 = 20
2 2

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 6 / 29
Step 4:
The z-test statistic formula is:
x − np
z= √
npq

Substituting the values:


864 − 800 64
z= = = 3.2
20 20
Step 5:
At the 1% level of significance, the critical value for a two-tailed test is
z = 2.576. Since z = 3.2 is greater than 2.576, we reject the null
hypothesis.
Conclusion: The coin is biased at the 1% significance level.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 7 / 29
The critical values at a 1% significance level are z = −2.576 and
z = 2.576. Our calculated z-score (3.2) lies in the rejection region.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 8 / 29
Problem 3: In 324 throws of a six-faced die, an odd number turned up
181 times. Test the hypothesis that the die is unbiased at the 1% level of
significance.

Solution:
Step 1:
Null hypothesis (H0 ): The die is unbiased (p = 0.5), i.e., the
probability of getting an odd number (1, 3, 5) is the same as the
probability of getting an even number (2, 4, 6).
Alternative hypothesis (H1 ): The die is biased (p ̸= 0.5).

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 9 / 29
Step 2:Expected Number of 3’s or 4’s
The probability of getting an odd number (1, 3, or 5) on a fair die is:
3
P(odd number) =
= 0.5
6
Hence, the expected number of odd numbers in 324 throws is:

E (odd numbers) = 0.5 × 324 = 162 = np


Observed number of odd numbers = 181.
Step 3:
The standard deviation is calculated using the formula for binomial
distribution:

S.D. = npq

√ √
∴ S.D. = 324 × 0.5 × 0.5 = 81 = 9

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 10 / 29
Step 4:
The z-test statistic formula is:
x − np
z= √
npq
Substituting the values:
181 − 162 19
z= = = 2.11
9 9
Step 5: At the 1% level of significance for a two-tailed test, the critical
value is z = 2.576.Since z = 2.11 is less than 2.576, we fail to reject the
null hypothesis.

Conclusion: The die is not significantly biased at the 1% significance


level.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 11 / 29
The critical values at a 1% significance level are z = −2.576 and
z = 2.576. Our calculated z-score (2.11) lies within the acceptance region.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 12 / 29
Problem 4: A die is thrown 9000 times, and a throw of 3 or 4 was
observed 3240 times. Test whether the die can be regarded as unbiased.

Solution:
Step 1:
Null hypothesis (H0 ): The die is unbiased (p = 62 = 31 ), i.e., the
probability of getting a 3 or 4 is 1/3.
Alternative hypothesis (H1 ): The die is biased (p ̸= 13 ).
Step 2: Expected Number of 3’s or 4’s
The probability of getting a 3 or 4 on a fair die is:
2 1
=P(3 or 4) =
6 3
Hence, the expected number of 3’s or 4’s in 9000 throws is:
1
× 9000 = 3000
E (3 or 4) =
3
Observed number of 3’s or 4’s = 3240.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 13 / 29
Step 3:
The standard deviation is calculated using the formula for binomial
distribution:

S.D. = npq
1 2 √
r
∴ S.D. = 9000 × × = 2000 ≈ 44.72
3 3
Step 4:
The z-test statistic formula is:
x − np
z= √
npq
Substituting the values:
3240 − 3000 240
z= = ≈ 5.37
44.72 44.72

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 14 / 29
Step 5: At the 5% level of significance for a two-tailed test, the critical
value is z = 1.96. Since z = 5.37 is much greater than 1.96, we reject the
null hypothesis.
Conclusion: The die is significantly biased.

The critical values at a 5% significance level are z = −1.96 and z = 1.96.


Our calculated z-score (5.37) lies far in the rejection region.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 15 / 29
Hypothesis test problems based on Proportions
A hypothesis test for a single proportion is used to determine if the sample
proportion differs significantly from a hypothesized proportion in the
population.
Step 1:
Null Hypothesis (H0 ): The population proportion is equal to the
hypothesized proportion. H0 : p = p0
Alternative Hypothesis (Ha ): The population proportion is different
from the hypothesized proportion. Ha : p ̸= p0
Step 2: Choose Significance Level (α) as 0.05 or 0.01.
Step 3: Formula for the test statistic Z :
p̂ − p0
Z= q
p0 q0
n

p̂: Sample proportion, p0 : Hypothesized proportion, n: Sample size Step


4: Compare the calculated Z to the critical value(s) and draw a
conclusion.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 16 / 29
Problem 5: A coin is tossed 400 times, and it turns up heads 216 times.
Test whether the coin may be regarded as an unbiased one at the 5%
significance level.
Solution:
Step 1: Null Hypothesis (H0 ):
The coin is unbiased, meaning the proportion of heads is 0.5.
H0 : p = 0.5
Alternative Hypothesis (H1 ):
The coin is biased, meaning the proportion of heads is not equal to 0.5.
H1 : p ̸= 0.5
This is a two-tailed test.
Step 2:
Observed Proportion of Heads:
216
p̂ = = 0.54
400
Expected Proportion under H0 :
p0 = 0.5
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 17 / 29
Step 3 Test Statistic:
p̂ − p0
z= q
p0 q0
n

Substituting the values:


0.54 − 0.5 0.04 0.04
z= q =q = = 1.6
0.5×0.5 0.25 0.025
400 400

Step 4:
For a two-tailed test at the 5% significance level, the critical values are
z = ±1.96 and the calculated z-value is 1.6, which is within the
acceptance region (−1.96, 1.96).
Conclusion: Since the calculated z-value does not exceed the critical
value, we fail to reject the null hypothesis. Therefore, there is no
evidence to suggest the coin is biased.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 18 / 29
The shaded areas represent the rejection regions for a two-tailed test with
α = 0.05. The calculated z = 1.6 falls within the acceptance region.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 19 / 29
Problem 6: A coin is tossed 1600 times, and tails turn up 864 times. Test
the hypothesis that the coin is unbiased at a 1% level of significance.
Step 1: Null Hypothesis (H0 ):
The coin is unbiased, meaning the proportion of tails is 0.5.
H0 : p = 0.5
Alternative Hypothesis (H1 ):
The coin is biased, meaning the proportion of tails is not equal to 0.5.
H1 : p ̸= 0.5
This is a two-tailed test.
Step 2: Significance level (α) = 0.01.
Step 3: Observed Proportion of Tails:
864
p̂ = = 0.54
1600
Expected Proportion under H0 :
p0 = 0.5
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 20 / 29
Step 4: Test Statistic is
p̂ − p0
z=q
p0 (1−p0 )
n

Substituting the values:


0.54 − 0.5 0.04 0.04
z=q =q = = 3.2
0.5(1−0.5) 0.25 0.0125
1600 1600

Step 5: For a two-tailed test at the 1% significance level, the critical


values are z = ±2.575. The calculated z-value is 3.2, which lies in the
rejection region.
Conclusion: Since the calculated z = 3.2 exceeds the critical value 2.575,
we reject the null hypothesis. Therefore, the coin is not unbiased.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 21 / 29
The shaded areas represent the rejection regions for a two-tailed test with
α = 0.01. The calculated z = 3.2 falls within the rejection region.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 22 / 29
Problem 7: In 324 throws of a six-faced die, an odd number turned up
181 times. Test the hypothesis that the die is unbiased at a 1% level of
significance.
Solution:
Step 1:
Null Hypothesis (H0 ): The die is unbiased, meaning the proportion of odd
numbers is 0.5.
H0 : p = 0.5
Alternative Hypothesis (H1 ): The die is biased, meaning the proportion of
odd numbers is not equal to 0.5.
H1 : p ̸= 0.5
Step 2: Significance level (α) = 0.01.
Step 3: Observed Proportion of Odd Numbers:
181
p̂ = = 0.5586
324
Expected Proportion under H0 :
p0 = 0.5
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 23 / 29
Step 4: The Test Statistic is
p̂ − p0
z=q
p0 (1−p0 )
n

Substituting the values:


0.5586 − 0.5 0.0586 0.0586
z= q = q = ≈ 2.11
0.5(1−0.5) 0.25 0.0278
324 324

Step 5: For a two-tailed test at the 1% significance level, the critical


values are z = ±2.575. The calculated z-value is approximately 2.11,
which does not lie in the rejection region.
Conclusion: Since the calculated z = 2.11 is less than the critical value
2.575, we fail to reject the null hypothesis. Therefore, there is insufficient
evidence to conclude that the die is biased.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 24 / 29
The shaded areas represent the rejection regions for a two-tailed test with
α = 0.01. The calculated z = 2.11 falls within the acceptance region.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 25 / 29
Problem 8 A die is thrown 9000 times, and a throw of 3 or 4 was
observed 3240 times. Test whether the die can be regarded as unbiased
using a hypothesis test for proportions.
Solution:
Step 1:
Null Hypothesis (H0 ): The die is unbiased, meaning the proportion of
throws resulting in 3 or 4 is 62 = 31 .

1
H0 : p =
3
Alternative Hypothesis (H1 ): The die is biased, meaning the proportion of
throws resulting in 3 or 4 is not equal to 13 .

1
H1 : p ̸=
3
Step 2:
We will conduct the test at the 5% significance level (α = 0.05). This
means there is a 5% risk of rejecting the null hypothesis when it is true.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 26 / 29
3240
Step 3: Observed Proportion of Throws with 3 or 4: p̂ = 9000 = 0.36
Expected Proportion under H0 : p0 = 31 = 0.3333
Step 4: The Test Statistic is
p̂ − p0
z=q
p0 (1−p0 )
n

Substituting the values:


0.36 − 0.3333 0.0267 0.0267
z=q =q = ≈ 5.39
0.3333(1−0.3333) 0.2222 0.00495
9000 9000

Step 5: For a two-tailed test at the 5% significance level, the critical


values are z = ±1.96. The calculated z-value is approximately 5.39, which
lies in the rejection region.
Conclusion: Since the calculated z = 5.39 is greater than the critical
value 1.96, we reject the null hypothesis. Therefore, we conclude that the
die is biased.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 27 / 29
The shaded areas represent the rejection regions for a two-tailed test with
α = 0.05. The calculated z = 5.39 falls within the rejection region.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 28 / 29
Assignment Problems

1. In a city, a sample of 500 people is taken, out of which 280 are tea
drinkers, and the rest are coffee drinkers. Can we assume that both coffee
and tea are equally popular in this city at a 5% level of significance?

2. A manufacturing company claims that at least 95% of its products


supplied conform to the specifications. Out of a sample of 200 products,
18 are found to be defective. Test the claim at a 5% level of significance.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems CMRIT, Bengalore. 29 / 29
Sampling and Significance Tests
(Simple sampling of attributes. Test of significance for large samples,
comparison of large samples)

Dr. P. Rajendra

Professor, Dept. of Maths

CMRIT, Bengalore.

Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 1 / 13
Simple Sampling of Attributes

Simple Sampling: A sampling method where every individual or


attribute in the population has an equal chance of being selected.This
method ensures a random and unbiased sample, essential for
statistical inference in AI and ML models.
Scenario in AI/ML: When working with datasets, simple sampling
can be used to create training and test sets. This helps ensure that
the model generalizes well to unseen data.
For example, If we want to sample 100 data points from a dataset of
10,000, we use simple random sampling to ensure all data points have
an equal chance of being selected.

Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 2 / 13
Test of Significance for Large Samples

Test of significance: Used to determine whether the observed data


significantly deviates from what is expected under the null
hypothesis.For large samples, when the sample size n > 30, we can
apply the Z-test for large samples.
Z-test:
x̄ − µ
Z= (1)
√σ
n

where: x̄ is the sample mean, µ is the population mean, σ is the


standard deviation, n is the sample size.
Critical value: Compare the Z-value to the critical value from the
Z-table at a given significance level (e.g., α = 0.05).
Scenario in AI/ML: Significance can be used for feature selection or
model comparison. Comparing two model predictions to see if one
model is significantly better than another. Testing whether a feature
Significantly affects the output of a model.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 3 / 13
Large Samples

When comparing two large samples, we test if the means of two


populations are significantly different.
Hypotheses:
Null Hypothesis H0 : The two sample means are equal.
Alternate Hypothesis H1 : The two sample means are different.
Two-sample Z-test formula:
x̄1 − x̄2
Z=q 2 (2)
σ1 σ22
n1 + n2

where, x̄1 , x̄2 are the sample means, σ1 , σ2 are the population
standard deviations, n1 , n2 are the sample sizes.
Scenario: We have two datasets with different feature values, and we
want to test whether their mean feature values differ significantly.

Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 4 / 13
Test of Significance of Difference between Two Sample
Proportions
Used to test whether two population proportions are significantly
different.
Null Hypothesis (H0 ): P1 = P2
Alternate Hypothesis (H1 ): P1 ̸= P2
Test statistic:
p1 − p2
Z=r   (3)
P(1 − P) n11 + n12

where: p1 and p2 are the sample proportions, P = p1 nn11 +p


+n2
2 n2
= nx11 +x2
+n2
is the pooled sample proportion,n1 and n2 are the sample sizes.
Compare the Z-value to the critical value at the chosen significance
level.
Scenario in AI / ML: Testing whether the proportion of successful
outcomes in two models is significantly different.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 5 / 13
Problem 1: In an examination, the mean grade of students across various
schools was 74.5 with a standard deviation of 8. At one particular school,
200 students took the exam, and their mean grade was 75.9. Test the
significance of this result at the 5% and 1% significance levels.
Solution:
Step 1: Hypotheses
Null hypothesis: H0 : µ = 74.5
Alternative hypothesis: HA : µ ̸= 74.5 (two-tailed test)
Step 2: µ = 74.5,σ = 8, x̄ = 75.9, n = 200
Step 3: Z-test formula
x̄ − µ 75.9 − 74.5
Z= σ = 8
≈ 2.47
√ √
n 200
Step 4: Compare Z-value
. Critical Z-value at 5%: Z = ±1.96
. Critical Z-value at 1%: Z = ±2.58
Conclusion:
. Significant at 5% level, as Z = 2.47 exceeds 1.96.
. Not significant at 1% level, as Z = 2.47 is less than 2.58.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 6 / 13
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 7 / 13
Problem 2: In a coding competition, the national mean score was 82 with
a standard deviation of 10. At a particular university, 150 students
participated, and their mean score was 85. Is the difference significant at
5% and 1% significance levels?
Solution:
Step 1:
Null hypothesis: H0 : µ = 82
Alternative hypothesis: HA : µ ̸= 82 (two-tailed test)
Step 2: µ = 82,σ = 10, x̄ = 85, n = 150
Step 3: Z-test formula
x̄ − µ 85 − 82
Z= σ = 10
≈ 3.67
√ √
n 150
Step 4:
Critical Z-value at 5%: Z = ±1.96
Critical Z-value at 1%: Z = ±2.58
Conclusion:
Significant at both 5% and 1% levels, as Z = 3.67 exceeds both
critical values.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 8 / 13
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 9 / 13
Problem 3: Intelligence tests were given to two groups of boys and girls.
Their respective means, standard deviations, and sample sizes are given
below:
Group Mean (x̄) Standard Deviation (σ) Sample Size (n)
Girls 75 8 60
Boys 73 10 100
We need to determine if the two means significantly differ at the 5% level
of significance.

Solution:
Step 1: Null hypothesis (H0 ): µ1 = µ2
Alternative hypothesis (HA ): µ1 ̸= µ2 (two-tailed test)
Step 2: x̄1 = 75, x̄1 = 73, σ1 = 8, σ2 = 10, n1 = 60, n2 = 100
Step 3: Z-test formula for difference of means:

(x̄1 − x̄2 ) (75 − 73)


Z=q 2 2
=q ≈ 1.39
σ1 σ2 82 102
n1 + n2 60 + 100

Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 10 / 13
Since the calculated Z-value (1.39) is less than 1.96, we fail to reject the
null hypothesis at the 5% level. Hence, there is no significant difference
between the means of boys and girls.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 11 / 13
Problem 4: A group of researchers tested two machine learning
algorithms (Algorithm A and Algorithm B) on a benchmark dataset. The
results of their accuracy scores are as follows:
Algorithm Mean (x̄) Standard Deviation (σ) Sample Size (n)
Algorithm A 85 5 50
Algorithm B 82 6 80
Using a significance level of 5%, determine if there is a significant
difference in the mean accuracy scores of the two algorithms.

Solution:
Step 1: Null hypothesis (H0 ): µA = µB
Alternative hypothesis (HA ): µA ̸= µB (two-tailed test)
Step 2: x̄A = 85, x̄B = 82, σA = 5, σB = 6, nA = 50, nB = 80
Step 3: Z-test formula for difference of means:
(x̄A − x̄B ) (85 − 82)
Z=q 2 q ≈ 3.08
σA σB2 52 62
nA + nB 50 + 80

Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 12 / 13
Since the calculated Z-value (3.08) is greater than 1.96, we reject the
null hypothesis at the 5% level. Hence, there is a significant difference
between the mean accuracy scores of Algorithm A and Algorithm B.
Dr. P. Rajendra (Professor, Dept. of Maths) Sampling and Significance Tests CMRIT, Bengalore. 13 / 13
Hypothesis Testing problems - II
Difference of Two-Proportion

Dr. P. Rajendra

Professor, Dept. of Maths

CMRIT, Bengalore.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 1 / 11
Problem 1: A sample of 300 units of a manufactured product contains 65
defective units. In another sample of 200 units, 35 units were found
defective. At the 5% level of significance, we want to test if there is a
significant difference in the proportion of defectives between the two
samples.
Solution:
Step 1:
Null hypothesis (H0 ): H0 : p1 = p2
Alternative hypothesis (H1 ): H1 : p1 ̸= p2
Step 2: The sample proportions are calculated as follows:
65
pˆ1 = = 0.2167
300
35
pˆ2 = = 0.175
200
Step 3: The pooled proportion (P) is the combined proportion of
defectives from both samples:
x1 + x2 65 + 35 100
P= = = = 0.2
n1 + n2 300 + 200 500
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 2 / 11
Step 4: The Z-test statistic is calculated using the formula:

(pˆ1 − pˆ2 )
z=r  
P(1 − P) n11 + 1
n2

Substituting the values:

(0.2167 − 0.175)
z=q
1 1
 ≈ 1.14
0.2(1 − 0.2) 300 + 200

For a two-tailed test at the 5% level of significance, the critical z-value is


±1.96. Since the calculated z-value (1.14) is less than the critical value
(1.96), we fail to reject the null hypothesis.

Conclusion: There is no significant difference in the proportions of


defectives in the two samples at the 5% level of significance.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 3 / 11
The shaded areas represent the rejection regions. The red dashed line
marks the calculated z-value.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 4 / 11
Problem 2: In a large city A, 20% of a random sample of 900 school boys
had a slight physical defect. In another large city B, 18.5% of a random
sample of 1600 school boys had the same defect. At the 5% level of
significance, we want to test if the difference between the proportions is
significant.
Solution:
Step 1:
Null hypothesis (H0 ): H0 : p1 = p2
Alternative hypothesis (H1 ):H1 : p1 ̸= p2
Step 2: The sample proportions are:

pˆ1 = 0.20 (City A)

pˆ2 = 0.185 (City B)


Step 3: The pooled proportion (P) is calculated as:

x1 + x2 0.20 × 900 + 0.185 × 1600


P= = = 0.19
n1 + n2 900 + 1600

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 5 / 11
Step 4: The Z-test statistic is calculated using the formula:
pˆ1 − pˆ2
z=r  
P(1 − P) n11 + 1
n2

Substituting the values:

(0.20 − 0.185)
z=q
1 1
 ≈ 0.94
0.19(1 − 0.19) 900 + 1600

For a two-tailed test at the 5% level of significance, the critical z-value is


±1.96. Since the calculated z-value (0.94) is less than the critical value
(1.96), we fail to reject the null hypothesis.

Conclusion: There is no significant difference in the proportions of boys


with physical defects in cities A and B at the 5% level of significance.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 6 / 11
The shaded areas represent the rejection regions. The red dashed line
marks the calculated z-value.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 7 / 11
Problem 3: Before an increase in excise duty on tea, 800 out of 1000
people were tea drinkers. After the increase, 800 people were tea drinkers
out of 1200 people sampled. At the 5% level of significance, we want to
test if the difference in tea consumption before and after the excise.
Solution:
Step 1:
Null hypothesis (H0 ): H0 : p1 = p2 Alternative hypothesis (H1 ):
H1 : p1 ̸= p2
Step 2: The sample proportions are calculated as follows:
800
pˆ1 = = 0.80 (Before excise duty increase)
1000
800
pˆ2 =
= 0.6667 (After excise duty increase)
1200
Step 3: The pooled proportion (P) is the combined proportion of tea
drinkers from both samples:
x1 + x2 800 + 800
P= = = 0.7273
n1 + n2 1000 + 1200
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 8 / 11
Step 4: The Z-test statistic is calculated using the formula:

(p1 − p2 )
z=r  
P(1 − P) n11 + 1
n2

Substituting the values:

(0.80 − 0.6667)
z=q
1 1
 ≈ 7.00
0.7273(1 − 0.7273) 1000 + 1200

For a two-tailed test at the 5% level of significance, the critical z-values


are ±1.96. Since the calculated z-value (7.00) is much greater than the
critical value (1.96), we reject the null hypothesis.

Conclusion: There is a significant difference in the proportion of tea


drinkers before and after the excise duty increase at the 5% level of
significance.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 9 / 11
The shaded areas represent the rejection regions. The red dashed line
marks the calculated z-value.

Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 10 / 11
Assignment Questions
(1) In a sample of 600 men from a certain city, 450 are found smokers. In
another sample of 900 men from another city, 450 are smokers. Do the
data indicate that the cities are significantly different with respect to the
habit of smoking among men? Test at 5% significance level.
z = 6.38
(2) A sample of 100 tyres is taken from a lot. The mean life of a tyre is
found to be 39350 kms with a SD of 3260. Can it be considered as the
true random sample from a population with a mean life of 40000 kms?
(Use 5% significance level).
(3) In two large populations there are 30% and 25% respectively of fair
haired people. Is this difference likely to be hidden in samples of 1200 and
900 respectively from the two populations?
(4) A stenographer claims that she can type at the rate of 120 words per
minute. Can we reject her claim on the basis of 100 trials in which she
demonstrates a mean of 116 words with a standard deviation of 15 words?
Use 5% level of significance.
Dr. P. Rajendra (Professor, Dept. of Maths) Hypothesis Testing problems - II CMRIT, Bengalore. 11 / 11
Confidence Intervals for Means and Proportions

Dr. P. Rajendra

Professor, Dept. of Maths

CMRIT, Bengalore.

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 1 / 13
(i) Confidence Interval for Mean:
If the population standard deviation σ is known, the confidence interval for
the population mean µ is given by:
σ
x̄ ± Z × √
n
Where x̄ is the sample mean, Z is the desired confidence level, σ is the
population standard

(ii) Confidence Interval for Proportion:


The confidence interval for a population proportion p is given by:
r
p̂ q̂)
p̂ ± Z ×
n
Where p̂ is the sample proportion, Z is the desired confidence level, n is
the sample size.

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 2 / 13
(iii) Confidence Interval for Difference of Two Means:
For two independent samples, the confidence interval for the difference in
means µ1 − µ2 is given by:
s
σ12 σ22
(x̄1 − x̄2 ) ± Z × +
n1 n2
Where x̄1 and x̄2 are the sample means, σ1 and σ2 are the population
standard deviations for the two groups, n1 and n2 are the sample sizes.

(iv) Confidence Interval for Difference of Two Proportions:


The confidence interval for the difference in proportions p1 − p2 is given by:
s
p̂1 q̂1 ) p̂2 q̂2 )
(p̂1 − p̂2 ) ± Z × +
n1 n2
Where p̂1 and p̂2 are the sample proportions, Z is the Z-value
corresponding to the desired confidence level, n1 and n2 are the sample
sizes.
Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 3 / 13
Problem 1: To know the mean weights of all 10-year-old boys in Delhi, a
sample of 225 was taken. The mean weight of the sample was found to be
67 pounds, with a standard deviation of 2 pounds (considered as the
population standard deviation). Find the 95% confidence interval for the
mean weight of the population.
Solution: Given:
Sample size n = 225, Sample mean x̄ = 67
Population standard deviation σ = 2, Confidence level = 95%
Standard error (SE) is calculated as:

σ 2
SE = √ = = 0.1333
n 15

The 95% confidence interval is:

67 ± 1.96 × 0.1333 = (66.74, 67.26)

Thus, we are 95% confident that the true mean weight lies between 66.74
pounds and 67.26 pounds.
Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 4 / 13
Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 5 / 13
Problem 2: The heights of a random sample of 50 college students
showed a mean of 174.5 centimeters and a standard deviation of 6.9
centimeters. Construct a 99% confidence interval for the mean height of
all college students.
Solution: Given:
Sample size n = 50, Sample mean x̄ = 174.5 cm
Sample standard deviation s = 6.9 cm
Confidence level = 99%
Standard error (SE) is calculated as:

s 6.9
SE = √ = √ = 0.9756 cm
n 50
The 99% confidence interval is:

174.5 ± 2.576 × 0.9756 = (171.99 cm, 177.01 cm)

Thus, we are 99% confident that the true mean height of college students
lies between 171.99 cm and 177.01 cm.
Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 6 / 13
Figure: Normal Distribution Curve showing the 99% Confidence Interval

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 7 / 13
Problem 3: A random sample of 500 apples was taken from a large
consignment, and 65 were found to be bad. Estimate the proportion of
bad apples in the consignment as well as the standard error of the
estimate. Also, find the percentage of bad apples in the consignment.
Solution: Given:
Sample size n = 500, Number of bad apples in the sample x = 65
The estimated proportion p̂ is calculated as:
x 65
p̂ = = = 0.13
n 500
The standard error (SE) of the proportion is given by:
r r
p̂ q̂) 0.13 × 0.87
SE = = = 0.0154
n 500
The percentage of bad apples is:
p̂ × 100 = 0.13 × 100 = 13%
Thus, the estimated proportion of bad apples in the consignment is 0.13,
with a standard error of 0.0154, and 13% of the apples are bad.
Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 8 / 13
Problem 4: In a locality of 18,000 families, a sample of 840 families was
selected at random. Of these 840 families, 206 families were found to have
a monthly income of Rs. 2500 or less. Estimate how many of the 18,000
families have a monthly income of Rs. 2500 or less. Also, find the limits
within which this estimate would lie.

Solution: Given:
Total number of families N = 18, 000
Sample size n = 840
Number of families with income Rs. 2500 or less x = 206
The sample proportion p̂ is calculated as:
206
p̂ = = 0.2452
840
Estimation of Families with income Rs. 2500 or less in the locality:

Estimate = p̂ × N = 0.2452 × 18, 000 = 4, 413.6 ≈ 4, 414

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 9 / 13
Standard Error (SE) is given by:
r r
p̂(1 − p̂) 0.2452 × (1 − 0.2452)
SE = = ≈ 0.015
n 840
Confidence Interval for 95% confidence level (Z = 1.96):

CI = p̂ ± 1.96 × SE = 0.2452 ± 1.96 × 0.015

CI = (0.216, 0.274)
Multiplying by the total population N:

Limits = (0.216 × 18, 000, 0.274 × 18, 000) = (3, 888, 4, 932)

Thus, the estimate is that between 3,888 and 4,932 families in the locality
have a monthly income of Rs. 2500 or less.

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 10 / 13
Problem 5: The mean and standard deviation of the maximum loads
supported by 60 cables are 11.09 tonnes and 0.73 tonnes, respectively.
Find: (i) 95% confidence limits for the mean of the maximum loads of all
cables produced by the company. (ii) 99% confidence limits for the mean
of the maximum loads of all cables produced by the company.

Solution: Given:
Sample size n = 60
Sample mean x̄ = 11.09 tonnes
Standard deviation s = 0.73 tonnes
Standard Error (SE):

s 0.73
SE = √ = √ = 0.0943 tonnes
n 60

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 11 / 13
(i) 95% Confidence Limits: For 95% confidence level, the Z-value is
1.96. The confidence limits are:

x̄ ± Z × SE = 11.09 ± 1.96 × 0.0943

= 11.09 ± 0.1848
= (10.9052 tonnes, 11.2748 tonnes)
Thus, the 95% confidence limits for the mean are (10.91, 11.27) tonnes.
(ii) 99% Confidence Limits: For 99% confidence level, the Z-value is
2.576. The confidence limits are:

x̄ ± Z × SE = 11.09 ± 2.576 × 0.0943

= 11.09 ± 0.2429
= (10.8471 tonnes, 11.3329 tonnes)
Thus, the 99% confidence limits for the mean are (10.85, 11.33) tonnes.

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 12 / 13
Assignment Questions

(1). A survey was conducted in a slum locality of 2000 families by selecting


a sample of size 800. It was revealed that 180 families were illiterates.
Find the probable limits of the illiterate families in the population of 2000.
(2). A sample of 900 days was taken in a coastal town and it was found
that on 100 days the weather was very hot. Obtain the probable limits of
the percentage of very hot weather.
(3) The mean and S.D of the maximum loads supported by 60 cables are
11.09 tonnes and 0.73 tonnes respectively. Find (a) 95% (b) 99%
confidence limits for the mean of the maximum loads of all cables
produced by the company.
(4) 400 children are chosen in an industrial town and 150 are found to be
underweight. Assuming the conditions of simple sampling, estimate the
percentage of children who are underweight in the industrial town and
assign limits within which the percentage probably lies.

Dr. P. Rajendra (Professor, Dept. of Maths)Confidence Intervals for Means and Proportions CMRIT, Bengalore. 13 / 13

You might also like