0% found this document useful (0 votes)
45 views38 pages

unit 2 part - 2

Uploaded by

shivamskashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views38 pages

unit 2 part - 2

Uploaded by

shivamskashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Estimation

100

 It is often of interest to learn about the characteristics of a large


group of elements such as individuals, households, buildings,
products, parts, customers, and so on. All the elements of interest in
a particular study form the population. Because of time, cost, and
other considerations, data often cannot be collected from every
element of the population. In such cases, a subset of the population,
called a sample, is used to provide the data.
 Data from the sample are then used to develop estimates of the
characteristics of the larger population. Therefore, the procedure of
making judgment or decision about a population parameter is
referred to as statistical estimation or simply estimation.
 There are two types of estimates namely point estimation and
interval estimation.

School of Computer Engineering


Point Estimation
101

The objective of point estimation is to obtain a single number from the


sample which will represent the unknown value of the population
parameter. Population parameters such as population mean, population
variance, etc are estimated from the corresponding sample statistics
such as sample mean, sample variance, etc.

School of Computer Engineering


Point Estimation cont…
102

 Most often, the methods of finding the parameters of large populations are
unrealistic. For example, when finding the average age of kids attending
kindergarten, it is impossible to collect the exact age of every kindergarten kid
in the world. Instead, the point estimator is used to make an estimate of the
population parameter.
 It is desirable for a point estimate to be
 Consistent: the larger the sample size, the more accurate the estimate. For
the point estimator to be consistent, the expected value should move toward
the true value.
 Unbiased: The expectation of the observed values of many samples equals
the corresponding population parameter i.e., the sample mean is an
unbiased estimator for the population mean.
 Most efficient: The most efficient point estimator is the one with the
smallest variance. Generally, the efficiency of the estimator depends on the
distribution of the population. For example, in a normal distribution, the
mean is considered more efficient than the median, but the same does not
apply in asymmetrical distributions.
School of Computer Engineering
How to find Point Estimate?
103

Point Estimate Population Parameter


S (sample deviation) σ (population deviation)
x̄ (sample mean) µ (population mean)
S2 (sample variance) σ2 (population variance)
 Example 1: A sample of 40 packages of rice has a mean weight of 5.7 kg with a
standard deviation of 0.4 kg. Find the best estimate of the population mean?
Solution: In such a case, the sample mean (i.e., 5.7) is the best point estimate
for population mean.

 Example 2: calculate the best point estimate from the list of data i.e., 15.22,
14.34, 18.12, 12.61, 15.61, 14.22, 19.41, 12.22, 17.12, 14.22, 12.91 and 18.12.
Solution: In such a case, the sample mean (i.e., 15.34) is the best point estimate
for population mean.

School of Computer Engineering


Interval Estimation
104

 An interval is a range of values. Let’s say we wanted to find out the average
cigarette use of senior citizens. We can’t survey every senior citizen on the
planet (due to time constraints and finances), so we take a sample of 1000
senior citizens and find that 10% of them smoke cigarettes. Although we have
only taken a sample, we can use that figure to estimate that “about” 10% of the
whole population smoke cigarettes. In reality, it’s unlikely to be exactly 10% (as
we only sampled a small percentage of people), but it’s probably somewhere
around there, perhaps between 5 and 15%. That “somewhere between 5 and
15%” is an interval estimate.
 There’s nothing wrong with making a good guess at an interval, but sometimes
we want to be very confident that our results are sound and repeatable.
“Repeatable” means that if we do the whole thing over again, we’ll get the same
results. One way to do this is to express a confidence level. Confidence levels
are percentages of certainty. For example, we might say we are 99% confident
(i.e., we have a 99% confidence level) that between 5 and 15% of senior
citizens smoke cigarettes. When the interval estimate has a confidence level
attached, it’s called a confidence interval.
School of Computer Engineering
Confidence Interval Estimation
105

 The lower bound (in the example, 5%) is called a lower confidence limit and
the upper bound (in the example, 15%) is called an upper confidence limit.

 The bigger the sample size, the more narrow the confidence interval will be.
 How to determine the lower and upper confidence limit?
Confidence limit Standard deviation
Sample size A measure of how many
standard deviations are below
Mean Z-score or above the population mean

 Z-Scores for commonly used confidence intervals are as follows:


 90%  1.645  99%  2.576  50%  0.674 Refer Appendix
for further details
 95%  1.96  80%  1.282  98%  2.326
School of Computer Engineering
Confidence Interval Estimation cont…
106

Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different
samples of the liquid. What is the interval estimation for the population mean at a 95%
confidence level?
Solution:
The sample mean of the boiling temperatures to be 101.82, with the standard deviation
σ=0.984. The confidence level is 95% and the sample size is 6. The Z-score for 95%
confidence level is 1.96.
µ = 101.82 ± 1.96 * (0.984/ √6) = 106.62, 97.02
So, upper confidence limit= 106.621 and lower confidence limit= 97.019
Standard error (SE) = σ / √n = 0.402  tells how accurately the sample reflects the
total population (measures the preciseness of an estimate of a population mean)
Margin of error = Z*(σ/√n) = 1.96 * (0.984/2.45) = 0.786  number of random sample
errors in the data that we are measuring (measures the half-width of a confidence
interval for a population mean)
Problem statement: The sample with the test scores in data analytics after end
semester examination are 55, 65, 80, 95, 90, 90, 95, 75, 75, 85, 90 and 80. Calculate the
confidence limit and margin error. Consider 95% confidence level.
School of Computer Engineering
Sampling Distributions
107

Recap

to
represents the probability of varied outcomes when a study is conducted.
There are 2 types of sampling distributions i.e.,
 Sampling distribution of mean – [Discussed earlier]
 T-distribution

School of Computer Engineering


T-Distribution
108

 The t-distribution describes the standardized distances of sample


means to the population mean when the population standard
deviation is not known, and the observations come from a normally
distributed population.
 The t-distribution is similar to normal distribution but flatter and shorter
than a normal distribution i.e., it is symmetrical, bell-shaped distribution,
similar to the standard normal curve.

 The height of the t-distribution depends on the degrees of freedom (df)


and refers to the maximum number of logically independent values, which
are values that have the freedom to vary, in the sample.
School of Computer Engineering
Degree of freedom
109

The easiest way to understand degrees of freedom conceptually is through several


examples.
 Consider a data sample consisting of five positive integers. The values of the
five integers must have an average of six. If four of the items within the data set
are {3, 8, 5, and 4}, the fifth number must be 10. Because the first four numbers
can be chosen at random, the degrees of freedom is four.
 Consider a data sample consisting of one integer. That integer must be odd.
Because there are constraints on the single item within the data set, the
degrees of freedom is zero.
 The formula to determine degrees of freedom is df = N – 1 where N is sample
size.
 For example, imagine a task of selecting 10 baseball players whose bating
average must average to .250. The total number of players that will make up
our data set is the sample size, so N = 10. In this example, 9 (10 - 1) baseball
players can theoretically be picked at random, with the 10th baseball player
having to have a specific batting average to adhere to the .250 batting average
constraint.
School of Computer Engineering
T-Distribution cont…
110

 As the df increases, the t-distribution will get closer and closer to matching
the standard normal distribution.
 The values of the t-statistic is : t = [ x̄ - μ ] / [ s / √ n ] where,
t = t score,
x̄ = sample mean,
μ = population mean,
s = standard deviation of the sample,
n = sample size
Note: A t-score is equivalent to the number of standard deviations away
from the mean of the t-distribution.
 A law school claims it’s graduates earn an average of $300 per hour. A
sample of 15 graduates is selected and found to have a mean salary of $280
with a sample standard deviation of $50. Assuming the school’s claim is
true, what is the t-score?
Solution: t= (280 – 300) / (50/ √ 15) = -20 / 12.909945 = -1.549.

School of Computer Engineering


T-Distribution cont…
111

 Student’s t distribution is used when


 The sample size must be 30 or less than 30.
 The population standard deviation(σ) is unknown.
 The population distribution must be unimodal and skewed.
 Note:
The t-score represents the number of standard errors by which the sample
mean differs from the population mean. For example, if a t-score is 2.5, the
sample mean is 2.5 standard errors above the population mean. If a t-score
is −2.5, the sample mean is 2.5 standard errors below the population mean.

School of Computer Engineering


Inferential Statistics
112

 Statistics can be classified into two different categories i.e., descriptive


statistics and inferential statistics.
 The descriptive statistics summarizes the features of the dataset, whereas
inferential statistics help to make conclusion from the data.
 Inferential statistics is the process of using a sample to infer the properties
of a population and allows to generalize the population.
 In general, inference means “guess”, which means making inference about
something. So, statistical inference means, making inference about the
population.
 Let’s look at a real flu vaccine study for an example of making a statistical
inference. The scientists for this study want to evaluate whether a flu
vaccine effectively reduces flu cases in the general population. However, the
general population is much too large to include in their study, so they must
use a representative sample to make a statistical inference about the
vaccine’s effectiveness.
 Hypothesis testing is one of the type of inferential statistics.
School of Computer Engineering
Hypothesis
113

 A hypothesis is defined as a formal statement, which gives the


explanation about the relationship between the two or more variables of
the specified population i.e., it includes components like variables,
population and the relation between the variables.
 Hypothesis example:
 Two variables - if you eat more vegetables, you will lose weight faster.
Here, eating more vegetables is an independent variable, while losing
weight is the dependent variable.
 Two or more dependent variables and two or more independent
variables - Eating more vegetables and fruits leads to weight loss,
glowing skin, and reduces the risk of many diseases such as heart
disease.
 Consumption of sugary drinks every day leads to obesity
 If a person gets 7 hours of sleep, then he will feel less fatigue than if he
sleeps less.

School of Computer Engineering


Hypothesis Testing
114

 In today’s data-driven world, decisions are based on data all the time.
Hypothesis plays a crucial role in that process, whether it may be making
business decisions, in the health sector, academia, or in quality improvement.
Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions
and making bad decisions.
 Hypothesis testing is a type of statistical analysis in which assumptions are
put about a population parameter to the test. It is used to estimate the
relationship between variables.
 Examples:
 A faculty assumes that 60% of his students come from higher-middle-class
families.
 A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for
diabetic patients.
 It involves setting up a null hypothesis and an alternative hypothesis. These
two hypotheses will always be mutually exclusive. This means that if the null
hypothesis is true then the alternative hypothesis is false and vice versa.

School of Computer Engineering


Null Hypothesis and Alternate Hypothesis
115

 The null hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
 Example:
 Smokers are no more susceptible to heart disease than nonsmokers.
 The new drug has a cure rate no higher than other drugs on the market.
 H0 is the symbol for it, and it is pronounced H-naught.
 Hypothesis testing is used to conclude if the null hypothesis can be rejected or not.
Suppose an experiment is conducted to check if girls are shorter than boys at the
age of 5. The null hypothesis will say that they are of the same height.
 The alternate hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis.
 It indicates that there is a statistical significance between two possible outcomes
and can be denoted as Ha.
 For the above-mentioned example, the alternative hypothesis would be that girls
are shorter than boys at the age of 5.
 The null hypothesis is usually the current thinking, or status quo. The alternative
hypothesis is usually the hypothesis to be proved. The burden of proof is on the
alternative hypothesis.
School of Computer Engineering
Null Hypothesis and Alternate Hypothesis cont…
116

 A sanitizer manufacturer claims that its product kills 95 percent of germs


on average. To put this company's claim to the test, create a null and
alternate hypothesis.
 H0 (Null Hypothesis): Average = 95%.
 Alternative Hypothesis (Ha): The average is less than 95%.
Research question H0 Ha
Does tooth flossing affect the Tooth flossing has no effect Tooth flossing has an effect
number of cavities? on the number of cavities. on the number of cavities.
Does the amount of text The amount of text The amount of text
highlighted in the textbook highlighted in the textbook highlighted in the textbook
affect exam scores? has no effect on exam has an effect on exam
scores. scores.
Does daily meditation Daily meditation does not Daily meditation decreases
decrease the incidence of decrease the incidence of the incidence of depression.
depression? depression

School of Computer Engineering


Null Hypothesis and Alternate Hypothesis cont…
117

 How to write null and alternate hypothesis - The only thing to know are the
dependent (DV) variables and independent variables (IV). To write null
hypothesis, and alternative hypothesis, fill in the following sentences with
variables i.e., does independent variable affect dependent variable?
 Null hypothesis (H0): IV does not affect DV.
 Alternative Hypothesis (Ha): IV affects DV.
 Characteristics of a Hypothesis
 It has to be clear and accurate in order to look reliable.
 It has to be specific.
 There should be scope for further investigation and experiments.
 It should be explained in simple language while retaining its
significance.
 IVs and DVs must be included with the relationship between them.

School of Computer Engineering


Tails of distributions
118

The tails of a distribution are the appendages on the side of a distribution.


Although it can apply to a set of data, it makes more sense if that data is
graphed, because the tails become easily visible.

Contains the Contains the


lower values in upper values in
a distribution
Lower tail Upper tail a distribution

School of Computer Engineering


Hypothesis Testing cont…
119

 The purpose of statistical inference is to draw conclusions about a


population on the basis of data obtained from a sample of that population.
 Hypothesis testing is the process used to evaluate the strength of evidence
from the sample and provides a framework for making determinations
related to the population, i.e, it provides a method for understanding how
reliably one can extrapolate observed findings in a sample under study to
the larger population from which the sample was drawn.
 The investigator formulates a specific hypothesis, evaluates data from the
sample, and uses these data to decide whether they support the specific
hypothesis.
 The first step in testing hypotheses is the transformation of the research
question into a null hypothesis, and an alternative hypothesis.
Subsequently, the hypothesis testing.
 In hypothesis testing, a one-tailed test and a two-tailed test are alternative
ways of computing the statistical significance of a parameter inferred from
a data set, in terms of a test statistic.
School of Computer Engineering
One-Tailed Hypothesis Testing
120

 A one-tailed test is based on a unidirectional hypothesis where the area of


rejection is on only one side of the sampling distribution.
 It determines whether a particular population parameter is larger or
smaller than the predefined parameter. It uses one single critical value to
test the data.

 Example: Effect of participants of students in coding competition on their


fear level and H0: There is no important effect of students in coding
competition on their fear level. The main intention is to check the
decreased fear level when students participate in a coding competition.
School of Computer Engineering
Two-Tailed Hypothesis Testing
121

 A two-tailed test is also called a non-directional hypothesis. For checking


whether the sample is greater or less than a range of values, the two-tailed
is used. It is used for null hypothesis testing.

 Example: Effect of new bill pass on the loan of farmers and H0: There is no
significant effect of the new bill passed on loans of farmers. The main
intention is to check the new bill passes can affect in both ways either
increase or decrease the loan of farmers.
School of Computer Engineering
Types of Error
122

 Regardless of whether the investigator decides to accept or reject the null


hypothesis, it might be the wrong decision.
 The investigator might incorrectly reject the null hypothesis when it is true,
and might incorrectly accept the null hypothesis when it is false.
 In the tradition of hypothesis testing, these two types of errors have
acquired the names i.e., type I and type II errors.
 In general, commit a type I error occurs when one incorrectly reject a null
hypothesis that is true. On the other hand, type II error occurs when you
one incorrectly accept a null hypothesis that is false.

Truth
H0 is true Ha is true
Reject H0 Type I error No error
Decision
Do not reject H0 No error Type II error

School of Computer Engineering


Rejection Region
123

 The question, then, is how strong the evidence in favor of the alternative
hypothesis must be to reject the null hypothesis.
 This is done by means of a p-value. The p-value is the probability of seeing
a random sample at least as extreme as the observed sample, given that the
null hypothesis is true. The smaller the p-value, the more evidence there is
in favor of the alternative hypothesis.
 The p-values are expressed as decimals and can be converted into
percentage. For example, a p-value of 0.0237 is 2.37%, which means there's
a 2.37% chance of the results being random or having happened by chance.
 In the hypothesis test, if the value is:
 A small p value (<=0.05), reject the null hypothesis.
 A large p value (>0.05), do not reject the null hypothesis
 The p-values are usually calculated using p-value tables, or calculated
automatically using statistical software like R, SPSS, Python etc.
 Note: Other way to decide the rejection region is with z-score and it is
applicable when the sample size is less than or equal to 30.
School of Computer Engineering
Hypothesis Testing Example
124

 An investor says that the performance of their investment portfolio is


equivalent to that of the Standard & Poor’s (S&P) 500 Index. The person
performs a two-tailed test to determine this.
 The null hypothesis here says that the portfolio’s returns are equivalent to
the returns of S&P 500, while the alternative hypothesis says that the
returns of the portfolio and the returns of the S&P 500 are not equivalent.
 The p-value hypothesis test gives a measure of how much evidence is
present to reject the null hypothesis. The smaller the p value, the higher the
evidence against null hypothesis.
 Therefore, if the investor gets a p value of .001, it indicates strong evidence
against null hypothesis. So he confidently deduces that the portfolio’s
returns and the S&P 500’s returns are not equivalent.

School of Computer Engineering


Hypothesis Testing Numerical
125

Problem Statement: In the population, the average IQ is 100 with a standard


deviation of 15. A team of scientists want to test a new medication to see if it has
either a positive or negative effect on intelligence, or not effect at all. A sample of
30 participants who have taken the medication has a mean of 140. Did the
medication affect intelligence?
Solution:
Step 1: Set up the null and alternate hypothesis
H0: medication does not affect intelligence.
Ha: medication affects intelligence.
Step 2: Determine the type of test to use
Since the sample size is 30, the z-test is used.
Step 3: Calculate the tested statistic z using the formula

Where x̄ n is the mean of the population, µ0 is the null hypothesis (i.e., the mean)
to be tested, σ is the standard deviation, n is the sample size.
School of Computer Engineering
Hypothesis Testing Numerical cont…
126

Using the data given in the equation we would have the following:
μ0 = 100, σ = 15, n = 30, x̄ n = 140
Plugging the values into the formula:

Step 4: Look up the values of z (called the critical value) from statistical
table (The table is predefined and should be referred)
From the table, the confidence level value is 1.96 with the confidence interval of
0.95.
Step 5: Draw a conclusion
In this case the tested statistic value of z calculated is more than the critical
value obtained from statistical tables (i.e., 14.606 > 1.96). Therefore the null
hypothesis is rejected in the favor of the alternative hypothesis.

This means that the medication administered affect intelligence.

School of Computer Engineering


Chi-square test for independence
127

 A chi-square test of independence is to test whether two categorical


variables are related to each other or not.
 Example 1: we have a list of movie genres; this is the first variable. The
second variable is whether or not the patrons of those genres bought
snacks at the theater. The idea (or null hypothesis) is that the type of movie
and whether or not people bought snacks are unrelated. The owner of the
movie theater wants to estimate how many snacks to buy. If movie type and
snack purchases are unrelated, estimating will be simpler than if the movie
types impact snack sales.
 Example 2: a veterinary clinic has a list of dog breeds they see as patients.
The second variable is whether owners feed dry food, canned food or a
mixture. The idea (or null hypothesis) is that the dog breed and types of
food are unrelated. If this is true, then the clinic can order food based only
on the total number of dogs, without consideration for the breeds.

School of Computer Engineering


Chi-square Test for Independence Example
128

 Let’s take a closer look at the movie snacks example. Suppose we collect
data for 600 people at our theater. For each person, we know the type of
movie they saw and whether or not they bought snacks.
 For the valid Chi-square test, the following conditions to be satisfied:
1. Data values that are a simple random sample from the population of
interest.
2. Two categorical or nominal variables.
3. For each combination of the levels of the two variables, we need at
least five expected values. When we have fewer than five for any one
combination, the test results are not reliable. To confirm this, we need
to know the total counts for each type of movie and the total counts for
whether snacks were bought or not. For now, we assume we meet this
requirement and will check it later.

School of Computer Engineering


Chi-square Test for Independence Example cont…
129

 The data summarized in a contingency table is as follows:


Type of movie Snacks No snacks
Action 50 75
Comedy 125 175
Family 90 30
Horror 45 10
 Before we go any further, let’s check the assumption of five expected values
in each category. The data has more than five counts in each combination of
Movie Type and Snacks.
 To find expected counts for each Movie-Snack combination, we first need
the row and column totals, which are shown below:
Type of movie Snacks No snacks Row Totals
Action 50 75 50 + 75 = 125
Comedy 125 175 125 + 175 = 300
Family 90 30 90 + 30 = 120
Horror 45 10 45 + 10 = 55
Column Totals 50+125+90+45 = 310 75+175+30+10 = 290 Grand Total = 600

School of Computer Engineering


Chi-square Test for Independence Example cont…
130
 The expected counts for each Movie-Snack combination are based on the row and
column totals. We multiply the row total by the column total and then divide by the
grand total. This gives us the expected count for each cell in the table.
 For example, for the Action-Snacks cell: (125 * 310) / 600 = 65. If there is not a
relationship between movie type and snack purchasing we would expect 65 people
to have watched an action film with snacks.
 For the Action-No Snacks cell: (125 * 290) / 600 = 60. Similarly, it can be counted
for others…
 The expected count appears in bold beneath the actual count.
Type of movie Snacks No snacks Row Totals
Action 50 75 125
125*310/600 = 65 125*290/600 = 60
Comedy 125 175 300
300*310/600 = 155 300*290/600 = 145
Family 90 30 120
120*310/600 = 62 120*290/600 = 58
Horror 45 10 55
55*310/600 = 28 55*290/600 = 27
Column Totals 310 290 Grand Total = 600

School of Computer Engineering


Chi-square Test for Independence Example cont…
131
 All of the expected counts for our data are larger than five, so we meet the
requirement for applying the independence test.
 If we look at each of the cells, we can see that some expected counts are close to the
actual counts but most are not.
 If there is no relationship between the movie type and snack purchases, the actual
and expected counts will be similar. If there is a relationship, the actual and
expected counts will be different.
Performing the Chi-square Test
 The basic idea in calculating the test statistic is to compare actual and expected
values, given the row and column totals that we have in the data.
 First, we calculate the difference from actual and expected for each Movie-Snacks
combination.
 Next, we square that difference. Squaring gives the same importance to
combinations with fewer actual values than expected and combinations with more
actual values than expected.
 Next, we divide by the expected value for the combination. We add up these values
for each Movie-Snacks combination. This gives the test statistic.

School of Computer Engineering


Chi-square Test for Independence Example cont…
132

Type of movie Snacks No snacks Row Totals


Action Actual: 50 Actual: 75 125
Expected: 65 Expected: 60
Difference: 50 – 65 = -15 Difference: 75 – 60 = 15
Squared Difference = 225 Squared Difference = 225
Divide by Expected: 225/65 = 3.46 Divide by Expected: 225/60 = 3.75
Comedy Actual: 125 Actual: 175 300
Expected:155 Expected: 145
Difference: 125– 155 = -30 Difference: 175– 145 = 30
Squared Difference = 900 Squared Difference = 900
Divide by Expected: 900/155 = 5.81 Divide by Expected: 900/145 = 6.21
Family Actual: 90 Actual: 30 120
Expected:62 Expected: 58
Difference: 90 – 62 = 28 Difference: 30 – 58 = -28
Squared Difference = 784 Squared Difference = 784
Divide by Expected: 784/62 = 12.65 Divide by Expected: 784/58 = 13.52
Horror Actual: 45 Actual: 10 55
Expected:28 Expected: 27
Difference: 45 – 28 = -16 Difference: 10 – 27 = -17
Squared Difference = 256 Squared Difference = 289
Divide by Expected: 256/28 = 9.14 Divide by Expected: 289/27 = 10.70
Column Totals 310 290 Grand Total = 600

School of Computer Engineering


Chi-square Test for Independence Example cont…
133

 Lastly, to get our test statistic, we add the numbers in the final row for each
cell: 3.46 + 3.75 + 5.81 + 6.21 + 12.65 + 13.52 + 9.14 + 10.70 = 65.24
 Now, we need to find the critical value from the Chi-square distribution based
on degrees of freedom and significance level. This is the value to expect if the
two variables are independent.
 The degrees of freedom depend on how many rows and how many columns
we have. The degrees of freedom (df) are calculated as df=(r−1)×(c−1) where
r is the number of rows, and c is the number of columns in the contingency
table. From the example, r is 4 and c is 2. Hence, df = (4−1)×(2−1)=3×1=3.
 The Chi-square value with α = 0.05 (it is given and represents the probability
of rejecting the null hypothesis when it is true) and three degrees of freedom
is 7.815. Note: This value of 7.815 to be infer from the Chi-square
distribution table. Refer Appendix for further details
 We compare the value of our test statistic (65.24) to the Chi-square value.
Since 65.24 > 7.815, we reject the idea that movie type and snack purchases
are independent.
School of Computer Engineering
Chi-square Test for Independence Example cont…
134

 Therefore, we conclude that there is some relationship between movie type


and snack purchases.
 However, the owner of the movie theater cannot estimate how many snacks
to buy regardless of the type of movies being shown. Instead, the owner
must think about the type of movies being shown when estimating snack
purchases.
 It's important to note that we cannot conclude that the type of movie
causes a snack purchase. The independence test tells us only whether there
is a relationship or not; it does not tell that one variable causes the other.
Statistical details
 The null hypothesis is that the type of movie and snack purchases are
independent. It is written as: H0:Movie Type and Snack purchases are
independent
 The alternative hypothesis is the opposite i.e., Ha: Movie Type and Snack
purchases are not independent.

School of Computer Engineering


135

School of Computer Engineering


Appendix
136

Z-values for confidence interval


Confidence Level Z value
0.70 1.04
0.75 1.15
0.80 1.28
0.85 1.44
0.90 1.64
0.92 1.75
0.95 1.96
0.96 2.05
0.98 2.33
0.99 2.58
0.50 0.674

School of Computer Engineering


Appendix
137

Chi-square Distribution Table

School of Computer Engineering

You might also like