0% found this document useful (0 votes)

220 views

Hypothesis Testing

This document provides an overview of hypothesis testing using Python. It discusses different types of hypothesis tests including t-tests, ANOVA tests, chi-square tests, and Pearson correlation. It provides examples of one-sample t-tests to compare the average weight of a class to a hypothesized mean. It also demonstrates two-sample independent and paired t-tests using random weight data for different classes and before/after medicine examples. The document emphasizes the importance of establishing hypotheses, interpreting p-values, and using hypothesis tests in model building and real-world applications.

Uploaded by

siddharth shirwadkar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

220 views

Hypothesis Testing

Uploaded by

siddharth shirwadkar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Hypothesis Testing with Python

USING STATASTICS FOR FINDING SIGNIFICANCE

Contributors ¯\_(ツ)_/¯
Vivek Chuadhary | Chintan Chitroda | Manvendra Singh
Hey Data Science Enthusiast,again we are back with
one of the mini book on applied statistics with some of
the methods & this is the basic version of the book,very
soon another part will be out.
Everyone out there learn statistics but they always fail
to apply when it comes to solving any project why?
Because out of 100% almost 80% to 85% are scared about
Research,Statistics & applying your commensence.
When it comes to commensence you can not master with any
course it will be develop by your curiosity & understanding the
problem statement deeply that what makes you get into by taking
this initial step.
Before applying statistics you have to think about assumption as
per your given problem statement as statistics work on
assumption but likelyhood to be true,what it means,always
statistics can't give you the true result & you have to compare
particular result after applying statistics with your strong domain
knowledge that you are working with (understanding problem
statement deeply & strongly).

In this book we have cover some of the stuff that may

help you out to explore & some of the resources have
been taken from other parts as well to combine it in a
good manner whether many of the stuff are custom.
We wish you will love this book & we are coming with
update version for the same in 4 to 5 upcoming
version.
Happy Learning
What is Hypothesis Testing and Why we do it ?
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.
Hypothesis Testing is basically an assumption that we make about the population parameter. -- Google

In simple words we make a Yes (Significant) or No (Not Significant) decision using Statastics using a sample of
population data to check significance between features.

we have to make decisions about the hypothesis. These decisions include deciding if we should accept the null
hypothesis or if we should reject the null hypothesis. Every test in hypothesis testing produces the significance
value for that particular test. In Hypothesis testing, if the significance value of the test is greater than the
predetermined significance level, then we accept the null hypothesis. If the significance value is less than the
predetermined value, then we should reject the null hypothesis.
For example,
if we want to see the degree of relationship between two stock prices and the significance value of the correlation
coefficient is greater than the predetermined significance level, then we can accept the null hypothesis and
conclude that there was no relationship between the two stock prices. However, due to the chance factor, it
shows a relationship between the variables.

Terms you should be familier with

1. Null hypothesis:
Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor.
Null hypothesis is denoted by; H0: μ1 = μ2, which shows that there is no difference between the two
population means.
2. Alternative hypothesis:
Contrary to the null hypothesis, the alternative hypothesis shows that observations are the result of a
real effect.
3. Level of significance / P-value:
Refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is
not possible for accepting or rejecting a hypothesis, so we therefore select a level of significance that is
usually 5%.
4. Type I error:
When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha.
In hypothesis testing, the normal curve that shows the critical region is called the alpha region.
5. Type II errors:
When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In Hypothesis
testing, the normal curve that shows the acceptance region is called the beta region.

Types of Hypothesis Testing:

1. T-Test
t-test is used to compare the mean of two given samples
2. Anova Test
Anova Test is used to compare multiple (three or more) samples with a single test
3. Chi-Square Test
Chi-square test is used to compare categorical variables
4. Pearson Corelation
it is used to compare two numerical variables

Datasets used in this file can be found here:

https://drive.google.com/open?id=1u0YImKHCPahDReepW0Nru2f2RfDX_UjO
Let's start by look on simple examples using various
testing methods
Hypothesis Testing , T-Testing
In [3]:
#creating random data set of different weights for individuals
average_weight = [33,34,35,36,32,28,29,30,31,37,36,35,33,34,31,40,24]

Hypothesis testing is all about assumption we

create.
Average weight in class 12th is 35, Means Hypothesized mean is 35
Null Hypothesis , H0 = Average age in class 12th is 35
Alt Hypothesis , Ha = Average age in class 12th is not 35
In [14]:
from scipy import stats #importing stats package

One Sample T-Test

One Sample T-test is used for one parameter as seen in example
we are looking to verify whether avarage age in class 12th is 35
or not
In [15]:
stats.ttest_1samp(average_weight,35)

Out[15]:
Ttest_1sampResult(statistic=-2.354253623010381, pvalue=0.03166804359862131)

P value = 0.031 = 3.1%, This means that the probablity (or chance) of
avaerage_weight 35 is only 3.1%. That is our Null Hypothesis is Wrong.
Generalizing, if P value < 5 % , we REJECT Null Hypothesis.
In our example, we REJECT H0, and conclude Ha that average_age in class
12th is NOT 35

Two sample Independent T-test

Used for comparing two parameters & to verify our
assumptions.
In [16]:
#creating an random data set of student in class 11th with each student weight
average_weight1 = [29,31,28,33,31,34,32,20,32,28,27,26,30,31,34,30]

In [18]:
average_weight #average weight of class 12th student as seen in One-Sample T-Test

Out[18]:
[33, 34, 35, 36, 32, 28, 29, 30, 31, 37, 36, 35, 33, 34, 31, 40, 24]

creating our assumptions

Null Hypothesis , H0 = Avaerage_weight of class 12th & class
11th student is same.
Alt Hypothesis , Ha = Avaerage_weight of class 12th & class 11th student is not
same.
In [20]:
stats.ttest_ind(average_weight,average_weight1)

Out[20]:
Ttest_indResult(statistic=2.404544177024533, pvalue=0.022355127034138323)

P value = 0.022 = 2.2%, This means that the probablity (or chance) of
average_weight of class 12th & class 11th students is same is only 2.2%. Null
Hypothesis is Wrong.
We REJECT Null Hypothesis.
Concluding,Average_weigth of class 12th & class 11th student is not
same.

Two Sample paired (or Relational) T - test

We use Two Sample paired T-Test with keep in mind in
simple way to check the effect before & after.

Studying effect of metaphor medicine on headache for

individual who are suffering from migrain.
let's create random data set of individuals who are suffering from headache &
we have given them two medicines one is paracetamol & another is
metaphor,took the reading before having metaphor & after having
metaphor.
In [21]:
#let's create random data set of individuals who are suffering from headache & we have given them two medic
before_metaphor = [68,45,46,34,23,67,80,120,34,54,68]
after_metaphor = [28,25,26,24,13,37,30,30,54,34,38]

H NULL = H0 = Response times before and after metaphor are same. This
means Metaphor has NO EFFECT
H Alternative = Ha = Response times before and after Metaphor are NOT
same. This means Metaphor has EFFECT
In [22]:
stats.ttest_rel(before_metaphor,after_metaphor)

Out[22]:
Ttest_relResult(statistic=3.2771720738937873, pvalue=0.00832867082029929)

P value = 0.008, 0.8% . P < 5%, So we reject H0 and accept Ha, This means
Metaphor has EFFECT on migrain suffered individuls

End of T-tests Tutorial

T-Test Example (Bike-Sharing Dataset)
We learn the theoratical concepts from abc institute but fail to implement in real world. Here we will see the use
of hypothesis in Model Building.

What are we going to learn ?

1. Different types of hypothesis testing.

2. Their implications in model building.

T- Test
T-test is mostly used to check the difference in the means of two samples

###Pre-processing

The first step always is to pre-process the dataset.

In [0]:
#install researchpy
!pip install researchpy
## it combines pandas, scipy.stats and statsmodels to
##get more complete information in a single API call

Requirement already satisfied: researchpy in /usr/local/lib/python3.6/dist-packages (0.1.9)

Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.18.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.0.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.4.1)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.6/dist-packages (from researchpy) (0.10.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->researchpy) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->researchpy) (2.8.1)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from statsmodels->researchpy) (0.5.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->researchpy) (1.12.
0)

In [0]:
#import the libraries
import statsmodels.api as sm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import researchpy as rc
import warnings
from scipy import stats
%matplotlib inline

In [0]:
#read the data
df = pd.read_csv('/content/drive/My Drive/data_set/bike_sharing.csv')

In [0]:
#check the shape
df.shape

Out[98]:
(10886, 12)
In [0]:
#check the head
df.head()

Out[99]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered cou