Hypothesis Testing
Hypothesis Testing
Contributors ¯\_(ツ)_/¯
Vivek Chuadhary | Chintan Chitroda | Manvendra Singh
Hey Data Science Enthusiast,again we are back with
one of the mini book on applied statistics with some of
the methods & this is the basic version of the book,very
soon another part will be out.
Everyone out there learn statistics but they always fail
to apply when it comes to solving any project why?
Because out of 100% almost 80% to 85% are scared about
Research,Statistics & applying your commensence.
When it comes to commensence you can not master with any
course it will be develop by your curiosity & understanding the
problem statement deeply that what makes you get into by taking
this initial step.
Before applying statistics you have to think about assumption as
per your given problem statement as statistics work on
assumption but likelyhood to be true,what it means,always
statistics can't give you the true result & you have to compare
particular result after applying statistics with your strong domain
knowledge that you are working with (understanding problem
statement deeply & strongly).
In simple words we make a Yes (Significant) or No (Not Significant) decision using Statastics using a sample of
population data to check significance between features.
we have to make decisions about the hypothesis. These decisions include deciding if we should accept the null
hypothesis or if we should reject the null hypothesis. Every test in hypothesis testing produces the significance
value for that particular test. In Hypothesis testing, if the significance value of the test is greater than the
predetermined significance level, then we accept the null hypothesis. If the significance value is less than the
predetermined value, then we should reject the null hypothesis.
For example,
if we want to see the degree of relationship between two stock prices and the significance value of the correlation
coefficient is greater than the predetermined significance level, then we can accept the null hypothesis and
conclude that there was no relationship between the two stock prices. However, due to the chance factor, it
shows a relationship between the variables.
1. Null hypothesis:
Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor.
Null hypothesis is denoted by; H0: μ1 = μ2, which shows that there is no difference between the two
population means.
2. Alternative hypothesis:
Contrary to the null hypothesis, the alternative hypothesis shows that observations are the result of a
real effect.
3. Level of significance / P-value:
Refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is
not possible for accepting or rejecting a hypothesis, so we therefore select a level of significance that is
usually 5%.
4. Type I error:
When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha.
In hypothesis testing, the normal curve that shows the critical region is called the alpha region.
5. Type II errors:
When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In Hypothesis
testing, the normal curve that shows the acceptance region is called the beta region.
https://drive.google.com/open?id=1u0YImKHCPahDReepW0Nru2f2RfDX_UjO
Let's start by look on simple examples using various
testing methods
Hypothesis Testing , T-Testing
In [3]:
#creating random data set of different weights for individuals
average_weight = [33,34,35,36,32,28,29,30,31,37,36,35,33,34,31,40,24]
Out[15]:
Ttest_1sampResult(statistic=-2.354253623010381, pvalue=0.03166804359862131)
P value = 0.031 = 3.1%, This means that the probablity (or chance) of
avaerage_weight 35 is only 3.1%. That is our Null Hypothesis is Wrong.
Generalizing, if P value < 5 % , we REJECT Null Hypothesis.
In our example, we REJECT H0, and conclude Ha that average_age in class
12th is NOT 35
In [18]:
average_weight #average weight of class 12th student as seen in One-Sample T-Test
Out[18]:
[33, 34, 35, 36, 32, 28, 29, 30, 31, 37, 36, 35, 33, 34, 31, 40, 24]
Out[20]:
Ttest_indResult(statistic=2.404544177024533, pvalue=0.022355127034138323)
P value = 0.022 = 2.2%, This means that the probablity (or chance) of
average_weight of class 12th & class 11th students is same is only 2.2%. Null
Hypothesis is Wrong.
We REJECT Null Hypothesis.
Concluding,Average_weigth of class 12th & class 11th student is not
same.
H NULL = H0 = Response times before and after metaphor are same. This
means Metaphor has NO EFFECT
H Alternative = Ha = Response times before and after Metaphor are NOT
same. This means Metaphor has EFFECT
In [22]:
stats.ttest_rel(before_metaphor,after_metaphor)
Out[22]:
Ttest_relResult(statistic=3.2771720738937873, pvalue=0.00832867082029929)
P value = 0.008, 0.8% . P < 5%, So we reject H0 and accept Ha, This means
Metaphor has EFFECT on migrain suffered individuls
T- Test
T-test is mostly used to check the difference in the means of two samples
###Pre-processing
In [0]:
#install researchpy
!pip install researchpy
## it combines pandas, scipy.stats and statsmodels to
##get more complete information in a single API call
In [0]:
#import the libraries
import statsmodels.api as sm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import researchpy as rc
import warnings
from scipy import stats
%matplotlib inline
In [0]:
#read the data
df = pd.read_csv('/content/drive/My Drive/data_set/bike_sharing.csv')
In [0]:
#check the shape
df.shape
Out[98]:
(10886, 12)
In [0]:
#check the head
df.head()
Out[99]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered cou
In [0]:
#check the information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
In [0]:
#check the number of null values in each column
df.isnull().sum()
Out[101]:
datetime 0
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64
In [0]:
df['atemp'].corr(df['temp']) #atemp and temp are correlated
Out[102]:
0.9849481104817068
Atemp and temp has correlation of 0.985. They are providing the same information. We will drop the atemp
feature and also datetime for our simplicity.
In [0]:
#drop datetime
df.drop(['datetime','atemp'],axis = 1,inplace=True)
Now check the columns. Keep in mind that we have dropped datetime and atemp from our dataset.
In [0]:
#check the unique values in each column
df.apply(lambda x : x.nunique())
Out[104]:
season 4
holiday 2
workingday 2
weather 4
temp 49
humidity 89
windspeed 28
casual 309
registered 731
count 822
dtype: int64
In [0]:
#standardize all the numerical features
num_scaled = scale (df[['temp','humidity','windspeed','casual','registered']],copy=False)
#scale takes the difference of each values from the mean and divide by standard deviation
num_scaled
Out[105]:
array([[-1.33366069, 0.99321305, -1.56775367, -0.66099193, -0.94385353],
[-1.43890721, 0.94124921, -1.56775367, -0.56090822, -0.81805246],
[-1.43890721, 0.94124921, -1.56775367, -0.62095844, -0.851158 ],
...,
[-0.80742813, -0.04606385, 0.26970368, -0.64097518, 0.05593396],
[-0.80742813, -0.04606385, -0.83244247, -0.48084125, -0.25525818],
[-0.91267464, 0.21375537, -0.46560752, -0.64097518, -0.47375478]])
Now we will perform t-test to check whether the number of bike rentals are dependent on workingday or not. For
this we will use two sample t-test.
Two sample t-test is used to check whether the means of two samples(group) are same or different. We want to
check whether the number of bikes rented on working day are different then number of bikes rented on non-
working days.
Let's check the mean of bikes rented on working and non-working days.
In [0]:
df.groupby('workingday')['count'].describe()
Out[106]:
count mean std min 25% 50% 75% max
workingday
We can see that mean on working days is 193.0 and mean on the non-working day is 188.5. Definitely we can
see that there is difference in the means of working and non working days.
But the quetsion is, is this difference in the mean stastically significant or was it just due to random chance ?
In [0]:
#create 2 samples one for working days and one for non-working days
sample_01 = df[df['workingday'] == 1]
sample_02 = df[df['workingday'] == 0]
In [0]:
#check the shape of both the samples
print(sample_01.shape,sample_02.shape)
sample_01 have 7412 observations whereas sample_02 only have 3474 obsrvations. We have to take equal
number of observations in both the sample.
In [0]:
#make equal number of records in each sample
sample_01 = sample_01.sample(3474)
print(sample_01.shape,sample_02.shape)
Before directly jumping for hypothesis testing we have to check for different assumptions related to the kind of
hypothesis test we want to perform.
1. The variances of the 2 samples are equal(We will use Levene's test to check this assumption).
2. The distrubtion of the residuals b/w the two groups should follow the normal distribution. We can plot
histogram and see whether the distribution follows the normal distribution or not. We can also plot a Q-Q
plot. We can check the normality using shapiro-wilks test as well.
In [0]:
#Levene's test to check whether the variances of the two group are same.
#H0 : Variances are same.
#H1 : Variances are not same.
#Alpha = 0.05%
#if p-value > alpha (Cannot reject H0)
#if p-value < alpha (Accept null hypothesis)
In [0]:
alpha = 0.05
Stats,Pvalue = stats.levene(sample_01['count'],sample_01['count'])
print(f' Test statistics : {Stats} \n Alpha : {alpha} \n P-value : {Pvalue}')
if Pvalue > alpha:
print(' Variances are same accept null hypothesis ')
else:
print(' Variances are not same reject not null hypothesis ')
1. Test Statistics
2. And p-value assosciated with test stastics. We can see that p-value(1.0) > alpha(0.05). So we fail to reject
the null hypothesis. Variances of the 2 samples are equal.
Take the difference between two samples and scale it to check the normality of the residuals.
In [0]:
#we will take the difference b/w sample_01 and sample_02 and plot a histogram to check for normality
#we will scale the difference
diff = scale((np.array(sample_01['count']) - np.array(sample_02['count'])))
plt.figure(figsize=(12,6))
plt.hist(diff)
plt.show()
The distribution seems very close to normal distribution. Let's check other methods to check the normality of the
residuals.
Q-Q plot, Generates the a probability of sample data against the quantiles of theoretical distributions.
In [0]:
#q-q plot to check the normality
plt.figure(figsize=(12,6))
stats.probplot(diff,plot=plt,dist='norm')
plt.show()
When the points are closely follows the redline we can say that the residulas are normally distributed. Here we
see that after 2 standard deviation the points are scattered from redline. They doesn't follow the redline. But most
of the data points are still close to the redline so we accept the assumption of normality.
Till now we have seen graphical methods to represent to check the assumption of normality. Now let's check is it
with statstical test (Shapiro-Wilk Test)
In [0]:
#Stastical test for checking normality
#Shapiro-wilk test
#H0 : Normally distributed
#H1 : Not Normally distributed
In [0]:
alpha = 0.05
statistic,p_value = stats.shapiro(diff)
if p_value > alpha:
print(f'Accept Null Hypothesis p-value : {p_value}')
else:
print(f'Reject Null Hypothesis p-value : {p_value}')
Here shapiro wilk test shows that the residuals are not normally distributed. for demonstration purpose We will
continue with t-test, but in practice we should not perform t-test when the assumption of normality is voilated.
In [0]:
# H0 : There's no difference in mean (Bike rental doesn't depends on workingday)
# H1 : There's a difference in mean (Bike rental depends on workingday)
# Alpha : 0.05%
alpha = 0.05
statistic , p_value = stats.ttest_ind(sample_01['count'],sample_02['count'])
if p_value > alpha:
print(f'Fail to reject Null Hypothesis p-value is {p_value}')
else:
print('Reject Null Hypothesis')
As we can see that the p-value is greater than alpha. So we can't reject our null hypothesis.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns
Given data
In [2]:
population_mean = 100
population_std = 15
n_sample = 30
In [3]:
avg_class = np.vectorize(int)(np.random.normal(loc=population_mean,scale=population_std,size=n_sample))
In [4]:
print("A typical class I.Q.:",avg_class)
The given class data (generated with the given mean and assumed same
variance as population)
In [5]:
given_class = np.vectorize(int)(np.random.normal(loc=112.5,scale=population_std,size=n_sample))
In [6]:
print("Given class I.Q.:",given_class)
Given class I.Q.: [128 115 132 80 96 95 142 122 93 114 95 77 110 148 123 108 82 119
128 74 97 77 122 137 119 119 89 93 115 134]
In [7]:
plt.figure(figsize=(7,5))
sns.kdeplot(avg_class,shade=True)
sns.kdeplot(given_class,shade=True)
plt.legend(['Average class','Given class'],fontsize=14)
plt.vlines(x=avg_class.mean(),ymin=0,ymax=0.025,color='blue',linestyle='--')
plt.vlines(x=given_class.mean(),ymin=0,ymax=0.025,color='brown',linestyle='--')
plt.show()
In [8]:
std_err = population_std/np.sqrt(n_sample)
z_stat = (given_class.mean()-population_mean)/std_err
In [9]:
alpha = 0.05
rejection_threshold = st.norm.ppf(1-alpha)
In [10]:
if z_stat>rejection_threshold:
print("We reject the NULL hypothesis. The class I.Q. is indeed above average")
else:
print("We cannot reject the NULL hypothesis that class average is same as population average.")
We reject the NULL hypothesis. The class I.Q. is indeed above average
In [11]:
def hypothesis_testing(n_sample=30,population_mean=100,population_std=15,alpha=0.05):
"""
Tests the hypothesis of above average I.Q. and reports the conclusion
"""
given_class=np.vectorize(int)(np.random.normal(loc=112.5,scale=population_std,size=n_sample))
std_err = population_std/np.sqrt(n_sample)
z_stat = (given_class.mean()-population_mean)/std_err
alpha = 0.05
rejection_threshold = st.norm.ppf(1-alpha)
if z_stat>rejection_threshold:
print("We reject the NULL hypothesis. The class I.Q. is indeed above average")
else:
print("We cannot reject the NULL hypothesis that class average is same as population average.")
We reject the NULL hypothesis. The class I.Q. is indeed above average
We cannot reject the NULL hypothesis that class average is same as population average.
What if the population standard deviation is lower, say 10? What happens if it is
higher instead?
In [14]:
hypothesis_testing(population_std=10)
We reject the NULL hypothesis. The class I.Q. is indeed above average
In [15]:
hypothesis_testing(population_std=40)
We cannot reject the NULL hypothesis that class average is same as population average.
What if there were only 4 employee from ABC company? What if we could test
100 employee instead?
In [16]:
hypothesis_testing(n_sample=4)
We reject the NULL hypothesis. The class I.Q. is indeed above average
In [17]:
hypothesis_testing(n_sample=100)
We reject the NULL hypothesis. The class I.Q. is indeed above average
What is the impact of changing the significance level to 0.01 (or even 0.001)
from 0.05?
In [18]:
hypothesis_testing(alpha=0.01)
We reject the NULL hypothesis. The class I.Q. is indeed above average
In [19]:
hypothesis_testing(alpha=0.001)
We reject the NULL hypothesis. The class I.Q. is indeed above average
A custom function
In [20]:
def independent_ttest(data1, data2, alpha=0.05):
"""
Student's t-test for independent groups
Argument:
data1: First group data in numpy array format
data2: Second group two data in numpy array format
alpha: Significance level
Returns:
t_stat: Computed t-statistic
df: Degrees of freedom
cv: Critical value
p: p-value (of NULL hypothesis)
"""
import scipy.stats as st
# calculate means
mean1, mean2 = np.mean(data1), np.mean(data2)
# calculate standard errors
se1, se2 = st.sem(data1), st.sem(data2)
# standard error on the difference between the samples
sed = np.sqrt(se1**2.0 + se2**2.0)
# calculate the t statistic
t_stat = (mean1 - mean2) / sed
# degrees of freedom
df = len(data1) + len(data2) - 2
# calculate the critical value
cv = st.t.ppf(1.0 - alpha, df)
# calculate the p-value
p = (1.0 - st.t.cdf(abs(t_stat), df)) * 2.0
# return everything
return t_stat, df, cv, p
In [22]:
data1 = 5 * np.random.randn(n_sample) + 50
data2 = 5 * np.random.randn(n_sample) + 51
plt.figure(figsize=(7,5))
sns.kdeplot(data1,shade=True)
sns.kdeplot(data2,shade=True)
plt.legend(['data1','data2'],fontsize=14)
plt.vlines(x=data1.mean(),ymin=0,ymax=0.09,color='blue',linestyle='--')
plt.vlines(x=data2.mean(),ymin=0,ymax=0.09,color='brown',linestyle='--')
plt.show()
In [23]:
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = independent_ttest(data1, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))
print()
It can be equivalently thought of as the probability of accepting the alternative hypothesis ( 𝐻𝑎 ) when it is true—
that is, the ability of a test to detect a specific effect, if that specific effect actually exists.
In [24]:
n_sample = 200
In [25]:
data1 = 5 * np.random.randn(n_sample) + 50
data2 = 5 * np.random.randn(n_sample) + 51
plt.figure(figsize=(7,5)
)
sns.kdeplot(data1,shade=True)
sns.kdeplot(data2,shade=True)
plt.legend(['data1','data2'],fontsize=14)
plt.vlines(x=data1.mean(),ymin=0,ymax=0.09,color='blue',linestyle='--')
plt.vlines(x=data2.mean(),ymin=0,ymax=0.09,color='brown',linestyle='--')
plt.show()
In [26]:
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = independent_ttest(data1, data2, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))
print()
In [28]:
group1 = 5 * np.random.randn(n_sample) + 50
group2 = 5 * np.random.randn(n_sample) + 52
plt.figure(figsize=(7,5)
)
sns.kdeplot(group1,shade=True)
sns.kdeplot(group2,shade=True)
plt.legend(['group1','gropu2'],fontsize=14)
plt.vlines(x=group1.mean(),ymin=0,ymax=0.09,color='blue',linestyle='--')
plt.vlines(x=group2.mean(),ymin=0,ymax=0.09,color='brown',linestyle='--')
plt.show()
In [29]:
plt.figure(figsize=(7,5))
sns.boxplot(data=[group1,group2])
sns.swarmplot(data=[group1,group2],color='.2')
plt.legend(['group1','gropu2'],fontsize=14)
plt.show()
In [30]:
f,p=st.f_oneway(group1,group2)
In [33]:
trial_A = 9*np.random.randn(34) + 102
trial_B = 6*np.random.randn(48) + 109
trial_C = 12*np.random.randn(38) + 103
trial_CONTROL = 10*np.random.randn(35) + 110
In [35]:
plt.figure(figsize=(9,5))
sns.kdeplot(trial_A,shade=True,color='Blue')
sns.kdeplot(trial_B,shade=True,color='red')
sns.kdeplot(trial_C,shade=True,color='yellow')
sns.kdeplot(trial_CONTROL,shade=True,color='black')
plt.legend(['trial_A','trial_B','trial_C','trial_CONTROL'],fontsize=14)
plt.show()
In [36]:
groups = {'trial_A':trial_A,'trial_B':trial_B,'trial_C':trial_C,'trial_CONTROL':trial_CONTROL}
In [37]:
multi_anova(groups)
We reject the hypothesis of equal mean for trial_A and trial_B as per ANOVA test result
We reject the hypothesis of equal mean for trial_A and trial_C as per ANOVA test result
We reject the hypothesis of equal mean for trial_A and trial_CONTROL as per ANOVA test result
We reject the hypothesis of equal mean for trial_B and trial_C as per ANOVA test result
ANOVA fails to reject the hypothesis of equal mean for trial_B and trial_CONTROL
We reject the hypothesis of equal mean for trial_C and trial_CONTROL as per ANOVA test result
What's the conclusion?
From the results, printed out, we see that equal mean hypothesis (with CONTROL group) could not be rejected for trial_B,
whereas the hypothesis was rejected for the cases - trial_A and trial_CONTROL, trial_B and trial_CONTROL.
Therefore, the trials of medicine A and medicine C showed statistically significant lowering of blood pressure whereas
medicine B did not.
In [3]: all_scores = A + B + C
company_names = (['A'] * len(A)) + (['B'] * len(B)) + (['C'] * len(C))
In [5]: data.head(20)
Out[5]:
company score
0 A 12.6
1 A 12.0
2 A 11.8
3 A 11.9
4 A 13.0
5 A 12.5
6 A 14.0
7 B 10.0
8 B 10.2
9 B 10.0
10 B 12.0
11 B 14.0
12 B 13.0
13 C 10.1
14 C 13.0
15 C 13.4
16 C 12.9
17 C 8.9
18 C 10.7
19 C 13.6
In [6]: data.groupby('company').mean() # here by calculate mean we can see avareage time to delivery the p
ackage & less delivery time
# can be seen with company B
Out[6]:
score
company
A 12.542857
B 11.533333
C 11.825000
Did you think finding mean & saying that company B is faster
in terms of delivering particular package??
No,as you can observed from the original Data that company B use to deliver the package at very slow rate as well.
In [7]: data.head(1)
Out[7]:
company score
0 A 12.6
Out[12]:
43.132380952380956
In [14]: # add group means and overall mean to the original data frame
data = data.merge(group_means, left_on = 'company', right_index = True)
Degrees of freedom
Mean Squares
Out[19]:
2.1958597883597886
For example ----- we want to check whether the average weigth of babies born in 3 different states are similar or
different.
Before moving forward with any kind of hypothesis testing we should always have a question in our mind. And
based on the question we decide the kind of hypothesis testing. And test our hypothesis against it.
Here we will work with the bike sharing data again. The question that we are asking here is ------- are the number
of bike rentals similar or different in all 4 seasons.
###Pre-processing We have already looked at preprocessing so here let's dive in directly to hypothesis testing.
In [0]:
#import the libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import warnings
from scipy import stats
%matplotlib inline
In [0]:
#read the data
df = pd.read_csv('/content/drive/My Drive/data_set/bike_sharing.csv')
In [0]:
#drop datetime
df.drop(['datetime','atemp'],axis = 1,inplace=True)
In [0]:
df['weather'].value_counts()
Out[4]:
1 7192
2 2834
3 859
4 1
Name: weather, dtype: int64
We have only 1 record in 4th category. We will drop the records of 4th weather situation.
In [0]:
df.drop(df[df['weather']==4].index,axis=0,inplace=True) #remove the records where weather == 4
In [0]:
df.groupby('weather')['count'].describe() #groupby weather situation and check the description
Out[6]:
count mean std min 25% 50% 75% max
weather
Clearly we can see that the means of 3 groups are very different. But are these differences stastically significant.
We will use one-way anova to test whether this difference in mean is stastically significant or not.
So the question is does weather situation has any impact on the number of bikes rented or not.
In [0]:
#perfrom one way annova using stats module from scipy library
#H0 : There is no difference in the mean
#H1 : There is a difference in the mean
#Alpha : 0.05
alpha = 0.05
Stats,p_value = stats.f_oneway(df['count'][df['weather']==1],
df['count'][df['weather']==2],
df['count'][df['weather']==3])
Here our p-value is less than alpha. Which means that the weather situation impact the number of bike rentals.
Using one way anova we only know that the Means of the groups are not same. But we don't know which group
mean are not same.
We use post-hoc test to find out which group mean are not equal.
In [0]:
#Use TukeyHSD to know which group mean are not similar.
from statsmodels.stats.multicomp import MultiComparison
mul_comp = MultiComparison(df['count'],df['weather'])
mul_result = mul_comp.tukeyhsd()
print(mul_result)
If you look at the last column. All the values are Reject = True. which means reject the null (Means are same)
hypothesis . The mean of all the groups are significantly different.
##Two Way Anova Two way Anova is used to examine the influence of 2 different independent categorical
variable on 1 dependent continuous variable.
Before ypothessis testing we perform a regression analysis using the two variables. We will go one step at a time
and keep it simple to understand.
Let's examine whether Season and weather situation has any effect on bike rentals or not. We have 4 seasons
and 3 weather situations.
In [0]:
#check the description of groups of different weather situations
df.groupby('weather')['count'].describe()
Out[9]:
count mean std min 25% 50% 75% max
weather
We had checked it before as well. And have proven that the means are stastically different in weather situations.
In [0]:
#check the description of groups of different seasons
df.groupby('season')['count'].describe()
Out[10]:
count mean std min 25% 50% 75% max
season
In [0]:
#Perfrom regression analysis with weather situation
model = ols('count ~ C(weather) * C(season)',df).fit() #fit the regression model
print(model.summary()) #print summary
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Table - 1 tells us whether the regression was significant or not. And table 2 tells us whether the variable is
significant or not.
By looking at the table-1 we can see that the p-value related with f-statistics is very low. Which means the
regression was significant. Similarly when we look at the p-value assosiated with the t-statistic in table-2, we
observe that the p-value is almost close to zero for most of the variables.
In [0]:
#H0: There's no difference in mean of weather
# There is No difference in Mean of Season
# There is no difference in mean of Weather and Season combined
Out[12]:
df sum_sq mean_sq F PR(>F)
By looking at the p-values i.e the last columns we can see that most of the values are close to zero. So we can
say that the means are significantly different.
Let's look into small Data Set & applying EDA as well as
Chi-Square Test
In [1]:
#Importing the required libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
In [2]:
#Reading the dataset
df_raw=pd.read_csv('QoSvsQoESyn.csv')
In [3]:
# Creating a copy of dataset
copy_df=df_raw.copy(deep=True)
Copy_df = pd.DataFrame(df_raw, columns=['cellAccessibilityRankDesc'])
Copy_df.head()
0 ACCEPTABLE
1 ACCEPTABLE
2 ACCEPTABLE
3 ACCEPTABLE
4 ACCEPTABLE
In [4]:
# Display first five rows
copy_df.head()
0 ACCEPTABLE 1
1 ACCEPTABLE 1
2 ACCEPTABLE 1
3 ACCEPTABLE 1
4 ACCEPTABLE 1
What you can analyze from above Data Set & can you
build an scalable ML Model from above information
given & how you will interpret above Data Set to make
it useful by undertsnading the columns names &
making your own assumptions?
Let's work out some steps to discuss further
In [5]:
# shape of the dataset
copy_df.shape
Out[5]:
(3047, 2)
In [6]:
#Checking the null values in the dataframe
copy_df.isnull().sum()
In [7]:
# information about dataset
copy_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3047 entries, 0 to 3046
Data columns (total 2 columns):
cellAccessibilityRankDesc 3047 non-null object
crmInboundInteractionCount 3047 non-null int64
dtypes: int64(1), object(1)
memory usage: 47.7+ KB
In [8]:
# Statistical description of dataframecopy_df.describe()
copy_df.describe()
count 3047.000000
mean 0.985560
std 0.780879
min 0.000000
25% 0.000000
50% 1.000000
75% 2.000000
max 2.000000
In [9]:
# Statistical description of dataframe
copy_df.describe().T #transforming the above code to have a better view of statistical summary
In [10]:
# Extraxting a unique values of 'cellAccessibilityRankDesc' Column
a=copy_df['cellAccessibilityRankDesc'].unique()
print(a)
len(a)
Out[10]:
3
In [11]:
#Finding the count of QoS
copy_df['cellAccessibilityRankDesc'].value_counts()
In [13]:
#Finding the count of the user experiences
copy_df['crmInboundInteractionCount'].value_counts()
In [15]:
#Finding the unique values of user experience for each parameter of QoS
copy_df.groupby('cellAccessibilityRankDesc')['crmInboundInteractionCount'].unique()
In [17]:
#Drawing crosstab for the QoS and QoE for easily understanding as per convenience
pd.crosstab(copy_df['cellAccessibilityRankDesc'],df_raw['crmInboundInteractionCount'],margins=True).style.b
cellAccessibilityRankDesc
In [18]:
sns.set_style('whitegrid')
sns.countplot(x='cellAccessibilityRankDesc', hue='crmInboundInteractionCount', data=copy_df, palette='rainbo
In [19]:
sns.set_style('whitegrid')
sns.countplot(x='crmInboundInteractionCount', hue='cellAccessibilityRankDesc', data=copy_df, palette='rainbo
cellAccessibilityRankDesc
GOOD 1.557965
ACCEPTABLE 0.988810
EXCELLENT 0.188439
In [21]:
#Using factorplot to find the factor for the user rating for each of the service quality
sns.factorplot('cellAccessibilityRankDesc','crmInboundInteractionCount',data=copy_df)
plt.show()
In [23]:
copy_df
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
3042 0 0 1
3043 0 0 1
3044 0 0 1
3045 0 0 1
3046 0 0 1
0 ACCEPTABLE
1 ACCEPTABLE
2 ACCEPTABLE
3 ACCEPTABLE
4 ACCEPTABLE
In [25]:
# applying get dummy method
dum_df = pd.get_dummies(Copy_df, columns=["cellAccessibilityRankDesc"] )
In [26]:
dum_df
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
3042 0 0 1
3043 0 0 1
3044 0 0 1
3045 0 0 1
3046 0 0 1
Out[27]:
crmInboundInteractionCount
0 1
1 1
2 1
3 1
4 1
In [28]:
new_data=dum_df.join(df_raw)
new_data.head()
0 1 0 0 ACCEPTAB
1 1 0 0 ACCEPTAB
2 1 0 0 ACCEPTAB
3 1 0 0 ACCEPTAB
4 1 0 0 ACCEPTAB
0 ACCEPTABLE 1
1 ACCEPTABLE 1
2 ACCEPTABLE 1
3 ACCEPTABLE 1
4 ACCEPTABLE 1
In [30]:
rank_desk=pd.crosstab(index=df_raw['cellAccessibilityRankDesc'],columns=df_raw['crmInboundInteractionCount'
In [31]:
rank_desk #before applying particular test we have to look for Contingency table
cellAccessibilityRankDesc
ACCEPTABLE 55 884 44
EXCELLENT 767 33 65
In [33]:
from scipy import stats #import stats package
(chi2, p, dof,_) = stats.chi2_contingency([rank_desk.iloc[0].values,rank_desk.iloc[1].values,rank_desk.iloc
In [34]:
print ("chi2 : " ,chi2)
print ("p-value : " ,p)
print ("Degree of Freedom : " ,dof)
chi2 : 3192.2255875828437
p-value : 0.0
Degree of Freedom : 4
Limitations of chi-Square
Can not be used when samples are matched or related.
It wont give much information about strength of the
relationship.
In [38]:
data=pd.read_csv('zomato.csv')
Out[39]:
url 0.00
address 0.00
name 0.00
online_order 0.00
book_table 0.00
rate 15.03
votes 0.00
phone 2.34
location 0.04
rest_type 0.44
dish_liked 54.29
cuisines 0.09
approx_cost(for two people) 0.67
reviews_list 0.00
menu_item 0.00
listed_in(type) 0.00
listed_in(city) 0.00
dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
url 51717 non-null object
address 51717 non-null object
name 51717 non-null object
online_order 51717 non-null object
book_table 51717 non-null object
rate 43942 non-null object
votes 51717 non-null int64
phone 50509 non-null object
location 51696 non-null object
rest_type 51490 non-null object
dish_liked 23639 non-null object
cuisines 51672 non-null object
approx_cost(for two people) 51371 non-null object
reviews_list 51717 non-null object
menu_item 51717 non-null object
listed_in(type) 51717 non-null object
listed_in(city) 51717 non-null object
dtypes: int64(1), object(16)
memory usage: 6.7+ MB
In [41]:
#Deleting Unnnecessary Columns
data=data.drop(['url','dish_liked','phone'],axis=1) #Dropping the column "dish_liked", "phone", "url" and sa
0 942, 21st Main Jalsa Yes Yes 4.1/5 775 Banashankari Casual North 800 [('Rat
Road, 2nd Dining Indian, 'RAT
Stage, Mughlai, beau
Banashankari, Chinese to ...
...
1 2nd Floor, 80 Spice Yes No 4.1/5 787 Banashankari Casual Chinese, 800 [('Rat
Feet Road, Elephant Dining North 'RAT
Near Big Indian, Thai been
Bazaar, 6th ... din...
2 1112, Next to San Yes No 3.8/5 918 Banashankari Cafe, Casual Cafe, 800 [('Rat
KIMS Medical Churro Dining Mexican, "RAT
College, 17th Cafe Italian Ambi
Cross... not th
3 1st Floor, Addhuri No No 3.7/5 88 Banashankari Quick Bites South 300 [('Rat
Annakuteera, Udupi Indian, "RAT
3rd Stage, Bhojana North Grea
Banashankar... Indian prope
4 10, 3rd Floor, Grand No No 3.8/5 166 Basavanagudi Casual North 600 [('Rat
Lakshmi Village Dining Indian, 'RAT
Associates, Rajasthani good
Gandhi Baza... resta
In [43]:
#Removing the Duplicates
data.duplicated().sum()
data.drop_duplicates(inplace=True)
In [44]:
#Remove the NaN values from the dataset
data.isnull().sum()
data.dropna(how='any',inplace=True)
data.info() #.info() function is used to get a concise summary of the dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43499 entries, 0 to 51716
Data columns (total 14 columns):
address 43499 non-null object
name 43499 non-null object
online_order 43499 non-null object
book_table 43499 non-null object
rate 43499 non-null object
votes 43499 non-null int64
location 43499 non-null object
rest_type 43499 non-null object
cuisines 43499 non-null object
approx_cost(for two people) 43499 non-null object
reviews_list 43499 non-null object
menu_item 43499 non-null object
listed_in(type) 43499 non-null object
listed_in(city) 43499 non-null object
dtypes: int64(1), object(13)
memory usage: 5.0+ MB
In [45]:
#Reading Column Names
data.columns
Out[46]:
Index(['address', 'name', 'online_order', 'book_table', 'rate', 'votes',
'location', 'rest_type', 'cuisines', 'cost', 'reviews_list',
'menu_item', 'type', 'city'],
dtype='object')
In [47]:
data['cost'].unique() #looking the unique values of cost column
In [48]:
#Reading Rate of dataset
data['rate'].unique()
In [49]:
data.rate = data.rate.replace("NEW", np.nan)
data.dropna(how ='any', inplace = True)
In [50]:
data.rate = data.rate.replace("-", np.nan)
data.dropna(how ='any', inplace = True)
0 942, 21st Main Jalsa Yes Yes 4.1 775 Banashankari Casual North 800 [('Rated 4.0', [
Road, 2nd Dining Indian, 'RATED\n A
Stage, Mughlai, beautiful place
Banashankari, Chinese to ...
...
1 2nd Floor, 80 Spice Yes No 4.1 787 Banashankari Casual Chinese, 800 [('Rated 4.0', [
Feet Road, Elephant Dining North 'RATED\n Had
Near Big Indian, Thai been here for
Bazaar, 6th ... din...
2 1112, Next to San Yes No 3.8 918 Banashankari Cafe, Casual Cafe, 800 [('Rated 3.0', [
KIMS Medical Churro Dining Mexican, "RATED\n
College, 17th Cafe Italian Ambience is
Cross... not that ...
3 1st Floor, Addhuri No No 3.7 88 Banashankari Quick Bites South 300 [('Rated 4.0', [
Annakuteera, Udupi Indian, "RATED\n
3rd Stage, Bhojana North Great food and
Banashankar... Indian proper...
4 10, 3rd Floor, Grand No No 3.8 166 Basavanagudi Casual North 600 [('Rated 4.0', [
Lakshmi Village Dining Indian, 'RATED\n Very
Associates, Rajasthani good
Gandhi Baza... restaurant ...
In [52]:
data['rate'].unique()
In [53]:
data.online_order = data.online_order.apply(lambda X : 0 if X == 'No' else 1)
data.book_table=data.book_table.apply(lambda X : 0 if X == 'No' else 1)
0 942, 21st Main Jalsa 1 1 4.1 775 Banashankari Casual North 800 [('Rated 4.0', [
Road, 2nd Dining Indian, 'RATED\n A
Stage, Mughlai, beautiful place
Banashankari, Chinese to ...
...
1 2nd Floor, 80 Spice 1 0 4.1 787 Banashankari Casual Chinese, 800 [('Rated 4.0', [
Feet Road, Elephant Dining North 'RATED\n Had
Near Big Indian, Thai been here for
Bazaar, 6th ... din...
2 1112, Next to San 1 0 3.8 918 Banashankari Cafe, Casual Cafe, 800 [('Rated 3.0', [
KIMS Medical Churro Dining Mexican, "RATED\n
College, 17th Cafe Italian Ambience is
Cross... not that ...
3 1st Floor, Addhuri 0 0 3.7 88 Banashankari Quick Bites South 300 [('Rated 4.0', [
Annakuteera, Udupi Indian, "RATED\n
3rd Stage, Bhojana North Great food and
Banashankar... Indian proper...
4 10, 3rd Floor, Grand 0 0 3.8 166 Basavanagudi Casual North 600 [('Rated 4.0', [
Lakshmi Village Dining Indian, 'RATED\n Very
Associates, Rajasthani good
Gandhi Baza... restaurant ...
In [55]:
#Encode the input Variables
def Encode(data):
for column in data.columns[~data.columns.isin(['rate', 'cost', 'votes'])]:
data[column] = data[column].factorize()[0]
return data
data_en = Encode(data.copy())
In [56]:
data_en.head(2)
In [57]:
#data_en.isnull().sum()
data=data.dropna(axis=1,how='all')
In [58]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41237 entries, 0 to 51716
Data columns (total 14 columns):
address 41237 non-null object
name 41237 non-null object
online_order 41237 non-null int64
book_table 41237 non-null int64
rate 41237 non-null float64
votes 41237 non-null int64
location 41237 non-null object
rest_type 41237 non-null object
cuisines 41237 non-null object
cost 41237 non-null object
reviews_list 41237 non-null object
menu_item 41237 non-null object
type 41237 non-null object
city 41237 non-null object
dtypes: float64(1), int64(3), object(10)
memory usage: 4.7+ MB
In [59]:
from sklearn.feature_selection import chi2 #importing chi2
In [60]:
data_en.isnull().sum() #looking into null values
In [61]:
data_en.isnull().sum() #deleting null values in rate column
data_en.dropna(how='any',inplace=True)
In [62]:
data_en.isnull().sum() #checking again is there any null values present or not
In [63]:
X = data_en.drop('cost',axis=1) #dividing into independent variable & target variable which is cost
y = data_en['cost']
In [64]:
chi_scores = chi2(X,y)
In [65]:
chi_scores #In below output first array represent chi2 values & second array represent p values
In [66]:
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = False , inplace = True)
In [0]:
df = pd.read_csv('/content/drive/My Drive/data_set/E-commerce.csv') #read the data
In [0]:
df.shape #check shape of the data
Out[3]:
(23472, 9)
In [0]:
df.head() #check head of the data
Out[4]:
Unnamed: Clothing Recommended Positive Feedback Division Department Class
0 ID Age Rating IND Count Name Name Name
In [0]:
df.drop('Unnamed: 0',axis=1,inplace=True) #drop Unnamed: 0 column
In [0]:
df.columns #check columns
Out[7]:
Index(['Clothing ID', 'Age', 'Rating', 'Recommended IND',
'Positive Feedback Count', 'Division Name', 'Department Name',
'Class Name'],
dtype='object')
In [0]:
df.isnull().sum()
Out[8]:
Clothing ID 0
Age 0
Rating 0
Recommended IND 0
Positive Feedback Count 0
Division Name 0
Department Name 0
Class Name 0
dtype: int64
Here we will work only with 2 columns Recommended IND and Ratings.
We want to check whether the website recommendations are independent of ratings or not.
In [0]:
df_1 = df[['Rating','Recommended IND']] #store rating and Recommended IND in df_1
In [0]:
df_1.head() #check head
Out[17]:
Rating Recommended IND
0 4 1
1 5 1
2 3 0
3 5 1
4 5 1
In [0]:
#Let's plot and check the frequency of ratings
We can see from the the distribution that most of the product are highly rated.Most are rated grater than 3.
In [0]:
#crosstab to check the rating of the product and whether website recommends or not.
cross_tab = pd.crosstab(df_1['Rating'],df_1['Recommended IND']).T
cross_tab
Out[21]:
Rating 1 2 3 4 5
Recommended IND
As we can see mostly high rated products are recommended by the website. We want statstical method to check
whether the website's recommendation are dependent on ratings or not.
In [0]:
#H0 : Recommended IND is independent of Ratings
#H1 : Recommended IND is not independent of Ratings
#Alpha : 0.05
In [0]:
alpha = 0.05
stats,p_value,degrees_of_freedom,expected = chi2_contingency(cross_tab)
if p_value > alpha:
print(f' Accept Null Hypothesis\n P-Value is {p_value}\n Recommendations are Independent of Ratings')
else:
print(f' Reject Null Hypothesis\n P-Value is {p_value}\n Recommendations are not Independent of Ratings')
Chi2 independency test tells us that the Recommendations are not independent of ratings. We can also check it
using
In [0]:
recommended = df_1[df_1['Recommended IND']==1] #store all the recommended products in a recommended
not_recommended = df_1[df_1['Recommended IND']==0] #store all the not recommended products in a not recomme
In [0]:
recommended['Rating'].value_counts() #check the value counts in recommended
Out[43]:
5 13092
4 4909
3 1189
2 94
1 16
Name: Rating, dtype: int64
In [0]:
not_recommended['Rating'].value_counts() #check the values count in not recommended
Out[46]:
3 1682
2 1471
1 826
4 168
5 25
Name: Rating, dtype: int64
Most of the recommended products has higher ratings. And Those which were not recommended have lower
rating. Indeed the website recommendation is dependent on ratings.
In [0]:
#plot different histograms for recommended and not recommended ratings
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,6))
recommended['Rating'].hist(ax=ax1)
ax1.set_title('Recommended')
not_recommended['Rating'].hist(ax=ax2,color='blue')
ax2.set_title('Not Recommended')
plt.show()
It is obvious from the plot that the Recommendation are dependent on Ratings. As we can see in graph the red
bars shows that the recommended products has ratings greater than three. And the bars shows that the not
recommended products were those which has lower ratings.
Hypothesis Testing Example (Blood Pressure Dataset)
In [0]: #install researchpy
!pip install researchpy
Collecting researchpy
Downloading https://files.pythonhosted.org/packages/c2/e4/6fef21ad13c0b48ccbafba3ad8bd0bd65af0c9eb1aad9a82fec66c9de1ed/researchpy-0.1.9-
py3-none-any.whl
Requirement already satisfied: statsmodels in /usr/local/lib/python3.6/dist-packages (from researchpy) (0.10.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.18.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.4.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from researchpy) (1.0.3)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from statsmodels->researchpy) (0.5.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->researchpy) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->researchpy) (2018.9)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from patsy>=0.4.0->statsmodels->researchpy) (1.12.0)
Installing collected packages: researchpy
Successfully installed researchpy-0.1.9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient 120 non-null int64
1 sex 120 non-null object
2 agegrp 120 non-null object
3 bp_before 120 non-null int64
4 bp_after 120 non-null int64
dtypes: int64(3), object(2)
memory usage: 4.8+ KB
In [0]: #check null values
df.isnull().sum()
Out[0]:
patient 0
sex 0
agegrp 0
bp_before 0
bp_after 0
dtype: int64
Here we can observe that the mean before intervention is 156.45 and mean after the intervention is 151.36. clearly we can
see that there is difference in the mean before and after the intervention but is this difference statstically significant?
result shows that we can reject our null hypothesis and accept that there is a significant difference before and after
intervention.
But while checking for assumptions of equal variance using leven's test ------ we saw that the variances are not same in both
the sample. In such condition when the sample variances are not same we use Wilcoxon Signed Rank test.
In [0]: #we will use researchpy and see it's fusefullness here.
rp.ttest(df['bp_before'],df['bp_after'], equal_variances = False, paired =True)
Out[0]:
Wilcoxon signed-rank test results
0 Mean for bp_before = 156.450000
1 Mean for bp_after = 151.358333
2 T value = 2234.500000
3 Z value = -3.191600
4 Two sided p value = 0.001400
5 r= -0.206000
Here we have used researchpy library for Wilcoxon Signed Rank test. We have specified equal_variances = False as the
sample variances are not equal and paired=True for paired sample t-test.
We see that p-value (0.0014) is less than alpha (0.05). So we can reject null hypothesis.
Thank You