DSBA Master Codebook - Python and Statistics
DSBA Master Codebook - Python and Statistics
Codebook
Preface
Data Science is the art and science of solving real world problems and making data driven decisions. It involves an
amalgamation of three aspects and a good data scientist has expertise in all three of them. These are:
Your lack of expertise should not become an impediment in your journey in Data Science. With consistent effort, you
can become fairly proficient in coding skills over a period of time. This Codebook is intended to help you become
comfortable with the finer nuances of Python and can be used as a handy reference for anything related to data science
codes throughout the program journey and beyond that.
Please keep in mind there is no one right way to write a code to achieve an intended outcome. There can be multiple
ways of doing things in Python. The examples presented in this document use just one of the approaches to perform
the analysis. Please explore by yourself different ways to perform the same thing.
1
Contents
PREFACE ......................................................................................................................................................... 1
Loops................................................................................................................................................................................ 5
For Loop ........................................................................................................................................................................ 5
While Loop .................................................................................................................................................................... 6
PRE-PROCESSING ...................................................................................................................................... 11
2
Null Value Check ........................................................................................................................................................... 11
Scaling ............................................................................................................................................................................ 12
Standard Scaler .......................................................................................................................................................... 12
Min-Max Scaler ........................................................................................................................................................... 12
STATISTICS .................................................................................................................................................. 13
VISUALISATIONS ....................................................................................................................................... 15
Histogram....................................................................................................................................................................... 15
Pairplot ........................................................................................................................................................................... 15
Boxplot: .......................................................................................................................................................................... 15
3
Table of Figures
Figure 1: Syntax Error ........................................................................................................................................................................ 8
Figure 2: Syntax Error- EOF while parsing ........................................................................................................................................ 8
Figure 3: Index Error .......................................................................................................................................................................... 8
Figure 4: Module Not Found Error ..................................................................................................................................................... 8
Figure 5: Import Error ......................................................................................................................................................................... 9
Figure 6: Key Error ............................................................................................................................................................................. 9
Figure 7: Value Error .......................................................................................................................................................................... 9
Figure 8: Type Error ........................................................................................................................................................................... 9
Figure 9: Name Error ........................................................................................................................................................................ 10
Figure 10: Indentation Error ............................................................................................................................................................. 10
Figure 11: Zero Division Error ......................................................................................................................................................... 10
Table of Equations
Equation 1: Z-Statistic ...................................................................................................................................................................... 17
Equation 2: T-Statistic ...................................................................................................................................................................... 17
Equation 3: T-Statistic for 2 samples ................................................................................................................................................ 18
Equation 4: F-Statistic....................................................................................................................................................................... 18
4
Python Basics
Arithmetic Operations
Addition
a=4
b=7
a+b
11
Subtraction
a-b
-3
Multiplication
a*b
28
Division
a/b
0.5714285714285714
Square
a**2
16
Square Root
a*0.5
2.0
Loops
For Loop
When the number of iterations required is known i.e. n, the ‘for’ is used.
5
While Loop
When the number of iterations required is unknown i.e. n, the ‘while’ is used.
Conditional Statements
IF Statement
This statements check the condition provided and if the condition is True, then the program moves ahead with defined steps
x=300
if x>200:
print('Hey, x is greater than 200!')
Hey, x is greater than 200!
IF-Else Statement
This statement is an extension to IF statement. The program moves to Else statement and performs the alternative steps defined in
case ‘IF’ condition is not met.
x=100
if x>200:
print('Hey, x is greater than 200!')
else: print('Hey, x is smaller than 200!')
Hey, x is smaller than 200!
IF-ELIF Statement
This statement is an extension to IF-Else statement. The program moves to ELIF statement and performs the alternative steps defined
in case ‘IF’ condition is not met and if the else if ‘ELIF’ in python is not met, then the program moves to another ELIF or Else
statement.
Syntax along with example In this example else would be executed only when x=y
x=100
y=150
if x>y:
print('Hey, x is greater than y!')
elif x<y:
print('Hey, y is greater than x!')
else: print('Hey, x is equal to y!')
Hey, y is greater than x!
6
User Defined Functions
User-defined functions are very helpful in automating repetitive tasks like selecting odd numbers out of a series or converting a
series of dates to timestamps
#A function is defined using 'def' followed by function name and arguments the function takes in
def addition(x,y):
return (x+y) #in a function, return is used most preferably
#this is a function which returns addition of two variables passed into it.
#calling a function
a=1
b=2
addition(a,b)
3
Importing Libraries/Modules
A module is a file containing Python definitions and statements. Use ‘import’ and aliasing statement ‘as’
Example:
import pandas as pd
Here, we are importing pandas module with an alias ‘pd’.
7
Python Error Debugging
Syntax Error
When the syntax used is wrong. In below snapshot, the print statement is missing parenthesis.
Missing one or more parenthesis like in below snapshot, the ‘)’ is missing.
Index Error
When the Index Position is out of bounds. In below snapshot, the lst[8] looks for index position 8 which is not present in the given
list lst.
Import Error
When the specified module function is not found. In below snapshot, ‘feature_importances_’ is not found in module sklearn.tree.
8
Figure 5: Import Error
Key Error
When the dictionary’s key is not found in the given dictionary. In below snapshot, key ‘d’ does not exist for the given dictionary.
Value Error
When an inappropriate value is passed into a function. In below snapshot, ‘hello’ is inappropriately passed in for typecasting to
float.
Type Error
When an unsupported operation is performed like below snapshot the subtraction of ‘10’ with integer 10.
Name Error
When an undefined variable/object is used ike in below snapshot.
9
Figure 9: Name Error
Indentation Error
When there is an incorrect indentation like in below example of ‘for’ loop
10
Pre-processing
df.isnull().sum()
To impute null values, use Simple Imputer: Imputation transformer for completing missing values. For numeric data, if there are
no outliers, use ‘median’ and for categorical data, use ‘most_frequent’
Below is an example of imputation of null values in an “object” type column, hence in SimpleImputer strategy used in “most
frequent”.
objects=df[cols].select_dtypes(include='object').columns
non_objects=df[cols].select_dtypes(exclude='object').columns
Outlier Check
Use this custom function to check and treat outliers. This is also known as “95% - 5%” capping technique. Based on the count of
outliers, decision could be made to use a different capping percentage like 99%-1%.
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
lower_range,upper_range=remove_outlier(df[column])
df[column]=np.where(df[column]>upper_range,upper_range,df[column])
df[column]=np.where(df[column]<lower_range,lower_range,df[column])
Splitting arrays or matrices into random train and test subsets. Model will be fitted on train set and predictions will be made on the
test set.
11
Scaling
Standard Scaler
It is a scaling technique that scales down the data with mean equal to zero and standard deviation equal to 1.
Min-Max Scaler
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero
and one.
Source: scikit-learn
12
Statistics
Packages Required: Usually the packages required for this module are numpy, pandas, statsmodels, scipy.stats.
Descriptive Statistics
Descriptive Statistics is a collective term for the summary statistics of a data set. It is comprised of mean, median, mode, standard
deviation, Inter Quartile Range (IQR) etc. and is studied through tables, graphs and charts.
Population refers to the entire set of observations of a data set. Sample refers to a subset of the population on which most studies
about the population are based. Sample is drawn randomly to make inferences about the population parameters.
import numpy as np
a = 100,56,29,90,102,134,809
np.mean(a)
188.57142857142858
Alternatively, print(df.mean()) would give you the mean of all columns of the data ‘df’
Please note that you would need to import numpy package for this function.
import numpy as np
a = 100,56,29,90,102,134,809
np.median(a)
Output: 100.0
Alternatively, print(df.median()) would give you the medin of all columns of a data set ‘df’.
print(df.mode()) gives mode of each column of data frame df. If no value appears more than once it displays NaN as output.
Please note that you can generate the five-point summary which will display most of the measures at one place. It can be generated
using df.describe() or df.describe().T
Measures of Dispersion
Measures of Dispersion (Spread) are statistics that describe how data varies. Measure of dispersion gives us the sense of how
much the data tends to diverge from the central tendency.
Range: It shows the spread of the values contained in the variable. It is the difference between the maximum and minimum
values.
13
Interquartile Range (IQR)
IQR gives a much better idea about the spread of the data because it doesn’t take into account the effect of outliers. IQR is more
popular than Range.
import numpy as np
Q1,Q3=np.percentile(col,[25,75]) #col is the column name
IQR=Q3-Q1
IQR
Correlation
Correlation measures how strongly two variables are related to each other.
df.corr()
This gives the correlation table showing all variables against each other.
Correlation Plot
import seaborn as sns
sns.heatmap(corr, annot=True)
This presents a pictorial representation of the correlation data frame and is much easier to make sense of, for smaller data frames.
For bigger data frames with a lot of variables, a correlation table would be preferable.
Skewness
Skewness shows the asymmetry in the data. It shows where most of the data points lie.
import pandas as pd
import scipy.stats as stats
Skewness = pd.DataFrame({'Skewness' : [stats.skew(df.col1),stats.skew(df.col2),stats.skew(df.col3)]}, index=[col1,'col2','col3'])
To check the Skewness of Col1, Col2, Col3 of the data frame ‘df’
Skewness (where is the df … either we should mention that these values mentioned below are representative)
Col1 0.283729
Col2 0.055610
Col3 1.514180
14
Visualisations
Histogram:
df.hist() # Histogram of all continuous variables of the data frame ‘df’
df[‘column’].hist() # Histogram of a particular column of the data frame ‘df’
Pairplot:
It is a powerful plot which is used to know the distributions and correlations of all variables of the data frame ‘df’.
Boxplot:
df.boxplot(figsize=(15,4))
Probability Distributions
It is a statistical method that describes all the possible likelihoods for a random variable within a given range.
Please note that loc=0 and scale =1 are default values in the codes below.
Normal Distribution
Binomial Distribution
import scipy.stats as stats
n,p = 10,0.22
k=0
stats.binom.pmf(k,n,p)
0.083357758312362 #output
p = Probability of success
k = random variable
For Cumulative:
stats.binom.cdf(k,n,p)
If k=0, then the outcome would be same as the above output.
15
Poisson Distribution
import scipy.stats as stats
stats.poisson.pmf(k,lambda) #k = random variable
For Cumulative:
stats.poisson.cdf(k,lambda)
Inferential Statistics
Inferential Statistics has a key, i.e., inference. So, in Inferential statistics, we draw out a sample from the population, put the
sample through various tests to make inferences about the population.
Hypothesis Testing
It can be mentioned as a comment or in a markdown. Having it in the sheet itself would be helpful.
The Null Hypothesis is often denoted by 𝐻0 or 𝐻𝑁𝑢𝑙𝑙 and the Alternate Hypothesis is denoted by 𝐻1 or 𝐻𝐴
Decide a significance level (alpha) and state it in the sheet itself in comments or markdown.
(Generally significance level is 5%, however it can be increased or decreased as per the situation at hand)
Step 3: Identify the test to be undertaken and find the critical value
** Please note that if Alpha is 5% (0.05), then for one tailed test the value of Alpha should be 0.05, but for two tailed test, it
should be 0.025)
Take a decision to accept or reject the H0 based on the Critical Value, p-value approach and reach conclusions.
Golden Rule: If p-value is low (means p-value is lower than Alpha), then Null Hypothesis should be rejected (reject H0). Accept
Alternative Hypothesis.
If p-value is more than Alpha, then we fail to reject Null Hypothesis (Reject Alternate Hypothesis )
z-Distribution
import scipy.stats as stats
cv = stats.norm.ppf(Alpha, 0, 1) #loc=0, scale=1
16
t-Distribution
import scipy.stats as stats
cv=stats.t.ppf(0.05, df) #df is the degree of freedom n-1
f- Distribution
import scipy.stats as stats
stats.f.ppf(0.95,dfn=4,dfd=26) #q=0.95
2.7425941372218587
Chi Square
Sample Code:
(𝑋̅ − 𝜇)
𝑍𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =
(𝜎/√𝑛)
Equation 1: Z-Statistic
Where 𝑋̅ is sample mean, 𝜇 is population mean, 𝜎 is population standard deviation, and n is sample size.
Sample Code:
For array, the following code can be used to calculate the z score
Equation 2: T-Statistic
Where 𝑋̅ is sample mean, 𝜇 is population mean, 𝑠 is sample standard deviation, and n is sample size.
17
(𝑋1 − 𝑋2 )
𝑡𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =
(𝑠12 ) (𝑠22 )
√ +
𝑛1 𝑛2
f-test
18