EDA+Cheatsheet+ +Class+Note
EDA+Cheatsheet+ +Class+Note
com
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
This cheatsheet contains the syntax of the codes used in the EDA content.
Assume we have a dataframe as df and its variables or column name as
Variable_name1, variable_name 2 and so on
print(np._version_)
print(pd._version_)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
To import the file
df=pd.read_csv(‘filename.csv)
## here file of type csv
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
To understand the statistics/description of data
like: min, 25%.50%,75%, max(), std, mean, count
df.describe( )
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
If want to replace a value (here let’s say -1)
with particular column value:
New_variable_name= df[df[‘variable_name1’]<condition]
[‘variable_name2’].values[0]
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
To check number of values in a column
df[‘variable_name1’].value_counts()
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Imputation using mean() incase no
outliers are present
df.variable_name1=df.variable_name1.fillna
(df.variable_name1.mean())
How to create dataframe as per data types from the given data
df_num=df.select_dtypes([‘float64’,’int64’])
df_cat=df.select_dtypes([‘object’])
ZN4L9ICF3G
IMPUTATION
[email protected]
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Similarly can be done for df_cat
IMPUTATION
[email protected]
ZN4L9ICF3G How to drop null values
df.dropna( )
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Univariate Analysis
VISUALIZATION
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Pick any number of fields to perform univariate analysis
df[[‘variable_name1’,’variable_name2’]].describe()
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
##kde is used to see the shape of the distribution
## ax=axes[ ][ ] defines the row and column of the drawing
board where the plots will be displayed.
## histplot or displot any of them can be used
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Bivariate Analysis
VISUALIZATION
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Here, let’s analyze using 2 variables:
Both are Numeric variables
Use scatterplot
plt.scatter(df[‘variable_name1’],df[,’variable_name2’])
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
## margins give total of values align x and y axis
To capture all numeric fields , check there combination (put them in pairs)
sns.pairplot(df)
BIVARIATE
[email protected]
ANALYSIS
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Multivariate analysis where more than 2 variables are involved.
So, we can put one categorical variable on the x-axis, one continuous
variable on the y-axis. Lastly, we have put the 3rd variable which is
represented by colors.
sns.boxplot(x=’categorical_variable1, y= ‘numeric_
variable’,hue=’categorical_variable2’)
BIVARIATE
[email protected]
Incase we have high categorical levels we can use FacetGrid
ANALYSIS a= sns.FacetGrid(df, col=”categorical_variable1”,
ZN4L9ICF3G
hue=’categorical_variable2’,col_wrap=3, height=3)
where col_wrap will show three plots in a row
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
SCALING: Scaling helps us achieve the same weightage to all numeric
variables irrespective of the values it contains.
To get all columns of numeric data type together and categorical data type
together we can create lists and perform for loop:
cat=[ ] #categorical list
num=[ ] #numeric list
# here we we have created a for loop search for all columns with datatype
as object and add it in cat list and other datatype which is numeric in num
DATA
[email protected]
ZN4L9ICF3G
using append
PREPARATION for i in df.columns:
if df[i].dtype==”object”:
cat.append(i)
else:
num.append(i)
# to see the list , do print
print(cat)
print(num)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Z-SCORE: How to apply zscore in numeric data
for scaling or standardizing the data
We have imported zscore from library scipy.stats
from scipy.stats import zscore
data_scaled=df[num].apply(zscore)
#here data_scaled a new dataframe is created so that comparison can be
done with the original dataframe. Also, we have applied zscore function to
METHODS /
[email protected]
ZN4L9ICF3G all numeric fields in dataframe
TECHNIQUES
FOR SCALING Note: when we apply zscore it centralizes the data - mean is near to 0 and
standard deviation as 1. Also, the scale is changed (i.e. values gets changed)
and is at a range which is comparable.
We can perform histplot on both the dataframes one before scaling and
other after performing scaling as explained previously **
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Scaling can be done using StandardScaler function as well. Import the
function from library. Create an object fit numeric dataframe in it and
transform it. Once done, the data will be in the form of an array which needs
to be converted back to dataframe.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Another method can be MinMax (where minimum value will be
0 and max will be 1 and other values will range between 0 to 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(df[num])
# created the object & fit numeric fields
METHODS / data_minmax = scaler.transform(df[num])
ZN4L9ICF3G
TECHNIQUES
[email protected]
Note: based upon the need/requirement of Algorithm scaling techniques are applied.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Check for skewness/kurtosis in data and
do transformation accordingly.
To check skewness / kurtosis:
df[‘variable_name’].skew( )
#helps us understand how symmetric is the distribution
df[‘variable_name’].kurtosis( )
TRANSFORMATION
[email protected]
ZN4L9ICF3G #helps us understand the sharpness/peak of distribution
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
To see transformation on the plot:
sns.histplot(np.log(df[‘variable_name’]),
kde=True, ax = axs[0])
sns.boxplot(x= np.log(df[‘variable_name’]), ax = axs[1])
Sqrt transformation
TRANSFORMATION
[email protected]
ZN4L9ICF3G print(np.sqrt(df[‘variable_name’]).skew())
print(np.sqrt(df[‘variable_name’]).kurtosis())
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 21
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Distribution means more spread
Using root of 10 transformation
print((df[‘variable_name’]**0.1).skew())
print((df[‘variable_name’]**0.1).kurtosis())
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
TRANSFORMATION
[email protected]
ZN4L9ICF3G sns.histplot((df[‘variable_name’]**0.1), kde=True, ax = axs[0])
sns.boxplot(x=(df[‘variable_name’]**0.1), ax = axs[1])
Note: Reverse of the transformation is also required to get the correct results.
We need to apply a square where square root is applied or exponential incase of log.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 22
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Outlier treatment
Sometimes we might not need to apply outlier treatment
if treating can hamper the data.
Z score: Normalization
Treat outlier if zscore value is either less than -3 or
greater than +3 Using z score technique:
TRANSFORMATION
[email protected]
ZN4L9ICF3G z= x-μ / σ
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 23
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Calculate the value that will be the threshold which can be imputed to the
outlier so as to get the normal distribution: i.e x= (z* σ)+ μ
Here, z is 3. so
3*σ + μ
impute_value = (3*variable_name.std())
TRANSFORMATION
[email protected]
ZN4L9ICF3G + df.variable_name.mean()round(impute_value,2)
Now, imputing this value will remove outlier and the data will be
normalized.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 24
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Use boxplot to detect outliers:
Point above or below the whisker value are outliers
def detect_outlier(col):
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
ZN4L9ICF3G
TRANSFORMATION
[email protected]
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 25
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
One Way:
Capping the value at 99%ile value
q99percent = df.variable_name.quantile(q=0.99)
(calculated the value & stored in q99percent)
ZN4L9ICF3G
TRANSFORMATION
[email protected]
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 26
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
LAB ENCODING : should be numeric
Can be applied to ordinal variables, so we can change the categorical
variable to numeric values and do encoding (it assign values/numbers
based upon the alphabetical order of categorical names in that field).
The first step is to change the datatype
ENCODING
[email protected]
Below is the syntax:
2 OPTIONS
ZN4L9ICF3G
df[‘categorical_variable_name’] = df[“categorical_variable_
name”].cat.codes
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 27
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Preferred: If number of levels in categorical are less than 20
Syntax to perform one hot encoding
cat.remove(‘categorical_variable’)
ENCODING
[email protected]
df_new =pd.get_dummies(df, columns=cat,drop_first=True)
2 OPTIONS
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 28
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
ZN4L9ICF3G
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.