0% found this document useful (0 votes)
73 views

Machine Learning With Python - Part-2

This document discusses preparing data for machine learning modeling. It begins by importing necessary libraries and loading a CSV dataset into a Pandas dataframe. Next, it examines the dataframe for outliers and noise, finding some in the "enginesize" column. Code is provided to filter these outliers, reducing the dataframe rows. The document concludes by noting the next steps will encode categorical variables for modeling.

Uploaded by

Musto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Machine Learning With Python - Part-2

This document discusses preparing data for machine learning modeling. It begins by importing necessary libraries and loading a CSV dataset into a Pandas dataframe. Next, it examines the dataframe for outliers and noise, finding some in the "enginesize" column. Code is provided to filter these outliers, reducing the dataframe rows. The document concludes by noting the next steps will encode categorical variables for modeling.

Uploaded by

Musto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Machine Learning with Python

PART - 2

Debdeep Chaudhuri | Machine Learning | 02.07.22


[email protected]
Objective
This is the part-2 of my ongoing article on Machine Learning / Deep Learning / AI.

In the first part we have done very basic steps of data preprocessing, make the data ready
to be used in Machine Learning I.e. Building a ML model using Python.

Now in this part-2 we will take one step further.

We will create our first ML model to predict the car price based on available information.

STEPS
For this purpose, we will be working on the following steps:

1. Creating a panda data frame from a .csv data downloaded from kaggle.com.
2. Finding noise/Outlier in data & removing that.
3. We will use ColumnTransformer to encode our data by
OneHotEncoder/OrdinalEncoder under one umbrella.
4. We will build our first ML model & predict the car price
5. We will validate & find out the score / accuracy of our ML model.

For this demo purpose I will be writing my Python codes using Visual Studio Code but you
can use any Python NoteBook editor you like.

I am not going to write down the steps like “How to install Python/ Anaconda”, “How to
setup VS code to work with Python NoteBook” etc. because you can find many article in
the cloud for this purpose using any search engine like Google.

I will share links for this purpose in the “More to Read” section at the very last section of
this article.

PAGE 1
Creating a panda data frame from a downloaded .csv file.

First we need to import some libraries to do our work in Python NoteBook.

Code Section (Copy & Paste)


import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as snb
import matplotlib as mat

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

copy & paste the above code in your python notebook & run.

If you are using VS code like me, before running the code do not forget to choose the
kernel as per the following image:

Now we are ready to create our Panda data frame from .csv file.

PAGE 2
For this purpose, do the following:

1st go to this url https://www.kaggle.com/datasets/shaistashaikh/carprice-assignment

Download the files & save it on your disk for offline access it will help.

Now load it from your hard drive, make sure to give the right path of the file.

It will be a lot easier if you save this .csv file under the same folder where you are saving
this notebook.

Code Section (Copy & Paste)


df = pd.read_csv('CarPrice_Assignment.csv')

df.info()

PAGE 3
If we run the code, we will get the following results

The output shows that our data frame contains 205 rows, indexing from 0 to 244 & 26
columns.

For this demo purpose I have not taken the total screen shot rather I am showing only 20
columns.

PAGE 4
But we are not going to analysis all column for this demo purpose, so we will take few
columns from it.

Code Section (Copy & Paste)


dfOriginal =
df[['CarName','fueltype','carbody','enginetype','cylindernumber','enginesize','fuelsystem','horsepower','price']]
dfOriginal
dfOriginal.head(2)

Here we are taking 8 input columns & one output column, the price column that we will
predict.

By running the above code, we get the following result

So we have our initial data frame ready to start our work.

But just before starting the work or jumping into the world of making a ML data model it
is always recommended to have a close look into the data, in our case our data frame.

Code Section (Copy & Paste)


dfOriginal.info()

By running the code, we get the following output

PAGE 5
As expected it is a pandas Data Frame object & it has 205 rows, 9 columns in it.

Among these 9 columns we will take first 8 columns as our input data to test our model
prediction.

Now look very carefully the first 8 columns, what did we find out!

Well, we are closely looking at the nature / data type of this columns.

Among these 8 columns the columns “enginesize” & “horsepower” are numerical, rest of it
are type string, categorical in nature.

Why it is so important for us!!

Well my friend as ML model only works with numerical values, not other data types, so we
need to convert our string type ordinal values to numerical.

PAGE 6
CHECKING FOR NOISE / OUTLIER IN DATA
Now this is very 1st steps we need to perform.

Noise can be missing values also.

If you run this block of code “df.isnull().sum()” then it will show you if any null vales
(NAN) is present in the data frame or not.

Fortunately, we do not that in our working data frame.

But if you have that in your working data frame then you need to handle it first.

If you are wondering how to do it ! then read my 1st article , I have already explain that to
you ☺

Now we will run one particular code block on our numerical columns.

Just run it, see the result & I will explain it to you.

Code Section (Copy & Paste)


dfOriginal.describe()

The Output

Now look into the “enginesize” & “horsepower” column carefully.

Just look into the row “min”,“25%” , “50%” ,”75%” & max.

Let’s pick up the “enginesize” column values.

PAGE 7
The values from min to 75% is changing but not very large difference (61,97,120,141) but
from 75% to max we can observe a huge change in value (141, 326).

You may be wondering what does it indicates!!

Well, the value of 25% indicates the total population of data points, enginesize below 25%
of the data sample size.

Its mean that from the value from 61 to 97 our 25% of data resided & so on for 50% & 75%.

So 75%, tell us that from the value 61 to 141 our 75% of data points, enginesize resided.

But from 141 (75%) to 326 (max) in this massive region only 25% data points, enginesize
resided.

Min~75% (61~141) 75%~Max (141~326)

So the small bucket has more data points than the large bucket.

So we have possible noise / Outlier in our data.

Let’s plot some graph to understand it more clearly.

Just run this code

Code Section (Copy & Paste)


snb.boxplot(data=dfOriginal,x='enginesize')

PAGE 8
The output we get

This is the BoxPlot of Seaborn.

Seaborn is excellent library is you want to analysis your data in graphical way.

Now in the above image you can see around 60 we have one bar & around 200 we have
another. It defines a range.

It is showing us the inter quartile distribution of our data points.

Do not get afraid as we are into statistics now ☺ .

For now, just watch we have some points after the mark of 200.

This data points are called OUTLIER or NOISE in data.

In ML it is better to have data points as much close as possible or pick up a segment for
your analysis where you can find more data points or you may call more data for analysis.

Whatever data resided apart from this dense data region you may call them Outlier which
has no such significant contribution in terms of ML model building & prediction.

We can also run the Histogram to view the distribution of data points.

Code Section (Copy & Paste)


snb.histplot(data=dfOriginal,x='enginesize',kde=1)

PAGE 9
Now if you look into the Histogram, you can see more data points are plotted towards the
left side.

This is called the Right Skewness of data.

In a nutshell more data points are available in the left than the right.

We can also get the value of the Skewness if we run the code

“dfOriginal['enginesize'].skew()”

For us the value is 1.95, the positive value indicates that it is Right Skewed.

So we need to exclude them from our data as much as possible for us.

The question you may ask me, well this is on theory man but how to do it!!

PAGE 10
Just run the below code, it will do the work for you

Code Section (Copy & Paste)


from scipy.stats import zscore as z_scores

z_scores = z_scores(dfOriginal['enginesize'])

abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 2)
dfOriginal = dfOriginal[filtered_entries]

snb.boxplot(data=dfOriginal,x='enginesize')

dfOriginal.shape

If we run the code, we will get the following information

As you can see, our data frame has reduced to 198 rows from 205 & we have almost
removed all out layesr.

PAGE 11
Now if we run “dfOriginal.nunique()” we get the following result

NOW- After noise removal Then-Before noise removal

So before it has 205 rows & 44 different enginesize

Now it has 198 rows & 39 different enginesize.

So we have removed (44-39=5) different enginesize category & only (205-198 =7) rows.

So by this information we further understand that this Outlier that we have already
removed from our data frame does not has any significant effect.

I hope you are much clear now about Outlier now …. ☺

So do the same thing with the “horsepower” column also, I leave it up to you.

Now we will work on the string / ordinal column values….

Code Section (Copy & Paste)


print(dfOriginal.nunique())

The result

PAGE 12
As you can see the maximum data variance is in the column “CarName”.

If we apply OneHotEncoding on this this column will produce 139 new columns.

So that is not what we really want! We need another plan.

Yes, we do have 140 unique car name but let’s find out the frequency of the distribution of
that.

Code Section (Copy & Paste)


CarNameCount = dfOriginal['CarName'].value_counts()
CarNameCount

As from the above result it is clear that we have entries with row value of 1,2.

Why we focus on this!

The reason is that this type of Carname is presented in our data frame for 1 row or 2 row
basis.

So they are not contributing that much!

We can easily have clubbed them up under one new group.

Let’s call this new car name as ‘uncommon’.

But how to do that! let’s run few codes …

PAGE 13
Code Section (Copy & Paste)
treshhold = 2
replindex =
CarNameCount[CarNameCount<=treshhold].index
replindex

Result

Here we are getting the index of those carnames whose count is 2 or less than 2.

Code Section (Copy & Paste)


NameTransformed = dfOriginal['CarName'].replace(replindex,'uncommon').values
NameTransformedreplindex

Result

Here we replaced the carname with ‘uncommon’ based on their index values.

PAGE 14
Code Section (Copy & Paste)
dfNameReplace = pd.DataFrame(NameTransformed, columns = ['CarName'])
dfNameReplace

Result

Here we have created a new data frame from the array we got in the previous step.

Now we have to join our new data frame with our original data frame.

But for that we need to drop the ‘CarName’ column from our original data frame first.

Code Section (Copy & Paste)


dfWithoutname = dfOriginal.drop(['CarName'],axis=1)
dfWithoutname

PAGE 15
Now we have our both data frame ready to be joined.

Code Section (Copy & Paste)


myArrry2 = np.concatenate((dfNameReplace,dfWithoutname),axis=1,casting='same_kind')
myArrry2

Result

We get a new array.

But we need a data frame to continue our work.

So we will generate a new data frame from our new array.

The data frame has columns but the does not have that.

So let’s first get the columns ready for our would be generated data frame.

Code Section (Copy & Paste)


Column_Names = dfOriginal.columns.values
Column_Names

PAGE 16
Now we will generate our new data frame from our new array.

For that we are using Pandas DataFrame method.

Code Section (Copy & Paste)


dfJoined = pd.DataFrame(myArrry2,columns = Column_Names)
dfJoined.describe()

Result

As we can see that we have our new data frame “dfjoined” ready to be used further.

Let’s get some information from it.

As we can see we have an issue here. Our ‘enginesize’ & ‘horsepower’ is now data type
object not numerical in nature. So we need to make their data type back to numerical
again.

PAGE 17
Code Section (Copy & Paste)
dfJoined[["enginesize", "horsepower","price"]] = df[["enginesize", "horsepower","price"]].apply(pd.to_numeric)

Code Section (Copy & Paste)


dfJoined.info()
print(" ")
dfJoined.nunique()

Result

We can see ‘enginesize’ & ‘horsepower’ converted back to numerical in pic-1.

In pic-2 we can see that the column ‘CarName’ has only 11 unique values in it now it was
140 previously.

PAGE 18
Now we can apply OneHotEncoding in all our string/oridinal type column apart from the
‘cylindernumber’.

Why so !! let’s find our then

Code Section (Copy & Paste)


dfJoined['cylindernumber'].value_counts()

As we can see the value of column ‘cylindernumber’ have some kind of order relationship.

So it will be better to use encode this with Ordinal Encoder not OHE.

So we have ('CarName','fueltype','carbody','enginetype','fuelsystem') 5 columns which will


undergo OneHotEncoding & ‘cylindernumber’ will have OrdinalEncoding in it.

But do we need to perform 5 time OHE & join the produced 5 different array with one
array coming from OrdinalEncoding!!!

That will be a mammoth task indeed!!!

So, is there any easy way around!

Yes, fortunately we have alternative easy way to do the same…. ☺

We will use ‘ColumnTransformer’ class from sklearn for this process.

PAGE 19
COLUMNTRANSFORMER

Code Section (Copy & Paste)


ct = ColumnTransformer(
[
('Trans1',OneHotEncoder(sparse=False,drop='first'),['CarName','fueltype','carbody','enginetype','fuelsystem']),
('Trans2',OrdinalEncoder(categories=[['two','three','four','five','six','eight','twelve']]),['cylindernumber'])
],
remainder='passthrough')

The Column Transformer actually does not do any data processing here.

It’s just remembers what are the step you asked for.

If we just type “ct” it will show us the following information.

So it is basically a skeleton.

PAGE 20
Now we can do our data processing / transformation in just one step with the help of our
defined “ct” object.

Code Section (Copy & Paste)

arrayTransformed = ct.fit_transform(dfJoined)
print(arrayTransformed.shape)
print(dfJoined.shape)

Result

As we can understand that our data frame has 9 columns in it & after the transformation
we have 32 columns now. Here is how our data looks like now…

All numerical in nature & ready for building our Machine Learning Model.

So let’s build our ML model now .. ☺

PAGE 21
ML MODEL BUILDING
For this purpose, first we need to split our array into two segments.

One segment we will use to train our model & another part will be used by our ML model
for prediction.

Code Section (Copy & Paste)


from sklearn.model_selection import train_test_split
X_Train,X_Test,Y_Train,Y_Test = train_test_split(arrayTransformed[:,0:31],arrayTransformed[:,-1],test_size=0.2)
print(X_Train.shape)
print(X_Test.shape)

Result

Here we have used sklearn train_test_split to split our array into two parts.

Look very carefully the parameters we passed in this function.

The very 1st parameter is “arrayTransformed[:,0:31]” , it is the total input values based on what the
car price will be predicted.

The 2nd parameter where we have passed “arrayTransformed[:,-1]” indicating the price of the
cars.

The 3rd parameter is the “test_size=0.2” as the name suggested we passed our test data size
that our ML model will use to predict the price of the car.

So X_Train now has 158 rows & X_Test has 40.

That basically means that we will train our ML model with the help of X_Train & Y_Train
& will ask it to predict the price based on X_Test & also compare the result with Y_Test to
derive its accuracy.

PAGE 22
As for the ML model creation for this type of data analysis we will make our model based
on “LinearRegression” class of sklearn.

Code Section (Copy & Paste)


from sklearn.linear_model import LinearRegression

lirM = LinearRegression()

lirM.fit(X_Train,Y_Train)

So our ML model ‘LirM’ is created & we have also train it by calling ‘fit’ method & passing
our X_Train & Y_Train.

Time for some prediction ….

Code Section (Copy & Paste)


arryMyPredection = lirM.predict(X_Test)
print(arryMyPredection.shape)

print(" ")
print('The predected values are :')
print(arryMyPredection)

PAGE 23
Result

This is what our ML model ‘lirM’ predicted, the price of the cars.

& the following is the actual price of the cars ..

Code Section (Copy & Paste)

print("The actual Price is : ")


print(Y_Test)
print(" ")
Y_Test.shape

PAGE 24
Let’s compare it side by side …

Not bad for our 1st try…

Let’s find out how accurate our ML model is!

Code Section (Copy & Paste)


lirM.score(X_Test,Y_Test)

Result

Well , our model is 82% accurate in its prediction …. ☺

That’s all for now my friends.

In the next part we will working on Pipeline to give this process a automation feel & make
our life more easy …

If you like my article, please leave your comments, give it a like & share in your social
media circle.

Until then Good Bye …. Happy Coding … ☺ Thanks ……….

PAGE 25
If you want more

https://www.python.org/

https://www.anaconda.com/products/distribution

https://numpy.org/learn/

https://pandas.pydata.org/docs/index.html

https://scikit-learn.org/stable/

https://code.visualstudio.com/docs/python/python-tutorial

https://www.geeksforgeeks.org/introduction-machine-learning-using-python/

https://www.tutorialspoint.com/machine_learning_with_python/index.htm

https://en.wikipedia.org/wiki/Machine_learning

https://azure.microsoft.com/en-in/overview/what-is-machine-learning-platform/

https://www.tensorflow.org/

PAGE 26

You might also like