Machine Learning With Python - Part-2
Machine Learning With Python - Part-2
PART - 2
In the first part we have done very basic steps of data preprocessing, make the data ready
to be used in Machine Learning I.e. Building a ML model using Python.
We will create our first ML model to predict the car price based on available information.
STEPS
For this purpose, we will be working on the following steps:
1. Creating a panda data frame from a .csv data downloaded from kaggle.com.
2. Finding noise/Outlier in data & removing that.
3. We will use ColumnTransformer to encode our data by
OneHotEncoder/OrdinalEncoder under one umbrella.
4. We will build our first ML model & predict the car price
5. We will validate & find out the score / accuracy of our ML model.
For this demo purpose I will be writing my Python codes using Visual Studio Code but you
can use any Python NoteBook editor you like.
I am not going to write down the steps like “How to install Python/ Anaconda”, “How to
setup VS code to work with Python NoteBook” etc. because you can find many article in
the cloud for this purpose using any search engine like Google.
I will share links for this purpose in the “More to Read” section at the very last section of
this article.
PAGE 1
Creating a panda data frame from a downloaded .csv file.
copy & paste the above code in your python notebook & run.
If you are using VS code like me, before running the code do not forget to choose the
kernel as per the following image:
Now we are ready to create our Panda data frame from .csv file.
PAGE 2
For this purpose, do the following:
Download the files & save it on your disk for offline access it will help.
Now load it from your hard drive, make sure to give the right path of the file.
It will be a lot easier if you save this .csv file under the same folder where you are saving
this notebook.
df.info()
PAGE 3
If we run the code, we will get the following results
The output shows that our data frame contains 205 rows, indexing from 0 to 244 & 26
columns.
For this demo purpose I have not taken the total screen shot rather I am showing only 20
columns.
PAGE 4
But we are not going to analysis all column for this demo purpose, so we will take few
columns from it.
Here we are taking 8 input columns & one output column, the price column that we will
predict.
But just before starting the work or jumping into the world of making a ML data model it
is always recommended to have a close look into the data, in our case our data frame.
PAGE 5
As expected it is a pandas Data Frame object & it has 205 rows, 9 columns in it.
Among these 9 columns we will take first 8 columns as our input data to test our model
prediction.
Now look very carefully the first 8 columns, what did we find out!
Well, we are closely looking at the nature / data type of this columns.
Among these 8 columns the columns “enginesize” & “horsepower” are numerical, rest of it
are type string, categorical in nature.
Well my friend as ML model only works with numerical values, not other data types, so we
need to convert our string type ordinal values to numerical.
PAGE 6
CHECKING FOR NOISE / OUTLIER IN DATA
Now this is very 1st steps we need to perform.
If you run this block of code “df.isnull().sum()” then it will show you if any null vales
(NAN) is present in the data frame or not.
But if you have that in your working data frame then you need to handle it first.
If you are wondering how to do it ! then read my 1st article , I have already explain that to
you ☺
Now we will run one particular code block on our numerical columns.
Just run it, see the result & I will explain it to you.
The Output
Just look into the row “min”,“25%” , “50%” ,”75%” & max.
PAGE 7
The values from min to 75% is changing but not very large difference (61,97,120,141) but
from 75% to max we can observe a huge change in value (141, 326).
Well, the value of 25% indicates the total population of data points, enginesize below 25%
of the data sample size.
Its mean that from the value from 61 to 97 our 25% of data resided & so on for 50% & 75%.
So 75%, tell us that from the value 61 to 141 our 75% of data points, enginesize resided.
But from 141 (75%) to 326 (max) in this massive region only 25% data points, enginesize
resided.
So the small bucket has more data points than the large bucket.
PAGE 8
The output we get
Seaborn is excellent library is you want to analysis your data in graphical way.
Now in the above image you can see around 60 we have one bar & around 200 we have
another. It defines a range.
For now, just watch we have some points after the mark of 200.
In ML it is better to have data points as much close as possible or pick up a segment for
your analysis where you can find more data points or you may call more data for analysis.
Whatever data resided apart from this dense data region you may call them Outlier which
has no such significant contribution in terms of ML model building & prediction.
We can also run the Histogram to view the distribution of data points.
PAGE 9
Now if you look into the Histogram, you can see more data points are plotted towards the
left side.
In a nutshell more data points are available in the left than the right.
We can also get the value of the Skewness if we run the code
“dfOriginal['enginesize'].skew()”
For us the value is 1.95, the positive value indicates that it is Right Skewed.
So we need to exclude them from our data as much as possible for us.
The question you may ask me, well this is on theory man but how to do it!!
PAGE 10
Just run the below code, it will do the work for you
z_scores = z_scores(dfOriginal['enginesize'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 2)
dfOriginal = dfOriginal[filtered_entries]
snb.boxplot(data=dfOriginal,x='enginesize')
dfOriginal.shape
As you can see, our data frame has reduced to 198 rows from 205 & we have almost
removed all out layesr.
PAGE 11
Now if we run “dfOriginal.nunique()” we get the following result
So we have removed (44-39=5) different enginesize category & only (205-198 =7) rows.
So by this information we further understand that this Outlier that we have already
removed from our data frame does not has any significant effect.
So do the same thing with the “horsepower” column also, I leave it up to you.
The result
PAGE 12
As you can see the maximum data variance is in the column “CarName”.
If we apply OneHotEncoding on this this column will produce 139 new columns.
Yes, we do have 140 unique car name but let’s find out the frequency of the distribution of
that.
As from the above result it is clear that we have entries with row value of 1,2.
The reason is that this type of Carname is presented in our data frame for 1 row or 2 row
basis.
PAGE 13
Code Section (Copy & Paste)
treshhold = 2
replindex =
CarNameCount[CarNameCount<=treshhold].index
replindex
Result
Here we are getting the index of those carnames whose count is 2 or less than 2.
Result
Here we replaced the carname with ‘uncommon’ based on their index values.
PAGE 14
Code Section (Copy & Paste)
dfNameReplace = pd.DataFrame(NameTransformed, columns = ['CarName'])
dfNameReplace
Result
Here we have created a new data frame from the array we got in the previous step.
Now we have to join our new data frame with our original data frame.
But for that we need to drop the ‘CarName’ column from our original data frame first.
PAGE 15
Now we have our both data frame ready to be joined.
Result
The data frame has columns but the does not have that.
So let’s first get the columns ready for our would be generated data frame.
PAGE 16
Now we will generate our new data frame from our new array.
Result
As we can see that we have our new data frame “dfjoined” ready to be used further.
As we can see we have an issue here. Our ‘enginesize’ & ‘horsepower’ is now data type
object not numerical in nature. So we need to make their data type back to numerical
again.
PAGE 17
Code Section (Copy & Paste)
dfJoined[["enginesize", "horsepower","price"]] = df[["enginesize", "horsepower","price"]].apply(pd.to_numeric)
Result
In pic-2 we can see that the column ‘CarName’ has only 11 unique values in it now it was
140 previously.
PAGE 18
Now we can apply OneHotEncoding in all our string/oridinal type column apart from the
‘cylindernumber’.
As we can see the value of column ‘cylindernumber’ have some kind of order relationship.
So it will be better to use encode this with Ordinal Encoder not OHE.
But do we need to perform 5 time OHE & join the produced 5 different array with one
array coming from OrdinalEncoding!!!
PAGE 19
COLUMNTRANSFORMER
The Column Transformer actually does not do any data processing here.
It’s just remembers what are the step you asked for.
So it is basically a skeleton.
PAGE 20
Now we can do our data processing / transformation in just one step with the help of our
defined “ct” object.
arrayTransformed = ct.fit_transform(dfJoined)
print(arrayTransformed.shape)
print(dfJoined.shape)
Result
As we can understand that our data frame has 9 columns in it & after the transformation
we have 32 columns now. Here is how our data looks like now…
All numerical in nature & ready for building our Machine Learning Model.
PAGE 21
ML MODEL BUILDING
For this purpose, first we need to split our array into two segments.
One segment we will use to train our model & another part will be used by our ML model
for prediction.
Result
Here we have used sklearn train_test_split to split our array into two parts.
The very 1st parameter is “arrayTransformed[:,0:31]” , it is the total input values based on what the
car price will be predicted.
The 2nd parameter where we have passed “arrayTransformed[:,-1]” indicating the price of the
cars.
The 3rd parameter is the “test_size=0.2” as the name suggested we passed our test data size
that our ML model will use to predict the price of the car.
That basically means that we will train our ML model with the help of X_Train & Y_Train
& will ask it to predict the price based on X_Test & also compare the result with Y_Test to
derive its accuracy.
PAGE 22
As for the ML model creation for this type of data analysis we will make our model based
on “LinearRegression” class of sklearn.
lirM = LinearRegression()
lirM.fit(X_Train,Y_Train)
So our ML model ‘LirM’ is created & we have also train it by calling ‘fit’ method & passing
our X_Train & Y_Train.
print(" ")
print('The predected values are :')
print(arryMyPredection)
PAGE 23
Result
This is what our ML model ‘lirM’ predicted, the price of the cars.
PAGE 24
Let’s compare it side by side …
Result
In the next part we will working on Pipeline to give this process a automation feel & make
our life more easy …
If you like my article, please leave your comments, give it a like & share in your social
media circle.
PAGE 25
If you want more
https://www.python.org/
https://www.anaconda.com/products/distribution
https://numpy.org/learn/
https://pandas.pydata.org/docs/index.html
https://scikit-learn.org/stable/
https://code.visualstudio.com/docs/python/python-tutorial
https://www.geeksforgeeks.org/introduction-machine-learning-using-python/
https://www.tutorialspoint.com/machine_learning_with_python/index.htm
https://en.wikipedia.org/wiki/Machine_learning
https://azure.microsoft.com/en-in/overview/what-is-machine-learning-platform/
https://www.tensorflow.org/
PAGE 26