0% found this document useful (0 votes)
775 views32 pages

MachineLearning Project PDF

This document discusses predicting employees' choice of transportation using machine learning models. It first explores the dataset containing employees' personal and professional details as well as their mode of transportation. Key findings from the exploratory data analysis include age 30 and above and salary 30k and above are more likely to use a car for transportation. Female car usage is also much lower than male usage. The document then outlines the steps to build logistic regression, KNN, and naive bayes models to predict car usage and determine significant predictor variables influencing an employee's choice of transportation.

Uploaded by

Senthil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
775 views32 pages

MachineLearning Project PDF

This document discusses predicting employees' choice of transportation using machine learning models. It first explores the dataset containing employees' personal and professional details as well as their mode of transportation. Key findings from the exploratory data analysis include age 30 and above and salary 30k and above are more likely to use a car for transportation. Female car usage is also much lower than male usage. The document then outlines the steps to build logistic regression, KNN, and naive bayes models to predict car usage and determine significant predictor variables influencing an employee's choice of transportation.

Uploaded by

Senthil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Machine Learning 

Transport Choice of Employees 

Senthil Kumar M 
22.Sep.2019 
Machine Learning (​PGP-BABI​)  
by Great Learning 
 

 
 

Table of Contents 
 
 
INTRODUCTION 2 

Observation 4 

Step by step approach 5 


Exploratory Data Analysis 5 
EDA Summary: 12 
Logistic Regression 14 
KNN 20 
Naive Bayes 22 

REFERENCES 26 
Great Learning PGP 1 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Great Learning PGP(BABI) 


 

INTRODUCTION 
This  project  is  to  understand  the  determinants  of  transport  choice  made  by  employees. 
The  given  data  has  an  employee  information  about  their  mode  of  transport  as  well  as 
their  personal  and  professional  details  like  age,  salary,  work  exp.  We  need  to  predict 
whether  or  not  an  employee  will  use  Car  as  a  mode  of  transport.  Also,  which  variables 
are a significant predictor behind this decision. 

We  are  gonna  use  multiple model and performance metrics to derive a better model that 


can  describe  a  variable  influencing  employee  to  use  a  car  as  a  mode  of  transport.  The 
input  variables  include  employee  personal  details  like  Age,  Salary,  Work.exp.  We  are 
going to use  

Process Map 

Great Learning PGP(BABI) 


 

The structure of input variable is tabled below: 

Given  the  dataset,  we  required  to  perform  following  tasks  as  explained  to  complete  this 
project successfully:  

1. EDA
2. Data Preparation
3. Modeling
4. Actionable Insights & Recommendations

Great Learning PGP(BABI) 


 

Observation  
Employees  use  2  wheeler,  public  transport and car as a mode of transport to commute to 
their  workplace.  We  have  been  given  418  rows  of  data  with  9  variables.  We might want 
to  cleanup  the  dataset and convert its type appropriately as required before processing it 
for analysis.  

Problem  statement  is  that  of  predicting  whether  or  not  an  employee  will  use  a  car 
as  a  mode  of  transport,  also  which  variable  is  a  significant  predictor  behind  the 
decision.   

Step by step approach  


We shall do the following to perform stepwise analysis and conclude this project. 

1. Exploratory Data Analysis 


2. Clustering 
3. CART 
4. Random Forest 
5. Performance Measurement 
6. Conclusion 

1. Exploratory Data Analysis  


We will start with converting categorical variables to factor to start our EDA process.  

Great Learning PGP(BABI) 


 

The following graph of overview as how the variables spread with volume of usage: 

Great Learning PGP(BABI) 


 

Structure of the dataset printed for reference.  

Lets  notice  that  there  is  a missing value in a variable MBA. we have several ways to treat 


but we will remove the whole record as there is only 1 missing value. 

There are several automated packages in ‘R’ to perform exploratory data analysis, we are 
going to use one such package “dlookr” in this project. EDA report from “dlookr” package 
gives us the detailed count of distinct values in each variable along with normality test, 
correlation coefficient other descriptive stats are elaborated as below: 

Great Learning PGP(BABI) 


 

Great Learning PGP(BABI) 


 

Great Learning PGP(BABI) 


 

Normality Test of Numeric Variable: 

Normality test statistics proves that Age & Distance variables are closely distributed 
normal, while Work Exp & Salary having positive skew in the dataset. Numeric variables 
individually tested for normality and skewness values with QQ plots for each variables 
printed down for reference. 

Great Learning PGP(BABI) 


 

10 

Great Learning PGP(BABI) 


 

Univariate Distribution: Histogram

11 

Great Learning PGP(BABI) 


 

Churn Ratio by numerical predictors: 

We  can  notice  that  the  higher  the  salary  &  age  the  employees  are  using  a  car.  There  is 
clear  indication  that  age  30  above  as  well  as  salary  30k  and above preferred to use a car 
as  a  mode  of  transport.  Also  the  distance  above  15miles  are  with  higher  salary  are 
choosing car as mode that is very evident in this dataset. 

12 

Great Learning PGP(BABI) 


 

The  above  map  depicts  that  female  car  usage  is  much  lower  compared to male, whereas 
qualification  doesn’t  have  any  correlation  with  car  usage.  But  license  as  we  can  assume 
employee without license uses public transport. 

Target based Analysis: (Categorical Variables) 

13 

Great Learning PGP(BABI) 


 

14 

Great Learning PGP(BABI) 


 

15 

Great Learning PGP(BABI) 


 

16 

Great Learning PGP(BABI) 


 

Target based Analysis: (Numerical Variables) 


 

AGE: 

17 

Great Learning PGP(BABI) 


 

Wrok Exp: 

18 

Great Learning PGP(BABI) 


 

Salary: 

19 

Great Learning PGP(BABI) 


 

Distance: 

20 

Great Learning PGP(BABI) 


 

Grouped Correlation Plot of Numerical Variables 

21 

Great Learning PGP(BABI) 


 

EDA Summary: 
1. There is 1 NA’s in the entire dataset 
2. Correlation between predictor variables found and removed from dataset 
3. We had challenges in numeric variables that were positively correlated, hence 
removing a variable Age & Work.Exp reduced numeric predictors to only 2 to go 
ahead with model. We could have used other methods such as PCA to fix the same 
but since the correlation about 90% we are retaining only Salary from personal 
details to train our model. 

Data Preparation: 
Our  primary  interest  as  per  problem  statement  is  to  understand  the  factors  influencing 
car  usage.  Hence  we  will  create  a  new  column  for  Car  usage.  It  will  take  value  0  for 
Public  Transport  &  2  Wheeler  and  1  for  car  usage  Understand  the  proportion  of  cars  in 
Transport Mode. 

22 

Great Learning PGP(BABI) 


 

Only 8% of employees in the dataset is using cars as a mode of transport.  

Smote the Data 

Before Smote  After Smote 

23 

Great Learning PGP(BABI) 


 

Modelling Building: 

24 

Great Learning PGP(BABI) 


 

25 

Great Learning PGP(BABI) 


 

26 

Great Learning PGP(BABI) 


 

27 

Great Learning PGP(BABI) 


 

Improving the model 

28 

Great Learning PGP(BABI) 


 

29 

Great Learning PGP(BABI) 


 

VIF scores to verify the multicollinearity, Work.Exp variable score above 10 confirms 
that multicollinearity exists in the dataset. 

After dropping out the Age & Work.Exp variables, we notice that VIF results are 
significantly low and we can conclude that the data is free from multicollinearity. We 
might go ahead training model with remaining variables.  

30 

Great Learning PGP(BABI) 


 

31 

Great Learning PGP(BABI) 

You might also like