MachineLearning Project PDF
MachineLearning Project PDF
Senthil Kumar M
22.Sep.2019
Machine Learning (PGP-BABI)
by Great Learning
Table of Contents
INTRODUCTION 2
Observation 4
REFERENCES 26
Great Learning PGP 1
1
INTRODUCTION
This project is to understand the determinants of transport choice made by employees.
The given data has an employee information about their mode of transport as well as
their personal and professional details like age, salary, work exp. We need to predict
whether or not an employee will use Car as a mode of transport. Also, which variables
are a significant predictor behind this decision.
Process Map
2
Given the dataset, we required to perform following tasks as explained to complete this
project successfully:
1. EDA
2. Data Preparation
3. Modeling
4. Actionable Insights & Recommendations
3
Observation
Employees use 2 wheeler, public transport and car as a mode of transport to commute to
their workplace. We have been given 418 rows of data with 9 variables. We might want
to cleanup the dataset and convert its type appropriately as required before processing it
for analysis.
Problem statement is that of predicting whether or not an employee will use a car
as a mode of transport, also which variable is a significant predictor behind the
decision.
4
The following graph of overview as how the variables spread with volume of usage:
5
There are several automated packages in ‘R’ to perform exploratory data analysis, we are
going to use one such package “dlookr” in this project. EDA report from “dlookr” package
gives us the detailed count of distinct values in each variable along with normality test,
correlation coefficient other descriptive stats are elaborated as below:
6
7
8
Normality test statistics proves that Age & Distance variables are closely distributed
normal, while Work Exp & Salary having positive skew in the dataset. Numeric variables
individually tested for normality and skewness values with QQ plots for each variables
printed down for reference.
9
10
11
We can notice that the higher the salary & age the employees are using a car. There is
clear indication that age 30 above as well as salary 30k and above preferred to use a car
as a mode of transport. Also the distance above 15miles are with higher salary are
choosing car as mode that is very evident in this dataset.
12
The above map depicts that female car usage is much lower compared to male, whereas
qualification doesn’t have any correlation with car usage. But license as we can assume
employee without license uses public transport.
13
14
15
16
AGE:
17
Wrok Exp:
18
Salary:
19
Distance:
20
21
EDA Summary:
1. There is 1 NA’s in the entire dataset
2. Correlation between predictor variables found and removed from dataset
3. We had challenges in numeric variables that were positively correlated, hence
removing a variable Age & Work.Exp reduced numeric predictors to only 2 to go
ahead with model. We could have used other methods such as PCA to fix the same
but since the correlation about 90% we are retaining only Salary from personal
details to train our model.
Data Preparation:
Our primary interest as per problem statement is to understand the factors influencing
car usage. Hence we will create a new column for Car usage. It will take value 0 for
Public Transport & 2 Wheeler and 1 for car usage Understand the proportion of cars in
Transport Mode.
22
23
Modelling Building:
24
25
26
27
28
29
VIF scores to verify the multicollinearity, Work.Exp variable score above 10 confirms
that multicollinearity exists in the dataset.
After dropping out the Age & Work.Exp variables, we notice that VIF results are
significantly low and we can conclude that the data is free from multicollinearity. We
might go ahead training model with remaining variables.
30
31