100% found this document useful (2 votes)
484 views

Machine Learning (Project5) PDF

This document summarizes an analysis of employee commute data to predict which mode of transportation employees will use. Exploratory data analysis found salary and public transit to be skewed, while age, work experience, and salary did not differ significantly between groups. Logistic regression and KNN models found work experience, distance, license, and salary were important predictors, with employees over 10 years experience, commuting over 12 miles, having a license, or earning over $20,000 more likely to use a car. Bagging ensemble methods were also applied.

Uploaded by

jagajits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
484 views

Machine Learning (Project5) PDF

This document summarizes an analysis of employee commute data to predict which mode of transportation employees will use. Exploratory data analysis found salary and public transit to be skewed, while age, work experience, and salary did not differ significantly between groups. Logistic regression and KNN models found work experience, distance, license, and salary were important predictors, with employees over 10 years experience, commuting over 12 miles, having a license, or earning over $20,000 more likely to use a car. Bagging ensemble methods were also applied.

Uploaded by

jagajits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Mini Project – Mode of transport

employees prefers to commute to their


office

13th - October– 2019


Submitted
By: Jagajit Singh
Project Objectives
This project requires you to understand what mode of transport employees prefers to
commute to their office. We need to predict whether or not an employee will use Car as a
mode of transport based on the personal and professional details provided.

Assumptions
 none

Exploratory Data Analysis – Step by Step approach


Environment Set Up and Data Import
Install Necessary Packages and Invoke Libraries

Set up working Directory

Data
Description:
Str function indicates all the var are numerical and integer

Dimension shows it has 444 rows and 9 columns


Variance: of the overall data

Summary : of the data

 One of data point for MBA is missing


 Salary might have skewed distribution
 Again, public transport is most common mode of transportation

Visual Analysis
boxplot(cardata$Age ~cardata$Engineer, main = "Age vs Eng.")
boxplot(cardata$Age ~cardata$MBA, main ="Age Vs MBA”
There are people working from all Age and work experience

boxplot(cardata$Salary ~cardata$Engineer, main = "Salary vs Eng.")


boxplot(cardata$Salary ~cardata$MBA, main = "Salary vs MBA.")

We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-M
BA’s
Also, mean salary for both MBA’s and Eng is around 16

hist(cardata$Work.Exp, col = "red", main = "Distribution of work exp")


This is skewed towards right, again this would be on expected lines as there would be more
juniors than seniors in any firm.

boxplot(cardata$Work.Exp ~ cardata$Gender)

Population is equally distributed for both male and females as there is not much difference b
etween mean work experiences in two genders.

Hypothesis Testing
Higher the salary more the chance of using the car for commute.

boxplot(cardata$Salary ~cardata$Transport, main="Salary vs Transport")

Graph clearly shows as salary increase, inclination of commuting by car is higher.

boxplot(cardata$Age~cardata$Transport, main="Age vs Transport")


we could see clear demarcation in usage of transport. With lower age group 2-wheeler is preferable and
with higher work exp car is preferred.

As distance increase employee, would prefer car for comfort and ease

boxplot(cardata$Distance~cardata$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.

Females would prefer more of private transfer then public transport.


We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.

Bivariate Analysis:

As per graph :
1. "CarUsage" and "Age",”Work Experience”,”Salary” seems to be correlated
Missing values
There are one missing values,
Checking for the missing values in dataset

Logistic Regression
What logistic regression predicts
The variate or value produced by logistic regression is a probability value
between 0.0 and 1.0.
No collinearity between significant data:

Due to unbalanced dataset the model is not predicting 1's accurately, hence using SMOTE
technique to over sample the data.
Running Logistic regression after using SMOTE technique
KNN model
What is kNN Algorithm?
Let’s assume we have several groups of labeled samples. The items present in the groups are
homogeneous in nature. Now, suppose we have an unlabeled example which needs to be
classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN
Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by
a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well
defined groups.

Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying
data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If
we take a deeper look, this doesn’t create a model since there’s no abstraction process involved.
Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the
prediction time is pretty high with useful insights missing at times. Therefore, building this
algorithm requires time to be invested in data preparation (especially treating the missing data
and categorical features) to obtain a robust model.
Analysis of Naive Bayes
This gives us the rule or factors which can help us employees decision to use car or not.
(These are summarized at the end)

General way to interpret this output is that for any factor variable say license we can say that 72%
of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5

Bagging
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:

 Important variables are Age, Work.Exp, Distance and License


 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

You might also like