100% found this document useful (2 votes)

484 views

Machine Learning (Project5) PDF

This document summarizes an analysis of employee commute data to predict which mode of transportation employees will use. Exploratory data analysis found salary and public transit to be skewed, while age, work experience, and salary did not differ significantly between groups. Logistic regression and KNN models found work experience, distance, license, and salary were important predictors, with employees over 10 years experience, commuting over 12 miles, having a license, or earning over $20,000 more likely to use a car. Bagging ensemble methods were also applied.

Uploaded by

jagajits

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

484 views

Machine Learning (Project5) PDF

Uploaded by

jagajits

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Mini Project – Mode of transport

employees prefers to commute to their

office

13th - October– 2019

Submitted
By: Jagajit Singh
Project Objectives
This project requires you to understand what mode of transport employees prefers to
commute to their office. We need to predict whether or not an employee will use Car as a
mode of transport based on the personal and professional details provided.

Assumptions
 none

Exploratory Data Analysis – Step by Step approach

Environment Set Up and Data Import
Install Necessary Packages and Invoke Libraries

Set up working Directory

Data
Description:
Str function indicates all the var are numerical and integer

Dimension shows it has 444 rows and 9 columns

Variance: of the overall data

Summary : of the data

 One of data point for MBA is missing

 Salary might have skewed distribution
 Again, public transport is most common mode of transportation

Visual Analysis
boxplot(cardata$Age ~cardata$Engineer, main = "Age vs Eng.")
boxplot(cardata$Age ~cardata$MBA, main ="Age Vs MBA”
There are people working from all Age and work experience

boxplot(cardata$Salary ~cardata$Engineer, main = "Salary vs Eng.")

boxplot(cardata$Salary ~cardata$MBA, main = "Salary vs MBA.")

We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-M
BA’s
Also, mean salary for both MBA’s and Eng is around 16

hist(cardata$Work.Exp, col = "red", main = "Distribution of work exp")

This is skewed towards right, again this would be on expected lines as there would be more
juniors than seniors in any firm.

boxplot(cardata$Work.Exp ~ cardata$Gender)

Population is equally distributed for both male and females as there is not much difference b
etween mean work experiences in two genders.

Hypothesis Testing
Higher the salary more the chance of using the car for commute.

boxplot(cardata$Salary ~cardata$Transport, main="Salary vs Transport")

Graph clearly shows as salary increase, inclination of commuting by car is higher.

boxplot(cardata$Age~cardata$Transport, main="Age vs Transport")

we could see clear demarcation in usage of transport. With lower age group 2-wheeler is preferable and
with higher work exp car is preferred.

As distance increase employee, would prefer car for comfort and ease

boxplot(cardata$Distance~cardata$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.

Females would prefer more of private transfer then public transport.

We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.

Bivariate Analysis:

As per graph :
1. "CarUsage" and "Age",”Work Experience”,”Salary” seems to be correlated
Missing values
There are one missing values,
Checking for the missing values in dataset

Logistic Regression
What logistic regression predicts
The variate or value produced by logistic regression is a probability value
between 0.0 and 1.0.
No collinearity between significant data:

Due to unbalanced dataset the model is not predicting 1's accurately, hence using SMOTE
technique to over sample the data.
Running Logistic regression after using SMOTE technique
KNN model
What is kNN Algorithm?
Let’s assume we have several groups of labeled samples. The items present in the groups are
homogeneous in nature. Now, suppose we have an unlabeled example which needs to be
classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN
Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by
a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well
defined groups.

Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying
data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If
we take a deeper look, this doesn’t create a model since there’s no abstraction process involved.
Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the
prediction time is pretty high with useful insights missing at times. Therefore, building this
algorithm requires time to be invested in data preparation (especially treating the missing data
and categorical features) to obtain a robust model.
Analysis of Naive Bayes
This gives us the rule or factors which can help us employees decision to use car or not.
(These are summarized at the end)

General way to interpret this output is that for any factor variable say license we can say that 72%
of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5

Bagging
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:

 Important variables are Age, Work.Exp, Distance and License

 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Effect Size Calc - Excel
0% (1)
Effect Size Calc - Excel
4 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Project Predictive Modeling
50% (2)
Project Predictive Modeling
69 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Project Questions
No ratings yet
Project Questions
3 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
Project Report
100% (3)
Project Report
36 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
Project ML
100% (4)
Project ML
36 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
FRA Report
100% (1)
FRA Report
30 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Machine Learning Solution
100% (1)
Machine Learning Solution
12 pages
Unit 2 Measures of Dispersion1
No ratings yet
Unit 2 Measures of Dispersion1
6 pages
Biostat Lecture 2
No ratings yet
Biostat Lecture 2
27 pages
Stats Assignment - 1
No ratings yet
Stats Assignment - 1
1 page
Statistical Tables
No ratings yet
Statistical Tables
5 pages
Chapter 5 3rd Ed Supply Chain by Wisner
No ratings yet
Chapter 5 3rd Ed Supply Chain by Wisner
33 pages
CASP Checklist Systematic Reviews Observational Studies Checklist 2024
No ratings yet
CASP Checklist Systematic Reviews Observational Studies Checklist 2024
14 pages
4880-Article Text-13253-1-10-20120420
No ratings yet
4880-Article Text-13253-1-10-20120420
9 pages
أثر تطبيق معايير التقارير المالية الدولية ias ifrs على جودة المعلومات المالية (دراسة عينة من الأكاديميين والمهنيين)
No ratings yet
أثر تطبيق معايير التقارير المالية الدولية ias ifrs على جودة المعلومات المالية (دراسة عينة من الأكاديميين والمهنيين)
17 pages
STA 2023 CRN 81075 Fall 2021 Syllabus
No ratings yet
STA 2023 CRN 81075 Fall 2021 Syllabus
13 pages
Mod 4 & 5 Mat 4 Asst
No ratings yet
Mod 4 & 5 Mat 4 Asst
6 pages
Topic 04 - Data Visualization
No ratings yet
Topic 04 - Data Visualization
67 pages
Final Assignment MAT1004 Code 9
No ratings yet
Final Assignment MAT1004 Code 9
2 pages
Business Research Method: Discriminant Analysis
No ratings yet
Business Research Method: Discriminant Analysis
29 pages
Chapter 3 - Methodology Final Visalakshi PDF
No ratings yet
Chapter 3 - Methodology Final Visalakshi PDF
34 pages
Cabanes - Jon - Lester-Assesment 7
No ratings yet
Cabanes - Jon - Lester-Assesment 7
1 page
TPJC JC 2 H2 Maths 2011 Mid Year Exam Solutions
No ratings yet
TPJC JC 2 H2 Maths 2011 Mid Year Exam Solutions
13 pages
Kami Export - 3.5A Regression Lines_Student
No ratings yet
Kami Export - 3.5A Regression Lines_Student
20 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
115 pages
Full Download Sample Size Calculations in Clinical Research, Third Edition Shein-Chung Chow PDF
100% (3)
Full Download Sample Size Calculations in Clinical Research, Third Edition Shein-Chung Chow PDF
52 pages
Load Forecasting: Introduction
No ratings yet
Load Forecasting: Introduction
38 pages
Portfolio Formula
100% (1)
Portfolio Formula
16 pages
Statistics For Business and Economics: Sampling and Sampling Distributions
No ratings yet
Statistics For Business and Economics: Sampling and Sampling Distributions
50 pages
Regine Quanti
No ratings yet
Regine Quanti
3 pages
Data Science Module 3 q & A
No ratings yet
Data Science Module 3 q & A
7 pages
[MAI 4.20-4.22] CONFIDENCE INTERVAL - HYPOTHESIS TEST FOR μ_solutions
No ratings yet
[MAI 4.20-4.22] CONFIDENCE INTERVAL - HYPOTHESIS TEST FOR μ_solutions
4 pages
Evans Analytics2e PPT 06 Final
100% (1)
Evans Analytics2e PPT 06 Final
36 pages
Markov Regime-Switching Quantile Regression Models and Financial
No ratings yet
Markov Regime-Switching Quantile Regression Models and Financial
6 pages
Student User Guide For Spss
No ratings yet
Student User Guide For Spss
82 pages
Exercises Dobson
0% (1)
Exercises Dobson
3 pages