Machine Learning (Project5) PDF
Machine Learning (Project5) PDF
Assumptions
none
Data
Description:
Str function indicates all the var are numerical and integer
Visual Analysis
boxplot(cardata$Age ~cardata$Engineer, main = "Age vs Eng.")
boxplot(cardata$Age ~cardata$MBA, main ="Age Vs MBA”
There are people working from all Age and work experience
We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-M
BA’s
Also, mean salary for both MBA’s and Eng is around 16
boxplot(cardata$Work.Exp ~ cardata$Gender)
Population is equally distributed for both male and females as there is not much difference b
etween mean work experiences in two genders.
Hypothesis Testing
Higher the salary more the chance of using the car for commute.
As distance increase employee, would prefer car for comfort and ease
There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.
Bivariate Analysis:
As per graph :
1. "CarUsage" and "Age",”Work Experience”,”Salary” seems to be correlated
Missing values
There are one missing values,
Checking for the missing values in dataset
Logistic Regression
What logistic regression predicts
The variate or value produced by logistic regression is a probability value
between 0.0 and 1.0.
No collinearity between significant data:
Due to unbalanced dataset the model is not predicting 1's accurately, hence using SMOTE
technique to over sample the data.
Running Logistic regression after using SMOTE technique
KNN model
What is kNN Algorithm?
Let’s assume we have several groups of labeled samples. The items present in the groups are
homogeneous in nature. Now, suppose we have an unlabeled example which needs to be
classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN
Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by
a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well
defined groups.
Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying
data. Being simple and effective in nature, it is easy to implement and has gained good popularity.
Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If
we take a deeper look, this doesn’t create a model since there’s no abstraction process involved.
Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the
prediction time is pretty high with useful insights missing at times. Therefore, building this
algorithm requires time to be invested in data preparation (especially treating the missing data
and categorical features) to obtain a robust model.
Analysis of Naive Bayes
This gives us the rule or factors which can help us employees decision to use car or not.
(These are summarized at the end)
General way to interpret this output is that for any factor variable say license we can say that 72%
of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5
Bagging
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not: