100% found this document useful (1 vote)

117 views

Regression Project

This document analyzes two datasets to build predictive models. For the first dataset, exploratory data analysis is performed to understand the data. Linear regression models are built on train-test splits and compared based on performance metrics. The best model is selected. For the second dataset, logistic regression and LDA models are built after encoding categorical variables. Model performance is evaluated using accuracy, confusion matrices and ROC curves. The best model is identified. Business insights and recommendations are provided based on the predictions from both analyses.

Uploaded by

Prabhu Saravana bhavan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

117 views

Regression Project

Uploaded by

Prabhu Saravana bhavan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 60

SMDM Project- Predictive Modeling

Prabhu.S-Oct Batch
Table of Contents
Executive Summary-1............................................................................................................................7
Introduction...........................................................................................................................................7
Data Description....................................................................................................................................7
Sample of the dataset:..........................................................................................................................7
Information of the dataset:...................................................................................................................8
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.........................8
Summary of Data set.......................................................................................................................8
Univariate Analysis............................................................................................................................9
Unique Values:.............................................................................................................................10
Carat............................................................................................................................................11
Depth...........................................................................................................................................11
Table............................................................................................................................................12
X- Column....................................................................................................................................12
Y- Column....................................................................................................................................13
Z column.....................................................................................................................................13
Price column................................................................................................................................14
Bivariate Analysis.............................................................................................................................15
Correlation Matrix.......................................................................................................................16
Price Vs Cut..................................................................................................................................17
Price Vs Colour.............................................................................................................................17
Price Vs Clarity.............................................................................................................................18
Outlier Treatment........................................................................................................................18
1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Do you think scaling is necessary in this
case?....................................................................................................................................................20
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on
Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning..........................................................................................................23
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.....32
Executive Summary-2..........................................................................................................................34
Introduction.........................................................................................................................................34
Data Description..............................................................................................................................34
Dataset Sample:...............................................................................................................................35
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis....35
Dataset type:...................................................................................................................................35
Shape of Data:.............................................................................................................................36
Checking for Null Values:.............................................................................................................36
Checking for Duplicates:..............................................................................................................36
Exploratory data analysis :...........................................................................................................39
Univariate Analysis..........................................................................................................................41
Salary...........................................................................................................................................41
Education.....................................................................................................................................42
Children (Young and Old).............................................................................................................43
Bivariate Analysis.............................................................................................................................44
Outliers........................................................................................................................................46
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).....47
Encoding the data............................................................................................................................47
Data Split.....................................................................................................................................50
LDA (linear discriminant analysis)................................................................................................51
Logistic Regression.......................................................................................................................51
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized.................................52
LDA Model-Training and Test Data classification report..................................................................52
Coefficient of variable-LDA..............................................................................................................53
Confusion Matrix-LDA (Train and Test Data)...................................................................................53
ROC Curve-Train-LDA.......................................................................................................................54
ROC Curve-Test-LDA........................................................................................................................54
LR Model-Training and Test Data Classification report....................................................................55
Coefficient of variable-LR.................................................................................................................56
ROC Curve-Train-LR.........................................................................................................................57
ROC Curve-Test-LR...........................................................................................................................57
2.4 Inference: Basis on these predictions, what are the insights and recommendations....................58
Table Details- Problem 1

Table Description Page Number

1.1 Dataset Sample 7
1.2 Dataset information 8
1.3 Summary of Dataset-1 8
1.4 Summary of Dataset-2 9

Figure Details-Problem 1

Figure Description Page Number

1.1 summary of Dataset-Object Variables 10
1.2 summary of Dataset -Numerical Variables 10
1.3 Univariate depth 11
1.4 Univariate -Table 11
1.5 Univariate X column 12
1.6 Univariate Y Column 13
1.7 Univariate Z Column 13
1.8 Univariate Price column 14
1.9 Bivariate Analysis 15
2 coo-relation Matrix 16
2.1 Price Vs Cut 17
2.2 Price Vs Colour 17
2.3 Price Vs Clarity 18
2.4 Outlier Treatment-Before 19
2.5 Outlier Treatment-After 19
2.6 data Check 21
2.7 Missing data-Before 21
2.8 Missing data-After 22
2.9 Object Type Variable 23
3 Dummy Variables 23
3.1 Missing Variables-Check 24
3.2 Data Types-Check 25
3.3 Co-relation Matrix 25
3.4 Data Set- After Drop 26
3.5 Data split 26
3.6 Coefficient of data set 27
3.7 Intercept of data set 27
3.8 Model Score (X and Y Train) 28
3.9 Model Score (X and Y Test) 28
4 RMSE Score-Train Data 28
4.1 RMSE Score-Test Data 28
4.2 Concatenating Data set 29
4.3 Stats Model Fit 29
4.4 Stats Model summary 30
4.5 Stats Model RMSE Values 31
4.6 Predicted vs Actual price 31
4.7 Regression Equation 31
4.8 Consolidated Score 32
4.9 Consolidated Score- Grid Search 32
Table Details- Problem 2

Table Description Page Number

1.1 Dataset Sample 35
1.2 Dataset Information 35
1.3 Dataset shape 36
1.4 Null Value check 36
1.5 Duplicate check 36
1.6 Data set-updated 37
1.7 Data set-information 37

Figure Details-Problem 2

Figure Description Page Number

1.1 Data set-Unique Check-Numerical 39
1.2 Dataset-Unique Check-Categorical 40
1.3 Data set-Holiday set and Foreign 40
1.4 Univariate- Salary 41
1.5 Univariate- Education 42
1.6 Plot check Young Children 43
1.7 Plot check Older Children 43
1.8 Bivariate Analysis 44
1.9 Heat Map 45
2 Data before Outliers 46
2.1 Data after outliers 47
2.2 Dataset-Holiday Package 48
2.3 Dataset-foreign 48
2.4 Dataset-Holiday package & no of young children 49
2.5 Dataset-Holiday package & no of old children 49
2.6 Dummy Variables 50
2.7 Data Drop 50
2.8 Data Split-X 50
2.9 Data Split-Y 51
3 Data Split 51
3.1 LDA Model 51
3.2 LR Model 51
3.3 LDA Classification report 52
3.4 LDA Coefficient Variable 53
3.5 LDA Confusion Matrix (Train and Test) 53
3.6 LDA- ROC curve-Train 54
3.7 LDA- ROC curve-Test 54
3.8 LR Classification report 55
3.9 LR Confusion Matrix (Train and Test) 56
4 LDA- ROC curve-Train 57
4.1 LDA- ROC curve-Test 57
4.2 LDA Vs LR 58
Executive Summary-1
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

Introduction
The Purpose of this exercise to explore the dataset and understand the nature of it. Check and
explore the various parameters used. Also, Understand the out layers and analyse the effect of it.
Use various Models

Data Description
Sample of the dataset:
Table 1.1 Dataset Sample

There are 26967 Rows and 11 Columns in the Data set.

Information of the dataset:

Table 1.2 Dataset information

We can see that there are many Objects and floats in the dataset

1.1 Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis.

Summary of Data set

Table 1.3 summary of Dataset-1

Checking for duplicates: There are 34 Duplicate rows and after removing them we have now
26967 Rows and 11 Columns in the Data set.

There is redundant column “unnamed” and it has been dropped. The final dataset looks like below
Table 1.4 summary of Dataset-2

Carat:- This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around 0.8 and
75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which shows that the
data is skewed and has a right tailed curve. Which means that majority of the stones are of lower
carat. There are very few stones above 1.05 carat.

Depth :- The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60. Average
height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are 62.5. Standard
deviation of the height of the stones is 1.4. Standard deviation is indicating a normal distribution

Table:- The percentage width of cubic Zirconia is in the range of 49 to 79. Average is around 57. 25%
of stones are below 56 and 75% of the stones have a width of less than 59. Standard deviation is
2.24. Thus the data does not show normal distribution and is similar to carat with most of the stones
having less width also this shows outliers are present in the variable.

Price:- Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median price of
stones is 2375, while 25% of the stones are priced below 945. 75% of the stones are in the price
range of 5356. Standard deviation of the price is 4022. Indicating prices of majority of the stones are
in lower range as the distribution is right skewed

Univariate Analysis
Let’s perform Univariate Analysis across all variables and understand its characters

Unique Values:
Analysing the unique for all the columns we get the following

Object Variables:
Figure 1.1 summary of Dataset-Object Variables
Numerical Variables

Figure 1.2 summary of Dataset -Numerical Variables

Carat
Figure 1.3 Univariate Analysis-Carat

Observation: The Carat Variable has outliers and its Skewed

Depth

Figure 1.3
Univariate depth

Observation: The depth Variable has outliers and its centrally distributed

Table

Figure 1.4 Univariate -Table

Observation: Table has multiple outliers and its slightly skewed

X- Column

Figure 1.5 Univariate X column

Observation: X column variable has outliers and slightly skewed

Y- Column

Figure 1.6 Univariate Y Column

Observation: Y column has outliers and its skewed

Z column

Figure 1.7 Univariate Z Column

Observation: Z Column has multiple outliers and its skewed

Price column

Figure 1.8 Univariate Price column

Observation: Maximum payment amount has no outliers and slightly skewed

Bivariate Analysis

To Analyse the effect of each variable with each we perform a Bi-Variate Analysis
Figure 1.9 Bivariate Analysis

Correlation Matrix
Figure 2.0 coo-relation Matrix

Observation: The Bi-Variate Analysts and the Correlation Matrix analysis suggests the following

 High correlation between the different features like carat, x, y, z and price.
 Less correlation between table with the other features.
 Depth is negatively correlated with most the other features except for carat

Also, Lets understand how price variables are affecting the Cut, Color and Clarity

Price Vs Cut
Comparing Price with the various Cut we get the following

Figure 2.1 Price Vs Cut

Observation:
 For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair cut
gems
 All cut type gems have outliers with respect to price
 Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive
Price Vs Colour

Figure 2.2 Price Vs Colour

 For the colour variable we see the most sold is G coloured gems and least is J coloured gems
 All colour type gems have outliers with respect to price
 However, the least priced seems to be E type; J and I coloured gems seems to be more expensive

Price Vs Clarity
Figure 2.3 Price Vs Clarity

 For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems
 All clarity type gems have outliers with respect to price
 Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more expensive

Outlier Treatment

From the univariate Analysis we understand that there are few Outliers in the data set.

Before Treating the Outliers

Figure 2.4 Outlier Treatment-Before

From the data set and the from the univariate Analysis we see that there are outliers present it in all
the variables

To treat these outliers, we use capping method to cap outliers data and make the limit i.e, above a
particular value or less than that value.

After Treating the outliers, we get the following

Figure 2.5- Outlier Treatment-After

Observation:

Using Capping method we treated the outliers by capping the outliers data and made a limit

Conclusion of EDA:

• Price – This variable gives the continuous output with the price of the cubic zirconia stones. This
will be our Target Variable.

• Carat, depth, table, x, y, z variables are numerical or continuous variables.

• Cut, Clarity and colour are categorical variables.

• We will drop the first column ‘Unnamed: 0’ column as this is not important for our study which
leaves the shape of the dataset with 26967 rows & 10 Columns

• Only in ‘depth 697missing values are present which we will impute by its median values.

• There are total of 34 duplicate rows as computed using. Duplicated () function. We will drop the
duplicates

• Upon dropping the duplicates – The shape of the data set is – 26933 rows & 10 columns

1.2. Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning, or do we need to change
them or drop them? Do you think scaling is necessary in this case?
Checking for any data in gibberish the Object module we get the following

Figure 2.6- data Check

There is no gibberish or missing data in the object type data columns – Cut, colour and clarity

Checking for Missing values we get the following

Figure 2.7- Missing data-Before

 We Understand that there are missing values in Depth columns

 From the earlier table we do have zero values in X,Y,Z columns. using Imputer we can impute
those values too
 Computing the missing values using simple Imputer and applying median strategy we impute
the values

After Imputing we get the following results

Figure 2.8- Missing data-After

Observation

 . All the missing values are imputed

 . Values which are zero has been removed and imputed
 Scaling: Scaling or Standardizing the features around the centre and 0 with a standard
deviation of 1 is important when we compare measurements that have different units.
Variables that are measured at different scales do not contribute equally to the analysis and
might end up creating a bias.

 In this data set we can see the all the variable are in different scale i.e price are in 1000s unit
and depth and table are in 100s unit, and carat is in 10s. So its necessary to scale or
standardise the data to allow each variable to be compared on a common scale. With data
measured in different "units" or on different scales (as here with different means and
variances) this is an important data processing step if the results are to be meaningful or not
dominated by the variables that have large variances.

 But is scaling necessary in this case? No, it is not necessary, we'll get an equivalent solution
whether we apply some kind of linear scaling or not. But recommended for regression
techniques as well because it would help gradient descent to converge fast and reach the
global minima. When number of features becomes large, it helps is running model quickly
else the starting point would be very far from minima, if the scaling is not done in
preprocessing.

 For now we will process the model with scaling for quicker and fast Output
1.3 Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning

Lets understand what are the Object type variables we have in dataset

Figure 2.9-Object Type Variable

We have Cut, Colour and clarity as objects. We need to convert these categorical variables to
dummy Variables

Converting the Categorical to Dummy variables we get the following output:

Figure 3.0-Dummy Variables

The total number of columns has increased to 24

Performing a final check on the missing values we get the following

Figure 3.1-Missing Variables-Check

There are no missing values in the data set.

Lets perform a final information check on the dataset before splitting the data
Figure 3.2-Data Types-Check

Checking for the co-relation between the variables we get :

Figure 3.3-co-relation Matrix

Observation:
 We have now set the precedence that there are a lot of variables (Carat, x,y, and z) that are
demonstrating strong correlation or multicollinearity. So, before proceeding with the Linear
Regression Model creation, we need to get rid of them from the model creation exercise.
 Drop x,y, and z from my Linear Regression model creation step.

Dropping the X,Y and Z columns we get the following output:

Figure 3.4-Data Set- After Drop

Splitting the data into train and Test and applying regression:

Figure 3.5-Data split

Lets find the coefficient of these datas.

Figure 3.6-Coefficient of
data set

The coefficient of variation (CV) is a relative measure of variability that indicates the size of a
standard deviation in relation to its mean

We will find out the intercept of the model:

Figure 3.7-Intercept of data set

The score of the model is as follows

Regression model Score- Training Data

Figure 3.8-Model Score(X and Y Train)

Regression model Score-Test Data

Figure 3.9-Model Score(X and Y Test)

RMSE score of Training data:

Figure 4.0-RMSE Score-Train Data

RMSE score of Test Data:

R-squared is always between 0 and 100%: 0% indicates that the model explains none of the
variability of the response data around its mean.100% indicates that the model explains all the
variability of the response data around its mean. In this regression model we can see the R-square
value on Training and Test data respectively 0.917 and 0.914 .
Figure 4.1-RMSE Score-Test Data

Observation

Linear regression Performance Metrics:

 intercept for the model: 224.200

 R square on training data: 0.9171895350186483
 R square on testing data: 0.9145148930357136
 RMSE on Training data: 1159.8896347641783
 RMSE on Testing data: 1170.4884135275267

As the training data & testing data score are almost inline, we can conclude this model is a Right-
Fit Model.
Lets Perform Linear Regression using Stats Model:

By concatenating X and Y values in a single data frame and adding dependent and independent
variables we get:

Figure 4.2-Concatinating Data set

initiating stats model and fitting it we get:

Figure 4.3-Stats Model Fit

Checking for the summary of the regression we get

Figure 4.4-Stats Model summary

RMSE values as follows:

Figure 4.5-Stats Model

RMSE Values

Visualising Actual and Linear model output(Predicted price) we get

Figure 4.6-

The final regression equation will be like below

Figure 4.7-Regression Equation

Observation: The RMSE score remains almost the same even when regression is done using stats
model.

Lets compare with other models and figure out which is the optimum for the dataset result

 Creating 4 models using ANN, Decision Tree, Random Forest, and Linear Regression
 Check Train and Test RMSE
 Check Train and Test Scores
Figure 4.8-Conslidates Score
Checking using Grid search we can analyse the four models
Figure 4.9-Conslidates Score- Grid Search

Observation:

 All the models are well within the 10 percent limit and seems like all of them fitting.
 The RMSE scores decides the best model i.e. lesser the RMSE score better the model
 Out of the four Models ANN REGRESSOR has less RMSE score and hence it’s the best fit for
this data set.

1.4 Inference: Basis on these predictions, what are the business

insights and recommendations.

Facts:

 We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of
the model performance.
 As a result while creating the model, certain independent variables displaying
multicollinearity or the ones with no direct relation with the target variable has been
dropped
 For the business based on the model that we have created for the test case, some of the key
variables that are likely to positively drive price change are (top 5 in descending order):


 Carat
 Clarity_IF
 Clarity VVS_1
 Clarity VVS_2
 Clarity_vs1
Recommendations:

 Carat is a strong predictor of the overall price of the stone.

 Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a strong
predictor of price as well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are helping the
firm put an expensive price cap on the stones.
 Colour of the stones such H, I and J won’t be helping the firm put an expensive price cap on
such stones.
 The company should instead focus on stones of colour D, E and F to command relative
higher price points and support sales. This also can indicate that company should be looking
to come up with new colour stones like clear stones or a different colour/unique colour that
helps impact the price positively.
 The company should focus on the stone’s carat and clarity so as to increase their prices.
Ideal customers will also contribute to more profits. The marketing efforts can make use of
educating customers about the importance of a better carat score and importance of clarity
index. Post this, the company can make segments, and target the customer based on their
income/paying capacity etc, which can be further studied.
The End
Executive Summary-2
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
Introduction
The Purpose of this exercise to explore the dataset and understand the nature of it. Check and
explore the various parameters used. Based on the data we need to provide recommendations by
comparing Logistics Regression and Linear Discriminant model and Understand each model and
provide solution and recommendation

Data Description
1. Target: Holiday_Package- Opted for Holiday Package yes/no?
2. Employee salary
3. Age in years
4. Years of formal education
5. The number of young children (younger than 7 years)
6. Number of older children
7. foreigner Yes/No

Dataset Sample:
Table 1.1- Dataset Sample

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.

Dataset type:
Table 1.2: Dataset Information

Shape of Data:

Table 1.3: Dataset shape

Observation: There are 872 Rows and 8 Columns

Checking for Null Values:

Table 1.4: Null Value check

Observation: There are no NULL values in the data set.

Checking for Duplicates:

Table 1.5: Duplicate check

Observation: There are no duplicates in the dataset

Removing the unnamed column we get the dataset:

Table 1.6: Data set-updated

Information of the dataset:

Table 1.7: Data set-information

Summary of the Dataset

• Holiday Package – This variable is a categorical Variable. output with the This will be our Target
Variable.

• Salary, age, educ, no_young_children, no_older_children, variables are numerical or continuous

variables.

• Salary ranges from 1322 to 236961. Average salary of employees is around 47729 with a standard
deviation of 23418. Standard deviation indicates that the data is not normally distributed. skew of
0.71 indicates that the data is right skewed and there are few employees earning more than an
average of 47729. 75% of the employees are earning below 53469 while 255 of the employees are
earning 35324.

• Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees are below
32 and 25% of the employees are above 48. Standard deviation is around 10. Standard deviation
indicates almost normal distribution.

• Years of formal education ranges from 1 to 21 years. 25% of the population has formal education
for 8 years, while the median is around 9 years. 75% of the employees have formal education of 12
years. Standard deviation of the education is around 3. This variable is also indicating skewness in
the data

• Foreign is a categorical variable

• We have dropped the first column ‘Unnamed: 0’ column as this is not important for our study.
Unnamed is a variable which has serial numbers so may not be required and thus it can be dropped
for further analysis.

The shape would be – 872 rows and 7 columns

• There are no null values

• There are no duplicates

Exploratory data analysis

Checking for Unique values of numerical variables

Checking the unique values for Numerical variables we get the following output:
Figure1.1: Data set-Unique Check-Numerical

Checking for unique values for Categorical variables

Figure1.2: Dataset-Unique Check-Categorical

Figure1.3: Data set-Holiday set and Foreign

Observation:

 We can observe that 54% of the employees are not opting for the holiday package and 46%
are interested in the package. This implies we have a dataset which is fairly balanced
 We can observe that 75% of the employees are not Foreigners and 25% are foreigners

Univariate Analysis
Salary
Figure1.4 -Univariate- Salary

Salary column has outliers and its skewed

Education

Figure1.5 -
Univariate- Education

Education column has a few

outlier and its centrally
distributed
Children (Young and Old)

Checking the plot for young children we get the following:

Figure1.6-Plot check Young Children

Observation: Looks like big population don’t have any young children

Checking the plot for young children we get the following:

Figure1.7-Plot check Older Children

Observation: Looks like the big population don’t have any old children
Bivariate Analysis

Figure1.8-Bivariate Analysis

Followed by heat map we get a clear picture of co-relation between the columns

Figure1.9-Heat Map
Observation: We can relate there isn’t any strong correlation between any variables. Salary and
education display moderate corelation and no_older_children is somewhat correlated with salary
variable. However, there are no strong correlation in the data set.

Outliers
Data before Outliers:
Figure2.0-Data before Outliers

We can observe that there are significant outliers present in variable “ Salary”, however there are
minimal outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’. There
are no outliers in variable ‘age’. For Interpretation purpose we would need to study the variables
such as no. of young children and no. of older children before outlier treatment. For this case study
we have done outlier treatment for only salary & educDataset after Outliers

Data After Outliers:

Figure2.1-Data after outliers

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).

Encoding the data

In the given dataset, the target variable – Holliday Package and an independent variable – Foreign
are object variables. Let us study them one at a time.
Figure2.2-Dataset-Holiday Package

Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.

Figure2.3-Dataset-foreign

Foreign: The distribution seems to be fine, and fit for analytical purpose

Both the variables can be encoded into numerical values for model creation analytical purposes.

Analysing Holiday package variable with No of Young Children

Figure2.4-Dataset-Holiday package & No of young children

The variable which is actually a numeric one seems to show varied distribution between number of
children being 1 and 2 when done a bivariate analysis with the dependent variable. It is therefore
advised to treat this variable as categorical and do the encoding on it.

Analysing Holiday package variable with No of Old Children

Figure2.5-Dataset-Holiday package & No of Old children

Looking at the table above, there does not seems to be much variation between the distribution of
data for children more than 0. It seems they are close enough for the Holiday_Package classes with
an almost like distribution. For this test case, I don’t think this variable will be an important factor
while creating the model and for analysis purposes and hence, I will drop this from my model
building process.

Converting the Categorical to Dummy variables we get the following output:

Figure2.6-Dummy Variables

Data Split

Splitting the data we get the following:

Figure2.7-Data Drop

Figure2.8-Data drop-X

Figure2.9-Data drop-Y

Split X and y into training and test set in 70:30 ratio. This implies 70% of the total data will be used
for training purposes and remaining 30% will be used for test purposes
Figure3.0-Data Split

LDA (linear discriminant analysis).

Applying LDA Model using the below formula:

Figure3.1-LDA Model

Logistic Regression
Applying LR Model using the below formula

Figure3.2-LR Model

2.3 Performance Metrics: Check the performance of Predictions on

Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.

LDA Model-Training and Test Data classification report

Figure3.3-LDA Classification report
Coefficient of variable-LDA

Figure3.4-LDA Coefficient Variable

Confusion Matrix-LDA (Train and Test Data)

Figure3.5-
LDA Confusion Matrix-(Train and Test)

ROC Curve-Train-LDA
Figure3.6-LDA- ROC curve-Train

ROC Curve-Test-LDA

Observation:

Model Accuracy scores: LDA

model.score (X_train, y_train) at 66.1% Test data:

model.score (X_test, y_test) at 66%

AUC Training LDA:0.74

AUC Test LDA:0.72

The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.
LR Model-Training and Test Data Classification report

Figure3.8-LR Classification report

Coefficient of variable-LR
f igure3.8-LR Coefficient Variable
Confusion Matrix-LR (Train and Test Data)

F igure3.9-LR Confusion Matrix-(Train and Test)

ROC Curve-Train-LR

Figure4.0-LR- ROC curve-Train

ROC Curve-Test-LR

Figure4.1-LR- ROC curve-Test

Observation:

Model Accuracy scores: LDA

model.score (X_train, y_train) at 66.1% Test data:

model.score (X_test, y_test) at 66%

AUC Training LDA:0.74

AUC Test LDA:0.72

The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.

Lets compare both models:

Figure4.2-LDAVs LR

Both the models – Logistics and LDA offers almost similar results.

Though for this case study, I have chosen to proceed with Logistics Regression as its is easier to
implement, interpret, and very efficient to train. Also, our dependent variable is following a binary
classification of classes, and hence it is ideal for us to rely on the logistic regression model to study
the test case at hand.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations

Facts:

 We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which might
have played out as an important predictor
 While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
 There are no outliers present in age The distribution of data for age variable with holiday
package is also similar in nature. The range of age for people not opting for holliday package
is more spread out when compared with people opting for yes.
 We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees
 There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package
 We can clearly see that people with younger children are not opting for holiday packages
We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process. Employees with older
children has almost similar distribution for opting and not opting for holiday packages across
the number of children levels and hence I don’t think it will be an important predictor at all
for my model and I did not include this specific variable for my model building process.
 For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes

Recommendations:

 The company should really focus on foreigners to drive the sales of their holiday packages as
that’s where majority of conversions are going to come in.
 The company can try to direct their marketing efforts or offers toward foreigners for a better
conversion opting for holiday packages
 The company should also stay away from targeting parents with younger children. The
chances of selling to parents with 2 younger children is probably the lowest. This also gels
with the fact that parents try and avoid visiting with younger children.
 If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger children.

The Art of Problem Solving Intermediate Algebra
96% (25)
The Art of Problem Solving Intermediate Algebra
720 pages
Digital SAT Math Practice Questions
61% (31)
Digital SAT Math Practice Questions
29 pages
Discovering Geometry Solutions Manual
70% (10)
Discovering Geometry Solutions Manual
304 pages
Woodcock Johson IV Training Manual PDF
100% (2)
Woodcock Johson IV Training Manual PDF
48 pages
Beginner's Step-By-Step Coding Course Learn Computer Programming The Easy Way, UK Edition
98% (46)
Beginner's Step-By-Step Coding Course Learn Computer Programming The Easy Way, UK Edition
360 pages
Introduction To Geometry
90% (21)
Introduction To Geometry
580 pages
The Motivational Interviewing Workbook - Exercises To Decide What You Want and How To Get There
100% (10)
The Motivational Interviewing Workbook - Exercises To Decide What You Want and How To Get There
224 pages
Algebra Cheat Sheet
97% (72)
Algebra Cheat Sheet
2 pages
Workout Log
63% (19)
Workout Log
8 pages
Physics Primer - Homework - 1
95% (42)
Physics Primer - Homework - 1
40 pages
Golf Strategies - Dave Pelz's Short Game Bible PDF
92% (24)
Golf Strategies - Dave Pelz's Short Game Bible PDF
444 pages
Catherine V Holmes - How To Draw Cool Stuff, A Drawing Guide For Teachers and Students
97% (35)
Catherine V Holmes - How To Draw Cool Stuff, A Drawing Guide For Teachers and Students
260 pages
[Algebra Essentials Practice Workbook with Answers Linear and Quadratic Equations Cross Multiplying and Systems of Equations Improve your Math Fluency Series] Chris McMullen - Algebra Essentials Practice Workbook with A.pdf
82% (11)
[Algebra Essentials Practice Workbook with Answers Linear and Quadratic Equations Cross Multiplying and Systems of Equations Improve your Math Fluency Series] Chris McMullen - Algebra Essentials Practice Workbook with A.pdf
207 pages
Math 87 Mathematics 8 - 7 Textbook An Incremental Development Stephen Hake John Saxon
100% (10)
Math 87 Mathematics 8 - 7 Textbook An Incremental Development Stephen Hake John Saxon
696 pages
Parts Work 4th Edition
100% (30)
Parts Work 4th Edition
166 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Astrology Cheatsheet
98% (44)
Astrology Cheatsheet
15 pages
Self-System Therapy For Depression Client Workbook
100% (9)
Self-System Therapy For Depression Client Workbook
113 pages
Final Exam For SAS Enterprise Miner
100% (1)
Final Exam For SAS Enterprise Miner
17 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
The Colossal Book of Mathematics PDF
100% (11)
The Colossal Book of Mathematics PDF
744 pages
Algebra 8-1studyguide
71% (7)
Algebra 8-1studyguide
110 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Poly
100% (1)
Poly
108 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Tutor
100% (1)
Tutor
309 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Import As
100% (1)
Import As
27 pages
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
100% (1)
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
49 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Introduction
100% (1)
Introduction
49 pages
Homework 2
100% (1)
Homework 2
12 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
ML Projects 1
No ratings yet
ML Projects 1
29 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
LPTHW
100% (1)
LPTHW
220 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
Lead Scoring Case Study Presentation
100% (2)
Lead Scoring Case Study Presentation
11 pages
Labpractice 2
100% (2)
Labpractice 2
29 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Car Price Prediction Using Various Algorithms
100% (1)
Car Price Prediction Using Various Algorithms
19 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Difference Between Data Science and Machine Learning
No ratings yet
Difference Between Data Science and Machine Learning
5 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
No ratings yet
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
13 pages
Core Java
No ratings yet
Core Java
217 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Predictive Modeling (MP) Project Report
100% (1)
Predictive Modeling (MP) Project Report
73 pages
'Yatham Padma' 8 May 2022
No ratings yet
'Yatham Padma' 8 May 2022
82 pages
Algebra 2
95% (19)
Algebra 2
200 pages
How To Distinguish ADHD From Typical Toddler Behavior
100% (1)
How To Distinguish ADHD From Typical Toddler Behavior
24 pages
How To Read Sheet Music For Beginners
100% (2)
How To Read Sheet Music For Beginners
15 pages
Tarasov Calculus
100% (1)
Tarasov Calculus
179 pages
Mathematics Fundamentals
89% (9)
Mathematics Fundamentals
198 pages
Florida Teacher Certificate Examinations (FTCE) Study Guide
0% (2)
Florida Teacher Certificate Examinations (FTCE) Study Guide
20 pages
Day 2 Math, Distributive Property
No ratings yet
Day 2 Math, Distributive Property
9 pages
Pre-Algebra and Algebra
100% (23)
Pre-Algebra and Algebra
66 pages
Fractions
100% (10)
Fractions
50 pages
M. Aurelius PDF
100% (10)
M. Aurelius PDF
366 pages
ECG Rhythm Interpretation 2007
100% (20)
ECG Rhythm Interpretation 2007
533 pages
Download Complete Multiple Imputation in Practice Using IVEware First Edition Berglund PDF for All Chapters
100% (1)
Download Complete Multiple Imputation in Practice Using IVEware First Edition Berglund PDF for All Chapters
65 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
6632-Bootcamp in Credit Risk
No ratings yet
6632-Bootcamp in Credit Risk
167 pages
EDA Mini Report
No ratings yet
EDA Mini Report
32 pages
Information Security Awareness Behavior Among Higher Education Student Case Study
No ratings yet
Information Security Awareness Behavior Among Higher Education Student Case Study
13 pages
UNIT 2_2
No ratings yet
UNIT 2_2
22 pages
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
No ratings yet
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
14 pages
AISD Paper 5
No ratings yet
AISD Paper 5
16 pages
ANZ Analyst Interview Question
100% (1)
ANZ Analyst Interview Question
22 pages
Instant Access to Design and Analysis of Quality of Life Studies in Clinical Trials Second Edition Chapman Hall CRC Interdisciplinary Statistics Diane L. Fairclough ebook Full Chapters
100% (1)
Instant Access to Design and Analysis of Quality of Life Studies in Clinical Trials Second Edition Chapman Hall CRC Interdisciplinary Statistics Diane L. Fairclough ebook Full Chapters
71 pages
Chapter 1 Data and Data Preparation
No ratings yet
Chapter 1 Data and Data Preparation
3 pages
DWM - Notes Unit 1 To Unit 5
No ratings yet
DWM - Notes Unit 1 To Unit 5
23 pages
Ml-1-Guided-Bus Report
No ratings yet
Ml-1-Guided-Bus Report
35 pages
Hurtado Rua Et Al 2018 Five Factor Personality Traits and Riasec Interest Types A Multivariate Meta Analysis
No ratings yet
Hurtado Rua Et Al 2018 Five Factor Personality Traits and Riasec Interest Types A Multivariate Meta Analysis
17 pages
Clinical Predictors of Renal Non-Recovery in ARDS
No ratings yet
Clinical Predictors of Renal Non-Recovery in ARDS
10 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Data Integration Using Statistical Matching Techniques: A Review
No ratings yet
Data Integration Using Statistical Matching Techniques: A Review
20 pages
Bankruptcy Prediction Report
No ratings yet
Bankruptcy Prediction Report
32 pages
Adolescents' Health Literacy, Health Protective Measures, and Health-Related Quality of Life During The Covid-19 Pandemic
No ratings yet
Adolescents' Health Literacy, Health Protective Measures, and Health-Related Quality of Life During The Covid-19 Pandemic
13 pages
De-Escalation Techniques
No ratings yet
De-Escalation Techniques
26 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Instant ebooks textbook Tille, Y: Sampling and Estimation from Finite Populations (Wiley Series in Probability and Statistics) Yves Tille download all chapters
100% (4)
Instant ebooks textbook Tille, Y: Sampling and Estimation from Finite Populations (Wiley Series in Probability and Statistics) Yves Tille download all chapters
41 pages
Accenture
No ratings yet
Accenture
3 pages
Module in Practical Research 2: Your Lesson For Today!
No ratings yet
Module in Practical Research 2: Your Lesson For Today!
20 pages
Data Science Interview Quesions
No ratings yet
Data Science Interview Quesions
22 pages
Hello
No ratings yet
Hello
3 pages
37610-Hair6 Im
100% (3)
37610-Hair6 Im
201 pages