100% found this document useful (1 vote)
117 views

Regression Project

This document analyzes two datasets to build predictive models. For the first dataset, exploratory data analysis is performed to understand the data. Linear regression models are built on train-test splits and compared based on performance metrics. The best model is selected. For the second dataset, logistic regression and LDA models are built after encoding categorical variables. Model performance is evaluated using accuracy, confusion matrices and ROC curves. The best model is identified. Business insights and recommendations are provided based on the predictions from both analyses.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
117 views

Regression Project

This document analyzes two datasets to build predictive models. For the first dataset, exploratory data analysis is performed to understand the data. Linear regression models are built on train-test splits and compared based on performance metrics. The best model is selected. For the second dataset, logistic regression and LDA models are built after encoding categorical variables. Model performance is evaluated using accuracy, confusion matrices and ROC curves. The best model is identified. Business insights and recommendations are provided based on the predictions from both analyses.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

SMDM Project- Predictive Modeling

Prabhu.S-Oct Batch
Table of Contents
Executive Summary-1............................................................................................................................7
Introduction...........................................................................................................................................7
Data Description....................................................................................................................................7
Sample of the dataset:..........................................................................................................................7
Information of the dataset:...................................................................................................................8
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.........................8
Summary of Data set.......................................................................................................................8
Univariate Analysis............................................................................................................................9
Unique Values:.............................................................................................................................10
Carat............................................................................................................................................11
Depth...........................................................................................................................................11
Table............................................................................................................................................12
X- Column....................................................................................................................................12
Y- Column....................................................................................................................................13
Z column.....................................................................................................................................13
Price column................................................................................................................................14
Bivariate Analysis.............................................................................................................................15
Correlation Matrix.......................................................................................................................16
Price Vs Cut..................................................................................................................................17
Price Vs Colour.............................................................................................................................17
Price Vs Clarity.............................................................................................................................18
Outlier Treatment........................................................................................................................18
1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Do you think scaling is necessary in this
case?....................................................................................................................................................20
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on
Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning..........................................................................................................23
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.....32
Executive Summary-2..........................................................................................................................34
Introduction.........................................................................................................................................34
Data Description..............................................................................................................................34
Dataset Sample:...............................................................................................................................35
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis....35
Dataset type:...................................................................................................................................35
Shape of Data:.............................................................................................................................36
Checking for Null Values:.............................................................................................................36
Checking for Duplicates:..............................................................................................................36
Exploratory data analysis :...........................................................................................................39
Univariate Analysis..........................................................................................................................41
Salary...........................................................................................................................................41
Education.....................................................................................................................................42
Children (Young and Old).............................................................................................................43
Bivariate Analysis.............................................................................................................................44
Outliers........................................................................................................................................46
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).....47
Encoding the data............................................................................................................................47
Data Split.....................................................................................................................................50
LDA (linear discriminant analysis)................................................................................................51
Logistic Regression.......................................................................................................................51
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized.................................52
LDA Model-Training and Test Data classification report..................................................................52
Coefficient of variable-LDA..............................................................................................................53
Confusion Matrix-LDA (Train and Test Data)...................................................................................53
ROC Curve-Train-LDA.......................................................................................................................54
ROC Curve-Test-LDA........................................................................................................................54
LR Model-Training and Test Data Classification report....................................................................55
Coefficient of variable-LR.................................................................................................................56
ROC Curve-Train-LR.........................................................................................................................57
ROC Curve-Test-LR...........................................................................................................................57
2.4 Inference: Basis on these predictions, what are the insights and recommendations....................58
Table Details- Problem 1

Table Description Page Number


1.1 Dataset Sample 7
1.2 Dataset information 8
1.3 Summary of Dataset-1 8
1.4 Summary of Dataset-2 9

Figure Details-Problem 1

Figure Description Page Number


1.1 summary of Dataset-Object Variables 10
1.2 summary of Dataset -Numerical Variables 10
1.3 Univariate depth 11
1.4 Univariate -Table 11
1.5 Univariate X column 12
1.6 Univariate Y Column 13
1.7 Univariate Z Column 13
1.8 Univariate Price column 14
1.9 Bivariate Analysis 15
2 coo-relation Matrix 16
2.1 Price Vs Cut 17
2.2 Price Vs Colour 17
2.3 Price Vs Clarity 18
2.4 Outlier Treatment-Before 19
2.5 Outlier Treatment-After 19
2.6 data Check 21
2.7 Missing data-Before 21
2.8 Missing data-After 22
2.9 Object Type Variable 23
3 Dummy Variables 23
3.1 Missing Variables-Check 24
3.2 Data Types-Check 25
3.3 Co-relation Matrix 25
3.4 Data Set- After Drop 26
3.5 Data split 26
3.6 Coefficient of data set 27
3.7 Intercept of data set 27
3.8 Model Score (X and Y Train) 28
3.9 Model Score (X and Y Test) 28
4 RMSE Score-Train Data 28
4.1 RMSE Score-Test Data 28
4.2 Concatenating Data set 29
4.3 Stats Model Fit 29
4.4 Stats Model summary 30
4.5 Stats Model RMSE Values 31
4.6 Predicted vs Actual price 31
4.7 Regression Equation 31
4.8 Consolidated Score 32
4.9 Consolidated Score- Grid Search 32
Table Details- Problem 2

Table Description Page Number


1.1 Dataset Sample 35
1.2 Dataset Information 35
1.3 Dataset shape 36
1.4 Null Value check 36
1.5 Duplicate check 36
1.6 Data set-updated 37
1.7 Data set-information 37

Figure Details-Problem 2

Figure Description Page Number


1.1 Data set-Unique Check-Numerical 39
1.2 Dataset-Unique Check-Categorical 40
1.3 Data set-Holiday set and Foreign 40
1.4 Univariate- Salary 41
1.5 Univariate- Education 42
1.6 Plot check Young Children 43
1.7 Plot check Older Children 43
1.8 Bivariate Analysis 44
1.9 Heat Map 45
2 Data before Outliers 46
2.1 Data after outliers 47
2.2 Dataset-Holiday Package 48
2.3 Dataset-foreign 48
2.4 Dataset-Holiday package & no of young children 49
2.5 Dataset-Holiday package & no of old children 49
2.6 Dummy Variables 50
2.7 Data Drop 50
2.8 Data Split-X 50
2.9 Data Split-Y 51
3 Data Split 51
3.1 LDA Model 51
3.2 LR Model 51
3.3 LDA Classification report 52
3.4 LDA Coefficient Variable 53
3.5 LDA Confusion Matrix (Train and Test) 53
3.6 LDA- ROC curve-Train 54
3.7 LDA- ROC curve-Test 54
3.8 LR Classification report 55
3.9 LR Confusion Matrix (Train and Test) 56
4 LDA- ROC curve-Train 57
4.1 LDA- ROC curve-Test 57
4.2 LDA Vs LR 58
Executive Summary-1
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

Introduction
The Purpose of this exercise to explore the dataset and understand the nature of it. Check and
explore the various parameters used. Also, Understand the out layers and analyse the effect of it.
Use various Models

Data Description
Sample of the dataset:
Table 1.1 Dataset Sample

There are 26967 Rows and 11 Columns in the Data set.

Information of the dataset:

Table 1.2 Dataset information

We can see that there are many Objects and floats in the dataset

1.1 Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis.

Summary of Data set


Table 1.3 summary of Dataset-1

Checking for duplicates: There are 34 Duplicate rows and after removing them we have now
26967 Rows and 11 Columns in the Data set.

There is redundant column “unnamed” and it has been dropped. The final dataset looks like below
Table 1.4 summary of Dataset-2

Carat:- This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around 0.8 and
75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which shows that the
data is skewed and has a right tailed curve. Which means that majority of the stones are of lower
carat. There are very few stones above 1.05 carat.

Depth :- The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60. Average
height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are 62.5. Standard
deviation of the height of the stones is 1.4. Standard deviation is indicating a normal distribution

Table:- The percentage width of cubic Zirconia is in the range of 49 to 79. Average is around 57. 25%
of stones are below 56 and 75% of the stones have a width of less than 59. Standard deviation is
2.24. Thus the data does not show normal distribution and is similar to carat with most of the stones
having less width also this shows outliers are present in the variable.

Price:- Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median price of
stones is 2375, while 25% of the stones are priced below 945. 75% of the stones are in the price
range of 5356. Standard deviation of the price is 4022. Indicating prices of majority of the stones are
in lower range as the distribution is right skewed

Univariate Analysis
Let’s perform Univariate Analysis across all variables and understand its characters

Unique Values:
Analysing the unique for all the columns we get the following

Object Variables:
Figure 1.1 summary of Dataset-Object Variables
Numerical Variables

Figure 1.2 summary of Dataset -Numerical Variables

Carat
Figure 1.3 Univariate Analysis-Carat

Observation: The Carat Variable has outliers and its Skewed

Depth

Figure 1.3
Univariate depth

Observation: The depth Variable has outliers and its centrally distributed

Table

Figure 1.4 Univariate -Table


Observation: Table has multiple outliers and its slightly skewed

X- Column

Figure 1.5 Univariate X column

Observation: X column variable has outliers and slightly skewed

Y- Column

Figure 1.6 Univariate Y Column

Observation: Y column has outliers and its skewed


Z column

Figure 1.7 Univariate Z Column

Observation: Z Column has multiple outliers and its skewed

Price column

Figure 1.8 Univariate Price column

Observation: Maximum payment amount has no outliers and slightly skewed


Bivariate Analysis

To Analyse the effect of each variable with each we perform a Bi-Variate Analysis
Figure 1.9 Bivariate Analysis

Correlation Matrix
Figure 2.0 coo-relation Matrix

Observation: The Bi-Variate Analysts and the Correlation Matrix analysis suggests the following

 High correlation between the different features like carat, x, y, z and price.
 Less correlation between table with the other features.
 Depth is negatively correlated with most the other features except for carat

Also, Lets understand how price variables are affecting the Cut, Color and Clarity

Price Vs Cut
Comparing Price with the various Cut we get the following

Figure 2.1 Price Vs Cut

Observation:
 For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair cut
gems
 All cut type gems have outliers with respect to price
 Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive
Price Vs Colour

Figure 2.2 Price Vs Colour

 For the colour variable we see the most sold is G coloured gems and least is J coloured gems
 All colour type gems have outliers with respect to price
 However, the least priced seems to be E type; J and I coloured gems seems to be more expensive

Price Vs Clarity
Figure 2.3 Price Vs Clarity

 For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems
 All clarity type gems have outliers with respect to price
 Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more expensive

Outlier Treatment

From the univariate Analysis we understand that there are few Outliers in the data set.

Before Treating the Outliers

Figure 2.4 Outlier Treatment-Before

From the data set and the from the univariate Analysis we see that there are outliers present it in all
the variables

To treat these outliers, we use capping method to cap outliers data and make the limit i.e, above a
particular value or less than that value.

After Treating the outliers, we get the following


Figure 2.5- Outlier Treatment-After

Observation:

Using Capping method we treated the outliers by capping the outliers data and made a limit

Conclusion of EDA:

• Price – This variable gives the continuous output with the price of the cubic zirconia stones. This
will be our Target Variable.

• Carat, depth, table, x, y, z variables are numerical or continuous variables.

• Cut, Clarity and colour are categorical variables.

• We will drop the first column ‘Unnamed: 0’ column as this is not important for our study which
leaves the shape of the dataset with 26967 rows & 10 Columns

• Only in ‘depth 697missing values are present which we will impute by its median values.

• There are total of 34 duplicate rows as computed using. Duplicated () function. We will drop the
duplicates

• Upon dropping the duplicates – The shape of the data set is – 26933 rows & 10 columns

1.2. Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning, or do we need to change
them or drop them? Do you think scaling is necessary in this case?
Checking for any data in gibberish the Object module we get the following

Figure 2.6- data Check

There is no gibberish or missing data in the object type data columns – Cut, colour and clarity

Checking for Missing values we get the following

Figure 2.7- Missing data-Before

 We Understand that there are missing values in Depth columns


 From the earlier table we do have zero values in X,Y,Z columns. using Imputer we can impute
those values too
 Computing the missing values using simple Imputer and applying median strategy we impute
the values

After Imputing we get the following results


Figure 2.8- Missing data-After

Observation

 . All the missing values are imputed


 . Values which are zero has been removed and imputed
 Scaling: Scaling or Standardizing the features around the centre and 0 with a standard
deviation of 1 is important when we compare measurements that have different units.
Variables that are measured at different scales do not contribute equally to the analysis and
might end up creating a bias.

 In this data set we can see the all the variable are in different scale i.e price are in 1000s unit
and depth and table are in 100s unit, and carat is in 10s. So its necessary to scale or
standardise the data to allow each variable to be compared on a common scale. With data
measured in different "units" or on different scales (as here with different means and
variances) this is an important data processing step if the results are to be meaningful or not
dominated by the variables that have large variances.

 But is scaling necessary in this case? No, it is not necessary, we'll get an equivalent solution
whether we apply some kind of linear scaling or not. But recommended for regression
techniques as well because it would help gradient descent to converge fast and reach the
global minima. When number of features becomes large, it helps is running model quickly
else the starting point would be very far from minima, if the scaling is not done in
preprocessing.

 For now we will process the model with scaling for quicker and fast Output
1.3 Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning

Lets understand what are the Object type variables we have in dataset

Figure 2.9-Object Type Variable

We have Cut, Colour and clarity as objects. We need to convert these categorical variables to
dummy Variables

Converting the Categorical to Dummy variables we get the following output:

Figure 3.0-Dummy Variables

The total number of columns has increased to 24

Performing a final check on the missing values we get the following


Figure 3.1-Missing Variables-Check

There are no missing values in the data set.

Lets perform a final information check on the dataset before splitting the data
Figure 3.2-Data Types-Check

Checking for the co-relation between the variables we get :

Figure 3.3-co-relation Matrix

Observation:
 We have now set the precedence that there are a lot of variables (Carat, x,y, and z) that are
demonstrating strong correlation or multicollinearity. So, before proceeding with the Linear
Regression Model creation, we need to get rid of them from the model creation exercise.
  Drop x,y, and z from my Linear Regression model creation step.

Dropping the X,Y and Z columns we get the following output:

Figure 3.4-Data Set- After Drop

Splitting the data into train and Test and applying regression:

Figure 3.5-Data split

Lets find the coefficient of these datas.


Figure 3.6-Coefficient of
data set

The coefficient of variation (CV) is a relative measure of variability that indicates the size of a
standard deviation in relation to its mean

We will find out the intercept of the model:

Figure 3.7-Intercept of data set

The score of the model is as follows

Regression model Score- Training Data


Figure 3.8-Model Score(X and Y Train)

Regression model Score-Test Data

Figure 3.9-Model Score(X and Y Test)

RMSE score of Training data:

Figure 4.0-RMSE Score-Train Data

RMSE score of Test Data:

R-squared is always between 0 and 100%: 0% indicates that the model explains none of the
variability of the response data around its mean.100% indicates that the model explains all the
variability of the response data around its mean. In this regression model we can see the R-square
value on Training and Test data respectively 0.917 and 0.914 .
Figure 4.1-RMSE Score-Test Data

Observation

Linear regression Performance Metrics:

 intercept for the model: 224.200


 R square on training data: 0.9171895350186483
 R square on testing data: 0.9145148930357136
 RMSE on Training data: 1159.8896347641783
 RMSE on Testing data: 1170.4884135275267

As the training data & testing data score are almost inline, we can conclude this model is a Right-
Fit Model.
Lets Perform Linear Regression using Stats Model:

By concatenating X and Y values in a single data frame and adding dependent and independent
variables we get:

Figure 4.2-Concatinating Data set

initiating stats model and fitting it we get:


Figure 4.3-Stats Model Fit

Checking for the summary of the regression we get


Figure 4.4-Stats Model summary

RMSE values as follows:

Figure 4.5-Stats Model


RMSE Values

Visualising Actual and Linear model output(Predicted price) we get

Figure 4.6-

The final regression equation will be like below


Figure 4.7-Regression Equation

Observation: The RMSE score remains almost the same even when regression is done using stats
model.

Lets compare with other models and figure out which is the optimum for the dataset result

 Creating 4 models using ANN, Decision Tree, Random Forest, and Linear Regression
 Check Train and Test RMSE
 Check Train and Test Scores
Figure 4.8-Conslidates Score
Checking using Grid search we can analyse the four models
Figure 4.9-Conslidates Score- Grid Search

Observation:

 All the models are well within the 10 percent limit and seems like all of them fitting.
 The RMSE scores decides the best model i.e. lesser the RMSE score better the model
 Out of the four Models ANN REGRESSOR has less RMSE score and hence it’s the best fit for
this data set.

1.4 Inference: Basis on these predictions, what are the business


insights and recommendations.

Facts:

 We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of
the model performance.
 As a result while creating the model, certain independent variables displaying
multicollinearity or the ones with no direct relation with the target variable has been
dropped
 For the business based on the model that we have created for the test case, some of the key
variables that are likely to positively drive price change are (top 5 in descending order):


 Carat
 Clarity_IF
 Clarity VVS_1
 Clarity VVS_2
 Clarity_vs1
Recommendations:

 Carat is a strong predictor of the overall price of the stone.


 Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a strong
predictor of price as well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are helping the
firm put an expensive price cap on the stones.
 Colour of the stones such H, I and J won’t be helping the firm put an expensive price cap on
such stones.
 The company should instead focus on stones of colour D, E and F to command relative
higher price points and support sales. This also can indicate that company should be looking
to come up with new colour stones like clear stones or a different colour/unique colour that
helps impact the price positively.
 The company should focus on the stone’s carat and clarity so as to increase their prices.
Ideal customers will also contribute to more profits. The marketing efforts can make use of
educating customers about the importance of a better carat score and importance of clarity
index. Post this, the company can make segments, and target the customer based on their
income/paying capacity etc, which can be further studied.
The End
Executive Summary-2
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
Introduction
The Purpose of this exercise to explore the dataset and understand the nature of it. Check and
explore the various parameters used. Based on the data we need to provide recommendations by
comparing Logistics Regression and Linear Discriminant model and Understand each model and
provide solution and recommendation

Data Description
1. Target: Holiday_Package- Opted for Holiday Package yes/no?
2.  Employee salary
3.  Age in years
4. Years of formal education
5. The number of young children (younger than 7 years)
6.  Number of older children
7. foreigner Yes/No

Dataset Sample:
Table 1.1- Dataset Sample

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.

Dataset type:
Table 1.2: Dataset Information

Shape of Data:

Table 1.3: Dataset shape

Observation: There are 872 Rows and 8 Columns

Checking for Null Values:

Table 1.4: Null Value check

Observation: There are no NULL values in the data set.


Checking for Duplicates:

Table 1.5: Duplicate check

Observation: There are no duplicates in the dataset

Removing the unnamed column we get the dataset:

Table 1.6: Data set-updated

Information of the dataset:

Table 1.7: Data set-information


Summary of the Dataset

• Holiday Package – This variable is a categorical Variable. output with the This will be our Target
Variable.

• Salary, age, educ, no_young_children, no_older_children, variables are numerical or continuous


variables.

• Salary ranges from 1322 to 236961. Average salary of employees is around 47729 with a standard
deviation of 23418. Standard deviation indicates that the data is not normally distributed. skew of
0.71 indicates that the data is right skewed and there are few employees earning more than an
average of 47729. 75% of the employees are earning below 53469 while 255 of the employees are
earning 35324.

• Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees are below
32 and 25% of the employees are above 48. Standard deviation is around 10. Standard deviation
indicates almost normal distribution.

• Years of formal education ranges from 1 to 21 years. 25% of the population has formal education
for 8 years, while the median is around 9 years. 75% of the employees have formal education of 12
years. Standard deviation of the education is around 3. This variable is also indicating skewness in
the data

• Foreign is a categorical variable

• We have dropped the first column ‘Unnamed: 0’ column as this is not important for our study.
Unnamed is a variable which has serial numbers so may not be required and thus it can be dropped
for further analysis.

The shape would be – 872 rows and 7 columns

• There are no null values

• There are no duplicates


Exploratory data analysis 

Checking for Unique values of numerical variables

Checking the unique values for Numerical variables we get the following output:
Figure1.1: Data set-Unique Check-Numerical

Checking for unique values for Categorical variables

Figure1.2: Dataset-Unique Check-Categorical

Figure1.3: Data set-Holiday set and Foreign

Observation:

 We can observe that 54% of the employees are not opting for the holiday package and 46%
are interested in the package. This implies we have a dataset which is fairly balanced
 We can observe that 75% of the employees are not Foreigners and 25% are foreigners

Univariate Analysis
Salary
Figure1.4 -Univariate- Salary

Salary column has outliers and its skewed

Education

Figure1.5 -
Univariate- Education

Education column has a few


outlier and its centrally
distributed
Children (Young and Old)

Checking the plot for young children we get the following:

Figure1.6-Plot check Young Children

Observation: Looks like big population don’t have any young children

Checking the plot for young children we get the following:

Figure1.7-Plot check Older Children

Observation: Looks like the big population don’t have any old children
Bivariate Analysis

Figure1.8-Bivariate Analysis

Followed by heat map we get a clear picture of co-relation between the columns

Figure1.9-Heat Map
Observation: We can relate there isn’t any strong correlation between any variables. Salary and
education display moderate corelation and no_older_children is somewhat correlated with salary
variable. However, there are no strong correlation in the data set.

Outliers
Data before Outliers:
Figure2.0-Data before Outliers

We can observe that there are significant outliers present in variable “ Salary”, however there are
minimal outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’. There
are no outliers in variable ‘age’. For Interpretation purpose we would need to study the variables
such as no. of young children and no. of older children before outlier treatment. For this case study
we have done outlier treatment for only salary & educDataset after Outliers

Data After Outliers:

Figure2.1-Data after outliers

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).

Encoding the data


In the given dataset, the target variable – Holliday Package and an independent variable – Foreign
are object variables. Let us study them one at a time.
Figure2.2-Dataset-Holiday Package

Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.

Figure2.3-Dataset-foreign

Foreign: The distribution seems to be fine, and fit for analytical purpose

Both the variables can be encoded into numerical values for model creation analytical purposes.

Analysing Holiday package variable with No of Young Children


Figure2.4-Dataset-Holiday package & No of young children

The variable which is actually a numeric one seems to show varied distribution between number of
children being 1 and 2 when done a bivariate analysis with the dependent variable. It is therefore
advised to treat this variable as categorical and do the encoding on it.

Analysing Holiday package variable with No of Old Children

Figure2.5-Dataset-Holiday package & No of Old children

Looking at the table above, there does not seems to be much variation between the distribution of
data for children more than 0. It seems they are close enough for the Holiday_Package classes with
an almost like distribution. For this test case, I don’t think this variable will be an important factor
while creating the model and for analysis purposes and hence, I will drop this from my model
building process.

Converting the Categorical to Dummy variables we get the following output:


Figure2.6-Dummy Variables

Data Split

Splitting the data we get the following:

Figure2.7-Data Drop

Figure2.8-Data drop-X

Figure2.9-Data drop-Y

Split X and y into training and test set in 70:30 ratio. This implies 70% of the total data will be used
for training purposes and remaining 30% will be used for test purposes
Figure3.0-Data Split

LDA (linear discriminant analysis).


Applying LDA Model using the below formula:

Figure3.1-LDA Model

Logistic Regression
Applying LR Model using the below formula

Figure3.2-LR Model

2.3 Performance Metrics: Check the performance of Predictions on


Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.

LDA Model-Training and Test Data classification report


Figure3.3-LDA Classification report
Coefficient of variable-LDA

Figure3.4-LDA Coefficient Variable

Confusion Matrix-LDA (Train and Test Data)

Figure3.5-
LDA Confusion Matrix-(Train and Test)

ROC Curve-Train-LDA
Figure3.6-LDA- ROC curve-Train

ROC Curve-Test-LDA

Observation:

Model Accuracy scores: LDA

model.score (X_train, y_train) at 66.1% Test data:

model.score (X_test, y_test) at 66%

AUC Training LDA:0.74

AUC Test LDA:0.72

The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.
LR Model-Training and Test Data Classification report

Figure3.8-LR Classification report

Coefficient of variable-LR
f igure3.8-LR Coefficient Variable
Confusion Matrix-LR (Train and Test Data)

F igure3.9-LR Confusion Matrix-(Train and Test)

ROC Curve-Train-LR

Figure4.0-LR- ROC curve-Train


ROC Curve-Test-LR

Figure4.1-LR- ROC curve-Test

Observation:

Model Accuracy scores: LDA

model.score (X_train, y_train) at 66.1% Test data:

model.score (X_test, y_test) at 66%

AUC Training LDA:0.74

AUC Test LDA:0.72

The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.

Lets compare both models:


Figure4.2-LDAVs LR

Both the models – Logistics and LDA offers almost similar results.

Though for this case study, I have chosen to proceed with Logistics Regression as its is easier to
implement, interpret, and very efficient to train. Also, our dependent variable is following a binary
classification of classes, and hence it is ideal for us to rely on the logistic regression model to study
the test case at hand.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations

Facts:

 We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which might
have played out as an important predictor
 While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
 There are no outliers present in age The distribution of data for age variable with holiday
package is also similar in nature. The range of age for people not opting for holliday package
is more spread out when compared with people opting for yes.
 We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees
 There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package
 We can clearly see that people with younger children are not opting for holiday packages
We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process. Employees with older
children has almost similar distribution for opting and not opting for holiday packages across
the number of children levels and hence I don’t think it will be an important predictor at all
for my model and I did not include this specific variable for my model building process.
 For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes

Recommendations:

 The company should really focus on foreigners to drive the sales of their holiday packages as
that’s where majority of conversions are going to come in.
 The company can try to direct their marketing efforts or offers toward foreigners for a better
conversion opting for holiday packages
 The company should also stay away from targeting parents with younger children. The
chances of selling to parents with 2 younger children is probably the lowest. This also gels
with the fact that parents try and avoid visiting with younger children.
 If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger children.

You might also like