Regression Project
Regression Project
Prabhu.S-Oct Batch
Table of Contents
Executive Summary-1............................................................................................................................7
Introduction...........................................................................................................................................7
Data Description....................................................................................................................................7
Sample of the dataset:..........................................................................................................................7
Information of the dataset:...................................................................................................................8
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.........................8
Summary of Data set.......................................................................................................................8
Univariate Analysis............................................................................................................................9
Unique Values:.............................................................................................................................10
Carat............................................................................................................................................11
Depth...........................................................................................................................................11
Table............................................................................................................................................12
X- Column....................................................................................................................................12
Y- Column....................................................................................................................................13
Z column.....................................................................................................................................13
Price column................................................................................................................................14
Bivariate Analysis.............................................................................................................................15
Correlation Matrix.......................................................................................................................16
Price Vs Cut..................................................................................................................................17
Price Vs Colour.............................................................................................................................17
Price Vs Clarity.............................................................................................................................18
Outlier Treatment........................................................................................................................18
1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Do you think scaling is necessary in this
case?....................................................................................................................................................20
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on
Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning..........................................................................................................23
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.....32
Executive Summary-2..........................................................................................................................34
Introduction.........................................................................................................................................34
Data Description..............................................................................................................................34
Dataset Sample:...............................................................................................................................35
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis....35
Dataset type:...................................................................................................................................35
Shape of Data:.............................................................................................................................36
Checking for Null Values:.............................................................................................................36
Checking for Duplicates:..............................................................................................................36
Exploratory data analysis :...........................................................................................................39
Univariate Analysis..........................................................................................................................41
Salary...........................................................................................................................................41
Education.....................................................................................................................................42
Children (Young and Old).............................................................................................................43
Bivariate Analysis.............................................................................................................................44
Outliers........................................................................................................................................46
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).....47
Encoding the data............................................................................................................................47
Data Split.....................................................................................................................................50
LDA (linear discriminant analysis)................................................................................................51
Logistic Regression.......................................................................................................................51
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized.................................52
LDA Model-Training and Test Data classification report..................................................................52
Coefficient of variable-LDA..............................................................................................................53
Confusion Matrix-LDA (Train and Test Data)...................................................................................53
ROC Curve-Train-LDA.......................................................................................................................54
ROC Curve-Test-LDA........................................................................................................................54
LR Model-Training and Test Data Classification report....................................................................55
Coefficient of variable-LR.................................................................................................................56
ROC Curve-Train-LR.........................................................................................................................57
ROC Curve-Test-LR...........................................................................................................................57
2.4 Inference: Basis on these predictions, what are the insights and recommendations....................58
Table Details- Problem 1
Figure Details-Problem 1
Figure Details-Problem 2
Introduction
The Purpose of this exercise to explore the dataset and understand the nature of it. Check and
explore the various parameters used. Also, Understand the out layers and analyse the effect of it.
Use various Models
Data Description
Sample of the dataset:
Table 1.1 Dataset Sample
We can see that there are many Objects and floats in the dataset
1.1 Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis.
Checking for duplicates: There are 34 Duplicate rows and after removing them we have now
26967 Rows and 11 Columns in the Data set.
There is redundant column “unnamed” and it has been dropped. The final dataset looks like below
Table 1.4 summary of Dataset-2
Carat:- This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around 0.8 and
75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which shows that the
data is skewed and has a right tailed curve. Which means that majority of the stones are of lower
carat. There are very few stones above 1.05 carat.
Depth :- The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60. Average
height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are 62.5. Standard
deviation of the height of the stones is 1.4. Standard deviation is indicating a normal distribution
Table:- The percentage width of cubic Zirconia is in the range of 49 to 79. Average is around 57. 25%
of stones are below 56 and 75% of the stones have a width of less than 59. Standard deviation is
2.24. Thus the data does not show normal distribution and is similar to carat with most of the stones
having less width also this shows outliers are present in the variable.
Price:- Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median price of
stones is 2375, while 25% of the stones are priced below 945. 75% of the stones are in the price
range of 5356. Standard deviation of the price is 4022. Indicating prices of majority of the stones are
in lower range as the distribution is right skewed
Univariate Analysis
Let’s perform Univariate Analysis across all variables and understand its characters
Unique Values:
Analysing the unique for all the columns we get the following
Object Variables:
Figure 1.1 summary of Dataset-Object Variables
Numerical Variables
Carat
Figure 1.3 Univariate Analysis-Carat
Depth
Figure 1.3
Univariate depth
Observation: The depth Variable has outliers and its centrally distributed
Table
X- Column
Y- Column
Price column
To Analyse the effect of each variable with each we perform a Bi-Variate Analysis
Figure 1.9 Bivariate Analysis
Correlation Matrix
Figure 2.0 coo-relation Matrix
Observation: The Bi-Variate Analysts and the Correlation Matrix analysis suggests the following
High correlation between the different features like carat, x, y, z and price.
Less correlation between table with the other features.
Depth is negatively correlated with most the other features except for carat
Also, Lets understand how price variables are affecting the Cut, Color and Clarity
Price Vs Cut
Comparing Price with the various Cut we get the following
Observation:
For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair cut
gems
All cut type gems have outliers with respect to price
Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive
Price Vs Colour
For the colour variable we see the most sold is G coloured gems and least is J coloured gems
All colour type gems have outliers with respect to price
However, the least priced seems to be E type; J and I coloured gems seems to be more expensive
Price Vs Clarity
Figure 2.3 Price Vs Clarity
For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems
All clarity type gems have outliers with respect to price
Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more expensive
Outlier Treatment
From the univariate Analysis we understand that there are few Outliers in the data set.
From the data set and the from the univariate Analysis we see that there are outliers present it in all
the variables
To treat these outliers, we use capping method to cap outliers data and make the limit i.e, above a
particular value or less than that value.
Observation:
Using Capping method we treated the outliers by capping the outliers data and made a limit
Conclusion of EDA:
• Price – This variable gives the continuous output with the price of the cubic zirconia stones. This
will be our Target Variable.
• We will drop the first column ‘Unnamed: 0’ column as this is not important for our study which
leaves the shape of the dataset with 26967 rows & 10 Columns
• Only in ‘depth 697missing values are present which we will impute by its median values.
• There are total of 34 duplicate rows as computed using. Duplicated () function. We will drop the
duplicates
• Upon dropping the duplicates – The shape of the data set is – 26933 rows & 10 columns
1.2. Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning, or do we need to change
them or drop them? Do you think scaling is necessary in this case?
Checking for any data in gibberish the Object module we get the following
There is no gibberish or missing data in the object type data columns – Cut, colour and clarity
Observation
In this data set we can see the all the variable are in different scale i.e price are in 1000s unit
and depth and table are in 100s unit, and carat is in 10s. So its necessary to scale or
standardise the data to allow each variable to be compared on a common scale. With data
measured in different "units" or on different scales (as here with different means and
variances) this is an important data processing step if the results are to be meaningful or not
dominated by the variables that have large variances.
But is scaling necessary in this case? No, it is not necessary, we'll get an equivalent solution
whether we apply some kind of linear scaling or not. But recommended for regression
techniques as well because it would help gradient descent to converge fast and reach the
global minima. When number of features becomes large, it helps is running model quickly
else the starting point would be very far from minima, if the scaling is not done in
preprocessing.
For now we will process the model with scaling for quicker and fast Output
1.3 Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning
Lets understand what are the Object type variables we have in dataset
We have Cut, Colour and clarity as objects. We need to convert these categorical variables to
dummy Variables
Lets perform a final information check on the dataset before splitting the data
Figure 3.2-Data Types-Check
Observation:
We have now set the precedence that there are a lot of variables (Carat, x,y, and z) that are
demonstrating strong correlation or multicollinearity. So, before proceeding with the Linear
Regression Model creation, we need to get rid of them from the model creation exercise.
Drop x,y, and z from my Linear Regression model creation step.
Splitting the data into train and Test and applying regression:
The coefficient of variation (CV) is a relative measure of variability that indicates the size of a
standard deviation in relation to its mean
R-squared is always between 0 and 100%: 0% indicates that the model explains none of the
variability of the response data around its mean.100% indicates that the model explains all the
variability of the response data around its mean. In this regression model we can see the R-square
value on Training and Test data respectively 0.917 and 0.914 .
Figure 4.1-RMSE Score-Test Data
Observation
As the training data & testing data score are almost inline, we can conclude this model is a Right-
Fit Model.
Lets Perform Linear Regression using Stats Model:
By concatenating X and Y values in a single data frame and adding dependent and independent
variables we get:
Figure 4.6-
Observation: The RMSE score remains almost the same even when regression is done using stats
model.
Lets compare with other models and figure out which is the optimum for the dataset result
Creating 4 models using ANN, Decision Tree, Random Forest, and Linear Regression
Check Train and Test RMSE
Check Train and Test Scores
Figure 4.8-Conslidates Score
Checking using Grid search we can analyse the four models
Figure 4.9-Conslidates Score- Grid Search
Observation:
All the models are well within the 10 percent limit and seems like all of them fitting.
The RMSE scores decides the best model i.e. lesser the RMSE score better the model
Out of the four Models ANN REGRESSOR has less RMSE score and hence it’s the best fit for
this data set.
Facts:
We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of
the model performance.
As a result while creating the model, certain independent variables displaying
multicollinearity or the ones with no direct relation with the target variable has been
dropped
For the business based on the model that we have created for the test case, some of the key
variables that are likely to positively drive price change are (top 5 in descending order):
Carat
Clarity_IF
Clarity VVS_1
Clarity VVS_2
Clarity_vs1
Recommendations:
Data Description
1. Target: Holiday_Package- Opted for Holiday Package yes/no?
2. Employee salary
3. Age in years
4. Years of formal education
5. The number of young children (younger than 7 years)
6. Number of older children
7. foreigner Yes/No
Dataset Sample:
Table 1.1- Dataset Sample
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.
Dataset type:
Table 1.2: Dataset Information
Shape of Data:
• Holiday Package – This variable is a categorical Variable. output with the This will be our Target
Variable.
• Salary ranges from 1322 to 236961. Average salary of employees is around 47729 with a standard
deviation of 23418. Standard deviation indicates that the data is not normally distributed. skew of
0.71 indicates that the data is right skewed and there are few employees earning more than an
average of 47729. 75% of the employees are earning below 53469 while 255 of the employees are
earning 35324.
• Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees are below
32 and 25% of the employees are above 48. Standard deviation is around 10. Standard deviation
indicates almost normal distribution.
• Years of formal education ranges from 1 to 21 years. 25% of the population has formal education
for 8 years, while the median is around 9 years. 75% of the employees have formal education of 12
years. Standard deviation of the education is around 3. This variable is also indicating skewness in
the data
• We have dropped the first column ‘Unnamed: 0’ column as this is not important for our study.
Unnamed is a variable which has serial numbers so may not be required and thus it can be dropped
for further analysis.
Checking the unique values for Numerical variables we get the following output:
Figure1.1: Data set-Unique Check-Numerical
Observation:
We can observe that 54% of the employees are not opting for the holiday package and 46%
are interested in the package. This implies we have a dataset which is fairly balanced
We can observe that 75% of the employees are not Foreigners and 25% are foreigners
Univariate Analysis
Salary
Figure1.4 -Univariate- Salary
Education
Figure1.5 -
Univariate- Education
Observation: Looks like big population don’t have any young children
Observation: Looks like the big population don’t have any old children
Bivariate Analysis
Figure1.8-Bivariate Analysis
Followed by heat map we get a clear picture of co-relation between the columns
Figure1.9-Heat Map
Observation: We can relate there isn’t any strong correlation between any variables. Salary and
education display moderate corelation and no_older_children is somewhat correlated with salary
variable. However, there are no strong correlation in the data set.
Outliers
Data before Outliers:
Figure2.0-Data before Outliers
We can observe that there are significant outliers present in variable “ Salary”, however there are
minimal outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’. There
are no outliers in variable ‘age’. For Interpretation purpose we would need to study the variables
such as no. of young children and no. of older children before outlier treatment. For this case study
we have done outlier treatment for only salary & educDataset after Outliers
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).
Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.
Figure2.3-Dataset-foreign
Foreign: The distribution seems to be fine, and fit for analytical purpose
Both the variables can be encoded into numerical values for model creation analytical purposes.
The variable which is actually a numeric one seems to show varied distribution between number of
children being 1 and 2 when done a bivariate analysis with the dependent variable. It is therefore
advised to treat this variable as categorical and do the encoding on it.
Looking at the table above, there does not seems to be much variation between the distribution of
data for children more than 0. It seems they are close enough for the Holiday_Package classes with
an almost like distribution. For this test case, I don’t think this variable will be an important factor
while creating the model and for analysis purposes and hence, I will drop this from my model
building process.
Data Split
Figure2.7-Data Drop
Figure2.8-Data drop-X
Figure2.9-Data drop-Y
Split X and y into training and test set in 70:30 ratio. This implies 70% of the total data will be used
for training purposes and remaining 30% will be used for test purposes
Figure3.0-Data Split
Figure3.1-LDA Model
Logistic Regression
Applying LR Model using the below formula
Figure3.2-LR Model
Figure3.5-
LDA Confusion Matrix-(Train and Test)
ROC Curve-Train-LDA
Figure3.6-LDA- ROC curve-Train
ROC Curve-Test-LDA
Observation:
The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.
LR Model-Training and Test Data Classification report
Coefficient of variable-LR
f igure3.8-LR Coefficient Variable
Confusion Matrix-LR (Train and Test Data)
ROC Curve-Train-LR
Observation:
The accuracy scores and the AUC scores aren’t too different and can be considered as right fit
models avoiding the scenarios of underfit and overfit models.
Both the models – Logistics and LDA offers almost similar results.
Though for this case study, I have chosen to proceed with Logistics Regression as its is easier to
implement, interpret, and very efficient to train. Also, our dependent variable is following a binary
classification of classes, and hence it is ideal for us to rely on the logistic regression model to study
the test case at hand.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations
Facts:
We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which might
have played out as an important predictor
While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
There are no outliers present in age The distribution of data for age variable with holiday
package is also similar in nature. The range of age for people not opting for holliday package
is more spread out when compared with people opting for yes.
We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees
There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package
We can clearly see that people with younger children are not opting for holiday packages
We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process. Employees with older
children has almost similar distribution for opting and not opting for holiday packages across
the number of children levels and hence I don’t think it will be an important predictor at all
for my model and I did not include this specific variable for my model building process.
For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes
Recommendations:
The company should really focus on foreigners to drive the sales of their holiday packages as
that’s where majority of conversions are going to come in.
The company can try to direct their marketing efforts or offers toward foreigners for a better
conversion opting for holiday packages
The company should also stay away from targeting parents with younger children. The
chances of selling to parents with 2 younger children is probably the lowest. This also gels
with the fact that parents try and avoid visiting with younger children.
If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger children.