100% found this document useful (1 vote)
70 views

Project LDA

Here are the key steps taken to address null values and combine ordinal variable sub-levels: 1. There were 697 null values found in the 'depth' column. The median of the 'depth' column was computed and used to impute the null values. 2. Values of 0 in dimensions columns 'x', 'y', 'z' were investigated. As diamonds cannot be dimensionless, these 0 values were removed as they appeared to be faulty data entries. 3. The ordinal variable 'cut' had sub-levels of quality from 'Fair' to 'Ideal'. These were combined by grouping 'Fair' and 'Good' as both signify lower quality cuts. 'Very Good' and 'Premium' were

Uploaded by

harish kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
70 views

Project LDA

Here are the key steps taken to address null values and combine ordinal variable sub-levels: 1. There were 697 null values found in the 'depth' column. The median of the 'depth' column was computed and used to impute the null values. 2. Values of 0 in dimensions columns 'x', 'y', 'z' were investigated. As diamonds cannot be dimensionless, these 0 values were removed as they appeared to be faulty data entries. 3. The ordinal variable 'cut' had sub-levels of quality from 'Fair' to 'Ideal'. These were combined by grouping 'Fair' and 'Good' as both signify lower quality cuts. 'Very Good' and 'Premium' were

Uploaded by

harish kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 32

Business Report – Project

Predictive Modeling

§
Table of Contents

Contents

1 – Linear Regression...................................................................................................3

1.1 Problem 1.1.............................................................................................................................4


1.2 Problem 1.2.............................................................................................................................11
1.3 Problem 1.3.............................................................................................................................12
1.4 Problem 1.4.............................................................................................................................16

2 – Logistic Regression and LDA..........................................................................................................17

2.1 Problem 2.1..............................................................................................................................18


2.2 Problem 2.2..............................................................................................................................23
2.3. Problem 2.3.............................................................................................................................24
2.4. Problem 2.4.............................................................................................................................32

Conclusion & Recommendation.............................................................................................................33


THE END!................................................................................................................................................ 33

Executive Summary

2
Summary–This business report provides detailed explanation of approach to each problem given in
the assignment and provides relative information with regards to solving the problem.

Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.Sample of the

Data Dictionary:

Variable Name Description

Carat  Carat weight of the cubic zirconia.

 Describe the cut quality of the cubic zirconia.


Cut Quality is increasing order Fair, Good, Very Good,
Premium, Ideal.

 Colour of the cubic zirconia.With D being the


Color 
worst and J the best.

Clarity refers to the absence of the Inclusions


and Blemishes. (In order from Worst to Best in
Clarity
terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1

 The Height of cubic zirconia, measured from the


Depth Culet to the table, divided by its average Girdle
Diameter.

 The Width of the cubic zirconia's Table


Table expressed as a Percentage of its Average
Diameter.

Price  the Price of the cubic zirconia.

X  Length of the cubic zirconia in mm.

Y  Width of the cubic zirconia in mm.

Z  Height of the cubic zirconia in mm.

Descriptive statistics to summarize data. 

3
 By Using the Describe functi on in Python we can verify the basic descripti ve
stati sti cs of the dataset.
 Also above info table confi rms that there is no null value in the dataset.

Q1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.

Summary of the dataset


The data set contains 26967 row and 11 columns. In the given data set there are 2 Integer type
features,6 Float type features and 3 Object type features. Where 'price' is the target variable and all
other are predictor variable. The first column is an index ("Unnamed: 0") as this only serial no, we can
remove it. Except for the column depth, the rest null count is 26967.

EXPLORATORY DATA ANALYSIS


Step 1: Check and remove any duplicates in the dataset
Step 2: Check and treat any missing values in the dataset
Step 3: Outlier Treatment
Step 4: Univariate Analysis
Step 5: Bi-variate Analysis

Step 1: Check and remove any duplicates in the dataset After checking for any duplicate values present
in the dataset it is confirmed that there are no duplicates hence it doesn't require treatment to
remove duplicates.

4
Step 2: Check and treat any missing values in the dataset

Step 3: Outlier Treatment Using the boxplot we confirm and visualise the presence of outliers in the
dataset and then proceed to treat the outliers present.

5
Below we see that the outliers have been treated accordingly.

6
Step 4: Univariate Analysis

7
The dataset indicates that there is significant amount of outliers present in one or few of the variable
and skewness is measured for every attributes present and after performing the univariate analysis we
can notice that the distribution of some quantitative features like "Carat" and the target feature
“Price” are heavily "right-skewed".

Step 5: Bi-variate Analysis

 It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining
the empirical relationship between them.
 It can be inferred that most features correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (<1%).

8
9
10
OBSERVATIONS BASED ON EDA

The inferences drawn from the above Exploratory Data analysis:

Observation-1: 'Price' is the target variable while all others are the predictors. The data set contains
26967 row, 11 column. In the given data set there are 2 Integer type features,6 Float type features. 3
Object type features. Where 'price' is the target variable and all other are predictor variable. The first
column is an index ("Unnamed: 0")as this only serial no, we can remove it.

Observation-2: On the given data set the mean and median values does not have much difference. We
can observe Min value of "x", "y", "z" are zero this indicates that they are faulty values. As we know
dimensionless or 2-dimensional diamonds are not possible. So we have filter out those as it clearly
faulty data entries. There are three object data type 'cut', 'colour' and 'clarity'.

Observation-3: We can observe there are 697 missing value in the depth column. There are some
duplicate row present. (33 duplicate rows out of 26958). which is nearly 0.12 % of the total data. So on
this case we have dropped the duplicated row.

Observation-4: There are significant amount of outlier present in some variable, the features with
datapoint that are far from the rest of dataset which will affect the outcome of our regression model.
So we have treat the outlier. We can see that the distribution of some quantitative features like
"carat" and the target feature "price" are heavily "right-skewed".

Observation-5: It looks like most features do correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (r-s1%). Observation on 'CUT': The Premium Cut
on Diamonds are the most Expensive, followed by Very Good Cut.

Q 1.2 Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly. Explain why you are combining
these sub levels with appropriate reasoning.

Solution:

 We start by checking through the dataset for any null values that are present as seen in below
image it shows that there are a total of 697 null values in the depth column.

11
 Followed by which the median is computed for each attribute so that it can be used to replace
the null values that are present in the dataset.
 In below given figure 9 we can see that the null values are replaced by the median that's
computed.
 After the removing the null values the shape of the dataset becomes 26925 rows and 10
columns.

Is scaling necessary in this case?

No, it is not necessary, we'll get an equivalent solution whether we apply some kind of linear scaling or
not. But is recommended for regression techniques as well because it would help gradient descent to
converge fast and reach the global minima. When number of features becomes large, it helps in
running model quickly else the starting point would be very far from minima, if the scaling is not done
in pre-processing.
For now we will process the model without scaling and later we will check the output with scaled data
of regression model output.

Q1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

Solution:

12
Train-Test Split:

 Copy all the predictor variables into X data frame and copy target into the y data frame. Using
the dependent variable we split the X and Y data frames into training set and test set.
 For this we use the Sklearn package and then split X and Y in 70:30 ration and then invoke the
linear regression function and find the best fit model on training data.
 The intercept for our model is -3171.9504473076336.

 The intercept (often labelled the constant) is the expected mean values of Y when x=0,and
when X is not equal to zero then the intercept has no intrinsic meaning.
 In the present case when the other predictor variable is zero i.e., like carat, cut, color, clarity
then C=-3172 ( Y = m/X/ m2X2+……+ mnXn + C+e), which means that the price is -3172 which
doesn't make any sense so in order to deal with this we have to carry out z-score and make it
nearly zero.

R square on training data : 0.9311935886926559


R square on testing data : 0.931543712584074

 R square is the percentage of the response variable variation that is explained by a linear
model and computed by the formula as:
R-square = Explained Variation / Total Variation
 It is always between 0 and 100%, in which 0% indicates that the model explains none of the
variability of the response data around its mean and 100% indicates that the model explains
all the variability of the response data around its mean.
 In the regression model we can see the R-square value on training and test data respectively
as 0.9311935886926559 and - 0.931543712584074.

13
 The RMSE on training and test data respectively is 907.1312415459143 and
911.8447345328437.
 From the scatter plot, we see that it is a linear and there is very strong correlation present
between the predicted y and actual y.
 It also indicates that there's a lot spread which indicates some unexplained variances on the
output.
 As the training data & Test data score are almost inline we can conclude that this model is a
Right-Fit model.

Training Data Test Data


R-square 0.9311935886926559 0.931543712584074
RMSE 907.1312415459143 911.8447345328436
Applying z- score stats models

 We initiate the linear Regression function and find the best fit model on the training data and
then explore the coefficients for each of the attributes.

 The intercept for our model is -5.879615251304736e-16 and the co-efficient of determinant is
0.9315051288558229.
 It's observed that by applying z score the intercept has changed from -3171.950447307667 to
5.87961525130473e-16, which tells that the co-efficient has changed and the bias has become
nearly zero but the overall accuracy is still the same.

Check Multi-collinearity using VIF


 We can observe very strong multi collinearity present in the data set when ideally it should be
within 1 to 5.

14
Linear Regression using stats models

 Assuming the null hypothesis is true, i.e. price from that universe we have drawn co-efficient
for the variable shown above.
 Now we can ask what is the probability of finding this co-efficient in this drawn sample if in the
real world the co-efficient is zero. As we see here the overall P value is less than alpha, so
rejecting HO and accepting Ha that at least 1 regression co-efficient is not '0'. Here all
regression co-efficient are not '0'.
 For example, we can see the p value is showing 0.449 for 'depth' variable, which is much
higher than 0.05. That means this dimension is of no use. So we can say that the attribute
which are having p value greater than 0.05 are poor predictor for price.

Root Mean Squared Error (Training) ------RMSE: 907.1312415459133


Root Mean Squared Error (lest) ------------RMSE: 911.8447345328433

15
Q1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

Solution:

Inference:

We can see that the from the linear plot, very strong corelation between the predicted y and actual y.
But there are lots of spread. That indicates some kind noise present on the data set i.e. Unexplained
variances on the output.

Linear regression Performance Metrics:

Intercept for the model: -3171.950447307667 R square on training data: 0.9311935886926559 R


square on testing data: 0.931543712584074 RMSE on Training data: 907.1312415459143 RMSE on
Testing data: 911.8447345328436 As the training data & testing data score are almost inline, we can
conclude this model is a Right-Fit Model.

Impact of scaling:
We can observe by applying z score the intercept became -5.87961525130473e-16. Earlier it was -
3171.950447307667. the co-efficient has changed, the bias became nearly zero but the overall
accuracy still same.

Multi collinearity: We can observe there are very strong multi collinearity present in the data set.

From statsmodels: we can see R-squared:0.931 and Adj. R-squared: 0.931 are same. The overall P
value is less than alpha.

 Finally we can conclude that Best 5 attributes that are most important are 'Carat', 'Cut',
'colour', clarity' and width i.e. 'y' for predicting the price.
 When 'carat' increases by 1 unit, diamond price increases by 8901.94 units, keeping all other
predictors constant.
 When 'cut' increases by 1 unit, diamond price increases by 109.19 units, keeping all other
predictors constant.
 When 'colour' increases by 1 unit, diamond price increases by 272.92 units, keeping all other
predictors constant.
 When 'clarity' increases by 1 unit, diamond price increases by 436.44 units, keeping all other
predictors constant.
 When 'y' increases by 1 unit, diamond price increases by 1464.83 units, keeping all other
predictors constant.
 We can see that the p value is 0.449 for depth variable, which is much greater than 0.05. That
means this attribute is of no use.
 There are also some negative co-efficient values, we can see the 'X' i.e Length of the cubic
zirconia in mm. having negative co-efficient -1417.9089. And the p value is less than 0.05, so
can conclude that as higher the length of the stone is a lower profitable stones.
 Similarly for the 'z' variable having negative co-efficient i.e. -711.23. And the p value is less
than 0.05, so we can conclude that as higher the 'z' of the stone is a lower profitable stones.
Recommendations:

16
 The Gem Stones company should consider the features 'Carat', 'Cut', 'colour', 'clarity' and
width i.e. 'y' as most important for predicting the price. To distinguish between higher
profitable stones and lower profitable stones so as to have better profit share
 As we can see from the model Higher the widtb('y') of the stone is higher the price.
 So the stones having higher widtb('y') should consider in higher profitable stones. The
'Premium Cut' on Diamonds are the most Expensive, followed by 'Very Good' Cut, these
should consider in higher profitable stones.
 The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two category also
consider in higher profitable stones.
 As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower the price.
 So higher the Length('x') of the stone are lower is the profitabilim higher the 'z' i.e Height of
the stone is, lower the price. This is because if a Diamond's Height is too large Diamond will
become 'Dark' in appearance because it will no longer return an Attractive amount of light.
That is why.
 Stones with higher 'z' is also are lower in profitability.

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.

Data Dictionary:

Variable Name Description

Holiday_Package   Opted for Holiday Package yes/no?

Salary   Employee salary

age   Age in years

edu   Years of formal education

 The number of young children (younger than 7


no_young_children 
years)

no_older_children   Number of older children

foreign   foreigner Yes/No

Q2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.

Solution:
The following are some observations after initial exploration of the data: (Details in Python file)

17
Head of Data

Tail of Data

Head after dropping Unamed

 We have no null values in the dataset.

 We have integer and object data.

Describe:

18
The data that we have is of integer and continuous data, here the holiday package is our target
variable .

Salary, age, educ and number young children, number older children of employee have the went to
foreign, those are the given attributes we have to cross examine and help the company predict
weather the person will opt for holiday package or not.

There are no null values in the dataset

CHECK FOR DUPLICATES IN THE GIVEN DATASET

Number of duplicate rows = 0

Unique values for categorical variables

Percentage of employees that are interested in the holiday package 45.9%

Data Visualization- Univariate Analysis

19
20
SKEWNESS

 We can see that most of the distribution are right skewe except for educ
 Salary distribution has the max no of outliers
 There are some outliers in educ , no of young children and no. of older children

CATOGORICAL UNIVARIATE ANALYSIS

21
 As we can observe people with salaries below 150000 prefer holiday package.
 Employee age over 50 to 60 have seems to be not taking the holiday package, whereas in the
age 30 to 50 and salary less than 50000 people have opted more for holiday package
BIVARITE ANALYIS DATA DISTRIBUTION

22
There is hardly any correlation between the data, the data seems to be normal. There is no huge
difference in the data distribution among the holiday package, I don’t see any clear two different
distributions in the dataset provided.

1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

Solution:

Encoding the data(having string variables)

Here we have done ONE HOT ENCODING to create dummy variables and we can see all values for
foreign_yes are 0.
Better results are predicted by logistic regression model if encoding is done.
Train/ Test split

23
We will split the data in 70/30 ratio

The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving.
We have found the parameters using grid search such as penalty=12 , solver: liblinear , tolerance=1e-
06

Prediction on the training set


ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)

Getting the probabilities on the test set

LDA (linear discriminant analysis)


DATASET HEAD

Build LDA Model

PROBABILITY PREDICTION

Performance Metrics will be discussed in 2.3

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized..

Solution:
PEFORMANCE METRICS FOR LINEAR REGRESSION
Confusion matrix on the training data

Confusion matrix cells are populated by the terms:


True Positive(TP)- The values which are predicted as True and are actually True.

24
True Negative(TN)- The values which are predicted as False and are actually False.
False Positive(FP)- The values which are predicted as True but are actually False.
False Negative(FN)- The values which are predicted as False but are actually True.

ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of models by


evaluating the trade-offs between sensitivity (true positive rate) and false (1- specificity) or false
positive rate.
AUC - The area under curve (AUC) is another measure for classification models is based on ROC. It is
the measure of accuracy judged by the area under the curve for ROC

Performance Matrix of Logistics Regression model:

Train data

AUC score is 0.731 or 73.1%

Confusion Matrix for Train data:

Test Data:

25
26
LDA Model:
Confusion Matrix on Training data:

The accuracy score of the training data and test data is same at 66%. This is almost similar to the
Logistic Regression model result so far. The AUC scores are marginally lower for the test data, else
they are also almost similar to the Logistic Regression model. F1 scores are 61% and 57% for train and
test data, respectively, which again is almost close to the logistic regression model.
AUC for the Training Data: 0.731 or 73.0%
AUC for the Test Data: 0.714 or 71.4%

27
Overall, the model seems to be a right fit model and is staying away from being referred as under fit or
over fit model. Let us see if we can refine the results further and improve on the F1 score of the test
data specifically.
Custom cut off for the LDA model:
Comparison of the Classification report:

28
29
30
As stated above, both the models – Logistics and LDA offers almost similar results. While LDA offers
flexibility to control or change the important metrics such as precision, recall and F1 score by changing
the custom cut-off. Like in this case study, the moment we changed the cut off to 40%, we were able
to improve our precision, recall and F1 scores considerably. Further, this is up to the business if they
would allow the play with the custom cut off values or no.
Though for this case study, I have chosen to proceed with Logistics Regression as its is easier to
implement, interpret, and very efficient to train. Also, our dependent variable is following a binary
classification of classes, and hence it is ideal for us to rely on the logistic regression model to study the
test case at hand.
Logistic regression is a classification algorithm used to find the probability of event success and event
failure. It is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. It learns a
linear relationship from the given dataset and then introduces a non-linearity in the form of the
Sigmoid function.

31
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.

Solution:
So we had been given a problem where we had to find out whether the employees will opt for a
holiday package or not . We looked in the data using logistic regression and LDA.
We found out that the results using both the methods is same. Predictions were done using both the
models.

While doing EDA we found out that

• Most of the employees who are above 50 don’t opt for holiday packages. It seems like they are not
interested in holiday packages at all .
• Employees who are in the age gap of 30 to 50 opt for holiday packages .It seems like young people
believe I spending on holiday packages so age here plays a very important role in deciding whether
they will opt for package or not
• Also people who have salary less than 50000 opt for holiday packages . So salary is also a deciding
factor for the holiday package.
• Education also plays an important role in deciding the holiday packages .
• To improve our customer base we need to look into those factors
Recommendations
As we already have the customer base who are of the age of 30 to 50 so we need to look for the
options and target the older people and the people who are earning more than 150000. • As we know
most of the people who are older prefer to visit religious places so it would be better if we target
those places and provide them with packages where they can visit religious places.
• We can also look into the family dynamics of the people of the older people , if the older people
have elder children e.g 30 to 40 they can use the holiday packages so the deal should include the
family package .
• People who earn more than 150000 don’t spend much on the holiday packages , they tend to go for
lavish holidays and we can provide them with customized packages according to their wish , such as
fancy hotels , longer vacations , personal cars during the holiday to attract such employees
• Plus such people who earn more than 150000 we can provide them extra facilities according to their
own wishes at the moment.

In this project we started with EDA , descriptive statistics and did null value condition check, we
performed Univariate and Bivariate Analysis. did exploratory data analysis ,we treated outliers then
we moved on to Logistic regression . We encoded the data (having string values) for Modelling. We
split data into train and test (70:30) and finally we applied Logistic Regression and LDA (linear
discriminant analysis).

32

You might also like