100% found this document useful (1 vote)

70 views

Project LDA

Here are the key steps taken to address null values and combine ordinal variable sub-levels: 1. There were 697 null values found in the 'depth' column. The median of the 'depth' column was computed and used to impute the null values. 2. Values of 0 in dimensions columns 'x', 'y', 'z' were investigated. As diamonds cannot be dimensionless, these 0 values were removed as they appeared to be faulty data entries. 3. The ordinal variable 'cut' had sub-levels of quality from 'Fair' to 'Ideal'. These were combined by grouping 'Fair' and 'Good' as both signify lower quality cuts. 'Very Good' and 'Premium' were

Uploaded by

harish kumar

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

70 views

Project LDA

Uploaded by

harish kumar

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 32

Business Report – Project

Predictive Modeling

§
Table of Contents

Contents

1 – Linear Regression...................................................................................................3

1.1 Problem 1.1.............................................................................................................................4

1.2 Problem 1.2.............................................................................................................................11
1.3 Problem 1.3.............................................................................................................................12
1.4 Problem 1.4.............................................................................................................................16

2 – Logistic Regression and LDA..........................................................................................................17

2.1 Problem 2.1..............................................................................................................................18

2.2 Problem 2.2..............................................................................................................................23
2.3. Problem 2.3.............................................................................................................................24
2.4. Problem 2.4.............................................................................................................................32

Conclusion & Recommendation.............................................................................................................33

THE END!................................................................................................................................................ 33

Executive Summary

2
Summary–This business report provides detailed explanation of approach to each problem given in
the assignment and provides relative information with regards to solving the problem.

Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.Sample of the

Data Dictionary:

Variable Name Description

Carat Carat weight of the cubic zirconia.

Describe the cut quality of the cubic zirconia.

Cut Quality is increasing order Fair, Good, Very Good,
Premium, Ideal.

Colour of the cubic zirconia.With D being the

Color
worst and J the best.

Clarity refers to the absence of the Inclusions

and Blemishes. (In order from Worst to Best in
Clarity
terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1

The Height of cubic zirconia, measured from the

Depth Culet to the table, divided by its average Girdle
Diameter.

The Width of the cubic zirconia's Table

Table expressed as a Percentage of its Average
Diameter.

Price the Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.

Descriptive statistics to summarize data.

3
 By Using the Describe functi on in Python we can verify the basic descripti ve
stati sti cs of the dataset.
 Also above info table confi rms that there is no null value in the dataset.

Q1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.

Summary of the dataset

The data set contains 26967 row and 11 columns. In the given data set there are 2 Integer type
features,6 Float type features and 3 Object type features. Where 'price' is the target variable and all
other are predictor variable. The first column is an index ("Unnamed: 0") as this only serial no, we can
remove it. Except for the column depth, the rest null count is 26967.

EXPLORATORY DATA ANALYSIS

Step 1: Check and remove any duplicates in the dataset
Step 2: Check and treat any missing values in the dataset
Step 3: Outlier Treatment
Step 4: Univariate Analysis
Step 5: Bi-variate Analysis

Step 1: Check and remove any duplicates in the dataset After checking for any duplicate values present
in the dataset it is confirmed that there are no duplicates hence it doesn't require treatment to
remove duplicates.

4
Step 2: Check and treat any missing values in the dataset

Step 3: Outlier Treatment Using the boxplot we confirm and visualise the presence of outliers in the
dataset and then proceed to treat the outliers present.

5
Below we see that the outliers have been treated accordingly.

6
Step 4: Univariate Analysis

7
The dataset indicates that there is significant amount of outliers present in one or few of the variable
and skewness is measured for every attributes present and after performing the univariate analysis we
can notice that the distribution of some quantitative features like "Carat" and the target feature
“Price” are heavily "right-skewed".

Step 5: Bi-variate Analysis

 It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining
the empirical relationship between them.
 It can be inferred that most features correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (<1%).

8
9
10
OBSERVATIONS BASED ON EDA

The inferences drawn from the above Exploratory Data analysis:

Observation-1: 'Price' is the target variable while all others are the predictors. The data set contains
26967 row, 11 column. In the given data set there are 2 Integer type features,6 Float type features. 3
Object type features. Where 'price' is the target variable and all other are predictor variable. The first
column is an index ("Unnamed: 0")as this only serial no, we can remove it.

Observation-2: On the given data set the mean and median values does not have much difference. We
can observe Min value of "x", "y", "z" are zero this indicates that they are faulty values. As we know
dimensionless or 2-dimensional diamonds are not possible. So we have filter out those as it clearly
faulty data entries. There are three object data type 'cut', 'colour' and 'clarity'.

Observation-3: We can observe there are 697 missing value in the depth column. There are some
duplicate row present. (33 duplicate rows out of 26958). which is nearly 0.12 % of the total data. So on
this case we have dropped the duplicated row.

Observation-4: There are significant amount of outlier present in some variable, the features with
datapoint that are far from the rest of dataset which will affect the outcome of our regression model.
So we have treat the outlier. We can see that the distribution of some quantitative features like
"carat" and the target feature "price" are heavily "right-skewed".

Observation-5: It looks like most features do correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (r-s1%). Observation on 'CUT': The Premium Cut
on Diamonds are the most Expensive, followed by Very Good Cut.

Q 1.2 Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly. Explain why you are combining
these sub levels with appropriate reasoning.

Solution:

 We start by checking through the dataset for any null values that are present as seen in below
image it shows that there are a total of 697 null values in the depth column.

11
 Followed by which the median is computed for each attribute so that it can be used to replace
the null values that are present in the dataset.
 In below given figure 9 we can see that the null values are replaced by the median that's
computed.
 After the removing the null values the shape of the dataset becomes 26925 rows and 10
columns.

Is scaling necessary in this case?

No, it is not necessary, we'll get an equivalent solution whether we apply some kind of linear scaling or
not. But is recommended for regression techniques as well because it would help gradient descent to
converge fast and reach the global minima. When number of features becomes large, it helps in
running model quickly else the starting point would be very far from minima, if the scaling is not done
in pre-processing.
For now we will process the model without scaling and later we will check the output with scaled data
of regression model output.

Q1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

Solution:

12
Train-Test Split:

 Copy all the predictor variables into X data frame and copy target into the y data frame. Using
the dependent variable we split the X and Y data frames into training set and test set.
 For this we use the Sklearn package and then split X and Y in 70:30 ration and then invoke the
linear regression function and find the best fit model on training data.
 The intercept for our model is -3171.9504473076336.

 The intercept (often labelled the constant) is the expected mean values of Y when x=0,and
when X is not equal to zero then the intercept has no intrinsic meaning.
 In the present case when the other predictor variable is zero i.e., like carat, cut, color, clarity
then C=-3172 ( Y = m/X/ m2X2+……+ mnXn + C+e), which means that the price is -3172 which
doesn't make any sense so in order to deal with this we have to carry out z-score and make it
nearly zero.

R square on training data : 0.9311935886926559

R square on testing data : 0.931543712584074

 R square is the percentage of the response variable variation that is explained by a linear
model and computed by the formula as:
R-square = Explained Variation / Total Variation
 It is always between 0 and 100%, in which 0% indicates that the model explains none of the
variability of the response data around its mean and 100% indicates that the model explains
all the variability of the response data around its mean.
 In the regression model we can see the R-square value on training and test data respectively
as 0.9311935886926559 and - 0.931543712584074.

13
 The RMSE on training and test data respectively is 907.1312415459143 and
911.8447345328437.
 From the scatter plot, we see that it is a linear and there is very strong correlation present
between the predicted y and actual y.
 It also indicates that there's a lot spread which indicates some unexplained variances on the
output.
 As the training data & Test data score are almost inline we can conclude that this model is a
Right-Fit model.

Training Data Test Data

R-square 0.9311935886926559 0.931543712584074
RMSE 907.1312415459143 911.8447345328436
Applying z- score stats models

 We initiate the linear Regression function and find the best fit model on the training data and
then explore the coefficients for each of the attributes.

 The intercept for our model is -5.879615251304736e-16 and the co-efficient of determinant is
0.9315051288558229.
 It's observed that by applying z score the intercept has changed from -3171.950447307667 to
5.87961525130473e-16, which tells that the co-efficient has changed and the bias has become
nearly zero but the overall accuracy is still the same.

Check Multi-collinearity using VIF

 We can observe very strong multi collinearity present in the data set when ideally it should be
within 1 to 5.

14
Linear Regression using stats models

 Assuming the null hypothesis is true, i.e. price from that universe we have drawn co-efficient
for the variable shown above.
 Now we can ask what is the probability of finding this co-efficient in this drawn sample if in the
real world the co-efficient is zero. As we see here the overall P value is less than alpha, so
rejecting HO and accepting Ha that at least 1 regression co-efficient is not '0'. Here all
regression co-efficient are not '0'.
 For example, we can see the p value is showing 0.449 for 'depth' variable, which is much
higher than 0.05. That means this dimension is of no use. So we can say that the attribute
which are having p value greater than 0.05 are poor predictor for price.

Root Mean Squared Error (Training) ------RMSE: 907.1312415459133

Root Mean Squared Error (lest) ------------RMSE: 911.8447345328433

15
Q1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

Solution:

Inference:

We can see that the from the linear plot, very strong corelation between the predicted y and actual y.
But there are lots of spread. That indicates some kind noise present on the data set i.e. Unexplained
variances on the output.

Linear regression Performance Metrics:

Intercept for the model: -3171.950447307667 R square on training data: 0.9311935886926559 R

square on testing data: 0.931543712584074 RMSE on Training data: 907.1312415459143 RMSE on
Testing data: 911.8447345328436 As the training data & testing data score are almost inline, we can
conclude this model is a Right-Fit Model.

Impact of scaling:
We can observe by applying z score the intercept became -5.87961525130473e-16. Earlier it was -
3171.950447307667. the co-efficient has changed, the bias became nearly zero but the overall
accuracy still same.

Multi collinearity: We can observe there are very strong multi collinearity present in the data set.

From statsmodels: we can see R-squared:0.931 and Adj. R-squared: 0.931 are same. The overall P
value is less than alpha.

 Finally we can conclude that Best 5 attributes that are most important are 'Carat', 'Cut',
'colour', clarity' and width i.e. 'y' for predicting the price.
 When 'carat' increases by 1 unit, diamond price increases by 8901.94 units, keeping all other
predictors constant.
 When 'cut' increases by 1 unit, diamond price increases by 109.19 units, keeping all other
predictors constant.
 When 'colour' increases by 1 unit, diamond price increases by 272.92 units, keeping all other
predictors constant.
 When 'clarity' increases by 1 unit, diamond price increases by 436.44 units, keeping all other
predictors constant.
 When 'y' increases by 1 unit, diamond price increases by 1464.83 units, keeping all other
predictors constant.
 We can see that the p value is 0.449 for depth variable, which is much greater than 0.05. That
means this attribute is of no use.
 There are also some negative co-efficient values, we can see the 'X' i.e Length of the cubic
zirconia in mm. having negative co-efficient -1417.9089. And the p value is less than 0.05, so
can conclude that as higher the length of the stone is a lower profitable stones.
 Similarly for the 'z' variable having negative co-efficient i.e. -711.23. And the p value is less
than 0.05, so we can conclude that as higher the 'z' of the stone is a lower profitable stones.
Recommendations:

16
 The Gem Stones company should consider the features 'Carat', 'Cut', 'colour', 'clarity' and
width i.e. 'y' as most important for predicting the price. To distinguish between higher
profitable stones and lower profitable stones so as to have better profit share
 As we can see from the model Higher the widtb('y') of the stone is higher the price.
 So the stones having higher widtb('y') should consider in higher profitable stones. The
'Premium Cut' on Diamonds are the most Expensive, followed by 'Very Good' Cut, these
should consider in higher profitable stones.
 The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two category also
consider in higher profitable stones.
 As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower the price.
 So higher the Length('x') of the stone are lower is the profitabilim higher the 'z' i.e Height of
the stone is, lower the price. This is because if a Diamond's Height is too large Diamond will
become 'Dark' in appearance because it will no longer return an Attractive amount of light.
That is why.
 Stones with higher 'z' is also are lower in profitability.

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.

Data Dictionary:

Variable Name Description

Holiday_Package Opted for Holiday Package yes/no?

Salary Employee salary

age Age in years

edu Years of formal education

The number of young children (younger than 7

no_young_children
years)

no_older_children Number of older children

foreign foreigner Yes/No

Q2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.

Solution:
The following are some observations after initial exploration of the data: (Details in Python file)

17
Head of Data

Tail of Data

Head after dropping Unamed

 We have no null values in the dataset.

 We have integer and object data.

Describe:

18
The data that we have is of integer and continuous data, here the holiday package is our target
variable .

Salary, age, educ and number young children, number older children of employee have the went to
foreign, those are the given attributes we have to cross examine and help the company predict
weather the person will opt for holiday package or not.

There are no null values in the dataset

CHECK FOR DUPLICATES IN THE GIVEN DATASET

Number of duplicate rows = 0

Unique values for categorical variables

Percentage of employees that are interested in the holiday package 45.9%

Data Visualization- Univariate Analysis

19
20
SKEWNESS

 We can see that most of the distribution are right skewe except for educ
 Salary distribution has the max no of outliers
 There are some outliers in educ , no of young children and no. of older children

CATOGORICAL UNIVARIATE ANALYSIS

21
 As we can observe people with salaries below 150000 prefer holiday package.
 Employee age over 50 to 60 have seems to be not taking the holiday package, whereas in the
age 30 to 50 and salary less than 50000 people have opted more for holiday package
BIVARITE ANALYIS DATA DISTRIBUTION

22
There is hardly any correlation between the data, the data seems to be normal. There is no huge
difference in the data distribution among the holiday package, I don’t see any clear two different
distributions in the dataset provided.

1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

Solution:

Encoding the data(having string variables)

Here we have done ONE HOT ENCODING to create dummy variables and we can see all values for
foreign_yes are 0.
Better results are predicted by logistic regression model if encoding is done.
Train/ Test split

23
We will split the data in 70/30 ratio

The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving.
We have found the parameters using grid search such as penalty=12 , solver: liblinear , tolerance=1e-
06

Prediction on the training set

ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)

Getting the probabilities on the test set

LDA (linear discriminant analysis)

DATASET HEAD

Build LDA Model

PROBABILITY PREDICTION

Performance Metrics will be discussed in 2.3

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized..

Solution:
PEFORMANCE METRICS FOR LINEAR REGRESSION
Confusion matrix on the training data

Confusion matrix cells are populated by the terms:

True Positive(TP)- The values which are predicted as True and are actually True.

24
True Negative(TN)- The values which are predicted as False and are actually False.
False Positive(FP)- The values which are predicted as True but are actually False.
False Negative(FN)- The values which are predicted as False but are actually True.

ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of models by

evaluating the trade-offs between sensitivity (true positive rate) and false (1- specificity) or false
positive rate.
AUC - The area under curve (AUC) is another measure for classification models is based on ROC. It is
the measure of accuracy judged by the area under the curve for ROC

Performance Matrix of Logistics Regression model:

Train data

AUC score is 0.731 or 73.1%

Confusion Matrix for Train data:

Test Data:

25
26
LDA Model:
Confusion Matrix on Training data:

The accuracy score of the training data and test data is same at 66%. This is almost similar to the
Logistic Regression model result so far. The AUC scores are marginally lower for the test data, else
they are also almost similar to the Logistic Regression model. F1 scores are 61% and 57% for train and
test data, respectively, which again is almost close to the logistic regression model.
AUC for the Training Data: 0.731 or 73.0%
AUC for the Test Data: 0.714 or 71.4%

27
Overall, the model seems to be a right fit model and is staying away from being referred as under fit or
over fit model. Let us see if we can refine the results further and improve on the F1 score of the test
data specifically.
Custom cut off for the LDA model:
Comparison of the Classification report:

28
29
30
As stated above, both the models – Logistics and LDA offers almost similar results. While LDA offers
flexibility to control or change the important metrics such as precision, recall and F1 score by changing
the custom cut-off. Like in this case study, the moment we changed the cut off to 40%, we were able
to improve our precision, recall and F1 scores considerably. Further, this is up to the business if they
would allow the play with the custom cut off values or no.
Though for this case study, I have chosen to proceed with Logistics Regression as its is easier to
implement, interpret, and very efficient to train. Also, our dependent variable is following a binary
classification of classes, and hence it is ideal for us to rely on the logistic regression model to study the
test case at hand.
Logistic regression is a classification algorithm used to find the probability of event success and event
failure. It is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. It learns a
linear relationship from the given dataset and then introduces a non-linearity in the form of the
Sigmoid function.

31
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.

Solution:
So we had been given a problem where we had to find out whether the employees will opt for a
holiday package or not . We looked in the data using logistic regression and LDA.
We found out that the results using both the methods is same. Predictions were done using both the
models.

While doing EDA we found out that

• Most of the employees who are above 50 don’t opt for holiday packages. It seems like they are not
interested in holiday packages at all .
• Employees who are in the age gap of 30 to 50 opt for holiday packages .It seems like young people
believe I spending on holiday packages so age here plays a very important role in deciding whether
they will opt for package or not
• Also people who have salary less than 50000 opt for holiday packages . So salary is also a deciding
factor for the holiday package.
• Education also plays an important role in deciding the holiday packages .
• To improve our customer base we need to look into those factors
Recommendations
As we already have the customer base who are of the age of 30 to 50 so we need to look for the
options and target the older people and the people who are earning more than 150000. • As we know
most of the people who are older prefer to visit religious places so it would be better if we target
those places and provide them with packages where they can visit religious places.
• We can also look into the family dynamics of the people of the older people , if the older people
have elder children e.g 30 to 40 they can use the holiday packages so the deal should include the
family package .
• People who earn more than 150000 don’t spend much on the holiday packages , they tend to go for
lavish holidays and we can provide them with customized packages according to their wish , such as
fancy hotels , longer vacations , personal cars during the holiday to attract such employees
• Plus such people who earn more than 150000 we can provide them extra facilities according to their
own wishes at the moment.

In this project we started with EDA , descriptive statistics and did null value condition check, we
performed Univariate and Bivariate Analysis. did exploratory data analysis ,we treated outliers then
we moved on to Logistic regression . We encoded the data (having string values) for Modelling. We
split data into train and test (70:30) and finally we applied Logistic Regression and LDA (linear
discriminant analysis).

Ewc661 Sample Portfolio PDF
100% (3)
Ewc661 Sample Portfolio PDF
69 pages
Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
Sociology Notes
86% (14)
Sociology Notes
244 pages
Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
No ratings yet
Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
38 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Predicting Life Expectancy Using Machine Learning
100% (1)
Predicting Life Expectancy Using Machine Learning
9 pages
A Star Search PDF
100% (1)
A Star Search PDF
6 pages
Interview Questions For DS & DA (ML)
100% (1)
Interview Questions For DS & DA (ML)
66 pages
Wine Quality Prediction Using ML PPR
100% (1)
Wine Quality Prediction Using ML PPR
8 pages
BoS - Session 1
100% (1)
BoS - Session 1
37 pages
Linear Discriminant Analysis - Credit Card Default Analysis
No ratings yet
Linear Discriminant Analysis - Credit Card Default Analysis
7 pages
Unit - V
100% (1)
Unit - V
75 pages
Whole ML PDF 1614408656
100% (1)
Whole ML PDF 1614408656
214 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Internship Report - Software - Salaries Predictions
100% (1)
Internship Report - Software - Salaries Predictions
17 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Data Analysis Powerpoint
100% (1)
Data Analysis Powerpoint
17 pages
Case Study 2
100% (1)
Case Study 2
12 pages
H 264/avc
No ratings yet
H 264/avc
23 pages
Homework 2
100% (1)
Homework 2
12 pages
KPMG
100% (1)
KPMG
2 pages
AReviewon Weather Forecastingusing Machine Learningand Deep Learning Techniques
100% (1)
AReviewon Weather Forecastingusing Machine Learningand Deep Learning Techniques
6 pages
Merge +1
No ratings yet
Merge +1
107 pages
Poly
100% (1)
Poly
108 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Panda Programs
No ratings yet
Panda Programs
40 pages
Tutor
100% (1)
Tutor
309 pages
Quest Stat
100% (1)
Quest Stat
2 pages
22MCA1008 - Varun ML LAB ASSIGNMENTS
100% (1)
22MCA1008 - Varun ML LAB ASSIGNMENTS
41 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Machine Learning Algorithns - Unit3
No ratings yet
Machine Learning Algorithns - Unit3
124 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
TASK 1 Data - Quality - Analysis
No ratings yet
TASK 1 Data - Quality - Analysis
2 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
Homework 2
100% (1)
Homework 2
14 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Machine
100% (1)
Machine
45 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Import As
100% (1)
Import As
27 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
CHANSA
No ratings yet
CHANSA
36 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
LPTHW
100% (1)
LPTHW
220 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Artelia Brochure - Phamaceuticals
No ratings yet
Artelia Brochure - Phamaceuticals
2 pages
Experiment:: To Implement Chi-Square Test in SPSS
No ratings yet
Experiment:: To Implement Chi-Square Test in SPSS
9 pages
Data Analysis and Business Intelligence
No ratings yet
Data Analysis and Business Intelligence
3 pages
McBratney Et Al. - On Digital Soil Mapping
No ratings yet
McBratney Et Al. - On Digital Soil Mapping
50 pages
Accenture 2018 Refining
No ratings yet
Accenture 2018 Refining
9 pages
Introduction To Interpersonal Communication: Chapter Two
No ratings yet
Introduction To Interpersonal Communication: Chapter Two
40 pages
Forensic Gait Analysis: A Primer For Courts
No ratings yet
Forensic Gait Analysis: A Primer For Courts
36 pages
Unit 42 - Statistics For Management
No ratings yet
Unit 42 - Statistics For Management
6 pages
Capstone Project
100% (1)
Capstone Project
24 pages
Hawthorne Middle School Homework Online
No ratings yet
Hawthorne Middle School Homework Online
7 pages
Final Light and Shadow Lesson Plan
No ratings yet
Final Light and Shadow Lesson Plan
9 pages
Ayush Gaur First Internship Project Report
No ratings yet
Ayush Gaur First Internship Project Report
43 pages
Fire Hazard Management: The Organisation
No ratings yet
Fire Hazard Management: The Organisation
7 pages
Credit-Risk Evaluation of A Tunisian Commercial Bank: Logistic Regression vs. Neural Network Modelling
No ratings yet
Credit-Risk Evaluation of A Tunisian Commercial Bank: Logistic Regression vs. Neural Network Modelling
12 pages
Faktor-Faktor Yang Mempengaruhi Kinerja Dan Peran Marketing Capability Sebagai Pemediasi Pada UKM Pengolahan Makanan
No ratings yet
Faktor-Faktor Yang Mempengaruhi Kinerja Dan Peran Marketing Capability Sebagai Pemediasi Pada UKM Pengolahan Makanan
17 pages
PSA PPT by Sir Jekell
No ratings yet
PSA PPT by Sir Jekell
96 pages
Evaluating The Performance of Salespeople
No ratings yet
Evaluating The Performance of Salespeople
13 pages
Language Curriculum Development
No ratings yet
Language Curriculum Development
11 pages
Correlational Research
No ratings yet
Correlational Research
36 pages
Business Economics
No ratings yet
Business Economics
51 pages
Masibus Vibration Meter Vm908
No ratings yet
Masibus Vibration Meter Vm908
2 pages
A Project Report On: "A Study of Cost Sheet Analysis With Respect To Shruti Srushti Textile Ichalkaranji"
No ratings yet
A Project Report On: "A Study of Cost Sheet Analysis With Respect To Shruti Srushti Textile Ichalkaranji"
40 pages
Anna-Wozniak CV PDF
No ratings yet
Anna-Wozniak CV PDF
2 pages
Thesis Chapter 4 Sample Table
100% (3)
Thesis Chapter 4 Sample Table
5 pages
Shipping Container by Reba
No ratings yet
Shipping Container by Reba
6 pages
Unit 1 2000 PDF
No ratings yet
Unit 1 2000 PDF
253 pages
SIP Chapter 3 Guidelines
No ratings yet
SIP Chapter 3 Guidelines
3 pages
The Ergonomics of Sheep Shearing
No ratings yet
The Ergonomics of Sheep Shearing
8 pages

Project LDA

Uploaded by

Project LDA

Uploaded by

Business Report – Project

1.1 Problem 1.1.............................................................................................................................4

2 – Logistic Regression and LDA..........................................................................................................17

2.1 Problem 2.1..............................................................................................................................18

Conclusion & Recommendation.............................................................................................................33

Problem 1: Linear Regression

Variable Name Description

Carat Carat weight of the cubic zirconia.

Describe the cut quality of the cubic zirconia.

Colour of the cubic zirconia.With D being the

Clarity refers to the absence of the Inclusions

The Height of cubic zirconia, measured from the

The Width of the cubic zirconia's Table

Price the Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.

Descriptive statistics to summarize data.

Summary of the dataset

EXPLORATORY DATA ANALYSIS

Step 5: Bi-variate Analysis

The inferences drawn from the above Exploratory Data analysis:

Is scaling necessary in this case?

R square on training data : 0.9311935886926559

Training Data Test Data

Check Multi-collinearity using VIF

Root Mean Squared Error (Training) ------RMSE: 907.1312415459133

Linear regression Performance Metrics:

Intercept for the model: -3171.950447307667 R square on training data: 0.9311935886926559 R

Problem 2: Logistic Regression and LDA

Variable Name Description

Holiday_Package Opted for Holiday Package yes/no?

Salary Employee salary

age Age in years

edu Years of formal education

The number of young children (younger than 7

no_older_children Number of older children

foreign foreigner Yes/No

Head after dropping Unamed

 We have no null values in the dataset.

 We have integer and object data.

There are no null values in the dataset

CHECK FOR DUPLICATES IN THE GIVEN DATASET

Number of duplicate rows = 0

Unique values for categorical variables

Percentage of employees that are interested in the holiday package 45.9%

Data Visualization- Univariate Analysis

CATOGORICAL UNIVARIATE ANALYSIS

1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS

Encoding the data(having string variables)

Prediction on the training set

Getting the probabilities on the test set

LDA (linear discriminant analysis)

Build LDA Model

Performance Metrics will be discussed in 2.3

Confusion matrix cells are populated by the terms:

ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of models by

Performance Matrix of Logistics Regression model:

AUC score is 0.731 or 73.1%

Confusion Matrix for Train data:

While doing EDA we found out that

You might also like