Advanced Regression in Excel S
Advanced Regression in Excel S
Advanced Regression
in Excel
The Excel Statistical Master
By Mark Harmon
Copyright 2011 Mark Harmon
No part of this publication may be reproduced
or distributed without the express permission
of the author.
[email protected]
www.ExcelMasterSeries.com
ISBN: 978-0-9833070-6-8
Page 1
Table of Contents
Click on Entries to Go To Each
Using Dummy Variable Regression in Excel To Perform Conjoint Analysis 6
Step-By-Step Video Showing How To Perform Conjoint Analysis Using
Dummy Variable Regression in Excel In Order To Find Out Which
Product Attributes Your Customers Value The Most....................................... 7
The 6 Steps of Performing Conjoint Analysis.................................................... 8
Step 1) List All Product Attributes For 1 Product ......................................... 8
Step 2) Make a List of All Possible Combinations of Those Attributes .. 9
Step 3) Have Consumer Rate Each Attribute Combination...................... 10
Step 4) Prepare Completed Survey for Regression.................................... 11
Dummy Variables to Be Removed From Input Data To Prevent
Collinearity......................................................................................................... 11
Step 5) Run Regression in Excel ..................................................................... 11
Step 6) Derive Attribute Utilities From Regression Output ...................... 12
An Example of Using a Dummy Variable........................................................... 13
The Problem of Collinearity - and How To Solve It......................................... 14
The Product Utilities - The Measure of Customer Liking .............................. 14
Page 2
2)
Page 3
Page 4
Assume Non-Negative:................................................................................... 50
Bypass Solver Reports:. ................................................................................ 50
Page 5
The video on the next page will make the entire procedure of Dummy Variable
Regression in Excel to perform Conjoint Analysis much easier to understand:
Page 6
Instructional Video
Go to
http://www.youtube.com/watch?v=EMbiGPGlBEM
to View a
Video From Excel Master Series
About How To Use
Dummy Variable Regression
in Excel To Perform
Conjoint Analysis
(Is Your Internet Connection and Sound Turned On?)
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Showing the Regression Equation Predicts Nearly the Same Score as the
Customer's Ranking of Card 13, Even Though Dummy Variables Were
Removed
Page 15
This video will illustrate exactly how to quickly and easily understand the output
of Regression performed in Excel:
Copyright 2011 http://ExcelMasterSeries.com/New_Manuals.php
Page 16
Page 17
Some parts of the Excel Regression output are much more important than
others. The goal here is for you to be able to glance at the Excel Regression
output and immediately understand it, so we will focus our attention only on the
four most important parts of the Excel regression output.
R Square
This is the most important number of the output. R Square tells how well the
regression line approximates the real data. This number tells you how much of
the output variables variance is explained by the input variables variance.
Ideally we would like to see this at least 0.6 (60%) or 0.7 (70%).
Adjusted R Square
This is quoted most often when explaining the accuracy of the regression
equation. Adjusted R Square is more conservative the R Square because it is
Copyright 2011 http://ExcelMasterSeries.com/New_Manuals.php
Page 18
always less than R Square. Another reason that Adjusted R Square is quoted
more often is that when new input variables are added to the Regression
analysis, Adjusted R Square increases only when the new input variable makes
the Regression equation more accurate (improves the Regression equationss
ability to predict the output). R Square always goes up when a new variable is
added, whether or not the new input variable improves the Regression equations
accuracy.
Significance of F
This indicates the probability that the Regression output could have been
obtained by chance. A small Significance of F confirms the validity of the
Regression output. For example, if Significance of F = 0.030, there is only a 3%
chance that the Regression output was merely a chance occurrence.
Page 19
Page 20
Page 21
The residuals are the difference between the Regressions predicted value and
the actual value of the output variable. You can quickly plot the Residuals on a
scatterplot chart. Look for patterns in the scatterplot. The more random (without
patterns) and centered around zero the residuals appear to be, the more likely it
is that the Regression equation is valid.
There are many other pieces of information in the Excel regression output but the
above four items will give a quick read on the validity of your Regression.
Page 22
Page 23
Instructional Video
Go to
http://www.youtube.com/watch?v=NHOO7iceJrw
to View a
Video From Excel Master Series
About How To Use
Logistic Regression
in Excel To Predict of Your
Next Prospect
WILL BUY! (or not !#!$%!)
(Is Your Internet Connection and Sound Turned On?)
Page 24
Suppose that you have collected three pieces of data on each of your previous
prospects. The data you have collected on each prospect was:
Page 25
The Logit
Event X is a purchase. In other words, P(X) is the probability that Y = 1.
P(X) has only one variable. That is L, which is called the Logit.
L, the Logit, has 3 variables: Constant, A, and B. They must be known before
P(X) can be calculated. Those 3 variables can be found in Excel by using the
Excel Solver. The Excel Solver will find the optimal combination of those 3
variables that causes the resulting P(X) to most accurately predict whether Y = 1
or 0 for all previous prospects.
Page 26
Page 27
Using Excel, each recorded prospect has the following calculation performed:
P(X)Y * [ 1 - P(X) ] (1-Y)
The Y refers to Y = 1 if the prospect bought and Y = 0 if the prospect didnt buy.
The P(X) is the probability of purchase that will be calculated using the equation
listed above. In Excel, the P(X) calculation is initially performed by the Excel
Solver using Logit variables (Constant, A, and B) which are not optimal. The
Excel Solver will then continuously try new combinations of these variables until
the optimal P(X) is found.
Page 28
The sum of each P(X)Y * [ 1 - P(X) ] (1-Y)calculation for all prospects is taken.
The only variables that exist when calculating P(X)Y * [ 1 - P(X) ] (1-Y)are Y and
the variables of P(X), which are Constant, A, and B. Use the Excel Solver, these
variable are adjusted until their values maximize the sum of all
P(X)Y * [ 1 - P(X) ] (1-Y)
Page 29
Page 30
Stated another way, we now have a predictive equation P(X ) which uses the
optimal combination of Constant, A, and B which most accurately calculates the
probability that Y = 1 given a prospects age and gender.
The embedded video provides a clear picture of all of this in action in Excel.
The use of the Excel Solver does require some hand-tweeking to ensure that the
most accurate answer is obtained. The video shows an example of this.
Ultimately what the Solver is doing is adjusting variables Constant, A, and B to
maximize the sum of the column of
P(X)Y * [ 1 - P(X) ] (1-Y) equations. The answer obtained by the Solver should
maximize that sum and provide realistic answers for the probabilities of each
prospect, including the new one.
Page 31
In the video, you will be able to watch how a Decision Variable is constrained to
make the final answer more accurate. The Decision Variable called Constant was
constrained to always remain above -25 during the Solver analysis. This resulted
in the most accurate and realistic maximization of the sum of the
P(X)Y * [ 1 - P(X) ] (1-Y) equations.
Page 32
Page 33
Page 34
The input and output variables will be graphed together. The y-axis of the chart
will provide the scale for plotting of those values. The x-axis will provide a
measure of whatever continuum was used, e.g. time, to collect the values of all of
the variables. Excels charting function is the way to go here. The above linked
video shows exactly how to chart all the data in Excel.
Page 35
Page 36
between the output variable and an input variable indicates that the input variable
is not a good predictor of the output. That input variable should be removed from
the Regression Analysis. The attached video provides an example of this.
Page 37
Page 38
Page 39
Page 40
In this problem we are going to show how to use the Excel Solver to calculate an
equation which most closely describes the relationship between sales and
number of ads being run. The purpose of this equation is to be able to predict the
number of sales based upon the number of ads that will be run.
A marketing manager has collected this following data on the companys sales
vs. the number of ads that were running at different times.
6700
7500
8700
8900
8800
10900
11200
11400
11500
12300
Page 41
We would like to create an equation from this data that allows us to predict the
sales based upon the number of ads currently running.
The first step is to eyeball the data and estimate what general type of curve this
graph probably is. In this case it appears to a graph the has a diminishing y value
for an increasing x value. A formula for such a curve would have the general
form:
Y = A1 + A2 * XB1
Sales = A1 + A2 * (Number of Ads Running)B1
We can use the Excel Solver to solve for A1, A2, and B1. We need to arrange
the data in a form that can be input into the Excel Solver as follows:
Page 42
This table shows the arrangement of data and the calculations. Here we have
created an Excel model based upon our model of:
Sales = A1 + A2 * (Number of Ads Running)B1
One example of this formula in action is explained for Cell E16. We are listing the
variable that we are solving for (A1, A2, and B1) in cells B3 to B5. In Solver
language, these solves that we are changing are called Decision Variables.
We now take the difference between the actual number of sales and the number
of sales predicted by our model with our arbitrary settings for the Decision
Copyright 2011 http://ExcelMasterSeries.com/New_Manuals.php
Page 43
Variables. The square of each difference is taken and then all squares are
summed up.
We are trying to find the settings for the Decision Variables that will minimize the
sum of the squares of the differences. In other words, we are trying to find A1,
A2, and B1 that will minimize the number in cell G13.
Once the Solver has been installed as an add-in (To add-in Solver: File /
Options / Add-Ins / Manage / Excel Add-Ins / Go / Solver Add-In), you can
access the Solver in Excel 2010 by: Data / Solver.
Page 44
The Solver dialogue box has the following 4 parameters that need to be set:
1) The Objective Cell This is the target cell that we are either trying to
maximize, minimize, or achieve a certain value.
4) Constraints These are the limitations that the problem subjects the
Solver to during its calculations
Page 45
Objective:
We are trying to minimize Cell G13, the sum of the square of differences
between the actual and predicted sales.
Decision Variables:
We are changing A1, A2, and B1 (cells B3 to B5) to minimize our Objective, Cell
G13. The Decision Variables are therefore Cells B3 to B5.
Constraints:
There are none for this curve-fitting operation.
These functions have graphs that are curved (nonlinear), but have no breaks
(smooth)
Page 46
Page 47
Page 48
Solver has optimized the Decision Variables to minimize the objective function as
follows:
A1 = -445,616
A2 = 437,247
B1 = 0.00911
The Objective is minimized to: 2,556,343
We can now create an Excel graph of the Actual Sales vs. the Predicted Sales as
follows:
Solver calculates that Sales can be predicted from Number of Ads Running by
the following equation:
Sales = -445616 + 437247 * (Number of Ads Running)0.00911
The trickiest part of this problem is the first step; eyeballing the data to
determine what kind of graph the data is arranged in. You should take time to
evaluate whether you are pursuing calculation of the correct curve type.
Page 49
Solver Tips
You may notice that if you run this problem through the Solver multiple time, you
will get slightly different answers. Each time that you run Solvers GRG algorithm,
it will calculate different values for the Decision Variables. You are trying to find
the values for the Decision Variables that minimize the objective function (cell
G13) the most.
When the Solver runs the GRG algorithm, it picks a starting point for its
calculations. Each time you run the Solver GRG method a slightly different
starting point will be picked. That is why different answers will appear during
each run. Choose the Decision Variable value that occur during the run which
produces the lowest value of the Objective. Keep running the Solver until the
objective is not minimized anymore. That should give you the optimal values of
the Decision Variables. That was done in the example above.
Show Iteration Results: Leave this unchecked. This stops the GRG Solver after
each iteration, displaying the result for that iteration. Very rarely is there a reason
for doing that.
Use Automatic Scaling: Leave this box unchecked. You would only use this
option if you had reason to believe that inputs of the Solver were measured using
different scales.
Assume Non-Negative: Only check this if you are sure that none of the
variables can ever be negative. In this case, that is clearly not the case.
Page 50
Summary
Excel Solver is an easy-to-use and powerful nonlinear regression tool as a result
of its curve-fitting capacity. One use of this is to calculate predictive sales
equations for your company. It will work as long as you have properly determined
the correct general curve type in the beginning.
Page 51
Page 52