SampleRegressionAnalysisAnIntuitiveGuide PDF
SampleRegressionAnalysisAnIntuitiveGuide PDF
Analysis
AN INTUITIVE GUIDE FOR USING
AND INTERPRETING LINEAR MODELS
Jim Frost
Copyright © 2019 by Jim Frost.
Ordering Information:
3
Contents
i
Regression Analyzes a Wide Variety of Relationships
.............................................................................................. 29
―John Tukey
INTRODUCTION
My Approach to
Teaching Regression and
Statistics
NOTE: This sample contains only the introduction and first two chap-
ters. Please buy the full ebook for all the content listed in the Table of
Contents. You can buy it in My Store.
I love statistics and analyzing data! I also love talking and writing
about it. I was a researcher at a major university. Then, I spent over a
decade working at a major statistical software company. During my
time at the statistical software company, I learned how to present sta-
tistics in a manner that makes it more intuitive. I want you to under-
stand the essential concepts, practices, and knowledge for regression
analysis so you can analyze your data confidently. That’s the goal of
my book.
13
Jim Frost
You’ll notice that there are not many equations in this book. After all,
you should let your statistical software handle the calculations so you
don’t get bogged down in the calculations and can instead focus on
understanding your results. Instead, I focus on the concepts and prac-
tices that you’ll need to know to perform the analysis and interpret
the results correctly. I’ll use more graphs than equations!
14
Regression Analysis: An Intuitive Guide
Please note that throughout this book I use Minitab statistical soft-
ware. However, this book is not about teaching particular software but
rather how to perform regression analysis. All common statistical
software packages should be able to perform the analyses that I show.
There is nothing in here that is unique to Minitab.
15
CHAPTER 1
Correlation and an
Introduction to
Regression
16
Regression Analysis: An Intuitive Guide
There are different types of correlation that you can use for different
kinds of data. In this chapter, I cover the most common type of corre-
lation—Pearson’s correlation coefficient.
Before we get into the numbers, let’s graph some data first so we can
understand the concept behind what we are measuring.
At a glance, you can see that there is a relationship between height and
weight. As height increases, weight also tends to increase. However,
it’s not a perfect relationship. If you look at a specific height, say 1.5
meters, you can see that there is a range of weights associated with it.
You can also find short people who weigh more than taller people.
17
Jim Frost
However, the general tendency that height and weight increase to-
gether is unquestionably present.
Pearson’s correlation takes all of the data points on this graph and rep-
resents them with a single summary statistic. In this case, the statisti-
cal output below indicates that the correlation is 0.705.
What do the correlation and p-value mean? We’ll interpret the output
soon. First, let’s look at a range of possible correlation values so we
can understand how our height and weight example fits in.
18
Regression Analysis: An Intuitive Guide
19
Jim Frost
20
Regression Analysis: An Intuitive Guide
21
Jim Frost
22
Regression Analysis: An Intuitive Guide
Graphs and the relevant statistical measures often work better in tan-
dem.
This example illustrates another reason to graph your data! Just be-
cause the coefficient is near zero, it doesn’t necessarily indicate that
there is no relationship.
23
Jim Frost
For the hypothesis test, our p-value equals 0.000. This p-value is less
than any reasonable significance level. Consequently, we can reject
the null hypothesis and conclude that the relationship is statistically
significant. The sample data provide sufficient evidence to conclude
that the relationship between height and weight exists in the popula-
tion of preteen girls.
24
Regression Analysis: An Intuitive Guide
25
Jim Frost
For instance, analysts naturally want to fit models that explain more
and more of the variability in the data. And, they come up with classi-
fication schemes for how well the model fits the data. However, there
is a natural amount of variability that the model can’t explain just as
there was in the height and weight correlation example. Regression
models can be forced to go past this natural boundary, but bad things
happen. Throughout this book, be aware of the tension between trying
to explain as much variability as possible and ensuring that you don’t
go too far. This issue pops up multiple times!
26
Regression Analysis: An Intuitive Guide
27
Jim Frost
why I love it! You’ll learn when you should consider using regression
analysis.
You might run across unfamiliar terms. Don’t worry. I’ll cover all of
them throughout this book! The upcoming section provides a preview
for things you’ll learn later in the book. For now, let’s define several
basics—the fundamental types of variables that you’ll include in your
regression analysis and your primary goals for using regression analy-
sis.
Dependent Variables
The dependent variable is a variable that you want to explain or pre-
dict using the model. The values of this variable depend on other vari-
ables. It’s also known as the response variable, outcome variable, and
it is commonly denoted using a Y. Traditionally, analysts graph de-
pendent variables and the vertical, or Y, axis.
Independent Variables
Independent variables are the variables that you include in the model
to explain or predict changes in the dependent variable. In controlled
experiments, independent variables are systematically set and
changed by the researchers. However, in observational studies, values
of the independent variables are not set by researchers but rather ob-
served. These variables are also known as predictor variables, input
variables, and are commonly denoted using Xs. On graphs, analysts
place independent variables on the horizontal, or X, axis.
28
Regression Analysis: An Intuitive Guide
29
Jim Frost
Regression analysis can handle many things. For example, you can use
regression analysis to do the following:
These capabilities are all cool, but they don’t include an almost magi-
cal ability. Regression analysis can unscramble very intricate prob-
lems where the variables are entangled like spaghetti. For example,
imagine you’re a researcher studying any of the following:
30
Regression Analysis: An Intuitive Guide
31
Jim Frost
variable constant. You can assess the effect of coffee intake while con-
trolling for smoking. Conveniently, you’re also controlling for coffee
intake when looking at the effect of smoking.
Note that the study also illustrates how excluding a relevant variable
can produce misleading results. Omitting an important variable causes
it to be uncontrolled, and it can bias the results for the variables that
you do include in the model. In the example above, the first model
without smoking could not control for this important variable, which
forced the model to include the effect of smoking in another variable
(coffee consumption).
Low p-values (typically < 0.05) indicate that the independent variable
is statistically significant. Regression analysis is a form of inferential
statistics. Consequently, the p-values help determine whether the re-
lationships that you observe in your sample also exist in the larger
population.
32
Regression Analysis: An Intuitive Guide
The low p-values indicate that both education and IQ are statistically
significant. The coefficient for IQ (4.796) indicates that each addi-
tional IQ point increases your income by an average of approximately
$4.80 while controlling everything else in the model. Furthermore,
the education coefficient (24.215) indicates that an additional year of
education increases average earnings by $24.22 while holding the
other variables constant.
Using regression analysis gives you the ability to separate the effects
of complicated research questions. You can disentangle the spaghetti
noodles by modeling and controlling all relevant variables, and then
assess the role that each one plays.
33
Jim Frost
34
CHAPTER 2
In later chapters, we’ll cover possible reasons for using other kinds of
regression analysis. I’ll ensure that you know when you should con-
sider a specialized type of analysis, and give you pointers about which
alternatives to consider for various issues.
35
Jim Frost
Use best practices while collecting your data. The following are some
points to consider:
36
Regression Analysis: An Intuitive Guide
Now, let’s see how OLS regression goes beyond correlation and pro-
duces an equation for the line that best fits a dataset.
Let’s start with some basic terms that I’ll use throughout this book.
While I strive to explain regression analysis in an intuitive manner
using everyday English, I do use proper statistical terminology. Doing
so will help you if you’re following along with a college statistics
course or need to communicate with professionals about your model.
Fitted values are the values that the model predicts for the dependent
variable using the independent variables. If you input values for the
independent variables into the regression equation, you obtain the fit-
ted value. Predicted values and fitted values are synonyms.
37
Jim Frost
An observed value is one that exists in the real world while your
model generates the fitted/predicted value for that observation.
38
Regression Analysis: An Intuitive Guide
The length of the line is the value of the residual. The equation below
shows how to calculate the residuals, or error, for the ith observation:
It makes sense, right? You want to minimize the distance between the
observed values and the fitted values. For a good model, the residuals
should be relatively small and unbiased. In statistics, bias indicates
that estimates are systematically too high or too low.
If the residuals become too large or biased, the model is no longer use-
ful. Consequently, these differences play a vital role during both the
model estimation process and later when you assess the quality of the
model.
Using the Sum of the Squared Errors (SSE) to Find the Best
Line
Let’s go back to the height and weight dataset for which we calculated
the correlation.
39
Jim Frost
You could draw many different potential lines. Some observations will
fit the model better or worse than other points, and that will vary
based on the line that you draw. Which measure would you use to
quantify how well the line fits all of the data points? Using what you
learned above, you know that you want to minimize the residuals.
And, it should be a measure that factors in the difference for all of the
points. We need a summary statistic for the entire dataset.
40
Regression Analysis: An Intuitive Guide
You can’t merely sum the residuals because the positive and negative
values will cancel each other out even when they tend to be relatively
large. Instead, OLS regression squares those residuals so they’re al-
ways positive. In this manner, the process can add them up without
canceling each other out.
Then, the ordinary least squares procedure sums these squared errors,
as shown in the equation below:
OLS draws the line that minimizes the sum of squared errors (SSE).
Hopefully, you’re gaining an appreciation for why the procedure is
named ordinary least squares!
In textbooks, you’ll find equations for how OLS derives the line that
minimizes SSE. Statistical software packages use these equations to
solve for the solution directly. However, I’m not going to cover those
equations. Instead, it’s crucial for you to understand the concepts of
41
Jim Frost
residuals and how the procedure minimizes the SSE. If you were to
draw any line other than the one that OLS produces, the SSE would
increase—which indicates that the distances between the observed
and fitted values are growing, and the model is not as good.
First, because OLS calculates squared errors using residuals, the model
fitting process ultimately ties back to the residuals very strongly. Re-
siduals are the underlying foundation for how least squares regression
fits the model. Consequently, understanding the properties of the re-
siduals for your model is vital. They play an enormous role in deter-
mining whether your model is good or not. You’ll hear so much about
them throughout this book. In fact, chapter 9 focuses on them. So, I
won’t say much more here. For now, just know that you want rela-
tively small and unbiased residuals (positive and negative are equally
likely) that don’t display patterns when you graph them.
Second, the fact that the OLS procedure squares the residuals has sig-
nificant ramifications. It makes the model susceptible to outliers and
unusual observations. To understand why, consider the following set
of residuals: {1 2 3}. Imagine most of your residuals are in this range.
These residuals produce the following squared errors: {1 4 9}. Now,
imagine that one observation has a residual of 6, which yields a
squared error of 36. Compare the magnitude of most squared errors
(1 – 9) to that of the unusual observation (36).
42
Regression Analysis: An Intuitive Guide
43
Jim Frost
For the same dataset, as you fit better models, RSS increases and SSE
decreases by an exactly corresponding amount. RSS cannot be greater
than TSS while SSE cannot be less than zero.
Additionally, if you take RSS / TSS, you’ll obtain the percentage of the
variability of the dependent variable around its mean that your model
explains. This statistic is R-squared!
Keep in mind that these sums of squares all measure variability. You
might hear about models and variables accounting for variability, and
that harkens back to these measures of variability.
44
Regression Analysis: An Intuitive Guide
Note: Some texts use RSS to refer to residual sums of squares (which
we’re calling SSE) rather than regression sums of squares. Be aware of
this potentially confusing use of terminology!
This graph shows all the observations together with a line that repre-
sents the fitted relationship. As is traditional, the Y-axis displays the
dependent variable, which is weight. The X-axis shows the independ-
ent variable, which is height. The line is the fitted line. If you enter the
45
Jim Frost
full range of height values that are on the X-axis into the regression
equation that the chart displays, you will obtain the line shown on the
graph. This line produces a smaller SSE than any other line you can
draw through these observations.
Visually, we see that that the fitted line has a positive slope that cor-
responds to the positive correlation we obtained earlier. The line fol-
lows the data points, which indicates that the model fits the data. The
slope of the line equals the coefficient that I circled. This coefficient
indicates how much mean weight tends to increase as we increase
height. We can also enter a height value into the equation and obtain
a prediction for the mean weight.
Each point on the fitted line represents the mean weight for a given
height. However, like any mean, there is variability around the mean.
Notice how there is a spread of data points around the line. You can
assess this variability by picking a spot on the line and observing the
range of data points above and below that point. Finally, the vertical
distance between each data point and the line is the residual for that
observation.
46
Regression Analysis: An Intuitive Guide
As fantastic as fitted line plots are, they can only show simple regres-
sion models, which contain only one independent variable. Fitted line
plots use two axes—one for the dependent variable and the other for
the independent variable. Consequently, fitted line plots are great for
displaying simple regression models on a screen or printed on paper.
However, each additional independent variable requires another axis
or physical dimension. With two independent variables, we can use a
3D representation for it. Although, that’s beyond my abilities for this
book. With three independent variables, we’d need a four-dimen-
sional plot. That’s not going to happen!
47
Jim Frost
You learned the basics of how OLS minimizes the sums of squared
errors (SSE) to produce the best fitting line for your dataset. And, how
SSE fits it in with two other sums of squares, regression sums of
squares (RSS) and total sums of squares (TSS). In the process, you
even got a sneak peek at R-squared (RSS / TSS)!
NOTE: This sample contains only the introduction and first two chap-
ters. Please buy the full ebook for all the content listed in the Table of
Contents. You can buy it in My Store.
48