0% found this document useful (0 votes)
3 views

Regression

The document discusses various statistical techniques for analyzing relationships between variables, including correlation methods like Pearson's R and Spearman's rho, as well as regression models such as simple and multiple linear regression. It outlines the assumptions necessary for these analyses, the importance of checking for normality and multicollinearity, and cautions regarding interpretation of correlation and regression results. Additionally, it introduces path analysis as an extension of regression modeling to explore causal relationships among variables.

Uploaded by

Dave Bacani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Regression

The document discusses various statistical techniques for analyzing relationships between variables, including correlation methods like Pearson's R and Spearman's rho, as well as regression models such as simple and multiple linear regression. It outlines the assumptions necessary for these analyses, the importance of checking for normality and multicollinearity, and cautions regarding interpretation of correlation and regression results. Additionally, it introduces path analysis as an extension of regression modeling to explore causal relationships among variables.

Uploaded by

Dave Bacani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

REGRESSION Probabilistic Models

CORRELATION TECHNIQUES Hypothesize 2 Components


Pearson’s R: Measures the strength and - Deterministic
magnitude of a linear relationship between two
- Random Error
variables
Relationship bet. variables is not perfect
Spearman rho: nonparametric version of the
Pearson r, measures the strength of association Examples:
between two ranked variables.
Height and Weight
Point-Biserial Coefficient: used to measure the IQ and test scores
strength and direction of the association that Alcohol consumed and tolerance
exists between one continuous variable and one Vital lung capacity and years of smoking
dichotomous variable
Types of Probabilistic Models
Phi Coefficient: said to be the measure of
the amount of the association between two binary - Regression Models
variables - Correlation Models
- Other Models
Check for normality of data
- Use of box plots (determining outliers)
- Kolmogorov-Smirnov (2000 samples or Regression Models
more) Relationship between one dependent variable
- Shapiro-Wilk and explanatory variable(s).
- Use of Normal Q-Q plot
Use equation to set up relationship
Apply the proper correlation technique to use.
Interpret the correlation results. - Numerical Dependent (Response) Variable
Test for significance
- 1 or More Numerical or Categorical Independent
(Explanatory) Variables

SIMPLE LINEAR REGRESSION AND ITS Used Mainly for Prediction & Estimation
ANALYSIS
Model: In statistics, modeling is one of the
techniques under regression analysis used in
estimating relationships among variables.
TYPES
1. Deterministic/Functional Models (no
randomness)
2. Probabilistic/Statistical Models (w/
randomness)
Deterministic Models Simple Linear Regression

- Hypothesize Exact Relationships a statistical method that allows us to summarize


and study relationships between two continuous
- Suitable When Prediction Error is Negligible (quantitative) variables:
Examples: - One variable, denoted x, is regarded as
the predictor, explanatory,
Body Mass Index and Weight
or independent variable.
Celsius and Fahrenheit
- The other variable, denoted y, is regarded
Circumference and Diameter
as the response, outcome,
Velocity and Time
or dependent variable.
Voltage and Current
Pressure and Volume Use scatter plots as graphical representation
Conditions L.I.N.E - An experiment is a study in which, when
collecting the data, the researcher controls
- The mean of the response, E(Yi), at each
the values of the predictor variables.
value of the predictor, xi, is a Linear
- An observational study is a study in
function of the xi.
which, when collecting the data, the
- The errors, εi, are Independent.
researcher merely observes and records
- The errors, εi, at each value of the
the values of the predictor variables as
predictor, xi, are Normally distributed.
they happen.
- The errors, εi, at each value of the
predictor, xi, have Equal The primary advantage of conducting
variances (denoted σ2). experiments is that one can typically conclude
that differences in the predictor values is what
caused the changes in the response values. This
Correlation and R – Square is not the case for observational studies.
Unfortunately, most data used in regression
 Correlation analyses arise from observational studies.
 Measures the magnitude and Therefore, you should be careful not to overstate
direction of the association your conclusions
between variables.
 R-squared Caution 5. Ecological correlations — correlations
that are based on rates or averages — tend to
 Also known as the coefficient of overstate the strength of an association. (ex.
determination, a statistical regions, use of individual data or aggregate data)
measure of how close the data are
to the fitted regression line Caution 6. A "statistically significant" r2 value
does not imply that the slope β1 is meaningfully
 it is the percentage of the different from 0.
response variable variation that is
explained by a linear model  "statistical significance does not
imply practical significance.“
 Typically, values of R^2 below 20%
are considered weak, between Caution 7. A large r2 value does not necessarily
20% and 40%, moderate, and mean that a useful prediction of the response
above 40%, strong. ynew, or estimation of the mean response µY, can
be made. It is still possible to get prediction
R-squared Cautions intervals or confidence intervals that are too wide
to be useful.
Caution 1. The coefficient of determination r2 and
the correlation coefficient r quantify the strength of
a linear relationship. It is possible that r2 = 0%
and r = 0, suggesting there is no linear relation MULTIPLE LINEAR REGRESSION
between x and y, and yet a perfect curved (or Multiple Linear Regression is an analysis
"curvilinear" relationship) exists. procedure to use when more than one
Caution 2. A large r2 value should not be explanatory variable is included in a “model”. That
interpreted as meaning that the estimated is, when we believe there is more than one
regression line fits the data well. Another function explanatory variable that might help “explain” or
might better describe the trend in the data “predict” the response variable, we’ll put all of
these explanatory variables into the “model” and
Caution 3. The coefficient of determination r2 and perform a multiple linear regression analysis.
the correlation coefficient r can both be greatly
affected by just one data point (or a few data The variable we want to predict is called the
points). dependent variable (or sometimes, the outcome,
target or criterion variable). The variables we are
Caution 4. Correlation (or association) does not using to predict the value of the dependent
imply causation. variable are called the independent variables (or
sometimes, the predictor, explanatory or
Recall the distinction between an
regressor variables).
experiment and an observational study:
FORMULA: Assumption #2: You have two or more
independent variables, which can be
The multiple linear regression model is just an
either continuous (i.e.,
extension of the simple linear regression model.
an interval or ratio variable) or categorical (i.e.,
In simple linear regression, we used an “x” to
an ordinal or nominal variable).
represent the explanatory variable. In multiple
linear regression, we’ll have more than one Assumption #3: You should have independence
explanatory variable, so we’ll have more than one of observations (i.e., independence of residuals),
“x” in the equation. We’ll distinguish between the which you can easily check using the Durbin-
explanatory variables by putting subscripts next to Watson statistic.
the “x’s” in the equation.
Assumption #4: There needs to be a linear
relationship between (a) the dependent variable
and each of your independent variables, and (b)
the dependent variable and the independent
variables collectively.
Assumption #5: Your data needs to
show homoscedasticity, which is where the
variances along the line of best fit remain similar
as you move along the line.
In Simple Linear Regression, it was easy to
Assumption #6: Your data must not
picture the model two-dimensionally with a
show multicollinearity, which occurs when you
scatterplot because there was only one
have two or more independent variables that are
explanatory variable. If we had two explanatory
highly correlated with each other. This leads to
variables, we could still picture the model: the x-
problems with understanding which independent
axis would represent the first explanatory
variable contributes to the variance explained in
variable, the y-axis the second explanatory
the dependent variable, as well as technical
variable, and the z-axis would represent the
issues in calculating a multiple regression model.
response variable. The model would actually be
an equation of a plane. However, when there are Assumption #7: There should be no significant
three or more explanatory variables, it becomes outliers, high leverage points or highly influential
impossible to picture the model. points. Outliers, leverage and influential points
are different terms used to represent observations
That is, we can’t visualize what the equation
in your data set that are in some way unusual
represents. Because of this, β0 is not called a “y-
when you wish to perform a multiple regression
intercept” anymore but is just called a “constant”
analysis. These different classifications of unusual
term. It is the value in the equation without any “x”
points reflect the different impact they have on the
next to it. (It is often called a constant term in
regression line. An observation can be classified
simple linear regression as well, but we can
as more than one type of unusual point. However,
visualize what this constant term is in simple
all these points can have a very negative effect on
linear regression – it’s the y-intercept)
the regression equation that is used to predict the
Likewise, the numbers in front of the “x’s” are no value of the dependent variable based on the
longer slopes in multiple regression since the independent variables. This can change the
equation is not an equation of a line anymore. output that SPSS Statistics produces and reduce
We’ll call these numbers coefficients, which the predictive accuracy of your results as well as
means “numbers in front of”. As we will see, the the statistical significance.
interpretation of the coefficients (β , β , etc. 1 2 )
Assumption #8: Finally, you need to check that
will be very similar to the interpretation of the
the residuals (errors) are approximately normally
slope in simple linear regression.
distributed. Two common methods to check this
assumption include using: (a) a histogram (with a
superimposed normal curve) and a Normal P-P
ASSUMPTIONS: Plot; or (b) a Normal Q-Q Plot of the studentized
Assumption #1: Your dependent variable should residuals.
be measured on a continuous scale (i.e., it is
either an interval or ratio variable).
Influential Points • Correlations
• An outlier is a data point whose • Zero-order - A correlation between
response y does not follow the general two variables which does not
trend of the rest of the data. include a control variable. A first-
order correlation, then, would
• A data point has high leverage if it has
include one control variable as well
"extreme" predictor x values. With a single
as the independent variable and
predictor, an extreme x value is simply one
dependent variable.
that is particularly high or low. With
multiple predictors, extreme x values may • Partial - Partial correlation is a
be particularly high or low for one or more measure of the strength and
predictors, or may be "unusual" direction of a linear relationship
combinations of predictor values (e.g., between two continuous variables
with two predictors that are positively whilst controlling for the effect of
correlated, an unusual combination of one or more other continuous
predictor values might be a high value of variables (also known as
one predictor paired with a low value of 'covariates' or 'control' variables).
the other predictor). Although partial correlation does
not make the distinction between
• Note that — for our purposes —
independent and dependent
we consider a data point to be an
variables, the two variables are
outlier only if it is extreme with
often considered in such a manner
respect to the other y values, not
(i.e., you have one continuous
the x values.
dependent variable and one
• A data point is influential if it continuous independent variable,
unduly influences any part of a as well as one or more continuous
regression analysis, such as the control variables).
predicted responses, the estimated
• Part – semi-partial correlation
slope coefficients, or the
hypothesis test results. Outliers
and high leverage data points have
Collinearity Statistics
the potential to be influential, but
we generally have to investigate • You can assess multicollinearity by
further to determine whether or not examining tolerance and the Variance
they are actually influential. Inflation Factor (VIF). Tolerance is a
measure of collinearity reported by most
statistical programs such as SPSS; the
Diagnostics: variable’s tolerance is 1-R2. A small
tolerance value indicates that the variable
• Durbin-Watson Test
under consideration is almost a perfect
• A test that the residuals from a linear combination of the independent
linear regression or multiple variables already in the equation and that
regression are independent. it should not be added to the regression
equation. All variables involved in the
• The Durbin Watson statistic is a linear relationship will have a small
number that tests for tolerance. Some suggest that a tolerance
autocorrelation in the residuals value less than 0.1 should be investigated
from a statistical regression further. If a low tolerance value is
analysis. The Durbin-Watson accompanied by large standard errors and
statistic is always between 0 and 4. nonsignificance, multicollinearity may be
A value of 2 means that there is no an issue.
autocorrelation in the sample.
Values approaching 0 indicate
positive autocorrelation and values
toward 4 indicate negative
autocorrelation.
The Variance Inflation Factor (VIF) Path Analysis
• The Variance Inflation Factor (VIF) Path analysis is an extension of the regression
measures the impact of collinearity among model. In a path analysis model from the
the variables in a regression model. The correlation matrix, two or more casual models are
Variance Inflation Factor (VIF) is compared. The path of the model is shown by a
1/Tolerance, it is always greater than or square and an arrow, which shows the causation.
equal to 1. There is no formal VIF value Regression weight is predicated by the model.
for determining presence of Then the goodness of fit statistic is calculated in
multicollinearity. Values of VIF that exceed order to see the fitting of the model.
10 are often regarded as indicating
multicollinearity, but in weaker models
values above 2.5 may be a cause for Key concepts and terms:
concern. In many statistics programs, the
results are shown both as an individual R2  Estimation method: Simple OLS and
value (distinct from the overall R2 of the maximum likelihood methods are used to
model) and a Variance Inflation Factor predict the path.
(VIF). When those R2 and VIF values are  Path model: A diagram which shows the
high for any of the variables in your model, independent, intermediate, and dependent
multicollinearity is probably an issue. variables. A single-headed arrow shows
When VIF is high there is high the cause for the independent,
multicollinearity and instability of the b and intermediate and dependent variable. A
beta coefficients. It is often difficult to sort double-headed arrow shows the
this out. covariance between the two variables.
 Exogenous and endogenous variables:
Different ways to assess multicollinearity Those where no error points towards
them, except the measurement error term.
1. Examine the correlations and associations If exogenous variables are correlated to
(nominal variables) between independent each other, then a double headed arrow
variables to detect a high level of association. will connect those variables. Endogenous
High bivariate correlations are easy to spot by variables may have both the incoming and
running correlations among your variables. If high outgoing arrows.
bivariate correlations are present, you can delete
one of the two variables. However, this may not  Path coefficient: A standardized
always be sufficient. regression coefficient (beta), showing the
direct effect of an independent variable on
2. Regression coefficients will change a dependent variable in the path model.
dramatically according to whether other variables
are included or excluded from the model. Play  Disturbance terms: The residual error
around with this by adding and then removing terms are also called disturbance terms.
variables from your regression model. Disturbance terms reflect the unexplained
variance and measurement error.
3. The standard errors of the regression
coefficients will be large if multicollinearity is an  Direct and indirect effect: The path
issue. model has two types of effects. The first is
the direct effect, and the second is the
4. Predictor variables with known, strong indirect effect. When the exogenous
relationships to the outcome variable will not variable has an arrow directed towards the
achieve statistical significance. In this case, dependent variable, then it is said to be
neither may contribute significantly to the model the direct effect. When an exogenous
after the other one is included. But together they variable has an effect on the dependent
contribute a lot. If you remove both variables from variable, through the other exogenous
the model, the fit would be much worse. So the variable, then it is said to be an indirect
overall model fits the data well, but neither X effect. To see the total effect of the
variable makes a significant contribution when it is exogenous variable, we have to add the
added to your model last. When this happens, direct and indirect effect. One variable
multicollinearity may be present.
may not have a direct effect, but it may
have an indirect effect as well.
 Chi-square statistics: Non-significant
chi-square value in path analysis shows
the goodness of fit model. Sometimes,
chi-square statistics is significant.
However, we still have to test one absolute
fit index and one incremental fit index.

ASSUMPTIONS:
 Linearity: Relationships should be linear.
 Interval level data: Data should be Lesson 4.2 REGRESSION METHODS; ENTER,
dichotomous nominal, interval or ratio STEPWISE, FORWARD, BACKWARD, &
level of measurement. REMOVE
 Uncorrelated residual term: Error terms Introduction
should not be correlated to any variable.
The basis of a multiple linear regression is to
 Disturbance terms: Disturbance terms assess whether one continuous dependent
should not be correlated to endogenous variable can be predicted from a set of
variables. independent (or predictor) variables. Or in other
words, how much variance in a continuous
 Multicollinearity: Low multicollinearity is
dependent variable is explained by a set of
assumed. Perfect multicollinearity may
predictors. Certain regression selection
cause problems in the path analysis.
approaches are helpful in testing predictors,
 Identification: The path model should not thereby increasing the efficiency of analysis.
be under identified, exactly identified or
Entry Method
over identified models are good.
The standard method of entry is simultaneous
 Adequate sample size: Kline (1998)
(a.k.a. the enter method); all independent
recommends that the sample size should
variables are entered into the equation at the
be 10 times (or ideally 20 times) as many
same time. This is an appropriate analysis when
cases as parameters, and at least 200.
dealing with a small set of predictors and when
the researcher does not know which independent
variables will create the best prediction equation.
Example of Very Simple Path Analysis via Each predictor is assessed as though it were
Regression (with correlation matrix input) entered after all the other independent variables
were entered, and assessed by what it offers to
Certainly the most three important sets of the prediction of the dependent variable that is
decisions leading to a path analysis are: different from the predictions offered by the other
variables entered into the model.
1. Which causal variables to include in the model
Selection Methods
2. How to order the causal chain of those
variables Selection, on the other hand, allows for the
construction of an optimal regression equation
3. Which paths are not “important” to the model – along with investigation into specific predictor
the only part that is statistically tested variables. The aim of selection is to reduce the
set of predictor variables to those that are
necessary and account for nearly as much of
the variance as is accounted for by the total
set. In essence, selection helps to determine
the level of importance of each predictor
variable. It also assists in assessing the
effects once the other predictor variables are
statistically eliminated. The circumstances of those that are entered in the earlier stages have a
the study, along with the nature of the research better chance of being retained than those
questions guide the selection of predictor entered at later stages.
variables.
Four selection procedures are used to yield the Essentially, the multiple regression selection
most appropriate regression equation: forward process enables the researcher to obtain a
selection, backward elimination, stepwise reduced set of variables from a larger set of
selection, and block-wise selection. The first predictors, eliminating unnecessary predictors,
three of these four procedures are considered simplifying data, and enhancing predictive
statistical regression methods. Many times accuracy. Two criterion are used to achieve
researchers use sequential regression the best set of predictors; these include
(hierarchical or block-wise) entry methods that do meaningfulness to the situation and statistical
not rely upon statistical results for selecting significance. By entering variables into the
predictors. Sequential entry allows the equation in a given order, confounding variables
researcher greater control of the regression can be investigated and variables that are highly
process. Items are entered in a given order correlated can be combined into blocks.
based on theory, logic or practicality, and are
appropriate when the researcher has an idea as Other definitions of regression methods
to which predictors may impact the dependent - Enter (default) All independent variables are
variable. entered into the equation in (one step), also
called "forced entry".
- Remove, all variables in a block are removed
Statistical Regression Methods of Entry: simultaneously
Forward selection begins with an empty - Stepwise Based on the p-value of F
equation. Predictors are added one at a time (probability of F), SPSS starts by entering the
beginning with the predictor with the highest variable with the smallest p-value; at the next step
correlation with the dependent variable. again the variable (from the list of variables not
Variables of greater theoretical importance are yet in the equation) with the smallest p-value for F
entered first. Once in the equation, the and so on. Variables already in the equation are
variable remains there. removed if their p-value becomes larger than the
default limit due to the inclusion of another
Backward elimination (or backward deletion) variable. The method terminates when no more
is the reverse process. All the independent variables are eligible for inclusion or removal. This
variables are entered into the equation first methods is based on both probability-to-enter
and each one is deleted one at a time if they (PIN) and probability to remove (POUT) (or
do not contribute to the regression equation. alternatively FIN and FOUT).
- Backward Elimination: First all variables are
Stepwise selection is considered a variation of entered into the equation and then sequentially
the previous two methods. Stepwise selection removed. For each step SPSS provides statistics,
involves analysis at each step to determine namely R2. At each step, the largest probability of
the contribution of the predictor variable F is removed (if the value is larger than POUT.
entered previously in the equation. In this way Alternatively FOUT can be specified as a
it is possible to understand the contribution of the criterion.
previous variables now that another variable has - Forward selection: at each step the variable
been added. Variables can be retained or not yet in the equation with the smallest
deleted based on their statistical contribution. probability pf F is entered. as long as the value is
smaller thant PIN. Alternatively you can use the
value of F by specifying FIN on /CRITERIA. The
Sequential Regression Method of Entry: procedure stops when there are no variables that
meet the entry criterion.
Block-wise selection is a version of forward
selection that is achieved in blocks or sets.
The predictors are grouped into blocks based on Lesson 5: OUTLIERS (Detection, Effects and
psychometric consideration or theoretical reasons Solutions)
and a stepwise selection is applied. Each block is
applied separately while the other predictor Definition
variables are ignored. Variables can be removed Outliers are unusual values in your dataset,
when they do not contribute to the prediction. In and they can distort statistical analyses and
general, the predictors included in the blocks will violate their assumptions. Unfortunately, all
be inter-correlated. Also, the order of entry has analysts will confront outliers and be forced to
an impact on which variables will be selected;
make decisions about what to do with them. WAYS TO FIND OUTLIERS
Given the problems they can cause, you might
There are a variety of ways to find outliers. All
think that it’s best to remove them from your data.
these methods employ different approaches for
But, that’s not always the case. Removing outliers
finding values that are unusual compared to the
is legitimate only for specific reasons.
rest of the dataset.
Outliers can be very informative about the
subject-area and data collection process. It’s Sorting Your Datasheet to Find Outliers
essential to understand how outliers occur and ** Sorting your datasheet is a simple but
whether they might happen again as a normal effective way to highlight unusual values. Simply
part of the process or study area. Unfortunately, sort your data sheet for each variable and then
resisting the temptation to remove outliers look for unusually high or low values. While this
inappropriately can be difficult. Outliers increase approach doesn’t quantify the outlier’s degree of
the variability in your data, which decreases unusualness, it's likely to use because, at a
statistical power. Consequently, excluding glance, you’ll find the unusually high or low
outliers can cause your results to become values. The data set above is a sample of finding
statistically significant. outliers by sorting.

Finding Outliers in your Data Graphing Your Data to Identify Outliers


Outliers are data points that are far from other ** Boxplots, histograms, and scatterplots can
data points. In other words, they’re unusual highlight outliers.
values in a dataset. Outliers are problematic for
many statistical analyses because they can cause **Boxplots display asterisks or other symbols on
tests to either miss significant findings or distort the graph to indicate explicitly when datasets
real results. contain outliers. These graphs use the
Unfortunately, there are no strict statistical rules interquartile method with fences to find outliers.
for definitively identifying outliers. Finding outliers The boxplot below displays our example dataset.
depends on subject-area knowledge and an It’s clear that the outlier is quite different than the
understanding of the data collection process. typical data value.
While there is no solid mathematical
definition, there are guidelines and statistical
tests you can use to find outlier candidates.

Outliers and Their Impact


Outliers are a simple concept—they are values
that are notably different from other data points,
and they can cause problems in statistical
procedures.

To demonstrate how much a single outlier can


affect the results, let’s examine the properties of You can also use boxplots to find outliers when
an example dataset. It contains 15 height you have groups in your data. The boxplot below
measurements of human males. One of those shows a different dataset that has an outlier in the
values is an outlier. The table below shows the Method 2 group.
mean height and standard deviation with and
without the outlier.

From the table, it’s easy to see how a single


outlier can distort reality. A single value changes
the mean height by 0.6m (2 feet) and the
standard deviation by a whopping 2.16m (7 feet)!
Hypothesis tests that use the mean with the
outlier are off the mark. And, the much larger
standard deviation will severely reduce
statistical power! Before performing statistical
analyses, you should identify potential
outliers.
** Histograms also emphasize the existence of In this dataset, the value of 10.8135 is clearly an
outliers. Look for isolated bars, as shown below. outlier. Not only does it stand out, but it’s an
Our outlier is the bar far to the right. The graph impossible height value. Examining the numbers
crams the legitimate data points on the far left. more closely, we conclude the zero might have
been accidental. Hopefully, we can either go back
to the original record or even remeasure the
subject to determine the correct height.

These types of errors are easy cases to


understand. If you determine that an outlier
value is an error, correct the value when
possible. That can involve fixing the typo or
possibly remeasuring the item or person. If
that’s not possible, you must delete the data
point because you know it’s an incorrect
value.
Most of the outliers discussed in this lesson are
univariate outliers. We look at a data distribution Sampling Problems Can Cause Outliers
for a single variable and find values that fall
Inferential statistics use samples to draw
outside the distribution. However, you can use a conclusions about a specific population. Studies
scatterplot to detect outliers in a multivariate should carefully define a population, and then
setting. draw a random sample from it specifically. That’s
the process by which a study can learn about a
population.

Unfortunately, your study might accidentally


obtain an item or person that is not from the
target population. There are several ways this can
occur. For example, unusual events or
characteristics can occur that deviate from the
defined population. Perhaps the experimenter
measures the item or subject under abnormal
conditions. In other cases, you can accidentally
collect an item that falls outside your target
population, and, thus, it might have unusual
characteristics.

Interestingly, the Input value (~14) for this Examples of Sampling Problems
observation isn’t unusual at all because the other Suppose a study assesses the strength of a
Input values range from 10 through 20 on the X- product. The researchers define the population as
axis. Also, notice how the Output value (~50) is the output of the standard manufacturing process.
similarly within the range of values on the Y-axis The normal process includes standard materials,
(10 – 60). Neither the Input nor the Output values manufacturing settings, and conditions. If
themselves are unusual in this dataset. Instead, something unusual happens during a portion of
it’s an outlier because it doesn’t fit the model. the study, such as a power failure or a machine
setting drifting off the standard value, it can affect
This type of outlier can be a problem in the products. These abnormal manufacturing
regression analysis. Given the multifaceted conditions can cause outliers by creating products
nature of multivariate regression, there are with atypical strength values. Products
numerous types of outliers in that realm. manufactured under these unusual conditions do
not reflect your target population of products from
the normal process. Consequently, you can
Data Entry and Measurement Errors and legitimately remove these data points from your
Outliers dataset.
Errors can occur during measurement and data
X-ray image of legs.During a bone density study
entry. During data entry, typos can produce weird
that I participated in as a scientist, I noticed an
values. Imagine that we’re measuring the height
outlier in the bone density growth for a subject.
of adult men and gather the following dataset.
Her growth value was very unusual. The study’s
subject coordinator discovered that the subject simply to produce a better fitting model or
had diabetes, which affects bone health. Our statistically significant results.
study’s goal was to model bone density growth in
pre-adolescent girls with no health conditions that If the extreme value is a legitimate observation
affect bone growth. Consequently, her data were that is a natural part of the population you’re
excluded from our analyses because she was not studying, you should leave it in the dataset.
a member of our target population.
Guidelines for Dealing with Outliers
If you can establish that an item or person does Sometimes it’s best to keep outliers in your data.
not represent your target population, you can They can capture valuable information that is part
remove that data point. However, you must be of your study area. Retaining these points can be
able to attribute a specific cause or reason for hard, particularly when it reduces statistical
why that sample item does not fit your target significance! However, excluding extreme values
population. solely due to their extremeness can distort the
results by removing information about the
Natural Variation Can Produce Outliers variability inherent in the study area. You’re
The previous causes of outliers are bad things. forcing the subject area to appear less variable
They represent different types of problems that than it is in reality.
you need to correct. However, natural variation
can also produce outliers—and it’s not When considering whether to remove an outlier,
necessarily a problem. you’ll need to evaluate if it appropriately reflects
your target population, subject-area, research
All data distributions have a spread of values. question, and research methodology. Did
Extreme values can occur, but they have lower anything unusual happen while measuring these
probabilities. If your sample size is large enough, observations, such as power failures, abnormal
you’re bound to obtain unusual values. In a experimental conditions, or anything else out of
normal distribution, approximately 1 in 340 the norm? Is there anything substantially different
observations will be at least three standard about an observation, whether it’s a person, item,
deviations away from the mean. However, or transaction? Did measurement or data entry
random chance might include extreme values in errors occur?
smaller datasets! In other words, the process or
population you’re studying might produce weird If the outlier in question is:
values naturally. There’s nothing wrong with - A measurement error or data entry error, correct
these data points. They’re unusual, but they are a the error if possible. If you can’t fix it, remove that
normal part of the data distribution. observation because you know it’s incorrect.

Example of Natural Variation Causing an - Not a part of the population you are studying
Outlier (i.e., unusual properties or conditions), you can
legitimately remove the outlier.
For example, I fit a model that uses historical U.S.
Presidential approval ratings to predict how later - A natural part of the population you are
historians would ultimately rank each President. It studying, you should not remove it.
turns out a President’s lowest approval rating
predicts the historian ranks. However, one data When you decide to remove outliers, document
point severely affects the model. President the excluded data points and explain your
Truman doesn’t fit the model. He had an abysmal reasoning. You must be able to attribute a
lowest approval rating of 22%, but later historians specific cause for removing outliers. Another
gave him a relatively good rank of #6. If I remove approach is to perform the analysis with and
that single observation, the R-squared increases without these observations and discuss the
by over 30 percentage points! differences. Comparing results in this manner is
particularly useful when you’re unsure about
However, there was no justifiable reason to removing an outlier and when there is substantial
remove that point. While it was an oddball, it disagreement within a group over this question.
accurately reflects the potential surprises and
uncertainty inherent in the political system. If I Lesson 6: HETEROSCEDASTICITY NATURE
remove it, the model makes the process appear AND CONSEQUENCES
more predictable than it actually is. Even though
this unusual observation is influential, I left it in A critical assumption of the classical linear
the model. It’s bad practice to remove data points regression model is that the disturbances ui have
all the same variance, σ 2. When this condition
holds, the error terms are homoscedastic, which Suppose 100 students enroll in a typing class—
means the errors have the same scatter some of which have typing experience and some
regardless of the value of X. of which do not. After the first class there would
be a great deal of dispersion in the number of
When the scatter of the errors is different, varying typing mistakes. After the final class the
depending on the value of one or more of the dispersion would be smaller. The error variance
independent variables, the error terms are
is non constant—it falls as time increases..
heteroscedastic.
HOMOSCEDASTIC PATTERN OF ERRORS
Regression Model:

Homoscedasticity:

Heteroscedasticity:

HETEROSCEDASTIC PATTERN OF ERRORS


HETEROSCEDASTICITY
 One of the assumptions of the classical
linear regression (CLRM) is that the
variance of ui, the error term, is constant,
or homoscedastic.
 Reasons are many, including:
 The presence of outliers in the
data
 Incorrect functional form of the
regression model
 Incorrect transformation of data
 Mixing observations with different
measures of scale (such as mixing
high-income households with low- THE HETEROSCEDASTIC CASE
income households)
 Heteroscedasticity is a systematic pattern
in the errors where the variances of the
errors are not constant.
 Heteroscedasticity occurs when the
variance of the error terms differs across
observations
Heteroscedasticity implies that the variances (i.e.
- the dispersion around the expected mean of
zero) of the residuals are not constant, but that
they are different for different observations. This
causes a problem: if the variances are unequal,
then the relative reliability of each observation
CONSEQUENCES OF HETEROSCEDASTICITY
(used in the regression analysis) is unequal. The
larger the variance, the lower should be the 1. Ordinary least squares estimators still
importance (or weight) attached to that linear and unbiased.
observation. 2. Ordinary least squares estimators not
efficient.
EXAMPLE: 3. Usual formulas give incorrect standard
errors for least squares.
4. Confidence intervals and hypothesis tests or inferences we make may be very
based on usual standard errors are misleading.
wrong.
Lesson 7: Autocorrelation
CONSEQUENCES OF USING OLS IN THE Autocorrelation occurs in time-series studies
PRESENCE OF HETEROSCEDASTICITY when the errors associated with a given time
period carry over into future time periods.
The existence of heteroscedasticity in the error For example, if we are predicting the growth of
term of an equation violates Classical stock dividends, an overestimate in one year is
Assumption V, and the estimation of the likely to lead to overestimates in succeeding
equation with OLS has at least three years.
consequences: Times series data follow a natural ordering over
 OLS estimation still gives unbiased time.
coefficient estimates, but they are no It is likely that such data exhibit intercorrelation,
longer BLUE. especially if the time interval between successive
 This implies that if we still use OLS in the observations is short, such as weeks or days.
presence of heteroscedasticity, our We expect stock market prices to move or move
standard errors could be inappropriate down for several days in succession.
and hence any inferences we make could In situation like this, the assumption of no auto
be misleading. or serial correlation in the error term that
underlies the CLRM will be violated.
The existence of heteroscedasticity in the error We experience autocorrelation when
term of an equation violates Classical
Assumption V, and the estimation of the Sometimes the term autocorrelation is used
equation with OLS has at least three interchangeably.
consequences: However, some authors prefer to distinguish
 OLS estimation still gives unbiased between them.
coefficient estimates, but they are no For example, Tintner defines autocorrelation as
longer BLUE. ‘lag correlation of a given series within itself,
 This implies that if we still use OLS in the lagged by a number of times units’ whereas serial
presence of heteroscedasticity, our correlation is the ‘lag correlation between two
standard errors could be inappropriate different series.
and hence any inferences we make could There are different types of serial correlation.
be misleading. With first-order serial correlation, errors in one
 Whether the standard errors calculated time period are correlated directly with errors in
using the usual formulae are too big or the ensuing time period.
too small will depend upon the form of the With positive serial correlation, errors in one
heteroscedasticity. time period are positively correlated with errors in
 In the presence of heteroscedasticity, the the next time period.
variances of OLS estimators are not
provided by the usual OLS formulas. But Causes of Autocorrelation
if we persist in using the usual OLS Inertia - Macroeconomics data experience
formulas, the t and F tests based on them cycles/business cycles.
can be highly mislead- ing, resulting in Specification Bias- Excluded variable
erroneous conclusions Appropriate equation:
Yt = b1 + b 2 X 2t + b 3 X 3t + b 4 X 4t + ut
Conclusion: Estimated equation
Yt = b1 + b 2 X 2t + b 3 X 3t + vt
 One of the assumption of OLS regression Estimating the second equation implies
is that error terms have a constant vt = b 4 X 4t + ut
variance across all value so f independent
variable Specification Bias-Incorrect Form
 With heteroscedasticity, this error term
variance is
not constant
 More common in cross sectional data than
time series data
 Even if heteroscedasticity is present or
suspected, whatever conclusions we draw

You might also like