Regression
Regression
SIMPLE LINEAR REGRESSION AND ITS Used Mainly for Prediction & Estimation
ANALYSIS
Model: In statistics, modeling is one of the
techniques under regression analysis used in
estimating relationships among variables.
TYPES
1. Deterministic/Functional Models (no
randomness)
2. Probabilistic/Statistical Models (w/
randomness)
Deterministic Models Simple Linear Regression
ASSUMPTIONS:
Linearity: Relationships should be linear.
Interval level data: Data should be Lesson 4.2 REGRESSION METHODS; ENTER,
dichotomous nominal, interval or ratio STEPWISE, FORWARD, BACKWARD, &
level of measurement. REMOVE
Uncorrelated residual term: Error terms Introduction
should not be correlated to any variable.
The basis of a multiple linear regression is to
Disturbance terms: Disturbance terms assess whether one continuous dependent
should not be correlated to endogenous variable can be predicted from a set of
variables. independent (or predictor) variables. Or in other
words, how much variance in a continuous
Multicollinearity: Low multicollinearity is
dependent variable is explained by a set of
assumed. Perfect multicollinearity may
predictors. Certain regression selection
cause problems in the path analysis.
approaches are helpful in testing predictors,
Identification: The path model should not thereby increasing the efficiency of analysis.
be under identified, exactly identified or
Entry Method
over identified models are good.
The standard method of entry is simultaneous
Adequate sample size: Kline (1998)
(a.k.a. the enter method); all independent
recommends that the sample size should
variables are entered into the equation at the
be 10 times (or ideally 20 times) as many
same time. This is an appropriate analysis when
cases as parameters, and at least 200.
dealing with a small set of predictors and when
the researcher does not know which independent
variables will create the best prediction equation.
Example of Very Simple Path Analysis via Each predictor is assessed as though it were
Regression (with correlation matrix input) entered after all the other independent variables
were entered, and assessed by what it offers to
Certainly the most three important sets of the prediction of the dependent variable that is
decisions leading to a path analysis are: different from the predictions offered by the other
variables entered into the model.
1. Which causal variables to include in the model
Selection Methods
2. How to order the causal chain of those
variables Selection, on the other hand, allows for the
construction of an optimal regression equation
3. Which paths are not “important” to the model – along with investigation into specific predictor
the only part that is statistically tested variables. The aim of selection is to reduce the
set of predictor variables to those that are
necessary and account for nearly as much of
the variance as is accounted for by the total
set. In essence, selection helps to determine
the level of importance of each predictor
variable. It also assists in assessing the
effects once the other predictor variables are
statistically eliminated. The circumstances of those that are entered in the earlier stages have a
the study, along with the nature of the research better chance of being retained than those
questions guide the selection of predictor entered at later stages.
variables.
Four selection procedures are used to yield the Essentially, the multiple regression selection
most appropriate regression equation: forward process enables the researcher to obtain a
selection, backward elimination, stepwise reduced set of variables from a larger set of
selection, and block-wise selection. The first predictors, eliminating unnecessary predictors,
three of these four procedures are considered simplifying data, and enhancing predictive
statistical regression methods. Many times accuracy. Two criterion are used to achieve
researchers use sequential regression the best set of predictors; these include
(hierarchical or block-wise) entry methods that do meaningfulness to the situation and statistical
not rely upon statistical results for selecting significance. By entering variables into the
predictors. Sequential entry allows the equation in a given order, confounding variables
researcher greater control of the regression can be investigated and variables that are highly
process. Items are entered in a given order correlated can be combined into blocks.
based on theory, logic or practicality, and are
appropriate when the researcher has an idea as Other definitions of regression methods
to which predictors may impact the dependent - Enter (default) All independent variables are
variable. entered into the equation in (one step), also
called "forced entry".
- Remove, all variables in a block are removed
Statistical Regression Methods of Entry: simultaneously
Forward selection begins with an empty - Stepwise Based on the p-value of F
equation. Predictors are added one at a time (probability of F), SPSS starts by entering the
beginning with the predictor with the highest variable with the smallest p-value; at the next step
correlation with the dependent variable. again the variable (from the list of variables not
Variables of greater theoretical importance are yet in the equation) with the smallest p-value for F
entered first. Once in the equation, the and so on. Variables already in the equation are
variable remains there. removed if their p-value becomes larger than the
default limit due to the inclusion of another
Backward elimination (or backward deletion) variable. The method terminates when no more
is the reverse process. All the independent variables are eligible for inclusion or removal. This
variables are entered into the equation first methods is based on both probability-to-enter
and each one is deleted one at a time if they (PIN) and probability to remove (POUT) (or
do not contribute to the regression equation. alternatively FIN and FOUT).
- Backward Elimination: First all variables are
Stepwise selection is considered a variation of entered into the equation and then sequentially
the previous two methods. Stepwise selection removed. For each step SPSS provides statistics,
involves analysis at each step to determine namely R2. At each step, the largest probability of
the contribution of the predictor variable F is removed (if the value is larger than POUT.
entered previously in the equation. In this way Alternatively FOUT can be specified as a
it is possible to understand the contribution of the criterion.
previous variables now that another variable has - Forward selection: at each step the variable
been added. Variables can be retained or not yet in the equation with the smallest
deleted based on their statistical contribution. probability pf F is entered. as long as the value is
smaller thant PIN. Alternatively you can use the
value of F by specifying FIN on /CRITERIA. The
Sequential Regression Method of Entry: procedure stops when there are no variables that
meet the entry criterion.
Block-wise selection is a version of forward
selection that is achieved in blocks or sets.
The predictors are grouped into blocks based on Lesson 5: OUTLIERS (Detection, Effects and
psychometric consideration or theoretical reasons Solutions)
and a stepwise selection is applied. Each block is
applied separately while the other predictor Definition
variables are ignored. Variables can be removed Outliers are unusual values in your dataset,
when they do not contribute to the prediction. In and they can distort statistical analyses and
general, the predictors included in the blocks will violate their assumptions. Unfortunately, all
be inter-correlated. Also, the order of entry has analysts will confront outliers and be forced to
an impact on which variables will be selected;
make decisions about what to do with them. WAYS TO FIND OUTLIERS
Given the problems they can cause, you might
There are a variety of ways to find outliers. All
think that it’s best to remove them from your data.
these methods employ different approaches for
But, that’s not always the case. Removing outliers
finding values that are unusual compared to the
is legitimate only for specific reasons.
rest of the dataset.
Outliers can be very informative about the
subject-area and data collection process. It’s Sorting Your Datasheet to Find Outliers
essential to understand how outliers occur and ** Sorting your datasheet is a simple but
whether they might happen again as a normal effective way to highlight unusual values. Simply
part of the process or study area. Unfortunately, sort your data sheet for each variable and then
resisting the temptation to remove outliers look for unusually high or low values. While this
inappropriately can be difficult. Outliers increase approach doesn’t quantify the outlier’s degree of
the variability in your data, which decreases unusualness, it's likely to use because, at a
statistical power. Consequently, excluding glance, you’ll find the unusually high or low
outliers can cause your results to become values. The data set above is a sample of finding
statistically significant. outliers by sorting.
Interestingly, the Input value (~14) for this Examples of Sampling Problems
observation isn’t unusual at all because the other Suppose a study assesses the strength of a
Input values range from 10 through 20 on the X- product. The researchers define the population as
axis. Also, notice how the Output value (~50) is the output of the standard manufacturing process.
similarly within the range of values on the Y-axis The normal process includes standard materials,
(10 – 60). Neither the Input nor the Output values manufacturing settings, and conditions. If
themselves are unusual in this dataset. Instead, something unusual happens during a portion of
it’s an outlier because it doesn’t fit the model. the study, such as a power failure or a machine
setting drifting off the standard value, it can affect
This type of outlier can be a problem in the products. These abnormal manufacturing
regression analysis. Given the multifaceted conditions can cause outliers by creating products
nature of multivariate regression, there are with atypical strength values. Products
numerous types of outliers in that realm. manufactured under these unusual conditions do
not reflect your target population of products from
the normal process. Consequently, you can
Data Entry and Measurement Errors and legitimately remove these data points from your
Outliers dataset.
Errors can occur during measurement and data
X-ray image of legs.During a bone density study
entry. During data entry, typos can produce weird
that I participated in as a scientist, I noticed an
values. Imagine that we’re measuring the height
outlier in the bone density growth for a subject.
of adult men and gather the following dataset.
Her growth value was very unusual. The study’s
subject coordinator discovered that the subject simply to produce a better fitting model or
had diabetes, which affects bone health. Our statistically significant results.
study’s goal was to model bone density growth in
pre-adolescent girls with no health conditions that If the extreme value is a legitimate observation
affect bone growth. Consequently, her data were that is a natural part of the population you’re
excluded from our analyses because she was not studying, you should leave it in the dataset.
a member of our target population.
Guidelines for Dealing with Outliers
If you can establish that an item or person does Sometimes it’s best to keep outliers in your data.
not represent your target population, you can They can capture valuable information that is part
remove that data point. However, you must be of your study area. Retaining these points can be
able to attribute a specific cause or reason for hard, particularly when it reduces statistical
why that sample item does not fit your target significance! However, excluding extreme values
population. solely due to their extremeness can distort the
results by removing information about the
Natural Variation Can Produce Outliers variability inherent in the study area. You’re
The previous causes of outliers are bad things. forcing the subject area to appear less variable
They represent different types of problems that than it is in reality.
you need to correct. However, natural variation
can also produce outliers—and it’s not When considering whether to remove an outlier,
necessarily a problem. you’ll need to evaluate if it appropriately reflects
your target population, subject-area, research
All data distributions have a spread of values. question, and research methodology. Did
Extreme values can occur, but they have lower anything unusual happen while measuring these
probabilities. If your sample size is large enough, observations, such as power failures, abnormal
you’re bound to obtain unusual values. In a experimental conditions, or anything else out of
normal distribution, approximately 1 in 340 the norm? Is there anything substantially different
observations will be at least three standard about an observation, whether it’s a person, item,
deviations away from the mean. However, or transaction? Did measurement or data entry
random chance might include extreme values in errors occur?
smaller datasets! In other words, the process or
population you’re studying might produce weird If the outlier in question is:
values naturally. There’s nothing wrong with - A measurement error or data entry error, correct
these data points. They’re unusual, but they are a the error if possible. If you can’t fix it, remove that
normal part of the data distribution. observation because you know it’s incorrect.
Example of Natural Variation Causing an - Not a part of the population you are studying
Outlier (i.e., unusual properties or conditions), you can
legitimately remove the outlier.
For example, I fit a model that uses historical U.S.
Presidential approval ratings to predict how later - A natural part of the population you are
historians would ultimately rank each President. It studying, you should not remove it.
turns out a President’s lowest approval rating
predicts the historian ranks. However, one data When you decide to remove outliers, document
point severely affects the model. President the excluded data points and explain your
Truman doesn’t fit the model. He had an abysmal reasoning. You must be able to attribute a
lowest approval rating of 22%, but later historians specific cause for removing outliers. Another
gave him a relatively good rank of #6. If I remove approach is to perform the analysis with and
that single observation, the R-squared increases without these observations and discuss the
by over 30 percentage points! differences. Comparing results in this manner is
particularly useful when you’re unsure about
However, there was no justifiable reason to removing an outlier and when there is substantial
remove that point. While it was an oddball, it disagreement within a group over this question.
accurately reflects the potential surprises and
uncertainty inherent in the political system. If I Lesson 6: HETEROSCEDASTICITY NATURE
remove it, the model makes the process appear AND CONSEQUENCES
more predictable than it actually is. Even though
this unusual observation is influential, I left it in A critical assumption of the classical linear
the model. It’s bad practice to remove data points regression model is that the disturbances ui have
all the same variance, σ 2. When this condition
holds, the error terms are homoscedastic, which Suppose 100 students enroll in a typing class—
means the errors have the same scatter some of which have typing experience and some
regardless of the value of X. of which do not. After the first class there would
be a great deal of dispersion in the number of
When the scatter of the errors is different, varying typing mistakes. After the final class the
depending on the value of one or more of the dispersion would be smaller. The error variance
independent variables, the error terms are
is non constant—it falls as time increases..
heteroscedastic.
HOMOSCEDASTIC PATTERN OF ERRORS
Regression Model:
Homoscedasticity:
Heteroscedasticity: