Assumption of Linear Regression
Assumption of Linear Regression
This "quick start" guide shows you how to carry out linear regression using SPSS
Statistics, as well as interpret and report the results from this test. However, before we
introduce you to this procedure, you need to understand the different assumptions that
your data must meet in order for linear regression to give you a valid result. We discuss
these assumptions next.
SPSS Statistics
Assumptions
When you choose to analyse your data using linear regression, part of the process
involves checking to make sure that the data you want to analyse can actually be
analysed using linear regression. You need to do this because it is only appropriate to
use linear regression if your data "passes" seven assumptions that are required for
linear regression to give you a valid result. In practice, checking for these seven
assumptions just adds a little bit more time to your analysis, requiring you to click a few
more buttons in SPSS Statistics when performing your analysis, as well as think a little
bit more about your data, but it is not a difficult task.
Before we introduce you to these seven assumptions, do not be surprised if, when
analysing your own data using SPSS Statistics, one or more of these assumptions is
violated (i.e., not met). This is not uncommon when working with real-world data rather
than textbook examples, which often only show you how to carry out linear regression
when everything goes well! However, don’t worry. Even when your data fails certain
assumptions, there is often a solution to overcome this. First, let’s take a look at these
seven assumptions:
Whilst these help to illustrate the differences in data that meets or violates the
assumption of homoscedasticity, real-world data can be a lot more messy and illustrate
different patterns of heteroscedasticity. Therefore, in our enhanced linear regression
guide, we explain: (a) some of the things you will need to consider when interpreting
your data; and (b) possible ways to continue with your analysis if your data fails to meet
this assumption.
o Assumption #7: Finally, you need to check that the residuals (errors) of the regression
line are approximately normally distributed (we explain these terms in our enhanced
linear regression guide). Two common methods to check this assumption include using
either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in
our enhanced linear regression guide, we: (a) show you how to check this assumption
using SPSS Statistics, whether you use a histogram (with superimposed normal curve)
or Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a
possible solution if your data fails to meet this assumption.
You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. Assumptions
#3 should be checked first, before moving onto assumptions #4, #5, #6 and #7. We
suggest testing the assumptions in this order because assumptions #3, #4, #5, #6 and
#7 require you to run the linear regression procedure in SPSS Statistics first, so it is
easier to deal with these after checking assumption #1 and #2. Just remember that if
you do not run the statistical tests on these assumptions correctly, the results you get
when running a linear regression might not be valid. This is why we dedicate a number
of sections of our enhanced linear regression guide to help you get this right. You can
find out more about our enhanced content as a whole on our Features: Overview page,
or more specifically, learn how we help with testing assumptions on
our Features: Assumptions page.
SPSS Statistics
Example
A salesperson for a large car brand wants to determine whether there is a relationship
between an individual's income and the price they pay for a car. As such, the
individual's "income" is the independent variable and the "price" they pay for a car is the
dependent variable. The salesperson wants to use this information to determine which
cars to offer potential customers in new areas where average income is known.
SPSS Statistics
Setup in SPSS Statistics
In SPSS Statistics, we created two variables so that we could enter our
data: Income (the independent variable), and Price (the dependent variable). It can also
be useful to create a third variable, caseno , to act as a chronological case number. This
third variable is used to make it easy for you to eliminate cases (e.g., significant outliers)
that you have identified when checking for assumptions. However, we do not include it
in the SPSS Statistics procedure that follows because we assume that you have already
checked these assumptions. In our enhanced linear regression guide, we show you how
to correctly enter data in SPSS Statistics to run a linear regression when you are also
checking for assumptions. You can learn about our enhanced data setup content on
our Features: Data Setup page. Alternately, see our generic, "quick start"
guide: Entering Data in SPSS Statistics.
SPSS Statistics
Test Procedure in SPSS Statistics
The five steps below show you how to analyse your data using linear regression in
SPSS Statistics when none of the seven assumptions in the previous
section, Assumptions, have been violated. At the end of these four steps, we show you
how to interpret the results from your linear regression. If you are looking for help to
make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required
when using linear regression and can be tested using SPSS Statistics, you can learn
more about our enhanced guides on our Features: Overview page.
Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28,
as well as the subscription version of SPSS Statistics, with version 28 and
the subscription version being the latest versions of SPSS Statistics. However,
in version 27 and the subscription version, SPSS Statistics introduced a new look
to their interface called "SPSS Light", replacing the previous look for versions
26 and earlier versions, which was called "SPSS Standard". Therefore, if you have
SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics),
the images that follow will be light grey rather than blue. However, the procedure
is identical.
1. Click Analyze > Regression > Linear... on the top menu, as
shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
You will be presented with the Linear Regression dialogue box:
Published with written permission from SPSS Statistics, IBM Corporation.
SPSS Statistics
Output of Linear Regression Analysis
SPSS Statistics will generate quite a few tables of output for a linear regression. In this
section, we show you only the three main tables required to understand your results
from the linear regression procedure, assuming that no assumptions have been
violated. A complete explanation of the output you have to interpret when checking your
data for the six assumptions required to carry out linear regression is provided in our
enhanced guide. This includes relevant scatterplots, histogram (with superimposed
normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic.
Below, we focus on the results for the linear regression analysis only.
The first table of interest is the Model Summary table, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
This table provides the R and R2 values. The R value represents the simple correlation
and is 0.873 (the "R" Column), which indicates a high degree of correlation.
The R2 value (the "R Square" column) indicates how much of the total variation in the
dependent variable, Price , can be explained by the independent variable, Income . In
this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation fits
the data (i.e., predicts the dependent variable) and is shown below:
The Coefficients table provides us with the necessary information to predict price from
income, as well as determine whether income contributes statistically significantly to the
model (by looking at the "Sig." column). Furthermore, we can use the values in the "B"
column under the "Unstandardized Coefficients" column, as shown below:
If you are unsure how to interpret regression equations or how to use them to make
predictions, we discuss this in our enhanced linear regression guide. We also show you
how to write up the results from your assumptions tests and linear regression output if
you need to report this in a dissertation/thesis, assignment or research report. We do
this using the Harvard and APA styles. You can learn more about our enhanced content
on our Features: Overview page.
We also have a "quick start" guide on how to perform a linear regression analysis in
Stata.
Wakjira Fekadu:
In linear regression how we can check outliers and what is the solution if we gets outlines ?
ChatGPTBot | Midjourney:
In linear regression, outliers are data points that deviate significantly from the overall pattern of the
data. They can have a strong influence on the estimated regression line and may affect the accuracy and
reliability of the linear regression model.
1. Residual analysis: Residuals are the differences between the observed values and the predicted values
from the regression model. Plotting the residuals against the predicted values (residual plot) can help
identify outliers. Outliers would appear as data points that have large positive or negative residuals.
2. Studentized residuals: Studentized residuals take into account the standard errors of the residuals. By
examining the absolute values of the studentized residuals, you can identify outliers. Standardized
cutoffs, such as ±3, are often used as thresholds to identify extreme observations.
3. Cook's distance: Cook's distance is a measure of the influence that each observation has on the
regression coefficients. Large values of Cook's distance indicate influential observations that can greatly
affect the regression line. Observations with Cook's distance values greater than a certain threshold (e.g.,
4/n, where n is the sample size) are often considered as potential outliers.
4. Leverage: Leverage measures how much an observation's predictor values differ from those of the
other observations. Observations with high leverage have the ability to exert a strong influence on the
fitted regression line. Plotting the leverage values can help identify influential observations.
If outliers are detected in a linear regression analysis, there are several approaches that can be taken:
1. Investigating data errors: It's important to first verify whether the outliers are due to data entry errors,
measurement errors, or any other data issues. Review the data to ensure accuracy and correctness. If
any errors are found, the data should be corrected before proceeding with the analysis.
2. Robust regression: Robust regression methods, such as the Huber or M-estimation, are less sensitive
to outliers and can provide more accurate parameter estimates even in the presence of outliers. These
methods downweight the influence of outliers, allowing for a more robust estimation of the regression
coefficients.
3. Data transformation: Transforming the data using mathematical functions, such as logarithmic or
power transformations, can sometimes reduce the impact of outliers. However, the choice of
transformation should be based on a careful understanding of the data and the underlying relationships.
4. Removing outliers: If outliers are known to be caused by data errors or measurement issues, or if they
are influential observations that cannot be handled through other means, removing outliers from the
dataset may be considered. However, it is important to exercise caution and use domain knowledge to
justify the removal of outliers, as it can significantly affect the results and interpretation of the analysis.
5. Robust statistical inference: Instead of removing outliers, another approach is to use robust statistical
inference techniques that can handle outliers. These methods provide parameter estimates and
hypothesis tests that are less affected by outliers.
It's worth noting that the appropriate approach for handling outliers may vary depending on the specific
context, the nature of the data, and the goals of the analysis. Therefore, it is important to carefully
assess the situation and choose the most appropriate strategy accordingly.
I hope this helps! If you have any further questions, feel free to ask.
Wakjira Fekadu:
The effect of outliers in simple linear regression while other assumptions are mate?
ChatGPTBot | Midjourney:
In simple linear regression, outliers can have a significant impact on the estimated regression line, also
known as the least squares line. This line represents the best-fit straight line that minimizes the sum of
squared residuals (the differences between the observed values and the predicted values from the
regression line).
When all other assumptions of linear regression are met (such as linearity, independence,
homoscedasticity, and normality of residuals), outliers can still influence the results in the following
ways:
1. Slope and Intercept: Outliers can affect the slope and intercept of the regression line. If an outlier has
a large residual (vertical distance from the line), it can shift the line upwards or downwards, thus altering
the relationship between the predictor variable (independent variable) and the response variable
(dependent variable). This can lead to biased estimates of the slope and intercept.
2. Statistical Significance: Outliers can also have a significant impact on the statistical significance of the
regression coefficients. In linear regression, hypothesis tests are commonly used to assess the
significance of the slope and intercept. Outliers can increase the variability (standard errors) of the
regression coefficients, making them less statistically significant. In extreme cases, an outlier can even
change the sign or significance of a coefficient.
3. Predictive Accuracy: Outliers can disrupt the accuracy of predictions made by the regression model.
Since the regression line is based on minimizing the sum of squared residuals, an outlier with a large
residual can have an exaggerated influence on the predicted values. Predictions for new observations
located near or influenced by outliers may be highly unreliable.
It's important to note that the impact of outliers on the regression line depends on factors such as the
magnitude of the outlier, the sample size, and the proximity of other data points. A single outlier may
not have a significant effect if the remaining data follows a clear pattern, while multiple outliers or
influential observations can lead to substantial changes in the estimated relationship between the
variables.
Detecting and addressing outliers, as discussed in the previous response, is crucial in minimizing their
impact on the regression analysis. However, it's important to exercise caution and appropriately consider
the context and nature of the data before taking any action.
Remember that linear regression is not robust to extreme outliers, and outlier detection and treatment
should be approached in a rigorous and thoughtful manner, taking into account domain knowledge and
the goals of the analysis.
Wakjira Fekadu:
Is that standardized ressidual value 3.75 is satsatiscaly significant to say there are outliers
ChatGPTBot | Midjourney:
When assessing outliers using standardized residuals, a common threshold used to identify potential
outliers is a value of ±3. However, it's important to note that this threshold is not a definitive rule and
should be considered as a guideline rather than a strict cutoff.
A standardized residual value of 3.75 indicates that the residual for a specific data point is more than 3
standard deviations away from the expected residual value. This suggests that the data point may be an
outlier. However, statistical significance is not solely determined by the magnitude of the standardized
residual. Further investigation and consideration of the context and characteristics of the data are
necessary.
Moreover, it's essential to exercise caution when interpreting outliers based solely on standardized
residual values. Outliers can have substantive implications for the analysis and interpretation of results,
but their handling should be done judiciously, taking into account the specific research context, data
quality, and domain knowledge.
In summary, a standardized residual value of 3.75 may be an indication of a potential outlier, but it
should be further investigated and evaluated in conjunction with other relevant factors to make a robust
determination about its significance and impact on the regression analysis.