Assumptions of Linear Regression

Last Updated : 19 Nov, 2024

Linear regression is the simplest machine learning algorithm of predictive analysis. It is widely used for predicting a continuous target variable based on one or more predictor variables. While linear regression is powerful and interpretable, its validity relies heavily on certain assumptions about the data.

In this article, we will explore the key assumptions of linear regression, discuss their significance, and provide insights into how violations of these assumptions can affect the model's performance.

Table of Content

Key Assumptions of Linear Regression

1. Linearity
2. Homoscedasticity of Residuals in Linear Regression
3. Multivariate Normality - Normal Distribution
4. Independence of Errors
5. Lack of Multicollinearity
6. Absence of Endogeneity

Detecting Violations of Assumptions
Addressing Violations of Assumptions

Key Assumptions of Linear Regression

1. Linearity

Assumption: The relationship between the independent and dependent variables is linear.

The first and foremost assumption of linear regression is that the relationship between the predictor(s) and the response variable is linear. This means that a change in the independent variable results in a proportional change in the dependent variable. This can be visually assessed using scatter plots or residual plots.

If the relationship is not linear, the model may underfit the data, leading to inaccurate predictions. In such cases, transformations of the data or the use of non-linear regression models may be more appropriate.

Example:

Consider a dataset where the relationship between temperature and ice cream sales is being studied. If sales increase non-linearly with temperature (e.g., significantly more sales at high temperatures), a linear model may not capture this effect well. We'll also show a scenario where the relationship is not linear.

Linear Relationship: This is where the increase in temperature results in a consistent increase in ice cream sales.
Non-Linear Relationship: In this case, the increase in temperature leads to a more significant increase in ice cream sales at higher temperatures, indicating a non-linear relationship.

2. Homoscedasticity of Residuals in Linear Regression

Homoscedasticity is one of the key assumptions of linear regression, which asserts that the residuals (the differences between observed and predicted values) should have a constant variance across all levels of the independent variable(s). In simpler terms, it means that the spread of the errors should be relatively uniform, regardless of the value of the predictor.

When the residuals maintain constant variance, the model is said to be homoscedastic. Conversely, when the variance of the residuals changes with the level of the independent variable, we refer to this phenomenon as heteroscedasticity.

Heteroscedasticity can lead to several issues:

Inefficient Estimates: The estimates of the coefficients may not be the best linear unbiased estimators (BLUE), meaning that they could be less accurate than they should be.
Impact on Hypothesis Testing: Standard errors can become biased, leading to unreliable significance tests and confidence intervals.

homoscedasticity — Homoscedasticity of Residuals

Left plot (Homoscedasticity): The residuals are scattered evenly around the horizontal line at zero, indicating a constant variance.
Right plot (Heteroscedasticity): The residuals are not evenly scattered. There is a clear pattern of increasing variance as the predicted values increase, indicating heteroscedasticity.

3. Multivariate Normality - Normal Distribution

Multivariate normality is a key assumption for linear regression models when making statistical inferences. Specifically, it means that the residuals (the differences between observed and predicted values) should follow a normal distribution when considering multiple predictors together. This assumption ensures that hypothesis tests, confidence intervals, and p-values are valid.

This assumption is crucial because it allows us to make valid inferences about the model's parameters and the relationship between the dependent and independent variables.

The first row shows a normally distributed dataset, as evidenced by the bell-shaped histogram and the points falling close to a straight line in the Q-Q plot.
The second row shows a dataset that is too peaked in the middle, indicating a deviation from normality.
The third row shows a skewed dataset, also indicating a deviation from normality.

Therefore, the image demonstrates how different distributions can affect the assumption of Multivariate Normality.

4. Independence of Errors

Independence of errors is another critical assumption for linear regression models. It ensures that the residuals (the differences between the observed and predicted values) are not correlated with one another. This means that the error associated with one observation should not influence the error of any other observation. When errors are correlated, it can indicate that some underlying pattern or trend in the data has been overlooked by the model.

If the errors are correlated, it can lead to underestimated standard errors, resulting in overconfident predictions and misleading significance tests. Violation of this assumption is most common in time series data, where the error at one point in time may influence errors at subsequent time points. Such patterns suggest the presence of autocorrelation.

In the Image above,

The Residuals vs. Time plot: shows a random scatter of points, suggesting no clear pattern or correlation over time.
The ACF of Residuals plot shows a few spikes at low lags, but they are not significant enough to indicate strong autocorrelation.

5. Lack of Multicollinearity

Assumption: The independent variables are not highly correlated with each other.

Multicollinearity occurs when two or more independent variables in the model are highly correlated, leading to redundancy in the information they provide. This can inflate the standard errors of the coefficients, making it difficult to determine the effect of each independent variable.

When multicollinearity is present, it becomes challenging to interpret the coefficients of the regression model accurately. It can also lead to overfitting, where the model performs well on training data but poorly on unseen data. We can identify highly correlated features using scatter plots or heatmap.

Example: In a model predicting health outcomes based on multiple health metrics, if both blood pressure and heart rate are included as predictors, their high correlation may lead to multicollinearity.

6. Absence of Endogeneity

No endogeneity is an important assumption in the context of multiple linear regression. The assumption of no endogeneity states that the independent variables in the regression model should not be correlated with the error term. If this assumption is violated, it leads to biased and inconsistent estimates of the regression coefficients.

Bias and Consistency: When endogeneity is present, the estimates of the regression coefficients are biased, meaning they do not accurately reflect the true relationships between the variables. Additionally, the estimates become inconsistent, which means they do not converge to the true parameter values as the sample size increases.
Valid Inference: The assumption of no endogeneity is critical for conducting valid hypothesis tests and creating reliable confidence intervals. If endogeneity exists, the statistical tests based on these estimates may lead to incorrect conclusions.

Detecting Violations of Assumptions

It is crucial to assess whether the assumptions of linear regression are met before fitting a model. Here are some common techniques to detect violations:

Residual Plots: Plotting the residuals against the fitted values or independent variables can help visualize linearity, homoscedasticity, and independence of errors. Ideally, the residuals should show no pattern, indicating a linear relationship and constant variance.
Q-Q Plots: A Quantile-Quantile plot can be used to assess the normality of residuals. If the residuals follow a straight line in a Q-Q plot, they are normally distributed.
Variance Inflation Factor (VIF): To check for multicollinearity, calculate the VIF for each independent variable. A VIF value greater than 5 or 10 indicates significant multicollinearity.
Durbin-Watson Test: This statistical test helps detect the presence of autocorrelation in the residuals. A value close to 2 indicates no autocorrelation, while values significantly less than or greater than 2 indicate the presence of positive or negative autocorrelation, respectively.
Statistical Tests: Perform statistical tests like the Breusch-Pagan test for homoscedasticity and the Shapiro-Wilk test for normality.

Addressing Violations of Assumptions

If any of the assumptions are violated, there are various strategies to mitigate the issue:

Transformations: Apply transformations (e.g., logarithmic, square root) to the dependent variable to address non-linearity and heteroscedasticity.
Adding Variables: If autocorrelation or omitted variable bias is suspected, consider adding relevant predictors to the model.
Regularization Techniques: Techniques like Ridge or Lasso regression can help handle multicollinearity and improve model performance.
Robust Regression: Consider using robust regression methods, such as quantile regression or Huber regression, that are less sensitive to violations of assumptions.
Generalized Least Squares (GLS): This approach can be used when the residuals are heteroscedastic or correlated.

Conclusion

Linear regression is a powerful tool for modeling relationships between variables, but its effectiveness is contingent upon the validity of several key assumptions. Understanding and checking these assumptions—linearity, independence of errors, homoscedasticity, normality of errors, absence of multicollinearity, and no autocorrelation—is critical for ensuring reliable results.

Normal Equation in Linear Regression

parthshukla211

Improve

Article Tags :

Practice Tags :

Machine Learning