Assumptions of Linear Regression
Last Updated :
19 Nov, 2024
Linear regression is the simplest machine learning algorithm of predictive analysis. It is widely used for predicting a continuous target variable based on one or more predictor variables. While linear regression is powerful and interpretable, its validity relies heavily on certain assumptions about the data.
In this article, we will explore the key assumptions of linear regression, discuss their significance, and provide insights into how violations of these assumptions can affect the model's performance.
Assumptions of Linear RegressionKey Assumptions of Linear Regression
1. Linearity
Assumption: The relationship between the independent and dependent variables is linear.
The first and foremost assumption of linear regression is that the relationship between the predictor(s) and the response variable is linear. This means that a change in the independent variable results in a proportional change in the dependent variable. This can be visually assessed using scatter plots or residual plots.
If the relationship is not linear, the model may underfit the data, leading to inaccurate predictions. In such cases, transformations of the data or the use of non-linear regression models may be more appropriate.
Example:
Consider a dataset where the relationship between temperature and ice cream sales is being studied. If sales increase non-linearly with temperature (e.g., significantly more sales at high temperatures), a linear model may not capture this effect well. We'll also show a scenario where the relationship is not linear.
- Linear Relationship: This is where the increase in temperature results in a consistent increase in ice cream sales.
- Non-Linear Relationship: In this case, the increase in temperature leads to a more significant increase in ice cream sales at higher temperatures, indicating a non-linear relationship.
Linearity2. Homoscedasticity of Residuals in Linear Regression
Homoscedasticity is one of the key assumptions of linear regression, which asserts that the residuals (the differences between observed and predicted values) should have a constant variance across all levels of the independent variable(s). In simpler terms, it means that the spread of the errors should be relatively uniform, regardless of the value of the predictor.
When the residuals maintain constant variance, the model is said to be homoscedastic. Conversely, when the variance of the residuals changes with the level of the independent variable, we refer to this phenomenon as heteroscedasticity.
Heteroscedasticity can lead to several issues:
- Inefficient Estimates: The estimates of the coefficients may not be the best linear unbiased estimators (BLUE), meaning that they could be less accurate than they should be.
- Impact on Hypothesis Testing: Standard errors can become biased, leading to unreliable significance tests and confidence intervals.
Homoscedasticity of Residuals- Left plot (Homoscedasticity): The residuals are scattered evenly around the horizontal line at zero, indicating a constant variance.
- Right plot (Heteroscedasticity): The residuals are not evenly scattered. There is a clear pattern of increasing variance as the predicted values increase, indicating heteroscedasticity.
3. Multivariate Normality - Normal Distribution
Multivariate normality is a key assumption for linear regression models when making statistical inferences. Specifically, it means that the residuals (the differences between observed and predicted values) should follow a normal distribution when considering multiple predictors together. This assumption ensures that hypothesis tests, confidence intervals, and p-values are valid.
This assumption is crucial because it allows us to make valid inferences about the model's parameters and the relationship between the dependent and independent variables.
Multivariate Normality - Normal Distribution- The first row shows a normally distributed dataset, as evidenced by the bell-shaped histogram and the points falling close to a straight line in the Q-Q plot.
- The second row shows a dataset that is too peaked in the middle, indicating a deviation from normality.
- The third row shows a skewed dataset, also indicating a deviation from normality.
Therefore, the image demonstrates how different distributions can affect the assumption of Multivariate Normality.
4. Independence of Errors
Independence of errors is another critical assumption for linear regression models. It ensures that the residuals (the differences between the observed and predicted values) are not correlated with one another. This means that the error associated with one observation should not influence the error of any other observation. When errors are correlated, it can indicate that some underlying pattern or trend in the data has been overlooked by the model.
If the errors are correlated, it can lead to underestimated standard errors, resulting in overconfident predictions and misleading significance tests. Violation of this assumption is most common in time series data, where the error at one point in time may influence errors at subsequent time points. Such patterns suggest the presence of autocorrelation.
Independence of ErrorsIn the Image above,
- The Residuals vs. Time plot: shows a random scatter of points, suggesting no clear pattern or correlation over time.
- The ACF of Residuals plot shows a few spikes at low lags, but they are not significant enough to indicate strong autocorrelation.
5. Lack of Multicollinearity
Assumption: The independent variables are not highly correlated with each other.
Multicollinearity occurs when two or more independent variables in the model are highly correlated, leading to redundancy in the information they provide. This can inflate the standard errors of the coefficients, making it difficult to determine the effect of each independent variable.
When multicollinearity is present, it becomes challenging to interpret the coefficients of the regression model accurately. It can also lead to overfitting, where the model performs well on training data but poorly on unseen data. We can identify highly correlated features using scatter plots or heatmap.
Example: In a model predicting health outcomes based on multiple health metrics, if both blood pressure and heart rate are included as predictors, their high correlation may lead to multicollinearity.
6. Absence of Endogeneity
No endogeneity is an important assumption in the context of multiple linear regression. The assumption of no endogeneity states that the independent variables in the regression model should not be correlated with the error term. If this assumption is violated, it leads to biased and inconsistent estimates of the regression coefficients.
- Bias and Consistency: When endogeneity is present, the estimates of the regression coefficients are biased, meaning they do not accurately reflect the true relationships between the variables. Additionally, the estimates become inconsistent, which means they do not converge to the true parameter values as the sample size increases.
- Valid Inference: The assumption of no endogeneity is critical for conducting valid hypothesis tests and creating reliable confidence intervals. If endogeneity exists, the statistical tests based on these estimates may lead to incorrect conclusions.
Absence of EndogeneityDetecting Violations of Assumptions
It is crucial to assess whether the assumptions of linear regression are met before fitting a model. Here are some common techniques to detect violations:
- Residual Plots: Plotting the residuals against the fitted values or independent variables can help visualize linearity, homoscedasticity, and independence of errors. Ideally, the residuals should show no pattern, indicating a linear relationship and constant variance.
- Q-Q Plots: A Quantile-Quantile plot can be used to assess the normality of residuals. If the residuals follow a straight line in a Q-Q plot, they are normally distributed.
- Variance Inflation Factor (VIF): To check for multicollinearity, calculate the VIF for each independent variable. A VIF value greater than 5 or 10 indicates significant multicollinearity.
- Durbin-Watson Test: This statistical test helps detect the presence of autocorrelation in the residuals. A value close to 2 indicates no autocorrelation, while values significantly less than or greater than 2 indicate the presence of positive or negative autocorrelation, respectively.
- Statistical Tests: Perform statistical tests like the Breusch-Pagan test for homoscedasticity and the Shapiro-Wilk test for normality.
Addressing Violations of Assumptions
If any of the assumptions are violated, there are various strategies to mitigate the issue:
- Transformations: Apply transformations (e.g., logarithmic, square root) to the dependent variable to address non-linearity and heteroscedasticity.
- Adding Variables: If autocorrelation or omitted variable bias is suspected, consider adding relevant predictors to the model.
- Regularization Techniques: Techniques like Ridge or Lasso regression can help handle multicollinearity and improve model performance.
- Robust Regression: Consider using robust regression methods, such as quantile regression or Huber regression, that are less sensitive to violations of assumptions.
- Generalized Least Squares (GLS): This approach can be used when the residuals are heteroscedastic or correlated.
Conclusion
Linear regression is a powerful tool for modeling relationships between variables, but its effectiveness is contingent upon the validity of several key assumptions. Understanding and checking these assumptions—linearity, independence of errors, homoscedasticity, normality of errors, absence of multicollinearity, and no autocorrelation—is critical for ensuring reliable results.
Similar Reads
Bayesian Linear Regression
Linear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the
10 min read
Non-Linear Regression in R
Non-Linear Regression is a statistical method that is used to model the relationship between a dependent variable and one of the independent variable(s). In non-linear regression, the relationship is modeled using a non-linear equation. This means that the model can capture more complex and non-line
6 min read
Normal Equation in Linear Regression
Linear regression is a popular method for understanding how different factors (independent variables) affect an outcome (dependent variable. At its core, linear regression aims to find the best-fitting line that minimizes the error between observed data points and predicted values. One efficient met
8 min read
Linear Regression (Python Implementation)
Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr
14 min read
Cost Function in Linear Regression
Linear Regression is a method used to predict values by drawing the best-fit line through the data. When we first create a model, the predictions may not always match the actual data. To understand how well the model is performing we use a cost function. This function helps us to measure the differe
5 min read
Dataset for Linear Regression
Linear regression is a machine learning technique used for predicting continuous outcome variable based on one or more input variables. It assumes a linear relationship between the input variables and the target variable which make it simple and easy for beginners. In this article, we will see some
6 min read
Linear Regression and Group By in R
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. In R programming language it can be performed using the lm() function which stands for "linear model". Sometimes, analysts need to apply linear regression sepa
3 min read
Linear Regression Assumptions and Diagnostics using R
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. Before interpreting the results of a linear regression analysis in R, it's important to check and ensure that the assumptions of linear regression are met. Ass
7 min read
Example of Linear Regression in Real Life
You might have read lot of tutorials on Linear Regression and already have the assumption - Linear Regression is not easy to Understand. We will make Linear Regression very easy for you. Let's boil down each concept and learn with help of Examples. If you have no idea what Linear regression is, this
9 min read
Comparisons of linear regression and survival analysis
Understanding the differences between linear regression and survival analysis is crucial as they address different types of data and research questions and applying the correct method ensures accurate modeling, better predictions, and more informed decision-making in fields like healthcare, engineer
8 min read