Linear regression
Linear regression
A technique that determines the relationship between one dependent variable and one independent variable.
random_state is a parameter in train_test_split that controls the random number generator used to shuffle the data before splitting it. In other words, it ensures that the same randomization is used each time you run the
code, resulting in the same splits of the data.
(OR)
The Ordinary Least Squares (OLS) algorithm is a method for estimating the parameters of a linear regression model. The OLS algorithm aims to find the values of the linear regression model's parameters (i.e., the
coefficients) that minimize the sum of the squared residuals. In OLS model we give y_train data as first parameter and X_train as second
An R-Squared value shows how well the model predicts the outcome of the dependent variable. R-Squared values range from 0 to 1. An R-Squared value of 0 means that the model explains or predicts 0% of the relationship
between the dependent and independent variables.
The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a regression model. It measures the average difference between values predicted by a model and the actual values. It provides an
estimation of how well the model is able to predict the target value (accuracy).
R-squared = 0.16
RMSE = 73553.54
Residual Analysis:
In linear regression, a residual is the difference between the actual value and the value predicted by the model (y-ŷ) for any given point.
Normality of residuals refers to how well the values are normally distributed. To check for normality of residuals we use probplot from scipy.stats
model_residuals = model.resid
se.probplot(model_residuals, plot = plt)
plt.show()
Standardizing the fitted values and residuals and using scatter plot for visual distribution.
def standard(vals):
return (vals - vals.mean())/vals.std()
plt.scatter(standard(model.fittedvalues), standard(model.resid))
plt.title("Residual Plot: MBA Salary Prediction")
plt.xlabel("Standardized predicted values")
plt.ylabel("Standardized Residuals")
plt.show()
Outlier Analysis:
Outliers are observations whose values show a large deviation from the mean value.
We can detect outliers using Z-Score, Mahalanobis Distance, Cook's Distance, Leverage Values and also IQR (Inter-Quartile range)
Z-score is the standardized distance of an observation from its mean value. For the predicted value of the dependent variable Y, the Z-score is given by
z = (x-μ)/σ
Cook's distance measures how much the predicted value of the dependent variable changes for all the observations in the sample when a particular observation is excluded from the sample for the estimation of regression
parameters. A Cook's distance value of more than 1 indicates highly influential observation. Lower cook's distance observations can be removed
Leverage value of an observation measures the influence of that observation on the overall fit of the regression function and is related to the Mahalanobis distance. Leverage value of more than 3(k + 1)/n is treated as highly
influential observation rest are useless, where k is the number of features in the model and n is the sample size.
The Inter-Quartile Range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.
def remove_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (data >= lower_bound) & (data <= upper_bound)
mba_salary_df = mba_salary_df[remove_outliers_iqr(mba_salary_df['Salary'])]
y_pred = model.predict(X_test)
print("R-squared for new model =",np.abs(round(r2_score(y_test, y_pred), 2)))
print("RMSE for new model =",round(np.sqrt(mean_squared_error(y_test, y_pred)), 2))
We can clearly see that R-square for new model has increased and RMSE is decreased which indicates our model works better than before.