0% found this document useful (0 votes)
2 views

Linear regression

The document outlines the process of performing simple linear regression using a dataset on MBA salaries, detailing the steps from data importation to model evaluation. It explains the use of Ordinary Least Squares (OLS) for parameter estimation, the significance of R-squared and RMSE as performance metrics, and methods for residual and outlier analysis. After removing outliers, the model's performance improves, as indicated by an increase in R-squared and a decrease in RMSE.

Uploaded by

Greesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Linear regression

The document outlines the process of performing simple linear regression using a dataset on MBA salaries, detailing the steps from data importation to model evaluation. It explains the use of Ordinary Least Squares (OLS) for parameter estimation, the significance of R-squared and RMSE as performance metrics, and methods for residual and outlier analysis. After removing outliers, the model's performance improves, as indicated by an increase in R-squared and a decrease in RMSE.

Uploaded by

Greesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Simple Linear Regression

A technique that determines the relationship between one dependent variable and one independent variable.

In [3]: # Importing Necessary Libraries


# Importing necessary libraries and reading our CSV file
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

mba_salary_df = pd.read_csv('/content/sample_data/MBA Salary.csv')

add_constant method is used to create a column of 1's.

random_state is a parameter in train_test_split that controls the random number generator used to shuffle the data before splitting it. In other words, it ensures that the same randomization is used each time you run the
code, resulting in the same splits of the data.

(OR)

In simple words, random_state controls the shuffling process.

The Ordinary Least Squares (OLS) algorithm is a method for estimating the parameters of a linear regression model. The OLS algorithm aims to find the values of the linear regression model's parameters (i.e., the
coefficients) that minimize the sum of the squared residuals. In OLS model we give y_train data as first parameter and X_train as second

In [4]: # Add constant 1 to the dataset


X = sm.add_constant(mba_salary_df['Percentge in Grade 10'])
y = mba_salary_df['Salary']

# Split dataset into train and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 100)

# Fit the regression model


model = sm.OLS(y_train, X_train).fit()

print(f"Our SLR Model is: Y = {round(model.params[0],2)} + {round(model.params[1],2)} X")


print(f"i.e., For every 1 unit increase in Percentage in Grade 10, Salary is increased by {round(model.params[1],2)} times")

Our SLR Model is: Y = 28442.88 + 3590.79 X


i.e., For every 1 unit increase in Percentage in Grade 10, Salary is increased by 3590.79 times

An R-Squared value shows how well the model predicts the outcome of the dependent variable. R-Squared values range from 0 to 1. An R-Squared value of 0 means that the model explains or predicts 0% of the relationship
between the dependent and independent variables.

Higher R-Squred leads to better model and better accuracy.

The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a regression model. It measures the average difference between values predicted by a model and the actual values. It provides an
estimation of how well the model is able to predict the target value (accuracy).

In [5]: # Calculation R2_score and RMSE

from sklearn.metrics import r2_score, mean_squared_error


y_pred = model.predict(X_test)
print("R-squared =",np.abs(round(r2_score(y_test, y_pred), 2)))
print("RMSE =",round(np.sqrt(mean_squared_error(y_test, y_pred)), 2))

R-squared = 0.16
RMSE = 73553.54

Residual Analysis:

In linear regression, a residual is the difference between the actual value and the value predicted by the model (y-ŷ) for any given point.

Normality of residuals refers to how well the values are normally distributed. To check for normality of residuals we use probplot from scipy.stats

In [6]: # Residual Analysis


import scipy.stats as se
import matplotlib.pyplot as plt

model_residuals = model.resid
se.probplot(model_residuals, plot = plt)
plt.show()

Standardizing the fitted values and residuals and using scatter plot for visual distribution.

We can also use StandardScaler() function to standardize our values.

In [8]: # Standaridizing the values

def standard(vals):
return (vals - vals.mean())/vals.std()

plt.scatter(standard(model.fittedvalues), standard(model.resid))
plt.title("Residual Plot: MBA Salary Prediction")
plt.xlabel("Standardized predicted values")
plt.ylabel("Standardized Residuals")
plt.show()

Outlier Analysis:

Outliers are observations whose values show a large deviation from the mean value.

We can detect outliers using Z-Score, Mahalanobis Distance, Cook's Distance, Leverage Values and also IQR (Inter-Quartile range)

Z-score is the standardized distance of an observation from its mean value. For the predicted value of the dependent variable Y, the Z-score is given by

z = (x-μ)/σ

If Z-score > 3 or < -3 then it is an Outlier.

Cook's distance measures how much the predicted value of the dependent variable changes for all the observations in the sample when a particular observation is excluded from the sample for the estimation of regression
parameters. A Cook's distance value of more than 1 indicates highly influential observation. Lower cook's distance observations can be removed

Leverage value of an observation measures the influence of that observation on the overall fit of the regression function and is related to the Mahalanobis distance. Leverage value of more than 3(k + 1)/n is treated as highly
influential observation rest are useless, where k is the number of features in the model and n is the sample size.

The Inter-Quartile Range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.

In [9]: # Outliers Removal using Z-Score

from scipy.stats import zscore


mba_salary_df['z_score_salary'] = zscore(mba_salary_df.Salary)
mba_salary_df[ (mba_salary_df.z_score_salary > 3.0) | (mba_salary_df.z_score_salary < -3.0)]

Out[9]: S.No Percentge in Grade 10 Salary z_score_salary

In [10]: # Outliers Removal using IQR

def remove_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (data >= lower_bound) & (data <= upper_bound)

mba_salary_df = mba_salary_df[remove_outliers_iqr(mba_salary_df['Salary'])]

After removing ouliers we can run model evaluation again.

In [11]: # Add constant 1 to the dataset

X = sm.add_constant(mba_salary_df['Percentge in Grade 10'])


y = mba_salary_df['Salary']

# Split dataset into train and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 100)

# Fit the regression model


model = sm.OLS(y_train, X_train).fit()

y_pred = model.predict(X_test)
print("R-squared for new model =",np.abs(round(r2_score(y_test, y_pred), 2)))
print("RMSE for new model =",round(np.sqrt(mean_squared_error(y_test, y_pred)), 2))

R-squared for new model = 0.6


RMSE for new model = 59869.17

We can clearly see that R-square for new model has increased and RMSE is decreased which indicates our model works better than before.

You might also like