MLR Example 2predictors
MLR Example 2predictors
[ ]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# fix the seed (random state), so that next time we use the code we get the same
,→data
#
init_seed=48
np.random.seed(init_seed)
err=np.random.normal(0,sigma,nobs)
# Consider the linear model for the dependence of the response Y on predictors
,→X1 and X2:
1
# in other words, assume the model in which we observe Y-value with the error
#y=alpha+beta1*x1+beta2*x2+err
# true parameters:
alpha=2023
beta1=20
beta2=-5
y=alpha+beta1*x1+beta2*x2+err
plt.scatter(x2, y)
plt.title('y vs. x2')
plt.xlabel('x2')
plt.ylabel('y')
plt.show()
#below we are checking dimensions and shapes (just to see what we have got)
#the 1st column in x contains values of x1, and the 2nd column contains values
,→of x2
print(np.shape(x), x.ndim)
#check y
print(np.shape(y), y.ndim)
2
1 Fitting MLR model to the whole dataset by using Sklearn.
Model = LinearRegression()
Model.fit(x, y)
print(Model.coef_[0], Model.coef_[1], Model.intercept_)
y_pr= Model.predict(x)
mse = mean_squared_error(y, y_pr)
print("Mean Squared Error:", mse)
[ ]: # Prepare matrices
B=np.ones([nobs,1])
X=np.c_[B, x]
XT=np.transpose(X)
XTX=XT@X
print(XTX)
XTX_inv=np.linalg.inv(XTX)
print(XTX_inv)
M=XTX_inv
#Compute the sums of squares SSR and SST, R^2 and MS_E(=MSE)
mean_y=sum([i for i in y])/len(y)
SSR=sum([(h-mean_y)**2 for h in hat_y])
SST=sum([(i-mean_y)**2 for i in y])
R2=SSR/SST
print(R2)
3
MSE=(SST-SSR)/(len(y)-3)
print(MSE, MSE*(len(y)-3)/len(y))
2.1 Train_Test_Split
In the example we deal with syntetic data which are “homogeneous” in the sense that observations
are prepared by using the same model. Therefore, splitting data on training and testing data does
not make much sense. However, for training purposes (training ourselves :)), let’s see how it
works.
Split data on the training subset and the test subset by using the function train_test_split from
Sklearn. Then train the model on the training subset and see how good the trained model is on
predicting similar data.
Train_model.fit(x_train, y_train)
To assess overfitting using MSE: check the Mean Squared Error (MSE) on both the training and
test sets. If the training error is much lower than the test error, it might be a sign of potential
overfitting.
R-squared (coefficient of determination) is a useful metric to assess the goodness of fit of a linear
regression model. However, R-squared alone may not be sufficient to detect overfitting. While a
high R-squared on the training set is desirable, it doesn’t necessarily imply good generalization to
new, unseen data.
To assess overfitting using R-squared: compare the R-squared values on both the training and test
sets. If the R-squared is significantly higher on the training set than on the test set, it may be an
indication of overfitting.
4
print("Mean Squared Error:", mse_train)
y_pred = Train_model.predict(x_test)
#check how good the computed predicted values are, i.e. how good they match the
,→observed
In the example the train model works very well with the “new”, i.e. test data.
[ ]: