0% found this document useful (0 votes)
11 views

MLR Example 2predictors

This document presents an example of fitting a multiple linear regression model with 2 predictors (X1 and X2) to synthetic data. It: 1) Generates random predictor and response variable data according to a linear model with errors. 2) Fits the full MLR model to the entire dataset and estimates the coefficients, finding values close to the true parameters. 3) Implements the least squares method to directly estimate the coefficients, obtaining similar results. 4) Performs a train-test split and trains the model on the train set, finding it predicts the test set well without overfitting.

Uploaded by

wangshiui2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

MLR Example 2predictors

This document presents an example of fitting a multiple linear regression model with 2 predictors (X1 and X2) to synthetic data. It: 1) Generates random predictor and response variable data according to a linear model with errors. 2) Fits the full MLR model to the entire dataset and estimates the coefficients, finding values close to the true parameters. 3) Implements the least squares method to directly estimate the coefficients, obtaining similar results. 4) Performs a train-test split and trains the model on the train set, finding it predicts the test set well without overfitting.

Uploaded by

wangshiui2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MLR-Example-2predictors

[ ]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

[ ]: # Prepare syntetic data

# fix the seed (random state), so that next time we use the code we get the same
,→data

#
init_seed=48
np.random.seed(init_seed)

nobs=100 # the number of observations

# generate normally distributed errors with the common distribution


,→N(0,sigma^2)

#specify sigma (the standard deviation of the error)


sigma=20

#generates random sample err (errors) of size nobs from N(0,sigma^2),

err=np.random.normal(0,sigma,nobs)

# Prepare values of predictors X1 and X2

x1=np.random.randint(1,10,nobs) # random values from the discrete set {1,...,10}


x2=np.random.uniform(20, 30,nobs) # random values from the interval [20,30]

# Consider the linear model for the dependence of the response Y on predictors
,→X1 and X2:

# given that X1=x1 and X2=x2 the response is given by


# y=alpha+beta1*x1+beta2*x2

1
# in other words, assume the model in which we observe Y-value with the error
#y=alpha+beta1*x1+beta2*x2+err

#choose coefficients of the linear function

# true parameters:

alpha=2023
beta1=20
beta2=-5

#code to generate observations of the response

y=alpha+beta1*x1+beta2*x2+err

# Here we know the true parameters alpha, beta1 and beta2


#If your code is correct, then it should give smth close to the true parameters

[ ]: #as preliminary step it might be useful to visualise data


# for example, produce the scatter plot of
# the response against each predictor
plt.scatter(x1, y)
plt.title('y vs. x1')
plt.xlabel('x1')
plt.ylabel('y')
plt.show()

plt.scatter(x2, y)
plt.title('y vs. x2')
plt.xlabel('x2')
plt.ylabel('y')
plt.show()

[ ]: #produce the two-dimensional array of x-observations


# we have two column vectors x1 and x2 of shape (100,1)
# need to combine them into the 2d matrix of shape (100,2)
x = np.column_stack((x1, x2))

#below we are checking dimensions and shapes (just to see what we have got)
#the 1st column in x contains values of x1, and the 2nd column contains values
,→of x2

print(np.shape(x), x.ndim)
#check y
print(np.shape(y), y.ndim)

2
1 Fitting MLR model to the whole dataset by using Sklearn.

[ ]: # Full model, i.e. fit the model to ALL data

Model = LinearRegression()
Model.fit(x, y)
print(Model.coef_[0], Model.coef_[1], Model.intercept_)
y_pr= Model.predict(x)
mse = mean_squared_error(y, y_pr)
print("Mean Squared Error:", mse)

#R-squared, coeff. of determination


r2 = r2_score(y, y_pr)
print("R-squared:", r2)

2 Estimation of the model coefficients (using, as before, all observa-


tions) directly by implementing the least square method.
Below we use the code form the example with STEAM data.

[ ]: # Prepare matrices
B=np.ones([nobs,1])
X=np.c_[B, x]
XT=np.transpose(X)

XTX=XT@X
print(XTX)

XTX_inv=np.linalg.inv(XTX)
print(XTX_inv)
M=XTX_inv

[ ]: # Compute the estimators


Coef=M@XT@y
print(Coef)

# Compute the fitted values


hat_y=X@Coef

#Compute the sums of squares SSR and SST, R^2 and MS_E(=MSE)
mean_y=sum([i for i in y])/len(y)
SSR=sum([(h-mean_y)**2 for h in hat_y])
SST=sum([(i-mean_y)**2 for i in y])
R2=SSR/SST
print(R2)

3
MSE=(SST-SSR)/(len(y)-3)
print(MSE, MSE*(len(y)-3)/len(y))

2.1 Train_Test_Split
In the example we deal with syntetic data which are “homogeneous” in the sense that observations
are prepared by using the same model. Therefore, splitting data on training and testing data does
not make much sense. However, for training purposes (training ourselves :)), let’s see how it
works.
Split data on the training subset and the test subset by using the function train_test_split from
Sklearn. Then train the model on the training subset and see how good the trained model is on
predicting similar data.

[ ]: # Split the data into training and testing sets


seed=121
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
,→random_state=seed)

[ ]: #Initiate an object=linear regression model from the class LinearRegress


#here the object is named Mymodel
#choose another name (your choice)
Train_model = LinearRegression()

# Train the model by fitting to the train subset

Train_model.fit(x_train, y_train)

To assess overfitting using MSE: check the Mean Squared Error (MSE) on both the training and
test sets. If the training error is much lower than the test error, it might be a sign of potential
overfitting.
R-squared (coefficient of determination) is a useful metric to assess the goodness of fit of a linear
regression model. However, R-squared alone may not be sufficient to detect overfitting. While a
high R-squared on the training set is desirable, it doesn’t necessarily imply good generalization to
new, unseen data.
To assess overfitting using R-squared: compare the R-squared values on both the training and test
sets. If the R-squared is significantly higher on the training set than on the test set, it may be an
indication of overfitting.

[ ]: print(Train_model.coef_[0], Train_model.coef_[1], Train_model.intercept_)

[ ]: # mse and R^2 squared for the trainig set


y_train_pred = Train_model.predict(x_train)

# Evaluate the model


mse_train = mean_squared_error(y_train, y_train_pred)

4
print("Mean Squared Error:", mse_train)

r2_train = r2_score(y_train, y_train_pred)


print("R-squared:", r2_train)

[ ]: #Predictions: calculate predictions of the response for


# x-values from the test subset

y_pred = Train_model.predict(x_test)

#check how good the computed predicted values are, i.e. how good they match the
,→observed

#corresponding values of the response from the test subset

# Mean square error


mse_test = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse_test)

#R-squared, coeff. of determination


r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

In the example the train model works very well with the “new”, i.e. test data.

[ ]:

You might also like