ml2020 Pythonlab02
ml2020 Pythonlab02
Introduction
In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent
variable y and one or more explanatory variables (or independent variables) denoted X. The case of one
explanatory variable is called simple linear regression. For more than one explanatory variable, the process
is called multiple linear regression.
Linear regression models are often fitted using the least squares approach, but they may also be fitted in
other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations
regression), or by minimizing a penalized version of the least squares loss function as in ridge regression
(L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit
models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely
linked, they are not synonymous.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
'describe()' method to get the statistical summary of the various features of the data set
df.describe(percentiles=[0.1,0.25,0.5,0.75,0.9])
df.corr()
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(),annot=True,linewidths=2)
Put all the numerical features in X and Price in y, ignore Address which is string for linear regression
X = df[l_column[0:len_feature-2]]
y = df[l_column[len_feature-2]]
print("Feature set size:",X.shape)
print("Variable set size:",y.shape)
X.head()
y.head()
Test-train split
Import train_test_split function from scikit-learn
from sklearn.cross_validation import train_test_split
Create X and y train and test splits in one command using a split ratio and a random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
Check the size and shape of train/test splits (it should be in the ratio as per test_size parameter above)
print("Training feature set size:",X_train.shape)
print("Test feature set size:",X_test.shape)
print("Training variable set size:",y_train.shape)
print("Test variable set size:",y_test.shape)
Fit the model on to the instantiated object itself and Check the intercept and coefficients and put them in a
DataFrame
lm.fit(X_train,y_train) # Fit the linear model on to the 'lm' object itself i.e. no need to set this to another variable
print("The intercept term of the linear model:", lm.intercept_)
print("The coefficients of the linear model:", lm.coef_)
#idf = pd.DataFrame(data=idict,index=['Intercept'])
cdf = pd.DataFrame(data=lm.coef_, index=X_train.columns, columns=["Coefficients"])
#cdf=pd.concat([idf,cdf], axis=0)
cdf
Generate a random dataset of some points and apply all the above mentioned regression methods on this
dataset. Show the results on a graph. Also, find the most suitable method according to the execution time.