Linear Regression - 25mar2025
Linear Regression - 25mar2025
You are a part of an investment firm and your work is to do research about these 759 firms.
You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to
help your company in investing consciously. Also, provide them with 5 attributes that are
most important.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
df.isnull().sum()
Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null int64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 sp500 759 non-null object
7 tobinq 738 non-null float64
8 value 759 non-null float64
9 institutions 759 non-null float64
dtypes: float64(7), int64(2), object(1)
memory usage: 59.4+ KB
df.dtypes
Unnamed: 0 int64
sales float64
capital float64
patents int64
randd float64
employment float64
sp500 object
tobinq float64
value float64
institutions float64
dtype: object
df.describe().transpose()
df.shape
(759, 10)
df.head()
df.drop("Unnamed: 0",axis=1,inplace=True)
df.iloc[np.where(df["sales"]==max(df["sales"]))]
institutions
49 6.11
EDA
Histogram
dfcolumns=['sales', 'capital', 'patents', 'randd', 'employment',
'tobinq',
'value', 'institutions']
fig=plt.figure(figsize=(12,12))
for i in range(0,len(dfcolumns)):
ax=fig.add_subplot(4,3,i+1)
sns.distplot(df[dfcolumns[i]])
ax.set_title(dfcolumns[i],color='Red')
plt.tight_layout()
As we can from the Histogram Plots ,variables ['sales', 'capital',
'patents', 'randd', 'employment', 'tobinq','value'] are rightly skewed
means that Data is not normally distriburted .
CORRELATION
df.columns
<AxesSubplot:>
BOXPLOT
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=df, width= 0.5,ax=ax, fliersize=3)
<AxesSubplot:>
As we can see from the boxplot all the variables have outliers but we
cant remove the outliers beacuse dataset is small.
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x21e8e517f40>
1.2. Impute null values if present? Do you think scaling is necessary in
this case?
df.isnull().sum()
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
Imputation is not needed because there are only 21 values which are
null in 'tobinq' column ,we can simply drop those values.
df=df.dropna()
df.isnull().sum()
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
df.head()
df1= pd.get_dummies(df)
df1.head()
df1.shape
(738, 10)
X.head()
sp500_no sp500_yes
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
Linear regression
from sklearn.linear_model import LinearRegression
lr= LinearRegression()
lr.fit(x_train,y_train)
LinearRegression()
lr.score(x_train,y_train)
0.9231828383630586
#r2
0.9231828383630586
#RMSE
RMSE: 2231.2961798939587
lr.score(x_test,y_test)
0.8842133828141475
#r2
from sklearn.metrics import r2_score
r2_test = r2_score(y_test,lr.predict(x_test))
r2_test
0.8842133828141475
RMSE: 3545.7901768414704
print("R2_train:",r2_score(y_train,lr.predict(x_train)))
print("R2_test:",r2_score(y_test,lr.predict(x_test)))
print('RMSE_Train:',
np.sqrt(metrics.mean_squared_error(y_train,lr.predict(x_train))))
print('RMSE_Test:',
np.sqrt(metrics.mean_squared_error(y_test,lr.predict(x_test))))
R2_train: 0.9231828383630586
R2_test: 0.8842133828141475
RMSE_Train: 2231.2961798939587
RMSE_Test: 3545.7901768414704
plt.scatter(y_test,lr.predict(x_test))
<matplotlib.collections.PathCollection at 0x21e948b0d60>
ExtraTreesRegressor()
X.shape
(738, 9)