0% found this document useful (0 votes)
10 views14 pages

Linear Regression - 25mar2025

The document outlines a linear regression analysis performed on a dataset of 759 firms to predict their sales based on various attributes. It includes data cleaning, exploratory data analysis, and model training, achieving an R² score of approximately 0.92 on the training set and 0.88 on the test set. The analysis also identifies the most important attributes influencing sales, although specific attributes are not listed in the provided text.

Uploaded by

ravi kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Linear Regression - 25mar2025

The document outlines a linear regression analysis performed on a dataset of 759 firms to predict their sales based on various attributes. It includes data cleaning, exploratory data analysis, and model training, achieving an R² score of approximately 0.92 on the training set and 0.88 on the test set. The analysis also identifies the most important attributes influencing sales, although specific attributes are not listed in the provided text.

Uploaded by

ravi kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

Linear Regression

You are a part of an investment firm and your work is to do research about these 759 firms.
You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to
help your company in investing consciously. Also, provide them with 5 attributes that are
most important.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

#Reading the data


df= pd.read_csv("Firm_level_data.csv")
df.head()

Unnamed: 0 sales capital patents randd


employment \
0 0 826.995050 161.603986 10 382.078247
2.306000
1 1 407.753973 122.101012 2 0.000000
1.860000
2 2 8407.845588 6221.144614 138 3296.700439
49.659005
3 3 451.000010 266.899987 1 83.540161
3.071000
4 4 174.927981 140.124004 2 14.233637
1.947000

sp500 tobinq value institutions


0 no 11.049511 1625.453755 80.27
1 no 0.844187 243.117082 59.02
2 yes 5.205257 25865.233800 47.70
3 no 0.305221 63.024630 26.88
4 no 1.063300 67.406408 49.46

df.isnull().sum()

Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null int64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 sp500 759 non-null object
7 tobinq 738 non-null float64
8 value 759 non-null float64
9 institutions 759 non-null float64
dtypes: float64(7), int64(2), object(1)
memory usage: 59.4+ KB

df.dtypes

Unnamed: 0 int64
sales float64
capital float64
patents int64
randd float64
employment float64
sp500 object
tobinq float64
value float64
institutions float64
dtype: object

df.describe().transpose()

count mean std min 25% \


Unnamed: 0 759.0 379.000000 219.248717 0.000000 189.500000
sales 759.0 2689.705158 8722.060124 0.138000 122.920000
capital 759.0 1977.747498 6466.704896 0.057000 52.650501
patents 759.0 25.831357 97.259577 0.000000 1.000000
randd 759.0 439.938074 2007.397588 0.000000 4.628262
employment 759.0 14.164519 43.321443 0.006000 0.927500
tobinq 738.0 2.794910 3.366591 0.119001 1.018783
value 759.0 2732.734750 7071.072362 1.971053 103.593946
institutions 759.0 43.020540 21.685586 0.000000 25.395000

50% 75% max


Unnamed: 0 379.000000 568.500000 758.000000
sales 448.577082 1822.547366 135696.788200
capital 202.179023 1075.790020 93625.200560
patents 3.000000 11.500000 1220.000000
randd 36.864136 143.253403 30425.255860
employment 2.924000 10.050001 710.799925
tobinq 1.680303 3.139309 20.000000
value 410.793529 2054.160385 95191.591160
institutions 44.110000 60.510000 90.150000

df.shape

(759, 10)

df.head()

Unnamed: 0 sales capital patents randd


employment \
0 0 826.995050 161.603986 10 382.078247
2.306000
1 1 407.753973 122.101012 2 0.000000
1.860000
2 2 8407.845588 6221.144614 138 3296.700439
49.659005
3 3 451.000010 266.899987 1 83.540161
3.071000
4 4 174.927981 140.124004 2 14.233637
1.947000

sp500 tobinq value institutions


0 no 11.049511 1625.453755 80.27
1 no 0.844187 243.117082 59.02
2 yes 5.205257 25865.233800 47.70
3 no 0.305221 63.024630 26.88
4 no 1.063300 67.406408 49.46

df.drop("Unnamed: 0",axis=1,inplace=True)

df.iloc[np.where(df["sales"]==max(df["sales"]))]

sales capital patents randd employment sp500


\
295 135696.7882 93625.20056 774 30425.25586 710.799925 yes

tobinq value institutions


295 0.559656 42499.13324 41.8
df.iloc[np.where(df["sales"]==min(df["sales"]))]

sales capital patents randd employment sp500 tobinq


value \
49 0.138 1.512 22 9.3931 0.046 no 18.816427
36.292813

institutions
49 6.11

EDA

Histogram
dfcolumns=['sales', 'capital', 'patents', 'randd', 'employment',
'tobinq',
'value', 'institutions']

fig=plt.figure(figsize=(12,12))
for i in range(0,len(dfcolumns)):
ax=fig.add_subplot(4,3,i+1)
sns.distplot(df[dfcolumns[i]])
ax.set_title(dfcolumns[i],color='Red')
plt.tight_layout()
As we can from the Histogram Plots ,variables ['sales', 'capital',
'patents', 'randd', 'employment', 'tobinq','value'] are rightly skewed
means that Data is not normally distriburted .

The variable['institutions'] seem to be normally distributed

CORRELATION
df.columns

Index(['sales', 'capital', 'patents', 'randd', 'employment', 'sp500',


'tobinq',
'value', 'institutions'],
dtype='object')

num_var=['sales', 'capital', 'patents', 'randd', 'employment',


'tobinq',
'value', 'institutions']
plt.figure(figsize=(10,8))
sns.heatmap(df[num_var].corr(),cmap="YlGnBu",annot=True)

<AxesSubplot:>

As we can see from the correlation heatmap that variable "patents"


and "randd" are highly correlated to each other.

BOXPLOT
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=df, width= 0.5,ax=ax, fliersize=3)

<AxesSubplot:>
As we can see from the boxplot all the variables have outliers but we
cant remove the outliers beacuse dataset is small.
sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x21e8e517f40>
1.2. Impute null values if present? Do you think scaling is necessary in
this case?
df.isnull().sum()

sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
Imputation is not needed because there are only 21 values which are
null in 'tobinq' column ,we can simply drop those values.
df=df.dropna()

df.isnull().sum()

sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64

Yes scaling is necessary since the dataset is not normally distributed


and as we can see all the columns have their own range of
distribution and because of that the algorithum might not work
properly is why scaling of the dataset is required.

df.head()

sales capital patents randd employment sp500 \


0 826.995050 161.603986 10 382.078247 2.306000 no
1 407.753973 122.101012 2 0.000000 1.860000 no
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes
3 451.000010 266.899987 1 83.540161 3.071000 no
4 174.927981 140.124004 2 14.233637 1.947000 no

tobinq value institutions


0 11.049511 1625.453755 80.27
1 0.844187 243.117082 59.02
2 5.205257 25865.233800 47.70
3 0.305221 63.024630 26.88
4 1.063300 67.406408 49.46

df1= pd.get_dummies(df)

df1.head()

sales capital patents randd employment


tobinq \
0 826.995050 161.603986 10 382.078247 2.306000
11.049511
1 407.753973 122.101012 2 0.000000 1.860000
0.844187
2 8407.845588 6221.144614 138 3296.700439 49.659005
5.205257
3 451.000010 266.899987 1 83.540161 3.071000
0.305221
4 174.927981 140.124004 2 14.233637 1.947000
1.063300

value institutions sp500_no sp500_yes


0 1625.453755 80.27 1 0
1 243.117082 59.02 1 0
2 25865.233800 47.70 0 1
3 63.024630 26.88 1 0
4 67.406408 49.46 1 0

df1.shape

(738, 10)

Splitting the dataset


X= df1.drop("sales",axis=1)
y= df1["sales"]

Scaling of the dataset


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X= scaler.fit_transform(X)
X = pd.DataFrame(X, columns=['capital','patents',
'randd','employment', "tobinq",
'value',"institutions", 'sp500_no','sp500_yes'])

X.head()

capital patents randd employment tobinq value


institutions \
0 0.001725 0.008197 0.012558 0.003236 0.549797 0.017055
0.890405
1 0.001304 0.001639 0.000000 0.002608 0.036476 0.002533
0.654687
2 0.066447 0.113115 0.108354 0.069856 0.255835 0.271703
0.529118
3 0.002850 0.000820 0.002746 0.004312 0.009367 0.000641
0.298170
4 0.001496 0.001639 0.000468 0.002731 0.047498 0.000687
0.548641

sp500_no sp500_yes
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0

Train Test Split


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=
0.30, random_state=16)

Linear regression
from sklearn.linear_model import LinearRegression

lr= LinearRegression()
lr.fit(x_train,y_train)

LinearRegression()

#Performance of prediction on train set

lr.score(x_train,y_train)

0.9231828383630586

#r2

from sklearn.metrics import r2_score


r2_train = r2_score(y_train,lr.predict(x_train))
r2_train

0.9231828383630586

#RMSE

from sklearn import metrics


from sklearn.metrics import mean_squared_error
print('RMSE:',
np.sqrt(metrics.mean_squared_error(y_train,lr.predict(x_train))))

RMSE: 2231.2961798939587

#Performance of prediction on test set

lr.score(x_test,y_test)

0.8842133828141475

#r2
from sklearn.metrics import r2_score
r2_test = r2_score(y_test,lr.predict(x_test))
r2_test

0.8842133828141475

from sklearn import metrics


from sklearn.metrics import mean_squared_error
print('RMSE:',
np.sqrt(metrics.mean_squared_error(y_test,lr.predict(x_test))))

RMSE: 3545.7901768414704

print("R2_train:",r2_score(y_train,lr.predict(x_train)))
print("R2_test:",r2_score(y_test,lr.predict(x_test)))
print('RMSE_Train:',
np.sqrt(metrics.mean_squared_error(y_train,lr.predict(x_train))))
print('RMSE_Test:',
np.sqrt(metrics.mean_squared_error(y_test,lr.predict(x_test))))

R2_train: 0.9231828383630586
R2_test: 0.8842133828141475
RMSE_Train: 2231.2961798939587
RMSE_Test: 3545.7901768414704

plt.scatter(y_test,lr.predict(x_test))

<matplotlib.collections.PathCollection at 0x21e948b0d60>

### Feature Importance

from sklearn.ensemble import ExtraTreesRegressor


import matplotlib.pyplot as plt
model = ExtraTreesRegressor()
model.fit(X,y)

ExtraTreesRegressor()

X.shape

(738, 9)

#plot graph of feature importances for better visualization


feat_importances = pd.Series(model.feature_importances_,
index=X.columns)
feat_importances.nlargest(9).plot(kind='barh')
plt.show()

Key Insights from Linear Regression Analysis


1. Sales Prediction Accuracy
o The linear regression model has an R² score of 0.92 on the training set and
0.88 on the test set.
o The RMSE values indicate some error but are within a reasonable range for
investment decisions.
2. Feature Importance
o The top five features influencing sales are:
1. R&D Investment
2. Employment Level
3. Capital Investment
4. Patents Owned
5. Market Value (Tobin’s Q Ratio)
o These attributes are critical for making informed investment decisions.
3. Data Distribution & Skewness
o Most variables (sales, capital, patents, R&D, employment, and value) are
right-skewed, meaning a few firms dominate the dataset with significantly
higher values.
o The presence of outliers suggests high variability in firm performance.
4. Impact of Stock Market Listing (S&P 500 Membership)
o Firms listed on the S&P 500 (sp500_yes = 1) tend to have higher sales and
capital investment.
o However, not all high-sales firms are S&P 500 members, implying
investment opportunities in non-listed firms.
5. Correlation Insights
o Strong correlation between patents and R&D spending, which indicates
that firms investing in R&D tend to generate more intellectual property.
o Moderate correlation between sales and employment, meaning firms
with more employees tend to have higher revenues.

You might also like