OpenLab2
OpenLab2
Trainer:
FARHAD M RIAZ
Submitted By:
AIMAN SOHAIL
LAB TASK
Regression in Machine Learning
1.1 Objective:
The objective of this task is to apply these regression which I mentioned above on relevant dataset to
the type of regression. And perform the performance analysis by finding the mean-squared error, root-mean-
squared error, mean absolute error, and R-squared score. And also visualize them using bar graph.
2. Methodology:
The procedure to apply these regression types on the relevant dataset.
First, import the required libraries like ‘sklearn’, ’NumPy’,’matplot.lib’ which is used for classifiers,
preprocessing, performance metrics, numerical, and plotting, respectively.
To read the data from dataset, we’re going to use pandas library. And store the data in some variable.
Now, preprocess the data. Check if there is some null column or textual data. If any remove the null
column and convert the textual column into numerical. We can convert it by using encoder I used
Labeled encoder.
Now, drop the target column from x(contain all attributes except the target class) and store the target
class in y. In the given dataset, the target class is ‘smoker’.
Now, split the data into test and train data. After splitting the data apply regression type.
For each regression type:
1. First build the regression type.
2. Then train the model by using the x_train or any variable you’ve declared. So, it’ll train the
model to predict our test data.
3. After training the data, predict the data by using the x_test or any variable you’ve declared.
4. After training and testing the data, perform the performance analysis to see how well our
regression type has run on the given dataset.
3. Result Analysis:
The following is the description of results for mean-squared error, root-mean-squared error, mean absolute
error, and R-squared score for linear, multiple, and polynomial regression.
Linear Regression:
Following are the results of Linear Regression:
Mean Squared Error: 14.20
Polynomial Regression:
Following are the results of Polynomial Regression:
Mean Squared Error: 14.88
R-squared (R2) Score: 0.84
Mean Absolute Error (MAE): 3.23
Root Mean Squared Error (RMSE): 3.86
4. Conclusion:
After analyzing the results, we came to this conclusion that different regression types are suited to
different datasets: Linear for simple linear relationships, Multiple Linear for multiple linear factors, and
Polynomial for complex, non-linear patterns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score,
mean_absolute_error
Linear Regression
data1 = pd.read_csv("C:\\Users\\DELL\\Downloads\\Student_Marks.csv")
data1.head()
data1.describe()
data1.info
data1.isnull().sum()
number_courses 0
time_study 0
Marks 0
dtype: int64
sns.heatmap(data1.corr(),annot=True)
<Axes: >
lr = LinearRegression()
lr.fit(x_train,y_train)
lr_predict = lr.predict(x_test)
print('Actual Value')
print(y_test)
print('Predicted Value')
print(lr_predict)
Actual Value
83 16.106
53 36.653
70 16.606
45 8.924
44 9.742
39 51.142
22 12.209
80 54.321
10 42.036
0 19.202
18 50.986
30 24.172
73 7.014
33 39.965
90 24.394
4 55.299
76 36.746
77 38.278
12 24.318
31 8.100
Name: Marks, dtype: float64
Predicted Value
[19.27278272 37.76035676 20.18779372 9.65670863 10.97508223
44.81200554
13.34810968 47.62447209 37.01567001 22.30738483 44.48113375
28.33573684
7.42332402 38.70638699 28.10820618 48.74122069 35.72331468
39.30103485
28.29432156 8.94326632]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Hours Studied 10000 non-null int64
1 Previous Scores 10000 non-null int64
2 Extracurricular Activities 10000 non-null object
3 Sleep Hours 10000 non-null int64
4 Sample Question Papers Practiced 10000 non-null int64
5 Performance Index 10000 non-null float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB
data2.isnull().sum()
Hours Studied 0
Previous Scores 0
Extracurricular Activities 0
Sleep Hours 0
Sample Question Papers Practiced 0
Performance Index 0
dtype: int64
data2.describe()
le = LabelEncoder()
data2['Extracurricular
Activities']=le.fit_transform(data2['Extracurricular Activities'])
data2.head()
sns.heatmap(data2.corr(), annot=True)
<Axes: >
X = data2.iloc[:,:-1].values
Y = data2.iloc[:,-1].values
mlr = LinearRegression()
mlr.fit(X_train,Y_train)
mlr_predict = mlr.predict(X_test)
print('Actual Value')
print(Y_test)
print('Predicted Value')
print(mlr_predict)
Actual Value
[51. 20. 46. ... 16. 65. 47.]
Predicted Value
[54.71185392 22.61551294 47.90314471 ... 16.79341955 63.34327368
45.94262301]
Polynomial Regression
data3 = pd.read_csv("C:\\Users\\DELL\\Downloads\\Ice_cream
selling.csv")
data3.head()
data3.isnull().sum()
Temperature (°C) 0
Ice Cream Sales (units) 0
dtype: int64
data3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature (°C) 49 non-null float64
1 Ice Cream Sales (units) 49 non-null float64
dtypes: float64(2)
memory usage: 916.0 bytes
data3.describe()
sns.heatmap(data3.corr(), annot=True)
<Axes: >
# Plotting the scatter plot
plt.scatter(data3['Temperature (°C)'], data3['Ice Cream Sales
(units)'], color='blue')
degree = 2
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(X_Train)
x_test_poly = poly_features.fit_transform(X_Test)
pr = LinearRegression()
pr.fit(x_train_poly, Y_Train)
pr_predict = pr.predict(x_test_poly)
print('Actual Value')
print(Y_Test)
print('Predicted Value')
print(pr_predict)
Actual Value
[10.01286785 17.84395652 27.69838335 28.91218793 9.39296866
2.31380636
0.78997365 4.62568946 4.8579878 4.67364254]
Predicted Value
[10.24499852 25.82807714 31.8241041 24.61704203 4.43380256
4.00785672
3.58277491 3.17865087 7.86990452 2.91591368]
Performance Analysis
print('Linear Regression')
mse1 = mean_squared_error(y_test,lr_predict)
print('Mean Squared Error: ',mse1)
r2_1 = r2_score(y_test,lr_predict)
print("R-squared (R2) Score:", r2_1)
mae1 = mean_absolute_error(y_test, lr_predict)
print("Mean Absolute Error (MAE):", mae1)
rmse1 = np.sqrt(mse1)
print("Root Mean Squared Error (RMSE):", rmse1)
Linear Regression
Mean Squared Error: 14.200726136374588
R-squared (R2) Score: 0.9459936100591212
Mean Absolute Error (MAE): 3.079345229666688
Root Mean Squared Error (RMSE): 3.768385083344666
print('Polynomial Regression')
mse3 = mean_squared_error(Y_Test,pr_predict)
print('Mean Squared Error: ',mse3)
r2_3 = r2_score(Y_Test,pr_predict)
print("R-squared (R2) Score:", r2_3)
mae3 = mean_absolute_error(Y_Test, pr_predict)
print("Mean Absolute Error (MAE):", mae3)
rmse3 = np.sqrt(mse3)
print("Root Mean Squared Error (RMSE):", rmse3)
Polynomial Regression
Mean Squared Error: 14.87879644098147
R-squared (R2) Score: 0.8430551371938841
Mean Absolute Error (MAE): 3.2299819836597266
Root Mean Squared Error (RMSE): 3.857304297171986