0% found this document useful (0 votes)
2 views

OpenLab2

The document outlines a lab task focused on applying regression techniques in machine learning, specifically linear, multiple, and polynomial regression. It details the methodology for implementing these techniques on datasets, including data preprocessing, model training, and performance analysis using various metrics. Results indicate that multiple linear regression performed best, while polynomial regression was less effective for the given datasets.

Uploaded by

abihaf312024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

OpenLab2

The document outlines a lab task focused on applying regression techniques in machine learning, specifically linear, multiple, and polynomial regression. It details the methodology for implementing these techniques on datasets, including data preprocessing, model training, and performance analysis using various metrics. Results indicate that multiple linear regression performed best, while polynomial regression was less effective for the given datasets.

Uploaded by

abihaf312024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Artificial Intelligence & Data Science

Trainer:
FARHAD M RIAZ

Submitted By:
AIMAN SOHAIL

LAB TASK
Regression in Machine Learning

High Impact IT Training

June 2024 - December 2024


1. Introduction:
Regression is a method used to understand the relationship between a dependent variable and one or
more independent variables. It has many types, here we discuss 3 types; Linear, multiple, and polynomial
regression.
Linear regression is used to fit a regression model that describes the relationship between one or
more predictor variables and a numeric response variable. Multiple Linear Regression is a statistical
technique used to model the relationship between one dependent variable and two or more independent
variables. Polynomial regression is used to fit a regression model that describes the relationship between one
or more predictor variables and a numeric response variable.

1.1 Objective:
The objective of this task is to apply these regression which I mentioned above on relevant dataset to
the type of regression. And perform the performance analysis by finding the mean-squared error, root-mean-
squared error, mean absolute error, and R-squared score. And also visualize them using bar graph.

2. Methodology:
The procedure to apply these regression types on the relevant dataset.
 First, import the required libraries like ‘sklearn’, ’NumPy’,’matplot.lib’ which is used for classifiers,
preprocessing, performance metrics, numerical, and plotting, respectively.
 To read the data from dataset, we’re going to use pandas library. And store the data in some variable.
 Now, preprocess the data. Check if there is some null column or textual data. If any remove the null
column and convert the textual column into numerical. We can convert it by using encoder I used
Labeled encoder.
 Now, drop the target column from x(contain all attributes except the target class) and store the target
class in y. In the given dataset, the target class is ‘smoker’.
 Now, split the data into test and train data. After splitting the data apply regression type.
 For each regression type:
1. First build the regression type.
2. Then train the model by using the x_train or any variable you’ve declared. So, it’ll train the
model to predict our test data.
3. After training the data, predict the data by using the x_test or any variable you’ve declared.
4. After training and testing the data, perform the performance analysis to see how well our
regression type has run on the given dataset.

3. Result Analysis:
The following is the description of results for mean-squared error, root-mean-squared error, mean absolute
error, and R-squared score for linear, multiple, and polynomial regression.

Linear Regression:
Following are the results of Linear Regression:
Mean Squared Error: 14.20

R-squared (R2) Score: 0.95

Mean Absolute Error (MAE): 3.08

Root Mean Squared Error (RMSE): 3.77

Multiple Linear Regression:


Following are the results of Multiple Linear Regression:
Mean Squared Error: 4.08
R-squared (R2) Score: 0.99
Mean Absolute Error (MAE): 1.61
Root Mean Squared Error (RMSE): 2.02

Polynomial Regression:
Following are the results of Polynomial Regression:
Mean Squared Error: 14.88
R-squared (R2) Score: 0.84
Mean Absolute Error (MAE): 3.23
Root Mean Squared Error (RMSE): 3.86
4. Conclusion:
After analyzing the results, we came to this conclusion that different regression types are suited to
different datasets: Linear for simple linear relationships, Multiple Linear for multiple linear factors, and
Polynomial for complex, non-linear patterns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score,
mean_absolute_error

Linear Regression
data1 = pd.read_csv("C:\\Users\\DELL\\Downloads\\Student_Marks.csv")
data1.head()

number_courses time_study Marks


0 3 4.508 19.202
1 4 0.096 7.734
2 4 3.133 13.811
3 6 7.909 53.018
4 8 7.811 55.299

data1.describe()

number_courses time_study Marks


count 100.000000 100.000000 100.000000
mean 5.290000 4.077140 24.417690
std 1.799523 2.372914 14.326199
min 3.000000 0.096000 5.609000
25% 4.000000 2.058500 12.633000
50% 5.000000 4.022000 20.059500
75% 7.000000 6.179250 36.676250
max 8.000000 7.957000 55.299000

data1.info

<bound method DataFrame.info of number_courses time_study Marks


0 3 4.508 19.202
1 4 0.096 7.734
2 4 3.133 13.811
3 6 7.909 53.018
4 8 7.811 55.299
.. ... ... ...
95 6 3.561 19.128
96 3 0.301 5.609
97 4 7.163 41.444
98 7 0.309 12.027
99 3 6.335 32.357
[100 rows x 3 columns]>

data1.isnull().sum()

number_courses 0
time_study 0
Marks 0
dtype: int64

sns.heatmap(data1.corr(),annot=True)

<Axes: >

sns.scatterplot(x='time_study', y='Marks', hue = 'number_courses',


data=data1) # Changed 'data' to 'df'
plt.show()
x = data1[['time_study','number_courses']]
y = data1['Marks']

x_train, x_test, y_train, y_test =


train_test_split(x,y,test_size=0.2,random_state=42)

lr = LinearRegression()
lr.fit(x_train,y_train)

lr_predict = lr.predict(x_test)

print('Actual Value')
print(y_test)
print('Predicted Value')
print(lr_predict)

Actual Value
83 16.106
53 36.653
70 16.606
45 8.924
44 9.742
39 51.142
22 12.209
80 54.321
10 42.036
0 19.202
18 50.986
30 24.172
73 7.014
33 39.965
90 24.394
4 55.299
76 36.746
77 38.278
12 24.318
31 8.100
Name: Marks, dtype: float64
Predicted Value
[19.27278272 37.76035676 20.18779372 9.65670863 10.97508223
44.81200554
13.34810968 47.62447209 37.01567001 22.30738483 44.48113375
28.33573684
7.42332402 38.70638699 28.10820618 48.74122069 35.72331468
39.30103485
28.29432156 8.94326632]

Multiple Linear Regression


data2 = pd.read_csv("C:\\Users\\DELL\\Downloads\\
Student_Performance.csv")
data2.head()

Hours Studied Previous Scores Extracurricular Activities Sleep


Hours \
0 7 99 Yes
9
1 4 82 No
4
2 8 51 Yes
7
3 5 52 Yes
5
4 7 75 No
8

Sample Question Papers Practiced Performance Index


0 1 91.0
1 2 65.0
2 2 45.0
3 2 36.0
4 5 66.0
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Hours Studied 10000 non-null int64
1 Previous Scores 10000 non-null int64
2 Extracurricular Activities 10000 non-null object
3 Sleep Hours 10000 non-null int64
4 Sample Question Papers Practiced 10000 non-null int64
5 Performance Index 10000 non-null float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB

data2.isnull().sum()

Hours Studied 0
Previous Scores 0
Extracurricular Activities 0
Sleep Hours 0
Sample Question Papers Practiced 0
Performance Index 0
dtype: int64

data2.describe()

Hours Studied Previous Scores Sleep Hours \


count 10000.000000 10000.000000 10000.000000
mean 4.992900 69.445700 6.530600
std 2.589309 17.343152 1.695863
min 1.000000 40.000000 4.000000
25% 3.000000 54.000000 5.000000
50% 5.000000 69.000000 7.000000
75% 7.000000 85.000000 8.000000
max 9.000000 99.000000 9.000000

Sample Question Papers Practiced Performance Index


count 10000.000000 10000.000000
mean 4.583300 55.224800
std 2.867348 19.212558
min 0.000000 10.000000
25% 2.000000 40.000000
50% 5.000000 55.000000
75% 7.000000 71.000000
max 9.000000 100.000000

le = LabelEncoder()
data2['Extracurricular
Activities']=le.fit_transform(data2['Extracurricular Activities'])
data2.head()

Hours Studied Previous Scores Extracurricular Activities Sleep


Hours \
0 7 99 1
9
1 4 82 0
4
2 8 51 1
7
3 5 52 1
5
4 7 75 0
8

Sample Question Papers Practiced Performance Index


0 1 91.0
1 2 65.0
2 2 45.0
3 2 36.0
4 5 66.0

sns.heatmap(data2.corr(), annot=True)

<Axes: >
X = data2.iloc[:,:-1].values
Y = data2.iloc[:,-1].values

X_train, X_test, Y_train,Y_test =


train_test_split(X,Y,test_size=0.2,random_state=42)

mlr = LinearRegression()
mlr.fit(X_train,Y_train)

mlr_predict = mlr.predict(X_test)

print('Actual Value')
print(Y_test)
print('Predicted Value')
print(mlr_predict)

Actual Value
[51. 20. 46. ... 16. 65. 47.]
Predicted Value
[54.71185392 22.61551294 47.90314471 ... 16.79341955 63.34327368
45.94262301]

Polynomial Regression
data3 = pd.read_csv("C:\\Users\\DELL\\Downloads\\Ice_cream
selling.csv")
data3.head()

Temperature (°C) Ice Cream Sales (units)


0 -4.662263 41.842986
1 -4.316559 34.661120
2 -4.213985 39.383001
3 -3.949661 37.539845
4 -3.578554 32.284531

data3.isnull().sum()

Temperature (°C) 0
Ice Cream Sales (units) 0
dtype: int64

data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature (°C) 49 non-null float64
1 Ice Cream Sales (units) 49 non-null float64
dtypes: float64(2)
memory usage: 916.0 bytes

data3.describe()

Temperature (°C) Ice Cream Sales (units)


count 49.000000 49.000000
mean 0.271755 15.905308
std 2.697672 12.264682
min -4.662263 0.328626
25% -2.111870 4.857988
50% 0.688781 12.615181
75% 2.784836 25.142082
max 4.899032 41.842986

sns.heatmap(data3.corr(), annot=True)

<Axes: >
# Plotting the scatter plot
plt.scatter(data3['Temperature (°C)'], data3['Ice Cream Sales
(units)'], color='blue')

# Adding titles and labels


plt.title('Temperature vs. Ice Cream Sales')
plt.xlabel('Temperature (°C)')
plt.ylabel('Ice Cream Sales (units)')

# Show the plot


plt.show()
X1 = data3.iloc[:,:-1].values
Y1 = data3.iloc[:,-1].values

X_Train, X_Test, Y_Train, Y_Test =


train_test_split(X1,Y1,test_size=0.2,random_state=42)

degree = 2
poly_features = PolynomialFeatures(degree=degree)
x_train_poly = poly_features.fit_transform(X_Train)
x_test_poly = poly_features.fit_transform(X_Test)

pr = LinearRegression()
pr.fit(x_train_poly, Y_Train)

pr_predict = pr.predict(x_test_poly)

print('Actual Value')
print(Y_Test)
print('Predicted Value')
print(pr_predict)

Actual Value
[10.01286785 17.84395652 27.69838335 28.91218793 9.39296866
2.31380636
0.78997365 4.62568946 4.8579878 4.67364254]
Predicted Value
[10.24499852 25.82807714 31.8241041 24.61704203 4.43380256
4.00785672
3.58277491 3.17865087 7.86990452 2.91591368]

Performance Analysis
print('Linear Regression')
mse1 = mean_squared_error(y_test,lr_predict)
print('Mean Squared Error: ',mse1)
r2_1 = r2_score(y_test,lr_predict)
print("R-squared (R2) Score:", r2_1)
mae1 = mean_absolute_error(y_test, lr_predict)
print("Mean Absolute Error (MAE):", mae1)
rmse1 = np.sqrt(mse1)
print("Root Mean Squared Error (RMSE):", rmse1)

Linear Regression
Mean Squared Error: 14.200726136374588
R-squared (R2) Score: 0.9459936100591212
Mean Absolute Error (MAE): 3.079345229666688
Root Mean Squared Error (RMSE): 3.768385083344666

print('Multiple Linear Regression')


mse2 = mean_squared_error(Y_test,mlr_predict)
print('Mean Squared Error: ',mse2)
r2_2 = r2_score(Y_test,mlr_predict)
print("R-squared (R2) Score:", r2_2)
mae2 = mean_absolute_error(Y_test, mlr_predict)
print("Mean Absolute Error (MAE):", mae2)
rmse2 = np.sqrt(mse2)
print("Root Mean Squared Error (RMSE):", rmse2)

Multiple Linear Regression


Mean Squared Error: 4.082628398521854
R-squared (R2) Score: 0.9889832909573145
Mean Absolute Error (MAE): 1.6111213463123044
Root Mean Squared Error (RMSE): 2.020551508505006

print('Polynomial Regression')
mse3 = mean_squared_error(Y_Test,pr_predict)
print('Mean Squared Error: ',mse3)
r2_3 = r2_score(Y_Test,pr_predict)
print("R-squared (R2) Score:", r2_3)
mae3 = mean_absolute_error(Y_Test, pr_predict)
print("Mean Absolute Error (MAE):", mae3)
rmse3 = np.sqrt(mse3)
print("Root Mean Squared Error (RMSE):", rmse3)

Polynomial Regression
Mean Squared Error: 14.87879644098147
R-squared (R2) Score: 0.8430551371938841
Mean Absolute Error (MAE): 3.2299819836597266
Root Mean Squared Error (RMSE): 3.857304297171986

You might also like