0% found this document useful (0 votes)
3 views

LR-LogReg

The document presents an overview of model building using machine learning techniques, focusing on supervised learning, regression, and classification metrics. It covers linear and logistic regression, including their mathematical representations, implementation in Python, and performance evaluation metrics. The document concludes with a discussion on the assumptions, limitations, and applications of these regression techniques.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

LR-LogReg

The document presents an overview of model building using machine learning techniques, focusing on supervised learning, regression, and classification metrics. It covers linear and logistic regression, including their mathematical representations, implementation in Python, and performance evaluation metrics. The document concludes with a discussion on the assumptions, limitations, and applications of these regression techniques.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Model Building using Machine Learning

Techniques

Presented by
Dr. J.V. Bibal Benifa
Assistant Professor (Sl. Gr),
Data Science Research Group
Indian Institute of Information Technology, Kottayam
(A National Institute of Importance, Under MoE, Govt. of India)

Data Science Research Group, IIIT Kottayam


Data Science Research Group, IIIT Kottayam
Learning Algorithms
Supervised Learning

Data Science Research Group, IIIT Kottayam


Classifier metrics
Accuracy- The proportion of correctly predicted instances out of the total
instances.
Precision - The proportion of correctly predicted positive instances out of all
predicted positives.
Recall (Sensitivity or True Positive Rate)-The proportion of correctly predicted
positive instances out of all actual positives.
F1 Score- The harmonic mean of precision and recall, balancing the two
metrics.
Specificity (True Negative Rate)- The proportion of correctly predicted
negative instances out of all actual instances
ROC-AUC (Receiver Operating Characteristic - Area Under Curve)- Measures the
ability of a classifier to distinguish between classes across all thresholds.
Log Loss (Logarithmic Loss)- Quantifies the uncertainty of predictions by
penalizing incorrect classifications with probabilities far from the actual label.

Data Science Research Group, IIIT Kottayam


Data Science Research Group, IIIT Kottayam
Regression

Data Science Research Group, IIIT Kottayam


Unsupervised Learning

Data Science Research Group, IIIT Kottayam


Clustering Metrics

Data Science Research Group, IIIT Kottayam


Reinforcement Learning

Data Science Research Group, IIIT Kottayam


Data Science Research Group, IIIT Kottayam
Linear & Logistic Regression Model building

Presented by
Dr. J.V. Bibal Benifa
Assistant Professor (Sl. Gr),
Data Science Research Group
Indian Institute of Information Technology, Kottayam
(A National Institute of Importance, Under MoE, Govt. of India)

Data Science Research Group, IIIT Kottayam


Contents
• Introduction to Regression
• Simple Linear Regression
• Multiple Linear Regression
• Logistic Regression with linear models
• Multiple Logistic Regression with linear models
• Conclusion
• References

Data Science Research Group, IIIT Kottayam


Introduction to Regression

• Regression analysis is a
predictive modelling technique
which investigates the
relationship between
dependent (target) and indepe
ndent variable (predictor). • Significant relationships between
• Regression analysis is an dependent variable and
independent variable.
important tool for modelling
and analyzing data. • Strength of impact of
multiple independent variables on
a dependent variable.

Data Science Research Group, IIIT Kottayam


Predictive Model

Data Science Research Group, IIIT Kottayam


Linear Regression
• Linear regression method
shows a linear relationship
between a dependent (y) and
one or more independent (y)
variables.

y= a0+a1x+ ε
➢Simple & Multiple Linear Regression

Data Science Research Group, IIIT Kottayam


Mathematical Representation of LR
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value- slope/ gradient).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Data Science Research Group, IIIT Kottayam


Mean Squared Error (MSE) cost
LR Cont., function is the average of squared
error occurred between the
predicted values and actual values.
a1 =  ( xi − x )( yi − y )   ( xi − x ) 
 2

 
Where,
a0 = y − a1  x N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.

R = (1 N )   ( xi − x )( yi − y )  ( x   y ) 
2
2

Data Science Research Group, IIIT Kottayam


Example-LR

Data Science Research Group, IIIT Kottayam


Example Cont.,
^
y = a0 + a1 x
^
y = 26.768 + 0.644  80
a1 =  ( xi − x )( yi − y )   ( xi − x ) 
 2 ^
y = 78.28
 
R = (1 N )   ( xi − x )( yi − y )  ( x   y ) 
2
a1 = 470 / 730 2

a1 = 0.644
x =  ( i )  N
 2
x − x
a0 = y − a1  x
 x = 12.083
a0 = 77 − (0.644  78)
 y = 11.225
a0 = 26.768
R = (1 5 )  470 (12.083 11.225 ) 
2 2

^
y = 26.768 + 0.644 x R 2 = 0.48
Data Science Research Group, IIIT Kottayam
Predict the speed of a 10 years old car:
from scipy imprt stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y
= [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err =
stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
speed = myfunc(10)
print(speed)

R2-0.75

Data Science Research Group, IIIT Kottayam


Multiple Linear Regression

Data Science Research Group, IIIT Kottayam


Multiple Linear Regression

Data Science Research Group, IIIT Kottayam


Data Science Research Group, IIIT Kottayam
Data Science Research Group, IIIT Kottayam
Implementation of Multiple Linear Regression model
using Python
• Step-1: Data Pre-processing Step
# importing libraries
• https://www.kaggle.com/farhanmd import numpy as nm
29/50-startups
import matplotlib.pyplot as mtp
• R&D Spend, Administration import pandas as pd
Spend, Marketing Spend, State,
and Profit for a financial year. #importing datasets
data_set= pd.read_csv('50_Comp
List.csv')

Data Science Research Group, IIIT Kottayam


Example Cont.,
#Extracting Independent and depe
ndent Variable
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values
#Fitting the MLR model to the trai
ning set:
from sklearn.linear_model import
LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
Data Science Research Group, IIIT Kottayam
Example Cont.,
• # Splitting the dataset into training
#Catgorical data and test set.
from sklearn.preprocessing import L • from sklearn.model_selection impo
abelEncoder, OneHotEncoder rt train_test_split x_train, x_test, y
_train, y_test= train_test_split(x, y, t
labelencoder_x= LabelEncoder() est_size= 0.2, random_state=0)
x[:, 3]= labelencoder_x.fit_transform • #Fitting the MLR model to the traini
(x[:,3]) ng set:
onehotencoder= OneHotEncoder(cat • from sklearn.linear_model import Li
egorical_features= [3]) nearRegression
x= onehotencoder.fit_transform(x).to • regressor= LinearRegression()
array() • regressor.fit(x_train, y_train)

Data Science Research Group, IIIT Kottayam


Example Cont.,
Step: 3- Prediction of Test
set results:

#Predicting the Test set res


ult;
y_pred= regressor.predict(
x_test)
print('Train Score: ', regres
sor.score(x_train, y_train))

print('Test Score: ', regress


or.score(x_test, y_test))
Data Science Research Group, IIIT Kottayam
Boston house pricing dataset- MLR
https://www.kaggle.com/vikrishnan/b 506 samples and 13 feature variables in
oston-house-prices this dataset

Data Science Research Group, IIIT Kottayam


import numpy as np # model evaluation for testing set
y_train_predict = lin_model.predict(X_train)
import matplotlib.pyplot as plt
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
import pandas as pd r2 = r2_score(Y_train, y_train_predict)
import seaborn as sns print("The model performance for training set")
print("--------------------------------------")
%matplotlib inline print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
from sklearn.linear_model
print("\n")
import LinearRegression # model evaluation for testing set
from sklearn.metrics import y_test_predict = lin_model.predict(X_test)
mean_squared_error rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
r2 = r2_score(Y_test, y_test_predict)
lin_model = LinearRegression()
print("The model performance for testing set")
lin_model.fit(X_train, Y_train) print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
Data Science Research Group, IIIT Kottayam
Rainfall prediction using Linear regression
• https://www.kaggle.com/grubenm/austi
General Applications n-weather
• Trend lines
• Economics
• Finance
• Biology
• Agriculture
• Sports
Performance Metrics for Regression
• Mean Absolute Error (MAE)
• Mean Squared Error (MSE)
• Root Mean Squared Error
(RMSE)
• R-Squared

Data Science Research Group, IIIT Kottayam


Assumptions of LR

• Linear relationship between the features and target


• Small or no multi-collinearity between the features
• Homoscedasticity Assumption
• Normal distribution of error terms
• No autocorrelations

Data Science Research Group, IIIT Kottayam


Limitations of Linear Regression
• It over-simplifies real-world problems by assuming a linear
relationship among the variables.
• Linear regressions are sensitive to outliers.
• Assumption of normality

Data Science Research Group, IIIT Kottayam


Multiple Linear Regression

Data Science Research Group, IIIT Kottayam


References
• Alan Mackworth, David Lynton Poole, Foundations of Computational
Agents, 2nd Edition, Page No. 283-305

Data Science Research Group, IIIT Kottayam


Diabetes dataset
Logistic Regression

Data Science Research Group, IIIT Kottayam


To predict a student will get admitted to a
school based on CGPA

Data Science Research Group, IIIT Kottayam


Data Science Research Group, IIIT Kottayam
Data Science Research Group, IIIT Kottayam
Result

Data Science Research Group, IIIT Kottayam


Example- Python Implementation
• Data Pre-processing step
• Fitting Logistic Regression to
the Training set
• Predicting the test result
• Test accuracy of the
result(Creation of Confusion
matrix)
• Visualizing the test set result.

Data Science Research Group, IIIT Kottayam


Python code- Data Pre processing step
#Data Pre-procesing Step #feature Scaling
# importing libraries from sklearn.preprocessing import StandardScaler
import numpy as nm st_x= StandardScaler()
import matplotlib.pyplot as mtp x_train= st_x.fit_transform(x_train)
import pandas as pd x_test= st_x.transform(x_test)
#importing datasets #Fitting Logistic Regression to the training set
data_set= pd.read_csv('user_data.csv') from sklearn.linear_model import LogisticRegression
#Extracting Independent and dependent Variable classifier= LogisticRegression(random_state=0)
x= data_set.iloc[:, [2,3]].values classifier.fit(x_train, y_train)
y= data_set.iloc[:, 4].values LogisticRegression(C=1.0, class_weight=None, dual=Fals
# Splitting the dataset into training and test set. from e, fit_intercept=True, intercept_scaling=1, l1_ratio=Non
sklearn.model_selection import train_test_split e, max_iter=100, multi_class='warn', n_jobs=None, pen
alty='l2', random_state=0, solver='warn', tol=0.0001, ve
x_train, x_test, y_train, y_test= train_test_split(x, y, te rbose=0, warm_start=False)
st_size= 0.25, random_state=0)

Data Science Research Group, IIIT Kottayam


Confusion Matrix and Predictor Output
#Predicting the test set result
y_pred= classifier.predict(x_test)
#Creating the Confusion matrix
from sklearn.metrics import confusion_matrix cm
= confusion_matrix()

Data Science Research Group, IIIT Kottayam


Results and discussions
• #Visualizing the training set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

• alpha = 0.75, cmap = ListedColormap(('purple','green' )))


• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Logistic Regression (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()

Data Science Research Group, IIIT Kottayam


Multinomial Logistic Regression

Data Science Research Group, IIIT Kottayam


Assumptions & Applications of Logistic
Regression
• Logistic regression requires the Applications
dependent variable to be
• Credit scoring
categorical.
• Classification of
• Logistic regression requires there
healthcare data
to be little or no multi-collinearity
among the independent variables. • Text Editing
• There should be a linear • Gaming
relationship between the link
function and independent
variables in the logit model.
Data Science Research Group, IIIT Kottayam
Cautions & Pitfalls
• Choosing the right predictor variables
• Avoiding the use of highly correlated variables
• Handling continuous input variables
• Assumptions regarding the relationship between input and output
variables

Data Science Research Group, IIIT Kottayam


Conclusion

• Linear Regression is used when our dependent variable is continuous in


nature.
• Logistic Regression is used when the dependent variable is binary or
limited.
• Mathematical formulations of Linear and Logistic Regression.

• Implementation of Linear and Logistic Regression.

• Research Applications of Linear and Logistic Regression.

Data Science Research Group, IIIT Kottayam


References
• Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining,
“Introduction to Linear Regression Analysis”, 5th Edition.”
• Sanford Weisberg, “Applied linear regression”, Wiley Publisher,
2014.
• Ronald Christensen, “Log-Linear Models and Logistic Regression”,
Springer, New York, 1997.
• Scott Menard, “Logistic Regression: From Introductory to
Advanced Concepts and Applications”, SAGE publications, 2010.

Data Science Research Group, IIIT Kottayam


Queries…?

Write to me: [email protected]

Data Science Research Group, IIIT Kottayam

You might also like