Logistic Regression
Anita Shirol
Introduction
Logistic regression is a supervised machine learning algorithm mainly used for
classification tasks where the goal is to predict the probability that an
instance of belonging to a given class.
It is used for classification algorithms its name is logistic regression. it’s
referred to as regression because it takes the output of the linear regression
function as input.
Uses a sigmoid function to estimate the probability for the given class.
The difference between linear regression and logistic regression is that linear
regression output is the continuous value that can be anything while logistic
regression predicts the probability that an instance belongs to a given class or
not.
Cont.
Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
Logistic Function (Sigmoid Function)
The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
It maps any real value into another value within a range of 0 and 1. o The
value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
Types
Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low”, “Medium”, or “High”.
Difference between Linear Regression and
Logistic Regression
Terminologies
Independent variables: The input characteristics or predictor factors applied to the dependent
variable’s predictions.
Dependent variable: The target variable in a logistic regression model, which we are trying to
predict.
Logistic function: The formula used to represent how the independent and dependent variables
relate to one another. The logistic function transforms the input variables into a probability value
between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
Odds: It is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could possibly
occur.
Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In
logistic regression, the log odds of the dependent variable are modeled as a linear combination of
the independent variables and the intercept.
Coefficient: The logistic regression model’s estimated parameters, show how the independent and
dependent variables relate to one another.
Intercept: A constant term in the logistic regression model, which represents the log odds when all
independent variables are equal to zero.
Maximum likelihood estimation: The method used to estimate the coefficients of the logistic
regression model, which maximizes the likelihood of observing the data given the model.
How does Logistic Regression work?
The logistic regression model transforms the linear regression function continuous
value output into categorical value output using a sigmoid function, which maps any
real-valued set of independent variables input into a value between 0 and 1. This
function is known as the logistic function.
Let the independent input features be:
Sigmoid
Equation
Odds of occurrence
Likelihood function for Logistic Regression
Gradient of the log-likelihood function
Assumptions for Logistic Regression
The assumptions for Logistic regression are as follows:
Independent observations: Each observation is independent of the other.
meaning there is no correlation between any input variables.
Binary dependent variables: It takes the assumption that the dependent
variable must be binary or dichotomous, meaning it can take only two values.
For more than two categories softmax functions are used.
Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the
dependent variable should be linear.
No outliers: There should be no outliers in the dataset.
Large sample size: The sample size is sufficiently large
Binomial
Code
# import the necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)
# split the train and test dataset
X_train, X_test,\
y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=23)
# LogisticRegression
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
# Prediction
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Logistic Regression model accuracy (in %):", acc*100)
Multi class
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics
# load the digit dataset
digits = datasets.load_digits()
# defining feature matrix(X) and response vector(y)
X = digits.data
y = digits.target
# splitting X and y into training and testing sets
X_train, X_test,\
y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# create logistic regression object
reg = linear_model.LogisticRegression()
# train the model using the training sets
reg.fit(X_train, y_train)
# making predictions on the testing set
y_pred = reg.predict(X_test)
Cost function