Logistic Regression in R Programming
Last Updated :
22 Apr, 2025
Logistic regression ( also known as Binomial logistics regression) in R Programming is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary (0/1, True/False, Yes/No) in nature.
At the core of logistic regression is the logistic (or sigmoid) function, which maps any real valued input to a value between 0 and 1 which is interpreted as a probability. This allows the model to describe the relationship between the input features and the probability of the binary outcome.

Logistic Regression in R Programming
Mathematical Implementation
Logistic regression is a type of generalized linear model (GLM) used for classification tasks, particularly when the response variable is binary. The goal is to model the probability that a given input belongs to a particular category. The output represents a probability, ranging between 0 and 1. It can we expressed as:
[Tex]P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)}}[/Tex]
In logistic regression, the odds represent the ratio of the probability of success to the probability of failure. The odds ratio (OR) is a key concept that helps interpret logistic regression coefficients. It measures how the odds change with a one-unit increase in a predictor variable:
- An OR of 1 indicates equal probability of success and failure.
- An OR of 2 means success is twice as likely as failure.
- An OR of 0.5 implies failure is twice as likely as success.

Odd Ratio
Since the outcome is binary and follows a binomial distribution, logistic regression uses the logit link function, which connects the linear model to the probability:
[Tex]\text{logit}(P) = \log\left( \frac{P}{1 – P} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n[/Tex]
This transformation ensures that the predicted probabilities stay within the (0, 1) interval and that the model is linear in the log-odds.
Logistic regression estimates the model parameters using maximum likelihood estimation. This approach finds the coefficients that make the observed outcomes most probable. Each coefficient [Tex]\beta_i[/Tex] in the logistic regression model represents the change in the log-odds of the outcome for a one-unit increase in the corresponding predictor xix_ixi​, assuming all other variables are held constant.
- If [Tex]\beta_i > 0
[/Tex], an increase in [Tex]x_i[/Tex]​ increases the probability of success.
- If [Tex]\beta_i < 0
[/Tex], an increase in [Tex]x_i[/Tex]​ decreases the probability of success.
This interpretation is crucial when analyzing how predictor variables influence the predicted outcome.
Code Implementation
We will implement the logistic regression model on he mtcars dataset and make some predictions to see the performance of the model.
1. Importing the Dataset
mtcars dataset comes pre-installed with dplyr package in R. We are using the “wt"
(weight of the car) and “disp"
(displacement of the engine) column as the predictor variable to estimate the engine type. The target variable we are trying to predict is “vs"
, which tells us whether a car has a V-shaped (0
) or straight (1
) engine.
R
install.packages("dplyr")
library(dplyr)
head(mtcars)
Output:

Head 0f the Dataset
We are using the caTools
package to randomly split the mtcars
dataset into two parts: 80% for training (train_reg
) and 20% for testing (test_reg
). This allows us to train the logistic regression model on one set and evaluate its performance on unseen data.
R
install.packages("caTools")
library(caTools)
split <- sample.split(mtcars, SplitRatio = 0.8)
train_reg <- subset(mtcars, split == "TRUE")
test_reg <- subset(mtcars, split == "FALSE")
3. Building the model
Logistic regression is implemented in R using glm() by training the model using features or variables in the dataset.Â
R
logistic_model <- glm(vs ~ wt + disp,
data = train_reg,
family = "binomial")
logistic_model
summary(logistic_model)
Output:
Call:
glm(formula = vs ~ wt + disp, family = “binomial”, data = train_reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6552 -0.4051 0.4446 0.6180 1.9191
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58781 2.60087 0.610 0.5415
wt 1.36958 1.60524 0.853 0.3936
disp -0.02969 0.01577 -1.882 0.0598 .
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.617 on 24 degrees of freedom
Residual deviance: 20.212 on 22 degrees of freedom
AIC: 26.212
Number of Fisher Scoring iterations: 6
In the output ,
- Call: Shows the formula and dataset used to build the model.
- Deviance Residuals: Measure how well the model fits the data. Smaller values mean a better fit.
- Coefficients: Indicate the effect of each predictor on the log-odds of the outcome. Also includes standard errors.
- Significance Codes: Show how statistically significant each predictor is (e.g., ‘***’ means highly significant).
- Dispersion Parameter: For logistic regression, it’s fixed at 1 (since it uses a binomial distribution).
- Null Deviance: The model’s deviance when no predictors are used (only the intercept).
- Residual Deviance: The model’s deviance after adding predictors. Lower than the null deviance indicates a better fit.
- AIC (Akaike Information Criterion): Helps compare models. Lower AIC = better model (with fewer unnecessary variables).
- Fisher Scoring Iterations: The number of steps taken to find the best-fitting model.
4. Predict test data based on model
We will use the model to predict some values from out test split. For each car in the test set, the model outputs a probability score. A higher score (close to 1) means the model is more confident that the car has a straight engine, while a lower score (close to 0) indicates it’s likely a V-shaped engine.
R
predict_reg <- predict(logistic_model,
test_reg, type = "response")
predict_reg
Output:
Hornet Sportabout Merc 280C Merc 450SE Chrysler Imperial
0.01226166 0.78972164 0.26380531 0.01544309
AMC Javelin Camaro Z28 Ford Pantera L
0.06104267 0.02807992 0.01107943
5. Plotting a Confusion Matrix
We are using a confusion matrix to compare actual vs. predicted values, then transforming it into a long format with the melt()
function. We create a heatmap with ggplot2
by mapping the counts to tile colors, providing a clear visual of the prediction performance.
R
library(ggplot2)
library(reshape2)
conf_matrix <- table(test_reg$vs, predict_reg)
# Reshape the confusion matrix for ggplot2
conf_matrix_melted <- as.data.frame(conf_matrix)
colnames(conf_matrix_melted) <- c("Actual", "Predicted", "Count")
ggplot(conf_matrix_melted, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile() +
geom_text(aes(label = Count), color = "black", size = 6) + # Add text labels
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Confusion Matrix Heatmap", x = "Actual", y = "Predicted") +
theme_minimal()
Output:

Confusion Matrix
The confusion matrix shows that the model correctly predicted the negative class (0) 6 times and made 2 false positive errors, predicting 1 when the actual class was 0. It made no false negatives, correctly predicting the positive class (1) once. Overall, the model is performing well, with fewer mistakes in classifying the negative class and no errors in predicting the positive class