How to Create a Partial Dependence Plot for a Categorical Variable in R?

Last Updated : 16 Aug, 2024

Partial Dependence Plots (PDPs) are a powerful tool for understanding the relationship between predictor variables and the predicted outcome in machine learning models. PDPs are particularly useful for visualizing how a feature affects the predictions, holding other features constant. While they are commonly used for continuous variables, PDPs can also be created for categorical variables to understand their influence on the model's predictions using R Programming Language.

Understanding Partial Dependence Plots

A Partial Dependence Plot shows the marginal effect of one or more features on the predicted outcome. For a single feature, the PDP is created by:

Holding all other features constant.
Varying the feature of interest across its range of values.
Averaging the model predictions for each value of the feature.

When dealing with categorical variables, the PDP will show how the predicted outcome changes as the categorical variable takes on different levels.

Use Cases

Interpretability: PDPs help in interpreting the impact of a specific categorical variable on the predictions.
Model Debugging: PDPs can reveal unexpected relationships or interactions between variables.
Feature Selection: PDPs can help in identifying important categorical variables.

This article will guide you through the steps of creating a Partial Dependence Plot for a categorical variable in R. We will cover the theory behind PDPs, the necessary packages, and a step-by-step example using a popular dataset.

Step 1: Load the Necessary Packages

To create a PDP in R, we need several packages that facilitate model building and visualization. The key packages include randomForest, pdp, and ggplot2.

# Install necessary packages if not already installed
install.packages("randomForest")
install.packages("pdp")
install.packages("ggplot2")

# Load the packages
library(randomForest)
library(pdp)
library(ggplot2)

Step 2: Prepare the Dataset

For this example, we will use the Titanic dataset available in R. This dataset contains information about the passengers on the Titanic, including whether they survived, their age, class.

# Load the Titanic dataset
data("Titanic")
df <- as.data.frame(Titanic)

# Preview the dataset
head(df)

Output:

  Class    Sex   Age Survived Freq
1   1st   Male Child       No    0
2   2nd   Male Child       No    0
3   3rd   Male Child       No   35
4  Crew   Male Child       No    0
5   1st Female Child       No    0
6   2nd Female Child       No    0

Step 3: Build a Model

We will create a Random Forest model to predict the survival of passengers based on various features, including the categorical variable Class.

# Build a Random Forest model
set.seed(42)
rf_model <- randomForest(Survived ~ Class + Sex + Age + Freq, data = df, ntree = 100)

# Print model summary
print(rf_model)

Output:

Call:
 randomForest(formula = Survived ~ Class + Sex + Age + Freq, data = df,      ntree = 100) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 65.62%
Confusion matrix:
    No Yes class.error
No   3  13      0.8125
Yes  8   8      0.5000

Step 4: Generate the Partial Dependence Plot

Now that we have a trained model, we can generate a Partial Dependence Plot for the categorical variable Class. The partial function from the pdp package is used to create the plot.

# Create a Partial Dependence Plot for the categorical variable 'Class'
pdp_class <- partial(rf_model, pred.var = "Class", plot = TRUE, which.class = 1)

# Display the plot
print(pdp_class)

Output:

Partial Dependence Plot for a Categorical Variable in R

The resulting plot will show the effect of different passenger classes on the probability of survival, holding all other features constant. For example, if the plot shows that passengers in 1st class have a higher probability of survival, this indicates the importance of the Class variable in predicting survival in the model.

Conclusion

Creating a Partial Dependence Plot for a categorical variable in R is a straightforward process that can provide valuable insights into the influence of specific features on model predictions. By following the steps outlined in this article, you can generate and customize PDPs for your categorical variables, helping to improve model interpretability and transparency.