Open In App

Decision tree using continuous variable in R

Last Updated : 25 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Decision trees are widely used due to their simplicity and effectiveness. They split data into branches to form a tree structure based on decision rules, making them intuitive and easy to interpret. In R, several packages such as rpart and party are available to facilitate decision tree modeling. This guide will specifically delve into how to utilize these tools for continuous variables.

Overview of Decision Trees

A decision tree is a model used to make predictions based on a series of decision rules inferred from the data. Starting from a root node, the data is split according to these rules, creating branches and leaf nodes. Each node in the tree represents a decision point, and each leaf node represents an outcome or prediction.

What are Continuous variables?

Continuous variables are variables that can take on an infinite number of values within a given range. They are often measurements or quantities that can be subdivided into smaller and smaller parts, representing data points that are not restricted to discrete steps or categories. Continuous variables can represent a wide range of phenomena in the real world and are crucial in statistical analysis and data modeling. Key Characteristics of Continuous Variables

  1. Infinite Possibilities: Continuous variables can take on any value within a specified range. For example, the weight of an object can be 50.5 kg, 50.55 kg, 50.555 kg, and so on.
  2. Measurable Quantities: They represent quantities that can be measured rather than counted. Examples include height, temperature, time, and distance.
  3. Precision: The precision of a continuous variable can be increased by using more refined measurement tools. For example, the length of an object can be measured to the nearest millimeter, micrometer, or even nanometer.
  4. Interdependence: Continuous variables can be related to each other. For example, in a physical system, temperature and pressure might be related, with changes in one affecting the other.

Now we will discuss step by step to Building a Decision Tree with Continuous Variables in R Programming Language.

Step 1: Load Required Libraries

First we install and Load Required Libraries.

R
library(rpart)
library(rpart.plot)

Step 2 : Prepare the Data

Use a dataset with continuous variables. We'll use the mtcars dataset, which is available in R.

R
data(mtcars)
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Step 3: Create the Decision Tree

Use the rpart function to create a decision tree model.

R
# Build the decision tree model
model <- rpart(mpg ~ ., data = mtcars, method = "anova")

# Print the model summary
summary(model)

Output:

Call:
rpart(formula = mpg ~ ., data = mtcars, method = "anova")
n= 32

CP nsplit rel error xerror xstd
1 0.64312523 0 1.0000000 1.0492564 0.2515360
2 0.09748407 1 0.3568748 0.7454559 0.1710954
3 0.01000000 2 0.2593907 0.6173034 0.1284871

Variable importance
cyl disp hp wt qsec vs carb
20 20 19 16 12 11 1

Node number 1: 32 observations, complexity param=0.6431252
mean=20.09062, MSE=35.18897
left son=2 (21 obs) right son=3 (11 obs)
Primary splits:
cyl < 5 to the right, improve=0.6431252, (0 missing)
wt < 2.3925 to the right, improve=0.6356630, (0 missing)
disp < 163.8 to the right, improve=0.6130502, (0 missing)
hp < 118 to the right, improve=0.6010712, (0 missing)
vs < 0.5 to the left, improve=0.4409477, (0 missing)
Surrogate splits:
disp < 142.9 to the right, agree=0.969, adj=0.909, (0 split)
hp < 101 to the right, agree=0.938, adj=0.818, (0 split)
wt < 2.5425 to the right, agree=0.906, adj=0.727, (0 split)
qsec < 18.41 to the left, agree=0.844, adj=0.545, (0 split)
vs < 0.5 to the left, agree=0.844, adj=0.545, (0 split)

Node number 2: 21 observations, complexity param=0.09748407
mean=16.64762, MSE=9.451066
left son=4 (7 obs) right son=5 (14 obs)
Primary splits:
hp < 192.5 to the right, improve=0.5530828, (0 missing)
cyl < 7 to the right, improve=0.5068475, (0 missing)
disp < 266.9 to the right, improve=0.5068475, (0 missing)
wt < 3.49 to the right, improve=0.4414890, (0 missing)
drat < 3.075 to the left, improve=0.1890739, (0 missing)
Surrogate splits:
disp < 334 to the right, agree=0.857, adj=0.571, (0 split)
wt < 4.66 to the right, agree=0.810, adj=0.429, (0 split)
qsec < 15.455 to the left, agree=0.810, adj=0.429, (0 split)
carb < 3.5 to the right, agree=0.762, adj=0.286, (0 split)
gear < 4.5 to the right, agree=0.714, adj=0.143, (0 split)

Node number 3: 11 observations
mean=26.66364, MSE=18.48959

Node number 4: 7 observations
mean=13.41429, MSE=4.118367

Node number 5: 14 observations
mean=18.26429, MSE=4.276582

provides detailed information about the splits and nodes.

Step 4: Plot the decision tree

Now we will plot the decision tree.

R
# Plot the decision tree
rpart.plot(model, type = 3, fallen.leaves = TRUE, cex = 0.7)

Output:

gh
Decision tree using continuous variable in R

Step 5: Predicted values and Calculate Mean Squared Error

Now we will Predicted values and Calculate Mean Squared Error.

R
# Predict mpg for the first few rows
predictions <- predict(model, mtcars[1:5,])
print(predictions)

# Calculate Mean Squared Error (MSE)
actuals <- mtcars$mpg[1:5]
mse <- mean((predictions - actuals)^2)
print(paste("MSE:", mse))

Output:

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
18.26429 18.26429 26.66364 18.26429
Hornet Sportabout
18.26429

[1] "MSE: 7.98370045538877"

Conclusion

Decision trees are versatile and intuitive models for regression tasks involving continuous variables. By using the rpart package in R, you can easily build, visualize, and interpret decision trees. This article provided a comprehensive guide with a practical example to help you get started with decision trees using continuous variables in R.


Next Article

Similar Reads