Decision tree using continuous variable in R
Last Updated :
25 Jul, 2024
Decision trees are widely used due to their simplicity and effectiveness. They split data into branches to form a tree structure based on decision rules, making them intuitive and easy to interpret. In R, several packages such as rpart and party are available to facilitate decision tree modeling. This guide will specifically delve into how to utilize these tools for continuous variables.
Overview of Decision Trees
A decision tree is a model used to make predictions based on a series of decision rules inferred from the data. Starting from a root node, the data is split according to these rules, creating branches and leaf nodes. Each node in the tree represents a decision point, and each leaf node represents an outcome or prediction.
What are Continuous variables?
Continuous variables are variables that can take on an infinite number of values within a given range. They are often measurements or quantities that can be subdivided into smaller and smaller parts, representing data points that are not restricted to discrete steps or categories. Continuous variables can represent a wide range of phenomena in the real world and are crucial in statistical analysis and data modeling. Key Characteristics of Continuous Variables
- Infinite Possibilities: Continuous variables can take on any value within a specified range. For example, the weight of an object can be 50.5 kg, 50.55 kg, 50.555 kg, and so on.
- Measurable Quantities: They represent quantities that can be measured rather than counted. Examples include height, temperature, time, and distance.
- Precision: The precision of a continuous variable can be increased by using more refined measurement tools. For example, the length of an object can be measured to the nearest millimeter, micrometer, or even nanometer.
- Interdependence: Continuous variables can be related to each other. For example, in a physical system, temperature and pressure might be related, with changes in one affecting the other.
Now we will discuss step by step to Building a Decision Tree with Continuous Variables in R Programming Language.
Step 1: Load Required Libraries
First we install and Load Required Libraries.
R
library(rpart)
library(rpart.plot)
Step 2 : Prepare the Data
Use a dataset with continuous variables. We'll use the mtcars
dataset, which is available in R.
R
data(mtcars)
head(mtcars)
Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step 3: Create the Decision Tree
Use the rpart
function to create a decision tree model.
R
# Build the decision tree model
model <- rpart(mpg ~ ., data = mtcars, method = "anova")
# Print the model summary
summary(model)
Output:
Call:
rpart(formula = mpg ~ ., data = mtcars, method = "anova")
n= 32
CP nsplit rel error xerror xstd
1 0.64312523 0 1.0000000 1.0492564 0.2515360
2 0.09748407 1 0.3568748 0.7454559 0.1710954
3 0.01000000 2 0.2593907 0.6173034 0.1284871
Variable importance
cyl disp hp wt qsec vs carb
20 20 19 16 12 11 1
Node number 1: 32 observations, complexity param=0.6431252
mean=20.09062, MSE=35.18897
left son=2 (21 obs) right son=3 (11 obs)
Primary splits:
cyl < 5 to the right, improve=0.6431252, (0 missing)
wt < 2.3925 to the right, improve=0.6356630, (0 missing)
disp < 163.8 to the right, improve=0.6130502, (0 missing)
hp < 118 to the right, improve=0.6010712, (0 missing)
vs < 0.5 to the left, improve=0.4409477, (0 missing)
Surrogate splits:
disp < 142.9 to the right, agree=0.969, adj=0.909, (0 split)
hp < 101 to the right, agree=0.938, adj=0.818, (0 split)
wt < 2.5425 to the right, agree=0.906, adj=0.727, (0 split)
qsec < 18.41 to the left, agree=0.844, adj=0.545, (0 split)
vs < 0.5 to the left, agree=0.844, adj=0.545, (0 split)
Node number 2: 21 observations, complexity param=0.09748407
mean=16.64762, MSE=9.451066
left son=4 (7 obs) right son=5 (14 obs)
Primary splits:
hp < 192.5 to the right, improve=0.5530828, (0 missing)
cyl < 7 to the right, improve=0.5068475, (0 missing)
disp < 266.9 to the right, improve=0.5068475, (0 missing)
wt < 3.49 to the right, improve=0.4414890, (0 missing)
drat < 3.075 to the left, improve=0.1890739, (0 missing)
Surrogate splits:
disp < 334 to the right, agree=0.857, adj=0.571, (0 split)
wt < 4.66 to the right, agree=0.810, adj=0.429, (0 split)
qsec < 15.455 to the left, agree=0.810, adj=0.429, (0 split)
carb < 3.5 to the right, agree=0.762, adj=0.286, (0 split)
gear < 4.5 to the right, agree=0.714, adj=0.143, (0 split)
Node number 3: 11 observations
mean=26.66364, MSE=18.48959
Node number 4: 7 observations
mean=13.41429, MSE=4.118367
Node number 5: 14 observations
mean=18.26429, MSE=4.276582
provides detailed information about the splits and nodes.
Step 4: Plot the decision tree
Now we will plot the decision tree.
R
# Plot the decision tree
rpart.plot(model, type = 3, fallen.leaves = TRUE, cex = 0.7)
Output:
Decision tree using continuous variable in RStep 5: Predicted values and Calculate Mean Squared Error
Now we will Predicted values and Calculate Mean Squared Error.
R
# Predict mpg for the first few rows
predictions <- predict(model, mtcars[1:5,])
print(predictions)
# Calculate Mean Squared Error (MSE)
actuals <- mtcars$mpg[1:5]
mse <- mean((predictions - actuals)^2)
print(paste("MSE:", mse))
Output:
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
18.26429 18.26429 26.66364 18.26429
Hornet Sportabout
18.26429
[1] "MSE: 7.98370045538877"
Conclusion
Decision trees are versatile and intuitive models for regression tasks involving continuous variables. By using the rpart
package in R, you can easily build, visualize, and interpret decision trees. This article provided a comprehensive guide with a practical example to help you get started with decision trees using continuous variables in R.
Similar Reads
Continuous Random Variable In statistics and probability theory, a continuous random variable is a type of variable that can take any value within a given range. Unlike discrete random variables, which can only assume specific, separate values (like the number of students in a class), continuous random variables can assume an
8 min read
R Variables - Creating, Naming and Using Variables in R A variable is a memory location reserved for storing data, and the name assigned to it is used to access and manipulate the stored data. The variable name is an identifier for the allocated memory block, which can hold values of various data types during the programâs execution.In R, variables are d
5 min read
Difference Between Discrete and Continuous Variable In statistics, variables play a crucial role in understanding and analyzing data. Two fundamental types of variables are discrete and continuous variables. Discrete variables have distinct, separate values with gaps between them, while continuous variables have an unbroken sequence of values. In thi
5 min read
Data Prediction using Decision Tree of rpart Decision trees are a popular choice due to their simplicity and interpretation, and effectiveness at handling both numerical and categorical data. The rpart (Recursive Partitioning) package in R specializes in constructing these trees, offering a robust framework for building predictive models.Overv
3 min read
Decision Tree Classifiers in R Programming Decision Tree is a machine learning algorithm that assigns new observations to predefined categories based on a training dataset. Its goals are to predict class labels for unseen data and identify the features that define each class. It has a flowchart-like tree structure in which the internal node
6 min read