Decision tree using continuous variable in R
Last Updated :
25 Jul, 2024
Decision trees are widely used due to their simplicity and effectiveness. They split data into branches to form a tree structure based on decision rules, making them intuitive and easy to interpret. In R, several packages such as rpart and party are available to facilitate decision tree modeling. This guide will specifically delve into how to utilize these tools for continuous variables.
Overview of Decision Trees
A decision tree is a model used to make predictions based on a series of decision rules inferred from the data. Starting from a root node, the data is split according to these rules, creating branches and leaf nodes. Each node in the tree represents a decision point, and each leaf node represents an outcome or prediction.
What are Continuous variables?
Continuous variables are variables that can take on an infinite number of values within a given range. They are often measurements or quantities that can be subdivided into smaller and smaller parts, representing data points that are not restricted to discrete steps or categories. Continuous variables can represent a wide range of phenomena in the real world and are crucial in statistical analysis and data modeling. Key Characteristics of Continuous Variables
- Infinite Possibilities: Continuous variables can take on any value within a specified range. For example, the weight of an object can be 50.5 kg, 50.55 kg, 50.555 kg, and so on.
- Measurable Quantities: They represent quantities that can be measured rather than counted. Examples include height, temperature, time, and distance.
- Precision: The precision of a continuous variable can be increased by using more refined measurement tools. For example, the length of an object can be measured to the nearest millimeter, micrometer, or even nanometer.
- Interdependence: Continuous variables can be related to each other. For example, in a physical system, temperature and pressure might be related, with changes in one affecting the other.
Now we will discuss step by step to Building a Decision Tree with Continuous Variables in R Programming Language.
Step 1: Load Required Libraries
First we install and Load Required Libraries.
R
library(rpart)
library(rpart.plot)
Step 2 : Prepare the Data
Use a dataset with continuous variables. We'll use the mtcars
dataset, which is available in R.
R
data(mtcars)
head(mtcars)
Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step 3: Create the Decision Tree
Use the rpart
function to create a decision tree model.
R
# Build the decision tree model
model <- rpart(mpg ~ ., data = mtcars, method = "anova")
# Print the model summary
summary(model)
Output:
Call:
rpart(formula = mpg ~ ., data = mtcars, method = "anova")
n= 32
CP nsplit rel error xerror xstd
1 0.64312523 0 1.0000000 1.0492564 0.2515360
2 0.09748407 1 0.3568748 0.7454559 0.1710954
3 0.01000000 2 0.2593907 0.6173034 0.1284871
Variable importance
cyl disp hp wt qsec vs carb
20 20 19 16 12 11 1
Node number 1: 32 observations, complexity param=0.6431252
mean=20.09062, MSE=35.18897
left son=2 (21 obs) right son=3 (11 obs)
Primary splits:
cyl < 5 to the right, improve=0.6431252, (0 missing)
wt < 2.3925 to the right, improve=0.6356630, (0 missing)
disp < 163.8 to the right, improve=0.6130502, (0 missing)
hp < 118 to the right, improve=0.6010712, (0 missing)
vs < 0.5 to the left, improve=0.4409477, (0 missing)
Surrogate splits:
disp < 142.9 to the right, agree=0.969, adj=0.909, (0 split)
hp < 101 to the right, agree=0.938, adj=0.818, (0 split)
wt < 2.5425 to the right, agree=0.906, adj=0.727, (0 split)
qsec < 18.41 to the left, agree=0.844, adj=0.545, (0 split)
vs < 0.5 to the left, agree=0.844, adj=0.545, (0 split)
Node number 2: 21 observations, complexity param=0.09748407
mean=16.64762, MSE=9.451066
left son=4 (7 obs) right son=5 (14 obs)
Primary splits:
hp < 192.5 to the right, improve=0.5530828, (0 missing)
cyl < 7 to the right, improve=0.5068475, (0 missing)
disp < 266.9 to the right, improve=0.5068475, (0 missing)
wt < 3.49 to the right, improve=0.4414890, (0 missing)
drat < 3.075 to the left, improve=0.1890739, (0 missing)
Surrogate splits:
disp < 334 to the right, agree=0.857, adj=0.571, (0 split)
wt < 4.66 to the right, agree=0.810, adj=0.429, (0 split)
qsec < 15.455 to the left, agree=0.810, adj=0.429, (0 split)
carb < 3.5 to the right, agree=0.762, adj=0.286, (0 split)
gear < 4.5 to the right, agree=0.714, adj=0.143, (0 split)
Node number 3: 11 observations
mean=26.66364, MSE=18.48959
Node number 4: 7 observations
mean=13.41429, MSE=4.118367
Node number 5: 14 observations
mean=18.26429, MSE=4.276582
provides detailed information about the splits and nodes.
Step 4: Plot the decision tree
Now we will plot the decision tree.
R
# Plot the decision tree
rpart.plot(model, type = 3, fallen.leaves = TRUE, cex = 0.7)
Output:
Decision tree using continuous variable in RStep 5: Predicted values and Calculate Mean Squared Error
Now we will Predicted values and Calculate Mean Squared Error.
R
# Predict mpg for the first few rows
predictions <- predict(model, mtcars[1:5,])
print(predictions)
# Calculate Mean Squared Error (MSE)
actuals <- mtcars$mpg[1:5]
mse <- mean((predictions - actuals)^2)
print(paste("MSE:", mse))
Output:
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
18.26429 18.26429 26.66364 18.26429
Hornet Sportabout
18.26429
[1] "MSE: 7.98370045538877"
Conclusion
Decision trees are versatile and intuitive models for regression tasks involving continuous variables. By using the rpart
package in R, you can easily build, visualize, and interpret decision trees. This article provided a comprehensive guide with a practical example to help you get started with decision trees using continuous variables in R.
Similar Reads
Create boxplot for continuous variables using ggplot2 in R
Box plots are a good way to summarize the shape of a distribution, showing its median, its mean, skewness, possible outliers, its spread, etc. Box-whisker plots are the other name of Box plots. Â These plots are mostly used for data exploration. The box plot is the five-number summary, which is the m
3 min read
Continuous Random Variable
In statistics and probability theory, a continuous random variable is a type of variable that can take any value within a given range. Unlike discrete random variables, which can only assume specific, separate values (like the number of students in a class), continuous random variables can assume an
9 min read
R Variables - Creating, Naming and Using Variables in R
A variable is a memory allocated for the storage of specific data and the name associated with the variable is used to work around this reserved block. The name given to a variable is known as its variable name. Usually a single variable stores only the data belonging to a certain data type. The na
6 min read
Difference Between Discrete and Continuous Variable
In statistics, variables play a crucial role in understanding and analyzing data. Two fundamental types of variables are discrete and continuous variables. Discrete variables have distinct, separate values with gaps between them, while continuous variables have an unbroken sequence of values. In thi
5 min read
Data Prediction using Decision Tree of rpart
Decision trees are a popular choice due to their simplicity and interpretation, and effectiveness at handling both numerical and categorical data. The rpart (Recursive Partitioning) package in R specializes in constructing these trees, offering a robust framework for building predictive models. Over
3 min read
Contingency Tables in R Programming
Prerequisite: Data Structures in R ProgrammingContingency tables are very useful to condense a large number of observations into smaller to make it easier to maintain tables. A contingency table shows the distribution of a variable in the rows and another in its columns. Contingency tables are not o
6 min read
Decision Tree Classifiers in R Programming
Classification is the task in which objects of several categories are categorized into their respective classes using the properties of classes. A classification model is typically used to, Predict the class label for a new unlabeled data objectProvide a descriptive model explaining what features ch
4 min read
Variable Selection Techniques In R
Variable selection, also known as feature selection, is the process of identifying and choosing the most important predictors for a model. In R Programming Language This process leads to simpler, faster, and more interpretable models, and helps in preventing overfitting. Overfitting occurs when a mo
4 min read
Histogram for Continuous Data in R
This article will cover the theory behind histograms for continuous data and provide practical examples of how to create and customize histograms in the R Programming Language. Histogram in RA histogram is a graphical representation of the distribution of continuous data. It is one of the most commo
4 min read
Variable importance plot using random forest package in R
As the name indicates Variable Importance Plot is a which used random forest package to plot the graph based on their accuracy and Gini Coefficient. If the accuracy of the variable is high then it's going to classify data accurately and Gini Coefficient is measured in terms of the homogeneity of nod
2 min read