Decision tree using continuous variable in R

Last Updated : 25 Jul, 2024

Decision trees are widely used due to their simplicity and effectiveness. They split data into branches to form a tree structure based on decision rules, making them intuitive and easy to interpret. In R, several packages such as rpart and party are available to facilitate decision tree modeling. This guide will specifically delve into how to utilize these tools for continuous variables.

Overview of Decision Trees

A decision tree is a model used to make predictions based on a series of decision rules inferred from the data. Starting from a root node, the data is split according to these rules, creating branches and leaf nodes. Each node in the tree represents a decision point, and each leaf node represents an outcome or prediction.

What are Continuous variables?

Continuous variables are variables that can take on an infinite number of values within a given range. They are often measurements or quantities that can be subdivided into smaller and smaller parts, representing data points that are not restricted to discrete steps or categories. Continuous variables can represent a wide range of phenomena in the real world and are crucial in statistical analysis and data modeling. Key Characteristics of Continuous Variables

Infinite Possibilities: Continuous variables can take on any value within a specified range. For example, the weight of an object can be 50.5 kg, 50.55 kg, 50.555 kg, and so on.
Measurable Quantities: They represent quantities that can be measured rather than counted. Examples include height, temperature, time, and distance.
Precision: The precision of a continuous variable can be increased by using more refined measurement tools. For example, the length of an object can be measured to the nearest millimeter, micrometer, or even nanometer.
Interdependence: Continuous variables can be related to each other. For example, in a physical system, temperature and pressure might be related, with changes in one affecting the other.

Now we will discuss step by step to Building a Decision Tree with Continuous Variables in R Programming Language.

Step 1: Load Required Libraries

First we install and Load Required Libraries.

library(rpart)
library(rpart.plot)

Step 2 : Prepare the Data

Use a dataset with continuous variables. We'll use the mtcars dataset, which is available in R.

data(mtcars)
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Create the Decision Tree

Use the rpart function to create a decision tree model.

# Build the decision tree model
model <- rpart(mpg ~ ., data = mtcars, method = "anova")

# Print the model summary
summary(model)

Output:

Call:
rpart(formula = mpg ~ ., data = mtcars, method = "anova")
  n= 32 

          CP nsplit rel error    xerror      xstd
1 0.64312523      0 1.0000000 1.0492564 0.2515360
2 0.09748407      1 0.3568748 0.7454559 0.1710954
3 0.01000000      2 0.2593907 0.6173034 0.1284871

Variable importance
 cyl disp   hp   wt qsec   vs carb 
  20   20   19   16   12   11    1 

Node number 1: 32 observations,    complexity param=0.6431252
  mean=20.09062, MSE=35.18897 
  left son=2 (21 obs) right son=3 (11 obs)
  Primary splits:
      cyl  < 5      to the right, improve=0.6431252, (0 missing)
      wt   < 2.3925 to the right, improve=0.6356630, (0 missing)
      disp < 163.8  to the right, improve=0.6130502, (0 missing)
      hp   < 118    to the right, improve=0.6010712, (0 missing)
      vs   < 0.5    to the left,  improve=0.4409477, (0 missing)
  Surrogate splits:
      disp < 142.9  to the right, agree=0.969, adj=0.909, (0 split)
      hp   < 101    to the right, agree=0.938, adj=0.818, (0 split)
      wt   < 2.5425 to the right, agree=0.906, adj=0.727, (0 split)
      qsec < 18.41  to the left,  agree=0.844, adj=0.545, (0 split)
      vs   < 0.5    to the left,  agree=0.844, adj=0.545, (0 split)

Node number 2: 21 observations,    complexity param=0.09748407
  mean=16.64762, MSE=9.451066 
  left son=4 (7 obs) right son=5 (14 obs)
  Primary splits:
      hp   < 192.5  to the right, improve=0.5530828, (0 missing)
      cyl  < 7      to the right, improve=0.5068475, (0 missing)
      disp < 266.9  to the right, improve=0.5068475, (0 missing)
      wt   < 3.49   to the right, improve=0.4414890, (0 missing)
      drat < 3.075  to the left,  improve=0.1890739, (0 missing)
  Surrogate splits:
      disp < 334    to the right, agree=0.857, adj=0.571, (0 split)
      wt   < 4.66   to the right, agree=0.810, adj=0.429, (0 split)
      qsec < 15.455 to the left,  agree=0.810, adj=0.429, (0 split)
      carb < 3.5    to the right, agree=0.762, adj=0.286, (0 split)
      gear < 4.5    to the right, agree=0.714, adj=0.143, (0 split)

Node number 3: 11 observations
  mean=26.66364, MSE=18.48959 

Node number 4: 7 observations
  mean=13.41429, MSE=4.118367 

Node number 5: 14 observations
  mean=18.26429, MSE=4.276582

provides detailed information about the splits and nodes.

Step 4: Plot the decision tree

Now we will plot the decision tree.

# Plot the decision tree
rpart.plot(model, type = 3, fallen.leaves = TRUE, cex = 0.7)

Output:

Decision tree using continuous variable in R

Step 5: Predicted values and Calculate Mean Squared Error

Now we will Predicted values and Calculate Mean Squared Error.

# Predict mpg for the first few rows
predictions <- predict(model, mtcars[1:5,])
print(predictions)

# Calculate Mean Squared Error (MSE)
actuals <- mtcars$mpg[1:5]
mse <- mean((predictions - actuals)^2)
print(paste("MSE:", mse))

Output:

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         18.26429          18.26429          26.66364          18.26429 
Hornet Sportabout 
         18.26429 

[1] "MSE: 7.98370045538877"

Conclusion

Decision trees are versatile and intuitive models for regression tasks involving continuous variables. By using the rpart package in R, you can easily build, visualize, and interpret decision trees. This article provided a comprehensive guide with a practical example to help you get started with decision trees using continuous variables in R.

Decision tree using continuous variable in R

jagritiezz6

Improve

Article Tags :

Practice Tags :

Machine Learning

Decision tree using continuous variable in R

Overview of Decision Trees

What are Continuous variables?

Step 1: Load Required Libraries

Step 2 : Prepare the Data

Step 3: Create the Decision Tree

Step 4: Plot the decision tree

Step 5: Predicted values and Calculate Mean Squared Error

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?