Random Forest with Parallel Computing in R Programming
Last Updated :
02 Jul, 2020
Random Forest in
R Programming is basically a bagging technique. From the name we can clearly interpret that this algorithm basically creates the forest with a lot of trees. It is a supervised classification algorithm.
In a general scenario, if we have a greater number of trees in a forest it gives the best aesthetic appeal to all and is counted as the best forest. The same is the scenario in random forest classifier, the higher the number of trees the better the accuracy and hence it turns out to be a better model. In a random forest, the same approach which you use in decision tree i.e. entropy and information gain is not going to be used. Here in the random forest, we draw random bootstrap samples from the training set.
Advantages of Random Forest:
- Good for Large Datasets to handle.
- The learning is fast and provides a high accuracy.
- Can handle plenty number of variables at once.
- Over-fitting is not a problem in this algorithm.
Disadvantages of Random Forest:
- Complexity is a major issue. Since the algorithm creates a number of trees and combines its output to produce the best output takes more computational time and resources.
- The time period usually for training a random forest model is greater since it generates a large number of trees.
Parallel Computing
Parallel Computing basically refers to the usage of two or more cores (or processors) at the same instance to get the solution of one problem existing. The main target here is to break the task into smaller sub-tasks and get them done simultaneously.
A simple mathematical example will clear the basic idea behind parallel computing:
Suppose we have the below expression to evaluate:
Z= 7a + 8b + 2c + 3d
Where, a = 1, b = 2, c = 9, d = 5.
Normal Process without parallel computing would be:
Step 1: Putting in the values of the variables.
Z = (7*1) + (8*2) + (2*9) + (3*5)
Step 2: Evaluating the expression:
Z = 7 + (8*2) + (2*9) + (3*5)
Step 3: Z = 7 + 16 + (2*9) + (3*5)
Step 4: Z = 7 + 16 + 18 + (3*5)
Step 5: Z = 7 + 16 + 18 + 15
Step 6: Z = 56
The same expression evaluation in the scenario of parallel computing would be as follows:
Step 1: Putting in the values of the variables.
Z = (7*1) + (8*2) + (2*9) + (3*5)
Step 2: Evaluating the expression:
Z = 7 + 16 + 18 + 15
Step 3: Z = 56
So we can see the difference above that the evaluation of the expression is much faster in the second case.
So, I have taken a dataset of radar data. It consists of a total of 35 attributes. The 35th attribute is the target variable which is “g” or “b”. This target variable mainly represents the free electrons in the ionosphere. “g" is for Good" radar returns are those showing evidence of some type of structure in the ionosphere and “b" is for Bad" returns are those that do not; their signals pass through the ionosphere. So basically, it is a binary classification task. Let’s start with the coding part.
Loading the required libraries:
r
library(caret)
library(randomForest)
library(doParallel)
Reading the dataset:
r
datafile<-read.csv("C:/Users/prana/Downloads/ionosphere.data.csv")
datafile
Converting the target variable into factor variable with labels 0 and 1. Also check for missing values if any.
r
datafile$target0]
set.seed(100)
As there are no missing values, we have a clean dataset. Therefore, moving on to the model building part. Splitting the dataset into an 80:20 ratio i.e. training set and testing set respectively.
r
Trainingindex<-createDataPartition(datafile$target, p=0.8, list=FALSE)
trainingset<-datafile[Trainingindex, ]
testingset<-datafile[-Trainingindex, ]
Implementation of Random Forest without Parallel Computing
Now we will build a model normally and record the time taken for the same:
r
start.time<-proc.time()
model<-train(target~., data=trainingset, method='rf')
stop.time<-proc.time()
run.time<-stop.time -start.time
print(run.time)
Output:
user system elapsed
13.05 0.20 13.62
Implementation of Random Forest with Parallel Computing
Now building the model with parallel computing concept, loading the do Parallel library in R(here we have already loaded it in the beginning so no need to load it again). We can see the function
makePSOCKcluster()
which creates a set of copies of R running in parallel and communicating over sockets. The
stopCluster()
stops the engine node in the cluster in cl. We will also record the time taken to build the model by this approach.
r
cl<-makePSOCKcluster(5)
registerDoParallel(cl)
start.time<-proc.time()
model<-train(target~., data=trainingset, method='rf')
stop.time<-proc.time()
run.time<-stop.time -start.time
print(run.time)
stopCluster(cl)
Output:
user system elapsed
0.56 0.02 6.19
Comparison Table for Time Differences
So now formulating a table which will show us both the timings of the 2 approaches.

So, from the above table, we can conclude that with parallel computing the modeling process is
2.200323 (=13.62/6.19) times faster than the normal approach.
Similar Reads
Completely Randomized Design with R Programming
Experimental Designs are part of ANOVA in statistics. They are predefined algorithms that help us in analyzing the differences among group means in an experimental unit. Completely Randomized Design (CRD) is one part of the Anova types. Completely Randomized Design: The three basic principles of des
4 min read
Random Forest Approach in R Programming
Random Forest in R Programming is an ensemble of decision trees. It builds and combines multiple decision trees to get more accurate predictions. It's a non-linear classification algorithm. Each decision tree model is used when employed on its own. An error estimate of cases is made that is not used
4 min read
Randomized Block Design with R Programming
Experimental Designs are part of ANOVA in statistics. They are predefined algorithms that help us in analyzing the differences among group means in an experimental unit. Randomized Block Design (RBD) or Randomized Complete Block Design is one part of the Anova types. Randomized Block Design: The thr
3 min read
Random Forest Approach for Regression in R Programming
Random Forest approach is a supervised learning algorithm. It builds the multiple decision trees which are known as forest and glue them together to urge a more accurate and stable prediction. The random forest approach is similar to the ensemble technique called as Bagging. In this approach, multip
3 min read
Parallel Programming In R
Parallel programming is a type of programming that involves dividing a large computational task into smaller, more manageable tasks that can be executed simultaneously. This approach can significantly speed up the execution time of complex computations and is particularly useful for data-intensive a
6 min read
Random Forest Approach for Classification in R Programming
Random forest approach is supervised nonlinear classification and regression algorithm. Classification is a process of classifying a group of datasets in categories or classes. As random forest approach can use classification or regression techniques depending upon the user and target or categories
4 min read
Reading Files in R Programming
So far the operations using the R program are done on a prompt/terminal which is not stored anywhere. But in the software industry, most of the programs are written to store the information fetched from the program. One such way is to store the fetched information in a file. So the two most common o
9 min read
Setting up Environment for Machine Learning with R Programming
Machine Learning is a subset of Artificial Intelligence (AI), which is used to create intelligent systems that are able to learn without being programmed explicitly. In machine learning, we create algorithms and models which is used by an intelligent system to predict outcomes based on particular pa
6 min read
Exporting Data from scripts in R Programming
In R, when a program terminates, all data is lost unless it is exported to a file. Exporting data ensures its preservation, even after the program ends, and allows for easy sharing, storage, and transfer between systems.Exporting data is essential to prevent loss of information. It allows for:Data P
6 min read
Interesting Facts about R Programming Language
R is an open-source programming language that is widely used as a statistical software and data analysis tool. R generally comes with the Command-line interface. R is available across widely used platforms like Windows, Linux, and macOS. Also, the R programming language is the latest cutting-edge to
4 min read