0% found this document useful (0 votes)
22 views

Exercise RandomForest

Uploaded by

liszczdamian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Exercise RandomForest

Uploaded by

liszczdamian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Random Forest in R: Exercise

# Install the package "randomForest"./resources/Random-Forests-in-R.ipynb


install.packages("randomForest")
library(randomForest)

In the description example of Random Forest, we used the iris database to understand how this algorithm can be used for classi cation. In the
following exercise, we are going to learn how to use Random Forest for regression analysis. In terms of regression analysis using random forest
method, we may see instances where the individual trees are weak predictors, however the nal combined random forest model is much
stronger, thus being able to give better predictions. The major parameter of interest in this case, is the resulting importance value, as you will
see below.

For this case study, we will use housing sales data and its dependencies. The keywords are:

Sldprice - House sale price


rooms - Number of rooms
beds - No of bedrooms
d_cbd - Distance to centre of town
hway_1 - Within 5 km of highway
sway_1 - Within 1 km of subway
hh_avinc - average household income
detach - detached
brick - brick
air_con - air condition
bsmt_ n - nished basement
As a rst step, read in the data from the csv le and have a quick look at the various attributes and some of their values.

my_data1 <- read.csv("https://ibm.box.com/shared/static/fzceg5vdj9hxpf7aopgvfgobi1g4vb4v.csv")

head(my_data1)

Since we are going to analyse housing prices, it is good practice to get a better understanding of this variable. We can use the plot() function as
one method of doing this. To ensure that there are no NA values, we can use the na.omit() function. We perform these steps to prepare the data
for our random forest implementation.

plot(my_data1$sldprice)

## removing NAs from the data

new_data <- na.omit(my_data1)

Now to the actual work. Just like in the description example, we are going to use all the features to create the random forest. Since regression
analysis makes sense with the importance value, we need to include this keyword.

Q1. Just like in the example, can you model the data for selling price, including all of the variables and with the "importance" parameter set to
true and print out the t?

## Your Answer Code Here: ##

fit1 <- randomForest(sldprice~hh_avinc+rooms+beds+sway_1+hway_1+d_cbd+detach+air_con+brick+bsmt_fin,data=new_data,importance=TRUE)


print(fit1)

From the resulting t, we understand that 3 variables are randomly selected at each tree node and the model can explain ~73% of the variability
in the data. You can learn more about the meaning of these values in a regression module.
Let us have a look at the importance factor. We can use the type keyword in the importance function to look at only the percentage increase in
MSE.

Q2: Can you print out the importance factor, preferably rounded to two decimal places and comment on the values observed?

## Your Answer Code Here: ##


round(importance(fit1,type=1),2)

The important deciding factors for housing prices are the average household income (hh_avinc), the distance to the centre of town (d_cbd), and
the number of rooms (room).

When the trees are being created, there are instances, where one branch will be very similar to another branch. The algorithm computes a
parameter called proximity to identify these instances. When we use this keyword while computing the model, the branches with similar
characteristics, will be merged.

To understand this, we will t the model again with this keyword, print the model and look at the importance value as before.

Q3: Fit the model with same variables along with the proximity keyword, print the t and importance factor. Comment and compare on the
importance factors with the previous t.

randomForest(sldprice~hh_avinc+rooms+beds+sway_1+hway_1+d_cbd+detach+air_con+brick+bsmt_fin,data=new_data,proximity=TRUE,action=na.omit,importa
print(fit2)
round(importance(fit2,type=1),2)

Although both models predict the average house hold income to be the most important deciding factor on housing prices, the importance of
the distance to centre of town and the number of rooms differ signi cantly.

Also from the % variation explained, the second model is slightly better in terms of the t to the data.
These small variations can make a noticeable difference when we are using these models for predicting future housing prices.
Now, let us plot these two models to see how the errors evolved during the process.

Q4: Divide the plot area into two and simply plot the individual ts and comment on your observation.

## Your Answer Code Here: ##


par(mfrow=c(1,2))
plot(fit1)
plot(fit2)

The errors decrease almost exponentially with the increase in the number of trees. However, it is also interesting to note that the range of errors
differ between the two models. This gives us an helpful clue in understanding how the underlying algorithm works, w.r.t, the proximity keyword.

It is important to understand that there will be differences each time you run the same forest with the same parameter, due to the random
nature of the algorithm. Hence, the percentages may not match every time.

Now that you have learnt how to use random forest algorithm for both classi cation of data as well as in regression analysis, it is time for you
to try these techniques with your own datasets. Best wishes!!

Want to learn more?


IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive
intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course,
available here: SPSS Modeler for Mac users and SPSS Modeler for Windows users

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud
solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the
cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX
users today with a free account at Data Science Experience
Thanks for completing this lesson!
Notebook created by: Vino Sangaralingam
Copyright © 2017 [IBM Cognitive Class](https://cognitiveclass.ai/?utm_source=ML0151&utm_medium=lab&utm_campaign=cclab). This notebook and its source
code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).

You might also like