Exercise RandomForest

Uploaded by

liszczdamian

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Exercise RandomForest

Uploaded by

liszczdamian

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Random Forest in R: Exercise

# Install the package "randomForest"./resources/Random-Forests-in-R.ipynb

install.packages("randomForest")
library(randomForest)

In the description example of Random Forest, we used the iris database to understand how this algorithm can be used for classi cation. In the
following exercise, we are going to learn how to use Random Forest for regression analysis. In terms of regression analysis using random forest
method, we may see instances where the individual trees are weak predictors, however the nal combined random forest model is much
stronger, thus being able to give better predictions. The major parameter of interest in this case, is the resulting importance value, as you will
see below.

For this case study, we will use housing sales data and its dependencies. The keywords are:

Sldprice - House sale price

rooms - Number of rooms
beds - No of bedrooms
d_cbd - Distance to centre of town
hway_1 - Within 5 km of highway
sway_1 - Within 1 km of subway
hh_avinc - average household income
detach - detached
brick - brick
air_con - air condition
bsmt_ n - nished basement
As a rst step, read in the data from the csv le and have a quick look at the various attributes and some of their values.

my_data1 <- read.csv("https://ibm.box.com/shared/static/fzceg5vdj9hxpf7aopgvfgobi1g4vb4v.csv")

head(my_data1)

Since we are going to analyse housing prices, it is good practice to get a better understanding of this variable. We can use the plot() function as
one method of doing this. To ensure that there are no NA values, we can use the na.omit() function. We perform these steps to prepare the data
for our random forest implementation.

plot(my_data1$sldprice)

## removing NAs from the data

new_data <- na.omit(my_data1)

Now to the actual work. Just like in the description example, we are going to use all the features to create the random forest. Since regression
analysis makes sense with the importance value, we need to include this keyword.

Q1. Just like in the example, can you model the data for selling price, including all of the variables and with the "importance" parameter set to
true and print out the t?

## Your Answer Code Here: ##

fit1 <- randomForest(sldprice~hh_avinc+rooms+beds+sway_1+hway_1+d_cbd+detach+air_con+brick+bsmt_fin,data=new_data,importance=TRUE)

print(fit1)

From the resulting t, we understand that 3 variables are randomly selected at each tree node and the model can explain ~73% of the variability
in the data. You can learn more about the meaning of these values in a regression module.
Let us have a look at the importance factor. We can use the type keyword in the importance function to look at only the percentage increase in
MSE.

Q2: Can you print out the importance factor, preferably rounded to two decimal places and comment on the values observed?

## Your Answer Code Here: ##

round(importance(fit1,type=1),2)

The important deciding factors for housing prices are the average household income (hh_avinc), the distance to the centre of town (d_cbd), and
the number of rooms (room).

When the trees are being created, there are instances, where one branch will be very similar to another branch. The algorithm computes a
parameter called proximity to identify these instances. When we use this keyword while computing the model, the branches with similar
characteristics, will be merged.

To understand this, we will t the model again with this keyword, print the model and look at the importance value as before.

Q3: Fit the model with same variables along with the proximity keyword, print the t and importance factor. Comment and compare on the
importance factors with the previous t.

randomForest(sldprice~hh_avinc+rooms+beds+sway_1+hway_1+d_cbd+detach+air_con+brick+bsmt_fin,data=new_data,proximity=TRUE,action=na.omit,importa
print(fit2)
round(importance(fit2,type=1),2)

Although both models predict the average house hold income to be the most important deciding factor on housing prices, the importance of
the distance to centre of town and the number of rooms differ signi cantly.

Also from the % variation explained, the second model is slightly better in terms of the t to the data.
These small variations can make a noticeable difference when we are using these models for predicting future housing prices.
Now, let us plot these two models to see how the errors evolved during the process.

Q4: Divide the plot area into two and simply plot the individual ts and comment on your observation.

## Your Answer Code Here: ##

par(mfrow=c(1,2))
plot(fit1)
plot(fit2)

The errors decrease almost exponentially with the increase in the number of trees. However, it is also interesting to note that the range of errors
differ between the two models. This gives us an helpful clue in understanding how the underlying algorithm works, w.r.t, the proximity keyword.

It is important to understand that there will be differences each time you run the same forest with the same parameter, due to the random
nature of the algorithm. Hence, the percentages may not match every time.

Now that you have learnt how to use random forest algorithm for both classi cation of data as well as in regression analysis, it is time for you
to try these techniques with your own datasets. Best wishes!!

Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive
intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course,
available here: SPSS Modeler for Mac users and SPSS Modeler for Windows users

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud
solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the
cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX
users today with a free account at Data Science Experience
Thanks for completing this lesson!
Notebook created by: Vino Sangaralingam
Copyright © 2017 [IBM Cognitive Class](https://cognitiveclass.ai/?utm_source=ML0151&utm_medium=lab&utm_campaign=cclab). This notebook and its source
code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).

Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Bus 5112 Marketing Management Written Assignment Unit 6
100% (1)
Bus 5112 Marketing Management Written Assignment Unit 6
6 pages
Income Taxation
44% (9)
Income Taxation
190 pages
Tbs Discovery Manual
No ratings yet
Tbs Discovery Manual
30 pages
ISOMAP in ML
No ratings yet
ISOMAP in ML
12 pages
Best ML Packages in R
No ratings yet
Best ML Packages in R
9 pages
Intro To Statistic Using R - Session 1
No ratings yet
Intro To Statistic Using R - Session 1
1 page
Ai new
No ratings yet
Ai new
4 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
LAB 1 Notes
No ratings yet
LAB 1 Notes
3 pages
Unit 2
No ratings yet
Unit 2
11 pages
Ggplot2 Exercise
No ratings yet
Ggplot2 Exercise
6 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Assignment2 Group5 212
No ratings yet
Assignment2 Group5 212
2 pages
AMTA Assignment AMTA B (Aswin Avni Navya)
No ratings yet
AMTA Assignment AMTA B (Aswin Avni Navya)
13 pages
Python Exploratory Data Analysis
No ratings yet
Python Exploratory Data Analysis
24 pages
Studio 9 Questions
No ratings yet
Studio 9 Questions
6 pages
Assignment 1 AI
No ratings yet
Assignment 1 AI
6 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
ML
No ratings yet
ML
49 pages
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
100% (1)
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
12 pages
Data Structure Notes
No ratings yet
Data Structure Notes
171 pages
Improving The Performance of Your Imbalanced Machine Learning Classifiers
No ratings yet
Improving The Performance of Your Imbalanced Machine Learning Classifiers
26 pages
Predict and Co
No ratings yet
Predict and Co
6 pages
ML imppp (1)
No ratings yet
ML imppp (1)
12 pages
UNIT-1 Regression vs. Classification
No ratings yet
UNIT-1 Regression vs. Classification
25 pages
assignment
No ratings yet
assignment
7 pages
Programming Assignment 2 - Decision Trees and Random Forests
No ratings yet
Programming Assignment 2 - Decision Trees and Random Forests
2 pages
Assignment 1
100% (1)
Assignment 1
3 pages
Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
No ratings yet
Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
15 pages
Numpy and Pandas
No ratings yet
Numpy and Pandas
11 pages
Histograms and Density Plots in R
No ratings yet
Histograms and Density Plots in R
9 pages
Report
No ratings yet
Report
40 pages
Recipes For Data Processing
No ratings yet
Recipes For Data Processing
51 pages
Using Random Forests v4.0
No ratings yet
Using Random Forests v4.0
33 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
MIT6 0002F16 ProblemSet5
No ratings yet
MIT6 0002F16 ProblemSet5
13 pages
CS 461 - Fall 2021 - Neural Networks - Machine Learning
No ratings yet
CS 461 - Fall 2021 - Neural Networks - Machine Learning
5 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Final Project - Regression Models
100% (1)
Final Project - Regression Models
35 pages
V3i403 PDF
No ratings yet
V3i403 PDF
3 pages
Data Science Interview Questions 30 Days 1686062665
No ratings yet
Data Science Interview Questions 30 Days 1686062665
300 pages
Project Occupancy Alfonso Vicente Aragues
No ratings yet
Project Occupancy Alfonso Vicente Aragues
18 pages
T Sne Implementation R Python
No ratings yet
T Sne Implementation R Python
19 pages
Gradient Descent Tutorial
No ratings yet
Gradient Descent Tutorial
3 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Big Data Computing Decision Trees For Big Data Analytics
No ratings yet
Big Data Computing Decision Trees For Big Data Analytics
48 pages
Homework 9: Independent and Paired Samples T-Tests: Information 1
No ratings yet
Homework 9: Independent and Paired Samples T-Tests: Information 1
7 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Data Science Interview Quesions
No ratings yet
Data Science Interview Quesions
22 pages
5th Unit Answer Bank AIML
No ratings yet
5th Unit Answer Bank AIML
24 pages
A Comparative Study of Various Multi Relational Decision Tree Learning Algorithms
No ratings yet
A Comparative Study of Various Multi Relational Decision Tree Learning Algorithms
4 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Data Science Notes
No ratings yet
Data Science Notes
66 pages
Machine Learning With The Arduino Air Quality Pred
No ratings yet
Machine Learning With The Arduino Air Quality Pred
10 pages
CONTENTS
No ratings yet
CONTENTS
7 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Blockchain Career Guide
No ratings yet
Blockchain Career Guide
7 pages
WQIP-Q1-AWP-T-0093-V2 Method Statement of Ductile Iron Pipe Works - Qatif 1
No ratings yet
WQIP-Q1-AWP-T-0093-V2 Method Statement of Ductile Iron Pipe Works - Qatif 1
68 pages
Pharma Questions PDF
No ratings yet
Pharma Questions PDF
71 pages
User Manual: High Frequency AC / DC Current Probe CP1003B / CP503B
No ratings yet
User Manual: High Frequency AC / DC Current Probe CP1003B / CP503B
25 pages
En 1442 2017 (E) Ma
No ratings yet
En 1442 2017 (E) Ma
50 pages
Yonghui Wu Competitive Programming Books
No ratings yet
Yonghui Wu Competitive Programming Books
30 pages
3.0 Million 1,548 Million 8,367,390 15,864,249 676,215 1,198,457
No ratings yet
3.0 Million 1,548 Million 8,367,390 15,864,249 676,215 1,198,457
2 pages
MBA80 Manual - Sneak Peek PDF
No ratings yet
MBA80 Manual - Sneak Peek PDF
14 pages
24 CNC Machine Feedback Devices
100% (1)
24 CNC Machine Feedback Devices
6 pages
Fire Safety Presentation
No ratings yet
Fire Safety Presentation
19 pages
How To Get Business Credit Approvals?
100% (7)
How To Get Business Credit Approvals?
6 pages
Vatika Now (Jul - Dec 2017)
No ratings yet
Vatika Now (Jul - Dec 2017)
56 pages
HAM Report
No ratings yet
HAM Report
25 pages
Resumenes Sabi2022
No ratings yet
Resumenes Sabi2022
217 pages
Tps 54140
No ratings yet
Tps 54140
52 pages
FMS CMA-900 930-600006-060 Operators Manual 606
50% (2)
FMS CMA-900 930-600006-060 Operators Manual 606
334 pages
Global Inter-State System: Learning Objectives
100% (2)
Global Inter-State System: Learning Objectives
4 pages
Vav SVX08L en - 05132019
No ratings yet
Vav SVX08L en - 05132019
72 pages
Sistema Freno 966c
No ratings yet
Sistema Freno 966c
11 pages
GPG160 Electric Lighting Controls – a Guide for Designers, Installers and Users
100% (1)
GPG160 Electric Lighting Controls – a Guide for Designers, Installers and Users
12 pages
Exchange Rate Mechanism: Types and Calculation
No ratings yet
Exchange Rate Mechanism: Types and Calculation
6 pages
Faculty Biodata - NGP - SMT Neha Pardesi
No ratings yet
Faculty Biodata - NGP - SMT Neha Pardesi
3 pages
Bioinformatics Overview
100% (1)
Bioinformatics Overview
18 pages
AOP Project DrSSDas
No ratings yet
AOP Project DrSSDas
4 pages
Uge 1 Test 3 Version 1
No ratings yet
Uge 1 Test 3 Version 1
8 pages
07 Article Shivendra Srivastava
No ratings yet
07 Article Shivendra Srivastava
12 pages
Chapter 5 - Multithreading - 015357
No ratings yet
Chapter 5 - Multithreading - 015357
34 pages