Open In App

Random Forest Regression in Python

Last Updated : 07 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

A random forest is an ensemble learning method that combines the predictions from multiple decision trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that can be used for both classification and regression tasks.

In regression task we can use Random Forest Regression technique for predicting numerical values. It predicts continuous values by averaging the results of multiple decision trees.

Working of Random Forest Regression

Random Forest Regression works by creating multiple of decision trees each trained on a random subset of the data. The process begins with Bootstrap sampling where random rows of data are selected with replacement to form different training datasets for each tree. After this we do feature sampling where only a random subset of features is used to build each tree ensuring diversity in the models.

After the trees are trained each tree make a prediction and the final prediction for regression tasks is the average of all the individual tree predictions and this process is called as Aggregation.

Random Forest Regression Model Working

Random Forest Regression Model Working


This approach is beneficial because individual decision trees may have high variance and are prone to overfitting especially with complex data. However by averaging the predictions from multiple decision trees Random Forest minimizes this variance leading to more accurate and stable predictions and hence improving generalization of model.

Implementing Random Forest Regression in Python

We will be implementing random forest regression on salaries data.

1. Import Libraries

Here we are importing numpy, pandas, matplotlib, seaborn and scikit learn.

  • RandomForestRegressor – This is the regression model that is based upon the Random Forest model.
  • LabelEncoder: This class is used to encode categorical data into numerical values.
  • KNNImputer: This class is used to impute missing values in a dataset using a k-nearest neighbors approach.
  • train_test_split: This function is used to split a dataset into training and testing sets.
  • StandardScaler: This class is used to standardize features by removing the mean and scaling to unit variance.
  • f1_score: This function is used to evaluate the performance of a classification model using the F1 score.
  • RandomForestRegressor: This class is used to train a random forest regression model.
  • cross_val_score: This function is used to perform k-fold cross-validation to evaluate the performance of a model
Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

2. Import Dataset

Now let’s load the dataset in the panda’s data frame. For better data handling and leveraging the handy functions to perform complex tasks in one go. You can download dataset from here.

Python
df= pd.read_csv('Salaries.csv')
print(df)

Output:

Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000

Python
df.info()

Output:

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null
int64 dtypes: int64(2), object(1)
memory usage: 372.0+ bytes

3. Data Preparation

Here the code will extracts two subsets of data from the Dataset and stores them in separate variables.

  • Extracting Features: It extracts the features from the DataFrame and stores them in a variable named X.
  • Extracting Target Variable: It extracts the target variable from the DataFrame and stores it in a variable named y.
Python
X = df.iloc[:,1:2].values
y = df.iloc[:,2].values

4. Random Forest Regressor Model

The code processes categorical data by encoding it numerically, combines the processed data with numerical data, and trains a Random Forest Regression model using the prepared data.

  • RandomForestRegressor: It builds multiple decision trees and combines their predictions.
  • n_estimators=10: Defines the number of decision trees in the Random Forest (10 trees in this case).
  • random_state=0: Ensures the randomness in model training is controlled for reproducibility.
  • oob_score=True: Enables out-of-bag scoring which evaluates the model’s performance using data not seen by individual trees during training.
  • LabelEncoder(): Converts categorical variables (object type) into numerical values, making them suitable for machine learning models.
  • select_dtypes(): Selects columns based on data type—include=['object'] selects categorical columns, and exclude=['object'] selects numerical columns.
  • apply(label_encoder.fit_transform): Applies the LabelEncoder transformation to each categorical column, converting string labels into numbers.
  • concat(): Combines the numerical and encoded categorical features horizontally into one dataset, which is then used as input for the model.
  • fit(): Trains the Random Forest model using the combined dataset (x) and target variable (y).
Python
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

#Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = df.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = df.select_dtypes(exclude=['object']).values
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values

regressor = RandomForestRegressor(n_estimators=10, random_state=0, oob_score=True)

regressor.fit(x, y)

5. Make predictions and Evaluation

The code evaluates the trained Random Forest Regression model:

  • oob_score_: Retrive out-of-bag (OOB) score, which estimates the model’s generalization performance.
  • Makes predictions using the trained model and stores them in the ‘predictions’ array.
  • Evaluates the model’s performance using the Mean Squared Error (MSE) and R-squared (R2) metrics.
Python
from sklearn.metrics import mean_squared_error, r2_score

oob_score = regressor.oob_score_
print(f'Out-of-Bag Score: {oob_score}')

predictions = regressor.predict(x)

mse = mean_squared_error(y, predictions)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y, predictions)
print(f'R-squared: {r2}')

Output:


Out-of-Bag Score: 0.644879832593859
Mean Squared Error: 2647325000.0
R-squared: 0.9671801245316117

6. Visualization

Now let’s visualize the results obtained by using the RandomForest Regression model on our salaries dataset.

  • Creates a grid of prediction points covering the range of the feature values.
  • Plots the real data points as blue scatter points.
  • Plots the predicted values for the prediction grid as a green line.
  • Adds labels and a title to the plot for better understanding.
Python
import numpy as np
   
# Generate X_grid with 3 features
X_grid = np.arange(min(X[:, 0]), max(X[:, 0]), 0.01)  # Only the first feature
X_grid = X_grid.reshape(-1, 1)
X_grid = np.hstack((X_grid, np.zeros((X_grid.shape[0], 2))))  # Pad with zeros

# Plot results
plt.scatter(X[:, 0], y, color='blue', label="Actual Data")  # Plot first feature vs target
plt.plot(X_grid[:, 0], regressor.predict(X_grid), color='green', label="Random Forest Prediction")  
plt.title("Random Forest Regression Results")
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
plt.show()

Output: 

Screenshot-2023-12-04-101235

7. Visualizing a Single Decision Tree from the Random Forest Model

The code visualizes one of the decision trees from the trained Random Forest model. Plots the selected decision tree, displaying the decision-making process of a single tree within the ensemble.

Python
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

tree_to_plot = regressor.estimators_[0]

# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_to_plot, feature_names=df.columns.tolist(), filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree from Random Forest")
plt.show()

Output:

Screenshot-2023-12-05-111140Applications of Random Forest Regression

The Random forest regression has a wide range of real-world problems including:

  • Predicting continuous numerical values: Predicting house prices, stock prices or customer lifetime value.
  • Identifying risk factors: Detecting risk factors for diseases, financial crises or other negative events.
  • Handling high-dimensional data: Analyzing datasets with a large number of input features.
  • Capturing complex relationships: Modeling complex relationships between input features and the target variable.

Advantages of Random Forest Regression

  • Handles Non-Linearity: It can capture complex, non-linear relationships in the data that other models might miss.
  • Reduces Overfitting: By combining multiple decision trees and averaging predictions it reduces the risk of overfitting compared to a single decision tree.
  • Robust to Outliers: Random Forest is less sensitive to outliers as it aggregates the predictions from multiple trees.
  • Works Well with Large Datasets: It can efficiently handle large datasets and high-dimensional data without a significant loss in performance.
  • Handles Missing Data: Random Forest can handle missing values by using surrogate splits and maintaining high accuracy even with incomplete data.
  • No Need for Feature Scaling: Unlike many other algorithms Random Forest does not require normalization or scaling of the data.

Disadvantages of Random Forest Regression

  • Complexity: It can be computationally expensive and slow to train especially with a large number of trees and high-dimensional data.
  • Less Interpretability: Since it uses many trees it can be harder to interpret compared to simpler models like linear regression or decision trees.
  • Memory Intensive: Storing multiple decision trees for large datasets require significant memory resources.
  • Overfitting on Noisy Data: While Random Forest reduces overfitting, it can still overfit if the data is highly noisy, especially with a large number of trees.
  • Sensitive to Imbalanced Data: It may perform poorly if the dataset is highly imbalanced like one class is significantly more frequent than another.
  • Difficulty in Real-Time Predictions: Due to its complexity it may not be suitable for real-time predictions especially with a large number of trees.

Random Forest Regression has become a powerful tool for continuous prediction tasks with advantages over traditional decision trees. Its capability to handle high-dimensional data, capture complex relationships and reduce overfitting has made it a popular choice for a variety of applications. Python’s scikit-learn library enables the implementation of Random Forest Regression models possible.



Next Article

Similar Reads