ML | Extra Tree Classifier for Feature Selection

Random Forest Regression in Python

Last Updated : 07 Apr, 2025

A random forest is an ensemble learning method that combines the predictions from multiple decision trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that can be used for both classification and regression tasks.

In regression task we can use Random Forest Regression technique for predicting numerical values. It predicts continuous values by averaging the results of multiple decision trees.

Working of Random Forest Regression

Random Forest Regression works by creating multiple of decision trees each trained on a random subset of the data. The process begins with Bootstrap sampling where random rows of data are selected with replacement to form different training datasets for each tree. After this we do feature sampling where only a random subset of features is used to build each tree ensuring diversity in the models.

After the trees are trained each tree make a prediction and the final prediction for regression tasks is the average of all the individual tree predictions and this process is called as Aggregation.

Random Forest Regression Model Working

This approach is beneficial because individual decision trees may have high variance and are prone to overfitting especially with complex data. However by averaging the predictions from multiple decision trees Random Forest minimizes this variance leading to more accurate and stable predictions and hence improving generalization of model.

Implementing Random Forest Regression in Python

We will be implementing random forest regression on salaries data.

1. Import Libraries

Here we are importing numpy, pandas, matplotlib, seaborn and scikit learn.

RandomForestRegressor – This is the regression model that is based upon the Random Forest model.
LabelEncoder: This class is used to encode categorical data into numerical values.
KNNImputer: This class is used to impute missing values in a dataset using a k-nearest neighbors approach.
train_test_split: This function is used to split a dataset into training and testing sets.
StandardScaler: This class is used to standardize features by removing the mean and scaling to unit variance.
f1_score: This function is used to evaluate the performance of a classification model using the F1 score.
RandomForestRegressor: This class is used to train a random forest regression model.
cross_val_score: This function is used to perform k-fold cross-validation to evaluate the performance of a model

Python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

2. Import Dataset

Now let's load the dataset in the panda's data frame. For better data handling and leveraging the handy functions to perform complex tasks in one go. You can download dataset from here.

Python

df= pd.read_csv('Salaries.csv')
print(df)

Output:

Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000

Python

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null
int64 dtypes: int64(2), object(1)
memory usage: 372.0+ bytes

3. Data Preparation

Here the code will extracts two subsets of data from the Dataset and stores them in separate variables.

Extracting Features: It extracts the features from the DataFrame and stores them in a variable named X.
Extracting Target Variable: It extracts the target variable from the DataFrame and stores it in a variable named y.

Python

X = df.iloc[:,1:2].values
y = df.iloc[:,2].values

4. Random Forest Regressor Model

The code processes categorical data by encoding it numerically, combines the processed data with numerical data, and trains a Random Forest Regression model using the prepared data.

RandomForestRegressor: It builds multiple decision trees and combines their predictions.
n_estimators=10: Defines the number of decision trees in the Random Forest (10 trees in this case).
random_state=0: Ensures the randomness in model training is controlled for reproducibility.
oob_score=True: Enables out-of-bag scoring which evaluates the model's performance using data not seen by individual trees during training.
LabelEncoder(): Converts categorical variables (object type) into numerical values, making them suitable for machine learning models.
select_dtypes(): Selects columns based on data type—include=['object'] selects categorical columns, and exclude=['object'] selects numerical columns.
apply(label_encoder.fit_transform): Applies the LabelEncoder transformation to each categorical column, converting string labels into numbers.
concat(): Combines the numerical and encoded categorical features horizontally into one dataset, which is then used as input for the model.
fit(): Trains the Random Forest model using the combined dataset (x) and target variable (y).

Python

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

#Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = df.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = df.select_dtypes(exclude=['object']).values
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values

regressor = RandomForestRegressor(n_estimators=10, random_state=0, oob_score=True)

regressor.fit(x, y)

5. Make predictions and Evaluation

The code evaluates the trained Random Forest Regression model:

oob_score_: Retrive out-of-bag (OOB) score, which estimates the model's generalization performance.
Makes predictions using the trained model and stores them in the 'predictions' array.
Evaluates the model's performance using the Mean Squared Error (MSE) and R-squared (R2) metrics.

Python

from sklearn.metrics import mean_squared_error, r2_score

oob_score = regressor.oob_score_
print(f'Out-of-Bag Score: {oob_score}')

predictions = regressor.predict(x)

mse = mean_squared_error(y, predictions)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y, predictions)
print(f'R-squared: {r2}')

Output:

Out-of-Bag Score: 0.644879832593859
Mean Squared Error: 2647325000.0
R-squared: 0.9671801245316117

6. Visualization

Now let's visualize the results obtained by using the RandomForest Regression model on our salaries dataset.

Creates a grid of prediction points covering the range of the feature values.
Plots the real data points as blue scatter points.
Plots the predicted values for the prediction grid as a green line.
Adds labels and a title to the plot for better understanding.

Python

import numpy as np
   
# Generate X_grid with 3 features
X_grid = np.arange(min(X[:, 0]), max(X[:, 0]), 0.01)  # Only the first feature
X_grid = X_grid.reshape(-1, 1)
X_grid = np.hstack((X_grid, np.zeros((X_grid.shape[0], 2))))  # Pad with zeros

# Plot results
plt.scatter(X[:, 0], y, color='blue', label="Actual Data")  # Plot first feature vs target
plt.plot(X_grid[:, 0], regressor.predict(X_grid), color='green', label="Random Forest Prediction")  
plt.title("Random Forest Regression Results")
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
plt.show()

Output:

Screenshot-2023-12-04-101235

7. Visualizing a Single Decision Tree from the Random Forest Model

The code visualizes one of the decision trees from the trained Random Forest model. Plots the selected decision tree, displaying the decision-making process of a single tree within the ensemble.

Python

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

tree_to_plot = regressor.estimators_[0]

# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_to_plot, feature_names=df.columns.tolist(), filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree from Random Forest")
plt.show()

Output:

Applications of Random Forest Regression

The Random forest regression has a wide range of real-world problems including:

Predicting continuous numerical values: Predicting house prices, stock prices or customer lifetime value.
Identifying risk factors: Detecting risk factors for diseases, financial crises or other negative events.
Handling high-dimensional data: Analyzing datasets with a large number of input features.
Capturing complex relationships: Modeling complex relationships between input features and the target variable.

Advantages of Random Forest Regression

Handles Non-Linearity: It can capture complex, non-linear relationships in the data that other models might miss.
Reduces Overfitting: By combining multiple decision trees and averaging predictions it reduces the risk of overfitting compared to a single decision tree.
Robust to Outliers: Random Forest is less sensitive to outliers as it aggregates the predictions from multiple trees.
Works Well with Large Datasets: It can efficiently handle large datasets and high-dimensional data without a significant loss in performance.
Handles Missing Data: Random Forest can handle missing values by using surrogate splits and maintaining high accuracy even with incomplete data.
No Need for Feature Scaling: Unlike many other algorithms Random Forest does not require normalization or scaling of the data.

Disadvantages of Random Forest Regression

Complexity: It can be computationally expensive and slow to train especially with a large number of trees and high-dimensional data.
Less Interpretability: Since it uses many trees it can be harder to interpret compared to simpler models like linear regression or decision trees.
Memory Intensive: Storing multiple decision trees for large datasets require significant memory resources.
Overfitting on Noisy Data: While Random Forest reduces overfitting, it can still overfit if the data is highly noisy, especially with a large number of trees.
Sensitive to Imbalanced Data: It may perform poorly if the dataset is highly imbalanced like one class is significantly more frequent than another.
Difficulty in Real-Time Predictions: Due to its complexity it may not be suitable for real-time predictions especially with a large number of trees.

Random Forest Regression has become a powerful tool for continuous prediction tasks with advantages over traditional decision trees. Its capability to handle high-dimensional data, capture complex relationships and reduce overfitting has made it a popular choice for a variety of applications. Python's scikit-learn library enables the implementation of Random Forest Regression models possible.

ML | Extra Tree Classifier for Feature Selection

A

Avik_Dutta

Improve

Article Tags :

Practice Tags :

Machine Learning

Similar Reads

Machine Learning Algorithms

Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith

Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025

Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi

ML | Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d