Multiregression using CatBoost

Last Updated : 06 Jun, 2024

Multiregression, also known as multiple regression, is a statistical method used to predict a target variable based on two or more predictor variables. This technique is widely used in various fields such as finance, economics, marketing, and machine learning. CatBoost, a powerful gradient boosting library, provides efficient and robust algorithms for multiregression tasks. In this article, we will explore how to leverage CatBoost for multiregression and achieve accurate predictions.

Table of Content

Understanding Multiregression
What is CatBoost?
Implementing Multiregression with CatBoost
Pros & Cons of Using CatBoost for Multiregression
Conclusion

Implementing Multiregression with CatBoost

Let's dive into a practical example of using CatBoost for multiregression:

Install CatBoost

Ensure you have CatBoost installed in your Python environment. You can install it via pip:

pip install catboost

Step 1: Loading a Public Dataset

We'll using an online publicly accessible dataset for this example. Using its URL, we'll load it immediately.

Python

import pandas as pd

# Load dataset
url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240527142547/BostonHousing.csv'
df = pd.read_csv(url)
print(df.head())

Output:

      crim    zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

        b  lstat  medv  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90   5.33  36.2

Step 2: Preprocessing Data

We’ll prepare the data for modeling, which may include encoding categorical features if present.

Python

import seaborn as sns
import matplotlib.pyplot as plt

# Visualize the distribution of the target variable
sns.histplot(df['medv'], bins=30, kde=True)
plt.title('Distribution of MEDV (Median Value of Homes)')
plt.savefig('Distribution.webp')
plt.show()

Output:

Our data must be ready for the model. This covers managing missing values, standardizing the data, and encoding categorical characteristics.

Python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into features and target
X = df.drop('medv', axis=1)
y = df['medv']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the feature data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 3: Train the Model

Now, we will define and train our CatBoost regressor model.

Python

from catboost import CatBoostRegressor

# Initialize the CatBoostRegressor
model = CatBoostRegressor(
    iterations=1000, learning_rate=0.05, depth=3, loss_function='RMSE', verbose=200)

# Fit the model
model.fit(X_train_scaled, y_train)

Output:

0:    learn: 9.0223472    total: 138ms    remaining: 2m 18s
200:    learn: 2.4369710    total: 252ms    remaining: 1s
400:    learn: 1.8078506    total: 365ms    remaining: 545ms
600:    learn: 1.4641839    total: 475ms    remaining: 315ms
800:    learn: 1.2249782    total: 587ms    remaining: 146ms
999:    learn: 1.0551550    total: 696ms    remaining: 0us
<catboost.core.CatBoostRegressor at 0x193071691d0>

Step 4: Making Predictions and Evaluating the Model

After training, we make predictions on the test set and evaluate our model using RMSE.

Python

from sklearn.metrics import mean_squared_error

# Make predictions
predictions = model.predict(X_test_scaled)

# Calculate RMSE
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f'Root Mean Squared Error: {rmse}')

Output:

Root Mean Squared Error: 2.9516912601424115

Step 5: Visualizing the Results

Lastly, in order to evaluate the performance of our model, we will plot the actual values against the predictions.

Python

# Visualize the actual vs predicted values
plt.scatter(y_test, predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.plot([min(y_test), max(y_test)], [min(y_test),
                                      max(y_test)], color='red')  # Diagonal line
plt.show()

Output:

These examples offer a detailed how-to use CatBoost for multiregression, including the steps of data preparation, model training, and result visualization. Recall that practice and experimentation are the keys to mastering machine learning, so feel free to experiment with other datasets, and parameter adjustments to observe how the model performs.

Understanding Multiregression

Multiregression extends the concept of simple linear regression by allowing multiple independent variables to be used in predicting a dependent variable. The relationship between the predictor variables and the target variable is expressed through a linear equation:

Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon

Where:

Y is the dependent variable (target).
X_1, X_2, \ldots, X_n are the independent variables (predictors).
\beta_0, \beta_1, \ldots, \beta_n are the coefficients representing the strength and direction of the relationship between the predictors and the target.
\epsilon is the error term.

What is CatBoost?

CatBoost stands for Categorical Boosting. It is an open-source gradient boosting library developed by Yandex that is particularly powerful for datasets with categorical features. This is a robust open-source library excelling in gradient boosting, a machine learning technique well-suited for regression problems. It is renowned for being rapid, and effective. It's a versatile tool that works well with many sorts of data, including those that are categorical (such colors or types). Among CatBoost's noteworthy attributes are:

Support for Categorical Data: Unlike other boosting algorithms, CatBoost can directly handle categorical features without the need for explicit encoding.
Fast Training and Prediction: CatBoost is optimized for speed, making it suitable for large datasets.
Excellent Performance: In terms of accuracy and generalization, it frequently performs better, than other gradient boosting techniques like XGBoost and LightGBM.

Conclusion

Multiregression is a powerful technique for predicting a target variable based on multiple predictor variables. With the advent of advanced machine learning libraries like CatBoost, performing multiregression tasks has become more accessible and efficient. By following the steps outlined in this article, you can leverage CatBoost to build accurate multiregression models for a wide range of applications. Experiment with different parameters and features to fine-tune your models and achieve optimal performance.

Multiregression using CatBoost

Anonymous

Improve

Article Tags :

Practice Tags :

Machine Learning

Multiregression using CatBoost

Implementing Multiregression with CatBoost

Install CatBoost

Step 1: Loading a Public Dataset

Step 2: Preprocessing Data

Step 3: Train the Model

Step 4: Making Predictions and Evaluating the Model

Step 5: Visualizing the Results

Understanding Multiregression

What is CatBoost?

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?