Multiregression using CatBoost
Last Updated :
06 Jun, 2024
Multiregression, also known as multiple regression, is a statistical method used to predict a target variable based on two or more predictor variables. This technique is widely used in various fields such as finance, economics, marketing, and machine learning. CatBoost, a powerful gradient boosting library, provides efficient and robust algorithms for multiregression tasks. In this article, we will explore how to leverage CatBoost for multiregression and achieve accurate predictions.
Implementing Multiregression with CatBoost
Let's dive into a practical example of using CatBoost for multiregression:
Install CatBoost
Ensure you have CatBoost installed in your Python environment. You can install it via pip:
pip install catboost
Step 1: Loading a Public Dataset
We'll using an online publicly accessible dataset for this example. Using its URL, we'll load it immediately.
Python
import pandas as pd
# Load dataset
url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240527142547/BostonHousing.csv'
df = pd.read_csv(url)
print(df.head())
Output:
crim zn indus chas nox rm age dis rad tax ptratio \
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7
b lstat medv
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 5.33 36.2
Step 2: Preprocessing Data
We’ll prepare the data for modeling, which may include encoding categorical features if present.
Python
import seaborn as sns
import matplotlib.pyplot as plt
# Visualize the distribution of the target variable
sns.histplot(df['medv'], bins=30, kde=True)
plt.title('Distribution of MEDV (Median Value of Homes)')
plt.savefig('Distribution.webp')
plt.show()
Output:

Our data must be ready for the model. This covers managing missing values, standardizing the data, and encoding categorical characteristics.
Python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split the data into features and target
X = df.drop('medv', axis=1)
y = df['medv']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize the feature data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 3: Train the Model
Now, we will define and train our CatBoost regressor model.
Python
from catboost import CatBoostRegressor
# Initialize the CatBoostRegressor
model = CatBoostRegressor(
iterations=1000, learning_rate=0.05, depth=3, loss_function='RMSE', verbose=200)
# Fit the model
model.fit(X_train_scaled, y_train)
Output:
0: learn: 9.0223472 total: 138ms remaining: 2m 18s
200: learn: 2.4369710 total: 252ms remaining: 1s
400: learn: 1.8078506 total: 365ms remaining: 545ms
600: learn: 1.4641839 total: 475ms remaining: 315ms
800: learn: 1.2249782 total: 587ms remaining: 146ms
999: learn: 1.0551550 total: 696ms remaining: 0us
<catboost.core.CatBoostRegressor at 0x193071691d0>
Step 4: Making Predictions and Evaluating the Model
After training, we make predictions on the test set and evaluate our model using RMSE.
Python
from sklearn.metrics import mean_squared_error
# Make predictions
predictions = model.predict(X_test_scaled)
# Calculate RMSE
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f'Root Mean Squared Error: {rmse}')
Output:
Root Mean Squared Error: 2.9516912601424115
Step 5: Visualizing the Results
Lastly, in order to evaluate the performance of our model, we will plot the actual values against the predictions.
Python
# Visualize the actual vs predicted values
plt.scatter(y_test, predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.plot([min(y_test), max(y_test)], [min(y_test),
max(y_test)], color='red') # Diagonal line
plt.show()
Output:
.png)
These examples offer a detailed how-to use CatBoost for multiregression, including the steps of data preparation, model training, and result visualization. Recall that practice and experimentation are the keys to mastering machine learning, so feel free to experiment with other datasets, and parameter adjustments to observe how the model performs.
Understanding Multiregression
Multiregression extends the concept of simple linear regression by allowing multiple independent variables to be used in predicting a dependent variable. The relationship between the predictor variables and the target variable is expressed through a linear equation:
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon
Where:
- Y is the dependent variable (target).
- X_1, X_2, \ldots, X_n are the independent variables (predictors).
- \beta_0, \beta_1, \ldots, \beta_n are the coefficients representing the strength and direction of the relationship between the predictors and the target.
- \epsilon is the error term.
What is CatBoost?
CatBoost stands for Categorical Boosting. It is an open-source gradient boosting library developed by Yandex that is particularly powerful for datasets with categorical features. This is a robust open-source library excelling in gradient boosting, a machine learning technique well-suited for regression problems. It is renowned for being rapid, and effective. It's a versatile tool that works well with many sorts of data, including those that are categorical (such colors or types). Among CatBoost's noteworthy attributes are:
- Support for Categorical Data: Unlike other boosting algorithms, CatBoost can directly handle categorical features without the need for explicit encoding.
- Fast Training and Prediction: CatBoost is optimized for speed, making it suitable for large datasets.
- Excellent Performance: In terms of accuracy and generalization, it frequently performs better, than other gradient boosting techniques like XGBoost and LightGBM.
Conclusion
Multiregression is a powerful technique for predicting a target variable based on multiple predictor variables. With the advent of advanced machine learning libraries like CatBoost, performing multiregression tasks has become more accessible and efficient. By following the steps outlined in this article, you can leverage CatBoost to build accurate multiregression models for a wide range of applications. Experiment with different parameters and features to fine-tune your models and achieve optimal performance.
Similar Reads
Regression using CatBoost In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns. Table of ContentWhat is CatBoost?How Catboost Works?Implementatio
13 min read
Catboost Regression Metrics CatBoost is a powerful gradient boosting library that has gained popularity in recent years due to its ease of use, efficiency, and high performance. One of the key aspects of using CatBoost is understanding the various metrics it provides for evaluating the performance of regression models. In this
6 min read
Train a model using CatBoost CatBoost is the current one of the state-of-the-art ML models that can be used both for the regression as well as the classification task. By the name, we can say that the cat boost models were built taking into consideration the fact that they will be used to deal with the datasets that have catego
10 min read
CatBoost Monitoring training progress CatBoost is a powerful and efficient gradient-boosting library designed for training machine learning models for both classification and regression tasks. CatBoost became very popular in a short time for its robust handling of categorical features, automatic handling of missing values, and superior
8 min read
Binary classification using CatBoost CatBoost is a high-performance, open-source gradient boosting library developed by Yandex, a Russian multinational IT company. It is designed for categorical feature support, making it particularly powerful for structured data like those often encountered in real-world datasets. In this article, we
13 min read