CatBoost is the current one of the state-of-the-art ML models that can be used both for the regression as well as the classification task. By the name, we can say that the cat boost models were built taking into consideration the fact that they will be used to deal with the datasets that have categorical columns in them. In this article, we will learn how can we train a CatBoost model for the classification purpose on the placement data that has been taken from the Kaggle.
Catboost
Catboost (Categorical Boosting), is a high-performance, open-source, gradient-boosting framework developed by Yandex. It is intended to address a broad spectrum of machine learning problems, such as regression, ranking, and classification, with a focus on effectively managing categorical information. Catboost is unique in the structured data processing space because of its speed, accuracy, and user-friendliness.
How Catboost Works?
A high-performance gradient-boosting method designed for machine learning applications, particularly those requiring structured input, is called Catboost. Its primary mechanism is based on the ensemble learning technique known as gradient boosting. Typically, Catboost starts by speculating on the target variable's mean. The next step is to progressively build the ensemble of decision trees, with each tree aiming to remove the residuals or errors from the preceding one. The way that Catboost manages category features makes it unique. Catboost processes categorical data directly using an approach known as "ordered boosting," which improves model performance and speeds up training.
To prevent overfitting, regularization strategies are also included. When generating predictions, Catboost combines thе forecasts from every tree, producing incredibly dependablе and precise models. Furthermore, it provides feature relevance rankings that facilitate thе understanding of model choices and thе selection of features. For many different machine-learning tasks, including regression and classification, Catboost is a helpful tool.
Implementation Using CatBoost
!pip install catboost
Importing Libraries and Dataset
Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
- We'll use this dataset to perform a classification task using the catboost algorithm.
Python3
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score as ras
import warnings
warnings.filterwarnings('ignore')
First step first we will load the data into the pandas dataframe.
Python3
df = pd.read_csv('placementdata.csv')
print(df.head())
Output:
StudentID CGPA Internships Projects Workshops/Certifications \
0 1 7.5 1 1 1
1 2 8.9 0 3 2
2 3 7.3 1 2 2
3 4 7.5 1 1 2
4 5 8.3 1 2 2
AptitudeTestScore SoftSkillsRating ExtracurricularActivities \
0 65 4.4 No
1 90 4.0 Yes
2 82 4.8 Yes
3 85 4.4 Yes
4 86 4.5 Yes
PlacementTraining SSC_Marks HSC_Marks PlacementStatus
0 No 61 79 NotPlaced
1 Yes 78 82 Placed
2 No 79 80 NotPlaced
3 Yes 81 80 Placed
4 Yes 74 88 Placed
If we take a moment to understand the data first then we will get to know that this dataset contains information about the students academic and training and placement status.
- CGPA - It is the overall grades achieved by the student.
- Internships - It tells the number of internships a student has done.
- Projects - Numbe rof projects a student has done.
- Workshops/Certifications - As there are multiple MOOC courses available online student opt for them to upskill themselves.
- ApptitudeTestScore - Aptitude test are generally a part of the recruitment process to understand the Quant and logical thinking of the student.
- SoftSkillrating - Communication is a key role that plays in the placement or in any aspect of the life.
- ExtraCurricularActivities - This helps provide and insight about the personality of an individual regarding how much he is active other than the academic.
- PlacementTraining - It is provided to students in college to ace teh placement process.
- SSC and HSC - Senior Secondary and Higher Secondary Marks.
- PlacementStatus - This is our target column with two classes placed and not placed.
So, this is all about the dataset now let's check the shape of the dataset to know how many data entries have been provided to us.
Python3
Output:
(10000, 12)
By using the df.info() function we can see the content of each columns and the data types present in it along with the number of null values present in each column.
Python3
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 StudentID 10000 non-null int64
1 CGPA 10000 non-null float64
2 Internships 10000 non-null int64
3 Projects 10000 non-null int64
4 Workshops/Certifications 10000 non-null int64
5 AptitudeTestScore 10000 non-null int64
6 SoftSkillsRating 10000 non-null float64
7 ExtracurricularActivities 10000 non-null object
8 PlacementTraining 10000 non-null object
9 SSC_Marks 10000 non-null int64
10 HSC_Marks 10000 non-null int64
11 PlacementStatus 10000 non-null object
dtypes: float64(2), int64(7), object(3)
memory usage: 937.6+ KB
The DataFrame df is described statistically via the df.describe() function. In order to provide a preliminary understanding of the data's central tendencies and distribution, it includes important statistics such as count, mean, standard deviation, minimum, and maximum values for each numerical column.
Python3
Output:
count mean std min 25% \
StudentID 10000.0 5000.50000 2886.895680 1.0 2500.75
CGPA 10000.0 7.69801 0.640131 6.5 7.40
Internships 10000.0 1.04920 0.665901 0.0 1.00
Projects 10000.0 2.02660 0.867968 0.0 1.00
Workshops/Certifications 10000.0 1.01320 0.904272 0.0 0.00
AptitudeTestScore 10000.0 79.44990 8.159997 60.0 73.00
SoftSkillsRating 10000.0 4.32396 0.411622 3.0 4.00
ExtracurricularActivities 10000.0 0.58540 0.492677 0.0 0.00
PlacementTraining 10000.0 0.73180 0.443044 0.0 0.00
SSC_Marks 10000.0 69.15940 10.430459 55.0 59.00
HSC_Marks 10000.0 74.50150 8.919527 57.0 67.00
PlacementStatus 10000.0 0.41970 0.493534 0.0 0.00
50% 75% max
StudentID 5000.5 7500.25 10000.0
CGPA 7.7 8.20 9.1
Internships 1.0 1.00 2.0
Projects 2.0 3.00 3.0
Workshops/Certifications 1.0 2.00 3.0
AptitudeTestScore 80.0 87.00 90.0
SoftSkillsRating 4.4 4.70 4.8
ExtracurricularActivities 1.0 1.00 1.0
PlacementTraining 1.0 1.00 1.0
SSC_Marks 70.0 78.00 90.0
HSC_Marks 73.0 83.00 88.0
PlacementStatus 0.0 1.00 1.0
Exploratory Data Analysis
EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other.
Now let's start with a short analysis of the null values in the data frame column wise.
Python3
Output:
StudentID 0
CGPA 0
Internships 0
Projects 0
Workshops/Certifications 0
AptitudeTestScore 0
SoftSkillsRating 0
ExtracurricularActivities 0
PlacementTraining 0
SSC_Marks 0
HSC_Marks 0
PlacementStatus 0
dtype: int64
So, we are good to go for the data exploration as there are no null values in the dataset.
DIstributions of Target Class
Python3
temp = df['PlacementStatus'].value_counts()
plt.pie(temp.values, labels=temp.index.values,
autopct='%1.1f%%')
plt.title("Distribution of Target Class")
plt.show()
Output:
Target Class
From the above pie chart of the distribution of the classes in the dataset is nearly balanced it is not perfect but yeah it is acceptable. We can observe that there are categorical columns as well as numerical columns in the dataset let's separate them in two list before we move on to the analysis of these features.
Python3
cat_cols, num_cols = list(), list()
for col in df.columns:
if df[col].dtype=='object' or df[col].nunique()<10:
cat_cols.append(col)
else:
num_cols.append(col)
print('Categorical :', cat_cols)
print('Numerical :',num_cols)
Output:
Categorical : ['Internships', 'Projects', 'Workshops/Certifications', 'ExtracurricularActivities', 'PlacementTraining', 'PlacementStatus']
Numerical : ['StudentID', 'CGPA', 'AptitudeTestScore', 'SoftSkillsRating', 'SSC_Marks', 'HSC_Marks']
Countplot for categorical columns
Now, let's create countplot for the categorical columns with the hue of the placement status.
Python3
plt.subplots(figsize=(15,15))
for i, col in enumerate(cat_cols):
plt.subplot(3,2,i+1)
sb.countplot(data=df, x=col, hue='PlacementStatus')
plt.tight_layout()
plt.show()
Output:

From the above charts we can observe multiple patterns that empower the fact that the work done on your skill development will definitely help you get placed. There are certainly cases where the students have completed training programs and projects but still they are not placed but the ratio of them is quite low as compare to that who has done nothing.
Label Encoding of Categorical Columns
Python3
for col in ['ExtracurricularActivities', 'PlacementTraining']:
df[col] = df[col].map({'No':0,'Yes':1})
df['PlacementStatus']=df['PlacementStatus'].map({'NotPlaced':0,
'Placed':1})
Now as we have encoded the categorical features in the dataset let's create a heatmap that can be used to identify the highly correlated features with the target columns of within the feature space itself.
Confusion Matrix
Python3
sb.heatmap(df.corr(),fmt='.1f', cbar=True, annot=True)
plt.show()
Output:
Confusion MatrixFrom here we can observe that there are no highly correlated feature in the dataset so, no data leakage and correlated features.
Train & Validations Data Split
To evaluate the performance of the model while the training process goes on let's split the dataset in 85:15 ratio. This will help us evaluate the performance of the model by using the unseen dataset of the validation split.
Python3
features = df.drop(['StudentID', 'PlacementStatus'], axis=1)
target = df['PlacementStatus']
X_train, X_val, Y_train, Y_val = train_test_split(
features,
target,
random_state=2023,
test_size=0.15)
X_train.shape, X_val.shape
Output:
((8500, 10), (1500, 10))
Feature scaling
This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.
Python3
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
Now we are ready to train the model using the training data that we have prepared. Here we are performing binary classification as the target column that is Y_train and Y_val have 0 and 1 only that means binary classification task also it is not necessary to specify separately while training the model weather it is for the binary classification task or the multi-class classification.
Build and Train the Model
To avoid the overfitting we can tune some of the hyperparameters of the model.
- Iterations: We can define the custom number of iterations to train our model on the data for. This can be helpful to avoid the overfitting when the size of the dataset is small.
- loss_function: Using this hyperparameter we can define the loss function that we would like to use to train our model on.
- early_stopping_rounds: This hyperparameter is useful when we have splitter our data into training and the validation dataset. This stops further training of the model when the model starts overfitting the training data.
- custom_metric: This can be used to print the performance of the model on the custom metric as the training process goes on.
Python3
model = CatBoostClassifier(verbose=100,
iterations=1000,
loss_function='Logloss',
early_stopping_rounds=50,
custom_metric=['AUC'])
model.fit(X_train, Y_train,
eval_set=(X_val, Y_val))
y_train = model.predict(X_train)
y_val = model.predict(X_val)
Output:
Learning rate set to 0.053762
0: learn: 0.6621731 test: 0.6623146 best: 0.6623146 (0) total: 1.58ms remaining: 1.58s
100: learn: 0.3971504 test: 0.4332513 best: 0.4331288 (92) total: 158ms remaining: 1.41s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4331287949
bestIteration = 92Shrink model to first 93 iterations.
Now let's check the performance of the model using the ROC-AUC metric on the training and the validation data.
Python3
print("Training ROC-AUC: ", ras(Y_train, y_train))
print("Validation ROC-AUC: ", ras(Y_val, y_val))
Output:
Training ROC-AUC: 0.8140948743198752
Validation ROC-AUC: 0.7850069999416671
Conclusion
In conclusion, the model has been trained using Catboost algorithm. The algorithm has shown to be a highly effective way for binary classification tasks.
Similar Reads
Train a model using LightGBM
Light Gradient Boosting Machine (LightGBM) is an open-source and distributed gradient boosting framework that was developed by Microsoft Corporation. Unlike other traditional machine learning models, LightGBM can efficiently large datasets and has optimized training processes. LightGBM can be employ
11 min read
Regression using CatBoost
In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns. Table of ContentWhat is CatBoost?How Catboost Works?Implementatio
13 min read
Multiregression using CatBoost
Multiregression, also known as multiple regression, is a statistical method used to predict a target variable based on two or more predictor variables. This technique is widely used in various fields such as finance, economics, marketing, and machine learning. CatBoost, a powerful gradient boosting
5 min read
Sentiment Analysis using CatBoost
Sentiment analysis is crucial for understanding the emotional tone behind text data, making it invaluable for applications such as customer feedback analysis, social media monitoring, and market research. In this article, we will explore how to perform sentiment analysis using CatBoost. Table of Con
4 min read
Binary classification using CatBoost
CatBoost is a high-performance, open-source gradient boosting library developed by Yandex, a Russian multinational IT company. It is designed for categorical feature support, making it particularly powerful for structured data like those often encountered in real-world datasets. In this article, we
13 min read
CatBoost Monitoring training progress
CatBoost is a powerful and efficient gradient-boosting library designed for training machine learning models for both classification and regression tasks. CatBoost became very popular in a short time for its robust handling of categorical features, automatic handling of missing values, and superior
8 min read
Catboost Regression Metrics
CatBoost is a powerful gradient boosting library that has gained popularity in recent years due to its ease of use, efficiency, and high performance. One of the key aspects of using CatBoost is understanding the various metrics it provides for evaluating the performance of regression models. In this
6 min read
Email Spam Detection using Catboost
In today's digital age, email remains one of the most widely used communication mediums. However, the prevalence of spam emails poses significant challenges for both individuals and organizations. Spam emails not only clutter inboxes but can also contain malicious links or attachments that pose secu
9 min read
Customer Default Prediction using AdaBoost
Customer Default Prediction is used by many banks and loan lenders to determine whether a person will be able to return the money they lend them or not. For this we be using AdaBoost which is an ensemble learning technique that combines multiple weak classifiers to create a strong classifier. The al
4 min read
Breast Cancer predictions using catboost
CatBoost is a gradient boosting algorithm that deals with the categorical features during the training process. In the article, we are going to perform prediction analysis on breast cancer dataset using CatBoost. Breast Cancer Detection using CatBoost We aim to provide a comprehensive pipeline for t
7 min read