Using ColumnTransformer in Scikit-Learn for Data Preprocessing

Last Updated : 21 Apr, 2025

Data preprocessing is a critical step in any machine learning workflow. It involves cleaning and transforming raw data into a format suitable for modeling. One of the challenges in preprocessing is dealing with datasets that contain different types of features, such as numerical and categorical data. Scikit-learn's ColumnTransformer is a powerful tool that allows you to apply different transformations to different subsets of features within your dataset. This article will explore how to use ColumnTransformer effectively to streamline your data preprocessing tasks.

Table of Content

What is ColumnTransformer?

Why Use ColumnTransformer?
Preprocessing Strategies with ColumnTransformer
Challenges Without Column Transformer

Implementing Column Transformer in Sklearn

Without using Column Transformer
Using Column Transformer

What is ColumnTransformer?

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms to different columns in your dataset. This is particularly useful when you have a mix of numerical and categorical data that require different preprocessing steps.

Why Use ColumnTransformer?

Using ColumnTransformer offers several advantages:

Selective Transformation: Apply specific transformations to subsets of columns.
Pipeline Integration: Easily integrate with scikit-learn's Pipeline for streamlined workflows.
Code Organization: Encapsulate preprocessing logic in a single, maintainable object.

Preprocessing Strategies with ColumnTransformer

Here Let's create a dataset which is named as CAR_SPEED_DATA which consists of 6 columns named as:

AGE: This column contains numerical data representing the age of individuals. It's a continuous variable that may require scaling to ensure it fits well within our model.
GENDER: A categorical feature that denotes the gender of individuals, often represented as 'Male' or 'Female.' To make this data usable for machine learning models, we'll need to encode it into numerical values.
SPEED: Another numerical feature, this column represents the speed of an individual’s vehicle. Like AGE, it might need scaling or normalization.
AVERAGE_SPEED: This feature is an ordinal categorical value. It represents speed categories like 'high' or 'low' .Although it seems similar to numerical data, it needs special handling because the order matters but the differences between categories are not consistent.
CITY: A categorical feature indicating the city where the individual resides. With potentially many unique values, we'll need to apply one-hot encoding to convert it into a form suitable for modeling.
HAS_DRIVING_LICENSE: This binary categorical variable shows whether an individual has a driving license ('Yes' or 'No'). Simple encoding can transform this into a numerical feature.

Challenges Without Column Transformer

When we work with such a diverse dataset, different preprocessing steps are required for different columns:

Numerical Data Handling: AGE and SPEED need to be scaled to ensure that they don’t overpower other features in the model. Without proper scaling, numerical columns with larger ranges could disproportionately affect the model's learning process.
Categorical Encoding: GENDER and CITY need encoding into numerical formats. With multiple categorical features, applying encoding manually can be big task, especially when dealing with a large number of categories.
Ordinal Encoding: AVERAGE_SPEED, as an ordinal categorical feature, requires careful encoding to preserve the order of categories. Applying standard one-hot encoding might not respect this inherent ordering.
Binary Features: The HAS_DRIVING_LICENSE column is binary and relatively straightforward, but it's another step that adds complexity when handled separately.

Implementing Column Transformer in Sklearn

Step 1: Importing Necessary Libraries

From sklearn we have imported SimpleImputer, OneHotEncoder,OrdinalEncoder.

SimpleImputer: Used to fill in missing data in a dataset with a specified strategy, such as mean, median, or mode, we use mean in our dataset.
OneHotEncoder: Converts categorical features into a format that can be provided to machine learning algorithms by creating binary columns for each category.
OrdinalEncoder: Transforms categorical features into integer values that represent the ordinal relationship between categories, preserving their order. In our dataset for low speed we encode it as '0' for high speed we encode it '1'.

Python

# import the required libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

Step 2: Loading The Dataset

Python

# reading the dataset which is attached at end.
df = pd.read_csv('/content/Car_Speed_Data.csv')
df.head()

Output:

	Age	Gender	Speed	Average_speed	City	has_driving_license
0	54	male	40.0	low	Kolkata	yes
1	34	female	70.0	low	Delhi	yes
2	19	female	140.0	high	Delhi	no
3	45	male	120.0	high	Kolkata	yes
4	23	male	80.0	low	Mumbai	no

Click on the given link for downloading Dataset: CAR_SPEED_DATA

Now We'll be checking if there are any null values present or not.

Python

df.isnull().sum()

Output:

Age                    0
Gender                 0
Speed                  9
Average_speed          0
City                   0
has_driving_license    0
dtype: int64

After identifying missing values in the dataset using the isnull().sum() method, we can use sklearn's SimpleImputer to handle these gaps by replacing the missing values with the mean of each feature. This approach ensures that the dataset remains complete and consistent.

Now we are going to split our dataset into training and testing sets with a test size of 0.2. Since our dataset contains a total of 100 samples, this means we will use 80 samples for training and 20 samples for testing.

Python

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_driving_license']),df['has_driving_license'],
                                                test_size=0.2)
X_train

Output:

	Age	Gender	Speed	Average_speed	City
57	23	female	70.0	low	Delhi
20	24	female	NaN	high	Banglore
98	42	female	50.0	low	Kolkata
86	34	male	60.0	low	Banglore
37	42	male	NaN	low	Mumbai
...	...	...	...	...	...
52	29	male	60.0	low	Mumbai
47	25	female	130.0	high	Banglore
16	42	male	60.0	low	Kolkata
76	42	male	50.0	low	Delhi
32	23	male	NaN	high	Delhi
80 rows × 5 columns

We have created X_Train and X_test data.

Now we have to handle the null values using the SimpleImputer for Speed Feature Let's do this!!

Python

# we are using SimpleInputer() for speed Column
si = SimpleImputer()
X_train_Speed = si.fit_transform(X_train[['Speed']])

#for test data
X_test_Speed = si.fit_transform(X_test[['Speed']])
                                 
X_train_Speed.shape
#to check whether values or replaced by mean or not you simply call X_train_speed function
#output: (80,1)

Next step:

Why to use Ordinal Encoding for the Average_speed Feature?

The Average_speed column is an ordinal categorical feature, meaning its categories have a meaningful order, such as low and high. By using OrdinalEncoder, we transform these categories into numerical values that preserve their inherent order. This encoding allows machine learning models to interpret the relative significance of each category, enabling them to capture patterns associated with different speeds. For instance, if low is encoded as 0 and high as 1, the model can recognize that high represents a greater speed than low, impacting how it learns relationships in the data.

Python

# Ordinalencoding for Average_speed
oe = OrdinalEncoder(categories=[['low','high']])
X_train_Average_speed = oe.fit_transform(X_train[['Average_speed']])

# for test data
X_test_Average_speed = oe.fit_transform(X_test[['Average_speed']])

X_train_Average_speed.shape
#output (80,1)
#you can also check it by calling the function X_train_Average_speed

Next Step:

Why to use One-Hot Encoding for Gender and City?

One-Hot Encoding is used to convert categorical variables into a binary matrix, allowing machine learning models to interpret these features without assuming any ordinal relationship. By transforming Gender and City into binary columns, we avoid misleading the model into thinking there's a rank or order between the categories.

Python

# Performing OneHotEncoding for Gender,City
ohe = OneHotEncoder(drop='first',sparse=False)
X_train_Gender_City = ohe.fit_transform(X_train[['Gender','City']])

#For test data
X_test_Gender_City = ohe.fit_transform(X_test[['Gender','City']])

X_train_Gender_City.shape
#output:(80,7)
#you can call X_train_Gender_City for checking the feature values.

Next Step:

Why to use Age Separately?

The Age feature is a numerical variable that often plays a critical role in predictive modeling .By extracting Age separately, we can focus on its unique contribution to the model without interference from categorical features This separation also simplifies preprocessing steps, such as scaling or normalizing numerical data, ensuring that age is treated with its own distinct statistical(mean,median,mode) considerations.

Python

# Extracting Age
X_train_age = X_train.drop(columns=['Gender','Speed','Average_speed','City']).values

# for test data
X_test_age = X_test.drop(columns=['Gender','Speed','Average_speed','City']).values

X_train_age.shape
#output: (80,1)

Final Step:

Concatenation into single transformer: Why Concatenate and Transform the Data?

The concatenation of X_train_age, X_train_Speed, X_train_Gender_City, and X_train_Average_speed into X_train_transformed creates a single feature set where each feature is represented in a format suitable for machine learning models. This step combines all preprocessed features—numerical and categorical—into a unified matrix, enabling the model to access and learn from all relevant data simultaneously. By using the np.concatenate function, we ensure that each feature retains its processed form and contributes equally to the predictive modeling task.

Without using Column Transformer

Python

#using np.concatenate which combines multiple arrays along a specified axis. Here, axis=1 ensures that the arrays are joined horizontally
X_train_transformed = np.concatenate((X_train_age,X_train_Speed,X_train_Gender_City,X_train_Average_speed),axis=1)
#for test data
X_test_transformed = np.concatenate((X_test_age,X_test_Speed,X_test_Gender_City,X_test_Average_speed),axis=1)

X_train_transformed.shape

#output: (80,10) 
#the output includes 80 samples and 10 total features (including encoded and numerical) shape.

Now, we have completed the data preprocessing process without using the ColumnTransformer. This approach is quite difficult because we are performing transformations on each feature separately. This can become a burden, especially when dealing with larger datasets or more complex preprocessing needs.

Let’s explore why this is burdensome with the representation below:

Withoutcolumntransformer — Without Column Transformer

Now we can see how hectic it is to transform each feature individually and then concatenate them all together at the end. This manual process is time-consuming, leads to errors, and difficult to maintain. To make to easy and simpler, we can use the ColumnTransformer from Scikit-learn. The ColumnTransformer allows us to apply different preprocessing steps to specific columns simultaneously, making our workflow more efficient and less errors.