Open In App

Using ColumnTransformer in Scikit-Learn for Data Preprocessing

Last Updated : 21 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Data preprocessing is a critical step in any machine learning workflow. It involves cleaning and transforming raw data into a format suitable for modeling. One of the challenges in preprocessing is dealing with datasets that contain different types of features, such as numerical and categorical data. Scikit-learn's ColumnTransformer is a powerful tool that allows you to apply different transformations to different subsets of features within your dataset. This article will explore how to use ColumnTransformer effectively to streamline your data preprocessing tasks.

What is ColumnTransformer?

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms to different columns in your dataset. This is particularly useful when you have a mix of numerical and categorical data that require different preprocessing steps.

Why Use ColumnTransformer?

Using ColumnTransformer offers several advantages:

  • Selective Transformation: Apply specific transformations to subsets of columns.
  • Pipeline Integration: Easily integrate with scikit-learn's Pipeline for streamlined workflows.
  • Code Organization: Encapsulate preprocessing logic in a single, maintainable object.

Preprocessing Strategies with ColumnTransformer

Here Let's create a dataset which is named as CAR_SPEED_DATA which consists of 6 columns named as:

  • AGE: This column contains numerical data representing the age of individuals. It's a continuous variable that may require scaling to ensure it fits well within our model.
  • GENDER: A categorical feature that denotes the gender of individuals, often represented as 'Male' or 'Female.' To make this data usable for machine learning models, we'll need to encode it into numerical values.
  • SPEED: Another numerical feature, this column represents the speed of an individual’s vehicle. Like AGE, it might need scaling or normalization.
  • AVERAGE_SPEED: This feature is an ordinal categorical value. It represents speed categories like 'high' or 'low' .Although it seems similar to numerical data, it needs special handling because the order matters but the differences between categories are not consistent.
  • CITY: A categorical feature indicating the city where the individual resides. With potentially many unique values, we'll need to apply one-hot encoding to convert it into a form suitable for modeling.
  • HAS_DRIVING_LICENSE: This binary categorical variable shows whether an individual has a driving license ('Yes' or 'No'). Simple encoding can transform this into a numerical feature.

Challenges Without Column Transformer

When we work with such a diverse dataset, different preprocessing steps are required for different columns:

  • Numerical Data Handling: AGE and SPEED need to be scaled to ensure that they don’t overpower other features in the model. Without proper scaling, numerical columns with larger ranges could disproportionately affect the model's learning process.
  • Categorical Encoding: GENDER and CITY need encoding into numerical formats. With multiple categorical features, applying encoding manually can be big task, especially when dealing with a large number of categories.
  • Ordinal Encoding: AVERAGE_SPEED, as an ordinal categorical feature, requires careful encoding to preserve the order of categories. Applying standard one-hot encoding might not respect this inherent ordering.
  • Binary Features: The HAS_DRIVING_LICENSE column is binary and relatively straightforward, but it's another step that adds complexity when handled separately.

Implementing Column Transformer in Sklearn

Step 1: Importing Necessary Libraries

From sklearn we have imported SimpleImputer, OneHotEncoder,OrdinalEncoder.

  • SimpleImputer: Used to fill in missing data in a dataset with a specified strategy, such as mean, median, or mode, we use mean in our dataset.
  • OneHotEncoder: Converts categorical features into a format that can be provided to machine learning algorithms by creating binary columns for each category.
  • OrdinalEncoder: Transforms categorical features into integer values that represent the ordinal relationship between categories, preserving their order. In our dataset for low speed we encode it as '0' for high speed we encode it '1'.
Python
# import the required libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

Step 2: Loading The Dataset

Python
# reading the dataset which is attached at end.
df = pd.read_csv('/content/Car_Speed_Data.csv')
df.head()

Output:

	Age	Gender	Speed	Average_speed	City	has_driving_license
0 54 male 40.0 low Kolkata yes
1 34 female 70.0 low Delhi yes
2 19 female 140.0 high Delhi no
3 45 male 120.0 high Kolkata yes
4 23 male 80.0 low Mumbai no

Click on the given link for downloading Dataset: CAR_SPEED_DATA

Now We'll be checking if there are any null values present or not.

Python
df.isnull().sum()

Output:

Age                    0
Gender 0
Speed 9
Average_speed 0
City 0
has_driving_license 0
dtype: int64

After identifying missing values in the dataset using the isnull().sum() method, we can use sklearn's SimpleImputer to handle these gaps by replacing the missing values with the mean of each feature. This approach ensures that the dataset remains complete and consistent.

  • Now we are going to split our dataset into training and testing sets with a test size of 0.2. Since our dataset contains a total of 100 samples, this means we will use 80 samples for training and 20 samples for testing.
Python
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_driving_license']),df['has_driving_license'],
                                                test_size=0.2)
X_train

Output:

	Age	Gender	Speed	Average_speed	City
57 23 female 70.0 low Delhi
20 24 female NaN high Banglore
98 42 female 50.0 low Kolkata
86 34 male 60.0 low Banglore
37 42 male NaN low Mumbai
... ... ... ... ... ...
52 29 male 60.0 low Mumbai
47 25 female 130.0 high Banglore
16 42 male 60.0 low Kolkata
76 42 male 50.0 low Delhi
32 23 male NaN high Delhi
80 rows × 5 columns

We have created X_Train and X_test data.

Now we have to handle the null values using the SimpleImputer for Speed Feature Let's do this!!

Python
# we are using SimpleInputer() for speed Column
si = SimpleImputer()
X_train_Speed = si.fit_transform(X_train[['Speed']])

#for test data
X_test_Speed = si.fit_transform(X_test[['Speed']])
                                 
X_train_Speed.shape
#to check whether values or replaced by mean or not you simply call X_train_speed function
#output: (80,1)

Next step:

Why to use Ordinal Encoding for the Average_speed Feature?

The Average_speed column is an ordinal categorical feature, meaning its categories have a meaningful order, such as low and high. By using OrdinalEncoder, we transform these categories into numerical values that preserve their inherent order. This encoding allows machine learning models to interpret the relative significance of each category, enabling them to capture patterns associated with different speeds. For instance, if low is encoded as 0 and high as 1, the model can recognize that high represents a greater speed than low, impacting how it learns relationships in the data.

Python
# Ordinalencoding for Average_speed
oe = OrdinalEncoder(categories=[['low','high']])
X_train_Average_speed = oe.fit_transform(X_train[['Average_speed']])

# for test data
X_test_Average_speed = oe.fit_transform(X_test[['Average_speed']])

X_train_Average_speed.shape
#output (80,1)
#you can also check it by calling the function X_train_Average_speed

Next Step:

Why to use One-Hot Encoding for Gender and City?

One-Hot Encoding is used to convert categorical variables into a binary matrix, allowing machine learning models to interpret these features without assuming any ordinal relationship. By transforming Gender and City into binary columns, we avoid misleading the model into thinking there's a rank or order between the categories.

Python
# Performing OneHotEncoding for Gender,City
ohe = OneHotEncoder(drop='first',sparse=False)
X_train_Gender_City = ohe.fit_transform(X_train[['Gender','City']])

#For test data
X_test_Gender_City = ohe.fit_transform(X_test[['Gender','City']])

X_train_Gender_City.shape
#output:(80,7)
#you can call X_train_Gender_City for checking the feature values.

Next Step:

Why to use Age Separately?

The Age feature is a numerical variable that often plays a critical role in predictive modeling .By extracting Age separately, we can focus on its unique contribution to the model without interference from categorical features This separation also simplifies preprocessing steps, such as scaling or normalizing numerical data, ensuring that age is treated with its own distinct statistical(mean,median,mode) considerations.

Python
# Extracting Age
X_train_age = X_train.drop(columns=['Gender','Speed','Average_speed','City']).values

# for test data
X_test_age = X_test.drop(columns=['Gender','Speed','Average_speed','City']).values

X_train_age.shape
#output: (80,1)

Final Step:

Concatenation into single transformer: Why Concatenate and Transform the Data?

The concatenation of X_train_age, X_train_Speed, X_train_Gender_City, and X_train_Average_speed into X_train_transformed creates a single feature set where each feature is represented in a format suitable for machine learning models. This step combines all preprocessed features—numerical and categorical—into a unified matrix, enabling the model to access and learn from all relevant data simultaneously. By using the np.concatenate function, we ensure that each feature retains its processed form and contributes equally to the predictive modeling task.

Without using Column Transformer

Python
#using np.concatenate which combines multiple arrays along a specified axis. Here, axis=1 ensures that the arrays are joined horizontally
X_train_transformed = np.concatenate((X_train_age,X_train_Speed,X_train_Gender_City,X_train_Average_speed),axis=1)
#for test data
X_test_transformed = np.concatenate((X_test_age,X_test_Speed,X_test_Gender_City,X_test_Average_speed),axis=1)

X_train_transformed.shape

#output: (80,10) 
#the output includes 80 samples and 10 total features (including encoded and numerical) shape.

Now, we have completed the data preprocessing process without using the ColumnTransformer. This approach is quite difficult because we are performing transformations on each feature separately. This can become a burden, especially when dealing with larger datasets or more complex preprocessing needs.

Let’s explore why this is burdensome with the representation below:

Withoutcolumntransformer
Without Column Transformer

Now we can see how hectic it is to transform each feature individually and then concatenate them all together at the end. This manual process is time-consuming, leads to errors, and difficult to maintain. To make to easy and simpler, we can use the ColumnTransformer from Scikit-learn. The ColumnTransformer allows us to apply different preprocessing steps to specific columns simultaneously, making our workflow more efficient and less errors.

Using Column Transformer

usingcolumntransformer
Using Column Transformer
NOTE: Perform same steps till train_test_split method.

Then later Import Column Transform from sklearn Library

Python
from sklearn.compose import ColumnTransformer

ColumnTransformer :The ColumnTransformer is a powerful tool in sklearn for applying different preprocessing transformations to specific columns within a dataset. This allows processing based on the nature of each feature.

Syntax for ColumnTransformer:

transformer = ColumnTransformer(transformers=[('imputer', SimpleImputer(), ['NumericalColumn1', 'NumericalColumn2']),('ordinal', OrdinalEncoder(), ['OrdinalColumn']),('onehot', OneHotEncoder(), ['CategoricalColumn1', 'CategoricalColumn2'])],remainder='passthrough')
Python
# we have declared transformer and called ColumnTransformer() method
#we use SimpleImputer for speed and t1 as parameter,t2 for OrdinalEncoder i.e for categorical values and t3 for one hot encoding i.e for Gender and City.
transformer = ColumnTransformer(transformers=[
    ('t1',SimpleImputer(),['Speed']),
    ('t2',OrdinalEncoder(categories=[['low','high']]),['Average_speed']),
    ('t3',OneHotEncoder(sparse=False,drop='first'),['Gender','City'])
],remainder='passthrough')

What is remainder='passthrough'?

Purpose of the remainder parameter specifies what to do with the columns not explicitly transformed by the transformers list.

remainder='passthrough': It Leaves all other columns untouched and includes them in the transformed output.

  • Why are we using this. As it is useful when you want to apply specific transformations to only some columns while preserving the rest as they are, ensuring that no information is lost from the original dataset.
  • This is about Column Transformer where each column values have been stored into single column transformer.
  • As in Categorical Features we can't give machine the complete word. we transform it's data into matrix which is represtation of numbers then we give to machine for training.

Using this Column Transformer we can do everything in single dataset.

Single-Vector-containing-different-feature--values
Single Vector containing different feature values
Python
transformer.fit_transform(X_train).shape
#output:(80,10)
transformer.transform(X_test).shape
#output: (20,10)

Output:

[[ 70.           0.           0.           0.           0.
1. 0. 0. 0. 23. ]
[ 88.45070423 1. 0. 0. 0.
0. 0. 0. 0. 24. ]
[ 50. 0. 0. 0. 0.
0. 0. 1. 0. 42. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 0. 34. ]
[ 88.45070423 0. 1. 0. 0.
0. 0. 0. 1. 42. ]
[ 60. 0. 1. 0. 0.
0. 0. 1. 0. 20. ]
[ 88.45070423 0. 0. 0. 0.
0. 0. 0. 1. 29. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 0. 35. ]
[130. 1. 1. 0. 0.
0. 0. 0. 0. 36. ]
[ 70. 0. 0. 0. 0.
0. 0. 1. 0. 62. ]
[120. 1. 1. 0. 0.
0. 0. 0. 0. 24. ]
[180. 1. 1. 0. 0.
1. 0. 0. 0. 49. ]
[ 60. 0. 0. 0. 0.
0. 0. 1. 0. 24. ]
[ 90. 0. 0. 0. 0.
0. 0. 1. 0. 28. ]
[ 50. 0. 1. 0. 0.
0. 0. 1. 0. 34. ]
[120. 1. 0. 0. 1.
1. 0. 0. 0. 39. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 25. ]
[ 40. 0. 0. 0. 0.
0. 0. 1. 0. 36. ]
[ 90. 0. 1. 0. 0.
0. 0. 0. 1. 45. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 31. ]
[ 70. 0. 1. 0. 0.
1. 0. 0. 0. 45. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 39. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 47. ]
[170. 1. 1. 0. 0.
0. 0. 1. 0. 61. ]
[ 70. 0. 0. 0. 0.
0. 0. 1. 0. 43. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 48. ]
[180. 1. 0. 0. 0.
0. 0. 0. 1. 28. ]
[130. 1. 1. 0. 0.
0. 0. 0. 0. 17. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[150. 1. 0. 0. 0.
1. 0. 0. 0. 27. ]
[ 50. 0. 1. 0. 0.
0. 0. 1. 0. 36. ]
[ 50. 0. 1. 0. 0.
0. 0. 0. 0. 28. ]
[ 40. 0. 0. 0. 0.
0. 0. 0. 1. 52. ]
[ 60. 0. 0. 0. 0.
0. 0. 0. 0. 62. ]
[ 40. 0. 1. 0. 0.
1. 0. 0. 0. 57. ]
[150. 1. 0. 0. 0.
0. 0. 0. 1. 32. ]
[120. 1. 1. 0. 0.
0. 0. 1. 0. 45. ]
[170. 1. 1. 0. 0.
1. 0. 0. 0. 27. ]
[ 88.45070423 0. 1. 0. 0.
1. 0. 0. 0. 56. ]
[ 80. 0. 0. 1. 0.
0. 0. 0. 1. 23. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 1. 29. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 1. 40. ]
[ 80. 0. 1. 0. 0.
0. 0. 1. 0. 35. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 42. ]
[120. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[ 88.45070423 1. 0. 0. 0.
0. 0. 0. 1. 45. ]
[ 70. 0. 0. 0. 0.
1. 0. 0. 0. 34. ]
[140. 1. 1. 0. 0.
0. 0. 0. 1. 46. ]
[ 88.45070423 0. 0. 0. 0.
0. 0. 0. 0. 39. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 0. 24. ]
[120. 1. 1. 0. 0.
1. 0. 0. 0. 47. ]
[ 70. 0. 0. 0. 0.
0. 0. 0. 0. 19. ]
[120. 1. 0. 0. 0.
0. 0. 1. 0. 47. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 28. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 21. ]
[ 40. 0. 1. 0. 0.
0. 0. 0. 0. 32. ]
[ 70. 0. 0. 0. 0.
0. 0. 0. 0. 28. ]
[ 90. 0. 1. 0. 0.
0. 0. 0. 1. 28. ]
[100. 1. 1. 0. 0.
0. 0. 0. 0. 35. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 46. ]
[120. 1. 1. 0. 0.
0. 0. 0. 0. 52. ]
[120. 1. 1. 0. 0.
0. 0. 0. 1. 42. ]
[ 60. 0. 0. 0. 0.
0. 0. 0. 1. 34. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 21. ]
[140. 1. 0. 0. 0.
0. 1. 0. 0. 19. ]
[ 88.45070423 1. 1. 0. 0.
0. 0. 0. 1. 35. ]
[ 40. 0. 0. 0. 0.
0. 0. 0. 0. 49. ]
[ 70. 0. 0. 0. 0.
1. 0. 0. 0. 39. ]
[120. 1. 0. 1. 0.
1. 0. 0. 0. 33. ]
[ 50. 0. 1. 0. 0.
0. 0. 0. 0. 41. ]
[150. 1. 0. 1. 0.
0. 0. 0. 0. 26. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 1. 23. ]
[ 88.45070423 1. 1. 0. 0.
0. 0. 0. 0. 49. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 1. 37. ]
[170. 1. 0. 0. 0.
0. 0. 0. 0. 36. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 1. 29. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[ 60. 0. 1. 0. 0.
0. 0. 1. 0. 42. ]
[ 50. 0. 1. 0. 0.
1. 0. 0. 0. 42. ]
[ 88.45070423 1. 1. 0. 0.
1. 0. 0. 0. 23. ]]
Python
print(transformer.transform(X_test))

Output:

[[120.   1.   1.   0.   0.   0.   0.   1.   0.  45.]
[150. 1. 1. 0. 0. 0. 0. 1. 0. 46.]
[ 50. 0. 1. 0. 0. 0. 0. 0. 1. 32.]
[ 90. 0. 1. 0. 0. 0. 0. 0. 1. 32.]
[ 80. 0. 0. 0. 0. 1. 0. 0. 0. 35.]
[120. 1. 0. 0. 0. 0. 0. 0. 0. 29.]
[ 80. 0. 1. 0. 0. 1. 0. 0. 0. 56.]
[ 80. 0. 1. 0. 0. 0. 0. 1. 0. 42.]
[150. 1. 1. 0. 0. 0. 0. 0. 0. 27.]
[120. 1. 1. 0. 0. 0. 0. 1. 0. 27.]
[170. 1. 0. 0. 0. 0. 0. 0. 1. 51.]
[170. 1. 0. 0. 0. 0. 0. 1. 0. 37.]
[140. 1. 1. 0. 0. 0. 0. 0. 1. 51.]
[ 50. 0. 0. 0. 0. 1. 0. 0. 0. 25.]
[140. 1. 0. 0. 0. 0. 0. 1. 0. 61.]
[ 40. 0. 1. 0. 0. 0. 0. 1. 0. 54.]
[ 60. 0. 1. 0. 0. 1. 0. 0. 0. 25.]
[ 70. 0. 0. 0. 0. 0. 0. 0. 0. 37.]
[140. 1. 1. 0. 0. 0. 0. 0. 0. 24.]
[170. 1. 0. 0. 0. 0. 0. 0. 1. 21.]]

We can see there are no null values and all have been into single matrix now we are ready for further step

  • Using the Scikit-learn ColumnTransformer, we have efficiently transformed each feature in our dataset into a single matrix. This approach streamlines the preprocessing workflow by applying different transformations to different columns simultaneously.
  • Unlike handling each feature separately, It ensures consistent preprocessing across training and test datasets, enhances code readability, and reduces the likelihood of mistakes.
  • By transforming our operations into a single operation, we not only save time but also create a more robust foundation for our machine learning models to build upon.

Next Article

Similar Reads