Using ColumnTransformer in Scikit-Learn for Data Preprocessing
Last Updated :
21 Apr, 2025
Data preprocessing is a critical step in any machine learning workflow. It involves cleaning and transforming raw data into a format suitable for modeling. One of the challenges in preprocessing is dealing with datasets that contain different types of features, such as numerical and categorical data. Scikit-learn's ColumnTransformer
is a powerful tool that allows you to apply different transformations to different subsets of features within your dataset. This article will explore how to use ColumnTransformer
effectively to streamline your data preprocessing tasks.
The ColumnTransformer
is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms to different columns in your dataset. This is particularly useful when you have a mix of numerical and categorical data that require different preprocessing steps.
Using ColumnTransformer
offers several advantages:
- Selective Transformation: Apply specific transformations to subsets of columns.
- Pipeline Integration: Easily integrate with scikit-learn's
Pipeline
for streamlined workflows. - Code Organization: Encapsulate preprocessing logic in a single, maintainable object.
Here Let's create a dataset which is named as CAR_SPEED_DATA which consists of 6 columns named as:
- AGE: This column contains numerical data representing the age of individuals. It's a continuous variable that may require scaling to ensure it fits well within our model.
- GENDER: A categorical feature that denotes the gender of individuals, often represented as 'Male' or 'Female.' To make this data usable for machine learning models, we'll need to encode it into numerical values.
- SPEED: Another numerical feature, this column represents the speed of an individual’s vehicle. Like AGE, it might need scaling or normalization.
- AVERAGE_SPEED: This feature is an ordinal categorical value. It represents speed categories like 'high' or 'low' .Although it seems similar to numerical data, it needs special handling because the order matters but the differences between categories are not consistent.
- CITY: A categorical feature indicating the city where the individual resides. With potentially many unique values, we'll need to apply one-hot encoding to convert it into a form suitable for modeling.
- HAS_DRIVING_LICENSE: This binary categorical variable shows whether an individual has a driving license ('Yes' or 'No'). Simple encoding can transform this into a numerical feature.
When we work with such a diverse dataset, different preprocessing steps are required for different columns:
- Numerical Data Handling: AGE and SPEED need to be scaled to ensure that they don’t overpower other features in the model. Without proper scaling, numerical columns with larger ranges could disproportionately affect the model's learning process.
- Categorical Encoding: GENDER and CITY need encoding into numerical formats. With multiple categorical features, applying encoding manually can be big task, especially when dealing with a large number of categories.
- Ordinal Encoding: AVERAGE_SPEED, as an ordinal categorical feature, requires careful encoding to preserve the order of categories. Applying standard one-hot encoding might not respect this inherent ordering.
- Binary Features: The HAS_DRIVING_LICENSE column is binary and relatively straightforward, but it's another step that adds complexity when handled separately.
Step 1: Importing Necessary Libraries
From sklearn we have imported SimpleImputer, OneHotEncoder,OrdinalEncoder.
- SimpleImputer: Used to fill in missing data in a dataset with a specified strategy, such as mean, median, or mode, we use mean in our dataset.
- OneHotEncoder: Converts categorical features into a format that can be provided to machine learning algorithms by creating binary columns for each category.
- OrdinalEncoder: Transforms categorical features into integer values that represent the ordinal relationship between categories, preserving their order. In our dataset for low speed we encode it as '0' for high speed we encode it '1'.
Python
# import the required libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
Step 2: Loading The Dataset
Python
# reading the dataset which is attached at end.
df = pd.read_csv('/content/Car_Speed_Data.csv')
df.head()
Output:
Age Gender Speed Average_speed City has_driving_license
0 54 male 40.0 low Kolkata yes
1 34 female 70.0 low Delhi yes
2 19 female 140.0 high Delhi no
3 45 male 120.0 high Kolkata yes
4 23 male 80.0 low Mumbai no
Click on the given link for downloading Dataset: CAR_SPEED_DATA
Now We'll be checking if there are any null values present or not.
Python
Output:
Age 0
Gender 0
Speed 9
Average_speed 0
City 0
has_driving_license 0
dtype: int64
After identifying missing values in the dataset using the isnull().sum() method, we can use sklearn's SimpleImputer to handle these gaps by replacing the missing values with the mean of each feature. This approach ensures that the dataset remains complete and consistent.
- Now we are going to split our dataset into training and testing sets with a test size of 0.2. Since our dataset contains a total of 100 samples, this means we will use 80 samples for training and 20 samples for testing.
Python
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_driving_license']),df['has_driving_license'],
test_size=0.2)
X_train
Output:
Age Gender Speed Average_speed City
57 23 female 70.0 low Delhi
20 24 female NaN high Banglore
98 42 female 50.0 low Kolkata
86 34 male 60.0 low Banglore
37 42 male NaN low Mumbai
... ... ... ... ... ...
52 29 male 60.0 low Mumbai
47 25 female 130.0 high Banglore
16 42 male 60.0 low Kolkata
76 42 male 50.0 low Delhi
32 23 male NaN high Delhi
80 rows × 5 columns
We have created X_Train and X_test data.
Now we have to handle the null values using the SimpleImputer for Speed Feature Let's do this!!
Python
# we are using SimpleInputer() for speed Column
si = SimpleImputer()
X_train_Speed = si.fit_transform(X_train[['Speed']])
#for test data
X_test_Speed = si.fit_transform(X_test[['Speed']])
X_train_Speed.shape
#to check whether values or replaced by mean or not you simply call X_train_speed function
#output: (80,1)
Next step:
Why to use Ordinal Encoding for the Average_speed Feature?
The Average_speed column is an ordinal categorical feature, meaning its categories have a meaningful order, such as low and high. By using OrdinalEncoder, we transform these categories into numerical values that preserve their inherent order. This encoding allows machine learning models to interpret the relative significance of each category, enabling them to capture patterns associated with different speeds. For instance, if low is encoded as 0 and high as 1, the model can recognize that high represents a greater speed than low, impacting how it learns relationships in the data.
Python
# Ordinalencoding for Average_speed
oe = OrdinalEncoder(categories=[['low','high']])
X_train_Average_speed = oe.fit_transform(X_train[['Average_speed']])
# for test data
X_test_Average_speed = oe.fit_transform(X_test[['Average_speed']])
X_train_Average_speed.shape
#output (80,1)
#you can also check it by calling the function X_train_Average_speed
Next Step:
Why to use One-Hot Encoding for Gender and City?
One-Hot Encoding is used to convert categorical variables into a binary matrix, allowing machine learning models to interpret these features without assuming any ordinal relationship. By transforming Gender and City into binary columns, we avoid misleading the model into thinking there's a rank or order between the categories.
Python
# Performing OneHotEncoding for Gender,City
ohe = OneHotEncoder(drop='first',sparse=False)
X_train_Gender_City = ohe.fit_transform(X_train[['Gender','City']])
#For test data
X_test_Gender_City = ohe.fit_transform(X_test[['Gender','City']])
X_train_Gender_City.shape
#output:(80,7)
#you can call X_train_Gender_City for checking the feature values.
Next Step:
Why to use Age Separately?
The Age feature is a numerical variable that often plays a critical role in predictive modeling .By extracting Age separately, we can focus on its unique contribution to the model without interference from categorical features This separation also simplifies preprocessing steps, such as scaling or normalizing numerical data, ensuring that age is treated with its own distinct statistical(mean,median,mode) considerations.
Python
# Extracting Age
X_train_age = X_train.drop(columns=['Gender','Speed','Average_speed','City']).values
# for test data
X_test_age = X_test.drop(columns=['Gender','Speed','Average_speed','City']).values
X_train_age.shape
#output: (80,1)
Final Step:
Concatenation into single transformer: Why Concatenate and Transform the Data?
The concatenation of X_train_age, X_train_Speed, X_train_Gender_City, and X_train_Average_speed into X_train_transformed creates a single feature set where each feature is represented in a format suitable for machine learning models. This step combines all preprocessed features—numerical and categorical—into a unified matrix, enabling the model to access and learn from all relevant data simultaneously. By using the np.concatenate function, we ensure that each feature retains its processed form and contributes equally to the predictive modeling task.
Python
#using np.concatenate which combines multiple arrays along a specified axis. Here, axis=1 ensures that the arrays are joined horizontally
X_train_transformed = np.concatenate((X_train_age,X_train_Speed,X_train_Gender_City,X_train_Average_speed),axis=1)
#for test data
X_test_transformed = np.concatenate((X_test_age,X_test_Speed,X_test_Gender_City,X_test_Average_speed),axis=1)
X_train_transformed.shape
#output: (80,10)
#the output includes 80 samples and 10 total features (including encoded and numerical) shape.
Now, we have completed the data preprocessing process without using the ColumnTransformer. This approach is quite difficult because we are performing transformations on each feature separately. This can become a burden, especially when dealing with larger datasets or more complex preprocessing needs.
Let’s explore why this is burdensome with the representation below:
Without Column TransformerNow we can see how hectic it is to transform each feature individually and then concatenate them all together at the end. This manual process is time-consuming, leads to errors, and difficult to maintain. To make to easy and simpler, we can use the ColumnTransformer from Scikit-learn. The ColumnTransformer allows us to apply different preprocessing steps to specific columns simultaneously, making our workflow more efficient and less errors.
Using Column TransformerNOTE: Perform same steps till train_test_split method.
Then later Import Column Transform from sklearn Library
Python
from sklearn.compose import ColumnTransformer
ColumnTransformer :The ColumnTransformer is a powerful tool in sklearn for applying different preprocessing transformations to specific columns within a dataset. This allows processing based on the nature of each feature.
Syntax for ColumnTransformer:
transformer = ColumnTransformer(transformers=[('imputer', SimpleImputer(), ['NumericalColumn1', 'NumericalColumn2']),('ordinal', OrdinalEncoder(), ['OrdinalColumn']),('onehot', OneHotEncoder(), ['CategoricalColumn1', 'CategoricalColumn2'])],remainder='passthrough')
Python
# we have declared transformer and called ColumnTransformer() method
#we use SimpleImputer for speed and t1 as parameter,t2 for OrdinalEncoder i.e for categorical values and t3 for one hot encoding i.e for Gender and City.
transformer = ColumnTransformer(transformers=[
('t1',SimpleImputer(),['Speed']),
('t2',OrdinalEncoder(categories=[['low','high']]),['Average_speed']),
('t3',OneHotEncoder(sparse=False,drop='first'),['Gender','City'])
],remainder='passthrough')
What is remainder='passthrough'?
Purpose of the remainder parameter specifies what to do with the columns not explicitly transformed by the transformers list.
remainder='passthrough': It Leaves all other columns untouched and includes them in the transformed output.
- Why are we using this. As it is useful when you want to apply specific transformations to only some columns while preserving the rest as they are, ensuring that no information is lost from the original dataset.
- This is about Column Transformer where each column values have been stored into single column transformer.
- As in Categorical Features we can't give machine the complete word. we transform it's data into matrix which is represtation of numbers then we give to machine for training.
Using this Column Transformer we can do everything in single dataset.
Single Vector containing different feature values
Python
transformer.fit_transform(X_train).shape
#output:(80,10)
transformer.transform(X_test).shape
#output: (20,10)
Output:
[[ 70. 0. 0. 0. 0.
1. 0. 0. 0. 23. ]
[ 88.45070423 1. 0. 0. 0.
0. 0. 0. 0. 24. ]
[ 50. 0. 0. 0. 0.
0. 0. 1. 0. 42. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 0. 34. ]
[ 88.45070423 0. 1. 0. 0.
0. 0. 0. 1. 42. ]
[ 60. 0. 1. 0. 0.
0. 0. 1. 0. 20. ]
[ 88.45070423 0. 0. 0. 0.
0. 0. 0. 1. 29. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 0. 35. ]
[130. 1. 1. 0. 0.
0. 0. 0. 0. 36. ]
[ 70. 0. 0. 0. 0.
0. 0. 1. 0. 62. ]
[120. 1. 1. 0. 0.
0. 0. 0. 0. 24. ]
[180. 1. 1. 0. 0.
1. 0. 0. 0. 49. ]
[ 60. 0. 0. 0. 0.
0. 0. 1. 0. 24. ]
[ 90. 0. 0. 0. 0.
0. 0. 1. 0. 28. ]
[ 50. 0. 1. 0. 0.
0. 0. 1. 0. 34. ]
[120. 1. 0. 0. 1.
1. 0. 0. 0. 39. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 25. ]
[ 40. 0. 0. 0. 0.
0. 0. 1. 0. 36. ]
[ 90. 0. 1. 0. 0.
0. 0. 0. 1. 45. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 31. ]
[ 70. 0. 1. 0. 0.
1. 0. 0. 0. 45. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 39. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 47. ]
[170. 1. 1. 0. 0.
0. 0. 1. 0. 61. ]
[ 70. 0. 0. 0. 0.
0. 0. 1. 0. 43. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 48. ]
[180. 1. 0. 0. 0.
0. 0. 0. 1. 28. ]
[130. 1. 1. 0. 0.
0. 0. 0. 0. 17. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[150. 1. 0. 0. 0.
1. 0. 0. 0. 27. ]
[ 50. 0. 1. 0. 0.
0. 0. 1. 0. 36. ]
[ 50. 0. 1. 0. 0.
0. 0. 0. 0. 28. ]
[ 40. 0. 0. 0. 0.
0. 0. 0. 1. 52. ]
[ 60. 0. 0. 0. 0.
0. 0. 0. 0. 62. ]
[ 40. 0. 1. 0. 0.
1. 0. 0. 0. 57. ]
[150. 1. 0. 0. 0.
0. 0. 0. 1. 32. ]
[120. 1. 1. 0. 0.
0. 0. 1. 0. 45. ]
[170. 1. 1. 0. 0.
1. 0. 0. 0. 27. ]
[ 88.45070423 0. 1. 0. 0.
1. 0. 0. 0. 56. ]
[ 80. 0. 0. 1. 0.
0. 0. 0. 1. 23. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 1. 29. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 1. 40. ]
[ 80. 0. 1. 0. 0.
0. 0. 1. 0. 35. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 0. 42. ]
[120. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[ 88.45070423 1. 0. 0. 0.
0. 0. 0. 1. 45. ]
[ 70. 0. 0. 0. 0.
1. 0. 0. 0. 34. ]
[140. 1. 1. 0. 0.
0. 0. 0. 1. 46. ]
[ 88.45070423 0. 0. 0. 0.
0. 0. 0. 0. 39. ]
[ 50. 0. 0. 0. 0.
0. 0. 0. 0. 24. ]
[120. 1. 1. 0. 0.
1. 0. 0. 0. 47. ]
[ 70. 0. 0. 0. 0.
0. 0. 0. 0. 19. ]
[120. 1. 0. 0. 0.
0. 0. 1. 0. 47. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 28. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 21. ]
[ 40. 0. 1. 0. 0.
0. 0. 0. 0. 32. ]
[ 70. 0. 0. 0. 0.
0. 0. 0. 0. 28. ]
[ 90. 0. 1. 0. 0.
0. 0. 0. 1. 28. ]
[100. 1. 1. 0. 0.
0. 0. 0. 0. 35. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 46. ]
[120. 1. 1. 0. 0.
0. 0. 0. 0. 52. ]
[120. 1. 1. 0. 0.
0. 0. 0. 1. 42. ]
[ 60. 0. 0. 0. 0.
0. 0. 0. 1. 34. ]
[ 50. 0. 0. 0. 0.
1. 0. 0. 0. 21. ]
[140. 1. 0. 0. 0.
0. 1. 0. 0. 19. ]
[ 88.45070423 1. 1. 0. 0.
0. 0. 0. 1. 35. ]
[ 40. 0. 0. 0. 0.
0. 0. 0. 0. 49. ]
[ 70. 0. 0. 0. 0.
1. 0. 0. 0. 39. ]
[120. 1. 0. 1. 0.
1. 0. 0. 0. 33. ]
[ 50. 0. 1. 0. 0.
0. 0. 0. 0. 41. ]
[150. 1. 0. 1. 0.
0. 0. 0. 0. 26. ]
[ 70. 0. 1. 0. 0.
0. 0. 0. 1. 23. ]
[ 88.45070423 1. 1. 0. 0.
0. 0. 0. 0. 49. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 1. 37. ]
[170. 1. 0. 0. 0.
0. 0. 0. 0. 36. ]
[ 60. 0. 1. 0. 0.
0. 0. 0. 1. 29. ]
[130. 1. 0. 0. 0.
0. 0. 0. 0. 25. ]
[ 60. 0. 1. 0. 0.
0. 0. 1. 0. 42. ]
[ 50. 0. 1. 0. 0.
1. 0. 0. 0. 42. ]
[ 88.45070423 1. 1. 0. 0.
1. 0. 0. 0. 23. ]]
Python
print(transformer.transform(X_test))
Output:
[[120. 1. 1. 0. 0. 0. 0. 1. 0. 45.]
[150. 1. 1. 0. 0. 0. 0. 1. 0. 46.]
[ 50. 0. 1. 0. 0. 0. 0. 0. 1. 32.]
[ 90. 0. 1. 0. 0. 0. 0. 0. 1. 32.]
[ 80. 0. 0. 0. 0. 1. 0. 0. 0. 35.]
[120. 1. 0. 0. 0. 0. 0. 0. 0. 29.]
[ 80. 0. 1. 0. 0. 1. 0. 0. 0. 56.]
[ 80. 0. 1. 0. 0. 0. 0. 1. 0. 42.]
[150. 1. 1. 0. 0. 0. 0. 0. 0. 27.]
[120. 1. 1. 0. 0. 0. 0. 1. 0. 27.]
[170. 1. 0. 0. 0. 0. 0. 0. 1. 51.]
[170. 1. 0. 0. 0. 0. 0. 1. 0. 37.]
[140. 1. 1. 0. 0. 0. 0. 0. 1. 51.]
[ 50. 0. 0. 0. 0. 1. 0. 0. 0. 25.]
[140. 1. 0. 0. 0. 0. 0. 1. 0. 61.]
[ 40. 0. 1. 0. 0. 0. 0. 1. 0. 54.]
[ 60. 0. 1. 0. 0. 1. 0. 0. 0. 25.]
[ 70. 0. 0. 0. 0. 0. 0. 0. 0. 37.]
[140. 1. 1. 0. 0. 0. 0. 0. 0. 24.]
[170. 1. 0. 0. 0. 0. 0. 0. 1. 21.]]
We can see there are no null values and all have been into single matrix now we are ready for further step
- Using the Scikit-learn ColumnTransformer, we have efficiently transformed each feature in our dataset into a single matrix. This approach streamlines the preprocessing workflow by applying different transformations to different columns simultaneously.
- Unlike handling each feature separately, It ensures consistent preprocessing across training and test datasets, enhances code readability, and reduces the likelihood of mistakes.
- By transforming our operations into a single operation, we not only save time but also create a more robust foundation for our machine learning models to build upon.
Similar Reads
Prediction using ColumnTransformer, OneHotEncoder and Pipeline
In this tutorial, we'll predict insurance premium costs for each customer having various features, using ColumnTransformer, OneHotEncoder and Pipeline. We'll import the necessary data manipulating libraries: Code: import pandas as pd import numpy as np from sklearn.compose import ColumnTransformer f
6 min read
How to Normalize Data Using scikit-learn in Python
Data normalization is a crucial preprocessing step in machine learning. It ensures that features contribute equally to the model by scaling them to a common range. This process helps in improving the convergence of gradient-based optimization algorithms and makes the model training process more effi
4 min read
Data Pre-Processing with Sklearn using Standard and Minmax scaler
Data Scaling is a data preprocessing step for numerical features. Many machine learning algorithms like Gradient descent methods, KNN algorithm, linear and logistic regression, etc. require data scaling to produce good results. Various scalers are defined for this purpose. This article concentrates
3 min read
Imputing Missing Values Before Building an Estimator in Scikit Learn
The missing values in a dataset can cause problems during the building of an estimator. Scikit Learn provides different ways to handle missing data, which include imputing missing values. Imputing involves filling in missing data with estimated values that are based on other available data in the da
3 min read
Save classifier to disk in scikit-learn in Python
In this article, we will cover saving a Save classifier to disk in scikit-learn using Python. We always train our models whether they are classifiers, regressors, etc. with the scikit learn library which require a considerable time to train. So we can save our trained models and then retrieve them w
3 min read
ML | Data Preprocessing in Python
Data preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
7 min read
Python for Data Science - Learn the Uses of Python in Data Science
In this Python for Data Science guide, we'll explore the exciting world of Python and its wide-ranging applications in data science. We will also explore a variety of data science techniques used in data science using the Python programming language. We all know that data Science is applied to gathe
6 min read
Project | Scikit-learn - Whisky Clustering
Introduction | Scikit-learn Scikit-learn is a machine learning library for Python.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numeric
4 min read
Multiple Linear Regression With scikit-learn
In this article, let's learn about multiple linear regression using scikit-learn in the Python programming language. Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it's utilized as a method for predictive mode
8 min read
How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we're working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed.
3 min read