Prediction using ColumnTransformer, OneHotEncoder and Pipeline
Last Updated :
17 Jul, 2020
In this tutorial, we’ll predict insurance premium costs for each customer having various features, using ColumnTransformer, OneHotEncoder and Pipeline.
We’ll import the necessary data manipulating libraries:
Code:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
|
We’ll now load the dataset, which is available here:
Each row is a different individual, having an age, gender, body mass index (bmi), number of dependents, whether they smoke, the region from where they belong, and the insurance premium they pay.
Code:
Code:
Code:
We see there are none. But we will introduce ‘impurities’ in this dataset just because a smooth sea has never made a skilled sailor! … apart from the fact that we need missing values to demonstrate ColumnTransformer in a better way.
Code:
np.random.seed( 0 )
for _ in range ( 10 ):
r = np.random.randint( len (df))
c = np.random.randint( 6 )
df.iloc[r, c] = np.nan
|
With range(10) we imply that we need NaN’s at 10 places in the data, whether each NaN in a different row or multiple NaNs in a row, we won’t mind.
Code:

We’ll now split the data into train and test sets.
Code:
X_train, X_test, y_train, y_test = train_test_split(df.drop( 'charges' , 1 ),
df[ 'charges' ],
test_size = 0.2 , random_state = 0 )
|
Now enters the ColumnTransformer!
A ColumnTransformer takes in a list, which contains tuples of the transformations we wish to perform on the different columns. Each tuple expects 3 comma-separated values: first, the name of the transformer, which can be practically anything (passed as a string), second is the estimator object, and the final one being the columns upon which we wish to perform that operation.
Code:
trf1 = ColumnTransformer(transformers = [
( 'cat' , SimpleImputer(strategy = 'most_frequent' ), [ 'sex' , 'smoker' , 'region' ]),
( 'num' , SimpleImputer(strategy = 'median' ), [ 'age' , 'bmi' , 'children' ]),
], remainder = 'passthrough' )
|
First, we’ll impute the categorical columns. We’ll use the most_frequent, or the ‘mode’ type of imputation, and the categorical columns are ‘sex’, ‘smoker’ and ‘region’. We’ll name this transformer ‘cat’ for simplicity.
Similarly we’ll do the imputation of the numerical columns using medians of respective columns. We now need to tell the ColumnTransformer what it should do with the remaining columns, i.e. the columns upon which no transformation was performed. In our case, all features are used, but in cases were you have ‘unused’ columns, you can specify whether you want to drop or retain those columns after the transformation. We’ll retain them, hence pass remainder=’passthrough’ instead of the default behavior which is to drop those columns. We could have also specified the columns as their integer positions instead of their names, like for [‘age’, ‘bmi’, ‘children’], we could’ve said [0, 2, 3] etc. Now we’ll fit and transform the X_train to see the output, which is a numpy array by default:
Code:
first_step = trf1.fit_transform(X_train)
first_step
|

We’ll make a data frame out of it:
Code:
pd.DataFrame(first_step).head()
|
Did you notice that the columns have been reordered, and the column names are now lost? They’ve been reordered in the order of the transformers that we passed to the ColumnTransformer, i.e. we first asked it to impute the categorical columns, hence they’ve been placed first, and so on…
Code:
pd.DataFrame(first_step).isna(). sum ()
|

We can check what each transformer is doing by using the ‘names’ we passed in the tuples:
Code:
Code:
trf1.named_transformers_[ 'num' ].statistics_
|
Now that all columns are free of missing values, we can go ahead with encoding of the categorical columns.
Note: OneHotEncoder can’t handle missing values, hence it is important to get rid of them before encoding. Now, we make another transformer object for the encoding. We couldn’t do this in ‘trf1’ because at that point in time, there were missing values in the X_train, and OneHotEncoder can’t deal with missing values as discussed earlier. Hence we first needed to remove the missing values, and then pass this new ‘first_step’ array (with no missing values) to OneHotEncoder.
Code:
trf2 = ColumnTransformer(transformers = [
( 'enc' , OneHotEncoder(sparse = False , drop = 'first' ), list ( range ( 3 ))),
], remainder = 'passthrough' )
|
We set the sparse parameter to False (because we want a dense array output) and we can toggle between dropping the first of the dummy encoded columns or not, depending upon the type of model we’re fitting, to avoid the ‘dummy variable trap’. Learn more about it here: A general rule of thumb: drop a dummy-encoded column if using a linear-based model, and do not drop it if using a tree-based model. Also, did you see how for the columns parameter, we specified list(range(3)) instead of the column names? That is because now, we’ve lost the column names (as seen in ‘first_step’, but we know the categorical columns are the first three columns (after reordering), hence we specify [0, 1, 2].
Code:
second_step = trf2.fit_transform(first_step)
pd.DataFrame(second_step).head()
|
Now comes the Pipeline! We could’ve performed all these steps in one single Pipeline instance. The pipeline also expects a list of tuples, and each tuple in turn expecting two values: name of the step and the object.
Code:
pipe = Pipeline(steps = [
( 'tf1' , trf1),
( 'tf2' , trf2),
( 'tf3' , MinMaxScaler()),
( 'model' , RandomForestRegressor(n_estimators = 200 )),
])
|
Code:
print ( "All cross val scores:" , cvs)
print ( "Mean of all scores: " , cvs.mean())
|

So our model is around 81.2% accurate. You could try different regressors, tweak parameters, use StandardScaler or other scalers, and see if you can achieve better results. We can use GridSearchCV to do this work of finding best set of parameters for us. We’ll now fit the model on the entire training set, and predict results on the test set:
Code:
pipe.fit(X_train, y_train)
|
Code:
preds = pipe.predict(X_test)
pd.DataFrame({ 'original test set' :y_test, 'predictions' : preds})
|
Similar Reads
Using ColumnTransformer in Scikit-Learn for Data Preprocessing
Data preprocessing is a critical step in any machine learning workflow. It involves cleaning and transforming raw data into a format suitable for modeling. One of the challenges in preprocessing is dealing with datasets that contain different types of features, such as numerical and categorical data
15 min read
Customer Default Prediction using AdaBoost
Customer Default Prediction is used by many banks and loan lenders to determine whether a person will be able to return the money they lend them or not. For this we be using AdaBoost which is an ensemble learning technique that combines multiple weak classifiers to create a strong classifier. The al
3 min read
Heart Disease Prediction using ANN
Deep Learning is a technology of which mimics a human brain in the sense that it consists of multiple neurons with multiple layers like a human brain. The network so formed consists of an input layer, an output layer, and one or more hidden layers. The network tries to learn from the data that is fe
3 min read
Placement Prediction App using Flask
Machine learning is a widely employed method for making predictions. Numerous algorithms are accessible in different libraries for predictive tasks. In this article, we'll construct a placement prediction model using Random Forest Classifier with historical data and later we will store that model to
9 min read
QuantileTransformer using Scikit-learn
Data preprocessing is an essential step in the machine learning pipeline that involves transforming unprocessed data into a format that can be used for analysis and model construction. QuantileTransformer, a powerful module is one of the strongest tool that uses quantile information to transform fea
5 min read
End-to-End MLOps Pipeline: A Comprehensive Project
Machine Learning Operations (MLOps) is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. It combines the principles of DevOps with machine learning to streamline the process of taking ML models from development to production. This art
15 min read
Breast Cancer predictions using catboost
CatBoost is a gradient boosting algorithm that deals with the categorical features during the training process. In the article, we are going to perform prediction analysis on breast cancer dataset using CatBoost. Breast Cancer Detection using CatBoost We aim to provide a comprehensive pipeline for t
7 min read
Disease Prediction Using Machine Learning
Disease prediction using machine learning is used in healthcare to provide accurate and early diagnosis based on patient symptoms. We can build predictive models that identify diseases efficiently. In this article, we will explore the end-to-end implementation of such a system. Step 1: Import Librar
5 min read
Target encoding using nested CV in sklearn pipeline
In machine learning, feature engineering plays a pivotal role in enhancing model performance. One such technique is target encoding, which is particularly useful for categorical variables. However, improper implementation can lead to data leakage and overfitting. This article delves into the intrica
7 min read
Flight Delay Prediction using Deep Learning
Air travel has become an important part of our lives, and with this comes the problem of flights being delayed. Deep learning models can automatically learn hierarchical representations from data, making them best for flight delay prediction. In the article, we will build a flight delay predictor us
5 min read