How to Transform Nominal Data for ML with OneHotEncoder from Scikit-Learn

Last Updated : 01 Jul, 2024

In the machine learning domain, data pre-processing particularly the category data is the key to the models’ effectiveness. Since nominal data is an unordered data, collecting the data needs some special preparation to numerate the data. There are many strategies out there that support this transformation, but one frequently suggested is called OneHotEncoding. Now, let us know how to use the OneHotEncoder from the Scikit-Learn library in Python for converting nominal data in machine learning.

Overview of Nominal Data Transformation

Nominal data is data where the categories do not have a rank or a position or any relative importance. Such non-numerical data include: colours, cities, and types of products. Unlike IV, ordinal data retains meaningful order unlike nominal data which apart from being converted, through various methods, must be handled differently to be used efficiently in the models.

OneHotEncoding is perhaps utilized to transform nominal into binary matrix where each category is in vector format, containing one high value (1) and a low value (0) as this convert the categorical form of variables into a format that can easily suit machine learning algorithms.

Using OneHotEncoder in Scikit-Learn

Scikit-Learn provides a convenient and efficient implementation of OneHotEncoder. Here's a step-by-step guide on how to use it:

Step 1: Importing the Necessary Libraries

Python

import numpy as np
from sklearn.preprocessing import OneHotEncoder

Step 2: Creating the Data

Suppose we have a dataset with a nominal feature "City":

Python

data = np.array([['New York'], ['Los Angeles'], ['Chicago'], ['New York'], ['Chicago']])

Step 3: Initializing and Fitting OneHotEncoder

Python

encoder = OneHotEncoder(sparse=False)  # sparse=False returns a dense array
encoded_data = encoder.fit_transform(data)

Step 4: Viewing the Encoded Data

Python

print(encoded_data)

This will output:

[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]]

Each city is now represented as a binary vector.

Handling Categorical Variables

Dealing with New Categories

One of the challenges with OneHotEncoding is handling new categories that were not present during training. Scikit-Learn's OneHotEncoder provides a way to handle this using the handle_unknown parameter:

Python

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_data = encoder.fit_transform(data)

Column Transformers

When dealing with datasets that have both numerical and categorical data, the ColumnTransformer from Scikit-Learn can be used to apply different transformations to different columns:

Python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Assume we have a dataset with a numerical column 'Age' and a nominal column 'City'
data = np.array([[25, 'New York'], [30, 'Los Angeles'], [35, 'Chicago']])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), [0]),  # Standard scaling for the first column (Age)
        ('cat', OneHotEncoder(), [1])  # OneHotEncoding for the second column (City)
    ])

processed_data = preprocessor.fit_transform(data)
print(processed_data)

Use Cases and Examples

Example 1: Customer Segmentation

In the case of categorical variables commonly used in customer segmentation, it is pertinent to quantitative transformation where categorical data like location, gender, or type of membership among others are converted to numbers. Segmentation analysis using the above-mentioned categories becomes easier and more efficient with the help of OneHotEncoding so that machine learning models are capable of interpreting such categories in the right manner for segmentation and well-defined targeted marketing.

Example 2: Predictive Maintenance

In the predictive maintenance, machines or equipment types are often denoted as nominal data. Thus, the OneHotEncoder enables the application of these categorical variables enhancing the accuracy of maintenance prediction.

Conclusion

OneHotEncoding is one of the most important tactics to convert nominal data into a numerical from that is appropriate for machine learning algorithms. Fortunately, OneHotEncoder in Scikit-Learn is a perfect tool for performing this operation; therefore, this tool can be considered a must-have in a data scientist’s toolbox. Thus, proper treatment of categorical variables helps in creating better machine learning models.

Q1.What is the difference between OneHotEncoder and LabelEncoder?

OneHotEncoder encodes the categorical features to numerical vectors of size n x m where n is the number of instances and m is the number of features, LabelEncoder encodes the categories into integer values. Using OneHotEncoder is advantageous for nominal data because it does not artificially impose an order in the data set’s categories.

Q2. Can OneHotEncoder handle multiple categorical features at once?

Yes, OneHotEncoder can work with a number of independent variables having categorical variables at the same time. You can feed an entire array or DataFrame of categorical data to it.

Q3. What are some limitations of OneHotEncoding?

This encoding process can also cause a drastic change in the size of a feature space where a large number of categories are possible. This can lead to the creation of sparse matrices and at some point lead to higher computational complexity.

Q4. How does OneHotEncoder handle missing values in categorical data?

Unlike for categorical variables, OneHotEncoder from Scikit-Learn does not address missing values directly. They provide that all values should be present during the fitting process. Hence, the first steps always involve pre-processing the data; one can either fill up the missing values or otherwise, use different approaches that include elimination of rows or columns.

Q5. Can OneHotEncoder handle large datasets with a high number of unique categories?

When it comes to large amounts of data especially when the number of categories is high, OneHotEncoder is apt, although be careful with the expanded number of dimensions. However, when dealing with categories with a high number of values, there could be high cardinality of the categorical variables, in this case feature extraction or data reduction methods are used to handle with computational resources.

How to split the Dataset With scikit-learn's train_test_split() Function

om7826pw5al

Improve

Article Tags :

Practice Tags :

Machine Learning

How to Transform Nominal Data for ML with OneHotEncoder from Scikit-Learn

Overview of Nominal Data Transformation

Using OneHotEncoder in Scikit-Learn

Step 1: Importing the Necessary Libraries

Step 2: Creating the Data

Step 3: Initializing and Fitting OneHotEncoder

Step 4: Viewing the Encoded Data

This will output:

Handling Categorical Variables

Dealing with New Categories

Column Transformers

Use Cases and Examples

Example 1: Customer Segmentation

Example 2: Predictive Maintenance

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?