Open In App

One Hot Encoding in Machine Learning

Last Updated : 07 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine learning models.

Importance of One Hot Encoding

We use one hot Encoding because:

  1. Eliminating Ordinality: Many categorical variables have no inherent order (e.g., “Male” and “Female”). If we were to assign numerical values (e.g., Male = 0, Female = 1) the model might mistakenly interpret this as a ranking and lead to biased predictions. One Hot Encoding eliminates this risk by treating each category independently.
  2. Improving Model Performance: By providing a more detailed representation of categorical variables. One Hot Encoding can help to improve the performance of machine learning models. It allows models to capture complex relationships within the data that might be missed if categorical variables were treated as single entities.
  3. Compatibility with Algorithms: Many machine learning algorithms particularly based on linear regression and gradient descent which require numerical input. It ensures that categorical variables are converted into a suitable format.

How One-Hot Encoding Works: An Example

To grasp the concept better let’s explore a simple example. Imagine we have a dataset with fruits their categorical values and corresponding prices. Using one-hot encoding we can transform these categorical values into numerical form. For example:

  • Wherever the fruit is “Apple,” the Apple column will have a value of 1 while the other fruit columns (like Mango or Orange) will contain 0.
  • This pattern ensures that each categorical value gets its own column represented with binary values (1 or 0) making it usable for machine learning models.
FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320

The output after applying one-hot encoding on the data is given as follows,

Fruit_appleFruit_mangoFruit_orangeprice
1005
01010
10015
00120

Implementing One-Hot Encoding Using Python

To implement one-hot encoding in Python we can use either the Pandas library or the Scikit-learn library both of which provide efficient and convenient methods for this task.

1. Using Pandas

Pandas offers the get_dummies function which is a simple and effective way to perform one-hot encoding. This method converts categorical variables into multiple binary columns.

  • For example the Gender column with values 'M' and 'F' becomes two binary columns: Gender_F and Gender_M.
  • drop_first=True in pandas drops one redundant column e.g., keeps only Gender_F to avoid multicollinearity.
Python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {
    'Employee id': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

df = pd.DataFrame(data)
print(f"Original Employee Data:\n{df}\n")
# Use pd.get_dummies() to one-hot encode the categorical columns
df_pandas_encoded = pd.get_dummies(df, columns=['Gender', 'Remarks'], drop_first=True)
print(f"One-Hot Encoded Data using Pandas:\n{df_pandas_encoded}\n")

encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, 
                          columns=encoder.get_feature_names_out(categorical_columns))

df_sklearn_encoded = pd.concat([df.drop(categorical_columns, axis=1), one_hot_df], axis=1)

print(f"One-Hot Encoded Data using Scikit-Learn:\n{df_sklearn_encoded}\n")

Output:

Original Employee Data:
   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice

One-Hot Encoded Data using Pandas:
   Employee id  Gender_M  Remarks_Great  Remarks_Nice
0           10      True          False         False
1           20     False          False          True
2           15     False          False         False
3           25      True           True         False
4           30     False          False          True

We can observe that we have 3 Remarks and 2 Gender columns in the data. However you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep the Gender_Female column and drop the Gender_Male column then also we can convey the entire information as when the label is 1 it means female and when the label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

2. One Hot Encoding using Scikit Learn Library

Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for encoding categorical and numerical variables into binary vectors. Using df.select_dtypes(include=['object']) in Scikit Learn Library:

  • This selects only the columns with categorical data (data type object).
  • In this case, ['Gender', 'Remarks'] are identified as categorical columns.
Python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {'Employee id': [10, 20, 15, 25, 30],
        'Gender': ['M', 'F', 'F', 'M', 'F'],
        'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
        }
df = pd.DataFrame(data)
print(f"Employee data : \n{df}")

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

df_encoded = pd.concat([df, one_hot_df], axis=1)

df_encoded = df_encoded.drop(categorical_columns, axis=1)
print(f"Encoded Employee data : \n{df_encoded}")

Output:

Employee data : 
   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice
Encoded Employee data : 
   Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0           10       0.0       1.0           1.0            0.0           0.0
1           20       1.0       0.0           0.0            0.0           1.0
2           15       1.0       0.0           1.0            0.0           0.0
3           25       0.0       1.0           0.0            1.0           0.0
4           30       1.0       0.0           0.0            0.0           1.0

Both Pandas and Scikit-Learn offer robust solutions for one-hot encoding.

  • Use Pandas get_dummies() when you need quick and simple encoding.
  • Use Scikit-Learn OneHotEncoder when working within a machine learning pipeline or when you need finer control over encoding behavior.

Advantages and Disadvantages of One Hot Encoding

Advantages of Using One Hot Encoding

  1. It allows the use of categorical variables in models that require numerical input.
  2. It can improve model performance by providing more information to the model about the categorical variable.
  3. It can help to avoid the problem of ordinality which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

Disadvantages of Using One Hot Encoding

  1. It can lead to increased dimensionality as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
  2. It can lead to sparse data as most observations will have a value of 0 in most of the one-hot encoded columns.
  3. It can lead to overfitting especially if there are many categories in the variable and the sample size is relatively small.

Best Practices for One Hot Encoding

To make the most of One Hot Encoding and we must consider the following best practices:

  1. Limit the Number of Categories: If you have high cardinality categorical variables consider limiting the number of categories through grouping or feature engineering.
  2. Use Feature Selection: Implement feature selection techniques to identify and retain only the most relevant features after One Hot Encoding. This can help reduce dimensionality and improve model performance.
  3. Monitor Model Performance: Regularly evaluate your model’s performance after applying One Hot Encoding. If you notice signs of overfitting or other issues consider alternative encoding methods.
  4. Understand Your Data: Before applying One Hot Encoding take the time to understand the nature of your categorical variables. Determine whether they have a natural order and whether One Hot Encoding is appropriate.

Alternatives to One Hot Encoding

While One Hot Encoding is a popular choice for handling categorical data there are several alternatives that may be more suitable depending on the context:

  1. Label Encoding: In cases where categorical variables have a natural order (e.g., “Low,” “Medium,” “High”) label encoding can be a better option. This method assigns a unique integer to each category without introducing the same risks of hierarchy misinterpretation as with nominal data.
  2. Binary Encoding: This technique combines the benefits of One Hot Encoding and label encoding. It converts categories into binary numbers and then creates binary columns. This method can reduce dimensionality while preserving information.
  3. Target Encoding: In target encoding, we replace each category with the mean of the target variable for that category. This method can be particularly useful for categorical variables with a high number of unique values but it also carries a risk of leakage if not handled properly.


Next Article

Similar Reads