Handling Missing Data with KNN Imputer

Last Updated : 13 Aug, 2024

Missing data is a common issue in data analysis and machine learning, often leading to inaccurate models and biased results. One effective method for addressing this issue is the K-Nearest Neighbors (KNN) imputation technique. This article will delve into the technical aspects of KNN imputation, its implementation, advantages, and limitations.

Table of Content

Understanding KNN Imputation for Handling Missing Data
Implementing KNN Imputer in Python for Missing Data

Choosing the Right Parameters for KNN Imputer
Example 1: Basic Implementation of KNN Imputer for Handling Missing Data
Example 2: Handling Missing Data with Mixed Feature Types
Example 3: Handling Missing Data for Time Series Model

Practical Considerations and Limitations

Understanding KNN Imputation for Handling Missing Data

KNN imputation is a technique used to fill missing values in a dataset by leveraging the K-Nearest Neighbors algorithm. This method involves finding the k-nearest neighbors to a data point with a missing value and imputing the missing value using the mean or median of the neighboring data points. This approach preserves the relationships between features, which can lead to better model performance compared to simpler imputation methods like mean or median imputation.

How KNN Imputer Works?

Identifying Missing Values: The first step is to identify the missing values in the dataset, typically marked as NaN (Not a Number).
Finding Nearest Neighbors: For each data point with a missing value, the KNN imputer finds the k-nearest neighbors based on a specified distance metric (e.g., Euclidean distance, cosine similarity).
Imputing Missing Values: The missing value is then imputed using the mean or median of the values from the k-nearest neighbors.

For getting in-depth knowledge refer to : How KNN Imputer Works in Machine Learning

Implementing KNN Imputer in Python for Missing Data

Choosing the Right Parameters for KNN Imputer

The performance of the KNN Imputer depends on the choice of parameters:

n_neighbors: The number of neighbors to consider for imputation. A smaller value may be more sensitive to noise, while a larger value may oversmooth the data.
weights: Determines how to weight the contributions of the neighbors. Options include:
uniform: All neighbors have equal weight.
distance: Weights neighbors by their distance, giving closer neighbors more influence.
p: The power parameter for the Minkowski distance metric. p=1 corresponds to Manhattan distance, and p=2 corresponds to Euclidean distance.

Example:

imputer = KNNImputer(n_neighbors=3, weights='distance')

The KNNImputer class from the scikit-learn library provides a straightforward way to implement KNN imputation.

Example 1: Basic Implementation of KNN Imputer for Handling Missing Data

Python

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample dataset with missing values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 2.0, 3.0, 4.0],
    'Feature3': [1.0, 2.0, 3.0, np.nan]
}
df = pd.DataFrame(data)

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2)

# Impute missing values
df_imputed = imputer.fit_transform(df)

print("Data after KNN Imputation:\n", df_imputed)

Output:

Data after KNN Imputation:
 [[1.  2.5 1. ]
 [2.  2.  2. ]
 [3.  3.  3. ]
 [4.  4.  2.5]]

Example 2: Handling Missing Data with Mixed Feature Types

Python

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder

data = {
    'Numerical_1': [5.1, 4.9, np.nan, 4.7, 5.0],
    'Numerical_2': [3.5, np.nan, 3.0, np.nan, 3.2],
    'Categorical': ['A', 'B', 'A', 'C', np.nan]
}
df = pd.DataFrame(data)

numerical_features = ['Numerical_1', 'Numerical_2']
categorical_features = ['Categorical']

# Pipeline for numerical features
numerical_pipeline = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=2)),  # Apply KNN Imputer first
    ('scaler', StandardScaler())
])

# Pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
    ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),  # Encode categories first
    ('imputer', KNNImputer(n_neighbors=2))  # Apply KNN Imputer after encoding
])

# Combine pipelines into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Applying the pipeline to the data
df_imputed = preprocessor.fit_transform(df)

# Extracting the column names
num_cols = ['Numerical_1', 'Numerical_2']
cat_cols = ['Categorical']

# Combining the column names
columns = num_cols + cat_cols

# Convert the imputed data back to a DataFrame
df_imputed = pd.DataFrame(df_imputed, columns=columns)

print("Data after KNN Imputation with Mixed Feature Types:\n", df_imputed)

Output:

Data after KNN Imputation with Mixed Feature Types:
    Numerical_1  Numerical_2  Categorical
0     1.060660     1.300887         0.00
1    -0.353553     0.413919         1.00
2     0.707107    -1.655675         0.00
3    -1.767767     0.413919         2.00
4     0.353553    -0.473050         0.75

Example 3: Handling Missing Data for Time Series Model

Python

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', periods=100, freq='D')
data = {
    'Temperature': np.random.normal(20, 5, 100),
    'Humidity': np.random.normal(60, 10, 100)
}
df = pd.DataFrame(data, index=date_range)

# Introducing missing values
df.loc['2023-01-10':'2023-01-15', 'Temperature'] = np.nan
df.loc['2023-02-05':'2023-02-10', 'Humidity'] = np.nan

# Initializing the KNNImputer
imputer = KNNImputer(n_neighbors=3)

# Using a pipeline to scale data before imputation
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),  # Scaling time series data
    ('imputer', KNNImputer(n_neighbors=3))
])

# Impute missing values
df_imputed = pipeline.fit_transform(df)

print("Time Series Data after KNN Imputation:\n", df_imputed)

Output:

Time Series Data after KNN Imputation:
 [[ 6.35360812e-01 -1.51320098e+00]
 [-7.62416368e-02 -4.65281950e-01]
 [ 8.04553549e-01 -3.83183742e-01]
 [ 1.78552388e+00 -8.67321922e-01]
 [-1.83701818e-01 -1.92052909e-01]
 [-1.83683419e-01  4.03515413e-01]
 [ 1.84848654e+00  1.96490863e+00]
 [ 9.38749728e-01  1.61771143e-01]
 [-4.47419440e-01  2.49180735e-01]
 [-1.90218870e-01 -1.00569297e-01]
 [ 2.46891236e-03 -2.04352113e+00]
 [-1.44318728e-01 -5.00740586e-02]
 [ 5.48176465e-01  4.13087252e-02]
 [-2.47450599e-01  2.57282331e+00]
 [-1.75689324e-01 -2.24789933e-01]
 [-5.51432512e-01  2.95530454e-01]
 [-1.05634401e+00 -5.87103408e-02]
 [ 4.30875392e-01 -1.25331622e+00]
 [-9.38889711e-01  1.18179370e+00]
 [-1.50402167e+00  7.70000822e-01]
 [ 1.72121852e+00  8.11190577e-01]
 [-1.74313867e-01 -9.80159897e-01]
 [ 1.54384222e-01  1.45566735e+00]
 [-1.51796785e+00 -1.49895833e+00]
 [-5.31367103e-01  5.96097336e-01]
 [ 2.03015085e-01  2.28544938e+00]
 [-1.21117878e+00 -1.06564826e+00]
 [ 4.99741442e-01 -6.18723218e-01]
 [-5.94411577e-01  8.28379531e-02]
 [-2.48185694e-01 -5.52541689e-01]
 [-5.95608367e-01 -1.65572854e+00]
 [ 2.15450320e+00  5.00870902e-02]
 [ 6.35813077e-02 -1.14125346e+00]
 [-1.10663955e+00  4.76775769e-01]
 [ 1.00051016e+00 -9.90733400e-01]
 [-1.28945779e+00 -2.51316510e-01]
 [ 3.12774808e-01  1.80479812e-01]
 [-2.11743978e+00  1.12861712e+00]
 [-1.40975339e+00 -5.38556233e-01]
 [ 2.99324103e-01  1.26005115e+00]
 [ 9.06285941e-01  6.67932406e-01]
 [ 2.70754868e-01  1.35490076e+00]
 [-5.08965023e-02 -1.71558682e+00]
 [-2.58731155e-01  1.72364943e-01]
 [-1.57823064e+00  2.51637866e-01]
 [-7.28001840e-01  8.01489040e-01]
 [-4.37517615e-01 -1.32523982e+00]
 [ 1.26339434e+00 -1.41321126e+00]
 [ 4.63790595e-01  5.27710406e-01]
 [-1.89708212e+00  2.90723793e-01]
 [ 4.41899033e-01  2.41745788e-01]
 [-3.52843587e-01  3.42832425e-01]
 [-6.79900132e-01 -7.38531837e-01]
 [ 7.64195636e-01  2.22531284e-01]
 [ 1.23411936e+00  2.86602385e-01]
 [ 1.12236664e+00 -7.74694177e-01]
 [-8.61780148e-01  1.94340572e+00]
 [-2.67818324e-01  4.77029120e-01]
 [ 4.49944854e-01 -1.27715159e+00]
 [ 1.17197318e+00  6.69520919e-01]
 [-4.58289790e-01 -1.04894576e+00]
 [-1.29355513e-01  8.07032147e-01]
 [-1.16113114e+00  1.19840992e+00]
 [-1.26184777e+00 -8.86711194e-01]
 [ 9.89282045e-01  9.92750980e-01]
 [ 1.59860723e+00  4.12712329e-01]
 [-1.99244595e-03  8.43878045e-01]
 [ 1.20333829e+00  1.97608293e+00]
 [ 4.83982563e-01 -2.80652746e-01]
 [-6.44260252e-01 -8.16185049e-01]
 [ 4.83713132e-01 -9.59224148e-01]
 [ 1.80234148e+00 -8.81578625e-01]
 [ 3.85580350e-02 -1.03367111e-01]
 [ 1.83215929e+00  3.37252971e-01]
 [-2.85716720e+00  2.69344690e-01]
 [ 9.99790228e-01  8.49275094e-01]
 [ 1.76258462e-01 -8.44515900e-03]
 [-2.56381840e-01  1.50912045e+00]
 [ 1.81540982e-01 -3.00951871e-01]
 [-2.14870517e+00  2.84348971e+00]
 [-1.67472824e-01  6.36982995e-01]
 [ 4.78913256e-01 -9.25136970e-01]
 [ 1.73494145e+00 -1.15030153e+00]
 [-5.02103553e-01  4.86130617e-01]
 [-8.27348717e-01 -2.57554944e-01]
 [-4.83597704e-01  7.30039815e-01]
 [ 1.10457261e+00  4.76401989e-01]
 [ 4.47129366e-01 -9.88658236e-02]
 [-5.14980056e-01 -9.14218918e-01]
 [ 6.53911607e-01 -1.61799704e+00]
 [ 1.87499339e-01 -4.92534976e-01]
 [ 1.16424039e+00  8.80052961e-01]
 [-7.08063840e-01  2.03400215e-01]
 [-2.88494462e-01 -1.33449783e+00]
 [-3.60717284e-01  1.60299557e-01]
 [-1.56141267e+00  3.83780150e-01]
 [ 4.10560912e-01 -9.53264643e-01]
 [ 3.71264549e-01  1.39803324e-01]
 [ 8.44377735e-02  3.91791330e-02]
 [-1.84187919e-01 -1.22623374e+00]]

Practical Considerations and Limitations

While the KNN Imputer is powerful, it has limitations:

Computational Cost: The algorithm can be computationally expensive for large datasets due to the distance calculations.
Scalability: Performance may degrade with high-dimensional data or large numbers of neighbors.
Data Quality: Imputation quality depends on the quality and quantity of the available data.

To mitigate these issues, consider using dimensionality reduction techniques or combining KNN Imputation with other imputation methods.

Conclusion

KNN imputation is a robust technique for handling missing data, leveraging the power of the K-nearest neighbors algorithm to estimate missing values based on the patterns in the data. Despite its computational demands, it remains a popular choice due to its simplicity and effectiveness. By understanding its workings, advantages, and limitations, data scientists can effectively incorporate KNN imputation into their data preprocessing pipelines, enhancing the quality and reliability of their machine learning models.

Missing data imputation with fancyimpute

kiwkandmd

Improve

Article Tags :

Practice Tags :

Machine Learning

Handling Missing Data with KNN Imputer

Understanding KNN Imputation for Handling Missing Data

How KNN Imputer Works?

Implementing KNN Imputer in Python for Missing Data

Choosing the Right Parameters for KNN Imputer

Example 1: Basic Implementation of KNN Imputer for Handling Missing Data

Example 2: Handling Missing Data with Mixed Feature Types

Example 3: Handling Missing Data for Time Series Model

Practical Considerations and Limitations

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?