Handling Missing Data with KNN Imputer
Last Updated :
13 Aug, 2024
Missing data is a common issue in data analysis and machine learning, often leading to inaccurate models and biased results. One effective method for addressing this issue is the K-Nearest Neighbors (KNN) imputation technique. This article will delve into the technical aspects of KNN imputation, its implementation, advantages, and limitations.
Understanding KNN Imputation for Handling Missing Data
KNN imputation is a technique used to fill missing values in a dataset by leveraging the K-Nearest Neighbors algorithm. This method involves finding the k-nearest neighbors to a data point with a missing value and imputing the missing value using the mean or median of the neighboring data points. This approach preserves the relationships between features, which can lead to better model performance compared to simpler imputation methods like mean or median imputation.
How KNN Imputer Works?
- Identifying Missing Values: The first step is to identify the missing values in the dataset, typically marked as NaN (Not a Number).
- Finding Nearest Neighbors: For each data point with a missing value, the KNN imputer finds the k-nearest neighbors based on a specified distance metric (e.g., Euclidean distance, cosine similarity).
- Imputing Missing Values: The missing value is then imputed using the mean or median of the values from the k-nearest neighbors.
For getting in-depth knowledge refer to : How KNN Imputer Works in Machine Learning
Implementing KNN Imputer in Python for Missing Data
Choosing the Right Parameters for KNN Imputer
The performance of the KNN Imputer depends on the choice of parameters:
- n_neighbors: The number of neighbors to consider for imputation. A smaller value may be more sensitive to noise, while a larger value may oversmooth the data.
- weights: Determines how to weight the contributions of the neighbors. Options include:
- uniform: All neighbors have equal weight.
- distance: Weights neighbors by their distance, giving closer neighbors more influence.
- p: The power parameter for the Minkowski distance metric. p=1 corresponds to Manhattan distance, and p=2 corresponds to Euclidean distance.
Example:
imputer = KNNImputer(n_neighbors=3, weights='distance')
The KNNImputer class from the scikit-learn library provides a straightforward way to implement KNN imputation.
Example 1: Basic Implementation of KNN Imputer for Handling Missing Data
Python
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
# Sample dataset with missing values
data = {
'Feature1': [1.0, 2.0, np.nan, 4.0],
'Feature2': [np.nan, 2.0, 3.0, 4.0],
'Feature3': [1.0, 2.0, 3.0, np.nan]
}
df = pd.DataFrame(data)
# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2)
# Impute missing values
df_imputed = imputer.fit_transform(df)
print("Data after KNN Imputation:\n", df_imputed)
Output:
Data after KNN Imputation:
[[1. 2.5 1. ]
[2. 2. 2. ]
[3. 3. 3. ]
[4. 4. 2.5]]
Example 2: Handling Missing Data with Mixed Feature Types
Python
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
data = {
'Numerical_1': [5.1, 4.9, np.nan, 4.7, 5.0],
'Numerical_2': [3.5, np.nan, 3.0, np.nan, 3.2],
'Categorical': ['A', 'B', 'A', 'C', np.nan]
}
df = pd.DataFrame(data)
numerical_features = ['Numerical_1', 'Numerical_2']
categorical_features = ['Categorical']
# Pipeline for numerical features
numerical_pipeline = Pipeline(steps=[
('imputer', KNNImputer(n_neighbors=2)), # Apply KNN Imputer first
('scaler', StandardScaler())
])
# Pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)), # Encode categories first
('imputer', KNNImputer(n_neighbors=2)) # Apply KNN Imputer after encoding
])
# Combine pipelines into a ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_pipeline, numerical_features),
('cat', categorical_pipeline, categorical_features)
])
# Applying the pipeline to the data
df_imputed = preprocessor.fit_transform(df)
# Extracting the column names
num_cols = ['Numerical_1', 'Numerical_2']
cat_cols = ['Categorical']
# Combining the column names
columns = num_cols + cat_cols
# Convert the imputed data back to a DataFrame
df_imputed = pd.DataFrame(df_imputed, columns=columns)
print("Data after KNN Imputation with Mixed Feature Types:\n", df_imputed)
Output:
Data after KNN Imputation with Mixed Feature Types:
Numerical_1 Numerical_2 Categorical
0 1.060660 1.300887 0.00
1 -0.353553 0.413919 1.00
2 0.707107 -1.655675 0.00
3 -1.767767 0.413919 2.00
4 0.353553 -0.473050 0.75
Example 3: Handling Missing Data for Time Series Model
Python
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(42)
date_range = pd.date_range(start='2023-01-01', periods=100, freq='D')
data = {
'Temperature': np.random.normal(20, 5, 100),
'Humidity': np.random.normal(60, 10, 100)
}
df = pd.DataFrame(data, index=date_range)
# Introducing missing values
df.loc['2023-01-10':'2023-01-15', 'Temperature'] = np.nan
df.loc['2023-02-05':'2023-02-10', 'Humidity'] = np.nan
# Initializing the KNNImputer
imputer = KNNImputer(n_neighbors=3)
# Using a pipeline to scale data before imputation
pipeline = Pipeline(steps=[
('scaler', StandardScaler()), # Scaling time series data
('imputer', KNNImputer(n_neighbors=3))
])
# Impute missing values
df_imputed = pipeline.fit_transform(df)
print("Time Series Data after KNN Imputation:\n", df_imputed)
Output:
Time Series Data after KNN Imputation:
[[ 6.35360812e-01 -1.51320098e+00]
[-7.62416368e-02 -4.65281950e-01]
[ 8.04553549e-01 -3.83183742e-01]
[ 1.78552388e+00 -8.67321922e-01]
[-1.83701818e-01 -1.92052909e-01]
[-1.83683419e-01 4.03515413e-01]
[ 1.84848654e+00 1.96490863e+00]
[ 9.38749728e-01 1.61771143e-01]
[-4.47419440e-01 2.49180735e-01]
[-1.90218870e-01 -1.00569297e-01]
[ 2.46891236e-03 -2.04352113e+00]
[-1.44318728e-01 -5.00740586e-02]
[ 5.48176465e-01 4.13087252e-02]
[-2.47450599e-01 2.57282331e+00]
[-1.75689324e-01 -2.24789933e-01]
[-5.51432512e-01 2.95530454e-01]
[-1.05634401e+00 -5.87103408e-02]
[ 4.30875392e-01 -1.25331622e+00]
[-9.38889711e-01 1.18179370e+00]
[-1.50402167e+00 7.70000822e-01]
[ 1.72121852e+00 8.11190577e-01]
[-1.74313867e-01 -9.80159897e-01]
[ 1.54384222e-01 1.45566735e+00]
[-1.51796785e+00 -1.49895833e+00]
[-5.31367103e-01 5.96097336e-01]
[ 2.03015085e-01 2.28544938e+00]
[-1.21117878e+00 -1.06564826e+00]
[ 4.99741442e-01 -6.18723218e-01]
[-5.94411577e-01 8.28379531e-02]
[-2.48185694e-01 -5.52541689e-01]
[-5.95608367e-01 -1.65572854e+00]
[ 2.15450320e+00 5.00870902e-02]
[ 6.35813077e-02 -1.14125346e+00]
[-1.10663955e+00 4.76775769e-01]
[ 1.00051016e+00 -9.90733400e-01]
[-1.28945779e+00 -2.51316510e-01]
[ 3.12774808e-01 1.80479812e-01]
[-2.11743978e+00 1.12861712e+00]
[-1.40975339e+00 -5.38556233e-01]
[ 2.99324103e-01 1.26005115e+00]
[ 9.06285941e-01 6.67932406e-01]
[ 2.70754868e-01 1.35490076e+00]
[-5.08965023e-02 -1.71558682e+00]
[-2.58731155e-01 1.72364943e-01]
[-1.57823064e+00 2.51637866e-01]
[-7.28001840e-01 8.01489040e-01]
[-4.37517615e-01 -1.32523982e+00]
[ 1.26339434e+00 -1.41321126e+00]
[ 4.63790595e-01 5.27710406e-01]
[-1.89708212e+00 2.90723793e-01]
[ 4.41899033e-01 2.41745788e-01]
[-3.52843587e-01 3.42832425e-01]
[-6.79900132e-01 -7.38531837e-01]
[ 7.64195636e-01 2.22531284e-01]
[ 1.23411936e+00 2.86602385e-01]
[ 1.12236664e+00 -7.74694177e-01]
[-8.61780148e-01 1.94340572e+00]
[-2.67818324e-01 4.77029120e-01]
[ 4.49944854e-01 -1.27715159e+00]
[ 1.17197318e+00 6.69520919e-01]
[-4.58289790e-01 -1.04894576e+00]
[-1.29355513e-01 8.07032147e-01]
[-1.16113114e+00 1.19840992e+00]
[-1.26184777e+00 -8.86711194e-01]
[ 9.89282045e-01 9.92750980e-01]
[ 1.59860723e+00 4.12712329e-01]
[-1.99244595e-03 8.43878045e-01]
[ 1.20333829e+00 1.97608293e+00]
[ 4.83982563e-01 -2.80652746e-01]
[-6.44260252e-01 -8.16185049e-01]
[ 4.83713132e-01 -9.59224148e-01]
[ 1.80234148e+00 -8.81578625e-01]
[ 3.85580350e-02 -1.03367111e-01]
[ 1.83215929e+00 3.37252971e-01]
[-2.85716720e+00 2.69344690e-01]
[ 9.99790228e-01 8.49275094e-01]
[ 1.76258462e-01 -8.44515900e-03]
[-2.56381840e-01 1.50912045e+00]
[ 1.81540982e-01 -3.00951871e-01]
[-2.14870517e+00 2.84348971e+00]
[-1.67472824e-01 6.36982995e-01]
[ 4.78913256e-01 -9.25136970e-01]
[ 1.73494145e+00 -1.15030153e+00]
[-5.02103553e-01 4.86130617e-01]
[-8.27348717e-01 -2.57554944e-01]
[-4.83597704e-01 7.30039815e-01]
[ 1.10457261e+00 4.76401989e-01]
[ 4.47129366e-01 -9.88658236e-02]
[-5.14980056e-01 -9.14218918e-01]
[ 6.53911607e-01 -1.61799704e+00]
[ 1.87499339e-01 -4.92534976e-01]
[ 1.16424039e+00 8.80052961e-01]
[-7.08063840e-01 2.03400215e-01]
[-2.88494462e-01 -1.33449783e+00]
[-3.60717284e-01 1.60299557e-01]
[-1.56141267e+00 3.83780150e-01]
[ 4.10560912e-01 -9.53264643e-01]
[ 3.71264549e-01 1.39803324e-01]
[ 8.44377735e-02 3.91791330e-02]
[-1.84187919e-01 -1.22623374e+00]]
Practical Considerations and Limitations
While the KNN Imputer is powerful, it has limitations:
- Computational Cost: The algorithm can be computationally expensive for large datasets due to the distance calculations.
- Scalability: Performance may degrade with high-dimensional data or large numbers of neighbors.
- Data Quality: Imputation quality depends on the quality and quantity of the available data.
To mitigate these issues, consider using dimensionality reduction techniques or combining KNN Imputation with other imputation methods.
Conclusion
KNN imputation is a robust technique for handling missing data, leveraging the power of the K-nearest neighbors algorithm to estimate missing values based on the patterns in the data. Despite its computational demands, it remains a popular choice due to its simplicity and effectiveness. By understanding its workings, advantages, and limitations, data scientists can effectively incorporate KNN imputation into their data preprocessing pipelines, enhancing the quality and reliability of their machine learning models.
Similar Reads
Missing data imputation with fancyimpute In a real world dataset, there will always be some data missing. This mainly associates with how the data was collected. Missing data plays an important role creating a predictive model, because there are algorithms which does not perform very well with missing dataset. Fancyimput fancyimpute is a l
3 min read
Using KNNImputer in Scikit-Learn to Handle Missing Data in Python KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more useful method that works on the basic approach of the KNN algorithm rather than the naive approach of filling all the values with the mean or the median. In this approach, we specify a distan
4 min read
Coping with Missing, Invalid and Duplicate Data in R Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation. We use the past data to train our model and pre
15+ min read
Coping with Missing, Invalid and Duplicate Data in R Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation. We use the past data to train our model and pre
15+ min read
Tuple Duplication in Data Mining Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and providing a unified view of the data. These sources may include multiple data cubes, databases, or flat files. The data integration approaches are f
3 min read
knn Impute Using Categorical Variables with caret Package In data science and machine learning, missing data is a common issue that can significantly impact the performance of predictive models. One effective way to handle missing values is through imputation, which involves replacing missing data with substituted values. The caret package in R provides se
3 min read