ML | Handling Missing Values
Last Updated :
21 Jul, 2025
Missing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed properly, missing values can harm the accuracy and reliability of our models. They can reduce the sample size, introduce bias and make it difficult to apply certain analysis techniques that require complete data. Efficiently handling missing values is important to ensure our machine learning models produce accurate and unbiased results. In this article, we'll see more about the methods and strategies to deal with missing data effectively.
Missing ValuesImportance of Handling Missing Values
Handling missing values is important for ensuring the accuracy and reliability of data analysis and machine learning models. Key reasons include:
- Improved Model Accuracy: Addressing missing values helps avoid incorrect predictions and boosts model performance.
- Increased Statistical Power: Imputation or removal of missing data allows the use of more analysis techniques, maintaining the sample size.
- Bias Prevention: Proper handling ensures that missing data doesn’t introduce systematic bias, leading to more reliable results.
- Better Decision-Making: A clean dataset leads to more informed, trustworthy decisions based on accurate insights.
Challenges Posed by Missing Values
Missing values can introduce several challenges in data analysis including:
- Reduce sample size: If rows or data points with missing values are removed, it reduces the overall sample size which may decrease the reliability and accuracy of the analysis.
- Bias in Results: When missing data is not handled carefully, it can introduce bias. This is especially problematic when the missingness is not random, leading to misleading conclusions.
- Difficulty in Analysis: Many statistical techniques and machine learning algorithms require complete data for all variables. Missing values can cause certain analyses or models inapplicable, limiting the methods we can use.
Reasons Behind Missing Values in the Dataset
Data can be missing from a dataset for several reasons and understanding the cause is important for selecting the most effective way to handle it. Common reasons for missing data include:
- Technical issues: Failed data collection or errors during data transmission.
- Human errors: Mistakes like incorrect data entry or oversights during data processing.
- Privacy concerns: Missing sensitive or personal information due to confidentiality policies.
- Data processing issues: Errors that occur during data preparation.
By identifying the reason behind the missing data, we can better assess its impact whether it's causing bias or affecting the analysis and select the proper handling method such as imputation or removal.
Types of Missing Values
Missing values in a dataset can be categorized into three main types each with different implications for how they should be handled:
- Missing Completely at Random (MCAR): In this case, the missing data is completely random and unrelated to any other variable in the dataset. The absence of data points occurs without any systematic pattern such as a random technical failure or data omission.
- Missing at Random (MAR): The missingness is related to other observed variables but not to the value of the missing data itself. For example, if younger individuals are more likely to skip a particular survey question, the missingness can be explained by age but not by the content of the missing data.
- Missing Not at Random (MNAR): Here, the probability of missing data is related to the value of the missing data itself. For example, people with higher incomes may be less likely to report their income, leading to a direct connection between the missingness and the value of the missing data.
Methods for Identifying Missing Data
Detecting and managing missing data is important for data analysis. Let's see some useful functions for detecting, removing and replacing null values in Pandas DataFrame.
Functions | Descriptions |
---|
.isnull() | Identifies missing values in a Series or DataFrame. |
.notnull() | Opposite of .isnull(), returns True for non-missing values and False for missing values. |
.info() | Displays DataFrame summary including data types, memory usage and the count of missing values. |
.isna() | Works similarly to .notnull() but returns True for missing data and False for valid data. |
dropna() | Removes rows or columns with missing values with customizable options for axis and threshold. |
fillna() | Fills missing values with a specified value (like mean, median) or method (forward/backward fill). |
replace() | Replaces specified values in the DataFrame, useful for correcting or standardizing data. |
drop_duplicates() | Removes duplicate rows based on specified columns. |
unique() | Finds unique values in a Series or DataFrame. |
For more detail refer to Working with Missing Data in Pandas
Representation of Missing Values in Datasets
Missing values can be represented by blank cells, specific values like "NA" or codes. It's important to use consistent and documented representation to ensure transparency and ease in data handling.
Common representations include:
- Blank Cells: Empty cells in data tables or spreadsheets are used to signify missing values. This is common in many data formats like CSVs.
- Specific Values: It is commonly used placeholders for missing data include "NA", "NaN", "NULL" or even arbitrary values like -999. It’s important to choose a standardized value and document its meaning to prevent confusion.
- Codes or Flags: In some cases, non-numeric codes or flags (e.g "MISSING", "UNKNOWN") are used to show missing data. These can be useful in distinguishing between different types of missingness or categorizing missing data based on its origin.
Strategies for Handling Missing Values in Data Analysis
Depending on the nature of the data and the missingness, several strategies can help maintain the integrity of our analysis. Let's see some of the most effective methods to handle missing values.
Before moving to various strategies, let's first create a Sample Dataframe so that we can use it for different methods.
Creating a Sample Dataframe
Here we will be using Pandas and Numpy libraries.
Python
import pandas as pd
import numpy as np
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Output:
Creating a Sample Dataframe1. Removing Rows with Missing Values
Removing rows with missing values is a simple and straightforward method to handle missing data, used when we want to keep our analysis clean and minimize complexity.
Advantages:
- Simple and efficient: It’s easy to implement and quickly removes data points with missing values.
- Cleans data: It removes potentially problematic data points, ensuring that only complete rows remain in the dataset.
Disadvantages:
- Reduces sample size: When rows are removed, the overall dataset shrinks which can affect the power and accuracy of our analysis.
- Potential bias: If missing data is not random (e.g if certain groups are more likely to have missing values) removing rows could introduce bias.
In this example, we are removing rows with missing values from the original DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame (df_cleaned).
Python
df_cleaned = df.dropna()
print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)
Output:
Removing Rows with Missing Values2. Imputation Methods
Imputation involves replacing missing values with estimated values. This approach is beneficial when we want to preserve the dataset’s sample size and avoid losing data points. However, it's important to note that the accuracy of the imputed values may not always be reliable.
Let's see some common imputation methods:
2.1 Mean, Median and Mode Imputation:
This method involves replacing missing values with the mean, median or mode of the relevant variable. It's a simple approach but it doesn't account for the relationships between variables.
In this example, we are explaining the imputation techniques for handling missing values in the 'Marks' column of the DataFrame (df). It calculates and fills missing values with the mean, median and mode of the existing values in that column and then prints the results for observation.
- df['Marks'].fillna(df['Marks'].mean()): Fills missing values in the 'Marks' column with the mean value.
- df['Marks'].fillna(df['Marks'].median()): Fills missing values in the 'Marks' column with the median value.
- df['Marks'].fillna(df['Marks'].mode(): Fills missing values in the 'Marks' column with the mode value.
- .iloc[0]: Accesses the first element of the Series which represents the mode.
Python
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])
print("\nImputation using Mean:")
print(mean_imputation)
print("\nImputation using Median:")
print(median_imputation)
print("\nImputation using Mode:")
print(mode_imputation)
Output:
Mean, Median and Mode ImputationAdvantages:
- Simple and efficient: Easy to implement and quick.
- Works well with numerical data: It is useful for numerical variables with a normal distribution.
Disadvantages:
- Inaccuracy: It assumes the missing value is similar to the central tendency (mean/median/mode) which may not always be the case.
2.2 Forward and Backward Fill
Forward and backward fill techniques are used to replace missing values by filling them with the nearest non-missing values from the same column. This is useful when there’s an inherent order or sequence in the data.
The method parameter in fillna() allows to specify the filling strategy.
- df['Marks'].fillna(method='ffill'): This method fills missing values in the 'Marks' column of the DataFrame (df) using a forward fill strategy. It replaces missing values with the last observed non-missing value in the column.
- df['Marks'].fillna(method='bfill'): This method fills missing values in the 'Marks' column using a backward fill strategy. It replaces missing values with the next observed non-missing value in the column.
Python
forward_fill = df['Marks'].fillna(method='ffill')
backward_fill = df['Marks'].fillna(method='bfill')
print("\nForward Fill:")
print(forward_fill)
print("\nBackward Fill:")
print(backward_fill)
Output:
Forward and Backward FillAdvantages:
- Simple and Intuitive: Preserves the temporal or sequential order in data.
- Preserves Patterns: Fills missing values logically, especially in time-series or ordered data.
Disadvantages:
- Assumption of Closeness: Assumes that the missing values are similar to the observed values nearby which may not always be true.
- Potential Inaccuracy: May not work well if there are large gaps between non-missing values.
Note:
- Forward fill uses the last valid observation to fill missing values.
- Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
Interpolation is a technique used to estimate missing values based on the values of surrounding data points. Unlike simpler imputation methods (e.g mean, median, mode), interpolation uses the relationship between neighboring values to make more informed estimations.
The interpolate() method in pandas are divided into Linear and Quadratic.
- df['Marks'].interpolate(method='linear'): This method performs linear interpolation on the 'Marks' column of the DataFrame (df).
- df['Marks'].interpolate(method='quadratic'): This method performs quadratic interpolation on the 'Marks' column.
Python
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')
print("\nLinear Interpolation:")
print(linear_interpolation)
print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Output:
Interpolation techniquesAdvantages:
- Sophisticated Approach: Interpolation is more accurate than simple imputation methods like mean or median, as it considers the underlying data structure.
- Preserves Data Relationships: Captures patterns or trends that exist between data points, which helps maintain the integrity of the dataset.
Disadvantages:
- Complexity: Requires more computational resources and additional libraries.
- Assumptions on Data: Assumes that data points follow a specific pattern (e.g., linear or quadratic), which may not always be true.
Note:
- Linear interpolation assumes a straight line between two adjacent non-missing values.
- Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.
Impact of Handling Missing Values
Handling missing values effectively is important to ensure the accuracy and reliability of our findings.
Let's see some key impacts of handling missing values:
- Improved data quality: A cleaner dataset with fewer missing values is more reliable for analysis and model training.
- Enhanced model performance: Properly handling missing values helps models perform better by training on complete data, leading to more accurate predictions.
- Preservation of Data Integrity: Imputing or removing missing values ensures consistency and accuracy in the dataset, maintaining its integrity for further analysis.
- Reduced bias: Addressing missing values prevents bias in analysis, ensuring a more accurate representation of the underlying patterns in the data.
Effectively handling missing values is important for maintaining data integrity, improving model performance and ensuring reliable analysis. By carefully choosing appropriate strategies for imputation or removal, we increase the quality of our data, minimize bias and maximize the accuracy of our findings.
Similar Reads
Data Analysis (Analytics) Tutorial Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
4 min read
Introduction to Data Analytics
What is Data Analytics?Data Analytics is the process of collecting, organizing and studying data to find useful information understand whatâs happening and make better decisions. In simple words it helps people and businesses learn from data like what worked in the past, what is happening now and what might happen in the
6 min read
Why Data Analysis is Important?DData Analysis involves inspecting, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making. It encompasses a range of techniques and tools used to interpret raw data, identify patterns, and extract actionable insights. Effective data analysis
5 min read
Data Science vs Data AnalyticsIn this article, we will discuss the differences between the two most demanded fields in Artificial intelligence that is data science, and data analytics.What is Data Science Data Science is a field that deals with extracting meaningful information and insights by applying various algorithms preproc
3 min read
Uses of Data AnalyticsIn this article, we are going to discuss different uses of data analytics. And will discuss the application where we will see how data is an essential part of different sectors. So, let's discuss them one by one. Data is of much importance nowadays. Data helps you understand performance providing th
3 min read
Life Cycle Phases of Data AnalyticsIn this article, we are going to discuss life cycle phases of data analytics in which we will cover various life cycle phases and will discuss them one by one. Data Analytics Lifecycle :The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to
3 min read
Data Preprocessing and Exploration
What is Data Cleaning?Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
ML | Handling Missing ValuesMissing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed
10 min read
What is Feature Engineering?Feature engineering is the process of turning raw data into useful features that help improve the performance of machine learning models. It includes choosing, creating and adjusting data attributes to make the modelâs predictions more accurate. The goal is to make the model better by providing rele
5 min read
What is Data Transformation?Data transformation is an important step in data analysis process that involves the conversion, cleaning, and organizing of data into accessible formats. It ensures that the information is accessible, consistent, secure, and finally recognized by the intended business users. This process is undertak
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Univariate, Bivariate and Multivariate data and its analysisData analysis is an important process for understanding patterns and making informed decisions based on data. Depending on the number of variables involved it can be classified into three main types: univariate, bivariate and multivariate analysis. Each method focuses on different aspects of the dat
5 min read
Python - Data visualization tutorialData visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
7 min read
Statistical Analysis and Probability
Probability Data Distributions in Data ScienceUnderstanding how data behaves is one of the first steps in data science. Before we dive into building models or running analysis, we need to understand how the values in our dataset are spread out and thatâs where probability distributions come in.Let us start with a simple example: If you roll a f
8 min read
Central Limit Theorem in StatisticsThe Central Limit Theorem in Statistics states that as the sample size increases and its variance is finite, then the distribution of the sample mean approaches normal distribution irrespective of the shape of the population distribution.The central limit theorem posits that the distribution of samp
11 min read
Parametric Methods in StatisticsParametric statistical methods are those that make assumptions regarding the distribution of the population. These methods presume that the data have a known distribution (e.g., normal, binomial, Poisson) and rely on parameters (e.g., mean and variance) to define the data.Key AssumptionsParametric t
6 min read
Non-Parametric TestsNon-parametric tests are applied in hypothesis testing when the data does not satisfy the assumptions necessary for parametric tests, such as normality or equal variances. These tests are especially helpful for analyzing ordinal data, small sample sizes, or data with outliers.Common Non-Parametric T
5 min read
ANOVA for Data Science and Data AnalyticsANOVA is useful when we need to compare more than two groups and determine whether their means are significantly different. Suppose you're trying to understand which ingredients in a recipe affect its taste. Some ingredients, like spices might have a strong influence while others like a pinch of sal
9 min read
Confidence IntervalA Confidence Interval (CI) is a range of values that contains the true value of something we are trying to measure like the average height of students or average income of a population.Instead of saying: âThe average height is 165 cm.âWe can say: âWe are 95% confident the average height is between 1
7 min read
Hypothesis TestingHypothesis testing compares two opposite ideas about a group of people or things and uses data from a small part of that group (a sample) to decide which idea is more likely true. We collect and study the sample data to check if the claim is correct.Hypothesis TestingFor example, if a company says i
9 min read
P-Value: Comprehensive Guide to Understand, Apply, and InterpretA p-value is a statistical metric used to assess a hypothesis by comparing it with observed data. This article delves into the concept of p-value, its calculation, interpretation, and significance. It also explores the factors that influence p-value and highlights its limitations. Table of Content W
12 min read
Data Analysis Libraries & Tools
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Matplotlib TutorialMatplotlib is an open-source visualization library for the Python programming language, widely used for creating static, animated and interactive plots. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, Qt, GTK and wxPython. It
5 min read
Python Seaborn TutorialSeaborn is a library mostly used for statistical plotting in Python. It is built on top of Matplotlib and provides beautiful default styles and color palettes to make statistical plots more attractive.In this tutorial, we will learn about Python Seaborn from basics to advance using a huge dataset of
15+ min read
Power BI Tutorial | Learn Power BIPower BI is a Microsoft-powered business intelligence tool that helps transform raw data into interactive dashboards and actionable insights. It allow users to connect to various data sources, clean and shape data and visualize it using charts, graphs and reports all with minimal coding.Itâs widely
5 min read
Tableau TutorialTableau is a leading data visualization tool that help users to create interactive and insightful visualizations from data. With Tableau we can transform raw data into meaningful visuals without the need for coding. This tutorial will guide us through data visualization using Tableau like connecting
5 min read
SQL for Data AnalysisSQL (Structured Query Language) is a powerful tool for data analysis, allowing users to efficiently query and manipulate data stored in relational databases. Whether you are working with sales, customer or financial data, SQL helps extract insights and perform complex operations like aggregation, fi
6 min read
How to Perform Data Analysis in Excel: A Beginnerâs GuideExcel is one of the most powerful tools for data analysis, allowing you to process, manipulate, and visualize large datasets efficiently. Whether you're analyzing sales figures, financial reports, or any other type of data, knowing how to perform data analysis in Excel can help you make informed dec
14 min read
Time Series Analysis
Time Series Analysis & Visualization in PythonTime series data consists of sequential data points recorded over time which is used in industries like finance, pharmaceuticals, social media and research. Analyzing and visualizing this data helps us to find trends and seasonal patterns for forecasting and decision-making. In this article, we will
6 min read
8 Types of Plots for Time Series Analysis using PythonTime series data Time series data is a collection of observations chronologically arranged at regular time intervals. Each observation corresponds to a specific time point, and the data can be recorded at various frequencies (e.g., daily, monthly, yearly). This type of data is very essential in many
10 min read
Handling Missing Values in Time Series DataHandling missing values in time series data in R is a crucial step in the data preprocessing phase. Time series data often contains gaps or missing observations due to various reasons such as sensor malfunctions, human errors, or other external factors. In R Programming Language dealing with missing
5 min read
Understanding the Moving average (MA) in Time Series DataData is often collected with respect to time, whether for scientific or financial purposes. When data is collected in a chronological order, it is referred to as time series data. Analyzing time series data provides insights into how the data behaves over time, including underlying patterns that can
15 min read
Augmented Dickey-Fuller (ADF)Augmented Dickey-Fuller (ADF) test is a statistical test in time series analysis used to determine whether a given time series is stationary. A stationary time series has constant mean and variance over time, which is a core assumption in many time series models, including ARIMA.Why Stationarity Mat
3 min read
AutoCorrelationAutocorrelation is a fundamental concept in time series analysis. Autocorrelation is a statistical concept that assesses the degree of correlation between the values of variable at different time points. The article aims to discuss the fundamentals and working of Autocorrelation. Table of Content Wh
10 min read
Data Analytics Projects
30+ Top Data Analytics Projects in 2025 [With Source Codes]Are you an aspiring data analyst? Dive into 40+ FREE Data Analytics Projects packed with the hottest 2024 tech. Data Analytics Projects for beginners, final-year students, and experienced professionals to Master essential data analytical skills. These top data analytics projects serve as a simple ye
4 min read
Top 80+ Data Analyst Interview Questions and AnswersData is information, often in the form of numbers, text, or multimedia, that is collected and stored for analysis. It can come from various sources, such as business transactions, social media, or scientific experiments. In the context of a data analyst, their role involves extracting meaningful ins
15+ min read