EXP-12_IAIML
EXP-12_IAIML
02
19/3/25
Date of Performance:
22/3/25
Date of Submission:
Experiment No. 12
LABORATORY OUTCOME:
CO3: Apply the most suitable search strategy to design problem solving agents.
CO4: Identify the pattern in data using scientific programming language.
RELATED THEORY:
Introduction
Missing values are a common issue in machine learning. This occurs when a particular variable
lacks data points, resulting in incomplete information and potentially harming the accuracy and
dependability of your models.
What is a Missing Value?
Missing values are data points that are absent for a specific variable in a dataset. They can be
represented in various ways, such as blank cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.
Functions Descriptions
Common Representations
1. Blank cells: Empty cells in spreadsheets or databases often signify missing data.
2. Specific values: Special values like “NULL”, “NA”, or “-999” are used to represent
missing data explicitly.
3. Codes or flags: Non-numeric codes or flags can be used to indicate different types of
missing values.
import numpy as np
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar
Blvd', '555 Birch Dr'],
'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Output:
In this example, we are removing rows with missing values from the original DataFrame (df) using
the dropna() method and then displaying the cleaned DataFrame (df_cleaned).
df_cleaned = df.dropna()
print(df_cleaned)
Output:
Imputation Methods
● Replacing missing values with estimated values.
● Preserves sample size: Doesn’t reduce data points.
● Can introduce bias: Estimated values might not be accurate.
● Replace missing values with the mean, median, or mode of the relevant variable.
● Simple and efficient: Easy to implement.
● Can be inaccurate: Doesn’t consider the relationships between variables.
In this example, we are explaining the imputation techniques for handling missing values in the
‘Marks’ column of the DataFrame (df). It calculates and fills missing values with the mean, median,
and mode of the existing values in that column, and then prints the results for observation.
1. Mean Imputation: Calculates the mean of the ‘Marks’ column in the DataFrame (df).
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the mean value.
● mean_imputation: The result is stored in the variable mean_imputation.
2. Median Imputation: Calculates the median of the ‘Marks’ column in the DataFrame (df).
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the median value.
● median_imputation: The result is stored in the variable median_imputation.
3. Mode Imputation: Calculates the mode of the ‘Marks’ column in the DataFrame (df). The
result is a Series.
● .iloc[0]: Accesses the first element of the Series, which represents the mode.
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the mode value.
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])
print(mean_imputation)
print(mode_imputation)
Output:
● Replace missing values with the previous or next non-missing value in the same
variable.
● Simple and intuitive: Preserves temporal order.
● Can be inaccurate: Assumes missing values are close to observed values
● Forward Fill (forward_fill)
○ df['Marks'].fillna(method='ffill'): This method fills missing values in the ‘Marks’
column of the DataFrame (df) using a forward fill strategy. It replaces missing
values with the last observed non-missing value in the column.
○ forward_fill: The result is stored in the variable forward_fill.
● Backward Fill (backward_fill)
○ df['Marks'].fillna(method='bfill'): This method fills missing values in the ‘Marks’
column using a backward fill strategy. It replaces missing values with the next
observed non-missing value in the column.
○ backward_fill: The result is stored in the variable backward_fill.
forward_fill = df['Marks'].fillna(method='ffill')
backward_fill = df['Marks'].fillna(method='bfill')
print("\nForward Fill:")
print(forward_fill)
print("\nBackward Fill:")
print(backward_fill)
Output:
3. Interpolation Techniques
● Linear Interpolation
○ df['Marks'].interpolate(method='linear'): This method performs linear
interpolation on the ‘Marks’ column of the DataFrame (df). Linear interpolation
estimates missing values by considering a straight line between two adjacent non-
missing values.
○ linear_interpolation: The result is stored in the variable linear_interpolation.
● Quadratic Interpolation
○ df['Marks'].interpolate(method='quadratic'): This method performs Quadratic
Interpolation on the ‘Marks’ column. Quadratic interpolation estimates missing
values by considering a quadratic curve that passes through three adjacent non-
missing values.
○ quadratic_interpolation: The result is stored in the variable
quadratic_interpolation.
● Estimate missing values based on surrounding data points using techniques like
linear interpolation or spline interpolation.
● More sophisticated than mean/median imputation: Captures relationships between
variables.
● Requires additional libraries and computational resources.
Interpolation Techniques
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')
print("\nLinear Interpolation:")
print(linear_interpolation)
print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Output:
RESULT/OUTPUT:
CONCLUSION:
In this experiment, we successfully implemented a program designed to handle missing values in datasets,
which is a critical step in data preprocessing for any data analysis or machine learning task. By applying
various imputation techniques, such as mean, median, mode, and advanced methods like regression
imputation, we demonstrated the effectiveness of these strategies in preserving the integrity of the dataset
while minimizing bias introduced by missing data. Additionally, we identified that the pattern of
missingness—whether it is missing completely at random, missing at random, or missing not at random—
plays a crucial role in determining the most appropriate handling technique.
QUESTIONS:
2. What are the two main categories of techniques for exploring data? Choose two.
a.Histogram
b.Outliers
c.Visualization
d.Trends
e.Correlations
f.Summary statistics
3. Which method is used to fill missing values with the mean of a column in Pandas?
A dropna()
B fillna()
C mean()
D interpolate()
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
result = df.isnull().sum()
print(result)
A) A 1
B 1
dtype: int64
B) A 2
B 2
dtype: int64
C) A 1
B 2
dtype: int64
D) A 2
B 1
dtype: int64
5 Which of the following methods is suitable for forward-filling missing data in a DataFrame?
A `fillna(method='ffill')`
B `fillna(method='bfill')`
C `interpolate()
D `dropna()`