0% found this document useful (0 votes)
9 views13 pages

EXP-12_IAIML

The document outlines an experiment focused on handling missing values in data, detailing various types of missing values and methods for identifying and addressing them using Python's Pandas library. It discusses techniques such as mean, median, mode imputation, forward and backward fill, and interpolation methods to manage missing data effectively. The experiment concludes that understanding the pattern of missingness is crucial for selecting appropriate handling techniques.

Uploaded by

samyak.18240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

EXP-12_IAIML

The document outlines an experiment focused on handling missing values in data, detailing various types of missing values and methods for identifying and addressing them using Python's Pandas library. It discusses techniques such as mean, median, mode imputation, forward and backward fill, and interpolation methods to manage missing data effectively. The experiment concludes that understanding the pattern of missingness is crucial for selecting appropriate handling techniques.

Uploaded by

samyak.18240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Experiment No.

02
19/3/25
Date of Performance:

22/3/25
Date of Submission:

Program Execution/ Timely Viva Answer to Experiment


formation/ Submission Sample Total (10) Sign with
correction/ ethical practices (01) questions Date
(06) (03)

Experiment No. 12

AIM: Program to handle missing values in data.

LABORATORY OUTCOME:
CO3: Apply the most suitable search strategy to design problem solving agents.
CO4: Identify the pattern in data using scientific programming language.

PROBLEM STATEMENT: Program to handle missing values in data.

RELATED THEORY:

Introduction

Missing values are a common issue in machine learning. This occurs when a particular variable
lacks data points, resulting in incomplete information and potentially harming the accuracy and
dependability of your models.
What is a Missing Value?

Missing values are data points that are absent for a specific variable in a dataset. They can be
represented in various ways, such as blank cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.

Types of Missing Values

There are three main types of missing values:

1. Missing Completely at Random (MCAR): MCAR is a specific type of missing data in


which the probability of a data point being missing is entirely random and
independent of any other variable in the dataset. In simpler terms, whether a value is
missing or not has nothing to do with the values of other variables or the
characteristics of the data point itself.
2. Missing at Random (MAR): MAR is a type of missing data where the probability of a
data point missing depends on the values of other variables in the dataset, but not on
the missing variable itself. This means that the missingness mechanism is not entirely
random, but it can be predicted based on the available information.
3. Missing Not at Random (MNAR): MNAR is the most challenging type of missing
data to deal with. It occurs when the probability of a data point being missing is
related to the missing value itself. This means that the reason for the missing data is
informative and directly associated with the variable that is missing.
Methods for Identifying Missing Data

Locating and understanding patterns of missingness in the dataset is an important step in


addressing its impact on analysis.There are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame.

Functions Descriptions

.isnull() Identifies missing values in a Series or


DataFrame.

.notnull() check for missing values in a pandas Series


or DataFrame. It returns a boolean Series
or DataFrame, where True indicates non-
missing values and False indicates missing
values.

.info() Displays information about the


DataFrame, including data types, memory
usage, and presence of missing values.

.isna() similar to notnull() but returns True for


missing values and False for non-missing
values.

dropna() Drops rows or columns containing missing


values based on custom criteria.

fillna() Fills missing values with specific values,


means, medians, or other calculated values.

replace() Replaces specific values with other values,


facilitating data correction and
standardization.

drop_duplicates() Removes duplicate rows based on specified


columns.

unique() Finds unique values in a Series or


DataFrame.

Common Representations
1. Blank cells: Empty cells in spreadsheets or databases often signify missing data.
2. Specific values: Special values like “NULL”, “NA”, or “-999” are used to represent
missing data explicitly.
3. Codes or flags: Non-numeric codes or flags can be used to indicate different types of
missing values.

Creating a Sample Dataframe


import pandas as pd

import numpy as np

# Creating a sample DataFrame with missing values

data = {

'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],

'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar
Blvd', '555 Birch Dr'],

'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],

'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],


'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],

'Rank': [2, 1, 4, 3, 8, 1, 5, 3],

'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']

df = pd.DataFrame(data)

print("Sample DataFrame:")

print(df)

Output:

Removing Rows with Missing Values


● Simple and efficient: Removes data points with missing values altogether.
● Reduces sample size: Can lead to biased results if missingness is not random.
● Not recommended for large datasets: Can discard valuable information.

In this example, we are removing rows with missing values from the original DataFrame (df) using
the dropna() method and then displaying the cleaned DataFrame (df_cleaned).

# Removing rows with missing values

df_cleaned = df.dropna()

# Displaying the DataFrame after removing missing values

print("\nDataFrame after removing rows with missing values:")

print(df_cleaned)

Output:

Imputation Methods
● Replacing missing values with estimated values.
● Preserves sample size: Doesn’t reduce data points.
● Can introduce bias: Estimated values might not be accurate.

Here are some common imputation methods:

1- Mean, Median, and Mode Imputation:

● Replace missing values with the mean, median, or mode of the relevant variable.
● Simple and efficient: Easy to implement.
● Can be inaccurate: Doesn’t consider the relationships between variables.

In this example, we are explaining the imputation techniques for handling missing values in the
‘Marks’ column of the DataFrame (df). It calculates and fills missing values with the mean, median,
and mode of the existing values in that column, and then prints the results for observation.

1. Mean Imputation: Calculates the mean of the ‘Marks’ column in the DataFrame (df).
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the mean value.
● mean_imputation: The result is stored in the variable mean_imputation.
2. Median Imputation: Calculates the median of the ‘Marks’ column in the DataFrame (df).
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the median value.
● median_imputation: The result is stored in the variable median_imputation.
3. Mode Imputation: Calculates the mode of the ‘Marks’ column in the DataFrame (df). The
result is a Series.
● .iloc[0]: Accesses the first element of the Series, which represents the mode.
● df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the mode value.

# Mean, Median, and Mode Imputation

mean_imputation = df['Marks'].fillna(df['Marks'].mean())

median_imputation = df['Marks'].fillna(df['Marks'].median())

mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])

print("\nImputation using Mean:")

print(mean_imputation)

print("\nImputation using Median:")


print(median_imputation)

print("\nImputation using Mode:")

print(mode_imputation)

Output:

2. Forward and Backward Fill

● Replace missing values with the previous or next non-missing value in the same
variable.
● Simple and intuitive: Preserves temporal order.
● Can be inaccurate: Assumes missing values are close to observed values
● Forward Fill (forward_fill)
○ df['Marks'].fillna(method='ffill'): This method fills missing values in the ‘Marks’
column of the DataFrame (df) using a forward fill strategy. It replaces missing
values with the last observed non-missing value in the column.
○ forward_fill: The result is stored in the variable forward_fill.
● Backward Fill (backward_fill)
○ df['Marks'].fillna(method='bfill'): This method fills missing values in the ‘Marks’
column using a backward fill strategy. It replaces missing values with the next
observed non-missing value in the column.
○ backward_fill: The result is stored in the variable backward_fill.

# Forward and Backward Fill

forward_fill = df['Marks'].fillna(method='ffill')

backward_fill = df['Marks'].fillna(method='bfill')

print("\nForward Fill:")

print(forward_fill)

print("\nBackward Fill:")

print(backward_fill)

Output:

3. Interpolation Techniques
● Linear Interpolation
○ df['Marks'].interpolate(method='linear'): This method performs linear
interpolation on the ‘Marks’ column of the DataFrame (df). Linear interpolation
estimates missing values by considering a straight line between two adjacent non-
missing values.
○ linear_interpolation: The result is stored in the variable linear_interpolation.
● Quadratic Interpolation
○ df['Marks'].interpolate(method='quadratic'): This method performs Quadratic
Interpolation on the ‘Marks’ column. Quadratic interpolation estimates missing
values by considering a quadratic curve that passes through three adjacent non-
missing values.
○ quadratic_interpolation: The result is stored in the variable
quadratic_interpolation.

● Estimate missing values based on surrounding data points using techniques like
linear interpolation or spline interpolation.
● More sophisticated than mean/median imputation: Captures relationships between
variables.
● Requires additional libraries and computational resources.

Interpolation Techniques

linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')

print("\nLinear Interpolation:")
print(linear_interpolation)

print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Output:

RESULT/OUTPUT:
CONCLUSION:

In this experiment, we successfully implemented a program designed to handle missing values in datasets,
which is a critical step in data preprocessing for any data analysis or machine learning task. By applying
various imputation techniques, such as mean, median, mode, and advanced methods like regression
imputation, we demonstrated the effectiveness of these strategies in preserving the integrity of the dataset
while minimizing bias introduced by missing data. Additionally, we identified that the pattern of
missingness—whether it is missing completely at random, missing at random, or missing not at random—
plays a crucial role in determining the most appropriate handling technique.
QUESTIONS:

1. What is the purpose of exploring data?


a.To gain a better understanding of your data.
b.To gather your data into one repository.
c.To digitize your data.
d.To generate labels for your data.

2. What are the two main categories of techniques for exploring data? Choose two.
a.Histogram
b.Outliers
c.Visualization
d.Trends
e.Correlations
f.Summary statistics

3. Which method is used to fill missing values with the mean of a column in Pandas?

A dropna()

B fillna()

C mean()

D interpolate()

4. What will be the output of the following code?

import pandas as pd

import numpy as np

data = {'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]}

df = pd.DataFrame(data)

result = df.isnull().sum()

print(result)
A) A 1
B 1
dtype: int64

B) A 2
B 2
dtype: int64

C) A 1
B 2
dtype: int64

D) A 2
B 1
dtype: int64

5 Which of the following methods is suitable for forward-filling missing data in a DataFrame?

A `fillna(method='ffill')`

B `fillna(method='bfill')`

C `interpolate()

D `dropna()`

You might also like