0% found this document useful (0 votes)

45 views16 pages

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

janusrini14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views16 pages

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

janusrini14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Register no:2022510020

Ex no : 1
Date: 14.02.2024

Handling missing values, outliers and irregular

cardinalities

AIM
To handle the missing values, outliers and the irregular cardinalities present in the dataset.

DESCRIPTION ABOUT THE DATASET

The UNICEF child malnutrition dataset comprises comprehensive information on the nutritional status of
children worldwide. It includes demographic details such as age, gender, and possibly socioeconomic factors,
alongside crucial nutritional indicators like weight-for-age, height-for-age, and weight-for-height. Geographical
breakdowns enable analysis of malnutrition rates across regions, from countries to local communities, while time-
series data facilitates the identification of trends over time.

ANALYSIS

CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='utf-8')
except UnicodeDecodeError:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='latin1')
print("the total number of datapoints :",len(df))
cols=df.columns
print("the toal no of columns :",cols.size)
OUTPUT:

1
Register no:2022510020

CODE:

df.info()

OUTPUT:

HANDLING THE MISSING VALUES

CODE:

numerical_columns = df.select_dtypes(include=['int64', 'float64'])

new_df=pd.DataFrame(numerical_columns)

new_df['country']=df['Country and areas']

2
Register no:2022510020

new_df['region']=df['World Bank Region']

new_df.info()

OUTPUT:

Handling WHZ SURVEY SAMPLE column

The WHZ survey sample columns contains the number of samples that were taken from that country to measure the
weight-for-height Z-scores so it is the number of people that were surveyed so we cannot impute or do anything
with that column, so we are droping the null values.

CODE:

new_df.dropna(subset=['WHZ Survey Sample (N)'], inplace=True)

new_df.info()

3
Register no:2022510020

OUTPUT:

Handling U5 population column

CODE:

num_null_values = new_df['U5 Population (\'000s)'].isnull().sum()

# Print the number of null values

print("Number of null values U5 Population (\'000s) in the column:", num_null_values)

OUTPUT:

similarly the population column can also be not predicted because population can vary from place to place and there
is only 3 missing values in that column so we can drop the rows with null values

CODE:

new_df.dropna(subset=['U5 Population (\'000s)'], inplace=True)

new_df.info()

4
Register no:2022510020

OUTPUT:

Handling wasting and severe wasting columns

"wasting" refers to a condition of malnutrition characterized by a rapid weight loss and/or failure to gain weight in
children under the age of five. Wasting is typically assessed by measuring a child's weight in relation to their height
or length, often expressed as a Z-score. A Z-score below a certain threshold indicates that the child's weight is
significantly lower than expected for their height or length, which is indicative of acute malnutrition.

Children who are wasted are often visibly thin or emaciated, and they may suffer from weakened immune systems,
increased susceptibility to infections, and impaired physical and cognitive development. Severe wasting, refers to a
more severe form of malnutrition where the child's weight is significantly below the expected level for their height
or length.

so we are going to fill the wasting and severe wasting columns with the mean of that country's wasting or severe
wasting

CODE:

country_mean_wasting = new_df.groupby('country')['Wasting'].transform('mean')

new_df['Wasting'].fillna(country_mean_wasting, inplace=True)

country_mean_severe_wasting = new_df.groupby('country')['Severe Wasting'].transform('mean')

new_df['Severe Wasting'].fillna(country_mean_severe_wasting, inplace=True)

new_df.info()

5
Register no:2022510020

OUTPUT:

now you can see that the wasting column has been handled but the severe wasting column still has some missing
values this is because we the filled the missing values in the column with the mean of their respective country but
some country entirely doesn't have severe wasting so we cannot calculate mean for those so we are going to fill the
remaining rows with 0.

CODE:

new_df['Severe Wasting'].fillna(0,inplace=True)

new_df.info()

OUTPUT:

6
Register no:2022510020

Handling null values in overweight, stunting, underweight and underweight columns

CODE:

country_mean_overweight = new_df.groupby('country')['Overweight'].transform('mean')

new_df['Overweight'].fillna(country_mean_overweight, inplace=True)

country_mean_underweight = new_df.groupby('country')['Underweight'].transform('mean')

new_df['Underweight'].fillna(country_mean_underweight, inplace=True)

country_mean_stunting = new_df.groupby('country')['Stunting'].transform('mean')

new_df['Stunting'].fillna(country_mean_stunting, inplace=True)

new_df['Overweight'].fillna(0,inplace=True)

new_df.info()

OUTPUT:

7
Register no:2022510020

DEALING WITH THE OUTLIERS

CODE:

import seaborn as sns

plt.figure(figsize=(8, 6))

sns.boxplot(new_df['Wasting'])

plt.title('Box Plot of Wasting Column')

plt.ylabel('Wasting Value')

plt.grid(True)

plt.show()

OUTPUT:

8
Register no:2022510020

CODE:

import matplotlib.pyplot as plt

new_df.hist(column=['Severe Wasting', 'Wasting', 'Overweight', 'Stunting', 'Underweight'], figsize=(12, 16),

bins=30)

plt.tight_layout()

plt.show()

OUTPUT:

As this is a survey data the cances for outliers are very minimal except if there is any wrong measurements. so
manipulating or removing these values is not efficient so there is no need to worry about the outliers.

9
Register no:2022510020

HANDLING IRREFULAR CARDINALITIES

CODE:

import pandas as pd

import matplotlib.pyplot as plt

country_counts = new_df['country'].value_counts()

top_n = 40

print(country_counts.head(top_n))

country_counts.head(top_n).plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Countries by Frequency')

plt.xlabel('Country')

plt.ylabel('Frequency')

plt.show()

OUTPUT:
country
BANGLADESH 25
JAMAICA 21
NIGER 20
KUWAIT 16
MALAWI 16
PERU 16
INDONESIA 15
VIET NAM 14
SENEGAL 14
CHILE 14
MALI 14
VENEZUELA (BOLIVARIAN REPUBLIC OF) 13
BURKINA FASO 12
CHINA 12
BOLIVIA (PLURINATIONAL STATE OF) 11
EGYPT 11
TAJIKISTAN 11
RWANDA 11
NEPAL 10
PHILIPPINES 10
MYANMAR 10
NIGERIA 10
GHANA 10
MONGOLIA 10
TOGO 10

10
Register no:2022510020

ZIMBABWE 10
MALAYSIA 10
URUGUAY 10
MEXICO 9
CENTRAL AFRICAN REPUBLIC 9
UGANDA 9
MAURITANIA 9
UNITED REPUBLIC OF TANZANIA 9
GAMBIA 9
KENYA 9
SIERRA LEONE 9
INDIA 8
PAKISTAN 8
SRI LANKA 8
GUINEA 8
Name: count, dtype: int64

11
Register no:2022510020

we cannot deal with the cardinality in the country because the country with least occurence has just occured once a
nd doing stratification for this becomes impossible.

CODE:

country_counts = new_df['region'].value_counts()

print(country_counts)

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

plt.xlabel('Region')

plt.ylabel('Frequency')

plt.show()

OUTPUT:
region
Sub-Saharan Africa 335
Latin America & Caribbean 192
East Asia & Pacific 138
Middle East & North Africa 99
Europe & Central Asia 89
South Asia 73
Name: count, dtype: int64

12
Register no:2022510020

CODE:

import pandas as pd

from sklearn.model_selection import train_test_split

strata = new_df['region']

train, test = train_test_split(new_df, test_size=0.2, stratify=strata, random_state=42)

train, val = train_test_split(train, test_size=0.2, stratify=train['region'], random_state=42)

print("Train size:", len(train))

print("Validation size:", len(val))

print("Test size:", len(test))

OUTPUT:

13
Register no:2022510020

CODE:

country_counts = test['region'].value_counts()

print(country_counts)

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

plt.xlabel('Region')

plt.ylabel('Frequency')

plt.show()

OUTPUT:

region
Sub-Saharan Africa 67
Latin America & Caribbean 38
East Asia & Pacific 28
Middle East & North Africa 20
Europe & Central Asia 18
South Asia 15
Name: count, dtype: int64

14
Register no:2022510020

CODE:

from sklearn.utils import resample

# Oversampling minority classes (regions with fewer observations)

# Select the minority classes (regions with fewer observations)

minority_regions = country_counts[country_counts < country_counts.max()].index

# Initialize an empty list to store oversampled data

oversampled_data = []

# Oversample each minority class and append to the oversampled_data list

for region in minority_regions:

region_data = new_df[new_df['region'] == region]

oversampled_region = resample(region_data, replace=True, n_samples=country_counts.max(),

random_state=42)

oversampled_data.append(oversampled_region)

# Concatenate the oversampled data with the original dataset

oversampled_df = pd.concat([new_df] + oversampled_data)

# Undersampling majority class (regions with more observations)

# Select the majority class (region with the most observations)

majority_region = country_counts.idxmax()

# Filter the majority class data

majority_data = new_df[new_df['region'] == majority_region]

# Undersample the majority class data

undersampled_majority = resample(majority_data, replace=False, n_samples=country_counts.min(),

random_state=42)

# Concatenate the undersampled majority class data with the minority class data

undersampled_df = pd.concat([undersampled_majority] + [new_df[new_df['region'] == region] for region in

minority_regions])

# Print the sizes of the resulting oversampled and undersampled datasets

15
Register no:2022510020

print("Oversampled size:", len(oversampled_df))

print("Undersampled size:", len(undersampled_df))

OUTPUT:
Oversampled size: 1261
Undersampled size: 606

OBSERVATION (10) RECORD(10) TOTAL(20)

RESULT:
The missing values, outliers and irregular cardinalities has been successfully handled and the dataset was
cleaned.

Extended - Case - 2 - Fellow: 1 The Adverse Health Effects of Air Pollution - Are We Making Any Progress?
No ratings yet
Extended - Case - 2 - Fellow: 1 The Adverse Health Effects of Air Pollution - Are We Making Any Progress?
61 pages
Field Test Genius 20 - Gearless
100% (1)
Field Test Genius 20 - Gearless
3 pages
RME Closed Door Part 1 - PEC
100% (2)
RME Closed Door Part 1 - PEC
14 pages
Pandas Python For Data Science
No ratings yet
Pandas Python For Data Science
1 page
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
Eriez CrossFlowTeeterBedSeparators Brochure
No ratings yet
Eriez CrossFlowTeeterBedSeparators Brochure
2 pages
Pandaspythonfordatascience
No ratings yet
Pandaspythonfordatascience
1 page
Creation of Series Using List, Dictionary & Ndarray
No ratings yet
Creation of Series Using List, Dictionary & Ndarray
65 pages
Big Data Computing - Assignment 8
No ratings yet
Big Data Computing - Assignment 8
3 pages
Glass Ceramics PDF
No ratings yet
Glass Ceramics PDF
80 pages
Python Practical Questions
No ratings yet
Python Practical Questions
13 pages
Fluid Statics and Fluid Dynamics General Physics 1
No ratings yet
Fluid Statics and Fluid Dynamics General Physics 1
41 pages
AD3301 - Data - Transformation - Ipynb - Colaboratory
No ratings yet
AD3301 - Data - Transformation - Ipynb - Colaboratory
27 pages
Pandas - Cheat - Sheet
No ratings yet
Pandas - Cheat - Sheet
6 pages
Android Navigation 2020
No ratings yet
Android Navigation 2020
68 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Certificate
No ratings yet
Certificate
25 pages
Physical Chemistr y Inorganic Chemistr y Organic Chemistr Y: Class XII
No ratings yet
Physical Chemistr y Inorganic Chemistr y Organic Chemistr Y: Class XII
1 page
Case Study Final - Group 05 - Lokesh R M
No ratings yet
Case Study Final - Group 05 - Lokesh R M
43 pages
JDBC
No ratings yet
JDBC
8 pages
Pandas Plots
No ratings yet
Pandas Plots
14 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Hrithik Saini Class 12th c1, Roll No 1033
No ratings yet
Hrithik Saini Class 12th c1, Roll No 1033
25 pages
Maths
No ratings yet
Maths
6 pages
Cheat Python
No ratings yet
Cheat Python
8 pages
IP Practical
No ratings yet
IP Practical
28 pages
My P Report
No ratings yet
My P Report
14 pages
NM
No ratings yet
NM
23 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Projet Swift
No ratings yet
Projet Swift
12 pages
Lab Programmes Adwaith
No ratings yet
Lab Programmes Adwaith
18 pages
SwissProjectionGIS Version1 2
No ratings yet
SwissProjectionGIS Version1 2
6 pages
Ip Practical
No ratings yet
Ip Practical
23 pages
How To Handle Outliers
No ratings yet
How To Handle Outliers
6 pages
Java Questions
No ratings yet
Java Questions
38 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
Practical File Ip
No ratings yet
Practical File Ip
27 pages
Lecture 12 - Art and Science of Data Visualization
No ratings yet
Lecture 12 - Art and Science of Data Visualization
21 pages
Lab 3
No ratings yet
Lab 3
3 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
IP Practical
No ratings yet
IP Practical
24 pages
Introduction of Sludge Management
No ratings yet
Introduction of Sludge Management
154 pages
Untitled 5
No ratings yet
Untitled 5
10 pages
Quality Matters: Pollution Exacerbates Water Scarcity and Sectoral Output Risks in China
No ratings yet
Quality Matters: Pollution Exacerbates Water Scarcity and Sectoral Output Risks in China
10 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
Create A Pandas Series From A Dictionary of Values and An Ndarray
No ratings yet
Create A Pandas Series From A Dictionary of Values and An Ndarray
15 pages
Ip Final File
No ratings yet
Ip Final File
46 pages
Mid Sem 2021 MDD Date 02 March 2021
No ratings yet
Mid Sem 2021 MDD Date 02 March 2021
3 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
R-Plots - HOW TO
No ratings yet
R-Plots - HOW TO
4 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Midpalatal Miniscrew Insertion The Accuracy of Di
No ratings yet
Midpalatal Miniscrew Insertion The Accuracy of Di
7 pages
TTH Module 1
No ratings yet
TTH Module 1
4 pages
Week 4
No ratings yet
Week 4
35 pages
2-Introduction To Data Cleaning P02
No ratings yet
2-Introduction To Data Cleaning P02
7 pages
Unit-1 - Introduction To Nodejs
No ratings yet
Unit-1 - Introduction To Nodejs
92 pages
AI Unit 4
No ratings yet
AI Unit 4
11 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Pandas
No ratings yet
Pandas
4 pages
Projectile Motion (Lecture Note)
No ratings yet
Projectile Motion (Lecture Note)
16 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Nagai Tilt XRD
No ratings yet
Nagai Tilt XRD
7 pages
Xii Ip Practical List 2022-23-1
No ratings yet
Xii Ip Practical List 2022-23-1
23 pages
Mitochondria, Endoplasmic Reticulum, Ribosomes
No ratings yet
Mitochondria, Endoplasmic Reticulum, Ribosomes
39 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Act. 2 - Micropipetting Techni
No ratings yet
Act. 2 - Micropipetting Techni
29 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
Criteria For Assessment Nov 23
No ratings yet
Criteria For Assessment Nov 23
92 pages
ChatGPT in Exploratory Data Analysis
No ratings yet
ChatGPT in Exploratory Data Analysis
6 pages
Python Cheatsy
No ratings yet
Python Cheatsy
1 page
Outlook Module3
No ratings yet
Outlook Module3
21 pages
INDEX
No ratings yet
INDEX
16 pages
Partmart Price List 2024
No ratings yet
Partmart Price List 2024
16 pages
Class12 CS Practical File Slides Guidelines
No ratings yet
Class12 CS Practical File Slides Guidelines
12 pages
Chirayu (1) Merged Merged
No ratings yet
Chirayu (1) Merged Merged
76 pages
Electromagnetism Research Paper
No ratings yet
Electromagnetism Research Paper
3 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Hydraulics Course Flyer Rev 1.2
No ratings yet
Hydraulics Course Flyer Rev 1.2
1 page
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Pandas For Python Pro Level Cheat Sheet
No ratings yet
Pandas For Python Pro Level Cheat Sheet
14 pages
Assignment 1
No ratings yet
Assignment 1
27 pages

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

Register no:2022510020

Handling missing values, outliers and irregular

DESCRIPTION ABOUT THE DATASET

HANDLING THE MISSING VALUES

numerical_columns = df.select_dtypes(include=['int64', 'float64'])

new_df['country']=df['Country and areas']

new_df['region']=df['World Bank Region']

Handling WHZ SURVEY SAMPLE column

new_df.dropna(subset=['WHZ Survey Sample (N)'], inplace=True)

Handling U5 population column

num_null_values = new_df['U5 Population (\'000s)'].isnull().sum()

# Print the number of null values

print("Number of null values U5 Population (\'000s) in the column:", num_null_values)

new_df.dropna(subset=['U5 Population (\'000s)'], inplace=True)

Handling wasting and severe wasting columns

country_mean_severe_wasting = new_df.groupby('country')['Severe Wasting'].transform('mean')

new_df['Severe Wasting'].fillna(country_mean_severe_wasting, inplace=True)

Handling null values in overweight, stunting, underweight and underweight columns

DEALING WITH THE OUTLIERS

import seaborn as sns

plt.title('Box Plot of Wasting Column')

import matplotlib.pyplot as plt

new_df.hist(column=['Severe Wasting', 'Wasting', 'Overweight', 'Stunting', 'Underweight'], figsize=(12, 16),

HANDLING IRREFULAR CARDINALITIES

import matplotlib.pyplot as plt

country_counts.head(top_n).plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Countries by Frequency')

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

from sklearn.model_selection import train_test_split

train, test = train_test_split(new_df, test_size=0.2, stratify=strata, random_state=42)

train, val = train_test_split(train, test_size=0.2, stratify=train['region'], random_state=42)

print("Train size:", len(train))

print("Validation size:", len(val))

print("Test size:", len(test))

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

from sklearn.utils import resample

# Oversampling minority classes (regions with fewer observations)

# Select the minority classes (regions with fewer observations)

minority_regions = country_counts[country_counts < country_counts.max()].index

# Initialize an empty list to store oversampled data

# Oversample each minority class and append to the oversampled_data list

for region in minority_regions:

region_data = new_df[new_df['region'] == region]

oversampled_region = resample(region_data, replace=True, n_samples=country_counts.max(),

# Concatenate the oversampled data with the original dataset

oversampled_df = pd.concat([new_df] + oversampled_data)

# Undersampling majority class (regions with more observations)

# Select the majority class (region with the most observations)

# Filter the majority class data

majority_data = new_df[new_df['region'] == majority_region]

# Undersample the majority class data

undersampled_majority = resample(majority_data, replace=False, n_samples=country_counts.min(),

undersampled_df = pd.concat([undersampled_majority] + [new_df[new_df['region'] == region] for region in

# Print the sizes of the resulting oversampled and undersampled datasets

print("Oversampled size:", len(oversampled_df))

print("Undersampled size:", len(undersampled_df))

OBSERVATION (10) RECORD(10) TOTAL(20)

You might also like