0% found this document useful (0 votes)
45 views16 pages

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

janusrini14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views16 pages

Handling Missing Values, Outliers and Irregular Cardinalities

Uploaded by

janusrini14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Register no:2022510020

Ex no : 1
Date: 14.02.2024

Handling missing values, outliers and irregular


cardinalities

AIM
To handle the missing values, outliers and the irregular cardinalities present in the dataset.

DESCRIPTION ABOUT THE DATASET


The UNICEF child malnutrition dataset comprises comprehensive information on the nutritional status of
children worldwide. It includes demographic details such as age, gender, and possibly socioeconomic factors,
alongside crucial nutritional indicators like weight-for-age, height-for-age, and weight-for-height. Geographical
breakdowns enable analysis of malnutrition rates across regions, from countries to local communities, while time-
series data facilitates the identification of trends over time.

ANALYSIS

CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='utf-8')
except UnicodeDecodeError:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='latin1')
print("the total number of datapoints :",len(df))
cols=df.columns
print("the toal no of columns :",cols.size)
OUTPUT:

1
Register no:2022510020

CODE:

df.info()

OUTPUT:

HANDLING THE MISSING VALUES

CODE:

numerical_columns = df.select_dtypes(include=['int64', 'float64'])

new_df=pd.DataFrame(numerical_columns)

new_df['country']=df['Country and areas']


2
Register no:2022510020

new_df['region']=df['World Bank Region']

new_df.info()

OUTPUT:

Handling WHZ SURVEY SAMPLE column

The WHZ survey sample columns contains the number of samples that were taken from that country to measure the
weight-for-height Z-scores so it is the number of people that were surveyed so we cannot impute or do anything
with that column, so we are droping the null values.

CODE:

new_df.dropna(subset=['WHZ Survey Sample (N)'], inplace=True)

new_df.info()

3
Register no:2022510020

OUTPUT:

Handling U5 population column

CODE:

num_null_values = new_df['U5 Population (\'000s)'].isnull().sum()

# Print the number of null values

print("Number of null values U5 Population (\'000s) in the column:", num_null_values)

OUTPUT:

similarly the population column can also be not predicted because population can vary from place to place and there
is only 3 missing values in that column so we can drop the rows with null values

CODE:

new_df.dropna(subset=['U5 Population (\'000s)'], inplace=True)

new_df.info()

4
Register no:2022510020

OUTPUT:

Handling wasting and severe wasting columns

"wasting" refers to a condition of malnutrition characterized by a rapid weight loss and/or failure to gain weight in
children under the age of five. Wasting is typically assessed by measuring a child's weight in relation to their height
or length, often expressed as a Z-score. A Z-score below a certain threshold indicates that the child's weight is
significantly lower than expected for their height or length, which is indicative of acute malnutrition.

Children who are wasted are often visibly thin or emaciated, and they may suffer from weakened immune systems,
increased susceptibility to infections, and impaired physical and cognitive development. Severe wasting, refers to a
more severe form of malnutrition where the child's weight is significantly below the expected level for their height
or length.

so we are going to fill the wasting and severe wasting columns with the mean of that country's wasting or severe
wasting

CODE:

country_mean_wasting = new_df.groupby('country')['Wasting'].transform('mean')

new_df['Wasting'].fillna(country_mean_wasting, inplace=True)

country_mean_severe_wasting = new_df.groupby('country')['Severe Wasting'].transform('mean')

new_df['Severe Wasting'].fillna(country_mean_severe_wasting, inplace=True)

new_df.info()

5
Register no:2022510020

OUTPUT:

now you can see that the wasting column has been handled but the severe wasting column still has some missing
values this is because we the filled the missing values in the column with the mean of their respective country but
some country entirely doesn't have severe wasting so we cannot calculate mean for those so we are going to fill the
remaining rows with 0.

CODE:

new_df['Severe Wasting'].fillna(0,inplace=True)

new_df.info()

OUTPUT:

6
Register no:2022510020

Handling null values in overweight, stunting, underweight and underweight columns

CODE:

country_mean_overweight = new_df.groupby('country')['Overweight'].transform('mean')

new_df['Overweight'].fillna(country_mean_overweight, inplace=True)

country_mean_underweight = new_df.groupby('country')['Underweight'].transform('mean')

new_df['Underweight'].fillna(country_mean_underweight, inplace=True)

country_mean_stunting = new_df.groupby('country')['Stunting'].transform('mean')

new_df['Stunting'].fillna(country_mean_stunting, inplace=True)

new_df['Overweight'].fillna(0,inplace=True)

new_df.info()

OUTPUT:

7
Register no:2022510020

DEALING WITH THE OUTLIERS

CODE:

import seaborn as sns

plt.figure(figsize=(8, 6))

sns.boxplot(new_df['Wasting'])

plt.title('Box Plot of Wasting Column')

plt.ylabel('Wasting Value')

plt.grid(True)

plt.show()

OUTPUT:

8
Register no:2022510020

CODE:

import matplotlib.pyplot as plt

new_df.hist(column=['Severe Wasting', 'Wasting', 'Overweight', 'Stunting', 'Underweight'], figsize=(12, 16),


bins=30)

plt.tight_layout()

plt.show()

OUTPUT:

As this is a survey data the cances for outliers are very minimal except if there is any wrong measurements. so
manipulating or removing these values is not efficient so there is no need to worry about the outliers.

9
Register no:2022510020

HANDLING IRREFULAR CARDINALITIES

CODE:

import pandas as pd

import matplotlib.pyplot as plt

country_counts = new_df['country'].value_counts()

top_n = 40

print(country_counts.head(top_n))

country_counts.head(top_n).plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Countries by Frequency')

plt.xlabel('Country')

plt.ylabel('Frequency')

plt.show()

OUTPUT:
country
BANGLADESH 25
JAMAICA 21
NIGER 20
KUWAIT 16
MALAWI 16
PERU 16
INDONESIA 15
VIET NAM 14
SENEGAL 14
CHILE 14
MALI 14
VENEZUELA (BOLIVARIAN REPUBLIC OF) 13
BURKINA FASO 12
CHINA 12
BOLIVIA (PLURINATIONAL STATE OF) 11
EGYPT 11
TAJIKISTAN 11
RWANDA 11
NEPAL 10
PHILIPPINES 10
MYANMAR 10
NIGERIA 10
GHANA 10
MONGOLIA 10
TOGO 10

10
Register no:2022510020

ZIMBABWE 10
MALAYSIA 10
URUGUAY 10
MEXICO 9
CENTRAL AFRICAN REPUBLIC 9
UGANDA 9
MAURITANIA 9
UNITED REPUBLIC OF TANZANIA 9
GAMBIA 9
KENYA 9
SIERRA LEONE 9
INDIA 8
PAKISTAN 8
SRI LANKA 8
GUINEA 8
Name: count, dtype: int64

11
Register no:2022510020

we cannot deal with the cardinality in the country because the country with least occurence has just occured once a
nd doing stratification for this becomes impossible.

CODE:

country_counts = new_df['region'].value_counts()

print(country_counts)

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

plt.xlabel('Region')

plt.ylabel('Frequency')

plt.show()

OUTPUT:
region
Sub-Saharan Africa 335
Latin America & Caribbean 192
East Asia & Pacific 138
Middle East & North Africa 99
Europe & Central Asia 89
South Asia 73
Name: count, dtype: int64

12
Register no:2022510020

CODE:

import pandas as pd

import pandas as pd

from sklearn.model_selection import train_test_split

strata = new_df['region']

train, test = train_test_split(new_df, test_size=0.2, stratify=strata, random_state=42)

train, val = train_test_split(train, test_size=0.2, stratify=train['region'], random_state=42)

print("Train size:", len(train))

print("Validation size:", len(val))

print("Test size:", len(test))

OUTPUT:

13
Register no:2022510020

CODE:

country_counts = test['region'].value_counts()

print(country_counts)

country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')

plt.xlabel('Region')

plt.ylabel('Frequency')

plt.show()

OUTPUT:

region
Sub-Saharan Africa 67
Latin America & Caribbean 38
East Asia & Pacific 28
Middle East & North Africa 20
Europe & Central Asia 18
South Asia 15
Name: count, dtype: int64

14
Register no:2022510020

CODE:

from sklearn.utils import resample

# Oversampling minority classes (regions with fewer observations)

# Select the minority classes (regions with fewer observations)

minority_regions = country_counts[country_counts < country_counts.max()].index

# Initialize an empty list to store oversampled data

oversampled_data = []

# Oversample each minority class and append to the oversampled_data list

for region in minority_regions:

region_data = new_df[new_df['region'] == region]

oversampled_region = resample(region_data, replace=True, n_samples=country_counts.max(),


random_state=42)

oversampled_data.append(oversampled_region)

# Concatenate the oversampled data with the original dataset

oversampled_df = pd.concat([new_df] + oversampled_data)

# Undersampling majority class (regions with more observations)

# Select the majority class (region with the most observations)

majority_region = country_counts.idxmax()

# Filter the majority class data

majority_data = new_df[new_df['region'] == majority_region]

# Undersample the majority class data

undersampled_majority = resample(majority_data, replace=False, n_samples=country_counts.min(),


random_state=42)

# Concatenate the undersampled majority class data with the minority class data

undersampled_df = pd.concat([undersampled_majority] + [new_df[new_df['region'] == region] for region in


minority_regions])

# Print the sizes of the resulting oversampled and undersampled datasets


15
Register no:2022510020

print("Oversampled size:", len(oversampled_df))

print("Undersampled size:", len(undersampled_df))

OUTPUT:
Oversampled size: 1261
Undersampled size: 606

OBSERVATION (10) RECORD(10) TOTAL(20)

RESULT:
The missing values, outliers and irregular cardinalities has been successfully handled and the dataset was
cleaned.

16

You might also like