Register no:2022510020
Ex no : 1
Date: 14.02.2024
Handling missing values, outliers and irregular
cardinalities
AIM
To handle the missing values, outliers and the irregular cardinalities present in the dataset.
DESCRIPTION ABOUT THE DATASET
The UNICEF child malnutrition dataset comprises comprehensive information on the nutritional status of
children worldwide. It includes demographic details such as age, gender, and possibly socioeconomic factors,
alongside crucial nutritional indicators like weight-for-age, height-for-age, and weight-for-height. Geographical
breakdowns enable analysis of malnutrition rates across regions, from countries to local communities, while time-
series data facilitates the identification of trends over time.
ANALYSIS
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='utf-8')
except UnicodeDecodeError:
df = pd.read_csv('C:\\Users\\my pc\\Downloads\\JMECountryEstimatesApril2021.csv', encoding='latin1')
print("the total number of datapoints :",len(df))
cols=df.columns
print("the toal no of columns :",cols.size)
OUTPUT:
1
Register no:2022510020
CODE:
df.info()
OUTPUT:
HANDLING THE MISSING VALUES
CODE:
numerical_columns = df.select_dtypes(include=['int64', 'float64'])
new_df=pd.DataFrame(numerical_columns)
new_df['country']=df['Country and areas']
2
Register no:2022510020
new_df['region']=df['World Bank Region']
new_df.info()
OUTPUT:
Handling WHZ SURVEY SAMPLE column
The WHZ survey sample columns contains the number of samples that were taken from that country to measure the
weight-for-height Z-scores so it is the number of people that were surveyed so we cannot impute or do anything
with that column, so we are droping the null values.
CODE:
new_df.dropna(subset=['WHZ Survey Sample (N)'], inplace=True)
new_df.info()
3
Register no:2022510020
OUTPUT:
Handling U5 population column
CODE:
num_null_values = new_df['U5 Population (\'000s)'].isnull().sum()
# Print the number of null values
print("Number of null values U5 Population (\'000s) in the column:", num_null_values)
OUTPUT:
similarly the population column can also be not predicted because population can vary from place to place and there
is only 3 missing values in that column so we can drop the rows with null values
CODE:
new_df.dropna(subset=['U5 Population (\'000s)'], inplace=True)
new_df.info()
4
Register no:2022510020
OUTPUT:
Handling wasting and severe wasting columns
"wasting" refers to a condition of malnutrition characterized by a rapid weight loss and/or failure to gain weight in
children under the age of five. Wasting is typically assessed by measuring a child's weight in relation to their height
or length, often expressed as a Z-score. A Z-score below a certain threshold indicates that the child's weight is
significantly lower than expected for their height or length, which is indicative of acute malnutrition.
Children who are wasted are often visibly thin or emaciated, and they may suffer from weakened immune systems,
increased susceptibility to infections, and impaired physical and cognitive development. Severe wasting, refers to a
more severe form of malnutrition where the child's weight is significantly below the expected level for their height
or length.
so we are going to fill the wasting and severe wasting columns with the mean of that country's wasting or severe
wasting
CODE:
country_mean_wasting = new_df.groupby('country')['Wasting'].transform('mean')
new_df['Wasting'].fillna(country_mean_wasting, inplace=True)
country_mean_severe_wasting = new_df.groupby('country')['Severe Wasting'].transform('mean')
new_df['Severe Wasting'].fillna(country_mean_severe_wasting, inplace=True)
new_df.info()
5
Register no:2022510020
OUTPUT:
now you can see that the wasting column has been handled but the severe wasting column still has some missing
values this is because we the filled the missing values in the column with the mean of their respective country but
some country entirely doesn't have severe wasting so we cannot calculate mean for those so we are going to fill the
remaining rows with 0.
CODE:
new_df['Severe Wasting'].fillna(0,inplace=True)
new_df.info()
OUTPUT:
6
Register no:2022510020
Handling null values in overweight, stunting, underweight and underweight columns
CODE:
country_mean_overweight = new_df.groupby('country')['Overweight'].transform('mean')
new_df['Overweight'].fillna(country_mean_overweight, inplace=True)
country_mean_underweight = new_df.groupby('country')['Underweight'].transform('mean')
new_df['Underweight'].fillna(country_mean_underweight, inplace=True)
country_mean_stunting = new_df.groupby('country')['Stunting'].transform('mean')
new_df['Stunting'].fillna(country_mean_stunting, inplace=True)
new_df['Overweight'].fillna(0,inplace=True)
new_df.info()
OUTPUT:
7
Register no:2022510020
DEALING WITH THE OUTLIERS
CODE:
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.boxplot(new_df['Wasting'])
plt.title('Box Plot of Wasting Column')
plt.ylabel('Wasting Value')
plt.grid(True)
plt.show()
OUTPUT:
8
Register no:2022510020
CODE:
import matplotlib.pyplot as plt
new_df.hist(column=['Severe Wasting', 'Wasting', 'Overweight', 'Stunting', 'Underweight'], figsize=(12, 16),
bins=30)
plt.tight_layout()
plt.show()
OUTPUT:
As this is a survey data the cances for outliers are very minimal except if there is any wrong measurements. so
manipulating or removing these values is not efficient so there is no need to worry about the outliers.
9
Register no:2022510020
HANDLING IRREFULAR CARDINALITIES
CODE:
import pandas as pd
import matplotlib.pyplot as plt
country_counts = new_df['country'].value_counts()
top_n = 40
print(country_counts.head(top_n))
country_counts.head(top_n).plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Countries by Frequency')
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.show()
OUTPUT:
country
BANGLADESH 25
JAMAICA 21
NIGER 20
KUWAIT 16
MALAWI 16
PERU 16
INDONESIA 15
VIET NAM 14
SENEGAL 14
CHILE 14
MALI 14
VENEZUELA (BOLIVARIAN REPUBLIC OF) 13
BURKINA FASO 12
CHINA 12
BOLIVIA (PLURINATIONAL STATE OF) 11
EGYPT 11
TAJIKISTAN 11
RWANDA 11
NEPAL 10
PHILIPPINES 10
MYANMAR 10
NIGERIA 10
GHANA 10
MONGOLIA 10
TOGO 10
10
Register no:2022510020
ZIMBABWE 10
MALAYSIA 10
URUGUAY 10
MEXICO 9
CENTRAL AFRICAN REPUBLIC 9
UGANDA 9
MAURITANIA 9
UNITED REPUBLIC OF TANZANIA 9
GAMBIA 9
KENYA 9
SIERRA LEONE 9
INDIA 8
PAKISTAN 8
SRI LANKA 8
GUINEA 8
Name: count, dtype: int64
11
Register no:2022510020
we cannot deal with the cardinality in the country because the country with least occurence has just occured once a
nd doing stratification for this becomes impossible.
CODE:
country_counts = new_df['region'].value_counts()
print(country_counts)
country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')
plt.xlabel('Region')
plt.ylabel('Frequency')
plt.show()
OUTPUT:
region
Sub-Saharan Africa 335
Latin America & Caribbean 192
East Asia & Pacific 138
Middle East & North Africa 99
Europe & Central Asia 89
South Asia 73
Name: count, dtype: int64
12
Register no:2022510020
CODE:
import pandas as pd
import pandas as pd
from sklearn.model_selection import train_test_split
strata = new_df['region']
train, test = train_test_split(new_df, test_size=0.2, stratify=strata, random_state=42)
train, val = train_test_split(train, test_size=0.2, stratify=train['region'], random_state=42)
print("Train size:", len(train))
print("Validation size:", len(val))
print("Test size:", len(test))
OUTPUT:
13
Register no:2022510020
CODE:
country_counts = test['region'].value_counts()
print(country_counts)
country_counts.plot(kind='bar', figsize=(10, 6), title=f'Top {top_n} Regions by Frequency')
plt.xlabel('Region')
plt.ylabel('Frequency')
plt.show()
OUTPUT:
region
Sub-Saharan Africa 67
Latin America & Caribbean 38
East Asia & Pacific 28
Middle East & North Africa 20
Europe & Central Asia 18
South Asia 15
Name: count, dtype: int64
14
Register no:2022510020
CODE:
from sklearn.utils import resample
# Oversampling minority classes (regions with fewer observations)
# Select the minority classes (regions with fewer observations)
minority_regions = country_counts[country_counts < country_counts.max()].index
# Initialize an empty list to store oversampled data
oversampled_data = []
# Oversample each minority class and append to the oversampled_data list
for region in minority_regions:
region_data = new_df[new_df['region'] == region]
oversampled_region = resample(region_data, replace=True, n_samples=country_counts.max(),
random_state=42)
oversampled_data.append(oversampled_region)
# Concatenate the oversampled data with the original dataset
oversampled_df = pd.concat([new_df] + oversampled_data)
# Undersampling majority class (regions with more observations)
# Select the majority class (region with the most observations)
majority_region = country_counts.idxmax()
# Filter the majority class data
majority_data = new_df[new_df['region'] == majority_region]
# Undersample the majority class data
undersampled_majority = resample(majority_data, replace=False, n_samples=country_counts.min(),
random_state=42)
# Concatenate the undersampled majority class data with the minority class data
undersampled_df = pd.concat([undersampled_majority] + [new_df[new_df['region'] == region] for region in
minority_regions])
# Print the sizes of the resulting oversampled and undersampled datasets
15
Register no:2022510020
print("Oversampled size:", len(oversampled_df))
print("Undersampled size:", len(undersampled_df))
OUTPUT:
Oversampled size: 1261
Undersampled size: 606
OBSERVATION (10) RECORD(10) TOTAL(20)
RESULT:
The missing values, outliers and irregular cardinalities has been successfully handled and the dataset was
cleaned.
16