0% found this document useful (0 votes)

28 views

ML Data Preprocessing in Python

The document discusses data preprocessing in Python for machine learning. It describes the need for data preprocessing to prepare raw data for analysis by converting it into a proper format. The main steps covered are loading and exploring the diabetes dataset, detecting and removing outliers, analyzing correlations between variables, and separating features from the target variable. The goal of data preprocessing is to make the raw data cleaner and more suitable for machine learning algorithms.

Uploaded by

Fathoni Mahardika

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

ML Data Preprocessing in Python

Uploaded by

Fathoni Mahardika

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

ML | Data Preprocessing in Python




In order to derive knowledge and insights from data, the area of data
science integrates statistical analysis, machine learning, and computer programming.
It entails gathering, purifying, and converting unstructured data into a form that can be
analysed and visualised. Data scientists process and analyse data using a number of
methods and tools, such as statistical models, machine learning algorithms, and data
visualisation software. Data science seeks to uncover patterns in data that can help
with decision-making, process improvement, and the creation of new opportunities.
Business, engineering, and the social sciences are all included in this interdisciplinary
field.
Data Preprocessing
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data preprocessing is a technique that is used to convert the raw data into a
clean data set. In other words, whenever the data is gathered from different sources it
is collected in raw format which is not feasible for the analysis.

Data Preprocessing
Need of Data Preprocessing
 For achieving better results from the applied model in Machine Learning
projects the format of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format, for
example, Random Forest algorithm does not support null values, therefore
to execute random forest algorithm null values have to be managed from the
original raw data set.
 Another aspect is that the data set should be formatted in such a way that
more than one Machine Learning and Deep Learning algorithm are
executed in one data set, and best out of them is chosen.

Steps in Data Preprocessing

Step 1: Import the necessary libraries
 Python3

# importing libraries
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Load the dataset

Dataset link: [https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database]
 Python3

# Load the dataset

df = pd.read_csv('Geeksforgeeks/Data/diabetes.csv')
print(df.head())

Output:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
0 6 148 72 35 0 33.6
\
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
Check the data info

 Python3

df.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
As we can see from the above info that the our dataset has 9 columns and each
columns has 768 values. There is no Null values in the dataset.
We can also check the null values using df.isnull()
 Python3

df.isnull().sum()

Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
Step 3: Statistical Analysis
In statistical analysis, first, we use the df.describe() which will give a descriptive
overview of the dataset.
 Python3

df.describe()

Output:

Data summary
The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and
max values for each column. When we carefully observe the table we will find that.
Insulin, Pregnancies, BMI, BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 4: Check the outliers:

 Python3

# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i = 0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()

Output:

Boxplots
from the above boxplot, we can clearly see that all most every column has some
amounts of outliers.
Drop the outliers

 Python3

# Identify the quartiles

q1, q3 = np.percentile(df['Insulin'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = df[(df['Insulin'] >= lower_bound)
& (df['Insulin'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['Pregnancies'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Pregnancies'] >= lower_bound)
& (clean_data['Pregnancies'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['Age'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Age'] >= lower_bound)
& (clean_data['Age'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['Glucose'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Glucose'] >= lower_bound)
& (clean_data['Glucose'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['BloodPressure'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (0.75 * iqr)
upper_bound = q3 + (0.75 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BloodPressure'] >= lower_bound)
& (clean_data['BloodPressure'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['BMI'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BMI'] >= lower_bound)
& (clean_data['BMI'] <= upper_bound)]

# Identify the quartiles

q1, q3 = np.percentile(clean_data['DiabetesPedigreeFunction'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

# Drop the outliers

clean_data = clean_data[(clean_data['DiabetesPedigreeFunction'] >=
lower_bound)
& (clean_data['DiabetesPedigreeFunction'] <=
upper_bound)]

Step 5: Correlation

 Python3

#correlation
corr = df.corr()

plt.figure(dpi=130)
sns.heatmap(df.corr(), annot=True, fmt= '.2f')
plt.show()

Output:
Correlation
We can also camapare by single columns in descending order
 Python3

corr['Outcome'].sort_values(ascending = False)

Output:
Outcome 1.000000
Glucose 0.466581
BMI 0.292695
Age 0.238356
Pregnancies 0.221898
DiabetesPedigreeFunction 0.173844
Insulin 0.130548
SkinThickness 0.074752
BloodPressure 0.0
Check Outcomes Proportionality

 Python3
plt.pie(df.Outcome.value_counts(),
labels= ['Diabetes', 'Not Diabetes'],
autopct='%.f', shadow=True)
plt.title('Outcome Proportionality')
plt.show()

Output:

Outcome Proportionality
Step 6: Separate independent features and Target Variables

 Python3

# separate array into input and output components

X = df.drop(columns =['Outcome'])
Y = df.Outcome

Step 7: Normalization or Standardization

Normalization
 MinMaxScaler scales the data so that each feature is in the range [0, 1].
 It works well when the features have different scales and the algorithm
being used is sensitive to the scale of the features, such as k-nearest
neighbors or neural networks.
 Rescale your data using scikit-learn using the MinMaxScaler.
 Python3

# initialising the MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

# learning the statistical parameters for each of the data and

transforming
rescaledX = scaler.fit_transform(X)
rescaledX[:5]

Output:
array([[0.353, 0.744, 0.59 , 0.354, 0. , 0.501, 0.234, 0.483],
[0.059, 0.427, 0.541, 0.293, 0. , 0.396, 0.117, 0.167],
[0.471, 0.92 , 0.525, 0. , 0. , 0.347, 0.254, 0.183],
[0.059, 0.447, 0.541, 0.232, 0.111, 0.419, 0.038, 0. ],
[0. , 0.688, 0.328, 0.354, 0.199, 0.642, 0.944, 0.2 ]])
Standardization
 Standardization is a useful technique to transform attributes with a Gaussian
distribution and differing means and standard deviations to a standard
Gaussian distribution with a mean of 0 and a standard deviation of 1.
 We can standardize data using scikit-learn with the StandardScaler class.
 It works well when the features have a normal distribution or when the
algorithm being used is not sensitive to the scale of the features
 Python3
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[:5]

Output:
array([[ 0.64 , 0.848, 0.15 , 0.907, -0.693, 0.204, 0.468,
1.426],
[-0.845, -1.123, -0.161, 0.531, -0.693, -0.684, -0.365, -
0.191],
[ 1.234, 1.944, -0.264, -1.288, -0.693, -1.103, 0.604, -
0.106],
[-0.845, -0.998, -0.161, 0.155, 0.123, -0.494, -0.921, -
1.042],
[-1.142, 0.504, -1.505, 0.907, 0.766, 1.41 , 5.485,

Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Credit Card Fraud Detection
100% (1)
Credit Card Fraud Detection
20 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
4-10 Aiml
No ratings yet
4-10 Aiml
25 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Diabetic Prediction Using LogicalRegression
No ratings yet
Diabetic Prediction Using LogicalRegression
9 pages
Logistic Regression With Pyspark
No ratings yet
Logistic Regression With Pyspark
19 pages
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
No ratings yet
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
21 pages
ML Practical 3D
No ratings yet
ML Practical 3D
4 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
Generative AI Binary Classification
No ratings yet
Generative AI Binary Classification
7 pages
From Import: Image Image (Filename, Height, Width)
No ratings yet
From Import: Image Image (Filename, Height, Width)
5 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Diabetes Prediction
No ratings yet
Diabetes Prediction
1 page
Mla - 2 (Cia - 3) - 20221013
No ratings yet
Mla - 2 (Cia - 3) - 20221013
21 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
34 pages
Data pipeline in ML
No ratings yet
Data pipeline in ML
3 pages
ML 5
No ratings yet
ML 5
3 pages
AIML Report (1) 11
No ratings yet
AIML Report (1) 11
13 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
Ii Avaliação Parcial - Ia - 25.0-Gabarito
No ratings yet
Ii Avaliação Parcial - Ia - 25.0-Gabarito
9 pages
ML 4
No ratings yet
ML 4
2 pages
AIML Report.
No ratings yet
AIML Report.
12 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
32 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
lecture2-supervised-learning slides
No ratings yet
lecture2-supervised-learning slides
56 pages
End to End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End to End Project Multiple Disease Detection Using ML - Nomidl
24 pages
Heart FailureDataset ML Algorithms
No ratings yet
Heart FailureDataset ML Algorithms
10 pages
Unit5 - Logistic Regression
No ratings yet
Unit5 - Logistic Regression
4 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Cardiovascular_Disease_Prediction
No ratings yet
Cardiovascular_Disease_Prediction
2 pages
Loading The Dataset: 'Diabetes - CSV'
No ratings yet
Loading The Dataset: 'Diabetes - CSV'
4 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Recipe-2-Quantifying-missing-data - Ipynb - Colab
No ratings yet
Recipe-2-Quantifying-missing-data - Ipynb - Colab
2 pages
healthcare-project-simplilearn- Week2
No ratings yet
healthcare-project-simplilearn- Week2
8 pages
Lab Manual DL (New)
No ratings yet
Lab Manual DL (New)
89 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Cia 2 ML 2348352
No ratings yet
Cia 2 ML 2348352
6 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
20MEMECH Part 6 - NN Vol - 1
No ratings yet
20MEMECH Part 6 - NN Vol - 1
34 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Diabetes
No ratings yet
Diabetes
97 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
Day 39
No ratings yet
Day 39
6 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
Tutorial 2 - Histogram
No ratings yet
Tutorial 2 - Histogram
9 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Ethicsai Notes
No ratings yet
Ethicsai Notes
51 pages
Bring Your Data To Life!: About The Author
No ratings yet
Bring Your Data To Life!: About The Author
592 pages
Digital Ecosystems: Interconnecting Advanced Networks With AI Applications
No ratings yet
Digital Ecosystems: Interconnecting Advanced Networks With AI Applications
918 pages
Face Recognition
No ratings yet
Face Recognition
23 pages
Instant Access to (Ebook) Machine Learning in Finance: From Theory to Practice by Matthew F. Dixon, Igor Halperin, Paul Bilokon ISBN 9783030410674, 9783030410681, 3030410676, 3030410684 ebook Full Chapters
100% (7)
Instant Access to (Ebook) Machine Learning in Finance: From Theory to Practice by Matthew F. Dixon, Igor Halperin, Paul Bilokon ISBN 9783030410674, 9783030410681, 3030410676, 3030410684 ebook Full Chapters
55 pages
Matthew Dixon CV March 2018
No ratings yet
Matthew Dixon CV March 2018
6 pages
Almost Unsupervised Text To Speech and Automatic Speech Recognition
No ratings yet
Almost Unsupervised Text To Speech and Automatic Speech Recognition
11 pages
Assignment 2 Part 1
No ratings yet
Assignment 2 Part 1
2 pages
3DSC - A Dataset of Superconductors Including Crystal Structures
No ratings yet
3DSC - A Dataset of Superconductors Including Crystal Structures
13 pages
Jurnal Sistem Pendeteksi Pejalan Kaki
No ratings yet
Jurnal Sistem Pendeteksi Pejalan Kaki
12 pages
Ai Assignment 1 1
100% (1)
Ai Assignment 1 1
13 pages
ML Engineer Resume
No ratings yet
ML Engineer Resume
1 page
Industrial Training Brochure
No ratings yet
Industrial Training Brochure
21 pages
805-Article Text-5777-1-10-20220901
No ratings yet
805-Article Text-5777-1-10-20220901
10 pages
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
No ratings yet
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
15 pages
BBRC-Skin Cancer and OfECE Pan Tompkins
No ratings yet
BBRC-Skin Cancer and OfECE Pan Tompkins
158 pages
Accelerating Data Modernization With Azure
No ratings yet
Accelerating Data Modernization With Azure
7 pages
Imad Herchy: Education Skills
No ratings yet
Imad Herchy: Education Skills
1 page
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
No ratings yet
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
31 pages
LM39 - Naïve Bayes Models
No ratings yet
LM39 - Naïve Bayes Models
14 pages
CBSE Board Exam Marking Scheme 2024
No ratings yet
CBSE Board Exam Marking Scheme 2024
6 pages
Multivariate Analysis and Forecasting of The Crude Oil Prices
No ratings yet
Multivariate Analysis and Forecasting of The Crude Oil Prices
13 pages
Smart Gloves To Convert Sign Language To Speech
No ratings yet
Smart Gloves To Convert Sign Language To Speech
27 pages
Chapter-2(Deep Learning)
No ratings yet
Chapter-2(Deep Learning)
18 pages
Mutonga Anesu - CV
No ratings yet
Mutonga Anesu - CV
2 pages
L1 - Machine Learning For Finance
No ratings yet
L1 - Machine Learning For Finance
131 pages
Longterm Course Catalog
No ratings yet
Longterm Course Catalog
11 pages
Takeoff Edu Group CSE Title List
No ratings yet
Takeoff Edu Group CSE Title List
122 pages
09 - Machine Learning
No ratings yet
09 - Machine Learning
7 pages
A Novel Pipeline Leak Detection Approach Independent of Prior Failure Information
No ratings yet
A Novel Pipeline Leak Detection Approach Independent of Prior Failure Information
12 pages