Data Pre-Processing
Data Pre-Processing
Data preprocessing is a important step in the data
science transforming raw data into a clean structured
format for analysis. It involves tasks like handling missing
values, normalizing data and encoding variables. Mastering
preprocessing in Python ensures reliable insights for accurate
predictions and effective decision-making. Pre-processing refers
to the transformations applied to data before feeding it to
the algorithm.
Data Preprocessing
Data summary
The above table shows the count, mean, standard deviation, min,
25%, 50%, 75% and max values for each column. When we
carefully observe the table we will find that Insulin, Pregnancies,
BMI, BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 3: Check the outliers
# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i = 0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()
Output:
Boxplots
from the above boxplot we can clearly see that every column has
some amounts of outliers.
Step 4: Drop the outliers
# Identify the quartiles
q1, q3 = np.percentile(df['Insulin'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = df[(df['Insulin'] >= lower_bound)
& (df['Insulin'] <= upper_bound)]
#correlation
2
corr = df.corr()
3
plt.figure(dpi=130)
5
plt.show()
Output:
Correlation
Outcome Proportionality
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[:5]
Output:
array([[ 0.64 , 0.848, 0.15 , 0.907, -0.693, 0.204,
0.468, 1.426],
[-0.845, -1.123, -0.161, 0.531, -0.693, -0.684, -
0.365, -0.191],
[ 1.234, 1.944, -0.264, -1.288, -0.693, -1.103,
0.604, -0.106],
[-0.845, -0.998, -0.161, 0.155, 0.123, -0.494, -
0.921, -1.042],
[-1.142, 0.504, -1.505, 0.907, 0.766, 1.41 ,
5.485]
In conclusion data preprocessing is an important step to make
raw data clean for analysis. Using Python we can handle missing
values, organize data and prepare it for accurate results. This
ensures our model is reliable and helps us uncover valuable
insights from data.
Data cleaning is a important step in the machine learning
(ML) pipeline as it involves identifying and removing any
missing duplicate or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent
and free of errors as raw data is often noisy, incomplete
and inconsistent which can negatively impact the
accuracy of model and its reliability of insights derived
from it. Professional data scientists usually invest a large
portion of their time in this step because of the belief that
“Better data beats fancier algorithms”
Clean datasets also helps in EDA that enhance the
interpretability of data so that right actions can be taken based
on insights.
How to Perform Data Cleanliness?
The process begins by thorough understanding data and its
structure to identify issues like missing values, duplicates and
outliers. Performing data cleaning involves a systematic process
to identify and remove errors in a dataset. The following are
essential steps to perform data cleaning.
Data Cleaning
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
Box Plot
As we can see from the above Box and whisker plot, Our age
dataset has outliers values. The values less than 5 and more
than 55 are outliers.
# calculate summary statistics
mean = df3['Age'].mean()
std = df3['Age'].std()
# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and
transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Output:
Pclass Sex Age SibSp Parch Fare Embarked
0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 0.0 female 0.472229 0.125 0.0 0.139136
C
2 1.0 female 0.321438 0.000 0.0 0.015469
S
3 0.0 female 0.434531 0.125 0.0 0.103644
S
4 1.0 male 0.434531 0.000 0.0 0.015713 S
Standardization (Z-score scaling): Standardization
transforms the values to have a mean of 0 and a standard
deviation of 1. It centers the data around the mean and scales it
based on the standard deviation. Standardization makes the data
more suitable for algorithms that assume a Gaussian distribution
or require features to have zero mean and unit variance.
Z = (X - μ) / σ
Where,
X = Data
μ = Mean value of X
σ = Standard deviation of X
Data Cleansing Tools
Some data cleansing tools:
OpenRefine: A powerful open-source tool for cleaning
and transforming messy data. It supports tasks like
removing duplicate and data enrichment with easy-to-
use interface.
Trifacta Wrangler: A user-friendly tool designed for
cleaning, transforming and preparing data for analysis. It
uses AI to suggest transformations to streamline
workflows.
TIBCO Clarity: A tool that helps in profiling,
standardizing and enriching data. It’s ideal to make high
quality data and consistency across datasets.
Cloudingo: A cloud-based tool focusing on de-
duplication, data cleansing and record management to
maintain accuracy of data.
IBM Infosphere Quality Stage: It’s highly suitable for
large-scale and complex data.
Advantages and Disadvantages of Data
Cleaning in Machine Learning
Advantages:
Improved model performance: Removal of errors,
inconsistencies and irrelevant data helps the model to
better learn from the data.
Increased accuracy: Helps ensure that the data is
accurate, consistent and free of errors.
Better representation of the data: Data cleaning
allows the data to be transformed into a format that
better represents the underlying relationships and
patterns in the data.
Improved data quality: Improve the quality of the
data, making it more reliable and accurate.
Improved data security: Helps to identify and remove
sensitive or confidential information that could
compromise data security.
Disadvantages:
Time-consuming: It is very time consuming task
specially for large and complex datasets.
Error-prone: It can result in loss of important
information.
Cost and resource-intensive: It is resource-intensive
process that requires significant time, effort and
expertise. It can also require the use of specialized
software tools.
Overfitting: Data cleaning can contribute to overfitting
by removing too much data.
So we have discussed four different steps in data cleaning to
make the data more reliable and to produce good results. After
properly completing the Data Cleaning steps, we’ll have a robust
dataset that avoids any error and inconsistency. In summary,
data cleaning is a crucial step in the data science pipeline that
involves identifying and correcting errors, inconsistencies and
inaccuracies in the data to improve its quality and usability.
Overview of Data Cleaning – FAQs
What does it mean to cleanse our data?
Cleansing data involves identifying and rectifying errors,
inconsistencies and inaccuracies in a dataset to improve its
quality, ensuring reliable results in analyses and decision-
making.
What is an example of cleaning data?
Removing duplicate records in a customer database ensures
accurate and unbiased analysis, preventing redundant
information from skewing results or misrepresenting the
customer base.
What is the meaning of data wash?
“Data wash” is not a standard term in data management. If used
it could refer to cleaning or processing data but it’s not a widely
recognized term in the field.
How is data cleansing done?
Data cleansing involves steps like removing duplicates, handling
missing values, and correcting inconsistencies. It requires
systematic examination and correction of data issues.