0% found this document useful (0 votes)
0 views

Data Pre-Processing

Data preprocessing in Python is crucial for transforming raw data into a clean format suitable for analysis, involving tasks like handling missing values, normalizing data, and encoding variables. The document outlines steps for data preprocessing, including importing libraries, loading datasets, checking for null values, statistical analysis, outlier detection, and normalization techniques. It emphasizes the importance of data cleaning in the machine learning pipeline to ensure accurate and reliable insights.

Uploaded by

Amita Garg
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Pre-Processing

Data preprocessing in Python is crucial for transforming raw data into a clean format suitable for analysis, involving tasks like handling missing values, normalizing data, and encoding variables. The document outlines steps for data preprocessing, including importing libraries, loading datasets, checking for null values, statistical analysis, outlier detection, and normalization techniques. It emphasizes the importance of data cleaning in the machine learning pipeline to ensure accurate and reliable insights.

Uploaded by

Amita Garg
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

ML | Data Preprocessing in Python


Data preprocessing is a important step in the data
science transforming raw data into a clean structured
format for analysis. It involves tasks like handling missing
values, normalizing data and encoding variables. Mastering
preprocessing in Python ensures reliable insights for accurate
predictions and effective decision-making. Pre-processing refers
to the transformations applied to data before feeding it to
the algorithm.

Data Preprocessing

Steps in Data Preprocessing


Step 1: Import the necessary libraries
# importing libraries
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Load the dataset
You can download dataset from here.
# Load the dataset
df = pd.read_csv('Geeksforgeeks/Data/diabetes.csv')
print(df.head())
Output:
Pregnancies Glucose BloodPressure SkinThickness Insulin
BMI
0 6 148 72 35 0
33.6 \
1 1 85 66 29 0
26.6
2 8 183 64 0 0
23.3
3 1 89 66 23 94
28.1
4 0 137 40 35 168
43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
1. Check the data info
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
As we can see from the above info that the our dataset has 9
columns and each columns has 768 values. There is no Null
values in the dataset.
We can also check the null values using df.isnull()
df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
Step 2: Statistical Analysis
In statistical analysis we use df.describe() which will give a
descriptive overview of the dataset.
df.describe()
Output:

Data summary

The above table shows the count, mean, standard deviation, min,
25%, 50%, 75% and max values for each column. When we
carefully observe the table we will find that Insulin, Pregnancies,
BMI, BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 3: Check the outliers
# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i = 0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()
Output:
Boxplots
from the above boxplot we can clearly see that every column has
some amounts of outliers.
Step 4: Drop the outliers
# Identify the quartiles
q1, q3 = np.percentile(df['Insulin'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = df[(df['Insulin'] >= lower_bound)
& (df['Insulin'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['Pregnancies'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Pregnancies'] >= lower_bound)
& (clean_data['Pregnancies'] <=
upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['Age'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Age'] >= lower_bound)
& (clean_data['Age'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['Glucose'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['Glucose'] >= lower_bound)
& (clean_data['Glucose'] <= upper_bound)]
# Identify the quartiles
q1, q3 = np.percentile(clean_data['BloodPressure'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (0.75 * iqr)
upper_bound = q3 + (0.75 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BloodPressure'] >= lower_bound)
& (clean_data['BloodPressure'] <=
upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['BMI'], [25, 75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
# Drop the outliers
clean_data = clean_data[(clean_data['BMI'] >= lower_bound)
& (clean_data['BMI'] <= upper_bound)]

# Identify the quartiles


q1, q3 = np.percentile(clean_data['DiabetesPedigreeFunction'], [25,
75])
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

# Drop the outliers


clean_data = clean_data[(clean_data['DiabetesPedigreeFunction'] >=
lower_bound)
& (clean_data['DiabetesPedigreeFunction']
<= upper_bound)]
Step 5: Correlation

#correlation
2
corr = df.corr()
3

plt.figure(dpi=130)
5

sns.heatmap(df.corr(), annot=True, fmt= '.2f')


6

plt.show()
Output:

Correlation

We can also compare by single columns in descending order


corr['Outcome'].sort_values(ascending = False)
Output:
Outcome 1.000000
Glucose 0.466581
BMI 0.292695
Age 0.238356
Pregnancies 0.221898
DiabetesPedigreeFunction 0.173844
Insulin 0.130548
SkinThickness 0.074752
BloodPressure 0.0
Step 6: Check Outcomes Proportionality
plt.pie(df.Outcome.value_counts(),
labels= ['Diabetes', 'Not Diabetes'],
autopct='%.f', shadow=True)
plt.title('Outcome Proportionality')
plt.show()
Output:

Outcome Proportionality

Step 7: Separate independent features and Target


Variables
# separate array into input and output components
X = df.drop(columns =['Outcome'])
Y = df.Outcome
Step 7: Normalization or Standardization
Normalization
 Normalization works well when the features have different
scales and the algorithm being used is sensitive to the
scale of the features, such as k-nearest neighbors or
neural networks.
 Rescale your data using scikit-learn using
the MinMaxScaler.
 MinMaxScaler scales the data so that each feature is in
the range [0, 1].
# initialising the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

# learning the statistical parameters for each of the data and


transforming
rescaledX = scaler.fit_transform(X)
rescaledX[:5]
Output:
array([[0.353, 0.744, 0.59 , 0.354, 0. , 0.501, 0.234,
0.483],
[0.059, 0.427, 0.541, 0.293, 0. , 0.396, 0.117,
0.167],
[0.471, 0.92 , 0.525, 0. , 0. , 0.347, 0.254,
0.183],
[0.059, 0.447, 0.541, 0.232, 0.111, 0.419, 0.038,
0. ],
[0. , 0.688, 0.328, 0.354, 0.199, 0.642, 0.944,
0.2 ]])
Standardization
 Standardization is a useful technique to transform
attributes with a Gaussian distribution and differing
means and standard deviations to a standard Gaussian
distribution with a mean of 0 and a standard deviation of
1.
 We can standardize data using scikit-learn with the
StandardScaler class.
 It works well when the features have a normal distribution
or when the algorithm being used is not sensitive to the
scale of the features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[:5]
Output:
array([[ 0.64 , 0.848, 0.15 , 0.907, -0.693, 0.204,
0.468, 1.426],
[-0.845, -1.123, -0.161, 0.531, -0.693, -0.684, -
0.365, -0.191],
[ 1.234, 1.944, -0.264, -1.288, -0.693, -1.103,
0.604, -0.106],
[-0.845, -0.998, -0.161, 0.155, 0.123, -0.494, -
0.921, -1.042],
[-1.142, 0.504, -1.505, 0.907, 0.766, 1.41 ,
5.485]
In conclusion data preprocessing is an important step to make
raw data clean for analysis. Using Python we can handle missing
values, organize data and prepare it for accurate results. This
ensures our model is reliable and helps us uncover valuable
insights from data.
Data cleaning is a important step in the machine learning
(ML) pipeline as it involves identifying and removing any
missing duplicate or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent
and free of errors as raw data is often noisy, incomplete
and inconsistent which can negatively impact the
accuracy of model and its reliability of insights derived
from it. Professional data scientists usually invest a large
portion of their time in this step because of the belief that
“Better data beats fancier algorithms”
Clean datasets also helps in EDA that enhance the
interpretability of data so that right actions can be taken based
on insights.
How to Perform Data Cleanliness?
The process begins by thorough understanding data and its
structure to identify issues like missing values, duplicates and
outliers. Performing data cleaning involves a systematic process
to identify and remove errors in a dataset. The following are
essential steps to perform data cleaning.
Data Cleaning

 Removal of Unwanted Observations: Identify and


remove irrelevant or redundant (unwanted) observations
from the dataset. This step involves analyzing data
entries for duplicate records, irrelevant information or
data points that do not contribute to analysis and
prediction. Removing them from dataset helps reducing
noise and improving the overall quality of dataset.
 Fixing Structure errors: Address structural issues in
the dataset such as inconsistencies in data formats or
variable types. Standardize formats ensure uniformity in
data structure and hence data consistency.
 Managing outliers: Outliers are those points that
deviate significantly from dataset mean. Identifying and
managing outliers significantly improve model accuracy
as these extreme values influence analysis. Depending
on the context decide whether to remove outliers or
transform them to minimize their impact on analysis.
 Handling Missing Data: To handle missing data
effectively we need to impute missing values based on
statistical methods, removing records with missing
values or employing advanced imputation techniques.
Handling missing data helps preventing biases and
maintaining the integrity of data.
Throughout the process documentation of changes is crucial for
transparency and future reference. Iterative validation is done to
test effectiveness of the data cleaning resulting in a refined
dataset and can be used for meaningful analysis and insights.
Python Implementation for Database
Cleaning
Let’s understand each step for Database Cleaning, using titanic
dataset. Below are the necessary steps:
 Import the necessary libraries
 Load the dataset
 Check the data information using df.info()
import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('titanic.csv')
df.head()
Output:
PassengerId Survived Pclass Name Sex Age
SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0
1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence
Briggs Th... female 38.0 1 0 PC 17599
71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0
0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May
Peel) female 35.0 1 0 113803 53.1000
C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0
0 0 373450 8.0500 NaN S
Data Inspection and Exploration
Let’s first understand the data by inspecting its structure and
identifying missing values, outliers and inconsistencies and
check the duplicate rows with below python code:
df.duplicated()
Output:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool
Check the data information using df.info()
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info we can see that Age and Cabin have
an unequal number of counts. And some of the columns are
categorical and have data type objects and some are integer and
float values.
Check the Categorical and Numerical Columns.
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)
Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin',
'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass',
'Age', 'SibSp', 'Parch', 'Fare']
Check the total number of Unique Values in the Categorical
Columns
df[cat_col].nunique()
Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
Removal of all Above Unwanted Observations
Duplicate observations most frequently arise during data
collection and Irrelevant observations are those that don’t
actually fit with the specific problem that we’re trying to solve.
 Redundant observations alter the efficiency to a great
extent as the data repeats and may add towards the
correct side or towards the incorrect side, therefore
producing useless results.
 Irrelevant observations are any type of data that is of no
use to us and can be removed directly.
Now we have to make a decision according to the subject
of analysis which factor is important for our discussion.
As we know our machines don’t understand the text data. So we
have to either drop or convert the categorical column values into
numerical types. Here we are dropping the Name columns
because the Name will be always unique and it hasn’t a great
influence on target variables. For the ticket, Let’s first print the
50 unique tickets.
df['Ticket'].unique()[:50]
Output:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',
'330877', '17463', '349909', '347742', '237736', 'PP
9549',
'113783', 'A/5. 2151', '347082', '350406', '248706',
'382652',
'244373', '345763', '2649', '239865', '248698',
'330923', '113788',
'347077', '2631', '19950', '330959', '349216', 'PC
17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604',
'113789', '2677',
'A./5. 2152', '345764', '2651', '7546', '11668',
'349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567',
'370371', '14311',
'2662', '349237', '3101295'], dtype=object)
From the above tickets, we can observe that it is made of two
like first values ‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this
may influence our target variables. It will the case of Feature
Engineering. where we derived new features from a column or
a group of columns. In the current case, we are dropping the
“Name” and “Ticket” columns.
Drop Name and Ticket Columns
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
Handling Missing Data
Missing data is a common issue in real-world datasets and it can
occur due to various reasons such as human errors, system
failures or data collection issues. Various techniques can be used
to handle missing data, such as imputation, deletion or
substitution.
Let’s check the missing values columns-wise for each row using
df.isnull() it checks whether the values are null or not and gives
returns boolean values and sum() will sum the total number of
null values rows and we divide it by the total number of rows
present in the dataset then we multiply to get values in i.e per
100 values how much values are null.
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They
must be handled carefully as they can be an indication of
something important.
 The fact that the value was missing may be informative
in itself.
 In the real world we often need to make predictions on
new data even if some of the features are missing!
As we can see from the above result that Cabin has 77% null
values and Age has 19.87% and Embarked has 0.22% of null
values.
So, it’s not a good idea to fill 77% of null values. So we will drop
the Cabin column. Embarked column has only 0.22% of null
values so, we drop the null values rows of Embarked column.
df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape
Output:
(889, 9)
Imputing the missing values from past observations.
 Again “missingness” is almost informative in itself and
we should tell our algorithm if a value was missing.
 Even if we build a model to impute our values we’re not
adding any real information. we’re just reinforcing the
patterns already provided by other features. We can
use Mean imputation or Median imputations for the
case.
Note:
 Mean imputation is suitable when the data is normally
distributed and has no extreme outliers.
 Median imputation is preferable when the data contains
outliers or is skewed.
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
Handling Outliers
Outliers are extreme values that deviate significantly from the
majority of the data. They can negatively impact the analysis
and model performance. Techniques such as clustering,
interpolation or transformation can be used to handle outliers.
To check the outliers we generally use a box plot. A box plot is a
graphical representation of a dataset’s distribution. It shows a
variable’s median, quartiles and potential outliers. The line inside
the box denotes the median while the box itself denotes
the interquartile range (IQR) . The box plot extend to the most
extreme non-outlier values within 1.5 times the IQR. Individual
points beyond the box are considered potential outliers. A box
plot offers an easy-to-understand overview of the range of the
data and makes it possible to identify outliers or skewness in the
distribution.
Let’s plot the box plot for Age column data.
import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
Box Plot

As we can see from the above Box and whisker plot, Our age
dataset has outliers values. The values less than 5 and more
than 55 are outliers.
# calculate summary statistics
mean = df3['Age'].mean()
std = df3['Age'].std()

# Calculate the lower and upper bounds


lower_bound = mean - std*2
upper_bound = mean + std*2

print('Lower Bound :',lower_bound)


print('Upper Bound :',upper_bound)

# Drop the outliers


df4 = df3[(df3['Age'] >= lower_bound)
& (df3['Age'] <= upper_bound)]
Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Similarly, we can remove the outliers of the remaining columns.
Data Transformation
Data transformation involves converting the data from one form
to another to make it more suitable for analysis. Techniques such
as normalization, scaling or encoding can be used to transform
the data.
Data validation and verification
Data validation and verification involve ensuring that the data is
accurate and consistent by comparing it with external sources or
expert knowledge.
For the machine learning prediction we separate independent
and target features. Here we will consider only ‘Sex’ ‘Age’
‘SibSp’, ‘Parch’ ‘Fare’ ‘Embarked’ only as the independent
features and Survived as target variables because PassengerId
will not affect the survival rate.
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
Data formatting
Data formatting involves converting the data into a standard
format or structure that can be easily processed by the
algorithms or models used for analysis. Here we will discuss
commonly used data formatting techniques i.e. Scaling and
Normalization.
Scaling
 Scaling involves transforming the values of features to a
specific range. It maintains the shape of the original
distribution while changing the scale.
 Particularly useful when features have different scales,
and certain algorithms are sensitive to the magnitude of
the features.
 Common scaling methods include Min-Max scaling and
Standardization (Z-score scaling).
Min-Max Scaling: Min-Max scaling rescales the values to a
specified range, typically between 0 and 1. It preserves the
original distribution and ensures that the minimum value maps
to 0 and the maximum value maps to 1.
from sklearn.preprocessing import MinMaxScaler

# initialising the MinMaxScaler


scaler = MinMaxScaler(feature_range=(0, 1))

# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and
transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Output:
Pclass Sex Age SibSp Parch Fare Embarked
0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 0.0 female 0.472229 0.125 0.0 0.139136
C
2 1.0 female 0.321438 0.000 0.0 0.015469
S
3 0.0 female 0.434531 0.125 0.0 0.103644
S
4 1.0 male 0.434531 0.000 0.0 0.015713 S
Standardization (Z-score scaling): Standardization
transforms the values to have a mean of 0 and a standard
deviation of 1. It centers the data around the mean and scales it
based on the standard deviation. Standardization makes the data
more suitable for algorithms that assume a Gaussian distribution
or require features to have zero mean and unit variance.
Z = (X - μ) / σ
Where,
 X = Data
 μ = Mean value of X
 σ = Standard deviation of X
Data Cleansing Tools
Some data cleansing tools:
 OpenRefine: A powerful open-source tool for cleaning
and transforming messy data. It supports tasks like
removing duplicate and data enrichment with easy-to-
use interface.
 Trifacta Wrangler: A user-friendly tool designed for
cleaning, transforming and preparing data for analysis. It
uses AI to suggest transformations to streamline
workflows.
 TIBCO Clarity: A tool that helps in profiling,
standardizing and enriching data. It’s ideal to make high
quality data and consistency across datasets.
 Cloudingo: A cloud-based tool focusing on de-
duplication, data cleansing and record management to
maintain accuracy of data.
 IBM Infosphere Quality Stage: It’s highly suitable for
large-scale and complex data.
Advantages and Disadvantages of Data
Cleaning in Machine Learning
Advantages:
 Improved model performance: Removal of errors,
inconsistencies and irrelevant data helps the model to
better learn from the data.
 Increased accuracy: Helps ensure that the data is
accurate, consistent and free of errors.
 Better representation of the data: Data cleaning
allows the data to be transformed into a format that
better represents the underlying relationships and
patterns in the data.
 Improved data quality: Improve the quality of the
data, making it more reliable and accurate.
 Improved data security: Helps to identify and remove
sensitive or confidential information that could
compromise data security.
Disadvantages:
 Time-consuming: It is very time consuming task
specially for large and complex datasets.
 Error-prone: It can result in loss of important
information.
 Cost and resource-intensive: It is resource-intensive
process that requires significant time, effort and
expertise. It can also require the use of specialized
software tools.
 Overfitting: Data cleaning can contribute to overfitting
by removing too much data.
So we have discussed four different steps in data cleaning to
make the data more reliable and to produce good results. After
properly completing the Data Cleaning steps, we’ll have a robust
dataset that avoids any error and inconsistency. In summary,
data cleaning is a crucial step in the data science pipeline that
involves identifying and correcting errors, inconsistencies and
inaccuracies in the data to improve its quality and usability.
Overview of Data Cleaning – FAQs
What does it mean to cleanse our data?
Cleansing data involves identifying and rectifying errors,
inconsistencies and inaccuracies in a dataset to improve its
quality, ensuring reliable results in analyses and decision-
making.
What is an example of cleaning data?
Removing duplicate records in a customer database ensures
accurate and unbiased analysis, preventing redundant
information from skewing results or misrepresenting the
customer base.
What is the meaning of data wash?
“Data wash” is not a standard term in data management. If used
it could refer to cleaning or processing data but it’s not a widely
recognized term in the field.
How is data cleansing done?
Data cleansing involves steps like removing duplicates, handling
missing values, and correcting inconsistencies. It requires
systematic examination and correction of data issues.

You might also like