7.data Preprocessing

Uploaded by

anshbamotra11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

7.data Preprocessing

Uploaded by

anshbamotra11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Preprocessing:

Need for Preprocessing the Data

6.What is the need for Preprocessing of
Data?Explain
What is Data Preprocessing?

 Data preprocessing is the process of transforming raw data into an understandable

format.
 It is also an important step in data mining as we cannot work with raw data. The quality
of the data should be checked before applying data mining algorithms.
 Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following:
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do not
match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing

 There are 4 major tasks in data preprocessing – Data cleaning,

Data integration, Data reduction, and Data transformation.
Data Preprocessing:
The four Major Data Preprocessing tasks
are
1.Data cleaning
2.Data integration
3.Data reduction
4.Data transformation.
Data Cleaning

 Data cleaning is the process of removing incorrect data, incomplete data, and
inaccurate data from the datasets, and it also replaces the missing values. Here are
some techniques for data cleaning:
 Handling Missing Values
 Standard values like “Not Available” or “NA” can be used to replace the missing values.
 Missing values can also be filled manually, but it is not recommended when that
dataset is big.
 The attribute’s mean value can be used to replace the missing value when the data is
normally distributed where in in the case of non-normal distribution median value of
the attribute can be used.
 While using regression or decision tree algorithms, the missing value can be replaced
by the most probable value.
Handling Noisy Data

 Noisy generally means random error or containing unnecessary data points.

Handling noisy data is one of the most important steps as it leads to the
optimization of the model we are using
 Here are some of the methods to handle noisy data.
1.Binning:
 Smoothing by bin mean method
 Smoothing by bin median
 Smoothing by bin boundary
2.Regression
3.Clustering
 Binning: This method is to smooth or handle noisy data. First, the data is
sorted then, and then the sorted values are separated and stored in the form
of bins. There are three methods for smoothing data in the bin.
 Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin.
 Smoothing by bin median: In this method, the values in the bin are replaced
by the median value
 Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken, and the closest boundary value
replaces the values.
 Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to
decide the variable which is suitable for our analysis.
 Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 24, 30
 Partition using equal frequency approach:
 Bin 1 : 2, 6, 7
 Bin 2 : 9, 13, 20
 Bin 3 : 21, 24, 30

 Smoothing by bin mean :

 Bin 1 : 5, 5, 5
 Bin 2 : 14, 14, 14
 Bin 3 : 25, 25, 25
 Smoothing by bin median :
 Bin 1 : 6, 6, 6
 Bin 2 : 13, 13, 13
 Bin 3 : 24, 24, 24

 Smoothing by bin boundary :

 Bin 1 : 2, 7, 7
 Bin 2 : 9, 9, 20
 Bin 3 : 21, 21, 30
Data Integration

 The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management.
 There are some problems to be considered during data integration.
 Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
 Entity identification problem: Identifying entities from multiple databases. For
example, the system or the user should know the student id of one database and
studentname of another database belonging to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. The attribute values from one database may differ from
another database. For example, the date format may differ, like “MM/DD/YYYY” or
“DD/MM/YYYY”.
Data Reduction

 This process helps in the reduction of the volume of the data, which makes the
analysis easier yet produces the same or almost the same result.
 This reduction also helps to reduce storage space. Some of the data reduction
techniques are dimensionality reduction, numerosity reduction, and data compression.
 Dimensionality reduction: This process is necessary for real-world applications as the
data size is big. In this process, the reduction of random variables or attributes is done
so that the dimensionality of the data set can be reduced.
 Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space, and computation time
is reduced. When the data is highly dimensional, a problem called the “Curse of
Dimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is made smaller
by reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. This
compression can be lossless or lossy. When there is no loss of information during
compression, it is called lossless compression. Whereas lossy compression reduces
information, but it removes only the unnecessary information.
Data Transformation

 The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There
are some methods for data transformation.
 Smoothing: With the help of algorithms, we can remove noise from the dataset, which
helps in knowing the important features of the dataset. By smoothing, we can find
even a simple change that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with data
analysis description. This is an important step since the accuracy of the data depends
on the quantity and quality of the data. When the quality and the quantity of the data
are good, the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can set an interval
like (3 pm-5 pm, or 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented in a
smaller range. Example ranging from -1.0 to 1.0.

EY Artificial Intelligence Esg Stakes Discussion Paper
100% (1)
EY Artificial Intelligence Esg Stakes Discussion Paper
34 pages
FYP Project Report - YOLO V8 Object Detection
No ratings yet
FYP Project Report - YOLO V8 Object Detection
59 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
U2L1
No ratings yet
U2L1
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Week 3
No ratings yet
Week 3
23 pages
Module 2
No ratings yet
Module 2
8 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
ml4
No ratings yet
ml4
17 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Normalization
No ratings yet
Normalization
35 pages
DWM
No ratings yet
DWM
14 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Information Bulletin: Common Entrance Test - 2019
No ratings yet
Information Bulletin: Common Entrance Test - 2019
99 pages
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
No ratings yet
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
16 pages
Download ebooks file Computational Nanotechnology Modeling and Applications with MATLAB R Sarhan M Musa all chapters
No ratings yet
Download ebooks file Computational Nanotechnology Modeling and Applications with MATLAB R Sarhan M Musa all chapters
67 pages
PTE PDF 2
No ratings yet
PTE PDF 2
134 pages
Satellite Image Segmentation Using Self-Organizing Maps and Fuzzy C-Means
No ratings yet
Satellite Image Segmentation Using Self-Organizing Maps and Fuzzy C-Means
5 pages
Component Gaurd Ai Project
No ratings yet
Component Gaurd Ai Project
11 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
22 pages
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
No ratings yet
ACL - 2020 - Mike Lewis - BART Denoising Sequence-To-Sequence Pre-Training For Natural Language Generation, Translation, and Comprehension
10 pages
Dh-Xvr5108Hs-4Kl-I2: 8 Channel Penta-Brid 4K-N/5Mp Compact 1U Wizsense Digital Video Recorder
No ratings yet
Dh-Xvr5108Hs-4Kl-I2: 8 Channel Penta-Brid 4K-N/5Mp Compact 1U Wizsense Digital Video Recorder
3 pages
Chapter 8 - 1 Machine Learning
No ratings yet
Chapter 8 - 1 Machine Learning
167 pages
Artandai
No ratings yet
Artandai
307 pages
CSC 441_All_Notes_250403_113554
No ratings yet
CSC 441_All_Notes_250403_113554
30 pages
2023 regulation CSBS Autonomous Syllabus
No ratings yet
2023 regulation CSBS Autonomous Syllabus
85 pages
rmfinal
No ratings yet
rmfinal
14 pages
AI Poultry Project Presentation
No ratings yet
AI Poultry Project Presentation
14 pages
LLM With Knowledge Graphs
No ratings yet
LLM With Knowledge Graphs
40 pages
apunka_case_study[1][1]
No ratings yet
apunka_case_study[1][1]
33 pages
Analysis and Prediction of Crime Against Woman Using Machine Learning Techniques
No ratings yet
Analysis and Prediction of Crime Against Woman Using Machine Learning Techniques
6 pages
1 - Overview of Accounting Information System Part 1
No ratings yet
1 - Overview of Accounting Information System Part 1
30 pages
Sentiment Analysis On Twitter Data
No ratings yet
Sentiment Analysis On Twitter Data
7 pages
Training Outline_Partyrock
No ratings yet
Training Outline_Partyrock
4 pages
Bài 3
No ratings yet
Bài 3
26 pages
Integrating Multimodal Deep Learning For Enhanced News Sentiment Analysis and Market Movement Forecasting
No ratings yet
Integrating Multimodal Deep Learning For Enhanced News Sentiment Analysis and Market Movement Forecasting
8 pages
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
No ratings yet
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
245 pages
Sign Language Recognition Using Machine Learning
No ratings yet
Sign Language Recognition Using Machine Learning
7 pages
MIPSheet3 - 1solution2
No ratings yet
MIPSheet3 - 1solution2
6 pages
Deus Ex Machina
No ratings yet
Deus Ex Machina
6 pages
As 88117 Ai C 611D85 Us 1069-3
No ratings yet
As 88117 Ai C 611D85 Us 1069-3
24 pages

7.data Preprocessing

Uploaded by

7.data Preprocessing

Uploaded by

Data Preprocessing:

Need for Preprocessing the Data

 Data preprocessing is the process of transforming raw data into an understandable

 There are 4 major tasks in data preprocessing – Data cleaning,

 Noisy generally means random error or containing unnecessary data points.

 Smoothing by bin mean :

 Smoothing by bin boundary :

You might also like