Preprocessing
Preprocessing
Types of Data
Data Quality
Data Preprocessing
Aggregation
Sampling
Discretization and Binarization
Attribute Transformation
Dimensionality Reduction
Feature subset selection
Feature creation
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
x1
01/27/2021 Introduction to Data Mining, 2nd Edition 18
Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection
Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
01/27/2021 Introduction to Data Mining, 2nd Edition 19
Tan, Steinbach, Karpatne, Kumar
Feature Creation
general methodologies:
– Feature extraction
◆ Example: extracting edges from images
– Feature construction
◆ Example: dividing mass by volume to get density