Chapter 3 - Data Pre-Processing Notes
Chapter 3 - Data Pre-Processing Notes
• Data preprocessing
• Data cleaning
• Data transformation
• Data reduction
Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Data transformation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
– Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“ ”
• e.g., Salary=“-10”
Data Cleaning
• Importance
– “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
Noisy Data
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible outliers)
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
- Bin 1: 4, 8, 9, 15
- Bin 1: 9, 9, 9, 9
- Bin 1: 4, 4, 4, 15
Cluster Analysis
Regression
Data Transformation
Data Transformation: Normalization
Data Reduction
Data Reduction Strategies
• Data reduction
– Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results.
• Why data reduction?
– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the complete data set
• Data reduction strategies
– Data cube aggregation:
– Attribute subset selection
– Principal Component Analysis.
Data Cube Aggregation
• Queries regarding aggregated information should be answered using data cube, when possible
A Sample Data Cube
– Select a minimum set of features such that the probability distribution of different classes
given the values for those features is as close as possible to the original distribution given
the values of all features
– Decision-tree induction
Example of Decision Tree Induction
• Data preparation or preprocessing is a big issue for both data warehousing and data mining
• Descriptive data summarization is need for quality data preprocessing
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but data preprocessing still an active area of research