2 Data Pre-Processing
2 Data Pre-Processing
Preprocessing
Contents
Accuracy
Completeness
Consistency
Timeliness
Value added
Interpretability
Accessibility etc
Data Preprocessing
Techniques
Data Cleaning
Data Integration
Data Transformation
Data Reduction
What is Data?
Attributes Class
attribute
Tid Refund Marital Taxable
Collection of data objects and Income Cheat
Status
their attributes 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes
Types of Attributes
Data types
numeric, categorical
(see the hierarchy for
its relationship)
static, dynamic
(temporal)
7
Record Data
Data that consists of a collection of records, each of which consists of
a fixed set of attributes
10
10 No Single 90K Yes
Data Matrix
Data objects with the fixed set of numeric attributes
Consider them as points in a multi-dimensional space, where each
dimension represents a distinct attribute
Represent by an m by n matrix,
where there are m rows, one for each object, and n columns, one for
each attribute
timeout
season
coach
game
score
pla y
team
wi n
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
Average Monthly Temperature of land and ocean
CCCTCTGCTCGGCCTAGACCTGA
The data analysis pipeline
Mining is not the only step in the analysis process
Data Result
Preprocessing
Data Mining Post-processing
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Data Quality
Examples:
Same person with multiple email addresses
Forms of data
preprocessing
• Fill in missing values
• Smooth noisy data
• Remove outliers
• Resolve inconsistencies
z-score normalization
Example: Let mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively.
With z-score normalization, a value of $73,600 for income is
transformed to
Data Transformation: Normalization
Decimal scaling
normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A.
v
v'
Where j is the smallest integer such that Max(| v'
10
j
|)<1
Example: Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986.
To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3)
-986 normalizes to -0.986 and 917 normalizes to 0.917.
Data Reduction
Data reduction
Obtains a reduced representation of the data set that is much
smaller in volume
but produces the same (or almost the same) analytical results
Data Reduction Strategies
Dimensionality reduction
Data compression
use encoding schemes to reduce the data set size
Numerosity reduction
data is replaced or estimated by alternative smaller data
representations
Sampling
Histograms
Clustering
Discretization and concept hierarchy
generation
replace raw data values for attributes by ranges or higher
conceptual
levels
Histograms
A popular data reduction 4
technique 0
Divide data into buckets and
3
store average (sum) for each 3
50
bucket
Can be constructed optimally 25
in one dimension using
2
dynamic programming
0
1
5
1
0
100002000030000400005000060000700008000090000 100000
5
Cluster Analysis
using a sample will work almost as well as using the entire data
sets, if the sample is representative
Raw Data
Sample Size
Discretization:
Divide the range of a continuous attribute into intervals
Reduce data size by discretization
Interval labels can be used to replace actual data values.
Discretization for numeric data
Binning
sensitive to the user-specified number of bins and
outliers
Histogram
Clustering analysis