2-Data_Preprocessing
2-Data_Preprocessing
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
▪ A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
▪ Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or 9 No Married 75K No
instance
10 No Single 90K Yes
10
A More Complete View of Data
▪ Data may have parts
female} test
Ordinal Ordinal attribute hardness of minerals, median,
values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
▪ A special type of record data, where
▪ Each record (transaction) involves a set of items.
▪ For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
▪ Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
An element of
the sequence
Ordered Data
▪ Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
▪ Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Analysis Pipeline
▪ Mining is not the only step in the analysis process
Data Result
Data Mining
Preprocessing Post-processing
▪ Causes?
Duplicate Data
▪ Data set may include data objects that are duplicates, or
almost duplicates of one another
▪ Major issue when merging data from heterogeneous sources
▪ Examples:
▪ Same person with multiple email addresses
▪ Data cleaning
▪ Process of dealing with duplicate data issues
▪ Data integration
▪ Integration of multiple databases, data cubes, or files
▪ Data transformation
▪ Normalization and aggregation
Major Tasks in Data Preprocessing
▪ Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results (restriction to useful values,
and/or attributes only, etc.)
▪ Dimensionality reduction
▪ Numerosity reduction
▪ Data compression
▪ Data discretization
▪ Part of data reduction but with particular importance,
especially for numerical data
▪ Concept hierarchy generation
Forms of Data Preprocessing
Mining Data Descriptive Characteristics
▪ Motivation
▪ To better understand the data
▪ To highlight which data values should be treated as noise or
outliers.
▪ Data dispersion characteristics
▪ median, max, min, quantiles, outliers, variance, etc.
▪ Numerical dimensions correspond to sorted intervals
▪ Data dispersion: analyzed with multiple granularities of
precision
▪ Boxplot or quantile analysis on sorted intervals
▪ Dispersion analysis on computed measures
▪ Folding measures into numerical dimensions
▪ Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency
▪ Mean (algebraic measure) (sample vs. population):
▪ Arithmetic mean: The most common and most effective
numerical measure of the “center” of a set of data is the
(arithmetic) mean.
Median
▪ Mode
▪ Value that occurs most frequently in the data
▪ Unimodal, bimodal, trimodal
▪ Empirical formula:
Symmetric vs. Skewed Data
▪ Discretization
▪ Divide the range of a continuous attribute into intervals
▪ Some classification algorithms only accept categorical
attributes.
▪ Reduce data size by discretization
▪ Prepare for further analysis
Discretization and Concept Hierarchy
▪ Discretization:
▪ Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals
▪ Interval labels can then be used to replace actual data
values
▪ Supervised vs. unsupervised
▪ Split (top-down) vs. merge (bottom-up)
▪ Discretization can be performed recursively on an attribute
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Manhattan Distance
Dissimilarity of Numeric Data :
Minkowski Distance
▪ Minkowski Distance is a generalization of Euclidean
and Manhattan Distance