Data Mining _ Preprocessing
Data Mining _ Preprocessing
Preprocessing
Data Preprocessing
Why preprocess the data?
Data cleaning
Data reduction
Summary
Why Data Preprocessing?
Data in the real world is dirty
◦ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
◦ e.g., occupation=“ ”
◦ noisy: containing errors or outliers
◦ e.g., Salary=“-10”
◦ inconsistent: containing discrepancies in codes or names
◦ e.g., Age=“42” Birthday=“03/07/1997”
◦ e.g., Was rating “1,2,3”, now rating “A, B, C”
◦ e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from
◦ “Not applicable” data value when collected
◦ Different considerations between the time when the data was collected and when it is
analyzed.
◦ Human/hardware/software problems
Broad categories:
◦ Intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
◦ Integration of multiple databases, data cubes, or files
Data transformation
◦ Normalization and aggregation
Data reduction
◦ Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
◦ Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing
Chapter 2: Data Preprocessing
Data cleaning
Data reduction
Summary
Mining Data Descriptive
Characteristics
Motivation
◦ To better understand the data: central tendency, variation and spread
1 n x
Mean (algebraic measure) (sample vs. population): x xi
n i 1 N
n
◦ Weighted arithmetic mean:
w x i i
w
i 1
i
Mode
n / 2 ( f )l
median L1 ( )c
◦ Value that occurs most frequently in the data f median
◦ Unimodal, bimodal, trimodal
◦ Empirical formula:
mean mode 3 (mean median)
February 8, 2025 DATA MINING: CONCEPTS AND TECHNIQUES 11
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric, positively
and negatively skewed data
(x ) x
2
2 i
2
i 2
N i 1 N i 1
◦ Variance:
n (algebraic, scalable
n computation)
n
1 1 1
2
s2 ( xi x ) 2
[ xi ( xi ]
) 2
n 1 i 1 n 1 i 1 n i 1
Boxplot
◦ Data is represented with a box
◦ The ends of the box are at the first and third quartiles, i.e., the height of the
box is IRQ
◦ The median is marked by a line within the box
◦ Whiskers: two lines outside the box extend to Minimum and Maximum
Data cleaning
Data reduction
Summary
Regression
◦ smooth by fitting the data into regression functions
Clustering
◦ detect and remove outliers
◦ if A and B are the lowest and highest values of the attribute, the width of intervals will be: W =
(B –A)/N.
◦ The most straightforward, but outliers may dominate presentation
Y1
Y1’ y=x+1
X1 x
Data cleaning
Data reduction
Summary
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
rA, B
(A A)( B B )
( AB) n A B
(n 1)AB (n 1)AB
where n is the number of tuples, and are the respective means of A and B, σA and σB
are the respective standard deviation of A and
A B, and BΣ(AB) is the sum of the AB cross-
product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation.
Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count
is very different from the expected count
2 2 2 2
( 250 90) (50 210) ( 200 360) (1000 840)
2 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group
Attribute/feature construction
◦ New attributes constructed from the given ones
73,600 54,000
1.225
◦ Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by
v decimal scaling
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
February 8, 2025 DATA MINING: CONCEPTS AND TECHNIQUES 43
Chapter 2: Data Preprocessing
Data cleaning
Data reduction
Summary
Data reduction
◦ Obtain a reduced representation of the data set that is much smaller in volume but
yet produce the same (or almost the same) analytical results
A4 ?
A1? A6?
Audio/video compression
◦ Typically lossy compression, with progressive refinement
◦ Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
sy
los
Original Data
Approximated
X2
Y1
Y2
X1
Non-parametric methods
◦ Do not assume models
◦ Major families: histograms, clustering, sampling
Linear regression: Y = w X + b
◦ Two regression coefficients, w and b, specify the line and are to be estimated by
using the data at hand
◦ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Log-linear models:
◦ The multi-way table of joint probabilities is approximated by a product of lower-
order tables
◦ Probability: p(a, b, c, d) = ab acad bcd
Data Reduction Method (2):
Histograms
40
Divide data into buckets and store average
(sum) for each bucket
35
Partitioning rules:
30
◦ Equal-width: equal bucket range
◦ Equal-frequency (or equal-depth) 25
◦ V-optimal: with the least histogram variance
20
(weighted sum of the original values that
each bucket represents) 15
◦ MaxDiff: set bucket boundary between each 10
pair for pairs have the β–1 largest differences
5
0
10000 30000 50000 70000 90000
February 8, 2025 DATA MINING: CONCEPTS AND TECHNIQUES 59
Data Reduction Method (3): Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
S W OR om
SR le rand
p t
(sim le withou
samp ment)
ce
r e pl a
SRSW
R
Raw Data
February 8, 2025 DATA MINING: CONCEPTS AND TECHNIQUES 62
Sampling: Cluster or Stratified Sampling
Data cleaning
Data reduction
Summary
Discretization:
◦ Divide the range of a continuous attribute into intervals
The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping criterion
is met
Merge: Find the best neighboring intervals and merge them to form larger intervals
recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
◦ Initially, each distinct value of a numerical attr. A is considered to be one interval
◦ 2 tests are performed for every pair of adjacent intervals
◦ Adjacent intervals with the least 2 values are merged together, since low 2 values for a
pair indicate similar class distributions
◦ This merge process proceeds recursively until a predefined stopping criterion is met (such
as significance level, max-interval, max inconsistency, etc.)
A simply 3-4-5 rule can be used to segment numeric data into relatively
uniform, “natural” intervals.
◦ If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
partition the range into 3 equi-width intervals
◦ If it covers 2, 4, or 8 distinct values at the most significant digit, partition the
range into 4 intervals
◦ If it covers 1, 5, or 10 distinct values at the most significant digit, partition the
range into 5 intervals
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
Data cleaning
Data reduction
Summary
A lot a methods have been developed but data preprocessing still an active area
of research
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser .
SIGMOD’02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering,
20(4), December 1997
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data
Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data
Engineering, 7:623-640, 1995
February 8, 2025 DATA MINING: CONCEPTS AND TECHNIQUES 77