Data Mining and Business Intelligence
Data Mining and Business Intelligence
Business Intelligence
3rd Lecture
Data preprocessing
Iraklis Varlamis
Data preprocessing
2
The need for preprocessing
3
Reasons for “bad data”
4
Data quality
• A multidimensional concept
• May refer to … data
– Accurate
– Complete
– Consistent
– Up to date
– Trustful
– Easy to understand
– Accessible
– Added value
5
Basic operations
• Data cleaning
– Fill missing values, remove noise, find and remove
extreme/wrong values
• Data integration
– Integrate multiple databases, use data from tables or files
• Data transformation
– Value normalization, aggregation etc
• Data reduction
– Reduce the dataset size without decreasing the overall
performance
• Discretisation
– Part of data reduction, important for numeric data
6
Data preprocessing
7
Data cleaning tasks (1/2)
8
Data cleaning tasks (2/2)
9
Data binning
• Equi-width split:
– Select Ν sub-ranges of the same width
– The width of each sub-range is: W = (max–min)/N.
– Quality is affected by outliers, the final result is biased
towards outliers.
• Equi-depth split:
– Select N sub-ranges that contain the same number of
instances
– Better distribution of data
– It is hard for nominal/ordinal data (categorical and not
continuous values)
10
Smoothing
11
Regression – Curve fitting
outliers
noise
12
Clustering
13
Data preprocessing
14
Data integration
• May result in
– inconsistent data
– redundant data
• Because of
– Different metrics
– Different representation models
– Different ids
• Example:
– Year_of_birth= 1980
– Age = 30
which is correct?
15
Transformations
• Data aggregation
• Data discretisation and generalization: from
values to categories and hierarchies
• Range conversion: map continuous values to a
different range
– Using xk, log(x), ex, |x| etc
• Normalization: scale all values to the same range
usually 0..1 or -1..1
• Create additional features by
processing/combining existing features (e.g.
create age from date_of_birth, create
day_of_week from date)
16
Normalization
• Min-max Normalization
0…1
⚫ z-score Normalization
0…1
⚫ Decimal scaling
-1…1
17
Data preprocessing
18
Data reduction
• Data selection
– Compress
– Fit to functions
– Sample
• Dimension reduction
– Select the most representative/useful
dimensions (features)
– Create new composite dimensions that merge
multiple old dimensions
– Avoid repetitions (or highly correlated
dimensions)
19
Data compression
• String compression
– Usually lossless
– It is difficult to process without decompression
• Image/sound compression
– Usually lossy with gradient quality improvement
– Only part of the initial information is enough to rebuild the total
– Wavelet transformations can be employed
• Time-frequency analysis
20
Data reduction
• Parametric methods
– They assume that data fit to a model. They compute the
parameters of a model and store them instead of the
data
– Log-linear models: find associations between features
which are significantly different from zero (important
subspaces). Then replace the initial vector
representation (in the original m-dimensional space) to
a product of probabilities in these sub-spaces
– Regression
• Non parametric methods
– They do not assume models
– Histograms, Clusters, Sampling
21
Histograms
22
Sampling
23
Dimensionality reduction
• Feature selection
– We select the minimum set of features that, once
selected, gives the same class distribution as the
original set of features (or as similar as possible)
– We reduce the dimensions of the created models and
make them easier to understand
• Heuristics
– Step-wise forward selection: we select the best attribute
and keep on adding one attribute at a time
– Step-wise backward elimination: the reverse process
– Combination of selection and elimination
– Decision tree induction
24
Example
gender
height weight
height
weight
25
Principal Component Analysis
27
Data preprocessing
28
Attribute types
30
Data preprocessing
31
Discretize
– Histogram analysis
– Clustering analysis
– Entropy-based discretization
– Natural grouping
32
Entropy-Based Discretization
• For a set of S instances, which is split to ranges S1 and S2 at limit
T, the entropy after splitting is
• We select the limit that minimizes the resulting Entropy after split
• We then split at another dimension or a different split point until
33
Value hierarchies
city 2000
street 700000
34
Data preprocessing
35
Similarity and dissimilarity
• Similarity
– A measure of how two instances resemble each other
– higher resemblance 🡺 higher similarity
– Values usually range in [0,1]
• Dissimilarity
– A measure of how two instances differ from each other
– Higher resemblance 🡺 lower similarity
– The lower limit is usually 0 and the upper limit varies
• Proximity
– A measure which is connected to the similarity between
two instances
36
Instance similarity
37
Distance
What is
their
distance
Euclidian distance
40
Other distance norms
42
Example
43
Scaling and weights
• Where σk
44
Attribute relation
45
Mahalanobis distance
46
Example
47
Variations
⚫ When
◦ the covariance matrix is diagonal and
isotropic
◦ all dimensions have the same
variance (are orthogonal)
◦ Μahalanobis becomes Euclidian
⚫ When
◦ the covariance matrix is diagonal and
non-isotropic
◦ Dimensions have different variance
◦ Μahalanobis becomes Euclidian with
weights
48
For binary vectors
• Α= 1000000000
• Β= 0000001001
49
Other distance/similarity measures
50
Cosine similarity
51
Combined similarity
52