Preprocessing
Preprocessing
WHAT IS DATA?
Attributes
● Collection of data objects
and their attributes
● An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc. Objects
– Attribute is also known as
variable, field, characteristic, or
feature
● A collection of attributes describe an
object
For examples:
Hair colour- black, brown, red
Opinion- agree, disagree, neutral
Quantitative Data
• Quantitative data is a bunch of information gathered from a
group of individuals and includes statistical data analysis.
Numerical data is another name for quantitative data.
• Simply, it gives information about quantities of items in the
data and the items that can be estimated. And, we can
formulate them in terms of numbers.
For examples:
We can measure the height (1.70 meters), distance (1.35 miles)
with the help of a ruler or tape.
– Ordinal
Examples: rankings (e.g., taste of potato chips on
a scale from 1-10), grades, height in {tall,
medium, short}
6
DISCRETE AND CONTINUOUS ATTRIBUTES
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating- point
variables.
– Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time, counts 7
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression,generalization
Forms of data preprocessing
Data Cleaning
17
18
Cluster Analysis
DEVIATION/ANOMALY DETECTION
19
day
Regression
y
Y1
Y1’ y=x+1
X1 x
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface)
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency
Handling Redundant Data in
Data Integration
• Redundant data occur often when integrating multiple DBs
– The same attribute may have different names in different databases
– One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlational
analysis
( A A)( B B )
rA, B
(n 1) A B
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
Data Transformation: Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization
v min
v' (new _ max new _ min ) new _ min
A
A A A
max min
A A
• z-score normalization
v mean
v'
A
stand _ dev A
• Solution?
– Data reduction…
Data Reduction
•Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
•Data reduction strategies
–Dimensionality reduction
–Data compression
Dimensionality Reduction
• Problem: Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution given the values
of all features
– Nice side-effect: reduces # of attributes in the discovered patterns
(which are now easier to understand)
• Solution: Heuristic methods (due to exponential # of
choices) usually greedy:
– step-wise forward selection
– step-wise backward elimination
– combining forward selection and backward elimination
– decision-tree induction
Example of Decision Tree Induction
nonleaf nodes: tests
branches: outcomes of tests
leaf nodes: class prediction
Initial attribute set:
A4 ? {A1, A2, A3, A4, A5, A6}
A1? A6?
os sy
l
Original Data
Approximated
Principal Component Analysis (PCA)
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
Sampling
Raw Data Cluster/Stratified Sample
SAMPLE SIZE
4
Splitting Datasets
• To use a dataset in Machine Learning, the dataset is
first split into a training and test set.
Trainin Trai
g n
Data Produce Model
DAT
A
Test Model
Data
Determine
Test the Accuracy Accuracy
Model
Serial Splitting of the Dataset
• Simplest method of splitting data is to split it
serially.
Trainin
g Data
(90% Trai
Apple, n
10%
Pear)
Pear Model
Test Sample is a
The model will most likely
Pear
predict the pear as an apple.
Apple
Cross Validation
Repeat Process until Converge to Desired Accuracy
Use Training Data to Train the Model
Split Dataset into Training and Randomly pick split
Test
Training Trai
Data n
Training
Data
DAT Validatio
n Data Model
A
• Steps:
1. Partition the dataset into k equal sized partitions.
2. Select one partition as the validation data.
3. Use the remaining k-1 as the training data.
4. Train the model and determine accuracy from
the validation data.
5. Repeat the process k times, selecting a different
partition each time for the validation data.
5.Average the accuracy results.
k-fold cross validation
break into n repeat for all parts/splits:
equal-sized parts train on n-1 parts evaluate on the other
Training data
…
…
split 1 split 2 split 3
k-fold cross validation
evaluate
split 1
score 1
split 2
… split 3 score 2
… score 3
… …