6 Data Preprocessing
6 Data Preprocessing
DATA PR E - PR OC E SS ING - 1
Recap
• Instance Based Learning
o Is it supervised or unsupervised learning?
o How does it work? What is the criteria for K?
o What does Euclidian Distance measure? How is it different to the cosine
similarity measure?
o How does Euclidian Distance handle
Nominal attributes
Numeric attributes
o What are the positives and negatives of instance based learning?
o How to alleviate some of the problems of instance based learning?
Recap
• Support Vector Machine?
o Is it supervised or unsupervised learning?
o How does it work?
o How does SVM handle
Noisy data
o What is soft margin classification?
o What are kernel functions?
o What are the positives and negatives of SVM?
Knowledge Discovery Flow
• Data preparation/pre-processing estimated take 70-80% of the time
and effort
Data Pre-processing
• Data in the real world is dirty
o Incomplete
Missing attribute values, missing certain attributes of interest, containing only
aggregate data
o Noisy
Filled with errors or outliers
o Inconsistent
Containing discrepancies in codes, names or values
• Data integration
o Integration of multiple databases, data cubes or files
• Data transformation
o Normalization and aggregation
Major Tasks in Data Pre-processing
• Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
o Data discretization for numerical data
o Dimensionality reduction
o Data compression
o Generalization
Data Cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
• Remove duplicate records
Data Cleaning: Example
• Original data (fixed column format)
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000
000000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.00000000000000
0.000000000000000.000000000000000.000000000000000.000000000000000.000000000000
000.000000000000000.000000000000000.000000000000000.000000000000000.0000000000
00000.000000000000000.000000000000000.000000000000000.000000000000000.00000000
0000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00
• Clean data
0000000001,199706,1979.833,8014,5722 , ,#000310 …. ,
111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00
10
Missing Data
• May be due to
o Equipment malfunction
o Inconsistent with other recorded data and thus deleted
o Data not entered due to misunderstanding
o Certain data may not be considered important at the time of entry
o Not registered in history or changes of the data
How to handle Missing Data?
• Ignore the instance containing missing information
o Especially when the class label is missing
o Bad! If there are many instances with missing values
• Clustering
o Detect and remove outliers
• Regression
o Smooth by fitting the data into regression functions
Binning
• Equal-width (distance) partitioning
o Divide the range into N intervals of equal size
For N bins, the width of each interval will be
Binning Examples - 1
• Sorted price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Equal-width binning
o Minimum = 4, Maximum = 34,
If we want to have 2 bins, Width = ?
o Bin 1 (4 – 19)
4, 8, 9, 15
o Bin 2 (20 – 35)
21, 21, 24, 25, 26, 28, 29, 34
Binning
• Equal-width (distance) partitioning
o Divide the range into N intervals of equal size
For N bins, the width of each interval will be
o Straightforward but outliers may dominate presentation
Worst for skewed data!
• Smoothing by bin means: Change all the values in a bin to the mean of the bin
o Bin 1: 9, 9, 9, 9
o Bin 2: 23, 23, 23, 23
o Bin 3: 29, 29, 29, 29
Y1
Y1’
X1 x
Cluster Analysis
• Remove data that does not belong to any group
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic
o Detect violation of known functional dependencies and data constraints
E.g. use dictionary or grammar rules
o Correct redundant data
Use correlational analysis or similarity measures to detect redundant data
Data Integration
• Combines data from multiple sources into a coherent store
o Integrate metadata from different sources
• Possible problems
o The same attribute may have different names in different data sources, e.g.
CustID and CustomerNo
o One attribute may be a “derived” attribute in another table, e.g. annual revenue
o Different representation and scales, e.g. metric vs. British units, different
currency, different timezone
Data Transformation
• Aggregation: Summarization, Data cube construction
• Normalization: Scale the value to fall within a specified range
o Min-Max normalization
o Z-score normalization
o Decimal Scaling
where j is the smallest integer such that Max(|v’|) < 1
Ordinal to Numeric
• Sometimes it is better to convert nominal to numeric attributes
o So you can use mathematical comparisons on the fields
o E.g. instead of cold, warm, hot -5, 25, 33
o Or A 85, A- 80, B+ 75, B 70
Data Reduction
• Complex data analysis may take a very long time to run on the
complete data set
o Obtain a reduced representation of the data set that is much smaller in volume
but produces (almost) the same analytical results
• Strategies
o Dimensionality reduction
o Data Compression: compress data using a compression algorithm
o Discretization
o Concept hierarchy generalization
Quantity
• Generally
o 5000 or more number of instances are desired
If less, results are less reliable. Might need to use special methods like boosting
o There are at least 10 or more instances for each unique attribute value
o 100 more instances for each class label
If unbalanced, use stratified sampling
Dimensionality Reduction
• Feature selection
o Select a minimum set of features so that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
o Reduce the number of attributes in the discovered patterns
Makes the patterns easier to understand
o Ways to select attributes include
Decision Tree induction (information gain and gain ratio)
Principal Component Analysis (in 2 weeks time)
o Generally, keep top 50 attributes
Feature Selection - Example
• Use Decision Tree to filter attributes
o Initial attribute set: A1, A2, A3, A4, A5, A6
A4
o The tree produced
o The reduced attribute set: A1, A4, A6
A1 A6
• Proxy methods
o Determine what features are important or not without knowing/using what
learning algorithm will be employed
Information gain, Gain ratio, Cosine similarity, etc.
o Algorithm independent & Fast but may not suitable for all algorithms
Pro and Cons of Feature Selection
• Advantages
o Improved accuracy
o Less complex model: Run faster & Easier to understand, verify and explain
o Don’t need to collect/process features not used in models
• Disadvantages
o Prone to over-fitting
o Can be expensive to run multiple times to find the best set of features
o May throw away features domain experts want in model
o May remove important redundant features
Sampling
• Choose a representative subset of the data
o Simple random sampling may have very poor performance in the presence of
skew
sample
Discretization/Quantization
• Divide the range of a continuous attributes into intervals
o Interval labels can then be used to replace actual data values
• Techniques
o Binning methods
o Use information gain/gain ratio to find the best splitting points
o Clustering analysis
Concept Hierarchy
• Replace low level concepts by higher level concepts
o E.g. Age: 15, 65, 3 to Age: teen, senior, child, middle-aged, etc
o Instead of street, use city or state or country for the geographical location
WEKA Feature Selection
• AddExpression (MathExpression)
o Apply a math expression to an existing attribute to create/modify one
• Center/Normalize/Standardize
o Transform numeric attributes to have zero mean
• PrincipalComponents
o Perform a principal component analysis/transformation of the data
• RemoveUseless
o Remove attributes that do not vary at all, or vary too much
• TimeSeriesDelta, TimeSeriesTranslate
o Replace attribute values with successive differences between this instance and the next
Summary
• Data preparation is a big issue for data mining
• It includes
o Data cleaning
o Data integration
o Data reduction
o Data transformation
• Prone to over-fitting
• Remember correlation does not imply causation
o Data mining reveals correlation