0% found this document useful (0 votes)
46 views

6 Data Preprocessing

This document discusses various techniques for preparing and cleaning data prior to conducting data mining analysis. It covers topics like handling missing data, noisy data, inconsistent data, and data integration and transformation. Specific techniques discussed include data cleaning, binning, smoothing, clustering, and dimensionality reduction methods like feature selection. Pre-processing is estimated to take 70-80% of the time and effort for a data mining project. Clean, consistent, and well-structured data is essential for obtaining quality results from data mining and machine learning algorithms.

Uploaded by

Nirmal R
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

6 Data Preprocessing

This document discusses various techniques for preparing and cleaning data prior to conducting data mining analysis. It covers topics like handling missing data, noisy data, inconsistent data, and data integration and transformation. Specific techniques discussed include data cleaning, binning, smoothing, clustering, and dimensionality reduction methods like feature selection. Pre-processing is estimated to take 70-80% of the time and effort for a data mining project. Clean, consistent, and well-structured data is essential for obtaining quality results from data mining and machine learning algorithms.

Uploaded by

Nirmal R
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Mining

DATA PR E - PR OC E SS ING - 1
Recap
• Instance Based Learning
o Is it supervised or unsupervised learning?
o How does it work? What is the criteria for K?
o What does Euclidian Distance measure? How is it different to the cosine
similarity measure?
o How does Euclidian Distance handle
 Nominal attributes
 Numeric attributes
o What are the positives and negatives of instance based learning?
o How to alleviate some of the problems of instance based learning?
Recap
• Support Vector Machine?
o Is it supervised or unsupervised learning?
o How does it work?
o How does SVM handle
 Noisy data
o What is soft margin classification?
o What are kernel functions?
o What are the positives and negatives of SVM?
Knowledge Discovery Flow
• Data preparation/pre-processing estimated take 70-80% of the time
and effort
Data Pre-processing
• Data in the real world is dirty
o Incomplete
 Missing attribute values, missing certain attributes of interest, containing only
aggregate data
o Noisy
 Filled with errors or outliers
o Inconsistent
 Containing discrepancies in codes, names or values

• No quality data  no quality mining results


Data Quality Measure
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Major Tasks in Data Pre-processing
• Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers and resolve
inconsistencies

• Data integration
o Integration of multiple databases, data cubes or files

• Data transformation
o Normalization and aggregation
Major Tasks in Data Pre-processing
• Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
o Data discretization for numerical data
o Dimensionality reduction
o Data compression
o Generalization
Data Cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
• Remove duplicate records
Data Cleaning: Example
• Original data (fixed column format)
000000000130.06.19971979-10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000
000000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.00000000000000
0.000000000000000.000000000000000.000000000000000.000000000000000.000000000000
000.000000000000000.000000000000000.000000000000000.000000000000000.0000000000
00000.000000000000000.000000000000000.000000000000000.000000000000000.00000000
0000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00

• Clean data
0000000001,199706,1979.833,8014,5722 , ,#000310 …. ,
111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00

10
Missing Data
• May be due to
o Equipment malfunction
o Inconsistent with other recorded data and thus deleted
o Data not entered due to misunderstanding
o Certain data may not be considered important at the time of entry
o Not registered in history or changes of the data
How to handle Missing Data?
• Ignore the instance containing missing information
o Especially when the class label is missing
o Bad! If there are many instances with missing values

• Fill in the missing value manually  tedious and infeasible!


• Use a constant (like ? Or unknown) to fill in the missing value
• Infer the missing value
o Use the attribute mean to fill in every missing value for that attribute
o Use Bayesian formula or decision tree to know what is the most probable value
to fill in the missing value
Expectation Maximization (EM)
• Build model of the data (ignoring missing values)
• Use the model to estimate missing values
• Build new models of data values (including the estimated values)
• Use new models to re-estimate missing values
• Repeats until convergence (old model = new model)
Noisy Data
• Incorrect attribute values may be due to
o Faulty data collection instruments
o Data entry problems
o Data transmission problems
o Technology limitation
o Inconsistency in naming conventions
How to Handle Noisy Data?
• Binning method:
o Sort data and partition into (equal-depth) bins
o Smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

• Clustering
o Detect and remove outliers

• Combined computer and human inspection


o Detect suspicious values and check by human

• Regression
o Smooth by fitting the data into regression functions
Binning
•  Equal-width (distance) partitioning
o Divide the range into N intervals of equal size
 For N bins, the width of each interval will be
Binning Examples - 1
• Sorted price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Equal-width binning
o Minimum = 4, Maximum = 34,
 If we want to have 2 bins, Width = ?
o Bin 1 (4 – 19)
 4, 8, 9, 15
o Bin 2 (20 – 35)
 21, 21, 24, 25, 26, 28, 29, 34
Binning
•  Equal-width (distance) partitioning
o Divide the range into N intervals of equal size
 For N bins, the width of each interval will be
o Straightforward but outliers may dominate presentation
 Worst for skewed data!

• Equal-depth (frequency) partitioning


o Divide the range into N intervals so that each interval contains approximately the
same number of samples (the interval doesn’t need to have the same width)
o Good data scaling
Binning Examples - 2
• Sorted price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Equal-depth binning
o If we want to have 3 bins, Frequency = ?
o Bin 1 (4 – 19)
 4, 8, 9, 15
o Bin 2 (20 – 25)
 21, 21, 24, 25
o Bin 3 (26 – 34)
 26, 28, 29, 34
Smoothing
•• Partition
  into (equi-depth) bins:
o Bin 1: 4, 8, 9, 15
o Bin 2: 21, 21, 24, 25
o Bin 3: 26, 28, 29, 34

• Smoothing by bin means: Change all the values in a bin to the mean of the bin
o Bin 1: 9, 9, 9, 9
o Bin 2: 23, 23, 23, 23
o Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries:  , 


o Bin 1: 4, 4, 4, 15
o Bin 2: 21, 21, 25, 25
o Bin 3: 26, 26, 26, 34
Regression (next week)
• Smooth all the values according to the best line fit
y
y=x+1

Y1

Y1’

X1 x
Cluster Analysis
• Remove data that does not belong to any group
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic
o Detect violation of known functional dependencies and data constraints
 E.g. use dictionary or grammar rules
o Correct redundant data
 Use correlational analysis or similarity measures to detect redundant data
Data Integration
• Combines data from multiple sources into a coherent store
o Integrate metadata from different sources

• Possible problems
o The same attribute may have different names in different data sources, e.g.
CustID and CustomerNo
o One attribute may be a “derived” attribute in another table, e.g. annual revenue
o Different representation and scales, e.g. metric vs. British units, different
currency, different timezone
Data Transformation
•  Aggregation: Summarization, Data cube construction
• Normalization: Scale the value to fall within a specified range
o Min-Max normalization

o Z-score normalization

o Decimal Scaling
 where j is the smallest integer such that Max(|v’|) < 1
Ordinal to Numeric
• Sometimes it is better to convert nominal to numeric attributes
o So you can use mathematical comparisons on the fields
o E.g. instead of cold, warm, hot  -5, 25, 33
o Or A  85, A-  80, B+  75, B  70
Data Reduction
• Complex data analysis may take a very long time to run on the
complete data set
o Obtain a reduced representation of the data set that is much smaller in volume
but produces (almost) the same analytical results

• Strategies
o Dimensionality reduction
o Data Compression: compress data using a compression algorithm
o Discretization
o Concept hierarchy generalization
Quantity
• Generally
o 5000 or more number of instances are desired
 If less, results are less reliable. Might need to use special methods like boosting
o There are at least 10 or more instances for each unique attribute value
o 100 more instances for each class label
 If unbalanced, use stratified sampling
Dimensionality Reduction
• Feature selection
o Select a minimum set of features so that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
o Reduce the number of attributes in the discovered patterns
 Makes the patterns easier to understand
o Ways to select attributes include
 Decision Tree induction (information gain and gain ratio)
 Principal Component Analysis (in 2 weeks time)
o Generally, keep top 50 attributes
Feature Selection - Example
• Use Decision Tree to filter attributes
o Initial attribute set: A1, A2, A3, A4, A5, A6
A4
o The tree produced
o The reduced attribute set: A1, A4, A6
A1 A6

Class 1 Class 2 Class 1 Class 2


Feature Selection Approach
• Wrapper approach
o Try all possible combinations of feature subsets
 Train on train set, evaluate on a validation set (or use cross-validation)
 Use set of features that performs best on the validation set
o Algorithm dependent

• Proxy methods
o Determine what features are important or not without knowing/using what
learning algorithm will be employed
 Information gain, Gain ratio, Cosine similarity, etc.
o Algorithm independent & Fast but may not suitable for all algorithms
Pro and Cons of Feature Selection
• Advantages
o Improved accuracy
o Less complex model: Run faster & Easier to understand, verify and explain
o Don’t need to collect/process features not used in models

• Disadvantages
o Prone to over-fitting
o Can be expensive to run multiple times to find the best set of features
o May throw away features domain experts want in model
o May remove important redundant features
Sampling
• Choose a representative subset of the data
o Simple random sampling may have very poor performance in the presence of
skew

• Develop adaptive (stratified) sampling methods


o Approximate the percentage of each class
o Sample the data so that the class distribution stays the same after sampling

sample
Discretization/Quantization
• Divide the range of a continuous attributes into intervals
o Interval labels can then be used to replace actual data values

• Techniques
o Binning methods
o Use information gain/gain ratio to find the best splitting points
o Clustering analysis
Concept Hierarchy
• Replace low level concepts by higher level concepts
o E.g. Age: 15, 65, 3 to Age: teen, senior, child, middle-aged, etc
o Instead of street, use city or state or country for the geographical location
WEKA Feature Selection
• AddExpression (MathExpression)
o Apply a math expression to an existing attribute to create/modify one

• Center/Normalize/Standardize
o Transform numeric attributes to have zero mean

• Discretize (as well as Supervised Discretization)


o Convert numeric to nominal values

• PrincipalComponents
o Perform a principal component analysis/transformation of the data

• RemoveUseless
o Remove attributes that do not vary at all, or vary too much

• TimeSeriesDelta, TimeSeriesTranslate
o Replace attribute values with successive differences between this instance and the next
Summary
• Data preparation is a big issue for data mining
• It includes
o Data cleaning
o Data integration
o Data reduction
o Data transformation

• Prone to over-fitting
• Remember correlation does not imply causation
o Data mining reveals correlation

You might also like