Week2_DataPreprocessing
Week2_DataPreprocessing
Learning Applications
Roselyne Tchoua
[email protected]
School of Computing, CDM, DePaul
University
Understanding Your Data
3
Data Preprocessing
• Why do we need to prepare the data?
– In real world applications data can be inconsistent, incomplete
and/or noisy
• Data entry, data transmission, or data collection problems
• Discrepancy in naming conventions
• Duplicated records
• Incomplete or missing data
• Contradictions in data
4
Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
5
Data Cleaning
• Real-world application data can be incomplete, noisy,
and inconsistent
– No recorded values for some attributes
– Not considered at time of entry
– Random errors
– Irrelevant records or fields
• Data cleaning attempts to:
– Fill in missing values
– Smooth out noisy data
– Correct inconsistencies
– Remove irrelevant data
6
Dealing with Missing Values
• Solving the Missing Data Problem
– Ignore the record with missing values; (Can you afford this?)
– Fill in the missing values manually; (Error-prone, Unsustainable)
– Use a global constant to fill in missing values (NULL, unknown, etc.);
– Use the attribute value mean to filling missing values of that attribute;
– Use the attribute mean for all samples belonging to the same class to
fill in the missing values;
– Infer the most probable value to fill in the missing value
• may need to use methods such as Bayesian classification
(probabilities) to automatically infer missing attribute values
7
Smoothing Noisy Data
The purpose of data smoothing is to eliminate noise and “smooth out” the
data fluctuations.
Ex: Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
8
Smoothing Noisy Data
9
Smoothing Noisy Data - Example
Want to smooth “Temperature” by bin means with bins of size 3:
1. First sort the values of the attribute (keep track of the ID or key so
that the transformed values can be replaced in the original table.
2. Divide the data into bins of size 3 (or less in case of last bin).
3. Convert the values in each bin to the mean value for that bin
4. Put the resulting values into the original table
10
Smoothing Noisy Data - Example
ID Temperature ID Temperature
7 58 7 64
6 65 Bin1 6 64 Bin1
5 68 5 64
9 69 9 70
4 70 Bin2 4 70 Bin2
10 71 10 70
8 72 8 73
12 73 Bin3 12 73 Bin3
11 75 11 73
14 75 14 79
2 80 Bin4 2 79 Bin4
13 81 13 79
3 83 3 84
Bin5 Bin5
1 85 1 84
Value of every record in each bin is changed to the mean value for
that bin. If it is necessary to keep the value as an integer, then the
mean values are rounded to the nearest integer.
11
Smoothing Noisy Data - Example
The final table with the new values for the Temperature attribute.
12
Data Integration
Ideal case è Access to data warehouse, in which data integration has already
combined data from multiple sources into a coherent store
DB Unified View
Analysis
Files of diff. formats
Data Integration
• Reality and Research:
– Data Lakes: A data lake is a centralized repository that allows you to
store all your structured and unstructured data at any scale. You can
store your data as-is, without having to first structure the data, and
run different types of analytics.
• Low-cost storage and maintenance
• Scalable
• Store structured and non-structured data
• Usually have some organization (metadata), some data integration tools
– Data swamps: no organization, no system, no curation, no or broken
metadata/context data
• Here, integration is difficult and requires advanced solutions.
• Meta-data is often necessary for successful data integration
Data Integration
• Data analysis may require a combination of data from
multiple sources into a coherent data store
• Challenges in Data Integration:
– Schema integration: CID = C_number = Cust-id = cust#
– Semantic heterogeneity (diagnosis, medical condition in diff.
ontologies)
– Data value conflicts (different representations or scales, etc.)
– Synchronization (especially important in Web usage mining)
– Redundant attributes (redundant if it can be derived from other
attributes) -- may be able to identify redundancies via correlation
analysis: Pr(A,B) / (Pr(A).Pr(B))
= 1: independent,
> 1: positive correlation,
< 1: negative correlation.
15
Data Transformation: Normalization
• Min-max normalization: linear transformation from v to v’
– v’ = [(v - min)/(max - min)] x (newmax - newmin) + newmin
– Note that if the new range is [0..1], then this simplifies to
v’ = [(v - min)/(max - min)]
– Ex: transform $30000 between [10000..45000] into [0..1] ==>
[(30000 – 10000) / 35000] = 0.514
• z-score normalization: normalization of v into v’ based on attribute
value mean and standard deviation
– v’ = (v - Mean) / StandardDeviation
• Normalization by decimal scaling
– moves the decimal point of v by j positions such that j is the minimum
number of positions moved so that absolute maximum value falls in
[0..1].
– v’ = v / 10j
– Ex: if v in [-56 .. 9976] and j=4 ==> v’ in [-0.0056 .. 0.9976]
16
Normalization: Example
• z-score normalization: v’ = (v - Mean) / Stdev
• Example: normalizing the “Humidity” attribute:
Humidity Humidity
85 0.48
90 0.99
78 -0.23
96 Mean = 80.3 1.60
-0.03
80
70 Stdev = 9.84 -1.05
65 -1.55
95 1.49
70 -1.05
80 -0.03
70 -1.05
90 0.99
75 -0.54
80 -0.03
17
Normalization: Example II
• Min-Max normalization on an employee database
– max distance for salary: 100000-19000 = 81000
– max distance for age: 52-27 = 25
– New min for age and salary = 0; new max for age and salary = 1
18
Data Transformation: Discretization
• 3 Types of attributes
– nominal - values from an unordered set (also “categorical” attributes)
– ordinal - values from an ordered set
– numeric/continuous - real numbers (but sometimes also integer values)
• Discretization is used to reduce the number of values for a given
continuous attribute
– usually done by dividing the range of the attribute into intervals
– interval labels are then used to replace actual data values
• Some data mining algorithms only accept categorical attributes and
cannot handle a range of continuous attribute value
• Discretization can also be used to generate concept hierarchies
– reduce the data by collecting and replacing low level concepts (e.g.,
numeric values for “age”) by higher level concepts (e.g., “young”,
“middle aged”, “old”)
19
Discretization - Example
Example: discretizing the “Humidity” attribute using 3 bins.
Humidity Humidity
85 High
90 High
78 Low = 60-69 Normal
96 High
80 Normal = 70-79 High
70
65
High = 80+ Normal
Low
95 High
70 Normal
80 High
70
Normal
90
High
75
Normal
80
High
20
Data Discretization Methods
• Binning
– Top-down split, unsupervised
• Histogram analysis
– Top-down split, unsupervised
• Clustering analysis
– Unsupervised, top-down split or bottom-up merge
• Decision-tree analysis
– Supervised, top-down split
• Correlation (e.g., c2) analysis
– Unsupervised, bottom-up merge
21
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate
presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
22
Discretization by Classification & Correlation Analysis
23
Converting Categorical Attributes to Numerical
Attributes – Dummy variables
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE
2 sunny 80 90 TRUE Attributes:
3 overcast 83 78 FALSE Outlook (overcast, rain, sunny)
4 rain 70 96 FALSE Temperature real
5 rain 68 80 FALSE
6 rain 65 70 TRUE
Humidity real
7 overcast 58 65 TRUE Windy (true, false)
8 sunny 72 95 FALSE
9 sunny 69 70 FALSE
10 rain 71 80 FALSE
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE
13 overcast 81 75 FALSE
14 rain 75 80 TRUE Standard Spreadsheet Format
OutLook OutLook OutLook Temp Humidity Windy Windy
Create separate columns overcast rain sunny TRUE FALSE
for each value of a 0 0 1 85 85 0 1
0 0 1 80 90 1 0
categorical attribute (e.g., 1 0 0 83 78 0 1
3 values for the Outlook 0 1 0 70 96 0 1
attribute and two values 0 1 0 68 80 0 1
of the Windy attribute). 0 1 0 65 70 1 0
1 0 0 64 65 1 0
There is no change to the . . . . . . .
numerical attributes. . . . . . . .
24
Data Reduction
• Data is often too large; reducing data can improve
performance
• Data reduction consists of reducing the representation of
the data set while producing the same (or almost the
same) results
• Data reduction includes:
– Data (cube) aggregation
– Dimensionality reduction
– Discretization
– Numerosity reduction
• Regression
• Histograms
• Clustering
• Sampling
25
Data Cube Aggregation
• Reduce the data to the concept level needed in the analysis
– Use the smallest (most detailed) level necessary to solve the problem
26
Data Aggregation
• Change of scale: Cities aggregated into regions, states, countries, etc.
• More “stable” data: Aggregated data tends to have less variability
• Data reduction: Reduce the number of objects
x1
29
Principal Component Analysis (Steps)
• Given N data vectors (rows in a table) from n dimensions
(attributes), find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– The size of the data can be reduced by eliminating the weak
components, i.e., those with low variance
• Using the strongest principal components, it is possible to
reconstruct a good approximation of the original data
• Works for numeric data only
30
Other Feature Reduction Methods
• Discrete wavelet transform (DWT):
linear signal processing,
multiresolutional analysis
• Compressed approximation: store
only a small fraction of the strongest
of the wavelet coefficients
• Idea: summaries an image (average
and difference)
• Method:
– Length, L, must be an integer
power of 2 (padding with 0s,
when necessary)
– Each transform has 2 functions:
smoothing, difference
– Applies to pairs of data, resulting
in two set of data of length L/2
– Applies two functions recursively,
until reaches the desired length
• Non-linear methods: neural-
network based (e.g., text data)
Attribute Subset Selection
• Start here. Remove:
– Attribute for which values change for every object
– Attributes for which the values are the same for every object
• Or almost the same (spread on box plot maybe an indication)
– Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• e.g., age and date of birth
– Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• e.g., student’s area code to predict GPA
32
Attribute Relevance Analysis
• Idea: compute quantified measure for the
attribute given class (e.g., info gain, Gini index,
correlation coef.)
• Rank attributes from most to least
discriminating
• Set an arbitrary threshold for selection
• Problem?
Feature Subset Selection
• Techniques:
– Brute-force approach:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset
of attributes
– We revisit this topic later in the course
34
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations
of d attributes
• Typical heuristic attribute selection methods:
– “Best” single attribute under the attribute
independence assumption: choose by
significance tests
– Best step-wise feature selection:
• The best single-attribute is picked
first. Then next best attribute
condition to the first, ...
• {}{A1}{A1, A3}{A1, A3, A5}
– Step-wise attribute elimination:
• Repeatedly eliminate the worst
attribute: {A1, A2, A3, A4, A5}{A1, A3, Rüping, Stefan. "Learning interpretable models." (2006).
A4, A5} {A1, A3, A5}, ….
– Combined attribute selection and
elimination
– The stopping criteria for the methods may
vary
– Decision Tree Induction
Decision Tree Induction
Use information theory techniques to find the most
“informative” attributes
Attribute/Feature Creation/Generation
37
Data Reduction: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
– Ex.: Linear models — keep equation, discard points
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
38
Regression Analysis
y
• Collection of techniques for the
modeling and analysis of numerical
Y1
data consisting of values of a
dependent variable (also response
variable or measurement) and of
Y1’
one or more independent variables y=x+1
(aka. explanatory variables or
predictors)
• The parameters are estimated to x
X1
obtains a "best fit" of the data
• Typically, the best fit is evaluated by
• Used for prediction (including
using the least squares method, but
forecasting of time-series data),
other criteria have also been used
inference, hypothesis testing, and
modeling of causal relationships
39
Numerocity Reduction
• Reduction via histograms:
– Divide data into buckets and store
representation of buckets (sum, count, etc.)
40
Sampling
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
• Potential problems:
– Imbalanced classes
– Not enough data
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement
– Objects are not removed from the population as they are selected for
the sample.
• In sampling with replacement, the same object can be picked up more
than once
• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition
Sampling Techniques
OR
SRSW andom
er
(simpl without
sample ent)
em
replac
SRSWR
Raw Data
Cluster/Stratified Sample
Raw Data
43