Module 2_DM_AI
Module 2_DM_AI
DATA PREPROCESSING
A data mining technique that involves
transforming raw data into an understandable
format.
Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviours or trends,
and is likely to contain many errors. Data pre-
processing is a proven method of resolving such
issues.
Data pre-processing prepares raw data for
further processing.
WHY DATA PREPROCESSING?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
e.g., occupation=― ‖
noisy: containing errors or outliers
e.g., Salary=―-10‖
inconsistent: containing discrepancies in codes or
names
e.g., Age=―42‖ Birthday=―03/07/1997‖
e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖
e.g., discrepancy between duplicate records
FORMS OF DATA PREPROCESSING
DATA PREPROCESSING
Data cleaning
―clean‖ the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies.
Data Integration
to include data from multiple sources.
involve integrating multiple databases, data cubes, or files.
Data Transformation
Normalization
Aggregation
Data reduction
obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same)
analytical results.
data aggregation
attribute subset selection
dimensionality reduction
numerosity reduction
DATA CLEANING
1. Missing Values
If there are many tuples that have no recorded
value for several attributes, then the missing
value can be filled in by any of the following
methods:
A. Ignore the tuple:
This is usually done when the class label is missing.
This method is not very effective, unless the tuple
contains several attributes with missing values.
It is especially poor when the percentage of missing
values per attribute varies considerably.
DATA CLEANING
1. Missing Values
B. Fill in the missing value manually:
time-consuming
may not be feasible given a large data set with many
missing values.
C. Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant,
such as a label like ―Unknown‖ or ∞.
although this method is simple, it is not foolproof.
If missing values are replaced by ―Unknown,‖ then the
mining program may mistakenly think that they form an
interesting concept, since they all have a value in
common—that of ―Unknown.‖
D. Use a measure of Central Tendency to fill in the
attribute value
Attribute mean is used to replace the missing value for the
attribute.
DATA CLEANING
1. Missing Values
E. Use the attribute mean for all samples belonging to
the same class as the given tuple:
For example, if classifying customers according to credit
risk, replace the missing value with the average income
value for customers in the same credit risk category as
that of the given tuple.
F. Use the most probable value to fill in the missing
value:
This may be determined with regression, inference-based
tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer
attributes in your data set, you may construct a decision
tree to predict the missing values for income
DATA CLEANING
2. Noisy Data
Noise is a random error or variance in a measured
variable.
The common data smoothing techniques are:
A. Binning:
Binning methods smooth a sorted data value by
consulting its ―neighbourhood,‖ that is, the values around
it. The sorted values are distributed into a number of
―buckets,‖ or bins. Because binning methods consult the
neighbourhood of values, they perform local smoothing.
smoothing by bin means
smoothing by bin medians
smoothing by bin boundaries
DATA CLEANING
Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
DATA CLEANING
2. Noisy Data
B. Regression:
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the ―best‖ line to
fit two attributes (or variables), so that one attribute
can be used to predict the other.
Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a multidimensional
surface.
DATA CLEANING
2. Noisy Data
C. Clustering:
Outliers may be detected by clustering, where
similar values are organized into groups, or
―clusters.‖
Values that fall outside of the set of clusters may be
considered outliers
DATA CLEANING
CLUSTERING
DATA INTEGRATION
combines data from multiple sources into a
coherent data store, as in data warehousing.
Issues to be considered during data integration
Entity identification problem:
How can equivalent real-world entities from multiple data
sources be matched up?
For example, how can the data analyst or the computer be
sure that customer id in one database and cust number in
another refer to the same attribute?
metadata can be used to help avoid errors in schema
integration
DATA INTEGRATION
Issues to be considered during data integration
Redundancy
An attribute (such as annual revenue, for instance) may be
redundant if it can be ―derived‖ from another attribute or
set of attributes.
Some redundancies can be detected by correlation
analysis.
DATA INTEGRATION
Issues to be considered during data integration
Redundancy
χ2 (chi-square) test :
For nominal (discrete) data, a correlation relationship
computed as:
DATA INTEGRATION
Issues to be considered during data integration
Redundancy
χ2 (chi-square) test :
The χ2 value (also known as the Pearson χ2 statistic) is computed as:
where oi j is the observed frequency (i.e., actual count) of the joint event
(Ai;Bj) and ei j is the expected frequency of (Ai;Bj)
σA=Standard deviation of A.
DATA INTEGRATION
REDUNDANCY
DATA INTEGRATION
Issues to be considered during data integration
Redundancy
Correlation coefficient:
a i
E ( A) A i 1
n
n
b i
E ( B) B i 1
n
DATA INTEGRATION
REDUNDANCY
Covariance
Covariance between A and B defined as
n
(a i A)(bi B)
Cov( A, B) E ( A A)( B B) i 1
E ( A.B) A B
n
Cov( A, B)
rA, B
A B
If two attributes A and B tend to change together,
the covariance between A and B is positive.
If one attribute tends to be above its expected value
when the other is below its expected value , then the
covariance of A and B is negative.
If A and B are independent, Cov(A,B)=0.
DATA INTEGRATION
Issues to be considered during data integration
Detection and resolution of data value
conflicts.
for the same real-world entity, attribute values from
Aggregation
summary or aggregation operations are applied to the data.
For example, the daily sales data may be aggregated so as
multiple granularities.
DATA TRANSFORMATION
Data transformation can involve the following:
Generalization of the data
low-level or ―primitive‖ (raw) data are replaced by higher-
level concepts through the use of concept hierarchies.
For example, values for numerical attributes, like age, may
be mapped to higher level concepts, like youth, middle-aged,
and senior.
Normalization,
(73600-12000)/(98000-12;000)* (1.0-0)+0
=0.716.
DATA TRANSFORMATION
Normalization,
Methods:
z-score normalization
In (or zero-mean normalization), the values for an attribute, A,
are normalized based on the mean and standard deviation of A.
DATA TRANSFORMATION
Normalization,
Methods:
z-score normalization
EgSuppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively. By z-score normalization, normalize the
value of $73,600 for income.
(73600-54000)/16000 = 1.225
DATA TRANSFORMATION
Normalization,
Methods:
Normalization by decimal scaling
normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the
maximum absolute value of A.
A value, v, of A is normalized to v‘ by computing
v‘=v/10j
where j is the smallest integer such that max |v’| < 1.
Eg: Suppose that the recorded values of A range from -986 to
917. The maximum absolute value of A is 986. To normalize by
decimal scaling, we therefore divide each value by 1,000 (i.e., j =
3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
DATA TRANSFORMATION
Normalization,
Methods:
Attribute construction
new attributes are constructed from the given attributes and
added in order to help improve the accuracy and understanding
of structure in high-dimensional data.
For example, we may wish to add the attribute area based on
the attributes height and width.
By combining attributes, attribute construction can discover
Non Parametric
stores reduced representations of the data.
eg: histograms, clustering, and sampling.
DATA REDUCTION
NUMEROSITY REDUCTION
Regression
can be used to approximate the given data.
linear regression:
the data are modelled to fit a straight line.
a random variable, y (called a response variable), can be
modelled as a linear function of another random variable, x
(called a predictor variable), with the equation y = wx+b,
where the variance of y is assumed to be constant.
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
DISCRETIZATION AND CONCEPT HIERARCHY
GENERATION FOR NUMERIC DATA
Entropy-based discretization