0% found this document useful (0 votes)
1 views

Module 2_DM_AI

Data preprocessing is a crucial step in data mining that transforms raw data into a usable format, addressing issues such as incompleteness, inconsistency, and noise. It includes techniques like data cleaning, integration, transformation, and reduction to prepare data for analysis. Effective data preprocessing enhances the quality and accuracy of data mining results.

Uploaded by

gomotiveofficial
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Module 2_DM_AI

Data preprocessing is a crucial step in data mining that transforms raw data into a usable format, addressing issues such as incompleteness, inconsistency, and noise. It includes techniques like data cleaning, integration, transformation, and reduction to prepare data for analysis. Effective data preprocessing enhances the quality and accuracy of data mining results.

Uploaded by

gomotiveofficial
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

DATA PREPROCESSING

DATA PREPROCESSING
 A data mining technique that involves
transforming raw data into an understandable
format.
 Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviours or trends,
and is likely to contain many errors. Data pre-
processing is a proven method of resolving such
issues.
 Data pre-processing prepares raw data for
further processing.
WHY DATA PREPROCESSING?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
e.g., occupation=― ‖
 noisy: containing errors or outliers
e.g., Salary=―-10‖
 inconsistent: containing discrepancies in codes or
names
e.g., Age=―42‖ Birthday=―03/07/1997‖
e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖
e.g., discrepancy between duplicate records
FORMS OF DATA PREPROCESSING
DATA PREPROCESSING
 Data cleaning
 ―clean‖ the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies.
 Data Integration
 to include data from multiple sources.
 involve integrating multiple databases, data cubes, or files.
 Data Transformation
 Normalization
 Aggregation
 Data reduction
 obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same)
analytical results.
 data aggregation
 attribute subset selection
 dimensionality reduction
 numerosity reduction
DATA CLEANING
1. Missing Values
If there are many tuples that have no recorded
value for several attributes, then the missing
value can be filled in by any of the following
methods:
A. Ignore the tuple:
 This is usually done when the class label is missing.
 This method is not very effective, unless the tuple
contains several attributes with missing values.
 It is especially poor when the percentage of missing
values per attribute varies considerably.
DATA CLEANING
1. Missing Values
B. Fill in the missing value manually:
 time-consuming
 may not be feasible given a large data set with many
missing values.
C. Use a global constant to fill in the missing value:
 Replace all missing attribute values by the same constant,
such as a label like ―Unknown‖ or ∞.
 although this method is simple, it is not foolproof.
 If missing values are replaced by ―Unknown,‖ then the
mining program may mistakenly think that they form an
interesting concept, since they all have a value in
common—that of ―Unknown.‖
D. Use a measure of Central Tendency to fill in the
attribute value
 Attribute mean is used to replace the missing value for the
attribute.
DATA CLEANING
1. Missing Values
E. Use the attribute mean for all samples belonging to
the same class as the given tuple:
For example, if classifying customers according to credit
risk, replace the missing value with the average income
value for customers in the same credit risk category as
that of the given tuple.
F. Use the most probable value to fill in the missing
value:
This may be determined with regression, inference-based
tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer
attributes in your data set, you may construct a decision
tree to predict the missing values for income
DATA CLEANING
2. Noisy Data
 Noise is a random error or variance in a measured
variable.
 The common data smoothing techniques are:
A. Binning:
Binning methods smooth a sorted data value by
consulting its ―neighbourhood,‖ that is, the values around
it. The sorted values are distributed into a number of
―buckets,‖ or bins. Because binning methods consult the
neighbourhood of values, they perform local smoothing.
 smoothing by bin means
 smoothing by bin medians
 smoothing by bin boundaries
DATA CLEANING
Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
DATA CLEANING
2. Noisy Data
B. Regression:
 Data can be smoothed by fitting the data to a
function, such as with regression.
 Linear regression involves finding the ―best‖ line to
fit two attributes (or variables), so that one attribute
can be used to predict the other.
 Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a multidimensional
surface.
DATA CLEANING
2. Noisy Data
C. Clustering:
 Outliers may be detected by clustering, where
similar values are organized into groups, or
―clusters.‖
 Values that fall outside of the set of clusters may be
considered outliers
DATA CLEANING

CLUSTERING
DATA INTEGRATION
 combines data from multiple sources into a
coherent data store, as in data warehousing.
 Issues to be considered during data integration
 Entity identification problem:
 How can equivalent real-world entities from multiple data
sources be matched up?
 For example, how can the data analyst or the computer be
sure that customer id in one database and cust number in
another refer to the same attribute?
 metadata can be used to help avoid errors in schema

integration
DATA INTEGRATION
 Issues to be considered during data integration
 Redundancy
 An attribute (such as annual revenue, for instance) may be
redundant if it can be ―derived‖ from another attribute or
set of attributes.
 Some redundancies can be detected by correlation
analysis.
DATA INTEGRATION
 Issues to be considered during data integration
 Redundancy
 χ2 (chi-square) test :
 For nominal (discrete) data, a correlation relationship

between two attributes, A and B, can be discovered by a χ2


(chi-square) test.
 Suppose A has c distinct values, namely a1;a2; : : :ac. B has

r distinct values, namely b1;b2; : : :br. The data tuples


described by A and B can be shown as a contingency
table, with the c values of A making up the columns and
the r values of B making up the rows.
 The χ2 value (also known as the Pearson χ2 statistic) is

computed as:
DATA INTEGRATION
 Issues to be considered during data integration
 Redundancy
 χ2 (chi-square) test :
 The χ2 value (also known as the Pearson χ2 statistic) is computed as:

 where oi j is the observed frequency (i.e., actual count) of the joint event
(Ai;Bj) and ei j is the expected frequency of (Ai;Bj)

 where N is the number of data tuples, count(A=ai) is the number of


tuples having value ai for A, and count(B = bj) is the number of tuples
having value bj for B.
DATA INTEGRATION
REDUNDANCY

 Correlation analysis can measure how strongly one


attribute implies the other, based on the available data.
 For numerical attributes, we can evaluate the correlation
between two attributes, A and B, by computing the
correlation coefficient (also known as Pearson‘s
product moment coefficient, named after its inventer,
Karl Pearson).

 σA=Standard deviation of A.
DATA INTEGRATION
REDUNDANCY
DATA INTEGRATION
 Issues to be considered during data integration
 Redundancy
 Correlation coefficient:

-1 <= rA,B <= +1.


 If rA,B > 0, then A and B are positively correlated, meaning
that the values of A increase as the values of B increase.
The higher the value, the stronger the correlation.
 higher value may indicate that A (or B) may be removed as
a redundancy
 If the resulting value is equal to 0, then A and B are
independent and there is no correlation between them.
 If the resulting value is less than 0, then A and B are

negatively correlated, where the values of one attribute


increase as the values of the other attribute decrease.
DATA INTEGRATION
REDUNDANCY
 Covariance
 For numerical data.
 Mean value also known as expected values.
n

a i
E ( A)  A  i 1
n
n

b i
E ( B)  B  i 1
n
DATA INTEGRATION
REDUNDANCY
 Covariance
 Covariance between A and B defined as
n

 (a i  A)(bi  B)
Cov( A, B)  E ( A  A)( B  B)  i 1
 E ( A.B)  A B
n
Cov( A, B)
rA, B 
 A B
 If two attributes A and B tend to change together,
the covariance between A and B is positive.
 If one attribute tends to be above its expected value
when the other is below its expected value , then the
covariance of A and B is negative.
 If A and B are independent, Cov(A,B)=0.
DATA INTEGRATION
 Issues to be considered during data integration
 Detection and resolution of data value
conflicts.
 for the same real-world entity, attribute values from

different sources may differ. This may be due to


differences in representation, scaling or encoding.
 An attribute in one system may be recorded at a

lower level of abstraction than the ―same‖ attribute in


another
DATA TRANSFORMATION
 The data are transformed or consolidated into
forms appropriate for mining.
 Data transformation can involve the following:
 Smoothing:
 to remove noise from the data.
 techniques include binning, regression, and clustering.

 Aggregation
 summary or aggregation operations are applied to the data.
 For example, the daily sales data may be aggregated so as

to compute monthly and annual total amounts.


 used in constructing a data cube for analysis of the data at

multiple granularities.
DATA TRANSFORMATION
 Data transformation can involve the following:
 Generalization of the data
 low-level or ―primitive‖ (raw) data are replaced by higher-
level concepts through the use of concept hierarchies.
 For example, values for numerical attributes, like age, may
be mapped to higher level concepts, like youth, middle-aged,
and senior.
 Normalization,

 the attribute data are scaled so as to fall within a small


specified range, such as -1.0 to 1.0, or 0.0 to 1.0.
 Attribute construction (or feature construction)

 new attributes are constructed and added from the given

set of attributes to help the mining process.


DATA TRANSFORMATION
 Normalization,
 An attribute is normalized by scaling its values so
that they fall within a small specified range, such as
0.0 to 1.0.
 useful for classification algorithms involving neural
networks, or distance measurements such as nearest-
neighbour classification and clustering.
 Methods:
1. Min-max normalization performs a linear transformation
on the original data.

2. minA and maxA are the minimum and maximum values


of an attribute, A
DATA TRANSFORMATION
 Normalization,
 Methods:
1. Min-max normalization performs a linear transformation on
the original data.

minA and maxA are the minimum and maximum values of an


attribute, A
v’ is to be mapped in the new range(new_maxA, new_minA).
DATA TRANSFORMATION
 Normalization,
 Methods:
 Min-max normalization
Eg: Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000,
respectively. We would like to map income to the range
[0:0;1:0]. By min-max normalization, normalize the
value of $73,600 for income.

(73600-12000)/(98000-12;000)* (1.0-0)+0
=0.716.
DATA TRANSFORMATION
 Normalization,
 Methods:
 z-score normalization
 In (or zero-mean normalization), the values for an attribute, A,
are normalized based on the mean and standard deviation of A.
DATA TRANSFORMATION
 Normalization,
 Methods:
 z-score normalization
EgSuppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively. By z-score normalization, normalize the
value of $73,600 for income.

(73600-54000)/16000 = 1.225
DATA TRANSFORMATION
 Normalization,
 Methods:
 Normalization by decimal scaling
 normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the
maximum absolute value of A.
 A value, v, of A is normalized to v‘ by computing

v‘=v/10j
where j is the smallest integer such that max |v’| < 1.
Eg: Suppose that the recorded values of A range from -986 to
917. The maximum absolute value of A is 986. To normalize by
decimal scaling, we therefore divide each value by 1,000 (i.e., j =
3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
DATA TRANSFORMATION
 Normalization,
 Methods:
 Attribute construction
 new attributes are constructed from the given attributes and
added in order to help improve the accuracy and understanding
of structure in high-dimensional data.
 For example, we may wish to add the attribute area based on
the attributes height and width.
 By combining attributes, attribute construction can discover

missing information about the relationships between data


attributes that can be useful for knowledge discovery.
DATA REDUCTION
 Complex data analysis and mining on huge
amounts of data can take a long time, making
such analysis impractical or infeasible.
 Data reduction techniques can be applied to
obtain a reduced representation of the data set
that is much smaller in volume, yet closely
maintains the integrity of the original data.
 Strategies for data reduction
1. Data cube aggregation
2. Attribute subset selection
3. Dimensionality reduction
4. Numerosity reduction
5. Discretization and concept hierarchy generation
DATA REDUCTION
1. Data cube aggregation
DATA REDUCTION
DATA CUBE AGGREGATION
 Cube at the lowest level of abstraction –Base Cuboid
 Cube at the highest level of abstraction –Apex Cuboid
DATA REDUCTION
ATTRIBUTE SUBSET SELECTION
 reduces the data set size by removing irrelevant or
redundant attributes.
 The goal is to find a minimum set of attributes such
that the resulting probability distribution of the data
classes is as close as possible to the original
distribution obtained using all attributes.
 For n attributes, there are 2n possible subsets.
 Heuristic methods that explore a reduced search
space are commonly used for attribute subset
selection. These methods are typically greedy in that,
while searching through attribute space, they always
make what looks to be the best choice at the time.
 The ―best‖ (and ―worst‖) attributes are typically
determined using tests of statistical significance,
which assume that the attributes are independent of
one another.
DATA REDUCTION
ATTRIBUTE SUBSET SELECTION
1. Stepwise forward selection:
 starts with an empty set of attributes as the reduced set.
 The best of the original attributes is determined and added to the reduced set.
 At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination:
 starts with the full set of attributes.
 At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination:
 At each step, the procedure selects the best attribute and removes the worst
from among the remaining attributes.
4. Decision tree induction:
 Decision tree algorithms, such as ID3, C4.5, and CART, were originally
intended for classification.
 Decision tree induction constructs a flow chart like structure where each internal
(non leaf) node denotes a test on an attribute, each branch corresponds to an
outcome of the test, and each external (leaf) node denotes a class prediction.
 At each node, the algorithm chooses the ―best‖ attribute to partition the data into
individual classes.
 The set of attributes appearing in the tree form the reduced subset of attributes.
DATA REDUCTION
ATTRIBUTE SUBSET SELECTION
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Data encoding or transformations are applied so as to
obtain a reduced or ―compressed‖ representation of
the original data.
 Lossless: If the original data can be reconstructed from
the compressed data without any loss of information.
 Lossy : we can reconstruct only an approximation of the
original data. There are several well-tuned
 Two popular and effective methods of lossy
dimensionality reduction:
 Wavelet transforms
 Principal components analysis.
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Wavelet Transforms
 The discrete wavelet transform (DWT) is a linear
signal processing technique.
 It transforms a vector into a numerically different
vector (D to D‘) of wavelet coefficients.
 The two vectors are of the same length. But wavelet-
transformed data can be truncated.
 Given a set of coefficients, an approximation of the
original data con be got by applying the inverse DWT.
 Wavelet transform halves the
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Wavelet Transforms
 The method is as follows:
1. The length, L, of the input data vector must be an integer power
of 2.
2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring
out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is,
to all pairs of measurements (x ;x ). This results in two sets of
2i 2i+1

data of length L=2. In general, these represent a smoothed or


low-frequency version of the input data and the high frequency
content of it, respectively.
4. The two functions are recursively applied to the sets of data
obtained in the previous loop, until the resulting data sets
obtained are of length 2.
5. Selected values from the data sets obtained in the above
iterations are designated the wavelet coefficients of the
transformed data.
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Wavelet Transforms
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Principal Components Analysis

 Suppose that the data to be reduced consist of tuples or data


vectors described by n attributes or dimensions.

 PCA (also called the Karhunen-Loeve, or K-L, method),


searches for k n-dimensional orthogonal vectors that can
best be used to represent the data, where k <=n.

 The original data are thus projected onto a much smaller


space, resulting in dimensionality reduction.
 PCA ―combines‖ the essence of attributes by creating an
alternative, smaller set of variables.
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Principal Components Analysis
 The method is as follows:
1. The input data are normalized, so that each attribute falls within
the same range.
2. PCA computes k ortho-normal vectors that provide a basis for the
normalized input data. These vectors are referred to as the
principal components. The input data are a linear combination of
the principal components.
3. The principal components are sorted in order of decreasing
―significance‖ or strength. The principal components essentially
serve as a new set of axes for the. data, providing important
information about varianceThis information helps identify groups
or patterns within the data.
4. Because the components are sorted according to decreasing order
of ―significance,‖ the size of the data can be reduced by eliminating
the weaker components, that is, those with low variance. Using the
strongest principal components, it should be possible to
reconstruct a good approximation of the original data.
DATA REDUCTION
DIMENSIONALITY REDUCTION
 Principal Components Analysis
DATA REDUCTION
NUMEROSITY REDUCTION
 reduce the data volume by choosing alternative,
‗smaller‘ forms of data representation.
 Two types of methods:
 Parametric
 A model is used to estimate the data, so that typically only
the data parameters need to be stored, instead of the actual
data.
 Eg: log linear models

 Non Parametric
stores reduced representations of the data.
 eg: histograms, clustering, and sampling.
DATA REDUCTION
NUMEROSITY REDUCTION
 Regression
 can be used to approximate the given data.
 linear regression:
 the data are modelled to fit a straight line.
 a random variable, y (called a response variable), can be
modelled as a linear function of another random variable, x
(called a predictor variable), with the equation y = wx+b,
 where the variance of y is assumed to be constant.

 In the context of data mining, x and y are numerical


database attributes. The coefficients, w and b (called
regression coefficients), specify the slope of the line and the
Y-intercept, respectively.
 Multiple linear regression
 an extension of (simple) linear regression, which allows a
response variable, y, to be modelled as a linear function of
two or more predictor variables.
DATA REDUCTION
NUMEROSITY REDUCTION
 Log-Linear Models
 Log-linear models approximate discrete
multidimensional probability distributions.
 Given a set of tuples in n dimensions (e.g., described
by n attributes), we can consider each tuple as a point
in an n-dimensional space.
 Log-linear models can be used to estimate the
probability of each point in a multidimensional space
for a set of discretized attributes, based on a smaller
subset of dimensional combinations. This allows a
higher-dimensional data space to be constructed from
lower dimensional spaces. Log-linear models are
therefore also useful for dimensionality reduction
DATA REDUCTION
NUMEROSITY REDUCTION
 Histogram
 A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
 If each bucket represents only a single attribute-
value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous
ranges for the given attribute.
DATA REDUCTION
NUMEROSITY REDUCTION
 Histogram
 Eg: The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest
dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8,
8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18,
18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21,
21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
DATA REDUCTION
NUMEROSITY REDUCTION
 Histogram
 There are several partitioning rules, including the following:
 Equal-width: the width of each bucket range is uniform
 Equal-frequency (or equidepth):
 the buckets are created so that, roughly, the frequency of each bucket
is constant (that is, each bucket contains roughly the same number of
contiguous data samples).
 V-Optimal:
 V-Optimal histogram is the one with the least variance.
 Histogram variance is a weighted sum of the original values that each
bucket represents, where bucket weight is equal to the number of
values in the bucket.
 MaxDiff:
 we consider the difference between each pair of adjacent values.
 A bucket boundary is established between each pair for pairs having
the B-1 largest differences, where B is the user-specified number of
buckets.
DATA REDUCTION
NUMEROSITY REDUCTION
 Clustering
 Clustering techniques consider data tuples as objects.
 They partition the objects into groups or clusters, so that
objects within a cluster are ―similar‖ to one another and
―dissimilar‖ to objects in other clusters.
 Similarity is commonly defined in terms of how ―close‖ the
objects are in space, based on a distance function.
 The ―quality‖ of a cluster may be represented by its
diameter, the maximum distance between any two objects
in the cluster. Centroid distance is an alternative measure
of cluster quality and is defined as the average distance of
each cluster object from the cluster centroid.
DATA REDUCTION
NUMEROSITY REDUCTION
 Sampling
 Sampling can be used as a data reduction technique
because it allows a large data set to be represented by a
much smaller random sample (or subset) of the data
1. Simple random sample without replacement (SRSWOR)
of size s
2. Simple random sample with replacement (SRSWR) of size
s
3. Cluster sample
4. Stratified sample
 Simple Random Sampling
In this method, each individual or item in the population
has an equal chance of being selected.
 Example:
student a number from 1 to 30 and then use a random
number generator to pick 5 numbers.
 SRSWOR: Once an individual is selected, they cannot
be selected again.
 SRSWR: Individuals can be selected more than once
because they are returned to the pool after each
selection.
 Stratified Sampling: The population is divided into
subgroups, and random samples are taken from each
subgroup to ensure representation across different
groups in the population.
DATA REDUCTION
DATA DISCRETIZATION AND CONCEPT
HIERARCHY GENERATION
 Data discretization
 used to reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals.
 Interval labels can then be used to replace actual data
values.
 A concept hierarchy for a given numerical attribute
defines a discretization of the attribute.
 Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as
numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior).
 Methods: Binning, Histogram analysis,..
DISCRETIZATION AND CONCEPT HIERARCHY

 Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
DISCRETIZATION AND CONCEPT HIERARCHY
GENERATION FOR NUMERIC DATA

 Binning (see sections before)

 Histogram analysis (see sections before)

 Clustering analysis (see sections before)

 Entropy-based discretization

 Segmentation by natural partitioning


SPECIFICATION OF A SET OF ATTRIBUTES

Concept hierarchy can be automatically generated


based on the number of distinct values per attribute
in the given attribute set. The attribute with the
most distinct values is placed at the lowest level of
the hierarchy.

country 15 distinct values

province_or_ state 65 distinct values

city 3567 distinct values

street 674,339 distinct values


EXERCISE
2.

 3. Suppose a group of l2 sales price records has been


sorted as follows: 5, 10, I l, 13, 15, 35, 50,
55,72,92,204,215. Sketch examples of each of the
following sampling techniques: SRSWOR, SRSWR,
stratified sampling. Use samples of size 5 and the
strata‘ youth" "middle-aged," and "senior'"
EXERCISE
4. Using the data for age given, answer the
following:
1. Use min-max normalization to transform the value
35 for age on to the range [0.0,1.0].
2. Use z-score normalization to transform the value 35
for age, where the standard deviation of age is 12.94
years.
3. Use normalization by decimal scaling to transform
the value 35 for age.

You might also like