0% found this document useful (0 votes)
14 views

DWDM 3

Uploaded by

banavathshilpa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DWDM 3

Uploaded by

banavathshilpa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT-3

Data Preprocessing

To improve the quality of data we are processing the data. Data preprocessing contains some
techniques those are
1. Data cleaning: Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
2. Data integration & Transformation: Integration of multiple databases, data cubes, or files
3. Data transformation: Normalization , Concept hierarchy generation
4. Data reduction: Dimensionality reduction, Numerosity reduction, Data compression

1. DATA CLEANING:
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
o incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
 e.g., Occupation=“ ” (missing data)
o noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
o inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
o Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
(i). Missing values
Data is not always available. E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data.
Mr. D GANGADHAR
Associate. Professor
How to handle missing data:

1. Ignore the tuple: This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably.

2. Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible
given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like ―Unknown".
4. Use the attribute mean to fill in the missing value: For example, suppose that the average income of All
Electronics customers is $28,000. Use this value to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income value for
customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with inference-based
tools using a Bayesian formalism or decision tree induction. For example, using the other customer attributes
in your data set, you may construct a decision tree to predict the missing values for income.
(ii). Noisy data
Noise is a random error or variance in a measured variable. Noisy data is getting with Incorrect attribute
values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
Other data problems which require data cleaning are
 duplicate records
 incomplete data
 inconsistent data
How to handle noisy data:
1. Binning methods:
Binning methods smooth a sorted data value by consulting the ‖neighborhood", or values around it. The
sorted values are distributed into a number of 'buckets', or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing.
In this example, the data for price are first sorted and partitioned into equi-depth bins (of depth 3). In
smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean
of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin
median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as
the bin boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
(iii).Smoothing by bin means:

Mr. D GANGADHAR
Associate. Professor
Bin 1: 9, 9, 9,
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
2. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups or clusters.
Intuitively, values which fall outside of the set of clusters may be considered outliers.
Figure: Outliers may be detected by clustering analysis.

3. Combined computer and human inspection: Outliers may be identified through a combination of
computer and human inspection. In one application, for example, an information-theoretic measure was used
to help identify outlier patterns in a handwritten character database for classification. The measure's value
reflected the ―surprise" content of the predicted character label with respect to the known label. Outlier
patterns may be informative or ―garbage". Patterns whose surprise content is above a threshold are output
to a list. A human can then sort through the patterns in the list to identify the actual garbage ones
4. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the ―best" line to fit two variables, so that one variable can be used to predict
the other. Multiple linear regression is an extension of linear regression, where more than two variables are
involved and the data are fit to a multidimensional surface.
(iii). Inconsistent data
There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies may be
corrected manually using external references. For example, errors made at data entry may be corrected by
performing a paper trace. This may be coupled with routines designed to help correct the inconsistent use of
codes. Knowledge engineering tools may also be used to detect the violation of known data constraints.
2. DATA INTEGRATION:
Which Combines data from multiple sources into a coherent store in data warehouse. There are number of
issues to consider during data integration those are
1. Schema integration: Integrate metadata from different sources e.g., A.cust-id  B.cust-#

Mr. D GANGADHAR
Associate. Professor
2. Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton
= William Clinton
3. Detecting and resolving data value conflicts: For the same real world entity, attribute values from
different sources are different.
Possible reasons: different representations, different scales, e.g., metric vs. British units
Handling Redundancy in Data Integration:
Redundant data occur often when integration of multiple databases those are
 Object identification: The same attribute or object may have different names in different
databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue
Redundant attributes may be able to be detected by correlation analysis and covariance analysis. Careful
integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Correlation Analysis (Nominal Data):
Χ2 (chi-square) test :The larger the Χ2 value, the more likely the variables are related. The cells that
contribute the most to the Χ2 value are those whose actual count is very different from the expected count. Eg:
Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

Mr. D GANGADHAR
Associate. Professor
3. DATA
TRANSFORMATION:
In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data
transformation can involve the following:
A. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -
1.0 to 1.0, or 0 to 1.0.
There are three main methods for data normalization : min-max normalization, z-score
normalization, and normalization by decimal scaling.
(i).Min-max normalization performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization maps a value v of A
to v0 in the range [new minA; new maxA] by computing

(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are normalized
based on the mean and standard deviation of A. A value v of A is normalized to v0 by computing where
mean A and stand dev A are the mean and standard deviation, respectively, of attribute A. This method of
normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there
are outliers which dominate the min-max normalization.

Mr. D GANGADHAR
Associate. Professor
(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the maximum absolute value of A. A value v of A is
normalized to v0by computing where j is the smallest integer such that

B. Smoothing, which works to remove the noise from data. Such techniques include binning, clustering,
and regression(these we already read in Noisy of data in data cleaning)
C. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily
sales data may be aggregated so as to compute monthly and annual total amounts.
D. Generalization of the data, where low level or 'primitive' (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can
be generalized to higher level concepts, like city or county.
4. DATA REDUCTION:
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data
set should be more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following.
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data
cube. We aggregate the data for an individual entity of interest
e.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes, Further
reduce the size of data to deal with Reference appropriate levels
Use the smallest representation which is enough to solve the task.

Mr. D GANGADHAR
Associate. Professor
1. Dimension reduction, where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
A. Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability distribution of different classes given the values
for those features is as close as possible to the original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand
B. Heuristic methods:
I. Step-wise forward selection: The procedure starts with an empty set of attributes. The best of the
original attributes is determined and added to the set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.

II. Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.

Mr. D GANGADHAR
Associate. Professor
III. Combination forward selection and backward elimination: The step-wise forward selection and
backward elimination methods can be combined, where at each step one selects the best attribute and
removes the
IV. Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally intended for
classifcation. Decision tree induction constructs a flow-chart-like structure where each internal (non-leaf)
node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction. At each node, the algorithm chooses the ―best" attribute to partition
the data into individual classes.

2. Data compression, where encoding mechanisms are used to reduce the data set size. In data compression,
data encoding or transformations are applied so as to obtain a reduced or ‖compressed" representation of the
original data. If the original data can be reconstructed from the compressed data without any loss of
information, the data compression technique used is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data compression technique is called lossy. The two popular and
effective methods of lossy data compression: wavelet transforms, and principal components analysis.
I. Wavelet transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to a data
vector D, transforms it to a numerically different vector, D0, of wavelet coefficients. The two vectors are of
the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique involving
sines and cosines. In general, however, the DWT achieves better lossy compression.

Mr. D GANGADHAR
Associate. Professor
II. Principal components analysis:
Principal components analysis (PCA) searches for c k-dimensional orthogonal vectors that can best be used
to represent the data, where c << N. The original data is thus projected onto a much smaller space, resulting
in data compression. PCA can be used as a form of dimensionality reduction. The initial data can then be
projected onto this smaller set.

3. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
datarepresentations such as parametric models (which need store only the model parameters instead of
the actual data), or nonparametric methods such as clustering, sampling, and the use of histograms.
I. Multiple regression & Log-linear models:
Multiple regression is an extension of linear regression allowing a response variable Y to be modeled as a
linear function of a multidimensional feature vector.
Log-linear models approximate discrete multidimensional probability distributions. The method can be used
to estimate the probability of each cell in a base cuboid for a set of discretized attributes, based on the
smaller cuboids making up the data cube lattice
II. Histograms:
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. The
buckets are displayed on a horizontal axis, while the height (and area) of a bucket typically reects the average
frequency of the values represented by the bucket.

Mr. D GANGADHAR
Associate. Professor
 Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as the
width of $10 for the buckets in below figure).

 Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that, roughly,
the frequency of each bucket is constant (that is, each bucket contains roughly the same number of
contiguous data samples).
 V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the
original values that each bucket represents, where bucket weight is equal to the number of values in
the bucket.
 MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent values.
A bucket boundary is established between each pair for pairs having the Beta largest differences,
where Beta-1 is user-specified.
III. Clustering
Clustering techniques consider data tuples as objects. They partition the objects into groups or clusters, so
that objects within a cluster are ―similar" to one another and ―dissimilar" to objects in other clusters.
Similarity is commonly defined in terms of how ―close" the objects are in space, based on a distance
function. The ―quality" of a cluster may be represented by its diameter, the maximum distance between any
two objects in the cluster. Centroid distance is an alternative measure of cluster quality, and is defined as the
average distance of each cluster object from the cluster centroid.

Mr. D GANGADHAR
Associate. Professor
IV. Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be represented by a
much smaller random sample (or subset) of the data. Suppose that a large data set, D, contains N tuples. Let's
have a look at some possible samples for D.
 Simple random sample without replacement (SRSWOR) of size n: This is created by drawing n
of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples
are equally likely.
 Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR, except
that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn again.
 Cluster sample: If the tuples in D are grouped into M mutually disjoint ―clusters", then a SRS of m
clusters can be obtained, where m < M. A reduced data representation can be obtained by applying,
say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
 Stratified sample: If D is divided into mutually disjoint parts called ―strata", a stratified sample of
D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group.

Mr. D GANGADHAR
Associate. Professor
Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numeric—real numbers, e.g., integer or real numbers
5. DATA DISCRETIZATION:
Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges
or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction, and
are a powerful tool for data mining.
Discretization id Divide the range of a continuous attribute into intervals
o Interval labels can then be used to replace actual data values
o Reduce data size by discretization
o Supervised vs. unsupervised
o Split (top-down) vs. merge (bottom-up)
o Discretization can be performed recursively on an attribute
o Prepare for further analysis, e.g., classification

Mr. D GANGADHAR
Associate. Professor

You might also like