0% found this document useful (0 votes)
229 views

Data Reduction Techniques

Dimensionality reduction techniques like principal component analysis (PCA) and wavelet transforms can reduce large, complex datasets while maintaining most of the original information. PCA transforms the data into a new set of orthogonal variables ordered by how much of the variance they capture. Most of the variance can be retained using just the strongest principal components, reducing dimensionality. Wavelet transforms represent data as wavelet coefficients, most of which can be removed by truncation since the strongest coefficients contain the most information. Attribute subset selection identifies and removes irrelevant or redundant attributes to reduce the dataset size. Greedy search methods are commonly used to explore the space of attribute subsets efficiently.

Uploaded by

Prashant Sahu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views

Data Reduction Techniques

Dimensionality reduction techniques like principal component analysis (PCA) and wavelet transforms can reduce large, complex datasets while maintaining most of the original information. PCA transforms the data into a new set of orthogonal variables ordered by how much of the variance they capture. Most of the variance can be retained using just the strongest principal components, reducing dimensionality. Wavelet transforms represent data as wavelet coefficients, most of which can be removed by truncation since the strongest coefficients contain the most information. Attribute subset selection identifies and removes irrelevant or redundant attributes to reduce the dataset size. Greedy search methods are commonly used to explore the space of attribute subsets efficiently.

Uploaded by

Prashant Sahu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Reduction

Data Reduction
• Complex data analysis and mining on huge amounts of data can take a
long time, making such analysis impractical or infeasible.
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data.
• That is, analysis on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results
Overview of Data Reduction Strategies
• Data reduction strategies include
• Dimensionality reduction
• Numerosity reduction
• Data compression. Data Reduction

Dimensionality Numerosity Data


reduction reduction Compression

Principal
Wavelet Attribute Subset Regression and Data Cube
Component Histogram Clustering Sampling Lossy Lossless
Transformation selection Log Linear Generation
Analysis
Dimensionality reduction
• is the process of reducing the number
of random variables or attributes
under consideration.
• Dimensionality reduction methods
include wavelet and principal
components analysis which transform
or project the original data onto a
smaller space.
• Attribute subset selection is a method
of dimensionality reduction in which
irrelevant, weakly relevant, or
redundant attributes or dimensions
are detected and removed
Numerosity reduction techniques
• Numerosity reduction techniques replace the original data volume by
alternative, smaller forms of data representation.
• These techniques may be parametric or nonparametric.
• For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the
actual data. (Outliers may also be stored.)
• Regression and log-linear models are examples.
• Nonparametric methods for storing reduced representations of the
data include histograms , clustering , sampling and data cube
aggregation.
Data Compression
• In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of the original
data, then the data reduction is called lossy.
• There are several lossless algorithms for string compression; however,
they typically allow only limited data manipulation.
• Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
Dimensionality Reduction
• Wavelet Transformation
• Principal Component Analysis
• Attribute Subset Selection
wavelet transform
• The discrete wavelet transform (DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X’ , of wavelet coefficients.
• The two vectors are of the same length.
• When applying this technique to data reduction, we consider each
tuple as an n-dimensional data vector, that is, X = (x1,x2,...,xn),
depicting n measurements made on the tuple from n datafile
attributes.
“How can this technique be useful for data reduction if
the wavelet transformed data are of the same length
as the original data?”
• The usefulness lies in the fact that the wavelet transformed data can
be truncated.
• A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
• For example, all wavelet coefficients larger than some user-specified
threshold can be retained.
• All other coefficients are set to 0.
• The resulting data representation is therefore very sparse, so that
operations that can take advantage of data sparsity are
computationally very fast if performed in wavelet space.
Principal Components Analysis
• In this subsection we provide an intuitive introduction to principal components
analysis as a method of dimensionality reduction.
• Suppose that the data to be reduced consist of tuples or data vectors described
by n attributes or dimensions.
• Principal components analysis (PCA; also called the Karhunen-Loeve, or K-L,
method)
• searches for k n-dimensional orthogonal vectors that can best be used to represent the data,
where k ≤ n.
• The original data are thus projected onto a much smaller space, resulting in dimensionality
reduction.
• Unlike attribute subset selection (Section 3.4.4), which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the essence of attributes by
creating an alternative, smaller set of variables.
• The initial data can then be projected onto this smaller set.
• PCA often reveals relationships that were not previously suspected and thereby allows
interpretations that would not ordinarily result.
Steps in PCA
1. The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large domains
will not dominate attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in a
direction perpendicular to the others. These vectors are referred to as
the principal components. The input data are a linear combination of the
principal components.
3. The principal components are sorted in order of decreasing
“significance” or strength. The principal components essentially serve as
a new set of axes for the data providing important information about
variance. That is, the sorted axes are such that the first axis shows the
most variance among the data, the second axis shows the next highest
variance, and so on.
• shows the first two principal components, Y1 and Y2, for the given set
of data originally mapped to the axes X1 and X2.
• This information helps identify groups or patterns within the dat

Because the components are sorted in decreasing order of “significance,” the data size can be reduced
by eliminating the weaker components, that is, those with low variance. Using the strongest principal
components, it should be possible to reconstruct a good approximation of the original data.
PCA
• PCA can be applied to
• ordered and unordered attributes,
• and can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can be handled by
reducing the problem to two dimensions.
• Principal components may be used as inputs to multiple regression
and cluster analysis.
• In comparison with wavelet transforms, PCA tends to be better at
handling sparse data, whereas wavelet transforms are more suitable
for data of high dimensionality
Attribute Subset Selection
• Data sets for analysis may contain hundreds of attributes, many of which
may be irrelevant to the mining task or redundant.
• For example, if the task is to classify customers based on whether or not
they are likely to purchase a popular new CD at AllElectronics when
notified of a sale, attributes such as the customer’s telephone number are
likely to be irrelevant, unlike attributes such as age or music taste
• Attribute subset selection reduces the data set size by removing irrelevant
or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as
close as possible to the original distribution obtained using all attributes.
How can we find a ‘good’ subset of the
original attributes?”
• For n attributes, there are 2 n possible subsets.
• An exhaustive search for the optimal subset of attributes can be
prohibitively expensive, especially as n and the number of data classes
increase.
• Therefore, heuristic methods that explore a reduced search space are
commonly used for attribute subset selection.
• These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the
time.
• Their strategy is to make a locally optimal choice in the hope that this will
lead to a globally optimal solution.
• Such greedy methods are effective in practice and may come close to
estimating an optimal solution.
• The “best” (and “worst”) attributes are typically determined using
tests of statistical significance, which assume that the attributes are
independent of one another.
• Many other attribute evaluation measures can be used such as the
information gain measure used in building decision trees for
classification.
1. Stepwise forward selection:
• The procedure starts with an empty set of attributes as the reduced set. The best of the original
attributes is determined and added to the reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
2. Stepwise backward elimination:
• The procedure starts with the full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination of forward selection and backward elimination:
• The stepwise forward selection and backward elimination methods can be combined so that, at
each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision tree induction:
• Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for classification.
Decision tree induction constructs a flowchartlike structure where each internal (nonleaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each node, the algorithm chooses the “best”
attribute to partition the data into individual classes.
Attribute construction method
• When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree are
assumed to be irrelevant. The set of attributes appearing in the tree form the
reduced subset of attributes.
• The stopping criteria for the methods may vary.
• The procedure may employ a threshold on the measure used to determine when
to stop the attribute selection process.
• In some cases, we may want to create new attributes based on others.
• Such attribute construction can help improve accuracy and understanding of
structure in high dimensional data. For example, we may wish to add the
attribute area based on the attributes height and width. By combining attributes,
attribute construction can discover missing information about the relationships
between data attributes that can be useful for knowledge discovery
Numerosity Reduction
• Parametric Model
• Linear Regression
• Log Linear Model
• Non Parametric
• Histogram
• Clustering
• Sampling
• Data Cube Generation
Regression and Log-Linear Models:
Parametric Data Reduction
• Regression and log-linear models can be used to approximate the given
data.
• In (simple) linear regression, the data are modeled to fit a straight line.
• For example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
• y = wx + b
• where the variance of y is assumed to be constant.
• In the context of data analysis, x and y are numeric database attributes.
The coefficients, w and b (called regression coefficients), specify the slope
of the line and the y-intercept, respectively.
Linear Regression
• These coefficients can be solved for by the method of least squares,
which minimizes the error between the actual line separating the
data and the estimate of the line.
• Multiple linear regression is an extension of (simple) linear regression,
which allows a response variable, y, to be modeled as a linear
function of two or more predictor variables.
Log Linear model
• Log-linear models approximate discrete multidimensional probability
distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes), we can
consider each tuple as a point in an n-dimensional space.
• Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower-
dimensional spaces.
• Log-linear models are therefore also useful for dimensionality reduction (since
the lower-dimensional points together typically occupy less space than the
original data points) and data smoothing (since aggregate estimates in the lower-
dimensional space are less subject to sampling variations than the estimates in
the higher-dimensional space)
• Regression and log-linear models
• can both be used on sparse data, although their application may be limited.
• While both methods can handle skewed data, regression does exceptionally
well.
• Regression can be computationally intensive when applied to high-
dimensional data,
• whereas log-linear models show good scalability for up to 10 or so dimensions
Histograms
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair,
the buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given
attribute.
• The following data are a list of AllElectronics prices for commonly sold
items (rounded to the nearest dollar).
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10,
12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
• Figure 3.7 shows a histogram for the data using singleton buckets.
• To further reduce the data, it is common to have each bucket denote
a continuous value range for the given attribute.
• In Figure 3.8, each bucket represents a different $10 range for price
“How are the buckets determined and the
attribute values partitioned?”
• Equal-width:
• Equal-width histogram, the width of each bucket range is uniform (e.g., the width of
$10 for the buckets ).
• Equal-frequency (or equal-depth):
• Equal-frequency histogram,
• the buckets are created so that, roughly, the frequency of each bucket is constant
(i.e., each bucket contains roughly the same number of contiguous data samples)
Histograms are highly effective at approximating
• both sparse and dense data,
• as well as highly skewed and uniform data.
• The histograms described before for single attributes can be extended for multiple
attributes.
Clustering
• Clustering techniques consider data tuples as objects.
• They partition the objects into groups, or clusters, so that objects within a
cluster are “similar” to one another and “dissimilar” to objects in other
clusters.
• Similarity is commonly defined in terms of how “close” the objects are in
space, based on a distance function.
• The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
• Centroid distance is an alternative measure of cluster quality and is defined
as the average distance of each cluster object from the cluster centroid
(denoting the “average object,” or average point in space for the cluster).
Sampling
• Sampling can be used as a data reduction technique because it allows a large data
set to be represented by a much smaller random data sample (or subset).
• Suppose that a large data set, D, contains N tuples. Let’s look at the most
common ways that we could sample D for data reduction,
• Simple random sample without replacement (SRSWOR) of size s:
• This is created by drawing s of the N tuples from D (s < N), where the probability of drawing
any tuple in D is 1/N, that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size s:
• This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and
then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again.
• Cluster sample:
• If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters
can be obtained, where s < M. For example, tuples in a database are usually retrieved a page
at a time, so that each page can be considered a cluster.
• Stratified sample:
• If D is divided into mutually disjoint parts called strata, a stratified sample of D
is generated by obtaining an SRS at each stratum.
• This helps ensure a representative sample, especially when the data are
skewed.
• For example, a stratified sample may be obtained from customer data, where
a stratum is created for each customer age group.
• In this way, the age group having the smallest number of customers will be
sure to be represented
• An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size.
• Hence, sampling complexity is potentially sublinear to the size of the
data.
• Other data reduction techniques can require at least one complete
pass through D. For a fixed sample size, sampling complexity increases
only linearly as the number of data dimensions, n, increases, whereas
techniques using histograms, for example, increase exponentially in n.
Data Cube Aggregation
• Imagine that you have collected the data for your
analysis.
• These data consist of the AllElectronics sales per
quarter, for the years 2008 to 2010.
• You are, however, interested in the annual sales
(total per year), rather than the total per quarter.
• Thus, the data can be aggregated so that the
resulting data summarize the total sales per year
instead of per quarter.
• The resulting data set is smaller in volume, without
loss of information necessary for the analysis task
Data cube store multidimensional aggregated information. For
Data Cube example, Figure 3.11 shows a data cube for multidimensional
analysis of sales data with respect to annual sales per item type
for each AllElectronics branch.

Each cell holds an aggregate data value, corresponding to the


data point in multidimensional space.

Concept hierarchies may exist for each attribute, allowing the


analysis of data at multiple abstraction levels. For example, a
hierarchy for branch could allow branches to be grouped into
regions, based on their address. Data cubes provide fast access
to precomputed, summarized data, thereby benefiting online
analytical processing a
Data Cube
• The cube created at the lowest abstraction level is referred to as the base
cuboid.
• The base cuboid should correspond to an individual entity of interest such
as sales or customer
• This lowest level should be usable, or useful for the analysis.
• A cube at the highest level of abstraction is the apex cuboid.
• For the sales data in Figure, the apex cuboid would give one total—
• the total sales for all three years, for all item types, and for all branches.
• Data cubes created for varying levels of abstraction are often referred to as
cuboids, so that a data cube may instead refer to a lattice of cuboids.
• Each higher abstraction level further reduces the resulting data size.
Question 1
• Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results:

• (a) Calculate the mean, median, and standard deviation of age and
%fat.
• (b) Draw the boxplots for age and %fat.
Questions 2
• Using the data for age and body fat given in question 1, answer the
following:
• (a) Normalize the two attributes based on z-score normalization.
• (b) Calculate the correlation coefficient (Pearson’s product moment
coefficient).
• Are these two attributes positively or negatively correlated? Compute
their covariance.
Question 3
• Exercise 2.2
• gave the following data (in increasing order) for the attribute age: 13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70.
• (a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given
data.
• (b) How might you determine outliers in the data?
• (c) What other methods are there for data smoothing?

You might also like