DATA ANALYTICS
UE22CS342AA2
UNIT-1
Lecture 5 : Data Preprocessing
- Data Integration and Reduction
Gowri Srinivasa
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 6 : Data Preprocessing – Data Integration and Reduction
Slides collated by:
Nishanth M S PESU-2023, Department of CSE
[email protected]
Slides excerpted from: Data Mining : Concepts and Harshitha Srikanth ,PESU-2024,PESU,Department of CSE
Techniques by Han, Kamber and Pei, 3rd Edition [email protected]
Karthik Namboori, VII Sem, PESU, Department of CSE
[email protected]
Gowri Srinivasa
Department of Computer Science and Engineering With grateful thanks for contribution of slides to:
Dr. Mamatha H R, Professor at the Department of CSE, PESU
DATA ANALYTICS
Data Integration
• Data analysis often requires data integration – the merging of data from
multiple data stores into a coherent store.
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting dataset. This can help improve the accuracy
and speed of the subsequent data analysis process.
• The semantic heterogeneity and structure of data pose great challenges in
data integration.
• How can we match schema and objects from different sources?
• Schema Integration!
• Example : How can a data analyst be sure that the attribute
customer_id in table A and customer_number in table B refer to the
same attribute?
• With the help of metadata! It provides all possible information
regarding the attributes , thus ensuring error free schema integration.
DATA ANALYTICS
Data Integration
• Entity identification problem : Identify real world entities from multiple
data sources. Example : Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources
are different.
• Possible reasons : different representations, different scales, example –
metric vs British units
• During integration , special attention must be paid to the structure of the
data. This is to ensure that any attribute functional dependencies and
referential constraints in the source system match those in the target
system. For example , in one system, a discount may be applied to the
entire order whereas in another system , it is applied to each individual line
item. If this is not caught before integration, items in the target system
may be improperly discounted.
DATA ANALYTICS
Redundancy in Data Integration
Redundant data often occur during the integration of multiple databases.
• Object identification : The same attribute or object may have different
names in different databases which causes redundancy.
• Derivable data : An attribute may be redundant if it can be derived from
another attribute or set of attributes. For example , annual revenue can be
derived from monthly revenue.
• Few redundancies can be detected by correlation analysis.
• For nominal data , c2 (chi-square) test is employed.
• For numeric data , correlation coefficient and covariance is used.
DATA ANALYTICS
c2 (chi-square) test
c2 (chi-square) test for independence of two variables in a contingency table
• Null Hypothesis : The two variables are independent
• Alternate hypothesis : The two variables are not independent.
2
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
𝜒2 = ∑
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
• Expected stands for what we would expect if the null hypothesis were true.
• Larger the value of χ2 the more likely the variables are correlated.
• The cells that contribute the most to the 𝜒 2 value are those whose actual
count is different from the expected count.
• Can be used for categorical variables where entries are numbers(counts)
and not percentages or fractions(10% of 100 needs to be entered as 10)
• Correlation does not imply causation.
▪ The number of hospitals and number of car-thefts in a city may appear
to be correlated. Both are casually linked to a third variable :
DATA ANALYTICS
c2 (chi-square) Example
Play Not play Sum Play Not play Sum
chess chess (row) chess chess (row)
Like 250 200 450 Like 90 360 450
science science
fiction fiction
Not like 50 1000 1050 Not like 210 840 1050
science science
fiction fiction
Sum(col.) 300 1200 1500 Sum(col.) 300 1200 1500
Actual distribution (observed) Expected distribution
• Χ2 (chi-square) calculation
Χ2 = 507.93
• Degrees of freedom , k = (no_of_rows-1)(no_of_columns-1)=1
• Reading from the table = 10.828 < 507.93 => reject the null hypothesis
• It shows that like_science_fiction and play_chess are correlated.
DATA ANALYTICS
Correlation Analysis (Numeric Data)
• Correlation coefficient ( also called as Pearson’s product moment coefficient )
• rA,B = cov(A,B)/ (stdev(a)*stdev(b))
rA,B = Σ(aibi) – n*mean(a)(mean(b)/(stddev(a)*stddev(b))
where n is the number of tuples, mean(a) and mean(b) are the respective means of
A and B ,
σA and σB are the respective standard deviations of A and B , and Σ(aibi) is the sum
of the AB cross-product.
• If rA,B > 0 , A and B are positively correlated , that is A’s values increase when B’s
does. The higher the value of coefficient , stronger the correlation.
• If rA,B = 0 : A and B are independent of each other.
• If rA,B < 0 : A and B are negatively correlated. That is A’s values decrease when
B’s increases.
DATA ANALYTICS
Correlation (viewed as a linear relationship)
• Correlation measures the linear relationship between objects.
• To compute correlation , we standardize data objects A and B , and then take their
dot product.
DATA ANALYTICS
Visually Evaluating Correlation
Scatter plots signifying
the strength of
correlation
Reference Link
DATA ANALYTICS
Covariance analysis (Numeric Data)
• Covariance is similar to correlation
Where n is the number of tuples, and are the respective mean or expected
values of A and B, σA and σB are the respective standard deviations of A and B.
DATA ANALYTICS
Covariance analysis (Numeric Data )
• Positive Covariance : If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative Covariance : If CovA,B < 0, then if A is larger than its expected value, B
is likely to be smaller than its expected value.
• Independence : CovA,B = 0, but the converse is not true :
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under few additional assumptions ( example , the data follows
multivariate normal distributions ) does CovA,B = 0 imply independence.
DATA ANALYTICS
Covariance analysis : An Example
https://mathcs.clarku.edu/~djoyce/ma217/covar.pdf
DATA ANALYTICS
Covariance analysis : An Example
Suppose two stocks A and B have the following values in one
week : (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
DATA ANALYTICS
Tuple Duplication
• In addition to detecting redundancies between attributes, duplication should be
detected at the tuple level (Example , where there are two or more identical
tuples for a unique data entry case)
• The use of denormalized tables (often done to improve performance by avoiding
joins ) is another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data
entry or updating some but not all data occurrences.
DATA ANALYTICS
Data Value Conflict Detection and Resolution
• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different
sources may differ.
• This may be due to differences in representation, scaling or encoding.
• For instance , a weight attribute may be stored in metric units in once system and
British imperial units in another.
DATA ANALYTICS
Data Reduction
• Data reduction techniques are applied to obtain a reduced representation of the
dataset that is much smaller in volume , yet closely maintains the integrity of the
original data.
• Analysis on the reduced dataset should be more efficient yet produce the same or
almost the same analytical results.
• Why do we need data reduction? A database or a data warehouse may store
terabytes of data. Complex data analysis may take a very long time to run on the
complete data set.
DATA ANALYTICS
Data Reduction Strategies
• Dimensionality reduction – process of removing unimportant attributes
• Wavelet transforms
• Principal Component Analysis (PCA)
• Attribute subset selection
• Numerosity reduction – replaces the original data volume by an alternative,
smaller forms of data representation
• Regression and log-linear models
• Histograms, clustering and sampling
• Data cube aggregation
• Data compression – transformations are applied on the data to obtain a reduced
or a compressed representation of the original data.
DATA ANALYTICS
Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse.
• Density and distance between points , which are critical to clustering and
outlier analysis become less meaningful.
• The possible combinations of subspaces will grow exponentially.
• Dimensionality reduction
• Avoids the curse of dimensionality.
• Helps to eliminate irrelevant attributes and reduce noise.
• Reduces time and space required for data analytics.
• Enables easier visualization.
DATA ANALYTICS
Mapping data to a new space
• Fourier transform
• Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency
DATA ANALYTICS
Wavelet Transforms
What is a Wavelet?
A Wavelet is a wave-like oscillation that is localized in time, an example
is given below. Wavelets have two basic properties: scale and
location. Scale (or dilation) defines how “stretched” or “squished” a wavelet
is. This property is related to frequency as defined for
waves. Location defines where the wavelet is positioned in time (or space).
DATA ANALYTICS
Wavelet transformation
• Discrete wavelet transform (DWT) is used for
linear signal processing and multi-resolution
analysis.
• It decomposes a signal into different
frequency sub-bands. It is applicable to n-
dimensional signals.
• Data is transformed to preserve relative
distance between objects at different
resolutions. An example of DWT
• Compressed approximation : it stores only a
small fraction of the strongest of the wavelet
coefficients
• It is insensitive to noise , input order and is
only applicable to low dimensional data.
Wavelet families
DATA ANALYTICS
Wavelet Transforms
Why wavelet transforms?
• A major disadvantage of the Fourier Transform is it
captures global frequency information, meaning frequencies that persist
over an entire signal. An alternative approach is the Wavelet Transform,
which decomposes a function into a set of wavelets.
• Wavelet transform can extract local spectral and temporal information
simultaneously
• Variety of wavelets to choose from like shown
here.
DATA ANALYTICS
Wavelet transformation-Working
Take a look at the animation first.
Now, the basic idea is to compute ”how much” of a
wavelet is in a signal for a particular scale and
location. For those familiar with convolutions, that is
exactly what this is. A signal is convolved with a set
wavelets at a variety of scales.
In other words, we pick a wavelet of a particular
scale (like the blue wavelet in the gif). Then, we
slide this wavelet across the entire signal i.e. vary
its location, where at each time step we multiply the
wavelet and signal. The product of this multiplication
gives us a coefficient for that wavelet scale at that
time step. We then increase the wavelet scale (e.g.
the red and green wavelets) and repeat the
DATA ANALYTICS
Wavelet transformation-Working
Like the Fourier transform, the wavelet transform deconstructs the original signal
waveform into a series of basis waveforms, which in this case are called wavelets.
However, unlike the simple sinusoidal waves of Fourier analysis, the wavelet shapes are
complex, and, at first sight apparently arbitrary – they look like random squiggles
(although in fact they fulfil rigorous mathematical requirements). One important feature
that all wavelets share is that they are bounded, i.e. they decline to zero amplitude at
some distance either side of the centre, which is in obvious contrast to the sine/cosine
waves used in Fourier analysis, which go on forever. This is the underlying key to the
time localisation of the DWT.
There are a whole series of different types of “mother “ wavelets (Daubechies, Coiflet,
Symmlet etc) available, and each type occurs in a range of sizes.A particular episode of
wavelet analysis only uses one type of mother wavelet; the user decides which type and
size to use depending on the characteristics of the signal to be analysed (and probably
some trial-and-error)
DATA ANALYTICS
Wavelet transformation-Working
After transformation of a raw data signal using a particular mother wavelet you
end up with basis waveforms consisting of a series of daughter wavelets. The
daughter wavelets are all compressed or expanded versions of the mother
wavelet (they have different scales or frequencies), and each daughter wavelet
extends across a different part of the original signal (they have
differentlocations)
The important point is that each daughter wavelet is associated with a
corresponding coefficient that specifies how much the daughter wavelet at that
scale contributes to the raw signal at that location. It is these coefficients that
contain the information relating to the original input signal, since the daughter
wavelets derived from a particular mother wavelet are completely fixed and
independent of the input signal. Like the Fourier transform, the wavelet
transform is reversible - you can reconstruct the original signal by adding
together the appropriately daughter wavelets, each weighted by its associated
DATA ANALYTICS
References
• Data Mining: Concepts and Techniques by Jiawei Han, Micheline
Kamber and Jian Pei, The Morgan Kaufmann Series in Data
Management Systems, 3rd Edition Chapter : 3.3 – 3.4
• https://www.st-andrews.ac.uk/~wjh/dataview/tutorials/dwt.html
• https://towardsdatascience.com/the-wavelet-transform-
e9cfa85d7b34
• https://www.cs.unm.edu/~mueen/Teaching/CS_521/Lectures/Lectur
e2.pdf
• https://medium.com/analytics-vidhya/understanding-principle-
component-analysis-pca-step-by-step-e7a4bb4031d9
THANK YOU
Dr. Gowri Srinivasa
Professor, Department of Computer Science and
Engineering, PES University, Bengaluru
Email:
[email protected]