0% found this document useful (0 votes)

6 views28 pages

Ue22cs342aa2 20240827192243

Uploaded by

Shreya M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views28 pages

Ue22cs342aa2 20240827192243

Uploaded by

Shreya M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA ANALYTICS

UE22CS342AA2
UNIT-1
Lecture 5 : Data Preprocessing
- Data Integration and Reduction

Gowri Srinivasa
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 6 : Data Preprocessing – Data Integration and Reduction
Slides collated by:
Nishanth M S PESU-2023, Department of CSE
[email protected]
Slides excerpted from: Data Mining : Concepts and Harshitha Srikanth ,PESU-2024,PESU,Department of CSE
Techniques by Han, Kamber and Pei, 3rd Edition [email protected]
Karthik Namboori, VII Sem, PESU, Department of CSE
[email protected]

Gowri Srinivasa
Department of Computer Science and Engineering With grateful thanks for contribution of slides to:
Dr. Mamatha H R, Professor at the Department of CSE, PESU
DATA ANALYTICS
Data Integration
• Data analysis often requires data integration – the merging of data from
multiple data stores into a coherent store.
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting dataset. This can help improve the accuracy
and speed of the subsequent data analysis process.
• The semantic heterogeneity and structure of data pose great challenges in
data integration.
• How can we match schema and objects from different sources?
• Schema Integration!
• Example : How can a data analyst be sure that the attribute
customer_id in table A and customer_number in table B refer to the
same attribute?
• With the help of metadata! It provides all possible information
regarding the attributes , thus ensuring error free schema integration.
DATA ANALYTICS
Data Integration
• Entity identification problem : Identify real world entities from multiple
data sources. Example : Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources
are different.
• Possible reasons : different representations, different scales, example –
metric vs British units
• During integration , special attention must be paid to the structure of the
data. This is to ensure that any attribute functional dependencies and
referential constraints in the source system match those in the target
system. For example , in one system, a discount may be applied to the
entire order whereas in another system , it is applied to each individual line
item. If this is not caught before integration, items in the target system
may be improperly discounted.
DATA ANALYTICS
Redundancy in Data Integration

Redundant data often occur during the integration of multiple databases.

• Object identification : The same attribute or object may have different
names in different databases which causes redundancy.
• Derivable data : An attribute may be redundant if it can be derived from
another attribute or set of attributes. For example , annual revenue can be
derived from monthly revenue.
• Few redundancies can be detected by correlation analysis.
• For nominal data , c2 (chi-square) test is employed.
• For numeric data , correlation coefficient and covariance is used.
DATA ANALYTICS
c2 (chi-square) test

c2 (chi-square) test for independence of two variables in a contingency table

• Null Hypothesis : The two variables are independent
• Alternate hypothesis : The two variables are not independent.
2
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
𝜒2 = ∑
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
• Expected stands for what we would expect if the null hypothesis were true.
• Larger the value of χ2 the more likely the variables are correlated.
• The cells that contribute the most to the 𝜒 2 value are those whose actual
count is different from the expected count.
• Can be used for categorical variables where entries are numbers(counts)
and not percentages or fractions(10% of 100 needs to be entered as 10)
• Correlation does not imply causation.
▪ The number of hospitals and number of car-thefts in a city may appear
to be correlated. Both are casually linked to a third variable :
DATA ANALYTICS
c2 (chi-square) Example

Play Not play Sum Play Not play Sum

chess chess (row) chess chess (row)
Like 250 200 450 Like 90 360 450
science science
fiction fiction
Not like 50 1000 1050 Not like 210 840 1050
science science
fiction fiction
Sum(col.) 300 1200 1500 Sum(col.) 300 1200 1500
Actual distribution (observed) Expected distribution
• Χ2 (chi-square) calculation
Χ2 = 507.93
• Degrees of freedom , k = (no_of_rows-1)(no_of_columns-1)=1
• Reading from the table = 10.828 < 507.93 => reject the null hypothesis
• It shows that like_science_fiction and play_chess are correlated.
DATA ANALYTICS
Correlation Analysis (Numeric Data)

• Correlation coefficient ( also called as Pearson’s product moment coefficient )

• rA,B = cov(A,B)/ (stdev(a)*stdev(b))
rA,B = Σ(aibi) – n*mean(a)(mean(b)/(stddev(a)*stddev(b))
where n is the number of tuples, mean(a) and mean(b) are the respective means of
A and B ,
σA and σB are the respective standard deviations of A and B , and Σ(aibi) is the sum
of the AB cross-product.
• If rA,B > 0 , A and B are positively correlated , that is A’s values increase when B’s
does. The higher the value of coefficient , stronger the correlation.
• If rA,B = 0 : A and B are independent of each other.
• If rA,B < 0 : A and B are negatively correlated. That is A’s values decrease when
B’s increases.
DATA ANALYTICS
Correlation (viewed as a linear relationship)

• Correlation measures the linear relationship between objects.

• To compute correlation , we standardize data objects A and B , and then take their
dot product.
DATA ANALYTICS
Visually Evaluating Correlation

Scatter plots signifying

the strength of
correlation

Reference Link
DATA ANALYTICS
Covariance analysis (Numeric Data)

• Covariance is similar to correlation

Where n is the number of tuples, and are the respective mean or expected
values of A and B, σA and σB are the respective standard deviations of A and B.
DATA ANALYTICS
Covariance analysis (Numeric Data )

• Positive Covariance : If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative Covariance : If CovA,B < 0, then if A is larger than its expected value, B
is likely to be smaller than its expected value.
• Independence : CovA,B = 0, but the converse is not true :
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under few additional assumptions ( example , the data follows
multivariate normal distributions ) does CovA,B = 0 imply independence.
DATA ANALYTICS
Covariance analysis : An Example

https://mathcs.clarku.edu/~djoyce/ma217/covar.pdf
DATA ANALYTICS
Covariance analysis : An Example

Suppose two stocks A and B have the following values in one

week : (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

Question: If the stocks are affected by the same industry trends,

will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
DATA ANALYTICS
Tuple Duplication

• In addition to detecting redundancies between attributes, duplication should be

detected at the tuple level (Example , where there are two or more identical
tuples for a unique data entry case)
• The use of denormalized tables (often done to improve performance by avoiding
joins ) is another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data
entry or updating some but not all data occurrences.
DATA ANALYTICS
Data Value Conflict Detection and Resolution

• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different
sources may differ.
• This may be due to differences in representation, scaling or encoding.
• For instance , a weight attribute may be stored in metric units in once system and
British imperial units in another.
DATA ANALYTICS
Data Reduction

• Data reduction techniques are applied to obtain a reduced representation of the

dataset that is much smaller in volume , yet closely maintains the integrity of the
original data.
• Analysis on the reduced dataset should be more efficient yet produce the same or
almost the same analytical results.
• Why do we need data reduction? A database or a data warehouse may store
terabytes of data. Complex data analysis may take a very long time to run on the
complete data set.
DATA ANALYTICS
Data Reduction Strategies

• Dimensionality reduction – process of removing unimportant attributes

• Wavelet transforms
• Principal Component Analysis (PCA)
• Attribute subset selection
• Numerosity reduction – replaces the original data volume by an alternative,
smaller forms of data representation
• Regression and log-linear models
• Histograms, clustering and sampling
• Data cube aggregation
• Data compression – transformations are applied on the data to obtain a reduced
or a compressed representation of the original data.
DATA ANALYTICS
Dimensionality Reduction

• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse.
• Density and distance between points , which are critical to clustering and
outlier analysis become less meaningful.
• The possible combinations of subspaces will grow exponentially.
• Dimensionality reduction
• Avoids the curse of dimensionality.
• Helps to eliminate irrelevant attributes and reduce noise.
• Reduces time and space required for data analytics.
• Enables easier visualization.
DATA ANALYTICS
Mapping data to a new space

• Fourier transform
• Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

DATA ANALYTICS
Wavelet Transforms

What is a Wavelet?
A Wavelet is a wave-like oscillation that is localized in time, an example
is given below. Wavelets have two basic properties: scale and
location. Scale (or dilation) defines how “stretched” or “squished” a wavelet
is. This property is related to frequency as defined for
waves. Location defines where the wavelet is positioned in time (or space).
DATA ANALYTICS
Wavelet transformation
• Discrete wavelet transform (DWT) is used for
linear signal processing and multi-resolution
analysis.
• It decomposes a signal into different
frequency sub-bands. It is applicable to n-
dimensional signals.
• Data is transformed to preserve relative
distance between objects at different
resolutions. An example of DWT

• Compressed approximation : it stores only a

small fraction of the strongest of the wavelet
coefficients
• It is insensitive to noise , input order and is
only applicable to low dimensional data.
Wavelet families
DATA ANALYTICS
Wavelet Transforms

Why wavelet transforms?

• A major disadvantage of the Fourier Transform is it
captures global frequency information, meaning frequencies that persist
over an entire signal. An alternative approach is the Wavelet Transform,
which decomposes a function into a set of wavelets.
• Wavelet transform can extract local spectral and temporal information
simultaneously
• Variety of wavelets to choose from like shown
here.
DATA ANALYTICS
Wavelet transformation-Working

Take a look at the animation first.

Now, the basic idea is to compute ”how much” of a
wavelet is in a signal for a particular scale and
location. For those familiar with convolutions, that is
exactly what this is. A signal is convolved with a set
wavelets at a variety of scales.
In other words, we pick a wavelet of a particular
scale (like the blue wavelet in the gif). Then, we
slide this wavelet across the entire signal i.e. vary
its location, where at each time step we multiply the
wavelet and signal. The product of this multiplication
gives us a coefficient for that wavelet scale at that
time step. We then increase the wavelet scale (e.g.
the red and green wavelets) and repeat the
DATA ANALYTICS
Wavelet transformation-Working

Like the Fourier transform, the wavelet transform deconstructs the original signal
waveform into a series of basis waveforms, which in this case are called wavelets.
However, unlike the simple sinusoidal waves of Fourier analysis, the wavelet shapes are
complex, and, at first sight apparently arbitrary – they look like random squiggles
(although in fact they fulfil rigorous mathematical requirements). One important feature
that all wavelets share is that they are bounded, i.e. they decline to zero amplitude at
some distance either side of the centre, which is in obvious contrast to the sine/cosine
waves used in Fourier analysis, which go on forever. This is the underlying key to the
time localisation of the DWT.
There are a whole series of different types of “mother “ wavelets (Daubechies, Coiflet,
Symmlet etc) available, and each type occurs in a range of sizes.A particular episode of
wavelet analysis only uses one type of mother wavelet; the user decides which type and
size to use depending on the characteristics of the signal to be analysed (and probably
some trial-and-error)
DATA ANALYTICS
Wavelet transformation-Working

After transformation of a raw data signal using a particular mother wavelet you
end up with basis waveforms consisting of a series of daughter wavelets. The
daughter wavelets are all compressed or expanded versions of the mother
wavelet (they have different scales or frequencies), and each daughter wavelet
extends across a different part of the original signal (they have
differentlocations)
The important point is that each daughter wavelet is associated with a
corresponding coefficient that specifies how much the daughter wavelet at that
scale contributes to the raw signal at that location. It is these coefficients that
contain the information relating to the original input signal, since the daughter
wavelets derived from a particular mother wavelet are completely fixed and
independent of the input signal. Like the Fourier transform, the wavelet
transform is reversible - you can reconstruct the original signal by adding
together the appropriately daughter wavelets, each weighted by its associated
DATA ANALYTICS
References

• Data Mining: Concepts and Techniques by Jiawei Han, Micheline

Kamber and Jian Pei, The Morgan Kaufmann Series in Data
Management Systems, 3rd Edition Chapter : 3.3 – 3.4
• https://www.st-andrews.ac.uk/~wjh/dataview/tutorials/dwt.html
• https://towardsdatascience.com/the-wavelet-transform-
e9cfa85d7b34
• https://www.cs.unm.edu/~mueen/Teaching/CS_521/Lectures/Lectur
e2.pdf
• https://medium.com/analytics-vidhya/understanding-principle-
component-analysis-pca-step-by-step-e7a4bb4031d9
THANK YOU
Dr. Gowri Srinivasa
Professor, Department of Computer Science and
Engineering, PES University, Bengaluru
Email: [email protected]

DAT Practice Test
80% (10)
DAT Practice Test
75 pages
Plastic Material in Automotive
50% (2)
Plastic Material in Automotive
17 pages
Engine Balancing
0% (1)
Engine Balancing
20 pages
UE21CS342AA2 - Unit-1 Part - 2
No ratings yet
UE21CS342AA2 - Unit-1 Part - 2
110 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
2.4 DataIntegration and Transformation
No ratings yet
2.4 DataIntegration and Transformation
23 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Unit 3
No ratings yet
Unit 3
164 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DM Merged
No ratings yet
DM Merged
169 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
PPT1
No ratings yet
PPT1
93 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
03preprocessing Part2
No ratings yet
03preprocessing Part2
15 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Mining
No ratings yet
Mining
63 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
DP
No ratings yet
DP
44 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Lec 7
No ratings yet
Lec 7
45 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
10-1 Data Analysis and Pre-Processing Part 3 PDF
No ratings yet
10-1 Data Analysis and Pre-Processing Part 3 PDF
19 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
Module 2
No ratings yet
Module 2
62 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Plural and Singular
No ratings yet
Plural and Singular
13 pages
Chapter 1 Site Investigation PDF
No ratings yet
Chapter 1 Site Investigation PDF
47 pages
Articles of Organization
No ratings yet
Articles of Organization
6 pages
Assignment #1 (Marketing Management)
No ratings yet
Assignment #1 (Marketing Management)
7 pages
Economic Update 17th May-1
No ratings yet
Economic Update 17th May-1
2 pages
Summary of All Programmes
100% (3)
Summary of All Programmes
376 pages
Draft EBT Manual For Publication v20 Dated 2007 230807 182037
No ratings yet
Draft EBT Manual For Publication v20 Dated 2007 230807 182037
121 pages
Efikasi Herbisida Isopropilamina Glifosat Dalam Mengendalikan Gulma Perkebunan Karet (Hevea Brasiliensis) Belum Menghasilkan
No ratings yet
Efikasi Herbisida Isopropilamina Glifosat Dalam Mengendalikan Gulma Perkebunan Karet (Hevea Brasiliensis) Belum Menghasilkan
9 pages
Trust Lies
No ratings yet
Trust Lies
19 pages
Austric Language Family
No ratings yet
Austric Language Family
12 pages
CS-Agricultural Drone Operation Level II-2021-03-28
No ratings yet
CS-Agricultural Drone Operation Level II-2021-03-28
74 pages
Division 06
No ratings yet
Division 06
2 pages
1 Nov'22
No ratings yet
1 Nov'22
192 pages
Medical Assistant Key Words
No ratings yet
Medical Assistant Key Words
5 pages
2025_06_asia_v4
No ratings yet
2025_06_asia_v4
17 pages
ABC Kick-Off
100% (4)
ABC Kick-Off
2 pages
Death Bed (Feat. Beabadoobee) - Ukulele Tabs by Powfu - Ukutabs
No ratings yet
Death Bed (Feat. Beabadoobee) - Ukulele Tabs by Powfu - Ukutabs
3 pages
Numerical Estimation and Analysis of Effective Width of Composite Beams With Ribbed Slab
No ratings yet
Numerical Estimation and Analysis of Effective Width of Composite Beams With Ribbed Slab
15 pages
Noffs 5-Week Conditioning Plan
No ratings yet
Noffs 5-Week Conditioning Plan
8 pages
06 The Influence of Compensation and Work Motivation On Employee Performance Through Employee Discipline
No ratings yet
06 The Influence of Compensation and Work Motivation On Employee Performance Through Employee Discipline
6 pages
Responsibility Accounting Excises
100% (1)
Responsibility Accounting Excises
7 pages
Ornitoptero PDF
100% (1)
Ornitoptero PDF
5 pages
WSC Rayoptics-2
No ratings yet
WSC Rayoptics-2
5 pages
Federal Road Bill TABULA
No ratings yet
Federal Road Bill TABULA
223 pages
STOWAGE PLAN TUHUP LCT-Layout1 PDF
No ratings yet
STOWAGE PLAN TUHUP LCT-Layout1 PDF
1 page
Bank of India - Recruitment of General Banking Officers Online Application Form For The Post of General Banking Officers
No ratings yet
Bank of India - Recruitment of General Banking Officers Online Application Form For The Post of General Banking Officers
3 pages
Afrikaans
No ratings yet
Afrikaans
79 pages

Ue22cs342aa2 20240827192243

Uploaded by

Ue22cs342aa2 20240827192243

Uploaded by

DATA ANALYTICS

Redundant data often occur during the integration of multiple databases.

c2 (chi-square) test for independence of two variables in a contingency table

Play Not play Sum Play Not play Sum

• Correlation coefficient ( also called as Pearson’s product moment coefficient )

• Correlation measures the linear relationship between objects.

Scatter plots signifying

• Covariance is similar to correlation

Suppose two stocks A and B have the following values in one

Question: If the stocks are affected by the same industry trends,

• In addition to detecting redundancies between attributes, duplication should be

• Data reduction techniques are applied to obtain a reduced representation of the

• Dimensionality reduction – process of removing unimportant attributes

Two Sine Waves Two Sine Waves + Noise Frequency

• Compressed approximation : it stores only a

Why wavelet transforms?

Take a look at the animation first.

• Data Mining: Concepts and Techniques by Jiawei Han, Micheline

You might also like