ML Unit-Ii
ML Unit-Ii
❖DISCRETE
❖CONTINUOUS
Attribute Types
❑Categorical (Qualitative)
❑ Nominal and Ordinal attributes are collectively referred to as
categorical or qualitative attributes.
❑Numeric (Quantitative)
❑ Interval and Ratio are collectively referred to as quantitative
or numeric attributes.
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts, monetary
quantities
Attribute Types
(Discrete Vs Continuous Attribute)
Discrete Attribute
◼ Has only a finite or countably finite set of values
◼ zip codes, profession, or the set of words in a collection of documents
◼ Practically, real values can only be measured and represented using a finite number of
digits
◼ Continuous attributes are typically represented as floating-point variables
Basic statistical description of data
23
February 27, 2023 Data Mining: Concepts and Techniques
Dispersion
measures the
extent to which
the items vary
from central
value.
Also called as
spread out,
scatter, variance.
Data Preprocessing
Data Preprocessing: An Overview
◼ Data Quality
Data Cleaning
Data Integration
Data Reduction
34
34
Data Quality: Why Preprocess the Data?
35
Major Techniques/ Tasks in Data Preprocessing
36
Forms of data preprocessing
Data Preprocessing
Data Preprocessing: An Overview
◼ Data Quality
Data Cleaning
Data Integration
Data Reduction
38
38
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data
Ex. instrument faulty, human or computer error, transmission error
Intentional
Incomplete: Noisy: Inconsistent: e.g. disguised missing
data)
• lacking attribute • containing noise, • containing • Jan. 1 as everyone’s
values, lacking errors, or discrepancies in codes birthday?
or names, e.g.,
certain attributes outliers • Age=“42”,
of interest, or • e.g., Birthday=“20/03/2010”
containing only Salary=“−10” (an • Was rating “1, 2, 3”,
aggregate data error) now rating “A, B, C”
• e.g. Occupation=“ • discrepancy between
” (missing data) duplicate records
39
Incomplete (Missing) Data
Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data
data transmission
problems inconsistent data
Technology limitation
Inconsistency in
naming convention
43
How to Handle Noisy Data?
• First sort data & partition into (equal-frequency) bins
• Then one can smooth by bin means,
Binning
• smooth by bin median,
• smooth by bin boundaries, etc.
◼ Data Quality
Data Cleaning
Data Integration
Data Reduction
46
Data Integration
• Combines data from multiple sources into a
Data integration: coherent store
• Object Identification.
Redundancy • Derivable data
48
Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
Handling • Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundancy
in Data Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Integration
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
52
Χ2 Correlation Test (Nominal Data)
Χ2 (chi-square) test
(Observed − Expected) 2
=
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
56
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500
( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group
57
Data Reduction Strategies
Data reduction strategies
Why data
Data reduction? • Dimensionality reduction:
(remove unimportant attributes)
reduction: -Increase storage • Wavelet transforms
Data reduction efficiency • Principal Components Analysis
is a process - Performance (PCA)
that reduces (Complex data analysis • Feature subset selection, feature
the volume of may take a very long creation
original data time to run on the
complete data set.)
• Numerosity reduction: (some
and represents simply call it: Data Reduction)
it in a much -Reduce storage • Regression and Log-Linear
smaller Cost Models
volume. • Histograms, clustering, sampling
• Data cube aggregation
• Data compression
Principal Component Analysis (PCA)
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a
technique to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
x2
75
x1
Step 2: Step 3:
Step 1: Calculate the Calculate the
Standardize the covariance matrix eigenvalues and
dataset. for the features in eigenvectors for the
the dataset. covariance matrix.
Step 4:
Step 5:
Step 6: Sort eigenvalues
Pick k eigenvalues
Transform the and their
and form a matrix
original matrix. corresponding
of eigenvectors.
eigenvectors.
Regression Analysis y
Y1
Regression analysis:
Y1’ y=x+1
◼ Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor)
X1 x
variables with one or more independent variables.
Used for prediction
◼ More specifically, Regression analysis helps us to understand how (including forecasting
the value of the dependent variable is changing corresponding to an of time-series data),
inference, hypothesis
independent variable when other independent variables are held testing, and modeling
fixed. of causal relationships
Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be estimated by using the
data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing 79
Histogram Analysis
40
35
30
25
20
15
10
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
◼ Equal-width: equal bucket range
◼ Equal-frequency (or equal-depth)
80
Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
81
Sampling
82
Types of Sampling
Raw Data
84
Sampling: Cluster or Stratified Sampling
85
What Is Wavelet Transform?
Decomposes a signal into different
frequency sub bands
◼ Applicable to n-dimensional signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression
88
Wavelet Transformation
Discrete wavelet transform (DWT) for linear signal
Haar2 Daubechie4
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
▪ Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
▪ Each transform has 2 functions: smoothing, difference
▪ Applies to pairs of data, resulting in two set of data of length L/2
▪ Applies two functions recursively, until reaches the desired length
89
Why Wavelet Transform?
Use hat-shape
filters Effective
Multi-
• Emphasize removal of
region where resolution
outliers
points cluster • Detect
• Insensitive to arbitrary Only
• Suppress noise,
weaker shaped applicable to Efficient
insensitive to clusters at
information in input order low Complexity
their different
scales dimensional O(N)
boundaries
data
91
Data Preprocessing
◼ Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
92
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values i.e.
each old value can be identified with one of the new values
Data transformation is a process of converting data from one format or structure into another format or
structure.
normalization by
decimal scaling
Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ v−
Ex. Let μ = 54,000, σ = 16,000. A
Then
v' =
Normalization by decimal scaling A
73,600 − 54,000
= 1.225
16,000
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Discretization
96
Discretization: Divide the
range of a continuous attribute
into intervals
Discretization can be
performed recursively on
an attribute
Data Discretization Methods
Data Discretization
methods
( All the methods can be applied
recursively)
Top-down
split, Top-down split,
unsupervised unsupervised
98
Simple Discretization: Binning
Underfitting the
Training data 102
103
104
105
10
6
BY PUNNA RAO