0% found this document useful (0 votes)
2 views

Pre Processing

The document discusses the importance of data preprocessing, which includes data cleaning, integration, transformation, reduction, and discretization. It highlights the necessity of ensuring data quality to achieve reliable mining results and outlines various techniques for handling missing and noisy data. Additionally, it covers methods for data reduction and compression to facilitate efficient analysis and exploration of data.

Uploaded by

Pilamini Korako
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Pre Processing

The document discusses the importance of data preprocessing, which includes data cleaning, integration, transformation, reduction, and discretization. It highlights the necessity of ensuring data quality to achieve reliable mining results and outlines various techniques for handling missing and noisy data. Additionally, it covers methods for data reduction and compression to facilitate efficient analysis and exploration of data.

Uploaded by

Pilamini Korako
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 68

Data

Preprocessing

By
E.Sivasankar
NITT

February 14, 2025 1


Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 2


Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of
quality data

February 14, 2025 3


Multi-Dimensional Measure of
Data Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 intrinsic, contextual, representational, and accessibility.

February 14, 2025 4


Major Tasks in Data
 Preprocessing
Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data

February 14, 2025 5


Forms of data
preprocessing

February 14, 2025 6


Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 7


Data Cleaning

 Data cleaning tasks


 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

February 14, 2025 8


Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

February 14, 2025 9


How to Handle Missing

Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
February 14, 2025 10
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

February 14, 2025 11


How to Handle Noisy Data?
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human
 Regression
 smooth by fitting the data into regression functions

February 14, 2025 12


Noise Removal methods
 Noise-Noise is a random error or variance in a measured
variable.

 Binning
 Binning methods smooth a sorted data value by consulting
its neighborhood. The sorted values are distributed into a
no. of buckets or bins.
 In smoothing by bin means each value of the bin is
replaced by the mean value of the bin.
 In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value.

February 14, 2025 13


Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

February 14, 2025 14


Clustering

 Outliers (abnormal patterns) may be detected by


clustering where similar values are organized into groups
or clusters. The values that fall out side of the clusters
may be considered as outliers.

February 14, 2025 15


Cluster
Analysis

February 14, 2025 16


Regression

We can smooth data by fitting the data to a function.


Linear regression involves finding the best line to fit two
variables so that one variable can be used to predict the
other.

Multiple linear regression is an extension of linear


regression where more than two variables are involved
and data are fitted into a multi-dimensional surface.

February 14, 2025 17


Regression
y

Y1

Y1’ y=x+1

X1 x

February 14, 2025 18


Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 19


Data

Integration
Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different
 possible reasons: different representations, different scales,
e.g., metric vs. British units

February 14, 2025 20


Handling Redundant Data in Data
Integration
Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in different
databases
 One attribute may be a “derived” attribute in another table,
e.g., annual revenue
 Redundant data may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

February 14, 2025 21


Redundancy
 An attribute may be redundant if it can be derived from
another table. Redundancy can be deducted by
correlation analysis.

February 14, 2025 22


Correlation Analysis

February 14, 2025 23


Correlation Analysis
 If the resulting value of the equation is greater
than 0, then A & B are positively correlated,
meaning that the value of A increases as the
value of B increases.

 If the resulting value < 0, then A & B are


negatively correlated, where the value of one
attribute increases as the value of the other
attribute decreases.

February 14, 2025 24


Data

Transformation
Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones

February 14, 2025 25


Data
Transformation:
 Normalization
min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 z-score normalization
v  meanA
v' 
stand _ devA
 normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| v' |)<1
10

February 14, 2025 26


Min-Max Normalistaion
It performs linear transformation on the original data. Suppose
minA and maxA are the minimum and maximum values of an
attribute A min-max normalization maps a value V of A to V 1
in the range [new-MinA, new-maxA]

Ex. Suppose the minimum and maximum values for the


attribute income are $12,000 and $98,000 respectively. We
would like to map the income to the range [0.0 to 1.0]. By min-
max normalization a value of $73,600 for income is
transformed into

February 14, 2025 27


February 14, 2025 28
Z-score normalization (or) Zero-Mean Normalization
In Z-score normalization, the values for an attribute A are
normalized based on the mean and S.D. of A. A value V of
A is normalized to V1 by computing

February 14, 2025 29


February 14, 2025 30
Normalization by decimal scaling

It normalizes by moving the decimal point over the values


of attribute A. The no. of decimal points moved depends
on the maximum absolute value of A. A value V1 of A is
normalized by computing

February 14, 2025 31


February 14, 2025 32
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 33


DATA REDUCTION A Data Cube is a multi-dimensional array of values, typically used to describe
data in a way that allows for easy analysis and exploration. Imagine it as a
cube where each dimension represents a different attribute of the data, like
time, location, or product category.

Data reduction techniques can be applied to


obtain a reduced representation of the data set
that is much smaller in volume yet produces the
same analytical results.
1.Data cube reduction Here aggregation
operations are applied to the data in the
construction of a data cube.
2.Dimension reduction Here irrelevant, weak
and redundant attributes or dimensions may be
deleted and removed

February 14, 2025 34


DATA REDUCTION

3.Data compression
Here encoding mechanisms are used to reduce
the data set size.
4.Discretization and concept hierarchy
generation
Here raw data values for attributes are replaced
by ranges or higher conceptual level.

February 14, 2025 35


Data Cube aggregation
 Data cubes store multi-dimensional aggregate information.
 It provides fast access to pre computed, summarized data,
thereby benefiting online analytical processing (OLAP) as
well as data mining.
 Each cell in a data cube holds an aggregate data value,
corresponding to a data point on a multi-dimensional
space.
 The cube created at the lowest level of abstraction is
referred to as the base cuboid. A cube for the highest
level of abstraction is the apex cuboid. Data cubes
created by varying levels of abstraction is often referred to
as lattice of cuboid.
February 14, 2025 36
Dimensionality reduction
It reduces the data set size by removing attributes or
dimensions from a cube.

We need to find a minimum set of attributes such


that the resulting probability distribution of classes is
as close as possible to the original distribution
obtained using all attributes.

February 14, 2025 37


1.Stepwise forward selection

The procedure starts with an empty set of attributes.


The best of the original attributes is determined and is
added to the set.
At each subsequent iteration, the best of the remaining
attributes are added to the set.

February 14, 2025 38


2.Stepwise backward elimination
This procedure starts with the full set of
attributes. At each step is removes the worst
attribute remaining in the set.

3.Combination of forward selection and


backward elimination
Here, at each step, the procedure selects the
best attribute and removes the worst among the
remaining attributes.

39
Data Compression

 Data encoding or transformations are applied so as to


obtain a reduced or compressed representation of the
original data.

 If the original data can be reconstructed from the


compressed data without any loss of information, the data
compression technique is called “lossless”.

 If we can reconstruct only an approximation of the original


data, then the data compression technique is called as
“lossy compression”
February 14, 2025 40
Data
 StringCompression
compression
 There are extensive theories and well-tuned algorithms
 Typically lossless
 But only limited manipulation is possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

February 14, 2025 41


Data
Compression

Original Data
Compressed
Data
lossless

Original Data os sy
l
Approximated

February 14, 2025 42


Wavelet Transforms
Haar2 Daubechie4
 Discrete wavelet transform (DWT): linear signal processing
 Compressed approximation: store only a small fraction of the strongest of the
wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy compression,
localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

February 14, 2025 43


Wavelet transformation
 The discrete wavelet transform (DWT) is a linear signal
processing technique that when applied to a data vector
D, transforms it to a numerically different vector D’ of
wavelet coefficients. The two vectors are of the same
length.
 For example, all wavelet coefficients larger than the
user specified threshold can be retained. The
remaining co-efficient is said to be zero.

February 14, 2025 44


1. The length L of the input data vector must be an integer
power of 2. This condition can be met by padding the data
vectors with zeros, as necessary.

2.Each transform involves applying two functions. The first


applies a data smoothing function such as sum or weighted
average.
The second performs a weighted difference which brings out
the detailed features of the data.

3.The two functions are applied to the pairs of input data,


resulting in two sets of data with length L/2. In general,
these represent a smoothened or low frequency version of
the input data, and the high frequency content of it
respectively
February 14, 2025 45
4.The two functions are recursively applied to the set of
data obtained in the previous loop, until the resulting
data sets obtained are of length two.

5.A selection of values from the data sets obtained in the


above iteration are designated as the wavelength
coefficient of the transformed data

February 14, 2025 46


Principal Component
Analysis
 Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to
represent data
 The original data set is reduced to one consisting of N data
vectors on c principal components (reduced dimensions)
 Each data vector is a linear combination of the c
principal component vectors
 Works for numeric data only
 Used when the number of dimensions is large

February 14, 2025 47


The basic procedure is as follows :
1.The input data is normalized so that each attribute falls
within the same range. This step ensures that attributes
with larger domains will not dominate attributes with
smaller domains.

2.PCA computes C ortho-normal vectors that provides a


basics for the normalized input data. These are unit
vectors that each point in a direction perpendicular to
others.

These vectors are referred to as principal components.


The input data is a linear combination of the principal
components.

February 14, 2025 48


3. The principal components are sorted in the order
of decreasing significance. The principal components
essentially serve as a new set of access for data
providing more information about the variance.

4.Since the components are sorted by decreasing order of


significance, the size of the data can be reduced by
eliminating the weaker components with low variance.

February 14, 2025 49


Principal Component Analysis

X2

Y1
Y2

X1

February 14, 2025 50


Numerosity

Reduction
Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling

February 14, 2025 51


Regression and Log-Linear
 Models
Linear regression: Data are modeled to fit a straight line
 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to be


modeled as a linear function of multidimensional feature
vector
 Log-linear model: approximates discrete
multidimensional probability distributions

February 14, 2025 52


Regress Analysis and
Log-Linear Models
 Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
 using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into the above.
 Log-linear models:
 The multi-way table of joint probabilities is approximated by
a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd
Histograms
 A popular data 40
reduction technique 35
 Divide data into buckets30
and store average
(sum) for each bucket 25
 Can be constructed 20
optimally in one 15
dimension using
dynamic programming 10
 Related to quantization 5
problems. 0
10000 30000 50000 70000 90000

February 14, 2025 54


Clustering
 Partition data set into clusters, and one can store
cluster representation only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions
and clustering algorithms

February 14, 2025 55


Sampling
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in
the presence of skew
 Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation

of interest) in the overall database


 Used in conjunction with skewed data

 Sampling may not reduce database I/Os (page at a


time).

February 14, 2025 56


Sampling

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
February 14, 2025 57
Sampling

Raw Data Cluster/Stratified Sample

February 14, 2025 58


Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 59


Discretization
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical
attributes.
 Reduce data size by discretization
 Prepare for further analysis

February 14, 2025 60


Discretization and Concept
hierachy
 Discretization

 reduce the number of values for a given continuous


attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).

February 14, 2025 61


Specification of a set of
attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute
in the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.

country
15 distinct values

province_or_ state 65 distinct


values values
3567 distinct
city
street 674,339 distinct values
February 14, 2025 62
Background Knowledge:
Concept Hierarchies
 Schema hierarchy
 E.g., street < city < province_or_state < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
 email address: login-name < department <
university < country
 Rule-based hierarchy
 low_profit_margin (X) <= price(X, P1) and cost
(X, P2) and (P1 - P2) < $50

February 14, 2025 63


Syntax for concept hierarchy
specification
 To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
 We use different syntax to define different type of hierarchies
 schema hierarchies

define hierarchy time_hierarchy on date as [date,month


quarter,year]
 set-grouping hierarchies

define hierarchy age_hierarchy for age on customer as


level1: {young, middle_aged, senior} < level0:
all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
February 14, 2025 64
Syntax for concept hierarchy
specification (Cont.)
 operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default,
age, 5) < all(age)
 rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250

February 14, 2025 65


Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

February 14, 2025 66


Summary
 Data preparation is a big issue for both
warehousing and mining
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but still an
active area of research
February 14, 2025 67
THANK YOU

February 14, 2025 68

You might also like