0% found this document useful (0 votes)
17 views

2 Data Pre-Processing

The document discusses data preprocessing techniques. It explains that data preprocessing involves transforming raw data into an understandable format through techniques like data cleaning, integration, transformation and reduction. The goal of preprocessing is to improve data quality by handling issues like incomplete, noisy and inconsistent data.

Uploaded by

NooR
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

2 Data Pre-Processing

The document discusses data preprocessing techniques. It explains that data preprocessing involves transforming raw data into an understandable format through techniques like data cleaning, integration, transformation and reduction. The goal of preprocessing is to improve data quality by handling issues like incomplete, noisy and inconsistent data.

Uploaded by

NooR
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Data

Preprocessing
Contents

What & Why preprocess the Data


Data Cleaning
Data integration
Data Transformation
Data reduction
Data Preprocessing

It is a data mining technique that involves


transformation of raw data into an
understandable format.
Why to Preprocess Data

• Data in the real world is:


 Incomplete: lacking values, certain attributes of
interest etc.
 Noisy: containing errors or outliers
 Inconsistent: lack of compatibility or similarity
between two or more facts.
Why to Preprocess Data

• No Quality Data, no Quality mining:


 Quality decision must be based on quality data.
Measure of Data Quality

 Accuracy
 Completeness
 Consistency
 Timeliness
 Value added
 Interpretability
 Accessibility etc
Data Preprocessing
Techniques

 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
What is Data?
Attributes Class
attribute
Tid Refund Marital Taxable
Collection of data objects and Income Cheat
Status
their attributes 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes
Types of Attributes

 There are different types of attributes


Nominal
 Examples: ID numbers, eye color, zip codes
Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
 Examples: temperature in Kelvin, length, time, counts
Properties of Attribute
Values
 The type of an attribute depends on which of the
following properties it possesses:
Distinctness: = 
Order: < >
Addition: + -
Multiplication: */

Nominal attribute: distinctness


Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Data Types and Forms
A1 A2 … An C
 Attribute-value data:

 Data types
numeric, categorical
(see the hierarchy for
its relationship)
static, dynamic
(temporal)

7
Record Data
Data that consists of a collection of records, each of which consists of
a fixed set of attributes

Tid Refund Marital Taxable


Income Cheat
Status
1 Yes Single 125K No
2 No Married 100K No
Record
3 No Single 70K No
Data Matrix
4 Yes Married 120K No Document Data
5 No Divorced 95K Yes Transaction Data
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes
Data Matrix
 Data objects with the fixed set of numeric attributes
 Consider them as points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Represent by an m by n matrix,
 where there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
pla y
team

wi n
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

A special type of record data, where


each record (transaction) involves a set of items.
For example, a grocery store transactions.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

 World Wide Web


 Molecular Structures
Generic graph and HTML Links

2
5 1
2
5

Benzene Molecule: C6H6


Ordered data Sequences of transactions

 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data

Genomic sequence data


GGTTCCGCCTTCAGCCCCGCGCC

Spatio-Temporal Data CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC
Average Monthly Temperature of land and ocean
CCCTCTGCTCGGCCTAGACCTGA
The data analysis pipeline
 Mining is not the only step in the analysis process

Data Result
Preprocessing
Data Mining Post-processing

 Preprocessing: real data is noisy, incomplete and inconsistent


 Data cleaning is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature selection

 Post-Processing: Make the data actionable and useful to the user


 Statistical analysis of importance
 Visualization
Data Preprocessing

Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Data Quality

Examples of data quality problems:


Noise and outliers Tid Refund Marital
Status
Taxable
Income Cheat

Missing values 1 Yes Single 125K No

Duplicate data 2 No Married 100K No


3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL


8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on a poor phone

Two Sine Two Sine Waves


Waves + Noise
Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
 Can help to
detect new phenomenon or
discover unusual behavior in data
detect problems
Sample applications of outlier detection
 Fraud detection
 Abnormal buying patterns can characterize credit card abuse
 Medicine
 Unusual symptoms or test results may indicate potential health problems of a
patient
 Public health
 The occurrence of a particular disease
 Sports statistics
 Outstanding (in a positive as well as a negative sense) players
may be identified as having abnormal parameter values
 Detecting measurement errors
 • Data derived from sensors (e.g. in a given scientific experiment) may contain
measurement errors
 • Abnormal values could provide an indication of a measurement error
 “One person‘s noise could be another person‘s signal.”
Missing Values
 Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Duplicate Data

 Data set may include data objects that are duplicates, or


almost duplicates of one another
Major issue when merging data from heterogeous sources

 Examples:
Same person with multiple email addresses
Forms of data
preprocessing
• Fill in missing values
• Smooth noisy data
• Remove outliers
• Resolve inconsistencies

Normalization and aggregation


How to Handle Noisy Data?
 Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Clustering
detect and remove outliers
 Combined computer and human inspection
detect suspicious values and check by human
 Regression
smooth by fitting the data
into regression functions
Simple Discretization Methods:
Binning

Equal-width (distance) partitioning:


Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
The most straightforward but outliers may dominate presentation
Skewed data is not handled well.
Simple Discretization Methods:
Binning

Equal-depth (frequency) partitioning:


Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning Example
 Attribute values (for one attribute e.g., age):
0, 4, 12, 16, 16, 18, 24, 26, 28
 Equi-width binning – for bin width of e.g.,
10: Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
– denote negative infinity, + positive infinity
 Equi-frequency binning – for bin density of e.g.,
3: Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
 Data integration:
combines data from multiple sources into a coherent
store

 Detecting and resolving data value conflicts


for the same real world entity, attribute values from
different sources are different
possible reasons:
different representations,
 different scales, e.g., metric vs. British units
Handling Redundant Data
 Redundant data occur often when integration of multiple
databases is done
The same attribute may have different names in different
databases
Careful integration of the data from multiple sources may help
reduce/avoid
redundancies
inconsistencies and
improve mining speed and quality
Data Transformation

 Transform or consolidate data into forms appropriate for


mining
Smoothing: remove noise from data

Aggregation: summarization, data cube construction


Daily sales data aggregated to compute monthly or annual amount

Normalization: scaled to fall within a small, specified range


min-max normalization

z-score normalization

normalization by decimal scaling


Data Transformation

 Aggregation: summarization, data cube construction


Daily sales data aggregated to compute monthly or annual amount

A data cube for sales


Data Transformation: Normalization

 An attribute values are scaled to fall within a small, specified


range , such as 0.0 to 1.0
 Min-Max normalization
performs a linear transformation on the original data.
v  minA
v' (new _ maxA  new _ minA)  new _
 maxA  minA
minA Let min and max values for the attribute income are
 Example:
$12,000 and $98,000, respectively.
 Map income to the range [0.0;1.0].
Data Transformation: Normalization

 z-score normalization(or zero-mean normalization)


An attribute A, values are normalized based on the mean and
standard deviation of A.
v  m e a nA
v'
s ta n d _ d e vA

 Example: Let mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively.
 With z-score normalization, a value of $73,600 for income is
transformed to
Data Transformation: Normalization

 Decimal scaling
normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A.

v
v'
Where j is the smallest integer such that Max(| v'
 10
j

|)<1
 Example: Suppose that the recorded values of A range from -986 to 917.
 The maximum absolute value of A is 986.
 To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3)
 -986 normalizes to -0.986 and 917 normalizes to 0.917.
Data Reduction

 Warehouse may store terabytes of data: Complex data


analysis/mining may take a very long time to run on the
complete data set

 Data reduction
Obtains a reduced representation of the data set that is much
smaller in volume
but produces the same (or almost the same) analytical results
Data Reduction Strategies
 Dimensionality reduction
 Data compression
 use encoding schemes to reduce the data set size
 Numerosity reduction
 data is replaced or estimated by alternative smaller data
representations
Sampling
Histograms
Clustering
 Discretization and concept hierarchy
generation
 replace raw data values for attributes by ranges or higher
conceptual
levels
Histograms
 A popular data reduction 4
technique 0
 Divide data into buckets and
3
store average (sum) for each 3
50
bucket
 Can be constructed optimally 25
in one dimension using
2
dynamic programming
0
1
5
1
0
100002000030000400005000060000700008000090000 100000
5
Cluster Analysis

Partition data into


clusters, and store
cluster representation
only
Sampling

 Statisticians sample because obtaining the entire set of data


of interest is too expensive or time consuming.
 Example: What is the average height of a person in Pakistan?
 We cannot measure the height of everybody

 Sampling is used in data mining because processing the entire


set of
data of interest is too expensive or time consuming.
 Example: We have 1M documents. What fraction has at least 100 words in
common?
 Computing number of common words for all pairs requires 1012 comparisons
 Example: What fraction of tweets in a year contain the word “Lahore”?
 300M tweets per day, if 100 characters on average, 86.5TB to store all tweets
Sampling …

 The key principle for effective sampling is the following:

using a sample will work almost as well as using the entire data
sets, if the sample is representative

A sample is representative if it has approximately the same


property (of interest) as the original set of data

Otherwise we say that the sample introduces some bias


Types of Sampling
 Simple Random Sampling
There is an equal probability of selecting any particular item

 Sampling without replacement


As each item is selected, it is removed from the population

 Sampling with replacement


Objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked
up more than once.
This makes analytical computation of probabilities easier
Sampling

Raw Data
Sample Size

8000 points 2000 Points 500


Points
Discretization

 Discretization:
Divide the range of a continuous attribute into intervals
Reduce data size by discretization
Interval labels can be used to replace actual data values.
Discretization for numeric data

 Binning
sensitive to the user-specified number of bins and
outliers

 Histogram

 Clustering analysis

 Segmentation by natural Partitioning


Summar
y
 Data preparation is a big issue for both warehousing and mining

 Data preparation includes


Data cleaning and data integration
Data reduction and feature selection
Discretization

 A lot a methods have been developed but still an active area of


research

You might also like