Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
2
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
Redundant: including everything, some of which are
irrelevant to our task.
No quality data, no quality mining results!
3
Data is often of low quality
Collecting the required data is challenging
In addition to its heterogeneous & distributed nature of data, real
world data is low in quality.
Why?
You didn’t collect it yourself
It probably was created for some other use, and then you came
along wanting to integrate it.
People make mistakes (typos)
Data collection instruments used may be faulty.
Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information .
4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5
2. Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill
in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
Data cleaning tasks – this routine attempts to
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
6
Incomplete (Missing) Data:
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data.
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data.
7
How to Handle Missing Data?
8
Noisy Data
9 9
How to Handle Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
‘green’ is more frequent than ‘rgeen’
Works well for categorical data
Use, say Numerical constraints to Catch Corrupt Data
Weight can’t be negative
People can’t have more than 2 parents
Salary can’t be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (“pregnant
male”)
People can be male
People can be pregnant
People can’t be male AND pregnant
10
10
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
Data can be smoothed by fitting the data to a function such as
with regression. (linear regression/multiple linear regression)
Clustering
Similar values are organized into groups (clusters).
Values that fall outside of clusters considered outliers and
remove outliers. E.g. Noisy: One’s age = 200, widely deviated points
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
11
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 15, 21, 21,
24, 25, 28, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
12
3. Data Integration
13
Data Integration: Formats
Not everyone uses the same format. Do you agree?
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Dates are especially problematic:
12/19/97
19/12/97
19/12/1997
19-12-97
Dec 19, 1997
19 December 1997
19th Dec. 1997
Are you frequently writing money as:
Birr 200, Br. 200, 200 Birr, … 14
Data at different level of detail than needed
If it is at a finer level of detail, you can sometimes bin
it
• Example
– If I need age ranges of 20-30, 30-40, 40-50, etc. and
imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40) 15
Data Integration: Conflicting Data
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
weight measurement: KG or pound
Height measurement: meter or inch
Information source #1 says that Alex lives in Bahirdar
Information source #2 says that Alex lives in Mekele
What to do?
Use both (He lives in both places)
Use the most recently updated piece of information
Flag row to be investigated further by hand
Use neither (We’d rather be incomplete than wrong)
16
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may
have different names in different databases.
Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation
analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
n n
(ai A)(bi B) (ai bi ) n A B
rA, B i 1
i 1
(n 1) A B (n 1) A B
18
Covariance
Covariance is similar to correlation
Datareduction strategies
Dimensionality reduction,
Select best attributes or remove unimportant attributes
Numerosity reduction
Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression
Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet,
21
Data Reduction: Dimensionality Reduction
Dimensionality reduction
Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data mining task at hand
E.g. is students' ID relevant to predict students' GPA?
Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
E.g. purchase price of a product & the amount of sales tax paid
Reduce time and space required in data mining
Allow easier visualization
Method: attribute subset selection
One of the method to reduce dimensionality of data is by
selecting best attributes
22
Heuristic Search in Attribute Selection
• Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes
• The best single-attribute is picked first
• Then combine best attribute with the remaining to select the
best combined two attributes, then three attributes,…
• The process continues until the performance of the combined
attributes starts to decline
– Step-wise attribute elimination:
• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the
combined attributes increases
– Best combined attribute selection and elimination
23
Data Reduction: Numerosity Reduction
Different methods can be used, including Clustering and
sampling
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
There are many choices of clustering definitions and clustering
algorithms
Sampling
obtaining a small sample s to represent the whole data set N
Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
Key principle: Choose a representative subset of the data using
suitable sampling technique
24
5. Data Transformation
A function that maps the entire set of values of a given
attribute to a new set of replacement values such that each
old value can be identified with one of the new values.
Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization
Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals. Interval labels can then be
used to replace actual data values.
Discretization can be performed recursively on an attribute using
method such as
– Binning: divide values into intervals
– Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically
25
Data Transformation: Normalization
min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v mean A
v'
stand _ dev A
26
Example:
28
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e.,
attribute values) hierarchically and is usually country
associated with each dimension in a data
warehouse Region or state
Concept hierarchy formation: Recursively
reduce the data by collecting and replacing
low level concepts (such as numeric values for city
age) by higher level concepts (such as child,
youth, adult, or senior) Sub city
Concept hierarchies can be explicitly
specified by domain experts and/or data
warehouse designers Kebele
• Concept hierarchy can be automatically formed by the
analysis of the number of distinct values. E.g., for a
set of attributes: {Kebele, city, state, country}
For numeric data, use discretization methods.
29
What is Data?
Data(dataset) is a collection of
Attributes
data objects and their attributes
An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
Examples: eye color of a person,
temperature, etc. 1 Yes Single 125K No
Attribute is also known as variable, 2 No Married 100K No
field, characteristic, dimension, or
3 No Single 70K No
Objects
feature
4 Yes Married 120K No
A collection of attributes
5 No Divorced 95K Yes
describe an object
Object is also known as record, 6 No Married 60K No
point, case, sample, entity, or 7 Yes Divorced 220K No
instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numeric—the attribute can have.
There are different types of attributes
Nominal- means “relating to names” .
The values of a nominal attribute are symbols or names of
things.
Nominal attributes are also referred to as categorical.
Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation etc.
Ordinal:
an attribute with possible values that have a meaningful order or
ranking among them
Examples: rankings (e.g., grades, height {tall, medium, short}
31
Types of Attributes
Binary :
is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
are measured on a scale of equal-size units.
allow us to compare and quantify the difference between
values
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
a value as being a multiple (or ratio) of another value
Examples: temperature in length, time, counts
32
Data sets preparation for learning
A standard machine learning technique is to divide the
dataset into a training set and a test set.
Training dataset is used for learning the parameters of the model
in order to produce hypotheses.
A training set is a set of problem instances (described as a set
of properties and their values), together with a classification
of the instance.
Test dataset, which is never seen during the hypothesis forming
stage, is used to get a final, unbiased estimate of how well the
model works.
Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
33
Divide the dataset into training & test
There are various ways in which to separate the data into
training and test sets
The established ways by which to use the two sets to
assess the effectiveness and the predictive/ descriptive
accuracy of a machine learning techniques over unseen
examples.
The holdout method
Repeated holdout method
Cross-validation
The bootstrap
34
The holdout method
In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
Usually: one third for testing, the rest for training
For small or “unbalanced” datasets, samples might not be
representative
Few or none instances of some classes
Stratified sample: advanced version of balancing the
data
Make sure that each class is represented with approximately
equal proportions in both subsets.
Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
Cross-validation
Cross-validation works as follows:
First step: data is split into k subsets of equal-sized sets
randomly.
A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
Often the subsets are stratified before the cross-validation is
performed
The error estimates are averaged to yield an overall error
estimate.
36
Bootstrap
the bootstrap method samples the given training tuples uniformly
with replacement
the machine is allowed to select the same tuple more than once.
A commonly used one is the .632 bootstrap
Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
The data tuples that did not make it into the training set end up
forming the test set.
on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)
37