0% found this document useful (0 votes)
41 views

Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective

1. Data preprocessing involves cleaning, integrating, and transforming raw data from various sources into a format suitable for data mining. 2. Data cleaning aims to fill in missing values, smooth noisy data by identifying outliers, and resolve inconsistencies. 3. Data integration combines data from multiple sources, which can result in conflicts from different formats, structures, and levels of detail that must be resolved.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective

1. Data preprocessing involves cleaning, integrating, and transforming raw data from various sources into a format suitable for data mining. 2. Data cleaning aims to fill in missing values, smooth noisy data by identifying outliers, and resolve inconsistencies. 3. Data integration combines data from multiple sources, which can result in conflicts from different formats, structures, and levels of detail that must be resolved.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

1.

Overview of data preprocessing


 Data mining requires collecting great amount of
data (available in data warehouses or databases)
to achieve the intended objective.
Data mining starts by understanding the business or
problem domain in order to gain the business
knowledge.
 Based on the business knowledge data related to the
business problem are identified from the
database/data warehouse for mining.
 Beforefeeding data to DM we have to make sure
the quality of data?
Data Quality: Why Preprocess the Data?
 A well-accepted multidimensional data quality measures
are the following:
 Accuracy (free from errors and outliers)
 Completeness (no missing attributes and values)
 Consistency (no inconsistent values and attributes)
 Timeliness (appropriateness of the data for the purpose it is
required)
 Believability (acceptability)
 Interpretability (easy to understand)

 However, most of the data in the real world are poor


quality(Incomplete, Inconsistent, Noisy, Invalid, Redundant, …)

2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 Redundant: including everything, some of which are
irrelevant to our task.
 No quality data, no quality mining results!

3
Data is often of low quality
 Collecting the required data is challenging
 In addition to its heterogeneous & distributed nature of data, real
world data is low in quality.
 Why?
 You didn’t collect it yourself
 It probably was created for some other use, and then you came
along wanting to integrate it.
 People make mistakes (typos)
 Data collection instruments used may be faulty.
 Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
 Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information .
4
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
5
2. Data Cleaning
 Data cleaning (or data cleansing) routines attempt to fill
in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
 Data cleaning tasks – this routine attempts to
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

6
Incomplete (Missing) Data:
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data.
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data.

7
How to Handle Missing Data?

 Ignore the tuple:


 usually done when class label is missing (when doing
classification).
 Not effective method unless several attributes missing values
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., “unknown”, a new class?!
 Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
 Average income of customer $28,000 (use this value to replace).
 Use the most probable value :
 determined with regression, inference-based such as Bayesian
formula, or decision tree. (most popular)

8
Noisy Data

 Noise-is a random error or variance in a measured


variable
 Incorrect attribute values may be due to
 faulty data collection instruments(e.g.: OCR)
 data entry problems-Let say ‘green’ is written as ‘rgeen’
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

9 9
How to Handle Noisy Data?
 Manually check all data : tedious + infeasible?
 Sort data by frequency
 ‘green’ is more frequent than ‘rgeen’
 Works well for categorical data
Use, say Numerical constraints to Catch Corrupt Data
Weight can’t be negative
People can’t have more than 2 parents
Salary can’t be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (“pregnant
male”)
 People can be male
 People can be pregnant
 People can’t be male AND pregnant
10
10
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
 Data can be smoothed by fitting the data to a function such as
with regression. (linear regression/multiple linear regression)
 Clustering
 Similar values are organized into groups (clusters).
 Values that fall outside of clusters considered outliers and
remove outliers. E.g. Noisy: One’s age = 200, widely deviated points
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)
11
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 15, 21, 21,
24, 25, 28, 34
 Partition into (equi-depth) bins:
 Bin 1: 4, 8, 15
 Bin 2: 21, 21, 24
 Bin 3: 25, 28, 34
 Smoothing by bin means:
 Bin 1: 9, 9, 9
 Bin 2: 22, 22, 22
 Bin 3: 29, 29, 29
 Smoothing by bin boundaries:
 Bin 1: 4, 4, 15
 Bin 2: 21, 21, 24
 Bin 3: 25, 25, 34
12
3. Data Integration

 Data integration combines data from multiple sources


(database, data warehouse, files & sometimes from non-
electronic sources) into a coherent store
 Because of the use of different sources, data that that is
fine on its own may become problematic when we want
to integrate it.
 Some of the issues are:
 Different formats and structures
 Conflicting and redundant data
 Data at different levels

13
Data Integration: Formats
 Not everyone uses the same format. Do you agree?
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Dates are especially problematic:
 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
 Are you frequently writing money as:
 Birr 200, Br. 200, 200 Birr, … 14
Data at different level of detail than needed
 If it is at a finer level of detail, you can sometimes bin
it
• Example
– If I need age ranges of 20-30, 30-40, 40-50, etc. and
imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40) 15
Data Integration: Conflicting Data
 Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
 weight measurement: KG or pound
 Height measurement: meter or inch
 Information source #1 says that Alex lives in Bahirdar
Information source #2 says that Alex lives in Mekele
 What to do?
Use both (He lives in both places)
Use the most recently updated piece of information
Flag row to be investigated further by hand
Use neither (We’d rather be incomplete than wrong)
16
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object may
have different names in different databases.
 Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation
analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

17
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product moment
coefficient)
 
n n
(ai  A)(bi  B) (ai bi )  n A B
rA, B  i 1
 i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

18
Covariance
 Covariance is similar to correlation

where n is the number of tuples, pand qare the respective mean


of p and q, σp and σq are the respective standard deviation of p and
q.

 It can be simplified in computation as


Positive covariance: If Covp,q > 0, then p and q both tend to
be directly related.
Negative covariance: If Covp,q < 0 then p and q are inversely
related.
Independence: Covp,q = 0
19
Example: Co-Variance
 Suppose two stocks A and B have the following values in
one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 =
4
 Thus, A and B rise together since Cov(A, B) 20> 0.
4.Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on
the complete data set.

Datareduction strategies
Dimensionality reduction,
 Select best attributes or remove unimportant attributes
Numerosity reduction
 Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression
 Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet,
21
Data Reduction: Dimensionality Reduction
 Dimensionality reduction
 Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data mining task at hand
 E.g. is students' ID relevant to predict students' GPA?
 Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
 E.g. purchase price of a product & the amount of sales tax paid
 Reduce time and space required in data mining
 Allow easier visualization
 Method: attribute subset selection
 One of the method to reduce dimensionality of data is by
selecting best attributes

22
Heuristic Search in Attribute Selection
• Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes
• The best single-attribute is picked first
• Then combine best attribute with the remaining to select the
best combined two attributes, then three attributes,…
• The process continues until the performance of the combined
attributes starts to decline
– Step-wise attribute elimination:
• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the
combined attributes increases
– Best combined attribute selection and elimination
23
Data Reduction: Numerosity Reduction
 Different methods can be used, including Clustering and
sampling
 Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 There are many choices of clustering definitions and clustering
algorithms
 Sampling
 obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
 Key principle: Choose a representative subset of the data using
suitable sampling technique
24
5. Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values such that each
old value can be identified with one of the new values.
 Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization
 Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals. Interval labels can then be
used to replace actual data values.
 Discretization can be performed recursively on an attribute using
method such as
– Binning: divide values into intervals
– Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically
25
Data Transformation: Normalization

 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 z-score normalization
v  mean A
v' 
stand _ dev A

 normalization by decimal scaling


v
v'  j Where j is the smallest integer such that
10 Max(| v ' |)<1

26
Example:

 Suppose that the minimum and maximum values for the


attribute income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
 Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively.
 Suppose that the recorded values of A range from –986 to
917.
27
Normalization
 Min-max normalization:
v  minA
v'  ( newMax  newMin)  newMin
maxA  minA
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

 Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A 73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then,  1.225
16,000
 Decimal scaling: Suppose that the recorded values of A range from -986 to
917. To normalize by decimal scaling, we therefore divide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.

28
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e.,
attribute values) hierarchically and is usually country
associated with each dimension in a data
warehouse Region or state
 Concept hierarchy formation: Recursively
reduce the data by collecting and replacing
low level concepts (such as numeric values for city
age) by higher level concepts (such as child,
youth, adult, or senior) Sub city
 Concept hierarchies can be explicitly
specified by domain experts and/or data
warehouse designers Kebele
• Concept hierarchy can be automatically formed by the
analysis of the number of distinct values. E.g., for a
set of attributes: {Kebele, city, state, country}
For numeric data, use discretization methods.
29
What is Data?
 Data(dataset) is a collection of
Attributes
data objects and their attributes
 An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
 Examples: eye color of a person,
temperature, etc. 1 Yes Single 125K No
 Attribute is also known as variable, 2 No Married 100K No
field, characteristic, dimension, or
3 No Single 70K No

Objects
feature
4 Yes Married 120K No
 A collection of attributes
5 No Divorced 95K Yes
describe an object
 Object is also known as record, 6 No Married 60K No
point, case, sample, entity, or 7 Yes Divorced 220K No
instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
 The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numeric—the attribute can have.
 There are different types of attributes
Nominal- means “relating to names” .
 The values of a nominal attribute are symbols or names of
things.
 Nominal attributes are also referred to as categorical.
 Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation etc.
Ordinal:
 an attribute with possible values that have a meaningful order or
ranking among them
 Examples: rankings (e.g., grades, height {tall, medium, short}

31
Types of Attributes
Binary :
 is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
 Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
 are measured on a scale of equal-size units.
 allow us to compare and quantify the difference between
values
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
 a value as being a multiple (or ratio) of another value
 Examples: temperature in length, time, counts
32
Data sets preparation for learning
A standard machine learning technique is to divide the
dataset into a training set and a test set.
 Training dataset is used for learning the parameters of the model
in order to produce hypotheses.
 A training set is a set of problem instances (described as a set
of properties and their values), together with a classification
of the instance.
 Test dataset, which is never seen during the hypothesis forming
stage, is used to get a final, unbiased estimate of how well the
model works.
 Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
 A set of instances and their classifications used to test the
accuracy of a learned hypothesis.

33
Divide the dataset into training & test
 There are various ways in which to separate the data into
training and test sets
The established ways by which to use the two sets to
assess the effectiveness and the predictive/ descriptive
accuracy of a machine learning techniques over unseen
examples.
The holdout method
 Repeated holdout method
Cross-validation
The bootstrap

34
The holdout method
 In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
 Usually: one third for testing, the rest for training
 For small or “unbalanced” datasets, samples might not be
representative
 Few or none instances of some classes
 Stratified sample: advanced version of balancing the
data
 Make sure that each class is represented with approximately
equal proportions in both subsets.
 Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
 The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
Cross-validation
 Cross-validation works as follows:
 First step: data is split into k subsets of equal-sized sets
randomly.
 A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
 Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
 Often the subsets are stratified before the cross-validation is
performed
 The error estimates are averaged to yield an overall error
estimate.
36
Bootstrap
 the bootstrap method samples the given training tuples uniformly
with replacement
 the machine is allowed to select the same tuple more than once.
 A commonly used one is the .632 bootstrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
 The data tuples that did not make it into the training set end up
forming the test set.
 on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)

37

You might also like