Lecture Notes 1.7 & 1.8
Lecture Notes 1.7 & 1.8
1. Data Cleaning.
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
(i). Missing values
1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This method
is not very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per attribute
varies considerably.
2. Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many missing
values.
3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
―Unknown". If missing values are replaced by, say, ―Unknown", then the
mining program may mistakenly think that they form an interesting concept,
since they all have a value in common - that of ―Unknown". Hence, although
this method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose
that the average income of All Electronics customers is $28,000. Use this value
to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk,
replace the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
Use the most probable value to fill in the missing value: This may be
determined with inference-based tools using a Bayesian formalism or decision
tree induction. For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing values for income.
(ii). Noisy data
Noise is a random error or variance in a measured variable.
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
2. Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups or clusters.
Figure: Outliers may be detected by clustering analysis.
There may be inconsistencies in the data recorded for some transactions. Some
data inconsistencies may be corrected manually using external references. For
example, errors made at data entry may be corrected by performing a paper
trace. This may be coupled with routines designed to help correct the
inconsistent use of codes.