0% found this document useful (0 votes)
2 views

Lecture Notes 1.7 & 1.8

The lecture notes cover data cleaning techniques in data mining, including handling missing values, noisy data, and inconsistent data. Various methods for filling missing values are discussed, such as using global constants, attribute means, and decision tree induction. Additionally, the notes explain binning methods, clustering for outlier detection, and regression for data smoothing.

Uploaded by

Sajal Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture Notes 1.7 & 1.8

The lecture notes cover data cleaning techniques in data mining, including handling missing values, noisy data, and inconsistent data. Various methods for filling missing values are discussed, such as using global constants, attribute means, and decision tree induction. Additionally, the notes explain binning methods, clustering for outlier detection, and regression for data smoothing.

Uploaded by

Sajal Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Lecture Notes

Course Name: Data Mining and Warehousing


Course Code: 22CSH– 380

1. Data Cleaning.

Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
(i). Missing values

1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This method
is not very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per attribute
varies considerably.
2. Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many missing
values.
3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
―Unknown". If missing values are replaced by, say, ―Unknown", then the
mining program may mistakenly think that they form an interesting concept,
since they all have a value in common - that of ―Unknown". Hence, although
this method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose
that the average income of All Electronics customers is $28,000. Use this value
to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk,
replace the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.

Use the most probable value to fill in the missing value: This may be
determined with inference-based tools using a Bayesian formalism or decision
tree induction. For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing values for income.
(ii). Noisy data
Noise is a random error or variance in a measured variable.

Apex Institute of Technology, Chandigarh University, India


1. Binning methods:

Binning methods smooth a sorted data value by consulting the neighbourhood",


or values around it. The sorted values are distributed into a number of 'buckets',
or bins. Because binning methods consult the neighbourhood of values, they
perform local smoothing.
In this example, the data for price are first sorted and partitioned into equal-
depth bins (of depth 3). In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin. For example, the mean of the values 4,
8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by
the value 9. Similarly, smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median. In smoothing by bin boundaries,
the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
Bin 1: 9, 9, 9,
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
2. Clustering:

Outliers may be detected by clustering, where similar values are organized into
groups or clusters.
Figure: Outliers may be detected by clustering analysis.

Apex Institute of Technology, Chandigarh University, India


3. Combined computer and human inspection: Outliers may be
identified through a combination of computer and human inspection. In one
application, for example, an information-theoretic measure was used to help
identify outlier patterns in a handwritten character database for classification.
4. Regression: Data can be smoothed by fitting the data to a function, such as
with regression.
Linear regression involves finding the ―best" line to fit two variables, so that
one variable can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more
than two variables are involved and the data are fit to a multidimensional
surface.
(iii). Inconsistent data:

There may be inconsistencies in the data recorded for some transactions. Some
data inconsistencies may be corrected manually using external references. For
example, errors made at data entry may be corrected by performing a paper
trace. This may be coupled with routines designed to help correct the
inconsistent use of codes.

Suggestive Reading Material


• TEXT BOOKS
Introduction to Data Mining, Tan, Steinbach and Vipin Kumar, Pearson Education, 2016
• REFERENCE BOOKS
Data Mining: Concepts and Techniques, Pei, Han and Kamber, Elsevier
• Journals:
• http://www.ijsmsjournal.org/ijsms-v4i4p137.html
• https://www.springer.com/journal/41060

Apex Institute of Technology, Chandigarh University, India

You might also like