4 - Finding and Fixing Data Quality Issues
4 - Finding and Fixing Data Quality Issues
Tasks:
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
Challenges@Big Data
Finding and Fixing Data Quality Issues
• Data in the real world is dirty
– Incomplete (Missing): lacking attribute values, lacking
certain attributes of interest, or containing only aggregated
data
– Noisy (inaccurate): containing errors or outliers (values that
deviate from the expected)
– Inconsistent: containing discrepancies in codes or names
( containing discrepancies in the department codes used to
categorize items)
• No quality data, no quality results!
– Quality decisions must be based on quality data
– Data set/ warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability,
interpretability, accessibility
Challenges@Big Data
Finding and Fixing Data Quality Issues: Main Tasks
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization and standardization (scaling to a specific range)
– Data discretization (mostly numerical data), concept hierarchy generation
(categorical numerical data), entropy based discretization. These all are also
forms of data transformation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Dimensionality reduction, Numerosity reduction
Finding and Fixing Data Quality Issues: Major Tasks
Agenda
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Finding and Fixing Data Quality Issues: Major Tasks
Finding and Fixing Data Quality Issues: Major Tasks
Data Cleaning
(Data Cleansing)
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Finding and Fixing Data Quality Issues: Major Tasks
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
• Missing data may need to be inferred
Finding and Fixing Data Quality Issues: Major Tasks
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)
• min-max normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
stand _ dev A
•Normalization is good to use when you know that the distribution of your data does not follow a
Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the
data like K-Nearest Neighbors and Neural Networks.
•Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian
distribution. Also, unlike normalization, standardization does not have a bounding range.
However, at the end of the day, the choice of using normalization or standardization will depend
on your problem and the machine learning algorithm you are using. There is no hard and fast rule
to tell you when to normalize or standardize your data. You can always start by fitting your
model to raw, normalized and standardized data and compare the performance for best
results.
Finding and Fixing Data Quality Issues: Major Tasks
Discretization
Binning (for discretization and smoothing)
Binning or discretization is the process of transforming numerical
variables into numerical/categorical counterparts. An example is to bin
values for Age into categories such as 20-39, 40-59, and 60-79.
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non-linearity.
The algorithm divides the data into k intervals of equal size. The width
of intervals is:
w = (max-min)/k
And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
• E = -FYeslog2(FYes) – FNolog2(FNo)
•
• Where, Fc1 and Fc2 are the fraction of different classes (for example YES and
NO) in the class attribute
Gini Index
• Gini index is the cost function used to
evaluate splits in the dataset. Each split
have two important aspects: the first is the
attributes and second is the attribute
value.
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Data Reduction
• Problem:
Big Data Repository may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…
Data Reduction
• Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
• Data reduction strategies
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
4. Discretization and concept hierarchy generation
5. Sampling
1-Dimensionality Reduction
– Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
Agenda
• Ordinal
An ordinal variable is similar to a categorical variable. The difference
between the two is that there is a clear ordering of the variables. For
example, suppose you have a variable, economic status, with three
categories (low, medium and high). In addition to being able to classify
people into these three categories, you can order the categories as low,
medium and high.
Interval
•Interval scales are numeric scales in which we know not only the
order, but also the exact differences between the values. The
classic example of an interval scale is Celsius temperature because
the difference between each value is the same. For example, the
difference between 60 and 50 degrees is a measurable 10 degrees,
as is the difference between 80 and 70 degrees. Time is another
good example of an interval scale in which the increments are
known, consistent, and measurable.
– Index)
Concept hierarchy generation w/o data
semantics - Specification of a set of attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute
in the given attribute set. The attribute with the most
distinct values is placed at the lowest level of the
hierarchy.