100% found this document useful (2 votes)
18 views

Data Quality

The document discusses different types of data quality issues including noisy data, outliers, missing values, and duplicate data. It provides examples and methods for handling each type of issue.

Uploaded by

Kiran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
18 views

Data Quality

The document discusses different types of data quality issues including noisy data, outliers, missing values, and duplicate data. It provides examples and methods for handling each type of issue.

Uploaded by

Kiran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

There are many definitions of data quality, but data is generally

considered high quality if it is "fit for [its] intended uses


in operations, decision making and planning”.
6 Characteristics of Data Quality
 Accuracy
 Completeness
 Validity
 Relevance
 Uniqueness
 Timeliness
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
Examples of data quality problems:
 Noise
 Outliers
 Missing values
 Duplicate data
Noisy data:
Noisy data is data with a large amount of additional meaningless
information in it called noise. This includes data corruption and
the term is often used as a synonym for corrupt data. 

Noisy data containing errors or outlier value that deviate from the
expected.
e.g., Salary=“-10”
Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

• Regression

– smooth by fitting the data into regression functions

• Clustering
– detect and remove outliers

• Combined computer and human inspection


– detect suspicious values and check by human (e.g., deal
with possible outliers)
Outlier:
It is a point or an observation that deviates significantly
from the other observations. Outlier detection from
graphical representation:
 Scatter plot and
 Box plot
An outlier may be due to variability in the measurement or it
may indicate experimental error.
Outlier treatment:
 Retention
 Exclusion
 Other treatment method
Boxplot:
A boxplot is a standardized way of displaying the
distribution of data based on a five number summary
Minimum
First quartile (Q1),
Median,
Third quartile (Q3), and
Maximum.
Fig: boxplot
Scatter plot:
A scatter plot uses dots to represent values for
two different numeric variables. The position of
each dot on the horizontal and vertical axis
indicates values for an individual data
point. Scatter plots are used to observe
relationships between variables.
Fig: Scatter plot
Missing Data
 Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
 Missing data may need to be inferred.
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class:
smarter.
– the most probable value: inference-based such as Bayesian formula or
decision tree.
Handling Duplicate Data
• Redundant data occur often when integration of multiple databases

– Object identification: The same attribute or object may have


different names in different databases

– Derivable data: One attribute may be a “derived” attribute in


another table, e.g., annual revenue

• Redundant attributes may be able to be detected by correlation


analysis

• Careful integration of the data from multiple sources may help


reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

You might also like