0% found this document useful (0 votes)
47 views

Data Quality and Data Cleaning: An Overview

This document discusses data quality and data cleaning. It provides an overview of where data quality problems come from as data moves through the stages of data gathering, delivery, storage, integration, retrieval, and analysis. Common data quality issues that can occur include missing or incorrect data, data glitches, and lack of metadata and standards. The document also discusses how data quality can be measured and improved through constraints, profiling, and outlier detection.

Uploaded by

SohaibNasir
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Data Quality and Data Cleaning: An Overview

This document discusses data quality and data cleaning. It provides an overview of where data quality problems come from as data moves through the stages of data gathering, delivery, storage, integration, retrieval, and analysis. Common data quality issues that can occur include missing or incorrect data, data glitches, and lack of metadata and standards. The document also discusses how data quality can be measured and improved through constraints, profiling, and outlier detection.

Uploaded by

SohaibNasir
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Quality and Data Cleaning:

An Overview
Based on:
• Recent book
Exploratory Data Mining and Data Quality
Dasu and Johnson
(Wiley, 2004)

• SIGMOD 2003 tutorial.


Focus
• What research is relevant to Data Quality?
– DQ is universal and expensive. It is an important problem.
– But the problems are so messy and unstructured that research
seems irrelevant.
• This tutorial will try to structure the problem to make
research directions more clear.
• Overview
– Data quality process
• Where do problems come from
• How can they be resolved
– Disciplines
• Management
• Statistics
• Database
• Metadata
The Meaning of Data Quality (1)
Meaning of Data Quality (1)
• Generally, you have a problem if the data
doesn’t mean what you think it does, or should
– Data not up to spec : garbage in, glitches, etc.
– You don’t understand the spec : complexity, lack of
metadata.
• Many sources and manifestations
– As we will see.
• Data quality problems are expensive and
general
– DQ problems cost hundreds of billion $$$ each year.
– Resolving data quality problems is often the biggest
effort in a data mining study.
Example
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000

• Can we interpret the data?


– What do the fields mean?
– What is the key? The measures?
• Data glitches
– Typos, multiple formats, missing / default values
• Metadata and domain expertise
– Field three is Revenue. In dollars or cents?
Data Glitches
• Systemic changes to data which are external to
the recorded process.
– Changes in data layout / data types
• Integer becomes string, fields swap positions, etc.
– Changes in scale / format
• Dollars vs. euros
– Temporary reversion to defaults
• Failure of a processing step
– Missing and default values
• Application programs do not handle NULL values well …
– Gaps in time series
• Especially when records represent incremental changes.
Conventional Definition of Data Quality
• Accuracy
– The data was recorded correctly.
• Completeness
– All relevant data was recorded.
• Uniqueness
– Entities are recorded once.
• Timeliness
– The data is kept up to date.
• Special problems in federated data: time consistency.
• Consistency
– The data agrees with itself.
Finding a modern definition
• We need a definition of data quality which
– Reflects the use of the data
– Leads to improvements in processes
– Is measurable (we can define metrics)

• First, we need a better understanding of how


and where data quality problems occur
– The data quality continuum
The Data Quality Continuum
The Data Quality Continuum
• Data and information is not static, it flows in a
data collection and usage process
– Data gathering
– Data delivery
– Data storage
– Data integration
– Data retrieval
– Data mining/analysis
Data Gathering
• How does the data enter the system?
• Sources of problems:
– Manual entry
– No uniform standards for content and formats
– Parallel data entry (duplicates)
– Approximations,
– Measurement errors.
Solutions
• Potential Solutions:
– Preemptive:
• Process architecture (build in integrity checks)
• Process management (reward accurate data entry)
– Retrospective:
• Cleaning focus (duplicate removal, merge, name &
address matching, field value standardization)
• Diagnostic focus (automated detection of
glitches).
Data Delivery
• Destroying or damaging information by
inappropriate pre-processing
– Inappropriate aggregation
– Nulls converted to default values
• Loss of data:
– Buffer overflows
– Transmission problems
– No checks
The Meaning of Data Quality (2)
Meaning of Data Quality (2)
• There are many types of data, which have
different uses and typical quality problems
– Federated data
– High dimensional data
– Descriptive data
– Longitudinal data
– Streaming data
– Web (scraped) data
– Numeric vs. categorical vs. text data
Meaning of Data Quality (2)
• There are many uses of data
– Operations
– Aggregate analysis
– Customer relations …
• Data Interpretation : the data is useless if we
don’t know all of the rules behind the data.
• Data Suitability : Can you get the answer from
the available data
– Use of proxy data
– Relevant data is missing
Data Quality Constraints
• Many data quality problems can be captured by
static constraints based on the schema.
– Nulls not allowed, field domains, foreign key
constraints, etc.
• Many others are due to problems in workflow,
and can be captured by dynamic constraints
– E.g., orders above $200 are processed by Biller 2
• The constraints follow an 80-20 rule
– A few constraints capture most cases, thousands of
constraints to capture the last few cases.
Data Quality Metrics
Missing Data
• Missing data - values, attributes, entire
records, entire sections
• Missing values and defaults are
indistinguishable
• Truncation/censoring - not aware,
mechanisms not known
• Problem: Misleading results, bias.
Censoring and Truncation
• Well studied in Biostatistics, relevant to
time dependent data e.g. duration
• Censored - Measurement is bounded but
not precise e.g. Call duration > 20 are
recorded as 20
• Truncated - Data point dropped if it
exceeds or falls below a certain bound e.g.
customers with less than 2 minutes of
calling per month
Censored time intervals
Suspicious Data
• Consider the data points
3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92
• “92” is suspicious - an outlier
• Outliers are potentially legitimate
• Often, they are data or model glitches
• Or, they could be a data miner’s dream,
e.g. highly profitable customers
Outliers
• Outlier – “departure from the expected”
• Types of outliers – defining “expected”
• Many approaches
– Error bounds, tolerance limits – control charts
– Model based – regression depth, analysis of
residuals
– Geometric
– Distributional
– Time Series outliers
Database Profiling
• Systematically collect summaries of the data in the
database
– Number of rows in each table
– Number of unique, null values of each field
– Skewness of distribution of field values
– Data type, length of the field
• Use free-text field extraction to guess field types (address, name,
zip code, etc.)
– Functional dependencies, keys
– Join paths
• Does the database contain what you think it contains?
– Usually not.
References
• Metadata
– “A Metadata Resource to Promote Data Integration”, L.
Seligman, A. Rosenthal, IEEE Metadata Workshop, 1996
– “Using Semantic Values to Facilitate Interoperability Among
Heterogenous Information Sources”, E. Sciore, M. Siegel, A.
Rosenthal, ACM Trans. On Database Systems 19(2) 255-190
1994
– “XML Data: From Research to Standards”, D. Florescu, J.
Simeon, VLDB 2000 Tutorial, http://www-db.research.bell-
labs.com/user/simeon/vldb2000.ppt
– “XML’s Impact on Databases and Data Sharing”, A. Rosenthal,
IEEE Computer 59-67 2000
– “Lineage Tracing for General Data Warehouse Transformations”,
Y. Cui, J. Widom, Proc. VLDB Conf. 471-480 2001

You might also like