This document discusses data quality and data cleaning. It provides an overview of where data quality problems come from as data moves through the stages of data gathering, delivery, storage, integration, retrieval, and analysis. Common data quality issues that can occur include missing or incorrect data, data glitches, and lack of metadata and standards. The document also discusses how data quality can be measured and improved through constraints, profiling, and outlier detection.
Download as PPT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
47 views
Data Quality and Data Cleaning: An Overview
This document discusses data quality and data cleaning. It provides an overview of where data quality problems come from as data moves through the stages of data gathering, delivery, storage, integration, retrieval, and analysis. Common data quality issues that can occur include missing or incorrect data, data glitches, and lack of metadata and standards. The document also discusses how data quality can be measured and improved through constraints, profiling, and outlier detection.
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27
Data Quality and Data Cleaning:
An Overview Based on: • Recent book Exploratory Data Mining and Data Quality Dasu and Johnson (Wiley, 2004)
• SIGMOD 2003 tutorial.
Focus • What research is relevant to Data Quality? – DQ is universal and expensive. It is an important problem. – But the problems are so messy and unstructured that research seems irrelevant. • This tutorial will try to structure the problem to make research directions more clear. • Overview – Data quality process • Where do problems come from • How can they be resolved – Disciplines • Management • Statistics • Database • Metadata The Meaning of Data Quality (1) Meaning of Data Quality (1) • Generally, you have a problem if the data doesn’t mean what you think it does, or should – Data not up to spec : garbage in, glitches, etc. – You don’t understand the spec : complexity, lack of metadata. • Many sources and manifestations – As we will see. • Data quality problems are expensive and general – DQ problems cost hundreds of billion $$$ each year. – Resolving data quality problems is often the biggest effort in a data mining study. Example T.Das|97336o8327|24.95|Y|-|0.0|1000 Ted J.|973-360-8779|2000|N|M|NY|1000
• Can we interpret the data?
– What do the fields mean? – What is the key? The measures? • Data glitches – Typos, multiple formats, missing / default values • Metadata and domain expertise – Field three is Revenue. In dollars or cents? Data Glitches • Systemic changes to data which are external to the recorded process. – Changes in data layout / data types • Integer becomes string, fields swap positions, etc. – Changes in scale / format • Dollars vs. euros – Temporary reversion to defaults • Failure of a processing step – Missing and default values • Application programs do not handle NULL values well … – Gaps in time series • Especially when records represent incremental changes. Conventional Definition of Data Quality • Accuracy – The data was recorded correctly. • Completeness – All relevant data was recorded. • Uniqueness – Entities are recorded once. • Timeliness – The data is kept up to date. • Special problems in federated data: time consistency. • Consistency – The data agrees with itself. Finding a modern definition • We need a definition of data quality which – Reflects the use of the data – Leads to improvements in processes – Is measurable (we can define metrics)
• First, we need a better understanding of how
and where data quality problems occur – The data quality continuum The Data Quality Continuum The Data Quality Continuum • Data and information is not static, it flows in a data collection and usage process – Data gathering – Data delivery – Data storage – Data integration – Data retrieval – Data mining/analysis Data Gathering • How does the data enter the system? • Sources of problems: – Manual entry – No uniform standards for content and formats – Parallel data entry (duplicates) – Approximations, – Measurement errors. Solutions • Potential Solutions: – Preemptive: • Process architecture (build in integrity checks) • Process management (reward accurate data entry) – Retrospective: • Cleaning focus (duplicate removal, merge, name & address matching, field value standardization) • Diagnostic focus (automated detection of glitches). Data Delivery • Destroying or damaging information by inappropriate pre-processing – Inappropriate aggregation – Nulls converted to default values • Loss of data: – Buffer overflows – Transmission problems – No checks The Meaning of Data Quality (2) Meaning of Data Quality (2) • There are many types of data, which have different uses and typical quality problems – Federated data – High dimensional data – Descriptive data – Longitudinal data – Streaming data – Web (scraped) data – Numeric vs. categorical vs. text data Meaning of Data Quality (2) • There are many uses of data – Operations – Aggregate analysis – Customer relations … • Data Interpretation : the data is useless if we don’t know all of the rules behind the data. • Data Suitability : Can you get the answer from the available data – Use of proxy data – Relevant data is missing Data Quality Constraints • Many data quality problems can be captured by static constraints based on the schema. – Nulls not allowed, field domains, foreign key constraints, etc. • Many others are due to problems in workflow, and can be captured by dynamic constraints – E.g., orders above $200 are processed by Biller 2 • The constraints follow an 80-20 rule – A few constraints capture most cases, thousands of constraints to capture the last few cases. Data Quality Metrics Missing Data • Missing data - values, attributes, entire records, entire sections • Missing values and defaults are indistinguishable • Truncation/censoring - not aware, mechanisms not known • Problem: Misleading results, bias. Censoring and Truncation • Well studied in Biostatistics, relevant to time dependent data e.g. duration • Censored - Measurement is bounded but not precise e.g. Call duration > 20 are recorded as 20 • Truncated - Data point dropped if it exceeds or falls below a certain bound e.g. customers with less than 2 minutes of calling per month Censored time intervals Suspicious Data • Consider the data points 3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92 • “92” is suspicious - an outlier • Outliers are potentially legitimate • Often, they are data or model glitches • Or, they could be a data miner’s dream, e.g. highly profitable customers Outliers • Outlier – “departure from the expected” • Types of outliers – defining “expected” • Many approaches – Error bounds, tolerance limits – control charts – Model based – regression depth, analysis of residuals – Geometric – Distributional – Time Series outliers Database Profiling • Systematically collect summaries of the data in the database – Number of rows in each table – Number of unique, null values of each field – Skewness of distribution of field values – Data type, length of the field • Use free-text field extraction to guess field types (address, name, zip code, etc.) – Functional dependencies, keys – Join paths • Does the database contain what you think it contains? – Usually not. References • Metadata – “A Metadata Resource to Promote Data Integration”, L. Seligman, A. Rosenthal, IEEE Metadata Workshop, 1996 – “Using Semantic Values to Facilitate Interoperability Among Heterogenous Information Sources”, E. Sciore, M. Siegel, A. Rosenthal, ACM Trans. On Database Systems 19(2) 255-190 1994 – “XML Data: From Research to Standards”, D. Florescu, J. Simeon, VLDB 2000 Tutorial, http://www-db.research.bell- labs.com/user/simeon/vldb2000.ppt – “XML’s Impact on Databases and Data Sharing”, A. Rosenthal, IEEE Computer 59-67 2000 – “Lineage Tracing for General Data Warehouse Transformations”, Y. Cui, J. Widom, Proc. VLDB Conf. 471-480 2001