The What, Why and How of Data Quality
The What, Why and How of Data Quality
As data is becoming a core part of every during business processes will determine
business operation the quality of the data the success achieved in doing business
that is gathered, stored and consumed today and tomorrow.
This article covers the following topics about Data Quality:
Figure 1.
3
In a research commissioned by Experian Data quality resembles human health.
Data Quality in 2013 the top reason for Accurately testing how any one element
data inaccuracy was found to be human of our diet and exercising may affect
errors, with 59 % of cases assessed to be our health is fiendishly difficult. In the
stemming from that cause. Avoiding same way, accurately testing how any
or eventually correcting low quality one element of our data may affect our
data caused by human errors requires a business is fiendishly difficult too.
comprehensive effort with the right mix of
remedies being about people, processes Nevertheless, numerous experiences
and technology. tell us that bad data quality is not very
healthy for business.
Other top reasons for data inaccuracy
found in the mentioned research are lack
of communication between departments The classic examples are:
4
• In financial reporting you get different will only work on complete and
answers for the same question. This consistent data.
is due to inconsistent data, varying
• Shortcomings in meeting increasing
freshness of data and unclear
compliance requirements. These
data definitions.
requirements span from privacy and
data protection regulations as GDPR,
On a corporate level, data quality issues
health and safety requirements
have a drastic impact on meeting core
in various industries to financial
business objectives, as:
restrictions, requirements and
• Inability to timely react to new market guidelines. Better data quality is most
opportunities and thus hindering profit times a must in order to meet those
and growth achievements. Often this is compliance objectives.
due to not being ready for repurposing
• Difficulties in exploiting predictive
existing data that were only fit for
analysis on corporate data assets
yesterday’s requirements.
resulting in more risk than necessary
• Obstacles in implementing cost when making both short-term and long-
reduction programs, as the data that term decisions. These challenges stems
must support the ongoing business from issues around duplication of data,
processes needs too much manual data incompleteness, data inconsistency
inspection and correction. Automation and data inaccuracy.
5
HOW TO IMPROVE
data are often marred by duplicates,
meaning two or more database rows
When improving data quality, the aim will weather being a postal address and/or a
quality dimensions.
What is relevant to know about your
Uniqueness is the most addressed data customers and what is relevant to tell about
quality dimension when it comes to your products are essential questions in the
customer master data. Customer master intersection of the customer and product
master data domains.
6
Conformity of product data is related to The data quality KPIs will typically be
locations. Take unit measurement. In the measured on the core business data
United States the length of a small thing assets within the data quality dimensions
will be in inches. In most of the rest of the as data uniqueness, data completeness,
world it will be in centimetres. In the UK data consistency, data conformity,
you will never know. data precision, data relevance, data
timeliness, data accuracy, data validity
Timeliness, meaning if the data is available and data integrity.
at the time needed, is the everlasting data
quality dimension all over. The data quality KPIs must relate to
the KPIs used to measure the business
Other data quality dimensions to measure performance in general.
and improve are data accuracy, being
about the real-world alignment or The remedies used to prevent data
alignment with a verifiable source, data quality issues and eventual data cleansing
validity, being about if data is within the includes these disciplines:
specified business requirements, and data
• Data Governance
integrity, being about the if the relations
between entities and attributes are • Data Profiling
technically consistent.
• Data Matching
7
Data Governance
8
Data Profiling The classic example is how we spell
the name of a person differently due
It is essential that the people who are to misunderstandings, typos, use of
appointed to be responsible for data nicknames and more. With company
quality and those who are tasked with names the issues just piles up with
preventing data quality issues and data funny mnemonics and inclusion of legal
cleansing have a deep understanding of forms. When we place these persons and
the data at hand. organizations at locations using a postal
address the ways of writing that has
Data profiling is a method, often numerous outcomes too.
supported by dedicated technology, used
to understand the data assets involved Data matching is a technology based on
in data quality management. These data match codes, as for example soundex,
assets have most often been populated fuzzy logic and increasingly also machine
over the years by different people learning used to determine if two or more
operating under varying business rules and data records are describing the same
gathered for bespoke business objectives. real-world entity (typically a person, a
household or an organization).
In data profiling the frequency and
distribution of data values is counted on This method can be used in deduplicating
relevant structural levels. Data profiling can a single database and finding matching
also be used to discover the keys that relate entities across several data sources.
data entities across different databases
and in the degree that this is not already Often data matching is based on data
done within the single databases. parsing, where names, addresses and other
data elements are split into discrete data
Data profiling can be used to directly elements as for example an envelope type
measure data integrity and can be used as address is split into building name, unit,
input to set up the measurement of other house number, street, postal code, city,
data quality dimensions. state/province and country. This may be
supplemented by data standardization for
Data Matching example using the same value for street,
str and st.
When it comes to real-world alignment
using exact keys in databases is not enough.
9
Data Quality Reporting Master Data Management and Data
Quality Management (DQM) are tightly
The findings from data profiling can be coupled disciplines. MDM and DQM will
used as input to measure data quality be a part of the same data governance
KPIs based on the data quality dimensions framework and share the same roles as
relevant to a given organization. The data owners, data stewards and data
findings from data matching are especially custodians. Data profiling activities will
useful for measuring data uniqueness. most often be done with master data
assets. When doing data matching the
In addition to that it is helpful to operate a results must be kept in master data assets
data quality issue log, where known data controlling the merged and purged
quality issues are documented, and the records and the survivorship of data
preventive and data cleansing activities are attributes relating to those records.
followed up.
Customer Data Integration (CDI)
Organizations focussing on data quality
find it useful to operate a data quality Not at least customer master data are in
dashboard highlighting the data quality many organizations sourced from a range
KPIs and the trend in their measurements of applications. These are self-service
as well as the trend in issues going through registration sites, Customer Relationship
10
Product Information
Management (PIM)
11
Data Quality Best Practices 3. Occupy roles as data owners and data
stewards from the business side of
In the following we will, based on the the organization and occupy data
reasoning provided above in this post, list custodian roles from business or IT
a collection of 10 highly important data where it makes most sense.
quality best practices. These are:
4. Use a business glossary as
the foundation for metadata
1. Ensure top-level management management. Metadata is data about
involvement. Quite a lot of data quality data and metadata management
issues are only solved by having a cross must be used to have common data
departmental view. definitions and link those to current
and future business applications.
2. Manage data quality activities as a part
of a data governance framework. This 5. Operate a data quality issue log
framework should set the data policies with an entry for each issue with
and data standards, the roles needed information about the assigned
and provide a business glossary. data owner and the involved data
steward(s), the impact of the issue,
the resolution and the timing of the
necessary proceedings.
12
6. For each data quality issue raised, start
with a root cause analysis. The data
quality problems will only go away, if
the solution addresses the root cause.
13
Data Quality Resources
There are many resources out here where you can learn more about data quality. Please
find below a list of some of the resources that may be very useful when framing a data
quality strategy and addressing specific data quality issues:
• Larry P. English is the father of data and information quality management. His thoughts are still available
here: https://www.information-management.com/author/larry-english-im30029
• Thomas C. Redman, aka the Data Doc, writes about data quality and data in general on Howard Business
Review. His articles are found here: https://hbr.org/search?term=thomas%20c.%20redman
• David Loshin has made a book with the title The Practitioners’ Guide to Data Quality Improvement
http://dataqualitybook.com/?page_id=2
• Gartner, the analyst firm, has a glossary with definitions of data quality terms here:
https://www.gartner.com/it-glossary/?s=data+quality
• Massachusetts Institute of Technology (MIT) has a Total Data Management Program (TDQM)
http://web.mit.edu/tdqm/www/index.shtml
• Knowledgent, a part of Accenture, provides a white paper on Data Quality Management here:
https://knowledgent.com/whitepaper/building-successful-data-quality-management-program/
• Deloitte has published a case study called data quality driven, customer insights enabled: https://www2.
deloitte.com/us/en/pages/deloitte-analytics/articles/data-quality-driven-customer-insights-enabled.html
• The University of Leipzig has a page on data matching in big data environments (they call it dedoop)
https://dbs.uni-leipzig.de/dedoop
• A Toolbox article by Steve Jones goes through How to Achieve Quality Data in a Big Data context
https://it.toolbox.com/blogs/stevejones/how-to-achieve-quality-data-111618
• Data Quality Pro is a site, manged by Dylan Jones, with a lot of information about data quality:
https://www.dataqualitypro.com/
• Obsessive-Compulsive Data Quality (OCDQ) by Jim Harris is an inspiring blog about data quality and its
related disciplines http://www.ocdqblog.com/
• Nicola Askham runs a blog about data governance: https://www.nicolaaskham.com/blog One of the posts
in this blog is about what to include in a data quality issue log: https://www.nicolaaskham.com/blog/2018-
21-02what-do-you-include-in-data-quality-issue-log
• Henrik Liliendahl have a long-time running blog with over 1,000 blog posts about data quality and Master
Data Management: https://liliendahl.com/
• A blog called Viqtor Davis Data Craftmanship provides some useful insights on data management:
https://www.viqtordavis.com/blog/
14
Fast Track
Data Management
Profisee Headquarters
+1 678 202 8990
[email protected]
www.profisee.com
Documents_110_01_07