Talend_DefinitiveGuide_DataGovernance
Talend_DefinitiveGuide_DataGovernance
Guide to Data
Governance
Contents
Introduction: Why trusted data is the key to digital transformation . . . . . . . 03
Chapter 4: Dos & don’ts: the 12 labors of the data governance hero . . . . . . . 41
Please don’t blame the librarians; they also need to deal What if we could make all this data trustworthy, organize
with CDs and DVDs, new digital formats to classify, and it at scale, and deliver it to everybody who needs it?
a growing queue of visitors to manage (as well as online What if we could give people the right tools to organize
visitors clamoring for additional references). themselves and work as a team to cleanse, extract hidden
value, and then assemble and deliver data everyone can
You might think about ways to make things better trust? The ability to do this is the essence of
organized so that people can find their books quicker. data governance.
But nobody asked you for help — you were just here as a
reader. Besides, the overall integrity of this library does
not encourage you to trust it. The poor conditions, low- “Consolidating our data in a
quality books, and your precious time wasted leave you
with a negative perception of the library; it’s certainly not
single system made us better
a trustworthy institution you would recommend to others. placed to have clean data
Does this sound like a discouraging and frustrating
and also helped with data
experience? Your data community may share the governance.”
same feeling when looking for the right data sets in
Senior It Manager, Enterprise Telecommunications Services Company
your organization.
governance strategy standard processes and responsibilities. Business drivers highlight what data
needs to be carefully controlled in your data governance strategy and the
is fundamental for benefits expected from this effort. This strategy becomes the basis of your data
For example, if a business driver for your data governance strategy is to ensure
the privacy of health care-related data, patient data will need to be securely
managed as it flows through your business. Retention requirements (e.g.,
history of who changed what information and when) will need to be defined
to ensure compliance with relevant government requirements, such as the
GDPR and the CCPA.
Data governance ensures that roles related to data are clearly defined and that
responsibility and accountability are agreed upon across the enterprise. A well-
planned data governance framework covers strategic, tactical, and operational
roles and responsibilities.
• Improved quality of data: Data governance creates a plan that ensures data
accuracy, completeness, and consistency.
An effective data • A data map: Data governance provides an advanced ability to understand
the location of all data related to critical entities, which is necessary for data
governance strategy integration. Like a GPS that can represent a physical landscape and help
provides many people find their way in unknown territory, data governance makes data
assets usable and easier to connect with business outcomes.
crucial benefits to • A 360-degree view of each customer and other business entities:
your organization Data governance establishes a framework so an organization can agree on
that would be hard “a single version of the truth” for critical business entities. The organization
can then create an appropriate level of consistency across entities and
to live without. business activities.
• Improve the quality of your data with validation, data cleansing, and
data enrichment.
• Manage your data with metadata-driven ETL and ELT and data integration
applications so that data pipelines can be tracked and traced with end-to-end
data lineage.
• Control your data with tools that actively review and monitor. Document
your data so that it can be augmented by metadata to increase its relevance,
searchability, accessibility, linkability, and compliance.
• Empower the people who know the data best to contribute to the data
stewardship tasks with self-service tools.
Traditional
Authoritative
data
governance
warehouse
Your organization is facing the same issue with your data. You might have
the best experts in your central organization, but you do not have enough
resources to bring all this data accurately to everyone who needs it as quickly
Struggling
as they wantto control
it, nor can youthe datathesprawl
address growing needs of the business users for
new and different types of data.
Data scientists
Open Data
–
Hadoop and NoSQL Ultimately, people will find other ways, such as shadow IT or creating other
–
Cloud
bodies, to meet their data needs. The IT teams who can’t evolve from this
Business analysts
–
Raw centralized model will rapidly lose control, jeopardizing speed, accuracy,
Traditional Data
Source data Ingest
and security. Curate Manage Consume
–
Streams
Operations
–
Enterprise
Apps Data access is tightly controlled. The
encyclopedia model fails to scale in this
Big Data era, when multiple people demand
Any data Costs, time to value, scalability, governance, risks Any data worker
Any data Costs, time to value, scalability, governance, risks Any data worker
governance alongside this more needed to turn your lake into something that your
business can safely leverage might be huge.
agile model, but rather as an
afterthought. Delivering data at the speed of
Facebook faces this problem. Initially, it just provided the business is the Holy Grail,
a platform without acting as a content provider. This
allowed the company to create self-governed communities
but there is no compromise
on a platform with no limits on the content it could ingest with data governance.
and the number of users and communities it could serve.
Traditional
Authoritative
data
governance
warehouse
Establishing collaborative
governance in the digital era
Restricted data
access
Governance
costs, time to market, scalability
Limited user
reach
What is missed in the second model is the ability to take By introducing a Wikipedia-like approach where anyone
control of the data as it enters your systems, rather than can potentially collaborate in data curation as long as
after the fact. But at the same time, we need to recognize standards are followed, organizations can engage the
that there are more and more incoming data sources, entire business in contributing to the process of turning
introduced by more and more people from different raw data into something that is trusted, documented, and
parts of the organization. It’s helpful to Struggling
establish a moreto control readythe data
to be sprawl
shared. Businesses can implement a system of
collaborative approach to governance up front so that trust that scales by leveraging smart and workflow-driven
the most
Open Data knowledgeable among your business users can self-service tools with embedded data quality controls.
Data scientists
Any data Costs, time to value, scalability, governance, risks Any data worker
In the past, this activity was processed manually by data The data consumers can see what’s in the data before
experts using traditional data profiling tools. But this they consume it, by seeing data samples or getting the
approach doesn’t work anymore, since it requires working indications that the data within a column might be for
with each dataset individually. The digital era’s data sprawl example a phone number, an account number, or an email.
requires a more automatic and systematic approach. This
is what modern data cataloging tools such as Talend Data
Catalog can do. It helps you to schedule the data discovery
processes that than crawl your data lake or other data
landscapes and intelligently examine the underlying data,
so than you can understand, document, and take actions
based on the content of your datasets.
80% of surveyed
organizations say
» F igure 2: Data profiling for power users with Talend
they have been affected Data Preparation
by GDPR or other
data protection
legislation.
Empowering power users with
self-service profiling
Incorporating appropriate data quality controls in your data chain is vital for the
success of your data governance initiative.
Suppose, for example, that you want to start a campaign to contact customers
for billing and payment and your primary source to contact appropriate people
is email and postal addresses. Having consistent and correct address data is
vital to be able to reach everyone. Otherwise, you may lose lots of revenue
or miss out on opportunities due to missing or inconsistent data.
Data integrity issues have exploded over the last several years. The sources and
volume of data is growing, and so are the number of data professionals who
want to work with it. The impact of this proliferation of data across a growing
number of clouds and digital channels and an increasing number and variety of
people put the enterprise at risk for data leaks, data breaches, and misguided
insights based on rogue and inconsistent data. As an example, 62% of end
users admit they have access to data they should not. Dealing with integrity is
crucial as new data governance regulations that may have a tangible impact on
business are implemented, which may have a tangible impact on business; for
example, the fine for violating the European Union’s General Data Protection
Regulation (GDPR) is 4% of the organization’s worldwide revenue.
Data quality is the process of conditioning data to meet the specific needs of
business users. Accuracy, completeness, consistency, timeliness, uniqueness,
and validity are the chief measures of data quality.
Data quality is key So before establishing your single point of trust and putting data at your
disposal, you must be able to apply data quality controls and remediations to
to us — as well as the ingested data sources. To do so, Talend Data Quality generates native code
understanding to run data quality controls at the right place, on premises, inside a Big Data
cluster, or in the cloud, and at the right time, on data at rest or on
provenance for streaming data.
data governance You need to profile, cleanse, and standardize your data while monitoring data
purposes. quality over time, in any format or size. This is why you need more than isolated,
point solutions for data quality but rather a pervasive platform that provides a
Executive, health insurance company
wide array of data quality controls not only for cleansing and standardizing data
sources, but also for delegating some tasks to business experts using integrated
self-service tools.
» Figure 4: Talend Data Preparation orchestrates data with integrity across pipelines
Self-service is the way to get data quality standards to At the end of the first of the three-step approach to deliver
scale. When trusted data is not provided in a self-service data you can trust, data sources have been identified and
way, multiple surveys have shown that business analysts documented. Actions have been taken for the data sources
and data scientists spend 80% of their time cleaning data that required attention with respect to their data quality.
and getting it ready to use. Reduced time and effort mean
reduced costs; as a result, more value and more insight
can be extracted from data.
Top takeaway:
Your data governance platform’s choice should take into consideration its ability to delegate data quality operations to
business users in a self-service mode while keeping control. It is critical if you want to scale rapidly and mutualize your
data cleansing efforts at the business speed. It would be risky not to do anything and let people prepare and cleanse
data on their own, spending a considerable amount of time in repetitive tasks on uncontrolled data sources.
The typical questions Typical questions that might arise include: When an error
is identified in a management report, where did it come from? When did the
error occur? Who is accountable for that? How can you solve it? All of these
questions could find an answer into a metadata management solution that
integrates the data lineage. Lineage gives you a picture overview of the data
views so you can easily spot the problem.
“Cleansing and consolidating Now it’s time to encourage people to actually do it.
consumer data enables us to Communication is vital; you may have decided with your
executive sponsor to communicate about your project
deliver the kind of personalized launch officially. When doing so, involve your internal
In many cases, data owners realize that they should not manage everything in
their data domains, and thus need to act as orchestrators rather than doers.
The collaborative part of data management here makes a great deal of sense.
You need to engage with — occasionally or regularly — the ones who know
the data best for data certification, arbitration, resolution, or reconciliation.
Using applications such as Talend Data Stewardship, data stewards can design,
orchestrate, and launch “stewardship campaigns” that ask for identified
contributors’ key inputs to enrich your data dynamically.
Through this process, anyone can be promoted at any time to be a data steward
who participates in the data value chain. These data stewards can resolve
and validate inconsistent data in a user-friendly application, which is fully
operationalized by the steward campaign manager.
» Figure 10: Perform data remediation tasks with Talend Data Stewardship
automationData tointegration
streamline your
– Application Integration – Data Integration – Data Loading
dataflows. Use machine learning
toTrust
learn from remediation and Limited user
reach
scale faster.
Manual labeling
duplicate
Prediction of
potential duplicates
Machine learning helps to suggest the next best action to apply to the data
pipeline or capture tacit knowledge from the users of the Talend platform (such
as a developer in Talend Studio, or a steward in Talend Data Stewardship) and
run it at scale through automation.
» Figure 13: Smart assistance with machine learning in Talend Data Preparation
GDPR focus:
Article 25 in the GDPR establishes data protection by design and by default, while recital 26 states that the principles
of data protection should apply to any information concerning an identified or identifiable natural person. The laws
of data protection should therefore not apply to anonymous information, namely, information that does not relate to
an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data
subject is not, or is no longer, identifiable.
» Figure 17: Enabling everyone in the business to use trusted data with Talend Management Console
One big mistake would be to forget or ignore the rationale As we know, and it cannot be said often enough, a data
behind data. So don’t just govern to govern. Whether you journey is not a single project to be tackled by IT.
need to minimize risks or maximize your benefits, link
your data governance project to clear and measurable Even if you can go fast with tools and take advantage of
outcomes. As data governance is a nondepartmental but powerful apps, delivering trusted data is a team sport.
company-wide initiative, you need to prove its value from Gather your colleagues from various departments and
the start to convince leaders to prioritize and allocate start a discussion group around the data challenges
some resources. they’re facing. Try to identify what kind of issues they
have. Frequent complaints are:
What is your “Emerald City”? • “I cannot access datasets easily.”
Define your meaning of success
• “I don’t find the right data I am looking for.”
In “The Wonderful Wizard of Oz,” the Emerald City is
• “Salesforce data is polluted.”
Dorothy’s destination at the end of the yellow brick road.
• “How can I make sure it’s trusted?”
Success can take different forms: reinforcing data control,
• “We spent too much time removing
mitigating risks or data breaches, reducing time spent
duplicates manually.”
by business teams, monetizing your data, or producing
new value from your data pipelines. Meeting compliance • “I cannot access datasets easily.”
standards to avoid penalties is crucial.
You will soon discover that one of the biggest challenges
Secure your funding is to build a data value chain that various profiles can
leverage to get more trustworthy data into the data
As you’re building the fundamentals of your projects and pipelines. Work with peers to clarify, document, and see
you’re defining your criteria for success, explain the why, together how to remove these pains. Embark people on
the what, and the how. Then make sure you don’t forget your data journey and give them some responsibilities
“how much.” Identify associated costs, involved resources. so your project won’t be your project but a team project.
If you’re a newly assigned data protection officer and Show that the entire success will not be for you but for all
make sure you have a minimum secured operating fund. team members.
If you’re a chief data officer, ally with the chief technology
officer to secure your fundings together. Then defend your
proposal to your finance team so that they understand Bring people on your data
how the company’s risks are linked to failed compliance,
and explain the value of your data strategy and all the
journey and give them some
hidden potential behind data. Make sure you give them the responsibilities so your solo
perspective of data as a financial asset.
project becomes a team project.
Avoid too much control and an overly authoritative top- Gartner predicts that “by 2023, 75% of all databases
down approach whenever possible. On the contrary, apply will be on a cloud platform, increasing complexity for
the collaborative and controlled model of data governance data governance and integration.” The move to cloud is
to enable controlled role-based applications that allow accelerating as organizations need to collect more data,
your data stakeholders and the entire stakeholder including new datasets that are created beyond their
community to harness the power of data with firewalls, deliver that data in real time to a wider
governance put in place from the get-go. audience, and seek more agility and on-demand
processing capabilities.
Make sure that the business understands the benefits, but
also that stakeholders are ready to participate in the effort Because your data can be off premises, running on top of
of delivering trusted data at the speed of the business. third-party infrastructures, using the cloud might require
stronger data governance principles. Take the example of
Start with your data data privacy, where regulations mandate that:
Data governance is not Depending on your context, there’s a good chance that the
cloud is the perfect place to capture the footprints of all
a project; rather, it’s an the data in your data landscape, and then empower all the
Skeptics will challenge you on your ability to control and solve their problems.
Don’t take for granted people that will understand your data has value. You
will need to prove to them that they will save resources and money by working
with trusted data. Take a data sample like a Salesforce dataset, for instance, or
a Marketo data source. Use data preparation tools to explain how easy it is to
remove duplicates and identify data quality issues. Show the recipe function
that allows users to effortless reproduce their prep work to other data sets.
Make sure that everyone understands the benefits of data quality, and that they
can, for example, use proofed customer contact data to improve the ROI of their
sales and marketing activities.
Another quick win is to show them how easy it is to mask data with Talend
Data Preparation.
Top takeaway:
The more you target business people, the simpler and more intelligent self-service apps need to be.
It’s critical that these roles work together in a team. One approach to executing this data project is the siloed
Collaborative data governance is a team sport — like way, where IT controls the access to and distribution of
America’s Cup — where everyone works on the same boat data using data integration tools. Then the data scientists
to win the race, using their unique skills and capabilities in their data lab could use a data science platform, while
and working together to stay ahead of the competition. the CIO’s office would use this data governance framework
to ensure compliance. However, how could they work as
Here’s an example of how these roles work together a team with this siloed approach? Moreover, who could
in practice. control this disparate set of practices, tools, and datasets?
Let’s imagine a company that wants to incorporate This scenario is what collaborative data management is all
weather data to improve the precision of their about — allowing people to work as a team to reap all the
sales forecasts. benefits of your data.
Article 25 in the GDPR establishes data protection by design and by default, while article 26 states that the principles
of data protection should apply to any information concerning an identified or identifiable natural person. The laws
of data protection should therefore not apply to anonymous information; namely, information that does not relate to
an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data
subject is not, or is no longer, identifiable.
has met these Brussels, Dublin, Lisbon, and Paris. Euronext comprises close
to 1,300 issuers, reporting a total market capitalization of 3,700
expectations. billion euros at the end of March 2018.
Abderrahmane Belarfaoui, Chief Data Officer, Euronext In 2016, Euronext began the typical process of migrating its
data to the cloud — except that this migration had nothing
typical about it at all. First off, the Euronext database
contained 100 TB of data — one of the biggest databases in
Europe. Then there was the fact that this was not just a simple
transfer of a database to a hosted platform. The idea was
to create a governed data lake with self-service access for
business units and clients in an effort to monetize new services
and generate additional revenues.
INFORMATION:
HQ: Germany
10,000+ employees
USE CASE:
Operational efficiency
Uniper delivers trusted data at the speed
CHALLENGE: of demand
Providing self-service data and
analytics in real time Uniper generates, trades, and markets energy on a large scale.
With about 36 gigawatts of installed generation capacity,
TALEND PRODUCTS USED: Uniper is among the most significant global power generators.
Talend Real-Time Big Data The company also procures, stores, transports, and supplies
Talend Data Catalog commodities such as natural gas, LNG, and coal as well as
Talend Data Preparation energy-related products.
to become the airline on AirFrance.com. In addition, there are exchanges with the
company’s 16 million Facebook fans and three million Twitter
that best caters followers, as well as data from media campaigns, since Air
to its customers.” France-KLM is one of the few advertisers to carry out its own
media buying process online.
institution leveraged For this leading financial services company, this challenge
hundreds of datasets IT modernization was at the core of the project as well, with
from disparate cloud and Big Data as the two pillars for bringing all relevant
data together onto a flexible, scalable analytics and reporting
sources into the platform. Through a data lake on Amazon Web Services (AWS),
cloud data lake. this company was able to collect all the raw data needed for
risk aggregation.
And that’s where Talend can offer a “start small and This simple diagram shows the three maturity levels
grow fast” approach. Most data-driven initiatives start by organizations go through as they become more data-
creating a “data place” where companies capture all their driven companies.
data. You could call it a data hub, a data warehouse, a data
lake, or a customer 360° — the rationale is the same. This said, the road to data integrity is scattered with traps.
One of the biggest obstacles you’ll be confronted with is
Generally it starts with data capture and data movement the ability of your communities to understand why data is
and then transformation (for example, aggregation an asset and how to make it better.
or reconciliation). This is the starting point for data
governance. This is where businesses build and run their According to Accenture, 78% of business leaders expected
data pipelines at the speed of their business; with respect their organizations to be digital, yet only 49% of them said
to data governance, this is the the origin of they had a strategy for the management and development
data management. of the skills needed for the digital world.
Data intelligence
– Data Cataloging – Data Lineage – Metadata MGMT
Data integrity
– Data Preparation – Data Stewardship – Data Quality
Data integration
– Application Integration – Data Integration – Data Loading
Choose your most active members and encourage them to foster these
communities through learning apps.
Talend Data Fabric brings together in a single platform all the necessary
capabilities that ensure enterprise data is complete, clean, compliant, and
readily available to everyone who needs it throughout the organization.
Over 4,250 organizations across the globe rely on Talend to deliver exceptional
customer experiences, make smarter decisions in the moment, drive
innovation, and improve operations. Talend has been recognized as a leader in
its field by leading analyst firms and industry publications.