0% found this document useful (0 votes)
28 views10 pages

Data Lineage in 2024 For Data Architects 1717374070

Uploaded by

Arun Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Data Lineage in 2024 For Data Architects 1717374070

Uploaded by

Arun Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

THE ADVANCED GUIDE

DATA LINEAGE

Trust your data at every step!


What Is Data Lineage?
The core idea behind data lineage is the ability to fully understand how data
flows from one place to another within the infrastructure that was built to
house and process it. It seems like it should not be a difficult problem, but it is.
In fact, it is a huge issue for organizations as they face 2023 and beyond. If
organizations do not know where their data comes from or goes, they have
uncontrolled environments that have risk on many different levels. Having
uncontrolled data environments means that it is also very difficult to extract
value from data, and data is the new oil or new gold. Organizations that
cannot extract value from data stand a good chance of being outcompeted
and replaced by organizations that can.

Our modern technological civilization is full of examples that would make data
lineage seem easy by comparison. An oil refinery is a gigantic piece of
infrastructure, built to a precise specification, where the operators know
exactly what is happening with the liquid products flowing through it –
admittedly with a few very rare, and sometimes tragic, exceptions. Modern
telephone networks are another exquisite example of complex engineering,
being fully controlled from Network Operating Centers (NOCs) so that calls go
through from origin to destination without a hitch.

We can agree that data lineage is a hard problem to solve, but just what is it?
To answer this, think of an item of data being captured for the first time within
the firewalls of an organization, perhaps via data entry. The days when data
stayed put in the same silo where it was first captured are long gone.
Inevitably, the item of data will be sent to other data stores (databases or files),
and from these to yet more, until finally, our item of data ends up as a piece of
information in one or more data consumption platforms such as reports,
operational systems or even customer facing applications. As the data item
travels it may be replicated, or transformed to standardize it, or used in
calculations to generate other data elements that enrich the overall data
environment. All of this – where the data is stored, the pathways it travels, the
changes that happen to it along the way, how it becomes a constituent of
other data, and where it appears in the various data consumption platforms –
make up data lineage.
Understanding the Different Layers of Lineage

Data lineage exists on different levels, each of which has its own particular characteristics and value.

Cross-system data lineage (often referred to as “horizontal”)


Data lineage was first documented manually as overall flows between systems from source to
destination. This high-level data lineage, usually at the dataset level, shows the big picture and
answers questions like:

• Where did the data come from?


• What stops did the data make on its journey?
• Which reports, visualizations or analyses use the data?
• Which systems are involved in moving the data within the data
landscape?

This is of enormous help to data architects and engineers – as far as it goes. But in many cases it does
not go far enough; we need to dig deeper into a specific point in the horizontal data lineage to get
the answers we seek.

End-to-end column data lineage (often referred to as “vertical”)

End-to-end column lineage enables you to “zoom in” on the cross-system lineage, detailing
column-to-column- or column-to-report element-level lineage between systems from across the data
landscape. This “vertical” lineage includes the widest view of lineage for any column from its very first
origin, through the entire landscape and where it is being used, and answers questions like:

• What is the ultimate source of a metric in my report?


• Which processes are involved in getting it to the report?
• Why do these two “identical” fields have different values?
• Where is this column being used?

This is the type of lineage needed by data analysts trying to solve a report discrepancy or working
to understand the true scope of what exists in an environment that is to be migrated to a new
platform.
While this layer of data lineage is sufficient for many use cases, oftentimes we need to dig deeper still.

Inner-system data lineage

Inner-system lineage details the column-level lineage within an ETL process,


report or compilable database object. This layer of lineage answers
questions like:

• What logic was applied to a metric?


• What are the different components impacting a field within a
process?
• How was this KPI calculated?
This level of detail has traditionally only been found within ETL or
reporting tools themselves (when it is found at all), and, even in
2023, is found only in select advanced data lineage solutions that
strive to provide a full picture.
Why Every Business Processes Should Start with Data Lineage
So far we have described data lineage and its three layers, but this has been done in the context of systems
and data objects. But there is another frequently overlooked, but extremely important, aspect of data lineage.
Data lineage represents a good deal of the business processes that occur in an organization. Long ago, all
processes in an organization were manual, and information went from one person, or department, to another
where individual people processed it. Today, all that has changed. We can think of systems having replaced the
people who did the processing, and data lineage having replaced the ways in which information was sent
between people and departments. An organization must ask itself if the business value chain represented by
data lineage is the most effective and efficient way to implement business processes that is possible. All too
often, organic growth in IT architecture and point-to-point data transfers accumulate over time to distort how
business value chains are actually implemented. This is where data lineage becomes a strategic concern for
enterprises, closely linked to their overall business model.

Truth be told, most organizations are still a long way off from using data lineage in business strategy.
Nevertheless, the more technical use cases for data lineage are still extremely valuable, and we will look at
a range of them now to gain a better appreciation of what data lineage is and the value it provides.

Use Case 1 Use Case 2

Assurance of Data Integrity Impact Analysis


in Reports
A compelling argument for the need for data Changes involving data objects are frequent in
lineage is quickly resolving end-user doubts an organization. A major headache is the
about the reliability of the data they are determination of who may be affected if the
seeing in their reports. change is implemented. One way of
approaching this is to describe the change
Many BI developers, and report developers in
and ask a wide range of staff if they think that
the business, live in constant terror of being
something in their area may be affected.
asked by a business user to confirm the
Exactly which staff should be asked is difficult
accuracy of some strange “blip” of data that the
to know, because the impact is not known.
user is seeing in a report. The most important
Continuously asking a very broad range of
thing to understand here is that it is not whether
individuals in an organization if they might be
there is an error in the report or not that
affected, when most of them inevitably will not,
matters. What matters is whether the developer
risks disruption that will lead to complaints
can give the business user a convincing
about those managing the proposed change.
explanation of what is going on within a
All too often a team managing a change only
reasonable time or not – even if it is an error.
makes a perfunctory effort to assess impact,
Failure to deliver a timely explanation inevitably
and simply makes the change and waits to see
makes the business user suspect that they are
if anyone will complain right away.
only seeing the tip of the iceberg, and they have
no reason to trust the whole suite of reports Data lineage is a huge advantage in impact
they are working with. analysis. The data objects downstream of
With data lineage, a BI developer can trace
where the change will be implemented are
back the lineage of the offending data element
identified, along with the business users who
and inspect each node in the data lineage chain
interact with them. This is important because
to determine what is happening. Clarity of the
not every impact is a technical system or data
situation – whether good or bad – is achieved.
impact, and business process changes may be
required.
Use Case 3

Tracking
Personal Information (PI)
Data Privacy has exploded in the past few years, with widespread concerns about the misuse of personal
information (PI) by governments, corporations, and criminals. New laws have been enacted in several
jurisdictions, and the European Union’s GDPR explicitly calls out the need to know where PI is located in
the data landscape.

The traditional approach to the need to know where PI is located has been data profiling. This involves
examining each column in each relational database table (or other non-relational data objects) to try to infer
if it is PI or not. However, data lineage provides a better solution. If all the data flows are known, then if a
column at any node in a data flow pathway can be identified as PI, then every node in that pathway is
logically the same piece of PI. This makes it scalable to have analysts identify PI, since this only has to be
done once in any particular pathway. Furthermore, data lineage extends into reporting layers as well as
databases, unlike traditional data profiling. This means that if we can identify a data element in a report as
PI, which a report user should be easily able to do, then we can find PI columns across all the lineage
pathways involving this data element. An added bonus is that knowing what reports contain PI makes it
easier to govern their dissemination – both inside and outside the enterprise.

Using a data lineage tool to bring a data environment under control from the perspective of PI is a concern
of Data Governance and any Legal or Privacy function in an organization. This demonstrates how valuable
data lineage can be outside of IT.

Use Case 4

Broken ETL
This use case is often a consequence of the lack of implementation of proper impact analysis.
Extract-Transform-and-Load (ETL) tasks move data around the organization, often reshaping, enriching, and
integrating the data as they do so. Closely related is ELT (Extract-Load-and-Transform). In ELT, data is taken
from a source, loaded into target database, and then transformed. With ETL, there is a more traditional
approach of doing the transformation before loading the data. We shall use “ETL” to cover both approaches.

ETL jobs sometimes break in production, often as the result of some upstream change that was
not communicated. Once this happens IT is under the gun to figure out what happened and fix it.

Since it is so often the case that an upstream change has broken the ETL, this hypothesis needs to be tested
right away. Data lineage allows IT staff to trace back the pathway to the ETL job. With this, it is a
comparatively simple job to investigate what if anything has changed in this pathway and fix it.

Most importantly, data lineage can pinpoint what is broken. This means that the root cause of the problem can
be detected and analyzed. All too often, this is not done, and downstream workarounds are implemented that
further distort the overall architecture. The role of data lineage in root cause analysis and error correction
cannot be overstated.
Migration of applications and reports take this approach. A long chain of
Use Case 5 is needed for many reasons. Support periodic migrations in an enterprise
can run out for hardware or software. that has been around for decades
A new software product can become heap idiosyncrasy upon idiosyncrasy
Migration of available with very attractive features.
This is often the case with reporting
and workaround upon workaround
with each migration.
Applications software which seems to turn over on
The point here is that mature enterprises
and Reports a very frequent basis. Today, however,
should use data lineage during a migration to
there is a generational shift away
from on-premise data centers to the understand their business processes and
Cloud, and this is the example of re-engineer them. Data lineage is not just a
migration we will focus on. map of how data flows – it reveals an
understanding of how the ultimate business
When a migration away from on-premise processes have been implemented. Ideally, a
occurs, there is not simply a need to mature enterprise will abstract back from the
replicate the data structures, processing data lineage to what these processes should be
logic, and reports in the Cloud. The flow of today and redesign them.
data also has to be replicated. This means
understanding the existing data lineage.

That said, there are some additional


There is an overwhelming temptation quick wins that can be gained from
to adopt a “lift and shift” approach to data lineage during a migration. In
migration, meaning that the legacy particular, identifying data and report
environment, with all its blemishes, is objects where data “dead-ends” and
replicated as closely as possible in the which are not used is extremely
Cloud environment. This is incredibly helpful. There is no point in migrating
appealing to project managers and something that is not used. Thus,
sponsors who want to drive down risk these “dead-end” objects and the
and deliver the project on time. Yet it data lineage pathways to them can
is a very grave risk to the long-term be discarded in the migration.
viability of the mature enterprises that

Use Case 6

Enterprise-Wide
Data Administration
In the 1990’s Data Administration became very popular, only to largely disappear in the recessions of the early
2000’s. It never became a serious component of the Data Governance revolution that began in 2005, probably
because of the intensely manual aspect of Data Administration. Yet, data lineage now offers a solution to this
problem. Examples of the concerns of Data Administration that data lineage can address include:

• Continuously monitoring for tables and ETL processes that are not used in reporting, and go nowhere. This is
not just within the context of the migration projects we mentioned earlier but is a continuous activity of Data
Administration.
• Discovering and remediating datatype discrepancies that corrupt data as it flows. A target column may be
shorter than a source column, and truncate data. Or a target column may be numeric and remove
meaningful leading zeroes from a source column that is character data, but contains only numbers.
• Discovering suspicious data extracts, such as “private” files that might be used for ungoverned data
processing, or may even possibly be fraudulent.

There are actually a wide range of use cases that exist within Data Administration in addition to these
examples. Without data lineage it is difficult to see how these could be meaningfully addressed at the scale of
an entire enterprise data ecosystem.
The Role of Automated
Data Lineage
At this point, we have described what data lineage is and illustrated it with
a set of use cases where it is particularly valuable. However, it is natural to
ask how data lineage actually gets done.

Traditionally, it has been documented during IT development efforts – at


least sometimes. Even where there are attempts to document data
lineage, they are nearly always wasted effort. The environment evolves, but
the documentation is usually not updated. And if there is even the slightest
suspicion of the accuracy of the documentation, then all of it is suspect,
and nobody trusts it.

Traditionally, this has left us with manual efforts to trace data lineage
whenever the need arises – which is pretty frequent. These manual efforts
are costly, error-prone, and frustrating to all involved. But today, there is
an alternative – automated data lineage discovery.

The tools that enable automated data lineage address the use
cases described above very effectively, for the following reasons:

Automated data lineage tools Automated data lineage tools are


scale well. Very often there accurate. Manual effort is
are not enough technical IT error-prone, and different analysts
staff to do manual data may document data lineage
lineage work. differently leading to interpretation
problems.

Automated data lineage tools deal Automated data lineage tools are very
well with complexity. fast. They can scan huge environments
Even moderately sized enterprises and produce data lineage diagrams in
have a huge number of columns just minutes. It would take humans
and ETL processes in their data between several days to several months
landscapes. to do the same work. As we have seen,
there are use cases where answers
about data lineage are needed
immediately.

IT managers and executive sponsors sometimes make the mistake of adopting a viewpoint that the need for
data lineage is very intermittent – like migrations projects - so why should they invest in an automated data
lineage tool that will eat up recurrent annual license fees? Sometimes they think it is better to hire
consultants to document the data lineage manually when it is needed – a one-time cost. The use cases
discussed earlier show that this is a short-sighted view and an automated data lineage tool needs to be on
hand for the use cases that are permanent in nature, and others where a rapid response is needed. That
really is the situation for most enterprises as they face 2023 and beyond.
Choosing the Right Data Lineage Solution
for You
How do you choose the right data lineage solution for your organization from the
ones available on the market?

Let your priorities and use cases (see the above Data Lineage and Business Processes
section) be your first measuring stick with which to filter out the available tools. If your
primary use case is answering the questions of business users and auditors to assure
them of data integrity, an always up-to-date tool may be more important to you. If
your primary use case is facilitating seamless migrations, you may want to prioritize
tools that provide detailed and intuitive visualizations, enabling you to easily compare
target and source and assess the completeness of your migration.

In addition, it is wise to consider the following factors:

Which technologies does it support?


At the risk of stating the obvious: your data lineage solution will need to support
your company’s data systems. Ideally, you want a data lineage solution that will
support every single data technology tool you use, providing full visibility
end-to-end. You don’t want black boxes in the middle of your data lineage charts.

If you are coping with the awkward proprietary system, you may
be forced to find a tech agnostic solution. What those provide in
tech flexibility, however, they lose in accuracy.

If you are at the other end of the spectrum with a cloud-based


modern data stack, you want your enterprise data lineage solution to
support not only the tools you use now, but also any tools you may
decide to use in the future. A good sign is if you see that the data lineage
solution you’re evaluating is continually adding new out-of-the-box
integrations for new cloud-based technologies. That means that keeping
up-to-date is a priority for them. So if a new technology bursts onto the
market in another year and your enterprise decides a few months later to
use it, it’s likely that this data lineage solution will have developed an
integration by then.

And if the solution doesn't support all of your technologies out-of-the-box, there should still
be an option to enrich the automated lineage with additional relationships between data
assets to make sure to cover all of your ‘home grown’ applications and processes as well.

In which data environment is it designed to best


function?
What does your data environment look like? Are you working with a single vendor? Do you
integrate systems from multiple vendors? Are your systems on-prem, cloud-based or a type
of hybrid configuration?

Your answer will inform the level of data lineage tool that you need. If you use only one
on-premises legacy system, you may only need a simple data lineage tool. On the other
hand, if your data passes through multiple systems while flying to the cloud and back, you
need data lineage that can follow your data wherever it goes.
What dimensions of data lineage does it show?
Think about what dimensions of data intelligence you need to access (see
Understanding the Different Layers of Lineage section), and compare that to the
functionality of the data lineage system that you’re evaluating.

Most data lineage solutions will offer some form of cross-system (“horizontal”) lineage.
Fewer will include end-to-end column (one aspect of “vertical") lineage. Fewer still
provide inner-system lineage.

Assess your use cases and the dimensions of data lineage you will need and seek out a
data lineage solution that will provide them.

Is it automated, and to what extent?


As mentioned above (see The Role of Automated Data Lineage section), speed to insight
is a key factor in today’s fast-paced business environment. Automated data lineage uses
analysis of your system’s metadata to trace your data’s path, without having to burden
your data team with manual work. The result is faster, more scalable, more accurate
understanding of your data flow and landscape. If you are choosing a data lineage
solution in 2023, automation should be a given.

How user-friendly is it?


Even if your data lineage tool is meant for technical eyes only, you’d still want it to
be somewhat intuitive.

If business users are going to have direct access to the data lineage tool, then you
definitely want it to be intuitive. How clearly presented is the information? How hard is it to
find answers? Does it require extensive training to use and understand?

In addition, how long does setup take? Is implementing a data lineage solution going
to occupy days, weeks, or months? Ideally setup time should be measured in hours or
less. Once setup is complete, the time to full data lineage report and visualization
availability should be a few days or less.

Does it include other tools, such as data discovery or a


data catalog?
When a data lineage solution is integrated with other data intelligence tools, such as data
discovery or a data catalog, it becomes even more powerful and empowering to your
users.
Data discovery enables your users to find exactly what they need within your data
landscape - and specifically within your data lineage - in a matter of seconds. It saves your
organization time and resources, is an invaluable asset in regulatory compliance situations
and enables fast, accurate, data-driven business decisions.

A data catalog imparts order to your entire data landscape when it has comprehensive data
asset entries featuring definitions, data owner and steward, user ratings and reviews,
security status and more. When this type of data catalog contains or integrates with data
lineage, it provides a one-stop-shop to get answers about your data.
Conclusion
In this eBook we have described data lineage in detail with
illustrations from several core use cases. We have seen how useful
it can be from an IT perspective and a Data Governance
perspective. In fact, the importance and value of data lineage goes
well beyond what we have described as it is needed to successfully
address Data Quality (e.g. source-target reconciliation), Master
Data Management (e.g. flows into integration processes), and other
aspects of Data Governance (e.g. selecting the best source of data).
We have also seen barriers to adoption, including:

• The perception that automated data lineage is needed on an


infrequent basis
• Inertia in IT based on data lineage being impractical in the past,
and so never discussed
• A lack of understanding of the use cases due to the problems
they solve simply being ignored

Yet we have also clearly demonstrated the value of automated


data lineage. Going back to our oil refinery metaphor, no refinery
could operate without the instrumentation to understand what
is happening in it at any time. Why should we expect a complex data
environment to function efficiently and without risk if we do not even
have a map of how it is laid out? From a perspective of business
strategy, operational efficiency, and risk mitigation, this map is needed
and it is precisely what an automated data lineage tool provides.
Perhaps of all the use cases we have laid out, the most strategic one is
migrations wherein the opportunity exists to reengineer business
processes to match business objectives. The most tactically useful is
perhaps when data lineage enables a quick diagnosis of what broke an
ETL process. Yet the combination of all of the use cases we have
described provides an overwhelming justification for the acquisition of
an automated data lineage tool. So overwhelming in fact that we can
expect continued widespread adoption of these tools in 2023.

You might also like