Data Lineage in 2024 For Data Architects 1717374070
Data Lineage in 2024 For Data Architects 1717374070
DATA LINEAGE
Our modern technological civilization is full of examples that would make data
lineage seem easy by comparison. An oil refinery is a gigantic piece of
infrastructure, built to a precise specification, where the operators know
exactly what is happening with the liquid products flowing through it –
admittedly with a few very rare, and sometimes tragic, exceptions. Modern
telephone networks are another exquisite example of complex engineering,
being fully controlled from Network Operating Centers (NOCs) so that calls go
through from origin to destination without a hitch.
We can agree that data lineage is a hard problem to solve, but just what is it?
To answer this, think of an item of data being captured for the first time within
the firewalls of an organization, perhaps via data entry. The days when data
stayed put in the same silo where it was first captured are long gone.
Inevitably, the item of data will be sent to other data stores (databases or files),
and from these to yet more, until finally, our item of data ends up as a piece of
information in one or more data consumption platforms such as reports,
operational systems or even customer facing applications. As the data item
travels it may be replicated, or transformed to standardize it, or used in
calculations to generate other data elements that enrich the overall data
environment. All of this – where the data is stored, the pathways it travels, the
changes that happen to it along the way, how it becomes a constituent of
other data, and where it appears in the various data consumption platforms –
make up data lineage.
Understanding the Different Layers of Lineage
Data lineage exists on different levels, each of which has its own particular characteristics and value.
This is of enormous help to data architects and engineers – as far as it goes. But in many cases it does
not go far enough; we need to dig deeper into a specific point in the horizontal data lineage to get
the answers we seek.
End-to-end column lineage enables you to “zoom in” on the cross-system lineage, detailing
column-to-column- or column-to-report element-level lineage between systems from across the data
landscape. This “vertical” lineage includes the widest view of lineage for any column from its very first
origin, through the entire landscape and where it is being used, and answers questions like:
This is the type of lineage needed by data analysts trying to solve a report discrepancy or working
to understand the true scope of what exists in an environment that is to be migrated to a new
platform.
While this layer of data lineage is sufficient for many use cases, oftentimes we need to dig deeper still.
Truth be told, most organizations are still a long way off from using data lineage in business strategy.
Nevertheless, the more technical use cases for data lineage are still extremely valuable, and we will look at
a range of them now to gain a better appreciation of what data lineage is and the value it provides.
Tracking
Personal Information (PI)
Data Privacy has exploded in the past few years, with widespread concerns about the misuse of personal
information (PI) by governments, corporations, and criminals. New laws have been enacted in several
jurisdictions, and the European Union’s GDPR explicitly calls out the need to know where PI is located in
the data landscape.
The traditional approach to the need to know where PI is located has been data profiling. This involves
examining each column in each relational database table (or other non-relational data objects) to try to infer
if it is PI or not. However, data lineage provides a better solution. If all the data flows are known, then if a
column at any node in a data flow pathway can be identified as PI, then every node in that pathway is
logically the same piece of PI. This makes it scalable to have analysts identify PI, since this only has to be
done once in any particular pathway. Furthermore, data lineage extends into reporting layers as well as
databases, unlike traditional data profiling. This means that if we can identify a data element in a report as
PI, which a report user should be easily able to do, then we can find PI columns across all the lineage
pathways involving this data element. An added bonus is that knowing what reports contain PI makes it
easier to govern their dissemination – both inside and outside the enterprise.
Using a data lineage tool to bring a data environment under control from the perspective of PI is a concern
of Data Governance and any Legal or Privacy function in an organization. This demonstrates how valuable
data lineage can be outside of IT.
Use Case 4
Broken ETL
This use case is often a consequence of the lack of implementation of proper impact analysis.
Extract-Transform-and-Load (ETL) tasks move data around the organization, often reshaping, enriching, and
integrating the data as they do so. Closely related is ELT (Extract-Load-and-Transform). In ELT, data is taken
from a source, loaded into target database, and then transformed. With ETL, there is a more traditional
approach of doing the transformation before loading the data. We shall use “ETL” to cover both approaches.
ETL jobs sometimes break in production, often as the result of some upstream change that was
not communicated. Once this happens IT is under the gun to figure out what happened and fix it.
Since it is so often the case that an upstream change has broken the ETL, this hypothesis needs to be tested
right away. Data lineage allows IT staff to trace back the pathway to the ETL job. With this, it is a
comparatively simple job to investigate what if anything has changed in this pathway and fix it.
Most importantly, data lineage can pinpoint what is broken. This means that the root cause of the problem can
be detected and analyzed. All too often, this is not done, and downstream workarounds are implemented that
further distort the overall architecture. The role of data lineage in root cause analysis and error correction
cannot be overstated.
Migration of applications and reports take this approach. A long chain of
Use Case 5 is needed for many reasons. Support periodic migrations in an enterprise
can run out for hardware or software. that has been around for decades
A new software product can become heap idiosyncrasy upon idiosyncrasy
Migration of available with very attractive features.
This is often the case with reporting
and workaround upon workaround
with each migration.
Applications software which seems to turn over on
The point here is that mature enterprises
and Reports a very frequent basis. Today, however,
should use data lineage during a migration to
there is a generational shift away
from on-premise data centers to the understand their business processes and
Cloud, and this is the example of re-engineer them. Data lineage is not just a
migration we will focus on. map of how data flows – it reveals an
understanding of how the ultimate business
When a migration away from on-premise processes have been implemented. Ideally, a
occurs, there is not simply a need to mature enterprise will abstract back from the
replicate the data structures, processing data lineage to what these processes should be
logic, and reports in the Cloud. The flow of today and redesign them.
data also has to be replicated. This means
understanding the existing data lineage.
Use Case 6
Enterprise-Wide
Data Administration
In the 1990’s Data Administration became very popular, only to largely disappear in the recessions of the early
2000’s. It never became a serious component of the Data Governance revolution that began in 2005, probably
because of the intensely manual aspect of Data Administration. Yet, data lineage now offers a solution to this
problem. Examples of the concerns of Data Administration that data lineage can address include:
• Continuously monitoring for tables and ETL processes that are not used in reporting, and go nowhere. This is
not just within the context of the migration projects we mentioned earlier but is a continuous activity of Data
Administration.
• Discovering and remediating datatype discrepancies that corrupt data as it flows. A target column may be
shorter than a source column, and truncate data. Or a target column may be numeric and remove
meaningful leading zeroes from a source column that is character data, but contains only numbers.
• Discovering suspicious data extracts, such as “private” files that might be used for ungoverned data
processing, or may even possibly be fraudulent.
There are actually a wide range of use cases that exist within Data Administration in addition to these
examples. Without data lineage it is difficult to see how these could be meaningfully addressed at the scale of
an entire enterprise data ecosystem.
The Role of Automated
Data Lineage
At this point, we have described what data lineage is and illustrated it with
a set of use cases where it is particularly valuable. However, it is natural to
ask how data lineage actually gets done.
Traditionally, this has left us with manual efforts to trace data lineage
whenever the need arises – which is pretty frequent. These manual efforts
are costly, error-prone, and frustrating to all involved. But today, there is
an alternative – automated data lineage discovery.
The tools that enable automated data lineage address the use
cases described above very effectively, for the following reasons:
Automated data lineage tools deal Automated data lineage tools are very
well with complexity. fast. They can scan huge environments
Even moderately sized enterprises and produce data lineage diagrams in
have a huge number of columns just minutes. It would take humans
and ETL processes in their data between several days to several months
landscapes. to do the same work. As we have seen,
there are use cases where answers
about data lineage are needed
immediately.
IT managers and executive sponsors sometimes make the mistake of adopting a viewpoint that the need for
data lineage is very intermittent – like migrations projects - so why should they invest in an automated data
lineage tool that will eat up recurrent annual license fees? Sometimes they think it is better to hire
consultants to document the data lineage manually when it is needed – a one-time cost. The use cases
discussed earlier show that this is a short-sighted view and an automated data lineage tool needs to be on
hand for the use cases that are permanent in nature, and others where a rapid response is needed. That
really is the situation for most enterprises as they face 2023 and beyond.
Choosing the Right Data Lineage Solution
for You
How do you choose the right data lineage solution for your organization from the
ones available on the market?
Let your priorities and use cases (see the above Data Lineage and Business Processes
section) be your first measuring stick with which to filter out the available tools. If your
primary use case is answering the questions of business users and auditors to assure
them of data integrity, an always up-to-date tool may be more important to you. If
your primary use case is facilitating seamless migrations, you may want to prioritize
tools that provide detailed and intuitive visualizations, enabling you to easily compare
target and source and assess the completeness of your migration.
If you are coping with the awkward proprietary system, you may
be forced to find a tech agnostic solution. What those provide in
tech flexibility, however, they lose in accuracy.
And if the solution doesn't support all of your technologies out-of-the-box, there should still
be an option to enrich the automated lineage with additional relationships between data
assets to make sure to cover all of your ‘home grown’ applications and processes as well.
Your answer will inform the level of data lineage tool that you need. If you use only one
on-premises legacy system, you may only need a simple data lineage tool. On the other
hand, if your data passes through multiple systems while flying to the cloud and back, you
need data lineage that can follow your data wherever it goes.
What dimensions of data lineage does it show?
Think about what dimensions of data intelligence you need to access (see
Understanding the Different Layers of Lineage section), and compare that to the
functionality of the data lineage system that you’re evaluating.
Most data lineage solutions will offer some form of cross-system (“horizontal”) lineage.
Fewer will include end-to-end column (one aspect of “vertical") lineage. Fewer still
provide inner-system lineage.
Assess your use cases and the dimensions of data lineage you will need and seek out a
data lineage solution that will provide them.
If business users are going to have direct access to the data lineage tool, then you
definitely want it to be intuitive. How clearly presented is the information? How hard is it to
find answers? Does it require extensive training to use and understand?
In addition, how long does setup take? Is implementing a data lineage solution going
to occupy days, weeks, or months? Ideally setup time should be measured in hours or
less. Once setup is complete, the time to full data lineage report and visualization
availability should be a few days or less.
A data catalog imparts order to your entire data landscape when it has comprehensive data
asset entries featuring definitions, data owner and steward, user ratings and reviews,
security status and more. When this type of data catalog contains or integrates with data
lineage, it provides a one-stop-shop to get answers about your data.
Conclusion
In this eBook we have described data lineage in detail with
illustrations from several core use cases. We have seen how useful
it can be from an IT perspective and a Data Governance
perspective. In fact, the importance and value of data lineage goes
well beyond what we have described as it is needed to successfully
address Data Quality (e.g. source-target reconciliation), Master
Data Management (e.g. flows into integration processes), and other
aspects of Data Governance (e.g. selecting the best source of data).
We have also seen barriers to adoption, including: