0% found this document useful (0 votes)

126 views

Semantic Data Lineage and Impact Analysi

Uploaded by

Prathap Reddy Ambati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views

Semantic Data Lineage and Impact Analysi

Uploaded by

Prathap Reddy Ambati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 126

TALLINN

UNIVERSITY OF TECHNOLOGY
DOCTORAL THESIS
15/2018

Semantic Data Lineage and Impact
Analysis of Data Warehouse Workflows

KALLE TOMINGAS

TALLINN UNIVERSITY OF TECHNOLOGY
School of Information Technologies
Department of Software Science

This dissertation was accepted for the defence of the degree of Philosophy in
Computer Science on April 20, 2018

Supervisor: Professor Tanel Tammet

Department of Software Science
Tallinn University of Technology
Tallinn, Estonia

Opponents: Professor Alexandra Poulovassilis

Department of Computer Science and Information Systems
Birkbeck University of London
U.K.

Ph.D Peeter Laud

Research Director
Cybernetica AS
Estonia

Defence of the thesis: May 21, 2018, Tallinn

Declaration:
Hereby I declare that this doctoral thesis, my original investigation and
achievement, submitted for the doctoral degree at Tallinn University of
Technology has not been submitted for any academic degree.

Copyright: Kalle Tomingas, 2018

ISSN 2585-6898 (publication)
ISBN 978-9949-83-238-5 (publication)
ISSN 2585-6901 (PDF)
ISBN 978-9949-83-239-2 (PDF)
TALLINNA TEHNIKAÜLIKOOL
DOKTORITÖÖ
15/2018

Semantiline andmevoogude‐ ja mõjuanalüüs
andmelao keskkonnas

KALLE TOMINGAS

Table of Contents
ABSTRACT ........................................................................................................ 7
ACKNOWLEDGEMENTS ................................................................................. 8
LIST OF PUBLICATIONS ................................................................................. 9
OTHER RELATED PUBLICATIONS ............................................................... 9
AUTHOR’S CONTRIBUTION TO THE PUBLICATIONS ........................... 10
Abbreviations ..................................................................................................... 11
Terms ................................................................................................................. 12
List of Figures .................................................................................................... 14
INTRODUCTION ............................................................................................. 15
Motivation and the Problem Statement.......................................................... 16
Contribution of the Thesis ............................................................................. 18
Organization of the Thesis ............................................................................. 19
1. DATA LINEAGE ...................................................................................... 21
1.1. Overview of Data Lineage and Provenance ..................................... 21
1.2. A Motivating Example ..................................................................... 23
1.3. Summary .......................................................................................... 26
2. RELATED WORK .................................................................................... 27
2.1. Summary .......................................................................................... 30
3. ALGORITHMS AND METHODS ........................................................... 31
3.1. Overall Architecture and Methodology ............................................ 31
3.2. Metadata Database ........................................................................... 32
3.3. Design of Metadata Models and Mappings ...................................... 34
3.4. Data Capture, Store and Processing with Scanners .......................... 34
3.5. Query Parsing and Metadata Extraction ........................................... 35
3.6. Data Transformation Weight Calculation ........................................ 38
3.7. Rule System and Dependency Calculation....................................... 40
3.8. Semantic Layer Calculation ............................................................. 42
3.9. Summary .......................................................................................... 44
4. IMPLEMENTATION AND APPLICATIONS ......................................... 45
4.1. dLineage.com ................................................................................... 45
4.2. Performance Evaluation ................................................................... 49
5
4.3. Visualization..................................................................................... 52
4.4. Proposed Novel Applications ........................................................... 55
4.5. Summary .......................................................................................... 58
CONCLUSIONS ............................................................................................... 59
REFERENCES .................................................................................................. 61
KOKKUVÕTE .................................................................................................. 67
Publication A ..................................................................................................... 69
Publication B ..................................................................................................... 83
Publication C ..................................................................................................... 95
Publication D ................................................................................................... 113
CURRICULUM VITAE .................................................................................. 125
ELULOOKIRJELDUS .................................................................................... 126

6
ABSTRACT
The subject of the thesis is data flow in data warehouses. Data warehousing is
a complex process of collecting data, cleansing and transforming it into
information and knowledge to support strategic and tactical business decisions in
organizations Our goal is to develop a new way to automatically solve a
significant class of existing management and analysis problems in a corporate
data warehouse environment.
We will present and validate a method and an underlying set of languages,
data structures and algorithms to calculate, categorize and visualize component
dependencies, data lineage and business semantics from the database structure
and a large set of associated procedures and queries, independently of actual data
in the data warehouse.
Our approach taken is based on scanning, mapping, modelling and analysing
metadata of existing systems without accessing the contents of the database or
impacting the behaviour of the data processing system. This requires collecting
metadata from structures, queries, programs and reports from the existing
environments.
We have designed a domain-specific language XDTL for specifying data
transformations between different data formats, locations and storage
mechanisms. XDTL scripts guide the work of database schema and query
scanners.
We will present a flexible and dynamic database structure to store various
metadata sources and implement a web-based analytical application stack for the
delivery and visualization of analysis tools for various user groups with different
needs.
The core of the designed method relies on semantic techniques, probabilistic
weight calculation and estimation of the impact of data in queries. We develop a
method to estimate the impact factor of input variables in SQL statements. We
will present a rule system supporting the efficient calculation of the query
dependencies using these estimates.
We will show how to use the results of the conducted analysis to categorize,
aggregate and visualize the dependencies to address various planning and
decision support problems.
The methods and algorithms presented in the thesis have been implemented
and tested in different data warehouse analysis and visualization tasks for tens of
large international organizations. Some of these systems contain over a hundred
thousand database objects and over a million ETL objects, producing data lineage
graphs with more than a hundred thousand nodes. The analysis of the system
performance over real-life datasets of various sizes and structures presented in
the last chapter demonstrates linear performance scaling and the practical
capacity to handle very large datasets.

7
ACKNOWLEDGEMENTS
First, I would like to warmly thank my supervisor, Prof. Tanel Tammet, for
the motivation, encouragement and guidance through all these years, as well as
the patience and support during all stages of the scientific process and practical
works.
I would like to thank all those people who have contributed to the process
leading to the completion of the given work. I thank Margus Kliimask and other
colleagues from Mindworks Industries for a creative and productive environment,
for wonderful ideas and hard work. I would like to thank my colleagues from the
Eliko Competence Centre and my fellow doctoral students from Tallinn
University of Technology and Graz University of Technology.
Finally, I thank my family members and my friends who have been supportive
and have been with me during the long journey of my doctoral studies.

8
LIST OF PUBLICATIONS
The work of this thesis is based on the following publications:

A Tomingas, K.; Kliimask, M.; Tammet, T. Data Integration Patterns for

Data Warehouse Automation. In: New Trends in Database and
Information Systems II: 18th East European Conference on Advances in
Databases and Information Systems (ADBIS 2014). Springer, 2014.
B Tomingas, K.; Tammet, T.; Kliimask, M. Rule-Based Impact Analysis
for Enterprise Business Intelligence. In: Artificial Intelligence
Applications and Innovations (AIAI 2014), IFIP Advances in
Information and Communication Technology. Springer, 2014.
C Tomingas, K.; Tammet, T.; Kliimask, M.; Järv, P. Automating
Component Dependency Analysis for Enterprise Business Intelligence.
In: 2014 International Conference on Information Systems (ICIS 2014).
D Tomingas, K.; Järv, P; Tammet, T. Discovering Data Lineage from Data
Warehouse Procedures. In: 8th International Joint Conference on
Knowledge Discovery and Information Retrieval (KDIR 2016).

OTHER RELATED PUBLICATIONS

E Tomingas, Kalle; Järv, Priit; Tammet, Tanel (2017). Computing Data
Lineage and Business Semantics for Data Warehouse. Accepted for
publication in: Lecture Notes in Communications in Computer and
Information Science" (CCIS), Springer.
F Tomingas, Kalle; Kliimask, Margus; Tammet, Tanel (2014). Mappings,
Rules and Patterns in Template Based ETL Construction. In: TUT
Research Report Series: The 11th International Baltic Conference on DB
and IS, DB&IS2014, Tallinn, Estonia. Tallinn, Estonia.
G Tammet, T.; Tomingas, K.; Luts, M. (2010). Semantic Interoperability
Framework for Estonian Public Sector's eServices Integration. In:
Proceedings of the 11th European Conference on Knowledge
Management: Universidade Lusíada de Vila Nova de Famalicão
Portugal: 2-3 September 2010, 2: 11th European Conference on
Knowledge Management - ECKM 2010, Portugal, 2-3 September 2010.
Ed. Eduardo Tomé. Academic Publishing Limited, 988−995.
H Tomingas, Kalle; Luts, Martin (2010). Semantic Interoperability
Framework for Estonian Public Sector’s E-Services Integration. In:
Ontology Repositories and Editors for the Semantic Web: Proceedings
of the 1st Workshop on Ontology Repositories and Editors for the
Semantic Web, Hersonissos, Crete, Greece, May 31st, 2010. (CEUR
Workshop Proceedings; 596).

9
AUTHOR’S CONTRIBUTION TO THE
PUBLICATIONS
A Author contribution to the paper A started with a research problem and
methodology setup. It covers model and software development, testing
and experimenting, conducting analysis and writing most of the text.
B The author was one of the main contributors and writers of the paper B.
The most important part of the work was the development of the
methodology to solve data lineage and impact problems based on
technologies described in the previous paper A. Additional technical
work, like building models, development of software, testing, and
analysing the results, were part of feasibility studies and adjustment of
the methodology.
C The author was one of the main writers, continuing the development of
the methodology and the rule system started in the paper B. The main
content of paper B is re-published in the paper C with more additional
details, examples and visualizations. The new details, practical data
processing and visualizations were the main tasks and the focus of the
paper C that was published in the field of information systems rather than
computer science.
D The author was one of the main contributors of the paper D. The main
tasks and the results of the work were: new formalizations for the data
processing rule system, development of the new prototype, performance
measurements and new visualization techniques.
E The author was one of the main contributors of the paper E. Again, the
main tasks and the results of the work were new formalizations for the
data processing rule system, development of the new prototype,
performance measurements and new visualization techniques, plus a new
business semantics model development.
F The paper F was written as a short and initial version of the paper B that
was published at the DB&IS2014 conference in Tallinn and presented in
the poster session by the author.
G The author was a member of the semantic assets management
development project of the Estonian Information Systems Authority. The
paper G concludes the project and presents the ideas of the
interoperability framework development. The author was one of the main
writers of the paper.
H The paper H is a short initial version of the paper G that was presented at
the Workshop on Ontology Repositories in the Extended Semantic Web
Conference by the author.

10
Abbreviations
API Application Programming Interface
BI Business Intelligence
DBMS Database Management System
DDL Data Definition Language
DI Data Integration
DL Data Lineage
DML Data Manipulation Language
DSS Decision Support Systems
DW Data Warehouse
EBNF Extended Backus-Naur Form
EDW Enterprise Data Warehouse
EAV Entity Attribute Value
ETL Extract, Transform, Load
ELT Extract, Load, Transform
ER Entity–Relationship
IA Impact Analysis
IT Information Technology
ODS Operational Data Store
OLTP On-Line Transaction Processing
OLAP On-Line Analytical Processing
RDF Resource Description Framework
SQL Structured Query Language
XDTL eXtensible Data Transformation Language
XML eXtensible Markup Language

11
Terms1
Data Warehouse
A data warehouse (DW) is a collection of corporate information and data derived
from operational systems and external data sources. DW is designed to support
business decisions by allowing data consolidation, analysis and reporting at
different aggregate levels. Data is populated into the DW through the processes
of data integration or extraction, transformation and loading.

Data Lineage
Data lineage is generally defined as a kind of data life cycle that includes the
data's origins and where it moves over time. This term can also describe what
happens to data as it goes through diverse processes. Data lineage can help with
efforts to analyze how information is used and to track key bits of information
that serve a particular purpose (see also: Data Provenance).

Data Integration
Data integration (DI) is a process in which heterogeneous data is retrieved and
combined as an incorporated form and structure. Data integration allows different
data types (such as data sets, documents and tables) to be merged by users,
organizations and applications, for use as personal or business processes and/or
functions (see also: Extract-Transform-Load).

Data Provenance2
Data Provenance provides a historical record of the data and its origins. The
provenance of data which is generated by complex transformations such as
workflows is of considerable value to scientists. Provenance is also essential to
the business domain where it can be used to drill down to the source of data in a
data warehouse, track the creation of intellectual property, and provide an audit
trail for regulatory purposes (see also: Data Lineage).

Enterprise Data Warehouse

An enterprise data warehouse (EDW) is a unified database that holds all the
business information an organization and makes it accessible all across the
company.

Extract-Transform-Load
Extract transform load (ETL) is the process of extraction, transformation and
loading during database use, but particularly during data storage use.

Impact Analysis3

1
https://www.techopedia.com/
2
https://en.wikipedia.org/wiki/Data_lineage
3
https://en.wikipedia.org/wiki/Change_impact_analysis
12
Change impact analysis (IA) is defined as "identifying the potential consequences
of a change, or estimating what needs to be modified to accomplish a change",
and they focus on IA in terms of scoping changes within the details of a design.

Dependency Graph4
Dependency graph is a directed graph representing dependencies of several
objects towards each other. It is possible to derive an evaluation order or the
absence of an evaluation order that respects the given dependencies from the
dependency graph.

4
https://en.wikipedia.org/wiki/Dependency_graph
13
List of Figures
Figure 0.1 A general scheme of a Data Warehouse process and data flows...... 16
Figure 0.2 Real life Data Warehouse data flows from tables and views (left and
middle with blue) to reports (right side with red). .................................... 18
Figure 1.1 DW data transformation flows in table, job and query levels. ......... 25
Figure 1.2 DW data transformation flows in table, column and query component
levels. ........................................................................................................ 25
Figure 3.1 Methodology and system architecture components......................... 32
Figure 3.2 Metadata database physical schema tables. ..................................... 33
Figure 3.3 Visual representation of data lineage graph inference rule R1 ........ 40
Figure 3.4 Visual representation of data impact graph inference rule R2 . ....... 41
Figure 3.5 Visual representation of data lineage and impact graph inference rule
R3. .............................................................................................................. 41
Figure 3.6 Semantic layer illustration for two independent data flows based on
overlapping query conditions. ................................................................... 43
Figure 4.1 Data lineage visualization example in DW environment using Sankey
diagram...................................................................................................... 46
Figure 4.2 dLineage sub-graph table view, source and target objects with
calculated metrics. ..................................................................................... 47
Figure 4.3 dLineage sub-graph graphical view, selected object with all connected
targets. ....................................................................................................... 48
Figure 4.4 dLineage sub-graph graphical view, selected object with all connected
sources....................................................................................................... 48
Figure 4.5 dLineage dashboard has aggregated overview about collected
metadata and calculated results and metrics. ............................................ 49
Figure 4.6 Datasets size and structure compared to overall processing time. ... 50
Figure 4.7 Calculated graph size and structure compared to graph data processing
time............................................................................................................ 51
Figure 4.8 Dataset processing time with two main subcomponents. ....... 51
Figure 4.9 Dataset size and processing time correlation with linear regression
(semi-log scale). ........................................................................................ 52
Figure 4.10 Data flows (blue,red) and control flows (green,yellow) between
DW tables, views and reports. .............................................................. 53
Figure 4.11 Data flows between DW tables, views (blue) and reports (red). ... 53
Figure 4.12 Control flows in scripts, queries (green) and reporting queries
(yellow) are connecting DW tables, views and reports. ............................ 54
Figure 4.13 Data Warehouse loading packages plot with number of data sources
and targets (axis), loading complexity (size) and relative cost (color). .... 55
Figure 4.14 Data Warehouse tables plot with number of data sources and targets
(axis), loading complexity (size) and relative cost (color). ....................... 55

14
INTRODUCTION
The amount of available data is growing rapidly in many domains and areas
of human activity. Traditional and Internet businesses, social media, healthcare
and science are a few examples of the fields where accumulated data and
processed information can change the scale and the state of those businesses. The
development of the internet, connected information systems, social media, new
scientific equipment and the rising Internet of Things (IoT) has brought us to the
big data era and scale where traditional data processing technologies and methods
do not function, do not perform or simply stop working [1].
There are many reasons why we may want to understand the internal structure
and functions of a complex data processing systems like data warehouses. Some
of the reasons are related to the need to improve system functions, performance
or quality and the ability to evaluate them. Others are related to controlling and
managing the system effectively and avoiding unwanted or unpredictable
behavior of the system. Data warehouse systems collect data from various
distributed and heterogeneous data sources, integrating details or summarized
information in local database for further processing and analysis for various
applications and purposes. Data warehouses are living, continuously developed,
enriched and updated systems with variable load, performance and growing data
volumes. Data transformation chains can be very long and the complexity of
structural changes can be high. Tracing long and complex data flows or
dependencies of data transformation components are serious research tasks
without special supporting metadata and tools. Tracing data items back from the
final reports or applications to the source items and structures is a data lineage
problem. Traceability of internal components dependencies is critical when
developing and changing system software or configuration and can be defined as
problem of impact analysis. Data lineage allows for tracing internal functional
relations of data processing systems and gives insight of data flows for better
understanding of what the system does. Impact analysis allows for tracing internal
component structures and formal relations of the system and gives an
understanding of how a system is built from interconnected components.
In this thesis, we address the data lineage and the impact analysis problems in
a generalized and multidisciplinary way to use the same methods and approaches
in data warehouse or other decision support, data processing, enterprise
integration or service-oriented systems. Our goal is to implement methodology,
algorithms, representations, architecture and applications that have a relatively
small set of functions for specialized tasks, designed to perform and automate
complex analytical tasks. The final system design has to be modular, flexible and
robust, but also scalable and efficient to easily adapt heterogeneous environments
of real life data processing systems.
The chosen approach combines techniques from multiple fields of information
technology and computer science, like metadata capture and loading, unified and
open-schema data storing, grammar-based program parsing and resolving,
probabilistic semantic interpretation of data transformations and rule-based
reasoning, graph-based dependency calculations, data and component flow graph
visualization, etc.
15
Motivation and the Problem Statement
Data warehousing (DW) is a complex process of collecting data, cleansing
and transforming it into information and knowledge to support strategic and
tactical business decisions in organizations. DW is designed as a rapidly growing,
subject-oriented, integrated, time-variant and non-volatile collection of data from
heterogeneous data sources, with various connected applications, query engines,
fixed or open reporting and analytical tools (see Figure 0.1). Data sources can be
volatile and data can be structured (e.g., databases, xml files), semi structured
(e.g., log files, emails) or non-structured (e.g., text documents). Data consumers
from different domains with various interests (e.g., management information,
accounting, customer relationship, sales and marketing, resource planning,
forecasting, regulatory reporting, etc.) may have a broad spectrum of
requirements and service level quality. The process of source data integration is
called Extract, Transform, Load (ETL), and has a specific set of specialized tools
for data capturing and processing tasks. The processed and stored data consuming
process is called Business Intelligence (BI) and has its own set of tools for
reporting, ad-hoc querying, data mining, dashboards and other types of analytics.
ETL and BI are not independent components: ETL and data requirements are
driven by business needs and BI capabilities are limited by the collected and
integrated data.

Figure 0.1 A general scheme of a Data Warehouse process and data flows.

To make reasonable and informed business decisions, we need appropriate

data and metadata about context, structure, requirements, processing and timing.
Answering questions about used data sources, formulas, structures and freshness
of data in analytical systems or reports is challenging and not trivial. Components
of data warehouses are distributed over multiple physical locations and a diverse
set of software tools, and therefore tracing complex data processing metadata is
more complicated compared to using processed data. When the produced data
and information is the desired and emergent result of a DW system, then the
processing metadata is often hidden and captured into internal structures,
relations and programming code of separate components of the data processing
system. Emerging results, behavior and functions of such a complex system
depend on the subsystems and interconnections (formal and functional) of the
system’s components. To control, manage or predict the behavior of the system,
we must review the elements and the relationships between the components on a
detailed level. Large data warehouse systems can have hundreds of thousands of
tables/views and millions of columns with tens of millions of estimated
dependencies between those components.

16
We call networks of all dependencies over data warehouse system components
Enterprise Dependency Graphs (EDG) and we handle functional and structural
dependencies as directed graph edges between component nodes. The problem of
data lineage (DL) is seen as a data flow sub-graph construction, calculation and
navigation between static data structure components (e.g., tables, views, columns,
files, reports, fields, etc.). The problem of component impact analysis (IA) is seen
as a sub-graph calculation and navigation between active data transformation
components (e.g., ETL tasks and mappings, SQL scripts and queries, DB
procedures, reporting queries and components, etc.) and passive data structures.
Data warehouse owners and users are facing various data lineage and impact
analysis problems because the chains of data transformations are often very long
with complex changes of data structures. More than a dozens of staging steps in
a sequence is not a rare case when the transformation steps are generated by the
supporting ETL tools. The data models that are designed for OLTP systems are
not usually suitable for OLAP systems. Denormalization, aggregation, and new
fact inference are some of the practical techniques that require new or changed
data structures and new processes to perform the tasks. The management of such
a complex integration process is unpredictable, and the cost is uncontrollable due
to the lack of information about data flows and internal relations of system
components. The consequences can include unmanageable complexity,
fragmental knowledge, a large amount of technical work, unpredictable results,
wrong estimations, rigid administrative and development processes, high cost,
lack of flexibility, quality and trust. These risks are related to the ability to answer
the following questions about data lineage and impact analysis problems:
 How can the origins of a data elements, structures and transformation
formulas be traced?
 How are the data elements of a specific column, table, view or report used?
 When was the data loaded, updated or calculated in a specific column, table,
view or report?
 Which loadings, structures, components and reports are impacted when other
components are changed?
 Which data, structure or report is used by whom and when?
 What is the time and cost of making changes in programs or data structures?
 What will break when we change a program or data structure?
 Who is responsible for a data structure, program or formula?
The ability to support and automate answering such day-to-day questions
determines the benefits, cost, flexibility and manageability of the system. The
dynamics in business, the environment and the requirements ensure that regular
changes in data management are required for every living organization. Due to
its reflective nature, business intelligence is often the most fluid and unsteady
part of enterprise information systems. The most promising way to tackle the
challenges in such a rapidly growing, changing and complex field is automation.
Efficient automation in this particular field requires techniques from multiple
areas of computer science: computer language and semantic technologies, a
combination of rule systems and reasoning. Our goal is to aid users with

17
intelligent tools that can reduce the time required for several difficult tasks from
weeks to minutes, with higher quality results and smaller costs.
As an example, showcasing the complexity, a real-life data flow graph (Figure
0.2) is captured and visualized with the methods and tools we introduce in this
thesis. The underlying graph structure, rules and algorithms form the basis for
understanding and automation of complex analysis tasks.

Figure 0.2 Real life Data Warehouse data flows from tables and views (left and middle
with blue) to reports (right side with red).

Contribution of the Thesis

The thesis presents a full stack of methods, technologies and algorithms which
give analysts a novel way to efficiently solve several existing management and
analysis problems in a corporate data warehouse environment.
The work presented lies in the domain of software and knowledge engineering
and is based on experimentation with different real-life datasets. The feasibility
and usefulness of the results to analysts are validated by practical application on
data warehouses of actual large international companies in the financial, utilities,
governance, telecom and healthcare sectors. In particular, table 4.1 presents the
performance analysis on six large datasets.

The main components of the contribution are:

18
 A new formalized mapping representation for specifying data transformations
between different data formats, locations and storage mechanisms.
 An EAV-style open data model for storing meta-information, ontologies and
dependencies of the investigated information system, along with a
corresponding graph-based internal representation.
 Algorithms estimating the impact factor of input variables in SQL statements.
 A method for computing the transitive closure of probabilistic dependency
chains.
 Data lineage and component dependence visualization methodology.
 Experiments demonstrating the feasibility of the method on large information
systems of real companies.
 Analysis and proposals for new ways to apply the lineage analysis to practical
problems of finding critical software components, estimating development
time, generating documentation and compliance reports.

We describe the underlying technology and abstract mapping concept in our

paper A, which forms the foundation for dependency graph representation of data
flows and structures (sections 3.2 to 3.4). We draw the methodology framework,
system architecture (section 3.1) and define the formal rule system for weighted
graph calculation in paper B (sections 3.6 and 3.7). We then extend our rule
system with in-memory data structures, illustrate the algorithms with examples
and present real-life applications in paper C (section 3.8 and chapter 4). Finally,
we present formal definitions and algorithms for graph models and calculations
to support semantic data lineage and impact analysis applications (section 3.8),
and we present the performance analysis over different real-life datasets in paper
D (section 4.3).

The core technologies that are named and used in this thesis and the
underlying papers are referenced to their origins in the footnotes. Some of them
are closely related to the contribution of the thesis and therefore require additional
explanation. The XDTL5 language and runtime engine are technologies of
Mindworks Industries OÜ6, designed and built by several people inside and
outside of the company (including the author of this thesis). The dLineage7
technology is initially built by the author of the thesis together with my colleague
Margus Kliimask, and the XDLT is used as one of the core components of the
toolkit. The latter development of modern UI and new features are built with my
colleagues form Mindworks Industries.

Organization of the Thesis

The thesis starts with the general introduction and the summary of the
contribution.

5
http://www.xdtl.org/
6
http://www.mindworks.ee/
7
http://www.dlineage.com/
19
The first chapter of the thesis presents an overview of the data lineage and
impact analysis fields in data warehousing systems. We will give a simplified
example of the problems to be solved. The methodology chapter illustrates our
approach to the problems. The first chapter gives a background to the problems
that are common for all published papers A to D.
The second chapter gives an overview of the related work in the fields of data
lineage, provenance and impact analysis. A focus of the related work chapter is
in field of data lineage and data provenance, also other applications in these fields,
and the chapter draws wider context to papers B, C and D.
The third chapter of the thesis focuses on the algorithms developed along with
the design and the details of our system architecture. We will give detailed
presentations and will describe the considerations, options and reasons behind
our choices. We will draw a picture of the data model and the basic building
blocks with key figures and components that are introduced and used in published
papers A to D.
The fourth chapter of the thesis focuses on the details and requirements of our
system implementation and the practical case studies in different industries. We
will also present new potential application areas. The chapter extends the case
studies and the visualizations topics that were introduced in the paper D.
The conclusions chapter summarizes the advantages of our data lineage
architecture and system, our contributions and gives suggestions for future work
on the topic.
The rest of thesis consists of the four selected publications from the full list of
eighth.

20
1. DATA LINEAGE
This chapter presents a detailed introduction to data lineage and provenance
problems, starting with an overview in section 1.1. We continue with an example
in section 1.2, with a query example and mapping representation that forms the
interconnected data flow graph. We use the same examples in subsequent
chapters to illustrate different data linage or impact problems, keeping a
connection with different parts in the current thesis.

1.1. Overview of Data Lineage and Provenance

The data lineage, data provenance or pedigree are the overlapping terms used
to describe tracing origin sources and derivation of data. The provenance term, in
the scientific community, is used synonymously with the lineage term in the
database community. Sometimes provenance is also referred to as source
attribution or source tagging. Data lineage is a common key component for many
different application domains and is also the subject of studies in the field of
Computer Science or Data Science. Many business and scientific domains, like
scientific data management, big data, machine learning, data warehousing or
business intelligence, need provenance or lineage metadata on the origin, rules,
transformation, derivation, history, timing, context and background of the used
and processed data. Authenticity, integrity, correctness and trustworthiness of
information are common requirements for different domains that can be
established with effective tracing of data lineage. From scientific and business
perspectives, data sets are not very useful without knowing the exact sources,
processing methods and rules of derived data sets [2].
Data warehouses [3] and curated databases [4] are typical examples where
lineage information is essential. In both databases, comprehensive and often
manual effort is usually expended in the construction of the resulting database —
in the former, in specifying the ETL process, and in the latter, in incrementally
adding and updating the database. Data lineage adds value to the data by
explaining how it was obtained. It is important to understand the lineage of data
in the resulting database to check the correctness of an ETL specification or assess
the quality and trustworthiness of the collected data [5].
There are two levels of granularity in lineage described in previous works:
workflow or coarse-grained provenance and data or fine-grained provenance [6].
The coarse-grained workflow lineage describes the data processing components,
tasks and programs as a sequence of steps to capture and present general
transformations between data sources and targets without specific details. The
number of steps and the level of detail can vary between hardware and software
platforms and components to transformation programs and sub-components.
Fine-grained data lineage describes detailed information and derivation of data
items, like data structures, columns, tuples or rows, and represents it as a sequence
of transformation steps to trace from sources to targets or vice versa.
Both detail and granularity levels can be seen in combination with up to three
types of lineages to answer different questions [7]:
21
 Why lineage refers to context of data transformations and provides
justification for input data elements appearing in the output. Why lineage
answers questions like how some parts of input data influenced the output
data.
 How lineage refers to the transformations of the source data elements and
answers questions like how inputs were manipulated to produce given output.
 Where lineage refers to the locations of the data sources and structures from
which the data was extracted and answers questions like where the data
comes from or which inputs were used for a given output.

These three notions of why, how and where provenance are used as
independent or combined approaches to the data lineage solutions in databases.
The previous works that follow and cover these categories are analyzed by
Cheney et al. [5] and Tan [6], but there are also works that do not fit neatly into
the why, where and how provenance framework. Such works include Wang et
al.’s Polygen model [8], Cui et al.’s lineage tracing [9], Widom’s Trio system
[10] or Woodruff and Stonebraker’s work on lineage [11] [5].

To illustrate different lineage types, consider the following simple data

loading SQL query from the source table Account (Nbr, Type, State) to the target
table Agreement (Agreement_Nbr, Agreement_Type, Agreement_State):
INSERT INTO AGREEMENT (Agreement_Nbr, Agreement_Type, Agreement_State)
SELECT Nbr, Type, Coalesce(State,0)
FROM ACCOUNT
WHERE Type = ’A’
AND End_Date is not null

The Where lineage for every target table column (Agreement_Nbr,

Agreement_Type, Agreement_State) describes where data comes from and
corresponds to select list columns (Account.Nbr, Account.Type, Account.State)
in the SQL query. The How lineage for each target column column
(Agreement_Nbr, Agreement_Type, Agreement_State) describes the column
data transformation logic and expressions of each source column (copyOf(Nbr),
copyOf(Type), Coalesce(State,0)) in the SQL query. The Why lineage for each
target column comes from the conditions part that is present in the where (or join)
of the SQL query and describes the context of data transformations like the two
predicates here: Account.Type = ’A’ and Account.End_Date is not null.
Generic data transformation can be defined as a set of functions Tr(tr1..trn)
over source datasets S1(s1.1..s1.m) to Sn(sn.1..sn.m) that produce target or output
dataset T(t1..tn) in context of C(S1..Sn): T = Tr(S1..Sn, C(S1..Sn)). General data
lineage of target dataset T is defined as a lineage function L: L(T) = (S1..Sn) and
specific where, how and why properties by functions: Lwhere(T) = (S1..Sn), Lhow(T)
= Tr(S1..Sn) and Lwhy(T) = C(S1..Sn). The previous example SQL query column
Agreement_State lineage properties can be described as follows:
 Lwhere(Agreement.Agreement_State) = Account.State
 Lhow(Agreement.Agreement_State) = Coalesce(Account.State,0))

22
 Lwhy(Agreement.Agreement_State) = Account.Type=’A’ and Account
.End_Date is not null

The previous research on data lineage and provenance has been based on one
of two computational approaches in general:
 The non-annotation approach, which assumes the execution of a set of
transformation functions against the source or input dataset to generate the
output dataset in order to compute the data or row level lineage of
transformation and target dataset; and
 The annotation approach, which carries additional information in
transformation to target dataset; this requires modifications of the initial
transformation functions and requires extra space for maintaining additional
data. The analysis of additional data allows for computation of the data or
row level lineage without access to the input dataset.

In this thesis, we focus mainly on the data lineage problem and practical
solutions in database environments and use the data lineage term instead of
provenance. We have chosen the non-annotation approach to the data lineage
problem to support fast start and no impact on the working systems. We also take
advantage of data structures and transformations metadata, capture query
semantics and make probabilistic score calculation and logic-based inferences
about the input or output data, without a need for and access to the real data (i.e.
only metadata is used).

1.2. A Motivating Example

As an example of a financial industry data warehouse data lineage and data
impact problems, we have constructed our data loading and transformation
scenario with four SQL queries and four source tables. The data form the
ACCOUNT and LOAN tables are consolidated to one unified AGREEMENT
table, then we join the BALANCE table and two new tables,
DEPOSIT_SUMMARY and LOAN_SUMMARY, populated with denormalized
data for further querying and reporting. The next table (Table 1.1) below presents
four SQL DML queries from two different but dependent data loading jobs. The
Job1 is responsible for data loading to the DW and the Job2 is responsible for
loaded data manipulations and denormalization.
Table 1.1 Data transformation SQL query examples used in DW loading jobs.

SQL Query 1 from Job 1

INSERT INTO AGREEMENT (Agreement_Nbr, Agreement_Type, Agreement_State)
SELECT T1.Account_Nbr, T1.Type, T1.State_Code
FROM ACCOUNT T1
JOIN ACCOUNT_STATE T2 ON T2.Code = T1.State_Code
WHERE T2.State = ‘Active’
AND T1.Type = ’A’

SQL Query 2 from Job 2

23
INSERT INTO DEPOSIT_SUMMARY (Period_Date, Agreement_Nbr, Agreement_State,
Balance_Amt)
SELECT T3.Balance_Date, T4.Agreement_Nbr, T4.Agreement_State, T3.Balance_Amt
FROM AGREEMENT T4
JOIN BALANCE T3 ON T4.Agreement_Nbr = T3.Agreement_Nbr
WHERE T4.Agreement_Type = ‘A’
AND T4.Agreement_State = 2
AND T3.Balance_Date = DATE-1

SQL Query 3 from Job 1

INSERT INTO AGREEMENT (Agreement_Nbr, Agreement_Type, Agreement_State)
SELECT T6.Loan_Id, ‘L’, case when T6.State = ‘New’ then 1 when T6.State = ‘Active’
then 2 else 0 end
FROM LOAN T6
JOIN LOAN_TYPE T7 ON T6.Loan_Type = T7.Code
WHERE T7.Type in (‘Private’, ‘Business’)
AND T6.State in (‘New’, ‘Active’)

SQL Query 4 from Job 2

INSERT INTO LOAN_SUMMARY (Period_Date, Agreement_Nbr, Agreement_State,
Principal_Amt)
SELECT T3.Balance_Date, T4.Agreement_Nbr, T4.Agreement_State, T3.Balance_Amt
FROM AGREEMENT T4
JOIN BALANCE T3 ON T4.Agreement_Nbr = T3.Agreement_Nbr
WHERE T4.Agreement_Type = ‘L’
AND T4.Agreement_State = 2
AND T3.Balance_Date = DATE-1

The dependencies between the source and target tables, jobs and the queries
can be extracted from the queries and presented as a directed graph. The
structures and components are nodes of the graph and dependencies between
source and target tables are the directed edges of the graph. The direction of the
edge points the data flows from the source to the target structures. The Figure 1.1
has two coarse-grain data flow graphs with the detail level of tables and jobs or
tables and queries. We can use those graphs as illustrations for data lineage and
impact analysis problems, where data lineage questions can be answered as
querying sub-graphs in the target-to-source direction and data or component
impact questions can be answered by sub-graph queries in the source-to-target
direction. We can also notice that it is not possible to see which table data is
moving to the target tables and which is used only for filtering or lookups without
going to the fine-grain, column and query components level. For example, we
can see that ACCOUNT_STATE and LOAN_TYPE tables are used as sources
for the job and query levels, but we do not recognize that the data is not loaded
to the AGREEMENT table and is used only for filtering rows with certain types
or statuses.
ACCOUNT

ACCOUNT_STATE DEPOSIT_SUMMARY
JOB 1 AGREEMENT JOB 2
LOAN LOAN_SUMMARY
BALANCE
LOAN_TYPE

24
ACCOUNT
DEPOSIT_SUMMARY
Q1 Q2

ACCOUNT_STATE

AGREEMENT

LOAN

BALANCE LOAN_SUMMARY
Q3 Q4

LOAN_TYPE

Figure 1.1 DW data transformation flows in table, job and query levels.

The next Figure 1.2 illustrates the fine-grain level of detail, where the query
components allow us to construct more complex and detailed dependency graphs
to answer data lineage and impact questions at the column level. The
transformation queries (Q1…Q4) are parsed to abstract mappings (M1…M4)
with all the available source and target tables. Each mapping has data
transformation elements (t1.1…t4.3), joins (j1.1…j4.1) and filter conditions
(f1.1…f4.1) according to the query structure and expressions. All source and
target tables have connected columns according the usage in the query
expressions. Additional transformation expressions, key-value constraints and
conditions are extracted from the query text and are connected to mappings for
further semantic calculations and instance-level data lineage tracing.

ACCOUNT DEPOSIT_SUMMARY
Account_Nbr M1 M2 Period_Date
Type t1.1 t2.1 Agreement_Nbr
State_Code t1.2 t2.2 Agreement_State
t1.3 AGREEMENT t2.3
Balance_Amt
ACCOUNT_STATE j1.1 Agreement_Nbr t2.4

Code f1.1 Agreement_Type j2.1

f2.1
State Agreement_State

LOAN_SUMMARY
LOAN M3 BALANCE M4 Period_Date
t3.1 Balance_Date t4.1
Loan_Id Agreement_Nbr
t3.2 t4.2
Loan_Type Agreement_Nbr
Agreement_State
t3.3
Balance_Amt t4.3
State Principal_Amt
j3.1
t4.4
f3.1
j4.1
LOAN_TYPE
f4.1
Code

Type

Figure 1.2 DW data transformation flows in table, column and query component levels.

The result of the parse and query processing is a detail-level dependency graph
that allows for more precise data lineage and impact analysis in the table and
column levels. The graph is a representation of the discrete source and target
dependencies between the input and output components without additional
knowledge to describe how the data is transformed or filtered in the
transformation query. Analysis of the queries Q1…Q4 and predicates from the
where clauses shows that different and independent sets of rows produced by

25
queries Q1 and Q3 from the ACCOUNT and LOAN tables are loaded to the same
AGREEMENT table. We also notice that queries Q2 and Q4 are using the same
independent sub-sets of rows from the same AGREEMENT table using filtering
predicates Agreement_Type = ‘A’ and Agreement_Type = ‘L’.
We can conclude the example by saying that, based on the data structures
information and understanding the query semantics in terms of transformation
functions and filter predicates, we can make logical inferences about data rows
or tuples that are involved or excluded in data lineage workflows.

1.3. Summary
This chapter presented an introduction to data lineage, provenance and impact
analysis problems, starting with the overview in section 1.1, followed by the
example section 1.2, with queries and mapping representation forms for the
interconnected data flow graph that will be used in subsequent chapters to
illustrate different data linage or impact problems. These connect with different
parts of the current thesis.

26
2. RELATED WORK
Impact analysis, traceability and data lineage issues are not new. An overview
of the data lineage and data provenance tracing studies were collected by Cheney
et al. [5], historical and future perspectives were discussed by Tan [6] and the last
decade of research activities were presented by Pribe et al. [12]. Lineage and
provenance has been studied in scientific data processing areas [7], [8], [9] and
in the context of database management systems [2], [6], [16]. Multiple notions of
lineage and provenance in database systems have been used to describe
relationships between data in the source and in the target: where output records
came from [7], why an output records were produced by inputs [7], [17] and a
how output record was produced [18]. The query behavior lineage tracking has
been used in classical database problems like view update [19] or the
expressiveness of update languages [20], and the study of annotation propagation
[20], [21] or updates across peer-to-peer systems [22]. The data-driven and data
dependent processes and provenance theoretical and practical models described
by Deutch et al. [23].
The distinction is made between coarse-grained, or schema-level, provenance
tracking [24] and fine-grained-, or data instance-, level tracking [25]. The
methods of extracting the lineage are divided into physical (annotation of data by
Missier et al.) and logical, where the lineage is derived from the graph of data
transformations [26].
We can also find various research approaches and published papers from the
early 1990’s and later with methodologies for software traceability [27]. The
problem of data lineage tracing in data warehousing environments has been
formally founded by Cui and Widom [9], [17]. Data lineage or provenance details
levels (e.g., coarse-grained vs fine-grained), question types (e.g., why-
provenance, how-provenance and where-provenance) and two different
calculation approaches (e.g., eager approach vs. lazy approach) have been
discussed in multiple papers [6], [28], and formal definitions of the why-
provenance have been given by Buneman et al. [7]. Other theoretical works for
data lineage tracing can be found in [29] and [30]. Fan and Poulovassilis
developed algorithms for deriving affected data items along the transformation
pathway [31]. These approaches formalized a way to trace tuples (resp. attribute
values) through rather complex transformations, given that the transformations
are known on a schema level. This assumption does not often hold in practice.
Transformations may be documented in source-to-target matrices (specification
lineage) and implemented in ETL tools (implementation lineage). Woodruff and
Stonebraker created a solid base for the data-level and operator processing based
the fine-grained lineage, in contrast to the metadata-based lineage calculation in
their research paper [11].
Priebe et al. concentrated on proper handling of specification lineage, a
significant problem in large-scale DW projects, especially when different sources
have to be consistently mapped to the same target [12]. They proposed a business
information model (or conceptual business model) as the solution and a central
mapping point to overcome those issues. The requirement and design level
27
lineage and traceability solutions for next generation DW and BI architecture
described by Dayal et al. [32].
Other ETL-related practical works that are based on conceptual models can
be found in [33] and [34]. Ontologies and graphs-based practical works related to
data quality and data lineage tracking can be found in [35], [36] and [10]. De
Santana proposed the integrated metadata and the CWM metamodel-based data
lineage documentation approach [37]. The conceptual modeling approach of ETL
workflows described by Bala et al. [38] in the Big Data landscape and Basal [39]
presented a semantic approach to combine the traditional ETL approach with the
Big Data challenges. Another related work from the field of data lineage and
scientific data provenance by Wang et al. [40] brings together challenges and
opportunities of Big Data, including volume, variety, velocity and veracity, with
the problems of scientific workflow tracking and reproducibility. The cloud-
based or distributed systems have their own limitations for data lineage tracing
and the data-centric event logging introduced and discussed by Suen et al. [41].
In addition to data lineage and provenance in databases, closely related
workflow provenance tracking is an active research topic in the scientific
community. The overview of scientific workflow provenance was captured in
surveys by Bose and Frew [15] and Glavic and Dittrich [42], and tutorials with
research issues, challenges and opportunities were described by Davidson and
Freire in [43]. General design and principles of scientific workflow lineage and
provenance systems were introduced and discussed by Bose [44], Simmhan et al.
[45], Altintas et al.[46], Chervenak et al. [47] and Wu et al. [48], and there are
many different flavors and accents, like the collaborative approach from Missier
et al. [49] and Altintas [50]; the cloud-based or distributed systems by Cruz et al.
[51], Marinho et al. [52] and Wang et al. [53]; the Big Data-oriented approach by
Wang et al. [40]; the graph-oriented approach by Anand et al. [54], [55], Acar et
al. [56] and Biton et al. [57]; the ontology-driven approach by Bowers et al. [58];
the semantic web and semantic technologies based approaches by Kim et al. [59],
Ding et al. [60] and Sahoo et al. [61]; the user- or scientist-oriented systems from
Bowers et al. [62]; and the user- or subjective scientist eliminative-based
approach by Finlay [63]. The scientific workflow lineage and provenance
research does not end here, but continues in different scientific domains, like
bioinformatics by de Paula et al. [64] and Buneman et al. [65] or genomics by de
Paula et al. [66].
The lineage and provenance problems are not limited with databases, -flows
and scientific workflows, but having common challenges in field of curated
databases, semantic web, open linked data, e-Sciences and the growing social
networking landscape. Some interesting works can be found on the borders of the
different domains and disciplines by Chirigati and Freire [67], Hartig and Zhao
[68], Moreau [69], [70] and Moreau et al. [71].
In the context of our work, efficiently querying the lineage information after
the provenance graph has been captured is of specific interest. Heinis and Alonso
presented an encoding method that allows space-efficient storage of transitive
closure graphs and enables fast lineage queries over that data [24]. Anand et al.
proposed a high-level language QLP, together with the evaluation techniques that
28
allow storing provenance graphs in a relational database [72]. These techniques
are supported by a pointer-based encoding of the dependency closure that
supports reducing storage requirements by eliminating redundancy.
Several commercial ETL products are addressing the impact analysis and data
lineage problems to some extent (e.g., Oracle Data Integrator, Informatica
PowerCenter, IBM DataStage, Teradata Metadata Services or Microsoft SQL
Server Integration Services), but those tools and the dependency analysis
performed is often limited to the basic functions of a particular system. Another
group of commercial tools is formed by the specialized metadata integration
products not related to a particular ETL tool, offering a more sophisticated suite
of dependency analysis functionality. The examples are ASG Rochade8,
InfoSphere Information Governance Catalog from IBM9, Data Governance and
Catalog from Collibra10, Informatica Metadata Manager11, SAP Information
Steward12, Metacenter from Data Advantege Group13, Adaptive Metadata
Manager14, Troux Enterprise Architecture Solution15, Metadata Management
from Cambridge Semantics16, Metdata System from AB Initio17 or
MetaIntegration Metadata Management18, most of which have their own
limitations in terms of available functionality and adapters to other products [12].
In addition to full scale metadata management or data governance products,
there are several new generation technology companies, who fit into the picture
one or another way: Automated SQL query parsing and lineage extraction from
SqlDep19 and Manta20; Metadex data lineage solution from CompactBI21;
Accurity business glossary and data governance solutions from Simplity22;
Machine Learning based metadata and data lineage discovery solutions from
RokittAstra23; Data lineage and governance solutions from Synergy24; SQL
parsing, analyzing, documenting and data lineage discovery tools from General
SQL Parser25; Data mapping and documenting oriented Mapping Manager

8
https://www.asg.com/
9
http://www-03.ibm.com/software/products/en/infosphere-information-governance-catalog
10
https://www.collibra.com/
11
https://www.informatica.com/products/informatica-platform/metadata-management.html
12
http://www.sap.com/community/topic/information-steward.html
13
http://www.dag.com/
14
http://www.adaptive.com/metadata-manager
15
http://www.troux.com/
16
https://www.cambridgesemantics.com/solutions/metadata-management
17
https://www.abinitio.com/en/system/enterprise-meta-environment
18
http://www.metaintegration.com/Solutions/#MetadataManagement
19
https://www.sqldep.com/
20
https://getmanta.com
21
http://www.compactbi.com/
22
http://www.accurity.eu/
23
https://www.rokittastra.com/
24
http://www.meta-analysis.fr/en/la-solution/
25
http://sqlparser.com/
29
solution from AnalytixDS26; Automated metadata capture, analysis and
collaboration tools by AlexSolutions27; Data lineage and graph data analysis and
visualization tools from Linkurious28; Synapse data mapping, analysis, tagging
and visualization tools from Sapient29; Axon governance, lineage and
collaboration tool from Diaku30; and finally fully automated, semantic metadata
capture, data lineage, impact analysis, business governance and visualization in
toolset dLineage31, that is based on the methodology, algorithms and ideas, that
are described in this thesis.

2.1. Summary
This chapter gave an overview of previous works and scientific studies in the
field, along with the industry landscape.

26
http://analytixds.com/products/mapping-manager/
27
http://alexsolutions.com.au/
28
https://linkurio.us
29
https://synapse.sapientconsulting.com/
30
https://www.diaku.com
31
http://www.dlineage.com
30
3. ALGORITHMS AND METHODS
This chapter presents the algorithms and methods we have designed and
implemented. The overall architecture follows the methodological pathway
presented on a conceptual level in section 3.1. We describe the metadata database
design in section 3.2 and different metadata models (metamodels) for databases
and data integration in section 3.3. More details about the underlying foundation
and mapping design can be found in article A. The logical path with query parsing
and resolving techniques continues in section 3.5, with data transformation
evaluations and weight calculations. The rule system implemented for graph
construction and calculations is discussed in section 3.7, the semantic layers on
calculated graphs are discussed in section 3.8. The rule system and graph
calculations are discussed at a detailed level in papers B, C and D.

3.1. Overall Architecture and Methodology

The overall architecture is based on an independent metadata collection and
storage framework with dynamic schema and unified metamodels, grammar-
based query parsing and resolving, probabilistic data transformation weight
calculation, rule-based graph calculation and web-based user interface
components. The architecture follows the methodology steps (from 1 to 8)
presented in Figure 3.1:
1. Scanners collect metadata from different systems that are part of the DW’s
data flow (DI/ETL processes, data structures, queries, reports, etc.) to the
open-schema metadata database (PostgreSQL or Oracle).
2. The SQL parser is based on a customized grammar, the GoldParser parsing
engine and the Java-based XDTL engine.
3. The rule-based parse tree mapper extracts and collects meaningful
expressions from the parsed text, using declared combinations of grammar
rules and parsed text tokens.
4. The query resolver applies additional rules to expand and resolve all the
variables, aliases, sub-query expressions and other SQL syntax structures that
encode crucial information for data flow construction.
5. The expression weight calculator applies rules to calculate the meaning of
data transformation, join and filter expressions for impact analysis and data
flow construction.
6. The rule-based reasoning engine propagates and aggregates weighted
dependencies.
7. The dependency graph is stored along with the collected metadata in a
relational database as binary and directed relations between node objects.
8. The directed and weighted sub-graph calculations, visualization and web-
based UI is used for data lineage and impact analysis applications.

31
Figure 3.1 Methodology and system architecture components.

The color codes differentiate the data capture components (blue), active data
processing components (red) and passive supporting components (white). The
double lines in the comb-cell figure express the data flow bonds between the
active or passive components.
The base components of the system architecture were introduced in paper A.
Our general methodological and architecture scheme is presented in papers B and
C and developed further in paper D.

3.2. Metadata Database

Our metadata database is built on a relational database technology for different
knowledge management and rule-based analytical applications. The repository is
designed according to the OMG Metadata Object Facility (MOF)32 idea with
separate abstraction and modeling layers (M0-M3). The physical data model
(schema) is based on principles and guidelines of the EAV (Entity-Attribute-
Value)33 modeling technique suitable for modeling highly heterogeneous data
with a very dynamic nature. Metadata models and schema definitions in EAV are
separated from physical storage, and therefore modifications to schema on the
“data” level can easily be done without changing DB structures, just modifying
corresponding metadata. The chosen approach is suitable for open-schema
implementations (similar to key-value stores) where the model is dynamic and
semantics are applied in query time, but also model-driven implementations with
formal and well-defined schema, structure and semantics. The used URI
reference mechanism and resource storage scheme makes our metadata
repository a semantic data store that is comparable to the Resource Description
Framework (RDF) and can be serialized in different semantic formats or
notations (e.g., RDF/XML, N3, N-Triples, XMI, etc.) using XML or RDF APIs.

32
https://en.wikipedia.org/wiki/Meta-Object_Facility
33
https://en.wikipedia.org/wiki/Entity-attribute-value_model
32
The physical schema (Figure 3.2) can be seen as a general-purpose storage
mechanism for different metadata and knowledge models, and also as a
communication medium or information integration and exchange platform for
different software agents or applications (e.g., metadata scanners, metadata
consumers, etc.). Built-in limited reasoning capability is based on the recursive
SQL capability and is captured through data and metadata APIs to implement
inheritance and model validation functions. Semantic representation of data
allows for extended functionality with predicate calculus reasoners or applying
other external rule-based reasoners (e.g., Jena) for more complicated reasoning
tasks, like deduction of new knowledge.

Figure 3.2 Metadata database physical schema tables.

The repository contains integrated object-level security mechanisms and

different data access APIs (e.g., data, metadata, XML/XMI, RDF API, etc.) that
are implemented as relational database procedures or functions.
An unlimited number of different data models can exist inside our metadata
model simultaneously, with relationships between them. Each of these data
models constitutes a hierarchy of classes where the hierarchy might denote an
instance relationship, a whole-part relationship or some other form of generic
relationship between hierarchy members. We designed several predefined
metadata models for data lineage and impact analysis data:

33
 terminology and classification (business meaning and governance);
 relational database (DB and SQL);
 data integration model (ETL);  
 reporting model (OLAP, BI, Reporting); and 
 mappings model (formalized abstract mappings).

3.3. Design of Metadata Models and Mappings

The relational database metamodel is used to store detailed information about
the sources and targets of data transformations. The RDB metamodel focuses on
the main database objects, e.g. Schema, Table, View, Column, Datatype,
Procedure, etc. The ETL metamodel is based on the OMG CWM34 reference
architecture with base concepts like Folder, Package, Step and Task. The ETL
model is focused on the organization and structure of data processing packages,
sequences and dependencies of events, relations between elements controlling
data processing workflow, etc. The reporting metamodel focuses on Report,
Model, Dimension, Hierarchy and Measure elements, taking advantage of the
mapping metamodel to store query mappings and related classes, and is used to
store information describing the presentation layer. The mappings metamodel
used to manage decomposed relationships and expressions in a unified manner.
Various metadata and data integration and ETL models are discussed and used
in previous works [73],[74]. We decided to implement our own “soft” models
that do not require a database physical schema change when changing the
metamodel. The details about the abstract mappings model design, storage and
usage is presented in article A.

3.4. Data Capture, Store and Processing with Scanners

The Extensible Data Transformation Language (XDTL) is an XML-based
descriptive language designed for specifying data transformations between
different data formats, locations and storage mechanisms. XDTL was created by
Mindworks Industries as a Domain Specific Language (DSL) for the ETL domain
and was designed to keep in mind principles like modularity, extensibility,
reusability, decoupled declarative (unique) and procedural (repeated) patterns.
XDTL syntax is defined in an XML Schema document. Wildcard elements of
XML Schema enables extending the syntax of the core language with new
functionality implemented in other programming languages or in XDTL itself.
XDTL scripts are built as reusable components that have clearly defined
interfaces via parameter sets. Components can be serialized and de-serialized
between XML and database representations, thus making XDTL scripts suitable
for storing and managing in a data repository. XDTL provides functionality to
use externally stored data mappings for the scripts and decoupled from the scripts.
Therefore, mappings stored in a repository can exist as objects independent from

34
https://en.wikipedia.org/wiki/Common_Warehouse_Metamodel
34
the transformation process and can be reused by several different processes.
XDTL acts as a container for a process that often must use facilities not present
in XDTL itself (e.g., SQL, SAS language, etc.).
The purpose of a scanner is to extract and capture all relevant metadata about
a certain class of data elements and store it in a predefined, structured manner.
Scanners components (No1 in Figure 3.1) are collecting external systems
metadata, like database data dictionary structures, ETL system scripts and queries
or reporting system query models and reports, and all structural information is
extracted and stored to the metadata database. The scanned objects and their
properties are extracted and stored according to defined meta-models, like
relational databases, data integration, reporting and business terminology models.
Metamodels contain ontological knowledge about collected metadata and
relations across different domains and models. The scanners technology and
open-schema metadata database design are described in more detail in our article
A.
The database scanner is a program implemented as an XDTL package or script
that transforms metadata from a database dictionary into an RDB metamodel.
Database scanners are based on ANSI SQL Information Schema35 specification
and are currently being implemented for MsSQL, PostgreSQL, Greenplum,
Oracle, Teradata, IBM DB2, Netezza, Vertica and other database platforms. All
database scanners are implemented as two-phase processes that materialize (scan)
scanned data in a format conforming to Information Schema definition. A
separate process (store) stores this temporary information in a permanent storage
media (database). Decoupling those processes allows for reusing components
created for different database products in multiple combinations.
Application scanning is a procedure implemented as an XDTL package that
transforms metadata from application repository or internal representation into an
application metamodel. Several application scanners have been implemented for
various ETL, OLAP and Reporting tools.
Oracle Data Integrator (ODI) is an ETL tool quite common in DW
environments, especially in relation to Oracle databases. The ODI scanner
extracts information relevant for impact analysis, i.e., all data sources and targets,
column mappings, transformations, JOIN and WHERE conditions, variables,
references to external processes, etc.
Business Objects (BO) is a widely used reporting tool used in DW. The BO
scanner extracts metadata from a BO application repository and File Store,
transforming it into a reporting metamodel. The granularity of the extracted
information is relevant to impact analysis requirements.

3.5. Query Parsing and Metadata Extraction

To construct data flows from the very beginning data sources (e.g., the
accounting system) to the end points (e.g., reporting system) we should be able
to connect the same and related objects in different systems. To connect objects,

35
https://en.wikipedia.org/wiki/Information_schema
35
we have to understand and extract relations from SQL queries (e.g., ETL tasks,
DB views and procedures) and scripts (e.g., loader utility scripts) and expressions
(e.g., report structure) that are collected and stored by scanners. To understand
the data transformation semantics that are captured within the query language
statements (e.g., insert, update, select and delete queries) and expressions, we
have to involve external knowledge about query language syntax and
grammatical structure. We used a general-purpose Java-based parser engine36 and
developed a custom SQL grammar that was written in Extended Backus-Naur
Form (EBNF)8. Our grammar is based on ANSI/SQL syntax, but it contains a
large set of dialect specific notations, syntax elements and functions that were
developed and trained using large real-life SQL query sets from the DW field.
The current grammar edition is based on the Teradata, Oracle, Greenplum,
Vertica, Postgres, IBM DB2, Netezza and MsSql dialects.
Example 1. SQL select statement grammar sample in EBNF format:

<Select Stm> ::= <Select> UNION <Select Stm> 

| <Select> UNION ALL <Select Stm>
| <Select>
<Select>::= SELECT <Columns><Into Clause><From><Where><Group Clause><Qualify Clause>
<Having Clause> <Order Clause> 
<SubqueryStm> ::='('<SelectStm>')‘  <Columns> ::=<Restriction>'*'
| <Restriction> <Column List> <ColumnList> ::=<ColumnItem>','<ColumnList>
| <Column Item>
<Column Item> ::= <Column Source>
| <Column Source> <Alias>
| <Column Source> ' AS ' <Alias> <Column Source> ::= <Column Source Item>
<Column Source Item> ::= '('<Column Source Item>')' | <Add Exp>
<From> ::= FROM <Id List> <Join Chain> |
<Join Chain> := <Join> <Join Chain> |

Grammar-based parsing functionality is built into the scanners technology and

a configurable “parse” command brings semi-structured text parsing and
information extraction into the XDTL data integration environment. As the result
of the SQL parsing step (No2 in Figure 3.1), we have a large parse tree where
every SQL query token has a special disambiguated meaning based on the
grammar syntax.
Example 2. Parse tree fragment with grammar rules and parsed text tokens:
| +<SelectStm>::=<Select>
|
+<Select>::=SELECT<Columns><IntoClause><From><Where><GroupClause><QualifyClause><Ha
ving Clause> <Order Clause> 
| | | +SELECT 
| | | +<Columns>::=<Restriction><ColumnList>
| | | | +<Restriction>::= 
| | | | +<ColumnList>::=<ColumnItem>','<ColumnList>
| | | | | +<ColumnItem>::=<ColumnSource><Alias>
| | | | | | +<ColumnSource>::=<ColumnSourceItem>

36
http://www.goldparser.org/
36
| | | | | | | +<ColumnSourceItem>::=<AddExp> 
| | | | | | | | +<AddExp>::=<Exp><Operator><AddExp> 
| | | | | | | | | +<Exp>::=<Value> 
| | | | | | | | | | +<Value>::=Id 
| | | | | | | | | | | +MK.Kood 
| | | | | | | | | +<Operator>::='||' 
| | | | | | | | | | +|| 
| | | | | | | | | +<AddExp>::=<Exp><Operator><AddExp>
| | | | | | | | | | +<Exp> ::= <Value> 
| | | | | | | | | | | +<Value>::=StringLiteral 
| | | | | | | | | | | | +'/' 
| | | | | | | | | | +<Operator> ::= '||' 
| | | | | | | | | | | +||

To parse different texts into the tree structure and to be able reduce tokens and
parse the tree back to meaningful expressions (depending on search goals), we
use a declarative rule set (in JSON format) based on token and grammar rule
combinations. Configurable grammar and a synchronized reduction rule set
makes the XDTL parse command more suitable for general-purpose information
extraction and it captures the resource-hungry computation steps into one single
parse-and-map step with a flat table outcome. Parse Tree Mapper (No3 in Figure
3.1) uses 3 different rule sets with more than 100 rules to map the parse tree into
data transformation expressions. The defined rules are declared in the following
sets and are illustrated in Example 3:
 Stopword list and grammar rules are used to indicate the mapper to flush the
buffer and start token collection to construct a new expression;  
 Mapword list and grammar rules are used to map collected expressions to
meaningful items (e.g., sources, targets, data transformations, joins and
filters); and
 Tagword list and grammar rules are used to tag special meaningful tokens in
expressions to identify all db objects references (e.g., tables, views, and
columns, functions, constants etc.).

Example 3. Mapper rule set sample with sql query tokens and grammar rules:  

{"parsemap":
{"stopwords": [  
{"token":"SELECT", "rule": "<Select>"},
{"token":"FROM", "rule": "<From>"},
{"token":"WHERE", "rule": "<Where>"},
{"token":"JOIN", "rule": "<Join>"},
...  ],
"mapwords":[  
{"map":"FilterCondition","token":"WHERE", "rule": "<Where>", "group": "0"},
{"map":"JoinCondition","token":"ON", "rule": "<Join>", "group": "0"},
{"map":"Source","token":"FROM", "rule": "<From>", "group": "0"},
{"map":"Target","token":"INTO", "rule": "<Ins Prefix>", "group": "0"},
{"map":"Transformation","token":",", "rule": "<Column List>", "group": "0"},
37
...  ],
"tagwords":[
{"token":"Id"},
{"token":"IntegerLiteral"},
{"token":"StringLiteral"},
{"token":"Alias"},
...  ]
}}

After extraction and mapping of each SQL query statement into a series of
expressions, we execute the SQL Query Resolver (No4 in Figure 3.1) that
contains a series of functions to resolve SQL query structure-specific tasks:
 Resolve source and target object aliases to full qualified (schema name +
object name) object names;
 Resolve sub-query aliases to context-specific source and target object names;
 Resolve sub-query expressions and identify them to expand all query-level
expressions and identifies to fully qualified and functional ones;
 Resolve syntactic dissymmetry in different data transformation expressions
(e.g., insert  statement column lists, select ‘*’ statements, select statement
column lists, and update statement assign lists, etc.); and
 Extract quantitative metrics from data transformation, filter and join
expressions to calculate expression weights (e.g., number of columns in
expression, functions, predicates, string constants, number constants etc.). 

3.6. Data Transformation Weight Calculation

The problem of origin of data is often related with context, confidence and
trustworthiness. We can find papers from literature that focused on mathematical
models or algorithms to measure importance, certainty and trust in data
processing systems [75] or beliefs, opinions and trust transitivity, propagation
and reasoning in agents communication [76]. We notice some similarities in data
source confidence, trust calculation and propagation, but our data lineage and
impact weight calculation have different purpose. Our data transformation weight
calculation is based on probabilistic estimation of data sources usage in data
transformations and filtering, and the purpose is to make metadata-based
inferences about the data flows and the data usage.

Data structure transformations are parsed and extracted from queries, and are
stored as formalized, declarative mappings in the system (articles B and C). To
add additional quantitative measures to each column transformation or column
usage in join and filter conditions, we evaluate each expression and calculate
transformation and filter weights for them.
The Expression Weight Calculation (No5 in Figure 3.1) was based on the idea
that we can evaluate column data “transformation rate” and column data “filtering
rate” using data structure and structure transformation information captured from
the SQL queries. Such a heuristic evaluation allows for distinguishing columns
and structures used in transformation expressions or in filtering conditions or
38
both, and gives probabilistic weights to expressions without the need to
understand the full semantics of each expression. We defined two measures that
we further used in our rule system for new facts calculation:
 The column transformation weight Sw is based on expression complexity
estimation in column transformation and calculated weight expresses the
source column transfer rate or strength. Weights are calculated in scale [0,1]
where 0 means that data is not transformed from the source (e.g., constant
assignment in query) and 1 means that the source is directly copied to the
target (no additional column transformations).
 The column filter weight Fp is based on expression complexity estimation for
each filter column in the filter expression and the calculated weight expresses
the column filtering rate or strength. Weight is calculated in scale [0,1], where
0 means that the column is not used in the filter and 1 means that the column
is directly used in the filter predicate (no additional expressions).

The general column weight W algorithm in each expression for Sw and Fp

components are calculated as the column count ratio over all expression
components counts (e.g., column count, constant count, function count, predicate
count):
𝐶𝑜𝑙𝑢𝑚𝑛𝐶𝑜𝑢𝑛𝑡
𝑊=
𝐶𝑜𝑙𝑢𝑚𝑛𝐶𝑜𝑢𝑛𝑡 + 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡 + 𝑆𝑡𝑟𝑖𝑛𝑔𝐶𝑜𝑢𝑛𝑡 + 𝑁𝑢𝑚𝑏𝑒𝑟𝐶𝑜𝑢𝑛𝑡 + 𝑃𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝐶𝑜𝑢𝑛𝑡

All counts are normalized using the expression function list evaluation over
the positive function list (e.g., CAST, ROUND, COALESCE, TRIM etc.). If the
function in the expression is in the positive function list, then the normalization
function reduces according to the component count by 1 to “pay a smaller price”
when the used function does not a have significant impact on the column data.
When the column data is mapped from the source column to the target column
in the SQL DML statement column expression, then the data transformation
weight depends on the complexity of the expression and is between 0 and 1. The
following expression samples and the calculated weights for each source-target
column pair illustrate the variation of the data transformations:

q1: CAST(T1.LogDate AS DATE) as Request_Date => 0.91

q2: T1.First_Name||' '||T1.Last_Name as Full_Name => 0.67
q3: MIN(T1.Balance_Amt) as Min_Balance_Amt => 0.5
q4: SUM(ZEROIFNULL(T1.Payment_Amt)) as Sales_Amt => 0.33
q5: SUM(CASE T1.Acc_Type IN (2,42) THEN T1.Acc_Amt ELSE 0 END) as Credit_Amt => 0.2
q6: CASE WHEN T1.Feature_Id is not null THEN 'Y' ELSE 'N' END as Dynamic_Ind => 0.17

The last expression q6 contains parts and measures like ColumnCount:

1 (T1.Feature_Id), FunctionCount: 2 (Case,WhenThen) and StringCount:
3 (null,Y,N). Using those values and the weight definition we calculate the
column pair operation O(T1.Feature_Id,T2.Dynamic_Ind, q6, 0.17)
weight in the expression q6 like this:

39
1 1
𝑊 = = = 0.16667 ≅ 0.17
1+ 2+ 3+0+0 6

3.7. Rule System and Dependency Calculation

The defined figures, operations and weights are used with the combinations
of declarative inference rules with formal reasoning to calculate possible relations
and dependencies between data structures and software components. Applying
the rule system to the extracted query graph, we calculate and produce the lineage
and impact graphs that are used for the data lineage or impact analysis.
First, we define the rule R1 to map the column level primitive data
transformations to the data lineage graph edges with the aggregation of multiple
paths over pairs of nodes. Let 𝐸𝑥,𝑦 = {𝑒 ∈ 𝐸𝑂 | 𝑒. 𝑋 = 𝑥, 𝑒. 𝑌 = 𝑦} be the set
of edges connecting nodes x, y in the graph GO. The data lineage graph GL edges
are calculated by rule R1: ∀𝑥, 𝑦 ∈ 𝑁 𝐸𝑥,𝑦 ≠ ∅ ⟹ ∃𝑒′ ∈ 𝐸𝐿 with a set of
properties:
 𝑒′. 𝑋 = 𝑥 ⋀ 𝑒′. 𝑌 = 𝑦
 𝑒 ′ . 𝑀 =∪𝑒∈𝐸𝑥,𝑦 𝑒. 𝑀
 𝑒′. 𝑊 = 𝑚𝑎𝑥 {𝑒. 𝑊| 𝑒 ∈ 𝐸𝑥,𝑦 }

An inference of this rule should be understood as creating edges e’ into the set
EL until R1 is satisfied.

Figure 3.3 Visual representation of data lineage graph inference rule R 1 .

The filter conditions are mapped to edges in the impact graph GI. Let 𝐹𝑀,𝑝 =
{𝑥 |𝑃𝑎𝑟𝑒𝑛𝑡(𝑥, 𝑝) ⋀ 𝑥 𝑖𝑠 𝑎 𝑓𝑖𝑙𝑡𝑒𝑟 𝑖𝑛 𝑀 } be the set of nodes that are the filter
conditions for the mapping M with parent p in database schema. Let 𝑇𝑀,𝑝′ =
{𝑥|𝑃𝑎𝑟𝑒𝑛𝑡(𝑥, 𝑝′ ) ∧ 𝑥 𝑖𝑠 𝑡𝑎𝑟𝑔𝑒𝑡 𝑖𝑛 𝑀} be the set of nodes that represent the
target columns of mapping M. To assign filter weights to columns, we use the
function 𝑊𝑓 : 𝑁 → [0, 1]. The data impact graph GI edges are calculated by rule
R2: ∀𝑝, 𝑝′ ∈ 𝑁 𝐹𝑀,𝑝 ≠ ∅ ⋀ 𝑇𝑀,𝑝′ ≠ ∅ ⟹ ∃𝑒′ ∈ 𝐸𝐼 with a set of properties:
 𝑒′. 𝑋 = 𝑝 ⋀ 𝑒′. 𝑌 = 𝑝′
 𝑒′. 𝑀 = 𝑀
 𝑒′. 𝑊 = 𝑎𝑣𝑔{𝑊𝑓 (𝑥) | 𝑥 ∈ 𝐹𝑀,𝑝 }

40
Figure 3.4 Visual representation of data impact graph inference rule R 2 .

To propagate information through the database structure upwards, to view the

data flows on a more abstract level (such as table or schema level) or to calculate
the dependency closure to answer lineage queries, we treat the graphs GL and GI
similarly. Let 𝐸𝑝,𝑝′ = {𝑒 ∈ 𝐸 |𝑃𝑎𝑟𝑒𝑛𝑡(𝑒. 𝑋, 𝑝) ⋀ 𝑃𝑎𝑟𝑒𝑛𝑡(𝑒. 𝑌, 𝑝’)} be the set
of edges where the source nodes share a common parent p and the target nodes
share a common parent p’. The aggregation of the edges to the pair of common
parents in the lineage GL or impact graph GI are calculated by rule R3: ∀𝑝, 𝑝′ ∈
𝑁 𝐸𝑝,𝑝′ ≠ ∅ ⟹ ∃𝑒′ ∈ 𝐸 with a set of properties:
 𝑒′. 𝑋 = 𝑝 ⋀ 𝑒′. 𝑌 = 𝑝′
 𝑒 ′ . 𝑀 = ∪𝑒∈𝐸𝑝,𝑝′ 𝑒. 𝑀
∑𝑒∈𝐸𝑝,𝑝′ 𝑒.𝑊
 𝑒′. 𝑊 = |𝐸𝑝,𝑝′ |

Figure 3.5 Visual representation of data lineage and impact graph inference rule R3.

Based on the derived dependency graph, we can solve different business tasks
by calculating selected component(s) lineage or impact over available layers and
chosen details. Business questions like: “What reports are using my data...?”,
“Which components should be changed or tested...?” or “What is the time and
cost of change...?” will be turned to the directed sub-graph navigation and
calculation tasks. We calculate new quantitative measures to each component or
41
node by number of sources and targets in the graph and we use those results in
the UI to sort and select the correct components for specific tasks:

 Local lineage and impact dependency scores are calculated as ratio over sum
of local source and target lineage or impact weights. Zero percent means that
there are no data sources detected for the object and 100% means that there
are no data consumers (targets) detected for the object. About 50% means
that there are equal numbers of weighted sources and consumers (targets)
detected for the object.
 Global lineage and impact dependency scores are calculated as sums of local
dependency scores over connected sources and target chains for each node.
The local dependency calculation algorithm for each connected node is as
follows:
Σ(𝑠𝑜𝑢𝑟𝑐𝑒(𝑊))
𝐿𝐷 =
Σ(𝑠𝑜𝑢𝑟𝑐𝑒(𝑊)) + Σ(𝑡𝑎𝑟𝑔𝑒𝑡(𝑊))

More details about data transformation weight, node score calculations and
rule systems are presented in articles B and C. Rule system improvements and
current formulations are presented in article D.

3.8. Semantic Layer Calculation

The semantic layer is an additional visualization and specific filter set used to
localize connected sub-graphs of the expected data flows for the selected node.
All connected nodes and edges in the semantic layer share the overlapping filter
predicate conditions or data production conditions that are extracted during the
edge construction to indicate not only possible data flows (based on connections
in the initial query graph), but only expected and probabilistic data flows.
The main idea of the semantic layer is to narrow down all possible and
expected data flows over all connected graph nodes by cutting down unlikely or
not-allowed connections in the graph, which is based on additional query filters
and semantic interpretation of filters and calculated transformation expression
weights. The semantic layer of the data lineage graph will hide irrelevant or
highlight relevant graph nodes and edges (depending on user choice and
interaction) that makes a distinction when underlying data structures are abstract
enough and independent data flows store and use independent “horizontal” slices
of data. The essence of semantic layers is to use available query and schema
information to estimate the row-level data flows without additional row-level
lineage information that is unavailable at the schema level, but is also expensive
or impossible to collect at the row level.
The visualization of the semantically connected subgraph corresponding to
the selected node is created by fetching the path nodes and the edges along those
paths from the appropriate dependency graph (impact or lineage). Any nodes not
included in the semantic layer are removed or visually muted (by changing their

42
color or opacity) and semantically connected subgraphs are returned or visualized
in the UI.
The semantic layer calculation is based on the selected node filter set and
calculated separately for back (predecessor) and forward (successors) direction
using a similar recursive algorithm with a search of overlapping filter conditions.
The illustration of different semantics of connected data flows (see Figure 3.6)
is based on previously presented example queries and lineage graphs (see Section
1.2). Tables ACCOUNT and LOAN data are integrated to one AGREEMENT
table by queries 1 and 3 (see Table 1.1), which is feeding two separate tables,
DEPOSIT_SUMMARY and LOAN_SUMMARY, with queries 2 and 4. This is
a typical scenario in DW or OLAP environments and data models where
dimension and fact tables are integrating data from different sources and various
queries, reports, applications or data marts using that data for different purposes.
Based only on database structures and query mappings, we can see how such hub
tables are integrating all dimension or fact sources to the one’s targets. In other
words, we can see and visualize all possible data flows based on query mappings.
To distinguish all possible data flows from actual flows based on query conditions
and restrictions, we have to go deeper into query conditions analysis to track
semantics of data flows.
ACCOUNT_STATE.State = ‘Ac tiv e’ AGREEMENT.Agreement_Ty pe = ‘A’

ACCOUNT.Ty pe = ’A’ AGREEMENT.Agreement_State = 2

AGREEMENT.Agreement_Ty pe = overlapping BALANCE.Balanc e_Date = DATE-1

ACCOUNT.Ty pe
conditions AGREEMENT.Agreement_Nbr =
ACCOUNT BALANCE.Agreement_Nbr
DEPOSIT_SUMMARY
Account_Nbr Period_Date
Type Agreement_Nbr
State_Code Agreement_State
AGREEMENT
Balance_Amt
Agreement_Nbr

Agreement_Type

Agreement_State

LOAN_SUMMARY
LOAN BALANCE Period_Date
Loan_Id Balance_Date
Agreement_Nbr

Loan_Type Agreement_Nbr
Agreement_State
State Balance_Amt
Principal_Amt

AGREEMENT.Agreement_Ty pe = ‘L’
AGREEMENT.Agreement_Ty pe = ‘L’
LOAN.State in (‘New’, ‘Ac tiv e’)
overlapping AGREEMENT.Agreement_State = 2
LOAN_TYPE.Ty pe in (‘Priv ate’, ‘Bus ines s ’) conditions
BALANCE.Balanc e_Date = DATE-1

AGREEMENT.Agreement_Nbr =
BALANCE.Agreement_Nbr

Figure 3.6 Semantic layer illustration for two independent data flows based on
overlapping query conditions.

When comparing queries 1-4’s mapping and filter predicate conditions, we

can see the two separate data flows going to the AGREEMENT table and two
separate flows moving out to the DEPOSIT_SUMMARY and
LOAN_SUMMARY tables. The data in the AGREEMENT table has the same
structure, but different sources and possibly different semantics. The intersection
or overlap in query conditions allows us to notice separate slices of filtered
43
subsets in integrated structures, and such semantic analysis and matching of
normalized query conditions allows us to make rule-based inferences about actual
data flows. Queries 1 and 2 are dealing with the same data slice and are
transforming it from the ACCOUNT to the DEPOSIT_SUMMARY table, and
queries 3 and 4 are dealing with the same data slice and are transforming it from
the LOAN to the LOAN_SUMMARY table. Those two different data flows in
Figure 2.3 are marked with different colors (blue and green).
We can conclude the example by stating that to answer the data lineage
questions more precisely we need to look into query semantics in addition to
structural mappings. The semantic analysis of query conditions and recursive
conditions overlapping search allows us to detect more likely data sources and
flows than all possible sources and flows. We can make probabilistic decisions
about row level (or set of rows) data flows using database and query metadata
without interfering with the work of the actual system.
The details and recursive graph traversal algorithm descriptions of the
semantic layer are published in paper D.

3.9. Summary
This system design chapter draws the high level methodological and technical
overview of designed and implemented system components, their functions and
the form. The system architecture follows the methodological pathway that is
defined on a conceptual level in section 3.1. The metadata database design
described in section 3.2 and different semantic models (metamodels) for
databases, data integration, business intelligence and generalized mappings
metadata were described in section 3.3. The metadata capture and scanners were
described in section 3.4. More details about the underlying foundation and
mapping design can be found in article A. A discussion of logical paths with
query parsing and resolving techniques continued in section 3.5, with data
transformations evaluation and weight calculation in section 3.6. The
implemented rule system for graph construction and calculations was discussed
in section 3.7 and the semantic layer on top of calculated graphs was discussed
in section 3.8.

44
4. IMPLEMENTATION AND APPLICATIONS
This chapter presents an overview of the actual implementation along with
real-life experiments and relevant statistics. The developed software components
and applications are introduced in the section 4.1. A system performance
evaluation based on six different real-life datasets and the performance overview
details is presented in the section 4.2. Special attention has been given to the
dataset visualization techniques presented in the section 4.3. Details of the
visualization methods are published in papers C and D. Possible additional
application areas are discussed in the section 4.4.

4.1. dLineage.com
The previously described architecture and algorithms have been used to
implement the dLineage37 toolset for data lineage and impact analysis in real
organizations. dLineage is packaged as web-based software as a service (SaaS)
or a local appliance, prepackaged and configured as a virtual machine (VM) with
all the vital components included, such as scanners, parsers and calculation
engine, metadata database and web-based user interface with multiple
applications. The web-based tools are divided into different applications for
different user groups:
 The technical application for metadata management, browsing and
navigation to keep track of the source systems content and interconnection
with all the available technical details.
 The analytical application for data lineage and impact analysis, data sources,
targets and data flow visualizations.
 The business applications for technical metadata management with the help
of a connected classification system, business glossary or ontology, and data
or business governance with the help of domains, role system and
responsibilities.

The scanners and web-based tools of dLineage have been extended and tested
in real-life projects and environments to support several popular DW database
platforms (e.g., Oracle, Greenplum, Teradata, Vertica, PostgreSQL, MsSQL,
Sybase), ETL tools (e.g., Informatica, Pentaho, Oracle Data Integrator, SSIS,
SQL scripts and different data loading utilities) and BI tools (e.g., SAP Business
Objects, Microstrategy, Microsoft SSRS etc.). The dLineage database is built on
PostgreSQL, using an open schema data modeling approach and predefined
metamodels, described in sections 3.2 and 3.3. The rule system and dependency
graph calculation is implemented in SQL queries and stored as a specialized
relation between the scanned node objects. The current implementation uses
recursive SQL for subgraph query tasks, which works reasonably well because
of a local single object context and a sparse nature of the dependency graph. The
number of objects in our test datasets (see section 4.2) were about 1.3 million and

37
http://dlineage.com
45
we have tested the recursive SQL approach with three times bigger datasets
without any remarkable drawbacks. We have also tested special storage and
indexing methods and in-memory database approaches as alternatives for
recursive SQL. The most promising approach would be the in-memory structures
and algorithms for graph querying, which can be easily adapted and added as
application components when needed. The algorithms for interactive transitive
calculations and semantic layer calculation (see sections 3.7 and 3.8) are
implemented in JavaScript and work in browsers for small and local subgraph
optimization and visualization. Visualization of data lineage and impact flows is
built using d3.js graphics libraries in combination with Sankey38 diagram
techniques. Additional information can be found on our dLineage39 online demo
site and more technical details are in article D.
The general idea of capturing and visualizing data flows in an organization
DW ecosystem are drawn in Figure 4.1. The idea of visualizations using a Sankey
diagram is to align all the data sources (e.g., files, interfaces or tables in source
database) on the left side, all the final data consumers (ending targets, like reports,
export files, API interfaces, etc.) on the right side and all other structures and
components between them (depending on sources and targets). Figure 4.1.
illustrates a traditional DW environment with several data transformation layers
(e.g., source, staging, storage, access, and applications) using a small subset of
Human Resource Management System (HRMS) data structures. The data
structures and data are copied one-to-one from the source to DW and data
transformations are built on the access view layer in this simplified example.
Real-life DW environments are usually much more complex with different
modeling paradigms (e.g., ODS, dimensional, 3NF or hybrid), which means there
will be data restructuring and transformations almost in any layer or stage of data
flow.

Figure 4.1 Data lineage visualization example in DW environment using Sankey

diagram.

38
https://en.wikipedia.org/wiki/Sankey_diagram
39
http://www.dlineage.com/
46
The Analytics application in the dLineage toolset was designed for data
lineage and impact graph navigation and visualizations. In the Analytics
application, there are two built-in alternative data representation formats: table
and graph view; and two complementary content representations: data lineage
and impact view. The table view consists of two parts for each selected object:
dependent sources and dependent targets, which represent the list of objects that
are detected as a source or a target in-context of current focus. Figure 4.2 is an
illustration of one report object in a financial reporting hierarchy with more than
a hundred different sources (and no targets) that are connected to one report. The
table view shows the data lineage or impact graph with calculated metrics (e.g.,
distance, number of queries, number of sources and targets) and is sorted by the
most influential objects first.
The graph view in Figure 4.4 represents the same information about connected
sources and targets using a clickable and zoomable Sankey diagram, but in
contrast to the flattened table view, the graph view is stretched out from sources
to targets and rendered from left to right with all levels and distances clearly
visible.

Figure 4.2 dLineage sub-graph table view, source and target objects with calculated
metrics.

At the same time, the content filters for lineage and impact graphs based on graph
calculation rules (see section 3.7) produces two different dependency relations:
lineage (based on data transformation rules R1, R3 or R3) and impact (based on data
impact rules R2 or R3). Based on the lineage or impact content filters, the user can see
and switch between a direct data lineage graph or a dependent component graph. The
latter contains also impact graph data that used for data filtering, joining or coding
and that do not contribute directly to target structures. In Figure 4.3 and Figure 4.4,
one can see the impact view with two or three colored dependency lines, where direct
data transformations are in gray and indirect impact dependencies are in red. Both

47
representations and content filters have their own aspect to emphasize and help in
combinations, and together they perform the lineage or impact analysis tasks.

Figure 4.3 dLineage sub-graph graphical view, selected object with all connected
targets.

Figure 4.4 dLineage sub-graph graphical view, selected object with all connected
sources.

The other applications in the dLineage toolset are built to support related activities
to manage metadata scanners, browse and search collected data, manage systems
state and heath, analyze discovered dependencies, manage and govern corporate
information assets or collect and build business glossaries and definitions to give a
48
meaning to IT assets. In addition to technical, analytical and business applications,
we collect and calculate various measures to estimate system health, integrity, graph
connectivity, parse rate and errors, business coverage, errors, etc. Figure 4.5
illustrates the dashboard functionality of the dLineage toolset that visualizes
collected measures and data.

Figure 4.5 dLineage dashboard has aggregated overview about collected metadata and
calculated results and metrics.

4.2. Performance Evaluation

We have tested our solution in several real-life case studies involving a
thorough analysis of large international companies in the financial, utilities,
governance, telecom and healthcare sectors. The case studies analyzed thousands
of database tables and views, tens of thousands of data loading scripts and BI
reports. Those figures are far over the capacity limits of human analysts not
assisted by special tools and technologies.
The following six different datasets with varying sizes have been used for our
system performance evaluation. The datasets DS1 to DS6 represent data
warehouse and business intelligence data from different industry sectors and is
aligned according to dataset size (Table 4.1). The structure of the datasets are
diverse and complex, hence we have analyzed the results at a more abstract level
(e.g., the number of objects and processing time) to evaluate the system
performance under different conditions.
49
Table 4.1 Evaluation of processed datasets with different size and structure.

DS1 DS2 DS3 DS4 DS5 DS6

Number of scanned objects 1 341 863 673 071 132 588 120 239 26 026 2 369
DB objects 43 773 179 365 132 054 120 239 26 026 2 324
ETL objects 1 298 090 361 438 534 0 0 45
BI objects 0 132 268 0 0 0 0
Scan time (min) 114 41 17 33 6 0
Number of scripts to parse 6 541 8 439 7 996 8 977 1184 495
Number of parsed query mappings 48 971 13 946 11 215 14 070 1544 635
Query parse success rate (%) 96 98 96 92 88 100
Query parse/resolve perf. (qry/sec) 3.6 2.5 26.0 12.1 4.1 6.3
Query parse/resolve time (min) 30 57 5 12 5 1
Number of graph nodes 73 350 192 404 24 878 17 930 360 1 930
Number of graph links 95 418 357 798 24 823 15 933 330 2 629
Graph processing time (min) 36 62 14 15 6 2
Total processing time (min) 150 103 31 48 12 2

The biggest dataset, DS1, contained a big set of Informatica ETL package
files, a small set of connected DW database objects and no business intelligence
data. The next dataset, DS2, contained a data warehouse, SQL scripts for ETL
loadings and an SAP Business Object for reporting for business intelligence. The
DS3 dataset contained a smaller subset of the DW database (MsSql), SSIS ETL
loading packages and SSRS reporting for business intelligence. The DS4 dataset
had a subset of the DW (Oracle) and data transformations in the stored procedures
(Oracle). The DS5 dataset is similar but much smaller compared to DS4 and is
based on the Oracle database and stored procedures. The DS6 dataset had a small
subset of a data warehouse in Teradata and data loading scripts in the Teradata
TPT format.

Figure 4.6 Datasets size and structure compared to overall processing time.
50
Figure 4.7 Calculated graph size and structure compared to graph data processing
time.

The dataset sizes, internal structure and processing time are visible in Figure
4.6, where a longer processing time of DS4 is related to very large Oracle stored
procedure texts and loading of those to the database. The initial dataset and the
processed data dependency graphs have different graph structures (see Figure
4.7) that do not correspond necessarily to the initial dataset size. DS2 has a more
integrated graph structure and a higher number of connected objects (Figure 4.7)
than the DS1. At the same time, the DS1 initial row data size is about two times
bigger than DS2.

Figure 4.8 Dataset processing time with two main subcomponents.

51
Figure 4.9 Dataset size and processing time correlation with linear regression (semi-
log scale).

We have analyzed the correlation of the processing time and the dataset size
(see Figure 4.8 and Figure 4.9) showing that the growth of the execution time
follows the same linear trend as the size and complexity growth. The data scan
time is related mostly to the initial dataset size. The query parsing, resolving and
graph processing time also depend mainly on the initial data size, and less so on
the calculated graph size (Figure 4.8). The linear correlation between the overall
system processing time (seconds) and the dataset size (object count) can be seen
in Figure 4.9.

4.3. Visualization
The Enterprise Dependency Graph examples (Figure 4.10 - Figure 4.12)
illustrate the complex structure of dependencies between the DW storage scheme,
access views and user reports. The examples were generated using data
warehouse and business intelligence lineage layers. The details are at the database
and reporting object level, not at the column level. At the column and the report
field levels, a full data lineage graph would be about ten times bigger and too
complex to visualize in a single picture. The following graph from the data
warehouse structures and user reports presents about 50,000 nodes (tables, views,
scripts, queries, reports) and about 200,000 links (data transformations in views
and queries) on a single image (Figure 4.10).
The real-life dependency graph examples illustrate the automated data
collection, parsing, resolving, graph calculation and visualization tasks
implemented in our system. The system requires only the setup and configuration
tasks to be performed manually. The rest will be done by the scanners, parsers
and the calculation engine.
The final result consists of data flows and system component dependencies
visualized in the navigable and drillable graph or table form. The results can be
52
viewed as a local sub-graph with a fixed focus and a suitable filter set to visualize
the data lineage path from any source to a single report with click and zoom
navigation features. The big picture of the dependency network gives a full-scale
overview of the organization’s data flows. It explicates potential architectural,
performance and security problems.

Figure 4.10 Data flows (blue,red) and control flows (green,yellow) between DW tables,
views and reports.

Figure 4.11 Data flows between DW tables, views (blue) and reports (red).

53
Figure 4.12 Control flows in scripts, queries (green) and reporting queries (yellow) are
connecting DW tables, views and reports.

In addition to the visualization of data flows, we have developed the

aggregated plot view of graph nodes that will help to analyze database tables, data
loading programs or reports in terms of connectedness, complexity and cost. The
main idea of the visualization is to draw a two-dimensional plot or bubble chart
with a number of connected sources and targets on an X and Y axis that allow us
to clearly distinguish more and less connected nodes and the balance between the
number of sources and targets or data producers and consumers. The size of the
bubble in the chart is a recursively calculated number of child objects that express
the complexity of the object and its structure. The color of the bubble is calculated
as a sum of all three components – the number of sources, targets and children –
expressing the cost of the object in terms of change, development or maintenance.
The more costly objects are located in the upper right corner (see Figure 4.13
and Figure 4.14), with a bigger diameter and colored in red. The less costly
objects are located in the lower left corner and colored in blue. The color layer is
the fourth dimension of the chart, giving a quick aggregated overview of the
selected object set. The bigger and more red an object is, the costlier and more
complex it is to change. The smaller and more blue an object is, the less costly
and less complex it is to change.
The data axis with its number of sources and targets and bubble size are
calculated and drawn in a logarithmic scale. The number of sources, targets and
child elements of each object in the same chart can vary with several orders of
magnitude, and therefore the logarithmic scale is more suitable for visualization
and reading of charts.

54
Figure 4.13 Data Warehouse loading packages plot with number of data sources and
targets (axis), loading complexity (size) and relative cost (color).

Figure 4.14 Data Warehouse tables plot with number of data sources and targets
(axis), loading complexity (size) and relative cost (color).

4.4. Proposed Novel Applications

The previously described architecture and dLineage toolset allows us to
address and solve different IT management tasks, based on evidence stored in the
dependency graph. In the following section, we describe some practical use cases
in addition to data lineage and impact analysis that can be seen as additional
applications or plugins for the dLineage toolset.

55
Planning and Budgeting
The ETL programming is often the most time-consuming, complex and hard-
to-predict task in enterprise DW projects, and depends on many variables:
analysis and quality of source data, complexity of data mappings and
transformations, design of target model, etc. Estimations and budgeting of such
tasks are usually based on available input figures and expert opinions, and cannot
be easily answered without previous analysis. Automation of these analysis tasks
via replacement of expert opinions with traceable calculation and decision
algorithms would save money and provide decision support for ETL planning and
budgeting projects. We have successfully implemented and used the Excel-based
calculation algorithm for ETL programming resources estimation (time and
money) in several financial, retail and telecom sector DW projects that were
based on available input figures (i.e., number of tables/columns to load, number
of tables/columns to design/create/change/drop, number of views/column to
design/create/change/drop, number of tasks and packages, etc.), customized
weights and constants and calculation models that allowed us to validate and
replace the human expert opinion and speed up planning tasks. Such a model,
with a manually adjusted weight system for each individual organization, has the
ability to imitate the average human expert decisions with accuracy over 90%.
When implementing a similar model on a real DW dependency graph and
bringing the existing components with their sources and target object counts,
weights and complexity measures, we can build a new evidence-based estimation
calculator. Such an approach allows us to automate and speed up the project
estimations and make it available via a web-based UI or wizard to end users such
as project managers or business experts. The planning and budgeting app allows
faster decisions assisted by connected content and might even outperform the
average expert estimation because of additional knowledge captured into the
dependency graph.

Automatic System Documentation

Relevant systems documentation is an important topic in IT systems
development and is especially important in the context of DW development. A
crucial part of DW documentation describes actual data mappings,
transformations and loads with all the sources and targets. DW development and
management can quickly become expensive and error-prone when detailed
mappings and dependencies are not available. Design time mapping documents
are usually not detailed enough and are outdated by the end of ETL design and
programming. The lack of time, project setup and used tools often do not support
the online documentation availability all the way to the end of the development
phase. Automated documentation generation from actual data transformation
programming code or ETL metadata would be the solution.
The toolset with DW systems and programs metadata scanning, parsing,
resolving and storing in a unified metadata database is a good starting point for
automated documentation. Unified data mappings and constructed dependency
graphs consist of all the information required to generate detailed (column level)
56
ETL mapping documents. A web-based user interface allows for linked and living
documentation that is accurate and more usable than traditional design time
system documents.

Enterprise Search and IT Asset Management

The overview and management of corporate IT assets is a challenging topic
for many organizations. IT systems are physically separated by design or security
concerns. Integration of technical artefacts requires extra effort and tools.
Different counterparties require the same data, but with different details and
viewpoints highlighted, and there are not many tools to support them all from one
source. IT architecture, maintenance, support, development and data delivery
requirements are different and interested parties are rarely ready to find a
common solution. Enterprise asset management with connected dependencies,
business terminology, full text search, responsibilities and role systems would be
the common solution for different needs.
The core functionality described provides metadata for IT systems which is
organized in a suitable format to provide full-scale IT asset management
functions. Built in google-like full-text makes every scanned object fast and easy
to find. Business applications have functions to build up a full-scale business
glossary system in top-down or bottom-up manner and additional role, domain
and responsibility systems allows one to implement IT asset governance
applications suitable for different needs throughout an organization.

Auditing and Compliance Reporting

Compliance with different internal and external requirements can be critical
for many organizations and alignment of the requirements is time-consuming and
costly. Specific industry sectors have their own requirement standards or
mandatory governance regulation, and compliance with regulations will reduce
the risks and business costs or allow the company to operate in the market.
Compliance with regulations requires auditing or certification processes, and
automation of data capturing, consolidation, measurement and alignment tasks
allows for cost savings and quality improvements. The examples of such global
regulations would be the Sarbanes-Oxley Act40 for public and private companies
in the US, which was designed to protect investors, competitors and companies
themselves; Basel III41 and Solvency II42 in financial and insurance industries in
the EU for capital requirements and risk regulations; the General Data Protection
Regulation (GDPD)43 directive from EU/EC for personal data usage and
protection in online and internet businesses worldwide.
In order to fulfill regulations, we need to catalog the requirements in the form
of business ontology and connect IT assets manually or automatically with the

40
https://en.wikipedia.org/wiki/Sarbanes-Oxley_Act
41
https://en.wikipedia.org/wiki/Basel_III
42
https://en.wikipedia.org/wiki/Solvency_II_Directive_2009
43
https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
57
requirements. Depending on the specific regulations, we can build a logic-based
rule system and connect it with an underlying dependency graph to derive data
for requirements, to check internal logic and consistency of requirements and to
provide solid, fact-based audit trail and proof of compliance.

4.5. Summary
This implementation chapter concludes the presentation of the designed and
implemented software system, performance evaluation and datasets visualization.
The developed software components and applications were introduced in section
4.1. System performance evaluation based on real-life datasets and the
performance overview details were introduced in section 4.2, and the dataset
visualization was presented in section 4.3. Finally, novel further application areas
were discussed in section 4.4.

58
CONCLUSIONS
This thesis presents novel methods, algorithms and experimental results for
practical data lineage and impact analysis. We are able to map, aid and automate
the solution of management and analysis problems in a corporate data warehouse
environment.
Automation of human intensive analysis tasks reduces time and costs,
improves quality and leads to better decisions with reduced risks. It may take a
week or two for a human analyst to solve moderately complex impact analysis
tasks. We show that this time can be reduced to hours or minutes, with the
interpretation of the results being feasible for users without the help of domain
experts.
The traditional data lineage and impact analysis problems can be compared to
the internet search problem before the invention of Google. The analyst of a new
system component, functionality or business requirement had to find and read all
the relevant documents and/or code bases to trace and model the data sources and
dependencies. Our chosen approach to DW impact analysis and data lineage in a
closed corporate environment can be compared to Google’s approach to web
scanning and indexing to build a sophisticated search engine. We scan, collect
and map an organization’s IT systems and data warehouse environment, data
structures, queries, reports and programs, without using the DW data or affecting
the normal work and behavior of those systems.
Processing and mapping the collected data to an RDF-style database schema
creates a unified physical base for data storage. The unified data representation
allows us to define and implement a set formalized rules to build weighted and
directed dependency graphs. Probabilistic weight calculation in query parsing and
weight propagation by the rule system brings the data transformation semantics
to the graph for further usage. The weights are used for node dependency and
transitivity calculations, for layer visualization, filtering and object sorting. The
weight system is also used in the semantic layer calculation to visualize only the
applicable data flow subgraphs for each selected node.
We have implemented all the algorithms described in the thesis and built a
web-based dLineage software toolkit for browsing, analyzing and visualizing
collected and calculated data. This toolset, algorithms and techniques have been
successfully employed in tens of case studies and projects.
The presented case studies and performance analysis with six different real-
life datasets demonstrates that our algorithms and implementations are linearly
scalable.
We will continue our research and system development in the field of business
semantics and governance automation to employ the underlying dependency
graph in combination with semantic techniques and ontology learning.
Combining different techniques to automate business definitions management
and IT asset governance will hopefully allow us to fill another gap in the
corporate knowledge and asset management landscape.

59
REFERENCES
[1] K. C. Viktor Mayer-Schönberger, “Big Data: A Revolution That Will
Transform How We Live, Work, and Think.,” John Murray, 2013, p. 242.
[2] S. B. Zdonik, “Provenance , Lineage , and Workflows,” Computer (Long.
Beach. Calif)., pp. 1–24, 2010.
[3] S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP
technology,” ACM SIGMOD Rec., vol. 26, no. 1, pp. 65–74, 1997.
[4] P. Buneman, J. Cheney, W.-C. W. Tan, and S. Vansummeren, “Curated
Databases,” Pod. June 9–12, 2008, Vancouver, BC, Canada., pp. 1–12,
2008.
[5] J. Cheney, L. Chiticariu, and W.-C. Tan, “Provenance in Databases: Why,
How, and Where,” Found. Trends Databases, vol. 1, no. 4, pp. 379–474,
2007.
[6] W. Tan, “Provenance in Databases : Past , Current , and Future,” Sigmod
2007, pp. 1–10, 2007.
[7] P. Buneman, S. Khanna, and W.-C. Tan, “Why and where: A
characterization of data provenance,” Int. Conf. Database Theory, vol.
1973, no. January, pp. 316–330, 2001.
[8] Y. R. Wang and S. E. Madnick, “A Polygen Model for Heterogeneous
Database Systems: The Source Tagging Perspective,” Proc. 16th VLDB
Conf., no. January, pp. 519–538, 1990.
[9] Y. Cui and J. Widom, “Lineage tracing for general data warehouse
transformations,” VLDB J., vol. 12, no. 1, pp. 41–58, 2003.
[10] J. Widom, “Trio: A System for Integrated Management of Data,
Accuracy, and Lineage,” Proc. 2005 CIDR Conf., pp. 262–276, 2005.
[11] A. Woodruff and M. Stonebraker, “Supporting fine-grained data lineage
in a database visualization environment,” Data Eng. 1997. Proceedings.
13th Int. Conf., no. January, pp. 91–102, 1997.
[12] T. Priebe, A. Reisser, and D. T. Anh Hoang, “Reinventing the Wheel?!
Why Harmonization and Reuse Fail in Complex Data Warehouse
Environments and a Proposed Solution to the Problem,” Proc. 10th Int.
Conf. Wirtschaftsinformatik, pp. 766–775, 2011.
[13] Y. L. Simmhan, B. Plale, and D. Gannon, “A Survey of Data Provenance
in e-Science,” SIGMOD Rec., vol. 34, no. 3. pp. 31–36, 2005.
[14] S. B. Davidson and J. Freire, “Provenance and scientific workflows,”
Proc. 2008 ACM SIGMOD Int. Conf. Manag. data - SIGMOD ’08, p.
1345, 2008.
[15] R. Bose and J. Frew, “Lineage retrieval for scientific data processing: a
survey,” ACM Comput. Surv., vol. 37, no. 1, pp. 1–28, 2005.
[16] P. Buneman and W. Tan, “Provenance in Databases,” Proc. 2007 ACM
SIGMOD Int. Conf. Manag. data, pp. 1171–1173, 2007.
[17] Y. Cui, J. Widom, and J. L. Wiener, “Tracing the Lineage of View Data
in a Warehousing Environment,” ACM Trans. Database Syst., vol. 25, no.
2, pp. 179–227, 2000.
[18] T. J. Green, G. Karvounarakis, and V. Tannen, “Provenance semirings,”
61
Proc. twenty-sixth ACM SIGMOD-SIGACT-SIGART Symp. Princ.
database Syst. - Pod. ’07, vol. pages, no. June, p. 31, 2007.
[19] P. Buneman, S. Khanna, and W.-C. Tan, “On propagation of deletions and
annotations through views,” Proc. twenty-first ACM SIGMOD-SIGACT-
SIGART Symp. Princ. database Syst. - Pod. ’02, vol. 2002, no. June, p.
150, 2002.
[20] P. Buneman, J. Cheney, and S. Vansummeren, “On the expressiveness of
implicit provenance in query and update languages,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics), 2006, vol. 4353 LNCS,
pp. 209–223.
[21] D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, “An
annotation management system for relational databases,” in VLDB
Journal, 2005, vol. 14, no. 4, pp. 373–396.
[22] T. Green and G. Karvounarakis, “Update exchange with mappings and
provenance,” in Proceedings of the 33rd international conference on Very
large data bases, 2007, pp. 675–686.
[23] D. Deutch, Y. Moskovitch, and V. Tannen, “A Provenance Framework
for Data-Dependent Process Analysis,” Proc. VLDB Endow., vol. 7, no.
6, pp. 457–468, 2014.
[24] T. Heinis and G. Alonso, “Efficient lineage tracking for scientific
workflows,” Proc. 2008 ACM SIGMOD Int. Conf. Manag. data -
SIGMOD ’08, no. Section 2, p. 1007, 2008.
[25] P. Missier, K. Belhajjame, J. Zhao, M. Roos, and C. Goble, “Data Lineage
Model for Taverna Workflows with Lightweight Annotation
Requirements,” Proven. Annot. Data Process., pp. 17–30.
[26] R. Ikeda, A. Das Sarma, and J. Widom, “Logical provenance in data-
oriented workflows?,” in Proceedings - International Conference on Data
Engineering, 2013, pp. 877–888.
[27] B. Ramesh and M. Jarke, “Toward reference models for requirements
traceability,” IEEE Trans. Softw. Eng., vol. 27, no. 1, pp. 58–93, 2001.
[28] O. Benjelloun, A. Das Sarma, C. Hayworth, and J. Widom, “An
introduction to ULDBs and the Trio system,” IEEE Data Eng. Bull., vol.
29, no. 1, pp. 5–16, 2006.
[29] H. Fan and A. Poulovassilis, “Using AutoMed metadata in data
warehousing environments,” Proc. 6th ACM Int. Work. Data Warehous.
Ol. - Dol. ’03, p. 86, 2003.
[30] P. Giorgini, S. Rizzi, and M. Garzetti, “A goal-oriented approach to
requirement analysis in data warehouses,” Decis. Support Syst., vol. 45,
no. 1, pp. 4–21, 2008.
[31] H. Fan and A. Poulovassilis, “Using schema transformation pathways for
data lineage tracing,” Knowl. Transform. Semant. Web, vol. 3567, pp. 64–
79, 2010.
[32] U. Dayal, M. Castellanos, A. Simitsis, and K. Wilkinson, “Data
integration flows for business intelligence,” Proc. 12th Int. Conf.
Extending Database Technol. Adv. Database Technol. - EDBT ’09, p. 1,
62
2009.
[33] A. Simitsis and P. Vassiliadis, “A Methodology for the Conceptual
Modeling of ETL Processes,” CAiSE Work., pp. 305–316, 2003.
[34] A. Kabiri and D. Chiadmi, “A method for modelling and organazing ETL
processes,” in 2nd International Conference on Innovative Computing
Technology, INTECH 2012, 2012, pp. 138–143.
[35] D. Skoutas and A. Simitsis, “Ontology-Based Conceptual Design of ETL
Processes for Both Structured and Semi-Structured Data,” Int. J. Semant.
Web Inf. Syst., vol. 3, pp. 1–24, 2007.
[36] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita,
“Improving Data Cleaning Quality using a Data Lineage Facility,” in
DMDW, 2001.
[37] A. S. DeSantana and A. M. de C. Moura, “Metadata to Support
Transformations and Data & Metadata Lineage in a Warehousing
Environment,” in Data Warehousing and Knowledge Discovery, 2004,
vol. 3181, no. 6th International Conference, DaWaK 2004, Zaragoza,
Spain, September 1-3, 2004. Proceedings, pp. 249–258.
[38] M. Bala, O. Boussaid, and Z. Alimazighi, “Extracting-Transforming-
Loading Modeling Approach for Big Data Analytics,” Int. J. Decis.
Support Syst. Technol., vol. 8, no. 4, pp. 50–69, 2016.
[39] S. K. Bansal, “Towards a Semantic Extract-Transform-Load (ETL)
framework for big data integration,” in Proceedings - 2014 IEEE
International Congress on Big Data, BigData Congress 2014, 2014, pp.
522–529.
[40] J. Wang, D. Crawl, S. Purawat, M. Nguyen, and I. Altintas, “Big data
provenance: Challenges, state of the art and opportunities,” in
Proceedings - 2015 IEEE International Conference on Big Data, IEEE
Big Data 2015, 2015, pp. 2509–2516.
[41] C. H. Suen, R. K. L. Ko, Y. S. Tan, P. Jagadpramana, and B. S. Lee,
“S2Logger: End-to-end data tracking mechanism for cloud data
provenance,” in Proceedings - 12th IEEE International Conference on
Trust, Security and Privacy in Computing and Communications,
TrustCom 2013, 2013.
[42] B. Glavic and K. Dittrich, “Data provenance: A categorization of existing
approaches,” Btw, pp. 227–241, 2007.
[43] S. Davidson and J. Freire, “Provenance and scientific workflows:
challenges and opportunities,” Proc. 2008 ACM SIGMOD, pp. 1–6, 2008.
[44] R. Bose, “A conceptual framework for composing and managing
scientific data lineage,” in Proceedings of the International Conference
on Scientific and Statistical Database Management, SSDBM, 2002, vol.
2002–Janua, pp. 15–19.
[45] Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru, “Performance
Evaluation of the Karma Provenance Framework for Scientific
Workflows,” in Proceedings of the 2006 International Conference on
Provenance and Annotation of Data, 2006, pp. 222–236.
[46] I. Altintas, O. Barney, and E. Jaeger-frank, “Provenance Collection
63
Support in the Kepler Scientific Workflow System,” Work, vol. 4145, pp.
118–132, 2006.
[47] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The
data grid: Towards an architecture for the distributed management and
analysis of large scientific datasets,” J. Netw. Comput. Appl., vol. 23, no.
3, pp. 187–200, 2001.
[48] E. Wu, S. Madden, and M. Stonebraker, “SubZero: A fine-grained lineage
system for scientific databases,” in Proceedings - International
Conference on Data Engineering, 2013, pp. 865–876.
[49] P. Missier, B. Ludäscher, S. Bowers, S. Dey, A. Sarkar, B. Shrestha, I.
Altintas, M. K. Anand, and C. Goble, “Linking multiple workflow
provenance traces for interoperable collaborative science,” in 2010 5th
Workshop on Workflows in Support of Large-Scale Science, WORKS
2010, 2010.
[50] I. Altintas, Collaborative Provenance for Workflow-Driven Science and
Engineering, vol. 129. 2011.
[51] S. da Cruz, C. Paulino, and D. de Oliveira, “Capturing distributed
provenance metadata from cloud-based scientific workflows,” J. Inf. Data
Manag., vol. 2, no. 1, pp. 43–50, 2011.
[52] A. Marinho, C. Werner, S. M. S. Da Cruz, M. Mattoso, V. Braganholo,
and L. Murta, “A strategy for provenance gathering in distributed
scientific workflows,” in SERVICES 2009 - 5th 2009 World Congress on
Services, 2009, no. PART 1, pp. 344–347.
[53] L. Wang, S. Lu, X. Fei, A. Chebotko, H. Victoria Bryant, and J. L. Ram,
“Atomicity and provenance support for pipelined scientific workflows,”
Futur. Gener. Comput. Syst., vol. 25, no. 5, pp. 568–576, 2009.
[54] M. K. Anand, S. Bowers, I. Altintas, and B. Ludäscher, “Approaches for
exploring and querying scientific workflow provenance graphs,” in
Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics), 2010, vol.
6378 LNCS, pp. 17–26.
[55] M. K. Anand, S. Bowers, and B. Ludäscher, “A navigation model for
exploring scientific workflow provenance graphs,” in Proceedings of the
4th Workshop on Workflows in Support of Large-Scale Science, WORKS
’09, in Conjunction with SC 2009, 2009.
[56] U. Acar, P. Buneman, J. Cheney, J. Van Den Bussche, N. Kwasnikowska,
and S. Vansummeren, “A graph model of data and workflow provenance,”
Procs. TAPP’10 Work. (Theory Pract. Provenance), p. 8, 2010.
[57] O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara, “Querying
and managing provenance through user views in scientific workflows,” in
Proceedings - International Conference on Data Engineering, 2008, pp.
1072–1081.
[58] S. Bowers and B. Ludascher, “An ontology-driven framework for data
transformation in scientific workflows,” Data Integr. Life Sci. Proc., vol.
2994, pp. 1–16, 2004.
[59] J. Kim, Y. Gil, and V. Ratnakar, “Semantic Metadata Generation for
64
Large Scientific Workflows,” in Proceedings of the Fifth International
Semantic Web Conference, 2006, pp. 357–370.
[60] L. Ding, J. Michaelis, J. McCusker, and D. L. McGuinness, “Linked
provenance data: A semantic Web-based approach to interoperable
workflow traces,” in Future Generation Computer Systems, 2011, vol. 27,
no. 6, pp. 797–805.
[61] S. S. Sahoo, A. Sheth, and C. Henson, “Semantic provenance for
eScience: Managing the deluge of scientific data,” IEEE Internet
Comput., vol. 12, no. 4, pp. 46–54, 2008.
[62] S. Bowers, T. Mcphillips, B. Ludascher, S. Cohen, S. B. Davidson, and
B. Ludäscher, “A Model for User-Oriented Data Provenance in Pipelined
Scientific Workflows,” Lect. Notes Comput. Sci., vol. 4145, no. 4145, pp.
133–147, 2006.
[63] L. Finlay, “‘Outing’ the researcher: the provenance, process, and practice
of reflexivity.,” Qual. Health Res., vol. 12, no. 4, pp. 531–545, 2002.
[64] R. de Paula, M. Holanda, L. S. A. Gomes, S. Lifschitz, and M. E. M. T.
Walter, “Provenance in bioinformatics workflows.,” BMC
Bioinformatics, vol. 14 Suppl 1, no. Suppl 11, p. S6, 2013.
[65] P. Buneman, A. Chapman, and J. Cheney, “Provenance management in
curated databases,” in Proceedings of the 2006 ACM SIGMOD
international conference on Management of data - SIGMOD ’06, 2006,
pp. 539–550.
[66] R. De Paula, M. T. Holanda, M. E. M. T. Walter, and S. Lifschitz,
“Managing data provenance in genome project workflows,” in
Proceedings - 2012 IEEE International Conference on Bioinformatics
and Biomedicine Workshops, BIBMW 2012, 2012.
[67] F. Chirigati and J. Freire, “Towards integrating workflow and database
provenance,” in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2012, vol. 7525 LNCS, pp. 11–23.
[68] O. Hartig and J. Zhao, “Publishing and consuming provenance metadata
on the web of linked data,” in Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 2010, vol. 6378 LNCS, pp. 78–90.
[69] L. Moreau, “The Foundations for Provenance on the Web,” Found.
Trends Web Sci., vol. 2, no. 2–3, pp. 99–241, 2010.
[70] L. Moreau, “Provenance-based reproducibility in the semantic web,” J.
Web Semant., vol. 9, no. 2, pp. 202–221, 2011.
[71] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson,
“The Open Provenance Model,” Futur. Gener. Comput. Syst., vol. 27, no.
6, pp. 743–756, 2011.
[72] M. K. Anand, S. Bowers, and B. Ludäscher, “Techniques for efficiently
querying scientific workflow provenance graphs,” pp. 287–298, 2010.
[73] T. Stöhr, R. Müller, and E. Rahm, “An integrative and uniform model for
metadata management in data warehousing environments,” in
Proceedings of the International Workshop on Design and Management
65
of Data Warehouses DMDW99, 1999, vol. 1999, pp. 1–16.
[74] P. Vassiliadis, A. Simitsis, P. Georgantas, M. Terrovitis, and S.
Skiadopoulos, “A generic and customizable framework for the design of
ETL scenarios,” Inf. Syst., vol. 30, no. 7, pp. 492–525, Nov. 2005.
[75] M. Jäger, T. N. Phan, C. Huber, and J. Küng, “Incorporating Trust,
Certainty and Importance of Information into Knowledge Processing
Systems -- An Approach,” in Future Data and Security Engineering:
Third International Conference, FDSE 2016, Can Tho City, Vietnam,
November 23-25, 2016, Proceedings, T. K. Dang, R. Wagner, J. Küng, N.
Thoai, M. Takizawa, and E. Neuhold, Eds. Cham: Springer International
Publishing, 2016, pp. 3–19.
[76] A. Jøsang, S. Marsh, and S. Pope, “Exploring Different Types of Trust
Propagation,” in Trust Management, 2006, vol. 3986, no. May, pp. 179–
192.

66
KOKKUVÕTE
Käesoleva doktoritöö teema on andmevoogude ja neid realiseerivate
komponentide analüüs ning selle protsessi automatiseerimine ettevõtte andmelao
keskkonnas. Töö eesmärgiks on luua universaalne metoodika, algoritmid ja
tarkvaraline lahendus, mida saab vähese vaevaga rakendada juba olemasoleva
keskkonna andmevoogude ja mõjuanalüüsi automatiseerimiseks. Metoodilise
lähenemise aluspõhimõteteks on töötava andmelao keskkonna kaardistamine
selle tööd mõjutamata ning andmelao süsteemides töödeldavaid andmeid
kasutamata. Selline lähenemisviis eeldab andmelao struktuuride, programmide ja
raportite metaandmete kogumist ja töötlust ning võimaldab lahendust rakendada
võimalikult väikeste kulutustega juba töötavas keskkonnas, selle tööd
mõjutamata ja tundlikke andmeid vajamata.
Loodud süsteemi arhitektuur sisaldab dünaamilise ja paindliku struktuuriga
andmebaasi kasutamist erinevate metaandmete salvestamiseks, modulaarsetel ja
korduvakasutatavatel komponentidel baseeruvat metaandmete kogumis- ning
töötlusprogrammide loomist ning veebipõhiseid rakendusi erinevatele
kasutajagruppidele, analüüsi teostamiseks ja andmete visualiseerimiseks.
Kirjeldatud semantilised meetodid ning reeglipõhine ja tõenäosuslik
järeldussüsteem aitavad konstrueerida struktuuride ja programmide sisendite-
väljundite baasil suunatud graafi, mis võimaldab andmestruktuuride ja -voogude
analüüsiülesanded teisendada alamgraafide läbimise- ja arvutusülesanneteks.
Töös kirjeldatud tarkvara on testitud kümnete rahvusvaheliste ettevõttete
andmeladude analüüsiks ja visualiseerimiseks. Ülevaade andmekogudest ning
süsteemi jõudlusest on toodud töö viimases peatükis. Kokkuvõttes näitame, et
valitud süsteemi arhitektuur, algoritmid ja meetodid on sobivad väga erinevate
valdkondade, suuruse ja sisuga andmeladude analüüsiks metaandmete baasil ning
kirjeldatud süsteemi komponentide jõudlus skaleerub lineaarselt lähteandmete
mahuga.

67
Publication A

Tomingas, K.; Kliimask, M.; Tammet, T. Data Integration Patterns for Data
Warehouse Automation. In: New Trends in Database and Information Systems
II: 18th East European Conference on Advances in Databases and Information
Systems (ADBIS 2014). Springer, 2014.

69
0121ÿ452678129
5ÿ122685ÿ
8ÿ0121ÿ
186
6ÿ2
129
5ÿ
ÿÿ !"ÿ#ÿ%ÿÿ&ÿ
$
ÿ
ÿ()**+,,ÿ-,+./01+23ÿ45ÿ(/67,4*4839ÿ:7+2);)2/ÿ2//ÿ<9ÿ()**+,,ÿ=>?@Aÿ:124,+)ÿ
Bÿ:*+C4ÿD4EF/2/,6/ÿD/,2/09ÿ(/)GH1F)08+ÿAIJ9ÿ()**+,,ÿ=JA=@ÿ:124,+)ÿ
KLMNOPQNRÿSÿTT!ÿT!&ÿÿTTUB%ÿ%ÿ&%&U%!Vÿ%"!ÿ%&ÿ
&!W!&ÿW!X!#ÿ%%ÿ&ÿVÿY&!Z&U&!W!U%ÿ[\]^ÿ
"&&ÿTZ&ÿ_ÿ%&ÿ`"&_ÿ%ÿ&!&ÿT!Bÿÿ%&ÿ
X!S"ÿV!&aÿbÿ&!%"Zÿÿ%Z!&VÿTTÿW!c&ÿ
&ZS`"ÿÿB&!Z&ÿYT!ÿT&&!ÿZZT&ÿ%ÿÿ!&%ÿ&T&ÿÿ
&ZS_ÿW!ÿWYBÿ\]ÿZ%ÿ!&ÿ%ÿYZ"&aÿSÿWB&_ÿ%ÿ
WWZZ_ÿWÿ&SÿTT!ZSÿÿ%&!&%ÿÿ&SÿT&&!ÿ%&Z&ÿ%ÿ%&ÿ
ÿ_ÿZÿ&"%ÿ"ÿ!ÿ!ÿWÿde]ÿZ!T"aÿ
fghijOkMRÿ%&ÿX!S"ÿ&ÿ%&ÿTTÿ&T&ÿB%ÿ`ÿ!&ÿ
B&!Z&ÿ_&YÿT&&!ÿ&%&ÿ&aÿ

lRÿnoNOjkpQNqjoÿ
Sÿ%V!_ÿWÿÿ"ZZW"ÿs&ÿb!S"ÿ[sb^ÿT!tZ&ÿÿÿS&!"ÿ
%ZTÿWÿV!"ÿ%&ÿ"!Zÿ&%ÿ!"!Zÿ%ÿZ#ÿWÿ!`"!&ÿÿ
"&BÿWZ"ÿ%ÿB"%&ÿZ&!&ÿÿX_ÿZSÿ%ÿ!#_aÿ _ÿU
&!ÿsbÿT!tZ&ÿW"!ÿ!ÿ!&%ÿ&ÿ&Sÿ!`"!&ÿ%ÿ!&_ÿ&ZSÿB&Xÿ
VBÿ%&ÿ%W%ÿ%ÿ%ÿW!&ÿ!`"!&ÿW!ÿ%Zÿ#ÿuvwaÿ
\Y&!Z&ÿ&!W!ÿ%ÿ%ÿ[\]^ÿÿÿ%&Bÿ"ÿT!ZÿX%_ÿ"%ÿÿ&Sÿ%&ÿ
X!S"ÿW%aÿ\]ÿVVÿY&!Z&ÿ%&ÿW!ÿ"&%ÿ"!Zÿ&!W!ÿ&ÿ&ÿ
W&ÿT!&ÿ%ÿ%ÿ%ÿ&ÿ&ÿ&Sÿ%ÿ&!&xÿÿÿ%&Bÿ!ÿÿs&ÿb!S"aÿ
TTÿB&Xÿ"!Zÿ%ÿ&!&ÿ%&ÿ&!"Z&"!ÿ!ÿZSÿ!ÿ&SÿBZÿ
TZWZ&ÿWÿ%&ÿ&!W!&aÿ TTÿZÿBÿVX%ÿÿ&%&ÿZT&"!ÿ
&Sÿ!&STÿB&XÿW!&ÿ"!Zÿ%ÿ&!&aÿ TTÿ%Z"&ÿ&Sÿ
%ZÿW!ÿW!&ÿ&!"Z&"!ÿ%ÿ%ÿuyzwaÿS_ÿ!ÿ"%ÿW!ÿV!ÿ
%WW!&ÿÿÿsbÿT!ZxÿX!&ÿÿTZWZ&ÿW!ÿ\]ÿT!!!ÿ
!&ÿÿ&!W!&ÿ`"!_ÿ!ÿT!!ÿ&S&ÿ"ÿ&Sÿ&ZÿWÿ&SÿTTÿ
TZWZ&ÿ[aaÿÿde]ÿ`"!_ÿ&S&ÿTT"&ÿ&!&ÿ&BÿW!ÿ"!Zÿ&B^ÿ
T!V%ÿ&%&ÿB"&ÿ!&STÿB&Xÿ&!"Z&"!ÿ!ÿZSÿT!V%ÿ
&%&ÿB"&ÿ%&ÿWXÿ%ÿ!ÿ"!Zÿu{waÿ
|!!ÿTTÿÿ&Sÿ\]ÿV!&ÿVVÿX!&ÿTZÿ%&Bÿ
%ÿZ!T&ÿ[aaÿ}!Zÿd`~]%!ÿ dde]ÿ"#%ÿ!%&ÿ&%ÿ
|&!`ÿT_ÿ&Za^ÿ%ÿde]ÿ̀"!ÿ[aaÿZ&ÿ!&ÿ"T%&ÿ%ÿ%&ÿ&&&^ÿ
XSZSÿ!ÿZ!&ÿ&!&Vÿ&ÿZ"ÿ%ÿ!"&ÿZ&V&aÿSÿ"ÿ
T!!ÿWÿ&Sÿ%&ÿ%ÿÿ&&U%U!!!ÿB%ÿ%ÿ&ÿ&ÿWWZ&ÿÿZÿ
&S!ÿÿÿ"TT!&ÿW!ÿ&SÿV!&ÿ%ÿÿ&S%_aÿ "ÿZ!T&ÿ%ÿ
Z%ÿWÿde]ÿVÿSSÿWYB&_ÿB"&ÿBZ#W!ÿÿ&!ÿWÿWWZZ_ÿZTY&_ÿ
!"B&_ÿ%ÿ&ZÿWÿ%&ÿ%ÿuwaÿSÿYZ"&ÿ%ÿT&c&ÿWÿ
Y&ÿ%ÿT!!ÿZÿBÿÿV!_ÿZTYÿ%ÿZSÿ&#ÿX&S"&ÿZZÿ
&ÿ&SÿW"ÿ%T%Zÿ%ÿ&&ÿZS!_ÿ&ÿ!&ÿT&c%ÿX!#WXÿuwÿ
uywaÿ
SÿT!ZÿWÿZ!&ÿ&!&ÿ&ÿZSÿ!"ÿ%ÿ%ZV!_ÿ
Wÿ%&ÿ&!&ÿT!!ÿ!ÿ&ÿTZ_ÿWWZ&ÿX&S"&ÿ&Sÿ%T%Zÿ%ÿ
Publication B

Tomingas, K.; Tammet, T.; Kliimask, M. Rule-Based Impact Analysis for

Enterprise Business Intelligence. In: Artificial Intelligence Applications and
Innovations (AIAI 2014), IFIP Advances in Information and Communication
Technology. Springer, 2014.

83
ÿ
ÿ
ÿ
ÿ
ÿ
ÿ
ÿ
1234567849ÿ 7ÿ7388ÿÿ4 84ÿ28488ÿ
43344ÿ
ÿ !"#$%&ÿ"ÿ '%&ÿ()#*$ÿ!!$+,ÿ
%ÿ!""ÿ-"!.)$!'/ÿ0ÿ12"#/&ÿ32!'4'ÿ'ÿ5&ÿ!""ÿ%6789ÿ3$'"!ÿ
,ÿ3!+ÿ: ;'"1ÿ:"')&ÿ<*$;)#!ÿ9=,&ÿ!""ÿ%,9%8ÿ3$'"!ÿ

687>ÿ?ÿ<<)$$ÿ$.)ÿ1 "ÿ;)@ $ÿ!"ÿ'2ÿ0!<ÿ0ÿA*$!"$$ÿB"'C

!#"1&ÿD'ÿ?)2*$!"#ÿ"<ÿD1!$!"ÿE*;;)'ÿE/$' $Fÿÿ'2ÿ1 ;G!'/ÿ'ÿ
"#&ÿ')1+ÿ"<ÿ*"<)$'"<ÿ<'ÿ!"#ÿ"<ÿ$/$' ÿ1 ;""'ÿ<;"<"1!$ÿ
!"ÿ"#ÿ$)!$ÿ0ÿ<'ÿ')"$0) '!"ÿ12!"$Hÿ2ÿ;;)ÿ;)$"'$ÿ;)1'!1ÿ '2C
<$ÿ'ÿ11*'ÿ "!"#0*ÿ<'ÿ')"$0) '!"ÿ"<ÿ1 ;""'ÿ<;"<"1/ÿ
;'2$&ÿ@$<ÿ"ÿ;)#) ÿ;)$!"#&ÿ2*)!$'!1ÿ!;1'ÿ"/$!$&ÿ;)@@!!$'!1ÿ)*$ÿ
"<ÿ$ "'!1ÿ'12"#!$Hÿ:$ÿ$'*<!$ÿ)ÿ ;/<ÿ'ÿG;!"ÿ0*)'2)ÿ<'ÿ
##)#'!"ÿ"<ÿ.!$*!I'!"ÿ0ÿ'2ÿ)$*'$ÿ'ÿ<<)$$ÿ<!00)"'ÿ;""!"#ÿ"<ÿ
<1!$!"ÿ$*;;)'ÿ;)@ $ÿ0)ÿ.)!*$ÿ*$)ÿ;)0!$ÿ!+ÿ@*$!"$$ÿ*$)$&ÿ "#C
)$&ÿ<'ÿ$'J)<$&ÿ$/$' ÿ"/$'$&ÿ<$!#")$ÿ"<ÿ<.;)$Hÿ
K4L98Mÿ!;1'ÿ"/$!$&ÿ<'ÿ!"#&ÿ<'ÿJ)2*$&ÿ)*C@$<ÿ)$"C
!"#&ÿ;)@@!!$'!1ÿ)$"!"#&ÿ$ "'!1$ÿ
Nÿ 92ÿ
D.;)$ÿ"<ÿ"#)$ÿ)ÿ01!"#ÿ$!!)ÿD'ÿP!"#ÿQDPRÿ"<ÿB ;1'ÿS"/C
$!$ÿQBSRÿ;)@ $ÿ!"ÿ1 ;Gÿ<'ÿ!"'#)'!"ÿQDBR&ÿ@*$!"$$ÿ!"'!#"1ÿQABRÿ"<ÿ
D'ÿ?)2*$ÿQD?Rÿ".!)" "'$ÿJ2)ÿ'2ÿ12!"$ÿ0ÿ<'ÿ')"$0) '!"$ÿ)ÿ
"#ÿ"<ÿ'2ÿ1 ;G!'/ÿ0ÿ$')*1'*)ÿ12"#$ÿ!$ÿ2!#2Hÿ2ÿ "# "'ÿ0ÿ<'ÿ!"'C
#)'!"ÿ;)1$$$ÿ@1 $ÿ*";)<!1'@ÿ"<ÿ'2ÿ1$'$ÿ0ÿ12"#$ÿ1"ÿ@ÿ.)/ÿ2!#2ÿ
<*ÿ'ÿ'2ÿ1+ÿ0ÿ!"0) '!"ÿ@*'ÿ<'ÿ0J$ÿ"<ÿ!"')"ÿ)'!"$ÿ0ÿ$/$' ÿ1 C
;""'$Hÿ2ÿ *"'ÿ0ÿ<!00)"'ÿ<'ÿ0J$ÿ"<ÿ$/$' ÿ1 ;""'ÿ<;"<"1!$ÿ!"ÿÿ
')<!'!"ÿ<'ÿJ)2*$ÿ".!)" "'ÿ!$ÿ)#HÿB ;)'"'ÿ1"'G'*ÿ)'!"$ÿ)ÿ
1<<ÿ!"'ÿ<'ÿ')"$0) '!"ÿT*)!$ÿ"<ÿ;)#) $ÿQH#HÿEUPÿT*)!$&ÿ<'ÿ<!"#ÿ
$1)!;'$&ÿ;"ÿ)ÿ1$<ÿDBÿ$/$' ÿ1 ;""'$ÿ'1HRHÿD'ÿ!"#ÿ<;"<"1!$ÿ)ÿ
$;)<ÿ@'J"ÿ<!00)"'ÿ$/$' $ÿ"<ÿ0)T*"'/ÿG!$'ÿ"/ÿ!"ÿ;)#) ÿ1<ÿ)ÿEUPÿ
T*)!$Hÿ2!$ÿ<$ÿ'ÿ*" "#@ÿ1 ;G!'/&ÿ1+ÿ0ÿ+"J<#ÿ"<ÿÿ)#ÿ
ÿÿ
ÿ
ÿ ÿ ÿ
ÿ
Publication C

Tomingas, K.; Tammet, T.; Kliimask, M.; Järv, P. Automating Component

Dependency Analysis for Enterprise Business Intelligence. In: 2014 International
Conference on Information Systems (ICIS 2014).

95
ÿ
!"#$!%&'ÿ)"#*"&+&!ÿ,+*+&-+&./ÿ
&$0/1%1ÿ2"3ÿ4&!+3*3%1+ÿ5 1%&+11ÿ6&!+00%'+&.+ÿ
75ÿ842ÿ974ÿ
ÿ
;<==>ÿ?@ABCD<Eÿ ?<C>=ÿ?<AA>]ÿ
FGHHIJJÿKJIÿLMÿFNOPJLHLQRÿ FGHHIJJÿKJIÿLMÿFNOPJLHLQRÿ
SPITGUGTNÿTNNÿVÿ SPITGUGTNÿTNNÿVÿ
FGHHIJJWÿSXTLJIGÿ FGHHIJJWÿSXTLJIGÿ
YGHHNZTL[IJQGX\Q[GIHZOL[ÿ TGJNHZTG[[NT\TT^ZNNÿ
ÿ
_<`DaEÿ;=BBA<Ebÿ j3%%!ÿkl3mÿ
SHIYLÿcL[dNTNJONÿcNJTNeÿ SHIYLÿcL[dNTNJONÿcNJTNeÿ
FNGf^XdGeQIÿghiÿ FNGf^XdGeQIÿghiÿ
FGHHIJJWÿSXTLJIGÿ FGHHIJJWÿSXTLJIGÿ
[GeQ^XZYHII[GXY\Q[GIHZOL[ÿ deIITZUGen\Q[GIHZOL[ÿ
ÿ
opE]`<q]ÿ
rÿ4ÿ
ÿ74 sÿ3
ÿ52ÿ83ÿ 8ÿt3
ÿ9
53u
ÿv5ÿr42 3
uÿ
ÿ
v33
ÿ77 45ÿ65wÿ52ÿ 7x356ÿ5 ÿ
uÿ54ÿ
ÿ
45
ÿ5ÿ3
uÿ
ÿ
65ÿ 7

5ÿ7

3ÿ3
ÿ
uÿ43ÿ 8ÿ5ÿ54
8 453
ÿ23
yÿ12ÿ774ÿ
74
5ÿ7453ÿ52 ÿ5 ÿ5ÿ
3
u8ÿ5ÿ54
8 453
ÿ
ÿ 7

5ÿ
7

6ÿ752ÿsÿ
ÿ74 u4ÿ743
uÿ24353ÿ375ÿ
63ÿ74 ss3353ÿ4ÿ
ÿ

53ÿ52
u3yÿÿ53ÿ4ÿ7 6ÿ5 ÿx73
ÿ84524ÿ5ÿuu4u53
ÿ
ÿ
z33{53
ÿ 8ÿ52ÿ45ÿ5 ÿ4ÿ3884
5ÿ7

3
uÿ
ÿ33
ÿ77 45ÿ74 sÿ8 4ÿ
z43 ÿu4 7ÿ 8ÿ52
3ÿ
ÿs3
ÿ4yÿ
;>|}@`~EÿÿcL[dLJNJTÿfNdNJfNJORÿGJGHRXIXWÿI[dGOTÿGJGHRXIXWÿfGTGÿHIJNGQNWÿfGTGÿGeNPL^XNWÿ
e^HNGXNfÿeNGXLJIJQZÿ
ÿ
C]`@~aq]B@Cÿ
NnNHLdNeXÿGJfÿ[GJGQNeXÿGeNÿMGOIJQÿXI[IHGeÿGTGÿIJNGQNÿÿGJfÿ[dGOTÿJGHRXIXÿÿdeLHN[XÿIJÿ
OL[dHNÿfGTGÿIJTNQeGTILJÿWÿ^XIJNXXÿIJTNHHIQNJONÿÿGJfÿGTGÿGeNPL^XNÿÿNJnIeLJ[NJTXÿPNeNÿ
TPNÿOPGIJXÿLMÿfGTGÿTeGJXMLe[GTILJXÿGeNÿHLJQÿGJfÿTPNÿOL[dHNITRÿLMÿXTe^OT^eGHÿOPGJQNXÿIXÿPIQPZÿFPNÿ
[GJGQN[NJTÿLMÿfGTGÿIJTNQeGTILJÿdeLONXXNXÿNOL[NXÿ^JdeNfIOTGHNÿGJfÿTPNÿOLXTXÿLMÿOPGJQNXÿOGJÿNÿnNeRÿ
PIQPÿf^NÿTLÿTPNÿHGOYÿLMÿIJMLe[GTILJÿGL^TÿfGTGÿMHLXÿGJfÿIJTNeJGHÿeNHGTILJXÿLMÿXRXTN[ÿOL[dLJNJTXZÿFPNÿ
G[L^JTÿLMÿfIMMNeNJTÿfGTGÿMHLXÿGJfÿXRXTN[ÿOL[dLJNJTÿfNdNJfNJOINXÿIJÿGÿTeGfITILJGHÿfGTGÿGeNPL^XNÿ
NJnIeLJ[NJTÿIXÿHGeQNZÿ[dLeTGJTÿOLJTNT^GHÿeNHGTILJXÿGeNÿOLfNfÿIJTLÿfGTGÿTeGJXMLe[GTILJÿ^NeINXÿGJfÿ
deLQeG[XÿNZQZÿÿ^NeINXWÿfGTGÿHLGfIJQÿXOeIdTXWÿLdNJÿLeÿOHLXNfÿÿXRXTN[ÿOL[dLJNJTXÿNTOZZÿGTGÿHIJNGQNÿ
fNdNJfNJOINXÿGeNÿXdeNGfÿNTNNJÿfIMMNeNJTÿXRXTN[XÿGJfÿMeN^NJTHRÿNIXTÿLJHRÿIJÿdeLQeG[ÿOLfNÿLeÿÿ
^NeINXZÿFPIXÿHNGfXÿTLÿ^J[GJGQNGHNÿOL[dHNITRWÿHGOYÿLMÿYJLHNfQNÿGJfÿGÿHGeQNÿG[L^JTÿLMÿTNOPJIOGHÿLeYÿ
ITPÿ^JOL[MLeTGHNÿOLJXN^NJONXÿHIYNÿ^JdeNfIOTGHNÿeNX^HTXWÿeLJQÿNXTI[GTILJXWÿeIQIfÿGf[IJIXTeGTInNÿGJfÿ
fNnNHLd[NJTÿdeLONXXNXWÿPIQPÿOLXTWÿHGOYÿLMÿMHNIIHITRÿGJfÿHGOYÿLMÿTe^XTZÿ

ÿ 123456ÿ73852ÿ9
54
53
ÿ
84
ÿ
ÿ9
8 453
ÿ65ÿ
ÿÿ ÿ
Publication D

Tomingas, K.; Järv, P; Tammet, T. Discovering Data Lineage from Data

Warehouse Procedures. In: 8th International Joint Conference on Knowledge
Discovery and Information Retrieval (KDIR 2016).

113
01234567189ÿ0ÿ 18696ÿ74ÿ0ÿ76426ÿ7436762ÿ
ÿ
+ÿ
ÿ!"#ÿ$%&ÿ'(%)"ÿ*ÿÿ&"ÿ
-.//011ÿ2103456078ÿ9:ÿ-4;<19/9=8>ÿ?<[email protected]ÿ744ÿA>ÿ-.//011ÿ+BCDEÿ?67910.ÿ
FG.//4H79I01=.6>ÿ[email protected]>ÿ7.14/H7.II47KL=I.0/H;9Iÿ

MN%*!Oÿ P&ÿQ%RS!#ÿP&ÿT#ÿPU*VMÿWM!!#ÿP&ÿXNÿY!SZ&[ÿ
W\!&%V&Oÿ QÿU%!&ÿÿ&R*ÿ&ÿVVS&ÿVU&ÿ*U*V!ÿ*ÿ*&ÿÿ]%ÿ&Rÿ*&\!ÿ!&%SV&S%ÿ*ÿ
ÿ%ÿ!&ÿ]ÿ!!V&*ÿU%V*S%!ÿ*ÿ^S%!#ÿ*U*&Mÿ]ÿV&Sÿ*&ÿÿ&Rÿ*&ÿN%RS![ÿRÿ
&R*ÿ%!ÿÿ&RÿU%\\!&Vÿ!&&ÿ]ÿ&RÿUV&ÿ]ÿ*&ÿÿ^S%![ÿQÿU%!&ÿÿ%Sÿ!M!&ÿ
!SUU%&ÿ&Rÿ]]V&ÿVVS&ÿ]ÿ&Rÿ&%!&)ÿV!S%[ÿRÿ*U*V!ÿ%ÿV&%Z*#ÿ %&*ÿ
*ÿ)!SZ*ÿ&ÿ**%!!ÿ)%S!ÿUÿ*ÿ*V!ÿ!SUU%&ÿU%\![ÿ_M!&ÿU%]%Vÿ!ÿ)S&*ÿ
*ÿM!*ÿ)%ÿ!)%ÿ%`]ÿ*&!&![ÿ

aÿbcdef0ghdbfcÿ oÿQRVRÿVU&!ÿl%U%&!#ÿ^S%!#ÿ*!ÿ*ÿ
!&%SV&S%!nÿ%ÿUV&*ÿNRÿ&R%ÿVU&!ÿ
_M!&ÿ*)U%!ÿ*ÿ%!ÿ%ÿ]Vÿ!%ÿ %ÿVR*rÿ
*&ÿÿ*ÿUV&ÿM!!ÿU%\!ÿÿVUiÿ oÿQRVRÿ*&#ÿ!&%SV&S%ÿ%ÿ%U%&ÿ!ÿS!*ÿ\MÿNRÿ
*&ÿ&%&#ÿ\S!!!ÿ&Vÿ*ÿ*&ÿ *ÿNRrÿ
N%RS!ÿ)%&!ÿNR%ÿ&RÿVR!ÿ]ÿ*&ÿ oÿQR&ÿ!ÿ&RÿV!&ÿ]ÿjÿVR!rÿ
&%!]%&!ÿ%ÿÿ*ÿ&RÿVUi&Mÿ]ÿ oÿQR&ÿNÿ\%jÿNRÿNÿVRÿ!&Rrÿ
!&%SV&S%ÿVR!ÿ!ÿRR[ÿRÿ&ÿ]ÿ*&ÿ ÿ
&%&ÿU%V!!!ÿ\V!ÿSU%*V&\ÿ*ÿ&Rÿ Rÿ\&Mÿ&ÿ]*ÿ*`RVÿ!N%!ÿ&ÿMÿ*Mÿ&ÿ
V!&!ÿ]ÿVR!ÿVÿ\ÿ)%MÿRRÿ*Sÿ&ÿ&RÿVjÿ]ÿ *Mÿ^S!&!ÿ*&%!ÿ&ÿMÿ&Rÿ&ÿ
]%&ÿ\S&ÿ*&ÿ]N!ÿ*ÿ&Rÿ&%ÿ VU\&!ÿ*ÿ&RÿV!&ÿ]ÿ&Rÿ!M!&#ÿ\S&ÿ!ÿ&Rÿ
%&!ÿ]ÿ!M!&ÿVU&![ÿkU%&&ÿV&i&Sÿ U%Vÿ*ÿ]i\&Mÿ]ÿjÿVR![ÿÿ
%&!ÿ%ÿV**ÿ&ÿ*&ÿ&%!]%&ÿ^S%!ÿ Rÿÿ]ÿS%ÿ%!%VRÿ!ÿ&ÿ*)Uÿ%\ÿ*ÿ
*ÿU%%!ÿl_mTÿ^S%!#ÿ*&ÿ*ÿ!V%U&!#ÿ ]]V&ÿ&R*!ÿ]%ÿS&&Vÿ*!V)%Mÿ]ÿ
&V[n[ÿP&ÿÿ*U*V!ÿ%ÿ!U%*ÿ\&Nÿ VU&ÿ*U*V!ÿ*ÿ*&ÿÿ]%ÿ&Rÿ
*]]%&ÿ!M!&!ÿ*ÿ]%^S&Mÿi!&ÿMÿÿ *&\!ÿ!VR!#ÿ^S%!ÿ*ÿ*&ÿ&%!]%&ÿ
U%%ÿV*ÿ%ÿ_mTÿ^S%![ÿR!ÿ*!ÿ&ÿ VU&!ÿ\MÿS&&*ÿM!!ÿ]ÿV&SÿU%%ÿ
S\ÿVUi&M#ÿVjÿ]ÿjN*ÿ*ÿÿ V*[ÿR!ÿ%^S%!ÿU%\\!&Vÿ!&&ÿ]ÿ&Rÿ
%ÿS&ÿ]ÿ&VRVÿN%jÿN&RÿSV]%&\ÿ !S%ÿ]ÿ*U*V!ÿ*ÿ&Rÿ %&ÿ*ÿ
V!^SV!ÿjÿSU%*V&\ÿ%!S&!#ÿN%ÿ )!SZ&ÿ]ÿ&Rÿ!&&![ÿ
!&&!#ÿ%*ÿ*!&%&)ÿ*ÿ*)U&ÿ
U%V!!!#ÿRRÿV!&#ÿVjÿ]ÿ]i\&Mÿ*ÿVjÿ]ÿ
&%S!&[ÿ sÿet udt0ÿfevÿ
QÿU&ÿS&ÿ!ÿ]ÿ&Rÿ!&ÿU%&&ÿ*ÿ
Vÿ^S!&!ÿ]%ÿ%ÿPQÿNRVRÿS!SMÿ kUV&ÿM!!#ÿ&%V\&Mÿ*ÿ*&ÿÿ!!S!ÿ
\Vÿÿ&UVÿ]ÿ%!%VRÿ]%ÿ!M!&ÿM!&!ÿ*ÿ %ÿ&ÿN[ÿWÿ *ÿ)%)Nÿ]ÿ&Rÿ%!%VRÿ
*!&%&%!Oÿÿ V&)&!ÿ]ÿ&Rÿ!&ÿ*V*ÿ!ÿU%!&*ÿÿÿ%&Vÿ
oÿQR%ÿ*!ÿ&Rÿ*&ÿVÿ%ÿÿ&ÿq]%ÿÿ \Mÿl$%\#ÿwx""n[ÿQÿVÿ]*ÿ)%S!ÿ%!%VRÿ
!UV]VÿVS#ÿ&\#ÿ)Nÿ%ÿ%U%&rÿ UU%VR!ÿ*ÿUS\!R*ÿUU%!ÿ]%ÿ&Rÿ%Mÿ
oÿQRÿN!ÿ&Rÿ*&ÿ**#ÿSU*&*ÿ%ÿVVS&*ÿ "yyxz!ÿN&Rÿ&R*!ÿ]%ÿ!]&N%ÿ&%V\&Mÿ
ÿÿ!UV]VÿVS#ÿ&\#ÿ)Nÿ%ÿ%U%&rÿ l{!R#ÿwxx"n[ÿRÿU%\ÿ]ÿ*&ÿÿ&%Vÿ
ÿ*&ÿN%RS!ÿ)%&!ÿR!ÿ\ÿ]%Mÿ
CURRICULUM VITAE
Personal data
Name: Kalle Tomingas
Date of birth: 22.08.1973
Place of birth: Pärnu, Estonia
Citizenship: Estonia

Contact data
Phone: +372 5040568
E-mail: [email protected]

Education
2008 – 2018 Tallinn University of Technology, PhD
1991 – 2000 Tallinn University of Technology, MSC
1989 – 1991 Pärnu Ülejõe Gymnasium, Highschool

Language competence
English Fluent
Russian Communication
Estonian Native language

Professional employment
2017– … Orion Information Governance, Chief Data Scientist
2005–2017 Mindworks Industries, Consultant
2011–2015 ELIKO Technology and Competence Center, Researcher
2012–2012 Marie Curie Research Fellow in Technical University Graz
1999–2005 Swedbank (Hansabank), Architect
1993–1998 Forexbank (Raebank), Manager, Architect

125
ELULOOKIRJELDUS
Isikuandmed
Nimi: Kalle Tomingas
Sünniaeg: 22.08.1973
Sünnikoht: Pärnu linn, Eesti
Kodakondsus: Eesti

Kontaktandmed
Telefon: +372 5040568
E-mail: [email protected]

Hariduskäik
2008 – 2018 Tallinna Tehnikaülikool, PhD
1991 – 2000 Tallinna Tehnikaülikool, MSC
1989 – 1991 Pärnu Ülejõe Gümnaasium, keskharidus

Keelteoskus
Inglise keel kõrgtase
Vene keel suhtlustase
Eesti keel emakeel

Teenistuskäik
2017– … Orion Information Governance, teadus- ja arendusjuht
2005–2017 Mindworks Industries, konsultant
2011–2015 ELIKO Tehnoloogia Arenduskeskus, teadur
2012–2012 Marie Curie Research Fellow in Technical University Graz
1999–2005 Swedbank (Hansapank), arhitekt
1993–1998 Forexpank (Raepank), IT juht, arhitekt

126

Whitepaper NEC SAPHANA Hadoop
No ratings yet
Whitepaper NEC SAPHANA Hadoop
24 pages
TDWI Dimensional Modeling Primer
No ratings yet
TDWI Dimensional Modeling Primer
41 pages
Database Design Checklist
No ratings yet
Database Design Checklist
7 pages
Etl Tools Informatica PDF
No ratings yet
Etl Tools Informatica PDF
2 pages
2021 AWS Glue Developer Guide
100% (1)
2021 AWS Glue Developer Guide
1,005 pages
Data Services User Guide
No ratings yet
Data Services User Guide
84 pages
Gartner Reprint
No ratings yet
Gartner Reprint
42 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
Data Quality Analyst: Professiona L Profile
No ratings yet
Data Quality Analyst: Professiona L Profile
2 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Vertica Column-vs-Row
No ratings yet
Vertica Column-vs-Row
64 pages
VerTica Architecture
100% (1)
VerTica Architecture
13 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
DWH & BI in Banking at ET 2 Nov 2004 Chandrasekhar
No ratings yet
DWH & BI in Banking at ET 2 Nov 2004 Chandrasekhar
50 pages
DW Olap
No ratings yet
DW Olap
57 pages
ETL Specification Review Check List Ods - Ap
No ratings yet
ETL Specification Review Check List Ods - Ap
5 pages
SAP Data Warehouse Cloud Security Guide
No ratings yet
SAP Data Warehouse Cloud Security Guide
18 pages
Banking Data Warehouse and Basel II From IBM
100% (2)
Banking Data Warehouse and Basel II From IBM
27 pages
387 - Mindtree Whitepaper Migrating An Existing On Premise Application To Windows Azure Cloud PDF
No ratings yet
387 - Mindtree Whitepaper Migrating An Existing On Premise Application To Windows Azure Cloud PDF
8 pages
ETL Tool Comparison
No ratings yet
ETL Tool Comparison
16 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
Data Driven Framework
No ratings yet
Data Driven Framework
11 pages
Module 5: Transforming Data: in This Module, You Will Learn
No ratings yet
Module 5: Transforming Data: in This Module, You Will Learn
33 pages
A Guide To Best Practices: Putting The Data Lake To Work
No ratings yet
A Guide To Best Practices: Putting The Data Lake To Work
12 pages
Data Integration Using GoldenGate
No ratings yet
Data Integration Using GoldenGate
18 pages
1 Data Vault TDWI SouthFL 20110311 by Raphael Klebanov
No ratings yet
1 Data Vault TDWI SouthFL 20110311 by Raphael Klebanov
30 pages
Denodo Data Virtualization Reference Architecture and Patterns
No ratings yet
Denodo Data Virtualization Reference Architecture and Patterns
13 pages
Oracle Drivers Config For HA
No ratings yet
Oracle Drivers Config For HA
73 pages
Data Model
100% (1)
Data Model
11 pages
Vertica Unify 2021 - Health Advisor - Evaluate and Improve The Health of Your Vertica Cluster
No ratings yet
Vertica Unify 2021 - Health Advisor - Evaluate and Improve The Health of Your Vertica Cluster
25 pages
HP Vertica
No ratings yet
HP Vertica
18 pages
Low Level Design
No ratings yet
Low Level Design
23 pages
Data Architect or ETL Architect
100% (1)
Data Architect or ETL Architect
4 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
BI EE Technical Check-List: Your Delivery Quality Scores Are
No ratings yet
BI EE Technical Check-List: Your Delivery Quality Scores Are
6 pages
Informatica MDM (Overview)
No ratings yet
Informatica MDM (Overview)
8 pages
Deep Dive Aurora
No ratings yet
Deep Dive Aurora
55 pages
Himanshu Sharma Resume V1.0
No ratings yet
Himanshu Sharma Resume V1.0
1 page
Battle of The Giants - Comparing Kimball and Inmon
No ratings yet
Battle of The Giants - Comparing Kimball and Inmon
15 pages
Director Analytics Supply Chain in New York City Resume Pradeep Nair
No ratings yet
Director Analytics Supply Chain in New York City Resume Pradeep Nair
2 pages
Talend ESB Container AG 50b en
No ratings yet
Talend ESB Container AG 50b en
63 pages
Tableau Performance Optimization Flow Chart 2020
No ratings yet
Tableau Performance Optimization Flow Chart 2020
3 pages
Venkat G
No ratings yet
Venkat G
7 pages
Etl Tools PDF
0% (1)
Etl Tools PDF
2 pages
Caching Strategies Explained Hazelcast IMDG v1.1
No ratings yet
Caching Strategies Explained Hazelcast IMDG v1.1
21 pages
Informatica CDC
No ratings yet
Informatica CDC
4 pages
Talend Data Quality Datasheet
No ratings yet
Talend Data Quality Datasheet
2 pages
Top 88 ODI Interview Questions
No ratings yet
Top 88 ODI Interview Questions
16 pages
NuoDB vs. Oracle Comparison
No ratings yet
NuoDB vs. Oracle Comparison
6 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
Cloudera Kudu
100% (1)
Cloudera Kudu
102 pages
Dwques
No ratings yet
Dwques
5 pages
Case Study On Building Data Warehouse/Data Mart
100% (2)
Case Study On Building Data Warehouse/Data Mart
6 pages
OBIEE Semantic Layer
No ratings yet
OBIEE Semantic Layer
3 pages
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
No ratings yet
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
38 pages
ETL vs. ELT: Frictionless Data Integration - Diyotta
No ratings yet
ETL vs. ELT: Frictionless Data Integration - Diyotta
3 pages
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Managing Data as a Product: Design and build data-product-centered socio-technical architectures
From Everand
Managing Data as a Product: Design and build data-product-centered socio-technical architectures
Andrea Gioia
No ratings yet
TOGAF® Business Architecture Level 1 Study Guide
From Everand
TOGAF® Business Architecture Level 1 Study Guide
Andrew Josey
No ratings yet
Dp900 Dump
No ratings yet
Dp900 Dump
195 pages
Power BI Resume Sample
No ratings yet
Power BI Resume Sample
4 pages
Business Intelligence Lab Report: Ms. A.Lalitha Registration No.: 15381033
No ratings yet
Business Intelligence Lab Report: Ms. A.Lalitha Registration No.: 15381033
53 pages
NoLic PowerArchitectUserGuide-0.9.13
No ratings yet
NoLic PowerArchitectUserGuide-0.9.13
68 pages
Open Source ETL Tools Comparision
No ratings yet
Open Source ETL Tools Comparision
10 pages
Full download Data Warehouse Systems Design and Implementation, 2nd Edition Alejandro Vaisman pdf docx
100% (2)
Full download Data Warehouse Systems Design and Implementation, 2nd Edition Alejandro Vaisman pdf docx
40 pages
Standard Change Catalogue
No ratings yet
Standard Change Catalogue
56 pages
Azure Data Factory SSIS in The Cloud
No ratings yet
Azure Data Factory SSIS in The Cloud
24 pages
Sap Hana Essentials Toc Feb1
No ratings yet
Sap Hana Essentials Toc Feb1
15 pages
AWR Warehouse: An Introduction
No ratings yet
AWR Warehouse: An Introduction
38 pages
Full Stack Java Developer Resume
100% (1)
Full Stack Java Developer Resume
3 pages
PDI (Pentaho Data Integration)
100% (1)
PDI (Pentaho Data Integration)
37 pages
MDA - 1.module 1 - BI Introduction - Data Prep
No ratings yet
MDA - 1.module 1 - BI Introduction - Data Prep
131 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
Informatica - Question - Answer: Deleting Duplicate Row Using Informatica
No ratings yet
Informatica - Question - Answer: Deleting Duplicate Row Using Informatica
113 pages
Getting Started
No ratings yet
Getting Started
1 page
Latest QA Jobs 04 June 2024
No ratings yet
Latest QA Jobs 04 June 2024
33 pages
Data Engineer - Ireland
No ratings yet
Data Engineer - Ireland
3 pages
Computer Science Higher Level Paper 2: Instructions To Candidates (65 Marks) - Option Questions
No ratings yet
Computer Science Higher Level Paper 2: Instructions To Candidates (65 Marks) - Option Questions
17 pages
What Is DataOps - The Ultimate DataOps Guide by Rivery
No ratings yet
What Is DataOps - The Ultimate DataOps Guide by Rivery
11 pages
Development of A University Financial Data Warehouse and Its Visualization Tool
No ratings yet
Development of A University Financial Data Warehouse and Its Visualization Tool
9 pages
Dinesh Kumar S: Professional Summary
No ratings yet
Dinesh Kumar S: Professional Summary
6 pages
ETL Testing Training: Training Topics: Chapter 1: Data Warehousing
No ratings yet
ETL Testing Training: Training Topics: Chapter 1: Data Warehousing
3 pages
AWS Data Engineer
No ratings yet
AWS Data Engineer
1 page
DWC Getting Started
No ratings yet
DWC Getting Started
42 pages
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
No ratings yet
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
6 pages
IN Rajan's Resume
No ratings yet
IN Rajan's Resume
3 pages

Semantic Data Lineage and Impact Analysi

Uploaded by

Semantic Data Lineage and Impact Analysi

Uploaded by

TALLINN

Supervisor: Professor Tanel Tammet

Opponents: Professor Alexandra Poulovassilis

Ph.D Peeter Laud

Defence of the thesis: May 21, 2018, Tallinn

Copyright: Kalle Tomingas, 2018

A Tomingas, K.; Kliimask, M.; Tammet, T. Data Integration Patterns for

OTHER RELATED PUBLICATIONS

Enterprise Data Warehouse

To make reasonable and informed business decisions, we need appropriate

Contribution of the Thesis

The main components of the contribution are:

We describe the underlying technology and abstract mapping concept in our

Organization of the Thesis

1.1. Overview of Data Lineage and Provenance

To illustrate different lineage types, consider the following simple data

The Where lineage for every target table column (Agreement_Nbr,

1.2. A Motivating Example

SQL Query 1 from Job 1

SQL Query 2 from Job 2

SQL Query 3 from Job 1

SQL Query 4 from Job 2

Code f1.1 Agreement_Type j2.1

3.1. Overall Architecture and Methodology

3.2. Metadata Database

Figure 3.2 Metadata database physical schema tables.

The repository contains integrated object-level security mechanisms and

3.3. Design of Metadata Models and Mappings

3.4. Data Capture, Store and Processing with Scanners

3.5. Query Parsing and Metadata Extraction

<Select Stm> ::= <Select> UNION <Select Stm>

Grammar-based parsing functionality is built into the scanners technology and

3.6. Data Transformation Weight Calculation

The general column weight W algorithm in each expression for Sw and Fp

q1: CAST(T1.LogDate AS DATE) as Request_Date => 0.91

The last expression q6 contains parts and measures like ColumnCount:

3.7. Rule System and Dependency Calculation

Figure 3.3 Visual representation of data lineage graph inference rule R 1 .

To propagate information through the database structure upwards, to view the

3.8. Semantic Layer Calculation

ACCOUNT.Ty pe = ’A’ AGREEMENT.Agreement_State = 2

AGREEMENT.Agreement_Ty pe = overlapping BALANCE.Balanc e_Date = DATE-1

When comparing queries 1-4’s mapping and filter predicate conditions, we

Figure 4.1 Data lineage visualization example in DW environment using Sankey

4.2. Performance Evaluation

DS1 DS2 DS3 DS4 DS5 DS6

Figure 4.8 Dataset processing time with two main subcomponents.

In addition to the visualization of data flows, we have developed the

4.4. Proposed Novel Applications

Automatic System Documentation

Enterprise Search and IT Asset Management

Auditing and Compliance Reporting

Tomingas, K.; Tammet, T.; Kliimask, M. Rule-Based Impact Analysis for

687>ÿ?ÿ<<)$$ÿ$.)ÿ1 "ÿ;)@ $ÿ!"ÿ'2ÿ0!<ÿ0ÿA*$!"$$ÿB"'C

Tomingas, K.; Tammet, T.; Kliimask, M.; Järv, P. Automating Component

Tomingas, K.; Järv, P; Tammet, T. Discovering Data Lineage from Data

You might also like

<Select Stm> ::= <Select> UNION <Select Stm> 

687>ÿ?ÿ<<)$$ÿ$.)ÿ1 "ÿ;)@ $ÿ!"ÿ'2ÿ0!<ÿ0ÿA*$!"$$ÿB"'C