Data Versioning For Graph Databases
Data Versioning For Graph Databases
Master’s Thesis
THESIS
MASTER OF SCIENCE
in
COMPUTER SCIENCE
by
Abstract
This thesis investigates the extent to which delta-based version control can be used
to manage versions and the historical records of graph databases. The delta encoding is
represented as a set of queries to reconstruct one version from the other. We presented
Grit, a Git-style command line tool to do data versioning for graph databases. We
investigated two variants of our approach, with and without leveraging snapshots of
data in the process. The experiment result showed that the tool can be used to see the
historical records of data, compare two versions of data, and tracking the changes for
every entity in the database.
We evaluated the tool by comparing its performance with other graph version con-
trol tools. The evaluation results showed that version control for graph database can be
done with queries as the delta encoding. However, there are some additional costs of
doing version control with this approach. First, the delta object size increases linearly
to the number of changes in a version. We also found that using queries as the delta
to reconstruct versions can take a significant amount of time if the number of changes
is large. While leveraging snapshots can reduce the time to materialize version, it also
increases the storage cost for saving the delta objects and snapshot objects. A recom-
mendation for future studies in data versioning is to explore and improve over the delta
encoding.
Thesis Committee:
Preface
This thesis marks the completion of my study at Delft University of Technology.
It was written to fulfill the requirements of the MSc Computer Science. It was a fan-
tastic journey with many ups and downs, and I would like to thank the Almighty
God, to whom I wish for strength and wisdom. Throughout the completion of this
master’s thesis, I have learned a great deal of knowledge and experience, particu-
larly in scientific research. I would like to acknowledge the people that have helped
and supported me, without whom this thesis would not be finished.
I would like to express my sincerest gratitude to my thesis supervisor, Asterios
Katsifodimos, for accepting me to work on this thesis. I have gained many insights
from the discussions with him throughout the course of this thesis. I am grateful
for his guidance, feedback, and patience. I would also like to acknowledge the other
members of the committee, Alessandro Bozzon and Maurício Aniche. Without them,
I would not be able to defend this thesis.
I would also like to acknowledge and express my gratitude to LPDP (Indonesia
Endowment Fund for Education) for the scholarship to pursue my study at TU Delft.
I would like to thank my parents and my family for their unconditional love and
support and for being always there for me. I want to thank my fellow computer
science students from Indonesia: Helmi, Reza, Romi, Andre, Sindu, and Gilang for
their companionship. Lastly, I would like to take this opportunity to acknowledge
every person who I have been in contact with–friends, colleagues, professors–during
my study at TU Delft.
Thank you.
Contents
Abstract iii
Preface v
1 Introduction 1
1.1 Version control for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5 Evaluation 41
5.1 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Commit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Checkout time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Storage cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Evaluation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Commit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Checkout time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Storage cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusions 51
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Bibliography 55
ix
List of Figures
List of Tables
Chapter 1
Introduction
In this introductory chapter, we will present the background that motivates this the-
sis in section 1.1 and 1.2, followed by the research questions and sub-questions and
contributions in section 1.3 and 1.4. Finally, the outline of this document is presented
in section 1.5.
and assign the version during their analysis. This practice is essential to track the
changes in the dataset and the processes that introduce these changes, thus keeping
the lineage or the provenance of the dataset.
Data can also come in many formats. A commonly used one is tabular data in
the form of a CSV file or spreadsheet. Relational database systems, such as MySQL
and PostgreSQL also save the data in tabular format. However, there exist dataset
in other forms such as a graph-based database, document-based database, and key-
value database. These databases, commonly called NoSQL databases, are usually
saved in binary format.
A data analysis pipeline is often done together in a team, where each member
will do different tasks and introduce different modifications to the dataset, depend-
ing on the processes and the goals of the tasks. In this scenario, there will be some
branches of the dataset and to integrate them back, there need to be a mechanism or
system to manage these various branches (Maddox et al., 2016).
An example of the process is a team of scientists wants to do a relationship anal-
ysis with data from Twitter. Following the CRISP-DM model, a frequently used data
analysis pipeline model illustrated in Figure 1.2, one of the steps in the pipeline is
data understanding (Chapman et al., 2000). After acquiring raw data from Twitter,
each member of the team can have different tasks and can introduce modifications
towards the raw data, before they perform the actual analysis. For example, a mem-
ber of the team does a data cleaning and data normalization. A team member then
modifies the cleaned data for feature extraction while another member can also edit
the data for visualization. These tasks can result in several versions of the dataset
with different changes from each team member.
A common approach to track the lineage and assign versions of a dataset is to cre-
ate a snapshot of the current state of the data in each process (Maddox et al., 2016).
In this practice, a copy or a version of the dataset will be created per process and re-
named differently, for example, data_01.txt, data_02.txt, data_for_visualization.txt. In
this case, there is no associated information about the relationship between these
1.2. Challenges 3
copies, i.e. no information about which is the former version of a dataset and what
is the process that transforms one version into the subsequent version. For a small
pipeline that consists of a few processes, this practice may be acceptable, because
there will not be too many versions and it will be faster to recreate or retrieve differ-
ent versions. However, for a more complex pipeline with a larger size of datasets,
this practice will create a considerable amount of data and will result in wasted stor-
age. Furthermore, by creating snapshots, it is possible to recreate previous versions
of data, but it is not possible to compare the changes or track the lineage between
versions.
In recent years, there are emerging researches on data versioning with the capa-
bility to compare the difference between versions and tracking the individual who
authorized the changes. Researches such as Huang et al., 2017, Bhardwaj et al., 2015,
and Maddox et al., 2016 explores version control for relational database with explo-
ration on materialization/storage trade-off. There are, however, limited researches
on how to do dataset versioning on non-relational databases.
1.2 Challenges
Based upon the aforementioned background and the nature of dataset versioning,
we can identify some challenges as follows:
3. Delta representation
Based on how data and versions are encoded, there are two common approaches
on data versioning, snapshot-based version control and delta-based version
4 Chapter 1. Introduction
control. In delta-based version control, after the changes to the data are de-
tected, the delta between two versions needs to be stored. Delta can be de-
scribed as a list of changes from one version to another with some instructions
on how to generate one version of the dataset from the other. The difference be-
tween delta-based and snapshot-based version control is that by storing delta,
modifications between two datasets can be calculated. Because of the various
formats of the dataset, a suitable delta representation is needed to be deter-
mined so that the retrieval of different version can be efficient.
To answer the main research question, the research sub-questions are formulated
as follows:
RQ1: What is the current state of the art regarding the techniques and tools for dataset ver-
sioning?
This question aims to explore the previous works on data versioning. The explana-
tions of the existing approaches and tools can be found in Section 2.4 and 2.5.
RQ2: Can we design a version control tool for graph databases with queries as the delta?
Concerning the challenges of data versioning, there are many kinds of data types
and we want to focus on graph databases since this type of databases is claimed to
be able to capture the relationship information of data compared to other databases.
According to Bhattacherjee et al., 2015, in databases, SQL queries can be used to
compare and reconstruct tuples or records from one version to another version. We
would like to investigate if it is possible to represent the delta as queries.
RQ3: How can the changes in graph databases be detected and represented?
One of the essential steps to manage the versions of databases is to detect and store
the changes in the database. These changes can be used to compare the difference
between versions and to retrieve one version from the other.
1.4. Contributions 5
RQ4: Can we improve the performance of the version control tool by leveraging the use of
snapshots?
Delta-based data versioning stores the difference between versions of data instead
of the full snapshot in each version. In doing so, there are expected additional costs
to retrieve the version. For example, to checkout a version, the delta needs to be
executed. This execution can take a longer time when the number of changes gets
larger. On the other hand, snapshot-based data versioning, a snapshot of data is
created in each version. The advantage of this approach is that it is expected to be
faster to checkout regardless of the number of changes because the user can restore
the version from the snapshot. We compare the performance of data versioning tool
with and without leveraging the use of snapshots.
1.4 Contributions
The main contributions of this thesis are as follows:
1. A literature survey on the recent researches regarding version control for data
Chapter 2
This chapter discusses the description and a brief history of version control for data
along with various specific tools and techniques related to data versioning. This
chapter aims to provide an answer to the first research sub-question
RQ1: What is the current state of the art regarding the techniques and tools for
dataset versioning?
The applications of version control for code and version control for data are elab-
orated in Section 2.1 and 2.2 followed by the common features of version control sys-
tems in Section 2.3. The existing research and tools for data versioning are further
elaborated in Section 2.4 and 2.5 followed by the summary in Section 2.6.
The concept of temporal database has been introduced and studied by Tansel
et al., 1993 and Ozsoyoglu and Snodgrass, 1995 among others to denote the time
information constraint in the database as opposed to the real-time database where
only the information about current state of data is stored. Temporal database saves
the information about the timeframe of a record, namely valid time, a time period
when a particular record is valid in the current situation, and transaction time, a time
period when a record is considered correct. This concept has been implemented as
a feature in the major relational database management systems, such as OracleDB,
Microsoft SQL Server, and MariaDB.
Several numbers of research, such as Cheney, Chiticariu, and Tan, 2007 and Kar-
vounarakis, Ives, and Tannen, 2010 explores the concept of provenance in databases
along with how to query the provenance information. Storing provenance informa-
tion of data in a database allows the users to track the lineage of data, for example,
where a tuple in a database comes from and what is the relation between one tuple
to its base tuple. In short, provenance information can be used to track the lineage
of changes in the database from its creation to its current state.
Data versioning also relates to the practice of archiving database. Buneman et al.
described database archiving as the practice to store past versions of a database that
are used in data analysis so that the result of the analysis can be verified (Buneman
et al., 2004). The authors argued that since certain analysis is based on a specific
version of a database, it is essential to keep track of the previous states of the data
as the database always evolves with addition, correction, and deletion of records. In
addition to the ability to analyze older versions of the database, database archiving
makes it possible to query historical data (Müller, Buneman, and Koltsidas, 2008).
According to Buneman et al., 2004, there are several approaches to archiving
data. The first one is to create a snapshot or version for every significant transfor-
mation of data. By creating snapshots, it is possible to retrieve past versions of data
should the needs for analysing those data arises. The disadvantage of this approach
is that it can result in wasted storage as either small or large changes will result in
a full snapshot of data. Furthermore, it is impractical to compare the differences
between versions by using snapshots alone.
Another approach for archiving data is storing the delta which describes the list
of changes between two consecutive versions. For text-based data, this can mean
storing the difference line per line between the same files in different versions, while
for a database, this can mean storing the script or SQL to generate one version from
the other. Saving the delta means that less storage is used compared to saving a full
snapshot. The difference between the two versions can also be compared using the
delta or sequence of deltas. However, if the number of changes is huge, retrieving
the version by using the delta can be time-consuming.
Recent researches, such as Bhardwaj et al., 2014 and Maddox et al., 2016, explore
versioning in complex data, i.e. relational database, with the ability to retrieve previ-
ous versions and compare the difference between versions. These approaches intro-
duce the method to store delta efficiently along with the user interface in which the
user can interact with Git-style command. The authors claimed that the challenges
for these approaches are two-fold: recording deltas and encoding version graph.
From the existing tools and research, there are several approaches on how to
address data versioning challenges. One approach is where the data and the code
are linked in the pipeline along with the code. This approach is commonly used in
machine learning and data engineering pipeline where the data are stored in certain
storage and linked to the pipeline via script/code. This approach is effective for
2.3. Version control features 9
repeatable processes with common dataset formats. However, not all data types are
suitable for this approach.
Another approach is to manage the version of the data separately from the code
and the pipeline. In this approach, a new version of the data is created regardless
of the pipeline, for example, based on current time or based on certain addition
of records to the database. The analysis then uses a certain version as one of its
dependencies.
In summary, the main objective of data versioning is to manage the versions of
data for reproducibility purpose. Given a set of versions V , we want to be able to
retrieve or recreate a target version, for example, V j , from its predecessor version,
for example, Vi . The version can then be used as data dependency for data analysis
or data science processes.
According to Bhattacherjee et al., 2015, delta from two consecutive versions can
be calculated in several ways.
• Text-based files can utilize Unix diff command to compare line-by-line changes
in text-based files.
• Databases systems can employ SQL queries to compare and reconstruct tuples
or records from one version to another.
• Tabular data, which consists of rows, columns, and cells, can use the cell-wise
differences to compare the changes between the same rows in the subsequent
versions.
In a database system, one way to store the delta is to save it in external files,
for example, by storing serialized changes from source version Vs to target version
Vt in JSON files. Another way is to store it inside the database itself, for example
by adding additional columns to represent the version or by saving the version in
separate tables.
• checkout: Change the current working directory to a specific version that was
already committed before, for example, going back from version Develop a fea-
ture part 2 to Develop a feature part 1.
• diff: Compare two committed versions or compare a version against the cur-
rent working directory. For example, inspecting the addition and the updates
from version Develop a feature part 1 to Develop a feature part 2
• branch: Create a new branch from an existing branch in the version graph.
This is done, for example, when a user wants to work on his/her own version
and explore some modifications towards the files but does not want to change
it directly in the main branch. In the figure, the grey line is the main or master
branch. The blue line is the develop branch made from the master branch, while
the yellow line is the myfeature branch made from the develop branch.
• merge: Merge the changes made in a branch to the parent branch. This is
illustrated as the blue develop branch is merged to grey master branch.
2.4.1 Taverna
Taverna is a workflow system designed for the integration of various tools and re-
sources in bioinformatics (Oinn et al., 2004; Hull et al., 2006). It is also developed
due to the need for seamless communication between the diverse analysis tools and
web services from third-party providers. Taverna acts as a user interface to create
and run workflows from these several resources.
Lineage information in a Taverna workflow consists of the source data and the
metadata related to the process and the subsequent result in the workflow. The lin-
eage of data can be used, for example, an audit trail to determine the source of error
in workflow or determine the source of data. Another use of lineage information is
for reproducibility and reusability purposes.
month. Therefore, it is possible to retrieve the version from last month from the list
of previous commits.
VCS systems are designed for unstructured data and compare the changes line-
per-line basis. For text-based data with small changes, using Git can be sufficient.
However, VCS systems are not optimized for semi-structured or structured data, for
example, tabular data or serialized data (CSV, XML, JSON), since a change in one
cell in a table is considered a change in the whole row. Furthermore, for complex
data, such as database systems, VCS systems do not know how to calculate the delta
since the files for database systems are usually in binary format.
There are some plugins for Git that are designed to handle large size data in
Git repository, for example, Git LFS 3 and git-annex 4 . They work by committing
the symlinks of the dataset to the repository. The actual files are instead stored in
external storage, for example, AWS S3 or LFS server.
By using Git for data versioning, it is possible to track the lineage of data by
using Git version graph, although there can be some limitations. Furthermore, for a
dataset in a binary format, it is not possible to compare the changes from one version
to another.
The data structure in the array-oriented database is in the form of arrays consist-
ing of a predefined schema. During the creation of a record, a schema that includes
the data type of each element in the array along with the dimensions of the array.
The operations for version control, including the creation of a new version and
the retrieval of the past versions, are provided in the form of SQL-like query. To
create an updatable array, for example, a user can issue an SQL command:
CREATE UPDATABLE ARRAY Example (A :: INTEGER ) [I=0:2, J=0:2] ;
When inserting a new array, the operation takes one of the three forms of pay-
load, namely dense representation, sparse representation, and delta list representa-
tion. In a dense representation, the value of every cell in the array is specified even
though some of them are empty. In a sparse representation, a list of (dimension, at-
tribute) pairs is provided, together with the default value for undefined dimension
values. A delta list representation contains a parent version of the record along with
a list of (dimension, attribute) pairs that are changed. After data is inserted, the user
can explicitly define the version, i.e. commit, by issuing a query to the database.
To handle a new version of an array in the database, for example, changing some
of the attributes, there are several options to store these changes. The first one is
to fully materialize the array whenever changes happen. The array in each version
is chunked and compressed for more efficient storage. The second one is to calcu-
late the delta encoding, that is the cell-wise differences between arrays of the same
dimension.
When a new version of an array is loaded into the database, the system computes
the materialization matrix to determine the best encoding and representation for the
new version with respect to the previous versions. The latest version of the database
will always be materialized since in many cases, it will be accessed more often.
Retrieval of the previous versions is also done by using SQL-like query. In the
paper, the author gave an example of query SELECT * FROM Example@’1-5-2011’ ;
to retrieve a version that was created on January 5th 2011.
To store the list of versions, the system employs a graph representation with the
vertices denote the versions and the edges denote the delta encoding. The system
uses an undirected graph to represent the version graph since the system can mate-
rialize to both directions by applying delta or subtracting delta.
Datahub
Figure 2.4 shows the architecture of Datahub. At a high level, the system uses a
"schema-later" data representation. It allows data to be represented in many forms,
from textual blobs to key-value pairs, to structured records. The main abstraction
that the system uses is that of a table containing records, in which every record in a
table has a key, and an arbitrary number of typed, named attributes associated with
it. For unstructured data, a single key can be used to refer to an entire file.
16 Chapter 2. Background and Related Works
Another abstraction that the system uses is dataset. A dataset consists of a set of
tables, along with any correlation / connection structure between them (e.g., foreign
key constraints).
The versioning information is maintained at the level of datasets in the form of a
version graph, a directed acyclic graph where the nodes correspond to the datasets
and the edges capture relationships between versions as well as the provenance
information that records user-provided and automatically inferred annotations for
derivation relationships between data. Whenever a version of a dataset is created in
DataHub, users or applications may additionally supply provenance metadata that
shows the relationship between two versions. This metadata is designed to allow
queries over the relationship between versions.
The paper mentioned two challenges in developing DataHub. The first challenge
is how to calculate deltas between two versions of dataset so that one version can be
retrieved by using the other version and the delta. Since the dataset is represented in
tables and records, versions are created explicitly via INSERT/DELETE commands
Bhardwaj et al., 2015. However, for very large datasets and large changes between
versions, the comparison can be slow.
The other challenge is version graph encoding. The authors argued that there is
a trade-off between materializing specific version or just storing delta from previous
versions since following delta might take a considerable amount of time to execute.
Decibel
Decibel is a relational storage system with built-in version control. Decibel repre-
sents a dataset in a table. It is a key component of DataHub Bhardwaj et al., 2015
versioning system. There is three representation of data, namely tuple first, ver-
sion first, and hybrid representation. Figure 2.5 illustrates how Decibel handles data
storage and versioning capability.
2.5. Separating the data version from the pipelines 17
OrpheusDB
OrpheusDB (Huang et al., 2017) works as a dataset version management system that
is built on top of relational databases. OrpheusDB is designed to get the benefit of
both sophisticated querying of relational database and versioning capabilities. The
authors claimed that OrpheusDB support both git-style version control commands
and SQL-like queries. Figure 2.6 illustrates the delta encoding in OrpheusDB.
Document-oriented database
One of the categories of NoSQL database is document-oriented, where the data is
stored in documents, such as XML, JSON, and BSON Sadalage and Fowler, 2012.
MongoDB and Couchbase are two examples of database systems that belong to
this category. Several approaches have been proposed to manage versioning for
document-oriented databases.
τXSchema Currim et al., 2004 and τJSchema Brahmia et al., 2016 are two ap-
proaches for versioning XML and JSON documents respectively. In each of these
approaches, a schema is proposed to handle the version history of XML and JSON
2.6. Summary 19
Graph-oriented database
Graph-oriented database systems, such as Neo4j, ArangoDB, and HyperGraphDB,
store the data as graph. This type of NoSQL databases is claimed to be suitable
to represent data from social networking, organizational data, and natural phe-
nomenon as it can illustrate the relationships between data better than relational
database systems Robinson, Webber, and Eifrem, 2013.
Castelltort and Laurent, 2013 defined several criteria for versioning graph databases,
namely, it is non-intrusive to the graph structure, can be distributed as a plugin,
can be pluggable to the database regardless of the time, and provides the history of
nodes and the relationships.
A plugin for Neo4j called neo4j-versioner-core (D’Este and Falcier, 2017) is devel-
oped to manage versions of graphs in Neo4j. It is based on the Entity-State approach
in which the identity and state of graphs are separated (Robinson, 2014). Every en-
tity will have a main unique property while the non-unique properties are store into
another node called state node. The main node and the state nodes are connected
with a relationship. A change in the entity will generate a new state node that has a
relationship with the main node and the previous state node. All the functionalities
are provided via SQL-like queries. Figure 2.7 illustrates how data versioning works
in neo4j-versioner-core.
A study by Vijitbenjaronk et al., 2018 aimed to add time-versioning support in
graph database by using LMDB, a key-value store based on B-tree, as the backend
storage. Figure 2.8 shows the overview of time-versioned graph database.
2.6 Summary
Based on the relation to the workflow or analysis, data versioning can be categorized
into two group, that is, by linking the data to the workflow and by separating the
data from the workflow.
Linking the data to the workflow is useful for projects that have predefined work-
flows. Scientific workflow systems are practical for managing the processes and the
20 Chapter 2. Background and Related Works
versions of data in this category. From the perspective of the datasets, to create a
new version, a user can introduce a process of transformation. Several systems pro-
vide the ability to explicitly define the version number while some others do this
automatically. To checkout a version of data, a user can scan the workflow graph
and define which version of data to be retrieved from the list of processes.
To some extent, it is possible to compare what the data look like in each step
of the workflow. The comparison between data can be made at the file-level. It
is possible to see the state of the file before and after a transformation. However,
for a file consisted of records that undergoes a transformation that modifies a large
number of these records, it is impractical to compare which records are changed
during the transformation.
Separating the data versions from the workflow is useful for projects where
the manipulation of data does not always fit into a certain workflow. A relational
database, for example, can have numerous client applications that constantly up-
dates the data where the flows of the process in one application can be significantly
different than the other applications.
Based on how data and versions are encoded, data versioning can be catego-
rized into two group, that is delta-based approach and snapshot-based approach.
Snapshot-based approach stores the whole content of data in each versions while
delta-based approach stores the difference between versions (delta) to reconstruct
the full version of data.
Table 2.1 displays the comparison of the existing tools and methods for managing
the versions of data. There are a considerable amount of tools for managing data
versions that are used in workflows. On the other hand, there are still limited studies
on managing data versions in database systems, especially in NoSQL databases.
TABLE 2.1: A comparison of data versioning tools
Chapter 3
Figure 3.1 illustrates an example of graph model. It depicts a social graph where
a person follows another person. Each node illustrates a person and each relation-
ship illustrates which direction a user follows another user. There are three nodes
with label ‘Person‘ depicted as circles, and there are four relationships with type
‘FOLLOWS‘ illustrated as purple directed arrows. The number of properties on each
node or relationship can be different, for example, node ‘Alice‘ has two properties,
namely ‘name‘ and ‘dob‘, while node ‘Bob‘ only has one property, that is ‘name‘.
Over the time there can be some changes to the graph, for example, addition
of nodes and relationships and updates to the values of some properties. Figures
3.2a and 3.2b display the example of the states of the graph, before and after the
changes, respectively. Figure 3.2a can be referred as the version 1 V1 of the graph
and Figure 3.2b can be referred as the version 2 V2 . In Figure 3.2b, there are one
new node ‘Dave‘ and one new relationship from node ‘Charlie‘ to ‘Dave‘. There are
also new properties in node ‘Alice‘, ‘Bob‘, and ‘Dave‘ and also a new property in the
relationship from ‘Bob‘ to ‘Charlie‘. Given the existing versions of the graph, V1 in
Fig. 3.2a and V2 in Fig. 3.2b, we want to be able to retrieve V1 from V2 and to retrieve
V2 from V1 .
3. Delta encoding
Delta encoding deals with how the changes, either an insert, an update, or a
delete transaction, is going to be encoded and stored.
4. Version graph
Version graph represents the order of versions, for example, how to reach a
target version Vt from a source version Vs .
1. Init: This function tells the system to start managing the version of the active
Neo4j database. All nodes and relationships that exist in the database or that
will be inserted into the database will be versioned. It also instructs the sys-
tem to start detecting changes in the database and store them. For the graph
example in Figure 3.2a, ‘init‘ function tells the system to identify changes in all
three nodes, all four relationships, and the upcoming additions of nodes and
relationships to the graph.
2. Commit: It is used to create a new version of the data. The changes from the
previous version are calculated and stored. Commit can be used to save the
state on Figure 3.2a as "version 1", and Figure 3.2b as "version 2".
3. Checkout: This function will materialize a certain version that has been com-
mitted before. For example, if the current state of the graph is depicted in
Figure 3.2b and we want to see the state of the graph in version 1, illustrated
in Figure 3.2a, we can use checkout function with version 1 as the parameter.
4. Diff: Diff is used to compare two versions and determine, for example, which
nodes are modified, which relationships are created, which properties are added,
and which labels are removed. In the case in Figure 1, the diff command will
compare the list of changes between version 1 and version 2, and the results
are the highlighted properties.
7. Blame: When we want to learn what changes are applied to a certain node or
relationship and who changes it, we can use the blame function.
26 Chapter 3. Data Versioning for Graph Databases
• createdNodes: the list of nodes, along with the properties and labels, that are
created during a transaction.
• assignedLabels: a node normally has one label, but it can also have multiple
labels or none. This element contains the list of labels that are assigned to a
node during a transaction.
• removedLabels: similar to assignedLabels but for the labels that are removed
from a node
These properties can be used to determine the state of the graph before and after
some queries or transactions are committed and therefore determine how to recon-
struct the graph.
• entity: the type of the entity (node or relationship) that is modified. The value
is either ‘node‘ or ‘edge‘.
• grit_id: the id of the entity that was assigned by the versioning tool.
• start_node_grit_id and end_node_grit_id: the id of the start node and the end
node of a relationship.
• action: the action that is performed to the entity. The value is either ‘cre-
ate‘, ‘delete‘, ‘add_label‘, ‘remove_label‘, ‘add_property‘, ‘update_property‘,
‘delete_property‘.
28 Chapter 3. Data Versioning for Graph Databases
• key, old_value, and new_value: the key, value pair of the property that was
assigned, overwritten, or removed. We store the old and new values so that it
is possible to rollback to the older version.
• id: the unique identifier by which the contents of the changelog file are grouped.
The aggregate of the changelog file contains the delta of a version from Vi to
V j . To retrieve these objects, we store them into a file called commit object or delta
object, with a unique identifier called ‘version_id‘.
From the delta object, we can generate a set of SQL-like queries to reconstruct V j
from Vi and to rollback from Vi from V j . It is also possible to show the differences
between Vi and V j from the aggregate objects.
• Assign a unique identifier called ‘grit_id‘ for each created node and relation-
ship
The plugin will generate a changelog file containing the changes for every entity.
The following is an example of the output of the plugin.
{
" g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" action " : " create " ,
" e n t i t y " : " node " ,
" properties " : {
" user_id " : "10263002" ,
},
" username " : " n e o 4 j "
}
{
" g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" action " : " delete " ,
" e n t i t y " : " edge " ,
" edge_type " : "FOLLOWS" ,
" s t a r t _ n o d e _ g r i t _ i d " : " 1 ae66808 −7c8c −4f 7 f −9335− d284df75b43e " ,
" s t a r t _ n o d e _ l a b e l " : [ " Users " ] ,
" e n d _ n o d e _ g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" end_node_label " : [ " Users " ] ,
" username " : " n e o 4 j "
}
The edges in the version graph represent the delta between the subsequent ver-
sions. Given two subsequent versions denoted as (Vi )-[∆ij ]->(V j ) where Vi and V j are
2 subsequent versions and ∆ij is the delta of the two versions, ∆ij stores of the delta
object from Vi to V j and vice versa.
30 Chapter 3. Data Versioning for Graph Databases
Since the terms for the components of version graph and Neo4j graph are similar,
to differentiate them, we will use nodes and relationships as the components of the
graph database (Neo4j) and use vertices and edges to denote the components of the
version graph. To help with the creation of the version graph, a Python package,
networkx, is utilized.
Figure 3.4 illustrates a version graph. Each version, denoted by a dot or vertex,
has several properties namely branch_name, version_id, commit_message, and au-
thor. The version_id in the version graph corresponds with the version_id of the
delta object.
There are two functionalities in the branch function, create a branch and switch
current branch. The create branch function takes a parameter that is the name of the
new branch branch_name. It is described as follows.
Switching branch means that the working copy of the database is set to the state
of the graph based on the latest version referred by that branch. The switch branch
function takes the parameter of the target branch branch_target and is described as a
set of processes as follows.
Chapter 4
This chapter discusses an approach for the improvement of the initial delta-based
versioning system for graph databases that was elaborated in Chapter 3. Leveraging
snapshots in version materialization is expected to reduce the checkout time of a
version where a large amount of changes is involved.
a snapshot will be generated and will be stored with the name that corresponds with
Vi .
After generating the snapshot, the version graph needs to be updated, describing
that aside from creating the delta object, the snapshot is also generated for version Vi .
To achieve this, we store the predicted checkout time as the property of the edge in
the version graph. Given a sub-graph of a version graph denoted as (Vi )-[∆ij ]->(V j )
where Vi and V j are 2 subsequent versions and ∆ij is the delta of the two versions,
we assign the following attributes to ∆ij .
1. Find the path P from the source version Vs to the target version Vt .
2. From the list of version in P , find all versions that has a snapshot. We call
it Vsnapshot . The current version Vs and the version at the root of the version
graph Vroot are also included in Vsnapshot .
4.4. Building the model for optimizing the execution time 37
5. Otherwise, materialize Vn from the snapshot, and checkout the target version
Vt with Vn as the source version.
The first four attributes are the independent variables that can be retrieved from
the delta objects. The last attribute, actual_checkout_time, is the dependent variable
that we want to predict. Table 4.1 displays a sample of checkout metadata based on
several test cases.
Figure 4.2 displays the correlation of the each column from the checkout meta-
data. It can be seen that the checkout time is linear to all the other attributes.
Chapter 5
Evaluation
This chapter summarizes the evaluation process for the system described in Chapter
3 and 4. The evaluation setup is elaborated in Section 5.1 followed by the explanation
of the evaluation metrics in Section 5.2 and the evaluation results in Section 5.3.
• Grit-base (base approach): The base approach for our delta-based versioning
The dataset that is used for evaluation is Twitter user’s connectivity information
named ‘ego-Twitter‘ from Stanford Network Analysis Project (Leskovec and Krevl,
2014). The dataset is an example of a directed graph of user’s "following" informa-
tion. It consists of 81,306 Twitter user (nodes) and 1,768,149 "follow" relationships.
There are several control variables that are measured in each iteration of the eval-
uation processes, described as follows.
• Number of initial entities: The number existing of entities in the database be-
fore changes are applied
Aspect Details
Machine Google Compute Machine
Operating systems Ubuntu 18.04
Processor 8 vCPUs
Memory 32 GB
Java java version "1.8.0_172"
Python Python 3.6.8
Neo4j neo4j 3.4.12
We would like to investigate to what extent the control variables affect the perfor-
mance of the versioning systems. Therefore, we deploy several test cases by chang-
ing one variable at a time to see the effect on the evaluation metrics. For each test
case, the commit and checkout functionalities will be performed and the evaluation
metrics will be calculated. The evaluation is run on the same environment that is
described in Table 5.1.
object into the database. On the other hand, the checkout process in entity-state
approach is by executing cypher queries.
The checkout time is thus described as the sum of the time to the time to generate
queries and the time to execute the queries.
On the other hand, the delta-based approach takes up more storage than snapshot-
only approach but less than entity-state approach. Breaking down the storage cost,
illustrated in 5.6, 5.5b, 5.5c, and 5.5d, a larger portion of the storage cost is for storing
the database files. The larger size of database in delta-based approach and entity-
state approach is probably due to the additional information that needs to be stored
in the database.
1000
Number of initial entities: ~0 1000
Number of initial entities: ~5000
approach approach
grit-base grit-base
800 grit-snapshot 800 grit-snapshot
Commit time (seconds)
400 400
200 200
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
1000
Number of initial entities: ~10000 1000
Number of initial entities: ~15000
approach approach
grit-base grit-base
800 grit-snapshot 800 grit-snapshot
Commit time (seconds)
snapshot-only snapshot-only
entity-state entity-state
600 600
400 400
200 200
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
1000
Approach: grit-base 1000
Approach: grit-snapshot
initial_entities initial_entities
0 0
800 5000 800 5000
Commit time (seconds)
400 400
200 200
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
1000
Approach: snapshot-only 1000
Approach: entity-state
initial_entities initial_entities
0 0
800 5000 800 5000
Commit time (seconds)
400 400
200 200
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
500
Number of initial entities: ~0 500
Number of initial entities: ~5000
approach approach
grit-base grit-base
400 grit-snapshot 400 grit-snapshot
Checkout time (seconds)
snapshot-only snapshot-only
entity-state entity-state
300 300
200 200
100 100
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
500
Number of initial entities: ~10000 500
Number of initial entities: ~15000
approach approach
grit-base grit-base
400 grit-snapshot 400 grit-snapshot
Checkout time (seconds)
snapshot-only snapshot-only
entity-state entity-state
300 300
200 200
100 100
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
500 500
initial_entities initial_entities
0 0
400 4000 400 4000
Checkout time (seconds)
200 200
100 100
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
50 50
initial_entities initial_entities
40 0 40 0
4000 4000
Checkout time (seconds)
20 20
10 10
0 0
10 10
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
100
Number of initial entities: ~0 100
Number of initial entities: ~0
approach approach
grit-base grit-base
80 grit-snapshot 80 grit-snapshot
snapshot-only snapshot-only
entity-state entity-state
Storage (MB)
Storage (MB)
60 60
40 40
20 20
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
100
Number of initial entities: ~10000 100
Number of initial entities: ~15000
approach approach
grit-base grit-base
80 grit-snapshot 80 grit-snapshot
snapshot-only snapshot-only
entity-state entity-state
Storage (MB)
Storage (MB)
60 60
40 40
20 20
0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities
Number of entities - initial: ~0, changed: ~1500 Number of entities - initial: ~0, changed: ~3000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size
30 30
Size (MB)
Size (MB)
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~0, changed: ~8000 Number of entities - initial: ~0, changed: ~12000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size
30 30
Size (MB)
Size (MB)
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~0, changed: ~16000 Number of entities - initial: ~0, changed: ~20000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size
30 30
Size (MB)
Size (MB)
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~5000, changed: ~1500 Number of entities - initial: ~5000, changed: ~3000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50
40 40
Size (MB)
Size (MB)
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~5000, changed: ~8000 Number of entities - initial: ~5000, changed: ~12000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50
40 40
Size (MB)
Size (MB)
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~5000, changed: ~16000 Number of entities - initial: ~5000, changed: ~20000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50
40 40
Size (MB)
Size (MB)
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~10000, changed: ~1500 Number of entities - initial: ~10000, changed: ~3000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)
Size (MB)
40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~10000, changed: ~8000 Number of entities - initial: ~10000, changed: ~12000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)
Size (MB)
40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~10000, changed: ~16000 Number of entities - initial: ~10000, changed: ~20000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)
Size (MB)
40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~15000, changed: ~1500 Number of entities - initial: ~15000, changed: ~3000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size
80 80
Size (MB)
Size (MB)
60 60
40 40
20 20
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~15000, changed: ~8000 Number of entities - initial: ~15000, changed: ~12000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size
80 80
Size (MB)
Size (MB)
60 60
40 40
20 20
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Number of entities - initial: ~15000, changed: ~16000 Number of entities - initial: ~15000, changed: ~20000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size
80 80
Size (MB)
Size (MB)
60 60
40 40
20 20
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach
Chapter 6
Conclusions
This chapter provides a summary of conclusions based on the results of the exper-
iments. We also propose some recommendations for the potential future works for
the continuations of this thesis.
6.1 Conclusions
In this section, we reiterate the research questions defined at the beginning of this
thesis and answer those research questions based on the findings in our experiments.
RQ1: What is the current state of the art regarding the techniques and tools for dataset ver-
sioning?
Based on the literature survey, it can be concluded that there are several ap-
proaches for data versioning. We can group them into two categories, namely, by
linking the data version to the workflow and by separating the data version from
any workflow where the versions of the data are created regardless of the workflow.
The challenges of data versioning have been laid out in Chapter 1, for example, the
various structures and formats of data, change detection in data, delta representa-
tion, and the trade-off of storing delta and full snapshot.
RQ2: Can we design a version control tool for graph databases with queries as the delta
objects?
The primary goal of this project was to develop a data versioning for a graph
database. We presented Grit, a versioning tool for graph databases. The version
control API of Grit is provided as a Git-style command line tool. The proposed
framework is elaborated in Chapter 3.
It was found that, by detecting changes in the graph database and generating
the associated queries based on those changes, it is possible to assign versions and
retrieve versions in the graph database. This was supported by the evaluation re-
sults. The cost for doing data versioning using Grit is the additional storage needed
to store the delta and snapshot objects and the time to generate the delta objects and
the time to execute the queries to retrieve the versions. With this versioning tool, in
addition to retrieving previous versions, it is also possible to compare the difference
between two versions of a graph, to track the changes in each entity, and to create
branching of the graph.
To evaluate the performance of the tool, we designed an evaluation setup that
is explained in Chapter 5.1. The evaluation metrics include commit time, checkout
52 Chapter 6. Conclusions
time, and storage cost. We proposed two versions of the tool: a base model that
only store delta objects to manage the versions, and an improved model that stores
snapshot alongside the delta. We also compared the performance with two other
tools, snapshot-based versioning using Git and state-entity based versioning using
neo4j-versioner-core (D’Este and Falcier, 2017).
The experiment results showed that the storage cost of saving the delta in Grit
is larger than the storage cost in snapshot-based and entity-state based versioning
tool. For small to medium changes, Grit takes a relatively small cost in storage and
checkout time. However, the tool gets more expensive as the number of changes
gets larger. An improvement of checkout time can be made by leveraging snapshot
into the process.
RQ3: How can the changes in graph databases be detected and represented?
The framework for data versioning in this thesis consists of four main compo-
nents, namely, change detection, delta representation, delta aggregation, and ver-
sion materialization. Change detection relates to how to capture the value changes
in graph databases. It can be done by developing a plugin for the database that acts
as a trigger to capture transaction data.
The second and third components, delta representation and delta aggregation,
relates to how to represent and store the changes so that the previous version of data
can be retrieved. We use serialized objects to encode the changes. The serialized
delta objects are then aggregated to group the changes by the entity. From the ag-
gregated delta objects, a set of queries can be generated. The set of queries can be
used to commit the transactions or to rollback the transactions. Thus, it is possible
to retrieve the past versions of the database from the queries.
RQ4: Can we improve the performance of the version control tool by leveraging the use of
snapshot?
The evaluation results showed that the checkout process of a version where a
large number of changes are involved takes a considerable amount of time. The
checkout time relates to the amount of time that it needs to execute the queries from
delta objects into the database.
To reduce the checkout time, we propose an algorithm that will predict the check-
out time of a version based on the information from the associated delta object. Since
the checkout time is linear to the number of changes, we use a linear regression
model to predict the checkout time. When the predicted checkout time exceeds a
certain limit, a snapshot will be generated in addition to the delta object.
Based on our experimental findings, we can answer that, yes, we can improve
the version control tool in the form of reduced checkout time by leveraging snap-
shots. Instead of executing the queries from the delta object to recreate a version, we
can materialize the version from the full snapshot of the database. Thus, the query
execution time can be eliminated. However, saving snapshots introduces additional
storage cost.
6.2. Future Works 53
1. The version materialization process of the tool presented in this thesis uses
queries to reconstruct a version. The base approach in this thesis can handle
up to several thousands of changes in a version and does not take a consider-
able amount of storage cost and checkout time. However, for changes that are
larger than that, it needs to leverage the snapshot because it takes longer time
to checkout. A possible improvement would be to design an alternative for the
delta encoding and checkout process to speed up the checkout time.
2. Changes in the database are detected by developing a plugin that utilizing the
database API. While it works on Neo4j, not every database or general graph
data has this feature. The change detection can be improved so that it can work
generally, for example, in other database systems.
3. The granularity of the delta encoding in this project is at graph property level,
that is by storing every property change to the database. Further improvement
would be to change the granularity of the delta encoding at character level.
55
Bibliography
Bhardwaj, Anant et al. (2014). “DataHub: Collaborative Data Science & Dataset Ver-
sion Management at Scale”. In: arXiv: 1409.0798. URL: http://arxiv.org/abs/
1409.0798.
Bhardwaj, Anant et al. (2015). “Collaborative Data Analytics with DataHub”. In:
Proc. VLDB Endow. 8.12, pp. 1916–1919. ISSN: 2150-8097. DOI: 10.14778/2824032.
2824100. URL: http://dx.doi.org/10.14778/2824032.2824100.
Bhattacherjee, Souvik et al. (2015). “Principles of Dataset Versioning: Exploring the
Recreation/Storage Tradeoff”. In: Proc. VLDB Endow. 8.12, pp. 1346–1357. ISSN:
2150-8097. DOI: 10 . 14778 / 2824032 . 2824035. URL: http : / / dx . doi . org / 10 .
14778/2824032.2824035.
Blischak, John D., Emily R. Davenport, and Greg Wilson (2016). “A Quick Introduc-
tion to Version Control with Git and GitHub”. In: PLoS Computational Biology 12.1,
pp. 1–18. ISSN: 15537358. DOI: 10.1371/journal.pcbi.1004668.
Brahmia, Safa et al. (2016). “τJSchema: A Framework for Managing Temporal JSON-
Based NoSQL Databases”. In: Database and Expert Systems Applications. Ed. by
Sven Hartmann and Hui Ma. Vol. 9828. i. Cham: Springer International Publish-
ing, pp. 167–181. ISBN: 978-3-319-44406-2. DOI: 10.1007/978- 3- 319- 44406- 2.
URL : http://link.springer.com/10.1007/978-3-319-44406-2.
Buneman, Peter et al. (2004). “Archiving scientific data”. In: Database 29.1, pp. 2–42.
ISSN: 03625915. DOI : 10.1145/974750.974752. URL : http://portal.acm.org/
citation.cfm?doid=974750.974752.
Castelltort, Arnaud and Anne Laurent (2013). “Representing history in graph-oriented
NoSQL databases: A versioning system”. In: 8th International Conference on Dig-
ital Information Management, ICDIM 2013 1, pp. 228–234. DOI: 10.1109/ICDIM.
2013.6694022.
Chapman, Pete et al. (2000). CRISP-DM 1.0 Step-by-step data mining guide. Tech. rep.
The CRISP-DM consortium. URL: http://www.crisp- dm.org/CRISPWP- 0800.
pdf.
Chavan, Amit et al. (2015). “Towards a Unified Query Language for Provenance and
Versioning”. In: Proceedings of the 7th USENIX Conference on Theory and Practice
of Provenance, p. 5. arXiv: arXiv : 1506 . 04815v1. URL: http : / / dl . acm . org /
citation.cfm?id=2814579.2814584.
Cheney, James, Laura Chiticariu, and Wang-Chiew Tan (2007). “Provenance in Databases:
Why, How, and Where”. In: Foundations and Trends in Databases 1.4, pp. 379–474.
ISSN: 1931-7883. DOI : 10.1561/1900000006. URL : http://www.nowpublishers.
com/article/Details/DBS-006.
Currim, Faiz et al. (2004). “A Tale of Two Schemas: Creating a Temporal XML Schema
from a Snapshot Schema with τXSchema”. In: Advances in Database Technology -
EDBT 2004. Ed. by Elisa Bertino et al. Berlin, Heidelberg: Springer Berlin Heidel-
berg, pp. 348–365. ISBN: 978-3-540-24741-8.
Davidson, Susan B. and Juliana Freire (2008). “Provenance and scientific workflows”.
In: Proceedings of the 2008 ACM SIGMOD international conference on Management
56 BIBLIOGRAPHY
Tansel, Abdullah Uz et al., eds. (1993). Temporal Databases: Theory, Design, and Im-
plementation. Redwood City, CA, USA: Benjamin-Cummings Publishing Co., Inc.
ISBN: 0-8053-2413-5.
Vijitbenjaronk, Warut D. et al. (2018). “Scalable time-versioning support for prop-
erty graph databases”. In: Proceedings - 2017 IEEE International Conference on Big
Data, Big Data 2017 2018-January, pp. 1580–1589. DOI: 10.1109/BigData.2017.
8258092.