0% found this document useful (0 votes)
49 views

Data Versioning For Graph Databases

This thesis investigates using delta-based version control to manage versions and historical records of graph databases. The author presents Grit, a Git-style command line tool to perform data versioning for graph databases. Grit represents deltas as queries to reconstruct one version from another. The thesis evaluates Grit's performance compared to other graph version control tools. Results show that version control is possible with queries as deltas, but there are additional costs, such as delta object size increasing linearly with changes and queries taking significant time to materialize versions if there are many changes. Leveraging snapshots can reduce materialization time but increases storage costs. The thesis recommends exploring improved delta encodings for future studies in data versioning.

Uploaded by

Asghar Akbari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Data Versioning For Graph Databases

This thesis investigates using delta-based version control to manage versions and historical records of graph databases. The author presents Grit, a Git-style command line tool to perform data versioning for graph databases. Grit represents deltas as queries to reconstruct one version from another. The thesis evaluates Grit's performance compared to other graph version control tools. Results show that version control is possible with queries as deltas, but there are additional costs, such as delta object size increasing linearly with changes and queries taking significant time to materialize versions if there are many changes. Leveraging snapshots can reduce materialization time but increases storage costs. The thesis recommends exploring improved delta encodings for future studies in data versioning.

Uploaded by

Asghar Akbari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Data Versioning for Graph Databases

Master’s Thesis

Mohamat Ulin Nuha


Data Versioning for Graph Databases

THESIS

submitted in partial fulfillment of the


requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

by

Mohamat Ulin Nuha


born in Tulungagung, Indonesia

Software Engineering Research Group


Department of Software Technology
Faculty EEMCS, Delft University of Technology
Delft, the Netherlands
www.ewi.tudelft.nl
Data Versioning for Graph Databases

Author: Mohamat Ulin Nuha


Student id: 4620747
Email: [email protected]

Abstract

This thesis investigates the extent to which delta-based version control can be used
to manage versions and the historical records of graph databases. The delta encoding is
represented as a set of queries to reconstruct one version from the other. We presented
Grit, a Git-style command line tool to do data versioning for graph databases. We
investigated two variants of our approach, with and without leveraging snapshots of
data in the process. The experiment result showed that the tool can be used to see the
historical records of data, compare two versions of data, and tracking the changes for
every entity in the database.
We evaluated the tool by comparing its performance with other graph version con-
trol tools. The evaluation results showed that version control for graph database can be
done with queries as the delta encoding. However, there are some additional costs of
doing version control with this approach. First, the delta object size increases linearly
to the number of changes in a version. We also found that using queries as the delta
to reconstruct versions can take a significant amount of time if the number of changes
is large. While leveraging snapshots can reduce the time to materialize version, it also
increases the storage cost for saving the delta objects and snapshot objects. A recom-
mendation for future studies in data versioning is to explore and improve over the delta
encoding.

Thesis Committee:

Chair: Dr. ir. Alessandro Bozzon, Faculty EEMCS, TU Delft


University supervisor: Dr. Asterios Katsifodimos, Faculty EEMCS, TU Delft
Committee Member: Dr. Maurı́cio Aniche, Faculty EEMCS, TU Delft
v

Preface
This thesis marks the completion of my study at Delft University of Technology.
It was written to fulfill the requirements of the MSc Computer Science. It was a fan-
tastic journey with many ups and downs, and I would like to thank the Almighty
God, to whom I wish for strength and wisdom. Throughout the completion of this
master’s thesis, I have learned a great deal of knowledge and experience, particu-
larly in scientific research. I would like to acknowledge the people that have helped
and supported me, without whom this thesis would not be finished.
I would like to express my sincerest gratitude to my thesis supervisor, Asterios
Katsifodimos, for accepting me to work on this thesis. I have gained many insights
from the discussions with him throughout the course of this thesis. I am grateful
for his guidance, feedback, and patience. I would also like to acknowledge the other
members of the committee, Alessandro Bozzon and Maurício Aniche. Without them,
I would not be able to defend this thesis.
I would also like to acknowledge and express my gratitude to LPDP (Indonesia
Endowment Fund for Education) for the scholarship to pursue my study at TU Delft.
I would like to thank my parents and my family for their unconditional love and
support and for being always there for me. I want to thank my fellow computer
science students from Indonesia: Helmi, Reza, Romi, Andre, Sindu, and Gilang for
their companionship. Lastly, I would like to take this opportunity to acknowledge
every person who I have been in contact with–friends, colleagues, professors–during
my study at TU Delft.
Thank you.

Mohamat Ulin Nuha


Delft, the Netherlands
April 2019
vii

Contents

Abstract iii

Preface v

1 Introduction 1
1.1 Version control for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background and Related Works 7


2.1 Version control for code . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Version control for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Version control features . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Retrieval of previous versions . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Comparison between versions . . . . . . . . . . . . . . . . . . . 9
2.3.3 Version graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Version control API . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Linking the data version to the workflows . . . . . . . . . . . . . . . . . 11
2.4.1 Taverna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Machine learning pipeline systems . . . . . . . . . . . . . . . . . 12
2.4.3 Scheduling workflow systems . . . . . . . . . . . . . . . . . . . 13
2.5 Separating the data version from the pipelines . . . . . . . . . . . . . . 13
2.5.1 Git for data versioning . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Data versioning for array-oriented database . . . . . . . . . . . 14
2.5.3 Data versioning for relational database systems . . . . . . . . . 15
Datahub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Decibel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
OrpheusDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4 Data versioning for NoSQL database systems . . . . . . . . . . 18
Document-oriented database . . . . . . . . . . . . . . . . . . . . 18
Graph-oriented database . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Data Versioning for Graph Databases 23


3.1 Graph database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Version control components . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Version control API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Delta encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Detecting changes from a transaction . . . . . . . . . . . . . . . 26
3.4.2 Defining the data structures . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Building the plugin . . . . . . . . . . . . . . . . . . . . . . . . . . 28
viii

3.5 Version graph representation . . . . . . . . . . . . . . . . . . . . . . . . 29


3.6 Initializing the versioning system (init function) . . . . . . . . . . . . . 30
3.7 Creating a new version (commit) . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Materializing a version (checkout function) . . . . . . . . . . . . . . . . 31
3.9 Comparing the differences between versions (diff function) . . . . . . . 32
3.10 Tracking the changes for one entity (blame function) . . . . . . . . . . . 32
3.11 Creating and switching to another branch (branch function) . . . . . . 32
3.12 Merging two branches (merge function) . . . . . . . . . . . . . . . . . . 33

4 Leveraging snapshots in version materialization 35


4.1 Problem with delta-based versioning . . . . . . . . . . . . . . . . . . . . 35
4.2 Generating a snapshot after a commit . . . . . . . . . . . . . . . . . . . 35
4.3 Optimizing the checkout time . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Building the model for optimizing the execution time . . . . . . . . . . 37
4.4.1 Generating the dataset for the initial model . . . . . . . . . . . . 37
4.4.2 Building and updating the model . . . . . . . . . . . . . . . . . 38
4.4.3 Deploying the model . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Optimizing the storage cost . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Evaluation 41
5.1 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Commit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Checkout time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Storage cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Evaluation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Commit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Checkout time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Storage cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Conclusions 51
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Bibliography 55
ix

List of Figures

1.1 Components of ML systems (Sculley et al., 2015) . . . . . . . . . . . . . 1


1.2 CRISP-DM data analysis model (Chapman et al., 2000) . . . . . . . . . 2

2.1 An example of Version Graph in a VCS . . . . . . . . . . . . . . . . . . 10


2.2 An example of machine learning workflow . . . . . . . . . . . . . . . . 12
2.3 SciDB version system architecture (Seering et al., 2012) . . . . . . . . . 14
2.4 Delta encoding (Bhardwaj et al., 2014) . . . . . . . . . . . . . . . . . . . 16
2.5 Decibel storage (Maddox et al., 2016) . . . . . . . . . . . . . . . . . . . . 17
2.6 Delta encoding (Huang et al., 2017) . . . . . . . . . . . . . . . . . . . . . 18
2.7 Entity-state model (D’Este and Falcier, 2017) . . . . . . . . . . . . . . . 19
2.8 Time versioning graph (Vijitbenjaronk et al., 2018) . . . . . . . . . . . . 20

3.1 An example of graph model . . . . . . . . . . . . . . . . . . . . . . . . . 23


3.2 An example of the transformation of a graph . . . . . . . . . . . . . . . 24
3.3 Data structures for changes detection . . . . . . . . . . . . . . . . . . . . 27
3.4 An illustration of Version Graph . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Flowchart for the checkout function . . . . . . . . . . . . . . . . . . . . 36


4.2 Plot of pairwise relationships of the checkout metadata . . . . . . . . . 38

5.1 Comparison of commit time. Each sub-figure represents different num-


ber of initial entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Comparison of commit time. Each sub-figure represents different ap-
proach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Comparison of checkout time. Each sub-figure represents different
number of initial entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Comparison of checkout time. Each sub-figure represents different
approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Comparison of storage cost. Each sub-figure represents different num-
ber of initial entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Comparison of storage cost for initial number of entities = 0. Each
sub-figure represents different number of changed entities. . . . . . . . 47
5.7 Comparison of storage cost for initial number of entities = 4000. Each
sub-figure represents different number of changed entities. . . . . . . . 48
5.8 Comparison of storage cost for initial number of entities = 8000. Each
sub-figure represents different number of changed entities. . . . . . . . 49
5.9 Comparison of storage cost for initial number of entities = 10000. Each
sub-figure represents different number of changed entities. . . . . . . . 50
xi

List of Tables

2.1 A comparison of data versioning tools . . . . . . . . . . . . . . . . . . . 21

4.1 Sample of checkout time metadata . . . . . . . . . . . . . . . . . . . . . 37

5.1 Environment for system evaluation . . . . . . . . . . . . . . . . . . . . . 42


1

Chapter 1

Introduction

In this introductory chapter, we will present the background that motivates this the-
sis in section 1.1 and 1.2, followed by the research questions and sub-questions and
contributions in section 1.3 and 1.4. Finally, the outline of this document is presented
in section 1.5.

1.1 Version control for data


In many domains, a data analysis pipeline requires datasets to be manipulated and
processed, from simple cleaning and normalization to data value correction, visu-
alization, and machine learning application. This pipeline can grow from a small
number of steps to a large and complex application. According to Buneman et al.,
2004, a good pipeline requires the results to be reproducible by other researchers,
thus maintaining the detailed steps to produce the results from initial raw data is
critical.
In data science domain, machine learning contains many moving parts. Accord-
ing to Sculley et al., 2015, the actual machine learning code makes up to only a small
portion of the overall ML systems, as illustrated in Figure 1.1. Data collection and
data verification are two other components in ML systems. Data and trends change
over time. New information is added and updated. Therefore, to make a repro-
ducible machine learning process, it is necessary to describe which version of the
data is used in the process.
Data can come from many sources and can be either from private repositories
and public repositories. They are also updated periodically, either by the addition of
new data, correction, or deletion of existing data based on newly acquired informa-
tion (Buneman et al., 2004). As data evolve and change periodically, there may be
different versions of a dataset over time. A data analysis pipeline may refer to a spe-
cific version of a dataset. Thus researchers sometimes keep a snapshot of the dataset

F IGURE 1.1: Components of ML systems (Sculley et al., 2015)


2 Chapter 1. Introduction

F IGURE 1.2: CRISP-DM data analysis model (Chapman et al., 2000)

and assign the version during their analysis. This practice is essential to track the
changes in the dataset and the processes that introduce these changes, thus keeping
the lineage or the provenance of the dataset.
Data can also come in many formats. A commonly used one is tabular data in
the form of a CSV file or spreadsheet. Relational database systems, such as MySQL
and PostgreSQL also save the data in tabular format. However, there exist dataset
in other forms such as a graph-based database, document-based database, and key-
value database. These databases, commonly called NoSQL databases, are usually
saved in binary format.
A data analysis pipeline is often done together in a team, where each member
will do different tasks and introduce different modifications to the dataset, depend-
ing on the processes and the goals of the tasks. In this scenario, there will be some
branches of the dataset and to integrate them back, there need to be a mechanism or
system to manage these various branches (Maddox et al., 2016).
An example of the process is a team of scientists wants to do a relationship anal-
ysis with data from Twitter. Following the CRISP-DM model, a frequently used data
analysis pipeline model illustrated in Figure 1.2, one of the steps in the pipeline is
data understanding (Chapman et al., 2000). After acquiring raw data from Twitter,
each member of the team can have different tasks and can introduce modifications
towards the raw data, before they perform the actual analysis. For example, a mem-
ber of the team does a data cleaning and data normalization. A team member then
modifies the cleaned data for feature extraction while another member can also edit
the data for visualization. These tasks can result in several versions of the dataset
with different changes from each team member.
A common approach to track the lineage and assign versions of a dataset is to cre-
ate a snapshot of the current state of the data in each process (Maddox et al., 2016).
In this practice, a copy or a version of the dataset will be created per process and re-
named differently, for example, data_01.txt, data_02.txt, data_for_visualization.txt. In
this case, there is no associated information about the relationship between these
1.2. Challenges 3

copies, i.e. no information about which is the former version of a dataset and what
is the process that transforms one version into the subsequent version. For a small
pipeline that consists of a few processes, this practice may be acceptable, because
there will not be too many versions and it will be faster to recreate or retrieve differ-
ent versions. However, for a more complex pipeline with a larger size of datasets,
this practice will create a considerable amount of data and will result in wasted stor-
age. Furthermore, by creating snapshots, it is possible to recreate previous versions
of data, but it is not possible to compare the changes or track the lineage between
versions.
In recent years, there are emerging researches on data versioning with the capa-
bility to compare the difference between versions and tracking the individual who
authorized the changes. Researches such as Huang et al., 2017, Bhardwaj et al., 2015,
and Maddox et al., 2016 explores version control for relational database with explo-
ration on materialization/storage trade-off. There are, however, limited researches
on how to do dataset versioning on non-relational databases.

1.2 Challenges
Based upon the aforementioned background and the nature of dataset versioning,
we can identify some challenges as follows:

1. Dataset structures and formats


The various structures and formats of dataset make it impossible to create
"one size fits all" method that can handle versioning for all different kinds of
datasets. Tabular data can be represented in CSV, JSON, or other text-based
formats. A set of related tabular data can be stored in relational databases
which use a binary format to store the data in the file system and can be ac-
cessed from a client application or from an API. Schemaless data can be stored
in non-relational databases which, similar to relational databases, also use a
binary format to store the information in the file system.

2. Detecting changes in datasets


Value changes. Data changes. There are many kinds of applications that can
be used to manipulate dataset. A CSV file can be modified manually by a
user, for example using note editor or spreadsheet application. It can also be
transformed by a set of instructions created by an external program such as a
Python script. Similarly, a database can be manipulated using a direct query
from database client or instructions from external programs.
For text-based files, such as source code or CSV files, Unix diff function can
be used to compare the difference between two files. For other kinds of files,
such as relational and non-relational databases, comparing the two different
versions can present a challenge, because, on the lower level, they are stored
in binary files.
There needs to be a way to capture the changes regardless of the application
that is used, for example, whether a change is caused by an external script, a
client application, or other APIs.

3. Delta representation
Based on how data and versions are encoded, there are two common approaches
on data versioning, snapshot-based version control and delta-based version
4 Chapter 1. Introduction

control. In delta-based version control, after the changes to the data are de-
tected, the delta between two versions needs to be stored. Delta can be de-
scribed as a list of changes from one version to another with some instructions
on how to generate one version of the dataset from the other. The difference be-
tween delta-based and snapshot-based version control is that by storing delta,
modifications between two datasets can be calculated. Because of the various
formats of the dataset, a suitable delta representation is needed to be deter-
mined so that the retrieval of different version can be efficient.

4. Storing delta vs recreation/materialization


Sometimes, changes between versions can be massive that, although storing
the delta is cheaper, it can be faster to materialize using the snapshot of data.
The delta can still be used to determine the list of changes between versions.

1.3 Research questions


Based on the background in Section 1.1 and challenges in Section 1.2, the main goal
of this thesis is to investigate the design and functionality of a version control sys-
tem and how to implement them for data. This objective can be summarized into
the following main research question:

RQ: To what extent can version control be implemented in graph databases?

To answer the main research question, the research sub-questions are formulated
as follows:

RQ1: What is the current state of the art regarding the techniques and tools for dataset ver-
sioning?

This question aims to explore the previous works on data versioning. The explana-
tions of the existing approaches and tools can be found in Section 2.4 and 2.5.

RQ2: Can we design a version control tool for graph databases with queries as the delta?

Concerning the challenges of data versioning, there are many kinds of data types
and we want to focus on graph databases since this type of databases is claimed to
be able to capture the relationship information of data compared to other databases.
According to Bhattacherjee et al., 2015, in databases, SQL queries can be used to
compare and reconstruct tuples or records from one version to another version. We
would like to investigate if it is possible to represent the delta as queries.

RQ3: How can the changes in graph databases be detected and represented?

One of the essential steps to manage the versions of databases is to detect and store
the changes in the database. These changes can be used to compare the difference
between versions and to retrieve one version from the other.
1.4. Contributions 5

RQ4: Can we improve the performance of the version control tool by leveraging the use of
snapshots?

Delta-based data versioning stores the difference between versions of data instead
of the full snapshot in each version. In doing so, there are expected additional costs
to retrieve the version. For example, to checkout a version, the delta needs to be
executed. This execution can take a longer time when the number of changes gets
larger. On the other hand, snapshot-based data versioning, a snapshot of data is
created in each version. The advantage of this approach is that it is expected to be
faster to checkout regardless of the number of changes because the user can restore
the version from the snapshot. We compare the performance of data versioning tool
with and without leveraging the use of snapshots.

1.4 Contributions
The main contributions of this thesis are as follows:

1. A literature survey on the recent researches regarding version control for data

2. A version control tool for graph databases

3. A comparison of version control systems for graph databases

4. A decision-making process to determine whether a snapshot is needed to be


store for efficient version materialization

1.5 Thesis outline


The rest of this thesis report is organized as follows. In Chapter 2, we discuss recent
researches on data versioning, including the techniques, delta representation, and
how they are incorporated in real life case. In Chapter 3, we describe the approach
to the implementation of version control for graph databases. We further improve
the versioning system and describe it in Chapter 4. The evaluation method, setup,
and results are elaborated in Chapter 5. Finally, the conclusions, recommendations,
and future works are provided in Chapter 6.
7

Chapter 2

Background and Related Works

This chapter discusses the description and a brief history of version control for data
along with various specific tools and techniques related to data versioning. This
chapter aims to provide an answer to the first research sub-question
RQ1: What is the current state of the art regarding the techniques and tools for
dataset versioning?

The applications of version control for code and version control for data are elab-
orated in Section 2.1 and 2.2 followed by the common features of version control sys-
tems in Section 2.3. The existing research and tools for data versioning are further
elaborated in Section 2.4 and 2.5 followed by the summary in Section 2.6.

2.1 Version control for code


Version control can be defined as a mechanism to track changes that happen to cer-
tain files while maintaining the previous versions of those files. The common usage
of version control systems is to track changes in source code or other text-based files,
as source code often undergoes revisions, bug fixes, and features additions,
Version Control Systems (VCS), such as Git and SVN, are designed to manage
versions of files and calculate the changes with Unix-style diff command (Loeliger
and McCullough, 2012). The ability to manage files in a project repository makes
VCS, to be widely used for codebase and team projects. Git, one of the most popular
VCS, works by tracking file-based changes and saves the changes along with the
configurations and the history of files in a directory inside the project/repository. Git
hashes the content of the files and compresses them for efficient data storage. If there
are changes in a file, a new hash for that file will be created (Blischak, Davenport,
and Wilson, 2016). Thus, it is possible to detect changes and calculate delta objects
efficiently. This approach works well with text-based files and other non-structured
files as it allows comparison per line basis and the compression algorithm for text-
based files can reduce the amount of storage significantly.

2.2 Version control for data


One of the critical components that determine an excellent scientific workflow is
the reproducibility of the process so that other scientists can repeat the steps and
produce similar results (Simmhan, Plale, and Gannon, 2005; Davidson and Freire,
2008). Since a scientific pipeline mainly deals with datasets, tracking the changes in
these datasets is one of the approaches to maintain reproducibility. Over the years,
there are a growing number of techniques that are developed to track changes in
data.
8 Chapter 2. Background and Related Works

The concept of temporal database has been introduced and studied by Tansel
et al., 1993 and Ozsoyoglu and Snodgrass, 1995 among others to denote the time
information constraint in the database as opposed to the real-time database where
only the information about current state of data is stored. Temporal database saves
the information about the timeframe of a record, namely valid time, a time period
when a particular record is valid in the current situation, and transaction time, a time
period when a record is considered correct. This concept has been implemented as
a feature in the major relational database management systems, such as OracleDB,
Microsoft SQL Server, and MariaDB.
Several numbers of research, such as Cheney, Chiticariu, and Tan, 2007 and Kar-
vounarakis, Ives, and Tannen, 2010 explores the concept of provenance in databases
along with how to query the provenance information. Storing provenance informa-
tion of data in a database allows the users to track the lineage of data, for example,
where a tuple in a database comes from and what is the relation between one tuple
to its base tuple. In short, provenance information can be used to track the lineage
of changes in the database from its creation to its current state.
Data versioning also relates to the practice of archiving database. Buneman et al.
described database archiving as the practice to store past versions of a database that
are used in data analysis so that the result of the analysis can be verified (Buneman
et al., 2004). The authors argued that since certain analysis is based on a specific
version of a database, it is essential to keep track of the previous states of the data
as the database always evolves with addition, correction, and deletion of records. In
addition to the ability to analyze older versions of the database, database archiving
makes it possible to query historical data (Müller, Buneman, and Koltsidas, 2008).
According to Buneman et al., 2004, there are several approaches to archiving
data. The first one is to create a snapshot or version for every significant transfor-
mation of data. By creating snapshots, it is possible to retrieve past versions of data
should the needs for analysing those data arises. The disadvantage of this approach
is that it can result in wasted storage as either small or large changes will result in
a full snapshot of data. Furthermore, it is impractical to compare the differences
between versions by using snapshots alone.
Another approach for archiving data is storing the delta which describes the list
of changes between two consecutive versions. For text-based data, this can mean
storing the difference line per line between the same files in different versions, while
for a database, this can mean storing the script or SQL to generate one version from
the other. Saving the delta means that less storage is used compared to saving a full
snapshot. The difference between the two versions can also be compared using the
delta or sequence of deltas. However, if the number of changes is huge, retrieving
the version by using the delta can be time-consuming.
Recent researches, such as Bhardwaj et al., 2014 and Maddox et al., 2016, explore
versioning in complex data, i.e. relational database, with the ability to retrieve previ-
ous versions and compare the difference between versions. These approaches intro-
duce the method to store delta efficiently along with the user interface in which the
user can interact with Git-style command. The authors claimed that the challenges
for these approaches are two-fold: recording deltas and encoding version graph.
From the existing tools and research, there are several approaches on how to
address data versioning challenges. One approach is where the data and the code
are linked in the pipeline along with the code. This approach is commonly used in
machine learning and data engineering pipeline where the data are stored in certain
storage and linked to the pipeline via script/code. This approach is effective for
2.3. Version control features 9

repeatable processes with common dataset formats. However, not all data types are
suitable for this approach.
Another approach is to manage the version of the data separately from the code
and the pipeline. In this approach, a new version of the data is created regardless
of the pipeline, for example, based on current time or based on certain addition
of records to the database. The analysis then uses a certain version as one of its
dependencies.
In summary, the main objective of data versioning is to manage the versions of
data for reproducibility purpose. Given a set of versions V , we want to be able to
retrieve or recreate a target version, for example, V j , from its predecessor version,
for example, Vi . The version can then be used as data dependency for data analysis
or data science processes.

2.3 Version control features


Based on a brief history of data versioning in Section 2.2, related works on data
versioning have several important objectives, namely, the ability to retrieve previous
versions, the ability to compare versions, the ability to track provenance information,
and the API or user interface for interacting with version control.

2.3.1 Retrieval of previous versions


An essential feature in version control is the ability to retrieve or recreate versions.
Given a set of versions, the goal of data versioning is minimizing the cost of storage,
e.g. the cost to store delta while also minimizing the cost of materialization, e.g. the
time to recreate versions (Bhattacherjee et al., 2015).
Consider a scenario in a relational database system, if a user changed a record
or a tuple in a relational database from source version Vs to target version Vt , it is
cheaper to store the delta from Vs to Vt instead of storing the entire database of Vt .
The time that it takes to recreate one change of record would also be minimal.
On the other hand, if the changes from Vs to Vt consist of millions of tuples,
while storing the delta can be cheaper, the cost to recreate Vt from Vs using delta
in the form of SQL queries, would be expensive since it will take a longer time to
execute millions of queries compared to using the snapshot of Vt .
While delta needs to be always stored to compare the differences between ver-
sions, there needs to be a decision system to determine whether it is cheaper to ma-
terialize using delta alone or using a snapshot of whole data.

2.3.2 Comparison between versions


Aside from the capability to recreate versions, another objective of version control is
to be able to compare the differences between versions. For text files, it can be in the
form of character differences or line differences between two files. For a relational
database, it can be in the form the list of tuple difference between two versions of
the database. One of the approaches to achieve this is by storing the delta or modi-
fications made to each version from its prior version(s). This approach is known as
delta-based approach (Buneman et al., 2004, Huang et al., 2017).
Delta can be defined as the changes or the sequence of modifications between
two consecutive versions. Changes to the files or data can be introduced by either a
client application, e.g. database client, a script, e.g. a Python script, or a quick-and-
dirty update to the files, e.g. manual edit with external applications.
10 Chapter 2. Background and Related Works

According to Bhattacherjee et al., 2015, delta from two consecutive versions can
be calculated in several ways.
• Text-based files can utilize Unix diff command to compare line-by-line changes
in text-based files.
• Databases systems can employ SQL queries to compare and reconstruct tuples
or records from one version to another.
• Tabular data, which consists of rows, columns, and cells, can use the cell-wise
differences to compare the changes between the same rows in the subsequent
versions.
In a database system, one way to store the delta is to save it in external files,
for example, by storing serialized changes from source version Vs to target version
Vt in JSON files. Another way is to store it inside the database itself, for example
by adding additional columns to represent the version or by saving the version in
separate tables.

2.3.3 Version graph


One of the mechanisms to track the lineage of data in a set of versions is by using a
version graph. In VCS such as Git and workflow systems such as Apache Airflow,
version graph is represented in a directed acyclic graph (DAG), a kind of directed
graph where one vertex is connected to the following vertex with an edge where
there are no loops back from any node n to n again.
Figure 2.1 shows an example of version graph in Git. In this example, there are
several versions in the repository, illustrated with the dots. A repository in a VCS can
contain several branches, indicated with the different colour of lines. The branches
can be merged back to the main branch.

F IGURE 2.1: An example of Version Graph in a VCS

2.3.4 Version control API


Version control systems, such as Git, provide the user with versioning API to interact
with VCSs. The API can be in the form of a set of instructions in the command line
or a set of SQL queries. Some of these functionalities of the versioning API are as
follows.

• init: Initialize the repository in a directory, either in an empty directory or an


existing directory. With this command, an initial version of the files is created.
This is illustrated as the uppermost dot in Figure 2.1
2.4. Linking the data version to the workflows 11

• commit: Create a new version after changes or modifications are performed


towards the files. In the figure, commits or versions are illustrated with dots.

• checkout: Change the current working directory to a specific version that was
already committed before, for example, going back from version Develop a fea-
ture part 2 to Develop a feature part 1.

• diff: Compare two committed versions or compare a version against the cur-
rent working directory. For example, inspecting the addition and the updates
from version Develop a feature part 1 to Develop a feature part 2

• branch: Create a new branch from an existing branch in the version graph.
This is done, for example, when a user wants to work on his/her own version
and explore some modifications towards the files but does not want to change
it directly in the main branch. In the figure, the grey line is the main or master
branch. The blue line is the develop branch made from the master branch, while
the yellow line is the myfeature branch made from the develop branch.

• merge: Merge the changes made in a branch to the parent branch. This is
illustrated as the blue develop branch is merged to grey master branch.

• blame: Check the list of changes made by users towards a file.

2.4 Linking the data version to the workflows


Several tools are developed to aid the reproducibility of machine learning processes
by linking the version of the data to the workflow. In this approach, the datasets are
stored in a storage technology while a piece of code tied it to the overall workflow.
Workflow systems are designed to deploy data analysis workflows or pipelines
that are reproducible while keeping the lineage of the data. The general goal of these
systems is to provide a systematic way and visualization of the step-by-step phase
to develop and run pipelines.
A workflow mainly consists of input, one or more processes, and output. In such
a workflow, the intermediate output of one process will become the input of the
subsequent process. Workflow systems aim to maintain the provenance of data and
make sure that the correct sets of data are fed into the correct process
A workflow system may provide iterative and incremental capabilities. Iterative
capability means that the end results of a workflow can be used as the input of the
same workflow in the next iteration/cycle. Incremental capability means that the
input data can be increased incrementally during the next cycle without changing
the overall processes, for example, the initial cycle of the workflow can have 1000
records of data as the input resulting in certain set of outputs, while in the next
iteration the input data increase to 10.000 records resulting in another set of outputs.
An example of machine learning workflow is a classification task for text data,
illustrated in Figure 2.2. The workflow consists of data collection and extraction,
data transformation, model training, and model evaluation. This process can be
iterative and incremental, meaning that certain steps can be repeated and data can
be added incrementally. Thus, a workflow system has to make sure that the correct
model and correct testing data are fed into the evaluation process.
12 Chapter 2. Background and Related Works

F IGURE 2.2: An example of machine learning workflow

2.4.1 Taverna
Taverna is a workflow system designed for the integration of various tools and re-
sources in bioinformatics (Oinn et al., 2004; Hull et al., 2006). It is also developed
due to the need for seamless communication between the diverse analysis tools and
web services from third-party providers. Taverna acts as a user interface to create
and run workflows from these several resources.
Lineage information in a Taverna workflow consists of the source data and the
metadata related to the process and the subsequent result in the workflow. The lin-
eage of data can be used, for example, an audit trail to determine the source of error
in workflow or determine the source of data. Another use of lineage information is
for reproducibility and reusability purposes.

2.4.2 Machine learning pipeline systems


An example of machine learning pipeline systems is DVC. DVC 1 is an open source
command line tool designed for managing pipeline in machine learning projects. It
stores the content of large files, i.e. a large dataset, in a local key-value store. The
file pointers to the actual dataset cache are created in the repository while the actual
files can be saved externally either in local storage or in cloud storage, such as AWS
S3 or GCP.
The main functionality of DVC revolves around the ability to design step by step
machine learning pipeline, run the transformation and the processes, and reproduce
the result. The pipeline is language-agnostic, meaning that the scripts for the ma-
chine learning pipeline can be written in any language.
Aside from tracking the process and the transformation applied to the dataset,
DVC can also be used for switching between versions of data and machine learn-
ing models. Oftentimes, in machine learning pipeline, we want to generate several
models from different sets of data. When there are multiple versions of data and
model, it is possible to create a snapshot, i.e. commit, the working directory (code,
data files, model) into a version. This version can be checked out later into the work-
ing directory. Since the files are not actually saved into the repository, the commit
and checkout process would be faster.
1 https://dvc.org/
2.5. Separating the data version from the pipelines 13

Another example of machine learning workflow systems is Pachyderm. Similar


to DVC, Pachyderm 2 is an open-source tool designed to create iterative and incre-
mental machine learning pipeline. The main selling point of Pachyderm is the ability
to manage pipelines that are scalable. The scalability feature is done by implement-
ing parallelisation and distributed processing.
The data that are used in the pipeline is version controlled. A user can see which
training data generates a certain model and certain results. A user can switch to
different versions of input data to see the comparison of the results.

2.4.3 Scheduling workflow systems


There are numerous examples of open-source projects designed for building and
monitoring and scheduling complex workflow of batch jobs, for example, Airflow,
Luigi, Oozie, and Azkaban. These workflow systems provide a user interface for
managing and visualizing the workflow in a directed acyclic graph where a vertex
illustrates a process in the workflow while an edge depicts the direction of the sub-
sequent process.
In general, workflow systems are useful for designing and running scientific
pipelines since it provides user interfaces or an APIs for managing the data from
the start of the pipeline, the processes or services involved, along with the results. A
user can replace or modify of the data input in one process and the output from the
succeeding processes will change accordingly.
There are many more examples of workflow systems that are designed for these
purposes. The main similarity between these systems is that they can be used to
keep track of the data transformation, from the origin or input to the end result or
output, and therefore, it is possible to query information about where the data trans-
formations come from and what processes that introduce those transformations.

2.5 Separating the data version from the pipelines


Section 2.4 described the tools for versioning of data that revolves around workflows
or pipelines. The changes in the data are originated from a finite set of processes
defined in the workflow. Oftentimes, a user would also like to manage the version
the data regardless of the processes involved, for example, if a user wants to see the
last year version of a database after several incremental updates made in this year.
The changes in a database do not always fit into a certain workflow, for example,
a database system with daily transactions resulted from several client applications
that write into the database. For this reason, there exists several tools and studies to
manage the versions of data without tying it to a workflow.
Two common ways to do data versioning in this approach are to store the snap-
shot of the whole data in each version and to store the delta or changes from version
Vi to version V j .

2.5.1 Git for data versioning


To some extent, version control systems like Git and SVN can also be used for data
versioning, since it provides version control API and functionalities. An example
of data versioning using Git is to create monthly snapshots of a spreadsheet file
that are changed frequently. The commits are the snapshots that are made each
2 https://www.pachyderm.io/
14 Chapter 2. Background and Related Works

month. Therefore, it is possible to retrieve the version from last month from the list
of previous commits.
VCS systems are designed for unstructured data and compare the changes line-
per-line basis. For text-based data with small changes, using Git can be sufficient.
However, VCS systems are not optimized for semi-structured or structured data, for
example, tabular data or serialized data (CSV, XML, JSON), since a change in one
cell in a table is considered a change in the whole row. Furthermore, for complex
data, such as database systems, VCS systems do not know how to calculate the delta
since the files for database systems are usually in binary format.
There are some plugins for Git that are designed to handle large size data in
Git repository, for example, Git LFS 3 and git-annex 4 . They work by committing
the symlinks of the dataset to the repository. The actual files are instead stored in
external storage, for example, AWS S3 or LFS server.
By using Git for data versioning, it is possible to track the lineage of data by
using Git version graph, although there can be some limitations. Furthermore, for a
dataset in a binary format, it is not possible to compare the changes from one version
to another.

2.5.2 Data versioning for array-oriented database


A study by Seering et al., 2012 attempted to solve the version control problem in
array database and develop a versioning system on top of SciDB. The authors sug-
gested that array-oriented database is used by many scientific applications. The
study introduced delta representations and algorithms for minimizing the storage
cost in array-oriented database. It also introduced an optimization algorithm to de-
termine which versions to materialize, based on the access frequencies of the past
versions. Figure 2.3

F IGURE 2.3: SciDB version system architecture (Seering et al., 2012)


3 https://github.com/git-lfs/git-lfs
4 https://git-annex.branchable.com/
2.5. Separating the data version from the pipelines 15

The data structure in the array-oriented database is in the form of arrays consist-
ing of a predefined schema. During the creation of a record, a schema that includes
the data type of each element in the array along with the dimensions of the array.
The operations for version control, including the creation of a new version and
the retrieval of the past versions, are provided in the form of SQL-like query. To
create an updatable array, for example, a user can issue an SQL command:
CREATE UPDATABLE ARRAY Example (A :: INTEGER ) [I=0:2, J=0:2] ;
When inserting a new array, the operation takes one of the three forms of pay-
load, namely dense representation, sparse representation, and delta list representa-
tion. In a dense representation, the value of every cell in the array is specified even
though some of them are empty. In a sparse representation, a list of (dimension, at-
tribute) pairs is provided, together with the default value for undefined dimension
values. A delta list representation contains a parent version of the record along with
a list of (dimension, attribute) pairs that are changed. After data is inserted, the user
can explicitly define the version, i.e. commit, by issuing a query to the database.
To handle a new version of an array in the database, for example, changing some
of the attributes, there are several options to store these changes. The first one is
to fully materialize the array whenever changes happen. The array in each version
is chunked and compressed for more efficient storage. The second one is to calcu-
late the delta encoding, that is the cell-wise differences between arrays of the same
dimension.
When a new version of an array is loaded into the database, the system computes
the materialization matrix to determine the best encoding and representation for the
new version with respect to the previous versions. The latest version of the database
will always be materialized since in many cases, it will be accessed more often.
Retrieval of the previous versions is also done by using SQL-like query. In the
paper, the author gave an example of query SELECT * FROM Example@’1-5-2011’ ;
to retrieve a version that was created on January 5th 2011.
To store the list of versions, the system employs a graph representation with the
vertices denote the versions and the edges denote the delta encoding. The system
uses an undirected graph to represent the version graph since the system can mate-
rialize to both directions by applying delta or subtracting delta.

2.5.3 Data versioning for relational database systems


A collection of studies (Bhardwaj et al., 2014; Bhattacherjee et al., 2015; Chavan et
al., 2015; Maddox et al., 2016; Huang et al., 2017) attempted to solve data versioning
problem in relational database systems. The tools from the studies aim to make it
easier to organize data for researchers that often do analysis collaboration. It is also
designed to overcome the limitations of having to manually create a local working
copy of a dataset for each member of a team.

Datahub
Figure 2.4 shows the architecture of Datahub. At a high level, the system uses a
"schema-later" data representation. It allows data to be represented in many forms,
from textual blobs to key-value pairs, to structured records. The main abstraction
that the system uses is that of a table containing records, in which every record in a
table has a key, and an arbitrary number of typed, named attributes associated with
it. For unstructured data, a single key can be used to refer to an entire file.
16 Chapter 2. Background and Related Works

F IGURE 2.4: Delta encoding (Bhardwaj et al., 2014)

Another abstraction that the system uses is dataset. A dataset consists of a set of
tables, along with any correlation / connection structure between them (e.g., foreign
key constraints).
The versioning information is maintained at the level of datasets in the form of a
version graph, a directed acyclic graph where the nodes correspond to the datasets
and the edges capture relationships between versions as well as the provenance
information that records user-provided and automatically inferred annotations for
derivation relationships between data. Whenever a version of a dataset is created in
DataHub, users or applications may additionally supply provenance metadata that
shows the relationship between two versions. This metadata is designed to allow
queries over the relationship between versions.
The paper mentioned two challenges in developing DataHub. The first challenge
is how to calculate deltas between two versions of dataset so that one version can be
retrieved by using the other version and the delta. Since the dataset is represented in
tables and records, versions are created explicitly via INSERT/DELETE commands
Bhardwaj et al., 2015. However, for very large datasets and large changes between
versions, the comparison can be slow.
The other challenge is version graph encoding. The authors argued that there is
a trade-off between materializing specific version or just storing delta from previous
versions since following delta might take a considerable amount of time to execute.

Decibel
Decibel is a relational storage system with built-in version control. Decibel repre-
sents a dataset in a table. It is a key component of DataHub Bhardwaj et al., 2015
versioning system. There is three representation of data, namely tuple first, ver-
sion first, and hybrid representation. Figure 2.5 illustrates how Decibel handles data
storage and versioning capability.
2.5. Separating the data version from the pipelines 17

F IGURE 2.5: Decibel storage (Maddox et al., 2016)


18 Chapter 2. Background and Related Works

F IGURE 2.6: Delta encoding (Huang et al., 2017)

OrpheusDB
OrpheusDB (Huang et al., 2017) works as a dataset version management system that
is built on top of relational databases. OrpheusDB is designed to get the benefit of
both sophisticated querying of relational database and versioning capabilities. The
authors claimed that OrpheusDB support both git-style version control commands
and SQL-like queries. Figure 2.6 illustrates the delta encoding in OrpheusDB.

2.5.4 Data versioning for NoSQL database systems


To some extent, relational databases, such as MySQL and PostgreSQL, are still used
as the default storage for handling transactional data due to its maturity and be-
cause they provide ACID (Atomicity, Consistency, Isolation, Durability) properties.
However, with the massive explosion of data in the last couple of years, traditional
databases might not always be suitable to process the latest data inflow. Nowadays,
there are more than 100 new database systems which are developed to overcome
the limitations of relational databases. These database systems are generally called
NoSQL databases. Based on the data representation, NoSQL is commonly catego-
rized into four classes: document-oriented, key-value oriented, column-oriented,
and graph-oriented database.
NoSQL database systems are characterized by the aggregate data models that are
different from how relational database systems model the data. Unlike the relational
database where the schema should be defined in advance, NoSQL database systems
are described as schemaless thus making it more flexible to store various kinds of
data (Sadalage and Fowler, 2012).
Similar to relational databases, NoSQL databases commonly do not have built-in
support for versioning. The followings are several studies aimed at managing ver-
sions in document and graph-oriented databases. To the best of our knowledge,
there are no studies yet on how to manage versions in key-value, and column-
oriented databases.

Document-oriented database
One of the categories of NoSQL database is document-oriented, where the data is
stored in documents, such as XML, JSON, and BSON Sadalage and Fowler, 2012.
MongoDB and Couchbase are two examples of database systems that belong to
this category. Several approaches have been proposed to manage versioning for
document-oriented databases.
τXSchema Currim et al., 2004 and τJSchema Brahmia et al., 2016 are two ap-
proaches for versioning XML and JSON documents respectively. In each of these
approaches, a schema is proposed to handle the version history of XML and JSON
2.6. Summary 19

F IGURE 2.7: Entity-state model (D’Este and Falcier, 2017)

documents so that temporal information are kept while maintaining compatibility


with the existing schema. This is done by adding physical and temporal annotations
to the documents. By using this schema, the authors claimed that it is possible to
retrieve time-varying change in the documents.

Graph-oriented database
Graph-oriented database systems, such as Neo4j, ArangoDB, and HyperGraphDB,
store the data as graph. This type of NoSQL databases is claimed to be suitable
to represent data from social networking, organizational data, and natural phe-
nomenon as it can illustrate the relationships between data better than relational
database systems Robinson, Webber, and Eifrem, 2013.
Castelltort and Laurent, 2013 defined several criteria for versioning graph databases,
namely, it is non-intrusive to the graph structure, can be distributed as a plugin,
can be pluggable to the database regardless of the time, and provides the history of
nodes and the relationships.
A plugin for Neo4j called neo4j-versioner-core (D’Este and Falcier, 2017) is devel-
oped to manage versions of graphs in Neo4j. It is based on the Entity-State approach
in which the identity and state of graphs are separated (Robinson, 2014). Every en-
tity will have a main unique property while the non-unique properties are store into
another node called state node. The main node and the state nodes are connected
with a relationship. A change in the entity will generate a new state node that has a
relationship with the main node and the previous state node. All the functionalities
are provided via SQL-like queries. Figure 2.7 illustrates how data versioning works
in neo4j-versioner-core.
A study by Vijitbenjaronk et al., 2018 aimed to add time-versioning support in
graph database by using LMDB, a key-value store based on B-tree, as the backend
storage. Figure 2.8 shows the overview of time-versioned graph database.

2.6 Summary
Based on the relation to the workflow or analysis, data versioning can be categorized
into two group, that is, by linking the data to the workflow and by separating the
data from the workflow.
Linking the data to the workflow is useful for projects that have predefined work-
flows. Scientific workflow systems are practical for managing the processes and the
20 Chapter 2. Background and Related Works

F IGURE 2.8: Time versioning graph (Vijitbenjaronk et al., 2018)

versions of data in this category. From the perspective of the datasets, to create a
new version, a user can introduce a process of transformation. Several systems pro-
vide the ability to explicitly define the version number while some others do this
automatically. To checkout a version of data, a user can scan the workflow graph
and define which version of data to be retrieved from the list of processes.
To some extent, it is possible to compare what the data look like in each step
of the workflow. The comparison between data can be made at the file-level. It
is possible to see the state of the file before and after a transformation. However,
for a file consisted of records that undergoes a transformation that modifies a large
number of these records, it is impractical to compare which records are changed
during the transformation.
Separating the data versions from the workflow is useful for projects where
the manipulation of data does not always fit into a certain workflow. A relational
database, for example, can have numerous client applications that constantly up-
dates the data where the flows of the process in one application can be significantly
different than the other applications.
Based on how data and versions are encoded, data versioning can be catego-
rized into two group, that is delta-based approach and snapshot-based approach.
Snapshot-based approach stores the whole content of data in each versions while
delta-based approach stores the difference between versions (delta) to reconstruct
the full version of data.
Table 2.1 displays the comparison of the existing tools and methods for managing
the versions of data. There are a considerable amount of tools for managing data
versions that are used in workflows. On the other hand, there are still limited studies
on managing data versions in database systems, especially in NoSQL databases.
TABLE 2.1: A comparison of data versioning tools

Name Data Type Approach Delta encoding


Snapshot-based for binary files, Hash of files;
Git Any files
delta-based for text files line-per-line comparison
Hash of files;
Snapshot-based or
Workflow systems Any files a piece of code to link the version
symlinks of data
to the workflow
Dense, sparse, delta-list
SciDB Array databases Delta-based
representation of the changes
Tabular data/ Changes in each record is written into
Datahub, Decibel Delta-based
relational databases separate tables in the database
A column or a table to store
the association between a record
OrpheusDB Tabular data/relational databases Delta-based
and the list of versions where
the record exists
τJSchema and Physical and temporal annotations
JSON, XML Delta-based
τXSchema to the documets
Entity-state approach where
the entities (identifier) and Additional nodes in the database
neo4j-versioner-core Graph database
the states (modifiable attributes) to store every changes that happen to an entity
are separated
B-tree graph database Graph database Delta-based B-tree as the backend of the database
23

Chapter 3

Data Versioning for Graph


Databases

The main research question of this thesis, as described in Chapter 1 is to investigate


to what extent can a version control tool be developed for graph databases. This
chapter discusses the approaches that are used to answer the question and the sub-
questions. First, we describe what a graph database is and what are its components.
Then, we identify the components of version control for a graph database, along
with the description of how to develop them.

3.1 Graph database


A graph database is developed to address the complex relationship of data. Similar
to a graph data structure, the properties graph-oriented database consist of nodes
and relationship. To group nodes of the same kind, a user can assign a label to them.
Furthermore, Each node and relationship can have zero or more properties. Thus,
there are four main components of a graph database, that is nodes, relationships,
labels, and properties (Robinson, Webber, and Eifrem, 2013).
In a relational database, the relationship can be represented by a foreign key
column. Robinson, Webber, and Eifrem, 2013 suggested that rows in the relational
model can be represented by nodes in the graph model. Joins can be represented
by relationships, table names by labels, and columns by properties. In contrast with
the relational model where each record in a table has to have the same number of
columns, in graph model, nodes that belong to the same label do not have to have
the same amount of properties.

F IGURE 3.1: An example of graph model


24 Chapter 3. Data Versioning for Graph Databases

Figure 3.1 illustrates an example of graph model. It depicts a social graph where
a person follows another person. Each node illustrates a person and each relation-
ship illustrates which direction a user follows another user. There are three nodes
with label ‘Person‘ depicted as circles, and there are four relationships with type
‘FOLLOWS‘ illustrated as purple directed arrows. The number of properties on each
node or relationship can be different, for example, node ‘Alice‘ has two properties,
namely ‘name‘ and ‘dob‘, while node ‘Bob‘ only has one property, that is ‘name‘.
Over the time there can be some changes to the graph, for example, addition
of nodes and relationships and updates to the values of some properties. Figures
3.2a and 3.2b display the example of the states of the graph, before and after the
changes, respectively. Figure 3.2a can be referred as the version 1 V1 of the graph
and Figure 3.2b can be referred as the version 2 V2 . In Figure 3.2b, there are one
new node ‘Dave‘ and one new relationship from node ‘Charlie‘ to ‘Dave‘. There are
also new properties in node ‘Alice‘, ‘Bob‘, and ‘Dave‘ and also a new property in the
relationship from ‘Bob‘ to ‘Charlie‘. Given the existing versions of the graph, V1 in
Fig. 3.2a and V2 in Fig. 3.2b, we want to be able to retrieve V1 from V2 and to retrieve
V2 from V1 .

3.2 Version control components


To develop a version control system for the graph-oriented database, there are sev-
eral components that need to be defined and identified.

1. Version control API


Version control API is the user interface where users can interact with the ver-
sioning tool. We are interested in building a version control system with as
minimum modifications as possible to the database system.
2. Database storage
Database storage relates to how data is stored in the database. Based on the
description in Chapter 2, there are two common approaches to manage ver-
sions in the database. The first one is by changing data representation when it
is inserted or updated into the database. The second one is by storing the list
of changes in the database without changing the structure of the database.
Both options are valid, but because we do not want to modify the data struc-
ture of the database, we choose the second option, that is defining how to store
the delta or the list of changes in the database.

( A ) Graph before changes


( B ) Graph after changes

F IGURE 3.2: An example of the transformation of a graph


3.3. Version control API 25

3. Delta encoding
Delta encoding deals with how the changes, either an insert, an update, or a
delete transaction, is going to be encoded and stored.

4. Version graph
Version graph represents the order of versions, for example, how to reach a
target version Vt from a source version Vs .

3.3 Version control API


In this thesis, we are interested in building the version control tool that introduces a
minimal modification to the data and the functionality of the database system. This
means that we do not want to store the delta inside the database and we do not
want to modify how to query data from the database. For this purpose, we adopt
the Git-style command line functionality that is developed in Python.
One specific version control system has a set of functionalities that may be dif-
ferent from the others. However, there is a list of functions that commonly available
in most of version control systems as described in Section 2.3.4. The followings are
the list of functionalities that we are interested to implement into the version control
system.

1. Init: This function tells the system to start managing the version of the active
Neo4j database. All nodes and relationships that exist in the database or that
will be inserted into the database will be versioned. It also instructs the sys-
tem to start detecting changes in the database and store them. For the graph
example in Figure 3.2a, ‘init‘ function tells the system to identify changes in all
three nodes, all four relationships, and the upcoming additions of nodes and
relationships to the graph.

2. Commit: It is used to create a new version of the data. The changes from the
previous version are calculated and stored. Commit can be used to save the
state on Figure 3.2a as "version 1", and Figure 3.2b as "version 2".

3. Checkout: This function will materialize a certain version that has been com-
mitted before. For example, if the current state of the graph is depicted in
Figure 3.2b and we want to see the state of the graph in version 1, illustrated
in Figure 3.2a, we can use checkout function with version 1 as the parameter.

4. Diff: Diff is used to compare two versions and determine, for example, which
nodes are modified, which relationships are created, which properties are added,
and which labels are removed. In the case in Figure 1, the diff command will
compare the list of changes between version 1 and version 2, and the results
are the highlighted properties.

5. Branch: We want the version control system to support branching, so that it is


possible to explore some changes and either merge them or revert to the parent
branch.

6. Merge: It can be used to merge the changes to the parent branch.

7. Blame: When we want to learn what changes are applied to a certain node or
relationship and who changes it, we can use the blame function.
26 Chapter 3. Data Versioning for Graph Databases

3.4 Delta encoding


We are interested in having the ability to compare the difference between versions
of the graph. Considering this requirement, we propose a delta-based approach
for managing the versions. One of the approaches is to serialize the data that are
changed, along with the type of transaction. There are two steps in determining the
delta encoding, detecting changes and determining the data structure.

3.4.1 Detecting changes from a transaction


One of the essential steps in version control is to figure out the changes that the users
make from time to time. One possible way of detecting changes is by intercepting the
command before it has been committed to the database. This way, the list of changes
can be observed before it is being written into the database. The disadvantages of
this method are that the changes may and may not happen in the database since
there might be something preventing the modifications of being committed.
Based on Section 2.3.2, in database systems, the changes in the database system
can be represented in SQL queries can also be detected using trigger just before or af-
ter a transaction is committed. The trigger is installed and configured in the database
system so that the alteration to the database can be discovered without having the
knowledge of the front-end application or the workflow. By using this technique,
the user can be assured that the changes will take place because the database will
throw out an error if the changes are failed to be committed.
Graph database has four main classes or components, namely nodes, relation-
ships, labels, and properties. Therefore, we need to save the states of the graph be-
fore and after a node is created or deleted, before and after a relationship is created
or deleted, and before and after a property is assigned or overwritten or deleted.
To detect the changes, Neo4j provides ‘TransactionData‘ interface to handle the
changes before and after a transaction. Examples of ‘TransactionData‘ elements are
as follows:

• username: the name of the user who committed the transaction

• createdNodes: the list of nodes, along with the properties and labels, that are
created during a transaction.

• deletedNodes: the list of nodes that are deleted during a transaction.

• assignedNodeProperties: the list of node properties that are assigned or over-


written during a transaction

• removedNodeProperties: the list of node properties that are removed during


a transaction

• assignedLabels: a node normally has one label, but it can also have multiple
labels or none. This element contains the list of labels that are assigned to a
node during a transaction.

• removedLabels: similar to assignedLabels but for the labels that are removed
from a node

• createdRelationships: similar to createdNodes but for relationships

• deletedRelationships: similar to deletedNodes but for relationships


3.4. Delta encoding 27

F IGURE 3.3: Data structures for changes detection

• assignedRelationshipProperties: the list of properties that are added or changed


to a relationship

• removedRelationshipProperties: the list of properties that are removed from a


relationship

These properties can be used to determine the state of the graph before and after
some queries or transactions are committed and therefore determine how to recon-
struct the graph.

3.4.2 Defining the data structures


After changes are detected, it is important to save those changes or deltas appro-
priately so that the retrieval of a target version Vt from source version Vs can be
done efficiently. To accommodate this, we define several data structures that are
illustrated in Figure 3.3.
For the graph database, we add a system-assigned id, called ‘grit_id‘, for each
entity (node and relationship) in the database. Thus, it is possible to identify which
entity undergoes a modification. Aside from that, we do not modify the structure of
the database.
To capture the modification in each transaction, we define the first data structure
in our versioning system. For every transaction, a JSON-formatted serialized object
consisting of the modification is saved into a file named ‘changelog‘. The object
consists of a combination of the following elements.

• entity: the type of the entity (node or relationship) that is modified. The value
is either ‘node‘ or ‘edge‘.

• internal_id: the id of the entity that was assigned by Neo4j.

• grit_id: the id of the entity that was assigned by the versioning tool.

• start_node_grit_id and end_node_grit_id: the id of the start node and the end
node of a relationship.

• action: the action that is performed to the entity. The value is either ‘cre-
ate‘, ‘delete‘, ‘add_label‘, ‘remove_label‘, ‘add_property‘, ‘update_property‘,
‘delete_property‘.
28 Chapter 3. Data Versioning for Graph Databases

• label: the list of assigned or removed labels from a node.

• edge_type: the type of a created relationship.

• key, old_value, and new_value: the key, value pair of the property that was
assigned, overwritten, or removed. We store the old and new values so that it
is possible to rollback to the older version.

Each line in the ‘changelog‘ file is a JSON-formatted object that represents a


change in the database. One entity in the database can be modified multiple times
and therefore will result in multiple JSON objects in the changelog. We can further
aggregate the changelog file by grouping the objects based on the ‘grit_id‘ so that
each entity is represented by one object.
The aggregation is performed when a commit command is issued, signalling
that a new version V j will be created after the current version Vi . An aggregate
data consists of several entity objects with a unique id. The attributes of each object
represent the list of changes in an entity in the database from version Vi to V j . We
formulate the attributes of the aggregate objects as follows.

• id: the unique identifier by which the contents of the changelog file are grouped.

• entity_type: the type of the entity (node or relationship)

• labels_added: the list of all assigned labels for a node from Vi to V j .

• labels_removed: the list of all removed labels for a node from Vi to V j .

• relationship_type (String): the type of a created relationship.

• properties_before_commit: a dictionary objects depicting the list of entity prop-


erties in Vi

• properties_after_commit: a dictionary objects depicting the list of entity prop-


erties in V j

• is_created: a boolean value denoting whether this entity is created in V j

• is_changed: a boolean value denoting whether this entity is changed in V j

• is_deleted: a boolean value denoting whether this entity is deleted in V j

The aggregate of the changelog file contains the delta of a version from Vi to
V j . To retrieve these objects, we store them into a file called commit object or delta
object, with a unique identifier called ‘version_id‘.
From the delta object, we can generate a set of SQL-like queries to reconstruct V j
from Vi and to rollback from Vi from V j . It is also possible to show the differences
between Vi and V j from the aggregate objects.

3.4.3 Building the plugin


To extend the functionality of the database to capture the changes, Neo4j allows
developers to build plugins. Plugins are written in Java and must be compiled and
installed in the appropriate directory before the database is started.
The plugin that we developed has the following functions.
3.5. Version graph representation 29

• Assign a unique identifier called ‘grit_id‘ for each created node and relation-
ship

• Detect every transaction that happens in the database

• Store the changes in the aforementioned data structures

The plugin will generate a changelog file containing the changes for every entity.
The following is an example of the output of the plugin.
{
" g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" action " : " create " ,
" e n t i t y " : " node " ,
" properties " : {
" user_id " : "10263002" ,
},
" username " : " n e o 4 j "
}
{
" g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" action " : " delete " ,
" e n t i t y " : " edge " ,
" edge_type " : "FOLLOWS" ,
" s t a r t _ n o d e _ g r i t _ i d " : " 1 ae66808 −7c8c −4f 7 f −9335− d284df75b43e " ,
" s t a r t _ n o d e _ l a b e l " : [ " Users " ] ,
" e n d _ n o d e _ g r i t _ i d " : " 4 1 e8373a −7b66 −4c06 −afa2 −c 0 6 e 1 8 3 3 a 3 9 c " ,
" end_node_label " : [ " Users " ] ,
" username " : " n e o 4 j "
}

3.5 Version graph representation


Version graph in the version control system is commonly saved in the form of a
directed acyclic graph (DAG) that consists of many vertices and edges without cycle
representing the sequence of versions. The vertices represent the list versions of the
database. Given a source version Vs identified by an id, we should be able to find a
path to any other version Vt even though the version graph has many branches.
Each vertex or version Vi in the version graph has several attributes, as follows.

• version_id: a unique identifier for each version

• author: the user who made the commit

• timestamp: the timestamp of when the version was created

• message: the commit message

• branch_name: the name of the branch to which a version belongs

The edges in the version graph represent the delta between the subsequent ver-
sions. Given two subsequent versions denoted as (Vi )-[∆ij ]->(V j ) where Vi and V j are
2 subsequent versions and ∆ij is the delta of the two versions, ∆ij stores of the delta
object from Vi to V j and vice versa.
30 Chapter 3. Data Versioning for Graph Databases

Since the terms for the components of version graph and Neo4j graph are similar,
to differentiate them, we will use nodes and relationships as the components of the
graph database (Neo4j) and use vertices and edges to denote the components of the
version graph. To help with the creation of the version graph, a Python package,
networkx, is utilized.
Figure 3.4 illustrates a version graph. Each version, denoted by a dot or vertex,
has several properties namely branch_name, version_id, commit_message, and au-
thor. The version_id in the version graph corresponds with the version_id of the
delta object.

F IGURE 3.4: An illustration of Version Graph

3.6 Initializing the versioning system (init function)


We designed the init function to initialize the versioning system via a command
‘grit init‘ that can be issued in the command line. The init function is described in
Algorithm 1.

Algorithm 1 Pseudocode for initialization


1: procedure INIT
2: Assign a unique identifier to each existing node and relationship in the
database
3: Create a new version Vinit with the initial snapshot of the database
4: Create the root directory and the configuration files for the versioning system
repository. The repository directory is set to ‘ /.grit‘
5: Generate the initial version graph with one root vertex with an id that corre-
sponds to the version Vinit
6: Create the default branch ‘master‘ and assign a reference to the current ver-
sion Vinit
7: Create a file named ‘head‘ and assign a reference to the current branch ‘mas-
ter‘

3.7 Creating a new version (commit)


After the initialization, the version control system is ready to detect changes to the
database and write them into a changelog file as described in Section 3.4.2. This file
will be the basis of the creation of the delta object. The commit function is described
in Algorithm 2.
3.8. Materializing a version (checkout function) 31

Algorithm 2 Pseudocode for commit


1: procedure COMMIT
2: Retrieve the id of the current version as Vc
3: Generate a new version id Vn and add Vn to the version graph
4: Add a new edge ∆cn in the version graph from Vc to Vn , describing that ver-
sion Vn is the successor of version Vc
5: Retrieve the current branch CB
6: Read the changelog file
7: Aggregate the content of the changelog file by grouping the id
8: Store the aggregate data as the delta object and assign it as the property of
∆cn
9: Delete the changelog file
10: Assign a reference of the new version Vn to the current branch CB
11: The result is two subsequent versions in the version graph denoted as (Vc )-
[∆cn ]->(Vn ) where Vc is the version before commit, Vn is the version after commit,
and ∆cn is the delta object from Vc to Vn

3.8 Materializing a version (checkout function)


To materialize a certain version, we need to traverse the version graph and find the
path from the source version to the target version and generate the SQL from the
delta objects along the path. For example, in the version graph in Figure 2.1, if we
are currently in version ‘665003d‘ (master) and want to materialize version ‘3e89ec8‘
(develop), we would have the two delta objects along the path, 6c6faa5 (master) and
‘3e89ec8‘ (develop). To materialize them, we combine the two delta objects, generate
the SQL, and execute them into the database.
From the list of delta objects, there can be overlaps in the entities that are modi-
fied. Therefore, before generating the SQL, we create the aggregates of delta objects,
by grouping the entities based on their unique identifiers.
The materialization functionality is provided as the checkout function. The pa-
rameter of the function is the version_id that is going to be materialized. The check-
out function is defined in Algorithm 3.

Algorithm 3 Pseudocode for checkout


1: procedure CHECKOUT
2: Retrieve the id of the target version that is going to be materialized Vt .
3: Retrieve the id of the current version/source version Vs .
4: Find the path P from Vs to Vt in the version graph. The result is the list of
versions and delta objects along the path.
5: Load the delta objects ∆list based on the list in P
6: Aggregate the delta objects ∆list into one object ∆ aggregate
7: Generate the SQL queries from the delta object ∆ aggregate
8: Execute the queries
32 Chapter 3. Data Versioning for Graph Databases

3.9 Comparing the differences between versions (diff func-


tion)
The diff function allows the user to compare the list of created/changed/deleted
entities between two versions of the graph. The parameters of the function are the
id of the source version Vs and id of the the target version Vt . The diff function is
described in Algorithm 4.

Algorithm 4 Pseudocode for diff


1: procedure DIFF
2: Find the common ancestor CA of source version Vs and target version Vt
3: Find the path PCA−s from the common ancestor CA to source version Vs
4: Aggregate the delta objects ∆CA−s−list from the list of versions in path PCA−s
into one delta object ∆CA−s−aggregate
5: Find the path PCA−t from the common ancestor CA to target version Vt
6: Aggregate the delta objects ∆CA−t−list from the list of versions in path PCA−t
into one delta object ∆CA−t−aggregate
7: Get the unique entities in ∆CA−s−aggregate and print them as the deleted enti-
ties
8: Get the unique entities in ∆CA−t−aggregate and print them as the created entities
9: Get the entities that exists in both ∆CA−s−aggregate and ∆CA−t−aggregate and get
the properties and print them as the changed entities

3.10 Tracking the changes for one entity (blame function)


The blame function provides the description of the list of changes of an entity from
the initialization of the versioning tool to its current state. Given the id of an entity
e, the blame function will execute the following set of steps.

Algorithm 5 Pseudocode for blame


1: procedure BLAME
2: Retrieve the root version Vroot and the current version Vhead
3: Find the path P from Vroot to Vhead
4: Check if the entity e exists in each version Vi in the path P
5: If the entity e exists in version Vi , print the properties of the entity e and the
author of version Vi

3.11 Creating and switching to another branch (branch func-


tion)
Each version Vi in the version graph store the information about the branch name
to which the version belongs. Furthermore, the reference of the current version of a
specific branch is also stored in the repository directory.
Branching in the version control system allows the users to explore new ideas
and temporary modifications from the main work. The user has the options to either
integrate the modifications to the main work or revert the changes.
3.12. Merging two branches (merge function) 33

There are two functionalities in the branch function, create a branch and switch
current branch. The create branch function takes a parameter that is the name of the
new branch branch_name. It is described as follows.

Algorithm 6 Pseudocode for branch


1: procedure CREATE B RANCH
2: Get the current version Vc
3: Create a new file named branch_name and a reference to Vc
4: Set the ‘head‘ reference to the new branch branch_name

Switching branch means that the working copy of the database is set to the state
of the graph based on the latest version referred by that branch. The switch branch
function takes the parameter of the target branch branch_target and is described as a
set of processes as follows.

Algorithm 7 Pseudocode for switch branch


1: procedure SWITCH B RANCH
2: Get the source version Vs from the current branch
3: Get the target version Vt from branch_target
4: Checkout Vt from Vs

3.12 Merging two branches (merge function)


The merge functionality can be used to merge the changes that were performed in a
branch to its parent branch. The merge function takes the name of another branch
branch_target from which the data are going to be merged. The list of steps in the
merge function is as follows.

Algorithm 8 Pseudocode for merge


1: procedure MERGE
2: Retrieve the current version from as Vs
3: Retrieve the target version as Vt from branch_target
4: Find the common ancestor CA of source version Vs and target version Vt
5: Find the path PCA−s from the common ancestor CA to source version Vs
6: Aggregate the delta objects ∆CA−s−list from the list of versions in path PCA−s
into one delta object ∆CA−s−agg
7: Find the path PCA−t from the common ancestor CA to target version Vt
8: Aggregate the delta objects ∆CA−t−list from the list of versions in path PCA−t
into one delta object ∆CA−t−agg
9: Get the symmetric difference entities S ΘT from ∆CA−s−agg and ∆CA−t−agg
Get the union entities S T from from ∆CA−s−agg and ∆CA−t−agg
S
10:
Find the conflicting properties from S T
S
11:
12: If there are conflicting properties, cancel the merge process
Otherwise, create a delta object ∆merge from S ΘT and S T
S
13:
14: Generate SQL queries from ∆merge
15: Execute the SQL queries
35

Chapter 4

Leveraging snapshots in version


materialization

This chapter discusses an approach for the improvement of the initial delta-based
versioning system for graph databases that was elaborated in Chapter 3. Leveraging
snapshots in version materialization is expected to reduce the checkout time of a
version where a large amount of changes is involved.

4.1 Problem with delta-based versioning


In our initial build of the system, the system store the delta from any version Vi to
its subsequent version V j . Storing the delta is considered cost-effective in terms of
storage compared to storing the snapshot. However, materialization using delta is
not always efficient in terms of execution time.
A delta object represents the list of changes from one version to its subsequent
version. There are no limits to how many modifications that are performed in one
version. Thus, one delta object can take a couple of seconds to execute if there are
only a couple of changes. It can also take minutes to execute if there are millions of
changes to the database, thus while the storage cost is relatively low, the execution
time cost is expensive. On the other hand, saving the snapshots of the database is
considered expensive for the storage cost, but it is faster to materialize a version by
using snapshots.

4.2 Generating a snapshot after a commit


Based on the problem described in the previous section, we propose version materi-
alization using snapshots in some cases where the time execution cost is too expen-
sive. The decision process on whether to store the snapshot is executed after every
commit. Figure 4.1 the initial approach and the improved approach for commit func-
tion.
After every commit, a delta object is always stored while a snapshot is not always
generated. The decision on whether to generate a snapshot is based on the previous
execution time cost from the materialization/checkout process. We then define a
modifiable parameter to determine the threshold of the execution time cost. If the
predicted execution time exceeds this threshold, a snapshot is generated.
To help with the decision process, a regression model will be created and up-
dated after every version materialization process. When commit command is issued
to create version Vi , the system will take some properties of the delta object and pre-
dict the checkout time. If the predicted time exceeds the user-defined time threshold,
36 Chapter 4. Leveraging snapshots in version materialization

( B ) Updated approach for the checkout func-


( A ) Initial approach for the checkout function
tion

F IGURE 4.1: Flowchart for the checkout function

a snapshot will be generated and will be stored with the name that corresponds with
Vi .
After generating the snapshot, the version graph needs to be updated, describing
that aside from creating the delta object, the snapshot is also generated for version Vi .
To achieve this, we store the predicted checkout time as the property of the edge in
the version graph. Given a sub-graph of a version graph denoted as (Vi )-[∆ij ]->(V j )
where Vi and V j are 2 subsequent versions and ∆ij is the delta of the two versions,
we assign the following attributes to ∆ij .

1. num_changed_nodes: the number of changed nodes from Vi to V j and vice


versa

2. num_changed_relationships: the number of changed relationships from Vi to


V j and vice versa
3. num_changed_properties: the number of changed properties in a checkout
from Vi to V j and vice versa

4. delta_storage_size: the size of the delta object from Vi to V j

4.3 Optimizing the checkout time


Now that we have snapshots of the database, the version materialization process is
also modified. The updated approach is described as the set of processes as follows:

1. Find the path P from the source version Vs to the target version Vt .

2. From the list of version in P , find all versions that has a snapshot. We call
it Vsnapshot . The current version Vs and the version at the root of the version
graph Vroot are also included in Vsnapshot .
4.4. Building the model for optimizing the execution time 37

3. From Vsnapshot , get the nearest version to Vt . Assign this version to Vn . Vn is


the version that has the lowest sum of checkout time from the target version.
The lowest sum of checkout time can be calculated from the aggregated delta
objects from Vn to Vt

4. If Vn is equal to the current version Vs , do the delta-based checkout process


from Vs to Vt .

5. Otherwise, materialize Vn from the snapshot, and checkout the target version
Vt with Vn as the source version.

4.4 Building the model for optimizing the execution time


To help with the decision process, a regression model will be created and updated
after every version materialization. When a version is checked out by using delta
objects, the attributes of the delta objects and the actual checkout time are added as
the training data to build the model. The model will then be used to predict future
checkout time.
To mitigate the cold start problem where we don’t have the initial data of the ac-
tual checkout time, we need to build an initial regression model based on several test
cases. The initial model will be used for the decision process until there is enough
dataset of the checkout time from the user.

4.4.1 Generating the dataset for the initial model


To generate the dataset for the initial model, we conduct several test cases of the ver-
sion materialization using our initial approach to generate the checkout metadata.
The dataset of checkout metadata consists of the following attributes:

1. num_changed_nodes: the number of changed nodes from Vi to V j and vice


versa

2. num_changed_relationships: the number of changed relationships from Vi to


V j and vice versa
3. num_changed_properties: the number of changed properties in a checkout
from Vi to V j and vice versa

4. delta_storage_size: the size of the delta object from Vi to V j

5. actual_checkout_time: the time that it takes to do a delta-based checkout from


Vi to V j and vice versa

TABLE 4.1: Sample of checkout time metadata

num_ num_ num_ delta_ actual_


changed_ changed_ changed_ object_ checkout_
nodes relationships properties size time
18 10 18 3683 4.40
108 100 108 24186 7.60
212 1000 212 151463 22.00
428 9994 428 1329475 134.00
5280 97419 5280 13083278 1000.00
38 Chapter 4. Leveraging snapshots in version materialization

F IGURE 4.2: Plot of pairwise relationships of the checkout metadata

The first four attributes are the independent variables that can be retrieved from
the delta objects. The last attribute, actual_checkout_time, is the dependent variable
that we want to predict. Table 4.1 displays a sample of checkout metadata based on
several test cases.
Figure 4.2 displays the correlation of the each column from the checkout meta-
data. It can be seen that the checkout time is linear to all the other attributes.

4.4.2 Building and updating the model


Since the checkout time is linear to the number of changes and the delta object stor-
age size, we use linear regression to build our initial model. The model is then stored
as a Pickle file to be used after a commit. The evaluation of the base regression model
results in R squared score of 0.9.
The materialization time in one computer system will differ from another sys-
tem. Therefore, this initial model will be replaced by a new model once a dataset
of checkout metadata in the user’s computer is obtained with a minimum of 5 data
points. The new model will be updated after every checkout.

4.4.3 Deploying the model


The model is saved into a Pickle file after every build. During a checkout, the meta-
data for the checkout, consisting of the number of changed nodes, the number of
changed relationships, the number of changed properties, and delta object size will
be used as the testing data to infer the predicted checkout time. The predicted check-
out time is compared to the user-defined threshold to determine whether the snap-
shot is created or not.
4.5. Optimizing the storage cost 39

4.5 Optimizing the storage cost


Along the times, after many commits and checkouts, the size of the version control
repository will increase accordingly. While delta must always be saved, storing the
snapshots are optional. We propose using the knapsack problem solution to elimi-
nate some snapshots when a certain storage limit is exceeded.
The steps for the snapshot removal is by checking user-defined maximum stor-
age for the repository. If this limit is exceeded, the system will run a knapsack al-
gorithm. The knapsack algorithm takes three input, namely, the list of snapshots
storage size (weight), the list of predicted checkout time (value), and the storage
limit (capacity).
41

Chapter 5

Evaluation

This chapter summarizes the evaluation process for the system described in Chapter
3 and 4. The evaluation setup is elaborated in Section 5.1 followed by the explanation
of the evaluation metrics in Section 5.2 and the evaluation results in Section 5.3.

5.1 Evaluation setup


There are two variants of delta-based approach in our data versioning tool. The first
variant only stores delta to retrieve versions as described in Chapter 3. The second
variant stores delta and, when necessary, also stores snapshot. The decision whether
to store or not to store delta is determined by the number of changes in a version, as
described in Chapter 4.
To evaluate the performance of our version control system, we also compare it
with snapshot-based data versioning using Git and state-entity based versioning
using Neo4j-versioner-core D’Este and Falcier, 2017. Thus, there are 4 approaches,
described as follows, that are evaluated.

• Grit-base (base approach): The base approach for our delta-based versioning

• Grit-snapshot (base approach + snapshot): This is the delta-based versioning


tool where snapshot will be created when certain criteria are met. The thresh-
old for creating a snapshot is set to 300 seconds.

• Snapshot-only (using snapshot with Git): In each version of the database, a


snapshot is created. The disadvantage is that it is not possible to compare two
graph data from the snapshots.

• Entity-state (using neo4j-versioner-core by D’Este and Falcier, 2017): In this


approach, the entity and its states are separated into different nodes. Com-
mit and checkout functions will create a new state node for each entity in the
version.

The dataset that is used for evaluation is Twitter user’s connectivity information
named ‘ego-Twitter‘ from Stanford Network Analysis Project (Leskovec and Krevl,
2014). The dataset is an example of a directed graph of user’s "following" informa-
tion. It consists of 81,306 Twitter user (nodes) and 1,768,149 "follow" relationships.
There are several control variables that are measured in each iteration of the eval-
uation processes, described as follows.

• Number of initial entities: The number existing of entities in the database be-
fore changes are applied

• Number of changed entities: The number of changed entities in one version


42 Chapter 5. Evaluation

TABLE 5.1: Environment for system evaluation

Aspect Details
Machine Google Compute Machine
Operating systems Ubuntu 18.04
Processor 8 vCPUs
Memory 32 GB
Java java version "1.8.0_172"
Python Python 3.6.8
Neo4j neo4j 3.4.12

• Approach: the tool that is used to do data versioning

We would like to investigate to what extent the control variables affect the perfor-
mance of the versioning systems. Therefore, we deploy several test cases by chang-
ing one variable at a time to see the effect on the evaluation metrics. For each test
case, the commit and checkout functionalities will be performed and the evaluation
metrics will be calculated. The evaluation is run on the same environment that is
described in Table 5.1.

5.2 Evaluation metrics


The objectives of our versioning tool are to be able to store versions and the ability
to retrieve and compare versions with efficient commit and checkout time. Another
objective of the tool is to be able to store the delta encoding without requiring large
storage cost. Therefore, we propose three evaluation metrics for the system evalua-
tion, namely commit time, checkout time, and storage cost.

5.2.1 Commit time


The commit time can be defined as the execution time to create a new version by
storing the delta object and/or a snapshot. Each approach described in Section 5.1
has a different process of committing changes. In our versioning tool, the commit
process consists of generating the changelog file and constructing the delta objects.
On the other hand, the commit process in snapshot-based versioning consists of
applying the changes into the database and creating dump objects of the database. In
entity-state approach, the commit process is similar to executing the cypher queries
albeit with different syntax.
Therefore, in this evaluation process, the commit time is described as the sum
of the time to execute the queries and the time to generate delta/snapshot objects.
Delta-based approach has both query execution time and delta object generation
time. Snapshot-based approach only has snapshot generation time. Meanwhile,
entity-state approach only has query execution time.

5.2.2 Checkout time


Similar to the commit time, each approach for data versioning has a different pro-
cess of retrieving past versions of data. In our versioning tool, the checkout process
consists of generating queries from delta objects and executing these queries. Mean-
while, in the snapshot-based versioning tool, the process includes loading the dump
5.3. Evaluation result 43

object into the database. On the other hand, the checkout process in entity-state
approach is by executing cypher queries.
The checkout time is thus described as the sum of the time to the time to generate
queries and the time to execute the queries.

5.2.3 Storage cost


The storage cost consists of three items, the database size in each evaluation scenario,
the delta size, and the snapshot size. Delta-based approach using Grit generates both
delta and snapshot. Snapshot-based approach generates only snapshot. Meanwhile,
entity-state approach does not generate delta or snapshot objects.

5.3 Evaluation result


5.3.1 Commit time
Figure 5.1 depicts the comparison of commit time as a function of the number of
changed entities. Each sub-figure represents the results for different number of ini-
tial entities in the database. It can be seen that the commit time for all approaches is
linear to the number of changes. Both delta-based approaches take longer to commit
than snapshot-based and entity-state approach. The needs for extra time to commit
in delta-based approach is because our tool needs to generate delta objects in each
version.
Figure 5.2 depicts the comparison of commit time for each approach. It can be
seen that the initial number of entities in the database slightly increase the commit
time. From both Figure 5.1 and 5.2, it can be seen that although the same set of
queries is executed to the database, the execution time can be different.

5.3.2 Checkout time


Figure 5.3 visualizes the trends of the checkout time for the approaches. The check-
out time of snapshot-based approach is constant regardless of the number of changes
since the checkout process is basically loading the database from a snapshot. The
checkout time of entity-state approach is also constant since the versions are stored
and encoded as nodes in the database and therefore faster to retrieve.
The checkout time for the delta-based approach follows similar trends of the
commit time where the increase of the number of changed entities correlates with
the time to checkout. The more changes applied to the database, the longer time that
it takes to execute the queries into the database.
The snapshot is created when the predicted checkout time exceeds 300 seconds,
which happen at number of changes ~20.000. By leveraging snapshot into the ap-
proach, there are an improvement of checkout time because the target version is
retrieved from snapshot instead of from delta objects.

5.3.3 Storage cost


Figure 5.5 depicts the storage cost comparison for all approaches. It can be seen
that snapshot-based approach, despite creating a new snapshot object in each ver-
sion, takes the smallest amount of storage. The entity-state approach takes a largest
amount of storage due to the need to store the state nodes each time a version is
retrieved or created.
44 Chapter 5. Evaluation

On the other hand, the delta-based approach takes up more storage than snapshot-
only approach but less than entity-state approach. Breaking down the storage cost,
illustrated in 5.6, 5.5b, 5.5c, and 5.5d, a larger portion of the storage cost is for storing
the database files. The larger size of database in delta-based approach and entity-
state approach is probably due to the additional information that needs to be stored
in the database.

1000
Number of initial entities: ~0 1000
Number of initial entities: ~5000
approach approach
grit-base grit-base
800 grit-snapshot 800 grit-snapshot
Commit time (seconds)

Commit time (seconds)


snapshot-only snapshot-only
entity-state entity-state
600 600

400 400

200 200

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( A ) Number of initial entities = 0 ( B ) Number of initial entities = ~4000

1000
Number of initial entities: ~10000 1000
Number of initial entities: ~15000
approach approach
grit-base grit-base
800 grit-snapshot 800 grit-snapshot
Commit time (seconds)

Commit time (seconds)

snapshot-only snapshot-only
entity-state entity-state
600 600

400 400

200 200

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( C ) Number of initial entities = ~10000 ( D ) Number of initial entities = ~15000

F IGURE 5.1: Comparison of commit time. Each sub-figure represents


different number of initial entities.
5.3. Evaluation result 45

1000
Approach: grit-base 1000
Approach: grit-snapshot
initial_entities initial_entities
0 0
800 5000 800 5000
Commit time (seconds)

Commit time (seconds)


10000 10000
15000 15000
600 600

400 400

200 200

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( A ) Approach = grit-base ( B ) Approach = grit-snapshot

1000
Approach: snapshot-only 1000
Approach: entity-state
initial_entities initial_entities
0 0
800 5000 800 5000
Commit time (seconds)

Commit time (seconds)


10000 10000
15000 15000
600 600

400 400

200 200

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( C ) Approach = snapshot-only ( D ) Approach = entity-state

F IGURE 5.2: Comparison of commit time. Each sub-figure represents


different approach.

500
Number of initial entities: ~0 500
Number of initial entities: ~5000
approach approach
grit-base grit-base
400 grit-snapshot 400 grit-snapshot
Checkout time (seconds)

Checkout time (seconds)

snapshot-only snapshot-only
entity-state entity-state
300 300

200 200

100 100

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( A ) Number of initial entities = 0 ( B ) Number of initial entities = ~4000

500
Number of initial entities: ~10000 500
Number of initial entities: ~15000
approach approach
grit-base grit-base
400 grit-snapshot 400 grit-snapshot
Checkout time (seconds)

Checkout time (seconds)

snapshot-only snapshot-only
entity-state entity-state
300 300

200 200

100 100

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( C ) Number of initial entities = ~10000 ( D ) Number of initial entities = ~15000

F IGURE 5.3: Comparison of checkout time. Each sub-figure repre-


sents different number of initial entities.
46 Chapter 5. Evaluation

500 500
initial_entities initial_entities
0 0
400 4000 400 4000
Checkout time (seconds)

Checkout time (seconds)


8000 8000
12000 12000
300 300

200 200

100 100

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( A ) Approach: grit-base ( B ) Approach: grit-snapshot

50 50
initial_entities initial_entities
40 0 40 0
4000 4000
Checkout time (seconds)

Checkout time (seconds)


8000 8000
30 12000 30 12000

20 20

10 10

0 0

10 10
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( C ) Approach: snapshot-only ( D ) Approach: entity-state

F IGURE 5.4: Comparison of checkout time. Each sub-figure repre-


sents different approach

100
Number of initial entities: ~0 100
Number of initial entities: ~0
approach approach
grit-base grit-base
80 grit-snapshot 80 grit-snapshot
snapshot-only snapshot-only
entity-state entity-state
Storage (MB)

Storage (MB)

60 60

40 40

20 20

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( A ) Number of initial entities = 0 ( B ) Number of initial entities = 4000

100
Number of initial entities: ~10000 100
Number of initial entities: ~15000
approach approach
grit-base grit-base
80 grit-snapshot 80 grit-snapshot
snapshot-only snapshot-only
entity-state entity-state
Storage (MB)

Storage (MB)

60 60

40 40

20 20

0 0
2500 5000 7500 10000 12500 15000 17500 20000 2500 5000 7500 10000 12500 15000 17500 20000
Number of changed entities Number of changed entities

( C ) Number of initial entities = 10000 ( D ) Number of initial entities = 15000

F IGURE 5.5: Comparison of storage cost. Each sub-figure represents


different number of initial entities.
5.3. Evaluation result 47

Number of entities - initial: ~0, changed: ~1500 Number of entities - initial: ~0, changed: ~3000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size

30 30
Size (MB)

Size (MB)
20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( A ) Number of changed entities = ~1500 ( B ) Number of changed entities = ~3000

Number of entities - initial: ~0, changed: ~8000 Number of entities - initial: ~0, changed: ~12000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size

30 30
Size (MB)

Size (MB)

20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( C ) Number of changed entities = ~= 8000 ( D ) Number of changed entities = ~12000

Number of entities - initial: ~0, changed: ~16000 Number of entities - initial: ~0, changed: ~20000
Snapshot size Database size Snapshot size Database size
40 Delta size 40 Delta size

30 30
Size (MB)

Size (MB)

20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( E ) Number of changed entities = ~16000 ( F ) Number of changed entities = ~20000

F IGURE 5.6: Comparison of storage cost for initial number of entities


= 0. Each sub-figure represents different number of changed entities.
48 Chapter 5. Evaluation

Number of entities - initial: ~5000, changed: ~1500 Number of entities - initial: ~5000, changed: ~3000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50

40 40
Size (MB)

Size (MB)
30 30

20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( A ) Number of changed entities = ~1500 ( B ) Number of changed entities = ~3000

Number of entities - initial: ~5000, changed: ~8000 Number of entities - initial: ~5000, changed: ~12000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50

40 40
Size (MB)

Size (MB)

30 30

20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( C ) Number of changed entities = ~= 8000 ( D ) Number of changed entities = ~12000

Number of entities - initial: ~5000, changed: ~16000 Number of entities - initial: ~5000, changed: ~20000
60 Snapshot size Database size 60 Snapshot size Database size
Delta size Delta size
50 50

40 40
Size (MB)

Size (MB)

30 30

20 20

10 10

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( E ) Number of changed entities = ~16000 ( F ) Number of changed entities = ~20000

F IGURE 5.7: Comparison of storage cost for initial number of enti-


ties = 4000. Each sub-figure represents different number of changed
entities.
5.3. Evaluation result 49

Number of entities - initial: ~10000, changed: ~1500 Number of entities - initial: ~10000, changed: ~3000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)

Size (MB)
40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( A ) Number of changed entities = ~1500 ( B ) Number of changed entities = ~3000

Number of entities - initial: ~10000, changed: ~8000 Number of entities - initial: ~10000, changed: ~12000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)

Size (MB)

40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( C ) Number of changed entities = ~= 8000 ( D ) Number of changed entities = ~12000

Number of entities - initial: ~10000, changed: ~16000 Number of entities - initial: ~10000, changed: ~20000
80 Snapshot size Database size 80 Snapshot size Database size
Delta size Delta size
70 70
60 60
50 50
Size (MB)

Size (MB)

40 40
30 30
20 20
10 10
0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( E ) Number of changed entities = ~16000 ( F ) Number of changed entities = ~20000

F IGURE 5.8: Comparison of storage cost for initial number of enti-


ties = 8000. Each sub-figure represents different number of changed
entities.
50 Chapter 5. Evaluation

Number of entities - initial: ~15000, changed: ~1500 Number of entities - initial: ~15000, changed: ~3000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size

80 80
Size (MB)

Size (MB)
60 60

40 40

20 20

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( A ) Number of changed entities = ~1500 ( B ) Number of changed entities = ~3000

Number of entities - initial: ~15000, changed: ~8000 Number of entities - initial: ~15000, changed: ~12000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size

80 80
Size (MB)

Size (MB)

60 60

40 40

20 20

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( C ) Number of changed entities = ~= 8000 ( D ) Number of changed entities = ~12000

Number of entities - initial: ~15000, changed: ~16000 Number of entities - initial: ~15000, changed: ~20000
Snapshot size Database size Snapshot size Database size
100 Delta size 100 Delta size

80 80
Size (MB)

Size (MB)

60 60

40 40

20 20

0 0
grit-base grit-snapshot snapshot-only entity-state grit-base grit-snapshot snapshot-only entity-state
Approach Approach

( E ) Number of changed entities = ~16000 ( F ) Number of changed entities = ~20000

F IGURE 5.9: Comparison of storage cost for initial number of enti-


ties = 10000. Each sub-figure represents different number of changed
entities.
51

Chapter 6

Conclusions

This chapter provides a summary of conclusions based on the results of the exper-
iments. We also propose some recommendations for the potential future works for
the continuations of this thesis.

6.1 Conclusions
In this section, we reiterate the research questions defined at the beginning of this
thesis and answer those research questions based on the findings in our experiments.

RQ1: What is the current state of the art regarding the techniques and tools for dataset ver-
sioning?

Based on the literature survey, it can be concluded that there are several ap-
proaches for data versioning. We can group them into two categories, namely, by
linking the data version to the workflow and by separating the data version from
any workflow where the versions of the data are created regardless of the workflow.
The challenges of data versioning have been laid out in Chapter 1, for example, the
various structures and formats of data, change detection in data, delta representa-
tion, and the trade-off of storing delta and full snapshot.

RQ2: Can we design a version control tool for graph databases with queries as the delta
objects?

The primary goal of this project was to develop a data versioning for a graph
database. We presented Grit, a versioning tool for graph databases. The version
control API of Grit is provided as a Git-style command line tool. The proposed
framework is elaborated in Chapter 3.
It was found that, by detecting changes in the graph database and generating
the associated queries based on those changes, it is possible to assign versions and
retrieve versions in the graph database. This was supported by the evaluation re-
sults. The cost for doing data versioning using Grit is the additional storage needed
to store the delta and snapshot objects and the time to generate the delta objects and
the time to execute the queries to retrieve the versions. With this versioning tool, in
addition to retrieving previous versions, it is also possible to compare the difference
between two versions of a graph, to track the changes in each entity, and to create
branching of the graph.
To evaluate the performance of the tool, we designed an evaluation setup that
is explained in Chapter 5.1. The evaluation metrics include commit time, checkout
52 Chapter 6. Conclusions

time, and storage cost. We proposed two versions of the tool: a base model that
only store delta objects to manage the versions, and an improved model that stores
snapshot alongside the delta. We also compared the performance with two other
tools, snapshot-based versioning using Git and state-entity based versioning using
neo4j-versioner-core (D’Este and Falcier, 2017).
The experiment results showed that the storage cost of saving the delta in Grit
is larger than the storage cost in snapshot-based and entity-state based versioning
tool. For small to medium changes, Grit takes a relatively small cost in storage and
checkout time. However, the tool gets more expensive as the number of changes
gets larger. An improvement of checkout time can be made by leveraging snapshot
into the process.

RQ3: How can the changes in graph databases be detected and represented?

The framework for data versioning in this thesis consists of four main compo-
nents, namely, change detection, delta representation, delta aggregation, and ver-
sion materialization. Change detection relates to how to capture the value changes
in graph databases. It can be done by developing a plugin for the database that acts
as a trigger to capture transaction data.
The second and third components, delta representation and delta aggregation,
relates to how to represent and store the changes so that the previous version of data
can be retrieved. We use serialized objects to encode the changes. The serialized
delta objects are then aggregated to group the changes by the entity. From the ag-
gregated delta objects, a set of queries can be generated. The set of queries can be
used to commit the transactions or to rollback the transactions. Thus, it is possible
to retrieve the past versions of the database from the queries.

RQ4: Can we improve the performance of the version control tool by leveraging the use of
snapshot?

The evaluation results showed that the checkout process of a version where a
large number of changes are involved takes a considerable amount of time. The
checkout time relates to the amount of time that it needs to execute the queries from
delta objects into the database.
To reduce the checkout time, we propose an algorithm that will predict the check-
out time of a version based on the information from the associated delta object. Since
the checkout time is linear to the number of changes, we use a linear regression
model to predict the checkout time. When the predicted checkout time exceeds a
certain limit, a snapshot will be generated in addition to the delta object.
Based on our experimental findings, we can answer that, yes, we can improve
the version control tool in the form of reduced checkout time by leveraging snap-
shots. Instead of executing the queries from the delta object to recreate a version, we
can materialize the version from the full snapshot of the database. Thus, the query
execution time can be eliminated. However, saving snapshots introduces additional
storage cost.
6.2. Future Works 53

6.2 Future Works


This thesis was subject to some limitations. Time limitation is one of the restrictions
that bound the achievements of this project. For the future works, there are several
directions to continue this research, as described in as follow

1. The version materialization process of the tool presented in this thesis uses
queries to reconstruct a version. The base approach in this thesis can handle
up to several thousands of changes in a version and does not take a consider-
able amount of storage cost and checkout time. However, for changes that are
larger than that, it needs to leverage the snapshot because it takes longer time
to checkout. A possible improvement would be to design an alternative for the
delta encoding and checkout process to speed up the checkout time.

2. Changes in the database are detected by developing a plugin that utilizing the
database API. While it works on Neo4j, not every database or general graph
data has this feature. The change detection can be improved so that it can work
generally, for example, in other database systems.

3. The granularity of the delta encoding in this project is at graph property level,
that is by storing every property change to the database. Further improvement
would be to change the granularity of the delta encoding at character level.
55

Bibliography

Bhardwaj, Anant et al. (2014). “DataHub: Collaborative Data Science & Dataset Ver-
sion Management at Scale”. In: arXiv: 1409.0798. URL: http://arxiv.org/abs/
1409.0798.
Bhardwaj, Anant et al. (2015). “Collaborative Data Analytics with DataHub”. In:
Proc. VLDB Endow. 8.12, pp. 1916–1919. ISSN: 2150-8097. DOI: 10.14778/2824032.
2824100. URL: http://dx.doi.org/10.14778/2824032.2824100.
Bhattacherjee, Souvik et al. (2015). “Principles of Dataset Versioning: Exploring the
Recreation/Storage Tradeoff”. In: Proc. VLDB Endow. 8.12, pp. 1346–1357. ISSN:
2150-8097. DOI: 10 . 14778 / 2824032 . 2824035. URL: http : / / dx . doi . org / 10 .
14778/2824032.2824035.
Blischak, John D., Emily R. Davenport, and Greg Wilson (2016). “A Quick Introduc-
tion to Version Control with Git and GitHub”. In: PLoS Computational Biology 12.1,
pp. 1–18. ISSN: 15537358. DOI: 10.1371/journal.pcbi.1004668.
Brahmia, Safa et al. (2016). “τJSchema: A Framework for Managing Temporal JSON-
Based NoSQL Databases”. In: Database and Expert Systems Applications. Ed. by
Sven Hartmann and Hui Ma. Vol. 9828. i. Cham: Springer International Publish-
ing, pp. 167–181. ISBN: 978-3-319-44406-2. DOI: 10.1007/978- 3- 319- 44406- 2.
URL : http://link.springer.com/10.1007/978-3-319-44406-2.
Buneman, Peter et al. (2004). “Archiving scientific data”. In: Database 29.1, pp. 2–42.
ISSN: 03625915. DOI : 10.1145/974750.974752. URL : http://portal.acm.org/
citation.cfm?doid=974750.974752.
Castelltort, Arnaud and Anne Laurent (2013). “Representing history in graph-oriented
NoSQL databases: A versioning system”. In: 8th International Conference on Dig-
ital Information Management, ICDIM 2013 1, pp. 228–234. DOI: 10.1109/ICDIM.
2013.6694022.
Chapman, Pete et al. (2000). CRISP-DM 1.0 Step-by-step data mining guide. Tech. rep.
The CRISP-DM consortium. URL: http://www.crisp- dm.org/CRISPWP- 0800.
pdf.
Chavan, Amit et al. (2015). “Towards a Unified Query Language for Provenance and
Versioning”. In: Proceedings of the 7th USENIX Conference on Theory and Practice
of Provenance, p. 5. arXiv: arXiv : 1506 . 04815v1. URL: http : / / dl . acm . org /
citation.cfm?id=2814579.2814584.
Cheney, James, Laura Chiticariu, and Wang-Chiew Tan (2007). “Provenance in Databases:
Why, How, and Where”. In: Foundations and Trends in Databases 1.4, pp. 379–474.
ISSN: 1931-7883. DOI : 10.1561/1900000006. URL : http://www.nowpublishers.
com/article/Details/DBS-006.
Currim, Faiz et al. (2004). “A Tale of Two Schemas: Creating a Temporal XML Schema
from a Snapshot Schema with τXSchema”. In: Advances in Database Technology -
EDBT 2004. Ed. by Elisa Bertino et al. Berlin, Heidelberg: Springer Berlin Heidel-
berg, pp. 348–365. ISBN: 978-3-540-24741-8.
Davidson, Susan B. and Juliana Freire (2008). “Provenance and scientific workflows”.
In: Proceedings of the 2008 ACM SIGMOD international conference on Management
56 BIBLIOGRAPHY

of data - SIGMOD ’08, p. 1345. ISSN: 07308078. DOI: 10.1145/1376616.1376772.


URL : http://portal.acm.org/citation.cfm?doid=1376616.1376772.
D’Este, Alberto and Marco Falcier (2017). Neo4j Versioner Core Documentation. https:
//h-omer.github.io/neo4j-versioner-core/.
Huang, Silu et al. (2017). “OrpheusDB: Bolt-on Versioning for Relational Databases”.
In: CoRR abs/1703.02475. arXiv: 1703 . 02475. URL: http : / / arxiv . org / abs /
1703.02475.
Hull, Dunca et al. (2006). “Taverna: A tool for building and running workflows
of services”. In: Nucleic Acids Research 34.WEB. SERV. ISS. Pp. 729–732. ISSN:
03051048. DOI: 10.1093/nar/gkl320. arXiv: 3.
Karvounarakis, Grigoris, Zachary G. Ives, and Val Tannen (2010). “Querying data
provenance”. In: Proceedings of the 2010 international conference on Management of
data - SIGMOD ’10, p. 951. ISSN: 07308078. DOI: 10.1145/1807167.1807269. URL:
http://portal.acm.org/citation.cfm?doid=1807167.1807269.
Leskovec, Jure and Andrej Krevl (2014). SNAP Datasets: Stanford Large Network Dataset
Collection. http://snap.stanford.edu/data.
Loeliger, Jon and Matthew McCullough (2012). Version Control with Git: Powerful Tools
and Techniques for Collaborative Software Development. O’Reilly Media, Inc. ISBN:
1449316387, 9781449316389.
Maddox, Michael et al. (2016). “Decibel: The Relational Dataset Branching System”.
In: Proc. VLDB Endow. 9.9, pp. 624–635. ISSN: 2150-8097. DOI: 10.14778/2947618.
2947619. URL: http://dx.doi.org/10.14778/2947618.2947619.
Müller, Heiko, Peter Buneman, and Ioannis Koltsidas (2008). “XArch: Archiving Sci-
entific and Reference Data”. In: Proceedings of the 2008 ACM SIGMOD interna-
tional conference on Management of data, pp. 1295–1298. ISBN: 9781605581026. DOI:
10.1145/1376616.1376758.
Oinn, Tom et al. (2004). “Taverna: A tool for the composition and enactment of bioin-
formatics workflows”. In: Bioinformatics 20.17, pp. 3045–3054. ISSN: 13674803.
DOI : 10.1093/bioinformatics/bth361.
Ozsoyoglu, Gultekin and Richard T Snodgrass (1995). “Temporal and real-time databases:
A survey”. In: IEEE Transactions on Knowledge and Data Engineering 7.4, pp. 513–
532.
Robinson, Ian (2014). Time-Based Versioned Graphs. http://iansrobinson.com/2014/
05/13/time-based-versioned-graphs/.
Robinson, Ian, Jim Webber, and Emil Eifrem (2013). Graph Databases. Sebastopol:
O’Reilly Media, Inc. ISBN: 9781449356262.
Sadalage, Pramod J PJ and Martin Fowler (2012). NoSQL Distilled: A Brief Guide to the
Emerging World of Polyglot Persistence. 1st. Addison-Wesley Professional, p. 192.
ISBN: 0321826620, 9780321826626. DOI : 0321826620.
Sculley, D. et al. (2015). “Hidden Technical Debt in Machine Learning Systems”. In:
Advances in Neural Information Processing Systems 28. Ed. by C. Cortes et al. Curran
Associates, Inc., pp. 2503–2511. URL: http : / / papers . nips . cc / paper / 5656 -
hidden-technical-debt-in-machine-learning-systems.pdf.
Seering, Adam et al. (2012). “Efficient versioning for scientific array databases”. In:
Proceedings - International Conference on Data Engineering, pp. 1013–1024. ISSN:
10844627. DOI: 10.1109/ICDE.2012.102.
Simmhan, YL, Beth Plale, and Dennis Gannon (2005). “A survey of data provenance
in e-science”. In: ACM SIGMOD Record 34.3, pp. 31–36. ISSN: 01635808. DOI: 10.
1145 / 1084805 . 1084812. URL: http : / / portal . acm . org / citation . cfm ? id =
1084812.
BIBLIOGRAPHY 57

Tansel, Abdullah Uz et al., eds. (1993). Temporal Databases: Theory, Design, and Im-
plementation. Redwood City, CA, USA: Benjamin-Cummings Publishing Co., Inc.
ISBN: 0-8053-2413-5.
Vijitbenjaronk, Warut D. et al. (2018). “Scalable time-versioning support for prop-
erty graph databases”. In: Proceedings - 2017 IEEE International Conference on Big
Data, Big Data 2017 2018-January, pp. 1580–1589. DOI: 10.1109/BigData.2017.
8258092.

You might also like