(Atzeni Et Al., 2020) Data Modeling in The NoSQL World
(Atzeni Et Al., 2020) Data Modeling in The NoSQL World
Abstract
NoSQL systems have gained their popularity for many reasons, including the
flexibility they provide in organizing data, as they relax the rigidity provided by
the relational model and by the other structured models. This flexibility and
the heterogeneity that has emerged in the area have led to a little use of tradi-
tional modeling techniques, as opposed to what has happened with databases
for decades.
In this paper, we argue how traditional notions related to data modeling
can be useful in this context as well. Specifically, we propose NoAM (NoSQL
Abstract Model), a novel abstract data model for NoSQL databases, which ex-
ploits the commonalities of various NoSQL systems. We also propose a database
design methodology for NoSQL systems based on NoAM, with initial activities
that are independent of the specific target system. NoAM is used to specify a
system-independent representation of the application data and, then, this inter-
mediate representation can be implemented in target NoSQL databases, taking
into account their specific features. Overall, the methodology aims at support-
ing scalability, performance, and consistency, as needed by next-generation web
applications.
Keywords: Data models, database design, NoSQL systems
✩ This paper extends a short article appeared in the Proceedings of the 33rd International
Conference on Conceptual Modeling (ER 2014) with the title Database Design for NoSQL
Systems [1].
2
can reason at a high level, before delving into the details of the specific systems.
Instead, given the variety of systems, it is currently the case that the design
process for NoSQL applications is mainly based on best practices and guide-
lines [5], which are specifically related to the selected system [6, 7, 8], with no
35 systematic methodology. Several authors have observed that the development
of high-level methodologies and tools supporting NoSQL database design are
needed [9, 10, 11], and models here are definitely needed, in order to achieve
some level of generality.
Let us recall the various reasons for which modeling is considered important
40 in database design and development [12]. First of all, beside being crucial
in the conceptual and logical design phases, it offers support throughout the
lifecycle, from requirement analysis, where it helps in giving a structure to the
process, to coding and maintenance, where it gives valuable documentation. The
main point to be mentioned is that modeling allows the specialist to describe
45 the domain of interest and the application from various perspectives and at
various levels of abstraction. Moreover, it provides support to communication
(and to individual comprehension). Finally, it provides support to performance
management, as physical database design is also based on data structures, and
query processing efficiency is often based on reference to the regularity of data.
50 Conceptual and logical modeling, as they are currently known, were devel-
oped in the database world, with specific attention to relational systems, but
found applications also in other contexts. Indeed, while the importance of rela-
tional databases was clear since the Eighties, it was soon understood that there
were many “non-business” application domains for which other modeling fea-
55 tures were needed: the advocates of object-oriented databases observed, more
or less at the same time, that some requirements were not satisfied, such as
those in CAD, CASE, and multimedia and text management [13]. This led
to the development of models with nested structures, more complex than the
relational one, and less regular, and so more difficult to manage.
60 Flexibility in structures was also required in another area, which emerged a
decade later, and has since been very important: the area of Web applications,
3
where there were at least two kinds of developments concerned with models. On
the one hand, work on complex object models for representing hypertexts [14,
15, 16], and on the other hand significant development in semistructured data,
65 especially with reference to XML [17].
Another recurring claim in the database world in the last ten or fifteen
years has been the fact that, while relational databases are a de facto standard,
it is not the case that there is one solution that works well for all kinds of
applications. As Stonebraker and Çetintemel [18] argued, it is not the case that
70 “one size fits all,” and different engines and technologies are needed in different
contexts, for example OLAP and OLTP have different requirements, but the
same holds for other kinds of applications, such as stream processing, sensor
networks, or scientific databases.
The NoSQL movement emerged for a number of motivations, including most
75 of the above, with the goal of supporting highly scalable systems, with specific
requirements, usually with very simple operations over many nodes, on sets of
data that have flexible structure. Given that there are many different appli-
cations and the specific requirements vary, many systems have emerged, each
offering a different way of organizing data and a different programming interface.
80 Heterogeneity can become a problem if migration or integration are needed, as
this is often the case, in a world with changing requirements and new tech-
nological developments. Also, the availability of many different systems, with
different implementations, has led to different design techniques, usually related
just to individual systems or small families thereof.
85 In this paper we argue that a model-based approach can be useful to tackle
the difficulties related to heterogeneity, and provide support in the form of
abstraction. In fact, modeling can be at the basis of a design process, at various
level; at a higher one to represent the features of interest for the application,
and at a lower one to describe some implementation features in a concrete but
90 system-independent way.
Indeed, we will present a high-level data model for NoSQL databases, called
NoAM (NoSQL Abstract Model) and show how it can be used as an interme-
4
diate data representation in the context of a general design methodology for
NoSQL applications having initial steps that are independent of the individual
95 target system. We propose a design process that includes a conceptual phase,
as common in traditional application, followed (and this is unconventional and
original) by a system-independent logical design phase, where the intermediate
representation is used, as the basis for both modeling and performance aspects,
with only a final phase that takes into account the specific features of individual
100 systems.
The rest of the paper is organized as follows. In Section 2, we illustrate
the features of the main categories of NoSQL systems arguing that, for each
of them, there exists a sort of data model. In Section 3 we present NoAM,
our system-independent data model for NoSQL databases, and in Section 4 we
105 discuss our design methodology for NoSQL databases. In Section 5 we briefly
review some related literature. Finally, in Section 6 we draw some conclusions.
5
444
444 games[1] ' +,-.
444
' +,-. games[2]
444
444
!" ' +,-.
games[0] 444
games[0] games[1]
firstPlayer secondPlayer
rounds[0] rounds[1]
/.0,1 /.0,1
moves[0] moves[1] moves[0]
...
: Round : Round
moves[1] moves[0]
moves[0]
6
Player:mary : h
username : ”mary”,
firstName : ”Mary”,
lastName : ”Wilson”,
games : {
h game : Game:2345, opponent : Player:rick i,
h game : Game:2611, opponent : Player:ann i
}
i
Player:rick : h
username : ”rick”,
firstName : ”Ricky”,
lastName : ”Doe”,
score : 42,
games : {
h game : Game:2345, opponent : Player:mary i,
h game : Game:7425, opponent : Player:ann i,
h game : Game:1241, opponent : Player:johnny i
}
i
Game:2345 : h
id : ”2345”,
firstPlayer : Player:mary,
secondPlayer : Player:rick,
rounds : {
h moves : . . . , comments : . . . i,
h moves : . . . , actions : . . . , spell : . . . i
}
i
7
140 tems [2, 3]: key-value stores, document stores, extensible record stores, plus
others (e.g., graph databases) that are beyond the scope of this paper.
8
170 • Represent an aggregate using a single key-value pair. The key (major key)
is the aggregate identifier. The value is the complex value of the aggregate.
See Figure 4(a).
The data access operations provided by key-value stores usually enable an ef-
ficient and atomic data access to aggregates with respect to both data repre-
180 sentations. Indeed, all systems support the access to individual key-value pairs
(useful in the former case) and most of them (such as Oracle NoSQL) provide
also the access to groups of related key-value pairs (required in the latter case).
9
key (/major/key/-) value
/Player/mary/- { username: ”mary”, firstName: ”Mary”, ... }
/Player/rick/- { username: ”rick”, firstName: ”Ricky”, ... }
/Game/2345/- { id: ”2345”, firstPlayer: ”Player:mary”, ... }
(a) Single key-value pair per aggregate
10
[
"username" : "mary",
"firstName" : "Mary",
"lastName" : "Wilson",
"games" : {
[ "id" : "Game:2345", "opponent" : "Player:rick" ],
[ "id" : "Game:2611", "opponent" : "Player:ann"]
}
]
Figure 5: The JSON representation of the complex value of a sample Player object
11
Data access operations are usually over individual rows, which are units of data
distribution and atomic data manipulation.
230 A representative extensible record store is Amazon DynamoDB [26], a No-
SQL database service provided on the cloud by Amazon Web Services (AWS).
In DynamoDB a database is organized in tables. A table is a set of items. Each
item contains one or more attributes, each with a name and a value (or a set
of values). Each table designates an attribute as primary key. Items in a same
235 table are not required to have the same set of attributes — apart from the pri-
mary key, which is the only mandatory attribute of a table. Thus, DynamoDB
databases are mostly schemaless.
Specifically, the primary key is composed of a partition key and an optional
sort key. If the primary key of a table includes a sort key, then DynamoDB
240 stores together all the items having the same partition key, in such a way that
they can be accessed in an efficient way.
Distribution is operated at the item level and, for each table, is controlled
by the partition key only.
Some operations offered by DynamoDB are as follows: putItem(table, key, av)
245 adds (or modifies) a new item in table table with primary key key, using the
set of attribute-value pairs av ; and getItem(table, key) retrieves the item of table
table having primary key key. It is also possible to access or update just a subset
of the attributes of an item. All these operations can be executed in an efficient
way.
250 In an extensible record store (such as DynamoDB), each aggregate can be
represented by a record/row/item. The table corresponds to the aggregate class
(or type). The primary key (partition key) is the aggregate identifier. Then,
the item can have a distinct attribute-value pair for each top-level attribute of
the complex value of the aggregate (or for each major part of the aggregate that
255 needs to be accessed separately). See Figure 7.
Again, the data access operations provided by the systems in this category
support an efficient data access to aggregates or to specific portions of them.
12
table Player
username firstName lastName score games[0] games[1] games[2]
”mary” ”Mary” ”Wilson” { game: ..., opponent: ... } { ... }
”rick” ”Ricky” ”Doe” 42 { game: ..., opponent: ... } { ... } { ... }
table Game
id firstPlayer secondPlayer rounds[0] rounds[1] rounds[2]
2345 Player:mary Player:rick { moves: ..., comments: ... } { ... }
2.6. Comparison
13
280 NoSQL categories, this element is: (i) a group of related key-value pairs, in key-
value stores; (ii) a document, in document stores; or (iii) a record/row/item, in
extensible record stores.
In NoAM, a data access and distribution unit is modeled by a block. Specif-
ically, a block represents a maximal data unit for which atomic, efficient, and
285 scalable access operations are provided. Indeed, while the access to an individ-
ual block can be performed in an efficient way in the various systems, the access
to multiple blocks can be quite inefficient. In particular, NoSQL systems do
not usually provide an efficient “join” operation. Moreover, most NoSQL sys-
tems provide atomic operations only over single blocks and do not support the
290 atomic manipulation of a group of blocks. For example, MongoDB [25] provides
only atomic operations over individual documents, whereas Bigtable does not
support transactions across rows [22].
A second common feature of NoSQL systems is the ability to access and
manipulate just a component of a data access unit (i.e., of a block). This
295 component is: (i) an individual key-value pair, in key-value stores; (ii) a field,
in document stores; or (iii) a column, in extensible record stores. In NoAM,
such a smaller data access unit is called an entry.
Finally, most NoSQL databases provide a notion of collection of data access
units. For example, a table in extensible record stores or a document collection
300 in document stores. In NoAM, a collection of data access units is called a
collection.
According to the above observations, the NoAM data model is defined as
follows.
• A block is a non-empty set of entries. Each entry is a pair hek, evi, where
ek is the entry key (which is unique within its block) and ev is its value
14
Player
username ”mary”
firstName ”Mary”
username ”rick”
firstName ”Ricky”
lastName ”Doe”
rick score 42
Game
id 2345
firstPlayer Player:mary
15
Player
husername:”mary”,
firstName:”Mary”,
lastName:”Wilson”,
mary ǫ games : {
h game : Game:2345, opponent : Player:rick i,
h game : Game:2611, opponent : Player:ann i
}i
husername:”rick”,
firstName:”Ricky”,
lastName:”Doe”,
score:42,
rick ǫ games : {
h game : Game:2345, opponent : Player:mary i,
h game : Game:7425, opponent : Player:ann i,
h game : Game:1241, opponent : Player:johnny i
}i
Game
hid : ”2345”,
firstPlayer : Player:mary,
secondPlayer : Player:rick,
2345 ǫ rounds : {
h moves :..., comments : ... i,
h moves :..., actions : ..., spell : ... i
}i
16
The motivations to consider database design for NoSQL systems are as fol-
330 lows. It is important to notice that despite the fact that NoSQL databases
are claimed to be “schemaless,” the data of interest for applications do show
some structure, which should be mapped to the modeling elements (collections,
tables, documents, key-value pairs) available in the target system. Moreover,
different alternatives in the organization of data in a NoSQL database are usu-
335 ally possible, but they are not equivalent in supporting qualities such as perfor-
mance, scalability, and consistency (which are typically required when a NoSQL
database is adopted). For example, a “wrong” database representation can lead
to performance that are worse by an order of magnitude as well as to the in-
ability to guarantee atomicity of important operations.
340 Specifically, our design methodology has the goal of designing a “good” rep-
resentation of the application data in a target NoSQL database, and is intended
to support major qualities such as performance, scalability, and consistency, as
needed by next-generation Web applications.
The NoAM approach is based on the following main activities:
345 • conceptual data modeling and aggregate design, to identify the various
entities and relationships thereof needed in an application, and to group
related entities into aggregates;
17
and attributes. (This activity is discussed in most database textbooks, e.g.,
[12].) Following Domain-Driven Design (DDD [19]), which is a widely followed
360 object-oriented methodology, we assume that the outcome of this activity is a
conceptual UML class diagram defining the entities, value objects, and relation-
ships of the application. An entity is a persistent object that has independent
existence and is distinguished by a unique identifier (e.g., a player or a game,
in our running example). A value object is a persistent object which is mainly
365 characterized by its value, without an own identifier (e.g., a round or a move).
Then, the methodology proceeds by identifying aggregates.
The design of aggregates has the goal of identifying the classes of aggregates
for an application, and various approaches are possible. After the preliminary
conceptual design phase, entities and value objects are grouped into aggregates.
370 Each aggregate has an entity as its root, and it can also contain many value
objects. Intuitively, an entity and a group of value objects are used to define an
aggregate having a complex structure and value.
The relevant decisions in aggregate design involve the choice of aggregates
and of their boundaries. This activity can be driven by the data access pat-
375 terns of the application operations, as well as by scalability and consistency
needs [19]. Specifically, aggregates should be designed as the units on which
atomicity must be guaranteed [20] (with eventual consistency for update op-
erations spanning multiple aggregates [27]). In general, it is indeed the case
that most real applications require only operations that access individual aggre-
380 gates [2, 22]. Each aggregate should be large enough so as to include all the data
required by a relevant data access operation. (Please note that NoSQL systems
do not provide a “join” operation, and this is a main motivation for clustering
each group of related application objects into an aggregate.) Furthermore, to
support strong consistency (that is, atomicity) of update operations, each ag-
385 gregate should include all the data involved by some integrity constraints or
other forms of business rules [28]. On the other hand, aggregates should be as
small as possible; small aggregates reduce concurrency collisions and support
performance and scalability requirements [28].
18
Thus, aggregate design is mainly driven by data access operations. In our
390 running example, the online game application needs to manage various collec-
tions of objects, including players, games, and rounds. Figure 2 shows a few
representative application objects. (There, boxes and arrows denote objects and
links between them, respectively. An object having a colored top compartment
is an entity, otherwise it is a value object.) When a player connects to the
395 application, all data on the player should be retrieved, including an overview
of the games she is currently playing. Then, the player can select to continue
a game, and data on the selected game should be retrieved. When a player
completes a round in a game she is playing, then the game should be updated.
These operations suggest that the candidate aggregate classes are players and
400 games. Figure 2 also shows how application objects can be grouped in aggre-
gates. (There, a closed curve denotes the boundary of an aggregate.)
As we mentioned above, aggregate design is also driven by consistency needs.
Assume that the application should enforce a rule specifying that a round can
be added to a game only if some condition that involves the other rounds of the
405 game is satisfied. An individual round cannot check, alone, the above condition;
therefore, it cannot be an aggregate by itself. On the other hand, the above
business rule can be supported by a game (comprising, as an aggregate, its
rounds).
In conclusion, the aggregate classes for our sample application are Player
410 and Game, as shown in Figures 2 and 3.
19
represented by the NoAM database shown in Figure 8. The representation of
420 aggregates as blocks is motivated by the fact that both concepts represent a
unit of data access and distribution, but at different abstraction levels. Indeed,
NoSQL systems provide efficient, scalable, and consistent (i.e., atomic) opera-
tions on blocks and, in turn, this choice propagates such qualities to operations
on aggregates.
425 In general, an application dataset of aggregates can be represented in NoAM
database in several different ways. Each data representation for a dataset δ is a
NoAM database Dδ representing δ. Specifically, the various data representations
for a dataset differ only in the choice of the entries used to represent the complex
value of each aggregate. We first discuss basic data representation strategies,
430 which we illustrate with respect to the example described in Figure 3. We then
introduce additional and more flexible data representations.
A simple data representation strategy, called Entry per Aggregate Object
(EAO), represents each individual aggregate using a single entry. The entry
key is empty. The entry value is the whole complex value of the aggregate. The
435 data representation of the aggregates of Figure 3 according to the EAO strategy
is shown in Figure 9.
Another data representation strategy, called Entry per Top-level Field (ETF ),
represents each aggregate by means of multiple entries, using a distinct entry
for each top-level field of the complex value of the aggregate. For each top-level
440 field f of an aggregate o, it employs an entry having as value the value of field f
in the complex value of o (with values that can be complex themselves), and as
key the field name f . Figure 10 shows the data representation of the aggregates
of Figure 3 according to the ETF strategy.
As a comparison, we can observe that the EAO data representation uses a
445 block with a single entry to represent the Player object having username mary,
while the ETF representation needs a block with four entries, corresponding to
fields username, firstName, lastName, and games. Moreover, blocks in EAO
do not depend on the structure of aggregates, while blocks in ETF depend on
the top-level structure of aggregates (which can be “almost fixed” within each
20
Player
username ”mary”
firstName ”Mary”
username ”rick”
firstName ”Ricky”
lastName ”Doe”
rick
score 42
Game
id 2345
firstPlayer Player:mary
450 class).
The general data representation strategies we just described can be suited in
some cases, but they are often too rigid and limiting. For example, none of the
above strategies leads to the data representation shown in Figure 8. The main
limitation of such general data representations is that they refer only to the
455 structure of aggregates, and do not take into account the data access patterns
of the application operations. Therefore, these strategies are not usually able to
support the performance of these operations. This motivates the introduction
of aggregate partitioning.
We first need to introduce a preliminary notion of access path, to specify a
460 “location” in the structure of a complex value. Intuitively, if v is a complex value
and w is a value (possibly complex as well) occurring in v, then the access path
21
ap for w in v represents the sequence of “steps” that should be taken to reach
the component value w in v. More precisely, an access path ap is a (possibly
empty) sequence of access steps, ap = p1 p2 . . . pn , where each step pi identifies
465 a component value in a structured value. Furthermore, if v is a complex value
and ap is an access path, then ap(v) denotes the component value identified by
ap in v.
For example, consider the complex value vmary of the Player aggregate
having username mary shown in Figure 3. Examples of access paths for this
470 complex value are firstName and games[0].opponent. If we apply these access
paths to vmary , we access values Mary and Player:rick, respectively.
A complex value v can be represented using a set of entries, whose keys are
access paths for v. Each entry is intended to represent a distinct portion of the
complex value v, characterized by a location in its structure (the access path,
475 used as entry key) and a value (the entry value). Specifically, in NoAM we
represent each aggregate by means of a partition of its complex value v, that is,
a set E of entries that fully cover v, without redundancy. Consider again the
complex value vmary shown in Figure 3; a possible entry for vmary is the pair
hgames[0].opponent, Player:ricki. We have already applied the above intuition
480 earlier in this section. For example, the ETF data representation (shown in
Figure 10) uses field names as entry keys (which are indeed a case of access
paths) and field values as entry values.
Aggregate partitioning can be based on the following guidelines (which are a
variant of guidelines proposed in [12] in the context of logical database design):
485
• Two or more data elements should belong to the same entry if they are
22
Game
h id:2345,
ǫ firstPlayer :Player:mary,
2345 secondPlayer :Player:rick i
• Two or more data elements should belong to distinct entries if they are
usually accessed or modified separately.
4.3. Implementation
23
various NoSQL systems, while keeping their major aspects, it is rather straight-
515 forward to perform this activity. We have implementations for various NoSQL
systems, including Cassandra, Couchbase, Amazon DynamoDB, HBase, Mon-
goDB, Oracle NoSQL, and Redis. For the sake of space, we discuss the im-
plementation only with respect to a single representative system for each main
NoSQL category. Moreover, with reference to the same aggregate objects of
520 Figures 2 and 3 we will sometimes show only the data for one aggregate. Sim-
ilar representations can be obtained for the other aggregates of the running
example.
24
into units of data access and distribution. The effectiveness of our implementa-
545 tion is based on the use we make of Oracle NoSQL keys, where the major key
controls distribution (sharding is based on it) and consistency (an operation in-
volving multiple key-value pairs can be executed atomically only if the various
pairs are over a same major key).
More precisely, a technical precaution is needed to guarantee atomic con-
550 sistency when the selected data representation uses more than one entry per
block. Consider two separate operations that need to update just a subset of
the entries of the block for an aggregate object. Since aggregates should be units
of atomicity and consistency, if these operations are requested concurrently on
the same aggregate object, then the application would require that the NoSQL
555 system identifies a concurrency collision, commits only one of the operations,
and aborts the other. However, if the operations update two disjoint subsets
of entries, then Oracle NoSQL is unable to identify the collision, since it has
no notion of block. We support this requirement, thus providing atomicity and
consistency over aggregates, by always including in each update operation the
560 access to the entry that includes the identifier of the aggregate (or some other
distinguished entry of the block).
25
collection Player
id document
{
id:”mary”,
username:”mary”,
firstName:”Mary”,
mary
lastName:”Wilson”,
games:
[ { game:”Game:2345”, opponent:”Player:rick”},
{ game:”Game:2611”, opponent:”Player:ann”} ]
}
in the item for b (a serialization of the values is used, if needed). For example,
575 Figure 7 shows the implementation of the NoAM database of Figure 8.
The retrieval of a block, given its collection C and block key id, can be imple-
mented by performing a single getItem operation, which retrieves the item that
contains all the entries of the block. The storage of a block can be implemented
using a putItem operation, to save all the entries of the block, in an atomic way.
580 It is worth noting that, using operation getItem, it is also possible to retrieve a
subset of the entries of a block. Similarly, using operation updateItem, it is also
possible to update just a subset of the entries of a block, in an atomic way.
This implementation is also effective, since DynamoDB controls distribution
and atomicity with reference to items.
26
collection Player
id document
{
id:”mary”,
username:”mary”,
mary firstName:”Mary”,
lastName:”Wilson”,
games[0]: { game:”Game:2345”, opponent:”Player:rick” },
games[1]: { game:”Game:2611”, opponent:”Player:ann” }
}
27
4.4. Experiments
615 We will now discuss a case study of NoSQL database design, with refer-
ence to our running example. For the sake of simplicity, we just focus on the
representation and management of aggregates for games.
Data for each game include a few scalar fields and a collection of rounds.
The important operations over games are: (1) the retrieval of a game, which
620 should read all the data concerning the game; and (2) the addition of a round
to a game.
Assume that, to manage games, we have chosen a key-value store as the
target system. The candidate data representations are: (i) using a single entry
for each game (as shown in Figure 9, in the following called EAO); (ii) splitting
625 the data for each game in a group of entries, one for each round, and including
all the remaining scalar fields in a separate entry (as shown in Figure 11, called
Rounds).
We expect that the first operation (retrieval of a game) performs better in
EAO, since it needs to read just a key-value pair, while the second one (addition
630 of a round to a game) is favored by Rounds, which does not require to rewrite
the whole game.
We ran a number of experiments to compare the above data representations
in situations of different application workloads. Each game has, on average, a
dozen rounds, for a total of about 8KB per game. At each run, we simulated
635 the following workloads: (a) game retrievals only (in random order); (b) round
additions only (to random games); and (c) a mixed workload, with game re-
trieval and round addition operations, with a read/write ratio of 50/50. We ran
the experiments using different database sizes, and measured the running time
required by the workloads. The target system was Oracle NoSQL, deployed over
640 Amazon AWS on a cluster of four EC2 servers.1
The results are shown in Figure 14. Database sizes are in gigabytes, timings
are in milliseconds, and points denote the average running time of a single op-
28
Game Retrieval Round Addition
5675 GHIG
56;7 GHMI
56;5 GHMG
56:7 GHLI
56:5 GHLG
5697 GHKI
5695 GHKG
5687 GHJI
5685 GHJG
5657 GHGI
5655 GHGG
8 9 ; < 8= :9 =; 89< 97= 789 J K M N JO LK OM JKN KIO IJK
bcd efghij
eration. The experiments confirm the intuition that the retrieval of games (Fig-
ure 14(a)) is always favored by the EAO data representation, for any database
645 size. On the other hand, the addition of a round to an existing game (Fig-
ure 14(b)) is favored by the Rounds data representation. Finally, the exper-
iments over the mixed workload (Figure 14(c)) show a general advantage of
Rounds over EAO, which however decreases as the database size increases.
Overall, it turns out that the Rounds data representation is preferable.
650 We also performed other experiments on a data representation that does
not conform to the design guidelines proposed in this paper. Specifically, a data
representation that divides the rounds of a game into independent key-value
pairs, rather than keeping them together in a same block, as suggested by our
approach. In this case, the performance of the various operations worsens by at
655 least an order of magnitude. Moreover, with this data representation it is not
29
possible to update a game in an atomic way.
Overall, these experiments show that: (i) the design of NoSQL databases
should be done with care as it affects considerably the performance and consis-
tency of data access operations, and (ii) our methodology provides an effective
660 tool for choosing among different alternatives.
5. Related works
Although several authors have observed that there is a need for data-model
approaches to the design and management of NoSQL databases [9, 10, 11],
very few works have addressed this issue, especially from a general and system-
665 independent point of view. Indeed, most of them propose a solution to a specific
problem in a limited scenario.
For instance, Pasqualin et al. have recently shown how a document-oriented
model can be efficiently implemented in a NoSQL document store [30]. Sim-
ilarly, Olivera et al. [31] and de Lima and Mello [32] have proposed a data-
670 model based methodology for the design of NoSQL document database [32],
whereas Chevalier et al. have addressed the specific problem of leveraging on
a document-oriented model for implementing a multidimensional database in a
NoSQL document store [33] and in a column-oriented NoSQL database [34].
Most of the other contributions to data modeling for NoSQL systems come
675 from on-line papers, usually published in blogs of practitioners, that discuss
best practices and guidelines for modeling NoSQL databases, most of which
are suited only for specific systems. For instance, [5] lists some techniques for
implementing and managing data stored in different types of NoSQL systems,
while [35] discusses design issues for the specific case of key-value datastores.
680 Similarly, Mior et al. [36] have recently proposed an approach to the problem of
schema design for the specific class of extensible record stores. On the system-
oriented side, [6, 7, 8] illustrate design principles for the specific cases of HBase,
MongoDB, and Cassandra, respectively. However, none of them tackles the
problem from a general perspective, as we advocate in this paper.
30
685 Recently, Ruiz et al. have proposed a reverse engineering strategy aimed at
inferring the implicit schema of NoSQL databases [37]. This approach supports
the idea that, even in this context, a model-based description of the organization
of data is very useful during the entire life-cycle of a data set.
To the best of our knowledge, this paper presents the first general design
690 methodology for NoSQL systems with initial activities that are independent of
the specific target system. Our approach to data modeling is based on data
aggregates, a notion that is central in NoSQL databases where application data
are grouped in atomic units that are accessed and manipulated together [3].
The notion of aggregate also occurs in other contexts with a similar meaning.
695 For example, in Domain Driven Design [19], a widely followed object-oriented
software development approach, an aggregate is a group of related application
objects, used to govern transactions and distribution. Also Helland [20] advo-
cates the use of aggregates (there called entities) as units of distribution and
consistency. In this framework, Baker et al. [38] propose the notion of entity
700 groups, a set of entities that can be manipulated in an atomic way. They also
describe a specific mapping of entity groups to Bigtable [22], which however
makes the approach targeted only to a specific NoSQL system. Our approach is
based on a more abstract database model, NoAM, and is system independent,
as it is targeted to a wide class of NoSQL systems.
705 The issue of identifying data access units in database design shows some
similarities with problems studied in the past, such as: (i) the early works
on vertical partitioning and clustering [39], with the idea to put together the
attributes that are accessed together and to separate those that are visited
independently, and (ii) the more recent approaches to relational (or object-
710 relational) storage of XML documents [40], where various alternatives obviously
exist, with tables that can be very small and handle individual edges, or very
wide and handle entire paths, and many alternatives in between.
A major observation from [9] is that the availability of a high-level represen-
tation of the data remains a fundamental tool for developers and users, since it
715 makes understanding, managing, accessing, and integrating information sources
31
much easier, independently of the technologies used. We have addressed this
issue by proposing NoAM, an abstract data model that makes it possible to
devise an initial phase of the design process that is independent of any specific
system but suitable for each.
720 Along this line, SOS [41] is a tool that provides a common programming
interface towards different NoSQL systems, to access them in a unified way.
The interface is based on a simple, high-level common data model which is
inspired by those of non-relational systems and provides simple operations for
inserting, deleting, and retrieving database objects. However, the definition of
725 tools for data access is complementary to data models and design issues.
Finally, Jain et al. discusses the potential mismatch between the require-
ments of scientific data analysis and the models and languages of relational
database systems [42], whereas Alagiannis et al. [43] advocate a new database
design philosophy for emerging applications. This paper tries to provide a con-
730 tribution to these problems.
6. Conclusion
In this paper we have argued how data modeling can be useful in the No-
SQL arena. Specifically, we have proposed a comprehensive methodology for
the design of NoSQL databases, which relies on an aggregate-oriented view of
735 application data, an intermediate system-independent data model for NoSQL
datastores, and finally an implementation activity that takes into account the
features of specific systems.
References
32
[2] R. Cattell, Scalable SQL and NoSQL data stores, SIGMOD Record 39 (4)
(2010) 12–27.
[11] C. Mohan, History repeats itself: sensible and NonsenSQL aspects of the
NoSQL hoopla, in: EDBT, 2013, pp. 11–16.
33
770 [13] F. Bancilhon, Object-oriented database systems, in: Proceedings of the
Seventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, March 21-23, 1988, Austin, Texas, USA, 1988, pp. 152–
162.
780 [17] S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to
Semistructured Data and XML, Morgan Kaufmann, 1999.
[18] M. Stonebraker, U. Çetintemel, “one size fits all”: An idea whose time
has come and gone (abstract), in: Proceedings of the 21st International
Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan,
785 2005, pp. 2–11.
34
795 [24] J. Shute, et al., F1: A distributed SQL database that scales, PVLDB 6 (11)
(2013) 1068–1079.
[27] D. Pritchett, BASE: An ACID alternative, ACM Queue 6 (3) (2008) 48–55.
35
[34] M. Chevalier, M. E. Malki, A. Kopliku, O. Teste, R. Tournier, Implementa-
tion of multidimensional databases in column-oriented NoSQL systems, in:
19th East European Conference on Advances in Databases and Information
825 Systems (ADBIS 2015), 2015, pp. 79–91.
[36] M. J. Mior, K. Salem, A. Aboulnaga, R. Liu, Nose: Schema design for nosql
830 applications, in: 32nd IEEE International Conference on Data Engineering,
ICDE 2016, Helsinki, Finland, May 16-20, 2016, 2016, pp. 181–192.
835 [38] J. Baker, et al., Megastore: Providing scalable, highly available storage for
interactive services, in: CIDR 2011, 2011, pp. 223–234.
36