0% found this document useful (0 votes)
18 views

A Workload-Driven Logical Design Approach For NoSQL Document Databases

Uploaded by

olsowyverena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

A Workload-Driven Logical Design Approach For NoSQL Document Databases

Uploaded by

olsowyverena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/299997714

A workload-driven logical design approach for NoSQL document databases

Conference Paper · December 2015


DOI: 10.1145/2837185.2837218

CITATIONS READS
22 1,668

2 authors:

Cláudio Lima Ronaldo Mello


Federal University of Santa Catarina Federal University of Santa Catarina
8 PUBLICATIONS 35 CITATIONS 78 PUBLICATIONS 519 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Collaborative Networks as Internationalization of Higher Education dynamic: a knowledge sharing model View project

SQLtoKeyNoSQL View project

All content following this page was uploaded by Ronaldo Mello on 02 February 2019.

The user has requested enhancement of the downloaded file.


A Workload-Driven Logical Design Approach
for NoSQL Document Databases
Claudio de Lima Ronaldo dos Santos Mello
Postgraduate Program in Computer Science (PPGCC) Postgraduate Program in Computer Science (PPGCC)
Informatics and Statistics Department (INE) P.P. in Methods and Management in Evaluation (PPGMGA)
Federal University of Santa Catarina (UFSC) Informatics and Statistics Department (INE)
Florianópolis/SC, Brazil 88040-900 Federal University of Santa Catarina (UFSC)
[email protected] Florianópolis/SC, Brazil 88040-900
[email protected]

ABSTRACT challenges for data management in the cloud, including how to


NoSQL databases are designed to manage large volumes of data. handle and store these data. NoSQL Databases (DBs) are
Although they do not require a default schema associated with the designed to manage large volumes of data, commonly referred to
data, they are categorized by data models. Because of this, data as Big Data, and a large number of read and write operations [5].
organization in NoSQL databases needs significant design Although NoSQL DBs do not require a default schema
decisions because they affect quality requirements such as associated with the data, they are categorized by data models (key-
scalability, consistency and performance. In traditional database value, document, columnar and graph-based) [16], demonstrating
design, on the logical modeling phase, a conceptual schema is that their data show some degree of structuring. The importance
transformed into a schema with lower abstraction and suitable to of a model associated with the data is related to the definition of
the target database data model. In this context, the contribution of better strategies for persistence and manipulation of such data in
this paper is an approach for logical design of NoSQL document the target DB. In addition, data organization in NoSQL DBs
databases. Our approach consists in a process that converts a requires significant design decisions because it affects quality
conceptual modeling into efficient logical representations for a requirements such as scalability, consistency and performance [4].
NoSQL document database. Workload information is considered In this context of data modeling, conceptual schemas and
to determine an optimized logical schema, providing a better ontologies are crucial to define data semantics, providing access
access performance for the application. We evaluate our approach to them with higher accuracy. Traditional DB design is a process
through a case study in the e-commerce domain and demonstrate consisting of three data modeling phases [1, 9]: conceptual,
that the NoSQL logical structure generated by our approach logical and physical. At the conceptual modeling phase, a schema
reduces the amount of items accessed by the application queries. with the information of a domain is represented in a high level
abstraction model. In the sequence, in the logical modeling phase,
Categories and Subject Descriptors the conceptual schema is transformed into a schema with lower
H.2.1 [Database Management]: Logical Design. abstraction but suitable to the target DB data model. This logical
design phase, specifically for NoSQL DBs, is the focus of this
paper.
General Terms Support methodologies for the logical design of NoSQL DBs is
Algorithms, Performance, Design. a topic very little explored in Database literature. Therefore, this
paper aims to contribute to this problem by proposing a
Keywords methodology for the logical design of NoSQL document DBs.
NoSQL document, logical design, conceptual schema, workload. This methodology consists in a process that converts conceptual
modeling for suitable and efficient logical representations for a
NoSQL document DB. Document-oriented DBs are an
1. INTRODUCTION appropriate category for Web applications or applications that
The immense amount of data generated daily by applications from deal with Big Data because they provide semistructured data
several domains, such as Web data management, social networks, storage and dynamic queries execution, horizontal scalability and
sensor networks and educational evaluation, brings several high availability [15]. For these reasons, we chose this NoSQL
DB category.
Our conversion approach for generating NoSQL document
Permission to make digital or hard copies of all or part of this work for logical schemas from conceptual schemas considers the expected
personal or classroom use is granted without fee provided that copies are not workload of the application. Workload information is given by the
made or distributed for profit or commercial advantage and that copies bear designer in terms of the amount of data instances estimated for the
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
NoSQL DB, as well as the main operations that will be performed
credit is permitted. To copy otherwise, or republish, to post on servers or to over these data. This information is used to determine an
redistribute to lists, requires prior specific permission and/or a fee. Request optimized logical structuring for the NoSQL DB schema,
permissions from [email protected]. contributing, in general, to a better access performance for the
iiWAS '15, December 10-12, 2015, Brussels, Belgium application. We evaluate our approach through an experimental
© 2015 ACM. ISBN 978-1-4503-3491-4/15/12…$15.00
evaluation in the e-commerce domain, where existing datasets
DOI: http://dx.doi.org/10.1145/2837185.2837218
were redesigned by our approach in order to compare the number
of accesses generated by queries over the redesigned schema as aggregates at the logical level. According to the work of [20],
well as over the workload-based schema generated by our aggregates are suitable to represent key-value, document, and
methodology. We demonstrate that the NoSQL logical structure columnar NoSQL data models. The study of [4] explores the
generated by our approach reduces the data accessing overhead. commonalities of some categories of NoSQL DBs, and although it
The remainder of this paper is organized as follows. Section 2 acts in the three design phases, it does not consider all the
discusses related work. Section 3 presents the document NoSQL conceptual constructs nor formalizes conversion processes
logical model considered by our approach and some definitions between conceptual modeling and logical representations in the
regarding workload information. Section 4 provides an overview NoAM model. The proposal of [14] presents the conversion of a
of our approach, including the conversion algorithms for mapping conceptual model into a logical model, but it does not address the
conceptual constructs to suitable structures in the NoSQL physical modeling phase and do not formalize conversion
document logical model considering the workload of an processes between conceptual and logical representations.
application. Section 5 presents the experimental evaluation and
Section 6 is dedicated to the conclusion. Table 1. Comparison of related work
# Database Design
2. RELATED WORK Conceptual Logical Physical
This section presents related work to NoSQL DBs modeling and
NoSQL Logical Model
includes approaches dealing with other non-relational schemas
[4] UML NoAM – aggregate- specific elements
(XML and Object-Oriented (OO)) in order to identify based of NoSQL DBs
contributions to the logical modeling of NoSQL DB, since the categories
specific literature to NoSQL is quite limited. XML and OO [14] IDEF1X IDEF1X – -
models, as well as of NoSQL data models, are complex data aggregate-based
models whose similarities in terms of mapping strategies, such as XML Logical Model
the treatment of multivalued and nested attributes, can be adapted [21] EER XML logical model DTD / XML
to the NoSQL DBs logical design. This section also presents a Schema
brief comparison of these approaches in order to identify the [6] EER - DTD
modeling levels (conceptual, logical and physical) attended by [12] EER - DTD
each of them. [8] EER hierarchical XML Schema
The work of [4] presents an approach to NoSQL DBs design structures
which explores the commonalities of some NoSQL DBs’ [2] UML UML+ stereotypes XML Schema
categories. The proposal introduces a data model (NoAM – OO Logical Model
NoSQL Abstract Model) for the logical level, and demonstrates [18] ER OO logical schema DDL O2
how data modeled in NoAM can be implemented in some NoSQL
[3] ER F-Logic DDL ONTOS
DBs. NoAM is based on the concept of aggregates, which is a
term of Domain-Driven Design (DDD) [10]. DDD is a widely [7] EER OO schema -
adopted OO design approach, being an aggregate a collection of [19] ER OO schema -
related objects, organized in a nested way, which can be treated as [11] EER OMT based -
a unit [20]. The approach of [4] suggests support for scalability,
consistency and performance, having four phases: (i) aggregate On considering design methodologies for XML DBs, we
design: the classes of aggregated objects needed for the observe that the work of [21] considers the three DB design
application are identified (conducted by use cases and functional phases as well as all EER conceptual model constructs to generate
requirements); (ii) aggregate partitioning: aggregates are divided an equivalent XML structure. Information regarding the estimated
into smaller data elements (conducted by use cases and load for the DBs is used to define optimizations on the XML
performance requirements); (ii) high level NoSQL DB design: structure generated by the transformation process. Regarding
aggregates are mapped to the NoAM model according to the approaches for OO logical modeling, we observe that most of
identified partitions; and (iv) implementation: the NoAM schema them consider the ER model in the conceptual modeling phase.
is converted to the schema of the target NoSQL DB. However, there is no consensus with respect to the logical model
The work of [14] uses IDEF1X (Integration DEFinition for and, in particular, in the way binary relationships and
Information Modeling), a data modeling language for the generalization types are converted.
development of semantic data models, in the conceptual Different from related work, our proposal covers all typical
modeling phase to represent the application domain, and also to conceptual constructs, details the conversion algorithms between
represent the aggregate-based NoSQL logical model obtained conceptual schemas and logical representations for NoSQL
through a conversion process between these models. This document DBs category, and additionally considers the estimated
proposal provides support for the analysis of different modeling DB workload to perform optimizations in the logical structure. It
strategies, like schema partitioning into smaller and independent is detailed in the following.
aggregates in the SOA context (Service Oriented Architecture).
Besides these specific approaches for NoSQL DBs modeling, 3. FUNDAMENTALS
the literature presents several design methodologies for XML DBs Our approach provides the conversion of conceptual schemas into
[2, 6, 8, 12, 21], as well as conversion processes of conceptual NoSQL document logical schemas. It starts with a conceptual
modeling to OO logical representations [3, 7, 11, 18, 19]. Table 1 schema and workload information given by the application
shows a comparison of related work aiming to relate the modeling designer, as shown in Figure 1. The workload information is
levels attended by each proposal. estimated over a conceptual schema, being also used as input for
Regarding specific design methodologies for NoSQL DBs, we the Logical Design phase in order to generate appropriate logical
observe that they propose logical schemes using the concept of structures. The mapping of the conceptual schema to a NoSQL
document logical schema is governed by a set of rules that to B. By the same way, the value of Avg(B,A,R1) appears next
converts each conceptual constructor to an equivalent to A.
representation in the NoSQL document logical model.
Our logical model is an abstract model to represent NoSQL
document implementation models. In the Implementation Design
phase, a NoSQL document logical schema is translated to the
common implementation model for NoSQL documents, i.e., the Figure 2. An EER schema with volume of data information.
JSON1 specification. Even though the Implementation Design
level is considered by our approach, this paper focuses on Definition 2. Application Load. Consider an EER schema Ɛ =
generating optimized NoSQL document structures from a {t1, . . . , tm} and a set of operations O = {o1, ..., on} over Ɛ such
conceptual schema in the Logical Design level. that each oi  O is applied over a list of types T = (t1, .., tp) with
T Ɛ. The application load on Ɛ is defined by a set of operations
and it is composed by two functions: (i) f(oi) is the average
frequency of oi in a period of time; and (ii) v(oi, tj) is the volume
of instances of tj accessed by oi. This volume is given for each tj 
T respecting the accessed order imposed by oi. v(oi, tj) is defined
as f(oi) when j = 1; otherwise, it is defined as v(oi, tk) × ω, where
tk is the type accessed by oi before tj, and ω is 1 if tj is an entity
type or Avg(t´, . . . , t´´, tk, tj) if tj is an relationship type, being t´,
Figure 1: An overview of the proposed approach. . . . , t´´ types associated to tk in a relationship determined by tj .

Our input conceptual schema is defined by the Extended Entity- An operation is an elementary interaction with the application,
Relationship (EER) model [1], a classical and suitable model for which includes retrieval or updating operations. Table 2 shows an
representing data concerning an application domain. Other example of a set of operations estimated as the application load.
conceptual models could be considered, like UML. Instead, we Operation O1, for example, has an average frequency of 900 times
adopt EER because it contains the essential constructs for a day. The entity and relationship types C, R2 and B are accessed,
conceptual modeling. in this sequence, by o1. Note that the initial concept C is accessed
Some definitions regarding workload information and the 900 times by o1. In the navigation sequence, the average number
logical model defined by our approach are presented in the of accessed instances of the concepts R2 and B is obtained by
following. multiplying 900 by 20, considering that Avg(C,B,R2) = 20.
Analogously, we have Avg(A,B,R1) = 2 for operation O2.
3.1 Workload Information
Workload information corresponds to the data load expected for a Table 2. Operations for the schema of Figure 2
NoSQL-based application. This information allows our Frequency Concept Access
conversion process to choose an optimized NoSQL document Operation
per day accessed volume
structure to represent a conceptual schema. According to Batini et. C 900
al. [1], we may concentrate on the 20% of the most frequent O1 900 R2 18000
operations that will be performed by the application. This B 18000
assumption is rooted on the so-called 20-80 rule, which says that A 300
20% of the operations produce 80% of the application load. Our O2 300 R1 600
workload analysis identifies the concepts frequently accessed by B 600
transactions and is based on the workload modeling methodology
defined in [1, 21, 22] as follows. Given the volume of data and the application load, the
definition of the total frequency of access (operation access) on a
Definition 1. Volume of Data. Given an EER schema Ɛ, the type in an EER schema is presented.
volume of data of Ɛ is defined by V = {N(t), Avg(Ƭ , r)}, where
N(t) is the average number of occurrences of a conceptual type t Definition 3. General Access Frequency (GAF). Given an EER
 Ɛ and, given a n–tuple Ƭ = t1, . . . , tn (n > 1) of entity types schema Ɛ, O = {o1, ..., on} is the set of operations such that each oi
associated through a relationship type r, Avg(Ƭ , r) is the average  O is applied over a list of types T  Ɛ. The GAF of a type t  Ɛ,
cardinality among the entities in Ƭ through r. where n (n  0) represents the number of operations in which t is
accessed, is defined as follows:
Figure 2 shows an example of an EER schema augmented with
volume of data. The average number of instances N expected for (1)
the conceptual types is represented in the type shape and the
average cardinality (Avg) is presented on the associations. We The GAF of the concept B is GAF(B) = 18600 by considering
omitted Avg parameters for the sake of clarity. Thus, we have the 18000 instances accessed by O1, and the 600 accessed by O2.
N(A) = 250, N(R1) = 500 and so on. For the average cardinality, In order to evaluate GAF measure, we may consider a Minimal
we have Avg(A,B,R1) = 2, Avg(B,A, R1) = 1 and so on. Access Frequency (MAF), which is given by the designer as an
Cardinality interpretation is given as follows: the average volume input of our process. MAF is a value that represents the minimal
of instances of A related to B through R1 is 2 and it appears next frequency for accesses involving operations, and values below it
are considered as insignificant frequencies. We introduce an
example as follows. Suppose the designer assume that the set of
1
A lightweight data-interchange format (json.org). considered operations in Table 2 represents 80% of the
application load and the MAF should represent 0.9%. Thus, if the restrictions, like participation in the Partner block. An identifier
sum of the GAF of all schema concepts is 38400 accesses, the attribute is an attribute that is part of a root block, like ID_code in
MAF is 0.9% applied over 80% of this value, i.e., 432 accesses. the Person collection. A reference attribute is an attribute that
Given this minimal value, we can evaluate if the GAF of a concept refers to a block identifier, like contributor_REF, that refers to the
is relevant for the workload. If GAF(B) = 18600, for example, we Contributor collection.
say that B is a concept frequently accessed by transactions
because its GAF is higher than MAF (432).

3.2 NoSQL Document Logical Model


We propose a NoSQL document logical model to represent the
document data model. Our conversion approach generates NoSQL
schemas defined by this logical model in the Logical Design
phase. The NoSQL document logical model is an abstract
representation for NoSQL document models and consists in an
adaptation of the aggregate approach [10], which is a widely
adopted OO approach. In this context, an aggregate represents a Figure 3: Example of a NoSQL document logical schema.
collection of related objects, in a nested way, which can be treated
as a unit. Such a notion is suitable to NoSQL documents given The logical model supports two types of relationships:
that they are hierarchical data structures that consist of nested data hierarchical relationship and reference relationship. A hierarchical
collections and scalar values [20]. Besides, the choice for an relationship defines the minimum and maximum occurrences of a
aggregated-based logical representation is justified by the fact that target concept in a source concept. For instance, Person (root)
they support typical NoSQL databases requirements, like block may have zero or one Student block. The default minimum
scalability and consistency, as they provide a natural unit for and maximum occurrence for target concepts is 1. The
sharding and atomic manipulation of data in distributed disjointness constraint on generalization hierarchies is represented
environments [10, 13]. by the curly bracket (brace) symbol ("{"), graphically aligned to
A NoSQL document logical schema is composed by collections, the left of the target inner blocks. An example in Figure 3 is
blocks and attributes. A schema has one or more collections, and shown for the inner blocks Student and Employee. A reference
each collection has a root block. All updates to a collection pass relationship is represented by a reference attribute (or a set of
through its respective root block, ensuring the business rules. The reference attributes), which refers to the identifier of other
root block is the only block accessible out of the collection. The collection, like contributor_REF, that refers to the collection
main concepts of the logical model are defined as follows. Contributor.
As the NoSQL document logical model is an abstract
Definition 4. NoSQL document logical schema. A NoSQL representation for NoSQL document models, a NoSQL document
document logical schema NDSi is a set of collections Ci, where logical schema can be easily converted to the common storage
each collection has a unique name in the schema. format JSON for NoSQL document. Only a few decisions must be
Definition 5. Collection. A collection cj is a non-empty set of accomplished to the translation of the identifier and reference
blocks Bj, having cj a root block rbj  Bj. attributes. Such a conversion is out of the scope of this paper.

Definition 6. Root Block. A root block rbk consists in an 4. THE CONVERSION PROCESS
attribute ak that identifies uniquely rbk in a collection ck, and a Our conversion process is based on conversion rules for mapping
non-empty set of attributes Ak and/or blocks Bk. EER constructs into equivalent NoSQL document constructs in
the logical model described in Section 3.2. Algorithm 1 presents
Definition 7. Block. A block bx is a set of attributes Ax, or a set
the overall process, which comprises two main steps: conversion
of inner blocks Bx, that supports disjointness constraints for the
of generalization types and conversion of relationship types.
inner blocks bx Bx.
Definition 8. Attribute. An attribute ay of a block by is a tuple Algorithm 1 EER-NoSQL
(cy, vy), where cy identifies uniquely ay in by, and vy is the value of Input: An EER Schema Ɛ with load data information;
ay. The Minimal Access Frequency MAF of Ɛ
Output: A NoSQL document logical schema NF
Definition 9. Hierarchical Relationship. A hierarchical
relationship hrm  sbm, where sbm is a source block, is defined H  convertHierarchies (Ɛ, MAF);
between sbm and an inner (target) block tbm  sbm. R  convertRelationships (Ɛ, MAF, H);
NF  listOfCollections (R);
Definition 10. Reference Relationship. A reference relationship
defineRootBlockIDs(NF).
rrn is represented by a reference attribute ran  sbn or a set of
reference attributes RAn  sbn, where sbn identifies a source Generalization types are converted first, followed by the
block. The reference attribute ran refers to the identifier attribute conversion of the relationships. The blocks generated by the
ao of a target collection’s root block rbo. function convertHierarchies are maintained by the function
convertRelationships. After relationship types conversion, the
Figure 3 presents a NoSQL document logical schema in
remaining root blocks are finally defined as schema's collections.
accordance to our proposed logical model. There are three types
At the end of the process, a list of collections is returned, after the
of attributes: normal, identifier and reference attribute. A normal
definition of identifier attributes for collections’ root blocks when
attribute models a block property and does not impose
necessary. Load information is considered during the conversion
of the relationship types in order to generate well-structured In the alternative defined by Rule 3, the superclass and
NoSQL document logical schemas. subclasses are explicitly represented by blocks. Hierarchical
Next sections detail the rules for converting generalization relationships are established among the superclass and subclasses
hierarchies and relationship types as well as their respective blocks to represent the relationship. The generalization constraints
functions, as presented in Algorithm 1. are represented by the minimum and maximum occurrences of the
subclasses’ blocks in the superclass block. In cases where a
4.1 Hierarchy Types Conversion subclass of a generalization type has already been converted, the
A generalization hierarchy in the EER model defines a subset relationship with the superclass block is established by a reference
relationship between a generic entity, namely superclass, and one relationship between the block previously created to represent the
or more specialized entities, namely subclasses. The disjointeness converted subclass and the superclass block. In this case, the
and completeness constraints that are set to the subclasses superclass block is defined as a referenced block.
establish four possible constraints on generalization types: total
and disjoint (t, d); partial and disjoint (p, d); total and overlapping Rule 2. Generalization Focused on Subclasses. The conversion
(t, o); and partial and overlapping (p, o) [1]. of a generalization type G proceeds as follows:
Categories or union types of the EER model can be considered given an entity Esp defined as the superclass of a generalization
restricted cases of multiple inheritance [1]. Thus, their conversion type and {Esb1, Esb2, ..., Esbn} the set of subclasses of Esp, for each
strategies are similar to the strategies for processing generalization Esbi  {Esb1, Esb2, ..., Esbn} do: generate a block bsbi and define the
types. For sake of paper space, we omit these strategies. In this attributes of the Esbi and Esp as attributes of bsbi.
section, we define alternative rules to convert generalization Rule 3. Generalization Focused on Hierarchy. The conversion
hierarchy from an EER schema to a NoSQL document logical of a generalization type G proceeds as follows:
schema. We also detail the function convertHierarchies of 1. given an entity Esp defined as the G superclass, generate a
Algorithm 1, that selects the suitable rule to be applied on each block bsp and if (G is a disjoint generalization) then generate a
occurrence of EER generalization type. disjointness constraint. The attributes of Esp are defined as
attributes of bsp;
4.1.1. Conversion Rules 2. given the set of Esp subclasses {Esb1, Esb2, ..., Esbn}, for each
Three alternatives are provided to convert generalization types Esbi  {Esb1, Esb2, ..., Esbn} do:
inspired by the relational logical design methodology [1]. The if (Esbi was not converted) then generate a block bsbi and a
difference among these alternatives is given by the different size hierarchical relationship from bsp to bsbi where the occurrence of
of a NoSQL document schema that each one generates, and the bsbi in bsp is defined as ([0-1],[1]), depending on the completeness
constraints on generalization types they are able to support. constraint of G (total or partial)
The conversion strategy defined by Rule 1 generates only one else given bsbi the block that represents Esbi, generate an
block from a generalization hierarchy. The block represents the reference attribute rasbi in bsbi which refers to bsp identifier, and
superclass and its attributes, as well as the attributes of its define bsp as a referenced block.
subclasses. Subclasses’ attributes are defined as optional in the
content model of the superclass block. On applying this rule, we Function 1 convertHierarchies
assume that the subclasses’ attributes will act as discriminating Input: An EER Schema with load data information Ɛ;
attributes to identify an instance of a subclass in the NoSQL The Minimal Access Frequency of Ɛ (MAF)
documents. The subclasses previously converted (marked) Output: A set of blocks H’ of an NoSQL logical schema
become an optional inner block of the block generated by this
rule. H  the list of generalization types of Ɛ;
Rule 1. Generalization Focused on Superclass. The conversion H’  sort H so that the generalization types at the bottom of the
of a generalization type G proceeds as follows: hierarchy with superclasses that have highest GAF appear first;
1. given an entity Esp defined as the G superclass, generate a for each hi  H’ (1  i  n) with superclass Esp and
block bsp. The attributes of Esp become attributes of bsp; subclasses{Esb1,.., Esbn} do
2. given the set of Esp subclasses {Esb1, Esb2, ..., Esbn}, for each if ( converted subclasses in hi) AND (all subclasses in hi
Esbi  {Esb1, Esb2, ..., Esbn} do: have GAF < MAF) AND ( subclasses with more than one
if (Esbi was not converted) then define the attributes of Esbi superclass) AND ( subclasses defined as referenced block)
as optional attributes in bsp then
else given bsbi the block that represents Esbi, generate a Apply Rule 1 and mark as converted all the subclasses
hierarchical relationship from bsp to bsbi where the occurrence of of hi
bsbi in bsp is defined as [0..1]. else if (GAF(Esp) < MAF) AND ( subclasses with more
than one superclass) then
The main restriction to the application of Rule 1 occurs when Apply Rule 2 and mark Esp as converted
one of the subclasses is defined as a referenced block. A else
referenced block is an entity that was previously processed and Apply Rule 3 and mark as converted all the subclasses
defined as referenced by another block. This restriction guarantees of hi
that the referenced block will be a root block, avoiding that this end if
root block be further converted to a inner block of other block. end for
The alternative defined by Rule 2 generates only NoSQL return H’
document blocks for the subclasses, and the superclass attributes
are reproduced into each subclass block.
4.1.2. Conversion Function Finally, the Rule 6 generates independent blocks for each entity
The function convertHierarchies (Function 1) is responsible to of a relationship type and reference relationships are established
choose the appropriate rule for converting each generalization among the generated blocks.
type of a conceptual schema. A generalization type is converted
by analyzing the load data and the constraints of the Rule 5. Relationship Modeled as a Hierarchy. Given a 1:N
generalization hierarchy. The function establishes a conversion relationship type R which relates the entities E1 and E2, the
order in which the entities involved in a generalization hierarchy conversion of R proceeds as follows:
must be converted. A bottom-up conversion is performed when 1. generate a block bE1 for representing E1 and define the
there is a multiple-level hierarchy, i.e., the entities are converted attributes of E1 as attributes in bE1;
from the bottom to the top of the hierarchy. Besides, when there is 2. generate a block bE2 for representing E2 as a nested block of
a multiple-inheritance case, the superclass with the highest bE1. The occurrence of bE2 in bE1 depends on the participation of
General Access Frequency (GAF) has high priority. It means that E1 in R (optional or mandatory);
the superclass that is most frequently accessed becomes the parent 3. define the attributes of E2 and R as attributes in bE2.
block of a block that represents the subclass with more than one Rule 6. Relationship Modeled as References. Given a
superclass. In this case, the remaining superclasses are referenced relationship type R and the set of entities {E1, E2, .., En} related by
by reference attributes as defined in Rule 3. In fact, generalization R, the conversion of R proceeds as follows:
types involved in multiple-inheritance cases are always converted 1. for each Ei  R (1  i  n) do: generate a block bEi and
by Rule 3. define the attributes of Ei as attributes in bEi;
Once the conversion order of the generalization types is 2. if (R is a binary relationship without attributes) AND (the
established, we apply the conversion rules for generalization types
participation of E1 in R is defined as ([0-1],1)) then generate a
(Rule 1, 2 or 3) and verify the preconditions of each one, so that
reference attribute in bE1 referring to the identifier of bE2, and
the rules that generate the smallest NoSQL document logical
define bE2 as a referenced block
fragment are verified first, as illustrated in Function 1. For Rule 1
else
and Rule 2, we verify if the GAF of the entities that will be
3. generate a block bR as a nested block of bE1 and define
omitted is lower than the Minimal Access Frequency (MAF). If
the attributes of R as attributes in bR;
the GAF is higher than MAF, it means that these entities
4. for each Ei  R (1  i  n) do: generate a reference
participate in frequent operations and the distinction between
attribute in bR referring to the identifier of bEi, and define bEi as a
superclass and subclasses must be preserved. The last option to
referenced block.
convert generalization types is Rule 3.

4.2 Relationship Types Conversion Function 2 convertRelationships


A relationship type is a common conceptual construct which Input: An EER Schema with load data information Ɛ;
establishes a correspondence among two or more entities [1]. The The Minimal Access Frequency of Ɛ (MAF);
A set of blocks H of an NoSQL logical schema generated
cardinality of a relationship type is the main constraint that is
by convertHierarchies
considered on the conversion to a NoSQL document logical
Output: A set of root blocks R’ of an NoSQL logical schema
structure. Our rules for converting EER relationship types also
proceed from logical design of traditional data models. In this R  the list of relationship types of Ɛ;
section, we present these rules and their constraints, as well as the R’  sort R so that the relationship types with the highest GAF
function that controls their execution. appear first;
for each ri  R’ do
4.2.1. Conversion Rules ri  the first unconverted relationship of R’;
We define three conversion rules that deal with specific ES  the set of entities {Eh,.., En} related by ri ;
constraints for relationship types. Rule 4 is applied only to 1:1
if (ri is binary) AND ( an unconverted entity Ei  ES with
relationships, Rule 5 regards 1:N relationships, and Rule 6 is
participation (1,1) in ri) then
applied to relationships with cardinality N:N, n-ary ones with n >
E2 is Ei and E1 is the another entity of ri
2, or in cases where 1:1 and 1:N relationships cannot be treated by
else
rules 4 and 5, respectively.
E1 is the entity that has the highest GAF in ri
Rule 4 generates only one block to represent the relationship
end if
type and its related entities.
if (ri is 1:1) AND (the participation of E2 is (1,1) ) AND (E2 is
Rule 4. Relationship Modeled as One Block. The conversion of unconverted) then
a relationship type R proceeds as follows: Apply Rule 4 (H) and mark E2 as converted
given a 1:1 relationship type R which relates the entities E1 and else if (ri is binary) AND (the participation of E2 is (1,1) )
E2, generate a block bE1 and define the attributes of E1, E2 and R AND (E2 is unconverted) AND (E2 is not defined as referenced
as attributes in bE1. block) AND (E1  E2) then
Apply Rule 5 (H) and mark E2 as converted
Rule 5 generates blocks for each related entity, where one of else
them is converted to a nested block of the other one, and the Apply Rule 6 (H)
relationship attributes are appended to the nested block. To end if
guarantee that the referenced block will be a root block, Rule 5 Mark ri as converted
cannot be applied to relationship types in which the entity with end for
participation (1,1) was previously defined as a referenced block. return R’
4.2.2. Conversion Function appear first in the list. Then, the remaining entities are added to
The function convertRelationships (Function 2) controls the the end of EL, so that the entities that have a higher number of
execution of the conversion rules for relationship types of an EER relationships appear first. The final list obtained for our case study
schema. It orders the relationship types so that relationships with is EL = {Customer, Order, Product, Category, Carrier, Item,
the highest GAF appear first. This order is established to give Supplier, CreditCard, Payment, Bill, Person}.
priority to the relationships that represent the largest impact on the It is important to notice that the generation of the conventional
application workload. Then, if there is more than one nesting and optimized schemas has the same goal, which is to generate
possibility for an entity type giving all the relationship types in compact and redundancy-free schemas and define appropriate
which it participates, we process this entity by considering the representations in the NoSQL document logical model. The main
relationship with the highest GAF first. This order ensures that difference between the conversion processes is that the optimized
relationship types involving associative entities are converted schema is generated based on the consideration of workload
after the internal relationship types of these associative entities. information to select the appropriate conversion rules.
For converting a relationship type, we first determine what The number of instances from the original e-commerce
entity will be the entity on the top of the hierarchy in the NoSQL application dataset was used to measure the volume of data in our
document logical schema. If the relationship type is binary and case study, i.e., the average number of instances of the entities and
there is an entity with participation (1,1) in the relationship, the relationships as well as the average cardinality of the entities in
top entity is the other entity of the relationship type. For other each relationship type. We omit most of the attributes to simplify
cases, the top entity is the entity type with the highest GAF in the the schema readability. The volume of data for the conceptual
relationship, i.e., we assume that the relationship type is more schema of the application is also shown in Figure 4.
frequently accessed through this entity for the considered We also obtained the main operations that comprise the
operations. application workload. They were provided by an expert user
In the next section, we evaluate our approach with a case study application considering the concepts (entity and relationship
in the e-commerce domain. types) defined in the conceptual schema. The third column of
Table 3 presents the operation load in terms of access frequency
5. EXPERIMENTAL EVALUATION for each concept of the conceptual schema.
We evaluate our approach with an experiment in the e-commerce The GAF of each concept involved in the operations on the
domain. Our intention here is to validate our conversion conceptual schema was also measured. They are shown in Table
methodology, exemplify the usage of our process and show its 4. We omit the GAF of the concepts that are not accessed by the
positive effects, in terms of processing time, on considering the considered operations. The conventional and optimized NoSQL
application workload. In fact, we show here that our method can document logical schemas generated by our approach are shown
improve query performance on NoSQL documents by reducing in Figure 5 and Figure 6, respectively.
the number of access to the NoSQL database. Experimental The EER schema, the volume of data, a set of operations and its
settings and results are presented in the following. average frequencies are given as input for our conversion process.
The volume of data is included in the conceptual schema and the
5.1 Experiment Settings operations are shown in Table 3. The application load was
We perform a reverse engineering from a real e-commerce measured over the conceptual schema according to Definition 2,
application dataset. The resulting conceptual schema is presented while GAF was generated according to Definition 3.
in Figure 4. Then, we apply our conversion process twice over the In order to obtain MAF measure, we considered that the set of
conceptual schema obtained from this reverse engineering operations produces 80% of the load. Thus, the total volume
process. In the first time, we do not consider workload generated on the conceptual schema (651990 daily accesses)
information, and the generated schema was called conventional represents 80% of the total of accesses, which can be performed
schema. In the second time, we apply our complete conversion by the application over the conceptual types. We assume that
process and the generated schema was called optimized schema. 1.15% is given as a parameter and denotes MAF in percentage
For the generation of the conventional schema, small changes in value. Such a percentage is applied over the total volume and we
convertHierarchies and convertRelationships functions were obtain 9372 accesses as MAF value.
required. In the function convertHierarchies, instead of In the following, we measure and compare the access frequency
considering GAF and MAF, we verify the existence of subclasses generated by each concept by executing the operations on the
and superclasses relationships for the application of the Rules 1 conventional and optimized schemas. The access frequency
and 2, respectively. Rule 1 assumes that the explicit distinction generated by the schemas is shown in the two last columns of
between subclasses is irrelevant for most instances of the Table 3. In order to evaluate the effects of the query processing on
superclass. The existence of relationships involving subclasses is these schemas, the operations were performed on compliant
the main constraint for the application of this rule. Rule 2 is not NoSQL documents generated and stored in the NoSQL document-
considered for cases where the superclass's relationships must be oriented database MongoDB2. We develop a Java application to
converted into relationships with each one of the subclasses. produce JSON documents defined by each collection for both
For function convertRelationships, we modify the relationship schemas. For each schema, we generate a set of documents with
types ordering: instead of comparing GAF to perform the order, the same volume of data defined by Figure 4. The tests were
we use the concept of fully functional closures [17], that carried out in a processor Core i7 2.40 GHz with 8 GB of
determines the list of entities that can be reached from a starting memory, 1 TB of disk and Windows 8.1 Pro. The MongoDB-shell
entity through relationship pathways determined by participation query specifications were defined according to the structure of the
(1,1) of entities in the relationships. After identifying the fully schemas and the sequence of accesses for the operations as
functional closures of the entities of the conceptual schema, a list
of entities (EL) is generated. These entities must be ordered in EL 2
so that the participating entities on more fully functional closures A document-oriented database (mongodb.org).
presented in Table 3. We use the trial version of NoSQL Manager Next section presents and discusses the experiment results.
for MongoDB Professional tool to execute the queries.

Figure 4. EER schema for an e-commerce application.

Figure 5. The conventional NoSQL document logical schema. Figure 6. The optimized NoSQL document logical schema.
Table 3. Operations on schemas union type involving the subclass Payment. This rule was
Concept Access Frequency considered because the GAF of the superclasses is lower than the
Conceptual Conventional Optimized
assumed MAF. Besides, for the relationship commitment, the
# associative entity Sale was represented on Order block content, as
schema logical logical
schema schema the GAF of Order entity (159685) is higher than Payment entity
Order 1,500 1,500 1,500 (1200). Due to it, the function convertRelationships chooses
request 1,500 - - Order as E1 entity and Rule 4 was processed.
Customer 1,500 1,500 1,500 In short, the main difference among the produced logical
composite 2,475 - - schemas is the representation for the Payment union type in
O1
Item 2,475 2,475 2,475 optimized schema, which was nested to the Order block. Thus, the
reference 2,475 - - optimized schema has fewer collections and reference attributes
Product 2,475 2,227,500 2,227,500 than the conventional one.
Subtotal: 14,400 2,232,975 2,232,975 These different representations generate different access
Order 900 900 900 frequencies for the operations O2 and O4, as is shown in Table 3.
request 900 - - In O2, the conventional schema generates 706,500,000 accesses
O2
Customer 900 900 900 on the block Payment because it is necessary to compare the 900
commitment 900 900 - values of the reference attribute in commitment with all the
Payment 900 706,500,000 900 instances of Payment block (785,000). It does not occur on
Subtotal: 4,500 706,502,700 2,700
performing O2 on the optimized schema because the Payment
Customer 450 450 450
block is represented in the Order block. In this case, only 900
request 13,185 - -
Order 13,185 13,185 13,185 accesses are necessary to achieve the Payment content in O2.
O3 In practice, the impact of these different structures is evaluated
delivery 13,185 13,185 13,185
Carrier 13,185 210,960 210,960 by measuring the query processing time on both schemas at
Subtotal: 53,190 237,780 237,780 MongoDB NoSQL document DB. The operations were performed
Customer 300 300 300 on the compliant NoSQL documents, as stated before, and the
commitment 300 8,790 - results are presented in Figure 7. The results are presented in two
O4
Payment 300 6,900,150,000 8,790 ways: (i) one execution, and (ii) accumulated execution.
Subtotal: 900 6,900,159,090 9,090
Product 100 100 100
reference 144,100 - -
Item 144,100 129,700,000 129,700,000
O5
composite 144,100 - -
Order 144,100 129,700,000 129,700,000
Subtotal: 576,500 259,400,100 259,400,100
Supplier 100 100 100
furnishing 600 90,000 90,000
O6
Product 600 90,000 90,000
catalog 600 - 90,000
Category 600 5,400,000 5,400,000
Subtotal: 2,500 5,580,100 5,670,100
Total: 651,990 7,874,112,745 267,552,745

Table 4. GAF of concepts of Figure 4


Concept GAF Concept GAF
Order 159,685 composite 146,575
Item 146,575 reference 146,575
Carrier 13,185 request 15,585
Product 3,175 delivery 13,185
Customer 3,150 commitment 1,200 Figure 7. Operation processing time in seconds.
Payment 1,200 catalog 600
Category 600 furnishing 600 Table 5. Operations processing total time in seconds
Supplier 100
Execution Conventional Optimized
One 12.95 12.02
5.2 Result Analysis Accumulated 5095.25 4404.20
On analyzing the last line of Table 3, we verify that the Total d
Access Frequency increases considerably on comparing the Figure 7 presents the response time in seconds for the execution
number of accesses generated by the optimized and the of each of the six operations on JSON documents conformed to
conventional schemas. Basically, such an increase is due to extra the conventional and optimized schemas. For each schema is
blocks and reference relationships generated in the conventional shown the spent time in a single run (one execution) of a query
schema to represent some conceptual relationships in the NoSQL and the daily spent system time (accumulated execution) to run a
document schemas. The main reason for this block reduction in query considering its daily frequency. For example, for
the optimized schema is that Rule 1 was applied to convert the conventional schema, a single query O1 shows the response time
of 1525 seconds. On considering the frequency of 1500 times a [5] Cattell, R. 2010. Scalable SQL and NoSQL Data Stores.
day, the daily occupation system time to perform this accumulated SIGMOD Record, volume 39 (4), pages 12–27, 2010.
operation is 2287.5 seconds (1525 * 1500). [6] Choi, M., Lim, J. and Joo, K. 2003. Developing a Unified
The results shows that optimized schema had generated query Design Methodology based on Extended Entity-Relationship
processing times close to the ones generated for the conventional Model for XML. In ICCS 2003, pages 920–929, 2003.
schema for some operations. However, the cost to perform O2 on
conventional schema is notoriously higher because it is necessary [7] Elmasri, R., James, S. and Kouramajian, V. 1993. Automatic
to retrieve Payment of Order block through value joins on a Class and Method Generation for Object-Oriented Databases.
reference relationship. This result demonstrates the positive effect In DOOD 1993, Springer LNCS 760, pages 395-414, 1993.
of avoiding the value joins. The daily total accumulated execution [8] Elmasri, R., Wu, Y, Hojabri, B., Li, C. and Fu, J. 2002.
of operations performed on NoSQL documents, as shown in Table Conceptual Modeling for Customized XML Schemas. In ER
5, demonstrates that the optimized schema produces a better 2002, pages 429–443, 2002.
response time. It raises the relevance of considering workload
[9] Elmasri, R., Navathe, S. B. 2011. Fundamentals of Database
information.
Systems. Pearson Addison Wesley, 2011.
6. CONCLUSION [10] Evans, E. 2003. Domain-Driven Design: Tackling
Indeed, NoSQL DBs are suitable solutions for data management Complexity in the Heart of Software. Addison-Wesley, 2003.
in the Web as well as in the cloud, and an associated data model [11] Fong, J. 1995. Mapping Extended Entity-Relationship Model
allows the definition of better strategies for persistence and to Object Modeling Technique. SIGMOD Record, volume
manipulation of such a data in the target DB. In this context, the 24 (3), pages 18-22, 1995.
aggregate-based logical representation is a tendency in related
work, providing support to scalability and consistency, as they are [12] Fong, J., Fong, A., Wong, H. K. and Yu, P. 2006.
a natural unit for sharding and atomic manipulation of data in Translating Relational Schema with Constraints into XML
distributed environments. Schema. In International Journal of Software Engineering
This paper presents an approach for logical design of NoSQL and Knowledge Engineering, volume 16, pages 201–244,
document DB schemas based on a conceptual schema and 2006.
workload information. NoSQL document DBs are an appropriate [13] Helland, P. 2007. Life beyond distributed transactions: an
category for Web and cloud applications that provide dynamic apostate’s opinion. In CIDR 2007, pages 132–141, 2007.
queries execution, horizontal scalability and high availability. In
[14] Jovanovic, V., Benson, S. 2013. Aggregate Data Modeling
our proposal, the estimated volume of data and workload
Style. In SAIS 2013, pages 70-75, 2013.
information are considered to generate optimized NoSQL
document structures in terms of the main application operations [15] Kaur, K. and Rani, R. 2013. Modeling and Querying Data in
and their frequency. NoSQL Databases. IEEE. In International Conference on
We evaluate our approach through an experimental evaluation Big Data, pages 1-7, 2013.
for an e-commerce application domain, being the data stored in [16] McMurtry, D., Oakley, A., Sharp, J., Subramanian, M.,
the MongoDB NoSQL document DB. The results demonstrate Zhang, H. 2013. Data Access for Highly-Scalable Solutions:
that our workload-based conversion process improves query Using SQL, NoSQL, and Polyglot Persistence. Microsoft,
performance on NoSQL documents by reducing the number of 2013. Available in: <http://www.microsoft.com/en-
DB accesses. us/download/details.aspx?id=40327>. Accessed on March of
As future work, we intend to evaluate the application of our 2014.
process over a larger volume of data in a distributed environment,
as well as to consider the NoSQL DB physical design, including [17] Mok, W. Y., E. D. W. and Rani, R. 2006. Generating
the definition of indexes. compact redundancy-free xml documents from conceptual-
We also consider the comparison of our approach with a model hypergraphs. In IEEE Transactions on Knowledge
baseline in order to evaluate application performance for the and Data Engineering, volume 18, pages 1082–1096, 2006.
logical schemas generated by each one of them. It depends on the [18] Nachouki, J., Chastang, M.P. and Briand, H. 1991. From
availability of detailed conversion algorithms by related work, Entity-Relationship Diagram to an Object-Oriented
which were not found by the time of this paper writing. Database. In ER 1991, pages 459-482, 1991.
[19] Narasimhan, B., Navathe, S. and Jayaraman, S. 1993. On
7. REFERENCES Mapping ER and Relational Models onto OO Schemas. In
[1] Batini, C., Ceri, S. and Navathe, S. B. 1992. Conceptual
ER 1993, pages 402–413, 1993.
Database Design: An Entity-Relationship Approach.
Benjamin/Cummings, 1992. [20] Sadalage, P. J. and Fowler, M. J. 2013. NoSQL Distilled.
Addison-Wesley, 2013.
[2] Bird, L., Goodchild, A. and Halpin, T. 2000. Object Role
Modeling and XML-Schema. In ER 2000, pages 661–705, [21] Schroeder, R. and Mello, R. S. 2008. Improving Query
2000. Performance on XML Documents: A Workload-Driven
Design Approach. In DocEng 2008, pages 177-186, 2008.
[3] Biskup, J., Menzel, R. and Polle, T. 1995. Transforming an
Entity-Relationship Schema into Object-Oriented Database [22] Schroeder, R., Duarte, D. and Mello, R.S. 2011. A
Schemas. In ADBIS 1995, pages 109–136, 1995. workload‐aware approach for optimizing the XML schema
design trade‐off. In iiWAS 2011, pages 12-19, ACM, New
[4] Bugiotti, F., Cabibbo, L., Atzeni, P. and Torlone, R. 2014.
York, NY, 2011.
Database Design for NoSQL Systems. In ER 2014, pages
223-231, 2014.

View publication stats

You might also like