A Workload-Driven Logical Design Approach For NoSQL Document Databases
A Workload-Driven Logical Design Approach For NoSQL Document Databases
net/publication/299997714
CITATIONS READS
22 1,668
2 authors:
Some of the authors of this publication are also working on these related projects:
Collaborative Networks as Internationalization of Higher Education dynamic: a knowledge sharing model View project
All content following this page was uploaded by Ronaldo Mello on 02 February 2019.
Our input conceptual schema is defined by the Extended Entity- An operation is an elementary interaction with the application,
Relationship (EER) model [1], a classical and suitable model for which includes retrieval or updating operations. Table 2 shows an
representing data concerning an application domain. Other example of a set of operations estimated as the application load.
conceptual models could be considered, like UML. Instead, we Operation O1, for example, has an average frequency of 900 times
adopt EER because it contains the essential constructs for a day. The entity and relationship types C, R2 and B are accessed,
conceptual modeling. in this sequence, by o1. Note that the initial concept C is accessed
Some definitions regarding workload information and the 900 times by o1. In the navigation sequence, the average number
logical model defined by our approach are presented in the of accessed instances of the concepts R2 and B is obtained by
following. multiplying 900 by 20, considering that Avg(C,B,R2) = 20.
Analogously, we have Avg(A,B,R1) = 2 for operation O2.
3.1 Workload Information
Workload information corresponds to the data load expected for a Table 2. Operations for the schema of Figure 2
NoSQL-based application. This information allows our Frequency Concept Access
conversion process to choose an optimized NoSQL document Operation
per day accessed volume
structure to represent a conceptual schema. According to Batini et. C 900
al. [1], we may concentrate on the 20% of the most frequent O1 900 R2 18000
operations that will be performed by the application. This B 18000
assumption is rooted on the so-called 20-80 rule, which says that A 300
20% of the operations produce 80% of the application load. Our O2 300 R1 600
workload analysis identifies the concepts frequently accessed by B 600
transactions and is based on the workload modeling methodology
defined in [1, 21, 22] as follows. Given the volume of data and the application load, the
definition of the total frequency of access (operation access) on a
Definition 1. Volume of Data. Given an EER schema Ɛ, the type in an EER schema is presented.
volume of data of Ɛ is defined by V = {N(t), Avg(Ƭ , r)}, where
N(t) is the average number of occurrences of a conceptual type t Definition 3. General Access Frequency (GAF). Given an EER
Ɛ and, given a n–tuple Ƭ = t1, . . . , tn (n > 1) of entity types schema Ɛ, O = {o1, ..., on} is the set of operations such that each oi
associated through a relationship type r, Avg(Ƭ , r) is the average O is applied over a list of types T Ɛ. The GAF of a type t Ɛ,
cardinality among the entities in Ƭ through r. where n (n 0) represents the number of operations in which t is
accessed, is defined as follows:
Figure 2 shows an example of an EER schema augmented with
volume of data. The average number of instances N expected for (1)
the conceptual types is represented in the type shape and the
average cardinality (Avg) is presented on the associations. We The GAF of the concept B is GAF(B) = 18600 by considering
omitted Avg parameters for the sake of clarity. Thus, we have the 18000 instances accessed by O1, and the 600 accessed by O2.
N(A) = 250, N(R1) = 500 and so on. For the average cardinality, In order to evaluate GAF measure, we may consider a Minimal
we have Avg(A,B,R1) = 2, Avg(B,A, R1) = 1 and so on. Access Frequency (MAF), which is given by the designer as an
Cardinality interpretation is given as follows: the average volume input of our process. MAF is a value that represents the minimal
of instances of A related to B through R1 is 2 and it appears next frequency for accesses involving operations, and values below it
are considered as insignificant frequencies. We introduce an
example as follows. Suppose the designer assume that the set of
1
A lightweight data-interchange format (json.org). considered operations in Table 2 represents 80% of the
application load and the MAF should represent 0.9%. Thus, if the restrictions, like participation in the Partner block. An identifier
sum of the GAF of all schema concepts is 38400 accesses, the attribute is an attribute that is part of a root block, like ID_code in
MAF is 0.9% applied over 80% of this value, i.e., 432 accesses. the Person collection. A reference attribute is an attribute that
Given this minimal value, we can evaluate if the GAF of a concept refers to a block identifier, like contributor_REF, that refers to the
is relevant for the workload. If GAF(B) = 18600, for example, we Contributor collection.
say that B is a concept frequently accessed by transactions
because its GAF is higher than MAF (432).
Definition 6. Root Block. A root block rbk consists in an 4. THE CONVERSION PROCESS
attribute ak that identifies uniquely rbk in a collection ck, and a Our conversion process is based on conversion rules for mapping
non-empty set of attributes Ak and/or blocks Bk. EER constructs into equivalent NoSQL document constructs in
the logical model described in Section 3.2. Algorithm 1 presents
Definition 7. Block. A block bx is a set of attributes Ax, or a set
the overall process, which comprises two main steps: conversion
of inner blocks Bx, that supports disjointness constraints for the
of generalization types and conversion of relationship types.
inner blocks bx Bx.
Definition 8. Attribute. An attribute ay of a block by is a tuple Algorithm 1 EER-NoSQL
(cy, vy), where cy identifies uniquely ay in by, and vy is the value of Input: An EER Schema Ɛ with load data information;
ay. The Minimal Access Frequency MAF of Ɛ
Output: A NoSQL document logical schema NF
Definition 9. Hierarchical Relationship. A hierarchical
relationship hrm sbm, where sbm is a source block, is defined H convertHierarchies (Ɛ, MAF);
between sbm and an inner (target) block tbm sbm. R convertRelationships (Ɛ, MAF, H);
NF listOfCollections (R);
Definition 10. Reference Relationship. A reference relationship
defineRootBlockIDs(NF).
rrn is represented by a reference attribute ran sbn or a set of
reference attributes RAn sbn, where sbn identifies a source Generalization types are converted first, followed by the
block. The reference attribute ran refers to the identifier attribute conversion of the relationships. The blocks generated by the
ao of a target collection’s root block rbo. function convertHierarchies are maintained by the function
convertRelationships. After relationship types conversion, the
Figure 3 presents a NoSQL document logical schema in
remaining root blocks are finally defined as schema's collections.
accordance to our proposed logical model. There are three types
At the end of the process, a list of collections is returned, after the
of attributes: normal, identifier and reference attribute. A normal
definition of identifier attributes for collections’ root blocks when
attribute models a block property and does not impose
necessary. Load information is considered during the conversion
of the relationship types in order to generate well-structured In the alternative defined by Rule 3, the superclass and
NoSQL document logical schemas. subclasses are explicitly represented by blocks. Hierarchical
Next sections detail the rules for converting generalization relationships are established among the superclass and subclasses
hierarchies and relationship types as well as their respective blocks to represent the relationship. The generalization constraints
functions, as presented in Algorithm 1. are represented by the minimum and maximum occurrences of the
subclasses’ blocks in the superclass block. In cases where a
4.1 Hierarchy Types Conversion subclass of a generalization type has already been converted, the
A generalization hierarchy in the EER model defines a subset relationship with the superclass block is established by a reference
relationship between a generic entity, namely superclass, and one relationship between the block previously created to represent the
or more specialized entities, namely subclasses. The disjointeness converted subclass and the superclass block. In this case, the
and completeness constraints that are set to the subclasses superclass block is defined as a referenced block.
establish four possible constraints on generalization types: total
and disjoint (t, d); partial and disjoint (p, d); total and overlapping Rule 2. Generalization Focused on Subclasses. The conversion
(t, o); and partial and overlapping (p, o) [1]. of a generalization type G proceeds as follows:
Categories or union types of the EER model can be considered given an entity Esp defined as the superclass of a generalization
restricted cases of multiple inheritance [1]. Thus, their conversion type and {Esb1, Esb2, ..., Esbn} the set of subclasses of Esp, for each
strategies are similar to the strategies for processing generalization Esbi {Esb1, Esb2, ..., Esbn} do: generate a block bsbi and define the
types. For sake of paper space, we omit these strategies. In this attributes of the Esbi and Esp as attributes of bsbi.
section, we define alternative rules to convert generalization Rule 3. Generalization Focused on Hierarchy. The conversion
hierarchy from an EER schema to a NoSQL document logical of a generalization type G proceeds as follows:
schema. We also detail the function convertHierarchies of 1. given an entity Esp defined as the G superclass, generate a
Algorithm 1, that selects the suitable rule to be applied on each block bsp and if (G is a disjoint generalization) then generate a
occurrence of EER generalization type. disjointness constraint. The attributes of Esp are defined as
attributes of bsp;
4.1.1. Conversion Rules 2. given the set of Esp subclasses {Esb1, Esb2, ..., Esbn}, for each
Three alternatives are provided to convert generalization types Esbi {Esb1, Esb2, ..., Esbn} do:
inspired by the relational logical design methodology [1]. The if (Esbi was not converted) then generate a block bsbi and a
difference among these alternatives is given by the different size hierarchical relationship from bsp to bsbi where the occurrence of
of a NoSQL document schema that each one generates, and the bsbi in bsp is defined as ([0-1],[1]), depending on the completeness
constraints on generalization types they are able to support. constraint of G (total or partial)
The conversion strategy defined by Rule 1 generates only one else given bsbi the block that represents Esbi, generate an
block from a generalization hierarchy. The block represents the reference attribute rasbi in bsbi which refers to bsp identifier, and
superclass and its attributes, as well as the attributes of its define bsp as a referenced block.
subclasses. Subclasses’ attributes are defined as optional in the
content model of the superclass block. On applying this rule, we Function 1 convertHierarchies
assume that the subclasses’ attributes will act as discriminating Input: An EER Schema with load data information Ɛ;
attributes to identify an instance of a subclass in the NoSQL The Minimal Access Frequency of Ɛ (MAF)
documents. The subclasses previously converted (marked) Output: A set of blocks H’ of an NoSQL logical schema
become an optional inner block of the block generated by this
rule. H the list of generalization types of Ɛ;
Rule 1. Generalization Focused on Superclass. The conversion H’ sort H so that the generalization types at the bottom of the
of a generalization type G proceeds as follows: hierarchy with superclasses that have highest GAF appear first;
1. given an entity Esp defined as the G superclass, generate a for each hi H’ (1 i n) with superclass Esp and
block bsp. The attributes of Esp become attributes of bsp; subclasses{Esb1,.., Esbn} do
2. given the set of Esp subclasses {Esb1, Esb2, ..., Esbn}, for each if ( converted subclasses in hi) AND (all subclasses in hi
Esbi {Esb1, Esb2, ..., Esbn} do: have GAF < MAF) AND ( subclasses with more than one
if (Esbi was not converted) then define the attributes of Esbi superclass) AND ( subclasses defined as referenced block)
as optional attributes in bsp then
else given bsbi the block that represents Esbi, generate a Apply Rule 1 and mark as converted all the subclasses
hierarchical relationship from bsp to bsbi where the occurrence of of hi
bsbi in bsp is defined as [0..1]. else if (GAF(Esp) < MAF) AND ( subclasses with more
than one superclass) then
The main restriction to the application of Rule 1 occurs when Apply Rule 2 and mark Esp as converted
one of the subclasses is defined as a referenced block. A else
referenced block is an entity that was previously processed and Apply Rule 3 and mark as converted all the subclasses
defined as referenced by another block. This restriction guarantees of hi
that the referenced block will be a root block, avoiding that this end if
root block be further converted to a inner block of other block. end for
The alternative defined by Rule 2 generates only NoSQL return H’
document blocks for the subclasses, and the superclass attributes
are reproduced into each subclass block.
4.1.2. Conversion Function Finally, the Rule 6 generates independent blocks for each entity
The function convertHierarchies (Function 1) is responsible to of a relationship type and reference relationships are established
choose the appropriate rule for converting each generalization among the generated blocks.
type of a conceptual schema. A generalization type is converted
by analyzing the load data and the constraints of the Rule 5. Relationship Modeled as a Hierarchy. Given a 1:N
generalization hierarchy. The function establishes a conversion relationship type R which relates the entities E1 and E2, the
order in which the entities involved in a generalization hierarchy conversion of R proceeds as follows:
must be converted. A bottom-up conversion is performed when 1. generate a block bE1 for representing E1 and define the
there is a multiple-level hierarchy, i.e., the entities are converted attributes of E1 as attributes in bE1;
from the bottom to the top of the hierarchy. Besides, when there is 2. generate a block bE2 for representing E2 as a nested block of
a multiple-inheritance case, the superclass with the highest bE1. The occurrence of bE2 in bE1 depends on the participation of
General Access Frequency (GAF) has high priority. It means that E1 in R (optional or mandatory);
the superclass that is most frequently accessed becomes the parent 3. define the attributes of E2 and R as attributes in bE2.
block of a block that represents the subclass with more than one Rule 6. Relationship Modeled as References. Given a
superclass. In this case, the remaining superclasses are referenced relationship type R and the set of entities {E1, E2, .., En} related by
by reference attributes as defined in Rule 3. In fact, generalization R, the conversion of R proceeds as follows:
types involved in multiple-inheritance cases are always converted 1. for each Ei R (1 i n) do: generate a block bEi and
by Rule 3. define the attributes of Ei as attributes in bEi;
Once the conversion order of the generalization types is 2. if (R is a binary relationship without attributes) AND (the
established, we apply the conversion rules for generalization types
participation of E1 in R is defined as ([0-1],1)) then generate a
(Rule 1, 2 or 3) and verify the preconditions of each one, so that
reference attribute in bE1 referring to the identifier of bE2, and
the rules that generate the smallest NoSQL document logical
define bE2 as a referenced block
fragment are verified first, as illustrated in Function 1. For Rule 1
else
and Rule 2, we verify if the GAF of the entities that will be
3. generate a block bR as a nested block of bE1 and define
omitted is lower than the Minimal Access Frequency (MAF). If
the attributes of R as attributes in bR;
the GAF is higher than MAF, it means that these entities
4. for each Ei R (1 i n) do: generate a reference
participate in frequent operations and the distinction between
attribute in bR referring to the identifier of bEi, and define bEi as a
superclass and subclasses must be preserved. The last option to
referenced block.
convert generalization types is Rule 3.
Figure 5. The conventional NoSQL document logical schema. Figure 6. The optimized NoSQL document logical schema.
Table 3. Operations on schemas union type involving the subclass Payment. This rule was
Concept Access Frequency considered because the GAF of the superclasses is lower than the
Conceptual Conventional Optimized
assumed MAF. Besides, for the relationship commitment, the
# associative entity Sale was represented on Order block content, as
schema logical logical
schema schema the GAF of Order entity (159685) is higher than Payment entity
Order 1,500 1,500 1,500 (1200). Due to it, the function convertRelationships chooses
request 1,500 - - Order as E1 entity and Rule 4 was processed.
Customer 1,500 1,500 1,500 In short, the main difference among the produced logical
composite 2,475 - - schemas is the representation for the Payment union type in
O1
Item 2,475 2,475 2,475 optimized schema, which was nested to the Order block. Thus, the
reference 2,475 - - optimized schema has fewer collections and reference attributes
Product 2,475 2,227,500 2,227,500 than the conventional one.
Subtotal: 14,400 2,232,975 2,232,975 These different representations generate different access
Order 900 900 900 frequencies for the operations O2 and O4, as is shown in Table 3.
request 900 - - In O2, the conventional schema generates 706,500,000 accesses
O2
Customer 900 900 900 on the block Payment because it is necessary to compare the 900
commitment 900 900 - values of the reference attribute in commitment with all the
Payment 900 706,500,000 900 instances of Payment block (785,000). It does not occur on
Subtotal: 4,500 706,502,700 2,700
performing O2 on the optimized schema because the Payment
Customer 450 450 450
block is represented in the Order block. In this case, only 900
request 13,185 - -
Order 13,185 13,185 13,185 accesses are necessary to achieve the Payment content in O2.
O3 In practice, the impact of these different structures is evaluated
delivery 13,185 13,185 13,185
Carrier 13,185 210,960 210,960 by measuring the query processing time on both schemas at
Subtotal: 53,190 237,780 237,780 MongoDB NoSQL document DB. The operations were performed
Customer 300 300 300 on the compliant NoSQL documents, as stated before, and the
commitment 300 8,790 - results are presented in Figure 7. The results are presented in two
O4
Payment 300 6,900,150,000 8,790 ways: (i) one execution, and (ii) accumulated execution.
Subtotal: 900 6,900,159,090 9,090
Product 100 100 100
reference 144,100 - -
Item 144,100 129,700,000 129,700,000
O5
composite 144,100 - -
Order 144,100 129,700,000 129,700,000
Subtotal: 576,500 259,400,100 259,400,100
Supplier 100 100 100
furnishing 600 90,000 90,000
O6
Product 600 90,000 90,000
catalog 600 - 90,000
Category 600 5,400,000 5,400,000
Subtotal: 2,500 5,580,100 5,670,100
Total: 651,990 7,874,112,745 267,552,745