05-DocumentStores (1)
05-DocumentStores (1)
2. Explain the main resemblances and differences 12. Explain what shards and chunks are in
between XML and JSON documents MongoDB
3. Explain the design principle of documents 13. Explain the two horizontal fragmentation
mechanisms in MongoDB
4. Name 3 consequences of the design principle
of a document store 14. Explain how the catalog works in MongoDB
5. Explain the difference between relational 15. Identify the characteristics of the replica
foreign keys and document references synchronization management in MongoDB
6. Exemplify 6 alternatives in deciding the 16. Explain how primary copy failure is managed in
structure of a document MongoDB
7. Explain the difference between JSON and BSON 17. Name the three query mechanisms of
MongoDB
8. Name the main functional components of the
MongoDB architecture 18. Explain the query optimization mechanism of
MongoDB
4
Understanding objectives
1. Given two alternative structures of a document, explain the
performance impact of the choice in a given setting
2. Simulate splitting and migration of chunks in MongoDB
3. Configure the number of replicas needed for confirmation on both
reading and writing in a given scenario
5
Application objectives
1. Perform some queries on MongoDB through the shell and aggregation
framework
2. Compare the access costs given different document designs
3. Compare the access costs with different indexing strategies (i.e., hash
and range based)
4. Compare the access costs with different sharding distributions (i.e.,
balanced and unbalanced)
6
Semi-structured database
model
XML and JSON
7
Semi-structured data
• Document stores are essentially key-value stores
• The value is a document
• Allow secondary indexes
• Different implementations
• eXtensible Markup Language (XML)
• JavaScript Object Notation (JSON)
• Tightly related to the web
• Easily readable by humans and machines
• Data exchange formats for REST APIs
8
XML Documents
• Tree data structure
• Document: the root node of the XML document
• Element: nodes that correspond to the tagged nodes in the document
• Attribute: nodes attached to Element nodes
• Text: text nodes, i.e., untagged leaves of the XML tree
• XML-oriented databases storage
• eXist-db
• MarkLogic
• Relational extensions for Oracle, PostgreSQL, etc.
9
XML Document Example
S. Abiteboul et al.
10
JSON Documents
• Lightweight data interchange format
• Can contain unbounded nesting of arrays and objects
• Brackets ([]) represent ordered lists
• Curly braces ({}) represent key-value dictionaries
• Keys must be strings, delimited by quotes (")
• Values can be strings, numbers, booleans, lists, or key-value dictionaries
• Natively compatible with JavaScript
• Web browsers are natural clients
• JSON-like storage
• MongoDB
• CouchDB
• Relational extensions for Oracle, PostgreSQL, etc.
11
JSON Example (I)
12
JSON Example (II)
source: MongoDB 13
JSON Example (III)
source: MongoDB 14
Data structure alternatives
15
Designing Document Stores
Do not think relational-wise
• Break 1NF to avoid joins
• Get all data needed with one single fetch
• Use indexes to identify finer data granularities
Consequences:
• Massive denormalization
• Independent documents
• Avoid pointers (i.e., we may have references but not FKs)
• Massive rearrangement of documents on changing the application layout (e.g.,
queries)
16
Metadata representation
JSON Tuple
{ _id A1 … An
_id: 123, 123 "x" … "x"
A1: "x",
…
An: "x"
}
17
Attribute optionality
{ { {
_id: 123, _id: 123, _id: 123
A1: 666, A1: null, }
… …
An: 666 An: null
} }
T-666 T-NULL
_id A1 … An _id A1 … An
123 666 … 666 123 null … null
18
Structure and Data Types
JSON Type Tuple Type
_id A1 … An
{ {
_id: 123, "type": "object", 123 k … k
"properties":{
A1: k, "A1": {
… "type": "number” CREATE TABLE T (
An : k }, _id INTEGER,
} … A1 INTEGER,
"A1": { …
"type": "number” An INTEGER,
}, );
required: ["A1",…, "An"]
}
}
19
Integrity Constraints
JSON-IC Tuple-IC
_id A1 … An
{ {
_id: 123, "type": "object", 123 k … k
"properties":{
A1: k, "A1": {
… "type": "number” ALTER TABLE T ADD CONSTRAINT
An : k "minimum": -k’ val_A1 CHECK
} "type": k’}, (A1 BETWEEN -k’ AND k’);
… …
"An": {
"type": "number” ALTER TABLE T ADD CONSTRAINT
"minimum": -k’ val_An CHECK
"maximum": k’} (An BETWEEN -k’ AND k’);
}
}
20
Structure complexity
JSON-Attrib JSON-Array JSON-Nest
Tuple-Attrib Tuple-Array
_id A1 … An _id A
123 k … k 123 [1,…,n]
21
MongoDB architecture
23
Abstraction
• Documents
• Definition: JSON documents (serialized as BSON)
• Basic atom
• Identified by "_id" (user or system generated)
• May contain
• References (not FKs!)
• Embedded documents
• Collections
• Definition: A grouping of MongoDB documents
• A collection exists within a single database
• Collections do not enforce a schema
• MongoDB Namespace: database.collection
24
JSON vs. BSON (Binary JSON)
A. Hogan
25
Shell commands
• show dbs
• show collections
• show users
• use <database>
• coll = db.<collection>
• find([<criteria>], [<projection>])
• insert(<document>)
• update(<query>, <update>, <options [e.g., upsert]>)
• remove(<query>, [justOne])
• drop()
• createIndex(<keys>, <options>)
• Notes:
• db refers to the current database
• query is a document (query-by-example)
http://docs.mongodb.org/manual/reference/mongo-shell 26
MongoDB syntax
Query-by-example
Global (Depending on the method:
variable document, array of documents, etc.)
db.[collection-name].[method]([query],[options])
27
MongoDB functional components
Association Server
Aggregation
Specialization
Nodes containing a
NotInMongoDB Manager Store ReplicaSet Nodes containing data copy
(distributes data across of the directory to
MongoDB concept
the shards in the re-direct queries
cluster)
Chunk
ch1 ch2
Collection Document
https://docs.mongodb.com/manual/core/sharded-cluster-components 28
Data Design
Challenge I
29
Sharding (horizontal fragmentation)
• Shard key
• Must be indexed (sh.shardCollection(namespace, key))
• If not existing in a document, treated as null
• Chunk (64MB)
• Horizontal fragment according to the shard key
• Range-based: Range of values determines the chunks
• Adequate for range queries
• Hash-based: Hash function determines the chunks
• Consistent hashing
https://docs.mongodb.com/manual/core/ranged-sharding/#sharding-ranged
30
https://docs.mongodb.com/manual/core/hashed-sharding/#sharding-hashed
Splitting and migrating chunks
• Inserts and updates above a threshold trigger splits
• Not in single-key chunks (same value in the shard keys)
• Uneven distributions in the number of chunks per shard trigger migrations
1. A new chunk is created in an underused shard
2. Per document requests are sent to the origin shard
3. Origin keeps working as usual
• Changes made during the migration are applied a posteriori in the destination shard
4. Changes are annotated in the config servers, which enables the new chunk
5. Chunk at origin is dropped
6. Client cache in query routers is inconsistent
• Eventually synchronized
https://docs.mongodb.com/manual/core/sharding-balancer-administration/#sharding-balancing 31
Catalog Management
Challenge II
32
Catalog structure
• Content
• List of chunks in every shard
• Implemented in a replica set (as any other data)
• Client cache in the query routers
• Lazy/Primary-copy replication maintenance
https://docs.mongodb.com/manual/core/sharded-cluster-config-servers 33
Transaction Management
Challenge III
34
Replica sets
• A replica set is a set of 3 mongod instances
• Primary copy with lazy replication
• One primary copy
• Inserts, writes, updates
• Reads
• Secondary copies
• Reads
source: MongoDB 35
Read preference
• By default, applications will try to read the primary replica
• It can also specify a read preference
• primary
• primaryPreferred
• secondary
• secondaryPreferred
• nearest
• Least network latency
source: MongoDB 36
Required read and writes
• ReadConcern
• Specifies how many copies need to be read before confirmation
• They should coincide
• WriteConcern
• Specifies how many copies need to be writen before confirmation
• Might be zero
37
Handling failures
• Heartbeat system
• Primary does not communicate with the other members for
10sec → Failure
source: MongoDB 38
Handling failures
• Heartbeat system
• Primary does not communicate with the other members for
10sec → Failure
• New primary is decided based on consensus protocols
• PAXOS
source: MongoDB 39
Query Processing
Challenge IV
40
Query mechanisms
a) JavaScript API
• find and findOne methods (Query By Example)
• db.collection.find()
• db.collection.find( { qty: { $gt: 25 } } )
• db.collection.find( { field: { $gt: value1, $lt: value2 } } )
b) Aggregation Framework
• Documents enter a multi-stage pipeline that transforms them
• Filters that operate like queries
• Transformations that reshape the output document
• Grouping
• Sorting
• Other stage operations
c) MapReduce
41
Example queries
1. SELECT * FROM users; 1. db.users.find({});
2. SELECT * FROM users WHERE 2. db.users.find({ age: {
age > 25; $gt: 25 } });
3. SELECT name, age FROM 3. db.users.find({}, { name:
users; 1, age: 1, _id: 0 });
4. INSERT INTO users (name, 4. db.users.insertOne({
age) VALUES ('Alice', name: "Alice", age: 30
30); });
5. UPDATE users SET age = 31 5. db.users.updateOne({
WHERE name = 'Alice'; name: "Alice" }, { $set:
{ age: 31 } });
42
Aggregation Framework Steps
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline 43
Aggregation Framework Syntax
db.orders.aggregate(
{$match: {status:”A”}},
{$group: {_id: “$cust_id”, total:{$sum: “$amount”}}}
)
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline 44
Query routing
https://docs.mongodb.com/manual/core/sharded-cluster-query-router 45
Indexing
• Kinds
• B+
• Hash
• Geospatial
• Text
• Allow
• Multi-attribute indexes
• Multi-valued indexes
• On arrays
• Index-only query answering
• Usage
• Best plan is cached
• Performance is evaluated on execution
• New candidate plans are evaluated for some time
https://www.docs4dev.com/docs/en/mongodb/v3.6/reference/core-query-plans.html 46
Closing
47
Summary
• Document-stores
• Semi-structured database model
• Indexing
• MongoDB
• Architecture
• Interfaces
48
/
References
• E. Brewer. Towards Robust Distributed Systems. PODC’00
• L. Liu and M.T. Özsu (Eds.). Encyclopedia of Database Systems. Springer,
2009
• S. Abiteboul et al. Web Data Management. Cambridge University Press,
2012
• M. Hewasinghage et al. On the Performance Impact of Using JSON,
Beyond Impedance Mismatch. ADBIS 2020
• A. Hogan: Procesado de Datos Masivos. U. de Chile.
http://aidanhogan.com/teaching/cc5212-1-2020
49
Lab 2
Document Stores
23D020 50
Lab 2: Document Stores - Teams
• Teams of two
• You cannot repeat the teammate
• Assign yourself to a team, otherwise to be assigned randomly
• https://docs.google.com/spreadsheets/d/1jEzgsNGEEHR6yeS0HsQuynAo2IkHi073
1aNMF8pV6bI/edit?usp=sharing
23D020 51
Lab 2: Document Stores - Training
Training [not evaluated]
• Installing MongoDB
• MongoDB Community Server:
https://www.mongodb.com/try/download/community
• MongoDB Compas (GUI): https://www.mongodb.com/try/download/compass
• How To/FAQs: https://diligent-skirt-36b.notion.site/MongoDB-
2f1db119176c4be7886edfac2062d3cc?pvs=4
• Tasks:
• Importing data
• Querying data
• Inserte, Delete, Update, Select
• Geospatial queries
23D020 52
Lab 2: Document Stores - Assignment
Lab Assignment
• Deadline: Week 8 (27/05/2025, 12:25)
• Tasks:
• Model data in MongoDB
• Querying data in MongoDB
• Reporting query latencies
• Discussion of modeling alternatives
• Deliverables
• Python Code
• PDF Document
23D020 53