0% found this document useful (0 votes)
4 views50 pages

05-DocumentStores (1)

The document provides an overview of document stores, particularly focusing on MongoDB, covering key concepts such as the differences between key-value and document stores, the structure of JSON and XML documents, and the architecture of MongoDB. It details various functionalities including query mechanisms, sharding, replica sets, and transaction management. Additionally, it outlines practical objectives for understanding and applying MongoDB through queries and data design challenges.

Uploaded by

lgavidiap31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views50 pages

05-DocumentStores (1)

The document provides an overview of document stores, particularly focusing on MongoDB, covering key concepts such as the differences between key-value and document stores, the structure of JSON and XML documents, and the architecture of MongoDB. It details various functionalities including query mechanisms, sharding, replica sets, and transaction management. Additionally, it outlines practical objectives for understanding and applying MongoDB through queries and data design challenges.

Uploaded by

lgavidiap31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Document Stores

23D020: Big Data Management for Data Science


Barcelona School of Economics
Knowledge objectives 9. Explain the
processing
role of “mongos” in query

10. Explain what a replica set is in MongoDB


1. Explain the main difference between key-value
and document stores 11. Name the three storage engines of MongoDB

2. Explain the main resemblances and differences 12. Explain what shards and chunks are in
between XML and JSON documents MongoDB

3. Explain the design principle of documents 13. Explain the two horizontal fragmentation
mechanisms in MongoDB
4. Name 3 consequences of the design principle
of a document store 14. Explain how the catalog works in MongoDB

5. Explain the difference between relational 15. Identify the characteristics of the replica
foreign keys and document references synchronization management in MongoDB

6. Exemplify 6 alternatives in deciding the 16. Explain how primary copy failure is managed in
structure of a document MongoDB

7. Explain the difference between JSON and BSON 17. Name the three query mechanisms of
MongoDB
8. Name the main functional components of the
MongoDB architecture 18. Explain the query optimization mechanism of
MongoDB

4
Understanding objectives
1. Given two alternative structures of a document, explain the
performance impact of the choice in a given setting
2. Simulate splitting and migration of chunks in MongoDB
3. Configure the number of replicas needed for confirmation on both
reading and writing in a given scenario

5
Application objectives
1. Perform some queries on MongoDB through the shell and aggregation
framework
2. Compare the access costs given different document designs
3. Compare the access costs with different indexing strategies (i.e., hash
and range based)
4. Compare the access costs with different sharding distributions (i.e.,
balanced and unbalanced)

6
Semi-structured database
model
XML and JSON

7
Semi-structured data
• Document stores are essentially key-value stores
• The value is a document
• Allow secondary indexes
• Different implementations
• eXtensible Markup Language (XML)
• JavaScript Object Notation (JSON)
• Tightly related to the web
• Easily readable by humans and machines
• Data exchange formats for REST APIs

8
XML Documents
• Tree data structure
• Document: the root node of the XML document
• Element: nodes that correspond to the tagged nodes in the document
• Attribute: nodes attached to Element nodes
• Text: text nodes, i.e., untagged leaves of the XML tree
• XML-oriented databases storage
• eXist-db
• MarkLogic
• Relational extensions for Oracle, PostgreSQL, etc.

9
XML Document Example

S. Abiteboul et al.

10
JSON Documents
• Lightweight data interchange format
• Can contain unbounded nesting of arrays and objects
• Brackets ([]) represent ordered lists
• Curly braces ({}) represent key-value dictionaries
• Keys must be strings, delimited by quotes (")
• Values can be strings, numbers, booleans, lists, or key-value dictionaries
• Natively compatible with JavaScript
• Web browsers are natural clients
• JSON-like storage
• MongoDB
• CouchDB
• Relational extensions for Oracle, PostgreSQL, etc.

11
JSON Example (I)

12
JSON Example (II)

source: MongoDB 13
JSON Example (III)

source: MongoDB 14
Data structure alternatives

15
Designing Document Stores
Do not think relational-wise
• Break 1NF to avoid joins
• Get all data needed with one single fetch
• Use indexes to identify finer data granularities
Consequences:
• Massive denormalization
• Independent documents
• Avoid pointers (i.e., we may have references but not FKs)
• Massive rearrangement of documents on changing the application layout (e.g.,
queries)

16
Metadata representation

JSON Tuple
{ _id A1 … An
_id: 123, 123 "x" … "x"
A1: "x",

An: "x"
}

17
Attribute optionality

J-666 J-NULL J-Abs

{ { {
_id: 123, _id: 123, _id: 123
A1: 666, A1: null, }
… …
An: 666 An: null
} }

T-666 T-NULL
_id A1 … An _id A1 … An
123 666 … 666 123 null … null

18
Structure and Data Types
JSON Type Tuple Type
_id A1 … An
{ {
_id: 123, "type": "object", 123 k … k
"properties":{
A1: k, "A1": {
… "type": "number” CREATE TABLE T (
An : k }, _id INTEGER,
} … A1 INTEGER,
"A1": { …
"type": "number” An INTEGER,
}, );
required: ["A1",…, "An"]
}
}

19
Integrity Constraints
JSON-IC Tuple-IC
_id A1 … An
{ {
_id: 123, "type": "object", 123 k … k
"properties":{
A1: k, "A1": {
… "type": "number” ALTER TABLE T ADD CONSTRAINT
An : k "minimum": -k’ val_A1 CHECK
} "type": k’}, (A1 BETWEEN -k’ AND k’);
… …
"An": {
"type": "number” ALTER TABLE T ADD CONSTRAINT
"minimum": -k’ val_An CHECK
"maximum": k’} (An BETWEEN -k’ AND k’);
}
}

20
Structure complexity
JSON-Attrib JSON-Array JSON-Nest

{ _id: 123, { _id: 123, { _id: 123


A1: k, A: [1,…,n] L1:{
… } …
An : k Ln:{
} An+1: k}
}
}

Tuple-Attrib Tuple-Array
_id A1 … An _id A
123 k … k 123 [1,…,n]

21
MongoDB architecture

23
Abstraction
• Documents
• Definition: JSON documents (serialized as BSON)
• Basic atom
• Identified by "_id" (user or system generated)
• May contain
• References (not FKs!)
• Embedded documents

• Collections
• Definition: A grouping of MongoDB documents
• A collection exists within a single database
• Collections do not enforce a schema
• MongoDB Namespace: database.collection

24
JSON vs. BSON (Binary JSON)

A. Hogan

25
Shell commands
• show dbs
• show collections
• show users
• use <database>
• coll = db.<collection>
• find([<criteria>], [<projection>])
• insert(<document>)
• update(<query>, <update>, <options [e.g., upsert]>)
• remove(<query>, [justOne])
• drop()
• createIndex(<keys>, <options>)

• Notes:
• db refers to the current database
• query is a document (query-by-example)

http://docs.mongodb.org/manual/reference/mongo-shell 26
MongoDB syntax
Query-by-example
Global (Depending on the method:
variable document, array of documents, etc.)

db.[collection-name].[method]([query],[options])

• Collection methods: insert, update, remove, find, …


db.restaurants.find({"name": "x"})
• Cursor methods: forEach, hasNext, count, sort, skip, size, ...
db.restaurants.find({"name": "x"}).count()
• Database methods: createCollection, copyDatabase, ...
db.createCollection("collection-name")
• …

27
MongoDB functional components
Association Server
Aggregation
Specialization
Nodes containing a
NotInMongoDB Manager Store ReplicaSet Nodes containing data copy
(distributes data across of the directory to
MongoDB concept
the shards in the re-direct queries
cluster)

* 1 Router Shard ConfigServer


Driver
(mongos) (mongod) (config)

Chunk
ch1 ch2

Collection Document

In the sharding setup, a


collection can be partitioned (by
In-Memory MMAPv1 WiredTiger a key) into chunks (which are
distributed across multiple
shards)

https://docs.mongodb.com/manual/core/sharded-cluster-components 28
Data Design
Challenge I

29
Sharding (horizontal fragmentation)
• Shard key
• Must be indexed (sh.shardCollection(namespace, key))
• If not existing in a document, treated as null
• Chunk (64MB)
• Horizontal fragment according to the shard key
• Range-based: Range of values determines the chunks
• Adequate for range queries
• Hash-based: Hash function determines the chunks
• Consistent hashing

https://docs.mongodb.com/manual/core/ranged-sharding/#sharding-ranged
30
https://docs.mongodb.com/manual/core/hashed-sharding/#sharding-hashed
Splitting and migrating chunks
• Inserts and updates above a threshold trigger splits
• Not in single-key chunks (same value in the shard keys)
• Uneven distributions in the number of chunks per shard trigger migrations
1. A new chunk is created in an underused shard
2. Per document requests are sent to the origin shard
3. Origin keeps working as usual
• Changes made during the migration are applied a posteriori in the destination shard
4. Changes are annotated in the config servers, which enables the new chunk
5. Chunk at origin is dropped
6. Client cache in query routers is inconsistent
• Eventually synchronized

https://docs.mongodb.com/manual/core/sharding-balancer-administration/#sharding-balancing 31
Catalog Management
Challenge II

32
Catalog structure
• Content
• List of chunks in every shard
• Implemented in a replica set (as any other data)
• Client cache in the query routers
• Lazy/Primary-copy replication maintenance

https://docs.mongodb.com/manual/core/sharded-cluster-config-servers 33
Transaction Management
Challenge III

34
Replica sets
• A replica set is a set of 3 mongod instances
• Primary copy with lazy replication
• One primary copy
• Inserts, writes, updates
• Reads
• Secondary copies
• Reads

source: MongoDB 35
Read preference
• By default, applications will try to read the primary replica
• It can also specify a read preference
• primary
• primaryPreferred
• secondary
• secondaryPreferred
• nearest
• Least network latency

source: MongoDB 36
Required read and writes
• ReadConcern
• Specifies how many copies need to be read before confirmation
• They should coincide
• WriteConcern
• Specifies how many copies need to be writen before confirmation
• Might be zero

37
Handling failures
• Heartbeat system
• Primary does not communicate with the other members for
10sec → Failure

source: MongoDB 38
Handling failures
• Heartbeat system
• Primary does not communicate with the other members for
10sec → Failure
• New primary is decided based on consensus protocols
• PAXOS

source: MongoDB 39
Query Processing
Challenge IV

40
Query mechanisms
a) JavaScript API
• find and findOne methods (Query By Example)
• db.collection.find()
• db.collection.find( { qty: { $gt: 25 } } )
• db.collection.find( { field: { $gt: value1, $lt: value2 } } )
b) Aggregation Framework
• Documents enter a multi-stage pipeline that transforms them
• Filters that operate like queries
• Transformations that reshape the output document
• Grouping
• Sorting
• Other stage operations
c) MapReduce

41
Example queries
1. SELECT * FROM users; 1. db.users.find({});
2. SELECT * FROM users WHERE 2. db.users.find({ age: {
age > 25; $gt: 25 } });
3. SELECT name, age FROM 3. db.users.find({}, { name:
users; 1, age: 1, _id: 0 });
4. INSERT INTO users (name, 4. db.users.insertOne({
age) VALUES ('Alice', name: "Alice", age: 30
30); });
5. UPDATE users SET age = 31 5. db.users.updateOne({
WHERE name = 'Alice'; name: "Alice" }, { $set:
{ age: 31 } });

42
Aggregation Framework Steps

https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline 43
Aggregation Framework Syntax

Pipeline stages: ($match, $group, $addfields, $sort, $unwind …)

The name of the computed field

db.orders.aggregate(
{$match: {status:”A”}},
{$group: {_id: “$cust_id”, total:{$sum: “$amount”}}}
)

Pipeline operators: $sum, $max, $min …

References the field

Required field: to identify the field for the group by

https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline 44
Query routing

https://docs.mongodb.com/manual/core/sharded-cluster-query-router 45
Indexing
• Kinds
• B+
• Hash
• Geospatial
• Text
• Allow
• Multi-attribute indexes
• Multi-valued indexes
• On arrays
• Index-only query answering
• Usage
• Best plan is cached
• Performance is evaluated on execution
• New candidate plans are evaluated for some time

https://www.docs4dev.com/docs/en/mongodb/v3.6/reference/core-query-plans.html 46
Closing

47
Summary
• Document-stores
• Semi-structured database model
• Indexing
• MongoDB
• Architecture
• Interfaces

48
/

References
• E. Brewer. Towards Robust Distributed Systems. PODC’00
• L. Liu and M.T. Özsu (Eds.). Encyclopedia of Database Systems. Springer,
2009
• S. Abiteboul et al. Web Data Management. Cambridge University Press,
2012
• M. Hewasinghage et al. On the Performance Impact of Using JSON,
Beyond Impedance Mismatch. ADBIS 2020
• A. Hogan: Procesado de Datos Masivos. U. de Chile.
http://aidanhogan.com/teaching/cc5212-1-2020

49
Lab 2
Document Stores

23D020 50
Lab 2: Document Stores - Teams
• Teams of two
• You cannot repeat the teammate
• Assign yourself to a team, otherwise to be assigned randomly
• https://docs.google.com/spreadsheets/d/1jEzgsNGEEHR6yeS0HsQuynAo2IkHi073
1aNMF8pV6bI/edit?usp=sharing

23D020 51
Lab 2: Document Stores - Training
Training [not evaluated]
• Installing MongoDB
• MongoDB Community Server:
https://www.mongodb.com/try/download/community
• MongoDB Compas (GUI): https://www.mongodb.com/try/download/compass
• How To/FAQs: https://diligent-skirt-36b.notion.site/MongoDB-
2f1db119176c4be7886edfac2062d3cc?pvs=4
• Tasks:
• Importing data
• Querying data
• Inserte, Delete, Update, Select
• Geospatial queries

23D020 52
Lab 2: Document Stores - Assignment
Lab Assignment
• Deadline: Week 8 (27/05/2025, 12:25)
• Tasks:
• Model data in MongoDB
• Querying data in MongoDB
• Reporting query latencies
• Discussion of modeling alternatives
• Deliverables
• Python Code
• PDF Document

23D020 53

You might also like