Unit 5 NOSQL
Unit 5 NOSQL
Distributed
Data models
databases and
and query
distributed
languages
systems
NOSQL characteristics related to distributed
databases and distributed systems.
1.Scalability
3.Replication Models
4.Sharding of Files
Horizontal Vertical
Scalability Scalability
Additional
Categories
Hybrid NOSQL Object XML
systems databases databases
Categories of NOSQL Systems
• Document-based NOSQL systems
• Store data in the form of documents using JSON
• Documents are accessible by document id, but can also be accessed rapidly using other indexes.
• NOSQL key-value stores
• Simple data model based on fast access by the key to the value associated with the key;
• The value can be a record or an object or a document or even have a more complex data
structure.
• Column-based or wide column NOSQL systems
• Partition a table by column into column families (a form of vertical partitioning), where each
column family is stored in its own files.
• They also allow versioning of data values.
• Graph-based NOSQL systems
• Data is represented as graphs, and related nodes can be found by traversing the edges using
path expressions
The CAP Theorem
Three desirable properties of distributed systems
with replicated data
Consistency, Availability, Partition
tolerance
Nodes will have the same The system can continue operating
copies of a replicated data if the network connecting the
item visible for various nodes has a fault that results in
transactions two or more partitions, where the
nodes in each partition can only
communicate among each other.
• All write operations must be applied to the primary copy and then propagated to
the secondaries.
• For read operations, the user can choose the particular read preference for their
application.
• The default read preference processes all reads at the primary copy, so all read and
write operations are performed at the primary node. In this case, secondary copies
are mainly to make sure that the system continues operation if the primary fails,
and MongoDB can ensure that every read request gets the latest document value.
• To increase read performance, it is possible to set the read preference so that read
requests can be processed at any replica (primary or secondary); however, a read at
a secondary is not guaranteed to get the latest version of a document because there
can be a delay in propagating writes from the primary to the secondaries.
Sharding in MongoDB
• When a collection holds a very large number of documents
or requires a large storage space, storing all the documents
in one node can lead to performance problems, particularly
if there are many user operations accessing the documents
concurrently using various CRUD operations.
• Sharding of the documents in the collection—also known
as horizontal partitioning— divides the documents into
disjoint partitions known as shards. This allows the system
to add more nodes as needed by a process known as
horizontal scaling of the distributed system and to store
the shards of the collection on different nodes to achieve
load balancing.
Sharding in MongoDB
• Each node will process only those operations pertaining to the documents in the shard
stored at that node. Also, each shard will contain fewer documents than if the entire
collection were stored at one node, thus further improving performance.
• There are two ways to partition a collection into shards in MongoDB
range partitioning
hash partitioning.
Both require that the user specify a particular document field to be used as the basis for
partitioning the documents into shards.
• The partitioning field known as the shard key in MongoDB must have two
characteristics:
it must exist in every document in the collection, and
it must have an index.
• The ObjectId can be used, but any other field possessing these two characteristics can
also be used as the basis for sharding.
• The values of the shard key are divided into chunks either through range partitioning
or hash partitioning, and the documents are partitioned based on the chunks of shard
key values.
Sharding in MongoDB
• Range partitioning creates the chunks by specifying a range of key values
• for example, if the shard key values ranged from one to ten million, it is possible
to create ten ranges—1 to 1,000,000; 1,000,001 to 2,000,000; … ; 9,000,001 to
10,000,000—and each chunk would contain the key values in one range.
• Hash partitioning applies a hash function h(K) to each shard key K, and the
partitioning of keys into chunks is based on the hash values.
• In general, if range queries are commonly applied to a collection (for example,
retrieving all documents whose shard key value is between 200 and 400), then
range partitioning is preferred because each range query will typically be
submitted to a single node that contains all the required documents in one
shard.
• If most searches retrieve one document at a time, hash partitioning may be
preferable because it randomizes the distribution of shard key values into
chunks.
Sharding in MongoDB
• When sharding is used, MongoDB queries are submitted to a module called the
query router, which keeps track of which nodes contain which shards based
on the particular partitioning method used on the shard keys.
• The query (CRUD operation) will be routed to the nodes that contain the
shards that hold the documents that the query is requesting. If the system
cannot determine which shards hold the required documents, the query will be
submitted to all the nodes that hold shards of the collection.
• Sharding and replication are used together;
• Sharding focuses on improving performance via load balancing and horizontal
scalability,
• Replication focuses on ensuring system availability when certain nodes fail in
the distributed system.
• MongoDB also provides many other services in areas such as system
administration, indexing, security, and data aggregation.
MongoDB and Oracle
MongoDB installation steps
Cassandra’s data model with column families
• Cassandra can be described as fast and easily scalable
with write operations spread across the cluster.
• The cluster does not have a master node, so any read and
write can be handled by any node in the cluster.
Cassandra’s data model with column families