0% found this document useful (0 votes)
11 views

Unit 5 NOSQL

The document provides an overview of NoSQL databases, discussing their emergence due to the limitations of traditional SQL systems in handling large amounts of unstructured data. It covers various types of NoSQL databases, including key-value stores, document stores, column-family stores, and graph databases, along with their characteristics such as scalability, replication, and eventual consistency. Additionally, it introduces specific NoSQL systems like MongoDB, DynamoDB, and Voldemort, highlighting their data models and operational features.

Uploaded by

kainanto2312026
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 5 NOSQL

The document provides an overview of NoSQL databases, discussing their emergence due to the limitations of traditional SQL systems in handling large amounts of unstructured data. It covers various types of NoSQL databases, including key-value stores, document stores, column-family stores, and graph databases, along with their characteristics such as scalability, replication, and eventual consistency. Additionally, it introduces specific NoSQL systems like MongoDB, DynamoDB, and Voldemort, highlighting their data models and operational features.

Uploaded by

kainanto2312026
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 102

CS19443

DATABASE MANAGEMENT SYSTEMS

UNIT-V NoSQL DATABASE


UNIT 5 NoSQL DATABASE
Syllabus

• Introduction to NoSQL - CAP Theorem – Data Models -


Key-Value Databases - Document Databases- Column
Family Stores – Graph Databases –Working of NoSQL
Using MONGODB
Emergence of NOSQL Systems
Many companies and organizations are faced with applications
that store vast amounts of data. Example : free e-mail application -
Google Mail or Yahoo Mail
There is a need for a storage
system that can manage all
these e-mails
Structured relational SQL system may not be
appropriate
1. SQL systems offer too many services (powerful query
language, concurrency control, etc.), which this application
may not need
2. Structured data model such the traditional relational model
may be too restrictive
Newer relational systems do have more complex object-
relational modelling options, they still require schemas
Emergence of NOSQL Systems
Another example, an application such as Facebook, with
millions of users who submit posts, many with images and
videos; then these posts must be displayed on pages of other
users using the social media relationships among the users

Some of the data for this type of application is not


suitable for a traditional relational system

Needs multiple types of


databases and data storage
systems
EXAMPLES of NOSQL System

Google - BigTable, used in many of Google’s applications that


require vast amounts of data storage, such as Gmail, Google
Maps, and Web site indexing.
Apache Hbase is an open source NOSQL system. Google’s
innovation led to the category of NOSQL systems known as
column-based or wide column stores; they are also sometimes
referred
Amazon -to as columnthat
DynamoDB family stores. through Amazon’s cloud
is available
services. This innovation led to the category known as key-
value data stores or sometimes key-tuple or key-object data
stores.
Facebook - Cassandra, now open source and known as Apache
Cassandra. This NOSQL system uses concepts from both key-
value stores and column-based systems
EXAMPLES of NOSQL System

Other software companies – MongoDB and CouchDB, which


are classified as document-based NOSQL systems or
document stores
Another category graph-based NOSQL systems, or graph
databases; these include Neo4J and GraphBase

Some NOSQL systems, such as OrientDB, combine concepts


from many of the categories discussed above.
Major differences between SQL and NoSQL database
Centralized SQL database and Distributed NOSQL
database
Scalability
Rigid vs Flexible Schema
Characteristics of NOSQL Systems

Characteristics of NOSQL Systems related to

Distributed
Data models
databases and
and query
distributed
languages
systems
NOSQL characteristics related to distributed
databases and distributed systems.

1.Scalability

2.Availability, Replication and Eventual Consistency

3.Replication Models

4.Sharding of Files

5.High-Performance Data Access


Scalability
Scalability

Horizontal Vertical
Scalability Scalability

Distributed system is expanded Refers to expanding the


by adding more nodes for data storage and computing
storage and processing as the power of existing nodes
volume of data grows
Horizontal scalability is
employed while the system is
operational, so techniques for
distributing the existing data
among new nodes without
interrupting system operation
are necessary
Availability, Replication and Eventual Consistency

Availability, Replication and Eventual


Consistency
Data is replicated Do not require
over two or more serializable consistency,
nodes in a so more relaxed forms
transparent manner of consistency known as
so that if one node eventual consistency
fails, the data is still are used
available on other
Can also improve readperformance,
nodes. because read requestscan often be
serviced from any of the replicated
data nodes. However, write
performance becomes more
Replication Models
• Master-slave replication requires one copy to be the master copy; all write
operations must be applied to the master copy and then propagated to the
slave copies, usually using eventual consistency (the slave copies will
eventually be the same as the master copy).
• For read, the master-slave paradigm can be configured in various ways.
• One configuration requires all reads to also be at the master copy, so this would be similar to
the primary site or primary copy methods of distributed concurrency control, with similar
advantages and disadvantages.
• Another configuration would allow reads at the slave copies but would not guarantee that the
values are the latest writes, since writes to the slave nodes can be done after they are
applied to the master copy.

• Master-master replication allows reads and writes at any of the replicas


but may not guarantee that reads at nodes that store different copies see the
same values.
• Different users may write the same data item concurrently at different nodes
of the system, so the values of the item will be temporarily inconsistent.
• A reconciliation method to resolve conflicting write operations of the same
Sharding of Files

• In many NOSQL applications, files (or collections of data


objects) can have many millions of records (or documents or
objects), and these records can be accessed concurrently by
thousands of users. So it is not practical to store the whole file
in one node.
• Sharding (also known as horizontal partitioning) of the file
records is often employed in NOSQL systems. This serves to
distribute the load of accessing the file records to multiple
nodes.
• The combination of sharding the file records and replicating the
shards works in tandem to improve load balancing as well as
data availability.
• It is necessary to find individual records or objects from among the
millions of data records or objects in a file.
• To achieve this, most systems use one of two techniques: hashing or
range partitioning on object keys.
• The majority of accesses to an object will be by providing the key value rather
than by using complex query conditions. The object key is similar to the
concept of object id. In hashing, a hash function h(K) is applied to the key K,
and the location of the object with key K is determined by the value of h(K).
• In range partitioning, the location is determined via a range of key values;
for example, location i would hold the objects whose key values K are in the
range Kimin ≤ K ≤ Kimax. In applications that require range queries, where
multiple objects within a range of key values are retrieved, range partitioned
is preferred. Other indexes can also be used to locate objects based on
attribute conditions different from the key K.
NOSQL characteristics related to data models and query languages
• Not Requiring a Schema
• By allowing semi-structured, self describing data
• Any constraints on the data would have to be programmed in
the application programs that access the data items.
• Various languages for describing semi structured data, such as
JSON and XML
• Less Powerful Query Languages
• Not require a powerful query language such as SQL, because
search (read) queries in these systems often locate single
objects in a single file based on their object keys.
• Provide a set of functions and operations as a programming
API
• CRUD operations, for Create, Read, Update, and Delete &
SCRUD (added Search)
• Versioning
Categories of NOSQL Systems
Four Major Categories
Document- Column-based
based NOSQL or wide
NOSQL key-value column
systems stores NOSQL
systems
Graph-based NOSQL
systems

Additional
Categories
Hybrid NOSQL Object XML
systems databases databases
Categories of NOSQL Systems
• Document-based NOSQL systems
• Store data in the form of documents using JSON
• Documents are accessible by document id, but can also be accessed rapidly using other indexes.
• NOSQL key-value stores
• Simple data model based on fast access by the key to the value associated with the key;
• The value can be a record or an object or a document or even have a more complex data
structure.
• Column-based or wide column NOSQL systems
• Partition a table by column into column families (a form of vertical partitioning), where each
column family is stored in its own files.
• They also allow versioning of data values.
• Graph-based NOSQL systems
• Data is represented as graphs, and related nodes can be found by traversing the edges using
path expressions
The CAP Theorem
Three desirable properties of distributed systems
with replicated data
Consistency, Availability, Partition
tolerance
Nodes will have the same The system can continue operating
copies of a replicated data if the network connecting the
item visible for various nodes has a fault that results in
transactions two or more partitions, where the
nodes in each partition can only
communicate among each other.

Each read or write request for a data item


will either be processed successfully or will
receive a message that the operation
cannot be completed
The CAP Theorem

CAP theorem states that it is not possible to guarantee all three


of the desirable properties at the same time in a distributed
system with data replication
If this is the case, then the distributed system designer would
have to choose two properties out of the three to guarantee

NOSQL distributed data store, a weaker consistency level is


often acceptable, and guaranteeing the other two properties
(availability, partition tolerance) is important
Eventual consistency is often adopted in NOSQL systems
Consistency, Availability, Partition tolerance
Key-Value Databases

• Key-value stores focus on high performance, availability, and scalability by


storing data in a distributed storage system.
• The key is a unique identifier associated with a data item and is used to locate
this data item rapidly.
• The value is the data item itself, and it can have very different formats for
different key-value storage systems.
• In some cases, the value is just a string of bytes or an array of bytes, and the
application using the key-value store has to interpret the structure of the data
value.
• In other cases, some standard formatted data is allowed; for example,
structured data rows (tuples) similar to relational data, or semistructured data
using JSON or some other self-describing data format.
• Different key-value stores can thus store unstructured, semistructured, or
structured data items.
• The main characteristic of key-value stores is the fact that every value (data
item) must be associated with a unique key, and that retrieving the value by
supplying the key must be very fast.
MongoDB CRUD Operations
• insert operation
db.<collection_name>.insert(<document(s)>)
• delete operation is called remove
db.<collection_name>.remove(<condition>)
• update operation, which has a condition to select certain
documents, and a $set clause to specify the update. It is
also possible to use the update operation to replace an
existing document with another one but keep the same
ObjectId.
• For read queries, the main command is called find,
db.<collection_name>.find(<condition>)
Create database and Collection
NO SQL TOOL : NOSQL Booster for Mongo DB
Create Database : right click local host —--> click
create database
Swift to DB : use Collection
USE DB Output
Create Collection Output
Show Collections
INSERT Command
Insert Operation
Insert Operation
Find Command
db.transport.find({$and:[{Max_speed:{$lt:500}},{Brand:
{$eq:"Benz"}}}.pretty()}]})
Data Types in MongoDB
Update Command
Remove command
DynamoDB Overview

• The DynamoDB system is an Amazon product


• Available as part of Amazon’s AWS/SDK platforms
(Amazon Web Services/Software Development Kit).
• It can be used as part of Amazon’s cloud computing
services, for the data storage component.
DynamoDB data model
Tables, Items, and
Attributes
table in DynamoDB does
attribute values can be
not have a schema; it
single-valued or
holds a collection of self-
multivalued
describing items

Each item will consist of a number of


(attribute, value) pairs
DynamoDB also allows the user to specify the items in JSON
format, and the system will convert them to the internal
storage format of DynamoDB
Key-Value store
DynamoDB data model

• To create a table, it is required to specify a table name


and a primary key;
• Primary key will be used to rapidly locate the items in the
table. Thus, the primary key is the key and the item is the
value for the DynamoDB key-value store.
LINK :
AWS Tutorial - AWS DynamoDB - Create Table Insert Items Scan and Query Table - Bing video
• The primary key attribute must exist in every item in the table.
The primary key can be one of the following two types:
• A single attribute. The DynamoDB system will use this attribute to build a
hash index on the items in the table. This is called a hash type primary key.
The items are not ordered in storage on the value of the hash attribute.
• A pair of attributes. This is called a hash and range type primary key.
The primary key will be a pair of attributes (A, B): attribute A will be used
for hashing, and because there will be multiple items with the same value of
A, the B values will be used for ordering the records with the same A value.
• A table with this type of key can have additional secondary indexes defined
on its attributes. For example, if we want to store multiple versions of some
type of items in a table, we could use ItemID as hash and Date or
Timestamp (when the version was created) as range in a hash and range
type primary key.
Voldemort Key-Value Distributed Data Store
• Voldemort is an open source system available through Apache
2.0 open source licensing rules.
• It is based on Amazon’s DynamoDB.
• The focus is on high performance and horizontal scalability, as
well as on providing replication for high availability and sharding
for improving latency (response time) of read and write requests.
• All three of those features—replication, sharding, and horizontal
scalability—are realized through a technique to distribute the
key-value pairs among the nodes of a distributed cluster; this
distribution is known as consistent hashing.
• Voldemort has been used by LinkedIn for data storage.
Features of Voldemort

1. Simple basic operations

2. High-level formatted data values

3. Consistent hashing for distributing (key, value) pairs

4. Consistency and versioning


Simple basic operations.
• A collection of (key, value) pairs is kept in a Voldemort
store.
• Store is called s. The basic interface for data storage and
retrieval is very simple and includes three operations: get,
put, and delete.
• s.put(k, v) inserts an item as a key-value pair with key k
and value v.
• s.delete(k) deletes the item whose key is k from the
store
• v = s.get(k) retrieves the value v associated with key k.
• The application can use these basic operations to build its
High-level formatted data values
• The values v in the (k, v) items can be specified in JSON
(JavaScript Object Notation), and the system will convert
between JSON and the internal storage format.
• Other data object formats can also be specified if the application
provides the conversion (also known as serialization) between
the user format and the storage format as a Serializer class.
• The Serializer class must be provided by the user and will
include operations to convert the user format into a string of
bytes for storage as a value, and to convert back a string (array
of bytes) retrieved via s.get(k) into the user format.
• Voldemort has some built-in serializers for formats other than
JSON.
Consistent hashing for distributing (key, value) pairs
• A variation of the data distribution algorithm known as consistent hashing
is used in Voldemort for data distribution among the nodes in the distributed
cluster of nodes.
• A hash function h(k) is applied to the key k of each (k, v) pair, and h(k)
determines where the item will be stored.
• The method assumes that h(k) is an integer value, usually in the range 0 to
Hmax = 2n−1, where n is chosen based on the desired range for the hash
values. This method is best visualized by considering the range of all
possible integer hash values 0 to Hmax to be evenly distributed on a circle
(or ring).
• The nodes in the distributed system are then also located on the same ring;
usually each node will have several locations on the ring. The positioning of
the points on the ring that represent the nodes is done in a psuedorandom
manner.
Example of consistent hashing. (a) Ring having three nodes A,B,
and C, with C having greater capacity. The h(K) values that
map to the circle points in range 1 have their (k, v) items stored
in node A, range 2 in node B, range 3 in node C.
(b) Adding a node D to the ring. Items in range 4 are
moved to the node D from node B (range 2 is reduced)
and node C (range 3 is reduced).
Consistency and versioning
• Voldemort uses a method similar to the one developed for DynamoDB for
consistency in the presence of replicas.
• Basically, concurrent write operations are allowed by different processes so there
could exist two or more different values associated with the same key at different
nodes when items are replicated.
• Consistency is achieved when the item is read by using a technique known as
versioning and read repair.
• Concurrent writes are allowed, but each write is associated with a vector clock
value.
• When a read occurs, it is possible that different versions of the same value
(associated with the same key) are read from different nodes.
• If the system can reconcile to a single final value, it will pass that value to the
read; otherwise, more than one version can be passed back to the application,
which will reconcile the various versions into one version based on the application
semantics and give this reconciled value back to the nodes.
Key-value store label system examples
• Oracle key-value store
Oracle has one of the well-known SQL relational database systems, and
Oracle also offers a system based on the key-value store concept;this
system is called the Oracle NoSQL Database.
• Redis key-value cache and store
• Redis differs from the other systems discussed here because it
caches its data in main memory to further improve performance.
• It offers master-slave replication and high availability, and it also
offers persistence by backing up the cache to disk.
• Apache Cassandra
Cassandra is a NOSQL system that is not easily categorized into one
category; it is sometimes listed in the column-based NOSQL category
or in the key-value category. If offers features from several NOSQL
categories and is used by Facebook as well as many other customers
Column-Based or Wide Column NOSQL Systems
• The Google distributed storage system for big data, known as BigTable, and it
is used in many Google applications that require large amounts of data storage,
such as Gmail.
• Big-Table uses the Google File System for data storage and distribution.
• An open source system known as Apache Hbase is somewhat similar to Google
Big-Table, but it typically uses Hadoop Distributed File System for data
storage.
• HDFS is used in many cloud computing applications.
• Hbase can also use Amazon’s Simple Storage System (known as S3) for data
storage.
• Another well-known example of column-based NOSQL systems is Cassandra, it
can also be characterized as a key-value store.
• Hbase as an example of this category of NOSQL systems.
Column-Based or Wide Column NOSQL Systems
• BigTable (and Hbase) is sometimes described as a sparse
multidimensional distributed persistent sorted map, where
the word map means a collection of (key, value) pairs (the key
is mapped to the value).
• One of the main differences that distinguish column-based
systems from key-value stores is the nature of the key.
• In column-based systems such as Hbase, the key is
multidimensional and so has several components: typically, a
combination of table name, row key, column, and timestamp.
• The column is typically composed of two components: column
family and column qualifier.
Column Family Stores
Hbase Data Model and Versioning
• The data model in Hbase organizes data using the concepts of namespaces,
tables, column families, column qualifiers, columns, rows, and data cells.
• A column is identified by a combination of (column family:column qualifier).
• Data is stored in a self-describing form by associating columns with data
values, where data values are strings.
• Hbase also stores multiple versions of a data item, with a timestamp
associated with each version, so versions and timestamps are also part of the
Hbase data model (this is similar to the concept of attribute versioning in
temporal databases.
• As with other NOSQL systems, unique keys are associated with stored data
items for fast access, but the keys identify cells in the storage system.
Because the focus is on high performance when storing huge amounts of data,
the data model includes some storage-related concepts.
• It is important to note that the use of the words table, row, and column is not
identical to their use in relational databases, but the uses are related.
Tables and Rows
• Data in Hbase is stored in tables, and each table has a
table name.
• Data in a table is stored as self-describing rows.
• Each row has a unique row key, and row keys are strings
that must have the property that they can be
lexicographically ordered, so characters that do not have
a lexicographic
(a) order
Creating a table called in the
EMPLOYEE withcharacter set cannot
three column families: be used
Name, Address,
as part
and Details.of a row key.
Column Families, Column Qualifiers, and Columns.
• A table is associated with one or more column families.
• Each column family will have a name, and the column families associated with a table must be specified
when the table is created and cannot be changed later.
• A table may be created; the table name is followed by the names of the column families associated with
the table.
• When the data is loaded into a table, each column family can be associated with many column
qualifiers, but the column qualifiers are not specified as part of creating a table.
• So the column qualifiers make the model a self-describing data model because the qualifiers can be
dynamically specified as new rows are created and inserted into the table. A column is specified by a
combination of ColumnFamily:ColumnQualifier.
• Basically, column families are a way of grouping together related columns (attributes in relational
terminology) for storage purposes, except that the column qualifier names are not specified during table
creation. Rather, they are specified when the data is created and stored in rows, so the data is
selfdescribing
• Since any column qualifier name can be used in a new row of data. However, it is important that the
application programmers know which column qualifiers belong to each column family, even though they
have the flexibility to create new column qualifiers on the fly when new data rows are created.
• The concept of column family is similar to vertical partitioning, because columns (attributes) that are
accessed together because they belong to the same column family are stored in the same files.
• Each column family of a table is stored in its own files using the HDFS file system.
(b) Inserting some in the EMPLOYEE table; different rows can have different
self-describing column qualifiers (Fname, Lname, Nickname, Mname, Minit,
Suffix, … for column family Name; Job, Review, Supervisor, Salary
for column family Details).
Versions and Timestamps
• Hbase can keep several versions of a data item, along with the
timestamp associated with each version.
• The timestamp is a long integer number that represents the
system time when the version was created, so newer versions
have larger timestamp values.
• Hbase uses midnight ‘January 1, 1970 UTC’ as timestamp value
zero, and uses a long integer that measures the number of
milliseconds since that time as the system timestamp value (this
is similar to the value returned by the Java utility
java.util.Date.getTime() and is also used in MongoDB).
• It is also possible for the user to define the timestamp value
explicitly in a Date format rather than using the system-
generated timestamp.
Cells
• A cell holds a basic data item in Hbase.
• The key (address) of a cell is specified by a combination of
(table, rowid, columnfamily, columnqualifier, timestamp).
• If timestamp is left out, the latest version of the item is
retrieved unless a default number of versions is specified, say
the latest three versions.
• The default number of versions to be retrieved, as well as the
default number of versions that the system needs to keep, are
parameters that can be specified during table creation.
Namespaces

• A namespace is a collection of tables.


• A namespace basically specifies a collection of one or
more tables that are typically used together by user
applications, and it corresponds to a database that
contains a collection of tables in relational terminology
Hbase CRUD Operations

• Creating a table: create <tablename>, <column family>, <column


family>, …
• Inserting Data: put <tablename>, <rowid>, <column
family>:<column qualifier>, <value>
• Reading Data (all data in a table): scan <tablename>
• Retrieve Data (one item): get <tablename>,<rowid>
• Hbase only provides low-level CRUD operations. It is the
responsibility of the application programs to implement more
complex operations, such as joins between rows in different tables
Hbase Storage and Distributed System Concepts
• Each Hbase table is divided into a number of regions, where each region will hold a
range of the row keys in the table; this is why the row keys must be lexicographically
ordered.
• Each region will have a number of stores, where each column family is assigned to one
store within the region.
• Regions are assigned to region servers(storage nodes) for storage.
• A master server (master node) is responsible for monitoring the region servers and for
splitting a table into regions and assigning regions to region servers.
• Hbase uses the Apache Zookeeper open source system for services related to
managing the naming, distribution, and synchronization of the Hbase data on the
distributed Hbase server nodes, as well as for coordination and replication services.
• Hbase also uses Apache HDFS for distributed file services. So Hbase is built on top of
both HDFS and Zookeeper.
• Zookeeper can itself have several replicas on several nodes for availability, and it keeps
the data it needs in main memory to speed access to the master servers and region
servers.
Column Family: characteristics
• Column Family Store characteristics :
• Column Family databases can handle hundreds of terabytes of data
easily.
• Updates are performed without reading the row that contains it. Hence
writes are very quick.
• Column Family databases support the real-time insertion of the huge
amounts of data e.g. one million writes per second.
Column Family Store use cases:
• Real-time weather data (min. temperature, max. temperature, air
pressure etc.) collected through sensors at multiple locations.
• Log files of web servers for data analysis.
NOSQL Graph Databases and Neo4j
• The data is represented as a graph, which is a collection of vertices
(nodes) and edges.
• Both nodes and edges can be labeled to indicate the types of entities
and relationships they represent, and it is generally possible to store
data associated with both individual nodes and individual edges.
• Graph traversals are much faster compared to SQL joins
• Graph databases are also used for fraud detection wherein,
• Customers are represented as nodes and
• Transactions are represented as edges
• Transaction paths that are not related to any customer are identified as frauds
Neo4j Data Model
• The data model in Neo4j organizes data using the concepts of nodes and
relationships.
• Both nodes and relationships can have properties, which store the data items
associated with nodes and relationships.
• Nodes can have labels; the nodes that have the same label are grouped into a
collection that identifies a subset of the nodes in the database graph for querying
purposes.
• A node can have zero, one, or several labels.
• Relationships are directed; each relationship has a start node and end node as
well as a relationship type, which serves a similar role to a node label by
identifying similar relationships that have the same relationship type.
• Properties can be specified via a map pattern, which is made of one or more
“name : value” pairs enclosed in curly brackets; for example {Lname : ‘Smith’,
Fname : ‘John’, Minit : ‘B’}.
Graph Data Model Example
Neo4j Data Model
• Comparing the Neo4j graph model with ER/EER concepts, nodes
correspond to entities, node labels correspond to entity types and
subclasses, relationships correspond to relationship instances,
relationship types correspond to relationship types, and properties
correspond to attributes.
• One notable difference is that a relationship is directed in Neo4j, but is
not in ER/EER.
• Another is that a node may have no label in Neo4j, which is not allowed
in ER/EER because every entity must belong to an entity type.
• A third crucial difference is that the graph model of Neo4j is used as a
basis for an actual high-performance distributed database system
whereas the ER/EER model is mainly used for database design.
Graph Database characteristics

• Graph databases are suited for data that are heavily


interconnected through relationships.
• Graphs do not need joins for querying.
• Graph databases use graph theory for traversal. It
improves performance by keeping track of and thereby
skipping nodes already visited.
• Graph databases provide Atomicity, Consistency, Isolation
and Durability (similar to SQL).
Limitations of graph-based databases
• These databases are inappropriate for transactional data
like financial accounting
• They do not scale out horizontally
• Difficulty in performing aggregations like sum and
max efficiently
• You need to learn a new query language like CYPHER
• You have fewer vendors to choose from, so harder to get
technical support
Neo4j Interfaces and Distributed System Characteristics
• Enterprise edition vs. community edition
• Both editions support the Neo4j graph data model and storage system, as well as the Cypher graph
query language, and several other interfaces, including a high-performance native API, language
drivers for several popular programming languages, such as Java, Python, PHP, and the REST
(Representational State Transfer) API. In addition, both editions support ACID properties. The
enterprise edition supports additional features for enhancing performance, such as caching and
clustering of data and locking.
• Graph visualization interface
• Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a database graph
can be displayed as a graph. This tool can be used to visualize query results in a graph representation
• Master-slave replication
• Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a database graph
can be displayed as a graph. This tool can be used to visualize query results in a graph representation
• Caching
A main memory cache can be configured to store the graph data for improved performance.
• Logical logs
Logs can be maintained to recover from failures
Why Document-Oriented Databases?
Document Database: characteristics
Document-Based NOSQL Systems and MongoDB
• Store data as collections of similar documents.
• These types of systems are also sometimes known as document stores.
• The individual documents somewhat resemble complex objects or XML
documents , but a major difference between document-based systems versus
object and object-relational systems and XML is that there is no requirement
to specify a schema—rather, the documents are specified as self-describing
data.
• Although the documents in a collection should be similar, they can have
different data elements (attributes), and new documents can have new data
elements that do not exist in any of the current documents in the collection.
• The system basically extracts the data element names from the self-describing
documents in the collection, and the user can request that the system create
indexes on some of the data elements. Documents can be specified in various
formats, such as XML
• A popular language to specify documents in NOSQL systems is JSON.
MongoDB Data Model
• MongoDB documents are stored in BSON (Binary JSON)
format, which is a variation of JSON with some additional
data types and is more efficient for storage than JSON.
• Individual documents are stored in a collection.
• Simple example based on COMPANY database.
• operation createCollection is used to create each
collection.
• For example, the following command can be used to
create a collection called project to hold PROJECT objects
from the COMPANY database
Command can be used to create a collection called project
db.createCollection(“project”, { capped : true, size :
1310720, max : 500 } )
optional document that
Name of the
specifies collection options
collection
capped; this means it has upper limits on its
storage space (size) and number of documents (max).

• The capping parameters help the system choose the storage


options for each collection.
Document collection called worker to hold information about the
EMPLOYEEs who work on each project
• Example:
• db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
• Each document in a collection has a unique ObjectId field, called _id, which is
automatically indexed in the collection unless the user explicitly requests no index
for the _id field.
• The value of ObjectId can be specified by the user, or it can be system-generated if
the user does not specify an _id field for a particular document.
• System-generated ObjectIds have a specific format, which combines the timestamp
when the object is created (4 bytes, in an internal MongoDB format), the node id (3
bytes), the process id (2 bytes), and a counter (3 bytes) into a 16-byte Id value.
• User-generated ObjectsIds can have any value specified by the user as long as it
uniquely identifies the document and so these Ids are similar to primary keys in
relational systems.
MongoDB Data Model
• A collection does not have a schema.
• The structure of the data fields in documents is chosen based on how
documents will be accessed and used
• User can choose
• a normalized design (similar to normalized relational tuples)
• Or a denormalized design (similar to XML documents or complex objects).
• Interdocument references can be specified by storing in one document
the ObjectId or ObjectIds of other related documents.
• In the example, the _id values are user-defined, and the documents
whose _id starts with P (for project) will be stored in the “project”
collection, whereas those whose _id starts with W (for worker) will be
stored in the “worker” collection
Denormalized document design with embedded
subdocuments
Embedded array of document references.
Normalized documents
Inserting the documents into their collections
MongoDB Distributed Systems Characteristics

• Most MongoDB updates are atomic if they refer to a single


document, but MongoDB also provides a pattern for
specifying transactions on multiple documents.
• Since MongoDB is a distributed system, the two-phase
commit method is used to ensure atomicity and
consistency of multidocument transactions.
Replication in MongoDB.
• replica - to create multiple copies of the same data set on different nodes in the
distributed system, and it uses a variation of the master-slave approach for
replication.
• For example, suppose that we want to replicate a particular document collection C.
A replica set will have one primary copy of the collection C stored in one node N1,
and at least one secondary copy (replica) of C stored at another node N2.
Additional copies can be stored in nodes N3, N4, etc., as needed, but the cost of
storage and update (write) increases with the number of replicas.
• The total number of participants in a replica set must be at least three, so if only
one secondary copy is needed, a participant in the replica set known as an arbiter
must run on the third node N3. The arbiter does not hold a replica of the collection
but participates in elections to choose a new primary if the node storing the
current primary copy fails.
• If the total number of members in a replica set is n (one primary plus i secondaries,
for a total of n = i + 1), then n must be an odd number; if it is not, an arbiter is
added to ensure the election process works correctly if the primary fails.
Replication in MongoDB

• All write operations must be applied to the primary copy and then propagated to
the secondaries.
• For read operations, the user can choose the particular read preference for their
application.
• The default read preference processes all reads at the primary copy, so all read and
write operations are performed at the primary node. In this case, secondary copies
are mainly to make sure that the system continues operation if the primary fails,
and MongoDB can ensure that every read request gets the latest document value.
• To increase read performance, it is possible to set the read preference so that read
requests can be processed at any replica (primary or secondary); however, a read at
a secondary is not guaranteed to get the latest version of a document because there
can be a delay in propagating writes from the primary to the secondaries.
Sharding in MongoDB
• When a collection holds a very large number of documents
or requires a large storage space, storing all the documents
in one node can lead to performance problems, particularly
if there are many user operations accessing the documents
concurrently using various CRUD operations.
• Sharding of the documents in the collection—also known
as horizontal partitioning— divides the documents into
disjoint partitions known as shards. This allows the system
to add more nodes as needed by a process known as
horizontal scaling of the distributed system and to store
the shards of the collection on different nodes to achieve
load balancing.
Sharding in MongoDB
• Each node will process only those operations pertaining to the documents in the shard
stored at that node. Also, each shard will contain fewer documents than if the entire
collection were stored at one node, thus further improving performance.
• There are two ways to partition a collection into shards in MongoDB
range partitioning
hash partitioning.
Both require that the user specify a particular document field to be used as the basis for
partitioning the documents into shards.
• The partitioning field known as the shard key in MongoDB must have two
characteristics:
it must exist in every document in the collection, and
it must have an index.
• The ObjectId can be used, but any other field possessing these two characteristics can
also be used as the basis for sharding.
• The values of the shard key are divided into chunks either through range partitioning
or hash partitioning, and the documents are partitioned based on the chunks of shard
key values.
Sharding in MongoDB
• Range partitioning creates the chunks by specifying a range of key values
• for example, if the shard key values ranged from one to ten million, it is possible
to create ten ranges—1 to 1,000,000; 1,000,001 to 2,000,000; … ; 9,000,001 to
10,000,000—and each chunk would contain the key values in one range.
• Hash partitioning applies a hash function h(K) to each shard key K, and the
partitioning of keys into chunks is based on the hash values.
• In general, if range queries are commonly applied to a collection (for example,
retrieving all documents whose shard key value is between 200 and 400), then
range partitioning is preferred because each range query will typically be
submitted to a single node that contains all the required documents in one
shard.
• If most searches retrieve one document at a time, hash partitioning may be
preferable because it randomizes the distribution of shard key values into
chunks.
Sharding in MongoDB
• When sharding is used, MongoDB queries are submitted to a module called the
query router, which keeps track of which nodes contain which shards based
on the particular partitioning method used on the shard keys.
• The query (CRUD operation) will be routed to the nodes that contain the
shards that hold the documents that the query is requesting. If the system
cannot determine which shards hold the required documents, the query will be
submitted to all the nodes that hold shards of the collection.
• Sharding and replication are used together;
• Sharding focuses on improving performance via load balancing and horizontal
scalability,
• Replication focuses on ensuring system availability when certain nodes fail in
the distributed system.
• MongoDB also provides many other services in areas such as system
administration, indexing, security, and data aggregation.
MongoDB and Oracle
MongoDB installation steps
Cassandra’s data model with column families
• Cassandra can be described as fast and easily scalable
with write operations spread across the cluster.
• The cluster does not have a master node, so any read and
write can be handled by any node in the cluster.
Cassandra’s data model with column families

You might also like