0% found this document useful (0 votes)

7 views64 pages

Unit 3 Nosql Databases Adt

The document provides an overview of NoSQL databases, highlighting their advantages such as scalability, flexibility, and high performance, while also discussing their disadvantages like lack of standardization and ACID compliance. It categorizes NoSQL databases into types including document-based, key-value, column-oriented, and graph-based systems, and introduces the CAP theorem which states that a distributed database can only guarantee two out of three properties: consistency, availability, and partition tolerance. Overall, NoSQL databases are positioned as suitable solutions for managing large volumes of unstructured data in modern applications.

Uploaded by

rajikarthi2013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views64 pages

Unit 3 Nosql Databases Adt

Uploaded by

rajikarthi2013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

UNIT III NOSQL DATABASES

NoSQL – CAP Theorem – Sharding - Document based – MongoDB

Operation: Insert, Update, Delete, Query, Indexing, Application, Replication,
Sharding–Cassandra: Data Model, Key Space, Table Operations, CRUD
Operations, CQL Types – HIVE: Data types, Database Operations, Partitioning –
HiveQL – OrientDB Graph database – OrientDB Features
NoSQL
NoSQL, also referred to as “not only SQL” or “non-SQL”, is an approach to
design database that enables the storage and querying of data outside the traditional
structures found in relational databases.
The primary objective of a NoSQL database is to have
•simplicity of design,
•horizontal scaling, and
•finer control over availability.
While NoSQL can still store data found within relational database management
systems (RDBMS), it just stores it differently compared to an RDBMS.
Instead of the typical tabular structure of a relational database, NoSQL databases
store data within one data structure. Since this non-relational database design does not
require a schema, it offers rapid scalability to manage large and typically unstructured
data sets.
NoSQL is also type of distributed database, which means that information is
copied and stored on various servers, which can be remote or local. This ensures
availability and reliability of data. If some of the data goes offline, the rest of the
database can continue to run.
Today, companies need to manage large data volumes at high speeds with the
ability to scale up quickly to run modern web applications in nearly every industry.
In this era of growth within cloud, big data, and mobile and web applications,
NoSQL databases provide that speed and scalability, making it a popular choice for their
performance and ease of use.
Databases can be divided in 3 types:
1. RDBMS (Relational Database Management System)
2. OLAP (Online Analytical Processing)
3. NoSQL (recently developed database)
Advantages of NoSQL
o It supports query language.
o It provides fast performance.
o It provides horizontal scalability.
NoSQL versus SQL

Types / Categories of NoSQL Databases

NoSQL Databases are mainly categorized into four types: Key-value pair,
Column-oriented, Graph-based and Document-oriented. Every category has its unique
attributes and limitations.
None of the above-specified database is better to solve all the problems. Users
should select the database based on their product needs.
• Key-value Pair Based
• Column-oriented Graph
• Graphs based
• Document-oriented
1. Document-based NOSQL systems:
 These systems store data in the form of documents using well-known formats,
such as JSON (JavaScript Object Notation).
 Documents are accessible via their document id, but can also be accessed rapidly
using other indexes.
 Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are
popular Document originated DBMS systems.
2. NOSQL key-value stores:
These systems have a simple data model based on fast access by the key to the value
associated with the key; the value can be a record or an object or a document
or even have a more complex data structure. It is designed in such a way to handle lots
of data and heavy load.
Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
It is one of the most basic NoSQL database example. This kind of NoSQL database is
used as a collection, dictionaries, associative arrays, etc. Key value stores help the
developer to store schema-less data. They work best for shopping cart contents. Redis,
Dynamo, Riak are some NoSQL examples of key-value store DataBases.

3. Column-based or wide column NOSQL systems:

 These systems partition a table by column into column families (a form
of vertical partitioning), where each column family is stored in its own
files. They also allow versioning of data values. Column-oriented databases
work on columns and are based on BigTable paper by Google. Every column is
treated separately. Values of single column databases are stored contiguously.
 They deliver high performance on aggregation queries like SUM, COUNT,
AVG, MIN etc. as the data is readily available in a column.
 Column-based NoSQL databases are widely used to manage data warehouses,
business intelligence, CRM, Library card catalogs, HBase, Cassandra, HBase,
Hypertable are NoSQL query examples of column based database.
4. Graph-based NOSQL systems:
 A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique
identifier.
 Data is represented as graphs, and related nodes can be found by traversing the
edges using path expressions.

5. Hybrid NOSQL systems:

These systems have characteristics from two or Key Value Pair Based
INTRODUCTION:
Many companies and organizations are faced with applications that store vast
amounts of data. Consider a free e-mail application, such as Google Mail or Yahoo Mail
or other similar service this application can have millions of users, and each user
can have thousands of e-mail messages. There is a need for a storage system that can
manage all these e-mails; a structured relational SQL system may not be appropriate.
Because (1) SQL systems offer too many services (powerful query language,
concurrency control, etc.), which this application may not need; and
(2) a structured data model such the traditional relational model may be too
restrictive.
Another example, consider an application such as Facebook, with millions of users who
submit posts, many with images and videos; then these posts must be displayed on pages
of other users using the social media relationships among the users. User profiles, user
relationships, and posts must all be stored in a huge collection of data stores, and the
appropriate posts must be made available to the sets of users that have signed up to
see these posts Google developed a proprietary NOSQL system known as BigTable,
which is used in many of
Google’s applications that require vast amounts of data storage, such as Gmail,
Google Maps, and Web site indexing s. Google’s innovation led to the category of
NOSQL systems known as column-based or wide column stores; they are referred to as
column family stores.
Apache Hbase is an open source NOSQL system based on similar concepts.
Google’s innovation led to the category of NOSQL systems known as column-based or
wide column stores;
Amazon developed a NOSQL system called DynamoDB that is available through
Amazon’s cloud services. This innovation led to the category known as key-value data
stores or sometimes key-tuple or key-object data stores.
Facebook developed a NOSQL system called Cassandra, which is now open source and
known as Apache Cassandra. This NOSQL system uses concepts from both key-value
stores and column-based systems.
MongoDB and CouchDB, which are classified as document-based NOSQL systems or
document stores.
Another category of NOSQL systems is the graph-based NOSQL systems, or graph
databases; these include Neo4J and GraphBase.
Features of NoSQL
❖ Non-relational

❖ NoSQL databases never follow the relational model

❖ Never provide tables with flat fixed-column records

❖ Work with self-contained aggregates or BLOBs

❖ Doesn't require object-relational mapping and data normalization

❖ No complex features like query languages, query planners, referential integrity

joins, ACID.
When should NoSQL be used?
When deciding which database to use, decision-makers typically find one or more
of the following factors that lead them to select a NoSQL database:
 Fast-paced Agile development
 Storage of structured and semi-structured data
 Huge volumes of data
 Requirements for scale-out architecture
 Modern application paradigms like microservices and real-time streaming
Characteristics of NoSQL systems:
1 ) Flexible Data Model: NoSQL databases allow for flexible data models, meaning
you can store unstructured, semi structured, and structured data without the need to
define a rigid schema.
2) Scalability: NoSQL databases are designed to scale horizontally, which means you
can easily add more servers to handle increased load, making them suitable for handling
large amounts of data and high traffic.
3) High Performance: NoSQL databases are optimized for high performance and can
deliver low latency responses, making them ideal for real time applications and large
scale systems.
4) Replication and Fault Tolerance: NoSQL databases typically support automatic
data replication and have built in fault tolerance mechanisms, ensuring data durability
and high availability even in the event of server failures.
5) Distributed Computing: NoSQL databases are designed for distributed computing
environments, allowing data to be distributed across multiple nodes in a cluster, which
improves data access speed and fault tolerance.
6) Schema less Design: NoSQL databases do not require a fixed schema, allowing for
dynamic and evolving data structures, which is particularly useful in scenarios where
the data model may change frequently.
7) Support for Big Data: NoSQL databases are well suited for handling large volumes
of data, making them a popular choice for big data applications and analytics.
8) Automatic Data Sharding: Many NoSQL databases support automatic data
sharding, which allows data to be partitioned and distributed across multiple servers,
enabling horizontal scaling and improved performance.
9) Developer friendly: NoSQL databases are often developer friendly, offering flexible
APIs, query languages, and data manipulation tools that simplify application
development and integration.
10) Variety of Data Models: NoSQL databases support various data models such as
key value stores, document stores, column family stores, and graph databases, providing
flexibility in choosing the best data model for specific use cases.
Advantages of NoSQL: There are many advantages of working with NoSQL databases
such as MongoDB and Cassandra. The main advantages are high scalability and high
availability.
1. High scalability: NoSQL databases use sharding for horizontal scaling.
Partitioning of data and placing it on multiple machines in such a way that the
order of the data is preserved is sharding. Vertical scaling means adding more
resources to the existing machine whereas horizontal scaling means adding more
machines to handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal scaling databases
are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data because
of scalability, as the data grows NoSQL scalesThe auto itself to handle that data
in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-
structured data, which means that they can accommodate dynamic changes to the
data model. This makes NoSQL databases a good fit for applications that need to
handle changing data requirements.
3. High availability: The auto, replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the previous
consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit
for applications that need to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data
and traffic, which means that they can offer improved performance compared to
traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than
traditional relational databases, as they are typically less complex and do not
require expensive hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization: There are many different types of NoSQL databases,
each with its own unique strengths and weaknesses. This lack of standardization
can make it difficult to choose the right database for a specific application
2. Lack of ACID compliance: NoSQL databases are not fully ACID-compliant,
which means that they do not guarantee the consistency, integrity, and durability
of data. This can be a drawback for applications that require strong data
consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly
designed for storage but it provides very little functionality. Relational databases
are a better choice in the field of Transaction Management than NoSQL.
4. Open-source: NoSQL is an databaseopen-source database. There is no reliable
standard for NoSQL yet. In other words, two database systems are likely to be
unequal.
5. Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for applications
that require complex data analysis or reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure
than traditional databases.
7. Management challenge: The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not so easy.
Data management in NoSQL is much more complex than in a relational database.
NoSQL, in particular, has a reputation for being challenging to install and even
more hectic to manage on a daily basis.
8. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.
10. Large document size: Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large (BigData,
network bandwidth, speed), and having descriptive key names actually hurts
since they increase the document size.
CAP Theorem
It is very important to understand the limitations of NoSQL database. NoSQL can
not provide consistency and high availability together. This was first expressed by Eric
Brewer in CAP Theorem. CAP theorem states that we can only achieve at most two out
of three guarantees for a database:
1. Consistency
2. Availability
3. Partition Tolerance
The CAP theorem states that distributed databases can have at most two of
the three properties: consistency, availability, and partition tolerance. As a result,
database systems prioritize only two properties at a time.

The “CAP” in the CAP Theorem

The "CAP" in the CAP Theorem MongoDB refers to the three properties that a
distributed system cannot simultaneously guarantee: consistency "C", availability "A",
and partition tolerance "P".
Consistency "C" means that all nodes in a distributed system have the same view of
the data at the same time. Any read operation on the system should return the most
recent write or an error. In other words, the data remains consistent across all nodes.
Availability "A" ensures that every request made to a distributed system receives a
response, regardless of system failures or disruptions. The system remains operational
and responsive to user requests, providing uninterrupted services.
Partition tolerance "P" refers to the system's ability to continue functioning and
providing services even in the presence of network partitions or failures. Network
partitions occur when communication between nodes is disrupted, leading to the
formation of separate groups of nodes.
1. Consistency
Consistency means that all the nodes (databases) inside a network will have the
same copies of a replicated data item visible for various transactions. It guarantees that
every node in a distributed cluster returns the same, most recent, and successful write.
It refers to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very strong
form of consistency.
For example, a user checks his account balance and knows that he has 500 rupees.
He spends 200 rupees on some products. Hence the amount of 200 must be deducted
changing his account balance to 300 rupees. This change must be committed and
communicated with all other databases that hold this user’s details. Otherwise, there
will be inconsistency, and the other database might show his account balance as 500
rupees which is not true.

2. Availability
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be completed.
Every non-failing node returns a response for all the read and write requests in a
reasonable amount of time. The key word here is “every”. In simple terms, every node
(on either side of a network partition) must be able to respond in a reasonable amount
of time.
For example, user A is a content creator having 1000 other users subscribed to
his channel. Another user B who is far away from user A tries to subscribe to user A’s
channel. Since the distance between both users are huge, they are connected to different
database node of the social media network. If the distributed system follows the
principle of availability, user B must be able to subscribe to user A’s channel.
3. Partition Tolerance
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions, where
the nodes in each partition can only communicate among each other. That means, the
system continues to function and upholds its consistency guarantees in spite of network
partitions. Network partitions are a fact of life. Distributed systems guaranteeing
partition tolerance can gracefully recover from partitions once the partition heals.

For example, take the example of the same social media network where two users
are trying to find the subscriber count of a particular channel. Due to some technical
fault, there occurs a network outage, the second database connected by user B losses its
connection with first database. Hence the subscriber count is shown to the user B with
the help of replica of data which was previously stored in database 1 backed up prior to
network outage. Hence the distributed system is partition tolerant.
SHARDING:
It is basically a database architecture pattern in which we split a large dataset into
smaller chunks (logical shards) and we store/distribute these chunks in different
machines/database nodes (physical shards).
 Each chunk/partition is known as a “shard” and each shard has the same database
schema as the original database.
 We distribute the data in such a way that each row appears in exactly one shard.
 It’s a good mechanism to improve the scalability of an application.

Methods of Sharding
1. Key Based Sharding
This technique is also known as hash-based sharding. Here, we take the value of an
entity such as customer ID, customer email, IP address of a client, zip code, etc and we
use this value as an input of the hash function. This process generates a hash
value which is used to determine which shard we need to use to store the data.
 We need to keep in mind that the values entered into the hash function should all
come from the same column (shard key) just to ensure that data is placed in the
correct order and in a consistent manner.
 Basically, shard keys act like a primary key or a unique identifier for individual
rows.
For example:You have 3 database servers and each request has an application id which
is incremented by 1 every time a new application is registered.
To determine which server data should be placed on, we perform a modulo operation
on these applications id with the number 3. Then the remainder is used to identify the
server to store our data.

Advantages of Key Based Sharding:

 Predictable Data Distribution:
o Key-based sharding provides a predictable way to distribute data across
shards.
o Every distinct key value is associated with a particular shard, guaranteeing
a uniform and consistent distribution of data.
 Optimized Range Queries:
o If queries involve ranges of key values, key-based sharding can be
optimized to handle these range queries efficiently.
o This is especially beneficial when dealing with operations that span a range
of consecutive key values.
Disadvantages of Key Based Sharding:
 Uneven Data Distribution: If the sharding key is not well-distributed it may
result in uneven data distribution across shards
 Limited Scalability with Specific Keys: The scalability of key-based sharding
may be limited if certain keys experience high traffic or if the dataset is heavily
skewed toward specific key ranges.
 Complex Key Selection: Selecting an appropriate sharding key is crucial for
effective key-based sharding.
2. Horizontal or Range Based Sharding
In this method, we divide the data by separating it into different parts based on the range
of a specific value within each record. Let’s say you have a database of your online
customers’ names and email information. You can split this information into two shards.
 In one shard you can keep the info of customers whose first name starts with A-
P
 In another shard, keep the information of the rest of the customers.

Advantages of Range Based Sharding:

 Scalability: Horizontal or range-based sharding allows for seamless scalability
by distributing data across multiple shards, accommodating growing datasets.
 Improved Performance: Data distribution among shards enhances query
performance through parallelization, ensuring faster operations with smaller
subsets of data handled by each shard.
Disadvantages of Range Based Sharding:
 Complex Querying Across Shards: Coordinating queries involving multiple
shards can be challenging.
 Uneven Data Distribution: Poorly managed data distribution may lead to
uneven workloads among shards.
3. Vertical Sharding
In this method, we split the entire column from the table and we put those columns into
new distinct tables. Data is totally independent of one partition to the other ones. Also,
each partition holds both distinct rows and columns. We can split different features of
an entity in different shards on different machines.
For example:
On Twitter users might have a profile, number of followers, and some tweets posted by
his/her own. We can place the user profiles on one shard, followers in the second shard,
and tweets on a third shard.

Advantages of Vertical Sharding:

 Query Performance: Vertical sharding can improve query performance by
allowing each shard to focus on a specific subset of columns. This specialization
enhances the efficiency of queries that involve only a subset of the available
columns.
 Simplified Queries: Queries that require a specific set of columns can be
simplified, as they only need to interact with the shard containing the relevant
columns.
Disadvantages of Vertical Sharding:
 Potential for Hotspots: Certain shards may become hotspots if they contain
highly accessed columns, leading to uneven distribution of workloads.
 Challenges in Schema Changes: Making changes to the schema, such as adding
or removing columns, may be more challenging in a vertically sharded system.
Changes can impact multiple shards and require careful coordination.
4. Directory-Based Sharding
In this method, we create and maintain a lookup service or lookup table for the original
database. Basically we use a shard key for lookup table and we do mapping for each
entity that exists in the database. This way we keep track of which database shards hold
which data.
The lookup table holds a static set of information about where specific data can be
found. In the above image, you can see that we have used the delivery zone as a shard
key:
 Firstly the client application queries the lookup service to find out the shard
(database partition) on which the data is placed.
 When the lookup service returns the shard it queries/updates that shard.
Advantages of Directory-Based Sharding:
 Flexible Data Distribution: Directory-based sharding allows for flexible data
distribution, where the central directory can dynamically manage and update the
mapping of data to shard locations.
 Efficient Query Routing: Queries can be efficiently routed to the appropriate
shard using the information stored in the directory. This results in improved query
performance.
 Dynamic Scalability: The system can dynamically scale by adding or removing
shards without requiring changes to the application logic.
Disadvantages of Directory-Based Sharding:
 Centralized Point of Failure: The central directory represents a single point of
failure. If the directory becomes unavailable or experiences issues, it can disrupt
the entire system, impacting data access and query routing.
 Increased Latency: Query routing through a central directory introduces an
additional layer, potentially leading to increased latency compared to other
sharding strategies.
Ways optimize database sharding for even data distribution
Here are some simple ways to optimize database sharding for even data distribution:
 Use Consistent Hashing: This helps distribute data more evenly across all shards
by using a hashing function that assigns records to different shards based on their
key values.
 Choose a Good Sharding Key: Picking a well-balanced sharding key is crucial.
A key that doesn’t create hotspots ensures that data spreads out evenly across all
servers.
 Range-Based Sharding with Caution: If using range-based sharding, make sure
the ranges are properly defined so that one shard doesn’t get overloaded with
more data than others.
 Regularly Monitor and Rebalance: Keep an eye on data distribution and
rebalance shards when necessary to avoid uneven loads as data grows.
 Automate Sharding Logic: Implement automation tools or built-in database
features that automatically distribute data and handle sharding to maintain
balance across shards.
Alternatives to database sharding
Below are some of the alternatives to database sharding:
1. Vertical Scaling: Instead of splitting the database, you can upgrade your existing
server by adding more CPU, memory, or storage to handle more load. However,
this has limits as you can only scale a server so much.
2. Replication: You can create copies of your database on multiple servers. This
helps with load balancing and ensures availability, but can lead to
synchronization issues between replicas.
3. Partitioning: Instead of sharding across multiple servers, partitioning splits data
within the same server. It divides data into smaller sections, improving query
performance for large datasets.
4. Caching: By storing frequently accessed data in a cache (like Redis or
Memcached), you reduce the load on your main database, improving
performance without needing to shard.
5. CDNs: For read-heavy workloads, using a Content Delivery Network (CDN) can
offload some of the data access from your primary database, reducing the need
for sharding.
Advantages of Sharding in System Design
Sharding offers many advantages in system design such as:
1. Enhances Performance: By distributing the load among several servers, each
server can handle less work, which leads to quicker response times and better
performance all around.
2. Scalability: Sharding makes it easier to scale as your data grows. You can add
more servers to manage the increased data load without affecting the system’s
performance.
3. Improved Resource Utilization: When data is dispersed, fewer servers are used,
reducing the possibility of overloading one server.
4. Fault Isolation: If one shard (or server) fails, it doesn’t take down the entire
system, which helps in better fault isolation.
5. Cost Efficiency: You can use smaller, cheaper servers instead of investing in a
large, expensive one. As the system grows, sharding helps keep costs in control.
Disadvantages of Sharding in System Design
Sharding comes with some disadvantages in system design such as:
1. Increased Complexity: Managing and maintaining multiple shards is more
complex than working with a single database. It requires careful planning and
management.
2. Rebalancing Challenges: If data distribution becomes uneven, rebalancing
shards (moving data between servers) can be difficult and time-consuming.
3. Cross-Shard Queries: Queries that need data from multiple shards can be slower
and more complicated to handle, affecting performance.
4. Operational Overhead: With sharding, you’ll need more monitoring, backups,
and maintenance, which increases operational overhead.
5. Potential Data Loss: If a shard fails and isn’t properly backed up, there’s a
higher risk of losing the data stored on that shard.
MongoDB Operation:
MongoDB is an open-source document database that provides high performance,
high availability, and automatic scaling.
MongoDB is a document-oriented database. It is an open source product,
developed and supported by a company named 10gen.
MongoDB is a scalable, open source, high performance, document-oriented
database." MongoDB was designed to work with commodity servers. Now it is used by
companies of all sizes, across all industries.
CRUD Operations (Create, Read, Update, and Delete) are the basic set of operations
that allow users to interact with the MongoDB server.
As we know, to use MongoDB we need to interact with the MongoDB server to
perform certain operations like entering new data into the application, updating data
into the application, deleting data from the application, and reading the application data.

MongoDB Advantages
● MongoDB is schema less. It is a document database in which one collection holds
different documents.
● There may be differences between the number of fields, content and size of the
document from one to another.
● Structure of a single object is clear in MongoDB.
● There are no complex joins in MongoDB.
● MongoDB provides the facility of deep query because it supports a powerful dynamic
query on documents.
● It is very easy to scale.
● It uses internal memory for storing working sets and this is the reason of its fast
access.
Distinctive features of MongoDB
● Easy to use
● Light Weight
● Extremely faster than RDBMS Where MongoDB should be used
● Big and complex data
● Mobile and social infrastructure
● Content management and delivery
● User data management
● Data hub
MongoDB Data Types:
MongoDB supports many data types. Some of them are:
1. String: String is the most commonly used datatype to store the data. It is used to
store words or text. String in MongoDB must be UTF-8 valid.
2. Integer: This data type is used to store a numerical value. Integer can be 32-bit or
64-bit depending upon the server.
3. Boolean: This data type is used to store a Boolean (true/ false) value.
4. Float: This data type is used to store floating point values.
5. Min/Max keys: This data type is used to compare a value against the lowest and
highest BSON elements.
6. Arrays: This data type is used to store arrays or list or multiple values into one key
7. Timestamp: This data type is used to store the data and time at which a particular
event occurred. For example, recording when a document has been modified or added
8. Object: This datatype is used for embedded documents
9. Null: This data type is used to store a Null value.
10.Symbol: This datatype is used identically to a string; however, it’s generally
reserved for languages that use a specific symbol type
11.Date: This datatype is used to store the current date or time in UNIX time format.
We can specify our own date time by creating an object of Date and passing day,
month, year into it.
12.Object ID: This datatype is used to store the document’s ID
13.Binary data: This datatype is used to store binary data.
14.Code: This datatype is used to store JavaScript code into the document.
15.Regular expression: This datatype is used to store regular expressions.
MongoDB Create Database
There is no create database command in MongoDB. Actually, MongoDB does not
provide any command to create a database.
How and when to create database
If there is no existing database, the following command is used to create a new
database.
Syntax:
use DATABASE_NAME ;
INPUT:- >>>use inventory
OUTPUT:
switched to db inventory

MongoDB Create Collection

In MongoDB, db.createCollection(name, options) is used to create collections. But
usually it doesn't need to create a collection. MongoDB create collection
automatically when you insert some documents.
Syntax:
db.createCollection(name, options)
Name: is a string type, specifies the name of the collection to be created.
Options: is a document type, specifies the memory size and indexing of the
collection. It is an optional parameter.
Insert Operation:-
It is used to add one or more documents to the collection. It has two types,
 insertOne-used to add only one document to the collection.
 insertMany-used to add more than one document to the collection.
Syntax:-insertOne():-
db.collection.insertOne(<document>,{ writeConcern: <document> } ) SAMPLE
QUERY:-

INPUT:-
>>>db.inventory.insertOne({ item: "canvas", qty: 100, tags: ["cotton"], size: { h: 28, w:
35.5, uom: "cm" } })
OUTPUT:-
{
“acknowledge”:true,
“insertedId”:ObjectId(“603e3d2f6b88c382606523ad”)
}
Syntax:-
insertMany():-
db.collection.insertMany(
[ <document 1> , <document 2>, ... ],
{
writeConcern: <document>, ordered:
<boolean>
}
)
SAMPLE QUERY:-
INPUT:-
>>>db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
OUTPUT:-
{
“acknowledge”:true,
“insertedIds”:[ObjectId(“603ee5973b41040c0b3227107”),
ObjectId(“603ee5973b41040c0b3227108”)
ObjectId(“603ee5973b41040c0b32271079”)
}
READ OPERATION:-
It is used to retrieve documents from the collection based on some constraints.
Syntax:-
db.collection.find(query, { <field1>: <value>, <field2>: <value> ... })
SAMPLE QUERY:-
INPUT:-
>>>db.inventory.find( {} )
OUTPUT:

SAMPLE QUERY:-
INPUT:-
>>>db.inventory.find( {qty:85} )

OUTPUT:-

UPDATE OPERATION:-
It is used to modify (add/replace)one or more documents in the collection. It
consists of 3 types:
 updateOne
 update many
 replaceOne

Syntax:-
db.collection.updateOne(<filter>, <update>, <options>)
db.collection.updateMany(<filter>, <update>, <options>)
db.collection.replaceOne(<filter>, <update>, <options>)
SAMPLE QUERY:-updateOne()
INPUT:-
>>>db.inventory.updateOne( { item: "paper" }, { $set: { "size.uom": "cm"},
$currentDate: { lastModified: true } })

OUTPUT:-

SAMPLE QUERY:-updateMany()
INPUT:-
>>>db.inventory.updateMany( { "qty": { $lt: 50 } }, { $set: { "size.uom": "in", status: "P" },
$currentDate: { lastModified: true } })
OUTPUT:-

DELETE OPERATION:-
It is used to delete one or more documents/column from the collection based on
the constraints.
Syntax:- db.collection.deleteMany()
db.collection.deleteOne() SAMPLE
QUERY:-deleteOne() INPUT:-
>>>db.inventory.deleteOne( { qty:85 } )
OUTPUT:-
{
“acknowledged”:True
“deletecount”:1
}
SAMPLE QUERY:-deleteMany()
INPUT:-
>>>db.inventory.deleteMany( { qty:25 } )

OUTPUT:-
{
“acknowledged”:True
“deletecount”:2
}

Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If
there is no indexing, then the MongoDB must scan every document in the collection
and retrieve only those documents that match the query. Indexes are special data
structures that store some information related to the documents such that it becomes
easy for MongoDB to find the right data file. The indexes are ordered by the value of
the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows users to create an
index.
Syntax
db.COLLECTION_NAME.createIndex({KEY:1})
Example
db.mycol.createIndex({<age=:1})
{
<createdCollectionAutomatically= : false,
<numIndexesBefore= : 1,
<numIndexesAfter= : 2,
<ok= : 1
}
Dropping an Index :
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax:
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or
drop) multiple indexes from the collection, MongoDB provides the dropIndexes() method
that takes multiple indexes as its parameters.

Example
db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})
Application
1. Web Applications
○ MongoDB is widely used across various web applications as the primary data
store.
○ One of the most popular web development stacks, the MEAN stack employs
MongoDB as the data store (MEAN stands for MongoDB, ExpressJS,
AngularJS, and NodeJS).
2. Big Data
○ MongoDB also provides the ability to handle big data.
○ Big Data refers to massive data that is fast-changing, can be quickly
accessed and highly available for addressing needs efficiently.
○ So, it can be used in applications where Big Data is needed.
3. Demographic and Biometric Data
○ MongoDB is one of the biggest biometrics databases in the world. It is used
to store a massive amount of demographic and biometric data.
○ For example, India’s Unique Identification project, Aadhar, is using
MongoDB as its database to store a massive amount of demographic and
biometric data of more than 1.2 billion Indians.
4. Synchronization
○ MongoDB can easily handle complicated things that need
synchronization with each other entirely.
○ So, it is mainly used in gaming applications. An example gaming
application developed using MongoDB as a database is “EA”.
○ EA is a world-famous gaming studio that is using MongoDB Database for
its game called FIFA Online 3.
5. Ecommerce
○ For e-commerce websites and product data management and solutions, we
can use MongoDB to store information because it has a flexible schema well
suited for the job.
○ They can also determine the pattern to handle interactions between user’s
shopping carts and inventory using “Inventory Management.”
○ MongoDB also has a report called “Category Hierarchy,” which will
describe the techniques to do interaction with category hierarchies in
MongoDB.
MongoDB Replication
● In MongoDB, data can be replicated across machines by the means of replica sets.
● A replica set consists of a primary node together with two or more secondary
nodes.
● The primary node accepts all write requests, which are propagated asynchronously
to the secondary nodes.
● The primary node is determined by an election involving all available nodes.
● To be eligible to become primary, a node must be able to contact more than half of
the replica set.
● This ensures that if a network partitions a replica set in two, only one of the partitions
will attempt to establish a primary.
● The successful primary will be elected based on the number of nodes to which it is
in contact, together with a priority value that may be assigned by the system
administrator.
● Setting a priority of 0 to an instance prevents it from ever being elected as primary.
● In the event of a tie, the server with the most recent optime — the timestamp of
the last operation—will be selected.
● The primary stores information about document changes in a collection within
its local database, called the oplog.
● The primary will continuously attempt to apply these changes to secondary
instances. Members within a replica set communicate frequently via heartbeat
messages.
● If a primary finds it is unable to receive heartbeat messages from more than half of
the secondaries, then it will renounce its primary status and a new election will be
called.
● Figure illustrates a three-member replica set and shows how a network partition
leads to a change of primary.
● Arbiters are special servers that can vote in the primary election, but that don’t
hold data.
● For large databases, these arbiters can avoid the necessity of creating
unnecessary extra servers to ensure that a quorum is available when electing a
primary.
The replication process works as follows:
● Write operations on the primary:
○ When a user sends a write operation (such as an insert, update, or delete) to the
primary node, the primary node processes the operation and records it in its oplog
(operations log).
● Oplog replication to secondaries:
○ Secondary nodes poll the primary's oplog at regular intervals.
○ The oplog contains a chronological record of all the write operations performed.
○ The secondary nodes read the oplog entries and apply the same operations to their
data sets in the same order they were executed on the primary node.
● Achieving data consistency:
○ Through this oplog-based replication, secondary nodes catch up with the
primary's node data over time.
○ This process ensures that the data on secondary nodes remains consistent with
the primary's node data.
● Read operations:
○ While primary nodes handle write operations, both primary and secondary
nodes can serve read operations which can help in load balancing.
○ Clients can choose to read from secondary nodes, which helps distribute the
read load balance and reduce the primary node's workload.
○ But in some instances secondary nodes might have slightly outdated data due
to replication lag.

MongoDB replication provides several benefits

● High Availability: In the event of primary node failure, a secondary node can be
automatically promoted to the primary role, ensuring that the database remains
operational and minimizing downtime.
● Fault Tolerance: Multiple replicas of data reduce the risk of data loss due
to
hardware failures or other issues affecting a single node.
● Read Scalability: Secondary nodes can handle read queries, distributing the read
workload and improving overall performance.
● Data Redundancy: Having multiple replicas of data provides a level of data
redundancy, helping protect against data loss.
In summary, MongoDB replication is a critical feature that enhances data
availability and reliability in distributed environments. It enables the maintenance of
synchronized data copies across multiple nodes, allowing for fault tolerance and
improved performance in MongoDB database systems.

Methods to setup MongoDB replication

Setting up MongoDB replication is a crucial step in creating a fault-tolerant
and highly available database environment. MongoDB replication allows you to
create multiple copies of your data across different servers, ensuring data
redundancy and fault tolerance. Here are the methods to set up MongoDB replication
MongoDB replication using replica set
Setting up MongoDB replication using a Replica Set involves several steps.
Here are detailed instructions with code snippets for each step:
Step 1: Prepare MongoDB Instances
Install MongoDB on multiple servers or virtual machines where Replica Set
will be created. You can follow the installation instructions provided in the
MongoDB documentation.
Step 2: Configure Network Settings
Ensure that all the servers can communicate with each other over the network.
Include the hostnames and IP addresses of all Replica Set members by updating them
to the /etc/hosts file or DNS configuration.
Step 3: Start MongoDB Instances
For each server, a MongoDB configuration file needs to be created which will
be saved as mongod.conf file name and written in yaml format type similar to the
code snippet below.
storage:
dbPath: /var/lib/mongodb journal:
enabled: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true

net:
bindIp:
port:
replication:
replSetName: myReplSet
To start MongoDB on each server using the configuration file made above i.e.
mongod.conf file by using the following bash command:

mongod -f /path/to/mongod.conf

Step 4: Initialize the Replica Set

Now we need to connect to any one of the MongoDB instances (which is basically
replica set) created using the bello MongoDB shell command:

mongo --host <hostname>:<port>

Replica Set needs to be initialized now by executing the following command:
rs.initiate(
{_id: "myReplSet", members: [
{_id: 0,host:"<primary_host>:<primary_port>"
}]
})
Step 5: Add Secondary Members
After executing the Replica Set, connect to the primary node using the following
bash command in MongoDB shell:

mongo --host <primary_host>:<primary_port>

Add secondary members using the following Javascript command:
rs.add(":")
Repeat this step for each secondary member.

Step 6: Optional - Add Arbiter Node

If you want to add an arbiter node for elections, connect to the primary node’s
MongoDB shell and execute the following javascript command:

rs.addArb("<arbiter_host>:<arbiter_port>")
Step 7: Check Replica Set Status
To check the status of the Replica Set, connect to any of the MongoDB
instances and run the following javascript command:
rs.status()
Step 8: Test Connection Failure
To test connection failure, you can simulate a primary node failure by stopping
the MongoDB instance. The Replica Set should automatically elect a new primary
node. Please note that the provided steps and code snippets are generalized and the
actual steps might require adjustments based on the specific environment and use
case. This is where a near real-time low code tool like fivetran can be leveraged as
you just need to connect MongoDB with it and then Fivetran would handle all the
replication tasks without any hassle.
While MongoDB replication using the replica set method offers numerous
benefits, there are situations where its complexity, resource requirements, or
alignment with specific use cases make it less feasible. Organizations need to
carefully assess their requirements, infrastructure, and operational capabilities to
determine whether replica sets are the appropriate solution or if alternative strategies
should be considered.
Sharding
A high-level representation of the MongoDB sharding architecture is shown
in Figure. Each shard is implemented by a distinct MongoDB database, which in
most respects is unaware of its role in the broader sharded server
(1) A separate MongoDB database
(2) the config server — contains the metadata that can be used to determine
how data is distributed across shards.
(3) A router process — responsible for routing requests to the appropriate
shard server.

Sharding Mechanisms
Distribution of data across shards can be either range based or hash based.
● Range-based partitioning:
○ Each shard is allocated a specific range of shard key values.
○ MongoDB consults the distribution of key values in the index to ensure that each
shard is allocated approximately the same number of keys.
○ Range-based partitioning allows for more efficient execution of queries that
process ranges of values, since these queries can often be resolved by accessing
a single shard.
○ When range partitioning is enabled and the shard key is continuously
incrementing, the load tends to aggregate against only one of the shards, thus
unbalancing the cluster.
● Hash-based sharding:
○ The keys are distributed based on a hash function applied to the shard key.
○ Hash-based sharding requires that range queries be resolved by accessing all
shards.

○ Hash-based sharding is more likely to distribute documents (unfilled orders

or recent posts) evenly across the cluster, thus balancing the load more
effectively.
○ With hash-based partitioning new documents are distributed evenly across
all members of the cluster
The below Figure illustrates the performance trade-offs inherent in range and
hash sharding for inserts and range queries.
● Tag-aware sharding:
It allows the MongoDB administrator to fine-tune the distribution of documents to
shards.
○ By associating a shard with the tag, and associating a range of keys within a
collection with the same tag, the administrator can explicitly determine the shard
on which these documents will reside.
○ This can be used to archive data to shards on cheaper, slower storage or to direct
particular data to a specific data center or geography.
Cassandra:
Apache Cassandra is a highly scalable, high performance, distributed
NoSQL database. Cassandra is designed to handle huge amounts of data across
many commodity servers, providing high availability without a single point of
failure.
Cassandra is a NoSQL database
NoSQL database is Non-relational database. It is also called Not Only
SQL. It is a database that provides a mechanism to store and retrieve data other
than the tabular relations used in relational databases. These databases are
schema-free, support easy replication, have simple API, eventually consistent,
and can handle huge amounts of data. Important Points of Cassandra
● Cassandra is a column-oriented database.
● Cassandra is scalable, consistent, and fault-tolerant.
● Cassandra is created at Facebook. It is totally different from relational database
management systems.
● Cassandra is being used by some of the biggest companies like Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Data Model
Data model in Cassandra is totally different from normally we see in RDBMS.
Cluster Cassandra and other dynamo-based databases distribute data throughout the
cluster by using consistent hashing. The rowkey (analogous to a primary key in an
RDBMS) is hashed. Each node is allocated a range of hash values, and the node that
has the specific range for a hashed key value takes responsibility for the initial
placement of that data.
In the default Cassandra partitioning scheme, the hash values range from -263
to 263-1. Therefore, if there were four nodes in the cluster and we wanted to assign
equal numbers of hashes to each node, then the hash ranges for each would be
approximately as follows:
We usually visualize the cluster as a ring: the circumference of the ring
represents all the possible hash values, and the location of the node on the ring
represents its area of responsibility. Figure illustrates simple consistent hashing:
the value for a rowkey is hashed, which determines its position on “the
ring.” Nodes in the cluster take responsibility for ranges of values within the
ring, and therefore take ownership of specific rowkey values.

The four-node cluster in Figure 8-10 is well balanced because every node
is responsible for hash ranges of similar magnitude. But we risk unbalancing the
cluster as we add nodes. If we double the number of nodes in the cluster, then we
can assign the new nodes at points on the ring between existing nodes and the
cluster will remain balanced. However, doubling the cluster is usually
impractical: it’s more economical to grow the cluster incrementally.
Early versions of Cassandra had two options when adding a new node. We
could either remap all the hash ranges, or we could map the new node within an
existing range. In the first option we obtain a balanced cluster, but only after an
expensive rebalancing process. In the second option the cluster becomes
unbalanced; since each node is
responsible for the region of the ring between itself and its predecessor, adding a
new node without changing the ranges of other nodes essentially splits a region
in half. Figure shows how adding a node to the cluster can unbalance the
distribution of hash key ranges.

Order-Preserving Partitioning
The Cassandra partitioner determines how keys are distributed across
nodes. The default partitioner uses consistent hashing, as described in the
previous section. Cassandra also supports order-preserving partitioners that
distribute data across the nodes of the cluster as ranges of actual (e.g., not hashed)
rowkeys. This has the advantage of isolating requests for specific row ranges to
specific machines, but it can lead to an unbalanced cluster and may create
hotspots, especially if the key value is incrementing. For instance, if the key value
is a timestamp and the order-preserving partitioner is implemented, then all new
rows will tend to be created on a single node of the cluster. In early versions of
Cassandra, the order-preserving petitioner might be warranted to optimize range
queries that could not be satisfied in any other way; however, following the
introduction of secondary indexes, the order-preserving petitioner is maintained
primarily for backward compatibility, and Cassandra documentation
recommends against its use in new applications.
Key Space
Keyspace is the outermost container for data in Cassandra. A keyspace is
an object that is used to hold column families, user defined types. A keyspace is
like a RDBMS database which contains column families, indexes, user defined
types, data center awareness, strategy used in keyspace, replication factor, etc.
Following are the basic attributes of Keyspace in Cassandra:
● Replication factor: It specifies the number of machines in the cluster that will
receive copies of the same data.
● Replica placement Strategy: It is a strategy which specifies how to place
replicas in the ring.
● There are three types of strategies such as:
1) Simple strategy (rack-aware strategy)
2) old network topology strategy (rack-aware strategy)
3) network topology strategy (datacenter-shared strategy)
In Cassandra, "Create Keyspace" command is used to create keyspace.
Cassandra Create Keyspace
Cassandra Query Language (CQL) facilitates developers to communicate with
Cassandra. The syntax of Cassandra query language is very similar to SQL. In
Cassandra, "Create Keyspace" command is used to create keyspace.
Syntax:
CREATE KEYSPACE <identifier> WITH <properties>

Example:
Let's take an example to create a keyspace named "StudentDB".
CREATE KEYSPACE StudentDB WITH replication = {'class':'SimpleStrategy',
'replication_factor' : 3};
Different components of Cassandra Keyspace
Strategy: There are two types of strategy declaration in Cassandra syntax:
● Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes
are placed in clockwise direction in the ring without considering rack or node
location.
● Network Topology Strategy: This strategy is used in the case of more than one
data center. In this strategy, you have to provide a replication factor for each data
center separately.
Using a Keyspace
To use the created keyspace, you have to use the USE command.
Syntax:
USE <identifier>
Cassandra Alter Keyspace
The "ALTER keyspace" command is used to alter the replication factor,
strategy name and durable writes properties in created keyspace in Cassandra.
Syntax:
ALTER KEYSPACE <identifier> WITH
<properties> Cassandra Drop Keyspace
In Cassandra, "DROP Keyspace" command is used to drop keyspaces
with all the data, column families, user defined types and indexes from
Cassandra.
Syntax:
DROP keyspace KeyspaceName ;

Table Operations, - CRUD Operations

Cassandra Create Table

In Cassandra, CREATE TABLE command is used to create a table. Here,
column family is used to store data just like table in RDBMS. So, you can say that
CREATE TABLE command is used to create a column family in Cassandra.

Syntax:
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY, column2
name data type,
column3 name data type.
)
There are two types of primary keys:
1. Single primary key: Use the following syntax for single primary key.
Primary key (ColumnName)
2. Compound primary key: Use the following syntax for a single primary key.
Primary key(ColumnName1,ColumnName2 . . )
Example:
Let's take an example to demonstrate the CREATE TABLE command.
Here, we are using the already created Keyspace
"StudentDB". CREATE TABLE student(
student_id int PRIMARY
KEY, student_name text,
student_city text,
student_fees varint,
student_phone varint
);
SELECT * FROM student;
Cassandra Alter Table
ALTER TABLE command is used to alter the table after creating it. You
can use the ALTER command to perform two types of operations:
● Add a column
● Drop a column
Syntax:
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Adding a Column
You can add a column in the table by using the ALTER command. While
adding column, you have to aware that the column name is not conflicting with
the existing column names and that the table is not defined with compact storage
option.
Syntax:
ALTER TABLE table name ADD new column
datatype; After using the following command:
ALTER TABLE student ADD student_email text;
A new column is added. You can check it by using the SELECT command.
Dropping a Column
You can also drop an existing column from a table by using ALTER
command. You should check that the table is not defined with compact storage
option before dropping a column from a table.
Syntax:
ALTER table name DROP column name;
Example:
After using the following command:
ALTER TABLE student DROP student_email;
Now you can see that a column named "student_email" is dropped now. If you
want to drop the multiple columns, separate the column name by ",".
Cassandra DROP table
DROP TABLE command is used to drop a table.
Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;
The table named "student" is dropped now. You can use DESCRIBE
command to verify if the table is deleted or not. Here the student table has been
deleted; you will not find it in the column families list.
Cassandra Truncate Table
TRUNCATE command is used to truncate a table. If you truncate a table,
all the rows of the table are deleted permanently.
Syntax:
TRUNCATE <tablename>
Cassandra Batch
In Cassandra BATCH is used to execute multiple modification statements
(insert, update, delete) simultaneously. It is very useful when you have to update
some column as well as delete some of the existing.
Syntax:
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
Use of WHERE Clause
WHERE clause is used with SELECT command to specify the exact location
from where we have to fetch data.
Syntax:
SELECT FROM <table name> WHERE <condition>;
SELECT * FROM student WHERE student_id=2;
Cassandra Update Data
UPDATE command is used to update data in a Cassandra table. If you see
no result after updating the data, it means data is successfully updated otherwise
an error will be returned. While updating data in Cassandra table, the following
keywords are commonly used:
● Where: The WHERE clause is used to select the row that you want to update.
● Set: The SET clause is used to set the value.
● Must: It is used to include all the columns composing the primary key.
Syntax:
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Cassandra DELETE Data
DELETE command is used to delete data from Cassandra table. You can
delete the complete table or a selected row by using this command.
Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name

Example:
Delete the student_fees where student_id is 4.
DELETE student_fees FROM student WHERE student_id=4;
HAVING Clause in SQL
The HAVING clause places the condition in the groups defined by the
GROUP BY clause in the SELECT statement. This SQL clause is implemented
after the 'GROUP BY' clause in the 'SELECT' statement. This clause is used in
SQL because we cannot use the WHERE clause with the SQL aggregate
functions. Both WHERE and HAVING clauses are used for filtering the records
in SQL queries.
Syntax of HAVING clause in SQL
SELECT column_Name1, column_Name2, .....,
column_NameN aggregate_function_name(column_Name)
GROUP BY
Example:
SELECT SUM(Emp_Salary),Emp_City FROM Employee GROUP
BY Emp_City;
the following query with the HAVING clause in SQL:
SELECT SUM(Emp_Salary),Emp_City FROM Employee GROUP
BY Emp_City HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:
If you want to show each department and the minimum salary in each
department, you have to write the following query:
SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY
Emp_Dept; MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
AVERAGE CLAUSE:
SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP
BY Emp_Dept;

SQL ORDER BY Clause

● Whenever we want to sort the records based on the columns stored in the tables
of the SQL database, then we consider using the ORDER BY clause in SQL.
● The ORDER BY clause in SQL will help us to sort the records based on the
specific column of a table. This means that all the values stored in the column on
which we are applying the ORDER BY clause will be sorted, and the
corresponding column values will be displayed in the sequence in which we have
obtained the values in the earlier step.
Syntax to sort the records in ascending order:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName ASC;
Syntax to sort the records in descending order:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnNameDESC;
Syntax to sort the records in ascending order without using ASC keyword:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName;

CQL Types
CQL defines built-in data types for columns. The counter type is unique.
CQL Constants Description
Type supported
ascii strings US-ASCII character string
bigint integers 64-bit signed long
blob blobs Arbitrary bytes (no validation), expressed as
hexadecimal
boolean booleans true or false
counter integers Distributed counter value (64-bit long)
date strings Value is a date with no corresponding time value;
Cassandra encodes date as a 32-bit integer
representing days since epoch (January 1, 1970).
Dates can be represented in queries and inserts as a
string, such as 2015-05-03 (yyyy-mm-dd)
decimal integers, floats Variable-precision decimal
double integers, floats 64-bit IEEE-754 floating point
float integers, floats 32-bit IEEE-754 floating point
frozen User-defined A frozen value serializes multiple components into a
types,collections, single value. Non-frozen types allow updates to
tuples individual fields. Cassandra treats the value of a
frozen type as a blob. The entire value must be
overwritten.
inet strings IP address string in IPv4 or IPv6 format, used by the
python-cql driver and CQL native protocols
int integers 32-bit signed integer

list n/a A collection of one or more ordered elements:

[literal, literal, literal].
map n/a A JSON-style array of literals: { literal : literal,
literal : literal ... }
set n/a A collection of one or more elements: { literal,
literal, literal }
smallint integers 2 byte integer
text strings UTF-8 encoded string
time strings Value is encoded as a 64-bit signed integer
representing the number of nanoseconds since
midnight. Values can be represented as strings, such
as 13:30:54.234.
timesta integers, strings Date and time with millisecond precision, encoded as
mp 8 bytes since epoch. Can be represented as a string,
such as 2015-05-03 13:30:54.234.
timeuuid uuids Version 1 UUID only
tinyint integers 1 byte integer
tuple n/a A group of 2-3 fields.
uuid uuids A UUID in standard UUID format

varchar strings UTF-8 encoded string

varint integers Arbitrary-precision integer
Comparison of Cassandra and MongoDB
Sl.No Cassandra MongoDB

1. Cassandra is high performance MongoDB is cross-platform

distributed database system. document-oriented
database system.

2. Cassandra is written in Java MongoDB is written in C++.

3. Cassandra stores data in tabular MongoDB stores data in JSON

form format.
like SQL format.

4. Cassandra is licensed by Apache. MongoDB is licensed by AGPL and

drivers by Apache.

5. Cassandra is mainly designed to MongoDB is designed to deal with

handle large amounts of data across JSON-like documents and access
many commodity servers. applications easier and faster.

6. Cassandra provides high availability MongoDB is easy to administer in the

with no single point of failure. case of failure.

HIVE
● Hive is thought of as “SQL for Hadoop,” although Hive provides a catalog for
the Hadoop system, as well as a SQL processing layer.
● The Hive metadata service contains information about the structure of registered
files in the HDFS file system.
● This metadata effectively “schematizes” these files, providing definitions of
column names and data types.
● The Hive client or server (depending on the Hive configuration) accepts SQL-
like commands called Hive Query Language (HQL).
● These commands are translated into Hadoop jobs that process the query and
return the results to the user.
● Most of the time, Hive creates MapReduce programs that implement query
operations such as joins, sorts, aggregation, and so on.
● Hive is a data warehouse system which is used to analyze structured data. It is
built on the top of Hadoop. It was developed by Facebook.
● Hive provides the functionality of reading, writing, and managing large datasets
residing in distributed storage. It runs SQL-like queries called HQL (Hive query
language) which gets internally converted to MapReduce jobs.
● Hive supports Data Definition Language (DDL), Data Manipulation Language
(DML), and User Defined Functions (UDF).
HIVE ARCHITECTURE

● The Hive metastore maps HDFS files to Hive tables

● A Hive client or server (depending on the installation mode) accepts HQL
commands that perform SQL operations on those tables.
● Hive translates HQL to Hadoop code (3)—usually MapReduce.
● This code operates against the HDFS files (4) and returns query results to Hive (5).
Features of Hive
● Hive is fast and scalable.
● It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
● It is capable of analyzing large datasets stored in HDFS.
● It uses indexing to accelerate queries.
● It can operate on compressed data stored in the Hadoop ecosystem.
● It supports user-defined functions (UDFs) where user can provide its functionality.
Data types
Hive data types are categorized in numeric types, string types, misc
types, and complex types. A list of Hive data types is given below.
Integer:
Type Size Range
TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Date/Time Types
1. TIMESTAMP
● It supports traditional UNIX timestamp with optional nanosecond precision.
● As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
● As Floating point numeric type, it is interpreted as UNIX timestamp in seconds
with decimal precision.
● As string, it follows java.sql.Timestamp format "YYYY-MM-
DD HH:MM:SS.fffffffff" (9 decimal place precision)
2. DATES
● The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.
String Types
1. STRING
The string is a sequence of characters. It values can be enclosed within
single quotes (') or double quotes (").
2. Varchar
The varchar is a variable length type whose range lies between 1 and
65535, which specifies that the maximum number of characters allowed in the
character string.
3. CHAR
The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Type Size Range

Struct It is similar to C struct or struct('James','Roy')

an object where fields
are accessed using the
"dot" notation.

Map It contains the key-value map('first','James','last','Roy')

tuples where the fields
are accessed using array
notation.
Array It is a collection of similar array('James','Roy')
type of values that
indexable using
zero-based integers.

Database Operations,
Hive - Create Database
In Hive, the database is considered as a catalog or namespace of tables. So,
we can maintain multiple tables within a database where a unique name is
assigned to each table. Hive also provides a default database with a name default.
● Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
hive> create database demo
hive> show databases;
Hive - Drop Database
In this section, we will see various ways to drop the existing database.
drop the database by using the following command.
hive> drop database demo;
Hive - Create Table
In Hive, we can create a table by using the conventions similar to the SQL.
It supports a wide range of flexibility where the data files for tables are stored. It
provides two types of table:
● Internal table
● External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their
data is controlled by the Hive. By default, these tables are stored in a subdirectory
under the directory defined by hive. metastore. warehouse.dir (i.e.
/user/hive/warehouse). The internal tables are not flexible enough to share with
other tools like Pig. If we try to drop the internal table, Hive deletes both table
schema and data. Let's create an internal table by using the following command:-
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following
command:- hive> describe demo.employee

External Table
The external table allows us to create and access a table and a data
externally. The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data. As the table
is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.
Let's create an external table using the following command: -
hive> create external table emplist (Id int, Name string , Salary
float) row format delimited
fields terminated by ','
location
'/HiveDirectory';
we can use the following command to retrieve the data: -
select * from emplist;
Hive - Load Data
Once the internal table has been created, the next step is to load the data
into it. So, in Hive, we can easily load data from any file to the database.
Let's load the data of the file into the database by using the following command: -
Load data local in path '/home/codegyani/hive/emp_details'into table
demo.employee;
Hive - Drop Table
Hive facilitates us to drop a table by using the SQL drop table
command. Let's follow the below steps to drop the table from the database.
Let's check the list of existing databases by using the following command: -
hive> show
databases; hive>
use demo; hive>
show tables;
hive> drop table new_employee;
Hive - Alter Table
In Hive, we can perform modifications in the existing table like
changing the table name, column name, comments, and table properties. It
provides SQL like commands to alter the table.
Rename a Table
If we want to change the name of an existing table, we can rename
that table by using the following signature: -
Alter table old_table_name rename to new_table_name;
Now, change the name of the table by using the following
command: - Alter table emp rename to employee_data;
Adding column
In Hive, we can add one or more columns in an existing table by
using the following signature:
Alter table table_name add columns(column_name datatype);
Now, add a new column to the table by using the
following command: -
Alter table employee_data add columns (age int);
Change Column
In Hive, we can rename a column, change its type and position.
Here, we are changing the name of the column by using the following
signature: -
Alter table table_name change old_column_name new_column_name
datatype;
Now, change the name of the column by using the following command: -
Alter table employee_data change name first_name string;
Delete or Replace Column
Hive allows us to delete one or more columns by replacing them with
the new columns. Thus, we cannot drop the column directly. Let's see the
existing schema of the table.
alter table employee_data replace columns( id string, first_name string, age int);

Partitioning
The partitioning in Hive means dividing the table into some parts based
on the values of a particular column like date, course, city or country. The
advantage of partitioning is that since the data is stored in slices, the query
response time becomes faster.
The partitioning in Hive can be executed in two ways –
● Static partitioning
● Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table. Hence, the
data file doesn't contain the partitioned columns.
Example of Static Partitioning
Select the database in which we want to create
a table. hive> use test;
Create the table and provide the partitioned columns by using the following
command: -
hive> create table student (id int, name string, age int, institute string)partitioned by
(course string) row format delimited fields terminated by ',';
hive> describe student;
Load the data into the table and pass the values of partition columns with it by
using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details1'
into table student partition(course= "java");
Here, we are partitioning the students of an institute based on courses.
Load the data of another file into the same table and pass the values of
partition columns with it by using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details2'
into table student partition(course= "hadoop");
hive> select * from student;
Retrieve the data based on partitioned columns by using the following
command: - hive> select * from student where course="java";
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the
table.
So, it is not required to pass the values of partitioned columns
manually. First, select the database in which we want to create a
table.
hive> use show;
Enable the dynamic partition by using the following
commands: - hive> set
hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Create a dummy table to store the data.
hive> create table stud_demo(id int, name string, age int, institute
string, course string) row format delimited fields terminated by ',';
load the data into the table.
hive> load data local inpath '/home/codegyani/hive/student_details' into
table stud_demo;
Create a partition table by using the following command: -
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string) row format delimited fields terminated by ',';
Insert the data of dummy table into the partition table.
hive> insert into student_part partition(course) select id, name, age, institute, course
from stud_demo;
HiveQL
Hive is the original SQL on Hadoop. From the very early days of
Hadoop, Hive represented the most accessible face of Hadoop for many users.
Hive Query Language (HQL) is a SQL-based language that comes close to
SQL-92 entry-level compliance, particularly within its SELECT statement.
DML statements—such as INSERT, DELETE, and UPDATE—are supported
in recent versions, though the real purpose of Hive is to provide query access
to Hadoop data usually ingested via other means. Some SQL-2003 analytic
window functions are also supported. HQL is compiled to MapReduce or—
in later releases—more sophisticated YARN-based DAG algorithms.
The following is a simple Hive query:
0: jdbc:Hive2://> SELECT country_name, COUNT
(cust_id) 0: jdbc:Hive2://> FROM countries co
JOIN customers cu
0: jdbc:Hive2://>
ON(cu.country_id=co.country_id) 0:
jdbc:Hive2://> WHERE region = 'Asia'
0: jdbc:Hive2://> GROUP BY country_name
0: jdbc:Hive2://> HAVING COUNT (cust_id) > 500;
2015-10-10 11:38:55 Starting to launch local task to process map join;
maximum memory = 932184064
<<Bunch of Hadoop JobTracker output deleted>>
2015-10-10 11:39:05,928 Stage-2 map = 0%, reduce = 0%
2015-10-10 11:39:12,246 Stage-2 map = 100%, reduce = 0%, Cumulative CPU
2.28 sec
2015-10-10 11:39:20,582 Stage-2 map = 100%, reduce = 100%, Cumulative CPU
4.4 sec
+ + + +
| country_name | _c1 |
+ + + +
| China | 712 |
| Japan | 624 |
| Singapore | 597 |
+ + + +
3 rows selected (29.014 seconds)
HQL statements look and operate like SQL statements. There are a
few notable differences between HQL and commonly used standard SQL,
however:
● HQL supports a number of table generating functions which can be used to
return multiple rows from an embedded field that may contain an array of
values or a map of name:value pairs. The Explode() function returns one row
for each element in an array or map, while json_tuple() explodes an embedded
JSON document.
● Hive provides a SORT BY clause that requests output be sorted only within
each reducer within the MapReduce pipeline. Compared to ORDER BY, this
avoids a large sort in the final reducer stage, but may not return results in
sorted order.
● DISTRIBUTE BY controls how mappers distribute output to reducers. Rather
than distributing values to reducers based on hashing of key values, we can
insist that each reducer receive contiguous ranges of a specific column.
DISTRIBUTE BY can be used in conjunction with SORT BY to achieve an
overall ordering of results without requiring an expensive final sort operation.
CLUSTER BY combines the semantics of DISTRIBUTE BY and SORT BY
operations that specify the same column list. Hive can query data in HBase
tables and data held in HDFS.

OrientDB Graph database:

What is Graph?
A graph is a pictorial representation of objects which arc connected by some
pair of links. A graph contains two elements: Nodes (vertices) and relationships
(edges).
What is Graph database?
A graph database is a database which is used to model the data in the form of
graph. It store any kind of data using:
o Nodes
o Relationships
o Properties

Nodes: Nodes arc the records/data in graph databases. Data is stored as properties
and properties arc simple name/valuc pairs.
Relationships: It is used to connect nodes. It specifics how the nodes are related.
 Relationships always have direction.
 Relationships always have a type.
 Relationships form patternsof data.

Properties: Propertics are named data values.

Popular Graph Databases
Neo4j is the most popular Graph Database. Other Graph Databases are
 Oracle NoSQL Database
 OrientDB
 HypherGraphDB
 GraphBase
 InfiniteGraph
 AllegroGraph etc.

Graph Database vs. RDBMS

Differences between Graph database and RDBMS:

Creates and connects to a new database.
Syntax
CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-
type>]] [-restore=<backup-path>]
 <database-url> Defines the URL of the database you want to connect to. It
uses the format <mode>:<path>
o <mode> Defines the mode you want to use in connecting to the
database. It can be PLOCAL or REMOTE.
o <path> Defines the path to the database.
 <user> Defines the user you want to connect to the database with.
 <password> Defines the password needed to connect to the database, with
the defined user.
 <storage-type> Defines the storage type that you want to use. You can
choose between PLOCAL and MEMORY.
 <db-type> Defines the database type. You can choose
between GRAPH and DOCUMENT. The default is GRAPH.
Examples
Create a local database demo:
orientdb> CREATE DATABASE
PLOCAL:/usr/local/orientdb/databases/demo
Creating database [plocal:/usr/local/orientdb/databases/demo]...
Connecting to database [plocal:/usr/local/orientdb/databases/demo]...OK
Database created successfully.
Current database is:
plocal:/usr/local/orientdb/databases/demo
orientdb {db=demo}>
ALTER DATABASE:
Updates attributes on the current database.
Syntax
ALTER DATABASE <attribute-name> <attribute-value>
 <attribute-name> Defines the attribute that you want to change. For a list
of supported attributes, see the section below.
 <attribute-value> Defines the value you want to set.
orientdb> ALTER DATABASE custom strictSQL = false

If the command is executed successfully, you will get the following output.
Database updated successfully

The following statement is the basic syntax of the Connect command.

CONNECT <database-url> <user> <password>

Following are the details about the options in the above syntax.
<database-url> —Defines the URL of the database. URL contains two
parts one is <mode> and the second one is <path>.
<mode> —Defines the mode, 1e. local mode or remote mode.
<path> —Defines the path to the database.
<user> —Defines the user you want to connect to the
database.
<password> —Defines the password for connecting to
the database.
Example
We have already created a database named ‘demo’ in the previous chapters.
In this example, we will connect to that using the user admin.
You can use the following command to connect to demo database.

orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin

If it is successfully connected, you will get thefollowing output—

Connecting to database |plocal:/opt/oricntdb/databases/demo] with user

'admin'...OK Orientdb {db = demo}>
the following statement is the basic syntax of the info command.
LIST DATABASES
The following statement is the basic syntax of the Drop database command.
DROP DATABASE |<database-name><server-username><server-user-password>|
Following are the details about the options in the above syntax.
<database-name> —Database name you want to drop.
<server-username> —Username of the database who has the privilege to
drop a database.
<server-user-password> —Password of the particular user.
In this example, we will use the same database named ‘demo’ that we
created in an earlier chapter. You can use the following command to drop a
database demo.
orientdb{db= demo}> DROP DATABASE
If this command is successfully executed, you will get the
following output. Database ‘demo’ deleted successfully
INSERT DATABASE
The following statement is the basic syntax of the Insert Record command.
INSERT INTO |[class:]<class>|cluster:<cluster>|index:<index>
|(<field>|,]*)VALUES (<expression>|,]*)|,]*]|
[SET <ficld> = <expression>|<sub-command>|,]*']|
{CONTENT {<JSON>}]
[RETURN <cexpression>|
[FROM <query>]
Following are the details about the options in the above syntax.
SET —Defines each field along with the value.
CONTENT —Defines JSON data to set field values. This is optional.
RETURN —Defines the expression to return instead of number of records
inserted. The most common use cases are —
 @rid —Returns the Record ID of the new record.
 @this —Returns the entire new record.
FROM —Where you want to insert the record ora result set.

The following command is to insert the first record into the

Customer table. INSERT INTO Customer (id, name, age)
VALUES(01,’satish’, 25)
The following command is to insert the second record into the
Customer table. INSERT INTO Customer SET id = 02, name=
‘krishna’, age = 26
The following command is to insert the next two records into the Customer
table.
INSERT INTO Customer(id, name,age)VALUES (04,'javeed',21),(05,'raja’,29)
The following statement is the basic syntax of the SELECT command.
SELECT| <Projections>| |FROM<Target>| LET<Assignment>*| |
| WHERE<Condition>*|
| GROUPBY<Ficld>*|
| ORDERBY<Ficlds>*| ASC|DESC|*|
| UNWIND<Ficld>*|
| SKIP<SkipRecords>|
| LIMIT<MaxRecords>|
| FETCHPLAN<FetchPlan>|
| TIMEOUT<Timeout>| <STRATEGY>| |
| LOCKdefault|record|
| PARALLEL|
| NOCACHE|
Following are the details about the options in the above syntax.
<Projections> —Indicates the data you want to extract from the query as a
result records sct.
FROM —Indicates the object to query. This can be a class, cluster, single
Record ID, set of Record IDs. You can specify all these objects as target.
WHERE -—Specifies thecondition to filter the result-sct.
LET —Indicates the context variable which are used in projections, conditions
or sub queries.
GROUP BY —Indicates the field to group the records.
ORDER BY —Indicates the filed to arrange a record in order.
UNWIND .. Designates the field on which to unwind thecollection of records.
SKIP —Defines the number of records you want to skip from the start
of the result-sct.
LIMIT —Indicates the maximum number of records in the result-sct.
FETCHPLAN —Specifies the strategy defming how you want to fetchresults.
TIMEOUT -—Defines the maximum time in milliseconds for the query.
LOCK — Defines the locking strategy. DEFAULT and RECORD are the
available lock strategics.
PARALLEL —Executes the query against ‘x’ concurrentthreads.
NOCACHE -—Defines whether you want to use cache or not.
Example
Method | —You can use the following query to select all records from the
Customer table.
oricntdb {db = demo}> SELECT FROM Customer
orientdb {db = demo}> SELECT FROM Customer WHERE name
LIKE 'k%' orientdb {db = demo}> SELECT FROM Customer
WHERE namce.left(1)= 'k’ orientdb {db = demo}> SELECT id,
name.toUpperCase() FROM Customer
oricntdb {db =demo}> SELECT FROM Customer WHERE age in
[25,29]
orientdb {db = demo}> SELECT FROM Customer WHERE ANY()
LIKE '%sh%' oricntdb {db = demo}> SELECT FROM Customer
ORDER BY age DESC
UPDATE QUERY
Update Record command is used to modify the value of a particular record.
SET is the basic command to update a particular field valuc.
The following statement is the basic syntax of the Update command.
UPDATE <class>|cluster:<cluster>|<recordI!D>
[SETIINCREMENT|ADD|REMOVE|PUT<ficld-name>=<ficld-value>[
,]*]|[|CONTENT| MERGE<JSON>]
|UPSERT]
[RETURN<returning>[<returning-cxpression>|]
|WHERE<conditions>]
[LOCK default|record]
|LIMIT<max-records>] |TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
SET- Defines the field to update.
INCREMENT—Increments the specified field value by the given value.
ADD—Adds the new item in the collection fields.
REMOVE—Removes an item from the collection field.
PUT—Puts an entry into map field.
CONTENT—Replaces the record content with JSON document content.
MERGE—Merges the record content with a JSON document.
LOCK -—Specifies how to lock the records between load and update. We
have two options to specify Default and Record.
UPSERT —Updates a record if it exists or inserts a new record if it
doesn’t. It helps in executing a single query in the place of executing two
qucrics.
RETURN -—Specifies an expression to return insteadof the
numberof records.
LIMIT —Defines the maximum number of records to update.
TIMEOUT -—Defines the time you want to allow the update run
before it times out. Try the following query to updatethe age of a
customer ‘Raja’.
Orientdb{db = demo}> UPDATE CustomerSET age= 28 WHERE name= ‘Raja’

Truncate
Truncate Record command is used to delete the values of a particular
record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple
Rids separated by comma to truncate multiple records. It returns the number of
records truncated.
Try the following query to truncate the record having Record ID #1 1:4.
Orientdb{db= demo}> TRUNCATE RECORD #11:4
DELETE
Delete Record command is used to delete one or more records completely
from the database. The following statement is the basic syntax of the Delete
command.
DELETE FROM
<Class>|cluster:<cluster>|index:<index> [LOCK
<default|record>]
[RETURN <returning>]
|WHERE <Condition>*}
|LIMIT <MaxRecords>]
[TIMEOUT <timeout> |]
Following are the details about the options in the above syntax.
LOCK —Specifies how to lock the records between load and update. We
have two options to specify Default and Record.
RETURN —Specifies an expression to return instead of the
number of records.
LIMIT —Defines the maximum number of records to update.
TIMEOUT —Defines the time you want to allow the update run before it
times out.
OrientDB Features:
providing more functionality and flexibility, while being powerful cnough
to replace your operational DBMS.
SPEED
OrientDB was engineered from the ground up with performance as a key
specification. It’s fast on both read and write operations. Stores up to 120,000
records per second
 No more Joins: relationships are physical links to the records.
 Better RAM usc.
 Traverses parts of or entire trees and graphs of records in
milliseconds.
 Traversing speed is not affected by the database size.
ENTERPRISE

o Incremental backups
o Unmatched security
o 24x7 Support
o Query Profiler
o Distributed Clustering configuration
o Metrics Recording
o Live Monitor with configurable alerts

With a master-slave architecture, the master often becomes the bottleneck.

With OricntDB, throughput is not limited by a single server. Global throughput
is the sum of the throughput of all the servers.
 Multi-Master + Sharded architecture
 Elastic Linear Scalability
 estore the database content using WAL
 OrientDB Community is free for commercial use.
 Comes with an Apache 2 Open Source License.
 Eliminates the need for multiple products and multiple licenses.

AI Agents in Action_ Interaction and Workflow Dynamics _ by Benjamin Chu _ Towards Generative AI _ Nov, 2024 _ Medium
No ratings yet
AI Agents in Action_ Interaction and Workflow Dynamics _ by Benjamin Chu _ Towards Generative AI _ Nov, 2024 _ Medium
21 pages
Big Data Systems
100% (2)
Big Data Systems
341 pages
What Is NoSQL
No ratings yet
What Is NoSQL
4 pages
NOSQL
No ratings yet
NOSQL
25 pages
Module 5
No ratings yet
Module 5
31 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Unit 3
No ratings yet
Unit 3
10 pages
Features of Nosql: Non-Relational
No ratings yet
Features of Nosql: Non-Relational
7 pages
Module 5_NoSQL databases
No ratings yet
Module 5_NoSQL databases
33 pages
BIG Data 2
No ratings yet
BIG Data 2
18 pages
01 NSQL
No ratings yet
01 NSQL
5 pages
NoSQL lec
No ratings yet
NoSQL lec
45 pages
Non Relational Database-NoSQL
No ratings yet
Non Relational Database-NoSQL
4 pages
Chapter 1 - Introducing Big Data & NoSQL
No ratings yet
Chapter 1 - Introducing Big Data & NoSQL
14 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
NoSQL Databases Notes
No ratings yet
NoSQL Databases Notes
5 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
Unit 2
No ratings yet
Unit 2
65 pages
Unit 2
No ratings yet
Unit 2
26 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Nosql Database: Abstract
No ratings yet
Nosql Database: Abstract
6 pages
Unit 5_230601_174540-1
No ratings yet
Unit 5_230601_174540-1
14 pages
Module 7
No ratings yet
Module 7
30 pages
BIG DATA UNIT 3
No ratings yet
BIG DATA UNIT 3
374 pages
NoSQL_Notes
No ratings yet
NoSQL_Notes
11 pages
What Is Nosql: Features of Nosql Databases
No ratings yet
What Is Nosql: Features of Nosql Databases
11 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Lec 15 Notes
No ratings yet
Lec 15 Notes
3 pages
NoSQL Databases
No ratings yet
NoSQL Databases
10 pages
NoSQL Group1
No ratings yet
NoSQL Group1
15 pages
UNIT-III
No ratings yet
UNIT-III
22 pages
Nosql Database
No ratings yet
Nosql Database
19 pages
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
No ratings yet
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
44 pages
NOSQL Concept 2
No ratings yet
NOSQL Concept 2
4 pages
Nosql Databases
No ratings yet
Nosql Databases
2 pages
Full Stack-Unit-Iii
No ratings yet
Full Stack-Unit-Iii
56 pages
No SQL
No ratings yet
No SQL
32 pages
Nosql Database
No ratings yet
Nosql Database
8 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
47 pages
MODULE7_8b366a4419b5de5518f7397b3718fa5e
No ratings yet
MODULE7_8b366a4419b5de5518f7397b3718fa5e
23 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
NoSQL_Complete_QB
No ratings yet
NoSQL_Complete_QB
43 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
NoSQL Database Comprehensive Report
No ratings yet
NoSQL Database Comprehensive Report
75 pages
DSA 4-Introduction To NoSQL
No ratings yet
DSA 4-Introduction To NoSQL
59 pages
Bda Unit-5 PDF
No ratings yet
Bda Unit-5 PDF
83 pages
No SQL
No ratings yet
No SQL
10 pages
CH.5 NOSQL database for Business Applications
No ratings yet
CH.5 NOSQL database for Business Applications
21 pages
BDA
No ratings yet
BDA
65 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
NoSQL Database
No ratings yet
NoSQL Database
10 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
1 page
Case Study On Different Nosql Data Models
No ratings yet
Case Study On Different Nosql Data Models
6 pages
CHAPTER 03: Big Data Technology Landscape
No ratings yet
CHAPTER 03: Big Data Technology Landscape
81 pages
NoSQL DATABSES
No ratings yet
NoSQL DATABSES
12 pages
Unit 6
No ratings yet
Unit 6
143 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
NO_SQL_DB
No ratings yet
NO_SQL_DB
18 pages
Unit II No-SQL Db Managment
No ratings yet
Unit II No-SQL Db Managment
33 pages
NOSQL
No ratings yet
NOSQL
15 pages
No SQL - Types, CAP Theorem(4)
No ratings yet
No SQL - Types, CAP Theorem(4)
12 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Rm Ipr Unit 5 Unit 5 Full Notes
No ratings yet
Rm Ipr Unit 5 Unit 5 Full Notes
24 pages
Unit III Data Analysis and Reporting
No ratings yet
Unit III Data Analysis and Reporting
14 pages
Unit II Security Controls
No ratings yet
Unit II Security Controls
46 pages
Unit III Bayesian Learning
No ratings yet
Unit III Bayesian Learning
8 pages
Report on Internship
No ratings yet
Report on Internship
28 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Bengali Inscription To Knowledge Graph BIKG
No ratings yet
Bengali Inscription To Knowledge Graph BIKG
46 pages
Matrix Project Chatbot
No ratings yet
Matrix Project Chatbot
45 pages
A Review On Graph-Based Approaches For Network Security Monitoring and Botnet Detection
No ratings yet
A Review On Graph-Based Approaches For Network Security Monitoring and Botnet Detection
22 pages
Data+Literacy+-+Course+Notes
No ratings yet
Data+Literacy+-+Course+Notes
61 pages
SNS Lab Anual
No ratings yet
SNS Lab Anual
33 pages
Unit 15
No ratings yet
Unit 15
19 pages
E Book Unleashing AI Powered Search Pureinsights
No ratings yet
E Book Unleashing AI Powered Search Pureinsights
48 pages
Surveyondatamanagementsystemfor Final
No ratings yet
Surveyondatamanagementsystemfor Final
5 pages
DBDD Assignment H01309688 CRT2
No ratings yet
DBDD Assignment H01309688 CRT2
2 pages
Final Assessment_-5549959
No ratings yet
Final Assessment_-5549959
6 pages
Course Code: Course Title: TPC Version No. Course Pre-Requisites/ Co-Requisites Anti-Requisites (If Any) - Objectives
No ratings yet
Course Code: Course Title: TPC Version No. Course Pre-Requisites/ Co-Requisites Anti-Requisites (If Any) - Objectives
4 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
Learning SQL: The Hundred Pages SQL Notes
No ratings yet
Learning SQL: The Hundred Pages SQL Notes
100 pages
INS (4360704) Practical Assignment
No ratings yet
INS (4360704) Practical Assignment
5 pages
Advanced MongoDB Projects
No ratings yet
Advanced MongoDB Projects
3 pages
Neo4j WP Retail Innovation en US
No ratings yet
Neo4j WP Retail Innovation en US
13 pages
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
No ratings yet
2017 - Corbellini Et Al. - Persisting Big-Data, The NoSQL Landscape
23 pages
QCM
No ratings yet
QCM
11 pages
DBMS UNIT-V 1
No ratings yet
DBMS UNIT-V 1
204 pages
How To Map Relational Data To A Graph DB in Four Steps: by Steven Yang
No ratings yet
How To Map Relational Data To A Graph DB in Four Steps: by Steven Yang
5 pages
DBMS Full Notes
No ratings yet
DBMS Full Notes
5 pages
Chapter 3 Ontology
No ratings yet
Chapter 3 Ontology
15 pages
DBTG_and_CODASYL_Report
No ratings yet
DBTG_and_CODASYL_Report
2 pages

Unit 3 Nosql Databases Adt

Uploaded by

Unit 3 Nosql Databases Adt

Uploaded by

UNIT III NOSQL DATABASES

NoSQL – CAP Theorem – Sharding - Document based – MongoDB

Types / Categories of NoSQL Databases

3. Column-based or wide column NOSQL systems:

5. Hybrid NOSQL systems:

❖ NoSQL databases never follow the relational model

❖ Never provide tables with flat fixed-column records

❖ Work with self-contained aggregates or BLOBs

❖ Doesn't require object-relational mapping and data normalization

❖ No complex features like query languages, query planners, referential integrity

The “CAP” in the CAP Theorem

Advantages of Key Based Sharding:

Advantages of Range Based Sharding:

Advantages of Vertical Sharding:

MongoDB Create Collection

MongoDB replication provides several benefits

Methods to setup MongoDB replication

Step 4: Initialize the Replica Set

mongo --host <hostname>:<port>

mongo --host <primary_host>:<primary_port>

Step 6: Optional - Add Arbiter Node

○ Hash-based sharding is more likely to distribute documents (unfilled orders

Table Operations, - CRUD Operations

Cassandra Create Table

SQL ORDER BY Clause

list n/a A collection of one or more ordered elements:

varchar strings UTF-8 encoded string

1. Cassandra is high performance MongoDB is cross-platform

2. Cassandra is written in Java MongoDB is written in C++.

3. Cassandra stores data in tabular MongoDB stores data in JSON

4. Cassandra is licensed by Apache. MongoDB is licensed by AGPL and

5. Cassandra is mainly designed to MongoDB is designed to deal with

6. Cassandra provides high availability MongoDB is easy to administer in the

● The Hive metastore maps HDFS files to Hive tables

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Struct It is similar to C struct or struct('James','Roy')

Map It contains the key-value map('first','James','last','Roy')

OrientDB Graph database:

Properties: Propertics are named data values.

Graph Database vs. RDBMS

Differences between Graph database and RDBMS:

The following statement is the basic syntax of the Connect command.

CONNECT <database-url> <user> <password>

orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin

If it is successfully connected, you will get thefollowing output—

Connecting to database |plocal:/opt/oricntdb/databases/demo] with user

The following command is to insert the first record into the

With a master-slave architecture, the master often becomes the bottleneck.

You might also like