Unit - III
Unit - III
CAP theorem
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP
Theorem.
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance.
Here Consistency means that all nodes in the network see the same data at the same time.
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform.
What is CAP theorem in NoSQL databases?
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time.
What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system.
What is difference between sharding and partitioning?
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database
instance.
What are the types of
sharding? Sharding
Architectures
Key Based Sharding. This technique is also known as hash-based sharding. ...
Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ...
Vertical Sharding. ...
Directory-Based Sharding.
lOMoARcPSD|5536788
NoSQL
It provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.
Advantages of NoSQL
What is MongoDB?
MongoDB was designed to work with commodity servers. Now it is used by the company of
all sizes, across all industry.
MongoDB Advantages
o Easy to use
o Light Weight
o Extremely faster than
used
There is no create database command in MongoDB. Actually, MongoDB do not provide any
command to create database.
If there is no existing database, the following command is used to create a new database.
Syntax:
use DATABASE_NAME
>use javatpointdb
>db
lOMoARcPSD|5536788
In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
collection in your database.
>db.movie.insert({"name":"javatpoint"})
The dropDatabase command is used to drop a database. It also deletes the associated data
files. It operates on the current database.
Syntax:
db.dropDatabase()
This syntax will delete the selected database. In the case you have not selected any database,
it will delete default "test" database.
If you want to delete the database "javatpointdb", use the dropDatabase() command as
follows:
>db.dropDatabase()
MongoDB Create Collection
Syntax:
db.createCollection(name, options)
Name: is a string type, specifies the name of the collection to be created.
Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
>show collections
lOMoARcPSD|5536788
MongoDB creates collections automatically when you insert some documents. For example:
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist.
>db.SSSIT.insert({"name" : "seomount"})
>show collections
SSSIT
MongoDB update documents
Syntax:
db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)
Example
Consider an example which has a collection name javatpoint. Insert the following documents
in collection:
db.javatpoint.insert(
{
course: "java",
details: {
duration: "6 months",
Trainer: "Sonoo jaiswal"
},
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
category: "Programming language"
}
)
>db.javatpoint.update({'course':'java'},{$set:{'course':'android'}})
If you want to insert multiple documents in a collection, you have to pass an array of
documents to the db.collection.insert() method.
lOMoARcPSD|5536788
var Allcourses =
[
{
Course: "Java",
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
Batch: [ { size: "Medium", qty: 25 } ],
category: "Programming Language"
},
{
Course: ".Net",
details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
category: "Programming Language"
},
{
Course: "Web Designing",
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
category: "Programming Language"
}
];
Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.
1. Deletion criteria: With the use of its syntax you can remove the documents from the
collection.
Syntax:
b.collection_name.remove (DELETION_CRITERIA)
If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes.
db.javatpoint.remove({})
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the
right data file. The indexes are order by the value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax db.COLLECTION_NAME.createIndex({KEY:1})
Example
db.mycol.createIndex({<age=:1})
{
<createdCollectionAutomatically= : false,
<numIndexesBefore= : 1,
<numIndexesAfter= : 2,
<ok= : 1
}
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
lOMoARcPSD|5536788
Syntax –
In MongoDB, you can search by field, range query and it also supports regular expression
searches.
2. Indexing:
3. Replication:
A master can perform Reads and Writes and a Slave copies data from the master and
can only be used for reads or back up (not writes)
4. Duplication of data:
MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.
5. Load balancing:
10. Stores files of any size easily without complicating your stack.
Now a day many companies using MongoDB to create new types of applications, improve
performance and availability.
The MongoDB Replication methods are used to replicate the member to the replica sets.
rs.add(host, arbiterOnly)
The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if
the method will trigger an election for primary. For example - if we try to add a new member
with a higher priority than the primary. An error will be reflected by the mongo shell even if
the operation succeeds.
Example:
In the following example we will add a new secondary member with default vote.
Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.
lOMoARcPSD|5536788
MongoDBsh.addShard(<url>) command
A shard replica set added to a sharded cluster using this command. If we add it among the
shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.
<replica_set>/<hostname><:port>,<hostname><:port>, ...
Syntax:
sh.addShard("<replica_set>/<hostname><:port>")
Example:
sh.addShard("repl0/mongodb3.example.net:27327")
Output:
It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.
lOMoARcPSD|5536788
Cassandra
What is Cassandra?
NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database
that provides a mechanism to store and retrieve data other than the tabular relations used in
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.
Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that are operated together. The
outermost container is known as the Cluster which contains different nodes. Every node
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
nodes in a cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra:
o Replication factor: It specifies the number of machine in the cluster that will receive
copies of the same data.
lOMoARcPSD|5536788
What is Keyspace?
A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.
Syntax:
o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
centers. In this strategy, you have to provide replication factor for each data center
separately.
Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.
Example:
Using a Keyspace
To use the created keyspace, you have to use the USE command.
Syntax:
USE <identifier>
Cassandra Alter Keyspace
The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra.
Syntax:
In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
column families, user defined types and indexes from Cassandra.
Syntax:
In Cassandra, CREATE TABLE command is used to create a table. Here, column family is
used to store data just like table in RDBMS.
So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.
Syntax:
Single primary key: Use the following syntax for single primary key.
Primary key(ColumnName1,ColumnName2 . . .)
Example:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
command to perform two types of operations:
o Add a column
o Drop a column
Syntax:
Adding a Column
You can add a column in the table by using the ALTER command. While adding column, you
have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.
Syntax:
A new column is added. You can check it by using the SELECT command.
Dropping a Column
You can also drop an existing column from a table by using ALTER command. You should
check that the table is not defined with compact storage option before dropping a column
from a table.
Syntax:
Now you can see that a column named "student_email" is dropped now.
If you want to drop the multiple columns, separate the columns name by ",".
Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;
The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
column families list.
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.
Syntax:
lOMoARcPSD|5536788
TRUNCATE <tablename>
Example:
Cassandra Batch
Syntax:
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
WHERE clause is used with SELECT command to specify the exact location from where we
have to fetch data.
Syntax:
UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
While updating data in Cassandra table, the following keywords are commonly used:
o Where: The WHERE clause is used to select the row that you want to update.
o Set: The SET clause is used to set the value.
o Must: It is used to include all the columns composing the primary key.
Syntax:
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
lOMoARcPSD|5536788
DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command.
Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name
Example:
Delete the student_fees where student_id is 4.
The HAVING clause places the condition in the groups defined by the GROUP BY clause in
the SELECT statement.
This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement.
This clause is used in SQL because we cannot use the WHERE clause with the SQL
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries.
HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:
If you want to show each department and the minimum salary in each department, you have
to write the following query:
lOMoARcPSD|5536788
3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
like SQL format.
4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
by Apache.
Hive
What is HIVE?
Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Features of Hive
Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.
Integer Types
Decimal Type
Date/Time Types
TIMESTAMP
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
CHAR
Complex Type
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
o Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
o hive> create database demo
In this section, we will see various ways to drop the existing database.
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -
o Internal table
o External table
lOMoARcPSD|5536788
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
both table schema and data.
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following command:-
External Table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.
hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.
o Let's load the data of the file into the database by using the following command: -
lOMoARcPSD|5536788
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.
o Let's check the list of existing databases by using the following command: -
In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the
table.
Rename a Table
If we want to change the name of an existing table, we can rename that table by using the
following signature: -
In Hive, we can add one or more columns in an existing table by using the following
signature:
Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing the
name of the column by using the following signature: -
Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
we cannot drop the column directly.
alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of a
particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.
Static Partitioning
o Create the table and provide the partitioned columns by using the following
command: -
lOMoARcPSD|5536788
hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not
required to pass the values of partitioned columns manually.
hive> create table stud_demo(id int, name string, age int, institute string, course string)
row format delimited
fields terminated by ',';
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
o Now, insert the data of dummy table into the partition table.
What is Graph?
A graph is a pictorial representation of objects which are connected by some pair of links. A
graph contains two elements: Nodes (vertices) and relationships (edges).
A graph database is a database which is used to model the data in the form of graph. It store
any kind of data using:
o Nodes
o Relationships
o Properties
Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
properties are simple name/value pairs.
Relationships: It is used to connect nodes. It specifies how the nodes are related.
Neo4j is the most popular Graph Database. Other Graph Databases are
3. In graph database there are properties and In RDBMS, there are columns and
their values. data.
4. In graph database the connected nodes are In RDBMS, constraints are used
defined by relationships. instead of that.
MongoDB vs OrientDB
MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
graph engine.
Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
achieve best performance.
The following table illustrates the comparison between relational model, document model,
and OrientDB document model −
The SQL Reference of the OrientDB database provides several commands to create, alter, and
drop databases.
Create database
The following statement is a basic syntax of Create Database command.
Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
<storage-type> − Defines the storage types. You can choose between PLOCAL and
MEMORY.
Example
You can use the following command to create a local database named demo.
If the database is successfully created, you will get the following output.
Database created successfully.
If the command is executed successfully, you will get the following output.
Database updated successfully
Example
We have already created a database named 8demo9 in the previous chapters. In this example,
we will connect to that using the user admin.
You can use the following command to connect to demo database.
The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.
In this example, we will use the same database named 8demo9 that we created in an earlier
chapter. You can use the following command to drop a database demo.
If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully
INSERT DATABASE
The following statement is the basic syntax of the Insert Record command.
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
[SET <field> = <expression>|<sub-command>[,]*]|
[CONTENT {<JSON>}]
[RETURN <expression>]
[FROM <query>]
Following are the details about the options in the above syntax.
SET − Defines each field along with the value.
CONTENT − Defines JSON data to set field values. This is optional.
lOMoARcPSD|5536788
RETURN − Defines the expression to return instead of number of records inserted. The most
common use cases are −
@rid − Returns the Record ID of the new record.
@this − Returns the entire new record.
INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29)
SELECT COMMAND
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ]
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ]
[ UNWIND <Field>* ]
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ]
[ FETCHPLAN <FetchPlan> ]
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[ LOCK default|record ]
[ PARALLEL ]
[ NOCACHE ]
Following are the details about the options in the above syntax.
<Projections> − Indicates the data you want to extract from the query as a result records set.
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Record IDs. You can specify all these objects as target.
WHERE − Specifies the condition to filter the result-set.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
GROUP BY − Indicates the field to group the records.
ORDER BY − Indicates the filed to arrange a record in order.
UNWIND − Designates the field on which to unwind the collection of records.
SKIP − Defines the number of records you want to skip from the start of the result-set.
LIMIT − Indicates the maximum number of records in the result-set.
lOMoARcPSD|5536788
FETCHPLAN − Specifies the strategy defining how you want to fetch results.
TIMEOUT − Defines the maximum time in milliseconds for the query.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
strategies.
PARALLEL − Executes the query against 8x9 concurrent threads.
NOCACHE − Defines whether you want to use cache or not.
Example
Method 1 − You can use the following query to select all records from the Customer table.
UPDATE QUERY
Update Record command is used to modify the value of a particular record. SET is the basic
command to update a particular field value.
The following statement is the basic syntax of the Update command.
UPDATE <class>|cluster:<cluster>|<recordID>
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
MERGE <JSON>]
[UPSERT]
[RETURN <returning> [<returning-expression>]]
[WHERE <conditions>]
[LOCK default|record]
[LIMIT <max-records>] [TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
SET − Defines the field to update.
INCREMENT − Increments the specified field value by the given value.
ADD − Adds the new item in the collection fields.
REMOVE − Removes an item from the collection field.
PUT − Puts an entry into map field.
CONTENT − Replaces the record content with JSON document content.
MERGE − Merges the record content with a JSON document.
lOMoARcPSD|5536788
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
UPSERT − Updates a record if it exists or inserts a new record if it doesn9t. It helps in
executing a single query in the place of executing two queries.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Try the following query to update the age of a customer 8Raja9.
Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate
Truncate Record command is used to delete the values of a particular record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.
Try the following query to truncate the record having Record ID #11:4.
OrientDB Features
providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
SPEED
OrientDB was engineered from the ground up with performance as a key specification. It9s
fast on both read and write operations. Stores up to 120,000 records per second
ENTERPRISE
Incremental backups
Unmatched security
24x7 Support
Query Profiler
Distributed Clustering configuration
Metrics Recording
Live Monitor with configurable alerts
With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
of all the servers.