0% found this document useful (0 votes)
117 views

Unit - III

The document discusses NoSQL databases and MongoDB. It explains CAP theorem, database sharding, and MongoDB operations like insert, update, delete and query. It also provides details about MongoDB concepts like databases, collections and documents.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Unit - III

The document discusses NoSQL databases and MongoDB. It explains CAP theorem, database sharding, and MongoDB operations like insert, update, delete and query. It also provides details about MongoDB concepts like databases, collections and documents.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

lOMoARcPSD|5536788

UNIT III NOSQL DATABASES

NoSQL – CAP Theorem – Sharding - Document based – MongoDB Operation: Insert,


Update, Delete, Query, Indexing, Application, Replication, Sharding–Cassandra: Data Model,
Key Space, Table Operations, CRUD Operations, CQL Types – HIVE: Data types, Database
Operations, Partitioning – HiveQL – OrientDB Graph database – OrientDB Features.

CAP theorem
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP
Theorem.
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance.
Here Consistency means that all nodes in the network see the same data at the same time.
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform.
What is CAP theorem in NoSQL databases?
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time.

What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system.
What is difference between sharding and partitioning?
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database
instance.
What are the types of
sharding? Sharding
Architectures
 Key Based Sharding. This technique is also known as hash-based sharding. ...
 Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ...
 Vertical Sharding. ...
 Directory-Based Sharding.
lOMoARcPSD|5536788

NoSQL

NoSQL Database is used to refer a non-SQL or non relational database.

It provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.

Databases can be divided in 3 types:

1. RDBMS (Relational Database Management System)


2. OLAP (Online Analytical Processing)
3. NoSQL (recently developed database)

Advantages of NoSQL

o It supports query language.


o It provides fast performance.
o It provides horizontal scalability.

What is MongoDB?

MongoDB is an open-source document database that provides high performance, high


availability, and automatic scaling.

Mongo DB is a document-oriented database. It is an open source product, developed and


supported by a company named 10gen.

MongoDB is a scalable, open source, high performance, document-oriented database." -


10gen

MongoDB was designed to work with commodity servers. Now it is used by the company of
all sizes, across all industry.

MongoDB Advantages

o MongoDB is schema less. It is a document database in which one collection holds


different documents.
o There may be difference between number of fields, content and size of the
document from one to other.
o Structure of a single object is clear in MongoDB.
o There are no complex joins in MongoDB.
lOMoARcPSD|5536788

o MongoDB provides the facility of deep query because it supports a powerful


dynamic query on documents.
o It is very easy to scale.
o It uses internal memory for storing working sets and this is the reason of its fast
access.

Distinctive features of MongoDB

o Easy to use
o Light Weight
o Extremely faster than

RDBMS Where MongoDB should be

used

o Big and complex data


o Mobile and social infrastructure
o Content management and delivery
o User data management
o Data hub

MongoDB Create Database

There is no create database command in MongoDB. Actually, MongoDB do not provide any
command to create database.

How and when to create database

If there is no existing database, the following command is used to create a new database.

Syntax:

use DATABASE_NAME

we are going to create a database "javatpointdb"

>use javatpointdb

To check the currently selected database, use the command db

>db
lOMoARcPSD|5536788

To check the database list, use the command show dbs:


>show dbs

insert at least one document into it to display database:

MongoDB insert documents

In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
collection in your database.

>db.movie.insert({"name":"javatpoint"})

MongoDB Drop Database

The dropDatabase command is used to drop a database. It also deletes the associated data
files. It operates on the current database.

Syntax:

db.dropDatabase()

This syntax will delete the selected database. In the case you have not selected any database,
it will delete default "test" database.

If you want to delete the database "javatpointdb", use the dropDatabase() command as
follows:

>db.dropDatabase()
MongoDB Create Collection

In MongoDB, db.createCollection(name, options) is used to create collection. But usually you


don?t need to create collection. MongoDB creates collection automatically when you insert
some documents. It will be explained later. First see how to create collection:

Syntax:

db.createCollection(name, options)
Name: is a string type, specifies the name of the collection to be created.

Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.

To check the created collection, use the command "show collections".

>show collections
lOMoARcPSD|5536788

How does MongoDB create collection automatically

MongoDB creates collections automatically when you insert some documents. For example:
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist.

>db.SSSIT.insert({"name" : "seomount"})
>show collections
SSSIT
MongoDB update documents

In MongoDB, update() method is used to update or modify the existing documents of a


collection.

Syntax:

db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)

Example

Consider an example which has a collection name javatpoint. Insert the following documents
in collection:

db.javatpoint.insert(
{
course: "java",
details: {
duration: "6 months",
Trainer: "Sonoo jaiswal"
},
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
category: "Programming language"
}
)

Update the existing course "java" into "android":

>db.javatpoint.update({'course':'java'},{$set:{'course':'android'}})

MongoDB insert multiple documents

If you want to insert multiple documents in a collection, you have to pass an array of
documents to the db.collection.insert() method.
lOMoARcPSD|5536788

Create an array of documents

Define a variable named Allcourses that hold an array of documents to insert.

var Allcourses =
[
{
Course: "Java",
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
Batch: [ { size: "Medium", qty: 25 } ],
category: "Programming Language"
},
{
Course: ".Net",
details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
category: "Programming Language"
},
{
Course: "Web Designing",
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
category: "Programming Language"
}
];

Inserts the documents

Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.

> db.javatpoint.insert( Allcourses );


MongoDB Delete documents

In MongoDB, the db.colloction.remove() method is used to delete documents from a


collection. The remove() method works on two parameters.

1. Deletion criteria: With the use of its syntax you can remove the documents from the
collection.

2. JustOne: It removes only one document when set to true or 1.


lOMoARcPSD|5536788

Syntax:

b.collection_name.remove (DELETION_CRITERIA)

Remove all documents

If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes.

db.javatpoint.remove({})
Indexing in MongoDB :

MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the
right data file. The indexes are order by the value of the field specified in the index.

Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax db.COLLECTION_NAME.createIndex({KEY:1})

Example

db.mycol.createIndex({<age=:1})
{
<createdCollectionAutomatically= : false,
<numIndexesBefore= : 1,
<numIndexesAfter= : 2,
<ok= : 1
}
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax

db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
lOMoARcPSD|5536788

Syntax –

db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})


Applications of MongoDB

These are some important features of MongoDB:

1. Support ad hoc queries:

In MongoDB, you can search by field, range query and it also supports regular expression
searches.

2. Indexing:

You can index any field in a document.

3. Replication:

MongoDB supports Master Slave replication.

A master can perform Reads and Writes and a Slave copies data from the master and
can only be used for reads or back up (not writes)

4. Duplication of data:

MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.

5. Load balancing:

It has an automatic load balancing configuration because of data placed in shards.

6. Supports map reduce and aggregation tools.

7. Uses JavaScript instead of Procedures.

8. It is a schema-less database written in C++.

9. Provides high performance.

10. Stores files of any size easily without complicating your stack.

11. Easy to administer in the case of failures.

12. It also supports:

o JSON data model with dynamic schemas


o Auto-sharding for horizontal scalability
lOMoARcPSD|5536788

o Built in replication for high availability

Now a day many companies using MongoDB to create new types of applications, improve
performance and availability.

MongoDB Replication Methods

The MongoDB Replication methods are used to replicate the member to the replica sets.

rs.add(host, arbiterOnly)

The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if
the method will trigger an election for primary. For example - if we try to add a new member
with a higher priority than the primary. An error will be reflected by the mongo shell even if
the operation succeeds.

Example:

In the following example we will add a new secondary member with default vote.

rs.add( { host: "mongodbd4.example.net:27017" } )


MongoDBSharding Commands

Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.
lOMoARcPSD|5536788

MongoDBsh.addShard(<url>) command

A shard replica set added to a sharded cluster using this command. If we add it among the
shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.

<replica_set>/<hostname><:port>,<hostname><:port>, ...

Syntax:

sh.addShard("<replica_set>/<hostname><:port>")

Example:

sh.addShard("repl0/mongodb3.example.net:27327")

Output:

It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.
lOMoARcPSD|5536788

Cassandra

What is Cassandra?

Apache Cassandra is highly scalable, high performance, distributed NoSQL database.


Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.

Cassandra is a NoSQL database

NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database
that provides a mechanism to store and retrieve data other than the tabular relations used in
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.

Important Points of Cassandra

o Cassandra is a column-oriented database.


o Cassandra is scalable, consistent, and fault-tolerant.
o Cassandra is created at Facebook. It is totally different from relational database
management systems.
o Cassandra is being used by some of the biggest companies like Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Cassandra Data Model

Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data.

Cluster

Cassandra database is distributed over several machines that are operated together. The
outermost container is known as the Cluster which contains different nodes. Every node
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
nodes in a cluster, in a ring format, and assigns data to them.

Keyspace

Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra:

o Replication factor: It specifies the number of machine in the cluster that will receive
copies of the same data.
lOMoARcPSD|5536788

o Replica placement Strategy: It is a strategy which species how to place replicas in


the ring. There are three types of strategies such as:

1) Simple strategy (rack-aware strategy)

2) old network topology strategy (rack-aware strategy)

3) network topology strategy (datacenter-shared strategy)

Cassandra Create Keyspace

Cassandra Query Language (CQL) facilitates developers to communicate with Cassandra.


The syntax of Cassandra query language is very similar to SQL.

What is Keyspace?

A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.

In Cassandra, "Create Keyspace" command is used to create keyspace.

Syntax:

CREATE KEYSPACE <identifier> WITH <properties>

Different components of Cassandra Keyspace

Strategy: There are two types of strategy declaration in Cassandra syntax:

o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
centers. In this strategy, you have to provide replication factor for each data center
separately.

Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.

Example:

Let's take an example to create a keyspace named "javatpoint".

CREATE KEYSPACE javatpoint


lOMoARcPSD|5536788

WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

Keyspace is created now.

Using a Keyspace

To use the created keyspace, you have to use the USE command.

Syntax:

USE <identifier>
Cassandra Alter Keyspace

The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra.

Syntax:

ALTER KEYSPACE <identifier> WITH <properties>


Cassandra Drop Keyspace

In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
column families, user defined types and indexes from Cassandra.

Syntax:

DROP keyspace KeyspaceName ;


Cassandra Create Table

In Cassandra, CREATE TABLE command is used to create a table. Here, column family is
used to store data just like table in RDBMS.

So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.

Syntax:

CREATE TABLE tablename(


column1 name datatype PRIMARYKEY,
column2 name data type,
column3 name data type.
)

There are two types of primary keys:


lOMoARcPSD|5536788

Single primary key: Use the following syntax for single primary key.

Primary key (ColumnName)


Compound primary key: Use the following syntax for single primary key.

Primary key(ColumnName1,ColumnName2 . . .)

Example:

Let's take an example to demonstrate the CREATE TABLE command.

Here, we are using already created Keyspace "javatpoint".

CREATE TABLE student(


student_id int PRIMARY KEY,
student_name text,
student_city text,
student_fees varint,
student_phone varint
);

SELECT * FROM student;


Cassandra Alter Table

ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
command to perform two types of operations:

o Add a column
o Drop a column

Syntax:

ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>

Adding a Column

You can add a column in the table by using the ALTER command. While adding column, you
have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.

Syntax:

ALTER TABLE table name


ADD new column datatype;
lOMoARcPSD|5536788

After using the following command:

ALTER TABLE student


ADD student_email text;

A new column is added. You can check it by using the SELECT command.

Dropping a Column

You can also drop an existing column from a table by using ALTER command. You should
check that the table is not defined with compact storage option before dropping a column
from a table.

Syntax:

ALTER table name


DROP column name;
Example:

After using the following command:

ALTER TABLE student


DROP student_email;

Now you can see that a column named "student_email" is dropped now.

If you want to drop the multiple columns, separate the columns name by ",".

Cassandra DROP table

DROP TABLE command is used to drop a table.

Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;

The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
column families list.

Cassandra Truncate Table

TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.

Syntax:
lOMoARcPSD|5536788

TRUNCATE <tablename>

Example:

Cassandra Batch

In Cassandra BATCH is used to execute multiple modification statements (insert, update,


delete) simultaneously. It is very useful when you have to update some column as well as
delete some of the existing.

Syntax:

BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH

Use of WHERE Clause

WHERE clause is used with SELECT command to specify the exact location from where we
have to fetch data.

Syntax:

SELECT FROM <table name> WHERE <condition>;


SELECT * FROM student WHERE student_id=2;
Cassandra Update Data

UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
While updating data in Cassandra table, the following keywords are commonly used:

o Where: The WHERE clause is used to select the row that you want to update.
o Set: The SET clause is used to set the value.
o Must: It is used to include all the columns composing the primary key.

Syntax:

UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
lOMoARcPSD|5536788

Cassandra DELETE Data

DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command.

Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name
Example:
Delete the student_fees where student_id is 4.

DELETE student_fees FROM student WHERE student_id=4;


HAVING Clause in SQL

The HAVING clause places the condition in the groups defined by the GROUP BY clause in
the SELECT statement.

This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement.

This clause is used in SQL because we cannot use the WHERE clause with the SQL
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries.

Syntax of HAVING clause in SQL


SELECT column_Name1, column_Name2,......, column_NameN aggregate_function_name(
column_Name)
GROUP BY
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City;

the following query with the HAVING clause in SQL:

SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City

HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:

If you want to show each department and the minimum salary in each department, you have
to write the following query:
lOMoARcPSD|5536788

SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;


MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
AVERAGE CLAUSE:

SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP BY Emp_Dept;


SQL ORDER BY Clause
o Whenever we want to sort the records based on the columns stored in the tables of the
SQL database, then we consider using the ORDER BY clause in SQL.
o The ORDER BY clause in SQL will help us to sort the records based on the specific
column of a table. This means that all the values stored in the column on which we are
applying ORDER BY clause will be sorted, and the corresponding column values will
be displayed in the sequence in which we have obtained the values in the earlier step.

Syntax to sort the records in ascending order:


SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName ASC;
Syntax to sort the records in descending order:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnNameDESC;
Syntax to sort the records in ascending order without using ASC keyword:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName;
Index Cassandra Mongodb

1) Cassandra is high performance MongoDB is cross-platform document-oriented


distributed database system. database system.

2) Cassandra is written in Java. MongoDB is written in C++.

3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
like SQL format.

4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
by Apache.

5) Cassandra is mainly designed to MongoDB is designed to deal with JSON-like


handle large amounts of data across documents and access applications easier and
many commodity servers. faster.

6) Cassandra provides high availability MongoDB is easy to administer in the case of


with no single point of failure. failure.
lOMoARcPSD|5536788

Hive
What is HIVE?

Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).

Features of Hive

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to


9,223,372,036,854,775,807
lOMoARcPSD|5536788

Decimal Type

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Date/Time Types

TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond precision.


o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.


lOMoARcPSD|5536788

Complex Type

Type Size Range

Struct It is similar to C struct or an object where fields struct('James','Roy')


are accessed using the "dot" notation.

Map It contains the key-value tuples where the fields map('first','James','last','Roy')


are accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy')


indexable using zero-based integers.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.

o Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
o hive> create database demo

hive> show databases;


Let's create a new database by using the following command: -

Hive - Drop Database

In this section, we will see various ways to drop the existing database.

drop the database by using the following command.

hive> drop database demo;


Hive - Create Table

In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -

o Internal table
o External table
lOMoARcPSD|5536788

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
both table schema and data.

o Let's create an internal table by using the following command:-

hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following command:-

hive> describe demo.employee

External Table

The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.

Let's create an external table using the following command: -

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';

we can use the following command to retrieve the data: -

select * from emplist;


Hive - Load Data

Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -
lOMoARcPSD|5536788

load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;

Hive - Drop Table

Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.

o Let's check the list of existing databases by using the following command: -

hive> show databases;

hive> use demo;

hive> show tables;


hive> drop table new_employee;
Hive - Alter Table

In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the
table.

Rename a Table

If we want to change the name of an existing table, we can rename that table by using the
following signature: -

Alter table old_table_name rename to new_table_name;


o Now, change the name of the table by using the following command: -

Alter table emp rename to employee_data;


Adding column

In Hive, we can add one or more columns in an existing table by using the following
signature:

Alter table table_name add columns(column_name datatype);


o Now, add a new column to the table by using the following command: -

Alter table employee_data add columns (age int);


lOMoARcPSD|5536788

Change Column

In Hive, we can rename a column, change its type and position. Here, we are changing the
name of the column by using the following signature: -

Alter table table_name change old_column_name new_column_name datatype;


o Now, change the name of the column by using the following command: -

Alter table employee_data change name first_name string;


Delete or Replace Column

Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
we cannot drop the column directly.

o Let's see the existing schema of the table.

o Now, drop a column from the table.

alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive

The partitioning in Hive means dividing the table into some parts based on the values of a
particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.

The partitioning in Hive can be executed in two ways –


o Static partitioning
o Dynamic partitioning

Static Partitioning

In static or manual partitioning, it is required to pass the values of partitioned columns


manually while loading the data into the table. Hence, the data file doesn't contain the
partitioned columns.

Example of Static Partitioning

o First, select the database in which we want to create a

table. hive> use test;

o Create the table and provide the partitioned columns by using the following
command: -
lOMoARcPSD|5536788

hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

hive> describe student;


o Load the data into the table and pass the values of partition columns with it by using
the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details1' into table student


partition(course= "java");

Here, we are partitioning the students of an institute based on courses.

o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details2' into table student


partition(course= "hadoop");

hive> select * from student;


o Now, try to retrieve the data based on partitioned columns by using the following
command: -

hive> select * from student where course="java";


Dynamic Partitioning

In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not
required to pass the values of partitioned columns manually.

o First, select the database in which we want to create a table.

hive> use show;


o Enable the dynamic partition by using the following commands: -

hive> set hive.exec.dynamic.partition=true;


hive> set hive.exec.dynamic.partition.mode=nonstrict;

o Create a dummy table to store the data.


lOMoARcPSD|5536788

hive> create table stud_demo(id int, name string, age int, institute string, course string)
row format delimited
fields terminated by ',';

o Now, load the data into the table.

hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;

o Create a partition table by using the following command: -

hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

o Now, insert the data of dummy table into the partition table.

hive> insert into student_part


partition(course)
select id, name, age, institute, course
from stud_demo;

OrientDB Graph database

What is Graph?

A graph is a pictorial representation of objects which are connected by some pair of links. A
graph contains two elements: Nodes (vertices) and relationships (edges).

What is Graph database

A graph database is a database which is used to model the data in the form of graph. It store
any kind of data using:

o Nodes
o Relationships
o Properties

Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
properties are simple name/value pairs.

Relationships: It is used to connect nodes. It specifies how the nodes are related.

o Relationships always have direction.


lOMoARcPSD|5536788

o Relationships always have a type.


o Relationships form patterns of data.

Properties: Properties are named data values.

Popular Graph Databases

Neo4j is the most popular Graph Database. Other Graph Databases are

o Oracle NoSQL Database


o OrientDB
o HypherGraphDB
o GraphBase
o InfiniteGraph
o AllegroGraph etc.

Graph Database vs. RDBMS

Differences between Graph database and RDBMS:

In Graph Database RDBMS


de
x

1. In graph database, data is stored in graphs. In RDBMS, data is stored in tables.

2. In graph database there are nodes. In RDBMS, there are rows.

3. In graph database there are properties and In RDBMS, there are columns and
their values. data.

4. In graph database the connected nodes are In RDBMS, constraints are used
defined by relationships. instead of that.

5. In graph database traversal is used instead In RDBMS, join is used instead of


of join. traversal.
lOMoARcPSD|5536788

MongoDB vs OrientDB

MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
graph engine.

Features MongoDB OrientDB

Uses the RDBMS JOINS to create Embeds and connects documents


relationship between entities. It has like relational database. It uses
Relationships high runtime cost and does not direct, super-fast links taken from
scale when database scale graph database world.
increases.

Costly JOIN operations. Easily returns complete graph


Fetch Plan
with interconnected documents.

Doesn9t support ACID Supports ACID transactions as


Transactions transactions, but it supports atomic well as atomic operations.
operations.

Has its own language based on Query language is built on SQL.


Query language
JSON.

Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
achieve best performance.

Uses memory mapping technique. Uses the storage engine name


Storage engine
LOCAL and PLOCAL.

The following table illustrates the comparison between relational model, document model,
and OrientDB document model −

Relational Model Document Model OrientDB Document Model

Table Collection Class or Cluster

Row Document Document

Column Key/value pair Document field

Relationship Not available Link

The SQL Reference of the OrientDB database provides several commands to create, alter, and
drop databases.
Create database
The following statement is a basic syntax of Create Database command.

CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-type>]]


lOMoARcPSD|5536788

Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
<storage-type> − Defines the storage types. You can choose between PLOCAL and
MEMORY.

Example

You can use the following command to create a local database named demo.

Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo

If the database is successfully created, you will get the following output.
Database created successfully.

Current database is: plocal: /opt/orientdb/databases/demo

orientdb {db = demo}>


The following statement is the basic syntax of the Alter Database command.
ALTER DATABASE <attribute-name> <attribute-value>
Where <attribute-name> defines the attribute that you want to modify and <attribute-
value> defines the value you want to set for that attribute.

orientdb> ALTER DATABASE custom strictSQL = false

If the command is executed successfully, you will get the following output.
Database updated successfully

The following statement is the basic syntax of the Connect command.


CONNECT <database-url> <user> <password>
Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
lOMoARcPSD|5536788

Example

We have already created a database named 8demo9 in the previous chapters. In this example,
we will connect to that using the user admin.
You can use the following command to connect to demo database.

orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin


If it is successfully connected, you will get the following output −
Connecting to database [plocal:/opt/orientdb/databases/demo] with user 'admin'…OK
Orientdb {db = demo}>

the following statement is the basic syntax of the info command.


LIST DATABASES

The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.

In this example, we will use the same database named 8demo9 that we created in an earlier
chapter. You can use the following command to drop a database demo.

orientdb {db = demo}> DROP DATABASE

If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully

INSERT DATABASE

The following statement is the basic syntax of the Insert Record command.
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
[SET <field> = <expression>|<sub-command>[,]*]|
[CONTENT {<JSON>}]
[RETURN <expression>]
[FROM <query>]
Following are the details about the options in the above syntax.
SET − Defines each field along with the value.
CONTENT − Defines JSON data to set field values. This is optional.
lOMoARcPSD|5536788

RETURN − Defines the expression to return instead of number of records inserted. The most
common use cases are −
 @rid − Returns the Record ID of the new record.
 @this − Returns the entire new record.

FROM − Where you want to insert the record or a result set.


The following command is to insert the first record into the Customer table.

INSERT INTO Customer (id, name, age) VALUES (01,'satish', 25)


The following command is to insert the second record into the Customer table.

INSERT INTO Customer SET id = 02, name = 'krishna', age = 26


The following command is to insert the next two records into the Customer table.

INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29)

SELECT COMMAND
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ]
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ]
[ UNWIND <Field>* ]
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ]
[ FETCHPLAN <FetchPlan> ]
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[ LOCK default|record ]
[ PARALLEL ]
[ NOCACHE ]
Following are the details about the options in the above syntax.
<Projections> − Indicates the data you want to extract from the query as a result records set.
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Record IDs. You can specify all these objects as target.
WHERE − Specifies the condition to filter the result-set.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
GROUP BY − Indicates the field to group the records.
ORDER BY − Indicates the filed to arrange a record in order.
UNWIND − Designates the field on which to unwind the collection of records.
SKIP − Defines the number of records you want to skip from the start of the result-set.
LIMIT − Indicates the maximum number of records in the result-set.
lOMoARcPSD|5536788

FETCHPLAN − Specifies the strategy defining how you want to fetch results.
TIMEOUT − Defines the maximum time in milliseconds for the query.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
strategies.
PARALLEL − Executes the query against 8x9 concurrent threads.
NOCACHE − Defines whether you want to use cache or not.

Example

Method 1 − You can use the following query to select all records from the Customer table.

orientdb {db = demo}> SELECT FROM Customer


orientdb {db = demo}> SELECT FROM Customer WHERE name LIKE 'k%'
orientdb {db = demo}> SELECT FROM Customer WHERE name.left(1) = 'k'
orientdb {db = demo}> SELECT id, name.toUpperCase() FROM Customer
orientdb {db = demo}> SELECT FROM Customer WHERE age in [25,29]
orientdb {db = demo}> SELECT FROM Customer WHERE ANY() LIKE '%sh%'
orientdb {db = demo}> SELECT FROM Customer ORDER BY age DESC

UPDATE QUERY

Update Record command is used to modify the value of a particular record. SET is the basic
command to update a particular field value.
The following statement is the basic syntax of the Update command.
UPDATE <class>|cluster:<cluster>|<recordID>
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
MERGE <JSON>]
[UPSERT]
[RETURN <returning> [<returning-expression>]]
[WHERE <conditions>]
[LOCK default|record]
[LIMIT <max-records>] [TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
SET − Defines the field to update.
INCREMENT − Increments the specified field value by the given value.
ADD − Adds the new item in the collection fields.
REMOVE − Removes an item from the collection field.
PUT − Puts an entry into map field.
CONTENT − Replaces the record content with JSON document content.
MERGE − Merges the record content with a JSON document.
lOMoARcPSD|5536788

LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
UPSERT − Updates a record if it exists or inserts a new record if it doesn9t. It helps in
executing a single query in the place of executing two queries.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Try the following query to update the age of a customer 8Raja9.
Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate
Truncate Record command is used to delete the values of a particular record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.
Try the following query to truncate the record having Record ID #11:4.

Orientdb {db = demo}> TRUNCATE RECORD #11:4


DELETE
Delete Record command is used to delete one or more records completely from the database.
The following statement is the basic syntax of the Delete command.
DELETE FROM <Class>|cluster:<cluster>|index:<index>
[LOCK <default|record>]
[RETURN <returning>]
[WHERE <Condition>*]
[LIMIT <MaxRecords>]
[TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Note − Don9t use DELETE to remove Vertices or Edges because it effects the integrity of the
graph.
Try the following query to delete the record having id = 4.

orientdb {db = demo}> DELETE FROM Customer WHERE id = 4


lOMoARcPSD|5536788

OrientDB Features
providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
SPEED

OrientDB was engineered from the ground up with performance as a key specification. It9s
fast on both read and write operations. Stores up to 120,000 records per second

 No more Joins: relationships are physical links to the records.


 Better RAM use.
 Traverses parts of or entire trees and graphs of records in milliseconds.
 Traversing speed is not affected by the database size.

ENTERPRISE

 Incremental backups
 Unmatched security
 24x7 Support
 Query Profiler
 Distributed Clustering configuration
 Metrics Recording
 Live Monitor with configurable alerts

With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
of all the servers.

 Multi-Master + Sharded architecture


 Elastic Linear Scalability
 estore the database content using WAL

 OrientDB Community is free for commercial use.


 Comes with an Apache 2 Open Source License.
 Eliminates the need for multiple products and multiple licenses.

You might also like