Unit 3 Nosql Databases Adt
Unit 3 Nosql Databases Adt
2. Availability
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be completed.
Every non-failing node returns a response for all the read and write requests in a
reasonable amount of time. The key word here is “every”. In simple terms, every node
(on either side of a network partition) must be able to respond in a reasonable amount
of time.
For example, user A is a content creator having 1000 other users subscribed to
his channel. Another user B who is far away from user A tries to subscribe to user A’s
channel. Since the distance between both users are huge, they are connected to different
database node of the social media network. If the distributed system follows the
principle of availability, user B must be able to subscribe to user A’s channel.
3. Partition Tolerance
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions, where
the nodes in each partition can only communicate among each other. That means, the
system continues to function and upholds its consistency guarantees in spite of network
partitions. Network partitions are a fact of life. Distributed systems guaranteeing
partition tolerance can gracefully recover from partitions once the partition heals.
For example, take the example of the same social media network where two users
are trying to find the subscriber count of a particular channel. Due to some technical
fault, there occurs a network outage, the second database connected by user B losses its
connection with first database. Hence the subscriber count is shown to the user B with
the help of replica of data which was previously stored in database 1 backed up prior to
network outage. Hence the distributed system is partition tolerant.
SHARDING:
It is basically a database architecture pattern in which we split a large dataset into
smaller chunks (logical shards) and we store/distribute these chunks in different
machines/database nodes (physical shards).
Each chunk/partition is known as a “shard” and each shard has the same database
schema as the original database.
We distribute the data in such a way that each row appears in exactly one shard.
It’s a good mechanism to improve the scalability of an application.
Methods of Sharding
1. Key Based Sharding
This technique is also known as hash-based sharding. Here, we take the value of an
entity such as customer ID, customer email, IP address of a client, zip code, etc and we
use this value as an input of the hash function. This process generates a hash
value which is used to determine which shard we need to use to store the data.
We need to keep in mind that the values entered into the hash function should all
come from the same column (shard key) just to ensure that data is placed in the
correct order and in a consistent manner.
Basically, shard keys act like a primary key or a unique identifier for individual
rows.
For example:You have 3 database servers and each request has an application id which
is incremented by 1 every time a new application is registered.
To determine which server data should be placed on, we perform a modulo operation
on these applications id with the number 3. Then the remainder is used to identify the
server to store our data.
MongoDB Advantages
● MongoDB is schema less. It is a document database in which one collection holds
different documents.
● There may be differences between the number of fields, content and size of the
document from one to another.
● Structure of a single object is clear in MongoDB.
● There are no complex joins in MongoDB.
● MongoDB provides the facility of deep query because it supports a powerful dynamic
query on documents.
● It is very easy to scale.
● It uses internal memory for storing working sets and this is the reason of its fast
access.
Distinctive features of MongoDB
● Easy to use
● Light Weight
● Extremely faster than RDBMS Where MongoDB should be used
● Big and complex data
● Mobile and social infrastructure
● Content management and delivery
● User data management
● Data hub
MongoDB Data Types:
MongoDB supports many data types. Some of them are:
1. String: String is the most commonly used datatype to store the data. It is used to
store words or text. String in MongoDB must be UTF-8 valid.
2. Integer: This data type is used to store a numerical value. Integer can be 32-bit or
64-bit depending upon the server.
3. Boolean: This data type is used to store a Boolean (true/ false) value.
4. Float: This data type is used to store floating point values.
5. Min/Max keys: This data type is used to compare a value against the lowest and
highest BSON elements.
6. Arrays: This data type is used to store arrays or list or multiple values into one key
7. Timestamp: This data type is used to store the data and time at which a particular
event occurred. For example, recording when a document has been modified or added
8. Object: This datatype is used for embedded documents
9. Null: This data type is used to store a Null value.
10.Symbol: This datatype is used identically to a string; however, it’s generally
reserved for languages that use a specific symbol type
11.Date: This datatype is used to store the current date or time in UNIX time format.
We can specify our own date time by creating an object of Date and passing day,
month, year into it.
12.Object ID: This datatype is used to store the document’s ID
13.Binary data: This datatype is used to store binary data.
14.Code: This datatype is used to store JavaScript code into the document.
15.Regular expression: This datatype is used to store regular expressions.
MongoDB Create Database
There is no create database command in MongoDB. Actually, MongoDB does not
provide any command to create a database.
How and when to create database
If there is no existing database, the following command is used to create a new
database.
Syntax:
use DATABASE_NAME ;
INPUT:- >>>use inventory
OUTPUT:
switched to db inventory
INPUT:-
>>>db.inventory.insertOne({ item: "canvas", qty: 100, tags: ["cotton"], size: { h: 28, w:
35.5, uom: "cm" } })
OUTPUT:-
{
“acknowledge”:true,
“insertedId”:ObjectId(“603e3d2f6b88c382606523ad”)
}
Syntax:-
insertMany():-
db.collection.insertMany(
[ <document 1> , <document 2>, ... ],
{
writeConcern: <document>, ordered:
<boolean>
}
)
SAMPLE QUERY:-
INPUT:-
>>>db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
OUTPUT:-
{
“acknowledge”:true,
“insertedIds”:[ObjectId(“603ee5973b41040c0b3227107”),
ObjectId(“603ee5973b41040c0b3227108”)
ObjectId(“603ee5973b41040c0b32271079”)
}
READ OPERATION:-
It is used to retrieve documents from the collection based on some constraints.
Syntax:-
db.collection.find(query, { <field1>: <value>, <field2>: <value> ... })
SAMPLE QUERY:-
INPUT:-
>>>db.inventory.find( {} )
OUTPUT:
SAMPLE QUERY:-
INPUT:-
>>>db.inventory.find( {qty:85} )
OUTPUT:-
UPDATE OPERATION:-
It is used to modify (add/replace)one or more documents in the collection. It
consists of 3 types:
updateOne
update many
replaceOne
Syntax:-
db.collection.updateOne(<filter>, <update>, <options>)
db.collection.updateMany(<filter>, <update>, <options>)
db.collection.replaceOne(<filter>, <update>, <options>)
SAMPLE QUERY:-updateOne()
INPUT:-
>>>db.inventory.updateOne( { item: "paper" }, { $set: { "size.uom": "cm"},
$currentDate: { lastModified: true } })
OUTPUT:-
SAMPLE QUERY:-updateMany()
INPUT:-
>>>db.inventory.updateMany( { "qty": { $lt: 50 } }, { $set: { "size.uom": "in", status: "P" },
$currentDate: { lastModified: true } })
OUTPUT:-
DELETE OPERATION:-
It is used to delete one or more documents/column from the collection based on
the constraints.
Syntax:- db.collection.deleteMany()
db.collection.deleteOne() SAMPLE
QUERY:-deleteOne() INPUT:-
>>>db.inventory.deleteOne( { qty:85 } )
OUTPUT:-
{
“acknowledged”:True
“deletecount”:1
}
SAMPLE QUERY:-deleteMany()
INPUT:-
>>>db.inventory.deleteMany( { qty:25 } )
OUTPUT:-
{
“acknowledged”:True
“deletecount”:2
}
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If
there is no indexing, then the MongoDB must scan every document in the collection
and retrieve only those documents that match the query. Indexes are special data
structures that store some information related to the documents such that it becomes
easy for MongoDB to find the right data file. The indexes are ordered by the value of
the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows users to create an
index.
Syntax
db.COLLECTION_NAME.createIndex({KEY:1})
Example
db.mycol.createIndex({<age=:1})
{
<createdCollectionAutomatically= : false,
<numIndexesBefore= : 1,
<numIndexesAfter= : 2,
<ok= : 1
}
Dropping an Index :
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax:
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or
drop) multiple indexes from the collection, MongoDB provides the dropIndexes() method
that takes multiple indexes as its parameters.
Example
db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})
Application
1. Web Applications
○ MongoDB is widely used across various web applications as the primary data
store.
○ One of the most popular web development stacks, the MEAN stack employs
MongoDB as the data store (MEAN stands for MongoDB, ExpressJS,
AngularJS, and NodeJS).
2. Big Data
○ MongoDB also provides the ability to handle big data.
○ Big Data refers to massive data that is fast-changing, can be quickly
accessed and highly available for addressing needs efficiently.
○ So, it can be used in applications where Big Data is needed.
3. Demographic and Biometric Data
○ MongoDB is one of the biggest biometrics databases in the world. It is used
to store a massive amount of demographic and biometric data.
○ For example, India’s Unique Identification project, Aadhar, is using
MongoDB as its database to store a massive amount of demographic and
biometric data of more than 1.2 billion Indians.
4. Synchronization
○ MongoDB can easily handle complicated things that need
synchronization with each other entirely.
○ So, it is mainly used in gaming applications. An example gaming
application developed using MongoDB as a database is “EA”.
○ EA is a world-famous gaming studio that is using MongoDB Database for
its game called FIFA Online 3.
5. Ecommerce
○ For e-commerce websites and product data management and solutions, we
can use MongoDB to store information because it has a flexible schema well
suited for the job.
○ They can also determine the pattern to handle interactions between user’s
shopping carts and inventory using “Inventory Management.”
○ MongoDB also has a report called “Category Hierarchy,” which will
describe the techniques to do interaction with category hierarchies in
MongoDB.
MongoDB Replication
● In MongoDB, data can be replicated across machines by the means of replica sets.
● A replica set consists of a primary node together with two or more secondary
nodes.
● The primary node accepts all write requests, which are propagated asynchronously
to the secondary nodes.
● The primary node is determined by an election involving all available nodes.
● To be eligible to become primary, a node must be able to contact more than half of
the replica set.
● This ensures that if a network partitions a replica set in two, only one of the partitions
will attempt to establish a primary.
● The successful primary will be elected based on the number of nodes to which it is
in contact, together with a priority value that may be assigned by the system
administrator.
● Setting a priority of 0 to an instance prevents it from ever being elected as primary.
● In the event of a tie, the server with the most recent optime — the timestamp of
the last operation—will be selected.
● The primary stores information about document changes in a collection within
its local database, called the oplog.
● The primary will continuously attempt to apply these changes to secondary
instances. Members within a replica set communicate frequently via heartbeat
messages.
● If a primary finds it is unable to receive heartbeat messages from more than half of
the secondaries, then it will renounce its primary status and a new election will be
called.
● Figure illustrates a three-member replica set and shows how a network partition
leads to a change of primary.
● Arbiters are special servers that can vote in the primary election, but that don’t
hold data.
● For large databases, these arbiters can avoid the necessity of creating
unnecessary extra servers to ensure that a quorum is available when electing a
primary.
The replication process works as follows:
● Write operations on the primary:
○ When a user sends a write operation (such as an insert, update, or delete) to the
primary node, the primary node processes the operation and records it in its oplog
(operations log).
● Oplog replication to secondaries:
○ Secondary nodes poll the primary's oplog at regular intervals.
○ The oplog contains a chronological record of all the write operations performed.
○ The secondary nodes read the oplog entries and apply the same operations to their
data sets in the same order they were executed on the primary node.
● Achieving data consistency:
○ Through this oplog-based replication, secondary nodes catch up with the
primary's node data over time.
○ This process ensures that the data on secondary nodes remains consistent with
the primary's node data.
● Read operations:
○ While primary nodes handle write operations, both primary and secondary
nodes can serve read operations which can help in load balancing.
○ Clients can choose to read from secondary nodes, which helps distribute the
read load balance and reduce the primary node's workload.
○ But in some instances secondary nodes might have slightly outdated data due
to replication lag.
net:
bindIp:
port:
replication:
replSetName: myReplSet
To start MongoDB on each server using the configuration file made above i.e.
mongod.conf file by using the following bash command:
mongod -f /path/to/mongod.conf
rs.addArb("<arbiter_host>:<arbiter_port>")
Step 7: Check Replica Set Status
To check the status of the Replica Set, connect to any of the MongoDB
instances and run the following javascript command:
rs.status()
Step 8: Test Connection Failure
To test connection failure, you can simulate a primary node failure by stopping
the MongoDB instance. The Replica Set should automatically elect a new primary
node. Please note that the provided steps and code snippets are generalized and the
actual steps might require adjustments based on the specific environment and use
case. This is where a near real-time low code tool like fivetran can be leveraged as
you just need to connect MongoDB with it and then Fivetran would handle all the
replication tasks without any hassle.
While MongoDB replication using the replica set method offers numerous
benefits, there are situations where its complexity, resource requirements, or
alignment with specific use cases make it less feasible. Organizations need to
carefully assess their requirements, infrastructure, and operational capabilities to
determine whether replica sets are the appropriate solution or if alternative strategies
should be considered.
Sharding
A high-level representation of the MongoDB sharding architecture is shown
in Figure. Each shard is implemented by a distinct MongoDB database, which in
most respects is unaware of its role in the broader sharded server
(1) A separate MongoDB database
(2) the config server — contains the metadata that can be used to determine
how data is distributed across shards.
(3) A router process — responsible for routing requests to the appropriate
shard server.
Sharding Mechanisms
Distribution of data across shards can be either range based or hash based.
● Range-based partitioning:
○ Each shard is allocated a specific range of shard key values.
○ MongoDB consults the distribution of key values in the index to ensure that each
shard is allocated approximately the same number of keys.
○ Range-based partitioning allows for more efficient execution of queries that
process ranges of values, since these queries can often be resolved by accessing
a single shard.
○ When range partitioning is enabled and the shard key is continuously
incrementing, the load tends to aggregate against only one of the shards, thus
unbalancing the cluster.
● Hash-based sharding:
○ The keys are distributed based on a hash function applied to the shard key.
○ Hash-based sharding requires that range queries be resolved by accessing all
shards.
Data Model
Data model in Cassandra is totally different from normally we see in RDBMS.
Cluster Cassandra and other dynamo-based databases distribute data throughout the
cluster by using consistent hashing. The rowkey (analogous to a primary key in an
RDBMS) is hashed. Each node is allocated a range of hash values, and the node that
has the specific range for a hashed key value takes responsibility for the initial
placement of that data.
In the default Cassandra partitioning scheme, the hash values range from -263
to 263-1. Therefore, if there were four nodes in the cluster and we wanted to assign
equal numbers of hashes to each node, then the hash ranges for each would be
approximately as follows:
We usually visualize the cluster as a ring: the circumference of the ring
represents all the possible hash values, and the location of the node on the ring
represents its area of responsibility. Figure illustrates simple consistent hashing:
the value for a rowkey is hashed, which determines its position on “the
ring.” Nodes in the cluster take responsibility for ranges of values within the
ring, and therefore take ownership of specific rowkey values.
The four-node cluster in Figure 8-10 is well balanced because every node
is responsible for hash ranges of similar magnitude. But we risk unbalancing the
cluster as we add nodes. If we double the number of nodes in the cluster, then we
can assign the new nodes at points on the ring between existing nodes and the
cluster will remain balanced. However, doubling the cluster is usually
impractical: it’s more economical to grow the cluster incrementally.
Early versions of Cassandra had two options when adding a new node. We
could either remap all the hash ranges, or we could map the new node within an
existing range. In the first option we obtain a balanced cluster, but only after an
expensive rebalancing process. In the second option the cluster becomes
unbalanced; since each node is
responsible for the region of the ring between itself and its predecessor, adding a
new node without changing the ranges of other nodes essentially splits a region
in half. Figure shows how adding a node to the cluster can unbalance the
distribution of hash key ranges.
Order-Preserving Partitioning
The Cassandra partitioner determines how keys are distributed across
nodes. The default partitioner uses consistent hashing, as described in the
previous section. Cassandra also supports order-preserving partitioners that
distribute data across the nodes of the cluster as ranges of actual (e.g., not hashed)
rowkeys. This has the advantage of isolating requests for specific row ranges to
specific machines, but it can lead to an unbalanced cluster and may create
hotspots, especially if the key value is incrementing. For instance, if the key value
is a timestamp and the order-preserving partitioner is implemented, then all new
rows will tend to be created on a single node of the cluster. In early versions of
Cassandra, the order-preserving petitioner might be warranted to optimize range
queries that could not be satisfied in any other way; however, following the
introduction of secondary indexes, the order-preserving petitioner is maintained
primarily for backward compatibility, and Cassandra documentation
recommends against its use in new applications.
Key Space
Keyspace is the outermost container for data in Cassandra. A keyspace is
an object that is used to hold column families, user defined types. A keyspace is
like a RDBMS database which contains column families, indexes, user defined
types, data center awareness, strategy used in keyspace, replication factor, etc.
Following are the basic attributes of Keyspace in Cassandra:
● Replication factor: It specifies the number of machines in the cluster that will
receive copies of the same data.
● Replica placement Strategy: It is a strategy which specifies how to place
replicas in the ring.
● There are three types of strategies such as:
1) Simple strategy (rack-aware strategy)
2) old network topology strategy (rack-aware strategy)
3) network topology strategy (datacenter-shared strategy)
In Cassandra, "Create Keyspace" command is used to create keyspace.
Cassandra Create Keyspace
Cassandra Query Language (CQL) facilitates developers to communicate with
Cassandra. The syntax of Cassandra query language is very similar to SQL. In
Cassandra, "Create Keyspace" command is used to create keyspace.
Syntax:
CREATE KEYSPACE <identifier> WITH <properties>
Example:
Let's take an example to create a keyspace named "StudentDB".
CREATE KEYSPACE StudentDB WITH replication = {'class':'SimpleStrategy',
'replication_factor' : 3};
Different components of Cassandra Keyspace
Strategy: There are two types of strategy declaration in Cassandra syntax:
● Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes
are placed in clockwise direction in the ring without considering rack or node
location.
● Network Topology Strategy: This strategy is used in the case of more than one
data center. In this strategy, you have to provide a replication factor for each data
center separately.
Using a Keyspace
To use the created keyspace, you have to use the USE command.
Syntax:
USE <identifier>
Cassandra Alter Keyspace
The "ALTER keyspace" command is used to alter the replication factor,
strategy name and durable writes properties in created keyspace in Cassandra.
Syntax:
ALTER KEYSPACE <identifier> WITH
<properties> Cassandra Drop Keyspace
In Cassandra, "DROP Keyspace" command is used to drop keyspaces
with all the data, column families, user defined types and indexes from
Cassandra.
Syntax:
DROP keyspace KeyspaceName ;
Syntax:
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY, column2
name data type,
column3 name data type.
)
There are two types of primary keys:
1. Single primary key: Use the following syntax for single primary key.
Primary key (ColumnName)
2. Compound primary key: Use the following syntax for a single primary key.
Primary key(ColumnName1,ColumnName2 . . )
Example:
Let's take an example to demonstrate the CREATE TABLE command.
Here, we are using the already created Keyspace
"StudentDB". CREATE TABLE student(
student_id int PRIMARY
KEY, student_name text,
student_city text,
student_fees varint,
student_phone varint
);
SELECT * FROM student;
Cassandra Alter Table
ALTER TABLE command is used to alter the table after creating it. You
can use the ALTER command to perform two types of operations:
● Add a column
● Drop a column
Syntax:
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Adding a Column
You can add a column in the table by using the ALTER command. While
adding column, you have to aware that the column name is not conflicting with
the existing column names and that the table is not defined with compact storage
option.
Syntax:
ALTER TABLE table name ADD new column
datatype; After using the following command:
ALTER TABLE student ADD student_email text;
A new column is added. You can check it by using the SELECT command.
Dropping a Column
You can also drop an existing column from a table by using ALTER
command. You should check that the table is not defined with compact storage
option before dropping a column from a table.
Syntax:
ALTER table name DROP column name;
Example:
After using the following command:
ALTER TABLE student DROP student_email;
Now you can see that a column named "student_email" is dropped now. If you
want to drop the multiple columns, separate the column name by ",".
Cassandra DROP table
DROP TABLE command is used to drop a table.
Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;
The table named "student" is dropped now. You can use DESCRIBE
command to verify if the table is deleted or not. Here the student table has been
deleted; you will not find it in the column families list.
Cassandra Truncate Table
TRUNCATE command is used to truncate a table. If you truncate a table,
all the rows of the table are deleted permanently.
Syntax:
TRUNCATE <tablename>
Cassandra Batch
In Cassandra BATCH is used to execute multiple modification statements
(insert, update, delete) simultaneously. It is very useful when you have to update
some column as well as delete some of the existing.
Syntax:
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
Use of WHERE Clause
WHERE clause is used with SELECT command to specify the exact location
from where we have to fetch data.
Syntax:
SELECT FROM <table name> WHERE <condition>;
SELECT * FROM student WHERE student_id=2;
Cassandra Update Data
UPDATE command is used to update data in a Cassandra table. If you see
no result after updating the data, it means data is successfully updated otherwise
an error will be returned. While updating data in Cassandra table, the following
keywords are commonly used:
● Where: The WHERE clause is used to select the row that you want to update.
● Set: The SET clause is used to set the value.
● Must: It is used to include all the columns composing the primary key.
Syntax:
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Cassandra DELETE Data
DELETE command is used to delete data from Cassandra table. You can
delete the complete table or a selected row by using this command.
Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name
Example:
Delete the student_fees where student_id is 4.
DELETE student_fees FROM student WHERE student_id=4;
HAVING Clause in SQL
The HAVING clause places the condition in the groups defined by the
GROUP BY clause in the SELECT statement. This SQL clause is implemented
after the 'GROUP BY' clause in the 'SELECT' statement. This clause is used in
SQL because we cannot use the WHERE clause with the SQL aggregate
functions. Both WHERE and HAVING clauses are used for filtering the records
in SQL queries.
Syntax of HAVING clause in SQL
SELECT column_Name1, column_Name2, .....,
column_NameN aggregate_function_name(column_Name)
GROUP BY
Example:
SELECT SUM(Emp_Salary),Emp_City FROM Employee GROUP
BY Emp_City;
the following query with the HAVING clause in SQL:
SELECT SUM(Emp_Salary),Emp_City FROM Employee GROUP
BY Emp_City HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:
If you want to show each department and the minimum salary in each
department, you have to write the following query:
SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY
Emp_Dept; MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
AVERAGE CLAUSE:
SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP
BY Emp_Dept;
CQL Types
CQL defines built-in data types for columns. The counter type is unique.
CQL Constants Description
Type supported
ascii strings US-ASCII character string
bigint integers 64-bit signed long
blob blobs Arbitrary bytes (no validation), expressed as
hexadecimal
boolean booleans true or false
counter integers Distributed counter value (64-bit long)
date strings Value is a date with no corresponding time value;
Cassandra encodes date as a 32-bit integer
representing days since epoch (January 1, 1970).
Dates can be represented in queries and inserts as a
string, such as 2015-05-03 (yyyy-mm-dd)
decimal integers, floats Variable-precision decimal
double integers, floats 64-bit IEEE-754 floating point
float integers, floats 32-bit IEEE-754 floating point
frozen User-defined A frozen value serializes multiple components into a
types,collections, single value. Non-frozen types allow updates to
tuples individual fields. Cassandra treats the value of a
frozen type as a blob. The entire value must be
overwritten.
inet strings IP address string in IPv4 or IPv6 format, used by the
python-cql driver and CQL native protocols
int integers 32-bit signed integer
HIVE
● Hive is thought of as “SQL for Hadoop,” although Hive provides a catalog for
the Hadoop system, as well as a SQL processing layer.
● The Hive metadata service contains information about the structure of registered
files in the HDFS file system.
● This metadata effectively “schematizes” these files, providing definitions of
column names and data types.
● The Hive client or server (depending on the Hive configuration) accepts SQL-
like commands called Hive Query Language (HQL).
● These commands are translated into Hadoop jobs that process the query and
return the results to the user.
● Most of the time, Hive creates MapReduce programs that implement query
operations such as joins, sorts, aggregation, and so on.
● Hive is a data warehouse system which is used to analyze structured data. It is
built on the top of Hadoop. It was developed by Facebook.
● Hive provides the functionality of reading, writing, and managing large datasets
residing in distributed storage. It runs SQL-like queries called HQL (Hive query
language) which gets internally converted to MapReduce jobs.
● Hive supports Data Definition Language (DDL), Data Manipulation Language
(DML), and User Defined Functions (UDF).
HIVE ARCHITECTURE
Date/Time Types
1. TIMESTAMP
● It supports traditional UNIX timestamp with optional nanosecond precision.
● As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
● As Floating point numeric type, it is interpreted as UNIX timestamp in seconds
with decimal precision.
● As string, it follows java.sql.Timestamp format "YYYY-MM-
DD HH:MM:SS.fffffffff" (9 decimal place precision)
2. DATES
● The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.
String Types
1. STRING
The string is a sequence of characters. It values can be enclosed within
single quotes (') or double quotes (").
2. Varchar
The varchar is a variable length type whose range lies between 1 and
65535, which specifies that the maximum number of characters allowed in the
character string.
3. CHAR
The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Type Size Range
Database Operations,
Hive - Create Database
In Hive, the database is considered as a catalog or namespace of tables. So,
we can maintain multiple tables within a database where a unique name is
assigned to each table. Hive also provides a default database with a name default.
● Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
hive> create database demo
hive> show databases;
Hive - Drop Database
In this section, we will see various ways to drop the existing database.
drop the database by using the following command.
hive> drop database demo;
Hive - Create Table
In Hive, we can create a table by using the conventions similar to the SQL.
It supports a wide range of flexibility where the data files for tables are stored. It
provides two types of table:
● Internal table
● External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their
data is controlled by the Hive. By default, these tables are stored in a subdirectory
under the directory defined by hive. metastore. warehouse.dir (i.e.
/user/hive/warehouse). The internal tables are not flexible enough to share with
other tools like Pig. If we try to drop the internal table, Hive deletes both table
schema and data. Let's create an internal table by using the following command:-
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following
command:- hive> describe demo.employee
External Table
The external table allows us to create and access a table and a data
externally. The external keyword is used to specify the external table, whereas
the location keyword is used to determine the location of loaded data. As the table
is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.
Let's create an external table using the following command: -
hive> create external table emplist (Id int, Name string , Salary
float) row format delimited
fields terminated by ','
location
'/HiveDirectory';
we can use the following command to retrieve the data: -
select * from emplist;
Hive - Load Data
Once the internal table has been created, the next step is to load the data
into it. So, in Hive, we can easily load data from any file to the database.
Let's load the data of the file into the database by using the following command: -
Load data local in path '/home/codegyani/hive/emp_details'into table
demo.employee;
Hive - Drop Table
Hive facilitates us to drop a table by using the SQL drop table
command. Let's follow the below steps to drop the table from the database.
Let's check the list of existing databases by using the following command: -
hive> show
databases; hive>
use demo; hive>
show tables;
hive> drop table new_employee;
Hive - Alter Table
In Hive, we can perform modifications in the existing table like
changing the table name, column name, comments, and table properties. It
provides SQL like commands to alter the table.
Rename a Table
If we want to change the name of an existing table, we can rename
that table by using the following signature: -
Alter table old_table_name rename to new_table_name;
Now, change the name of the table by using the following
command: - Alter table emp rename to employee_data;
Adding column
In Hive, we can add one or more columns in an existing table by
using the following signature:
Alter table table_name add columns(column_name datatype);
Now, add a new column to the table by using the
following command: -
Alter table employee_data add columns (age int);
Change Column
In Hive, we can rename a column, change its type and position.
Here, we are changing the name of the column by using the following
signature: -
Alter table table_name change old_column_name new_column_name
datatype;
Now, change the name of the column by using the following command: -
Alter table employee_data change name first_name string;
Delete or Replace Column
Hive allows us to delete one or more columns by replacing them with
the new columns. Thus, we cannot drop the column directly. Let's see the
existing schema of the table.
alter table employee_data replace columns( id string, first_name string, age int);
Partitioning
The partitioning in Hive means dividing the table into some parts based
on the values of a particular column like date, course, city or country. The
advantage of partitioning is that since the data is stored in slices, the query
response time becomes faster.
The partitioning in Hive can be executed in two ways –
● Static partitioning
● Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table. Hence, the
data file doesn't contain the partitioned columns.
Example of Static Partitioning
Select the database in which we want to create
a table. hive> use test;
Create the table and provide the partitioned columns by using the following
command: -
hive> create table student (id int, name string, age int, institute string)partitioned by
(course string) row format delimited fields terminated by ',';
hive> describe student;
Load the data into the table and pass the values of partition columns with it by
using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details1'
into table student partition(course= "java");
Here, we are partitioning the students of an institute based on courses.
Load the data of another file into the same table and pass the values of
partition columns with it by using the following command: -
hive> load data local inpath '/home/codegyani/hive/student_details2'
into table student partition(course= "hadoop");
hive> select * from student;
Retrieve the data based on partitioned columns by using the following
command: - hive> select * from student where course="java";
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the
table.
So, it is not required to pass the values of partitioned columns
manually. First, select the database in which we want to create a
table.
hive> use show;
Enable the dynamic partition by using the following
commands: - hive> set
hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Create a dummy table to store the data.
hive> create table stud_demo(id int, name string, age int, institute
string, course string) row format delimited fields terminated by ',';
load the data into the table.
hive> load data local inpath '/home/codegyani/hive/student_details' into
table stud_demo;
Create a partition table by using the following command: -
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string) row format delimited fields terminated by ',';
Insert the data of dummy table into the partition table.
hive> insert into student_part partition(course) select id, name, age, institute, course
from stud_demo;
HiveQL
Hive is the original SQL on Hadoop. From the very early days of
Hadoop, Hive represented the most accessible face of Hadoop for many users.
Hive Query Language (HQL) is a SQL-based language that comes close to
SQL-92 entry-level compliance, particularly within its SELECT statement.
DML statements—such as INSERT, DELETE, and UPDATE—are supported
in recent versions, though the real purpose of Hive is to provide query access
to Hadoop data usually ingested via other means. Some SQL-2003 analytic
window functions are also supported. HQL is compiled to MapReduce or—
in later releases—more sophisticated YARN-based DAG algorithms.
The following is a simple Hive query:
0: jdbc:Hive2://> SELECT country_name, COUNT
(cust_id) 0: jdbc:Hive2://> FROM countries co
JOIN customers cu
0: jdbc:Hive2://>
ON(cu.country_id=co.country_id) 0:
jdbc:Hive2://> WHERE region = 'Asia'
0: jdbc:Hive2://> GROUP BY country_name
0: jdbc:Hive2://> HAVING COUNT (cust_id) > 500;
2015-10-10 11:38:55 Starting to launch local task to process map join;
maximum memory = 932184064
<<Bunch of Hadoop JobTracker output deleted>>
2015-10-10 11:39:05,928 Stage-2 map = 0%, reduce = 0%
2015-10-10 11:39:12,246 Stage-2 map = 100%, reduce = 0%, Cumulative CPU
2.28 sec
2015-10-10 11:39:20,582 Stage-2 map = 100%, reduce = 100%, Cumulative CPU
4.4 sec
+ + + +
| country_name | _c1 |
+ + + +
| China | 712 |
| Japan | 624 |
| Singapore | 597 |
+ + + +
3 rows selected (29.014 seconds)
HQL statements look and operate like SQL statements. There are a
few notable differences between HQL and commonly used standard SQL,
however:
● HQL supports a number of table generating functions which can be used to
return multiple rows from an embedded field that may contain an array of
values or a map of name:value pairs. The Explode() function returns one row
for each element in an array or map, while json_tuple() explodes an embedded
JSON document.
● Hive provides a SORT BY clause that requests output be sorted only within
each reducer within the MapReduce pipeline. Compared to ORDER BY, this
avoids a large sort in the final reducer stage, but may not return results in
sorted order.
● DISTRIBUTE BY controls how mappers distribute output to reducers. Rather
than distributing values to reducers based on hashing of key values, we can
insist that each reducer receive contiguous ranges of a specific column.
DISTRIBUTE BY can be used in conjunction with SORT BY to achieve an
overall ordering of results without requiring an expensive final sort operation.
CLUSTER BY combines the semantics of DISTRIBUTE BY and SORT BY
operations that specify the same column list. Hive can query data in HBase
tables and data held in HDFS.
Nodes: Nodes arc the records/data in graph databases. Data is stored as properties
and properties arc simple name/valuc pairs.
Relationships: It is used to connect nodes. It specifics how the nodes are related.
Relationships always have direction.
Relationships always have a type.
Relationships form patternsof data.
If the command is executed successfully, you will get the following output.
Database updated successfully
Truncate
Truncate Record command is used to delete the values of a particular
record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple
Rids separated by comma to truncate multiple records. It returns the number of
records truncated.
Try the following query to truncate the record having Record ID #1 1:4.
Orientdb{db= demo}> TRUNCATE RECORD #11:4
DELETE
Delete Record command is used to delete one or more records completely
from the database. The following statement is the basic syntax of the Delete
command.
DELETE FROM
<Class>|cluster:<cluster>|index:<index> [LOCK
<default|record>]
[RETURN <returning>]
|WHERE <Condition>*}
|LIMIT <MaxRecords>]
[TIMEOUT <timeout> |]
Following are the details about the options in the above syntax.
LOCK —Specifies how to lock the records between load and update. We
have two options to specify Default and Record.
RETURN —Specifies an expression to return instead of the
number of records.
LIMIT —Defines the maximum number of records to update.
TIMEOUT —Defines the time you want to allow the update run before it
times out.
OrientDB Features:
providing more functionality and flexibility, while being powerful cnough
to replace your operational DBMS.
SPEED
OrientDB was engineered from the ground up with performance as a key
specification. It’s fast on both read and write operations. Stores up to 120,000
records per second
No more Joins: relationships are physical links to the records.
Better RAM usc.
Traverses parts of or entire trees and graphs of records in
milliseconds.
Traversing speed is not affected by the database size.
ENTERPRISE
o Incremental backups
o Unmatched security
o 24x7 Support
o Query Profiler
o Distributed Clustering configuration
o Metrics Recording
o Live Monitor with configurable alerts