6.unit 2 Bda
6.unit 2 Bda
UNIT II
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and store
large volumes of unstructured and semi-structured data. Unlike traditional relational databases that use
tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can adapt
to changes in data structures and are capable of scaling horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term has
since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a wide range of
different database architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents, such as JSON
or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for simple and
fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of columns
that are treated as a single entity. They are optimized for fast and efficient querying of large
amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
NoSQL databases are often used in applications where there is a high volume of data that needs to be
processed and analyzed in real-time, such as social media analytics, e-commerce, and gaming. They
can also be used for other applications, such as content management systems, document management,
and customer relationship management.
However, NoSQL databases may not be suitable for all applications, as they may not provide the same
level of data consistency and transactional guarantees as traditional relational databases. It is important
to carefully evaluate the specific needs of an application when choosing a database management
system.
Key Features of NoSQL :
Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing data
structures without the need for migrations or schema alterations.
1. Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a
database cluster, making them well-suited for handling large amounts of data and high levels of
traffic.
2. Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model,
where data is stored in semi-structured format, such as JSON or BSON.
3. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data
is stored as a collection of key-value pairs.
4. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
5. Distributed and high availability: NoSQL databases are often designed to be highly available and
to automatically handle node failures and data replication across multiple nodes in a database
cluster.
6. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic
manner, with support for multiple data types and changing data structures.
7. Performance: NoSQL databases are optimized for high performance and can handle a high volume
of reads and writes, making them suitable for big data and real-time applications.
SRM TRP Engineering College
Department of Computer Science and Engineering
Advantages of NoSQL: There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and
placing it on multiple machines in such a way that the order of the data is preserved is sharding.
Vertical scaling means adding more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal scaling databases are MongoDB,
Cassandra, etc. NoSQL can handle a huge amount of data because of scalability, as the data grows
NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which
means that they can accommodate dynamic changes to the data model. This makes NoSQL
databases a good fit for applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly available because
in case of any failure data replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts
of data and traffic with ease. This makes them a good fit for applications that need to handle large
amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic, which
means that they can offer improved performance compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational
databases, as they are typically less complex and do not require expensive hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL databases, each with its own
unique strengths and weaknesses. This lack of standardization can make it difficult to choose the
right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that
they do not guarantee the consistency, integrity, and durability of data. This can be a drawback for
applications that require strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but
it provides very little functionality. Relational databases are a better choice in the field of
Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In
other words, two database systems are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle complex
queries, which means that they are not a good fit for applications that require complex data analysis
or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a large
amount of data as simple as possible. But it is not so easy. Data management in NoSQL is much
more complex than in a relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in the
market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has
no approach for the backup of data in a consistent manner.
SRM TRP Engineering College
Department of Computer Science and Engineering
10. Large document size : Some database systems like MongoDB and CouchDB store data in JSON
format. This means that documents are quite large (BigData, network bandwidth, speed), and
having descriptive key names actually hurts since they increase the document size.
SQL NoSQL
These databases have fixed or static or predefined schema They have a dynamic schema
These databases are not suited for hierarchical data storage. These databases are best suited for hierarchical data storage.
These databases are best suited for complex queries These databases are not so good for complex queries
Examples: MySQL, PostgreSQL, Oracle, MS-SQL Server, etc Examples: MongoDB, GraphQL, HBase, Neo4j, Cassandra, etc
Types of NoSQL database: Types of NoSQL databases and the name of the databases system that
falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle the data.
Difference between Relational database and NoSQL
1. Relational Database :
RDBMS stands for Relational Database Management Systems. It is most popular database. In it, data is
store in the form of row that is in the form of tuple. It contain numbers of table and data can be easily
accessed because data is store in the table. This Model was proposed by E.F. Codd.
2. NoSQL :
NoSQL Database stands for a non-SQL database. NoSQL database doesn’t use table to store the data
like relational database. It is used for storing and fetching the data in database and generally used to
store the large amount of data. It supports query language and provides better performance.
Difference between Relational database and NoSQL :
Relational Database NoSQL
SRM TRP Engineering College
Department of Computer Science and Engineering
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it in a
valid format. It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the data is
stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or characters but
can also be a more advanced data type. The value is typically linked or co-related to the key. The key-
value pair storage databases generally store data as a hash table where each key is unique. The value
can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type of pattern is usually
used in shopping websites or e-commerce applications.
Advantages:
Can handle large amounts of data and heavy load,
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the values are
called as Documents. Document can be stated as a complex data structure. Document here can be a
form of text, arrays, strings, JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of JSONs and is
unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
An efficient and compact structure of the index is used by the key-value store to have the option to
rapidly and dependably find value using its key. For example, Redis is a key-value store used to tracklists,
maps, heaps, and primitive types (which are simple data structures) in a constant database. Redis can
uncover a very basic point of interaction to query and manipulate value types, just by supporting a
predetermined number of value types, and when arranged, is prepared to do high throughput. When to
use a key-value database:
Here are a few situations in which you can use a key-value database:-
User session attributes in an online app like finance or gaming, which is referred to as real -time
random data access.
Caching mechanism for repeatedly accessing data or key-based design.
The application is developed on queries that are based on keys.
Features:
One of the most un-complex kinds of NoSQL data models.
For storing, getting, and removing data, key-value databases utilize simple functions.
Querying language is not present in key-value databases.
Built-in redundancy makes this database more reliable.
Advantages:
It is very easy to use. Due to the simplicity of the database, data can accept any kind, or even different
kinds when required.
Its response time is fast due to its simplicity, given that the remaining environment near it is very
much constructed and improved.
Key-value store databases are scalable vertically as well as horizontally.
Built-in redundancy makes this database more reliable.
Disadvantages:
As querying language is not present in key-value databases, transportation of queries from one
database to a different database cannot be done.
The key-value store database is not refined. You cannot query the database without a key.
SRM TRP Engineering College
Department of Computer Science and Engineering
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
Couchbase: It permits SQL-style querying and searching for text.
Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB as it is
a trusted database used by a large number of users. It can easily handle a large number of requests
every day and it also provides various security options.
Riak: It is the database used to develop applications.
Aerospike: It is an open-source and real-time database working with billions of exchanges.
Berkeley DB: It is a high-performance and open-source database providing scalability.
Both Columnar and Row databases are a few methods used for processing big data analytics and data
warehousing. But their approach is different from each other.
For example:
SRM TRP Engineering College
Department of Computer Science and Engineering
Row Database: “Customer 1: Name, Address, Location.”(The fields for each new record are stored
in a long row).
Columnar Database: “Customer 1: Name, Address, Location.”(Each field has its own set of columns).
Example:
Here is an example of a simple database table with four columns and three rows.
1. Columnar databases can be used for different tasks such as when the applications that are related to
big data comes into play then the column-oriented databases have greater attention in such case.
2. The data in the columnar database has a highly compressible nature and has different operations like
(AVG), (MIN, MAX), which are permitted by the compression.
3. Efficiency and Speed: The speed of Analytical queries that are performed is faster in columnar
databases.
4. Self-indexing: Another benefit of a column-based DBMS is self-indexing, which uses less disk space
than a relational database management system containing the same data.
1. For loading incremental data, traditional databases are more relevant as compared to column-oriented
databases.
2. For Online transaction processing (OLTP) applications, Row oriented databases are more appropriate
than columnar databases.
SRM TRP Engineering College
Department of Computer Science and Engineering
SCHEMALESS DATABASE:
Traditional relational databases are well-defined, using a schema to describe every functional
element, including tables, rows views, indexes, and relationships. By exerting a high degree of control,
the database administrator can improve performance and prevent capture of low-quality, incomplete, or
malformed data. In a SQL database, the schema is enforced by the Relational Database Management
System (RDBMS) whenever data is written to disk.
But in order to work, data needs to be heavily formatted and shaped to fit into the table structure.
This means sacrificing any undefined details during the save, or storing valuable information outside the
database entirely.
A schemaless database, like MongoDB, does not have these up-front constraints, mapping to a
more ‘natural’ database. Even when sitting on top of a data lake, each document is created with a partial
schema to aid retrieval. Any formal schema is applied in the code of your applications; this layer of
abstraction protects the raw data in the NoSQL database and allows for rapid transformation as your
needs change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the same
time, using the right tools in the form of a schemaless database can unlock the value of all of your
structured and unstructured data types.
As you can see, the data itself normally has a fairly consistent structure. With the schemaless MongoDB
database, there is some additional structure — the system namespace contains an explicit list of
collections and indexes. Collections may be implicitly or explicitly created — indexes must be explicitly
declared.
No data truncation
Views needs not to be updated every time the Materialized views are updated as the tuples are
relation on which view is defined is updated, as stored in the database system. It can be updated in
the tuples of the views are computed every time one of three ways depending on the databases
when the view is accessed. system as mentioned above.
1. The queries should be moved to the data rather than moving data to queries:
2. Hash rings should be used for even distribution of data:
3. For scaling read requests, replication should be used:
4. Distribution of queries to nodes should be done by the database:
Distribution Models
The primary driver of interest in NoSQL has been its ability to run databases on a large cluster. As data
volumes increase, it becomes more difficult and expensive to scale up buy a bigger server to run the
database on. A more appealing option is to scale out run the database on a cluster of servers. Aggregate
orientation fits well with scaling out because the aggregate is a natural unit to use for distribution.
Depending on your distribution model, you can get a data store that will give you the ability to handle
larger quantities of data, the ability to process a greater read or write traffic, or more availability in the face
of network slowdowns or breakages.
Broadly, there are two paths to data distribution: replication and sharding. Replication takes the
same data and copies it over multiple nodes. Sharding puts different data on different nodes.Replication
and sharding are orthogonal techniques: You can use either or both of them. Replication comes into two
forms: master-slave and peer-to-peer. We will now discuss these techniques starting at the simplest and
SRM TRP Engineering College
Department of Computer Science and Engineering
working up to the more complex: first single-server, then master-slave replication,then sharding, and finally
peer-to-peer replication.
1.1 Single Server
The first and the simplest distribution option is the one we would most often recommend no distribution at
all. Run the database on a single machine that handles all the reads and writes to the data store. We prefer
this option because it eliminates all the complexities that the other options introduce; it’s easy for
operations people to manage and easy for application developers to reason about.
Although a lot of NoSQL databases are designed around the idea of running on a cluster, it can make
sense to use NoSQL with a single-server distribution model if the data model of the NoSQL store is more
suited to the application. Graph databases are the obvious category here these work best in a single-server
configuration. If your data usage is mostly about processing aggregates, then a single-server document
or key-value store may well be worthwhile because it’s easier on applicationdevelopers.For the rest of
this chapter we’ll be wading through the advantages and complications of more sophisticated distribution
schemes. Don’t let the volume of words fool you into thinking that we would prefer these options. If we
can get away without distributing our data, we will always choosea single-server approach.
Sharding
Often, a busy data store is busy because different people are accessing different parts of the dataset. In
these circumstances we can support horizontal scalability by putting different parts of the data
ontodifferent servers a technique that’s called sharding ( Figure 1.1).
Figure 1.1. Sharding puts different data on separate nodes, each of which does its own reads
andwrites.
In the ideal case, we have different users all talking to different server nodes. Each user only has to talk
to one server, so gets rapid responses from that server. The load is balanced out nicely between
servers4for example, if we have ten servers, each one only has to handle 10% of the load.
Of course the ideal case is a pretty rare beast. In order to get close to it we have to ensure that data that’s
accessed together is clumped together on the same node and that these clumps are arranged on the nodes
to provide the best data access.
The first part of this question is how to clump the data up so that one user mostly gets her data froma single
server. This is where aggregate orientation comes in really handy. The whole point
of aggregates is that we design them to combine data that’s commonly accessed together4so aggregates
leap out as an obvious unit of distribution.
SRM TRP Engineering College
Department of Computer Science and Engineering
Another factor is trying to keep the load even. This means that you should try to arrange aggregates so
they are evenly distributed across the nodes which all get equal amounts of the load. This may vary over
time, for example if some data tends to be accessed on certain days of the week4so there may be
domain-specific rules you would like to use.
Pros:It can improve both reads and writes
Cons:Clusters use reliable machines- resilience decreases
Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is designated as the
master, or primary. This master is the authoritative source for the data and is usually responsiblefor
processing any updates to that data. The other nodes are slaves, or secondaries. A
replication process synchronizes the slaves with the master ( Figure 1.2).
Master-slave replication is most helpful for scaling when you have a read-intensive dataset.You can
scale horizontally to handle more read requests by adding more slave nodes and ensuring that allread
requests are routed to the slaves. You are still, however, limited by the ability of the master to process
updates and its ability to pass those updates on. Consequently it isn’t such a good scheme fordatasets with
heavy write traffic, although offloading the read traffic will help a bit with handling thewrite load.
A second advantage of master-slave replication is read resilience: Should the master fail, the slaves can
still handle read requests. Again, this is useful if most of your data access is reads. The failure of the
master does eliminate the ability to handle writes until either the master is restored or a new master is
SRM TRP Engineering College
Department of Computer Science and Engineering
appointed. However, having slaves as replicates of the master does speed up recoveryafter a failure of
the master since a slave can be appointed a new master very quickly.
Replication comes with some alluring benefits, but it also comes with an inevitable dark side
inconsistency. You have the danger that different clients, reading different slaves, will see different values
because the changes haven’t all propagated to the slaves. In the worst case, that can mean that a client cannot
read a write it just made. Even if you use master-slave replication just for hot backupthis can be a concern,
because if the master fails, any updates not passed on to the backup are lost
Pros:
More read request
Add more slave nodes
Ensure that all read request are routed to the slaves
Cons:
The master is a bottle neck
Limited by its ability to process updates and to pass those updates on
Its failure does eliminate the ability to handle writes until
Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of writes. It
provides resilience against failure of a slave, but not of a master. Essentially, the master is still a
bottleneck and a single point of failure. Peer-to-peer replication ( Figure 1.3) attacks these problems by
not having a master. All the replicas have equal weight, they can all accept writes, andthe loss of any
of them doesn’t prevent access to the data store.
figure 1.3. Peer-to-peer replication has all nodes applying reads and writes to all the data.
Pros:
You can ride over node failures without losing access to data
you can easily add nodes to improve your performance
Cons:
Inconsistency
SRM TRP Engineering College
Department of Computer Science and Engineering
Slow propagation of changes to copies on different nodes
1.5 Combining Sharding and Replication
Replication and sharding are strategies that can be combined. If we use both master-slave
replicationand sharding (see Figure 1.4), this means that we have multiple masters, but each data item only
has a single master. Depending on your configuration, you may choose a node to be a master for some
data and slaves for others, or you may dedicate nodes for master or slave duties
Using peer-to-peer replication and sharding is a common strategy for column-family databases. In ascenario
like this you might have tens or hundreds of nodes in a cluster with data sharded over them.
A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard
ispresent on three nodes. Should a node fail, then the shards on that node will be built on the other
nodes
SRM TRP Engineering College
Department of Computer Science and Engineering
• Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize
theircopies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids
loading all writes onto a single point of failure.
CONSITENCY:
SRM TRP Engineering College
Department of Computer Science and Engineering
VARIOUS FORMS OF CONSISTENCY:
SRM TRP Engineering College
Department of Computer Science and Engineering
SRM TRP Engineering College
Department of Computer Science and Engineering
VERSION STAMP:
SRM TRP Engineering College
Department of Computer Science and Engineering
RELAXING CONSISTENCY
SRM TRP Engineering College
Department of Computer Science and Engineering
SRM TRP Engineering College
Department of Computer Science and Engineering
Apache Cassandra (NOSQL database)
Apache Cassandra: Apache Cassandra is an open-source no SQL database that is used for handling big
data. Apache Cassandra has the capability to handle structure, semi-structured, and unstructured data.
Apache Cassandra was originally developed at Facebook after that it was open-sourced in 2008 and after
that, it become one of the top-level Apache projects in 2010.
Features of Cassandra:
1. it is scalable.
. it is flexible (can accept structured , semi-structured and unstructured data).
3. it has transaction support as it follows ACID properties.
4. it is highly available and fault tolerant.
5. it is open source.
Theorem.
Figure-3: RF = 3
cqlsh: CQL shell cqlsh is a command-line shell for interacting with Cassandra through CQL (Cassandra
Query Language). CQL query for Basic Operation:
Step1: To create keyspace use the following CQL query.
CREATE KEYSPACE Emp
WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
Step2: CQL query for using keyspace
Syntax:
USE keyspace-name
USE Emp;
Step-3: To create a table use the following CQL query.
Example:
CREATE TABLE Emp_table (
name text PRIMARY KEY,
Emp_id int,
Emp_city text,
Emp_email text,
);
Step-4: To insert into Emp_table use the following CQL query.
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('ashish', 1001, 'Delhi', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('Ashish Gupta', 1001, 'Bangalore', '[email protected]');
SRM TRP Engineering College
Department of Computer Science and Engineering
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('amit ', 1002, 'noida', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('dhruv', 1003, 'pune', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('shivang', 1004, 'mumbai', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('aayush', 1005, 'gurugram', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('bhagyesh', 1006, 'chandigar', '[email protected]');
Step-5: To read data use the following CQl query.
SELECT * FROM Emp_table;
Introduction to Cassandra
Cassandra is a distributed database management system which is open source with wide column store,
NoSQL database to handle large amount of data across many commodity servers which provides high
availability with no single point of failure. It is written in Java and developed by Apache Software
Foundation.
Avinash Lakshman & Prashant Malik initially developed the Cassandra at Facebook to power the
Facebook inbox search feature. Facebook released Cassandra as an open source project on Google code
in July 2008. In March 2009 it became an Apache Incubator project and in February 2010 it becomes a
top-level project. Due to its outstanding technical features Cassandra becomes so popular.
Apache Cassandra is used to manage very large amounts of structure data spread out across the world.
It provides highly available service with no single point of failure. Listed below are some points of
Apache Cassandra:
SRM TRP Engineering College
Department of Computer Science and Engineering
It is scalable, fault-tolerant, and consistent.
It is column-oriented database.
Its distributed design is based on Amazon’s Dynamo and its data model on Google’s Big table.
It is Created at Facebook and it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure but its add a
more powerful “column family” data model. Cassandra is being used by some of the biggest companies
such as Facebook, Twitter, Cisco, Rackspace, eBay, Netflix, and more.
The design goal of a Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes of the cluster.
All the nodes of Cassandra in a cluster play the same role. Each node is independent, at the same time
interconnected to other nodes. Each node in a cluster can accept read and write requests, regardless of
where the data is actually located in the cluster. When a node goes down, read/write request can be
served from other nodes in the network.
Features of Cassandra:
Cassandra has become popular because of its technical features. There are some of the features of
Cassandra:
1. Easy data distribution –
It provides the flexibility to distribute data where you need by replicating data across multiple data
centers.
for example:
If there are 5 node let say N1, N2, N3, N4, N5 and by using partitioning algorithm we will decide
the token range and distribute data accordingly. Each node have specific token range in which dat a
will be distribute. let’s have a look on diagram for better understanding.
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single
point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed
among all the nodes in a cluster.
All the nodes in a cluster play the same role. Each node is independent and at the same time
interconnected to other nodes.
Each node in a cluster can accept read and write requests, regardless of where the data is actually
located in the cluster.
When a node goes down, read/write requests can be served from other nodes in the network.
Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is
detected that some of the nodes responded with an out-of-date value, Cassandra will return the most
recent value to the client. After returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among the nodes in
a cluster to ensure no single point of failure.
Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with
each other and detect any faulty nodes in the cluster.
Components of Cassandra
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the
database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or
separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy
between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be
captured and stored in the mem-table. Whenever the mem-table is full, data will be written into
the SStable data file. All writes are automatically partitioned and replicated throughout the cluster.
Cassandra periodically consolidates the SSTables, discarding unnecessary data.
Step-1:
In Write Operation as soon as we receives request then it is first dumped into commit log to make
sure that data is saved.
Step-2:
Insertion of data into table that is also written in MemTable that holds the data till it’s get full.
Step-3:
If MemTable reaches its threshold then data is flushed to SS Table.
SRM TRP Engineering College
Department of Computer Science and Engineering
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the
appropriate SSTable that holds the required data.
In Read Operation there are three types of read requests that a coordinator can send to a replica. The
node that accepts the write requests called coordinator for that particular operation.
Step-1: Direct Request:
In this operation coordinator node sends the read request to one of the replicas.
Step-2: Digest Request:
In this operation coordinator will contact to replicas specified by the consistency level. For
Example: CONSISTENCY TWO; It simply means that Any two nodes in data center will
acknowledge.
Step-3: Read Repair Request:
If there is any case in which data is not consistent across the node then background Read Rep air
Request initiated that makes sure that the most recent data is available across the nodes.
Storage Engine:
1. Commit log:
Commit log is the first entry point while writing to disk or memTable. The purpose of commit log
in apache Cassandra is to server sync issues if a data node is down.
2. Mem-table:
After data written in Commit log then after that data is written in Mem-table. Data is written in
Mem-table temporarily.
3. SSTable:
Once Mem-table will reach a certain threshold then data will flushed to the SSTable disk file.
The data model of Cassandra is significantly different from what we normally see in an RDBMS. This
chapter provides an overview of how Cassandra stores its data.
SRM TRP Engineering College
Department of Computer Science and Engineering
Cluster:Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case of a
failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns
data to them.
Keyspace:Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are −
Replication factor − It is the number of machines in the cluster that will receive copies of the
same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-
aware strategy), and network topology strategy (datacenter-shared strategy).
Column families − Keyspace is a container for a list of one or more column families. A column
family, in turn, is a container of a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each keyspace has at least one and often
many column families.
Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from a table
of relational databases.
Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.
Column
A column is the basic data structure of Cassandra with three values, namely key or column name, value,
and a time stamp. Given below is the structure of a column.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a
map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is
important to keep columns that you are likely to query together in the same column family, and a super
column can be helpful here.Given below is the structure of a super column.
SRM TRP Engineering College
Department of Computer Science and Engineering
The following table lists down the points that differentiate the data model of Cassandra from that of an
RDBMS.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
Database is the outermost container that Keyspace is the outermost container that
contains data corresponding to an contains data corresponding to an
application. application.
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>cassandra.output.thrift.address</name>
<value>localhost</value>
</property>
These settings configure Hadoop to use the Cassandra input and output formats.
Step 3: Configure Cassandra
Next, you need to configure Cassandra to work with Hadoop. This involves modifying the Cassandra
configuration files to include the necessary settings for connecting to Hadoop.
SRM TRP Engineering College
Department of Computer Science and Engineering
First, navigate to the Cassandra installation directory and open the cassandra.yaml file. Add the following
lines to the file:
hadoop_config:
fs.default.name: hdfs://localhost:9000
This setting configures Cassandra to use the Hadoop file system.
Step 4: Create a Hadoop Job to Access Cassandra Data
Once you have configured Hadoop and Cassandra to work together, you can create a Hadoop job to
access the data stored in Cassandra. This involves writing a MapReduce program that uses the Cassandra
input and output formats to read and write data.
Here’s an example MapReduce program that reads data from a Cassandra table and writes it to a Hadoop
file:
public class CassandraHadoopJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Cassandra Hadoop Job");
job.setJarByClass(CassandraHadoopJob.class);
job.setInputFormatClass(CassandraInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(CassandraMapper.class);
job.setReducerClass(CassandraReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
CassandraConfigHelper.setInputColumnFamily(job.getConfiguration(), "keyspace", "table");
CassandraConfigHelper.setOutputColumnFamily(job.getConfiguration(), "keyspace", "table");
CassandraConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
CassandraConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This program uses the CassandraInputFormat and TextOutputFormat classes to read and write data,
respectively. The CassandraMapper and CassandraReducer classes define the Map and Reduce functions,
respectively. The CassandraConfigHelper class is used to configure the input and output column families,
as well as the initial address and RPC port for connecting to Cassandra.
Once you have written your MapReduce program, you can run the Hadoop job using the following
command:
$ hadoop jar <path-to-jar-file> <main-class> <input-path> <output-path>
Replace <path-to-jar-file> with the path to your MapReduce program’s JAR file, <main-class> with the
fully qualified name of your program’s main class, <input-path> with the path to the input data,
and <output-path> with the path to the output data.
Conclusion
SRM TRP Engineering College
Department of Computer Science and Engineering
Integrating Cassandra with Hadoop provides several benefits to organizations looking to manage their big
data efficiently. By leveraging the scalability and availability of Cassandra and the efficient processing
capabilities of Hadoop, organizations can handle even larger data sets with ease. Integrating the two
technologies involves several steps, including installing and configuring Hadoop and Cassandra, creating
a Hadoop job to access Cassandra data, and running the job. With this guide, you’ll be able to integrate
Cassandra with Hadoop and take advantage of the benefits that come with it.
We will start by using separate tables for storing the Customer and Product information. However, we
need to introduce a fair amount of denormalization to support the 3rd and 4th queries shown above.
We will create two more tables to achieve this – “Customer_by_Product” and “Product_by_Customer“.
Let's look at the Cassandra table schema for this example:
CREATE TABLE Customer (
cust_id text,
first_name text,
last_name text,
registered_on timestamp,
PRIMARY KEY (cust_id));
Predict who is generating the big data and also name the
Understand
ecosystems projects used for processing.
What is NoSQL database Remembering
What is Key Value data store? Remembering
Compare document store vs Key value store Remembering
Provide your own definition of what big data means Remembering
to yourorganization?
Outline the sharding? Creating