0% found this document useful (0 votes)
12 views50 pages

6.unit 2 Bda

Uploaded by

maruvarasi.k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views50 pages

6.unit 2 Bda

Uploaded by

maruvarasi.k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

SRM TRP Engineering College

Department of Computer Science and Engineering

UNIT II
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and store
large volumes of unstructured and semi-structured data. Unlike traditional relational databases that use
tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can adapt
to changes in data structures and are capable of scaling horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term has
since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a wide range of
different database architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents, such as JSON
or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for simple and
fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of columns
that are treated as a single entity. They are optimized for fast and efficient querying of large
amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
NoSQL databases are often used in applications where there is a high volume of data that needs to be
processed and analyzed in real-time, such as social media analytics, e-commerce, and gaming. They
can also be used for other applications, such as content management systems, document management,
and customer relationship management.
However, NoSQL databases may not be suitable for all applications, as they may not provide the same
level of data consistency and transactional guarantees as traditional relational databases. It is important
to carefully evaluate the specific needs of an application when choosing a database management
system.
Key Features of NoSQL :
Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing data
structures without the need for migrations or schema alterations.
1. Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a
database cluster, making them well-suited for handling large amounts of data and high levels of
traffic.
2. Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model,
where data is stored in semi-structured format, such as JSON or BSON.
3. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data
is stored as a collection of key-value pairs.
4. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
5. Distributed and high availability: NoSQL databases are often designed to be highly available and
to automatically handle node failures and data replication across multiple nodes in a database
cluster.
6. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic
manner, with support for multiple data types and changing data structures.
7. Performance: NoSQL databases are optimized for high performance and can handle a high volume
of reads and writes, making them suitable for big data and real-time applications.
SRM TRP Engineering College
Department of Computer Science and Engineering
Advantages of NoSQL: There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and
placing it on multiple machines in such a way that the order of the data is preserved is sharding.
Vertical scaling means adding more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal scaling databases are MongoDB,
Cassandra, etc. NoSQL can handle a huge amount of data because of scalability, as the data grows
NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which
means that they can accommodate dynamic changes to the data model. This makes NoSQL
databases a good fit for applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly available because
in case of any failure data replicates itself to the previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts
of data and traffic with ease. This makes them a good fit for applications that need to handle large
amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic, which
means that they can offer improved performance compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational
databases, as they are typically less complex and do not require expensive hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL databases, each with its own
unique strengths and weaknesses. This lack of standardization can make it difficult to choose the
right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that
they do not guarantee the consistency, integrity, and durability of data. This can be a drawback for
applications that require strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage but
it provides very little functionality. Relational databases are a better choice in the field of
Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In
other words, two database systems are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle complex
queries, which means that they are not a good fit for applications that require complex data analysis
or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a large
amount of data as simple as possible. But it is not so easy. Data management in NoSQL is much
more complex than in a relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in the
market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has
no approach for the backup of data in a consistent manner.
SRM TRP Engineering College
Department of Computer Science and Engineering
10. Large document size : Some database systems like MongoDB and CouchDB store data in JSON
format. This means that documents are quite large (BigData, network bandwidth, speed), and
having descriptive key names actually hurts since they increase the document size.
SQL NoSQL

RELATIONAL DATABASE MANAGEMENT SYSTEM Non-relational or distributed database system.


(RDBMS)

These databases have fixed or static or predefined schema They have a dynamic schema

These databases are not suited for hierarchical data storage. These databases are best suited for hierarchical data storage.

These databases are best suited for complex queries These databases are not so good for complex queries

Vertically Scalable Horizontally scalable

Follows ACID property Follows CAP(consistency, availability, partition tolerance)

Examples: MySQL, PostgreSQL, Oracle, MS-SQL Server, etc Examples: MongoDB, GraphQL, HBase, Neo4j, Cassandra, etc

Types of NoSQL database: Types of NoSQL databases and the name of the databases system that
falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle the data.
Difference between Relational database and NoSQL
1. Relational Database :
RDBMS stands for Relational Database Management Systems. It is most popular database. In it, data is
store in the form of row that is in the form of tuple. It contain numbers of table and data can be easily
accessed because data is store in the table. This Model was proposed by E.F. Codd.
2. NoSQL :
NoSQL Database stands for a non-SQL database. NoSQL database doesn’t use table to store the data
like relational database. It is used for storing and fetching the data in database and generally used to
store the large amount of data. It supports query language and provides better performance.
Difference between Relational database and NoSQL :
Relational Database NoSQL
SRM TRP Engineering College
Department of Computer Science and Engineering

It is used to handle data coming in high


It is used to handle data coming in low velocity. velocity.
It gives only read scalability. It gives both read and write scalability.
It manages structured data. It manages all type of data.
Data arrives from one or few locations. Data arrives from many locations.
It supports complex transactions. It supports simple transactions.
It has single point of failure. No single point of failure.
It handles data in less volume. It handles data in high volume.
Transactions written in one location. Transactions written in many locations.
support ACID properties compliance doesn’t support ACID properties
Its difficult to make changes in database once it is
defined Enables easy and frequent changes to database
schema is mandatory to store the data schema design is not required
Deployed in vertical fashion. Deployed in Horizontal fashion.

Aggregate Data Model in NoSQL


We know, NoSQL are databases that store data in another format other than relational databases. NoSQL
deals in nearly every industry nowadays. For the people who interact with data in databases, the
Aggregate Data model will help in that interaction.
Features of NoSQL Databases:
 Schema Agnostic: NoSQL Databases do not require any specific schema or s storage structure than
traditional RDBMS.
 Scalability: NoSQL databases scale horizontally as data grows rapidly certain commodity hardware
could be added and scalability features could be preserved for NoSQL.
 Performance: To increase the performance of the NoSQL system one can add a different commodity
server than reliable and fast access of database transfer with minimum overhead.
 High Availability: In traditional RDBMS it relies on primary and secondary nodes for fetching the
data, Some NoSQL databases use master place architecture.
 Global Availability: As data is replicated among multiple servers and clouds the data is accessible
to anyone, this minimizes the latency period.
Aggregate Data Models:
The term aggregate means a collection of objects that we use to treat as a unit. An aggregate is a collection
of data that we interact with as a unit. These units of data or aggregates form the bou ndaries for ACID
operation.
Example of Aggregate Data Model:
SRM TRP Engineering College
Department of Computer Science and Engineering

Here in the diagram have two Aggregate:


 Customer and Orders link between them represent an aggregate.
 The diamond shows how data fit into the aggregate structure.
 Customer contains a list of billing address
 Payment also contains the billing address
 The address appears three times and it is copied each time
 The domain is fit where we don’t want to change shipping and billing address.
Consequences of Aggregate Orientation:
 Aggregation is not a logical data property It is all about how the data is being used by applications.
 An aggregate structure may be an obstacle for others but help with some data interactions.
 It has an important consequence for transactions.
 NoSQL databases don’t support ACID transactions thus sacrificing consistency.
 aggregate-oriented databases support the atomic manipulation of a single aggregate at a time.
Advantage:
 It can be used as a primary data source for online applications.
 Easy Replication.
 No single point Failure.
 It provides fast performance and horizontal Scalability.
 It can handle Structured semi-structured and unstructured data with equal effort.
Disadvantage:
 No standard rules.
 Limited query capabilities.
 Doesn’t work well with relational data.
 Not so popular in the enterprise.
 When the value of data increases it is difficult to maintain unique values.
SRM TRP Engineering College
Department of Computer Science and Engineering
Architecture Patterns of NoSQL:

Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it in a
valid format. It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the data is
stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or characters but
can also be a more advanced data type. The value is typically linked or co-related to the key. The key-
value pair storage databases generally store data as a hash table where each key is unique. The value
can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type of pattern is usually
used in shopping websites or e-commerce applications.
Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys.
Limitations:
 Complex queries may attempt to involve multiple key-value pairs which may delay performance.
 Data can be involving many-to-many relationships which may collide.
Examples:
 DynamoDB
 Berkeley DB

2. Column Store Database:


Rather than storing data in relational tuples, the data is stored in individual cells which are further
grouped into columns. Column-oriented databases work only on columns. They store large amounts of
data into columns together. Format and titles of the columns can diverge from one row to other. Every
column is treated separately. But still, each individual column may contain multiple other columns like
traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
SRM TRP Engineering College
Department of Computer Science and Engineering
Examples:
 HBase
 Bigtable by Google
 Cassandra

3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the values are
called as Documents. Document can be stated as a complex data structure. Document here can be a
form of text, arrays, strings, JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of JSONs and is
unstructured.
Advantages:
 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.
Limitations:
 Handling multiple documents is challenging
 Aggregation operations may not work accurately.
Examples:
 MongoDB
 CouchDB

Figure – Document Store Model in form of JSON documents


4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in graphs. Graphs are
basically structures that depict connections between two or more objects in some data. The objects o r
SRM TRP Engineering College
Department of Computer Science and Engineering
entities are called as nodes and are joined together by relationships called Edges. Each edge has a
unique identifier. Each node serves as a point of contact for the graph. This pattern is very commonly
used in social networks where there are a large number of entities and each entity has one or many
characteristics which are connected by edges. The relational database pattern has tables that are loosely
connected, whereas graphs are often very strong and rigid in nature.
Advantages:
 Fastest traversal because of connections.
 Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
 Neo4J
 FlockDB( Used by Twitter)

TYPES OF DATA MODELS


Key-Value Data Model in NoSQL
A key-value data model or database is also referred to as a key-value store. It is a non-relational type of
database. In this, an associative array is used as a basic database in which an individual key is linked
with just one value in a collection. For the values, keys are special identifiers. Any kind of entity can be
valued. The collection of key-value pairs stored on separate records is called key-value databases and
they do not have an already defined structure.
SRM TRP Engineering College
Department of Computer Science and Engineering

How do key-value databases work?


A number of easy strings or even a complicated entity are referred to as a value that is associated with a
key by a key-value database, which is utilized to monitor the entity. Like in many programming
paradigms, a key-value database resembles a map object or array, or dictionary, however, which is put
away in a tenacious manner and controlled by a DBMS.

An efficient and compact structure of the index is used by the key-value store to have the option to
rapidly and dependably find value using its key. For example, Redis is a key-value store used to tracklists,
maps, heaps, and primitive types (which are simple data structures) in a constant database. Redis can
uncover a very basic point of interaction to query and manipulate value types, just by supporting a
predetermined number of value types, and when arranged, is prepared to do high throughput. When to
use a key-value database:
Here are a few situations in which you can use a key-value database:-
 User session attributes in an online app like finance or gaming, which is referred to as real -time
random data access.
 Caching mechanism for repeatedly accessing data or key-based design.
 The application is developed on queries that are based on keys.
Features:
 One of the most un-complex kinds of NoSQL data models.
 For storing, getting, and removing data, key-value databases utilize simple functions.
 Querying language is not present in key-value databases.
 Built-in redundancy makes this database more reliable.

Advantages:
 It is very easy to use. Due to the simplicity of the database, data can accept any kind, or even different
kinds when required.
 Its response time is fast due to its simplicity, given that the remaining environment near it is very
much constructed and improved.
 Key-value store databases are scalable vertically as well as horizontally.
 Built-in redundancy makes this database more reliable.
Disadvantages:
 As querying language is not present in key-value databases, transportation of queries from one
database to a different database cannot be done.
 The key-value store database is not refined. You cannot query the database without a key.
SRM TRP Engineering College
Department of Computer Science and Engineering
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
 Couchbase: It permits SQL-style querying and searching for text.
 Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB as it is
a trusted database used by a large number of users. It can easily handle a large number of requests
every day and it also provides various security options.
 Riak: It is the database used to develop applications.
 Aerospike: It is an open-source and real-time database working with billions of exchanges.
 Berkeley DB: It is a high-performance and open-source database providing scalability.

Document Data Model:


A Document Data Model is a lot different than other data models because it stores data in JSON, BSON,
or XML documents. in this data model, we can move documents under one document and apart from
this, any particular elements can be indexed to run queries faster. Often documents are stored and
retrieved in such a way that it becomes close to the data objects which are used in many applications
which means very less translations are required to use data in applications. JSON is a native language
that is often used to store and query data too.
So in the document data model, each document has a key-value pair below is an example for the same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "[email protected]",
"Contact" : "12345"
}
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the records and data
associated with them are stored in a single document which means this data model is not completely
unstructured. The main thing is that data here is stored in a document
Features:
 Document Type Model: As we all know data is stored in documents rather than tables or graphs, so
it becomes easy to map things in many programming languages.
 Flexible Schema: Overall schema is very much flexible to support this statement one must know
that not all documents in a collection need to have the same fields.
 Distributed and Resilient: Document data models are very much dispersed which is the reason
behind horizontal scaling and distribution of data.
 Manageable Query Language: These data models are the ones in which query language allows the
developers to perform CRUD (Create Read Update Destroy) operations on the data model.
Examples of Document Data Models :
 Amazon DocumentDB
 MongoDB
 Cosmos DB
 ArangoDB
 Couchbase Server
 CouchDB
SRM TRP Engineering College
Department of Computer Science and Engineering
Advantages:
 Schema-less: These are very good in retaining existing data at massive volumes because there are
absolutely no restrictions in the format and the structure of data storage.
 Faster creation of document and maintenance: It is very simple to create a document and apart
from this maintenance requires is almost nothing.
 Open formats: It has a very simple build process that uses XML, JSON, and its other forms.
 Built-in versioning: It has built-in versioning which means as the documents grow in size there
might be a chance they can grow in complexity. Versioning decreases conflicts.
Disadvantages:
 Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change in the
document data model involving two collections will require us to run two separate queries i.e. one
for each collection. This is where it breaks atomicity requirements.
 Consistency Check Limitations: One can search the collections and documents that are not
connected to an author collection but doing this might create a problem in the performance of
database performance.
 Security: Nowadays many web applications lack security which in turn results in the leakage of
sensitive data. So it becomes a point of concern, one must pay attention to web app vulnerabilities.
Applications of Document Data Model :
 Content Management: These data models are very much used in creating various video streaming
platforms, blogs, and similar services Because each is stored as a single document and the database
here is much easier to maintain as the service evolves over time.
 Book Database: These are very much useful in making book databases because as we know this data
model lets us nest.
 Catalog: When it comes to storing and reading catalog files these data models are very much used
because it has a fast reading ability if incase Catalogs have thousands of attributes stored.
 Analytics Platform: These data models are very much used in the Analytics Platform.
Introduction to Graph Database on NoSQL
A graph database is a type of NoSQL database that is designed to handle data with complex
relationships and interconnections. In a graph database, data is stored as nodes and edges, where nodes
represent entities and edges represent the relationships between those entities.
The description of components are as follows:
 Nodes: represent the objects or instances. They are equivalent to a row in database. The node
basically acts as a vertex in a graph. The nodes are grouped by applying a label to each member.
 Relationships: They are basically the edges in the graph. They have a specific direction, type and
form patterns of the data. They basically establish relationship between nodes.
 Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out of
which Neo4j is the most popular one.
Types of Graph Databases:
 Property Graphs: These graphs are used for querying and analyzing data by modelling the
relationships among the data. It comprises of vertices that has information about the particular subject
and edges that denote the relationship. The vertices and edges have additional attributes called
properties.
 RDF Graphs: It stands for Resource Description Framework. It focuses more on data integration.
They are used to represent complex data with well defined semantics. It is represented by three
elements: two vertices, an edge that reflect the subject, predicate and object of a sentence. Every
vertex and edge is represented by URI(Uniform Resource Identifier).
SRM TRP Engineering College
Department of Computer Science and Engineering
When to Use Graph Database?
 Graph databases should be used for heavily interconnected data.
 It should be used when amount of data is larger and relationships are present.
 It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data is
connected. Graph algorithms are also applied to find patterns, paths and other relationships this enabling
more analysis of the data. The algorithms help to explore the neighboring nodes, clustering of vertices
analyze relationships and patterns. Countless joins are not required in this kind of database.

Example of Graph Database:


 Recommendation engines in E commerce use graph databases to provide customers with accurate
recommendations, updates about new products thus increasing sales and satisfying the customer’s
desires.
 Social media companies use graph databases to find the “friends of friends” or products that the user’s
friends like and send suggestions accordingly to user.
 To detect fraud Graph databases play a major role. Users can create graph from the transactions
between entities and store other important information. Once created, running a simple query will
help to identify the fraud.
Advantages of Graph Database:
 Potential advantage of Graph Database is establishing the relationships with external sources as well
 No joins are required since relationships is already specified.
 Query is dependent on concrete relationships and not on the amount of data.
 It is flexible and agile.
 it is easy to manage the data in terms of graph.
 Efficient data modeling: Graph databases allow for efficient data modeling by representing data as
nodes and edges. This allows for more flexible and scalable data modeling than traditional
relational databases.
 Flexible relationships: Graph databases are designed to handle complex relationships and
interconnections between data elements. This makes them well-suited for applications that require
deep and complex queries, such as social networks, recommendation engines, and fraud detection
systems.
SRM TRP Engineering College
Department of Computer Science and Engineering
 High performance: Graph databases are optimized for handling large and complex datasets, making
them well-suited for applications that require high levels of performance and scalability.
 Scalability: Graph databases can be easily scaled horizontally, allowing additional servers to be
added to the cluster to handle increased data volume or traffic.
 Easy to use: Graph databases are typically easier to use than traditional relational databases. They
often have a simpler data model and query language, and can be easier to maintain and scale.
Disadvantages of Graph Database:
 Often for complex relationships speed becomes slower in searching.
 The query language is platform dependent.
 They are inappropriate for transactional data
 It has smaller user base.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely replace the
traditional database. This database deals with a typical set of interconnected data. Although Graph
Database is in the developmental phase it is becoming an important part as business and organizations
are using big data and Graph databases help in complex analysis. Thus these databases have become a
must for today’s needs and tomorrow success.

Column data bases:


A columnar database is used in a database management system (DBMS) which helps to store data in
columns rather than rows. It is responsible for speeding up the time required to return a particular query.
It also is responsible for greatly improving the disk I/O performance. It is helpful in data analytics and
data warehousing. also the major motive of Columnar Database is to effectively read and write data.
Here are some examples for Columnar Database like Monet DB, Apache Cassandra, SAP Hana, Amazon
Redshift.

Columnar Database VS Row Database:

Both Columnar and Row databases are a few methods used for processing big data analytics and data
warehousing. But their approach is different from each other.
For example:
SRM TRP Engineering College
Department of Computer Science and Engineering
 Row Database: “Customer 1: Name, Address, Location.”(The fields for each new record are stored
in a long row).
 Columnar Database: “Customer 1: Name, Address, Location.”(Each field has its own set of columns).
Example:
Here is an example of a simple database table with four columns and three rows.

ID Number Last Name First Name Bonus

534782 Miller Ginny 6000

585523 Parker Peter 8000

479148 Stacy Gwen 2000

In a Columnar DBMS, the data stored is in this format:


534782, 585523, 479148; Miller, Parker, Stacy; Ginny, Peter, Gwen; 6000, 8000, 2000.
In a Row-oriented DBMS, the data stored is in this format:
534782, Miller, Ginny, 6000; 585523, Parker, Peter, 8000; 479148, Stacy, Gwen, 2000.

When to use the Columnar Database:

1. Queries that involve only a few columns.


2. Compression but column-wise only.
3. Clustering queries against a huge amount of data.

Advantages of Columnar Database:

1. Columnar databases can be used for different tasks such as when the applications that are related to
big data comes into play then the column-oriented databases have greater attention in such case.
2. The data in the columnar database has a highly compressible nature and has different operations like
(AVG), (MIN, MAX), which are permitted by the compression.
3. Efficiency and Speed: The speed of Analytical queries that are performed is faster in columnar
databases.
4. Self-indexing: Another benefit of a column-based DBMS is self-indexing, which uses less disk space
than a relational database management system containing the same data.

Limitation of Columnar Database:

1. For loading incremental data, traditional databases are more relevant as compared to column-oriented
databases.
2. For Online transaction processing (OLTP) applications, Row oriented databases are more appropriate
than columnar databases.
SRM TRP Engineering College
Department of Computer Science and Engineering

SCHEMALESS DATABASE:
Traditional relational databases are well-defined, using a schema to describe every functional
element, including tables, rows views, indexes, and relationships. By exerting a high degree of control,
the database administrator can improve performance and prevent capture of low-quality, incomplete, or
malformed data. In a SQL database, the schema is enforced by the Relational Database Management
System (RDBMS) whenever data is written to disk.
But in order to work, data needs to be heavily formatted and shaped to fit into the table structure.
This means sacrificing any undefined details during the save, or storing valuable information outside the
database entirely.
A schemaless database, like MongoDB, does not have these up-front constraints, mapping to a
more ‘natural’ database. Even when sitting on top of a data lake, each document is created with a partial
schema to aid retrieval. Any formal schema is applied in the code of your applications; this layer of
abstraction protects the raw data in the NoSQL database and allows for rapid transformation as your
needs change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the same
time, using the right tools in the form of a schemaless database can unlock the value of all of your
structured and unstructured data types.

How does a schemaless database work?


In schemaless databases, information is stored in JSON-style documents which can have varying sets of
fields with different data types for each field. So, a collection could look like this:
{
name : “Joe”, age : 30, interests : ‘football’ }
{
name : “Kate”, age : 25
}

As you can see, the data itself normally has a fairly consistent structure. With the schemaless MongoDB
database, there is some additional structure — the system namespace contains an explicit list of
collections and indexes. Collections may be implicitly or explicitly created — indexes must be explicitly
declared.

What are the benefits of using a schemaless database?

 Greater flexibility over data types

 No pre-defined database schemas

 No data truncation

 Suitable for real-time analytics functions

 Enhanced scalability and flexibility


SRM TRP Engineering College
Department of Computer Science and Engineering
Differences between Views and Materialized Views in SQL
Views:
A View is a virtual relation that acts as an actual relation. It is not a part of logical relational model of
the database system. Tuples of the view are not stored in the database system and tuples of the view are
generated every time the view is accessed. Query expression of the view is stored in the databases
system.
Views can be used everywhere were we can use the actual relation. Views can be used to create custom
virtual relations according to the needs of a specific user. We can create as many views as we want in a
databases system.
Materialized Views:
When the results of a view expression are stored in a database system, they are called materialized
views. SQL does not provides any standard way of defining materialized view, however some database
management system provides custom extensions to use materialized views. The process of keeping the
materialized views updated is know as view maintenance.
Database system uses one of the three ways to keep the materialized view updated:
 Update the materialized view as soon as the relation on which it is defined is updated.
 Update the materialized view every time the view is accessed.
 Update the materialized view periodically.
Materialized view is useful when the view is accessed frequently, as it saves the computation time, as
the result are stored in the database before hand. Materialized view can also be helpful in case where
the relation on which view is defined is very large and the resulting relation of the view is very small.
Materialized view has storage cost and updation overheads associated with it.
Differences between Views and Materialized Views:
Views Materialized Views

Query expression are stored in the databases


Resulting tuples of the query expression are stored
system, and not the resulting tuples of the query
in the databases system.
expression.

Views needs not to be updated every time the Materialized views are updated as the tuples are
relation on which view is defined is updated, as stored in the database system. It can be updated in
the tuples of the views are computed every time one of three ways depending on the databases
when the view is accessed. system as mentioned above.

It does not have any storage cost associated


It does have a storage cost associated with it.
with it.

It does not have any updation cost associated


It does have updation cost associated with it.
with it.

There is no SQL standard for defining a


There is an SQL standard of defining a view.
materialized view, and the functionality is
SRM TRP Engineering College
Department of Computer Science and Engineering

Views Materialized Views

provided by some databases systems as an


extension.

Materialized views are efficient when the view is


Views are useful when the view is accessed. accessed frequently as it saves the computation
time by storing the results before hand.

How NoSQL System Handle Big Data Problem?


Datasets that are difficult to store and analyze by any software database tool are referred to as big data.
Due to the growth of data, an issue arises that based on recent fads in the IT region, how the data will be
effectively processed. A requirement for ideas, techniques, tools, and technologies is been set for
handling and transforming a lot of data into business value and knowledge. The major features of NoSQL
solutions are stated below that help us to handle a large amount of data.
NoSQL databases that are best for big data are:
 MongoDB
 Cassandra
 CouchDB
 Neo4j

Different ways to handle Big Data problems:

1. The queries should be moved to the data rather than moving data to queries:
2. Hash rings should be used for even distribution of data:
3. For scaling read requests, replication should be used:
4. Distribution of queries to nodes should be done by the database:

Distribution Models

The primary driver of interest in NoSQL has been its ability to run databases on a large cluster. As data
volumes increase, it becomes more difficult and expensive to scale up buy a bigger server to run the
database on. A more appealing option is to scale out run the database on a cluster of servers. Aggregate
orientation fits well with scaling out because the aggregate is a natural unit to use for distribution.
Depending on your distribution model, you can get a data store that will give you the ability to handle
larger quantities of data, the ability to process a greater read or write traffic, or more availability in the face
of network slowdowns or breakages.
Broadly, there are two paths to data distribution: replication and sharding. Replication takes the
same data and copies it over multiple nodes. Sharding puts different data on different nodes.Replication
and sharding are orthogonal techniques: You can use either or both of them. Replication comes into two
forms: master-slave and peer-to-peer. We will now discuss these techniques starting at the simplest and
SRM TRP Engineering College
Department of Computer Science and Engineering
working up to the more complex: first single-server, then master-slave replication,then sharding, and finally
peer-to-peer replication.
1.1 Single Server
The first and the simplest distribution option is the one we would most often recommend no distribution at
all. Run the database on a single machine that handles all the reads and writes to the data store. We prefer
this option because it eliminates all the complexities that the other options introduce; it’s easy for
operations people to manage and easy for application developers to reason about.
Although a lot of NoSQL databases are designed around the idea of running on a cluster, it can make
sense to use NoSQL with a single-server distribution model if the data model of the NoSQL store is more
suited to the application. Graph databases are the obvious category here these work best in a single-server
configuration. If your data usage is mostly about processing aggregates, then a single-server document
or key-value store may well be worthwhile because it’s easier on applicationdevelopers.For the rest of
this chapter we’ll be wading through the advantages and complications of more sophisticated distribution
schemes. Don’t let the volume of words fool you into thinking that we would prefer these options. If we
can get away without distributing our data, we will always choosea single-server approach.
Sharding
Often, a busy data store is busy because different people are accessing different parts of the dataset. In
these circumstances we can support horizontal scalability by putting different parts of the data
ontodifferent servers a technique that’s called sharding ( Figure 1.1).

Figure 1.1. Sharding puts different data on separate nodes, each of which does its own reads
andwrites.
In the ideal case, we have different users all talking to different server nodes. Each user only has to talk
to one server, so gets rapid responses from that server. The load is balanced out nicely between
servers4for example, if we have ten servers, each one only has to handle 10% of the load.
Of course the ideal case is a pretty rare beast. In order to get close to it we have to ensure that data that’s
accessed together is clumped together on the same node and that these clumps are arranged on the nodes
to provide the best data access.
The first part of this question is how to clump the data up so that one user mostly gets her data froma single
server. This is where aggregate orientation comes in really handy. The whole point
of aggregates is that we design them to combine data that’s commonly accessed together4so aggregates
leap out as an obvious unit of distribution.
SRM TRP Engineering College
Department of Computer Science and Engineering
Another factor is trying to keep the load even. This means that you should try to arrange aggregates so
they are evenly distributed across the nodes which all get equal amounts of the load. This may vary over
time, for example if some data tends to be accessed on certain days of the week4so there may be
domain-specific rules you would like to use.
Pros:It can improve both reads and writes
Cons:Clusters use reliable machines- resilience decreases

Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is designated as the
master, or primary. This master is the authoritative source for the data and is usually responsiblefor
processing any updates to that data. The other nodes are slaves, or secondaries. A
replication process synchronizes the slaves with the master ( Figure 1.2).

Master-slave replication is most helpful for scaling when you have a read-intensive dataset.You can
scale horizontally to handle more read requests by adding more slave nodes and ensuring that allread
requests are routed to the slaves. You are still, however, limited by the ability of the master to process
updates and its ability to pass those updates on. Consequently it isn’t such a good scheme fordatasets with
heavy write traffic, although offloading the read traffic will help a bit with handling thewrite load.
A second advantage of master-slave replication is read resilience: Should the master fail, the slaves can
still handle read requests. Again, this is useful if most of your data access is reads. The failure of the
master does eliminate the ability to handle writes until either the master is restored or a new master is
SRM TRP Engineering College
Department of Computer Science and Engineering
appointed. However, having slaves as replicates of the master does speed up recoveryafter a failure of
the master since a slave can be appointed a new master very quickly.
Replication comes with some alluring benefits, but it also comes with an inevitable dark side
inconsistency. You have the danger that different clients, reading different slaves, will see different values
because the changes haven’t all propagated to the slaves. In the worst case, that can mean that a client cannot
read a write it just made. Even if you use master-slave replication just for hot backupthis can be a concern,
because if the master fails, any updates not passed on to the backup are lost
Pros:
More read request
Add more slave nodes
Ensure that all read request are routed to the slaves
Cons:
The master is a bottle neck
Limited by its ability to process updates and to pass those updates on
Its failure does eliminate the ability to handle writes until

Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of writes. It
provides resilience against failure of a slave, but not of a master. Essentially, the master is still a
bottleneck and a single point of failure. Peer-to-peer replication ( Figure 1.3) attacks these problems by
not having a master. All the replicas have equal weight, they can all accept writes, andthe loss of any
of them doesn’t prevent access to the data store.

figure 1.3. Peer-to-peer replication has all nodes applying reads and writes to all the data.
Pros:
You can ride over node failures without losing access to data
you can easily add nodes to improve your performance
Cons:
Inconsistency
SRM TRP Engineering College
Department of Computer Science and Engineering
Slow propagation of changes to copies on different nodes
1.5 Combining Sharding and Replication
Replication and sharding are strategies that can be combined. If we use both master-slave
replicationand sharding (see Figure 1.4), this means that we have multiple masters, but each data item only
has a single master. Depending on your configuration, you may choose a node to be a master for some
data and slaves for others, or you may dedicate nodes for master or slave duties

Using peer-to-peer replication and sharding is a common strategy for column-family databases. In ascenario
like this you might have tens or hundreds of nodes in a cluster with data sharded over them.
A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard
ispresent on three nodes. Should a node fail, then the shards on that node will be built on the other
nodes
SRM TRP Engineering College
Department of Computer Science and Engineering

Figure 1 .5. Using peer-to-peer replication together with sharding


Key Points
• There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each server acts as the singlesource
for a subset of data.
Replication copies data across multiple servers, so each bit of data can be found in multipleplaces.
A system may use either or both techniques.
Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles writes
whileslaves synchronize with the master and may handle reads.

• Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize
theircopies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids
loading all writes onto a single point of failure.

CONSITENCY:
SRM TRP Engineering College
Department of Computer Science and Engineering
VARIOUS FORMS OF CONSISTENCY:
SRM TRP Engineering College
Department of Computer Science and Engineering
SRM TRP Engineering College
Department of Computer Science and Engineering

VERSION STAMP:
SRM TRP Engineering College
Department of Computer Science and Engineering

RELAXING CONSISTENCY
SRM TRP Engineering College
Department of Computer Science and Engineering
SRM TRP Engineering College
Department of Computer Science and Engineering
Apache Cassandra (NOSQL database)
Apache Cassandra: Apache Cassandra is an open-source no SQL database that is used for handling big
data. Apache Cassandra has the capability to handle structure, semi-structured, and unstructured data.
Apache Cassandra was originally developed at Facebook after that it was open-sourced in 2008 and after
that, it become one of the top-level Apache projects in 2010.
Features of Cassandra:
1. it is scalable.
. it is flexible (can accept structured , semi-structured and unstructured data).
3. it has transaction support as it follows ACID properties.
4. it is highly available and fault tolerant.
5. it is open source.

Figure-1: Masterless ring architecture of Cassandra


Apache Cassandra is a highly scalable, distributed database that strictly follows the principle of the
CAP (Consistency Availability and Partition tolerance)

Theorem.

Figure-2: CAP Theorem


In Apache Cassandra, there is no master-client architecture. It has a peer-to-peer architecture. In Apache
Cassandra, we can create multiple copies of data at the time of keyspace creation. We can simply define
replication strategy and RF (Replication Factor) to create multiple copies of data. Example:
CREATE KEYSPACE Example
WITH replication = {'class': 'NetworkTopologyStrategy',
SRM TRP Engineering College
Department of Computer Science and Engineering
'replication_factor': '3'};
In this example, we define RF (Replication Factor) as 3 which simply means that we are creating here 3
copies of data across multiple nodes in a clockwise direction.

Figure-3: RF = 3
cqlsh: CQL shell cqlsh is a command-line shell for interacting with Cassandra through CQL (Cassandra
Query Language). CQL query for Basic Operation:
Step1: To create keyspace use the following CQL query.
CREATE KEYSPACE Emp
WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
Step2: CQL query for using keyspace
Syntax:
USE keyspace-name
USE Emp;
Step-3: To create a table use the following CQL query.
Example:
CREATE TABLE Emp_table (
name text PRIMARY KEY,
Emp_id int,
Emp_city text,
Emp_email text,
);
Step-4: To insert into Emp_table use the following CQL query.
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('ashish', 1001, 'Delhi', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('Ashish Gupta', 1001, 'Bangalore', '[email protected]');
SRM TRP Engineering College
Department of Computer Science and Engineering
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('amit ', 1002, 'noida', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('dhruv', 1003, 'pune', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('shivang', 1004, 'mumbai', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('aayush', 1005, 'gurugram', '[email protected]');
Insert into Emp_table(name, Emp_id, Emp_city, Emp_email)
VALUES ('bhagyesh', 1006, 'chandigar', '[email protected]');
Step-5: To read data use the following CQl query.
SELECT * FROM Emp_table;
Introduction to Cassandra
Cassandra is a distributed database management system which is open source with wide column store,
NoSQL database to handle large amount of data across many commodity servers which provides high
availability with no single point of failure. It is written in Java and developed by Apache Software
Foundation.
Avinash Lakshman & Prashant Malik initially developed the Cassandra at Facebook to power the
Facebook inbox search feature. Facebook released Cassandra as an open source project on Google code
in July 2008. In March 2009 it became an Apache Incubator project and in February 2010 it becomes a
top-level project. Due to its outstanding technical features Cassandra becomes so popular.

Apache Cassandra is used to manage very large amounts of structure data spread out across the world.
It provides highly available service with no single point of failure. Listed below are some points of
Apache Cassandra:
SRM TRP Engineering College
Department of Computer Science and Engineering
 It is scalable, fault-tolerant, and consistent.
 It is column-oriented database.
 Its distributed design is based on Amazon’s Dynamo and its data model on Google’s Big table.
 It is Created at Facebook and it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure but its add a
more powerful “column family” data model. Cassandra is being used by some of the biggest companies
such as Facebook, Twitter, Cisco, Rackspace, eBay, Netflix, and more.
The design goal of a Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes of the cluster.
All the nodes of Cassandra in a cluster play the same role. Each node is independent, at the same time
interconnected to other nodes. Each node in a cluster can accept read and write requests, regardless of
where the data is actually located in the cluster. When a node goes down, read/write request can be
served from other nodes in the network.
Features of Cassandra:
Cassandra has become popular because of its technical features. There are some of the features of
Cassandra:
1. Easy data distribution –
It provides the flexibility to distribute data where you need by replicating data across multiple data
centers.
for example:
If there are 5 node let say N1, N2, N3, N4, N5 and by using partitioning algorithm we will decide
the token range and distribute data accordingly. Each node have specific token range in which dat a
will be distribute. let’s have a look on diagram for better understanding.

Ring structure with token range.


2. Flexible data storage –
Cassandra accommodates all possible data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to your data structures accordingly to your
need.
SRM TRP Engineering College
Department of Computer Science and Engineering
3. Elastic scalability –
Cassandra is highly scalable and allows to add more hardware to accommodate more customers and
more data as per requirement.
4. Fast writes –
Cassandra was designed to run on cheap commodity hardware. Cassandra performs blazingly fast
writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
5. Always on Architecture –
Cassandra has no single point of failure and it is continuously available for business-critical
applications that can’t afford a failure.
6. Fast linear-scale performance –
Cassandra is linearly scalable therefore it increases your throughput as you increase the number of
nodes in the cluster. It maintains a quick response time.
7. Transaction support –
Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID)
properties of transactions.
SRM TRP Engineering College
Department of Computer Science and Engineering
Cassandra - Architecture

The design goal of Cassandra is to handle big data workloads across multiple nodes without any single
point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed
among all the nodes in a cluster.

 All the nodes in a cluster play the same role. Each node is independent and at the same time
interconnected to other nodes.
 Each node in a cluster can accept read and write requests, regardless of where the data is actually
located in the cluster.
 When a node goes down, read/write requests can be served from other nodes in the network.
Data Replication in Cassandra

In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is
detected that some of the nodes responded with an out-of-date value, Cassandra will return the most
recent value to the client. After returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.

The following figure shows a schematic view of how Cassandra uses data replication among the nodes in
a cluster to ensure no single point of failure.

Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with
each other and detect any faulty nodes in the cluster.

Components of Cassandra

The key components of Cassandra are as follows −

 Node − It is the place where data is stored.


SRM TRP Engineering College
Department of Computer Science and Engineering
 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will
be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-
tables.
 SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach
a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an
element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every
query.
Cassandra Query Language

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the
database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or
separate application language drivers.

Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy
between the client and the nodes holding the data.

Write Operations

Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be
captured and stored in the mem-table. Whenever the mem-table is full, data will be written into
the SStable data file. All writes are automatically partitioned and replicated throughout the cluster.
Cassandra periodically consolidates the SSTables, discarding unnecessary data.

 Step-1:
In Write Operation as soon as we receives request then it is first dumped into commit log to make
sure that data is saved.
 Step-2:
Insertion of data into table that is also written in MemTable that holds the data till it’s get full.
 Step-3:
If MemTable reaches its threshold then data is flushed to SS Table.
SRM TRP Engineering College
Department of Computer Science and Engineering

Figure – Write Operation in Cassandra

Read Operations

During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the
appropriate SSTable that holds the required data.

In Read Operation there are three types of read requests that a coordinator can send to a replica. The
node that accepts the write requests called coordinator for that particular operation.
 Step-1: Direct Request:
In this operation coordinator node sends the read request to one of the replicas.
 Step-2: Digest Request:
In this operation coordinator will contact to replicas specified by the consistency level. For
Example: CONSISTENCY TWO; It simply means that Any two nodes in data center will
acknowledge.
 Step-3: Read Repair Request:
If there is any case in which data is not consistent across the node then background Read Rep air
Request initiated that makes sure that the most recent data is available across the nodes.
Storage Engine:
1. Commit log:
Commit log is the first entry point while writing to disk or memTable. The purpose of commit log
in apache Cassandra is to server sync issues if a data node is down.
2. Mem-table:
After data written in Commit log then after that data is written in Mem-table. Data is written in
Mem-table temporarily.
3. SSTable:
Once Mem-table will reach a certain threshold then data will flushed to the SSTable disk file.

Application of Apache Cassandra:


Some of the application use cases that Cassandra excels in include:
 Real-time, big data workloads
 Time series data management
 High-velocity device data consumption and analysis
 Media streaming management (e.g., music, movies)
 Social media (i.e., unstructured data) input and analysis
 Online web retail (e.g., shopping carts, user transactions)
 Real-time data analytics
 Online gaming (e.g., real-time messaging)
 Software as a Service (SaaS) applications that utilize web services
 Online portals (e.g., healthcare provider/patient interactions)
 Most write-intensive systems

Pre-defined data type in Apache Cassandra


SRM TRP Engineering College
Department of Computer Science and Engineering

Built-in Data Type:


It is a pre-defined data type in Cassandra. We can directly use by simply giving data type names as per
need. There are many Built-in Data types in Cassandra. let’s discuss one by one.
 Boolean:
It is a data type that represents two values, true or false. So, we can use such type of data type
where we need just two values.
true or false
Syntax:
CREATE TABLE table_name(
field_name1 Boolean,
...
);
 blob:
It is used for binary large objects such that audio, video or other multimedia and sometimes binary
executable code stored as a blob.
binary large objects
Syntax:
CREATE TABLE table_name(
field_name1 blob,
...
);
 ASCII:
It is used for strings type such that words and sentences. It represents the ASCII value of the
character. For example, for ‘A’ ASCII value is 65. so, it will store its ASCII value.
65 for A, 97 for a, etc.
Syntax:
CREATE TABLE table_name(
field_name1 ascii,
...
);
SRM TRP Engineering College
Department of Computer Science and Engineering
 bigint:
It is used for 64 bit signed long integer. Basically it is used for high range of integer which
represents a value from -(2^32) to +(2^32) roughly.
only for integers -(2^32) to +(2^32)
Syntax:
CREATE TABLE table_name(
field_name1 bigint,
...
);
 counter:
It is used for integers and represents a counter column. These columns part of the row which
represents column family which basically contains numeric values, containing the number of
columns.
1, 2, 3... (integer)
Syntax:
CREATE TABLE table_name(
field_name1 counter,
...
);
 Decimal:
It is used to save integer and float values. In Decimal data type is important to note when we tried
to save decimal value such as .907 ( dot 907), it will give an error “no viable alternative at input ‘.’
(…DecimalValue) Values ( 1, [.]…)”. If we need to save decimal like that then start with zero e.g
0.907.
10.45, 1, -1, 0.32...
Syntax:
CREATE TABLE table_name(
field_name1 DECIMAL,
...
);
 Double:
It is used for integers which represents a 64 bit floating point. It include a number with a decimal
points. for example: 5.838, 10.45 etc.
10.4556, 3.566, 0.5875 etc.
Syntax:
CREATE TABLE table_name(
field_name1 double,
...
);
SRM TRP Engineering College
Department of Computer Science and Engineering
 float:
It is used for numbers which represents a 32-bit floating point. It represents a number values with a
decimal point. for example : 6.254, 5.23 etc.
5.423, 2.31, 3.12...[32 bit floating point]
Syntax:
CREATE TABLE table_name(
field_name1 float,
...
);
 inet:
It is used to represent an IP address that includes both numeric and characters. basically It
Represents an IP address, IPv4 or IPv6. For example, 64.233.160.0 for such type address we can
use inet data type.
64.233.160.0 ...[IP address, IPv4 or IPv6]
Syntax:
CREATE TABLE table_name(
field_name1 inet,
...
);
 int:
It is used to Represents 32-bit signed integers. It represents both positive and negative numbers.
The range of int lies from -(2^16) to +(2^16) [only integers].
24, 907, -9, ...
[Represents 32-bit signed int -(2^16) to +(2^16) [only integers]]
Syntax:
CREATE TABLE table_name(
field_name1 int,
...
);
 text:
It is used to store string type which Represents UTF8 encoded string. It encodes all valid points in
Uni-code by using 1 bit to four 8 bit types in Cassandra.
Ashish, rana, ...[Represents UTF8 encoded string]
Syntax:
CREATE TABLE table_name(
field_name1 text,
...
);
SRM TRP Engineering College
Department of Computer Science and Engineering
 varchar:
It is used to store arbitrary string which represents UTF8 encoded string.
Ashish, rana, a$#34, A67dgg...
[arbitrary string, Represents UTF8 encoded string]
Syntax:
CREATE TABLE table_name(
field_name1 varchar,
...
);
 timestamp:
It is used to represents a timestamp which is very helpful to store the exact time format value of
timestamp. For example, if we want to store 15 dec 1995 at 4:00 AM then timestamp would be:
1995-12-15 04:00 +0530
Here +0530 is represents meridian for India.
[formats: yyyy-mm-dd HH:mm or yyyy-mm-dd HH:mm:ss]
Syntax:
CREATE TABLE table_name(
field_name1 timestamp,
...
);
 variant:
It is used to represents arbitrary-precision integer such that 124, 24, 1, 5468 etc.
1, 24, 07, 897, 4568, etc.
Syntax:
CREATE TABLE table_name(
field_name1 variant,
...
);
Data model of Apache Cassandra

The data model of Cassandra is significantly different from what we normally see in an RDBMS. This
chapter provides an overview of how Cassandra stores its data.
SRM TRP Engineering College
Department of Computer Science and Engineering
Cluster:Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case of a
failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns
data to them.

Keyspace:Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are −
 Replication factor − It is the number of machines in the cluster that will receive copies of the
same data.
 Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-
aware strategy), and network topology strategy (datacenter-shared strategy).
 Column families − Keyspace is a container for a list of one or more column families. A column
family, in turn, is a container of a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each keyspace has at least one and often
many column families.

The syntax of creating a Keyspace is as follows −

CREATE KEYSPACE Keyspace name


WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

The following illustration shows a schematic view of a Keyspace.

Column Family

A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from a table
of relational databases.

Relational Table Cassandra column Family

A schema in a relational model is fixed. Once In Cassandra, although the column


we define certain columns for a table, while families are defined, the columns are
inserting data, in every row all the columns not. You can freely add any column to
must be filled at least with a null value. any column family at any time.
SRM TRP Engineering College
Department of Computer Science and Engineering

In Cassandra, a table contains columns,


Relational tables define only columns and the
or can be defined as a super column
user fills in the table with values.
family.

A Cassandra column family has the following attributes −

 keys_cached − It represents the number of locations to keep cached per SSTable.


 rows_cached − It represents the number of rows whose entire contents will be cached in memory.
 preload_row_cache − It specifies whether you want to pre-populate the row cache.

Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.

The following figure shows an example of a Cassandra column family.

Column

A column is the basic data structure of Cassandra with three values, namely key or column name, value,
and a time stamp. Given below is the structure of a column.

SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a
map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is
important to keep columns that you are likely to query together in the same column family, and a super
column can be helpful here.Given below is the structure of a super column.
SRM TRP Engineering College
Department of Computer Science and Engineering

Data Models of Cassandra and RDBMS

The following table lists down the points that differentiate the data model of Cassandra from that of an
RDBMS.

RDBMS Cassandra

RDBMS deals with structured data. Cassandra deals with unstructured data.

It has a fixed schema. Cassandra has a flexible schema.

In Cassandra, a table is a list of “nested key-


In RDBMS, a table is an array of
value pairs”. (ROW x COLUMN key x
arrays. (ROW x COLUMN)
COLUMN value)

Database is the outermost container that Keyspace is the outermost container that
contains data corresponding to an contains data corresponding to an
application. application.

Tables or column families are the entity of a


Tables are the entities of a database.
keyspace.

Row is an individual record in


Row is a unit of replication in Cassandra.
RDBMS.

Column represents the attributes of a


Column is a unit of storage in Cassandra.
relation.

RDBMS supports the concepts of Relationships are represented using


foreign keys, joins. collections.

CASSANDRA HADOOP INTERACTION


As a data scientist, you know that managing and processing large amounts of data is no easy task. That’s
why many organizations turn to distributed systems like Hadoop and Cassandra to handle their big data
needs. Hadoop is an open-source framework that provides distributed storage and processing of large
SRM TRP Engineering College
Department of Computer Science and Engineering
data sets, while Cassandra is a distributed NoSQL database that offers high scalability and availability.
Integrating these two powerful technologies can provide even greater benefits to organizations looking to
manage their big data efficiently. In this post, we’ll explore how to integrate Cassandra with Hadoop and
the benefits of doing so.

Why Integrate Cassandra with Hadoop?


Integrating Cassandra with Hadoop provides several benefits, including:
1. Scalability: By integrating Cassandra with Hadoop, you can scale your data processing and
storage capabilities to handle even larger data sets. Cassandra’s distributed architecture allows it
to store and process data across multiple nodes, while Hadoop’s distributed computing framework
enables parallel processing of large data sets.
2. High Availability: Cassandra’s distributed architecture also provides high availability, ensuring
that your data remains accessible even in the event of a single node failure. This is especially
important for organizations that require continuous availability of their data.
3. Efficient Data Processing: Hadoop’s MapReduce framework allows for efficient processing of
large data sets. By integrating Cassandra with Hadoop, you can leverage Hadoop’s processing
capabilities to analyze and process data stored in Cassandra.

How to Integrate Cassandra with Hadoop


Integrating Cassandra with Hadoop involves several steps, including:
Step 1: Install Hadoop and Cassandra
To integrate Cassandra with Hadoop, you need to have both technologies installed on your system. You
can download the latest versions of Hadoop and Cassandra from their respective websites. Once
downloaded, follow the installation instructions provided by each technology.
Step 2: Configure Hadoop
Once you have installed Hadoop, you need to configure it to work with Cassandra. This involves
modifying the Hadoop configuration files to include the necessary settings for connecting to Cassandra.
First, navigate to the Hadoop installation directory and open the core-site.xml file. Add the following
lines to the file:
<property>
<name>cassandra.input.thrift.address</name>
<value>localhost</value>
</property>
Next, open the hdfs-site.xml file and add the following lines:

<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>cassandra.output.thrift.address</name>
<value>localhost</value>
</property>
These settings configure Hadoop to use the Cassandra input and output formats.
Step 3: Configure Cassandra
Next, you need to configure Cassandra to work with Hadoop. This involves modifying the Cassandra
configuration files to include the necessary settings for connecting to Hadoop.
SRM TRP Engineering College
Department of Computer Science and Engineering
First, navigate to the Cassandra installation directory and open the cassandra.yaml file. Add the following
lines to the file:
hadoop_config:
fs.default.name: hdfs://localhost:9000
This setting configures Cassandra to use the Hadoop file system.
Step 4: Create a Hadoop Job to Access Cassandra Data
Once you have configured Hadoop and Cassandra to work together, you can create a Hadoop job to
access the data stored in Cassandra. This involves writing a MapReduce program that uses the Cassandra
input and output formats to read and write data.

Here’s an example MapReduce program that reads data from a Cassandra table and writes it to a Hadoop
file:
public class CassandraHadoopJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Cassandra Hadoop Job");
job.setJarByClass(CassandraHadoopJob.class);
job.setInputFormatClass(CassandraInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(CassandraMapper.class);
job.setReducerClass(CassandraReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
CassandraConfigHelper.setInputColumnFamily(job.getConfiguration(), "keyspace", "table");
CassandraConfigHelper.setOutputColumnFamily(job.getConfiguration(), "keyspace", "table");
CassandraConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
CassandraConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This program uses the CassandraInputFormat and TextOutputFormat classes to read and write data,
respectively. The CassandraMapper and CassandraReducer classes define the Map and Reduce functions,
respectively. The CassandraConfigHelper class is used to configure the input and output column families,
as well as the initial address and RPC port for connecting to Cassandra.

Step 5: Run the Hadoop Job

Once you have written your MapReduce program, you can run the Hadoop job using the following
command:
$ hadoop jar <path-to-jar-file> <main-class> <input-path> <output-path>
Replace <path-to-jar-file> with the path to your MapReduce program’s JAR file, <main-class> with the
fully qualified name of your program’s main class, <input-path> with the path to the input data,
and <output-path> with the path to the output data.

Conclusion
SRM TRP Engineering College
Department of Computer Science and Engineering
Integrating Cassandra with Hadoop provides several benefits to organizations looking to manage their big
data efficiently. By leveraging the scalability and availability of Cassandra and the efficient processing
capabilities of Hadoop, organizations can handle even larger data sets with ease. Integrating the two
technologies involves several steps, including installing and configuring Hadoop and Cassandra, creating
a Hadoop job to access Cassandra data, and running the job. With this guide, you’ll be able to integrate
Cassandra with Hadoop and take advantage of the benefits that come with it.

Real World Data Modeling Examples


5.1. Facebook Posts
Suppose that we are storing Facebook posts of different users in Cassandra. One of the common query
patterns will be fetching the top ‘N‘ posts made by a given user.
Thus, we need to store all data for a particular user on a single partition as per the above guidelines.
Also, using the post timestamp as the clustering key will be helpful for retrieving the top ‘N‘ posts more
efficiently.
Let's define the Cassandra table schema for this use case:
CREATE TABLE posts_facebook (
user_id uuid,
post_id timeuuid,
content text,
PRIMARY KEY (user_id, post_id) )
WITH CLUSTERING ORDER BY (post_id DESC);Copy
Now, let's write a query to find the top 20 posts for the user Anna:
SELECT content FROM posts_facebook WHERE user_id = "Anna_id" LIMIT 20Copy
5.2. Gyms Across the Country
Suppose that we are storing the details of different partner gyms across the different cities and states of
many countries and we would like to fetch the gyms for a given city.
Also, let's say we need to return the results having gyms sorted by their opening date.
Based on the above guidelines, we should store the gyms located in a given city of a specific state and
country on a single partition and use the opening date and gym name as a clustering key.
Let's define the Cassandra table schema for this example:
CREATE TABLE gyms_by_city (
country_code text,
state text,
city text,
gym_name text,
opening_date timestamp,
PRIMARY KEY (
(country_code, state_province, city),
(opening_date, gym_name))
WITH CLUSTERING ORDER BY (opening_date ASC, gym_name ASC);Copy
Now, let's look at a query that fetches the first ten gyms by their opening date for the city of Phoenix
within the U.S. state of Arizona:
SELECT * FROM gyms_by_city
WHERE country_code = "us" AND state = "Arizona" AND city = "Phoenix"
LIMIT 10Copy
Next, let’s see a query that fetches the ten most recently-opened gyms in the city of Phoenix within the
U.S. state of Arizona:
SRM TRP Engineering College
Department of Computer Science and Engineering
SELECT * FROM gyms_by_city
WHERE country_code = "us" and state = "Arizona" and city = "Phoenix"
ORDER BY opening_date DESC
LIMIT 10Copy
Note: As the last query's sort order is opposite of the sort order defined during the table creation, the
query will run slower as Cassandra will first fetch the data and then sort it in memory.
5.3. E-commerce Customers and Products
Let's say we are running an e-commerce store and that we are storing
the Customer and Product information within Cassandra. Let's look at some of the common query
patterns around this use case:

1. Get Customer info


2. Get Product info
3. Get all Customers who like a given Product
4. Get all Products a given Customer likes

We will start by using separate tables for storing the Customer and Product information. However, we
need to introduce a fair amount of denormalization to support the 3rd and 4th queries shown above.
We will create two more tables to achieve this – “Customer_by_Product” and “Product_by_Customer“.
Let's look at the Cassandra table schema for this example:
CREATE TABLE Customer (
cust_id text,
first_name text,
last_name text,
registered_on timestamp,
PRIMARY KEY (cust_id));

CREATE TABLE Product (


prdt_id text,
title text,
PRIMARY KEY (prdt_id));

CREATE TABLE Customer_By_Liked_Product (


liked_prdt_id text,
liked_on timestamp,
title text,
cust_id text,
first_name text,
last_name text,
PRIMARY KEY (prdt_id, liked_on));

CREATE TABLE Product_Liked_By_Customer (


cust_id text,
first_name text,
last_name text,
liked_prdt_id text,
SRM TRP Engineering College
Department of Computer Science and Engineering
liked_on timestamp,
title text,
PRIMARY KEY (cust_id, liked_on));Copy
Note: To support both the queries, recently-liked products by a given customer and customers who
recently liked a given product, we have used the “liked_on” column as a clustering key.
Let's look at the query to find the ten Customers who most recently liked the product “Pepsi“:
SELECT * FROM Customer_By_Liked_Product WHERE title = "Pepsi" LIMIT 10Copy
And let's see the query that finds the recently-liked products (up to ten) by a customer named “Anna“:
SELECT * FROM Product_Liked_By_Customer
WHERE first_name = "Anna" LIMIT 10Copy
SRM TRP Engineering College
Department of Computer Science and Engineering
IMPORTANT QUESTIONS
1. Compare and contrast RDBMS and NoSQL databases
2. Explain the advantages and disadvantages of the Integration database, also discuss the
alternative to the integration database
3. Write a short note on Relational databases, and examine the impedance mismatch issue of
Relational databases
4. Explain and mention the features of key-value databases, with an example
5. Explain and mention the features of Column store databases, Explain with an example
6. Make use of the functionalities and properties of the Document database and explain its
working with an example
7. Make use of the functionalities and properties of Graph databases and explain its working with
an example
8. What is sharding? Analyze in detail how sharding is achieved in distributed models
9. Explain the properties of Master-slave replication and p2p replication models
10. Analyse the various issues that arise out of read consistency and update consistency, describe
with examples for each case
11. Analyse the essential features of the CAP theorem and the ways in which consistency and
availability can be relaxed
12. How can version stamps be constructed, also explain how they are implemented for multiple nodes
13. Explain the steps involved in working on Map-reduce with suitable diagrams, and explain its
applications in the real world
14. Explain the functionalities and features of Key value pair data store, and use diagrams and
examples wherever necessary
15. Apply the concept of key-value store NoSQL database to user profile/preferences use case and
derive suitable interferences, use diagrams and examples wherever necessary
16. Architecture of Cassandra
17. Data modeling in Cassandra
18. Cassandra and Hadoop integration
What is NoSQL? Remember

Analyze the reason behind why do we need NoSQL ? Analyze

Assess the categories of NoSQL. Analyze

List is the advantages of NoSQL. Remember

Discuss the disadvantages of NoSQL. Understand


SRM TRP Engineering College
Department of Computer Science and Engineering
Compare Cassandra vs Hadoop. Evaluate

Predict who is generating the big data and also name the
Understand
ecosystems projects used for processing.
What is NoSQL database Remembering
What is Key Value data store? Remembering
Compare document store vs Key value store Remembering
Provide your own definition of what big data means Remembering
to yourorganization?
Outline the sharding? Creating

Identify three “big data” sources either within or external Understanding


to yourorganization that would be relevant to your
business
Define Tabular store Understanding
What is a Graph database? Remembering
Enumerate the term Graph Analytics.
Evaluating
Describe a pilot application for graph analytics Analyzing

(i) Describe what is NoSQL. (7)


Remember
(ii) Identify the advantages and disadvantages of NoSQL. (6)
Describe how Cassandra is integrated with Hadoop and also the
Remember
tools related to Hadoop. (13)
List the classification of NoSQL Databases and explain Applying
about KeyValue Stores. (13)
Describe the system architecture and components of Remembering
Hive andHadoop(13)
What is NoSQL? What are the advantages of NoSQL? Explain
the typesof NoSQL databases. (13) Remembering
Explain about Graph databases and descriptive Statistics(13) Remembering

Write short notes on Applying


i. NoSQL Databases and its types(7)
With suitable examples differentiate the applications, Understanding
structure,working and usage of different NoSQL
databases (13)
Differences between SQL Vs NoSQL explain it with suitable Understanding
example.(13)
Explain the types of NoSQL data stores in detail. (13) Applying

Discuss in detail about the characteristics of NoSQL


Analyzing
databases(13)
SRM TRP Engineering College
Department of Computer Science and Engineering
Explain in detail about Market and Business drives for Remembering
Big dataAnalytics. (13)
i. What is the purpose of sharding? (6)
Creating
Formulate how big data analytics helps business people to
increasetheir revenue. Discuss with any one real time Evaluating
application.(15)
Draw insights out of any one visualization tool.(15)
Evaluating
Explain in detail about brief history of NoSQL. Explain in detail
about Creating
ACID vs. BASE.(15)

You might also like