0% found this document useful (0 votes)
4 views37 pages

1,2,3 Units

NoSQL is a flexible database management system designed for handling large volumes of unstructured and semi-structured data, differing from traditional relational databases by using adaptable data models. It encompasses four main categories: document databases, key-value stores, column-family stores, and graph databases, each optimized for specific data storage and retrieval needs. NoSQL databases offer advantages such as high scalability, flexibility, and performance, but also face challenges like lack of standardization and ACID compliance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views37 pages

1,2,3 Units

NoSQL is a flexible database management system designed for handling large volumes of unstructured and semi-structured data, differing from traditional relational databases by using adaptable data models. It encompasses four main categories: document databases, key-value stores, column-family stores, and graph databases, each optimized for specific data storage and retrieval needs. NoSQL databases offer advantages such as high scalability, flexibility, and performance, but also face challenges like lack of standardization and ACID compliance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

NO SQL

UNIT-1
Def :
NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and
are capable of scaling horizontally to handle growing amounts of data.

The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but


the term has since evolved to mean “not only SQL,” as NoSQL databases have
expanded to include a wide range of different database architectures and data
models.

NOSQL DATABASES ARE GENERALLY CLASSIFIED INTO FOUR


MAIN CATEGORIES:
1. Document databases: These databases store data as semi-structured
documents, such as JSON or XML, and can be queried using document-oriented
query languages.
2. Key-value stores: These databases store data as key-value pairs, and are
optimized for simple and fast read/write operations.
3. Column-family stores: These databases store data as column families, which
are sets of columns that are treated as a single entity. They are optimized for fast and
efficient querying of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are
designed to handle complex relationships between data.

1.Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data
in rows and columns (tables), it uses the documents to store the data in the database.
A document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.
EX: XML

2. Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL
database is a key-value store. Every data element in the database is stored in key-
value pairs. The data can be retrieved by using a unique key allotted to each element
in the database. The values can be simple data types like strings and numbers or
complex objects.
A key-value store is like a relational database with only two columns which is the
key and the value.
EX:

3. Column Oriented Databases:


A column-oriented database is a non-relational database that stores the data in
columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming memory
with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.
EX:

4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
EX:

LOCATION PREFERENCE STORE


If you are looking to store location preference in a NOSQL database, there are
several options to consider based on your specific requirements. NOSQL databases
are often chose for their flexibility and scalability. Here a few NOSQL database
options and how you might structure your data for location preference:

1. MongaDB
In MongoDB, you can structure your data in a way that each user document
contains information about their location preference. This could be achieved
using or nested documents within the user document. MongoDB is known for its
flexibility in handling diverse data structure, making it suitable for storing
location preferences in a user-friendly manner.
2. Cassandra
Cassandra is a highly scalable, distributed NoSQL database with a decentralized
architecture, designed for high availability, fault tolerance, and horizontal
scalability. It uses a column-family data model, features the Cassandra Query
Language(CQL), and is suitable for applications requiring massive data storage
and processing, such as real-time analytics and sensor data.
3. Amazon DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service provided by
AWS. It offers seamless scalability, high performance, and low-latency access to
data. DynamoDB uses a key-value and document data model, providing
flexibility for diverse data types. It is designed for high availability and
durability, with automatic scaling capabilities. DynamoDB is suitable for
applications requiring fast and consistent performance, such as web and mobile
applications, gaming, and IOT.
4. Redis
Redis is an open-source, in-memory data structure store. It functions as a versatile
key-value database, supporting various data structures like strings, hashes, lists,
sets, and more. Known for its exceptional speech, Redis is often used as a
caching layer or message broker. Its simplicity and efficiency make it popular for
real-time applications, session management, and other use cases requiring fast
data access. Redis also offers features like replication and clustering for
scalability and high availability.

CAR MAKE AND MODEL DATABASE:


Building a car make and model database using a NoSQL database can be a good
choice, as it allows for flexibility in storing and retrieving data. MongoDB is a
popular NoSQL database that you might consider for this purpose. Below, I'll
provide a basic example schema and show you how you could structure the data
using MongoDB as an example.
```json
{
"_id": "unique_id",
"make": "Toyota",
"models": [
{
"name": "Camry",
"year_range": "2000-2022"
},
{
"name": "Corolla",
"year_range": "2005-2022"
},
{
"name": "Rav4",
"year_range": "2002-2022"
}
// Additional models can be added
]
}
```

In this example:
- Each document in the collection represents a car make.
- The "make" field stores the name of the car make.
- The "models" field is an array of sub-documents, each representing a specific
model.
- Each model sub-document has a "name" field for the model name and a
"year_range" field to specify the range of years that the model was produced.

This structure allows for easy retrieval of all models for a given make or details for a
specific model within a make.

In MongoDB, you could use the official drivers (like pymongo for Python) to
interact with the database and perform operations such as inserting, updating, and
querying data.

Remember to adjust the schema based on your specific use case and requirements. If
you have additional data or relationships, you may need to expand the schema
accordingly.

WORKING WITH LANGUAGE BINDINGS


When working with NoSQL databases, you often interact with them using language-
specific drivers or bindings. Different NoSQL databases have their own drivers for
various programming languages. I'll provide a general overview of how you might
work with language bindings in a NoSQL context using MongoDB as an example.

### MongoDB Example with Python (using pymongo):

1.Install the MongoDB driver:


pip install pymongo

2.Connect to the MongoDB server:


from pymongo import MongoClient

# Replace 'mongodb://localhost:27017/' with your MongoDB server


connection string
client = MongoClient('mongodb://localhost:27017/')
# Access a specific database, create if it doesn't exist
db = client['your_database_name']

3.Insert data into the database:


car_data = {
"make": "Toyota",
"models": [
{"name": "Camry", "year_range": "2000-2022"},
{"name": "Corolla", "year_range": "2005-2022"},
{"name": "Rav4", "year_range": "2002-2022"}
]
}

# Insert data into a collection


collection = db['car_collection']
result = collection.insert_one(car_data)
print(f"Inserted document with ID: {result.inserted_id}")

4.Query data from the database:


# Find all documents in the collection
all_cars = collection.find()

for car in all_cars:


print(car)

5. Update data in the database:


# Update a specific document
query = {"make": "Toyota"}
update_data = {"$set": {"models.0.year_range": "2001-2022"}}
collection.update_one(query, update_data)

6. Delete data from the database:


# Delete a specific document
delete_query = {"make": "Toyota"}
collection.delete_one(delete_query)

Remember to replace placeholders such as `'mongodb://localhost:27017/'` and


`'your_database_name'` with your actual MongoDB connection string and database
name.

This is just a basic example with Python and MongoDB. Depending on the NoSQL
database you choose, the code structure and syntax will vary, but the general
principles of connecting, inserting, querying, updating, and deleting data will be
similar. Always refer to the documentation for the specific NoSQL database and
programming language you are using.

KEY FEATURES OF NO SQL


1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations or schema
alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding
more nodes to a database cluster, making them well-suited for handling large
amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-
based data model, where data is stored in a scalessemi-structured format, such as
JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data
model, where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based
data model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be
highly available and to automatically handle node failures and data replication
across multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a
flexible and dynamic manner, with support for multiple data types and changing
data structures.
8. Performance: NoSQL databases are optimized for high performance and can
handle a high volume of reads and writes, making them suitable for big data and
real-time applications.

ADVANTAGES OF NOSQL
There are many advantages of working with NoSQL databases such as MongoDB
and Cassandra. The main advantages are high scalability and high availability.

1. High scalability: NoSQL databases use sharding for horizontal scaling.


Partitioning of data and placing it on multiple machines in such a way that the
order of the data is preserved is sharding. Vertical scaling means adding more
resources to the existing machine whereas horizontal scaling means adding more
machines to handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal scaling databases
are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data because
of scalability, as the data grows NoSQL scalesThe auto itself to handle that data
in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-
structured data, which means that they can accommodate dynamic changes to the
data model. This makes NoSQL databases a good fit for applications that need to
handle changing data requirements.

3. High availability: The auto, replication feature in NoSQL databases makes it


highly available because in case of any failure data replicates itself to the previous
consistent state.

4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit for
applications that need to handle large amounts of data or traffic

5. Performance: NoSQL databases are designed to handle large amounts of data and
traffic, which means that they can offer improved performance compared to
traditional relational databases.

6. Cost-effectiveness: NoSQL databases are often more cost-effective than


traditional relational databases, as they are typically less complex and do not
require expensive hardware or software.

7. Agility: Ideal for agile development.

DIS ADVANTAGES OF NOSQL


NoSQL has the following disadvantages.

1. Lack of standardization: There are many different types of NoSQL databases,


each with its own unique strengths and weaknesses. This lack of standardization
can make it difficult to choose the right database for a specific application
2. Lack of ACID compliance: NoSQL databases are not fully ACID-compliant,
which means that they do not guarantee the consistency, integrity, and durability of
data. This can be a drawback for applications that require strong data consistency
guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed
for storage but it provides very little functionality. Relational databases are a better
choice in the field of Transaction Management than NoSQL.
4. Open-source: NoSQL is an databaseopen-source database. There is no reliable
standard for NoSQL yet. In other words, two database systems are likely to be
unequal.
5. Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for applications
that require complex data analysis or reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure
than traditional databases.
7. Management challenge: The purpose of big data tools is to make the management
of a large amount of data as simple as possible. But it is not so easy. Data
management in NoSQL is much more complex than in a relational database.
NoSQL, in particular, has a reputation for being challenging to install and even
more hectic to manage on a daily basis.
8. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
10. Large document size: Some database systems like MongoDB and CouchDB store
data in JSON format. This means that documents are quite large (BigData, network
bandwidth, speed), and having descriptive key names actually hurts since they
increase the document size.

DIFFERENCE BETWEEN SQL AND NOSQL


***
UNIT -2
If NOSQL Then What ?
NoSQL databases address specific needs and challenges that traditional relational
databases may face in certain scenarios. The primary needs and use cases for NoSQL
databases include:

1.Handling Large Amounts of Data:


NoSQL databases are designed to scale horizontally, making them well-suited for
handling large volumes of data across distributed and clustered environments. This
scalability is crucial in scenarios where the data size is expected to grow rapidly.

2.Flexible Schema:
NoSQL databases, unlike traditional relational databases, often allow for a flexible
or schema-less data model. This flexibility is beneficial when dealing with diverse
and evolving data structures, as there is no need to predefine a rigid schema for the
entire database.

3.Performance and Low Latency:


Many NoSQL databases are optimized for read and write performance, making them
suitable for applications that require low-latency data access. This is important in
scenarios where real-time data processing and quick response times are critical.

4.Horizontal Scalability:
NoSQL databases can easily scale out by adding more servers to a distributed
system. This is in contrast to some traditional relational databases, which may face
scalability challenges when trying to handle large amounts of data or increased
concurrent users.

5.Unstructured and Semi-structured Data:


NoSQL databases can efficiently handle unstructured or semi-structured data, such
as JSON or XML documents. This makes them suitable for applications dealing with
diverse data formats and varying data structures.

6.High Availability and Fault Tolerance:


NoSQL databases are often designed with distributed architectures that provide high
availability and fault tolerance. They can continue to operate even if some nodes in
the system fail, making them resilient in distributed and cloud environments.

7.Agile Development and Rapid Prototyping:


NoSQL databases are well-suited for agile development methodologies and rapid
prototyping, allowing developers to quickly adapt to changing requirements and
iterate on their data models without the constraints of a fixed schema.

8.Scalable for Read and Write Workloads:


NoSQL databases are designed to handle both read and write operations at scale.
Some NoSQL databases are specifically optimized for read-heavy workloads (e.g.,
caching databases), while others excel in write-intensive scenarios.

9.Support for Geographic Distribution:


NoSQL databases often support geographically distributed data, enabling the storage
and retrieval of data across different data centers or regions. This is beneficial for
applications that require global access to data.

It's important to note that the choice of using a NoSQL database should be based on
the specific requirements of the application and the nature of the data rather than a
one-size-fits-all approach. Different types of NoSQL databases may be more suitable
for different use cases.

Performing Crud Operations


As we know that we can use MongoDB for various things like building an
application (including web and mobile), or analysis of data, or an administrator of a
MongoDB database, in all these cases we need to interact with the MongoDB server
to perform certain operations like entering new data into the application, updating
data into the application, deleting data from the application, and reading the data of
the application. MongoDB provides a set of some basic but most essential operations
that will help you to easily interact with the MongoDB server and these operations
are known as CRUD operations.

Create Operations
The create or insert operations are used to insert or add new documents in the
collection. If a collection does not exist, then it will create a new collection in the
database. You can perform, create operations using the following methods provided
by the MongoDB:
Example : In this example, we are inserting details of a single student in the form of
document in the student collection using db.collection.insertOne() method.

Read Operations
The Read operations are used to retrieve documents from the collection, or in other
words, read operations are used to query a collection for a document. You can
perform read operation using the following method provided by the MongoDB:

.pretty() : this method is used to decorate the result such that it is easy to read.
Example : In this example, we are retrieving the details of students from the student
collection using db.collection.find() method.
Update Operations
The update operations are used to update or modify the existing document in the
collection. You can perform update operations using the following methods provided
by the MongoDB:

Example : In this example, we are updating the age of Sumit in the student
collection using db.collection.updateOne() method.
Delete Operations
The delete operation are used to delete or remove the documents from a collection.
You can perform delete operations using the following methods provided by the
MongoDB:

Example : In this example, we are deleting all the documents from the student
collection using db.collection.deleteMany() method.

***
UNIT 3
NoSQL Storage Architecture

NoSQL Architecture
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big
data and store it in a valid format. It is widely used because of its flexibility and a
wide variety of services.

Architecture Patterns of NoSQL:


The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name
suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data
type. The value is typically linked or co-related to the key. The key-value pair
storage databases generally store data as a hash table where each key is unique.
The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce
applications.
Advantages:
● Can handle large amounts of data and heavy load,
● Easy retrieval of data by keys.
Limitations:
● Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
● Data can be involving many-to-many relationships which may collide.
Examples:
● DynamoDB
● Berkeley DB

2. Column Store Database:


Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only
on columns. They store large amounts of data into columns together. Format and
titles of the columns can diverge from one row to other. Every column is treated
separately. But still, each individual column may contain multiple other columns
like traditional databases. Basically, columns are mode of storage in this type.
Advantages:
● Data is readily available
● Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
● HBase

● Bigtable by Google
● Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
● This type of format is very useful and apt for semi-structured data.
● Storage retrieval and managing of documents is easy.
Limitations:
● Handling multiple documents is challenging
● Aggregation operations may not work accurately.
Examples:
● MongoDB

● CouchDB

Figure – Document Store Model in form of JSON documents

4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two or
more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities and
each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs
are often very strong and rigid in nature.
Advantages:
● Fastest traversal because of connections.
● Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.

Examples:
● Neo4J

● FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

Working with Column Oriented Database


NoSQL Databases have four distinct types. Key-value stores, document-stores, graph
databases, and column-oriented databases. In this article, we’ll explore column-
oriented databases, also known simply as “NoSQL columns”.At a very surface level,
column-store databases do exactly what is advertised on the tin: namely, that instead
of organizing information into rows, it does so in columns. This essentially makes
them function the same way that tables work in relational databases. Of course, since
this is a NoSQL database, this data model makes them much more flexible.
More specifically, column databases use the concept of keyspace, which is sort of
like a schema in relational models. This keyspace contains all the column families,
which then contain rows, which then contain columns. It’s a bit tricky to wrap your
head around at first but it’s relatively straightforward.
By taking a quick look, we can see that a column family has several rows. Within
each row, there can be several different columns, with different names, links, and
even sizes (meaning they don’t need to adhere to a standard). Furthermore, these
columns only exist within their own row and can contain a value pair, name, and a
timestamp.

If we take a specific row as an example:

The Row Key is exactly that: the specific identifier of that row and is always unique.
The column contains the name, value, and timestamp, so that’s straightforward. The
name/value pair is also straight forward, and the timestamp is the date and time the
data was entered into the database.Some examples of column-store databases include
Casandra, CosmoDB, Bigtable, and HBase.

Benefits of Column Databases


There are several benefits that go along with columnar databases:
● Column stores are excellent at compression and therefore are efficient in terms of
storage. This means you can reduce disk resources while holding massive amounts of
information in a single column
● Since a majority of the information is stored in a column, aggregation queries are
quite fast, which is important for projects that require large amounts of queries in a
small amount of time.
● Scalability is excellent with column-store databases. They can be expanded nearly
infinitely, and are often spread across large clusters of machines, even numbering in
thousands. That also means that they are great for Massive Parallel Processing
● Load times are similarly excellent, as you can easily load a billion-row table in a few
seconds. That means you can load and query nearly instantly.
● Large amounts of flexibility as columns do not necessarily have to look like each
other. That means you can add new and different columns without disrupting the
whole database. That being said, entering completely new record queries requires a
change to all tables.
Overall, column-store databases are great for analytics and reporting: fast querying
speeds and abilities to hold large amounts of data without adding a lot of overhead
make it ideal.

Disadvantages of Column Databases


As it usually is in life, nothing is perfect and there are a couple of disadvantages to
using column-oriented databases as well:
● Designing an indexing schema that’s effective is difficult and time consuming. Even
then, the said schema would still not be as effective as simple relational database
schemas.
● While this may not be an issue for some users, incremental data loading is
suboptimal and should be avoided if possible.
● This goes for all NoSQL database types and not just columnar ones.
Security vulnerabilities in web applications are ever present and the fact that NoSQL
databases lack inbuilt security features doesn’t help. If security is your number one
priority, you should either look into relational databases you could employ or employ
a well-defined schema if possible.
● Online Transaction Processing (OLTP) applications are also not compatible with
columnar databases due to the way data is stored.

What is Hadoop ?
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to
handle big data and is based on the MapReduce programming model, which allows
for the parallel processing of large datasets.
Hadoop has two main components:
● HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware, which makes it cost-
effective.
● YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as CPU
and memory) for processing the data stored in HDFS.
● Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
● Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large data
sets across clusters of computers using a simple programming model.

History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written
by Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering
local computation and storage. It is used by many organizations, including Yahoo,
Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited


for big data processing:
● Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
● Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
● Data locality: Hadoop provides data locality feature, where the data is stored on
the same node where it will be processed, this feature helps to reduce the network
traffic and improve the performance
● High Availability: Hadoop provides High Availability feature, which helps to
make sure that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows for
the processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.
● Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
● Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
● Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
● YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Hadoop Distributed File System


It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which
are facilitated by HDFS.

HDFS

Advantages of HDFS: It is inexpensive, immutable in nature, stores data


reliably, ability to tolerate faults, scalable, block structured, can process a large
amount of data simultaneously and many more.

Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit


for small quantities of data. Also, it has issues related to potential stability, restrictive
and rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark,
Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop


1. Hive- It uses HiveQl for data structuring and for writing complicated MapReduce
in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of their
codes faster.

Hadoop framework is made up of the following modules:


1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.

Advantages and Disadvantages of Hadoop


Advantages:
● Ability to store a large amount of data.
● High flexibility.
● Cost effective.
● High computational power.
● Tasks are independent.
● Linear scaling.

Hadoop has several advantages that make it a popular


choice for big data processing:

● Scalability: Hadoop can easily scale to handle large amounts of data by adding
more nodes to the cluster.
● Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of data.
● Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down, the data can still
be processed by the other nodes.
● Flexibility: Hadoop can process structured, semi-structured, and unstructured
data, which makes it a versatile option for a wide range of big data scenarios.
● Open-source: Hadoop is open-source software, which means that it is free to use
and modify. This also allows developers to access the source code and make
improvements or add new features.
● Large community: Hadoop has a large and active community of developers and
users who contribute to the development of the software, provide support, and
share best practices.
● Integration: Hadoop is designed to work with other big data technologies such as
Spark, Storm, and Flink, which allows for integration with a wide range of data
processing and analysis tools.

Disadvantages:
● Not very effective for small data.
● Hard cluster management.
● Has stability issues.
● Security concerns.
● Complexity: Hadoop can be complex to set up and maintain, especially for
organizations without a dedicated team of experts.
● Latency: Hadoop is not well-suited for low-latency workloads and may not be the
best choice for real-time data processing.
● Limited Support for Real-time Processing: Hadoop’s batch-oriented nature
makes it less suited for real-time streaming or interactive data processing use
cases.
● Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured data
processing
● Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
● Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.
● Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available
but have some limitations.
● Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
● Data Loss: In the event of a hardware failure, the data stored in a single node may
be lost permanently.
● Data Governance: Data Governance is a critical aspect of data management,
Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.

HBase Architecture
HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.
Figure – Architecture of HBase
All the 3 components are described below:
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in
which regions are assigned to region server as well as DDL (create, delete table)
operations. It monitor all Region Server instances present in the cluster. In a
distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.

2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster that consists of
the distribution of tables and are comprised of Column families. Region Server
runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region
Server are responsible for several things, like handling, managing, executing as
well as reads and writes HBase operations on that set of regions. The default size
of a region is 256 MB.

3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.

Advantages of HBase –
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4. High availability through failover and replication
Disadvantages of HBase
1. No support SQL structure
2. No transaction support
3. Sorted only on key
4. Memory issues on the cluster

Comparison between HBase and HDFS:


HBase provides low latency access while HDFS provides high latency operations.
● HBase supports random read and writes while HDFS supports Write once Read
Many times.
● HBase is accessed through shell commands, Java API, REST, Avro or Thrift API
while HDFS is accessed through MapReduce jobs.

Features of HBase architecture :


● Distributed and Scalable: HBase is designed to be distributed and scalable,
which means it can handle large datasets and can scale out horizontally by adding
more nodes to the cluster.
● Column-oriented Storage: HBase stores data in a column-oriented manner,
which means data is organized by columns rather than rows. This allows for
efficient data retrieval and aggregation.
● Hadoop Integration: HBase is built on top of Hadoop, which means it can
leverage Hadoop’s distributed file system (HDFS) for storage and MapReduce for
data processing.
● Consistency and Replication: HBase provides strong consistency guarantees for
read and write operations, and supports replication of data across multiple nodes
for fault tolerance.
● Built-in Caching: HBase has a built-in caching mechanism that can cache
frequently accessed data in memory, which can improve query performance.
● Compression: HBase supports compression of data, which can reduce storage
requirements and improve query performance.
● Flexible Schema: HBase supports flexible schemas, which means the schema can
be updated on the fly without requiring a database schema migration.

Note – HBase is extensively used for online analytical operations, like in banking
applications such as real-time data updates in ATM machines, HBase can be used.
Document Store Internals
A Document Data Model is a lot different than other data models because it stores
data in JSON, BSON, or XML documents. in this data model, we can move
documents under one document and apart from this, any particular elements can be
indexed to run queries faster. Often documents are stored and retrieved in such a way
that it becomes close to the data objects which are used in many applications which
means very less translations are required to use data in applications. JSON is a native
language that is often used to store and query data too. So in the document data
model, each document has a key-value pair below is an example for the same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "[email protected]",
"Contact" : "12345"
}

Working of Document Data Model:


This is a data model which works as a semi-structured data model in which the
records and data associated with them are stored in a single document which means
this data model is not completely unstructured. The main thing is that data here is
stored in a document.

Features:
● Document Type Model: As we all know data is stored in documents rather than
tables or graphs, so it becomes easy to map things in many programming
languages.
● Flexible Schema: Overall schema is very much flexible to support this statement
one must know that not all documents in a collection need to have the same fields.
● Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.
● Manageable Query Language: These data models are the ones in which query
language allows the developers to perform CRUD (Create Read Update Destroy)
operations on the data model.

Examples of Document Data Models :


● Amazon DocumentDB
● MongoDB
● Cosmos DB
● ArangoDB
● Couchbase Server
● CouchDB

Advantages:
● Schema-less: These are very good in retaining existing data at massive volumes
because there are absolutely no restrictions in the format and the structure of data
storage.
● Faster creation of document and maintenance: It is very simple to create a
document and apart from this maintenance requires is almost nothing.
● Open formats: It has a very simple build process that uses XML, JSON, and its
other forms.
● Built-in versioning: It has built-in versioning which means as the documents
grow in size there might be a chance they can grow in complexity. Versioning
decreases conflicts.

Disadvantages:
● Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us to
run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.
● Consistency Check Limitations: One can search the collections and
documents that are not connected to an author collection but doing this might
create a problem in the performance of database performance.
● Security: Nowadays many web applications lack security which in turn results in
the leakage of sensitive data. So it becomes a point of concern, one must pay
attention to web app vulnerabilities.

Applications of Document Data Model :


● Content Management: These data models are very much used in creating
various video streaming platforms, blogs, and similar services Because each is
stored as a single document and the database here is much easier to maintain as
the service evolves over time.
● Book Database: These are very much useful in making book databases because
as we know this data model lets us nest.
● Catalog: When it comes to storing and reading catalog files these data models are
very much used because it has a fast reading ability if incase Catalogs have
thousands of attributes stored.
● Analytics Platform: These data models are very much used in the Analytics
Platform.

Understanding Key/Value Store


A key-value data model or database is also referred to as a key-value store. It is a
non-relational type of database. In this, an associative array is used as a basic
database in which an individual key is linked with just one value in a collection. For
the values, keys are special identifiers. Any kind of entity can be valued. The
collection of key-value pairs stored on separate records is called key-value databases
and they do not have an already defined structure.

How do key-value databases work?


A number of easy strings or even a complicated entity are referred to as a value that
is associated with a key by a key-value database, which is utilized to monitor the
entity. Like in many programming paradigms, a key-value database resembles a map
object or array, or dictionary, however, which is put away in a tenacious manner and
controlled by a DBMS.
An efficient and compact structure of the index is used by the key-value store to have
the option to rapidly and dependably find value using its key. For example, Redis is a
key-value store used to tracklists, maps, heaps, and primitive types (which are simple
data structures) in a constant database. Redis can uncover a very basic point of
interaction to query and manipulate value types, just by supporting a predetermined
number of value types, and when arranged, is prepared to do high throughput.

When to use a key-value database:


Here are a few situations in which you can use a key-value database:-

● User session attributes in an online app like finance or gaming, which is referred
to as real-time random data access.
● Caching mechanism for repeatedly accessing data or key-based design.
● The application is developed on queries that are based on keys.

Features:
● One of the most un-complex kinds of NoSQL data models.
● For storing, getting, and removing data, key-value databases utilize simple
functions.
● Querying language is not present in key-value databases.
● Built-in redundancy makes this database more reliable.

Advantages:
● It is very easy to use. Due to the simplicity of the database, data can accept any
kind, or even different kinds when required.
● Its response time is fast due to its simplicity, given that the remaining environment
near it is very much constructed and improved.
● Key-value store databases are scalable vertically as well as horizontally.
● Built-in redundancy makes this database more reliable.

Disadvantages:
● As querying language is not present in key-value databases, transportation of
queries from one database to a different database cannot be done.
● The key-value store database is not refined. You cannot query the database
without a key.
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:

● Couchbase: It permits SQL-style querying and searching for text.


● Amazon DynamoDB: The key-value database which is mostly used is Amazon
DynamoDB as it is a trusted database used by a large number of users. It can
easily handle a large number of requests every day and it also provides various
security options.
● Riak: It is the database used to develop applications.
● Aerospike: It is an open-source and real-time database working with billions of
exchanges.
● Berkeley DB: It is a high-performance and open-source database providing
scalability.
Difference Between Memcached and Redis
1. Redis :
Redis is an open-source, key-value, NoSQL database. It is an in-memory
data structure that stores all the data served from memory and uses disk for
storage. It offers a unique data model and high performance that supports
various data structures like string, list, sets, hash, which it uses as a
database cache or message broker. It is also called Data Structure Server. It
does not support schema RDBMS, SQL, or ACID transactions.
2. Memcached :
Memcached is a simple, open-source, in-memory caching system that can
be used as a temporary in-memory data storage. The stored data in memory
has high read and write performance and distributes data into multiple
servers. It is a key-value of string object that is stored in memory and the
API is available for all the languages. Memcached is very efficient for
websites.
Difference between Redis and Memcached –

Eventually Consistency Non-Relational Database


1. Eventual Consistency : Eventual consistency is a consistency model that
enables the data store to be highly available. It is also known as optimistic
replication & is key to distributed systems. So, how exactly does it work? Let’s
Understand this with the help of a use case.

Real World Use Case :


● Think of a popular microblogging site deployed across the world in different
geographical regions like Asia, America, and Europe. Moreover, each
geographical region has multiple data center zones: North, East, West, and South.
● Furthermore, each zone has multiple clusters which have multiple server nodes
running. So, we have many datastore nodes spread across the world that micro-
blogging site uses for persisting data. Since there are so many nodes running, there
is no single point of failure.
● The data store service is highly available. Even if a few nodes go down
persistence service is still up. Let’s say a celebrity makes a post on the website
that everybody starts liking around the world.
● At a point in time, a user in Japan likes a post which increases the “Like” count of
the post from say 100 to 101. At the same point in time, a user in America, in a
different geographical zone, clicks on the post, and he sees “Like” count as 100,
not 101.

Reason for the above Use case :


● Simply, because the new updated value of the Post “Like” counter needs some
time to move from Japan to America and update server nodes running there.
Though the value of the counter at that point in time was 101, the user in America
sees old inconsistent values.
● But when he refreshes his web page after a few seconds “Like” counter value
shows as 101. So, data was initially inconsistent but eventually got consistent
across server nodes deployed around the world. This is what eventual consistency
is.

2. Strong Consistency: Strong Consistency simply means the data must be


strongly consistent at all times. All the server nodes across the world should
contain the same value as an entity at any point in time. And the only way to
implement this behavior is by locking down the nodes when being updated.
Real World Use Case :
● Let’s continue the same Eventual Consistency example from the previous lesson.
To ensure Strong Consistency in the system, when a user in Japan likes posts, all
nodes across different geographical zones must be locked down to prevent any
concurrent updates.
● This means at one point in time, only one user can update the post “Like” counter
value. So, once a user in Japan updates the “Like” counter from 100 to 101. The
value gets replicated globally across all nodes. Once all nodes reach consensus,
locks get lifted. Now, other users can Like posts.
● If the nodes take a while to reach a consensus, they must wait until then. Well, this
is surely not desired in the case of social applications. But think of a stock market
application where the users are seeing different prices of the same stock at one
point in time and updating it concurrently. This would create chaos. Therefore, to
avoid this confusion we need our systems to be Strongly Consistent.
● The nodes must be locked down for updates. Queuing all requests is one good
way of making a system Strongly Consistent. The strong Consistency model hits
the capability of the system to be Highly Available & perform concurrent updates.
This is how strongly consistent ACID transactions are implemented.
ACID Transaction Support :
Distributed systems like NoSQL databases which scale horizontally on the fly don’t
support ACID transactions globally & this is due to their design. The whole reason
for the development of NoSQL tech is the ability to be Highly Available and
Scalable. If we must lock down nodes every time, it becomes just like SQL. So,
NoSQL databases don’t support ACID transactions and those that claim to, have
terms and conditions applied to them. Generally, transaction support is limited to a
geographic zone or an entity hierarchy. Developers of tech make sure that all the
Strongly consistent entity nodes reside in the same geographic zone to make ACID
transactions possible.

Conclusion: For transactional things go for MySQL because it provides a lock-in


feature and supports ACID transactions.

You might also like