1,2,3 Units
1,2,3 Units
UNIT-1
Def :
NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and
are capable of scaling horizontally to handle growing amounts of data.
1.Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data
in rows and columns (tables), it uses the documents to store the data in the database.
A document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.
EX: XML
2. Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL
database is a key-value store. Every data element in the database is stored in key-
value pairs. The data can be retrieved by using a unique key allotted to each element
in the database. The values can be simple data types like strings and numbers or
complex objects.
A key-value store is like a relational database with only two columns which is the
key and the value.
EX:
4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
EX:
1. MongaDB
In MongoDB, you can structure your data in a way that each user document
contains information about their location preference. This could be achieved
using or nested documents within the user document. MongoDB is known for its
flexibility in handling diverse data structure, making it suitable for storing
location preferences in a user-friendly manner.
2. Cassandra
Cassandra is a highly scalable, distributed NoSQL database with a decentralized
architecture, designed for high availability, fault tolerance, and horizontal
scalability. It uses a column-family data model, features the Cassandra Query
Language(CQL), and is suitable for applications requiring massive data storage
and processing, such as real-time analytics and sensor data.
3. Amazon DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service provided by
AWS. It offers seamless scalability, high performance, and low-latency access to
data. DynamoDB uses a key-value and document data model, providing
flexibility for diverse data types. It is designed for high availability and
durability, with automatic scaling capabilities. DynamoDB is suitable for
applications requiring fast and consistent performance, such as web and mobile
applications, gaming, and IOT.
4. Redis
Redis is an open-source, in-memory data structure store. It functions as a versatile
key-value database, supporting various data structures like strings, hashes, lists,
sets, and more. Known for its exceptional speech, Redis is often used as a
caching layer or message broker. Its simplicity and efficiency make it popular for
real-time applications, session management, and other use cases requiring fast
data access. Redis also offers features like replication and clustering for
scalability and high availability.
In this example:
- Each document in the collection represents a car make.
- The "make" field stores the name of the car make.
- The "models" field is an array of sub-documents, each representing a specific
model.
- Each model sub-document has a "name" field for the model name and a
"year_range" field to specify the range of years that the model was produced.
This structure allows for easy retrieval of all models for a given make or details for a
specific model within a make.
In MongoDB, you could use the official drivers (like pymongo for Python) to
interact with the database and perform operations such as inserting, updating, and
querying data.
Remember to adjust the schema based on your specific use case and requirements. If
you have additional data or relationships, you may need to expand the schema
accordingly.
This is just a basic example with Python and MongoDB. Depending on the NoSQL
database you choose, the code structure and syntax will vary, but the general
principles of connecting, inserting, querying, updating, and deleting data will be
similar. Always refer to the documentation for the specific NoSQL database and
programming language you are using.
ADVANTAGES OF NOSQL
There are many advantages of working with NoSQL databases such as MongoDB
and Cassandra. The main advantages are high scalability and high availability.
4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good fit for
applications that need to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and
traffic, which means that they can offer improved performance compared to
traditional relational databases.
2.Flexible Schema:
NoSQL databases, unlike traditional relational databases, often allow for a flexible
or schema-less data model. This flexibility is beneficial when dealing with diverse
and evolving data structures, as there is no need to predefine a rigid schema for the
entire database.
4.Horizontal Scalability:
NoSQL databases can easily scale out by adding more servers to a distributed
system. This is in contrast to some traditional relational databases, which may face
scalability challenges when trying to handle large amounts of data or increased
concurrent users.
It's important to note that the choice of using a NoSQL database should be based on
the specific requirements of the application and the nature of the data rather than a
one-size-fits-all approach. Different types of NoSQL databases may be more suitable
for different use cases.
Create Operations
The create or insert operations are used to insert or add new documents in the
collection. If a collection does not exist, then it will create a new collection in the
database. You can perform, create operations using the following methods provided
by the MongoDB:
Example : In this example, we are inserting details of a single student in the form of
document in the student collection using db.collection.insertOne() method.
Read Operations
The Read operations are used to retrieve documents from the collection, or in other
words, read operations are used to query a collection for a document. You can
perform read operation using the following method provided by the MongoDB:
.pretty() : this method is used to decorate the result such that it is easy to read.
Example : In this example, we are retrieving the details of students from the student
collection using db.collection.find() method.
Update Operations
The update operations are used to update or modify the existing document in the
collection. You can perform update operations using the following methods provided
by the MongoDB:
Example : In this example, we are updating the age of Sumit in the student
collection using db.collection.updateOne() method.
Delete Operations
The delete operation are used to delete or remove the documents from a collection.
You can perform delete operations using the following methods provided by the
MongoDB:
Example : In this example, we are deleting all the documents from the student
collection using db.collection.deleteMany() method.
***
UNIT 3
NoSQL Storage Architecture
NoSQL Architecture
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big
data and store it in a valid format. It is widely used because of its flexibility and a
wide variety of services.
● Bigtable by Google
● Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
● This type of format is very useful and apt for semi-structured data.
● Storage retrieval and managing of documents is easy.
Limitations:
● Handling multiple documents is challenging
● Aggregation operations may not work accurately.
Examples:
● MongoDB
● CouchDB
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two or
more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities and
each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs
are often very strong and rigid in nature.
Advantages:
● Fastest traversal because of connections.
● Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
● Neo4J
The Row Key is exactly that: the specific identifier of that row and is always unique.
The column contains the name, value, and timestamp, so that’s straightforward. The
name/value pair is also straight forward, and the timestamp is the date and time the
data was entered into the database.Some examples of column-store databases include
Casandra, CosmoDB, Bigtable, and HBase.
What is Hadoop ?
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to
handle big data and is based on the MapReduce programming model, which allows
for the parallel processing of large datasets.
Hadoop has two main components:
● HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware, which makes it cost-
effective.
● YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as CPU
and memory) for processing the data stored in HDFS.
● Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
● Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large data
sets across clusters of computers using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written
by Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering
local computation and storage. It is used by many organizations, including Yahoo,
Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
HDFS
● Scalability: Hadoop can easily scale to handle large amounts of data by adding
more nodes to the cluster.
● Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of data.
● Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down, the data can still
be processed by the other nodes.
● Flexibility: Hadoop can process structured, semi-structured, and unstructured
data, which makes it a versatile option for a wide range of big data scenarios.
● Open-source: Hadoop is open-source software, which means that it is free to use
and modify. This also allows developers to access the source code and make
improvements or add new features.
● Large community: Hadoop has a large and active community of developers and
users who contribute to the development of the software, provide support, and
share best practices.
● Integration: Hadoop is designed to work with other big data technologies such as
Spark, Storm, and Flink, which allows for integration with a wide range of data
processing and analysis tools.
Disadvantages:
● Not very effective for small data.
● Hard cluster management.
● Has stability issues.
● Security concerns.
● Complexity: Hadoop can be complex to set up and maintain, especially for
organizations without a dedicated team of experts.
● Latency: Hadoop is not well-suited for low-latency workloads and may not be the
best choice for real-time data processing.
● Limited Support for Real-time Processing: Hadoop’s batch-oriented nature
makes it less suited for real-time streaming or interactive data processing use
cases.
● Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured data
processing
● Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
● Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.
● Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available
but have some limitations.
● Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
● Data Loss: In the event of a hardware failure, the data stored in a single node may
be lost permanently.
● Data Governance: Data Governance is a critical aspect of data management,
Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.
HBase Architecture
HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.
Figure – Architecture of HBase
All the 3 components are described below:
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in
which regions are assigned to region server as well as DDL (create, delete table)
operations. It monitor all Region Server instances present in the cluster. In a
distributed environment, Master runs several background threads. HMaster has
many features like controlling load balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster that consists of
the distribution of tables and are comprised of Column families. Region Server
runs on HDFS DataNode which is present in Hadoop cluster. Regions of Region
Server are responsible for several things, like handling, managing, executing as
well as reads and writes HBase operations on that set of regions. The default size
of a region is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.
Advantages of HBase –
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4. High availability through failover and replication
Disadvantages of HBase
1. No support SQL structure
2. No transaction support
3. Sorted only on key
4. Memory issues on the cluster
Note – HBase is extensively used for online analytical operations, like in banking
applications such as real-time data updates in ATM machines, HBase can be used.
Document Store Internals
A Document Data Model is a lot different than other data models because it stores
data in JSON, BSON, or XML documents. in this data model, we can move
documents under one document and apart from this, any particular elements can be
indexed to run queries faster. Often documents are stored and retrieved in such a way
that it becomes close to the data objects which are used in many applications which
means very less translations are required to use data in applications. JSON is a native
language that is often used to store and query data too. So in the document data
model, each document has a key-value pair below is an example for the same.
{
"Name" : "Yashodhra",
"Address" : "Near Patel Nagar",
"Email" : "[email protected]",
"Contact" : "12345"
}
Features:
● Document Type Model: As we all know data is stored in documents rather than
tables or graphs, so it becomes easy to map things in many programming
languages.
● Flexible Schema: Overall schema is very much flexible to support this statement
one must know that not all documents in a collection need to have the same fields.
● Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.
● Manageable Query Language: These data models are the ones in which query
language allows the developers to perform CRUD (Create Read Update Destroy)
operations on the data model.
Advantages:
● Schema-less: These are very good in retaining existing data at massive volumes
because there are absolutely no restrictions in the format and the structure of data
storage.
● Faster creation of document and maintenance: It is very simple to create a
document and apart from this maintenance requires is almost nothing.
● Open formats: It has a very simple build process that uses XML, JSON, and its
other forms.
● Built-in versioning: It has built-in versioning which means as the documents
grow in size there might be a chance they can grow in complexity. Versioning
decreases conflicts.
Disadvantages:
● Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us to
run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.
● Consistency Check Limitations: One can search the collections and
documents that are not connected to an author collection but doing this might
create a problem in the performance of database performance.
● Security: Nowadays many web applications lack security which in turn results in
the leakage of sensitive data. So it becomes a point of concern, one must pay
attention to web app vulnerabilities.
● User session attributes in an online app like finance or gaming, which is referred
to as real-time random data access.
● Caching mechanism for repeatedly accessing data or key-based design.
● The application is developed on queries that are based on keys.
Features:
● One of the most un-complex kinds of NoSQL data models.
● For storing, getting, and removing data, key-value databases utilize simple
functions.
● Querying language is not present in key-value databases.
● Built-in redundancy makes this database more reliable.
Advantages:
● It is very easy to use. Due to the simplicity of the database, data can accept any
kind, or even different kinds when required.
● Its response time is fast due to its simplicity, given that the remaining environment
near it is very much constructed and improved.
● Key-value store databases are scalable vertically as well as horizontally.
● Built-in redundancy makes this database more reliable.
Disadvantages:
● As querying language is not present in key-value databases, transportation of
queries from one database to a different database cannot be done.
● The key-value store database is not refined. You cannot query the database
without a key.
Some examples of key-value databases:
Here are some popular key-value databases which are widely used: