Unit 1
Unit 1
What is NoSQL?
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term has since evolved
to mean “not only SQL,” as NoSQL databases have expanded to include a wide range of different database
architectures and data models.
NoSQL is a type of database management system (DBMS) that is designed to handle and store large volumes of
unstructured and semi-structured data.
Unlike traditional relational databases that use tables with pre-defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and are capable of scaling
horizontally to handle growing amounts of data.
Data Base can be considered as one of the important component entity for technology and the application.
Data need to be stored in the specific structure.
But, there are situations where data are not always in a structured format.
NoSQL is famous for its high functionality.
Types of Data Base
• Centralized Data base
• Cloud Data Base
• Commercial Data base
• Distributed Data base
• End User Data base
• Object Oriented Data base
• NoSQL data base
• Open Source Data base
• Operational Data base
• Personal Data base
• Relational Data base
These are the few main Data bases.
Mostly we have used Relational Data Base that is nothing but SQL and MySQL.
• Multiple System or client will be there and only one server will be present.
• Multiple clients will be connected to the single server.
• If Data base gets any problem/error after that all the client/system will also be in problem.
• Scale up means, one node will be there, in one node only many upgrades will be done.
• For eg: CPU, RAM
• It is also called as Vertical scaling.
Scale out
• It’s opposite to the scale up.
• Instead of one node if it ha multiple node or multiple server’s are coming means then it
is called as scale out.
• It is also called as horizontal scaling.
Document databases: These databases store data as semi-structured documents, such as JSON or XML,
and can be queried using document-oriented query languages.
Key-value stores: These databases store data as key-value pairs, and are optimized for simple and fast
read/write operations.
Column-family stores: These databases store data as column families, which are sets of columns that
are treated as a single entity. They are optimized for fast and efficient querying of large amounts of data.
Graph databases: These databases store data as nodes and edges, and are designed to handle complex
relationships between data.
Advantages of NoSQL
• High scalability
• Flexibility
• High availability
• Scalability
• Performance
• Cost-effectiveness
• Agility
Disadvantages of NoSQL
• Lack of standardization
• Lack of ACID compliance
• Open-source
• Lack of support for complex queries
• Lack of maturity
• Management challenge
• GUI is not available
• Backup
• Large document size
Data Base Revolutions
1. First Generation
2. Second Generation
3. Third Generation
First Generation
1st generation of data base revolution is Relational database.
The data’s are organized in the form of row’s and columns.
The data’s are stored in the form of tables.
It is based on the mathematical concept of set theory and use a structured query language(SQL).
The relational model uses a collection of tables to represent both data and the relationship among those data.
Each table has multiple columns and each table has a unique name.
Tables are also known as relations.
Advantages:
• Data is easy to organize.
• Querying is straightforward.
• It can be used to enforce data integrity.
Disadvantages
• Relational Data base model is not very good for large data base.
• Sometimes, it becomes difficult to find the relation between tables.
Second Generation
2nd generation of data base revolution is Object oriented database.
The data is organized into objects with attributes and methods.
This is based on the concept of object oriented programming language(OQL).
OQL is used for the accessing and manipulating data.
An object oriented data bases is a data base that store the data in objects.
Objects are similar to the files system, where each object contains a collection of information
It an store data in the form of objects.
Advantages:
• More flexible than relational databases.
• It can represent more complicated relationships between data.
• It can be easier to work with object-oriented programming languages.
Third Generation
Advantages
•It can be more scalable than relational databases.
•It can be more suitable for working with large amounts of data.
•It can be more flexible in terms of schema.
Key-Value Pair
The key-value pair database is the simplest NoSQL database and is often used for storing simple data such as
configuration settings.
A key-value pair database is a database that stores data in key-value pairs.
A key-value pair has a key and a value.
The key is used to identify the value of the data stored in the database, and It can store data in the form of
key-value pairs.
Document Data Base
The document database is more complex and is used for storing semi-structured or unstructured data.
A document database is a database that stores data in documents.
Documents are similar to files in a file system, where each document contains a collection of information, and
It can store data in the form of documents.
Managing transactions and ensuring data integrity in NoSQL data base can be quite different
from traditional relational data base.
NoSQL data base such as MangoDB, Cassandra and Couchbase.
For eg: Banking System
1. Eventual consistency
2. ACID Transactions in NoSQL
3. Single Document Transactions
4. Data Integrity
Eventual consistency
• It means when you make changes to data, those changes might not show up everywhere right away.
• Here we understand the limitations of the eventual consistency based on this we need to design our
application.
Single Document Transactions
• Many NoSQL databases provide strong consistency and atomicity at the document or row level.
• For eg: MangoDB treat each document as an atomic unit.
• This approach works well foe use cases where a single document encapsulates all the data that needs to be
manipulated in a transaction.
Data Integrity
• Data Modelling : In NoSQL data base, data integrity often comes from designing your data models carefully
to avoid inconsistency.
• This may involve denormalizing data(storing related data together in a single document) to avoid having to
update multiple documents in response to change.
• Schema flexibility: NoSQL data base are typically schema-less or have flexible schema designs, which can
lead to inconsistencies if not managed well.
• Validation and constraint are defined at the application level or through data base mechanism.
Best Practices for managing transactions and data integrity
An action or series of action that are being performed by a single user or application program, which
reads or updates the contents of the data base.
For eg: X=500, Y=300
T1 T2
Read(X) Read(Y)
X=X-100 Y=Y+100
Write(X) Write(Y)
Atomicity(A)
The entire transaction takes place at once or doesn’t happen at all.
Consistency(C)
• Correctness
• Integrity constraint must be maintained.
Isolation
• Multiple transaction can occur concurrently without leading to the inconsistency of data base
state.
• Transactions occur independently.
• Changes occurring in 1T will not be visible to other T until committed.
• Responsibility of concurrently control subsystem.
For eg: If Person A is doing transaction means it should affect some other person’s account
T1 T2
Read(X)
X=X*100
Write(X) Read(X)
Read(Y) Read(Y)
Y=Y-50 Z=X+Y
Write(Y) Write(Z)
Durability(D)
• Once the transaction is committed, the updates and modifications to the data base are stored in and
written to disk and they predict even if a system failure occurs.
• For eg: Banking system failure
• The effects of the transaction are never cost.
• Permanent
BASE for reliable database transactions
1. Basic Availability(AB)
2. Soft State(S)
3. Eventual Consistency(E)
Basic Availability(AB)
A distributed system should be available to respond with some acknowledgment even if it’s failure message to
any incoming request.
Soft State(S)
The system keep changing states as and when it receives new information.
Eventual Consistency(E)
The components in the system may not reflect the same value/state of a record at a given point in time.
They will settle it with time, eventually though.
For eg: E-Commerce site use.
Order Payment
Service Service
Speed
RAM is much faster than performance storage device.
Having the sufficient RAM ensures that frequently used data is readily available.
Reducing the need to access slower storage, thereby speeding up overall system performance.
Multitasking efficiency
More RAM allows more applications to run simultaneously without slowing down the system.
This multitasking efficiency is important for professionals running resource-intensive applications, such as
video editing software, virtual or machines.
Reducing Bottleneck
RAM minimize the number of times the CPU has to fetch data from slower storage(SSD or HDD)
preventing bottleneck in processing and increasing the efficiency of the system.
2. SSD(Solid State Drive)
SSD is a type of storage device that flash memory to store data, similar to the traditional hard disk
drives(HDD), but it is faster and more efficient.
Uses
Primary use is faster data storage and access.
SSD’s are ideal for hosting applications that require fast read/write access to the large amount of data.
SSD’s can be used to cache data that is frequently accessed by the cpu or in other program.
3. Disk
In NoSQL data base, disks(HDD’s or SSD’s are crucial for storing large amount of data persistently).
NoSQL data base is designed for the high performance distributed and scalable environment.
The use of disks in NoSQL data base affects how data is stored , retrieved and optimized for the performance.
Achieving horizontal scalability with data base sharding
Sharding
Definition : The method of splitting the single logical dataset and storing it in multiple servers / data
base are known as sharding
Data base
server shard1
User Internet
Horizontal sharding
Most of the time sharding is referred to the horizontal way.
It is a data architecture pattern related to horizontal partitioning, which is a method of splitting the one big
table into different smaller tables.
This way is also called as partitioning.
Here each partition will have column and schema, but rows will be with different data set.
Brewers CAP theorem
CAP theorem is again majorly used in the distributed environment system.
C Consistency
A Availability
P Partition Tolerance
Consistency(C)
It states that all replicas of a record must posses same value at every point of time.
Availability(A)
All the active nodes at any moment must be able to respond to different operations.
Partition Tolerance(P)
The system must be able to tolerance network partition among its participant nodes.
In other words partitioning should not affect the retrieval of records.