NoSql Intro
NoSql Intro
NOSQL
Module-1 Notes
when data is spread across different nodes, further complicating the use of relational
databases in clustered environments.
This mismatch between relational databases and clusters led some organization to
consider anlternative route to data storage. Two companies in particular—Google and
Amazon—have been very influential. Both were on the forefront of running large
clusters of this kind; furthermore, they were capturing huge amounts of data.
Emergence of NOSQL
The emergence of NoSQL databases addresses many of the limitations of traditional
relational databases, especially in handling large-scale, distributed, and unstructured
data. NoSQL databases are designed to provide flexible schemas and horizontal
scalability, making them well-suited for modern applications that require quick access
to large volumes of diverse data. Unlike relational databases, which rely on a rigid
schema and ACID transactions, NoSQL databases often sacrifice some aspects of
consistency to achieve higher availability and partition tolerance, as per the CAP
theorem.
NoSQL databases come in various types, including document stores (e.g., MongoDB),
key-value stores (e.g., Redis), column-family stores (e.g., Cassandra), and graph
databases (e.g., Neo4j), each optimized for specific use cases. They allow for efficient
storage and retrieval of unstructured or semi-structured data and are particularly
effective in handling large-scale, distributed data environments typical of big data
applications. The flexibility, scalability, and performance of NoSQL databases have
made them popular in industries such as social media, e-commerce, and real-time
analytics, where traditional RDBMS solutions struggle to meet the demands of modern,
high-velocity data processing.
The term "NoSQL" originally emerged to describe databases that do not adhere to the
traditional relational database model, particularly those that do not use Structured
Query Language (SQL) for data management. While "NoSQL" is often interpreted as
"no SQL," it more accurately conveys "not only SQL," highlighting that these databases
can support various data models and query languages beyond the standard relational
approach. The name reflects a broad category of database systems designed to handle
unstructured, semi-structured, and rapidly changing data with greater flexibility and
scalability than traditional relational databases.
Features of NoSQL
The key characteristics of NoSQL databases that highlight their distinct advantages over
traditional relational databases:
1. Schema Flexibility
NoSQL databases allow for dynamic and flexible schemas, enabling users to store unstructured,
semi-structured, or structured data without predefined schemas. This flexibility allows for
easier adaptation to changing data requirements and makes it simpler to incorporate new data
types without extensive modifications to the database.
2. Horizontal Scalability
NoSQL databases are designed to scale out horizontally by adding more servers or nodes to
distribute data across multiple machines. This scalability allows them to handle large volumes
of data and high levels of concurrent user requests, making them well-suited for applications
with rapidly growing datasets.
3. High Availability and Fault Tolerance
Many NoSQL databases are built to ensure high availability and fault tolerance. They often
employ data replication across multiple nodes, enabling continuous operation even if some
nodes fail. This design ensures that the database remains accessible and resilient in the face of
hardware failures or other disruptions.
4. Support for Various Data Models
NoSQL databases support multiple data models, including document, key-value, column-
family, and graph models. This variety allows developers to choose the most appropriate model
based on their specific use cases, optimizing data storage and retrieval according to the needs
of their applications.
5. Eventual Consistency
Unlike traditional relational databases that emphasize strong consistency through ACID
transactions, many NoSQL databases adopt an eventual consistency model. This approach
allows for higher availability and partition tolerance, as data updates may not be immediately
consistent across all nodes. Instead, the system guarantees that, given enough time, all updates
will propagate throughout the database, making it suitable for distributed environments where
immediate consistency is less critical.
Polyglot persistence refers to the practice of using multiple data storage technologies, each
optimized for specific use cases, within a single application or system architecture. In the
context of NoSQL databases, this approach allows developers to leverage the strengths of
various NoSQL database types—such as document stores, key-value stores, column-family
stores, and graph databases—to handle diverse data needs efficiently.
Data as per the RDBMS design is shown in the following example figure:
A single logical address record appears three times in the example data, but instead of using
IDs it’s treated as a value and copied each time. This fits the domain where we would not
want the shipping address, nor the payment’s billing address, to change. In a relational
database, we would ensure that the address rows aren’t updated for this case, making a new
row instead. With aggregates, we can copy the whole address structure into the aggregate
as we need to.
A typical NoSQL code for the same is:
The link between the customer and the order isn’t within either aggregate—it’s a
relationship between aggregates. Similarly, the link from an order item would cross into a
separate aggregate structure for products. It is more common with aggregates because we
want to minimize the number of aggregates we access during a data interaction. The more
embedded aggregate model is shown in the following figure:
Relational databases typically require a fixed schema, which can lead to complex
designs when trying to model aggregates. This can result in numerous tables and
relationships, making it difficult to manage and evolve the schema over time.
Aggregates often span multiple tables in RDBMS, necessitating complex JOIN
operations to retrieve related data. This can lead to performance bottlenecks, especially
when dealing with large datasets or frequent queries.
The need for multiple queries to fetch related data can degrade performance in relational
databases, particularly in scenarios where quick access to aggregate data is essential,
such as in real-time applications.
Maintaining ACID properties across multiple tables can introduce additional overhead,
making transactions more complex and potentially impacting throughput and latency.
Aggregate Relationships
Aggregate relationships are a key concept in NoSQL databases, particularly in
document-oriented and key-value stores, where data is grouped into aggregates that
represent a cohesive unit of related information.
Aggregate relationships define how different pieces of data within an aggregate relate
to one another. An aggregate is a cluster of related data that can be treated as a single
unit, encapsulating all the necessary information needed to describe a specific entity or
concept, such as a user profile, order, or product.
In the context of aggregate relationships, an aggregate root is the primary entity
through which the aggregate is accessed and modified. It serves as the entry point for
operations and ensures that all changes to the aggregate maintain its integrity.
Aggregate oriented databases treat the aggregate as the unit of data-retrieval.
Consequently, atomicity is only supported within the contents of a single aggregate. If
you update multiple aggregates at once, you have to deal yourself with a failure partway
through. Relational databases help you with this by allowing you to modify multiple
records in a single transaction, providing ACID guarantees while altering many rows.
Graph Databases
Graph databases are a type of NoSQL database specifically designed to represent and store
data in graph structures, which consist of nodes (entities) and edges (relationships)
connecting them.
This model allows for the representation of complex relationships and interconnected data
in a way that is both intuitive and efficient. Unlike traditional relational databases, which
rely on tables and foreign keys to establish relationships, graph databases treat relationships
as first-class citizens.
They excel in scenarios where relationships are as important as the data itself, such as social
networks, recommendation systems, fraud detection, and network analysis.
One of the key advantages of graph databases is their ability to perform complex queries
on relationships with high efficiency. By leveraging graph traversal algorithms, they can
quickly retrieve interconnected data without the need for expensive JOIN operations typical
in relational databases.
This capability allows for real-time analytics and insights into data relationships, enabling
applications to provide richer, context-driven user experiences. Popular graph databases
include Neo4j, Amazon Neptune, and ArangoDB, each offering unique features and
optimizations for handling graph data.
As data continues to grow in complexity and interconnectivity, graph databases are
increasingly recognized as essential tools for managing and analyzing relational data at
scale.
An example Graph structure is shown in the following figure:
With this structure, we can ask questions such as “find the books in the Databases category
that are written by someone whom a friend of mine likes.” Graph databases specialize in
capturing this sort of information—but on a much larger scale than a readable diagram
could capture. This is ideal for capturing any data consisting of complex relationships such
as social networks, product preferences.
The data model of a graph database is fundamentally centered around two primary
components: nodes and edges. Nodes represent entities or objects, such as users, products,
or locations, while edges define the relationships between these entities, indicating how
they are interconnected.
Each node and edge can have associated properties, which are key-value pairs that provide
additional context or attributes to the entities and relationships. For example, in a social
network graph database, a node might represent a user with properties like name and age,
while an edge could represent a "follows" relationship with properties such as the date the
connection was made.
This flexible structure allows graph databases to efficiently model complex, interconnected
data and perform sophisticated queries that explore relationships, making them particularly
effective for applications requiring insights into data relationships, such as social networks,
recommendation engines, and fraud detection.
Schemaless databases
Schemaless databases, often associated with NoSQL systems, are designed to allow for
flexible data storage without the constraints of a predefined schema. This characteristic
enables users to store unstructured, semi-structured, or structured data in a way that can
evolve over time without requiring significant alterations to the database design. Here are
some key features and advantages of schemaless databases:
They allow developers to add or modify fields within data records without needing to
update the entire database schema.
Because there are no strict schema requirements, schemaless databases can easily integrate
various data types and formats. This characteristic makes them suitable for handling data
from multiple sources, such as JSON, XML, or even binary data, allowing for the
aggregation of diverse datasets within a single system.
The lack of a fixed schema facilitates faster development cycles, enabling teams to iterate
quickly as they build and refine applications.
Schemaless databases are often designed to scale horizontally, meaning they can distribute
data across multiple servers or nodes. This scalability is especially beneficial for
applications experiencing rapid growth or fluctuating workloads.
Popular examples of schemaless databases include document stores like MongoDB, key-
value stores like Redis, and wide-column stores like Cassandra. These databases leverage
their schemaless nature to provide developers with the flexibility needed to handle a wide
range of applications, from content management systems to real-time analytics.
A schemaless store also makes it easier to deal with nonuniform data: data where each
record has a different set of fields. Eg code to parse such schemaless stores:
//pseudo code
foreach (Record r in records) {
foreach Field f in r.fields) {
print(f.name, f.value)
}}
Materialized views
Materialized views in RDBMS are pre-computed query results stored as database objects.
Unlike regular views, which are virtual and calculated on-the-fly each time they are
accessed, materialized views store the results physically on disk, allowing for faster query
performance, especially for complex aggregations or joins that involve large datasets.
When modeling for column-family stores, we have the benefit of the columns being ordered,
allowing us to name columns that are frequently used so that they are fetched first.
When using graph databases to model the same data, we model all objects as nodes and
relations within them as relationships; these relationships have types and directional
significance.
Module II – Notes
Distribution Models
NoSQL databases employ various distribution models to manage data across multiple nodes or
servers effectively. These models are designed to enhance scalability, availability, and
performance, allowing NoSQL systems to handle large volumes of data and high levels of
concurrent requests. Broadly, there are two paths to data distribution: replication and sharding.
Replication takes the same data and copies it over multiple nodes. Sharding puts different data
on different nodes. Replication and sharding are orthogonal technique. Replication comes into
two forms: master-slave and peer-to-peer.
Sharding
Sharding is a database architecture pattern that involves partitioning data across multiple
servers or nodes, enabling horizontal scaling and improving performance for large datasets.
In a sharded database, data is divided into smaller, more manageable pieces called "shards,"
which are distributed across different servers.
Each shard contains a subset of the data, allowing the database to handle a higher volume
of read and write operations simultaneously.
This model is particularly beneficial for applications with large datasets, as it helps mitigate
performance bottlenecks and ensures that no single server is overwhelmed by excessive
requests.
A sharding example is shown in the following figure. Sharding puts different data on
separate nodes, each of which does its own reads and writes
Many NoSQL databases offer auto-sharding, where the database takes on the
responsibility of allocating data to shards and ensuring that data access goes to the right
shard.
Sharding does little to improve resilience when used alone. Although the data is on different
nodes, a node failure makes that shard’s data unavailable just as surely as it does for a
single-server solution. The resilience benefit it does provide is that only the users of the
data on that shard will suffer; however, it’s not good to have a database with part of its data
missing.
Maste-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is
designated as the master, or primary. This master is the authoritative source for the data and
is usually responsible for processing any updates to that data. The other nodes are slaves,
or secondaries.
An example master-slave process is shown in the following figure:
Master-slave replication is most helpful for scaling when you have a read-intensive dataset.
You can scale horizontally to handle more read requests by adding more slave nodes and
ensuring that all read requests are routed to the slaves
Read resilience in a master-slave database architecture refers to the ability of the system to
effectively handle read operations even in the presence of failures or high load, leveraging
the characteristics of replication and redundancy inherent in this setup.
The ability to appoint a slave to replace a failed master means that master-slave replication
is useful even if you don’t need to scale out. Masters can be appointed manually or
automatically. Manual appointing typically means that when you configure your cluster,
you configure one node as the master. With automatic appointment, you create a cluster of
nodes and they elect one of themselves to be the master.
Replication comes with some alluring benefits, but it also comes with an inevitable dark
side— inconsistency. You have the danger that different clients, reading different slaves,
will see different values because the changes haven’t all propagated to the slaves.
Peer-to-Peer Replication
The biggest complication even with P2P replication is consistency. When you can write to two
different places, you run the risk that two people will attempt to update the same record at the
same time—a write-write conflict. Inconsistencies on read lead to problems but at least they
are relatively transient.
a common strategy for column-family databases. In a scenario like this you might have
tens or hundreds of nodes in a cluster with data sharded over them.
Combining Master-Slave replication with sharding is shown in the following figure:
Updating Consistency
Coincidentally, Martin and Pramod are looking at the company website and notice that the
phone number is out of date. Implausibly, they both have update access, so they both go in
at the same time to update the number. This issue is called a write-write conflict: two people
updating the same data item at the same time.
When the writes reach the server, the server will serialize them—decide to apply one, then
the other. Let’s assume it uses alphabetical order and picks Martin’s update first, then
Pramod’s. Without any concurrency control, Martin’s update would be applied and
immediately overwritten by Pramod’s. In this case Martin’s is a lost update.
Approaches for maintaining consistency in the face of concurrency are often described as
pessimistic or optimistic. A pessimistic approach works by preventing conflicts from
occurring; an optimistic approach lets conflicts occur, but detects them and takes action to
sort them out.
The most common pessimistic approach is to have write locks, so that in order to change a
value you need to acquire a lock, and the system ensures that only one client can get a lock
at a time.
A common optimistic approach is a conditional update where any client that does an update
to test the value just before updating it to see if it’s changed since his last read.
There is another optimistic way to handle a write-write conflict—save both updates and
record that they are in conflict. Pessimistic approaches often severely degrade the
responsiveness of a system to the degree that it becomes unfit for its purpose. This problem
is made worse by the danger of errors—pessimistic concurrency often leads to deadlocks,
which are hard to prevent and debug.
Read Consistency
Let’s imagine we have an order with line items and a shipping charge. The shipping charge is
calculated based on the line items in the order. If we add a line item, we thus also need to
recalculate and update the shipping charge. In a relational database, the shipping charge and
line items will be in separate tables. The danger of inconsistency is that Martin adds a line item
to his order, Pramod then reads the line items and shipping charge, and then Martin updates the
shipping charge. This is an inconsistent read or read-write conflict.
The scenario of read-write conflict is shown in the following figure:
NoSQL databases don’t support transactions and thus can’t be consistent. Such claim is
mostly wrong because lack of transactions usually only applies to some NoSQL databases,
in particular the aggregate-oriented ones. In contrast, graph databases tend to support ACID
transactions just the same as relational databases. Secondly, aggregate-oriented databases
do support atomic updates, but only within a single aggregate. This means that you will
have logical consistency within an aggregate but not between aggregates.
Not all data can be put in the same aggregate, so any update that affects multiple aggregates
leaves open a time when clients could perform an inconsistent read. The length of time an
inconsistency is present is called the inconsistency window. A NoSQL system may have a
quite short inconsistency window.
Replication consistency refers to the degree to which data remains consistent across
multiple replicas in a distributed database system. In architectures utilizing replication—
whether master-slave, peer-to-peer, or multi-master—ensuring that all copies of the data
reflect the same state is crucial for maintaining data integrity and application reliability. An
example for the replication consistency is depicted in the following figure:
With replication, there can be eventually consistency, meaning that at any time nodes
may have replication inconsistencies but, if there are no further updates, eventually all
nodes will be updated to the same value.
One can tolerate reasonably long inconsistency windows, but you need read your-writes
consistency which means that, once you’ve made an update, you’re guaranteed to continue
seeing that update.
One way to get this in an otherwise eventually consistent system is to provide session
consistency: Within a user’s session there is read-your-writes consistency. This does mean
that the user may lose that consistency should their session end for some reason or should
the user access the same system simultaneously from different computers, but these cases
are relatively rare.
There are a couple of techniques to provide session consistency. A common way, and often
the easiest way, is to have a sticky session: a session that’s tied to one node (this is also
called session affinity).
A sticky session allows you to ensure that as long as you keep read-your-writes consistency
on a node, you’ll get it for sessions too. The downside is that sticky sessions reduce the
ability of the load balancer to do its job.
Relaxing consistency
Relaxing consistency refers to the practice of intentionally allowing some degree of
inconsistency in a distributed database system to improve performance, availability, and
scalability.
In traditional database systems, strong consistency is often enforced, ensuring that all
replicas reflect the same state at all times.
However, in distributed environments, maintaining this strict consistency can introduce
latency and bottlenecks, particularly during network partitions or high-traffic scenarios. By
relaxing consistency, systems can achieve greater responsiveness and fault tolerance while
still meeting the needs of many applications.
CAP Theorem
The CAP theorem, also known as Brewer's theorem, is a fundamental principle in distributed
systems that states that it is impossible for a distributed data store to simultaneously provide
all three of the following guarantees:
1. Consistency (C)
Consistency ensures that every read operation returns the most recent write for a given piece
of data. In other words, all nodes in the distributed system view the same data at the same time.
2. Availability (A)
Availability guarantees that every request to the system receives a response, regardless of
whether it is successful or contains the latest data. This means that the system is operational
and accessible, even if some nodes are down or unreachable
3. Partition Tolerance (P)
Partition tolerance ensures that the system continues to operate even in the presence of network
partitions, where communication between some nodes is lost. In a distributed environment,
network failures can occur, and partition tolerance guarantees that the system can still function,
allowing for either consistent or available responses.
According to the CAP theorem, a distributed database can only achieve two of the three
guarantees at any given time:
• CP (Consistency and Partition Tolerance): Systems that prioritize consistency and
partition tolerance may sacrifice availability during network partitions. An example is
a system that returns errors or unavailable responses if it cannot ensure that all nodes
are consistent.
• AP (Availability and Partition Tolerance): Systems that focus on availability and
partition tolerance may allow for eventual consistency, meaning that some reads may
return stale data while the system continues to operate. Examples include systems like
Cassandra and DynamoDB, which prioritize availability.
• CA (Consistency and Availability): It is impossible to achieve consistency and
availability in the presence of network partitions. Systems that claim to provide both
will inevitably fail during network issues.
CAP theorem can be summarized with following figure:
Relaxing durability
Durability is a key property of database systems that ensures once a transaction has
been committed, it remains permanently recorded in the system, even in the event of a
failure such as a power outage or system crash. This property is typically achieved
through mechanisms like write-ahead logging and data replication, which safeguard the
integrity and availability of data.
If a database can run mostly in memory, apply updates to its in-memory representation,
and periodically flush changes to disk, then it may be able to provide substantially
higher responsiveness to requests.
A big website may have many users and keep temporary information about what each
user is doing in some kind of session state. There’s a lot of activity on this state, creating
lots of demand, which affects the responsiveness of the website.
Another example of relaxing durability is capturing telemetric data from physical
devices. It may be that you’d rather capture data at a faster rate, at the cost of missing
the last updates should the server go down.
Replication durability refers to the assurance that data changes made in a distributed
system will persist even in the face of failures, thanks to the mechanisms in place to
replicate data across multiple nodes or servers.
Quorums
A quorum is a minimum number of votes or acknowledgments required from nodes in a
distributed system to consider a read or write operation valid and successful. This mechanism
helps ensure consistency and availability in the face of network partitions or node failures.
There are typically two types of quorums used in distributed systems:
• Write Quorum (W): The minimum number of replicas that must acknowledge a write
operation before it is considered successful. This ensures that the data is sufficiently
replicated across the system to maintain consistency.
• Read Quorum (R): The minimum number of replicas that must be accessed for a read
operation to return a valid result. This ensures that the read operation reflects the most
recent write, maintaining consistency from the user’s perspective.
This relationship between the number of nodes you need to contact for a read (R), those
confirming a write (W), and the replication factor (N) can be captured in an inequality:
W>N/2
R+W>N
There are various ways you can construct your version stamps. You can use a counter,
always incrementing it when you update the resource. Counters are useful since they make
it easy to tell if one version is more recent than another. On the other hand, they require the
server to generate the counter value, and also need a single master to ensure the counters
aren’t duplicated.
Another approach is to create a GUID, a large random number that’s guaranteed to be
unique. These use some combination of dates, hardware information, and whatever other
sources of randomness they can pick up. The nice thing about GUIDs is that they can be
generated by anyone and you’ll never get a duplicate; a disadvantage is that they are large
and can’t be compared directly for recentness.
A third approach is to make a hash of the contents of the resource. With a big enough hash
key size, a content hash can be globally unique like a GUID and can also be generated by
anyone; the advantage is that they are deterministic—any node will generate the same
content hash for same resource data. However, like GUIDs they can’t be directly compared
for recentness, and they can be lengthy.
A fourth approach is to use the timestamp of the last update. Like counters, they are
reasonably short and can be directly compared for recentness, yet have the advantage of
not needing a single master. Multiple machines can generate timestamps—but to work
properly, their clocks have to be kept in sync. One node with a bad clock can cause all sorts
of data corruptions.