Nosql Final
Nosql Final
1. What is NOSQL? Explain briefly about aggregate data model with a neat diagram. Considering example of
Relational and Aggregate 8M/10M
ud
lo
Figure 2.3. An aggregate data model
1. What is NoSQL?
C
NoSQL databases are non-relational databases designed to store, retrieve, and manage unstructured or semi-structured
data, which doesn’t fit neatly into traditional rows and columns like in relational databases. NoSQL databases are highly
flexible, allowing for different data models like key-value, document, column-family, and graph. They are ideal for big
data and real-time applications, offering scalability and speed.
tu
The aggregate data model in NoSQL groups related data together as a single unit, called an "aggregate." This means
V
that, instead of splitting data across multiple tables (like in relational databases), related data is stored together.
Aggregates simplify data management, especially in large, distributed systems, making it easy to retrieve and update
data without complex joins.
In NoSQL, an aggregate could include all data related to a customer, such as their orders and addresses, within a single
record. This reduces the number of interactions needed to access the data and makes handling data across multiple
servers (sharding) easier.
ABHISHEK K N
Relational vs. Aggregate Model Example:
• Relational Model: Data is spread across multiple tables, with separate tables for customers, orders, and items.
These tables are linked by keys, so accessing all data about a customer’s orders requires joins.
• Aggregate Model (NoSQL): All related data is embedded in one document. For example, a customer document
may contain their orders, items, and addresses directly, simplifying data retrieval.
Relational databases have become an embedded part of our computing culture. The benefits they provide include:
ud
1. Getting at Persistent Data: Databases allow the storage of large amounts of persistent data, offering flexibility
over file systems by enabling applications to access small bits of information quickly and easily.
2. Concurrency: Relational databases help manage data access by multiple users through transactions, which aid
in preventing issues like double booking. Transactions also support error handling by allowing changes to be
rolled back if an error occurs.
3. Integration: Databases enable inter-application collaboration by allowing multiple applications to store and
lo
access data in a single database, making shared data accessible and manageable.
4. A (Mostly) Standard Model: Relational databases provide a mostly standard model, allowing developers to
apply knowledge across different projects. Despite some differences in SQL dialects, core mechanisms like
transactions operate similarly across databases.
C
3. Explain briefly about Impedance mismatch, with a neat diagram
tu
V
Impedance Mismatch refers to the difficulties that arise from the differences between the relational model used in
databases and the in-memory data structures used in application programming. Here are the key points:
1. Definition: Impedance mismatch is the discrepancy between the relational model (tables and rows) and in-
memory data structures (objects) in programming.
2. Relational Model: Data is organized into tables (relations) and rows (tuples), where each tuple consists of
simple name-value pairs.
ABHISHEK K N
3. Limitations: Relational tuples cannot contain complex structures, such as nested records or lists, which are
common in in-memory data structures.
4. Translation Requirement: Developers must translate rich in-memory structures into a relational format for
storage, leading to added complexity.
5. Historical Context: In the 1990s, object-oriented databases emerged as a potential solution but ultimately faded
as relational databases remained dominant.
6. ORM Solutions: Object-Relational Mapping (ORM) frameworks like Hibernate and iBATIS help manage this
mismatch but can introduce their own performance issues.
7. Ongoing Challenge: Despite advancements, impedance mismatch continues to be a source of frustration for
developers.
ud
4. Write a short note on:
•
lo
Lack of Distinction: Relational databases do not differentiate between aggregate relationships and other
relationships.
• Aggregate-Ignorant: They cannot optimize storage or distribution based on aggregate structures.
C
• Modeling Challenges: Existing modeling techniques often lack consistent semantics for defining aggregates.
• Cluster Efficiency: Aggregate orientation enhances data manipulation on clusters by minimizing node queries.
• Transaction Limitations: Supports atomic operations on individual aggregates, requiring application
tu
• Flexibility: Allows storing various types of data without strict structure limits.
• Key-Based Access: Primarily accessed through lookups based on unique keys.
ABHISHEK K N
5. Explain about Graph database, with neat diagram
ud
Graph Databases
• Definition: Graph databases store data as a collection of nodes (entities) and edges (relationships) that connect
lo
them. They are designed to handle complex interconnections between data points efficiently.
• Structure:
o Nodes: Represent entities (e.g., people, products).
o Edges: Represent relationships between nodes (e.g., "likes," "purchases").
o Attributes: Nodes and edges can have properties or attributes associated with them.
C
• Motivation: Graph databases emerged as a response to the limitations of relational databases, particularly in
managing highly connected data without incurring significant performance costs.
• Use Cases: Ideal for scenarios with complex relationships such as:
tu
o Social networks
o Recommendation systems
o Fraud detection
o Knowledge graphs
• Querying:
V
o Graph databases excel in queries that navigate through relationships, allowing for efficient traversal.
o Queries often involve starting from a node and exploring its connections (e.g., "Find all products liked
by friends of a user").
• Performance:
o Traversal operations are optimized in graph databases, making them cheaper than relational databases,
which often require costly joins.
o The architecture is typically designed for high read performance, especially for connected data.
• ACID Transactions:
o Graph databases support ACID transactions but may require covering multiple nodes and edges to
maintain data consistency.
ABHISHEK K N
6. What are Schemaless database? Explain
Schemaless Databases
Definition: Schemaless databases are a type of NoSQL database that do not require a predefined schema before storing
data. This allows for greater flexibility in data storage and structure.
Key Characteristics
1. Flexibility:
o Users can store data in any format without needing to define a rigid structure upfront.
ud
o Records can contain varying fields, accommodating non-uniform data easily.
2. Data Storage:
o Key-Value Stores: Store data as pairs of keys and values, allowing arbitrary data under a key.
o Document Databases: Allow for unstructured documents, enabling various data structures.
o Column-Family Databases: Permit dynamic addition of columns for each record.
o Graph Databases: Enable free addition of nodes and edges.
lo
3. Handling Non-Uniform Data:
o Schemaless databases allow records to have different fields without unnecessary nulls or meaningless
columns.
4. Implicit Schema:
o While schemaless databases do not enforce a schema, application code often relies on an implicit
C
schema to interpret and manipulate the data, leading to assumptions about field names and data types.
5. Decoupling of Schema and Storage:
o The schema is effectively handled in the application logic rather than within the database, which can
tu
create challenges in understanding data structure and consistency across multiple applications.
Advantages
• Rapid Iteration: Easy to modify data storage as project requirements evolve without needing extensive schema
changes.
V
• Dynamic Changes: New fields can be added as needed, and obsolete fields can be ignored without impacting
existing data.
• Simplicity: Simplifies the data insertion process, particularly in early development stages.
Disadvantages
• Complexity in Data Retrieval: Without a defined schema, it can be challenging to understand the structure of
the data, necessitating deep dives into application code.
• Lack of Consistency Checks: The database cannot enforce validations based on the schema, potentially leading
to inconsistent data manipulation across different applications.
ABHISHEK K N
• Difficulty in Data Management: Changing how data is stored or redefining aggregates can be as complex as
it is in relational databases.
Use Cases
• Rapidly Evolving Applications: Ideal for projects where data requirements change frequently and
unpredictably.
• Handling Varied Data: Suitable for applications dealing with diverse datasets, such as content management
systems, social media platforms, and IoT applications.
Conclusion
ud
Schemaless databases offer significant advantages in flexibility and adaptability, making them a popular choice in
dynamic development environments. However, they require careful management of implicit schemas and data integrity,
particularly when multiple applications access the same database.
7. Explain the Modeling for data access with respect to Key-value store
lo
Modeling for Data Access in Key-Value Stores
C
tu
V
Key-value stores provide a simple and efficient way to manage data through the use of unique keys paired with values.
When modeling data for access in key-value stores, several key considerations come into play:
1. Data Structure
• Key-Value Pair: Data is stored as pairs where each key serves as a unique identifier for the value. The value
can be a simple data type or a more complex object.
• Embedding Data: Related data can be embedded within a single value object. For instance, customer details
and their associated orders can be stored together to facilitate easy access.
ABHISHEK K N
2. Data Retrieval
• Reading Data: Accessing related data typically involves retrieving the entire object associated with a key. If
detailed information is required (like specific orders), the whole object needs to be fetched and processed.
3. Handling References
• Using References: To maintain relationships, separate data objects can include references to related entities.
For example, a customer object may contain references to their order IDs, enabling retrieval of all orders without
duplicating data.
ud
4. Denormalization for Read Optimization
•
lo
Performance: The design should prioritize read performance, often requiring the duplication or aggregation of
data to limit the number of read operations.
Simplicity: A straightforward structure is essential. Avoid overly complex nested objects to ensure easy access
and manipulation of data.
• Scalability: The model should accommodate future growth, allowing for new requirements without
C
necessitating significant changes to the existing structure.
8. Explain the following a) Column family store, with a neat diagram b) Emergence of NOSQL
tu
Definition: Column-family stores, also known as column-family databases, are a type of NoSQL database designed to
handle large amounts of structured and semi-structured data. They organize data into column families, allowing for a
flexible schema and optimized read/write performance.
V
Access Patterns:
ABHISHEK K N
• You can access either an entire row or specific columns within that row. For example, to retrieve a customer's
name, you might use a command like get('1234', 'name').
Advantages:
• Flexibility: New columns can be added to rows without requiring schema changes, making it adaptable to
evolving data needs.
• Performance: They are optimized for scenarios where reads involve accessing a few columns across many
rows, improving efficiency.
• Scalability: Designed for distributed architectures, enabling horizontal scaling and high availability.
ud
b) Emergence of NoSQL
Definition: NoSQL refers to a broad category of databases that do not primarily use SQL as their query language. These
databases are designed to handle large volumes of diverse data types and offer high performance and scalability.
Historical Context:
•
lo
The term "NoSQL" was first used in the late 1990s by Carlo Strozzi to describe an open-source relational
database that didn’t use SQL. However, this early use did not influence the contemporary understanding of
NoSQL.
C
• The modern concept of NoSQL emerged from a meetup in June 2009 in San Francisco, organized by Johan
Oskarsson. This event focused on discussing new database technologies inspired by systems like Google
BigTable and Amazon Dynamo.
tu
Key Points:
1. Initial Meetup: The meetup aimed to explore various projects that experimented with non-relational data
storage solutions.
2. Naming: The name "NoSQL" was chosen for its brevity and memorability for social media, rather than for a
V
ABHISHEK K N
MODULE-2
a) Single Server
Definition: A single-server distribution model involves running a database on one machine, handling all data reads and
writes without distributing data across multiple servers.
Advantages:
1. Simplicity: Easier to manage with no complex network issues or data consistency problems.
ud
2. Developer-Friendly: Application developers find it simpler to work with a single server, avoiding the
challenges of distributed systems.
3. Cost-Effective: Lower operational costs compared to maintaining a cluster of servers.
4. Performance: Sufficient for applications with moderate data needs, especially if optimized.
Use Cases:
•
•
lo
Graph Databases: Best for handling complex relationships without the overhead of distribution.
Document and Key-Value Stores: Suitable for applications focusing on aggregates rather than heavy
concurrent access.
C
Conclusion: The single-server model is a practical choice for many applications, allowing organizations to keep
operations straightforward and efficient.
tu
Definition: Combining sharding and replication in a distributed database helps manage data more efficiently and ensures
availability.
V
Sharding:
• What It Is: Dividing a large dataset into smaller pieces (shards) that are spread across multiple servers for better
performance.
Replication:
• What It Is: Keeping copies of data on different servers to ensure that if one fails, the system can still access the
data from other servers.
ABHISHEK K N
How They Work Together:
Benefits:
1. Higher Availability: If one node fails, data is still accessible from other replicas.
ud
2. Load Balancing: Read requests can be distributed across replicas, reducing the load on master nodes.
3. Scalability: Allows the system to grow easily by adding more nodes and shards as data increases.
Conclusion: Combining sharding and replication creates a powerful, flexible data management system that enhances
performance and ensures data availability.
2. Explain the following i. CAP theorem ii. Quorums iii. Relaxing consistency iv. Relaxing durability
i. CAP Theorem
lo
The CAP Theorem, proposed by Eric Brewer, states that in a distributed data store, you can only guarantee two out of
three properties: Consistency, Availability, and Partition Tolerance.
C
• Consistency: All nodes see the same data at the same time, ensuring that every read receives the most recent
write.
tu
• Availability: Every request received by a non-failing node must result in a response, ensuring that the system
remains operational even if some nodes fail.
• Partition Tolerance: The system continues to operate despite network partitions that separate nodes and
prevent communication.
In practice, when network partitions occur (which is common), a system must choose between sacrificing consistency
V
or availability, leading to a compromise that aligns with the specific needs of the application.
ii. Quorums
Quorums are a strategy used in distributed systems to ensure data consistency during read and write operations. They
specify the minimum number of nodes required to confirm an operation to maintain a certain level of consistency.
ABHISHEK K N
• Write Quorum (W): The number of nodes that must acknowledge a write before it is considered successful.
To ensure strong consistency, W must be greater than half the total number of replicas (N), expressed as W>N/2.
• Read Quorum (R): The number of nodes that must be contacted to read the most recent data. To guarantee a
consistent read, the sum of the read and write quorums must also exceed the total number of replicas: R+W>N.
By defining quorums, systems can minimize the risk of reading stale data or encountering write-write conflicts,
balancing the trade-offs between consistency, availability, and performance.
ud
Relaxing Consistency refers to the practice of allowing some level of inconsistency in order to improve system
performance and availability. In many systems, particularly distributed ones, achieving strong consistency can lead to
trade-offs in responsiveness or availability.
• Isolation Levels: In traditional databases, different transaction isolation levels (like read-committed or snapshot
isolation) can allow queries to access uncommitted data, thus sacrificing strict consistency for better
•
performance.
lo
Domain Tolerance: Different applications have varying tolerances for inconsistency. For example, e-
commerce systems might allow temporary discrepancies in shopping cart data, as long as users can continue
interacting with the system without delays.
C
In some cases, sacrificing consistency enables faster operations, like allowing overbooking in hotel reservations or
merging shopping carts during checkout, provided users can review their final orders.
tu
Relaxing Durability involves sacrificing some degree of data durability to achieve higher performance. While
durability ensures that once a transaction is committed, it will survive failures, in certain scenarios, this can lead to
V
unacceptable latency.
• In-Memory Operations: Systems can opt to keep data primarily in memory and periodically flush it to disk.
This improves responsiveness but risks losing data if a crash occurs before the latest changes are saved.
• Use Cases: For example, user session states may not require strict durability since losing session data is less
critical than ensuring a fast user experience. Similarly, telemetry data may prioritize capture speed over
complete durability.
ABHISHEK K N
3. Define Version stamps. Explain briefly about various approaches of constructing version stamps
Version Stamps
Definition:
Version stamps are metadata associated with data items that help track their versioning and changes over time. They are
crucial in distributed systems to manage consistency and conflicts, particularly when multiple nodes can update the
same data independently.
ud
1. Counters:
o Description: Each time a node updates data, it increments a counter and uses this value as the version
stamp.
o Use Case: Works well for master-slave replication models, where a single authoritative source controls
the versioning.
2. Timestamps: lo
o Description: Each update is tagged with the current time as the version stamp.
o Challenges:
▪ Difficult to ensure consistent time across distributed nodes.
▪ Cannot detect write-write conflicts effectively, making it suitable primarily for single-master
systems.
C
3. Version Stamp Histories:
o Description: All nodes maintain a history of version stamps, allowing them to track the relationships
between different versions.
tu
o Use Case: Effective in distributed version control systems, where clients or servers store the history to
detect inconsistencies.
4. Vector Stamps (Vector Clocks/Version Vectors):
o Description: A vector stamp consists of a set of counters, with one counter for each node in the system.
Each node increments its own counter upon an update.
V
o Synchronization: Nodes synchronize their vector stamps during communication. This allows
comparison of version stamps:
▪ If all counters of one stamp are greater than or equal to another, it is the newer version.
▪ If both stamps have counters greater than the other, it indicates a write-write conflict.
o Flexibility: Missing values in the vector are treated as zero, which facilitates the addition of new nodes
without invalidating existing stamps.
ABHISHEK K N
4. Explain Version stamps on multiple nodes.
Version stamps are essential in managing data consistency across multiple nodes in a distributed system, particularly
in a peer-to-peer model. Here’s a concise breakdown based on the provided text:
Basic Concepts
1. Single Authoritative Source: In a master-slave setup, version stamps are straightforward as they are managed
by the master. Slaves simply follow the master's version stamps.
2. Peer-to-Peer Challenges: In a decentralized system, multiple nodes can update data independently, leading to
ud
potential inconsistencies. When querying multiple nodes, they might return different versions of the same
data.
Managing Inconsistencies
• Version Stamps with Counters: Each node maintains a counter that increments with each update. This
allows nodes to determine which version is more recent by comparing counter values.
•
lo
Multiple-Master Cases: In scenarios where all nodes can act as masters, a more sophisticated approach is
necessary. Keeping a history of version stamps enables nodes to verify the relationships between updates (e.g.,
checking if one update is an ancestor of another).
C
Approaches to Versioning
1. Timestamps: While timestamps can denote the order of updates, they often fail due to time synchronization
issues across nodes and cannot effectively detect conflicts.
tu
2. Vector Stamps: This method uses an array of counters (one for each node). For example, a vector stamp for
three nodes might look like [blue: 43, green: 54, black: 12]. Each time a node updates, it
increments its respective counter, allowing nodes to synchronize their vector stamps during communication.
• A vector stamp is considered newer if all its counters are greater than or equal to those in an older stamp.
• If two stamps contain higher values for different counters, a conflict arises (e.g., [blue: 1, green: 2,
black: 5] vs. [blue: 2, green: 1, black: 5]).
• Missing values in a vector are treated as zero (e.g., [blue: 6, black: 2] becomes [blue: 6,
green: 0, black: 2]), facilitating the addition of new nodes without disrupting the existing system.
ABHISHEK K N
5. Explain the following a) Sharding b) Master-Slave replication c) Peer-to-peer replication
a) Sharding
Definition: Sharding is a horizontal scaling technique that involves distributing parts of a dataset across multiple servers
(shards). Each shard is responsible for a subset of the data, allowing multiple users to access different data concurrently.
Key Points:
• Load Distribution: Ideal sharding allows users to access different server nodes, balancing the load evenly
across servers.
ud
• Aggregate Orientation: Data that is frequently accessed together is grouped into aggregates, enhancing
performance by minimizing cross-node requests.
• Data Placement: Factors such as physical location and access patterns can inform where data is stored. For
instance, data related to Boston residents can be placed in an eastern U.S. data center.
• Challenges: Sharding can complicate application logic and may require rebalancing, which involves migrating
data and updating code. lo
• Resilience: Sharding alone does not improve resilience, as a node failure renders its shard’s data unavailable. It
can, however, limit the impact of such failures to affected users.
• Planning: Transitioning from a single-node setup to sharding should be done early, allowing sufficient
headroom for the change without overwhelming the system.
C
b) Master-Slave Replication
tu
Definition: Master-slave replication involves creating a primary node (master) that handles all writes, while one or more
secondary nodes (slaves) replicate the master’s data and can serve read requests.
Key Points:
• Scaling Reads: This model is particularly effective for read-intensive datasets, as multiple slaves can handle
V
ABHISHEK K N
c) Peer-to-Peer Replication
Definition: In peer-to-peer replication, all nodes in the system are equal, allowing each node to accept reads and writes,
thus eliminating the single point of failure associated with master-slave setups.
Key Points:
• Fault Tolerance: This model increases resilience since the failure of any single node does not prevent access
to data, improving overall system availability.
ud
• Write Scalability: Since every node can accept writes, the system can scale horizontally to handle increased
write loads effectively.
• Consistency Challenges: A major challenge is maintaining consistency, as simultaneous writes to different
nodes can lead to conflicts (write-write conflicts).
• Conflict Resolution: Approaches to handle inconsistencies include:
o Coordinating writes to ensure no conflicts arise, which may require additional network traffic.
lo
o Allowing inconsistent writes and later merging them based on application-specific rules or policies.
• Trade-offs: There’s a spectrum of options between strict consistency and high availability, where systems must
balance the two based on application needs.
6. What are Distribution models? Briefly explain two paths of data distribution
C
Distribution models refer to the ways data is spread across multiple servers or nodes in a database system. They are
crucial for handling large amounts of data and increased user traffic. By distributing data, systems can improve
performance and availability, but it can also add complexity.
tu
1. Replication:
o What it is: This involves making copies of the same data on multiple nodes.
V
o Types:
▪ Master-Slave Replication: One main server (master) handles all the writing, while other
servers (slaves) replicate this data and handle reading. This helps with read traffic but is limited
by the master’s write capacity.
▪ Peer-to-Peer Replication: All servers have equal status and can handle both reads and writes.
This improves fault tolerance and scalability but can complicate data consistency.
2. Sharding:
o What it is: This involves splitting the dataset into smaller pieces (shards), each stored on different
nodes.
ABHISHEK K N
o How it works: Each shard contains a portion of the overall data, allowing servers to handle specific
requests. This improves performance by reducing load and contention since each server deals with only
part of the data.
Update Consistency and Read Consistency are key concepts in database management, focusing on how data is
modified and accessed when multiple users interact with it.
Update Consistency
ud
Update Consistency ensures that when multiple users try to change the same data, the final outcome is correct.
Example:
• Scenario: Martin and Pramod both want to update the same phone number.
o Martin enters "123-456-7890," and Pramod enters "(123) 456-7890" at the same time.
Conflict:
•
lo
If Martin's update is processed first, Pramod's update will overwrite it, causing Martin's change to be lost. This
is called a lost update.
C
Resolution Approaches:
• Pessimistic: Use write locks to ensure only one user can update at a time, preventing conflicts.
• Optimistic: Allow both updates but check for changes before saving. If Pramod tries to save after Martin, his
tu
Read Consistency
Read Consistency ensures users see accurate data, especially during concurrent updates.
V
Example:
• Scenario: Martin adds a line item to his order, affecting the shipping charge. Meanwhile, Pramod reads the
order and charge.
o If Pramod reads while Martin’s update is happening, he might see outdated information, leading to a
read-write conflict.
Prevention:
ABHISHEK K N
• Transactions: If Martin uses a transaction for his updates, Pramod will either see all changes before or after,
avoiding inconsistencies.
Replication Issues:
• In distributed systems, different nodes might show outdated data. For example, if a hotel room gets booked, one
user might see it available while another sees it booked, leading to eventual consistency, where updates
eventually sync across all nodes.
Session Consistency:
ud
• This ensures that once a user updates data, any further reads in the same session show the most recent updates.
lo
C
tu
V
ABHISHEK K N
MODULE 3
1. Explain with a neat diagram Partitioning and Combining in Map reduce 10M
Partitioning and Combining in MapReduce
In MapReduce, the partitioning and combining phases are essential for optimizing performance,
especially when dealing with large datasets and multiple nodes. Here's an explanation along with a
diagram of how they work:
Partitioning in MapReduce
1. Partitioning refers to the division of the output of the map tasks into distinct groups, based on
ud
the key.
2. After the mappers process the data, the results are divided into partitions (or "buckets"), where
each partition contains results for a specific key or a set of keys.
3. These partitions are then sent to separate reducers, allowing parallel processing of the data. This
is beneficial because it means multiple reducers can process data simultaneously, increasing the
overall performance and scalability of the system.
lo
4. The partitioning phase ensures that each reducer handles a specific subset of the data, making
the reduce function operate only on a relevant set of key-value pairs.
Combining in MapReduce
1. Combining is a technique used to reduce the amount of data that needs to be transferred
between the mappers and reducers. Often, the same key will appear multiple times in the output
C
of a map task, leading to redundant data transfer.
2. A combiner function is applied to reduce this redundancy by merging values for the same key
before the data is sent to the reducer.
3. The combiner function is essentially a mini-reducer that reduces the data on the mapper's side.
tu
This reduces the amount of data being transferred across the network and speeds up the overall
process.
4. Not all reducers are combinable. A combinable reducer must produce the same kind of output
as its input. If the output structure differs (e.g., counting unique items), combining may not be
possible.
V
ABHISHEK K N
ud
2. Explain Basic Map-Reduce with a neat diagram 7M
Basic Map-Reduce
Map-Reduce is a computational model used to process large amounts of data in parallel across a
distributed cluster. It divides the computation into two main steps: the Map step and the Reduce step.
lo
Below, we explain the basic concept of Map-Reduce with an example and a diagram.
C
tu
V
Problem Scenario:
Let’s consider an example of a sales report system where we have orders, each containing line items.
Each line item has the following attributes:
• Product ID
• Quantity
• Price Charged
ABHISHEK K N
We want to calculate the total revenue for each product over the last seven days. This requires
analyzing all the orders, but since the data is sharded across multiple machines, we need to use Map-
Reduce to perform the analysis in a distributed manner.
Map Function:
• Input: The map function takes an order (or a record) as input.
• Output: The map function emits a series of key-value pairs. In this case, the key is
the Product ID, and the value is an embedded structure containing the quantity and price of
the product.
For example, if an order contains two line items for a product, the map function will emit:
• Key: Product ID (e.g., "P123")
ud
• Value: (Quantity, Price)
These key-value pairs are emitted for each line item in every order.
Reduce Function:
• Input: The reduce function takes all key-value pairs with the same key (product ID) that were
emitted by the map function. It aggregates all values associated with that key.
lo
• Output: The reduce function combines the quantities and prices for each product to calculate
the total quantity sold and the total revenue for that product.
For example, if the map function emits multiple values for the product ID "P123" (e.g., (5, 100), (3,
120)), the reduce function will sum the quantities and prices to get the total revenue for that product.
C
tu
V
ABHISHEK K N
3. Explain two stage Map-Reduce example with a neat diagram 10M
Two-Stage Map-Reduce Example
As the complexity of calculations increases, it's often helpful to break down a Map-Reduce job
into multiple stages. This approach allows each stage to focus on specific tasks, passing the output of
one stage as input to the next, similar to the pipes-and-filters model in UNIX. Below is an explanation
and diagram of a two-stage Map-Reduce example.
ud
lo
C
tu
V
ABHISHEK K N
ud
Example Scenario:
lo
We want to compare the sales of products for each month in 2011 with the sales for the same months in
2010. This requires calculating the monthly sales for each product in both 2010 and 2011 and then
comparing them.
C
Stage 1: Aggregating Monthly Sales Data by Product:
The first stage involves processing the raw order records to aggregate product sales for each month in
the year. Here’s how it works:
tu
• Input: Raw order records, which contain product details like product ID, quantity, price, and
date of the order.
• Output: Key-value pairs where:
• Key: Composite key combining product ID and month (e.g., ("P123", "Jan 2011"))
• Value: The quantity and total revenue for that product in that month.
V
For example:
• Input Order: Product "P123" sold 5 units at $100 in January 2011.
• Map output: Key ("P123", "Jan 2011"), Value (Quantity: 5, Revenue: 500).
The map function is applied to every order, producing a key-value pair for each product in each month.
Stage 2: Comparing Sales for the Same Month in Different Years:
In the second stage, we take the output of the first stage (aggregated product sales by month) and
perform a comparison between the same months in 2011 and 2010.
• Input: The output from Stage 1, which contains product sales for each month in 2010 and 2011.
ABHISHEK K N
• Processing: The second-stage mapper groups records by year and month. It then populates
values for the current year (2011) and the previous year (2010) for each product.
• For records from 2011, the quantity and revenue are associated with the current year.
• For records from 2010, the quantity and revenue are associated with the prior year.
• Output: Key-value pairs for each product, where the key is the product ID, and the value is a
tuple containing the sales data for 2011 and 2010.
Final Reduce Step:
• Input: Key-value pairs from Stage 2, grouped by product ID. Each product’s value will contain
two records (one for 2011 and one for 2010).
ud
• Processing: The reduce function takes the values for the same product from both 2010 and
2011 and computes the difference, such as the percentage change in sales between the two
years.
• Output: Final aggregated result, which shows the difference in sales for each product between
2010 and 2011.
In the Map-Reduce model, calculations are composed by structuring them around two main
tasks: Map and Reduce. Each stage of the calculation is carefully designed to work within the
constraints of the model, where:
• The Map function operates on a single aggregate (e.g., an order), producing key-value pairs.
• The Reduce function operates on all key-value pairs associated with a particular key (e.g., all
the values for a given product).
These calculations are structured to efficiently break down tasks into independent units that can be
processed in parallel. However, certain calculations, such as averages, require specific strategies for
composition because not all operations are composable in a simple manner.
Example: Calculating the Average Ordered Quantity
ABHISHEK K N
1. Map Step
In the map step, the function emits key-value pairs based on the input data (e.g., orders). If we're
calculating the average ordered quantity for products, each map output would be:
• Key: Product ID
• Value: The ordered quantity and a count of 1
This allows the data to be distributed across different nodes for parallel processing, where each node
handles different products independently.
2. Intermediate Output
After the map step, the output consists of key-value pairs where the key is the product ID, and the value
ud
is a tuple containing the total quantity of orders for the product and a count of how many orders
contributed to that total.
3. Reduce Step
In the reduce step, all the values for a particular product are aggregated. The reducer:
• Sums up the total quantities and counts for that product across all mapped outputs.
• Calculates the average by dividing the total quantity by the total count.
lo
This step can also be seen as a merge of the partial results (sum and count) and the final calculation
(average).
5. What are Key-Value stores? List out some popular key value database. Explain how all the data
C
is stored in a single bucket of a key value data store 5M/8M
Key-Value Store Explanation
In key-value stores like Riak, data is stored in buckets. A bucket is a logical grouping or container for
a set of keys and their associated values. Each key-value pair is stored in a flat namespace, meaning
tu
each key is unique within a bucket, and its associated value can be any data type or object.
Popular Key-Value Databases:
• Riak
• Redis
• Memcached DB
V
• Berkeley DB
• HamsterDB
• Amazon DynamoDB
• Project Voldemort
Data Storage in a Single Bucket
In the image you provided, we see a bucket named userData, which contains multiple types of data.
The key is represented by the sessionID, and the value is an object that contains multiple aggregates:
ABHISHEK K N
1. UserProfile
2. SessionData
3. ShoppingCart, which further contains multiple CartItems.
This scenario demonstrates how a single key in a key-value store can hold multiple objects (or
aggregates) within its value. These objects are stored as part of the same key-value pair under
the sessionID key in the userData bucket. This can be convenient for storing all session-related
information in a single place but can lead to potential issues like key conflicts or difficulties in
accessing specific parts of the data if the bucket grows too large.
Managing Data in a Single Bucket
ud
lo
C
tu
ABHISHEK K N
6. List and explain any 2 features of Key-value store 5M
ANY TWO AMONG THIS
1. Consistency
• Definition: Consistency refers to how data is synchronized across all nodes in a distributed
system.
• Understanding: In a key-value store like Riak, consistency is often eventually achieved,
meaning that updates to data may not immediately reflect across all nodes but will eventually
sync. Riak provides options to handle write conflicts, like the "last write wins" or "siblings"
model, where multiple conflicting versions can be returned for client resolution.
• Trade-off: A strong consistency model can affect write performance. You can tune consistency
ud
settings like W (number of nodes for successful write) and R (number of nodes for successful
read) for balancing between consistency and availability.
2. Transactions
• Definition: Transactions refer to ensuring that multiple operations are treated as a single unit,
either all succeeding or all failing.
• Understanding: Key-value stores generally do not provide full ACID transactions (like
relational databases), but some do offer simplified transactional features. In Riak, this is
lo
implemented using quorum-based writes where a write is considered successful only if it’s
confirmed by a quorum (a specified number of nodes). This allows for write tolerance even if
some nodes are unavailable.
• Trade-off: While this improves availability, it can sacrifice strong consistency because not all
nodes may have the latest data at any given time.
C
3. Query Features
• Definition: Query features define how data can be retrieved or queried within the database.
• Understanding: Traditional relational databases allow complex queries across multiple fields,
tu
but key-value stores are much simpler. They support querying only by key, not by other
attributes inside the value. Some key-value stores, like Riak, provide an indexing feature
(like Riak Search) to allow querying inside values. However, in most key-value stores, you
can only fetch data by the key, and ad-hoc queries are not as flexible.
• Trade-off: This makes key-value stores suitable for use cases where data is frequently accessed
via a known key, such as session data, shopping carts, or user profiles. The simplicity of key-
based queries also leads to faster retrieval times for those keys.
V
4. Structure of Data
• Definition: This refers to how the data is stored inside the key-value store, particularly the value
part of the key-value pair.
• Understanding: In key-value stores, the value can be any data structure, from a simple blob of
text to more complex formats like JSON or XML. The key itself is used for quick access, and
the value can be anything the application needs. For example, in Riak, the data is not
constrained to a specific format, and you can store structured data like user profiles or session
data.
ABHISHEK K N
• Trade-off: The lack of structure in the data format allows flexibility, but it also means the
application must understand how to interpret the value, unlike structured databases that impose
a schema.
5. Scaling
• Definition: Scaling refers to the ability of the database to handle increased loads by adding
more resources, such as nodes or servers.
• Understanding: Key-value stores often scale horizontally, meaning they distribute data across
multiple nodes. This is achieved through sharding, where each key is assigned to a specific
node based on a hashing function. As the data grows, more nodes can be added to the cluster to
maintain performance and handle higher traffic.
ud
• Trade-off: Sharding improves performance but introduces the challenge of data availability
during node failures. Systems like Riak manage this through replication, where each piece of
data is stored on multiple nodes to ensure availability even when some nodes fail. However,
managing consistency during failures and ensuring that the right data is always accessible can
be complex.
6. Availability and Fault Tolerance
• Definition: This refers to the ability of the database to remain operational even when parts of
lo
the system fail.
• Understanding: Key-value stores like Riak are designed to be highly available and fault-
tolerant. By replicating data across multiple nodes, the system ensures that even if one or more
nodes fail, the data remains accessible from other nodes.
• Trade-off: This replication can affect consistency (i.e., how up-to-date the data is across all
C
nodes), and trade-offs between availability and consistency can be adjusted based on needs (as
per the CAP Theorem).
7. Data Expiry
• Definition: Data expiry refers to the ability to automatically remove or expire data after a
tu
certain period.
• Understanding: Some key-value stores allow setting an expiration time for a key-value pair,
such as the expiry_secs feature in Riak. This is useful for temporary data like session data or
shopping cart information.
• Trade-off: Data expiration can help keep the database clean and ensure that temporary data is
not stored longer than necessary. However, this feature may not be available in all key-value
V
ABHISHEK K N
7. Elaborate the suitable use cases of Key-value store. When Key-value stores are not suitable.
Explain
When to Use Key-Value Stores
1. Storing Session Information
• Use Case: Web applications store session data (e.g., user ID, preferences) using a
unique session ID.
• Why It Works: Key-value stores allow fast retrieval of all session data with one
request. For example, Memcached or Riak can store and retrieve session data quickly.
2. User Profiles and Preferences
ud
• Use Case: Store user settings, such as language, timezone, and preferences.
• Why It Works: Each user can have a unique key (e.g., user ID), and all their
preferences can be stored and retrieved easily in one operation.
3. Shopping Cart Data
• Use Case: E-commerce sites can store shopping cart contents tied to a user.
• Why It Works: The user's shopping cart data is stored under their user ID and can be
lo
accessed across devices, making it fast and persistent.
• Issue: If you need to perform multiple operations on different keys at once and need to
ensure all succeed or fail together, key-value stores aren't suitable.
• Why Not: Key-value stores typically can't handle multi-step transactions or rollbacks.
3. Querying by Data
V
• Issue: If you need to search for data based on the value (e.g., finding all users with a
certain preference), key-value stores don't support this.
• Why Not: They only allow querying by the key, not the data inside the value.
4. Operations on Multiple Keys
• Issue: If you need to perform actions across multiple keys (e.g., batch updates), key-
value stores can't do this directly.
• Why Not: Key-value stores operate on one key at a time, so any bulk operation must
be handled in the application itself.
ABHISHEK K N
Module 4
1. What are Document databases? Explain with an example. List and explain any 2
features of document databases 10M
A document database is a type of NoSQL database that stores, retrieves, and manages data in
the form of documents, typically in JSON, BSON, or XML formats. Each document is a self-
contained unit of data that can include fields, arrays, and nested sub-documents.
Unlike traditional relational databases (RDBMS), document databases allow for schema
ud
flexibility. Documents within the same collection can have varying structures, eliminating the
need for a predefined schema. This makes document databases ideal for handling unstructured
and semi-structured data.
Example:
Document 1:
{
"firstname": "Martin",
lo
"likes": ["Biking", "Photography"],
"lastcity": "Boston"
C
}
Document 2:
{
tu
"firstname": "Pramod",
"citiesvisited": ["Chicago", "London", "Pune", "Bangalore"],
"addresses": [
{ "state": "AK", "city": "DILLINGHAM", "type": "R" },
{ "state": "MH", "city": "PUNE", "type": "R" }
],
V
"lastcity": "Chicago"
}
• Documents with differing fields (e.g., likes in Document 1 and addresses in Document
2) coexist in the same collection.
• The schema is flexible, and unused fields are simply omitted.
9.2 Features
ABHISHEK K N
Document databases offer a variety of powerful features. MongoDB, one of the most popular
document databases, serves as a representative example.
9.2.1 Consistency
ud
Example:
• w: "majority" ensures that the write propagates to a majority of nodes in the replica
set before success is confirmed.
• Read operations can be executed from slave nodes by enabling slaveOk.
9.2.2 Transactions
tu
Example:
shopping.setWriteConcern(WriteConcern.REPLICAS_SAFE);
shopping.insert(order, WriteConcern.REPLICAS_SAFE);
• REPLICAS_SAFE ensures the data is written to the primary and at least one secondary
node.
ABHISHEK K N
9.2.3 Availability
Key Points:
ud
• Applications do not need to detect or manage node failures; the MongoDB driver
handles communication with the new primary.
• Replication provides:
o Data redundancy
o Failover
o Read scaling lo
9.2.4 Query Features
Examples:
• SQL:
• MongoDB simplifies queries for nested fields as documents are aggregated objects.
9.2.5 Scaling
ABHISHEK K N
Document databases scale horizontally to handle increasing data and load:
1. Read Scaling:
o Add more read slaves to a replica set.
o Distribute read queries across secondary nodes using slaveOk.
Example:
rs.add("newNode:27017");
ud
o Data can be sharded (partitioned) across multiple nodes based on a shard key.
o Shards are dynamically rebalanced when new nodes are added.
Example of Sharding:
• Sharding enables distribution of data and write operations across multiple servers.
•
lo
Each shard can also be a replica set, combining scaling with high availability.
i. Consistency
Consistency in MongoDB is configured using replica sets and controlling the behavior of
tu
write operations. Every write operation can specify how many servers must acknowledge the
write before it is considered successful, ensuring different levels of consistency.
• Replica Sets:
A replica set is a group of MongoDB nodes where data is replicated asynchronously.
Writes are sent to the primary node and then propagated to secondary nodes.
V
• Write Concern:
Write consistency is achieved by setting the w parameter to specify how many nodes
must acknowledge the write.
Example:
javascript
Copy code
db.runCommand({ getlasterror: 1, w: "majority" });
ABHISHEK K N
o
For three nodes, the write must be acknowledged on at least two nodes for
success.
• Read Consistency:
To improve read performance, reads can be performed from secondary nodes using
the slaveOk parameter, though this may result in slightly stale data.
Example:
java
Copy code
DBCollection collection = getOrderCollection();
ud
BasicDBObject query = new BasicDBObject();
query.put("name", "Martin");
DBCursor cursor = collection.find(query).slaveOk();
ii. Transactions
lo
In traditional RDBMS, transactions involve multiple operations (insert, update, delete) that
are committed or rolled back together. Document databases, like MongoDB, primarily
support atomic transactions at the single-document level.
• Single-Document Transactions:
Writes in MongoDB are atomic at the document level. A single write operation
C
involving embedded documents is treated as a single unit.
• WriteConcern for Finer Control:
By using WriteConcern, MongoDB allows you to control the safety level of
writes. For example, you can ensure that a write is replicated to multiple nodes before
tu
being acknowledged.
Example:
java
Copy code
V
ABHISHEK K N
Key Trade-off:
MongoDB focuses on performance and scalability over traditional multi-document
transaction guarantees.
2. Briefly explain Scaling feature of document database, with a neat diagram 10M
Scaling in a document database allows handling more load without simply migrating to a
larger server. Instead, the focus is on features within the database to horizontally scale reads
and writes.
ud
1. Horizontal Scaling for Reads
• Scaling for heavy-read loads is achieved by adding more read slaves to a replica set.
• Reads are directed to the slave nodes to distribute the load.
•
lo
Adding new nodes can increase the read capacity without downtime.
C
tu
• Example: In a 3-node replica set, adding a new slave node like mongo D improves
read performance.
V
rs.add("mongod:27017")
• Scaling for writes is achieved using sharding, which splits data across multiple
shards based on a shard key.
ABHISHEK K N
• Each shard can also be a replica set for fault tolerance and improved read
performance.
• The shard key ensures the data is evenly distributed for optimal performance.
• New shards can be added to the cluster, and MongoDB will automatically rebalance
the data across shards.
ud
lo
Figure 9.3. MongoDB sharded setup where each shard is a replica set
C
Summary:
3. Elaborate the suitable use cases of document database. When document databases are
not suitable. Explain
V
Document databases are highly flexible and can store semi-structured data without predefined
schemas, making them ideal for several scenarios where data requirements evolve over time.
Below are some suitable use cases:
Event Logging
• Use Case: Many enterprise applications require event logging for various types of
activities such as transactions, user actions, or system events.
ABHISHEK K N
• Why Suitable: Document databases can store different types of events in a central
store without enforcing a strict schema.
• Sharding: Events can be efficiently sharded based on the application's name (e.g.,
app1, app2) or event type (e.g., order_processed, customer_logged).
• Benefit: Document databases easily accommodate the changing structure of event
data over time.
• Use Case: Websites and platforms for publishing articles, managing user comments,
user profiles, and other web-facing content.
ud
• Why Suitable: Document databases can store JSON-like documents, which are ideal
for representing dynamic and hierarchical data such as:
o Blog articles
o User-generated content
o User registrations and profiles.
• Benefit: No predefined schema makes it easier to accommodate frequent content
changes. lo
9.3.3. Web Analytics or Real-Time Analytics
• Use Case: Storing and processing real-time analytics data like page views, unique
C
visitors, or user interactions.
• Why Suitable:
o Parts of documents can be updated easily.
o New metrics or fields can be added without modifying a strict schema.
tu
• Benefit: Suitable for real-time, schema-flexible analytics where metrics evolve over
time.
• Use Case: Managing product catalogs, orders, and customer data in e-commerce
platforms.
• Why Suitable:
o E-commerce applications often require flexible schemas for dynamic product
details and evolving order structures.
o Document databases allow changes in data models without expensive data
migrations.
• Benefit: Scalability and flexibility make document databases ideal for handling large
product catalogs and customer data.
ABHISHEK K N
4. When Document Databases Are Not Suitable
Despite their advantages, there are certain scenarios where document databases may not be
the best solution:
ud
• Issue: Document databases do not inherently support atomic cross-document
operations (transactions involving multiple documents).
• Example: Applications requiring ACID transactions across different records or
collections.
• Exception: Some document databases (e.g., RavenDB) provide partial support for
complex transactions.
• Why Unsuitable: The lack of atomic operations limits use in applications requiring
lo
high transaction consistency.
Document databases like MongoDB use JSON-like documents to store data and support
flexible queries. Below are some common example queries that demonstrate typical
operations such as CRUD (Create, Read, Update, and Delete), filtering, and aggregation.
ABHISHEK K N
4.1. Insert Documents (Create Operation)
javascript
Copy code
db.customers.insertOne({
ud
firstname: "John",
lastname: "Doe",
email: "[email protected]",
orders: [
{ item: "laptop", price: 1200 },
{ item: "mouse", price: 25 }
],
location: "USA" lo
});
]);
To find data in a document database, you use the find() function with optional filters.
V
javascript
Copy code
db.customers.find();
javascript
ABHISHEK K N
Copy code
db.customers.find({ location: "USA" });
This query retrieves all documents where the location field is "USA".
javascript
Copy code
db.customers.find({
$and: [
{ location: "USA" },
ud
{ "orders.price": { $gt: 100 } }
]
});
This finds all customers in the USA who have at least one order with a price greater than 100.
lo
4.3. Update Documents (Update Operation)
javascript
Copy code
db.customers.updateMany(
{ location: "USA" }, // Filter
{ $set: { status: "Active" } } // Add a new field 'status' with value
'Active'
);
This adds or updates the status field for all customers located in the USA.
ABHISHEK K N
4.4. Delete Documents (Delete Operation)
javascript
Copy code
db.customers.deleteOne({ firstname: "Alice" });
ud
Example 9: Delete multiple documents
javascript
Copy code
db.customers.deleteMany({ location: "Canada" });
Document databases allow data aggregation to perform operations like grouping, sorting, and
C
counting.
javascript
tu
Copy code
db.customers.aggregate([
{ $group: { _id: "$location", count: { $sum: 1 } } }
]);
This groups customers by their location field and counts the number of customers in each
V
group.
javascript
Copy code
db.customers.aggregate([
{ $unwind: "$orders" }, // Unwind the orders array
{ $group: { _id: "$firstname", totalSales: { $sum: "$orders.price" } } }
]);
This calculates the total sales amount for each customer based on their orders.
ABHISHEK K N
4.6. Sorting and Projection
javascript
Copy code
db.customers.find().sort({ lastname: 1 });
ud
Example 13: Retrieve only firstname and email fields
javascript
Copy code
db.customers.find({}, { firstname: 1, email: 1, _id: 0 });
This retrieves only the firstname and email fields, excluding the _id field.
lo
Conclusion
These example queries demonstrate the core operations supported in a document database:
C
1. Insert: Adding new documents.
2. Query: Retrieving documents with conditions.
3. Update: Modifying documents.
tu
Module 5
V
Graph databases are specialized databases designed to store, manage, and query data
represented as graphs. A graph consists of two primary components:
1. Nodes: Represent entities or objects (like people, products, or locations). Each node
can have properties to describe it.
ABHISHEK K N
2. Edges (Relationships): Represent connections or relationships between nodes. These
edges can also have properties and directionality.
In essence, a graph database is ideal for handling interconnected data and complex
relationships. Traversing relationships in graph databases is highly efficient because
relationships are persisted rather than calculated at query time.
ud
Consider the example graph structure shown in Figure 11.1 (as uploaded), where various
entities and their relationships are represented:
lo
C
tu
V
• Nodes (Entities):
o People: Anna, Barbara, Carol, Dawn, Elizabeth, Jill, Martin, and Pramod.
o Books: NoSQL Distilled, Refactoring, Databases, Database Refactoring.
o Company: BigCo.
• Edges (Relationships):
o Friend Relationships: For example, Barbara is a friend of Carol.
o Likes: Barbara likes NoSQL Distilled, and Elizabeth likes Databases.
ABHISHEK K N
o Employee: Anna and Carol are employees of BigCo.
o Author: Martin authored Refactoring, and Pramod authored Database
Refactoring and NoSQL Distilled.
o Category: The book Databases belongs to the "category" of Database
Refactoring.
1. Nodes store data entities (like objects in OOP). Each node has properties (e.g., name:
ud
Martin).
2. Edges represent relationships between nodes, such as likes, friend, or author.
Edges can have directionality and properties (e.g., "since").
3. Traversals: A query on a graph is referred to as a traversal. Traversals efficiently
navigate through nodes and relationships to answer complex queries.
ABHISHEK K N
2. Briefly describe relationships in graph databases, with a neat diagram 10M
In graph databases, relationships are fundamental components that connect nodes (entities)
to form meaningful, traversable graphs. They provide the key advantage of enabling queries
that rely on connections between data points, allowing powerful, flexible models that
closely reflect real-world scenarios.
ud
Key Characteristics of Relationships
1. Directionality:
Relationships are directional, meaning they have:
o Start Node (where the relationship begins).
o End Node (where the relationship points to).
Example: A user can "like" a product, but the product does not "like" the user.
2. Traversability:
lo
o Relationships can be traversed in both directions (e.g., incoming and
outgoing paths).
o This makes it easy to navigate through connected nodes.
C
3. Type:
o Relationships have a type that defines their meaning (e.g., FRIEND, LIKES,
EMPLOYEE_OF).
4. Properties:
tu
o New relationship types can be easily added to the graph without restructuring
the entire database.
6. Queries:
o Queries can filter relationships based on properties or types, enabling rich
domain models. For example, we can query:
▪ "Find friends who became friends after 2010."
▪ "Find employees working in the Research role."
ABHISHEK K N
ud
lo
Below is an example graph diagram representing relationships between nodes, similar to the
figure provided:
Nodes:
Relationships:
tu
1. Employee_of:
o Directional relationship from "Anna," "Barbara," and "Carol" to "BigCo."
o Contains properties such as role (e.g., "Manager," "Research") and hired_date
(e.g., "Mar 06," "Feb 04").
2. Friend:
o Bidirectional friendships exist between nodes like "Anna ↔ Barbara" and
V
"Carol ↔ Barbara."
o Contains properties like since to denote the starting year of the friendship.
3. Share:
o Between "Barbara" and "Elizabeth," showing shared interests like books,
movies, tweets.
ABHISHEK K N
3. Explain scaling and application level sharding of nodes with a neat diagram 10M
Scaling and Application-Level Sharding of Nodes in Graph Databases
ud
Challenges in Scaling Graph Databases
1. Sharding Complexity:
o Sharding (splitting data across servers) is difficult because any node can have
relationships with any other node.
o Traversing relationships across servers can cause latency and reduce
performance. lo
2. Memory Management:
o Graph databases benefit from storing relationships and nodes in memory for
faster traversal.
o Modern servers with large RAM allow the entire dataset to fit in memory,
C
ensuring better performance.
Scaling Techniques
tu
1. Vertical Scaling:
o Add more RAM to the server so that nodes and relationships fit entirely in
memory.
o Effective when the dataset is small enough to fit in a single server's memory.
2. Read Scaling Using Replicas:
V
o Master-Slave Replication:
▪ All writes are directed to the master node.
▪ Reads are distributed across multiple slave nodes (read-only replicas).
▪ Proven in systems like MySQL to improve read performance and
availability.
3. Sharding for Large Datasets:
ABHISHEK K N
o When datasets are too large for replication, application-level sharding is
used.
o Nodes are split across multiple servers based on domain-specific knowledge
(e.g., geographical location).
o Example: North America nodes are on one server, and Asia nodes are on
another.
Application-Level Sharding
Application-level sharding involves splitting nodes across servers based on specific criteria
ud
defined at the application level. The application manages the logic for querying data across
these shards.
Example:
1. Nodes related to North America are stored on Server A.
2. Nodes related to Asia are stored on Server B.
lo
3. Relationships between nodes are kept localized to minimize cross-server traversal.
Diagram Explanation
C
tu
V
The diagram below illustrates application-level sharding using two geographical regions:
1. North America (Server A)
ABHISHEK K N
2. Asia (Server B)
Nodes:
• North America: LA, Chicago, NY.
• Asia: Mumbai, Xian, Singapore, Jakarta.
Relationships:
• Supplier, Reseller, Distributor, and Warehouse relationships exist within each
shard.
• Cross-region relationships are minimized to optimize performance.
ud
4. Explain some suitable use cases of graph databases and describe when we should not
use graph databases
Suitable Use Cases of Graph Databases and When Not to Use Them
lo
Suitable Use Cases of Graph Databases
Graph databases are highly effective for scenarios where the relationships between data
points are critical. Here are the most suitable use cases:
C
1. Connected Data
Graph databases are ideal when the data is highly interconnected, and relationships are just
as important as the data itself.
tu
• Social Networks:
o Example: Users (nodes) and their relationships (edges), such as friends,
followers, and likes.
o Use Case: Analyze how people are connected or identify mutual friends (e.g.,
Facebook, LinkedIn).
• Organizational Networks:
V
Graph databases can optimize routing problems where locations and distances are key
factors.
ABHISHEK K N
o
Nodes represent delivery points (addresses), and relationships include
properties like distance or delivery time.
o Use Case: Determine the shortest route for deliveries.
• Location-Based Recommendations:
o Example: Restaurants or stores as nodes.
o Use Case: Recommend nearby services (e.g., “Best restaurant within 2
kilometers”).
3. Recommendation Engines
ud
Graph databases can analyze patterns in relationships to provide personalized
recommendations.
• E-Commerce Recommendations:
o Example: “People who bought this item also bought that item.”
o Use Case: Suggest products based on user behavior.
• Travel Recommendations:
o Example: “People visiting Barcelona often visit Gaudi's landmarks.”
• Fraud Detection:
lo
o Relationships between transactions can be analyzed to find suspicious
patterns.
o Use Case: Detect anomalies such as missing co-purchased items or irregular
transactions.
C
4. Knowledge Graphs
Graph databases are effective for building knowledge graphs to model complex, interrelated
tu
domains.
• Example:
o Nodes: Concepts, events, or entities.
o Edges: Relationships between those concepts (e.g., “Paris is the capital of
France”).
• Use Case: Search engines (e.g., Google Knowledge Graph) for answering user queries
V
While graph databases are powerful, they are not suitable for all scenarios. Below are some
situations where graph databases may not be the best choice:
ABHISHEK K N
Graph databases struggle with scenarios where large-scale updates must be performed on all
nodes or relationships.
• Example: Updating a global property (e.g., “Add a discount flag to all products”).
• Issue: Traversing every node can be slow and inefficient.
Graph databases are not optimized for large-scale bulk operations that process entire
datasets.
ud
• Example: Performing aggregations, summations, or bulk updates across millions of
records.
• Alternative: Relational or columnar databases are better suited for these use cases.
• Example: A simple e-commerce store with customers and orders stored in flat tables.
• Alternative: Use relational databases like MySQL or PostgreSQL for efficient
C
performance.
4. Small Datasets
tu
For datasets with very few nodes and relationships, the benefits of graph databases diminish.
ABHISHEK K N