4unit NoSQL
4unit NoSQL
CAP Theorem
The CAP theorem, proposed by Eric Brewer, states that in a
distributed data system, it is impossible to simultaneously guarantee all three
of the following properties:
It is possible to attain only two properties and third would be
always compromised.
The system requirements should define which two properties
should be chosen over the rest.
1. Consistency (C): Every read receives the most recent write or an error.
2. Availability (A): Every request receives a response (non-error) without
guaranteeing the most recent data.
3. Partition Tolerance (P): The system continues to operate despite network
partitions.
Trade-offs
According to the theorem:
CP Systems: Prioritize Consistency and Partition Tolerance. Availability
might be sacrificed during partition failures.
o Example: MongoDB, HBase.
o Banking systems, where consistent data is critical.
Aggregate Model
In a NoSQL document store (e.g., MongoDB), the same data can be
represented as:
json
Copy code
{
"OrderID": "12345",
"CustomerID": "789",
"OrderDate": "2024-12-12",
"Products": [
{ "ProductID": "001", "Name": "Laptop", "Price": 1000 },
{ "ProductID": "002", "Name": "Mouse", "Price": 50 }
]
}
Here, all related data (Order and Products) are grouped into a single
aggregate document.
Advantages of Aggregate Models
Simplified data access (no joins).
Efficient storage and retrieval for distributed systems.
Flexibility to store nested and hierarchical data.
Key Features:
Nodes: Represent entities (e.g., people, places, things).
Edges: Represent relationships between nodes (e.g., "friends with", "likes").
Properties: Key-value pairs associated with nodes and edges to store metadata or
attributes.
Traversal: Enables efficient querying of relationships, such as shortest paths or network
connections.
Schema-less: Flexible data models allow for dynamic and evolving schemas.
Graph databases are ideal for applications such as social networks, recommendation engines,
fraud detection, and network management, where relationships are as important as the
entities themselves.
Applications:
Social Networks: Modelling users, connections, and interactions.
Recommendation Engines: Suggesting products, content, or friends based on relationships.
Fraud Detection: Identifying unusual patterns in financial transactions or networks.
Knowledge Graphs: Organizing and connecting domain-specific knowledge.
Neo4j has become one of the most widely used graph databases due to its robust features, ease
of use, and extensive ecosystem of tools and integrations.
5. What are the distribution models? Briefly explain two paths of data
distribution.
2. Sharding (Partitioning)
Sharding involves splitting the database into smaller, independent pieces
called shards, with each shard containing a subset of the data.
Characteristics:
o Horizontal Partitioning: Data is distributed based on specific criteria
(e.g., a range of IDs or a hash of keys).
o Key-Based Access: Each shard manages its data independently and is
identified by a shard key.
Advantages:
o Scalability: Enables horizontal scaling by adding more nodes to the
system.
o Efficient Resource Utilization: Each shard handles only a subset of
the data, reducing the load on individual nodes.
o Cost-Effectiveness: Allows scaling with commodity hardware.
Disadvantages:
o Complex Querying: Queries spanning multiple shards may require
additional coordination and can reduce performance.
o Rebalancing Challenges: Adding or removing shards requires
redistributing data, which can be complex.
o Single Point of Failure: Without a properly configured shard map,
failure of a central shard can disrupt the system.
6. Write short notes on
a. Single Server
A Single Server refers to a database deployment model where the entire
database system, including both the application and its data, is hosted on
a single server (machine). This is the simplest setup for database
management, where both the database engine and all related resources
(e.g., storage, computation, and network) are managed in one place.
Characteristics:
1. Centralized Architecture:
o All operations, including database management, are handled by a
single physical machine or server.
2. Simple Deployment:
o Setting up a single server is easy and requires minimal configuration,
making it suitable for small applications or early-stage development.
3. Limited Scalability:
o The server can handle only a limited amount of data and requests.
Performance bottlenecks may occur as the workload grows, leading to
issues like slow queries or reduced reliability.
4. Resource Contention:
o Since all database processes and application functions run on the same
server, there may be contention for resources such as CPU, memory,
and disk space, especially under heavy load.
Advantages:
1. Cost-Effective:
o Initial setup and maintenance are cheaper because only one server is
required.
2. Simplicity:
o No need for complex configurations like clustering or replication,
making it easy to manage.
3. Low Latency:
o As the database and application run on the same machine, data access
is fast due to minimal network overhead.
Disadvantages:
1. Limited Scalability:
o Performance is constrained by the capabilities of the single server.
Scaling requires moving to more complex architectures, like adding
more servers.
2. Single Point of Failure:
o If the server fails (e.g., due to hardware issues or crashes), the entire
system becomes unavailable, risking data loss or downtime.
3. Resource Constraints:
o As data volume or user load increases, the server may struggle to keep
up, leading to slower performance or crashes.
Mismatch Problems:
1. Relational to Object Mapping: The relational database stores data in rows
and columns, while objects in OOP have properties and methods.
Converting a database row into an object (or vice versa) can be complex, as
relational data does not inherently have behaviors (methods).
For example, converting the database row of Alice (EmployeeID = 1, Name
= Alice, Department = HR) into an instance of the Java Employee class
requires mapping the columns to the object’s properties. Similarly, when
saving an object into the database, the object's methods and the relationships
between objects have to be translated into tables and foreign keys.
2. Normalization and Relationships: In relational databases, data is often
normalized to reduce redundancy, meaning data might be split across
multiple tables (e.g., a separate Department table). In OOP, you might want
to model relationships as objects (like a Department object inside an
Employee object). This requires additional logic to handle the mapping.
3. Inheritance: Object-oriented programming often uses inheritance, where a
subclass inherits properties and behaviors from a parent class. Relational
databases do not have a direct way to model this concept. For instance, if
you have an Employee class and a subclass Manager in Java, translating this
hierarchy to relational tables (which are flat) is not straightforward.
Solutions to Impedance Mismatch:
1. Object-Relational Mapping (ORM): ORM frameworks (such as Hibernate
in Java, Entity Framework in .NET, or Django ORM in Python) help bridge
the impedance mismatch by automatically handling the conversion between
objects and relational tables. These frameworks map object properties to
columns in a table and provide a mechanism to persist objects to a database
and retrieve them back as objects.
2. Data Transfer Objects (DTOs): Instead of working directly with domain
objects, one can use data transfer objects to explicitly separate the concerns
of business logic and database access. This allows different data structures
for the database and the application while maintaining a clear separation
between them.