0% found this document useful (0 votes)
10 views

DBMS Unit4

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DBMS Unit4

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Course: MSc DS

Advanced Database Management

Systems

Module: 4
Learning Objectives:

1. Understand Cassandra and Neo4j's core principles and

architectures.

2. Learn optimization techniques for column-family and

graph databases.

3. Identify ideal applications for both Cassandra and Neo4j.

4. Contrast Cassandra and Neo4j's scalability and

consistency models.

5. Engage in hands-on exercises and case studies.

6. Evaluate the pros and cons of Cassandra vs. Neo4j for

specific needs.

Structure:

4.1 Cassandra: A High-performance Distributed Database

4.2 Neo4j: Navigating the World of Graph Databases

4.3 Comparison Between Cassandra and Neo4j4.4 Introduction

to Neo4j

4.4 Summary
4.5 Keywords

4.6 Self-Assessment Questions

4.7 Case Study

4.8 Reference

4.1 Cassandra: A High-performance Distributed Database

Cassandra is a distributed, NoSQL database designed for

scalability and high availability without compromising on

performance. It supports clusters that span multiple data centres,

making it highly fault-tolerant. Instead of being based on the

relational model, Cassandra is based on the column-family data

model, which offers more flexibility than traditional databases in

terms of how data is stored and queried.

Brief History of Cassandra

● Origin: Developed initially at Facebook in 2008 for their

Inbox search system.

● Open-sourced: Facebook released it as an open-source

project in 2008.
● Apache Incubation: By 2009, it became part of the Apache

Incubator.

● Graduation: Became a top-level Apache project in 2010.

Over the years, Cassandra has gained significant traction due to its

robustness, scalability, and flexibility, making it the preferred

choice for many large-scale applications.

Core Architecture and Principles

● Decentralised Architecture: Every node in the cluster is

identical. There's no single point of failure, making it highly

resilient.

● Scalability: Cassandra scales out by adding more nodes to

the cluster. Each node is capable of serving read and write

requests.

● Data Replication: Data is automatically replicated across

multiple nodes. Replication strategies can be configured

based on needs.

● Consistency Tuning: Offers tunable data consistency on a

per-operation basis.
● Partitioning: Data is distributed across the cluster using

partition keys, ensuring even data distribution.

Key Features and Benefits of Cassandra

● Linear Scalability: Allows adding more hardware to increase

throughput and accommodate larger datasets.

● Fault Tolerance: Data is replicated across multiple machines,

ensuring no single point of failure.

● Multi-Datacenter Replication: Supports replication across

geographically dispersed data centres.

● Flexible Data Storage: Can handle structured, semi-

structured, and unstructured data.

● Fast Writes: Designed to write data quickly and handle high

write loads.

● CQL (Cassandra Query Language): Provides a familiar SQL-

like interface for developers.

Data Modeling in Cassandra

Unlike relational databases, Cassandra's data modelling is centred

around the query-first approach, where the queries define the


structure of the data model.

Fundamental Concepts: Columns, Column Families, and

Keyspaces

● Columns: The smallest unit of data in Cassandra. Comprises

a name, value, and a timestamp.

● Column Families: Analogous to tables in RDBMS. They

consist of rows of related data. Each row is uniquely

identified by a key and contains one or more columns.

● Keyspaces: The highest level of data container. Similar to a

database in RDBMS, keyspaces group column families and

define data replication strategies.

Designing Schemas: Primary Key and Clustering Columns

● Primary Key: Consists of a partition key and zero or more

clustering columns. It uniquely identifies data within a table.

● Partition Key: Determines how data is distributed across the

nodes of the cluster.

● Clustering Columns: Used to sort data within a partition.


They determine the order in which related rows are stored.

Best Practices in Cassandra Data Modeling

● Query First: Always design your schema around your queries,

not the other way around.

● Denormalize: With no JOIN operations, it's often better to

denormalize data for efficient querying.

● Use Compound Keys: Make the best use of compound

primary keys for fine-grained control over data distribution

and access.

● Limit Wide Rows: While Cassandra supports wide rows, it's

often a good practice to keep an eye on the width for

performance reasons.

Denormalization, Materialised Views, and Secondary Indexes

● Denormalization: A trade-off in Cassandra to achieve high

read performance. Data might be duplicated to avoid JOIN-

like operations.

● Materialised Views: Allow for server-side computed views

of data. Useful for representing data in multiple structures


to support different queries.

● Secondary Indexes: Useful for querying data that is not

identified by the primary key. However, they come with a

performance cost and should be used judiciously.

4.2 Neo4j: Navigating the World of Graph Databases

Neo4j, at its essence, is a leading graph database management

system that emphasises the storage and retrieval of data in the

form of graphs. Unlike traditional relational databases that rely on

rows, columns, and structured tables, Neo4j uses nodes,

relationships, and properties to represent and store data. This

enables a more intuitive representation of complex and

interconnected data structures, often found in real-world

scenarios like social networks, recommendation systems, and

fraud detection.

Origin and Evolution of Neo4j

● Early Beginnings: Neo4j started its journey in the early


2000s when its founders, Emil Eifrem and Johan Svensson,

experienced the limitations of relational databases in

handling graph data for a professional network platform

project. Realising the potential of graph databases, they

embarked on the development of Neo4j.

● First Release: The initial public release of Neo4j came in

2007. It was touted as a revolutionary approach to handle

data relationships and their complexities.

● Evolution: Over the years, Neo4j has evolved, not only in

terms of performance and scalability but also in terms of its

ecosystem. With the introduction of Cypher - its proprietary

query language - Neo4j brought forward a powerful tool to

express and retrieve graph patterns.

● Open Source: While Neo4j started as an open-source project,

it has spawned an enterprise version that caters to larger

and more complex use cases. This duality has enabled it to

find traction in both the developer community and large-

scale enterprises.
Graph Databases: A Paradigm Shift

As data grows exponentially, the interconnections within it also

become more intricate. Traditional relational databases, although

robust and reliable, sometimes falter in the face of intricate

relational data. Enter graph databases:

● Data Representation: Graph databases perceive data as

nodes (entities) and edges (relationships), more accurately

reflecting real-world systems and their interconnectivity.

● Relationship-Centric: Unlike RDBMS where relationships are

inferred through joins, graph databases inherently store

relationships. This ensures rapid traversal and retrieval of

interconnected data.

● Flexibility: Graph databases are schema-less, which provides

greater flexibility in terms of evolving data models without

significant architectural changes.

● Performance: When dealing with relationship-heavy data,

graph databases often outperform their relational

counterparts, especially in traversal operations.


Features and Advantages of Neo4j

● Cypher Query Language: Cypher offers an expressive and

efficient way to query and manipulate graph data. Its

pattern-matching capabilities are particularly powerful for

complex graph traversals.

● ACID Compliance: Neo4j maintains the traditional database

integrity by adhering to Atomicity, Consistency, Isolation,

and Durability principles.

● Built-in Algorithms: Neo4j provides a host of built-in graph

algorithms that cater to operations like pathfinding,

centrality, and community detection.

● Scalability: With features like sharding and replication,

Neo4j can handle large datasets without compromising on

performance.

● Integration and Extensibility: Neo4j offers a plethora of APIs

and drivers, ensuring easy integration with other platforms

and languages. Furthermore, its extensibility allows for the

development of custom plugins and extensions.


● Real-time Insights: Owing to its graph-centric nature, Neo4j

can deliver real-time insights for applications that rely on

immediate data relationship analytics, such as fraud

detection or recommendation engines.

Understanding Nodes, Relationships, and Properties

In the graph database paradigm presented by Neo4j, the central

components are nodes, relationships, and properties:

● Nodes: Analogous to entities in a relational database, nodes

are the primary objects in the graph. They often represent

real-world entities, such as users, products, or locations.

Nodes can have labels, helping to categorise or classify them.

● Relationships: Representing the connections between nodes,

relationships are what make the graph model so intuitive.

Each relationship has a type and a direction, indicating the

nature of the relationship and its origin and endpoint.

● Properties: Both nodes and relationships can hold

properties. These are key-value pairs that store attributes of

entities and their relationships, such as a user's age or the


date a relationship was established.

Graph Schema Design Patterns

While Neo4j is schema-less, it's beneficial to follow some design

patterns for consistency and clarity:

● Single-label Nodes: Instead of using multiple labels for a

node, opt for a single, most representative label. This

simplifies querying and maintains clarity.

● Relationship Direction: Always establish a clear direction for

relationships, even if they're bidirectional in nature. This

ensures straightforward traversal.

● Rich Relationships: Instead of creating a new node for an

attribute, consider making it a rich relationship. For instance,

instead of a separate node for 'transaction', it can be a

relationship between 'user' and 'product' with properties

detailing the transaction.

Strategies for Efficient Graph Queries

Graph databases excel in complex data retrieval. However,

optimising the queries can make a significant difference:


● Limit Depth of Traversal: Avoid unnecessarily deep

traversals unless required. The depth of traversal can

significantly affect performance.

● Use Indexes: Just as with relational databases, indexes in

Neo4j speed up retrieval. Ensure critical attributes are

indexed.

● Specify Relationship Types and Node Labels: Whenever

possible, be explicit about the types of relationships and

labels of nodes in your queries. This narrows down the

search space.

● Avoid Global Operations: Operations that require scanning

the entire graph, such as global deletions or updates, can be

resource-intensive.

Indexing and Constraints in Neo4j

● Indexing: Indexes in Neo4j are crucial for enhancing query

performance. Neo4j provides both native indexes and full-

text search indexes. By indexing frequently queried

properties, you can significantly improve search times.


● Unique Constraints: Neo4j allows for the definition of

uniqueness constraints. This ensures that a specific property

of a node, for a given label, maintains unique values,

preventing accidental duplication.

● Existence Constraints: These constraints ensure that a

specific property exists for a node or relationship, ensuring

data integrity.

● Node Key Constraints: A combination of uniqueness and

existence constraints. They ensure that a specific

combination of properties exists and is unique for a node.

4.3 Comparison Between Cassandra and Neo4j

1. Differences in Data Model: Column-family vs. Graph-based

Cassandra: Column-family Data Model

● Originating from Google's Bigtable, Cassandra utilises a

column-family data model. This data structure can be

visualised as a multi-dimensional map, with rows being

unique and column-families as containers of columns.


● Each row has a unique key, and within a row, data is

grouped into column families. Column families are

flexible, which means columns within a family can vary

from one row to another.

● This model allows for efficient reads and writes of data

that has many attributes, which may differ across the

rows.

Neo4j: Graph-based Data Model

● Neo4j is grounded in graph theory and employs a

graph-based data model. In this model, data is stored

as nodes, edges, and properties.

● Nodes represent entities, edges denote relationships,

and properties are key-value pairs associated with

nodes and edges.

● This data structure excels at representing complex

relationships and offers rapid traversal, particularly

beneficial for social networks, recommendation

engines, and other connected data use-cases.


2. Scalability: Horizontal Scaling and Partitioning

Cassandra: Horizontal Scaling and Partitioning

● Cassandra was designed for distributed architectures

from the beginning, making it ideal for horizontal

scaling. By adding more machines to the cluster, its

capacity can be easily increased.

● Data partitioning in Cassandra is automatic, meaning

data gets distributed across multiple nodes without

manual intervention. The partition key determines the

distribution of data across the nodes.

● This results in Cassandra providing high availability and

fault tolerance, particularly for write-intensive

applications.

Neo4j: Horizontal Scaling and Challenges

● While Neo4j can be horizontally scaled, it faces some

challenges due to its graph nature. Maintaining

relationships in a distributed environment can be

complex.
● Neo4j uses clustering for high availability, but it

doesn't naturally partition graph data across nodes.

Hence, there are concerns regarding graph processing

on very large datasets in distributed setups.

● Nevertheless, Neo4j does support replication for read

scalability.

3. Consistency Models in Distributed Systems

Cassandra: Eventual Consistency

● Cassandra follows an eventual consistency model.

While this can lead to temporary inconsistencies

among nodes, over time, all changes propagate

through the system, ensuring data consistency.

● The system provides tunable consistency, meaning

operations can be configured for desired levels of

consistency versus performance.

Neo4j: Strong Consistency

● Neo4j adheres to a strong consistency model. When a

change is made, it is immediately reflected across all


nodes.

● This ensures that any read after a write returns the

value of that write or a more recent one, making it

ideal for applications where data integrity and

accuracy are paramount.

4. Suitability for Different Application Domains

Cassandra: Ideal Use Cases

● Time Series Data: Due to its columnar nature,

Cassandra is well-suited for time-series data like

metrics collection, monitoring systems, etc.

● Write-Intensive Applications: Cassandra's distributed

architecture makes it favourable for apps with high

write loads.

● E-commerce: For product catalogues, user activity

tracking, and recommendation systems where

scalability and high availability are crucial.

Neo4j: Ideal Use Cases

● Social Networks: To model and query complex


relationships among users.

● Recommendation Engines: Identifying patterns and

relationships between users and items.

● Fraud Detection: To identify patterns that signify

fraudulent activities by analysing data relationships.

● Knowledge Graphs: Representing and querying

knowledge as interconnected entities.

4.4 Summary

❖ A distributed NoSQL database designed for scalability and

high availability without compromising performance. Its

column-family storage model suits large-scale applications

where read and write throughput are crucial.

❖ Unlike relational databases, Cassandra uses columns,

column families, and keyspaces to organise data. Efficient

data modelling involves understanding primary and

clustering columns and may involve denormalization.


❖ A leading graph database management system designed to

handle interconnected data efficiently. It works on nodes,

relationships, and properties to represent and store data.

❖ Neo4j's data structures involve nodes (entities),

relationships (connections), and properties (data values).

Effective data modelling in Neo4j focuses on designing

graphs for efficient traversal and querying.

❖ While both are NoSQL databases, they serve different needs.

Cassandra excels in scenarios demanding high-speed

operations on vast amounts of data, while Neo4j shines in

handling complex, interrelated data structures.

❖ Real-world applications can illustrate the strengths and

limitations of each system. Examples might include using

Cassandra for real-time analytics and Neo4j for building

sophisticated recommendation engines.

4.5 Keywords

● Column-family (from Cassandra): In the context of


Cassandra, a column-family (often compared to a table in

relational databases) is a way of storing and organising data.

A column-family contains a collection of rows, each uniquely

identifiable by a primary key. Each row has multiple columns,

where each column has a name, a value, and a timestamp.

● Keyspace (from Cassandra): A keyspace in Cassandra is a

container for data that defines data replication on nodes.

Think of it as a database in the world of relational databases.

A keyspace is used to group column-families, typically by

application, since column-families under the same keyspace

will have the same replication settings.

● Graph Database (from Neo4j): A graph database is a type of

database designed to treat the relationships between data

as equally important to the data itself. It's optimised to allow

a flexible and fast traversal of these connections. Neo4j is a

prominent example of a graph database, where data is

stored as nodes (entities) and relationships (connections

between entities).
● Nodes and Relationships (from Neo4j): In the context of

graph databases like Neo4j, a node typically represents

entities (e.g., a person, a product) while relationships are the

connections or associations between these entities (e.g., a

person "likes" a product). Both nodes and relationships can

have properties (key-value pairs) to store information.

● Horizontal Scaling: Horizontal scaling refers to adding more

machines to a system to improve its performance and ability

to handle increased load, rather than upgrading the

specifications of existing machines (which would be vertical

scaling). Both Cassandra and Neo4j support horizontal

scaling, making them suitable for applications that require

distributing data across multiple servers or clusters.

● Denormalization (from Cassandra): Denormalization is a

strategy used in database design where redundancy is

intentionally introduced for the sake of query performance.

In relational databases, normalisation rules often reduce


data redundancy. However, in databases like Cassandra,

denormalization can be beneficial because it can reduce the

number of reads required for specific queries, making them

faster.

4.6 Self-Assessment Questions

1. How does Cassandra's architecture support high availability

and fault tolerance?

2. What are the primary components of a graph database in

Neo4j, and how do they interrelate?

3. Which of the following best describes the primary use case

of Cassandra?

● a) Social Network Analysis

● b) Knowledge Graphs

● c) Time-series Data Management

● d) Recommendation Systems

4. What are the key differences in data modelling between

Cassandra and Neo4j, especially concerning the handling of

relationships?
5. Which case study focuses on leveraging the strengths of

both Cassandra and Neo4j in hybrid systems?

4.7 Case Study

Title: E-commerce Giant Alibaba’s Use of Database Systems

Introduction:

China’s e-commerce behemoth, Alibaba, has experienced an

extraordinary growth trajectory since its inception. With billions

of transactions processed every day, Alibaba’s infrastructure

needs to be not just robust, but also nimble and adaptable to

manage the diverse demands of its services, from e-commerce to

cloud computing.

Background:

In the early years, Alibaba primarily relied on traditional relational

databases. However, as the volume of data grew exponentially,

they faced issues related to scalability, latency, and complexity.

This called for a move towards NoSQL databases that offer greater

flexibility in terms of data storage and retrieval.

To tackle the scalability challenge, Alibaba adopted a distributed


database system named OceanBase. This system is specifically

designed to support high concurrent and high-volume online

services, which is critical for platforms like Taobao and Tmall.

With its distributed architecture, OceanBase can seamlessly scale-

out, meeting Alibaba’s demanding data growth and real-time

transaction needs.

Furthermore, Alibaba's adoption of Graph databases facilitated

enhanced recommendation systems. By interpreting the intricate

relationships between users and products as a graph, the system

could offer more personalised and relevant product suggestions

to its users.

However, integrating these advanced database systems was not

without its challenges. Data consistency, ensuring real-time

processing across distributed systems, and managing the massive

influx of data during events like the "Singles Day" sale were

persistent challenges. But with continuous innovation and

investment in database management, Alibaba has set an

exemplary standard for e-commerce platforms globally.


Questions:

1. How do distributed database systems like OceanBase

address the limitations of traditional relational databases,

especially in the context of high-volume platforms like

Alibaba?

2. In what ways can graph databases revolutionise the user

experience in e-commerce platforms?

3. Considering the challenges Alibaba faced during peak events

like "Singles Day", what strategies can be adopted to ensure

data consistency and real-time processing in a distributed

system?

4.8 References

1. "Cassandra: The Definitive Guide" by Jeff Carpenter and

Eben Hewitt

2. "Neo4j in Action" by Aleksa Vukotic and Nicki Watt

3. "Data-Intensive Text Processing with MapReduce" by Jimmy

Lin and Chris Dyer

4. "Designing Data-Intensive Applications" by Martin


Kleppmann

5. "NoSQL Distilled: A Brief Guide to the Emerging World of

Polyglot Persistence" by Pramod J. Sadalage and Martin

Fowler.

You might also like