0% found this document useful (0 votes)
4 views

Bda Module 3

The document provides an overview of distributed systems in big data, highlighting their features such as increased reliability, flexibility, and scalability, as well as the challenges like troubleshooting complexity and security risks. It discusses NoSQL concepts, including various types of databases like MongoDB and Cassandra, and contrasts them with traditional SQL databases that follow ACID properties. Additionally, it covers the CAP theorem, schema-less models, and shared-nothing architecture, emphasizing their applications and advantages in handling big data tasks.

Uploaded by

Gurudev Mehta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Bda Module 3

The document provides an overview of distributed systems in big data, highlighting their features such as increased reliability, flexibility, and scalability, as well as the challenges like troubleshooting complexity and security risks. It discusses NoSQL concepts, including various types of databases like MongoDB and Cassandra, and contrasts them with traditional SQL databases that follow ACID properties. Additionally, it covers the CAP theorem, schema-less models, and shared-nothing architecture, emphasizing their applications and advantages in handling big data tasks.

Uploaded by

Gurudev Mehta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Big Data Analytics 21CS71

MODULE3
Introduction to Distributed Systems in Big Data
 Definition: Distributed systems consist of multiple data nodes organized into clusters,
enabling tasks to execute in parallel.
 Communication: Nodes communicate with applications over a network, optimizing
resource utilization.

Features of Distributed-Computing Architecture


1. Increased Reliability and Fault Tolerance
o Failure of some cluster machines does not impact the overall system.
o Data replication across nodes enhances fault tolerance.
2. Flexibility
o Simplifies installation, implementation, and debugging of new services.
3. Sharding
o Definition: Dividing data into smaller, manageable parts called shards.
o Example: A university student database is sharded into datasets per course
and year.
4. Speed
o Parallel processing on individual nodes in clusters boosts computing
efficiency.
5. Scalability
o Horizontal Scalability: Expanding by adding more machines and shards.
o Vertical Scalability: Enhancing machine capabilities to run multiple
algorithms.
6. Resource Sharing
o Shared memory, machines, and networks reduce operational costs.
7. Open System
o Accessibility of services across all nodes in the system.
8. Performance
o Improved performance through collaborative processor operations with lower
communication costs compared to centralized systems.

1
Big Data Analytics 21CS71

Demerits of Distributed Computing


1. Troubleshooting Complexity
o Diagnosing issues becomes challenging in large network infrastructures.
2. Software Overhead
o Additional software is often required for distributed system management.
3. Security Risks
o Vulnerabilities in data and resource sharing due to distributed architecture.

NoSQL Concepts
 NoSQL Data Store: Non-relational databases designed to handle semi-structured and
unstructured data.
 NoSQL Data Architecture Patterns: Models such as key-value, document, column-
family, and graph for efficient data organization.
 Shared-Nothing Architecture: Ensures no shared resources among nodes, enabling
independent operation and scalability.

MongoDB
 Type: Document-oriented NoSQL database.
 Features: Schema-less design, JSON-like storage, scalability, and high availability.
 Usage: Suitable for real-time applications and Big Data analytics.

Cassandra
 Type: Column-family NoSQL database.
 Features: High availability, decentralized architecture, linear scalability, and eventual
consistency.
 Usage: Ideal for applications requiring fast writes and large-scale data handling.
SQL Databases: ACID Properties
SQL databases are relational and exhibit ACID properties to ensure reliability and
consistency of transactions:
1. Atomicity
o All operations in a transaction must complete entirely, or none at all.
o Example: In a banking transaction, if updating both withdrawal and balance
fails midway, the entire transaction rolls back.

2
Big Data Analytics 21CS71

2. Consistency
o Transactions must maintain the integrity of the database by adhering to
predefined rules.
o Example: The sum of deposits minus withdrawals should always match the
account balance.
3. Isolation
o Transactions are executed independently, ensuring no interference among
concurrent transactions.
4. Durability
o Once a transaction is completed, its results are permanent, even in the event of
a system failure.

SQL Features
1. Triggers
o Automated actions executed upon events like INSERT, UPDATE, or
DELETE.
2. Views
o Logical subsets of data from complex queries, simplifying data access.
3. Schedules
o Define the chronological execution order of transactions to maintain
consistency.
4. Joins
o Combine data from multiple tables based on conditions, enabling complex
queries.
CAP Theorem Overview
The CAP Theorem, formulated by Eric Brewer, states that in a distributed system, it is
impossible to simultaneously guarantee all three properties: Consistency (C), Availability
(A), and Partition Tolerance (P). Distributed databases must trade off between these
properties based on specific application needs.

CAP Properties
1. Consistency (C):
o All nodes in the distributed system see the same data at the same time.
o Changes to data are immediately reflected across all nodes.

3
Big Data Analytics 21CS71

o Example: If a sales figure in one node is updated, it should instantly reflect in


all other nodes.
2. Availability (A):
o The system remains operational, ensuring every request gets a response,
regardless of success or failure.
o Achieved through replication, where copies of data are maintained on
multiple nodes.
o Example: A query for sales data will return a result even if some nodes are
down.
3. Partition Tolerance (P):
o The system continues to function even when network partitions
(communication breakdowns between nodes) occur.
o Ensures fault tolerance and resilience to node or network failures.
o Example: Operations on one partition of a database do not fail even if another
partition is unreachable.

CAP Combinations
Since achieving all three properties is not possible, distributed systems choose two of the
three based on requirements:
1. Consistency + Availability (CA):
o Ensures all nodes see the same data (Consistency).
o Ensures all requests receive responses (Availability).
o Cannot tolerate network partitions.
o Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
o Ensures the system responds to requests even during network failures
(Partition Tolerance).
o May sacrifice consistency, meaning some nodes may have stale or outdated
data.
o Example: DynamoDB, where availability is prioritized over consistency.
Consistency + Partition Tolerance (CP):
o Ensures all nodes maintain consistent data (Consistency).
o Tolerates network partitions but sacrifices availability during failures (some
requests may be denied).

4
Big Data Analytics 21CS71

o Example: MongoDB, where consistency is more critical than availability.

Network Partition and Trade-offs


When a network partition occurs:
1. AP (Availability + Partition Tolerance):
o The system provides responses but may return outdated or incorrect data.
o Prioritizes availability for user experience.
o Suitable for applications like social media or e-commerce with high fault
tolerance needs.
2. CP (Consistency + Partition Tolerance):
o The system waits for the latest data to be replicated, potentially delaying
responses.
o Prioritizes data accuracy.
o Suitable for applications like banking systems or financial transactions
requiring strict consistency.

5
Big Data Analytics 21CS71

Schema-less Models in NoSQL Databases


Schema in traditional databases defines a pre-designed structure for datasets, dictating how
data is organized and stored (e.g., tables, columns, data types). NoSQL databases, however,
often adopt a schema-less model, which increases flexibility and allows for unstructured or
semi-structured data.

Characteristics of Schema-less Models


1. No Fixed Table Schema:
o NoSQL databases do not require a predefined schema for data storage.
o New fields can be added to records without affecting existing data.
2. Non-Mathematical Relations:
o Unlike relational databases, NoSQL systems store relationships as aggregates
or metadata rather than using mathematical joins.
3. Flexibility for Data Manipulation:
o Ideal for applications where data evolves over time.
o For example, a student database can start with basic information and add fields
(e.g., extracurricular activities) dynamically as needed.
4. Cluster-based Management:
o Large datasets are stored and managed across distributed clusters or nodes.
o This setup ensures scalability and fault tolerance.
5. Metadata-driven Relationships:
o Relationships between datasets are stored as metadata, which describes and
specifies inter-linkages without rigid relational constraints.

BASE Model in NoSQL Databases


NoSQL databases follow the BASE model (as opposed to the ACID model in relational
databases). The BASE model prioritizes flexibility and scalability:
1. Basic Availability (BA):
o Ensures availability through data replication and distribution across nodes.
o Even if some segments fail, the system remains partially functional.
2. Soft State:
o Allows intermediate states that may be inconsistent temporarily.

6
Big Data Analytics 21CS71

o The system doesn't require immediate consistency after every transaction.


3. Eventual Consistency:
o Guarantees that all updates will eventually propagate through the system to
achieve a consistent state.
o Suitable for systems where immediate consistency is not critical (e.g., social
media platforms).

Advantages of Schema-less Models


 Increased Agility:
o Easier to adapt to changing data requirements without restructuring the entire
database.
 Scalability:
o Facilitates horizontal scaling by adding nodes to the system.
 Data Variety:
o Handles structured, semi-structured, and unstructured data seamlessly.
 Fault Tolerance:

o Replication across nodes ensures resilience against failures.

7
Big Data Analytics 21CS71

Applications of Schema-less Models


 E-commerce Platforms: Flexible product catalogs with dynamic attributes.
 Social Media: Storing varied user-generated content.
 IoT Systems: Managing semi-structured sensor data.
 Content Management Systems: Organizing diverse content types like text, images,
and videos.

NoSQL Data Architecture Patterns


1. Key-Value Pair Data Stores
 Definition: A schema-less store where data is represented as key-value pairs.
 Characteristics:
o High performance, scalability, and flexibility.
o Keys are simple strings, and values can be any large object or BLOB (e.g.,
text, images).
o Uses primary keys for fast data retrieval.
 Functions:
o Get(key) → Retrieves the value associated with the key.
o Put(key, value) → Associates or updates the value with the key.
o Multi-get(key1, key2, ...) → Retrieves multiple values.
o Delete(key) → Removes a key-value pair.
 Advantages:
1. Supports any data type in the value field.
2. Queries return values as a single item.
3. Eventually consistent.
4. Supports hierarchical and ordered structures.
5. High scalability, reliability, and portability with low operational cost.
6. Auto-generated or synthetic keys simplify usage.
 Limitations:
o No indexing or searching within values.
o Lack of traditional database capabilities (e.g., SQL queries).
o Managing unique keys becomes challenging with increasing data volume.

8
Big Data Analytics 21CS71

 Uses:
o Image/document storage.
o Lookup tables and query caches.

2. Document Stores
 Definition: Stores unstructured or semi-structured data in a hierarchical format.
 Features:
1. Stores data as documents (e.g., JSON, XML).
2. Hierarchical tree structures with paths for navigation.
3. Transactions exhibit ACID properties.
4. Flexible schema-less design.
 Advantages:
o Easy querying and navigation using languages like XPath or XQuery.
o Supports dynamic schema changes (e.g., adding new fields).
 Limitations:
o Incompatible with traditional SQL.
o Complex implementation compared to other stores.
 Examples: MongoDB, CouchDB.
 Use Cases:
o Office documents, inventory data, forms, and document searches.

9
Big Data Analytics 21CS71

3. CSV and JSON File Formats


 CSV: Stores flat, tabular data without hierarchical structure.
 JSON:
o Supports object-oriented and hierarchical structures.
o Easier parsing in JavaScript compared to XML.

o Preferred for serialization due to concise syntax.

 Comparison:
o JSON includes arrays; XML is more verbose but widely used.
o JSON is easier to handle for developers due to its key-value structure.

4. Columnar Data Stores


 Definition: Stores data in columns rather than rows for high-performance analytical
processing.
 Features:
o Groups columns into families, forming a tree-like structure.
o Keys include row, column family, and column identifiers.
 Advantages:
1. High scalability and partitionability.
2. Efficient querying and replication.
3. Supports dynamic column additions.
 Examples: HBase, BigTable, Cassandra.

10
Big Data Analytics 21CS71

 Use Cases:
o Web crawling, large sparsely populated tables, and high-variance systems.

5. BigTable Data Stores


 Features:
o Massively scalable (up to petabytes).
o Integrates with Hadoop and MapReduce.
o Handles millions of operations per second.
o Includes features like timestamps for versioning and consistent low latency.
 Use Cases: Ideal for high-throughput applications like analytics and global-scale
services.

6. Object Data Stores


 Definition: Stores data as objects (files, images, metadata) with associated system and
custom metadata.
 Features:
o APIs for scalability, indexing, querying, transactions, replication, and lifecycle
management.
o Persistent object storage and lifecycle control.
 Example: Amazon S3 (Simple Storage Service).
o S3 uses REST, SOAP, and BitTorrent interfaces for accessing trillions of
objects.
o Two storage classes: Standard and infrequent access.
 Uses: Web hosting, image storage, and backup systems.

7. Document and Hierarchical Patterns


 Document stores allow hierarchical structures resembling file directories.
 Query languages like XPath and XQuery facilitate efficient searching and navigation.

11
Big Data Analytics 21CS71

Graph Database Overview


Characteristics:
1. High Flexibility:
o Graph databases can easily expand by adding new nodes and edges.
o Best suited when relationships and relationship types are critical to the data
model.
2. Data Representation:
o Data is stored as interconnected nodes (entities or objects) and edges
(relationships between nodes).
o Makes relationship-based queries simple and efficient.
3. Specialized Query Languages:
o Examples include SPARQL for RDF-based graph databases.
4. Hyper-Edges Support:
o Hyper-edges allow edges to connect multiple vertices, providing a more
complex relationship structure (e.g., hypergraphs).
5. Small Data Size Records:
o Consist of small, interconnected records for efficient traversal and queries.

12
Big Data Analytics 21CS71

Typical Use Cases:


 Link Analysis: Finding connections in data, such as in social networks,
communication records, etc.
 Friend-of-Friend Queries: Querying indirect relationships, such as second-degree
connections.
 Rules and Inference: Leveraging taxonomies and class hierarchies for advanced rule-
based queries.
 Pattern Matching: Identifying specific patterns in interconnected data.
Examples of Graph Databases:
 Neo4J
 AllegroGraph
 HyperGraph
 InfiniteGraph
 Titan
 FlockDB

Shared-Nothing Architecture for Big Data Tasks


The Shared-Nothing (SN) architecture is a distributed computing model where each node
operates independently and does not share data with any other node. It is typically used in big
data systems for parallel processing and distributed storage. In this model, each node is self-
sufficient in computation, which ensures the system’s scalability and fault tolerance.

13
Big Data Analytics 21CS71

Features of Shared-Nothing Architecture:


1. Independence: Each node has its own memory and resources; there is no shared
memory or storage between nodes. Each node operates independently.
2. Self-Healing: If a node or link fails, the system can reconfigure itself by creating
alternative links to maintain operation.
3. Sharding: Data is partitioned across multiple nodes, where each node handles a shard
(a portion of the database). Each shard is processed independently at its respective
node.
4. No Network Contention: Since there is no shared memory, network contention is
minimized, allowing for efficient parallel processing.
Examples of Shared-Nothing Systems:
 Hadoop: A distributed data processing framework where tasks are distributed across
many nodes in a cluster.
 Apache Flink: A stream-processing framework that allows for distributed data
processing in real-time.
 Apache Spark: A unified analytics engine for big data processing, which also uses
shared-nothing architecture for parallel execution across a cluster.

Choosing Distribution Models for Big Data


Big data solutions often require data to be distributed across multiple nodes in a cluster. This
enables horizontal scalability, which allows the system to handle large volumes of data while
providing the ability to process many read and write operations simultaneously.
Distribution Models:
1. Single Server Model (SSD):

14
Big Data Analytics 21CS71

o This is the simplest distribution model where all data is stored and processed
on a single server. While this model is easy to implement, it may not scale
well for large datasets or high traffic applications.
o Best for: Small-scale applications or use cases like graph databases where
relationships are processed sequentially on a single server.
o Example: A simple graph database that processes node relationships on a
single server.
2. Sharding Very Large Databases:

o Sharding refers to the process of splitting a large database into smaller, more
manageable parts called "shards". Each shard is distributed across multiple
servers in a cluster.
o Sharding provides horizontal scalability, allowing the system to process data
in parallel across multiple nodes.
o Advantages:
 Enhanced performance by distributing data across multiple nodes.
 If a node fails, the shard can migrate to another node for continued
processing.
o Example: A dataset of customer records is split across four servers, where
each server handles a shard (DBi, DBk, DBL, DBMS).

15
Big Data Analytics 21CS71

3. Master-Slave Distribution Model (MSD):

o In this model, there is one master node that handles write operations, and
multiple slave nodes that replicate the master’s data for read operations.
o The master node directs the slaves to replicate data, ensuring consistency
across nodes.
o Advantages:
 Read performance is optimized as multiple slave nodes handle read
requests.
 Writing is centralized, ensuring data consistency.
o Challenges:
 The replication process can introduce some latency and complexity.
 A failure of the master node may impact the write operations until a
failover mechanism is implemented.
o Example: MongoDB uses this model where data is replicated from the master
node to slave nodes.
4. Peer-to-Peer Distribution Model (PPD):
o In this model, all nodes are equal peers that both read and write data. Each
node has a copy of the data and can handle both read and write operations
independently.
o Advantages:
 High Availability: Since all nodes can read and write, the system can
tolerate node failures without affecting the ability to perform writes.

16
Big Data Analytics 21CS71

 Consistency: Each node contains the updated data, ensuring


consistency across the system.
o Challenges:
 More complex to manage compared to the master-slave model, as
every node can serve read and write requests.
o Example: Cassandra uses the Peer-to-Peer model, where data is distributed
across all nodes in a cluster and each node can independently process read and
write requests.

Ways of Handling Big Data Problems:


1. Evenly Distribute Data Using Hash Rings:
o Consistent Hashing: A technique where data in a collection is distributed
across nodes in a cluster using a hashing algorithm. The hash ring serves as a
map, where each client node uses the hash of a collection ID to determine
where the data is located in the cluster. This helps in evenly assigning data to
processors, improving scalability and fault tolerance.
2. Replication for Horizontal Scaling:
o Replication: Involves creating backup copies of data in real-time. Many Big
Data clusters use replication to ensure fault-tolerant data retrieval in a
distributed environment. This technique enables horizontal scaling of client
requests by distributing the load across multiple nodes, improving
performance.
3. Moving Queries to the Data (Not Data to Queries):
o Efficient Query Processing: Instead of moving data to where queries are
executed, moving the queries to the data itself is a more efficient approach.
This is especially true for NoSQL databases, where data is often spread across
a distributed system (e.g., cloud services or enterprise servers). This method
reduces the overhead of data transfer and enhances performance.

17
Big Data Analytics 21CS71

4. Distribute Queries to Multiple Nodes:


o Query Distribution: Client queries are distributed across multiple nodes or
replica nodes. High-performance query processing can be achieved by
parallelizing the query execution on different nodes. This strategy involves
analyzing and distributing the query load across the system, improving overall
system throughput and reducing response times.

MongoDB Database:
MongoDB is a widely-used open-source NoSQL database designed to handle large amounts
of data in a flexible, distributed manner. Initially developed by 10gen (now MongoDB Inc.),
MongoDB was introduced as a platform-as-a-service (PaaS) and later released as an open-
source database. It’s known for its document-oriented model, making it suitable for handling
unstructured and semi-structured data.
Key Characteristics of MongoDB:
 Non-relational: Does not rely on traditional SQL-based relational models.
 NoSQL: Flexible and can handle large volumes of data across multiple nodes.
 Distributed: Data can be stored across multiple machines, supporting horizontal
scalability.
 Open Source: Freely available for use and modification.
 Document-based: Uses a document-oriented storage model, storing data in flexible
formats such as JSON.
 Cross-Platform: Can be used across different operating systems.
 Scalable: Can scale horizontally by adding more servers to handle growing data
needs.
 Fault Tolerant: Provides high availability through replication and data redundancy.
Features of MongoDB:
1. Database Structure:
o Each database is a physical container for collections. Multiple databases can
run on a single MongoDB server. The default database is called db, and the
server's main process is called mongod, while the client is mongo.
2. Collections:
o Collections are analogous to tables in relational databases, and they store
multiple MongoDB documents. Collections are schema-less, meaning that
documents within a collection can have different fields and structures.
3. Document Model:

18
Big Data Analytics 21CS71

o Data in MongoDB is stored in documents, which are structured in BSON


(Binary JSON) format. These documents are similar to rows in a relational
database but are more flexible, allowing fields to vary from document to
document.
4. JSON-Like Storage:
o Documents are stored in a JSON-like format (BSON), which allows flexibility
in storing different types of data structures.
5. Flexible Data Storage:
o MongoDB’s document format allows for storing complex data structures, and
the schema can evolve over time without requiring a predefined structure.
6. Querying and Indexing:
o MongoDB supports dynamic querying and real-time aggregation. Its query
language is similar to SQL but optimized for document-based storage. It also
allows indexing to speed up query execution.
7. No Complex Joins:
o MongoDB does not rely on complex joins, making it more efficient for certain
types of queries, especially when dealing with large datasets.
8. Distributed Architecture:
o MongoDB is designed to support high availability and horizontal scalability.
Data is distributed across multiple servers, which allows it to handle larger
datasets efficiently.
9. Real-Time Aggregation:
o MongoDB includes powerful aggregation capabilities for real-time data
analysis. It supports grouping, filtering, and transforming data to provide
insights on the fly

19
Big Data Analytics 21CS71

MongoDB Replication
Replication in MongoDB is essential for high availability and fault tolerance in Big Data
environments. Replication involves maintaining multiple copies of data across different
database servers. In MongoDB, this is achieved using replica sets, which ensure data
redundancy and allow for continuous data availability even in the event of server failures.
How Replica Sets Work:
 A replica set is a group of MongoDB server processes (mongod) that store the same
data. Each replica set has at least three nodes:
1. Primary Node: Receives all write operations.
2. Secondary Nodes: Replicate data from the primary node.
The primary node handles all write operations, and these are automatically propagated to the
secondary nodes. If the primary node fails, one of the secondary nodes is promoted to
primary in an automatic failover process, ensuring continuous availability.
o Commands for Replica Set Management:
 rs.initiate(): Initializes a new replica set.
 rs.config(): Checks the replica set configuration.
 rs.status(): Displays the status of the replica set.
 rs.add(): Adds new members to the replica set.

MongoDB Sharding
Sharding is MongoDB’s method of distributing data across multiple machines, particularly in
scenarios involving large amounts of data. It is useful for scaling out horizontally when a
single machine can no longer store or process the data efficiently.
How Sharding Works:
 Shards: A shard is a single MongoDB server or replica set that holds part of the data.
 Sharded Cluster: MongoDB uses a sharded cluster to distribute data. Each shard
contains a portion of the data, and queries are routed to the appropriate shard based on
a shard key.
 Shard Key: A field in the documents used to determine how data is distributed across
the shards.
Sharding allows MongoDB to handle larger datasets and more operations by spreading the
load across multiple machines.

20
Big Data Analytics 21CS71

MongoDB Data Types


MongoDB supports various data types for flexible and efficient data storage. Some of the key
data types include:
 Double: Represents floating-point values.
 String: UTF-8 encoded text.
 Object: Represents embedded documents (similar to a record in RDBMS).
 Array: A list of values.
 Binary Data: Arbitrary bytes, used for storing images or files.
 ObjectId: Unique identifier for documents, often used as the primary key.
 Boolean: Represents true or false.
 Date: Stores a date in BSON format (milliseconds since Unix epoch).
 Null: Represents a missing or unknown value.
 Regular Expression: JavaScript-based regular expression.
 32-bit Integer: Stores numbers without decimals.
 Timestamp: Special timestamp type for internal MongoDB use.

MongoDB Querying Commands


MongoDB provides various commands for interacting with databases, collections, and
documents.
Basic Commands:
 mongo: Starts the MongoDB client.
 db.help(): Displays help for available commands.
 db.stats(): Shows statistics for the database server.
 use <database name>: Switches to or creates a database.
 show dbs: Lists all databases.
 db.dropDatabase(): Drops the current database.
 db.createCollection(): Creates a collection within a database.
 db.<collection>.insert(): Inserts a document into a collection.
 db.<collection>.find(): Retrieves documents from a collection.
 db.<collection>.update(): Updates a document in a collection.
 db.<collection>.remove(): Removes a document from a collection.

21
Big Data Analytics 21CS71

Cassandra Database
Cassandra, developed by Facebook and later released by Apache, is a highly scalable NoSQL
database designed to handle large amounts of structured, semi-structured, and unstructured
data. The database is named after the Trojan mythological prophet Cassandra, who was
cursed to always speak the truth but never to be believed. It was initially designed by
Facebook to handle their massive data needs, and it has since been adopted by several large
companies like IBM, Twitter, and Netflix.
Characteristics:
 Open Source: Cassandra is freely available and open to modifications.
 Scalable: It is designed to scale horizontally by adding more nodes to the system.
 NoSQL: It is a non-relational database, making it suitable for big data applications.
 Distributed: Cassandra's architecture allows it to run on multiple servers, ensuring
high availability and fault tolerance.
 Column-based: Data is stored in columns rather than rows, making it more efficient
for write-heavy workloads.
 Decentralized: All nodes in a Cassandra cluster are peers, which ensures that there is
no single point of failure.
 Fault-tolerant: Due to data replication across multiple nodes, Cassandra can
withstand node failures without data loss.
 Tuneable consistency: It provides flexibility to choose the level of consistency for
different operations.
Features of Cassandra:
 Maximizes write throughput: It is optimized for handling massive amounts of write
operations.
 No support for joins, group by, OR clauses, or complex aggregations: Its
architecture focuses on performance rather than relational operations.
 Fast and easily scalable: The database performs well as more nodes are added, and it
can handle high write volumes.
 Distributed architecture: Data is distributed across the nodes in the cluster, ensuring
high availability.
 Peer-to-peer: Nodes in Cassandra communicate with each other in a peer-to-peer
fashion, unlike master-slave architectures.
Data Replication in Cassandra: Cassandra provides data replication across multiple
nodes, ensuring no single point of failure. The replication factor defines the number of
replicas placed on different nodes. In case of stale data or node failure, Cassandra uses read
repair to ensure that all replicas are consistent. It adheres to the CAP theorem, prioritizing
availability and partition tolerance.

22
Big Data Analytics 21CS71

Scalability: Cassandra supports linear scalability. As new nodes are added to the cluster,
both throughput increases and response time decreases. It uses a decentralized approach
where each node in the cluster is equally important.
Transaction Support: Cassandra supports the ACID properties (Atomicity, Consistency,
Isolation, Durability), although it is not strictly a transactional system like traditional
RDBMS. Instead, it offers eventual consistency to ensure high availability and fault
tolerance.
Replication Strategies:
 Simple Strategy: A straightforward replication factor for the entire cluster.
 Network Topology Strategy: Allows replication factor configuration per data center,
useful for multi-data center deployments.
Cassandra Data Model:
 Cluster: A collection of nodes and keyspaces.
 Keyspace: The outermost container in Cassandra that holds column families (tables).
Each keyspace defines the replication strategy and factors.
 Column: A single data point consisting of a name, value, and timestamp.
 Column Family: A collection of columns, which is equivalent to a table in relational
databases.
Cassandra CQL (Cassandra Query Language):
 CREATE KEYSPACE: Creates a keyspace to store tables. It includes replication
strategy options.
 ALTER KEYSPACE: Modifies an existing keyspace.
 DROP KEYSPACE: Deletes a keyspace.
 USE KEYSPACE: Connects to a specific keyspace.
 CREATE TABLE: Defines a new table with columns, including primary key
constraints.
 ALTER TABLE: Modifies the structure of an existing table (e.g., adding or dropping
columns).
 DESCRIBE: Provides detailed information about keyspaces, tables, indexes, etc.
CRUD Operations in Cassandra:
1. INSERT: Adds new data into a table.
o Example: INSERT INTO <tablename> (<columns>) VALUES (<values>);
2. UPDATE: Modifies existing data.
o Example: UPDATE <tablename> SET <column> = <value> WHERE
<condition>;

23
Big Data Analytics 21CS71

3. SELECT: Retrieves data from a table.


o Example: SELECT <columns> FROM <tablename> WHERE <condition>;
4. DELETE: Removes data from a table.
o Example: DELETE FROM <tablename> WHERE <condition>;
Cassandra Clients and Drivers: Cassandra supports various programming languages, and
clients interact with Cassandra through drivers. It has a peer-to-peer distribution system,
and each node can accept client connections, providing high availability.
Cassandra Hadoop Support: Cassandra integrates with Hadoop for big data processing. In
Cassandra 2.1, it offers Hadoop 2 support, allowing Hadoop's distributed storage and
processing capabilities to overlay Cassandra's data storage.

Column-Family Data Store Column-Family Data Store


----------------------END OF MODULE 3--------------------

24

You might also like