Bda Module 3
Bda Module 3
MODULE3
Introduction to Distributed Systems in Big Data
Definition: Distributed systems consist of multiple data nodes organized into clusters,
enabling tasks to execute in parallel.
Communication: Nodes communicate with applications over a network, optimizing
resource utilization.
1
Big Data Analytics 21CS71
NoSQL Concepts
NoSQL Data Store: Non-relational databases designed to handle semi-structured and
unstructured data.
NoSQL Data Architecture Patterns: Models such as key-value, document, column-
family, and graph for efficient data organization.
Shared-Nothing Architecture: Ensures no shared resources among nodes, enabling
independent operation and scalability.
MongoDB
Type: Document-oriented NoSQL database.
Features: Schema-less design, JSON-like storage, scalability, and high availability.
Usage: Suitable for real-time applications and Big Data analytics.
Cassandra
Type: Column-family NoSQL database.
Features: High availability, decentralized architecture, linear scalability, and eventual
consistency.
Usage: Ideal for applications requiring fast writes and large-scale data handling.
SQL Databases: ACID Properties
SQL databases are relational and exhibit ACID properties to ensure reliability and
consistency of transactions:
1. Atomicity
o All operations in a transaction must complete entirely, or none at all.
o Example: In a banking transaction, if updating both withdrawal and balance
fails midway, the entire transaction rolls back.
2
Big Data Analytics 21CS71
2. Consistency
o Transactions must maintain the integrity of the database by adhering to
predefined rules.
o Example: The sum of deposits minus withdrawals should always match the
account balance.
3. Isolation
o Transactions are executed independently, ensuring no interference among
concurrent transactions.
4. Durability
o Once a transaction is completed, its results are permanent, even in the event of
a system failure.
SQL Features
1. Triggers
o Automated actions executed upon events like INSERT, UPDATE, or
DELETE.
2. Views
o Logical subsets of data from complex queries, simplifying data access.
3. Schedules
o Define the chronological execution order of transactions to maintain
consistency.
4. Joins
o Combine data from multiple tables based on conditions, enabling complex
queries.
CAP Theorem Overview
The CAP Theorem, formulated by Eric Brewer, states that in a distributed system, it is
impossible to simultaneously guarantee all three properties: Consistency (C), Availability
(A), and Partition Tolerance (P). Distributed databases must trade off between these
properties based on specific application needs.
CAP Properties
1. Consistency (C):
o All nodes in the distributed system see the same data at the same time.
o Changes to data are immediately reflected across all nodes.
3
Big Data Analytics 21CS71
CAP Combinations
Since achieving all three properties is not possible, distributed systems choose two of the
three based on requirements:
1. Consistency + Availability (CA):
o Ensures all nodes see the same data (Consistency).
o Ensures all requests receive responses (Availability).
o Cannot tolerate network partitions.
o Example: Relational databases in centralized systems.
2. Availability + Partition Tolerance (AP):
o Ensures the system responds to requests even during network failures
(Partition Tolerance).
o May sacrifice consistency, meaning some nodes may have stale or outdated
data.
o Example: DynamoDB, where availability is prioritized over consistency.
Consistency + Partition Tolerance (CP):
o Ensures all nodes maintain consistent data (Consistency).
o Tolerates network partitions but sacrifices availability during failures (some
requests may be denied).
4
Big Data Analytics 21CS71
5
Big Data Analytics 21CS71
6
Big Data Analytics 21CS71
7
Big Data Analytics 21CS71
8
Big Data Analytics 21CS71
Uses:
o Image/document storage.
o Lookup tables and query caches.
2. Document Stores
Definition: Stores unstructured or semi-structured data in a hierarchical format.
Features:
1. Stores data as documents (e.g., JSON, XML).
2. Hierarchical tree structures with paths for navigation.
3. Transactions exhibit ACID properties.
4. Flexible schema-less design.
Advantages:
o Easy querying and navigation using languages like XPath or XQuery.
o Supports dynamic schema changes (e.g., adding new fields).
Limitations:
o Incompatible with traditional SQL.
o Complex implementation compared to other stores.
Examples: MongoDB, CouchDB.
Use Cases:
o Office documents, inventory data, forms, and document searches.
9
Big Data Analytics 21CS71
Comparison:
o JSON includes arrays; XML is more verbose but widely used.
o JSON is easier to handle for developers due to its key-value structure.
10
Big Data Analytics 21CS71
Use Cases:
o Web crawling, large sparsely populated tables, and high-variance systems.
11
Big Data Analytics 21CS71
12
Big Data Analytics 21CS71
13
Big Data Analytics 21CS71
14
Big Data Analytics 21CS71
o This is the simplest distribution model where all data is stored and processed
on a single server. While this model is easy to implement, it may not scale
well for large datasets or high traffic applications.
o Best for: Small-scale applications or use cases like graph databases where
relationships are processed sequentially on a single server.
o Example: A simple graph database that processes node relationships on a
single server.
2. Sharding Very Large Databases:
o Sharding refers to the process of splitting a large database into smaller, more
manageable parts called "shards". Each shard is distributed across multiple
servers in a cluster.
o Sharding provides horizontal scalability, allowing the system to process data
in parallel across multiple nodes.
o Advantages:
Enhanced performance by distributing data across multiple nodes.
If a node fails, the shard can migrate to another node for continued
processing.
o Example: A dataset of customer records is split across four servers, where
each server handles a shard (DBi, DBk, DBL, DBMS).
15
Big Data Analytics 21CS71
o In this model, there is one master node that handles write operations, and
multiple slave nodes that replicate the master’s data for read operations.
o The master node directs the slaves to replicate data, ensuring consistency
across nodes.
o Advantages:
Read performance is optimized as multiple slave nodes handle read
requests.
Writing is centralized, ensuring data consistency.
o Challenges:
The replication process can introduce some latency and complexity.
A failure of the master node may impact the write operations until a
failover mechanism is implemented.
o Example: MongoDB uses this model where data is replicated from the master
node to slave nodes.
4. Peer-to-Peer Distribution Model (PPD):
o In this model, all nodes are equal peers that both read and write data. Each
node has a copy of the data and can handle both read and write operations
independently.
o Advantages:
High Availability: Since all nodes can read and write, the system can
tolerate node failures without affecting the ability to perform writes.
16
Big Data Analytics 21CS71
17
Big Data Analytics 21CS71
MongoDB Database:
MongoDB is a widely-used open-source NoSQL database designed to handle large amounts
of data in a flexible, distributed manner. Initially developed by 10gen (now MongoDB Inc.),
MongoDB was introduced as a platform-as-a-service (PaaS) and later released as an open-
source database. It’s known for its document-oriented model, making it suitable for handling
unstructured and semi-structured data.
Key Characteristics of MongoDB:
Non-relational: Does not rely on traditional SQL-based relational models.
NoSQL: Flexible and can handle large volumes of data across multiple nodes.
Distributed: Data can be stored across multiple machines, supporting horizontal
scalability.
Open Source: Freely available for use and modification.
Document-based: Uses a document-oriented storage model, storing data in flexible
formats such as JSON.
Cross-Platform: Can be used across different operating systems.
Scalable: Can scale horizontally by adding more servers to handle growing data
needs.
Fault Tolerant: Provides high availability through replication and data redundancy.
Features of MongoDB:
1. Database Structure:
o Each database is a physical container for collections. Multiple databases can
run on a single MongoDB server. The default database is called db, and the
server's main process is called mongod, while the client is mongo.
2. Collections:
o Collections are analogous to tables in relational databases, and they store
multiple MongoDB documents. Collections are schema-less, meaning that
documents within a collection can have different fields and structures.
3. Document Model:
18
Big Data Analytics 21CS71
19
Big Data Analytics 21CS71
MongoDB Replication
Replication in MongoDB is essential for high availability and fault tolerance in Big Data
environments. Replication involves maintaining multiple copies of data across different
database servers. In MongoDB, this is achieved using replica sets, which ensure data
redundancy and allow for continuous data availability even in the event of server failures.
How Replica Sets Work:
A replica set is a group of MongoDB server processes (mongod) that store the same
data. Each replica set has at least three nodes:
1. Primary Node: Receives all write operations.
2. Secondary Nodes: Replicate data from the primary node.
The primary node handles all write operations, and these are automatically propagated to the
secondary nodes. If the primary node fails, one of the secondary nodes is promoted to
primary in an automatic failover process, ensuring continuous availability.
o Commands for Replica Set Management:
rs.initiate(): Initializes a new replica set.
rs.config(): Checks the replica set configuration.
rs.status(): Displays the status of the replica set.
rs.add(): Adds new members to the replica set.
MongoDB Sharding
Sharding is MongoDB’s method of distributing data across multiple machines, particularly in
scenarios involving large amounts of data. It is useful for scaling out horizontally when a
single machine can no longer store or process the data efficiently.
How Sharding Works:
Shards: A shard is a single MongoDB server or replica set that holds part of the data.
Sharded Cluster: MongoDB uses a sharded cluster to distribute data. Each shard
contains a portion of the data, and queries are routed to the appropriate shard based on
a shard key.
Shard Key: A field in the documents used to determine how data is distributed across
the shards.
Sharding allows MongoDB to handle larger datasets and more operations by spreading the
load across multiple machines.
20
Big Data Analytics 21CS71
21
Big Data Analytics 21CS71
Cassandra Database
Cassandra, developed by Facebook and later released by Apache, is a highly scalable NoSQL
database designed to handle large amounts of structured, semi-structured, and unstructured
data. The database is named after the Trojan mythological prophet Cassandra, who was
cursed to always speak the truth but never to be believed. It was initially designed by
Facebook to handle their massive data needs, and it has since been adopted by several large
companies like IBM, Twitter, and Netflix.
Characteristics:
Open Source: Cassandra is freely available and open to modifications.
Scalable: It is designed to scale horizontally by adding more nodes to the system.
NoSQL: It is a non-relational database, making it suitable for big data applications.
Distributed: Cassandra's architecture allows it to run on multiple servers, ensuring
high availability and fault tolerance.
Column-based: Data is stored in columns rather than rows, making it more efficient
for write-heavy workloads.
Decentralized: All nodes in a Cassandra cluster are peers, which ensures that there is
no single point of failure.
Fault-tolerant: Due to data replication across multiple nodes, Cassandra can
withstand node failures without data loss.
Tuneable consistency: It provides flexibility to choose the level of consistency for
different operations.
Features of Cassandra:
Maximizes write throughput: It is optimized for handling massive amounts of write
operations.
No support for joins, group by, OR clauses, or complex aggregations: Its
architecture focuses on performance rather than relational operations.
Fast and easily scalable: The database performs well as more nodes are added, and it
can handle high write volumes.
Distributed architecture: Data is distributed across the nodes in the cluster, ensuring
high availability.
Peer-to-peer: Nodes in Cassandra communicate with each other in a peer-to-peer
fashion, unlike master-slave architectures.
Data Replication in Cassandra: Cassandra provides data replication across multiple
nodes, ensuring no single point of failure. The replication factor defines the number of
replicas placed on different nodes. In case of stale data or node failure, Cassandra uses read
repair to ensure that all replicas are consistent. It adheres to the CAP theorem, prioritizing
availability and partition tolerance.
22
Big Data Analytics 21CS71
Scalability: Cassandra supports linear scalability. As new nodes are added to the cluster,
both throughput increases and response time decreases. It uses a decentralized approach
where each node in the cluster is equally important.
Transaction Support: Cassandra supports the ACID properties (Atomicity, Consistency,
Isolation, Durability), although it is not strictly a transactional system like traditional
RDBMS. Instead, it offers eventual consistency to ensure high availability and fault
tolerance.
Replication Strategies:
Simple Strategy: A straightforward replication factor for the entire cluster.
Network Topology Strategy: Allows replication factor configuration per data center,
useful for multi-data center deployments.
Cassandra Data Model:
Cluster: A collection of nodes and keyspaces.
Keyspace: The outermost container in Cassandra that holds column families (tables).
Each keyspace defines the replication strategy and factors.
Column: A single data point consisting of a name, value, and timestamp.
Column Family: A collection of columns, which is equivalent to a table in relational
databases.
Cassandra CQL (Cassandra Query Language):
CREATE KEYSPACE: Creates a keyspace to store tables. It includes replication
strategy options.
ALTER KEYSPACE: Modifies an existing keyspace.
DROP KEYSPACE: Deletes a keyspace.
USE KEYSPACE: Connects to a specific keyspace.
CREATE TABLE: Defines a new table with columns, including primary key
constraints.
ALTER TABLE: Modifies the structure of an existing table (e.g., adding or dropping
columns).
DESCRIBE: Provides detailed information about keyspaces, tables, indexes, etc.
CRUD Operations in Cassandra:
1. INSERT: Adds new data into a table.
o Example: INSERT INTO <tablename> (<columns>) VALUES (<values>);
2. UPDATE: Modifies existing data.
o Example: UPDATE <tablename> SET <column> = <value> WHERE
<condition>;
23
Big Data Analytics 21CS71
24