0% found this document useful (0 votes)
85 views

The CAP Theorem Overview

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

The CAP Theorem Overview

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT-III

NoSQL movement for handling Bigdata: Distributing data storage and processing with
Hadoop framework, case study on risk assessment for loan sanctioning, ACID principle of
relational databases, CAP theorem, base principle of NoSQL databases, types of NoSQL
databases, case study on disease diagnosis and profiling

Distributing data storage and processing with Hadoop framework


“New big data technologies such as Hadoop and Spark make it much easier to work with
and control a cluster of computers. Hadoop can scale up to thousands of computers,
creating a cluster with petabytes of storage. This enables businesses to grasp the value of
the massive amount of data available.”

Hadoop: a framework for storing and processing large data sets


Hadoop: Hadoop is an open-source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel
processing of large datasets.
Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims
to be all of the following things and more:
■ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
■ Fault tolerant—It detects faults and applies automatic recovery.
■ Scalable—Data and its processing are distributed over clusters of computers (horizontal
scaling).
■ Portable—Installable on all kinds of hardware and operating systems.
The Different Components of Hadoop:
At the heart of Hadoop, we find
■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
Hadoop is a widely-used framework for distributed storage and processing of large datasets
using the MapReduce programming model. Here's an overview of how Hadoop distributes
data storage and processing:
1. HDFS (Hadoop Distributed File System):
 Distributed Storage: Hadoop stores data in a distributed fashion using HDFS, which splits
files into large blocks (default is 128MB or 256MB) and distributes them across multiple
nodes in a cluster. Each block is replicated (typically 3 copies) to ensure fault tolerance.
 Nodes: In HDFS, the system consists of a NameNode (which manages metadata) and
DataNodes (which store actual data blocks).
 Fault Tolerance: The replication factor ensures that even if a node fails, the data can be
accessed from another replica.

Theorem: HDFS is a distributed storage system designed for large-scale data processing. It
divides files into blocks (default size 128 MB or 256 MB) and distributes them across
multiple nodes in a cluster. Each block is replicated (typically three times) to ensure data
redundancy and fault tolerance. The architecture is built to handle node failures by
replicating data across different nodes.

Diagram: Imagine you have a file File1.txt that is 500 MB. HDFS would divide it into blocks
as follows:

File1.txt (500 MB)


|
+-- Block 1 (128 MB) --> Stored on DataNode A, B, C
|
+-- Block 2 (128 MB) --> Stored on DataNode D, E, F
|
+-- Block 3 (128 MB) --> Stored on DataNode A, D, G
|
+-- Block 4 (116 MB) --> Stored on DataNode B, E, F

NameNode: Manages metadata, e.g., block locations, directories.

DataNodes: Store actual file data blocks.

Example: Suppose a file of 500 MB is stored in HDFS. It would be divided into 4 blocks:
 Block 1, Block 2, Block 3, Block 4.
 Each block would be replicated three times and stored on different nodes for fault
tolerance.

2. MapReduce:
 Distributed Processing: Hadoop processes data using the MapReduce model, which
breaks the task into two phases:
1. Map Phase: Processes data in parallel across multiple nodes by splitting the
dataset into smaller chunks, known as splits.
Each file is divided into smaller chunks, and the mapper function processes each
chunk to produce key-value pairs like (word, 1) for each word.
Example: The word "hadoop" appears in File1 and File2. After mapping, you would
have key-value pairs like:
 (hadoop, 1)
 (hadoop, 1)
2. Reduce Phase: After the Map phase, the Reduce phase aggregates the processed
data. The reducer takes these key-value pairs and sums the counts of each word:
Example: For the word "hadoop", the output would be:
 (hadoop, 2)
JobTracker and TaskTracker: In older versions of Hadoop, the JobTracker assigns tasks to
nodes (TaskTrackers) and monitors their execution. In newer versions (YARN),
ResourceManager and NodeManager handle resource management.
Diagram:

Example: Word count in a large document collection. Mappers count word occurrences in
each chunk of data, and reducers combine the results to get the total count of each word.
3.YARN (Yet Another Resource Negotiator)
Purpose: A resource management system that allocates CPU, memory, and other resources
in a Hadoop cluster. YARN decouples resource management and scheduling from
MapReduce, enabling multiple types of distributed applications to run in parallel.
Theorem: YARN is Hadoop's resource management layer that allocates resources and
schedules tasks. It decouples resource management and job scheduling from the
MapReduce process. The system consists of:
 ResourceManager: Global resource allocator for the entire cluster.
 NodeManager: Manages resources on individual nodes.
 ApplicationMaster: Oversees the execution of a specific job.
Diagram:
ResourceManager <----> ApplicationMaster <----> NodeManager (on each node)
|
+---> Allocates resources for jobs
Example: Suppose you submit a job to process a large dataset. The ResourceManager will
assign resources (CPU, memory) to the job, while the ApplicationMaster will monitor the
job's progress. NodeManagers on individual nodes will manage task execution on their
nodes.
SIMPLE EXAMPLE 1 of how Hadoop works, particularly with the MapReduce process. We'll
use a real-life scenario to make it easier to understand.
Example Scenario: Counting Fruits in a Grocery Store
Imagine you own a large grocery store, and you have a list of all the fruits customers have
bought. The list is too big for a single person to count, so you want to split the work
between many workers.
Your list looks like this:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
You want to find out how many times each type of fruit has been bought (e.g., how many
apples, bananas, and oranges). This is where Hadoop comes in!
Step-by-Step Breakdown
Step 1: Input Data: You have two lists of fruits that represent what customers have bought:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
These lists are like files that are stored in Hadoop’s file system (HDFS). In Hadoop, big files
are split into smaller parts so that many computers can work on them at the same time.
Step 2: Map Phase: Hadoop splits the lists, and each worker (mapper) counts the fruits in
their assigned list.
Mapper 1 (for List 1) sees: - "apple, banana, apple, orange"
Output: (apple, 1), (banana, 1), (apple, 1), (orange, 1)
Mapper 2 (for List 2) sees: - "banana, apple, apple, banana, orange, banana"
Output: (banana, 1), (apple, 1), (apple, 1), (banana, 1), (orange, 1),
(banana, 1)
So, each worker (mapper) is just counting the number of each fruit it sees, and it outputs a
key-value pair where the key is the fruit and the value is `1`.
Step 3: Shuffle and Sort: Hadoop takes all the results from the mappers and groups them by
the fruit type (key). So all the counts for "apple" are put together, all the counts for
"banana" are put together, and so on.
This is what the grouping looks like:
- (apple, [1, 1, 1, 1]): Four "1"s for apple (two from Mapper 1, two from Mapper 2)
- (banana, [1, 1, 1, 1]): Four "1"s for banana (one from Mapper 1, three from Mapper 2)
- (orange, [1, 1]): Two "1"s for orange (one from each list)
Step 4: Reduce Phase: The reducer adds up the counts for each fruit. It sums the values in
the lists to get the final count for each fruit.
Reduce Output:
- (apple, 4): There are 4 apples in total.
- (banana, 4): There are 4 bananas in total.
- (orange, 2): There are 2 oranges in total.
Step 5: Final Output: The final result tells you how many of each fruit were bought:
apple: 4
banana: 4
orange: 2
Explanation:
Input: Lists of fruits (representing big data files in Hadoop).
Map Phase: Each list is processed separately by different workers (mappers) who count the
fruits.
Shuffle and Sort: Hadoop groups the same fruits together.
Reduce Phase: The reducer sums up the counts and gives the total number of each fruit.
This example helps you understand how Hadoop processes large datasets by breaking the
work into smaller, manageable pieces.
SIMPLE EXAMPLE 2
Let’s see how MapReduce would work on a small fictitious example. You’re the director of a
toy company. Every toy has two colors, and when a client orders a toy from the web page,
the web page puts an order file on Hadoop with the colors of the toy. Your task is to find out
how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count
the colors. First let’s look at a simplified version in figure .

Figure: A simplified example of a MapReduce flow for counting the colors in input texts
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can
have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are
grouped together, and depending on the reducing function, a different result can be
created. Here we wanted a count per color, so that’s what the reduce function returns. In
reality it’s a bit more complicated than this though
Figure: An example of a MapReduce flow for counting the colors in input texts

Introduction to NoSQL
NoSQL databases are designed to manage large-scale data across distributed
systems and provide more flexible ways to model data, depending on the use
case. Unlike relational databases that strictly follow predefined structures,
NoSQL databases allow the data model to fit the needs of the application.
In order to understand NoSQL, we first need to explore the core principles of
relational databases, known as ACID, and how NoSQL rewrites these principles
into BASE to better suit distributed environments. Additionally, we’ll look at the
CAP theorem, which explains the challenges of distributing databases across
multiple nodes and how ACID and BASE handle these challenges differently.
ACID principle of relational databases
ACID: The Core Principle of Relational Databases
Relational databases follow ACID principles to ensure data reliability and consistency. These
principles are:
1. Atomicity:
o All or nothing. A transaction must be fully completed, or no part of it should be
applied.
o Example: When transferring money between bank accounts, the full amount is
moved, or the transaction is aborted. Partial transactions are not allowed.
2. Consistency:
o The database maintains predefined rules for data integrity, ensuring that only valid
data can be saved.
o Example: A field requiring a number will never accept text. All data stored must
follow these set rules.
3. Isolation:
o Changes in the database must not be visible to others until the transaction is
complete.
o Example: A document edited by one user is locked for others. They cannot see the
ongoing changes until the editing is finalized and saved.
4. Durability:
o Once data is committed, it remains safe and permanent, even in the event of a
crash or power failure.
o Example: If you save a record and the system crashes, the saved data is still available
when the system recovers.

ACID applies to traditional relational databases and some NoSQL databases, like
Neo4j (a graph database). However, most NoSQL databases follow a different set
of principles called BASE to support distributed environments.
The CAP Theorem Overview
The CAP Theorem is about NoSQL databases to explain why maintaining consistency can be
challenging. Proposed by Eric Brewer in 2000, and later formally proven by Seth Gilbert and
Nancy Lynch, it helps to understand the trade-offs in distributed systems. Understanding the
CAP Theorem helps developers choose the right database for their specific use case,
balancing the trade-offs based on the requirements of their application.
Theorem statement
The CAP Theorem can be stated as follows: In a distributed data store, it is impossible to
simultaneously guarantee all three of the following properties:
1. Consistency (C): All nodes see the same data at the same time.
2. Availability (A): Every request (read or write) receives a response, regardless of
whether the data is up-to-date.
3. Partition Tolerance (P): The system continues to operate despite network partitions.
You can have at most two of these three guarantees at any given time.
According to the CAP Theorem, a distributed system can achieve at most two of these
three properties at the same time. This leads to different types of NoSQL databases
being classified based on their design choices:
● CP (Consistency and Partition Tolerance): These systems prioritize consistency and
partition tolerance but may sacrifice availability during network partitions.
An example is HBase.
● AP (Availability and Partition Tolerance): These systems focus on availability and
partition tolerance, potentially allowing for temporary inconsistencies. An
example is Cassandra.
● CA (Consistency and Availability): This is typically not achievable in a distributed system
because network partitions are a reality in any distributed setup. Most systems cannot
guarantee both consistency and availability in the presence of network failures.
Example scenarios
Here are example scenarios for each combination of the CAP Theorem (Consistency,
Availability, and Partition Tolerance):
1. CA (Consistency and Availability)
● Example Scenario: Banking Transaction System
● Description: A banking system that operates on a single-node architecture.
● Behavior: When a user makes a deposit or withdrawal, the transaction is processed
immediately, ensuring that all subsequent reads return the most recent account balance
(consistency). Since there's only one node, the system is always available as long as that
node is operational.
● Limitation: If the server goes down (e.g., due to hardware failure), the system becomes
unavailable, losing availability.
2. AP (Availability and Partition Tolerance)
● Example Scenario: Social Media Feed
● Description: A social media platform that allows users to post updates and comments.
● Behavior: Users can post updates and interact with the platform even if some nodes
are unreachable due to network issues (availability). If a network partition occurs,
different nodes might have slightly different views of the feed, allowing users to continue
posting and commenting without waiting for synchronization.
● Limitation: Because of the partition tolerance, there may be inconsistencies between
users’ feeds, as updates made on one side of the partition may not be reflected on the
other until the partition is resolved.
3. CP (Consistency and Partition Tolerance)
● Example Scenario: Distributed Database for E-Commerce
Description: An e-commerce platform that needs to maintain accurate inventory counts
across multiple geographic locations.
● Behavior: When a user tries to purchase a product, the system ensures that the
inventory count is updated consistently across all nodes before completing the
transaction (consistency). If there’s a network partition, the system may temporarily
reject orders or limit access to ensure that inventory counts remain consistent.
● Limitation: During a partition, users may experience delays or rejections when trying to
place orders, sacrificing availability to maintain consistency in inventory management.

The BASE principles of NoSQL databases


The BASE principles of NoSQL databases represent a more flexible approach
compared to the strict ACID principles of relational databases. While ACID
ensures strong consistency and reliability, BASE offers softer guarantees that
prioritize availability and scalability in distributed systems. Here’s a summary
of the BASE promises:
1. Basically Available:
o Availability is a key feature of BASE. The system ensures that it's
always operational, even in the case of node failures. It focuses on
keeping services running, though the data might not always be up-to-
date or consistent.
o Example: In systems like Cassandra or Elasticsearch, if one node fails,
others can take over to keep the service available, often through data
replication or sharding.
Figure: sharding: each shard can function as a self-sufficient
database, but they also work together as a whole. The example
represents two nodes, each containing four shards: two main shards
and two replicas. Failure of one node is backed up by the other.
2. Soft State:
o The state of the database may change over time, even without new
input, due to the eventual consistency model. This means the system
doesn't guarantee immediate consistency after every transaction.
o Example: Data in one node might say "X" and another node might
say "Y" temporarily, but this will be resolved later when the nodes
synchronize their data.
3. Eventual Consistency:
o The database will become consistent over time, but it might allow for
temporary inconsistencies. Eventually, after all updates are
synchronized, every node will hold the same data.
o Example: If two customers purchase the last item in stock at the
same time, the database may show inconsistent results for a short
period, but it will eventually reconcile the conflict and decide who
gets the item.
ACID versus BASE
The BASE principles are somewhat contrived to fit acid and base from
chemistry: an acid is a fluid with a low pH value. A base is the opposite and
has a high pH value. We won’t go into the chemistry details here, but figure
shows a mnemonic to those familiar with the chemistry equivalents of acid
and base.

Figure: ACID versus BASE: traditional relational databases versus most NoSQL
databases. The names are derived from the chemistry concept of the pH
scale. A pH value below 7 is acidic; higher than 7 is a base. On this scale, your
average surface water fluctuates between 6.5 and 8.5.
types of NoSQL databases:
A database is a collection of structured data or information which is
stored in a computer system and can be accessed easily. A database
is usually managed by a Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in
the nontabular form. NoSQL stands for Not only SQL. The main types
are documents, key-value, wide-column, and graphs.
Types of NoSQL Database:
 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases
Document-Based Database:
The document-based database is a nonrelational database. Instead of storing
the data in rows and columns (tables), it uses the documents to store the data
in the database. A document database stores data in JSON, BSON, or XML
documents.
Documents can be stored and retrieved in a form that is much closer to the
data objects used in applications which means less translation is required to
use these data in the applications. In the Document database, the particular
elements can be accessed by using the index value that is assigned for faster
querying.
Collections are the group of documents that store documents that have
similar contents. Not all the documents are in any collection as they require a
similar schema because document databases have a flexible schema.

Key features of documents database:


 Flexible schema: Documents in the database has a flexible schema.
It means the documents in the database need not be the same
schema.
 Faster creation and maintenance: the creation of documents is
easy and minimal maintenance is required once we create the
document.
 No foreign keys: There is no dynamic relationship between two
documents so documents can be independent of one another. So,
there is no requirement for a foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others.

Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL


database is a key-value store. Every data element in the database is stored
in key-value pairs. The data can be retrieved by using a unique key allotted
to each element in the database. The values can be simple data types like
strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which
is the key and the value.
Key features of the key-value store:
 Simplicity.
 Scalability.
 Speed.

1. Column Oriented Databases (Or) Column Family Data stores or


Wide column data stores
:A column-oriented database is a non-relational database that stores the
data in columns instead of rows. That means when we want to run analytics
on a small number of columns, you can read those columns directly without
consuming memory with the unwanted data.

Columnar databases are designed to read data more efficiently and


retrieve the data with greater speed. A columnar database is used to store
a large amount of data. Key features of columnar oriented database:
 Scalability.
 Compression.
 Very responsive.

Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It
stores the data in the form of nodes in the database. The connections
between the nodes are called links or relationships.
Key features of graph database:
 In a graph-based database, it is easy to identify the relationship
between the data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the
database elements.
 Updating data is also easy, as adding a new node or edge to a graph
database is a straightforward task that does not require significant
schema changes.

 Node —The entities themselves. In a social network this could be


people.

 Edge —The relationship between two entities. This relationship is


represented by a line and has its own properties. An edge can have a
direction, for example, if the arrow indicates who is whose boss.

You might also like