0% found this document useful (0 votes)
3 views

IDS Unit3

The document discusses the NoSQL movement and the use of the Hadoop framework for handling Big Data, emphasizing its distributed storage and processing capabilities. It explains the components of Hadoop, including HDFS for data storage and MapReduce for data processing, along with examples illustrating their functionality. Additionally, it covers the principles of relational databases (ACID), the CAP theorem, and how NoSQL databases differ by adopting BASE principles to manage large-scale data across distributed systems.

Uploaded by

SwethaRouthu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

IDS Unit3

The document discusses the NoSQL movement and the use of the Hadoop framework for handling Big Data, emphasizing its distributed storage and processing capabilities. It explains the components of Hadoop, including HDFS for data storage and MapReduce for data processing, along with examples illustrating their functionality. Additionally, it covers the principles of relational databases (ACID), the CAP theorem, and how NoSQL databases differ by adopting BASE principles to manage large-scale data across distributed systems.

Uploaded by

SwethaRouthu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|33348499

DS Notes Unit 3

COMPUTER SCIENCE AND ENGINEERING - Artificial Intelligence and Machine


Learning (Jawaharlal Nehru Technological University, Kakinada)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by SWETHA ROUTHU ([email protected])
lOMoARcPSD|33348499

UNIT-III
NoSQL movement for handling Bigdata: Distributing data storage and processing
withHadoop framework, case study on risk assessment for loan sanctioning, ACID principle
of relational databases, CAP theorem, base principle of NoSQL databases, types of NoSQL
databases, case study on disease diagnosis and profiling

Distributing data storage and processing with Hadoop framework


“New big data technologies such as Hadoop and Spark make it much easier to work with and
control a cluster of computers. Hadoop can scale up to thousands of computers, creating a
cluster with petabytes of storage. This enables businesses to grasp the value of the massive
amount of data available”.

Hadoop: a framework for storing and processing large data sets


Hadoop: Hadoop is an open-source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and
is based on the MapReduce programming model, which allows for the parallel processing of
large datasets.
Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims
to be all of the following things and more:
■ Reliable— By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
■ Fault tolerant— It detects faults and applies automatic recovery.
■ Scalable— Data and its processing are distributed over clusters of computers (horizontal
scaling).
■ Portable— Installable on all kinds of hardware and operating systems.
The Different Components of Hadoop:
At the heart of Hadoop, we find
■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
Hadoop is a widely-used framework for distributed storage and processing of large datasets
Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

using the MapReduce programming model. Here's an overview of how Hadoop distributes
data storage and processing:

1. HDFS (Hadoop Distributed File System):


 Distributed Storage: Hadoop stores data in a distributed fashion using HDFS, which
splits files into large blocks (default is 128MB or 256MB) and distributes them across
multiple nodes in a cluster. Each block is replicated (typically 3 copies) to ensure fault
tolerance.
 Nodes: In HDFS, the system consists of a NameNode (which manages metadata) and
DataNodes (which store actual data blocks).
 Fault Tolerance: The replication factor ensures that even if a node fails, the data can be
accessed from another replica.

Theorem: HDFS is a distributed storage system designed for large-scale data processing. It
divides files into blocks (default size 128 MB or 256 MB) and distributes them across
multiple nodes in a cluster. Each block is replicated (typically three times) to ensure data
redundancy and fault tolerance. The architecture is built to handle node failures by
replicating data across different nodes.

Diagram: Imagine you have a file File1.txt that is 500 MB. HDFS would divide it into
blocks as follows:

File1.txt (500 MB)


|
+-- Block 1 (128 MB) --> Stored on DataNode A, B, C
|
+-- Block 2 (128 MB) --> Stored on DataNode D, E, F
|
+-- Block 3 (128 MB) --> Stored on DataNode A, D, G
|
+-- Block 4 (116 MB) --> Stored on DataNode B, E, F

NameNode: Manages metadata, e.g., block locations, directories.


Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

DataNodes: Store actual file data blocks.

Example: Suppose a file of 500 MB is stored in HDFS. It would be divided into 4 blocks:
 Block 1, Block 2, Block 3, Block 4.

 Each block would be replicated three times and stored on different nodes for fault
tolerance.

2. MapReduce:
 Distributed Processing: Hadoop processes data using the MapReduce model, which
breaks the task into two phases:
1. Map Phase: Processes data in parallel across multiple nodes by splitting the
dataset into smaller chunks, known as splits.
Each file is divided into smaller chunks, and the mapper function processes each
chunk to produce key-value pairs like (word, 1) for each word.
Example: The word "hadoop" appears in File1 and File2. After mapping, you would
have key-value pairs like:
 (hadoop, 1)
 (hadoop, 1)
2. Reduce Phase: After the Map phase, the Reduce phase aggregates the processed
data. The reducer takes these key-value pairs and sums the counts of each word:
Example: For the word "hadoop", the output would be:
 (hadoop, 2)
JobTracker and TaskTracker: In older versions of Hadoop, the JobTracker assigns tasks to
nodes (TaskTrackers) and monitors their execution. In newer versions (YARN),
ResourceManager and NodeManager handle resource management.
Diagram:

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

Example: Word count in a large document collection. Mappers count word occurrences in
each chunk of data, and reducers combine the results to get the total count of each word.
YARN (Yet Another Resource Negotiator)
Purpose: A resource management system that allocates CPU, memory, and other resources
in a Hadoop cluster. YARN decouples resource management and scheduling from
MapReduce, enabling multiple types of distributed applications to run in parallel.
Theorem: YARN is Hadoop's resource management layer that allocates resources and
schedules tasks. It decouples resource management and job scheduling from the MapReduce
process. The system consists of:
 ResourceManager: Global resource allocator for the entire cluster.
 NodeManager: Manages resources on individual nodes.
 ApplicationMaster: Oversees the execution of a specific job.
Diagram:
ResourceManager <----> ApplicationMaster < ----> NodeManager (on each node)
|
+ -- > Allocates resources for jobs
Example: Suppose you submit a job to process a large dataset. The ResourceManager will
assign resources (CPU, memory) to the job, while the ApplicationMaster will monitor the
job's progress. NodeManagers on individual nodes will manage task execution on their
nodes.
SIMPLE EXAMPLE 1 of how Hadoop works, particularly with the MapReduce process.
We'll use a real-life scenario to make it easier to understand.
Example Scenario: Counting Fruits in a Grocery Store
Imagine you own a large grocery store, and you have a list of all the fruits customers have
bought. The list is too big for a single person to count, so you want to split the work between
many workers.
Your list looks like this:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
You want to find out how many times each type of fruit has been bought (e.g., how many
apples, bananas, and oranges). This is where Hadoop comes in!

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

Step-by-Step Breakdown
Step 1: Input Data: You have two lists of fruits that represent what customers have bought:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
These lists are like files that are stored in Hadoop’s file system (HDFS). In Hadoop, big files
are split into smaller parts so that many computers can work on them at the same time.
Step 2: Map Phase: Hadoop splits the lists, and each worker (mapper) counts the fruits in
their assigned list.
Mapper 1 (for List 1) sees: - "apple, banana, apple, orange"
Output: (apple, 1), (banana, 1), (apple, 1), (orange, 1)
Mapper 2 (for List 2) sees: - "banana, apple, apple, banana, orange, banana"
Output: (banana, 1), (apple, 1), (apple, 1), (banana, 1), (orange, 1),
(banana, 1)
So, each worker (mapper) is just counting the number of each fruit it sees, and it outputs a
key-value pair where the key is the fruit and the value is `1`.
Step 3: Shuffle and Sort: Hadoop takes all the results from the mappers and groups them by
the fruit type (key). So all the counts for "apple" are put together, all the counts for "banana"
are put together, and so on.
This is what the grouping looks like:
- (apple, [1, 1, 1, 1]): Four "1"s for apple (two from Mapper 1, two from Mapper 2)
- (banana, [1, 1, 1, 1]): Four "1"s for banana (one from Mapper 1, three from Mapper 2)
- (orange, [1, 1]): Two "1"s for orange (one from each list)
Step 4: Reduce Phase: The reducer adds up the counts for each fruit. It sums the values in
the lists to get the final count for each fruit.
Reduce Output:
- (apple, 4): There are 4 apples in total.
- (banana, 4): There are 4 bananas in total.
- (orange, 2): There are 2 oranges in total.
Step 5: Final Output: The final result tells you how many of each fruit were bought:
apple: 4

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

banana: 4
orange: 2
Explanation:
Input: Lists of fruits (representing big data files in Hadoop).
Map Phase: Each list is processed separately by different workers (mappers) who count the
fruits.
Shuffle and Sort: Hadoop groups the same fruits together.
Reduce Phase: The reducer sums up the counts and gives the total number of each fruit.
This example helps you understand how Hadoop processes large datasets by breaking the
work into smaller, manageable pieces.
SIMPLE EXAMPLE 2
Let’s see how MapReduce would work on a small fictitious example. You’re the director of a
toy company. Every toy has two colors, and when a client orders a toy from the web page,
the web page puts an order file on Hadoop with the colors of the toy. Your task is to find out
how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count
the colors. First let’s look at a simplified version in figure .

Figure: A simplified example of a MapReduce flow for counting the colors in input texts
As the name suggests, the process roughly boils down to two big phases:

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can
have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are
grouped together, and depending on the reducing function, a different result can be created.
Here we wanted a count per color, so that’s what the reduce function returns. In reality it’s a
bit more complicated than this though

Figure: An example of a MapReduce flow for counting the colors in input texts

Introduction to NoSQL
NoSQL databases are designed to manage large-scale data across distributed systems and
provide more flexible ways to model data, depending on the use case. Unlike relational
databases that strictly follow predefined structures, NoSQL databases allow the data model to
fit the needs of the application.
In order to understand NoSQL, we first need to explore the core principles of relational
databases, known as ACID, and how NoSQL rewrites these principles into BASE to better
suit distributed environments. Additionally, we’ll look at the CAP theorem, which explains
the challenges of distributing databases across multiple nodes and how ACID and BASE
handle these challenges differently.

ACID PRINCIPLE OF RELATIONAL DATABASES


Relational databases follow ACID principles to ensure data reliability and consistency. These
principles are:
1. Atomicity:
o All or nothing. A transaction must be fully completed, or no part of it should be applied.
o Example: When transferring money between bank accounts, the full amount is moved,

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

or the transaction is aborted. Partial transactions are not allowed.


2. Consistency:
o The database maintains predefined rules for data integrity, ensuring that only valid data
can be saved.
o Example: A field requiring a number will never accept text. All data stored must follow
these set rules.
3. Isolation:
o Changes in the database must not be visible to others until the transaction is complete.
o Example: A document edited by one user is locked for others. They cannot see the
ongoing changes until the editing is finalized and saved.
4. Durability:
o Once data is committed, it remains safe and permanent, even in the event of a crash or
power failure.
o Example: If you save a record and the system crashes, the saved data is still available
when the system recovers.
ACID applies to traditional relational databases and some NoSQL databases, like Neo4j (a
graph database). However, most NoSQL databases follow a different set of principles called
BASE to support distributed environments.

The CAP Theorem Overview


The CAP Theorem is about NoSQL databases to explain why maintaining consistency can be
challenging. Proposed by Eric Brewer in 2000, and later formally proven by Seth Gilbert and
Nancy Lynch, it helps to understand the trade-offs in distributed systems. Understanding the
CAP Theorem helps developers choose the right database for their specific use case,
balancing the trade-offs based on the requirements of their application.
Theorem statement
The CAP Theorem can be stated as follows: In a distributed data store, it is impossible to
simultaneously guarantee all three of the following properties:
1. Consistency (C): All nodes see the same data at the same time.
2. Availability (A): Every request (read or write) receives a response, regardless of
whether the data is up-to-date.
3. Partition Tolerance (P): The system continues to operate despite network partitions.
You can have at most two of these three guarantees at any given time.

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

According to the CAP Theorem, a distributed system can achieve at most two of
thesethree properties at the same time. This leads to different types of NoSQL
databasesbeing classified based on their design choices:

● CP (Consistency and Partition Tolerance): These systems prioritize consistencyand


partition tolerance but may sacrifice availability during network partitions.An example is
HBase.
● AP (Availability and Partition Tolerance): These systems focus on availabilityand
partition tolerance, potentially allowing for temporary inconsistencies. An example is
Cassandra.
● CA (Consistency and Availability): This is typically not achievable in adistributed
system because network partitions are a reality in any distributedsetup. Most systems
cannot guarantee both consistency and availability in the presence of network failures.
Here are example scenarios for each combination of the CAP Theorem (Consistency,
Availability, and Partition Tolerance):
1. CA (Consistency and Availability)
● Example Scenario: Banking Transaction System
● Description: A banking system that operates on a single-node architecture.
● Behavior: When a user makes a deposit or withdrawal, the transaction isprocessed
immediately, ensuring that all subsequent reads return the mostrecent account balance
(consistency). Since there's only one node, the system isalways available as long as that
node is operational.
● Limitation: If the server goes down (e.g., due to hardware failure), the systembecomes
unavailable, losing availability.

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

2. AP (Availability and Partition Tolerance)


● Example Scenario: Social Media Feed
● Description: A social media platform that allows users to post updates andcomments.
● Behavior: Users can post updates and interact with the platform even if somenodes are
unreachable due to network issues (availability). If a network partitionoccurs, different
nodes might have slightly different views of the feed, allowingusers to continue posting
and commenting without waiting for synchronization.
● Limitation: Because of the partition tolerance, there may be inconsistenciesbetween
users’ feeds, as updates made on one side of the partition may not bereflected on the other
until the partition is resolved.
3. CP (Consistency and Partition Tolerance)
● Example Scenario: Distributed Database for E-Commerce
● Description: An e-commerce platform that needs to maintain accurate inventorycounts
across multiple geographic locations.
● Behavior: When a user tries to purchase a product, the system ensures that
theinventory count is updated consistently across all nodes before completing
thetransaction (consistency). If there’s a network partition, the system maytemporarily
reject orders or limit access to ensure that inventory counts remainconsistent.
● Limitation: During a partition, users may experience delays or rejections whentrying to
place orders, sacrificing availability to maintain consistency in inventorymanagement.

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

THE BASE PRINCIPLES OF NOSQL DATABASES


The BASE principles of NoSQL databases represent a more flexible approach compared
to the strict ACID principles of relational databases. While ACID ensures strong
consistency and reliability, BASE offers softer guarantees that prioritize availability and
scalability in distributed systems. Here’s a summary of the BASE promises:
1. Basically Available:
o Availability is a key feature of BASE. The system ensures that it's always operational,
even in the case of node failures. It focuses on keeping services running, though the
data might not always be up-to-date or consistent.
o Example: In systems like Cassandra or Elasticsearch, if one node fails, others can
take over to keep the service available, often through data replication or sharding.

Figure: sharding: each shard can function as a self-sufficient database, but they also work
together as a whole. The example represents two nodes, each containing four shards: two
main shards and two replicas. Failure of one node is backed up by the other.

2. Soft State:
o The state of the database may change over time, even without new input, due to the
eventual consistency model. This means the system doesn't guarantee immediate
consistency after every transaction.
o Example: Data in one node might say "X" and another node might say "Y"
temporarily, but this will be resolved later when the nodes synchronize their data.
3. Eventual Consistency:
o The database will become consistent over time, but it might allow for temporary
inconsistencies. Eventually, after all updates are synchronized, every node will hold
the same data.
o Example: If two customers purchase the last item in stock at the same time, the
database may show inconsistent results for a short period, but it will eventually
Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

reconcile the conflict and decide who gets the item.


ACID versus BASE
The BASE principles are somewhat contrived to fit acid and base from chemistry: anacid
is a fluid with a low pH value. A base is the opposite and has a high pH value.We won’t
go into the chemistry details here, but figure shows a mnemonic tothose familiar with the
chemistry equivalents of acid and base.

Figure: ACID versus BASE: traditional relational databases versusmost NoSQL


databases. The names are derived from the chemistry conceptof the pH scale. A pH value
below 7 is acidic; higher than 7 is a base. Onthis scale, your average surface water
fluctuates between 6.5 and 8.5.

TYPES OF NOSQL DATABASES:


A database is a collection of structured data or information which is stored in a computer
system and can be accessed easily. A database is usually managed by a Database
Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular form.
NoSQL stands for Not only SQL. The main types are documents, key-value, wide-
column, and graphs.
Types of NoSQL Database:
 Document-based databases
 Key-value stores
 Column-oriented databases

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

 Graph-based databases

1. Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used
in applications which means less translation is required to use these data in the applications.
In the Document database, the particular elements can be accessed by using the index value
that is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not
all the documents are in any collection as they require a similar schema because document
databases have a flexible schema.

Key features of documents database:


 Flexible schema: Documents in the database has a flexible schema. It means the
documents in the database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two documents so
documents can be independent of one another. So, there is no requirement for a foreign
key in a document database.
 Open formats: To build a document we use XML, JSON, and others.

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

2. Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL database is a


key-value store. Every data element in the database is stored in key-value pairs. The data
can be retrieved by using a unique key allotted to each element in the database. The values
can be simple data types like strings and numbers or complex objects.

A key-value store is like a relational database with only two columns which is the key and
the value.

Key features of the key-value store:

 Simplicity.
 Scalability.
 Speed.

3. Column Oriented Databases (Or) Column Family Data stores (Or) Wide column
data stores

A column-oriented database is a non-relational database that stores the data in columns


instead of rows. That means when we want to run analytics on a small number of
columns, you can read those columns directly without consuming memory with the
unwanted data.

Columnar databases are designed to read data more efficiently and retrieve the data with
greater speed. A columnar database is used to store a large amount of data. Key features
of columnar oriented database:

 Scalability.
 Compression.
Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])


lOMoARcPSD|33348499

 Very responsive.

4. Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.

Key features of graph database:


 In a graph-based database, it is easy to identify the relationship between the data by
using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database elements.
 Updating data is also easy, as adding a new node or edge to a graph database is a
straightforward task that does not require significant schema changes.
 Node —The entities themselves. In a social network this could be people.
 Edge —The relationship between two entities. This relationship is represented by a
line and has its own properties. An edge can have a direction, for example, if the
arrow indicates who is whose boss.

Chandrika Surya
Ass.professor
Dept.of AIML

Downloaded by SWETHA ROUTHU ([email protected])

You might also like