IDS Unit3
IDS Unit3
DS Notes Unit 3
UNIT-III
NoSQL movement for handling Bigdata: Distributing data storage and processing
withHadoop framework, case study on risk assessment for loan sanctioning, ACID principle
of relational databases, CAP theorem, base principle of NoSQL databases, types of NoSQL
databases, case study on disease diagnosis and profiling
using the MapReduce programming model. Here's an overview of how Hadoop distributes
data storage and processing:
Theorem: HDFS is a distributed storage system designed for large-scale data processing. It
divides files into blocks (default size 128 MB or 256 MB) and distributes them across
multiple nodes in a cluster. Each block is replicated (typically three times) to ensure data
redundancy and fault tolerance. The architecture is built to handle node failures by
replicating data across different nodes.
Diagram: Imagine you have a file File1.txt that is 500 MB. HDFS would divide it into
blocks as follows:
Example: Suppose a file of 500 MB is stored in HDFS. It would be divided into 4 blocks:
Block 1, Block 2, Block 3, Block 4.
Each block would be replicated three times and stored on different nodes for fault
tolerance.
2. MapReduce:
Distributed Processing: Hadoop processes data using the MapReduce model, which
breaks the task into two phases:
1. Map Phase: Processes data in parallel across multiple nodes by splitting the
dataset into smaller chunks, known as splits.
Each file is divided into smaller chunks, and the mapper function processes each
chunk to produce key-value pairs like (word, 1) for each word.
Example: The word "hadoop" appears in File1 and File2. After mapping, you would
have key-value pairs like:
(hadoop, 1)
(hadoop, 1)
2. Reduce Phase: After the Map phase, the Reduce phase aggregates the processed
data. The reducer takes these key-value pairs and sums the counts of each word:
Example: For the word "hadoop", the output would be:
(hadoop, 2)
JobTracker and TaskTracker: In older versions of Hadoop, the JobTracker assigns tasks to
nodes (TaskTrackers) and monitors their execution. In newer versions (YARN),
ResourceManager and NodeManager handle resource management.
Diagram:
Chandrika Surya
Ass.professor
Dept.of AIML
Example: Word count in a large document collection. Mappers count word occurrences in
each chunk of data, and reducers combine the results to get the total count of each word.
YARN (Yet Another Resource Negotiator)
Purpose: A resource management system that allocates CPU, memory, and other resources
in a Hadoop cluster. YARN decouples resource management and scheduling from
MapReduce, enabling multiple types of distributed applications to run in parallel.
Theorem: YARN is Hadoop's resource management layer that allocates resources and
schedules tasks. It decouples resource management and job scheduling from the MapReduce
process. The system consists of:
ResourceManager: Global resource allocator for the entire cluster.
NodeManager: Manages resources on individual nodes.
ApplicationMaster: Oversees the execution of a specific job.
Diagram:
ResourceManager <----> ApplicationMaster < ----> NodeManager (on each node)
|
+ -- > Allocates resources for jobs
Example: Suppose you submit a job to process a large dataset. The ResourceManager will
assign resources (CPU, memory) to the job, while the ApplicationMaster will monitor the
job's progress. NodeManagers on individual nodes will manage task execution on their
nodes.
SIMPLE EXAMPLE 1 of how Hadoop works, particularly with the MapReduce process.
We'll use a real-life scenario to make it easier to understand.
Example Scenario: Counting Fruits in a Grocery Store
Imagine you own a large grocery store, and you have a list of all the fruits customers have
bought. The list is too big for a single person to count, so you want to split the work between
many workers.
Your list looks like this:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
You want to find out how many times each type of fruit has been bought (e.g., how many
apples, bananas, and oranges). This is where Hadoop comes in!
Chandrika Surya
Ass.professor
Dept.of AIML
Step-by-Step Breakdown
Step 1: Input Data: You have two lists of fruits that represent what customers have bought:
- List 1: "apple, banana, apple, orange"
- List 2: "banana, apple, apple, banana, orange, banana"
These lists are like files that are stored in Hadoop’s file system (HDFS). In Hadoop, big files
are split into smaller parts so that many computers can work on them at the same time.
Step 2: Map Phase: Hadoop splits the lists, and each worker (mapper) counts the fruits in
their assigned list.
Mapper 1 (for List 1) sees: - "apple, banana, apple, orange"
Output: (apple, 1), (banana, 1), (apple, 1), (orange, 1)
Mapper 2 (for List 2) sees: - "banana, apple, apple, banana, orange, banana"
Output: (banana, 1), (apple, 1), (apple, 1), (banana, 1), (orange, 1),
(banana, 1)
So, each worker (mapper) is just counting the number of each fruit it sees, and it outputs a
key-value pair where the key is the fruit and the value is `1`.
Step 3: Shuffle and Sort: Hadoop takes all the results from the mappers and groups them by
the fruit type (key). So all the counts for "apple" are put together, all the counts for "banana"
are put together, and so on.
This is what the grouping looks like:
- (apple, [1, 1, 1, 1]): Four "1"s for apple (two from Mapper 1, two from Mapper 2)
- (banana, [1, 1, 1, 1]): Four "1"s for banana (one from Mapper 1, three from Mapper 2)
- (orange, [1, 1]): Two "1"s for orange (one from each list)
Step 4: Reduce Phase: The reducer adds up the counts for each fruit. It sums the values in
the lists to get the final count for each fruit.
Reduce Output:
- (apple, 4): There are 4 apples in total.
- (banana, 4): There are 4 bananas in total.
- (orange, 2): There are 2 oranges in total.
Step 5: Final Output: The final result tells you how many of each fruit were bought:
apple: 4
Chandrika Surya
Ass.professor
Dept.of AIML
banana: 4
orange: 2
Explanation:
Input: Lists of fruits (representing big data files in Hadoop).
Map Phase: Each list is processed separately by different workers (mappers) who count the
fruits.
Shuffle and Sort: Hadoop groups the same fruits together.
Reduce Phase: The reducer sums up the counts and gives the total number of each fruit.
This example helps you understand how Hadoop processes large datasets by breaking the
work into smaller, manageable pieces.
SIMPLE EXAMPLE 2
Let’s see how MapReduce would work on a small fictitious example. You’re the director of a
toy company. Every toy has two colors, and when a client orders a toy from the web page,
the web page puts an order file on Hadoop with the colors of the toy. Your task is to find out
how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count
the colors. First let’s look at a simplified version in figure .
Figure: A simplified example of a MapReduce flow for counting the colors in input texts
As the name suggests, the process roughly boils down to two big phases:
Chandrika Surya
Ass.professor
Dept.of AIML
■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can
have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are
grouped together, and depending on the reducing function, a different result can be created.
Here we wanted a count per color, so that’s what the reduce function returns. In reality it’s a
bit more complicated than this though
Figure: An example of a MapReduce flow for counting the colors in input texts
Introduction to NoSQL
NoSQL databases are designed to manage large-scale data across distributed systems and
provide more flexible ways to model data, depending on the use case. Unlike relational
databases that strictly follow predefined structures, NoSQL databases allow the data model to
fit the needs of the application.
In order to understand NoSQL, we first need to explore the core principles of relational
databases, known as ACID, and how NoSQL rewrites these principles into BASE to better
suit distributed environments. Additionally, we’ll look at the CAP theorem, which explains
the challenges of distributing databases across multiple nodes and how ACID and BASE
handle these challenges differently.
Chandrika Surya
Ass.professor
Dept.of AIML
Chandrika Surya
Ass.professor
Dept.of AIML
According to the CAP Theorem, a distributed system can achieve at most two of
thesethree properties at the same time. This leads to different types of NoSQL
databasesbeing classified based on their design choices:
Chandrika Surya
Ass.professor
Dept.of AIML
Chandrika Surya
Ass.professor
Dept.of AIML
Figure: sharding: each shard can function as a self-sufficient database, but they also work
together as a whole. The example represents two nodes, each containing four shards: two
main shards and two replicas. Failure of one node is backed up by the other.
2. Soft State:
o The state of the database may change over time, even without new input, due to the
eventual consistency model. This means the system doesn't guarantee immediate
consistency after every transaction.
o Example: Data in one node might say "X" and another node might say "Y"
temporarily, but this will be resolved later when the nodes synchronize their data.
3. Eventual Consistency:
o The database will become consistent over time, but it might allow for temporary
inconsistencies. Eventually, after all updates are synchronized, every node will hold
the same data.
o Example: If two customers purchase the last item in stock at the same time, the
database may show inconsistent results for a short period, but it will eventually
Chandrika Surya
Ass.professor
Dept.of AIML
Chandrika Surya
Ass.professor
Dept.of AIML
Graph-based databases
1. Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used
in applications which means less translation is required to use these data in the applications.
In the Document database, the particular elements can be accessed by using the index value
that is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not
all the documents are in any collection as they require a similar schema because document
databases have a flexible schema.
Chandrika Surya
Ass.professor
Dept.of AIML
2. Key-Value Stores:
A key-value store is like a relational database with only two columns which is the key and
the value.
Simplicity.
Scalability.
Speed.
3. Column Oriented Databases (Or) Column Family Data stores (Or) Wide column
data stores
Columnar databases are designed to read data more efficiently and retrieve the data with
greater speed. A columnar database is used to store a large amount of data. Key features
of columnar oriented database:
Scalability.
Compression.
Chandrika Surya
Ass.professor
Dept.of AIML
Very responsive.
4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Chandrika Surya
Ass.professor
Dept.of AIML