0% found this document useful (0 votes)
76 views

Big Data Unit5

SQL works on stored data, while Spark and Hive can work on both stored and live data. Parallel data processing involves dividing tasks into smaller sub-tasks that run simultaneously. Distributed data processing uses separate networked machines. Hadoop is an open-source framework for large-scale storage and processing of structured, semi-structured and unstructured data using MapReduce. Processing workloads are categorized as batch or transactional. Clusters enable distributed and scalable data processing. MapReduce is commonly used for batch processing in Hadoop through mapping and reducing data.

Uploaded by

Ananth Kallam
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Big Data Unit5

SQL works on stored data, while Spark and Hive can work on both stored and live data. Parallel data processing involves dividing tasks into smaller sub-tasks that run simultaneously. Distributed data processing uses separate networked machines. Hadoop is an open-source framework for large-scale storage and processing of structured, semi-structured and unstructured data using MapReduce. Processing workloads are categorized as batch or transactional. Clusters enable distributed and scalable data processing. MapReduce is commonly used for batch processing in Hadoop through mapping and reducing data.

Uploaded by

Ananth Kallam
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Big-Data Analytics

Lecture 5: Big Data Processing Concepts


Sql works on
A. Stored Data
B. Live Data
C. Both
D. None
Hive works on
A. Stored Data
B. Live Data
C. Both
D. None
Spark works on
A. Stored Data
B. Live Data
C. Both
D. None
Outline
• Parallel Data Processing
• Distributed Data Processing
• Hadoop
• Processing Workloads
• Cluster
• Processing in Batch Mode
• Processing in Realtime Mode
Parallel Data Processing
• Parallel data processing involves the simultaneous
execution of multiple sub-tasks that collectively comprise a
larger task.
• The goal is to reduce the execution time by dividing a single
larger task into multiple smaller tasks that run
concurrently.
• Although parallel data processing can be achieved through
multiple networked machines, it is more typically achieved
within the confines of a single machine with multiple
processors or cores, as shown in the following Figure.
• A task can be divided into three sub-tasks that are
executed in parallel on three different processors within
the same machine
Distributed Data Processing
• Distributed data processing is closely related to parallel
data processing in that the same principle of “divide-and-
conquer” is applied.
• However, distributed data processing is always achieved
through physically separate machines that are networked
together as a cluster.
An example of distributed data
processing
parallel data processing can be
achieved through
A. multiple networked machines
B. a single machine with multiple Core
C. Both of them
D. None of them
Hadoop
• Hadoop is an open-source framework for large-scale data
storage and data processing that is compatible with
commodity hardware.
• The Hadoop framework has established itself as a de facto
industry platform for contemporary Big Data solutions.
• It can be used as an ETL engine or as an analytics
engine for processing large amounts of structured,
semistructured and unstructured data.
• From an analysis perspective, Hadoop implements the
MapReduce processing framework.
Hadoop is a versatile framework that provides both
processing and storage
capabilities
parallel data processing can be
achieved through
A. a single machine with multiple Processor
B. a single machine with multiple Core
C. Both of them
D. None of them
Processing Workloads
• A processing workload in Big Data is defined as the
amount and nature of data that is processed within a
certain amount of time. Workloads are usually divided
into two types:
• batch
• transactional
Distributed Data Processing can Be done in

A. Separate Machines that are isolated to each other


B. Same Machine with multiple core in it
C. Separate Machines that are connected together
D. Same Machine with multiple Processor in it
Batch
• Batch processing, also known as offline processing,
involves processing data in batches
• and usually imposes delays, which in turn results in
high-latency responses. Batch
• workloads typically involve large quantities of data with
sequential read/writes and
• comprise of groups of read or write queries.
• Queries can be complex and involve multiple joins. OLAP
systems commonly process workloads in batches.
• Strategic BI and analytics are batch-oriented as they are
highly read-intensive tasks involving large volumes of
data.
A batch workload can include grouped read/writes
to INSERT, SELECT, UPDATE and DELETE
Hadoop framework has established
itself as a
A. Pre facto industry platform for contemporary Big Data solutions
B. Post facto industry platform for contemporary Big Data solutions
C. de facto industry platform for contemporary Big Data solutions
D. All of them
Transactional
• Transactional processing is also known as online
processing. Transactional workload processing follows an
approach whereby data is processed interactively without
delay, resulting in low-latency responses.
• Transaction workloads involve small amounts of data
with random reads and writes.
• OLTP and operational systems, which are generally write-
intensive, fall within this category.
• Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
• Transactional workloads comprise random reads/writes that
involve fewer joins than business intelligence and reporting
workloads.
• Given their online nature and operational significance to the
enterprise, they require low-latency responses with a smaller
data footprint
Transactional workloads have few joins and lower
latency responses than
batch workloads
A processing workload in Big Data is
A. Amount of data processed in certain time
B. Nature of data processed in certain time
C. Both of them
D. None of them
Cluster
• In the same manner that clusters provide necessary
support to create horizontally scalable storage solutions,
clusters also provides the mechanism to enable
distributed data processing with linear scalability.
• Since clusters are highly scalable, they provide an ideal
environment for Big Data processing as large datasets
can be divided into smaller datasets and then processed
in parallel in a distributed manner.
• An additional benefit of clusters is that they provide inherent
redundancy and fault tolerance, as they consist of physically
separate nodes.
• Redundancy and fault tolerance allow resilient processing
and analysis to occur if a network or node failure occurs.
• Due to fluctuations in the processing demands placed upon
a Big Data environment, leveraging cloud-host infrastructure
services, or ready-made analytical environments as the
backbone of a cluster, is sensible due to their elasticity and
pay-for-use model of utility-based computing.
A cluster can be utilized to support batch processing of
bulk data and
realtime processing of streaming data
Processing in Batch Mode
• In batch mode, data is processed offline in batches and
the response time could vary from minutes to hours. As
well, data must be persisted to the disk before it can be
processed.
• Batch mode generally involves processing a range of large
datasets, either on their own or joined together,
essentially addressing the volume and variety
characteristics of Big Data datasets.
• The majority of Big Data processing occurs in batch
mode. It is relatively simple, easy to set up and low in
cost compared to realtime mode.
• Strategic BI, predictive and prescriptive analytics and
ETL operations are commonly batch-oriented.
Batch Processing with MapReduce
• MapReduce is a widely used implementation of a batch
processing framework.
• It is highly scalable and reliable and is based on the principle of
divide-and-conquer, which provides built-in fault tolerance and
redundancy.
• It divides a big problem into a collection of smaller problems that
can each be solved quickly.
• MapReduce has roots in both distributed and parallel computing.
• MapReduce is a batch-oriented processing engine used to
process large datasets using parallel processing deployed over
clusters of commodity hardware.
The symbol used to represent a processing
engine
• MapReduce does not require that the input data conform to any
particular data model.
• Therefore, it can be used to process schema-less datasets. A dataset is
broken down into multiple smaller parts, and operations are
performed on each part independently and in parallel.
• The results from all operations are then summarized to arrive at the
answer.
• Because of the coordination overhead involved in managing a job, the
MapReduce processing engine generally only supports batch
workloads as this work is not expected to have low latency.
• MapReduce is based on Google’s research paper on the subject,
published in early 2000.
• The MapReduce processing engine works differently
compared to the traditional data processing paradigm.
• Traditionally, data processing requires moving data from
the storage node to the processing node that runs the
data processing algorithm.
• This approach works fine for smaller datasets; however,
with large datasets, moving data can incur more overhead
than the actual processing of the data.
• With MapReduce, the data processing algorithm is
instead moved to the nodes that store the data.
• The data processing algorithm executes in parallel on
these nodes, thereby eliminating the need to move the
data first.
• This not only saves network bandwidth but it also
results in a large reduction in processing time for large
datasets, since processing smaller chunks of data in
parallel is much faster.
Map and Reduce Tasks
• A single processing run of the MapReduce processing engine is known
as a MapReduce job.
• Each MapReduce job is composed of a map task and a reduce task and
each task consists of multiple stages.
• Map tasks
• map
• combine (optional)
• partition
• Reduce tasks
• shuffle and sort
• reduce
An illustration of a MapReduce job
with the map stage highlighted
Map
• The first stage of MapReduce is known as map, during
which the dataset file is divided into multiple smaller splits.
• Each split is parsed into its constituent records as a key-
value pair. The key is usually the ordinal position of the
record, and the value is the actual record.
• The parsed key-value pairs for each split are then sent to a
map function or mapper, with one mapper function per
split. The map function executes user-defined logic.
• Each split generally contains multiple key-value pairs, and
the mapper is run once for each key-value pair in the split.
• The mapper processes each key-value pair as per the
user-defined logic and further generates a key-value pair
as its output.
• The output key can either be the same as the input key
or a substring value from the input value, or another
serializable user-defined object.
• Similarly, the output value can either be the same as the
input value or a substring value from the input value, or
another serializable user-defined object.
• When all records of the split have been processed, the
output is a list of key-value pairs where multiple key-
value pairs can exist for the same key.
• It should be noted that for an input key-value pair, a
mapper may not produce any output key-value pair
(filtering) or can generate multiple key-value pairs
(demultiplexing.)
Combine
• Generally, the output of the map function is handled
directly by the reduce function.
• However, map tasks and reduce tasks are mostly run
over different nodes. This requires moving data between
mappers and reducers. This data movement can consume
a lot of valuable bandwidth and directly contributes to
processing latency.
• With larger datasets, the time taken to move the data
between map and reduce stages can exceed the actual
processing undertaken by the map and reduce tasks.
• For this reason, the MapReduce engine provides an
optional combine function (combiner) that summarizes a
mapper’s output before it gets processed by the reducer.
The combine stage groups the
output from the map stage
• A combiner is essentially a reducer function that locally
groups a mapper’s output on the same node as the mapper.
• A reducer function can be used as a combiner function, or a
custom user-defined function can be used.
• The MapReduce engine combines all values for a given key
from the mapper output, creating multiple key-value pairs as
input to the combiner where the key is not repeated and the
value exists as a list of all corresponding values for that key.
• The combiner stage is only an optimization stage, and may
therefore not even be called by the MapReduce engine.
• For example, a combiner function will work for finding
the largest or the smallest number, but will not work for
finding the average of all numbers since it only works
with a subset of the data.
Partition
• During the partition stage, if more than one reducer is
involved, a partitioner divides the output from the mapper
or combiner (if specified and called by the MapReduce
engine) into partitions between reducer instances. The
number of partitions will equal the number of reducers.
The partition stage assigns output
from the map task to reducers
• Although each partition contains multiple key-value pairs, all records for
a particular key are assigned to the same partition.
• The MapReduce engine guarantees a random and fair distribution
between reducers while making sure that all of the same keys across
multiple mappers end up with the same reducer instance.
• Depending on the nature of the job, certain reducers can sometimes
receive a large number of key-value pairs compared to others. As a result
of this uneven workload, some reducers will finish earlier than others.
• Overall, this is less efficient and leads to longer job execution times than
if the work was evenly split across reducers. This can be rectified by
customizing the partitioning logic in order to guarantee a fair
distribution of key-value pairs.
• The partition function is the last stage of the map task. It
returns the index of the reducer to which a particular
partition should be sent.
Shuffle and Sort
• During the first stage of the reduce task, output from all
practitioners is copied across the network to the nodes
running the reduce task. This is known as shuffling. The
list based key-value output from each partitioner can
contain the same key multiple times.
• Next, the MapReduce engine automatically groups and
sorts the key-value pairs according to the keys so that the
output contains a sorted list of all input keys and their
values with the same keys appearing together. The way in
which keys are grouped and sorted can be customized.
• This merge creates a single key-value pair per group,
where key is the group key and the value is the list of all
group values.
During the shuffle and sort stage, data is copied across the network
to the
reducer nodes and sorted by key
Reduce
• Reduce is the final stage of the reduce task. Depending
on the user-defined logic specified in the reduce function
(reducer), the reducer will either further summarize its
input or will emit the output without making any
changes.
• In either case, for each key-value pair that a reducer
receives, the list of values stored in the value part of the
pair is processed and another key-value pair is written
out.
• The output key can either be the same as the input key or a
substring value from the input value, or another serializable
user-defined object.
• The output value can either be the same as the input value or
a substring value from the input value, or another serializable
user defined object.
• Note that just like the mapper, for the input key-value pair, a
reducer may not produce any output key-value pair (filtering)
or can generate multiple key-value pairs (demultiplexing).
• The output of the reducer, that is the key-value pairs, is then
written out as a separate file —one file per reducer.
The reduce stage is the last stage
of the reduce task
• The number of reducers can be customized. It is also
possible to have a MapReduce job without a reducer, for
example when performing filtering.
• Note that the output signature (key-value types) of the
map function should match that of the input signature
(key-value types) of the reduce/combine function.
A Simple MapReduce Example
• MapReduce steps are following
1. The input (sales.txt) is divided into two splits.
2. Two map tasks running on two different nodes, Node A
and Node B, extract product and quantity from the
respective split’s records in parallel. The output from each
map function is a key-value pair where product is the key
while quantity is the value.
3. The combiner then performs local summation of product
quantities.
4. As there is only one reduce task, no partitioning is
performed.
5. The output from the two map tasks is then copied to a
third node, Node C, that runs the shuffle stage as part of
the reduce task.
6. The sort stage then groups all quantities of the same
product together as a list.
7. Like the combiner, the reduce function then sums up
the quantities of each unique product in order to create the
output.
An example of MapReduce in
action

You might also like