Articol Disteibuted Data Processing
Articol Disteibuted Data Processing
Abstract— Modern computers can process vast amounts of data representing many
aspects of the world. From these big data sets, we can learn about human behavior in
unprecedented ways: how language is used, what photos are taken, what topics are
discussed, and how people engage with their surroundings. To process large data sets
efficiently, programs are organized into pipelines of manipulations on sequential streams of
data. In this chapter, we consider a suite of techniques process and manipulate sequential
data streams efficiently. This write-up is an in-depth insight into the distributed data
processing. It will cover all the frequently asked questions about it such as What is it? How
different is it in comparison to the centralized data processing? What are the pros & cons of
it? What are the various approaches & architectures involved in distributed data
processing? What are the popular technologies & frameworks used in the industry for
processing massive amounts of data across several nodes running in a cluster?
2 IT professional
Data Processing Layer A map function is applied to each input, which
This is the layer contains logic which is the outputs zero or more intermediate key-value
real deal, it is responsible for processing the pairs of an arbitrary type.
data. All intermediate key-value pairs are grouped
The layer runs business logic on the data to by key, so that pairs with the same key can be
extract meaningful information from it. Machine reduced together.
learning, predictive, descriptive, decision A reduce function combines the values for a
modelling are primarily used for this. given key k; it outputs zero or more values,
which are each associated with k in the final
Data Visualization Layer output.
All the information extracted is sent to the
To perform this computation, the MapReduce
data visualization layer which typically contains
framework creates tasks (perhaps on different
browser-based dashboards which display the
machines) that perform various roles in the
information in the form of graphs, charts &
computation. A map task applies the map
infographics etc.
function to some subset of the input data and
Kibana is one good example of a data
outputs intermediate key-value pairs.
visualization tool, pretty popular in the industry.
A reduce task sorts and groups key-value pairs
by key, then applies the reduce function to the
TYPES OF DATA DISTIBUTED PROCESSING values for each key. All communication between
There are primarily two types of it. Batch map and reduce tasks is handled by the
Processing & Real-time streaming data framework, as is the task of grouping
processing. intermediate key-value pairs by key.
In order to utilize multiple machines in a
Batch Processing
MapReduce application, multiple mappers run in
Batch processing is the traditional data
parallel in a map phase, and multiple reducers
processing technique where chunks of data are
run in parallel in a reduce phase. In between
streamed in batches & processed. The
these phases, the sort phase groups together
processing is either scheduled for a certain time
key-value pairs by sorting them, so that all key-
of a day or happens in regular intervals or is
value pairs with the same key are adjacent.
random but not real-time.
Consider the problem of counting the vowels
Real-time Streaming Data Processing in a corpus of text. We can solve this problem
In this type of data processing, data is using the MapReduce framework with an
processed in real-time as it streams in. Analytics appropriate choice of map and reduce functions.
is run on the data to get insights from it. The map function takes as input a line of text
A good use case of this is getting insights and outputs key-value pairs in which the key is a
from sports data. As the game goes on the data vowel and the value is a count. Zero counts are
ingested from social media & other sources is omitted from the output:
analyzed in real-time to figure the viewers’ def count_vowels(line):
sentiments, players stats and predictions. """A map function that counts the vowels in
a line."""
for vowel in 'aeiou':
count = line.count(vowel)
TECHNOLOGIES INVOLVED IN if count > 0:
DISTRIBUTED DATA PROCESSING emit(vowel, count)
January/February 2020 3
APPACHE SPARK Unlike MapReduce, we can keep the input in
In Apache Spark, the computation model is memory and load them once. The broadcasting
much richer than just MapReduce. in step 3 means that we can quickly send the
Transformations on input data can be written centers to all machines. All cluster assignments
lazily and batched together. Intermediate results do not need to be written to disk every time.
can be cached and reused in future calculations. Spark applications run as independent sets of
There is a series of lazy transformations which processes on a cluster, coordinated by the
are followed by actions that force evaluation of SparkContext object in your main program
all transformations. (called the driver program).
Notably, each step in the Spark model Specifically, to run on a cluster, the
produces a resilient distributed dataset (RDD). SparkContext can connect to several types
Intermediate results can be cached on memory of cluster managers (either Spark’s own
or dis, optionally serialized. standalone cluster manager, Mesos or YARN),
For each RDD, we keep a lineage, which is the which allocate resources across applications.
operations which created it. Spark can then Once connected, Spark acquires executors on
recompute any data which is lost without storing nodes in the cluster, which are processes that
to disk. We can still decide to keep a copy in run computations and store data for your
memory or on disk for performance. application. Next, it sends your application code
The process of implement k-means clustering (defined by JAR or Python files passed to
in Spark is this: SparkContext) to the executors. Finally,
SparkContext sends tasks to the executors to
1. Load the data and “persist.” run.
2. Randomly sample for initial clusters.
3. Broadcast the initial clusters.
4. Loop over the data and find the new
centroids.
5. Repeat from step 3 until the algorithm
converges.
4 IT professional
(instances of SparkContext) without
writing it to an external storage system.
2. Spark is agnostic to the underlying
cluster manager. As long as it can
acquire executor processes, and these
communicate with each other, it is
relatively easy to run it even on a cluster
manager that also supports other
applications (e.g. Mesos/YARN).
Figure 3. Exemple of HIVE input
3. The driver program must listen for and
accept incoming connections from its
APPACHE FLINK
executors throughout its lifetime. As
Apache Flink is another system designed for
such, the driver program must be
distributed analytics like Apache Spark. It
network addressable from the worker
executes everything as a stream. Iterative
nodes.
computations can be written natively with cycles
4. Because the driver schedules tasks on
in the data flow. It has a very similar
the cluster, it should be run close to the
architecture to Spark, including (1) a client that
worker nodes, preferably on the same
optimizes and constructs data flow graph, (2) a
local area network. If you’d like to send
job manager that receives jobs, and (3) a task
requests to the cluster remotely, it’s
manager that executes jobs.
better to open an RPC to the driver and
have it submit operations from nearby
than to run a driver far away from the
worker nodes.
The system currently supports several cluster
managers:
January/February 2020 5
Users = LOAD ‘users.txt’ AS (user:chararray,
age:int);
Pages = LOAD ‘pages.txt’ AS
(user:chararray, url:chararray);
Filtered = FILTER Users BY age >= 18 AND
age <= 25;
Joined = JOIN Filtered BY user, Pages BY
user;
Grouped = GROUP Joined BY url;
Sorted = ORDER Summed BY clicks DESC; Figure 4. Exemple of Calcite architecture
Limit = LIMIT Sorted BY 5; APPACHE STORM
STORE Limit INTO ‘top5pages’ ; Apache Storm is a distributed stream
Apache Pig is installed locally and can send processing framework. In the industry, it is
jobs to any Hadoop cluster. It is slower than primarily used for processing massive amounts
Spark, but doesn’t require any software on the of streaming data.
Hadoop cluster. It has several different use cases such as real-
APPACHE HBASE time analytics, machine learning, distributed
Apache HBase is quite similar to Apache remote procedure calls.
Cassandra — the wide column store that we Apache Storm has many use cases: realtime
discussed above. It is essentially a large sorted analytics, online machine learning, continuous
map that we can update. It uses Apache computation, distributed RPC, ETL, and more.
Zookeeper to ensure consistent updates to the Apache Storm is fast: a benchmark clocked it at
data. over a million tuples processed per second per
node. It is scalable, fault-tolerant, guarantees
your data will be processed, and is easy to set
up and operate.
Apache Storm integrates with the queueing
and database technologies you already use. An
Apache Storm topology consumes streams of
data and processes those streams in arbitrarily
complex ways, repartitioning the streams
between each stage of the computation
however needed.
APPACHE KAFKA
Apache Kafka is an open source distributed
Figure 3. Exemple of HBase stream processing & messaging platform. It’s
written using Java & Scala & was developed
APPACHE CALCITE by LinkedIn.
Apache Calcite is a framework that can The storage layer of Kafka involves a
parse/optimize SQL queries and process data. It distributed scalable pub/sub message queue. It
powers query optimization in Flink, Hive, Druid, helps read & write streams of data like a
and others. It also provides many pieces needed messaging system.
to implement a database engine. Kafka is used in the industry to develop real-
More importantly, Calcite can connect to a time features such as notification platforms,
variety of database systems including Spark, managing streams of massive amounts of data,
Druid, Elastic Search, Cassandra, MongoDB, and monitoring website activity & metrics,
Java JDBC. messaging, log aggregation.
6 IT professional
aims at perfect accuracy by being able to
process all available data when generating
views. This means it can fix any errors by
recomputing based on the complete data set,
then updating existing views. Output is typically
stored in a read-only database, with updates
completely replacing existing precomputed
views.
Apache Hadoop is the leading batch-
processing system used in most high-throughput
architectures. New massively parallel, elastic,
relational databases like Snowflake, Redshift
and Big Query are also used in this role.
January/February 2020 7
KAPPA ARCHITECTURE ADVANTAGES AND
The Kappa Architecture was first described by DISADVANTAGES
Jay Kreps. It focuses on only processing data as
a stream. It is not a replacement for the Lambda
Architecture, except for where your use case ADVANTAGES
fits. For this architecture, incoming data is Distributed data processing facilitates are
streamed through a real-time layer and the scalability, high availability, fault tolerance,
results of which are placed in the serving layer replication, redundancy which is typically not
for queries. available in centralized data processing
systems.
Parallel distributed of work, facilitates faster
execution of work. Enforcing security,
authentication and authorization workflows
becomes easier as the system is more loosely
coupled.
DISADVANTAGES
Setting up and working with a distributed
Figure 7. Exemple of KAPPA Architecture system is complex. Well, that’s expected having
so many nodes working in conjunction with each
The idea is to handle both real-time data
other, maintaining a consistent shared state.
processing and continuous reprocessing in a
The management of distributed systems is
single stream processing engine. That’s right,
complex. Since the machines are distributed it
reprocessing occurs from the stream. This
entails additional network latency which
requires that the incoming data stream can be
engineering teams have to deal with.
replayed (very quickly), either in its entirety or
Strong consistency of data is hard to maintain
from a specific position. If there are any code
when everything is so distributed.
changes, then a second stream process would
replay all previous data through the latest real-
time engine and replace the data stored in the
serving layer. REFERENCES
This architecture attempts to simplify by only 1. Introduction to Parallel Processing; Dinesh Shikhare and P.
keeping one code base rather than manage one Ravi Prakash; 2004
for each batch and speed layers in the Lambda 2. Buchanan, J.R., & Linowes, R.G. Understanding distributed
data processing. United States.
Architecture. In addition, queries only need to 3. University of New Mexico: Computer Concepts and
look in a single serving location instead of going Terminology
against batch and speed views. 4. University of Cambridge: Distributed Data Processing
5. Standord University: Distributed Data Processing
The complication of this architecture mostly 6.http://composingprograms.com/pages/47-parallel-
revolves around having to process this data in a computing.html#conclusion
stream, such as handling duplicate events, 7.https://docs.oracle.com/cd/A87860_01/doc/server.817/a76965/c
29dstpr.htm
cross-referencing events or maintaining order- 8. https://www.8bitmen.com/distributed-data-processing-101-the-
operations that are generally easier to do in only-guide-youll-ever-need/
batch processing. 9.https://www.8bitmen.com/twitters-migration-to-google-cloud-an-
architectural-insight/
Kappa is not an alternative for Lambda. Both 10https://smallbusiness.chron.com/advantages-distributed-data-
the architectures have their use cases. processing-26326.html
Kappa is preferred if the batch and the 11.https://medium.com/cracking-the-data-science-interview/an-
introduction-to-big-data-distributed-data-processing-
streaming analytics results are fairly identical in
36654202c6ce
a system. Lambda is preferred if they are not. 12.https://www.talend.com/blog/2017/08/28/lambda-kappa-real-
Both the architectures can be implemented time-big-data-architectures/
using the distributed data processing 13. https://www.oreilly.com/ideas/questioning-the-lambda-
architecture
technologies I’ve talked about above.
8 IT professional
14. “Big Data” by Nathan Marz, James Warren,
https://www.manning.com/books/big-data
15.“How to beat the CAP theorem” by Nathan Marz,
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
January/February 2020 9