0% found this document useful (0 votes)
54 views9 pages

Articol Disteibuted Data Processing

dfddd

Uploaded by

Alexandru Stefan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views9 pages

Articol Disteibuted Data Processing

dfddd

Uploaded by

Alexandru Stefan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Department: Telematics Systems for Transport

Editor: Podeanu Alexandru-Stefan,


[email protected]
Petrica George- Cristian, [email protected]

Distributed Data Processing-


Overview
Ing. Podeanu Alexandru-Stefan
Departament of Telematic Systems for
Transport
Ing. Petrica George-Cristian
Departament of Telematic Systems for
Transport

Abstract— Modern computers can process vast amounts of data representing many
aspects of the world. From these big data sets, we can learn about human behavior in
unprecedented ways: how language is used, what photos are taken, what topics are
discussed, and how people engage with their surroundings. To process large data sets
efficiently, programs are organized into pipelines of manipulations on sequential streams of
data. In this chapter, we consider a suite of techniques process and manipulate sequential
data streams efficiently. This write-up is an in-depth insight into the distributed data
processing. It will cover all the frequently asked questions about it such as What is it? How
different is it in comparison to the centralized data processing? What are the pros & cons of
it? What are the various approaches & architectures involved in distributed data
processing? What are the popular technologies & frameworks used in the industry for
processing massive amounts of data across several nodes running in a cluster? 

IT Professional Published by the IEEE Computer Society © 2020 IEEE


 INTRODUCTION
Distributed data processing is a computer-
networking method in which multiple computers
across different locations share computer-
processing capability. This is in contrast to a
single, centralized server managing and
providing processing capability to all connected
systems. Computers that comprise the
distributed data-processing network are located
at different locations but interconnected by
means of wireless or satellite links.
Data processing is ingesting massive amounts Figure 1. System architecture
of data in the system from several different
sources such as IoT devices, social platforms, SYSTEM ARCHITECTURE
satellites, wireless networks, software logs etc. Data Collection and Prearation Layer
& running the business logic/algorithms on it to This layer takes care of collecting data from
extract meaningful information from it. different external sources & preparing it to be
Running algorithms on the data & extracting processed by the system.When the data
information from it is also known as Data streams in it has no standard structure. It is raw,
Analytics. unstructured or semi-structured in nature.
Data analytics helps businesses use the It may be a blob of text, audio, video, image
information extracted from the raw, format, tax return forms, insurance forms,
unstructured, semi-structured data in terabytes, medical bills etc.The task of the data
petabytes scale to create better products, preparation layer is to convert the data into a
understand what their customers want, consistent standard format, also to classify it as
understand their usage patterns, & per the business logic to be processed by the
subsequently evolve their service or the system.
product. The layer is intelligent enough to achieve all
this without any sort of human intervention.
Data Security Layer
 DISTRIBUTED DATA Moving data is vulnerable to security
PROCESSING SYSTEM breaches. The role of the data security layer is
In a distributed data processing system a to ensure that the data transit is secure by
massive amount of data flows through several watching over it throughout, applying security
different sources into the system. This process protocols and encryptionAcronyms and
of data flow is known as Data ingestion. abbreviations
Once the data streams in there are different
Data Storage Layer
layers in the system architecture which break
Once the data streams in it has to be
down the entire processing into several different
persisted. There are different approaches to do
parts.
this.
If the analytics is run on streaming data in
real-time in-memory distributed caches are used
to store & manage data.
On the contrary, if the data is being
processed in a traditional way like batch
processing distributed databases built for
handling big data are used to store stuff.

2 IT professional
Data Processing Layer A map function is applied to each input, which
This is the layer contains logic which is the outputs zero or more intermediate key-value
real deal, it is responsible for processing the pairs of an arbitrary type.
data. All intermediate key-value pairs are grouped
The layer runs business logic on the data to by key, so that pairs with the same key can be
extract meaningful information from it. Machine reduced together.
learning, predictive, descriptive, decision A reduce function combines the values for a
modelling are primarily used for this. given key k; it outputs zero or more values,
which are each associated with k in the final
Data Visualization Layer output.
All the information extracted is sent to the
To perform this computation, the MapReduce
data visualization layer which typically contains
framework creates tasks (perhaps on different
browser-based dashboards which display the
machines) that perform various roles in the
information in the form of graphs, charts &
computation. A map task applies the map
infographics etc.
function to some subset of the input data and
Kibana is one good example of a data
outputs intermediate key-value pairs.
visualization tool, pretty popular in the industry.
A reduce task sorts and groups key-value pairs
by key, then applies the reduce function to the
TYPES OF DATA DISTIBUTED PROCESSING values for each key. All communication between
There are primarily two types of it. Batch map and reduce tasks is handled by the
Processing & Real-time streaming data framework, as is the task of grouping
processing. intermediate key-value pairs by key.
In order to utilize multiple machines in a
Batch Processing
MapReduce application, multiple mappers run in
Batch processing is the traditional data
parallel in a map phase, and multiple reducers
processing technique where chunks of data are
run in parallel in a reduce phase. In between
streamed in batches & processed. The
these phases, the sort phase groups together
processing is either scheduled for a certain time
key-value pairs by sorting them, so that all key-
of a day or happens in regular intervals or is
value pairs with the same key are adjacent.
random but not real-time.  
Consider the problem of counting the vowels
Real-time Streaming Data Processing in a corpus of text. We can solve this problem
In this type of data processing, data is using the MapReduce framework with an
processed in real-time as it streams in. Analytics appropriate choice of map and reduce functions.
is run on the data to get insights from it. The map function takes as input a line of text
A good use case of this is getting insights and outputs key-value pairs in which the key is a
from sports data. As the game goes on the data vowel and the value is a count. Zero counts are
ingested from social media & other sources is omitted from the output:
analyzed in real-time to figure the viewers’ def count_vowels(line):
sentiments, players stats and predictions. """A map function that counts the vowels in
a line."""
for vowel in 'aeiou':
count = line.count(vowel)
 TECHNOLOGIES INVOLVED IN if count > 0:
DISTRIBUTED DATA PROCESSING emit(vowel, count)

MAPREDUCE The reduce function is the built-in sum


The MapReduce framework assumes as input functions in Python, which takes as input an
a large, unordered stream of input values of an iterator over values (all values for a given key)
arbitrary type. For instance, each input may be and returns their sum.
a line of text in some vast corpus. Computation
proceeds in three steps.

January/February 2020 3
APPACHE SPARK Unlike MapReduce, we can keep the input in
In Apache Spark, the computation model is memory and load them once. The broadcasting
much richer than just MapReduce. in step 3 means that we can quickly send the
Transformations on input data can be written centers to all machines. All cluster assignments
lazily and batched together. Intermediate results do not need to be written to disk every time.
can be cached and reused in future calculations. Spark applications run as independent sets of
There is a series of lazy transformations which processes on a cluster, coordinated by the
are followed by actions that force evaluation of SparkContext object in your main program
all transformations. (called the driver program).
Notably, each step in the Spark model Specifically, to run on a cluster, the
produces a resilient distributed dataset (RDD). SparkContext can connect to several types
Intermediate results can be cached on memory of cluster managers (either Spark’s own
or dis, optionally serialized. standalone cluster manager, Mesos or YARN),
For each RDD, we keep a lineage, which is the which allocate resources across applications.
operations which created it. Spark can then Once connected, Spark acquires executors on
recompute any data which is lost without storing nodes in the cluster, which are processes that
to disk. We can still decide to keep a copy in run computations and store data for your
memory or on disk for performance. application. Next, it sends your application code
The process of implement k-means clustering (defined by JAR or Python files passed to
in Spark is this: SparkContext) to the executors. Finally,
SparkContext sends tasks to the executors to
1. Load the data and “persist.” run.
2. Randomly sample for initial clusters.
3. Broadcast the initial clusters.
4. Loop over the data and find the new
centroids.
5. Repeat from step 3 until the algorithm
converges.

Unlike MapReduce, we can keep the input in


memory and load them once. The broadcasting
in step 3 means that we can quickly send the
centers to all machines. All cluster assignments
do not need to be written to disk every time. Figure 2. Spark architecture
Spark applications run as independent sets of
processes on a cluster, coordinated by the Seen above is the Spark architecture. There
SparkContext object in your main program are several useful things to note about this
(called the driver program). architecture:
Specifically, to run on a cluster, the
SparkContext can connect to several types 1. Each application gets its own executor
of cluster managers (either Spark’s own processes, which stay up for the duration
standalone cluster manager, Mesos or YARN), of the whole application and run tasks in
which allocate resources across applications. multiple threads. This has the benefit of
Once connected, Spark acquires executors on isolating applications from each other,
nodes in the cluster, which are processes that on both the scheduling side (each driver
run computations and store data for your schedules its own tasks) and executor
application. Next, it sends your application code side (tasks from different applications
(defined by JAR or Python files passed to run in different JVMs). However, it also
SparkContext) to the executors. Finally, means that data cannot be shared
SparkContext sends tasks to the executors to across different Spark applications
run.

4 IT professional
(instances of SparkContext) without
writing it to an external storage system.
2. Spark is agnostic to the underlying
cluster manager. As long as it can
acquire executor processes, and these
communicate with each other, it is
relatively easy to run it even on a cluster
manager that also supports other
applications (e.g. Mesos/YARN).
Figure 3. Exemple of HIVE input
3. The driver program must listen for and
accept incoming connections from its
APPACHE FLINK
executors throughout its lifetime. As
Apache Flink is another system designed for
such, the driver program must be
distributed analytics like Apache Spark. It
network addressable from the worker
executes everything as a stream. Iterative
nodes.
computations can be written natively with cycles
4. Because the driver schedules tasks on
in the data flow. It has a very similar
the cluster, it should be run close to the
architecture to Spark, including (1) a client that
worker nodes, preferably on the same
optimizes and constructs data flow graph, (2) a
local area network. If you’d like to send
job manager that receives jobs, and (3) a task
requests to the cluster remotely, it’s
manager that executes jobs.
better to open an RPC to the driver and
have it submit operations from nearby
than to run a driver far away from the
worker nodes.
The system currently supports several cluster
managers:

1. Standalone — a simple cluster manager


included with Spark that makes it easy
to set up a cluster. Figure 4. Exemple of Appache Flink
2. Apache Mesos — a general cluster
manager that can also run Hadoop APPACHE PIG
MapReduce and service applications. Apache Pig is thus another platform for
3. Hadoop YARN — the resource manager analyzing large datasets. It provides the Pig
in Hadoop 2. Latin language which is easy to write but runs as
4. Kubernetes — an open-source system for MapReduce. The PigLatin code is shorter and
automating deployment, scaling, and faster to develop than the equivalent Java code.
management of containerized
applications. Here is a Pig Latin code example that counts
word:
APPACHE HIVE (line: chararray);
Apache Hive is a data warehouse software Words = FOREACH lines GENERATE
built on top of Hadoop. It converts SQL queries FLATTEN(TOKENIZE(line)) AS
to MapReduce, Spark, etc. The data here is word;
stored in files on HDFS. An example of Hive Grouped = GROUP words BY word;
Input is shown below: counts = FOREACH grouped GENERATE
group, COUNT(words);
STORE counts INTO ‘counts’;
Here is a Pig Latin code example that visits
Page:

January/February 2020 5
Users = LOAD ‘users.txt’ AS (user:chararray,
age:int);
Pages = LOAD ‘pages.txt’ AS
(user:chararray, url:chararray);
Filtered = FILTER Users BY age >= 18 AND
age <= 25;
Joined = JOIN Filtered BY user, Pages BY
user;
Grouped = GROUP Joined BY url;
Sorted = ORDER Summed BY clicks DESC; Figure 4. Exemple of Calcite architecture
Limit = LIMIT Sorted BY 5; APPACHE STORM
STORE Limit INTO ‘top5pages’ ; Apache Storm is a distributed stream
Apache Pig is installed locally and can send processing framework. In the industry, it is
jobs to any Hadoop cluster. It is slower than primarily used for processing massive amounts
Spark, but doesn’t require any software on the of streaming data.
Hadoop cluster. It has several different use cases such as real-
APPACHE HBASE time analytics, machine learning, distributed
Apache HBase is quite similar to Apache remote procedure calls.
Cassandra — the wide column store that we Apache Storm has many use cases: realtime
discussed above. It is essentially a large sorted analytics, online machine learning, continuous
map that we can update. It uses Apache computation, distributed RPC, ETL, and more.
Zookeeper to ensure consistent updates to the Apache Storm is fast: a benchmark clocked it at
data. over a million tuples processed per second per
node. It is scalable, fault-tolerant, guarantees
your data will be processed, and is easy to set
up and operate.
Apache Storm integrates with the queueing
and database technologies you already use. An
Apache Storm topology consumes streams of
data and processes those streams in arbitrarily
complex ways, repartitioning the streams
between each stage of the computation
however needed.

APPACHE KAFKA
Apache Kafka is an open source distributed
Figure 3. Exemple of HBase stream processing & messaging platform. It’s
written using Java & Scala & was developed
APPACHE CALCITE by LinkedIn.
Apache Calcite is a framework that can The storage layer of Kafka involves a
parse/optimize SQL queries and process data. It distributed scalable pub/sub message queue. It
powers query optimization in Flink, Hive, Druid, helps read & write streams of data like a
and others. It also provides many pieces needed messaging system.
to implement a database engine. Kafka is used in the industry to develop real-
More importantly, Calcite can connect to a time features such as notification platforms,
variety of database systems including Spark, managing streams of massive amounts of data,
Druid, Elastic Search, Cassandra, MongoDB, and monitoring website activity & metrics,
Java JDBC. messaging, log aggregation.

6 IT professional
aims at perfect accuracy by being able to
process all available data when generating
views. This means it can fix any errors by
recomputing based on the complete data set,
then updating existing views. Output is typically
stored in a read-only database, with updates
completely replacing existing precomputed
views.
Apache Hadoop is the leading batch-
processing system used in most high-throughput
architectures. New massively parallel, elastic,
relational databases like Snowflake, Redshift
and Big Query are also used in this role.

Figure 5. Exemple of KAFKA architecture Speed Layer


The speed layer processes data streams in
real time and without the requirements of fix-
ups or completeness. This layer sacrifices
 ARCHITECTURES INVOLVED throughput as it aims to minimize latency by
providing real-time views into the most recent
LAMBDA ARCHITECTURE data. Essentially, the speed layer is responsible
Lambda is a distributed data processing for filling the "gap" caused by the batch layer's
architecture which leverages both the batch & lag in providing views based on the most recent
the real-time streaming data processing data. This layer's views may not be as accurate
approaches to tackle the latency issues arising or complete as the ones eventually produced by
out of the batch processing approach. the batch layer, but they are available almost
It joins the results from both the approaches immediately after data is received, and can be
before presenting it to the end user. replaced when the batch layer's views for the
same data become available.
Stream-processing technologies typically
used in this layer include Apache Storm, SQL
Stream   and Apache Spark. Output is typically
stored on fast NoSQL databases.
Speed Layer
Output from the batch and speed layers are
stored in the serving layer, which responds to
ad-hoc queries by returning precomputed views
or building views from the processed data.
Examples of technologies used in the serving
Figure 6. Exemple of Lambda Architecture layer include Druid, which provides a single
Lambda architecture describes a system cluster to handle output from both
consisting of three layers: batch processing, layers. Dedicated stores used in the serving
speed (or real-time) processing, and a serving layer include Apache Cassandra, Apache
layer for responding to queries. The processing HBase, MongoDB, VoltDB or Elasticsearch for
layers ingest from an immutable master copy of speed-layer output, and Elephant DB, Apache
the entire data set. Impala, SAP HANA or Apache Hive for batch-
layer output.
Batch Layer
The batch layer precomputes results using a
distributed processing system that can handle
very large quantities of data. The batch layer

January/February 2020 7
KAPPA ARCHITECTURE  ADVANTAGES AND
The Kappa Architecture was first described by DISADVANTAGES
Jay Kreps. It focuses on only processing data as
a stream. It is not a replacement for the Lambda
Architecture, except for where your use case ADVANTAGES
fits. For this architecture, incoming data is Distributed data processing facilitates  are
streamed through a real-time layer and the scalability, high availability, fault tolerance,
results of which are placed in the serving layer replication, redundancy which is typically not
for queries. available in centralized data processing
systems.
Parallel distributed of work, facilitates faster
execution of work. Enforcing security,
authentication and authorization workflows
becomes easier as the system is more loosely
coupled.

DISADVANTAGES
Setting up and working with a distributed
Figure 7. Exemple of KAPPA Architecture system is complex. Well, that’s expected having
so many nodes working in conjunction with each
The idea is to handle both real-time data
other, maintaining a consistent shared state.
processing and continuous reprocessing in a
The management of distributed systems is
single stream processing engine. That’s right,
complex. Since the machines are distributed it
reprocessing occurs from the stream. This
entails additional network latency which
requires that the incoming data stream can be
engineering teams have to deal with.
replayed (very quickly), either in its entirety or
Strong consistency of data is hard to maintain
from a specific position. If there are any code
when everything is so distributed.
changes, then a second stream process would
replay all previous data through the latest real-
time engine and replace the data stored in the
serving layer.  REFERENCES
This architecture attempts to simplify by only 1. Introduction to Parallel Processing; Dinesh Shikhare and P.
keeping one code base rather than manage one Ravi Prakash; 2004
for each batch and speed layers in the Lambda 2. Buchanan, J.R., & Linowes, R.G. Understanding distributed
data processing. United States.
Architecture. In addition, queries only need to 3. University of New Mexico: Computer Concepts and
look in a single serving location instead of going Terminology
against batch and speed views. 4. University of Cambridge: Distributed Data Processing
5. Standord University: Distributed Data Processing
The complication of this architecture mostly 6.http://composingprograms.com/pages/47-parallel-
revolves around having to process this data in a computing.html#conclusion
stream, such as handling duplicate events, 7.https://docs.oracle.com/cd/A87860_01/doc/server.817/a76965/c
29dstpr.htm
cross-referencing events or maintaining order- 8. https://www.8bitmen.com/distributed-data-processing-101-the-
operations that are generally easier to do in only-guide-youll-ever-need/
batch processing. 9.https://www.8bitmen.com/twitters-migration-to-google-cloud-an-
architectural-insight/
Kappa is not an alternative for Lambda. Both 10https://smallbusiness.chron.com/advantages-distributed-data-
the architectures have their use cases. processing-26326.html
Kappa is preferred if the batch and the 11.https://medium.com/cracking-the-data-science-interview/an-
introduction-to-big-data-distributed-data-processing-
streaming analytics results are fairly identical in
36654202c6ce
a system. Lambda is preferred if they are not. 12.https://www.talend.com/blog/2017/08/28/lambda-kappa-real-
Both the architectures can be implemented time-big-data-architectures/
using the distributed data processing 13. https://www.oreilly.com/ideas/questioning-the-lambda-
architecture
technologies I’ve talked about above.

8 IT professional
14. “Big Data” by Nathan Marz, James Warren,
https://www.manning.com/books/big-data
15.“How to beat the CAP theorem” by Nathan Marz,
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

January/February 2020 9

You might also like