0% found this document useful (0 votes)
71 views

Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS

The goal of this thesis is a comparison of different Big Data streaming architectures in the context of real-time insurance risk evaluation. Two architectures for real-time analysis of millions of data points will be implemented. The architectures will be compared using multiple parameters, such as, latency, implementation cost, elasticity, advantages and disadvantages and other benchmarks.

Uploaded by

Dragan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS

The goal of this thesis is a comparison of different Big Data streaming architectures in the context of real-time insurance risk evaluation. Two architectures for real-time analysis of millions of data points will be implemented. The architectures will be compared using multiple parameters, such as, latency, implementation cost, elasticity, advantages and disadvantages and other benchmarks.

Uploaded by

Dragan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Risikobewertung durch Realtime Datenanalyse

mittels Big Data Streaming in AWS

Risk assessment through real-time data analysis


using Big Data Streaming in AWS

Master Thesis
Submitted in partial fulfilment of the requirements for the degree of

Master of Science in Engineering

to the University of Applied Sciences FH Campus Wien


Master’s degree Program

Author:
Elias Dräxler, BSc (hons)

Student identification number:


1810838024
Supervisor:
in a in
FH-Prof. Mag. Dr. Sigrid Schefer-Wenzl, MSc BSc

Date:
31.05.2020
Declaration of authorship:

I declare that this Master Thesis has been written by myself. I have not used any other than
the listed sources, nor have I received any unauthorised help.
I hereby certify that I have not submitted this Master Thesis in any form (to a reviewer for
assessment) either in Austria or abroad.
Furthermore, I assure that the (printed and electronic) copies I have submitted are identical.

Date: 31.05.2020 .................. Signature:


Preface

First, I would like to thank my supervisor Dr. Sigrid Schefer-Wenzl for the input and
guidance that helped throughout my research.

I also wish thanks to Rene Schakman and my colleagues at the viesure innovation center
for supporting this reserach.

Finally, I would like to thank Anna Kirchgasser and my whole family for supporting me
throughout the process of writing this thesis.

i
Abstract
The competitive advantage that companies can gain nowadays when utilizing their
data is one of the most significant changes across many industries, also including
the finance and insurance sector. Being able to utilize available data brings benefits
for both, the customers and the insurance companies. Research has shown that
the traditional insurance models cannot compete with a data centric approach. This
study aims to discover possible architectures and implementations for Big Data
real-time analysis. It focuses on real-time analysis in the context of insurance risk
calculation, more specifically on telematics data that could be used to calculate car
insurance rates in real-time. Different approaches to implement such a system shall
be compared, specifically in the context cloud-native and on-premise setups.

Based on a review of the literature on Big Data and real-time processing, two
solutions were chosen and implemented. Further the advantages and
disadvantages of both approaches were evaluated and compared using multiple
metrics, such as latency and performance. Both implementations make use of a
Kappa architecture, that was once implemented using Apache Spark Streaming
and a second time using Apache Flink. The implementation was done using AWS
as a cloud provider. The results indicate that Apache Flink, as one of the first real
stream processing frameworks, could deliver up to 170% of the performance of
Apache Spark while the price is only 44% of the competing implementation.
Therefore, it is recommended to use Apache Flink with AWS Kinesis Analytics,
when implementing real-time analysis in the cloud.

ii
List of abbreviations

AWS Amazon Web Services


SaaS Software as a service
PaaS Platform as a service
RAM Random-access memory
ES Elasticsearch
POJO Plain Old Java Object
EMR Elastic Map Reduce
ECS Elastic Container Service

iii
Key terms

Big Data
Fast Data
Big Data Streaming
Big Data Analysis
Apache Spark
Apache Flink
Amazon Web Services
AWS Kinesis Analytics
Risk Analysis
Lambda Architecture
Kappa Architecture

iv
Table of Contents
Preface .......................................................................................................................... i
Abstract ........................................................................................................................ ii
List of abbreviations .................................................................................................... iii
Key terms .....................................................................................................................iv
1. Introduction .......................................................................................................... 8
1.1 Risk Analysis ............................................................................................................. 8
1.2 Research Question .................................................................................................... 9
1.3 How to compare approaches ..................................................................................... 9
1.4 Structure ................................................................................................................. 11
2. Concepts ............................................................................................................. 11
2.1 Big Data .................................................................................................................. 11
2.1.1 Characteristics ........................................................................................................................ 11
2.1.2 Application of Big Data ........................................................................................................... 12
2.1.3 Security and Privacy ............................................................................................................... 12
2.2 Fast Data ................................................................................................................. 13
2.3 Data Streaming ....................................................................................................... 13
2.4 Cloud-native............................................................................................................ 14
3. Architecture ........................................................................................................ 15
3.1 Lambda Architecture ............................................................................................... 15
3.1.1 Batch Layer ............................................................................................................................. 16
3.1.2 Serving Layer .......................................................................................................................... 16
3.1.3 Speed Layer ............................................................................................................................ 16
3.1.4 Integrated Layers .................................................................................................................... 16
3.1.5 Adoption and Implementation ............................................................................................... 17
3.2 Kappa Architecture ................................................................................................. 18
3.2.1 Layers ..................................................................................................................................... 18
3.2.2 Adoption & Implementation .................................................................................................. 18
3.3 Other Architectures................................................................................................. 19
4. Data Streaming ................................................................................................... 19
4.1 Introduction ............................................................................................................ 19
4.2 AWS Kinesis ............................................................................................................ 19
4.3 Apache Kafka .......................................................................................................... 22
5. Algorithms & Frameworks ................................................................................... 24
5.1 Introduction ............................................................................................................ 24
5.2 MapReduce ............................................................................................................. 24
5.2.1 Programming Model............................................................................................................... 25
5.2.2 MapReduce for real-time processing ..................................................................................... 25
v
5.3 Bulk Synchronous Parallel ....................................................................................... 27
5.4 Apache Hadoop ....................................................................................................... 28
5.4.1 AWS EMR ................................................................................................................................ 30
5.5 Apache Spark .......................................................................................................... 31
5.6 Apache Flink............................................................................................................ 32
5.7 Apache Storm ......................................................................................................... 33
5.8 Comparison: Spark vs Flink vs Storm ....................................................................... 34
6. Storage ............................................................................................................... 35
6.1 Introduction ............................................................................................................ 35
6.2 NoSQL Databases .................................................................................................... 35
6.2.1 Elasticsearch ........................................................................................................................... 37
6.3 NewSQL Databases ................................................................................................. 37
6.4 Cloud Storage .......................................................................................................... 39
6.5 Data Lake ................................................................................................................ 39
6.5.1 AWS Lake Formation .............................................................................................................. 41

7. Implementation & Results ................................................................................... 41


7.1 Introduction ............................................................................................................ 41
7.2 Chosen Architectures .............................................................................................. 42
7.3 Implementation Concept......................................................................................... 43
7.4 Data Setup .............................................................................................................. 44
7.4.1 Spatial Subdivision .................................................................................................................. 46
7.5 Apache Spark Streaming ......................................................................................... 47
7.5.1 Architecture and Technology ................................................................................................. 47
7.5.2 Implementation ...................................................................................................................... 48
7.5.3 Metrics ................................................................................................................................... 50
7.5.4 Advantages & Disadvantages ................................................................................................. 54
7.6 Kinesis Analytics with Apache Flink ......................................................................... 55
7.6.1 Architecture and Technology ................................................................................................. 55
7.6.2 Implementation ...................................................................................................................... 56
7.6.3 Metrics ................................................................................................................................... 58
7.6.4 Advantages & Disadvantages ................................................................................................. 61
7.7 Analysis Results....................................................................................................... 62
7.8 Comparison ............................................................................................................. 63
8. Conclusion .......................................................................................................... 65
Bibliography ............................................................................................................... 68
List of figures .............................................................................................................. 74
List of equations ......................................................................................................... 75
List of code listings...................................................................................................... 76
List of tables ............................................................................................................... 77
vi
vii
1. INTRODUCTION
1.1 Risk Analysis
Big Data is one of the most important buzz words of the last years. It refers to vast
sets of data that are difficult to manage, store, process, load, or to put it simpler, to
work with (Rijmeam, 2013). Businesses that can utilise Big Data will have a
competitive advantage over their competitors. The insurance sector will be one of
the biggest beneficiaries of Big Data usage once they can overcome the barriers
and start to make data-driven decisions (Manyika, et al., 2011). To be able to make
use of Big Data and provide straightforward and customer-oriented insurance
products, the insurance industry must make a shift. Today’s business is all about
the customer and how the company can offer a better customer experience. Data
is the key to provide a customised experience for every customer, and according to
Segment (2017), customers are typically willing to spend more money on
personalised services.

There are a few factors that influence the willingness of customers to share their
data with a company. Improved transparency around privacy policies and terms
and conditions will help the customers to gain trust in the product and the business
as a whole. One example would be to write the terms and conditions in a way that
people can understand them and are not overwhelmed by legal language. The
second step to build a trust relationship with the customers is to communicate the
commitment to data privacy and security. A company can have the best data
security, but if their customers do not know about the commitment, it does not
provide any competitive advantages. The easiest and most effective way to get the
customers to share their data with a company is to offer value in exchange for their
data. No customer will provide their data if they gain no benefits from it (Harris,
2018).

Using telematics and IoT data available to the customer, both the insurer and the
customers can gain advantages. Baecke & Bocca (2017) state that even the usage
of telematics data without any traditional variables, such as customer-specific data,
car-specific data or the claim history outperform the basic insurance model.
The ability to use and analyse Big Data is becoming one of the core competitive
factors for the future of any financial institution (Liu, Peng, & Yu, 2018).
To be able to provide insurance services based on telematics data, the backend
systems must be able to handle vast amounts of sensor data, such as geo-location
or acceleration information. Further, it must be possible to do real-time analysis on
the incoming data, to calculate dynamic risk models and perform fraud detection.
Using the accumulated data for forecasts and data insights, insurers can optimise
the dynamic risk calculations further.
IoT and InsurTech will enable insurers to embrace the shift to paying in advance.
The shift to prevention rather than cure. (Huckstep, 2019)
According to Gartner, there will be an average of 500 smart devices in every home
by 2022, and even today, most cars already have accessible interfaces to use the
telematics data. Accenture wrote in its insurance blog that roughly 39 per cent of
insurance companies started to offer services based on IoT devices, and another
44 per cent are thinking about launching products in this area. (Huckstep, 2019)
8
1.2 Research Question
The goal of this thesis is a comparison of different Big Data streaming architectures
in the context of real-time insurance risk evaluation.
Two architectures for real-time analysis of millions of data points will be
implemented. The architectures will be compared using multiple parameters, such
as, latency, implementation cost, elasticity, advantages and disadvantages and
other benchmarks that will be explained in Chapter 1.3.
The analysed data will consist of GPS location data and acceleration data. For the
purpose of showcasing the solution, the results of the risk evaluation should be
available in real-time and shown, for example, on a heatmap chart.

To be able to find a solution that is fitting to the problem mentioned in the research
question, this thesis will focus on multiple different architectures and technologies.
First, a few fitting architectures have to be evaluated, and checked if these can be
tailored to real-time analysis of the risk data. Following that, fitting solutions, that
can be used in the context of these architectures in the fields of streaming, real-
time analysis, storage and presentation of the data have to be evaluated. After the
evaluation, two architectures with their particular technology choices will be
implemented and compared. The choices will be made with the concern of
feasibility in a real-world scenario in mind.
Both architectures should be implemented using AWS cloud-native services, where
possible, and open-source software.

1.3 How to compare approaches


The most important aspects when talking about Big Data are the 4Vs. Volume is
one of them. Therefore, it is essential to define the data volume before picking a
Big Data architecture. To be able to compare the two approaches that will be
implemented, there needs to be a set number of data points that will be sent per
second. Otherwise, it is not possible to objectively compare them.

Multiple factors have to be considered to get an idea of the volume that a real-time
risk analysis insurance calculation would produce. There are 7.8 million vehicles
registered in Austria (STATISTIK AUSTRIA, 2019). According to VCÖ (2018) there
are 3.5 million car rides to get to work and back home every day. This number is
from 2018 but to get an estimation of the possible volume, it is sufficient. Implying
a 25 per cent market share for the insurance company that wants to provide such
an insurance model and a 10 per cent adoption rate, would result in 87.500
customers that would provide data for the real-time risk analysis, the calculation
can also be seen in Table 1. If every customer sends one datapoint per second,
this would result in 87.500 data entries per second and therefore, in 315 million data
points per hour. Based on the average size of one data entry of 30 bytes, this will
produce roughly 9.45 GB per hour.

Vehicles in Austria 7.800.000


Car rides per day 3.500.000
9
Rides per day for an insurance company 875.000
with 25 per cent market share
10 per cent adoption rate of the insurance 87.500
product
Table 1 - Possible customer base for a real-time insurance product and the number of car
rides for that product

Based on the cited data, it is crucial to pick an architecture that can scale well
horizontally. The two implementations will be tested by producing a load of 10.000
data points per second. The reason behind the downscaling of the load is the costs
of the associated streaming solutions. Nevertheless, the picked architectures need
to be able to scale to a much larger number. This can be achieved by scaling
horizontally, which means that if needed it is possible to add more servers to the
cluster, in contrast to scaling vertically by adding more resources, such as a more
powerful CPU. Horizontal scaling is essential when working in a cloud environment
as it is a lot easier to provision hundreds of servers in just a few seconds.

Benchmarking Big Data architectures is a challenging and complex task because


multiple dimensions have to be considered. Nevertheless, it is important as it helps
to compare the performance of equivalent systems. Further, it provides the
possibility to fine-tune existing systems by providing information about the impact
of, for example, configuration tuning or deployment validation. Over the last years,
multiple Big Data benchmarking frameworks were proposed. Some solutions like
YCSB and PigMix were considered to only deliver adequate results in particular use
cases (Persico, Pescapé, Picarello, & Sperlí, 2018), in contrast to others that try to
tackle a broader spectrum of dimensions. One of them is the BigBench, proposed
by Ghazal et al. (2013). It is based on a fictional retailer selling products to
customers and uses a synthetic data generator to generate structured, semi-
structured and unstructured data. It aims to create a universal standard for Big Data
benchmarking in the industry. A similar solution that also generates data is the
BigDataBench proposed by Wang et al. (2014). In this approach data is efficiently
generated by employing data models derived from real data; this is done to
preserve data veracity.

To assess in Chapter 7 discussed implementations, the following parameters will


be used to compare the results on different dimensions (Persico, Pescapé,
Picarello, & Sperlí, 2018):
• Latency
• Correctness of data
• Fault tolerance
• Performance – throughput
• Costs of the solution

All the tests must be done using comparable infrastructure for both Big Data
architectures, as the implementations need different hardware and cloud services
it is essential to not “overscale” one solution as this would make comparing the
implementations impossible. Where possible the same servers, database instance
tiers, network bandwidth settings should be used. If no comparable hardware or

10
service level is available, it is essential to compare the costs to create a comparable
solution.

1.4 Structure
This thesis is structured as follows. Chapter 2 discusses the most important
concepts that are needed throughout the thesis. Chapter 3 provides an overview of
architectures that could be used for real-time Big Data analysis, especially the
Lambda and the Kappa architecture. Section 4 discusses streaming platforms, such
as Apache Kafka and AWS Kinesis. Chapter 5 introduces data processing
algorithms that enable the mentioned architectures and could be used to implement
the data analysis. Further, it includes an evaluation and comparison of a few
frameworks for the data processing part of the architectures, like Apache Hadoop,
Apache Spark, Apache Flink and Apache Storm. In Chapter 6, different storage
solutions are presented, the discussed topics are ranging from NoSQL databases
to the implementation of Data Lakes. In Section 7, the findings of the preceding
chapters are used to define, implement, test and benchmark two different Big Data
analytics architectures. Chapter 8 concludes the work presented in this thesis and
contains possible directions for future work.

2. CONCEPTS
2.1 Big Data
2.1.1 Characteristics
The term Big Data increased in relevance since Roger Mougalas from O’Reilly
Media first mentioned it. The term Big Data refers to vast sets of data that are
difficult to manage, store, process and load. (Rijmeam, 2013)

Big Data grew massively over the last years. To set the growth in correlation, till
2003 five exabytes of data were created by the whole human population, in 2013
we only needed two days to create the same amount of data according to IBM, and
the number of generated data is increasing at an incredible rate. (Sagiroglu &
Sinanc, 2013) In 2018 Domo published their sixth report on Big Data that shows
examples of how much data gets generated every minute. In this report, they
estimated that in 2020 every person on earth would create 1.7 megabytes of data
every second. (Domo Inc., 2018)

There are many articles about the characteristics of Big Data, most of them sum up
the “Vs” of Big Data, how many V characteristics there are, and which are worth
mentioning is an open question. The four most used terms that characterise Big
Data, known as the 4Vs, are variety, velocity, volume and veracity (Big Data
Framework, 2019). Variety describes how data is represented and is one of the
defining factors for Big Data. Data comes in three forms; structured, semi-structured
and unstructured. Structured data is easy to work with as it is already tagged,
classified and annotated. Semi-structured data does not contain fixed fields but
contains tags to separate the data elements from each other. The real struggle with
Big Data is the handling of unstructured data as it is random and difficult to analyse.
Volume is all about the amount of data, traditional database, storage, query and
analysis techniques cannot be used on terabytes or petabytes of data. Therefore,
new techniques are needed to tackle the vast amount of data. Velocity describes
11
the way the data is handled when using batch processing or real-time analytics to
get the necessary information as soon as possible (Sagiroglu & Sinanc, 2013). The
last V stands for veracity which refers to the quality of the data that is being
analysed. Quality is here defined as how many data records are valuable and
contribute in a meaningful way to the result of the analysis. If data has low veracity,
it means that a high percentage of this data is meaningless and therefore just noise
that could be ignored.
Big Data mostly refers to datasets with high volume, high velocity and high variety,
which makes it nearly impossible to process this data with traditional tools. (Big
Data Framework, 2019)

2.1.2 Application of Big Data


According to Manyika et al. (2011), the usage of Big Data will generate significant
value across all sectors. Still, some sectors will see more significant growth than
others. The computer and electronic products and the information sector are the
two sectors that already had a productivity boost and gained the most out of the
usage of Big Data. The two sectors that will gain significant benefit by using Big
Data are finance and insurance and the governmental sector. These two are
predicted on gaining the most advantages if they can utilise Big Data. Other sectors
that will also see growth include health care, manufacturing, real estate and rental
and f.e. transportation and warehousing. Some sectors are behind the others in
adopting Big Data. However, for these, the prediction is that they will not be able to
utilise Big Data in their businesses, this includes, for example, construction and
educational services. (Manyika, et al., 2011)

New methods and techniques were created to deal with the vast amount of data.
The industry came up with multiple new architectures to support Big Data. One of
these architectures is MapReduce; it is a programming framework that was
implemented by Google. It uses a divide and conquer approach; more about
MapReduce can be found in Chapter 5.2. (Sagiroglu & Sinanc, 2013)

In 2005 Yahoo! built Hadoop, inspired by Google’s MapReduce, to index the entire
World Wide Web. Today it is known as the open-source project Apache Hadoop
that is used by organisations all around the world for their Big Data processing
needs. More about Hadoop can be found in Chapter 5.4. (Rijmeam, 2013)

Another open-source project to tackle Big Data is HPCC (High-Performance


Computing Cluster). It includes a high-level programming language to efficiently
perform ETL (Extract, Transform, Load) workloads. HPCC has two main
architectural components, Thor and Roxie. Thor is a parallel ETL engine that
enables data integration and is used for batch processing. Roxie is a parallel, high
throughput, low latency fast data delivery engine. (Sagiroglu & Sinanc, 2013)

2.1.3 Security and Privacy


Zhang (2018) wrote “big data is a double-edged sword. It brings convenience to
people and brings certain risks“. The benefits gained from Big Data are
indispensable for most users. Most people do not even know the risks associated
with their data-sharing habits. Therefore, the protection of the user's privacy is a top

12
concern. The concerns regarding Big Data are the targeted prediction of a people’s
state and behaviour. One way of protecting the user’s privacy is anonymisation.
Nevertheless, anonymous protection is not enough to adequately protect the
privacy of users. The current usage of Big Data, coupled with the lack of self-
protection awareness among users, can easily cause information leakage (Zhang,
2018). A big impact on the privacy of users has data mining and predictive analytics.
Both techniques are used to discover intercorrelated data. Information linkages can
bring advantages for companies, but on an individual basis, the discoveries of these
processes can lead to the exposal of the identity of the data providers (Grolinger,
et al., 2014).
Another issue when thinking about Big Data Security is the way people think about
their data. Mostly the data is thought of as a fact. In reality, someone can forge data
and therefore manipulate the outcome and decisions. One could intentionally
fabricate wrong malicious data and therefore create a reality beneficial to them. To
be able to trust Big Data, it is crucial to ensure credibility. Not only malicious activity
can change the outcome of Big Data but also the distortion during the processing
and the process of propagation. Therefore, it is essential to be able to ensure the
reliability and authenticity of the data used. (Zhang, 2018)

2.2 Fast Data


Data is growing fast. The stream of data that is produced by enterprises nowadays
is ever-growing and brings many difficulties, as already discussed in the previous
chapter. Besides Big Data, there is another concept that emerged, Fast Data. Fast
Data operates on smaller data sets than Big Data but uses the same analytical tools
and algorithms to process the data in real-time or near real-time. Fast Data is crucial
for applications that require low latency and are essential for instant decision
making. In this case, the data comes mostly in streams, and therefore needs a
streaming system capable of delivering these events as quickly as possible for
further analysis. The use of data streaming will be discussed thoroughly in the next
chapter, but Fast Data does not only work with data streams but also uses rapid
batch data processing. Interacting with Fast Data does differ significantly from the
typical use cases of Big Data at rest and therefore needs different architectures that
are capable of handling it (Miloslavskaya & Tolstoy, 2016). Architectures that make
use of data streams and can handle both Big Data at rest and Fast Data are
described in Chapter 3.

2.3 Data Streaming


When talking about data streaming, one can imagine it as hundreds of sources that
continuously generate data records simultaneously. Most of the time, a single data
point that is streamed is as small as a few kilobytes. Streaming data can be
generated from hundreds of different sources, for example, sensors, log entries,
clickstreams, geospatial services, social network activity, online trading platforms
and every other system that emits events with a high volume. Different operations
are applied to the data stream such as correlations, aggregations, filtering or
sampling. As the data stream is continuous, the data can be processed record by
record or grouped using a sliding time window. (Amazon Web Services, 2020)

13
Streaming Data can provide near real-time insights that are crucial for businesses
that want to react to change as quick as possible. Reacting to change is not the
only benefit companies can gain from streaming data and the analytical capabilities
that come with it. For a long time, the most important factor for making decisions
was the data of the past, mostly reports generated based on old data that may
reflect the current situation, or maybe not. It is possible to create better forecasts
and therefore, better outcomes using streaming data and real-time analytics
(Nagvanshi, 2018).

Stream processing is different from batch processing in multiple ways; the most
significant factor being that the data scope is entirely different. In regular batch
processing, it is possible to do queries or processing using the whole data set, as
opposed to stream processing where the queries and the processing are only done
over the most frequent data, regardless of what techniques are used. The data set
that has to be taken into account for batch processing is a lot bigger than for stream
processing and therefore also more difficult to handle. The critical factor for stream
processing is the low latency, that should not be higher than a few milliseconds,
which means that the insights gained from the data are available in near real-time
in contrast to the long delays that are given when thinking about batch processing
(Amazon Web Services, 2020). As not only the real-time but also the holistic view
is vital for most businesses, they run a hybrid approach such as the Lambda
architecture that will be discussed in Chapter 3.1.

When streaming data, it is not always possible to define the exact volume and
velocity of the input data, which makes it challenging to define how much resources
are needed in processing the data stream. Therefore, streaming is another prime
example for cloud-based processing, as it takes only seconds to scale the used
computing resources up or down. There are even solely cloud-based solutions; an
example would be AWS Kinesis for streaming in combination with Kinesis Analytics
for running queries on the data stream. Unfortunately, not all data streaming
platforms are as flexible as cloud platforms. Spike van der Veen, van der Waaji,
Lazovik, Wijbrandi, & Meijer (2015) mentioned for example the Apache Storm
platform. As one of the leading data stream analytics platforms it still misses the
capability to scale by itself. The authors proposed and created a tool that sits on
top of the platform and monitors the application running on Apache Storm and
external systems such as queues and databases, which decides based on this data
whether additional resources are needed to process the data or not. This shows
how vital scaling is when assessing Big Data.

2.4 Cloud-native
The term cloud-native will be used throughout this thesis to describe particular
application and architecture characteristics, but what exactly does it mean?

Cloud-native technologies empower organisations to build and run scalable


applications in modern, dynamic environments such as public, private, and
hybrid clouds. Containers, service meshes, microservices, immutable
infrastructure, and declarative APIs exemplify this approach. (Cloud Native
Computing Foundation, 2018)

14
Using these techniques and having a high degree of automation, it is possible to
build loosely coupled, resilient, manageable and observable systems (Cloud Native
Computing Foundation, 2018). Cloud-native applications and architectures try to
satisfy the requirements of the customers that expect rapid development,
responsiveness, innovative features and zero downtime. Not being able to satisfy
these customer requirements means that they will just use the product of the next
competitor. A cloud-native system should be able to take full advantage of cloud
services. Making use of PaaS and SaaS, these systems are mostly deployed in
highly dynamic cloud environments. Is a server unavailable? Provisioning, a new
one takes only minutes, and nobody will notice. Does a Service need a new
database? No problem using a SaaS model. To be able to reach this kind of
autonomy, all processes have to be automated (Microsoft, 2019).

3. ARCHITECTURE
This chapter will discuss different fitting architectures for implementing Big Data
streaming. All architectures could be implemented in the cloud and should be able
to process the needed amount of data. This chapter will focus on both cloud-native
architectures and traditional architectures that run in an on-premise datacentre. The
discussion in this chapter contributes to the decision, which architecture should be
implemented, in Chapter 7.

3.1 Lambda Architecture


One of the problems when processing large amounts of data is the delay between
the moment the data is collected, and the result of the processing. The delay can
cause problems when it is crucial to react according to the data in real-time.
However, not only the near real-time view is essential for today's business, but they
also need the historical view and analysis of their data. These requirements lead to
the need for a hybrid architecture that can handle both requirements, real-time
analysis and batch processing. (Hasani, Velinov, & Kon-Popovska, 2014)

One of the emerging architectures is the Lambda Architecture proposed by Nathan


Marz. The suggested architecture contains three layers to fulfil the requirements of
real-time & batch processing. The layers are named batch, serving and speed layer
(Marz & Warren, 2015). Figure 1, presented below, shows the three layers, the
interactions between them and further gives an example for technologies that could
be used to implement the different parts of the architecture.

15
Figure 1 - Speed, Serving and Batch Layer of the Lambda Architecture (ITechSeeker, 2019)

3.1.1 Batch Layer


The batch layer contains the master dataset and precomputes batch views on this
dataset. The master dataset is an ever-growing immutable dataset that is used to
compute arbitrary functions on this dataset. As new data is added, the recomputing
of the batch views will be done during the next batch iteration. (Marz & Warren,
2015)

3.1.2 Serving Layer


The serving layer loads the batch views that are precomputed by the batch layer
and provides the ability to query them. The data that can be retrieved is, therefore
calculated based on the entirety of the dataset. If new batch views from the batch
layer are available, the serving layer automatically swaps the old ones out.
Therefore it is possible to query all the data with a slight delay of a few hours, based
on the speed of the recalculation of the data in the batch layer. (Marz & Warren,
2015)

3.1.3 Speed Layer


The speed layer helps to fix the problem that the batch and serving layer cannot
present the latest data and always have a delay. What is needed is a real-time view
of the most recent data to compensate on the outdated data of the batch views.
Same as the batch layer, the speed layer does also produce views. The difference
of this real-time views is that they only look at the most recent data and update
based on that; it does not consider the entirety of the dataset. This instrumentational
computation is done in order to achieve the smallest latency possible. (Marz &
Warren, 2015)

3.1.4 Integrated Layers


The batch layer and the serving layer satisfy almost all constraints that are needed
from a Big Data architecture. Adding the speed layer helps to get a real-time view
of the data and using the three layers can help to provide both a batch-processed
view and a real-time view.
16
𝑏𝑎𝑡𝑐ℎ 𝑣𝑖𝑒𝑤 = 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑎𝑙𝑙 𝑑𝑎𝑡𝑎)
𝑟𝑒𝑎𝑙𝑡𝑖𝑚𝑒 𝑣𝑖𝑒𝑤 = 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑟𝑒𝑎𝑙𝑡𝑖𝑚𝑒 𝑣𝑖𝑒𝑤, 𝑛𝑒𝑤 𝑑𝑎𝑡𝑎)
𝑞𝑢𝑒𝑟𝑦 = 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝑏𝑎𝑡𝑐ℎ 𝑣𝑖𝑒𝑤, 𝑟𝑒𝑎𝑙𝑡𝑖𝑚𝑒 𝑣𝑖𝑒𝑤)
Equation 1 - Relation between the batch view, real-time view and how the data is queried
(Marz & Warren, 2015)

By querying both the batch views and the real-time view for the queries, it is possible
to get a complete view of the data. The data is continuously added to both the batch
layer and the speed layer, once the batch view includes the new data it will be
dropped from the real-time view, which makes it easier to handle the continuous
data flow. (Marz & Warren, 2015)

3.1.5 Adoption and Implementation


Kiran, Murphy, Monga, Dugan, & Baveja (2015) describe and showcase how a
Lambda Architecture can be implemented cost-effectively in multiple cloud
environments. Specifically, they looked at Microsoft Azure and Amazon Web
Services. The authors implemented a Lambda Architecture in AWS. They tried to
utilise all the possibilities that come with a cloud environment like auto-scaling and
the pay-as-you-go cost model.
The underlying architecture can be implemented using Hadoop for the batch layer
and Apache Spark and its streaming capabilities for the speed layer. For the serving
layer, multiple NoSQL databases can be used, and an example would be Apache
Cassandra or Elasticsearch. When using cloud components and services, it gets
again a lot easier. Using AWS services, most of the needed components can be
consumed as a service, which means less overhead for maintenance. The basic
building blocks consist out of AWS S3, AWS Glue, AWS EMR, AWS Athena, AWS
Lambda and various Kinesis services.
In order not to go beyond the scope of this study, this thesis will focus on the speed
layer and therefore, the real-time streaming part of the implementation.
For the implementation, three AWS services can be used, Kinesis Data Streams to
capture the incoming data continuously and stream it in near real-time, Kinesis Data
Firehose to batch and compress the incoming stream data into incremental views
and Kinesis Data Analytics to process the data using SQL or an Apache Flink
application (Amazon Web Services, 2018).

The Lambda architecture is a good fit when going for fault-tolerance against
hardware failures and human mistakes and also has advantages in the computation
of arbitrary functions on real-time data. The trade-off that comes with these
advantages is the high complexity and redundancy. The different frameworks that
are needed to implement the batch, speed and serving layer are in itself highly
complex, and the combination of the layers does not help in decreasing the
complexity. Maintaining all the layers and keeping the batch and speed layer
synchronised is no easy task in a fully distributed architecture. All in all, the Lambda
architectures does its job exceptionally well but introduces high complexity.
Therefore, it is essential to ponder if most of the use cases need a batch and a
speed layer (Feick, Kleer, & Kohn, 2018).

17
3.2 Kappa Architecture
The Kappa architecture was introduced by one of the original authors of Apache
Kafka, Jay Kreps. Kreps (2014) wrote a blog post about the Lambda architecture
and the already mentioned disadvantages that the highly complex Lambda
architecture has. In the same blog post, he proposed another approach for real-
time data processing that is in some way inspired by the Lambda architecture but
tried to favour simplicity. The resulting Kappa architecture is easier to implement. It
places greater focus on development-related subjects, such as, implementation,
debugging and code maintenance as there is no need to implement two systems
that work together (Feick, Kleer, & Kohn, 2018).

3.2.1 Layers
As opposed to the Lambda architecture, the Kappa architecture only has a Real-
Time Layer and a Serving Layer, as shown in Figure 2. The input comes from a
data stream such as Apache Kafka or AWS Kinesis and is fed into a stream
processing system. This is the real-time layer that can be compared to the speed
layer of the Lambda Architecture. In the real-time layer, the stream processing jobs
are executed on the incoming data, and it provides real-time data processing. After
the data was processed, it enters the serving layer that makes it possible to run
queries on the data. The implementation of the two layers does not differ from the
implementation that would be needed for a Lambda Architecture. The only thing
that is missing is the batch layer. This is justified by the presumption that most
applications do not need the entirety of the data but just a large enough set of the
most recent data. By dropping the batch layer, the architecture gets a lot simpler
and easier to handle, but this cannot be done without constraints. Dropping the
batch layer means that it is not possible to query the entire dataset easily, as the
whole data has to be streamed so that it can be queried. The trade-off is to lose
accuracy for reduced complexity (Feick, Kleer, & Kohn, 2018).

Figure 2 - Outline of a Kappa Architecture

3.2.2 Adoption & Implementation


For the implementation of the Kappa architecture, the same technologies can be
used as for the Lambda architecture. The real-time layer can be implemented using
Apache Spark, and for the serving layer, many NoSQL databases are available for
use. When comparing the Kappa architecture to its predecessor regarding the
accuracy, Sanla & Numonda (2019) discovered that the Lambda architecture
outperforms the newer architecture by 9%. However, it had 2.2 times longer
processing times. Further, the CPU and RAM usage did also exceed the ones of
the Kappa architecture (Sanla & Numonda, 2019). This again shows that the Kappa

18
architecture is better suited for use cases where speed is essential and the
accuracy loss is negligible.

3.3 Other Architectures


There are not many defined architectures other than the Lambda and Kappa
architecture. These two are also the foundation for most of the implemented
reference architectures. Two other examples that are worth mentioning are the
NIST Big Data Reference Architecture (NBDRA) and the Zeta Architecture. The
NBDRA provides a high-level overview over components that a Big Data
architecture should contain. The architecture was developed by conducting a
survey of use-cases for various application domains. The resulting reference
architecture shall help engineers, developers, architects and data scientists to
create solutions that require diverse approaches due to the volatile nature of Big
Data ecosystems. It further tries to define a common vocabulary that can be used
in describing such architectures (Chang, Boyd, & Levin, 2019).
The second architecture is called Zeta Architecture, it tries to tackle the perpetual
need for an enterprise architecture that enables simplified business processes. The
architecture is pictured as a hexagon and includes seven key components (MapR
Technologies, Inc., 2015).
• Distributed File System
• Real-time Storage
• Container System
• Enterprise Applications
• Solution Architecture
• Computation Model & Execution Engine
• Global Resource Management
Unfortunately, both, the NBDRA and the Zeta architecture are very high-level
overviews that are more suited for enterprises, as they both not only include
technical concepts but also processes that should be used to implement such Big
Data ecosystems. Therefore, they are not explained in greater detail and will not be
considered for the implementation.

4. DATA STREAMING
4.1 Introduction
The following chapter discusses different frameworks, services and products in their
respective categories. For data streaming AWS Kinesis and Apache Kafka will be
highlighted, as these two are the prime examples of high-volume data streaming,
and both are available as SaaS solutions in AWS.

4.2 AWS Kinesis


The Kinesis streaming platform that is provided by AWS consists out of four cloud-
native services, Kinesis Data Streams, Kinesis Data Firehose, Kinesis Video
Streams and Kinesis Data Analytics. It is a fully managed service, it handles and
manages the infrastructure, storage, networking and configuration, which means
that the users do not have to handle anything related to operating the service. It
provides high availability, and data durability by synchronously replicating the data
across three availability zones. It can be used to collect and process large streams
19
of data records in real-time. Kinesis durably stores the ingested data and makes it
available for consumption to consumers. The data that is stored in Kinesis is called
a data record which is composed of a sequence number, a partition key and a data
blob, the maximum size of a record is 1MB. The partition key is used by the
producer to write to different shards of the data stream. The sequence number is a
unique identifier that is assigned by Amazon Kinesis when a producer writes a data
record to the stream. A record is available in the data stream by default for 24 hours.
This retention time can be configured, and the maximum is 168 hours. A data
stream represents a group of data records which are distributed across multiple
shards (Amazon Web Services, 2019). The only configuration parameter of a
Kinesis Data Stream is the number of shards that will be used. A Shard is the base
throughput unit of a Kinesis Data Stream and provides a capacity of 1MB/sec data
input and 2MB/sec data output. It can keep up with 1000 PUT records per second.
To guarantee more throughput in a Kinesis Data Stream, it is possible to add more
shards, the number of shards needed can be calculated by the formula in Equation
2.

𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑤𝑟𝑖𝑡𝑒 𝑏𝑎𝑛𝑑𝑤𝑖𝑡ℎ 𝑜𝑢𝑡𝑔𝑜𝑖𝑛𝑔 𝑟𝑒𝑎𝑑 𝑏𝑎𝑛𝑑𝑤𝑖𝑡ℎ


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠ℎ𝑎𝑟𝑑𝑠 = max ( , )
1024 2048
Equation 2 - Formula to calculate the number of shards needed for a Kinesis Data Stream
(Amazon Web Services, 2019)

• All values are in kilobyte


• Incoming write bandwidth is defined as the average size of a data record in
the data stream rounded up to the nearest 1 KB multiplied by the number of
records written to the stream per second.
• Outgoing read bandwidth is calculated as the incoming write bandwidth
multiplied by the number of consumers that read from the stream.

Kinesis Data Streams is mostly used for applications that need high bandwidth
continuous data intake and aggregation. Some typical scenarios include log data
ingestion or real-time metrics, where producers push data directly into a stream.
Therefore, the data is available immediately, and no data is lost upon server failure.
Another use case is real-time data analytics, for example, in the form of a
clickstream that is processed in real-time and analyses the usability of a website.
All these scenarios are made possible because of the following benefits.
• Kinesis Streams are durable and elastic
• Low latency, the delay of a record being put in the stream, and the possibility
for a consumer to read it is less than one second.
• Multiple Consumers can consume the same data stream, therefore multiple
actions such as archiving, processing and aggregation can be done
concurrently and independently.

Figure 3 shows a standard Kinesis Data Stream application, with one producer that
writes data into the stream and multiple consumers that consume the records from
the shards. To help with building such applications there are two libraries, the
Kinesis Producer Library (KPL) and the Kinesis Client Library (KCL). The KPL is
used by applications to act as an intermediary between the application and the
20
Kinesis Data Streams API. It includes an automatic retry mechanism that is
important when a high number of records are sent to the stream. Further, it provides
features such as collecting records before sending them to different shards,
aggregated records to increase payload size and throughput and multiple
CloudWatch metrics to provide observability.
On the consumer side, the KCL helps applications to read from a data stream. It
handles all the logic for connecting to multiple shards and pulls data from the
stream, further it handles checkpointing and aggregation of the data that was sent
by the KPL. (Amazon Web Services, 2019)

Figure 3 - Kinesis Data Stream with n shards that are consumed by multiple consumers.

Using Kinesis Data Streams, consumers can make use of the enhanced fan-out
feature. This creates a logical 2 MB/sec throughput pipe between consumers and
shards. Consumers can decide whether they want to use enhanced fan-out or not,
as it comes with additional costs but therefore provides sub 200 ms latency between
producers and consumers. Using this feature makes it possible for users to allow
multiple applications that read from the data stream while still maintaining high
performance.

Other than Kinesis Data Stream, as already mentioned, there are three other
services under the Kinesis platform. Kinesis Video Streams makes it is easy to
stream video from devices to AWS and therefore enables the user to use these
videos for analytics, machine learning and other video processing tasks. Kinesis
Data Firehose helps to capture, transform and load data streams into AWS data
stores. The supported data stores include Amazon S3, Amazon Redshift, Amazon
Elasticsearch Service and Splunk. It provides five times higher input speed as
Kinesis Data Streams with up to 5 MB/sec. The last service in the Kinesis family is
Kinesis Data Analytics. It provides a comfortable, straightforward way to analyse
streaming data. It therefore helps the user to gain insights that can be used to
respond to their business and customers in real-time. It provides the possibility to
execute SQL queries and sophisticated Java applications that use operators for
standard stream processing functions to transform, aggregate, and analyse data at
any scale. Developers can use the AWS SDK or Apache Flink to write applications
that run in Amazon Kinesis Data Analytics.

21
Nguyen, Luckow, Duffy, Kennedy, & Apon (2018) compared Amazon Kinesis to
Apache Kafka, that will be discussed thoughtfully in Chapter 4.3, in the context of a
high available cloud streaming system. The authors compared multiple aspects,
such as throughput while using a different number of shards in Kinesis and
partitions in Kafka. Further, they compared the costs of such a cloud-based
streaming solution. They did their tests with 1, 2, 4, 8,16 and 32 shards/partitions
and six different data velocities. The comparision was done across multiple
dimensions, such as reliability, performance and costs. When comparing the
throughput, Kafka can achieve high values with only a single partition, while Kinesis
scales massively with the number of available shards. For both streaming platforms,
the consumer performance scales with the number of shards/partitions. When
comparing costs, Kinesis is around four times cheaper than Apache Kafka for a
message size of 10 KB. This difference only increases with increasing message
size. For smaller message sizes the costs for Kafka are nearly the same as for
Kinesis, the price for a Kafka system lies between the Kinesis price for a record size
of 1 KB and 3 KB. Nevertheless, Kafka requires more knowledge to set up in a
reliant, fault-tolerant way, as it needs a lot more configuration than the cloud-native
fully managed Kinesis Data Streams. (Nguyen, Luckow, Duffy, Kennedy, & Apon,
2018)

4.3 Apache Kafka


Apache Kafka is an open-source distributed streaming platform, that is developed
by LinkedIn and was donated to the Apache Software Foundation. As a streaming
platform, Apache Kafka has three key capabilities.

• Publish and subscribe to streams of records


• Fault-tolerant durable storage of stream records.
• Processing of stream records

The two most common use cases for Apache Kafka are real-time streaming data
pipelines that transport data from one system or application to another, and real-
time streaming applications that transform or react according to the data streams.

The core abstraction of Kafka is a topic. Records are published to topics, and
therefore a topic is a stream of records. A topic, as known from other publish-
subscribe models, can have multiple subscribers and at its core is a partitioned
append-only log. Each partition in a topic is an ordered, immutable sequence of
records that is always appended to, also called a structured commit log. Every
record within a partition has a sequence number that is unique for the partition, this
number is called offset. All records that are published by a producer to a topic are
durably persisted by the Kafka cluster, for a configurable retention time.
The two main actors in a publish-subscribe model are the producer that adds
records to the data stream and the consumers that read these records and do
processing with the data. A producer pushes its data to a broker, which is a server
within the Kafka cluster. The consumers work on a pull basis, and every consumer
manages its offset. Typically, the consumers read the message linearly, but as
shown in Figure 4, every consumer manages its offset and can, therefore, read any
record it wants to. (Apache Software Foundation, 2017)

22
Figure 4 - Partition with records that have a unique sequence number and consumers that
use the offset to read any record from the partition (Apache Software Foundation, 2017)

Another main component of a Kafka cluster is Zookeeper. Zookeeper is mostly used


to coordinate and promote distributed systems. In the case of Kafka, it manages
and coordinates the brokers. If any broker is added or removed from the cluster,
Zookeeper will manage the coordination with the producers and consumers. Kafka
itself does not save any status data of producers and consumers; this is all done by
Zookeeper. This makes it possible to quickly scale out by adding or deleting brokers
and not lose any performance when producers or consumers are added or
removed. Therefore, Zookeeper is one of the main parts of every Kafka clusters
and helps to ensure the reliability of the cluster. (Wang, et al., 2015)

One benefit of Apache Kafka is that it enables stream processing, which means
that a stream processor can take continual streams of data from one or multiple
input topics, performs processing and produces a continual data stream that is sent
to one or more output topics. To be able to write sophisticated stream processors,
Apache Kafka provides the Streams API, also known as Kafka Streams.

Kafka is one of the most popular streaming platforms, used by hundreds of


companies in production environments including for example, Netflix, Uber, Cisco,
Goldman Sachs and of course the company that developed it, LinkedIn. Lee & Wu
(2019) wrote about the current state of Apache Kafka at LinkedIn and presented
some incredible numbers that show how performant Kafka works. According to the
authors, LinkedIn’s Kafka deployment processes 7 trillion messages per day.

Apache Kafka does not only excel in stream processing but is also used as a
messaging system. There are many messaging system implementations, and all
have their respective advantages and disadvantages. One of the most used
messaging systems is RabbitMQ that is developed as an open-source project by
23
Pivotal. It also uses brokers and queues messages before they are sent to the
clients. This queuing opens the possibility for message routing, load balancing and
data persistence. RabbitMQ also supports a wide number of protocols such as
AMQP. In practice, RabbitMQ is mostly used when developing enterprise systems.
Another messaging system that is used by Twitter is ZeroMQ. ZeroMQ is used to
develop high throughput systems but has a lot of complexity when working with it.
Another problem with ZeroMQ is that messages are not persisted, which implies
that if the system goes down, messages could be lost. The last comparable
messaging solution is ActiveMQ, which is also a project of the Apache foundation.
It implements message queues by using brokers and can provide point-to-point
messaging. Problems arise for ActiveMQ when high throughput is needed because
every sent message includes high overhead as the message headers are
enormous. Compared to these messaging systems, Apache Kafka can provide
solutions to a few mentioned problems. For example, in Kafka, as already
mentioned, all messages are saved to disk. Therefore they are persistent even after
a consumer reads them. Another advantage that Kafka has over other messaging
systems is that the brokers do not have to maintain any state other than offsets and
messages, because the state of each consumer is managed by itself. All these
techniques help Kafka to reach the high throughput that was demonstrated by
LinkedIn.

5. ALGORITHMS & FRAMEWORKS


5.1 Introduction
In this Chapter, the most commonly used algorithms will be introduced. The
following algorithms are the base for many projects that implemented Big Data
processing. The first algorithm that will be discussed is MapReduce and the second
featured algorithm is Bulk Synchronous Parallel. It will be explained how they work
and how they could be used in regard to Big Data stream processing. Secondly,
frameworks that implemented the MapReduce approach, and supporting
frameworks that are needed when working with Big Data are introduced, among
them, for example, Apache Hadoop and Apache Spark.

5.2 MapReduce
In 2004 Google presented the world their algorithm for data processing on large
clusters, MapReduce. At that time, not many companies had as much data as
Google. They had massive amounts of data that had to be processed, such as
crawled documents or web request logs. The challenge was how to do
computations on a large amount of input data across thousands of machines. As
parallel computing and distributed data are challenging to handle, they created an
abstraction to hide all the complexity of parallelisation, fault-tolerance, load
balancing and data distribution in a library. The MapReduce library was inspired by
the primitive’s map and reduce as they are known from functional programming
languages such as Lisp. Using these abstractions, it is easy to process lists of
values that are too large to fit in the memory of a single machine. Example
applications that can easily be written as MapReduce programs are inverted
indices, distributed sorting and distributed grep. (Jeffrey & Sanjay, 2004)
An example of a distributed word count and how the different phases influence the
data can be found in Figure 5.
24
Figure 5 - Example of how a word count application would work using the MapReduce
programming paradigm (Pattamsetti, 2017)

5.2.1 Programming Model


The MapReduce library provides two interfaces that have to be implemented by the
user, Map and Reduce. The Map function takes an input tuple with two values and
creates one or multiple intermediate key/value pairs. In the next step, the library
groups the pairs by the key and then passes the pairs to the reduce function. The
reduce function merges these key/value pairs together and creates zero or one
output value. (Jeffrey & Sanjay, 2004)

𝑚𝑎𝑝 (𝑘1, 𝑣1) → 𝑙𝑖𝑠𝑡(𝑘2, 𝑣2)


𝑟𝑒𝑑𝑢𝑐𝑒 F𝑘2, 𝑙𝑖𝑠𝑡(𝑣2)G → 𝑙𝑖𝑠𝑡(𝑣2)
Equation 3 - Interface and return value of the map and reduce functions

5.2.2 MapReduce for real-time processing


Even if the MapReduce framework is known and improved since Google first
published it in 2004, there are still challenges that need to be solved. Grolinger et
al. (2014) described a few unsolved challenges that still need to be thought off when
implementing a system that utilises the MapReduce approach. The authors
grouped the challenges into four categories, data storage, analytics, online
processing, privacy and security. All four categories contain challenges and solution
approaches. For the scope of this thesis, the focus will be on the online processing
category.
Online processing is a synonym for real-time or stream processing, which is defined
by one of the Vs of Big Data, Velocity. According to Grolinger et al. (2014),
MapReduce is not an appropriate solution for stream processing because of
multiple reasons:

25
• MapReduce computations work on batches rather than data streams
• MapReduce computations are snapshots of data that is stored in files, as
opposed to data streams where new data is generated the whole time
• File operations add latency
• Some computations cannot be efficiently expressed using the MapReduce
programming paradigm

Even with all the limitations mentioned above, there is still much work being done
to make data stream processing possible using MapReduce (Grolinger, et al.,
2014). Other projects, such as the improved Hadoop MapReduce framework that
was implemented by Condie, et al. (2010) tried to overcome the limitations that are
opposed by using batch processing. The authors extended the MapReduce
programming model by pipelining data between the two operators, which allowed
for data to be delivered more promptly to the operators and therefore reduce the
response time.
Cheng-Zhang , Ze-Jun , Xiao-Bin, & Zhi-Ke (2012) discussed the viability of the
MapReduce approach and concluded that the standard approach, as implemented
by most frameworks and for example, Apache Hadoop, is not fitting for real-time
data processing. The first issue that the authors identify is about the dynamically
generated data. For a typical job run on Hadoop, all the data has to be present on
the Hadoop Distributed File System (HDFS), in case of real-time analytics the data
is on the fly generated by external systems. The second issue is that only a small
part of all the data needs to be analysed. For real-time analytics, the data is time
correlative; this means that data with the same key should not be aggregated if the
correlation of the timestamps is not given. The author's solution to implement real-
time analytics using MapReduce included two big adoptions to the Hadoop
framework. First, they modified the programming model to exclude the shuffle and
sort phase and push intermediate data to the reduce function that also included a
timestamp. The second change was to move from the HDFS to a more appropriate
persistent data store that can handle the key-value pairs more efficiently. The
authors decided to use Cassandra for this task (Cheng-Zhang , Ze-Jun , Xiao-Bin,
& Zhi-Ke, 2012). The observations that the sort and the merge phase are the most
severe problems during real-time analytics is also verified by Li, Mazur, Diao,
McGregor, & Shenoy (2011). In their paper, the authors describe two key
mechanisms that were implemented into MapReduce. The first one replaces the
sort-merge mechanism with a hash-based framework, and this removes the
blocking nature of the algorithm and brings benefits in terms of computation and I/O
performance. The second measure to make MapReduce on Hadoop viable for real-
time processing tackles the problem of expensive I/O operations. By using a
technique that stores frequently used keys in memory and therefore minimises disk
operations, the reduce step can keep up with the map operation. Using these two
optimisations, the authors could return results earlier and reduce internal data spills
(Li, Mazur, Diao, McGregor, & Shenoy, 2011).
Another approach that was used by multiple projects, such as Twitter’s Storm and
Yahoo’s S4 was to abandon the MapReduce programming paradigm but still use
the same runtime platform and adopted event-by-event processing. Another
alternative to the classic MapReduce is Apache Spark Streaming, the data stream
processing framework works with small batches and does all the computation on
these batches (Grolinger, et al., 2014).
26
5.3 Bulk Synchronous Parallel
Other than MapReduce, there is also Bulk Synchronous Parallel Computing
(BSPC), which was first described in 1990. Bulk Synchronous Parallel is not a
programming model, but also not a hardware model, it lies in between. The Bulk
Synchronous Parallel model can be defined as a combination of three attributes:

1. Multiple components, that all perform some kind processing and memory
functions.
2. A router that takes care of the message handling by distributing messages
between the components.
3. The last attribute is the synchronisation of all the components at regular
intervals.
The computation is defined as a sequence of supersteps, where each step means
that all the components are assigned tasks that consist out of processing work,
message sending and consuming messages that are sent from other components.
After a specific period of time, a global check is made to determine if the
components have completed their task and therefore, the superstep is finished. If
the superstep has finished, the next superstep will be executed (Valiant, 1990). This
synchronization step can be seen in Figure 6.

Google has created a framework, named Pregel, that uses the BSP processing
model. Pregel itself is not used anymore, but it was the inspiration for multiple open-
source projects that are still developed today, for example, Apache Hama and
Apache Giraph. In the typical BSP model, the computations are done by executing
multiple supersteps after each other, and in every superstep, a user-defined
function is executed on every item from the dataset. In the newer implementations
like Pregel and Apache Hama, every agent computation has a graph representation
in BSP, that consists out of an identifier for the node, its value and state and all the
outgoing edges, all together form a vertex. Before the computation, all the vertices
are loaded into the local memory of the machines and stay there for the entire
computation, this has the advantage that all computations are done using the local
memory. As already described in the explanation of the BSP model, a vertex
consumes messages from another vertex and executes the user-defined function.
In contrast to the MapReduce model, only one function, the compute() function is
defined. Executing the function, the vertex performs local computations and
produces messages for its neighbours in the graph. After the vertex is finished, it
waits for all the other vertices to finish. (Kajdenowicz, Indyk, Kazienko, & Kubul,
2012)

27
Figure 6 - Scheduling and synchronisation of a superstep in the bulk synchronous parallel
model (Okada, Amaris, & Goldman, 2015)

Google identified rather quickly that Bulk Synchronous Parallel is a good fit for
graph algorithmic problems, and therefore Pregel and the systems that followed it
fully adopted it for these capabilities. According to Kajdenowicz, Indyk, Kazienko, &
Kubul (2012), the bulk synchronous parallel approach outperforms the MapReduce
approach significantly, when tackling graph algorithmic problems.

In the context of stream processing or real-time processing, BSP is only mentioned


in combination with Apache Hama, that makes it possible to run real-time stream
analysis using the BSP model. Jungblut (2011), suggests an implementation using
Apache Hama to enable near real-time complex stream analysis. According to
Jungblut (2011), using Apache Hama instead of a pure MapReduce job on Apache
Hadoop has multiple advantages. First, using MapReduce, one would need to poll
a data stream to get new data records and save them to the HDFS, in the context
of Apache Hadoop, so that multiple jobs can access the data. This has the reason
that was already mentioned in the preceding chapter, that all the data, that is used
for the processing, needs to be available on the HDFS. Further Hadoop involves
overhead for job scheduling and the setup of the jobs, which adds more latency to
the real-time analysis. Using Apache Hama, and its messaging capabilities, there
could be dedicated jobs that act as consumers for the data stream and publish the
data, for others to process it (Jungblut, 2011).
After all Apache Hama and therefore BSP has been mostly used for graph
algorithmic computations, as other frameworks are better suited for stream
processing.

5.4 Apache Hadoop


Apache Hadoop is an open-source project, developed by the Apache Software
Foundation, that provides software utilities for reliable, scalable distributed
computing. The framework includes five modules that enable distributed processing
of large data sets across a cluster of computers using multiple programming
28
models, which also includes map-reduce. Apache Hadoop was designed to work
both on a single server and a cluster of thousands of machines, all using their local
memory and computation power. Failure is a constant, and therefore it has to be
dealt with. Hadoop handles failures at the application layer, and this makes it
possible to run a highly available system on top of a cluster of computers (Apache
Software Foundation, 2020). The Apache Hadoop ecosystem includes multiple
Apache projects, some of them will be discussed in the following chapters, but the
open-source project itself also consists of five sub-modules:

• Hadoop Common provides utilities that are used by the other modules.
• Hadoop Distributed File System (HDFS) is the underlying distributed file
system.
• Hadoop YARN is a framework for job scheduling and cluster resource
management. It was introduced in Hadoop 2 to replace the MapReduce
engine that was used and therefore decouple the programming model from
the resource management infrastructure.
• Hadoop MapReduce is a YARN based implementation of the map-reduce
programming model used for parallel processing.
• Hadoop Ozone is a highly scalable, redundant, distributed object-store.

Originally Apache Hadoop was one of many open-source projects that implemented
the MapReduce programming model and focused on tasks like web crawls. The
architecture was designed for precisely this one use case and focused on strong
fault tolerance for large and data-intensive computations. Soon it became the norm
for companies to save their data in the HDFS, as it was easy for developers and
data scientists to access the data instantaneously. Hadoop got much attention, the
community grew, and developers started to misuse the cluster management for
more than just MapReduce jobs. Therefore, Apache Hadoop released YARN, to
tackle the shortcomings of Hadoop 1. Hadoop YARN implements a new
architecture that decouples the programming model from the resource
management infrastructure. This means that MapReduce is now only one of many
frameworks that can be executed on top of Apache Hadoop, other programming
frameworks include Apache Spark, Apache Storm and Dryad (Vavilapalli, et al.,
2013).

The already mentioned Hadoop Distributed File System is a crucial component for
Hadoop. It is a distributed filesystem, that is designed to run on cheap hardware. It
is fault-tolerant per design and provides high throughput streaming data access. To
make this possible, a few POSIX semantics were sacrificed. For example, to enable
high throughput, the applications operated on the HDFS should support a write-
once-read-many access model. This means that the file is written to the storage
once, but not changed anymore after that, because new data is only appended. As
the typical applications run on Hadoop, work with data that has a considerable
volume, and files can easily have gigabytes to terabytes, HDFS is designed to be
able to support large files. As a Hadoop cluster can easily span across hundreds of
nodes, the HDFS can store large files across machines within the cluster, by
splitting them into equally sized blocks. This behaviour can be recognised in Figure
7 when looking at the green blocks that are present on the DataNodes. For fault-
tolerance, all saved blocks are replicated to multiple machines (Apache Software
29
Foundation, 2020). HDFS uses a master/slave architecture, and consists out of two
types of nodes:

• NameNode; At least one NameNode that acts as a master server that


manages the file system namespace and regulates the access to the files. A
cluster needs at least one NameNode, but because of fault-tolerance
concerns, there should be backups.
• DataNode; Multiple DataNodes, most of the time one per compute node in
the Hadoop cluster. DataNodes manage the storage attached to the nodes,
stores the blocks of the files and therefore manages the read and write
requests from the clients.

Figure 7 - Architecture of the Hadoop Distributed File System (Apache Software


Foundation, 2020)

5.4.1 AWS EMR


Amazon EMR is a cloud-native Big Data platform that provides processing for large
amounts of data. Using EMR, it is possible to quickly set up a cluster of services
that use open-source tools such as Apache Hadoop, Apache Spark, Apache Hive
or Apache Flink. The advantage of this service is that the user does not have to set
up the cluster by themselves and can profit from the same scalability as always.
EMR uses the dynamic scalability of EC2 instances and the unlimited storage of S3
to make it easy for teams to set up clusters quickly. As EMR can be run on EC2
spot instances, the costs for large data processing jobs can be very low. More about
EMR and how to set up such a cluster, including Hadoop and Spark, will be
discussed in Chapter 7.5.

30
5.5 Apache Spark
Apache Spark, initially developed by a group at the University of California and later
donated to the Apache Software Foundation, unifies multiple specialised engines
into one distributed data processing engine. This has the advantage that one API
can be used for multiple jobs. Most data processing pipelines need to do multiple
things like MapReduce and SQL queries. Before Spark, it was not possible to do
MapReduce, SQL, streaming and machine learning with only one engine, which
significantly increased the implementation and maintenance costs while lowering
the performance. While Spark unifies all these engines, it can still provide on par or
even better performance for most jobs than specialised engines. Spark can be
operated on multiple platforms. It can run as a standalone installation but also run
on Hadoop Yarn, Mesos or Kubernetes, this makes it particularly easy for most
companies to get started with Spark as most of the setup is already there.

Apache Spark implements a data-sharing abstraction called “Resilient Distributed


Dataset” in short RDD. Using RDDs Spark can unify all the mentioned workloads in
one API and therefore makes it very easy to process most Big Data workloads.
RDDs are fault-tolerant collections of elements that can be operated on in parallel
as they are partitioned across the cluster (Apache Software Foundation, 2020).
Multiple operations can be executed on an RDD, so-called transformations.
Transformations include, for example, map, filter and groupBy. When executing
these transformations on the data, a new RDD is created. After processing, actions
can be executed upon the final RDD, which compute a result, an example would be
the “count” action that gets the number of elements in an RDD. Spark lazily
evaluates RDDs, this means that each transformation returns a new RDD, but the
computation is done later. The advantage of that approach is that Spark waits till
an action is called and only then takes a look at the whole list of applied
transformations. The graph of transformations is then used to create an execution
plan (Zaharia, et al., 2016).

RDDs are not only used for data sharing between the cluster nodes but also for
fault-tolerance. RDDs tack the graph of all transformations that have been applied
to the data. Therefore, it can rerun the needed transformations in case of lost
partitions. This fault-tolerance model is called “lineage” (Zaharia, et al., 2016).

Other than the Spark core, four high-level libraries are used for operations that
would usually be run on separate specialised computing engines. These libraries
make use of the RDD programming model to implement the execution techniques
of these specialised engines (Zaharia, et al., 2016). Some of them will be used in
the implementation described in Chapter 7.5.
SparkSQL implements one of the most common data processing concepts,
relational queries. By mirroring the data layout of analytical databases, columnar
storage, inside the RDDs, simple SQL statements can be used to query the data.
Other than that, there are also abstractions for RDDs that contain data with a known
schema that resemble database tables, so-called DataFrames.
Spark Streaming is used to implement streaming over Spark. For that it uses a
model called “discretised streams”. The input data is split into smaller batches that
are processed, rather than processing each element on their own.

31
GraphX provides graph computation capabilities, similar to systems like GraphLab
and Pregel.
MLib is a collection of over 50 common machine learning algorithms for distributed
model training.

When comparing the performance of Apache Spark with its most used competitors,
the results are mostly dependent on the executed jobs and the nature of the
workload (Zaharia, et al., 2016). In comparision to Apache Hadoop, it is clear that
Spark provides way better performance. For MapReduce workloads, it is up to 100
times faster for in-memory processing and still ten times faster on disk. A further
comparison found that Spark could process a 100TB workload three times faster
than Hadoop with only one-tenth of the machines used. One exception can be found
when Spark runs on Hadoop YARN because of the memory overhead; Hadoop is
more efficient in this case (Karlon, 2020). In regard to machine learning Spark can
again provide better results than, for example, MapReduce. According to Gopalani
& Arora (2015) the processing times for the K-Means algorithm decreased up to
three times in comparison to the processing time using MapReduce. A comparison
regarding stream processing can be found in Chapter 5.8.

5.6 Apache Flink


Apache Flink is an open-source system of the Apache Software Foundation that
supports both stream processing and static batch data processing. In contrast to
Apache Spark, where an additional module introduces streaming, Apache Flink
treats streaming as a first-class citizen. Apache Flink is built for stream processing
and stream analytics but supports batch processing using an own API. In
combination with durable message queues or streaming platforms like Apache
Kafka and Amazon Kinesis, Apache Flink makes no distinction if events are
processed in real-time or a large number of historical events are replayed. Apache
Flink was created to address the shortcoming of the existing approaches to
combine batch and stream processing. Most traditional systems either support one
of both or tried to add the other capability later into an already existing system.
Architectures that try to accomplish both, like the Lambda architecture, suffer from
high complexity because of the orchestration of multiple different products into one
functioning system. To support both worlds of fast streaming data and static batch
data Flink implements two APIs, respectively named DataStream and DataSet.
Even though there are two APIs, both are eventually compiled down to a common
representation, the dataflow graph. A dataflow graph is a directed acyclic graph that
consists out of stateful operators and data streams that were produced by an
operator that can be consumed by succeeding operators (Carbone, et al., 2016).

Stream analytics is implemented on top of Flink's runtime engine using the


DataStream API. For this, Flink implements multiple concepts on how to manage
time for data streaming. It can distinguish between the event-time, the time at which
the event originates at the source, and the processing time, which can be seen as
the time of the machine where the event is processed. To help the execution engine
to process the events in the correct order, so-called watermarks are used. These
watermarks are events that the system inserts that mark a global progress
measure. Flink supports different kinds of windows for stream processing. Each
window consists out of three components (Carbone, et al., 2016):
32
1. Window assigner; this function assigns each event to a logical window. This
decision can be based on multiple parameters.
2. Trigger; this optional parameter decides when to perform the actions that are
associated with a window. An example of a trigger would be that the action
is executed once 1000 elements are reached.
3. Evictor; the second optional parameter, that decides which records should
be kept in each window for the next one.

Aside from the custom windows that can be built using these parameters, there are
a few predefined windows. The first is called a tumbling window. In a tumbling
window, each value of a stream is only present in one window, as shown in Figure
8.

Figure 8 - Example of a stream partitioned using a tumbling window

The second predefined window is a so-called sliding window. Figure 9 shows an


example of a sliding window, and it can be recognised that in this case, a value can
be found in multiple windows as they are overlapping.

Figure 9 - Example of a stream partitioned with a sliding window

5.7 Apache Storm


Apache Storm originated as an idea of Nathan Marz in 2010. The resulting open-
source, distributed, real-time computation project was later gifted to the Apache
Software Foundation. Apache Storm focuses on the reliable processing of unbound
data streams. It offers low latency and is easy to use in a highly scalable
environment. Developers can write Storm applications in any programming
language (Iqbal & Soomro, 2015). Storm applications that are run on the Storm
cluster are called Topologies. A Topology is a graph of computations, and each
node in the graph contains processing logic. The edges of the graph represent the
way the data is passed around between the nodes. Each node is either a spout or
bolt. In Storm, everything works with a stream, which is defined as “an unbound
sequence of tuples”. Spouts and bolts are the primitives that are provided by Storm
to do stream transformations. A spout is the source of a stream, which will mostly
read tuples from an external data source and emit them into the topology. Example
spouts include Kinesis and Kafka. Bolts, on the other hand, do all the processing in
the topology. A bolt consumes any number of streams, does some computations
and processing, for example, filtering, aggregations, joins, reading/writing data from
a database, and emits 0..n new streams. To illustrate the structure of a topology,
33
Figure 10 shows how spouts and bolts interact. All nodes in a topology are run in
parallel in the Storm cluster, and the whole topology runs forever, or until it is killed
by the user. Apache Storm is fault-tolerant and guarantees no data loss, even if
machines are killed or fail because of arbitrary events. This is done by tracking all
the tuples that are emitted by every spout, and if Storm detects that a tuple has not
been completed within a specified timeout, it will merely fail this one tuple and retry
it later. According to a benchmark, Apache Storm can process up to one million
tuples per second per node, which makes it one of the fastest stream processing
systems (Apache Software Foundation, 2020). More information and a
performance comparison with other data stream processing frameworks can be
found in the next chapter.

Figure 10 - Example of a Storm topology that shows the link between spouts and bolts
(Apache Software Foundation, 2020)

5.8 Comparison: Spark vs Flink vs Storm


When comparing Spark, Flink and Storm the first and most notable thing is that
Flink and Storm offer real-time stream processing as they are built as a stream
processing computation engine. In contrast, Spark only offers near real-time
computation since Spark Streaming is built on top of a batch processing framework.
There are two types of streaming frameworks; Storm and Flink belong to the Native
Streaming implementations. Native Streaming means that every incoming record is
processed as soon as it arrives, and does not wait for others. The second type of
streaming framework implementation is called Micro-batching or Fast Batching,
which means that the incoming records are put into batches that are processed
every few seconds in a single mini batch. Therefore, the latency is higher in contrast
to Native Streaming (Prakash, 2018).

Another comparison can be made in regards to performance. In this case, Spark


Streaming processes the least amount of messages per second, both Flink and
Storm process a lot more messages and can, therefore, produce up to 15 times
lower response times. Still, in regards to fault-tolerance, Spark can recover from
node failures without losing any messages, as opposed to Flink and Storm that both
have a higher message loss rate (Lopez, Lobato, & Duarte, 2015).

While Apache Storm provides excellent performance, some features such as


windowing, aggregation and watermarks are missing. Further, Storm has no
concept of state management and is reported to be a lot more challenging to handle
for developers than the other frameworks. Some see Apache Flink as a successor
34
of Apache Storm and the first “true” streaming framework. It provides a rich API and
all features needed for high throughput stream processing with the option of Flink
Batch to enable also batch processing on top of the streaming engine. As it is one
of the newer projects in this area, Flink has a smaller community than, for example,
Spark. Nevertheless, it is still used by a lot of big companies such as Uber and
Alibaba. Spark Streaming is very easy to integrate when Spark knowledge and the
infrastructure is already in place, and it is one of the most used frameworks with a
big community. Spark was also the first framework that fully supported the
implementation of the Lambda architecture, mentioned in Chapter 3.1. In terms of
performance, Spark Streaming is a little bit behind Flink and is also missing a few
advanced features (Prakash, 2018).

6. STORAGE
6.1 Introduction
In the following chapter, different forms of storage will be discussed. All solutions
have to be able to comply with the requirements of Big Data, namely the volume,
velocity and variety that is inherent when working with Big Data. To address the
challenges that come with the high volume of the data, most fitting storage solutions
make use of a distributed, shared-nothing architecture. Some solutions like Apache
Cassandra can scale very well horizontally without any hassle, only by adding more
servers to the cluster. The storage solutions have to cope with the high velocity and
still maintain low latency for queries, even with a high rate of incoming data.
Further, it needs to be possible to store data that comes from a lot of different
sources and is not always structured, hence the variety of Big Data (Strohbach,
Daubert, Ravkin, & Lischka, 2016). Different storage solutions will be discussed,
beginning with the NoSQL storage implementations, followed by NewSQL
databases, distributed file systems, Big Data querying platforms and an evaluation
of current cloud storage services. At last, there will be a description of Data Lakes,
in regard to Big Data and real-time processing.

6.2 NoSQL Databases


When working with Big Data mostly NoSQL storage solutions are used. The reason
why relational databases are not represented very well in the Big Data area is not
that they cannot satisfy the requirements. However, alternative storage
technologies, such as columnar stores or document stores, are often more efficient
and less expensive, and therefore better suited for most Big Data architectures
(Marz & Warren, 2015). NoSQL means “not only SQL”, but these databases are
also an alternative to the standard relational databases. They do not always follow
the transactional ACID properties, atomicity, consistency, isolation and durability.
Most NoSQL databases are specialized for one use-case and follow the BASE
properties, basically available, soft state, eventual consistency. There are four main
categories of NoSQL databases that can be distinguished; Key-Value Stores,
Columnar Stores, Document Databases and Graph Databases (Strohbach,
Daubert, Ravkin, & Lischka, 2016).

Key-Value Stores are the most simplistic version of a NoSQL database that can be
found. However, in what they are doing, they are very efficient and often can provide
35
single-digit millisecond latency for queries. The data in a key-value store is stored
in a schema-less way, and most of the time consists of strings, but also other
objects are supported. The data consists of a string which functions as the key and
the actual saved data. The keys are used as indexes, and the basic data model can
be imagined like a map or dictionary. Most querying features that can be found in
relational databases, such as joins and aggregation operations are sacrificed in
key-value stores for the sake of high scalability and fast lookups. Examples for key-
value databases are Redis and Amazon DynamoDB, which will be discussed more
thoughtfully in Chapter 6.4 (Nayak, Poriya, & Poojary, 2013).

Column-Oriented Databases are hybrid row/column stores, which means that the
database does not store the data in tables but rather in a massively distributed
architecture. The data is in columns rather than rows, and therefore it can easily be
aggregated with less I/O activity than would be needed in a relational database.
This is achieved by saving the data for each column continuously on the disk or in
memory, which brings performance benefits when running data mining or analytical
queries. Examples for columnar databases include Google’s BigTable, which is a
high performance, fault-tolerant, consistent and persistent database that is used for
many of the Google products such as YouTube or Gmail. Unfortunately, BigTable
is not distributed outside of Google but only usable together with the Google app
engine. The second database that is worth mentioning is Apache Cassandra. It is
developed by the Apache Software Foundation and is based on the principles of
Amazon DynamoDB and Google BigTable. Therefore, it includes both concepts, of
key-value stores and columnar stores. It includes features such as partition
tolerance, persistence and high availability and is used in various applications
ranging from social media networks to banking and finance applications (Nayak,
Poriya, & Poojary, 2013).

Document Store Databases store, as the name suggests, documents. The stored
documents are somehow similar to records in relational databases, but the most
significant difference is that they do not follow a predefined schema. The documents
are mostly saved in standard formats such as JSON or XML. The documents within
the document store can be similar or completely different. The database does not
care. Every document can be accessed using a unique key that is used to identify
and find the document. Other than the key most document-oriented databases also
support some kind of query language that can be used to search for documents
with specific features. This is the point where they differ from key-value stores as in
a key-value store, the values are complete black boxes in contrast to the document
store that knows and saves metadata from the documents that are stored.
Further document-oriented databases also support relationships between
documents. The best-known example is MongoDB, a highly performant, efficient
and fault-tolerant document-oriented database. MongoDB stores data in JSON-like
documents and provides a powerful query language, indexing and real-time
aggregation. Further examples include the cloud-based solution that offers
compatibility to MongoDB, Amazon DocumentDB or popular search engines like
Elasticsearch that do fit the definition of a document-oriented database (Nayak,
Poriya, & Poojary, 2013).

36
Graph databases store the data in the form of a graph. A graph consists of edges
and nodes. Nodes are the saved object, and the nodes are connected through
edges which represent relationships. Not only the nodes can have properties but
also the edges. Using graph databases, it is easy to traverse the complex
hierarchies of data as the main emphasis lies in the connections between the data
nodes. Further, the graph can contain semi-structured data, which makes it a lot
more flexible than relational databases. Most graph databases are ACID compliant
and offer functionality to rollback transactions. Neo4j is the most prominent
representative for graph databases. It uses its querying language Cypher to
traverse the graph through the REST API. It has an open-source free to use
community edition, but also provides licenses and support for enterprise-grade
deployments. It is a highly available, ACID-compliant graph database. Other
notable products include RedisGraph and SAP HANA (Nayak, Poriya, & Poojary,
2013).

NoSQL databases are mostly compared to relational databases as they are still the
most used databases out there. One of the significant advantages nowadays and
a crucial factor in the usability for Big Data projects is that NoSQL databases can
quickly scale for massive data volumes and still provide low latency when relational
databases are overwhelmed. There are multiple different databases in the NoSQL
world, and each has its area in which it excels. Further, the NoSQL technology did
evolve rapidly over the last years, and the community grows steadily. Still, there are
a few disadvantages because some products are brand new, and they are still
immature. Further problems that hinder the growth are that there is no standard
query language, such as SQL, for NoSQL databases (Nayak, Poriya, & Poojary,
2013).

6.2.1 Elasticsearch
“Elasticsearch is a distributed, open-source search and analytics engine for all
types of data, including textual, numerical, geospatial, structured, and unstructured”
(Elasticsearch, 2020).

Elasticsearch, easily the most popular open-source search engine, will be used as
the data store for the implementation that will be described in Chapter 7, and
therefore its core concepts will be elaborated. Elasticsearch is built on top of
Apache Lucene and best known for its scalability and speed. In the context of
NoSQL, Elasticsearch can be categorized as a document-oriented database as it
saves the data in an index as JSON documents. Using inverted indices that are
built upon data ingestion, Elasticsearch can search for data in the documents in
near real-time. The search engine comes with a management tool named Kibana.
Kibana offers a user interface that allows the user to execute searches, administrate
the cluster and build graphical representations of the data. Because of its capability
to aggregate and display the data within the Elasticsearch cluster, it is often used
to create dashboards that provide a real-time view of the data (Elasticsearch, 2020).

6.3 NewSQL Databases


NewSQL databases are a form of modern relational databases that seek to provide
the same scalability and performance of NoSQL databases while still maintaining

37
the ACID guarantees for transaction workloads required by traditional database
systems. The term NewSQL is not always used for the right kind of databases, but
in the last years, the industry came to a consensus of what a NewSQL database
has to provide to be called that way.
NewSQL databases have to:
• Be able to execute thousands of short-lived read-write transactions
• Touch only small subsets of data using index lookups, without requiring full
table scans
• Has to use a lock-free concurrency control scheme
• Implement a scale-out shared-nothing architecture capable of running on
hundreds of nodes
• Support the ACID principle

NewSQL databases can be divided into three categories. Firstly, systems built up
with a completely new architecture, secondly middleware that re-implements
existing sharding infrastructure, and the third category are database-as-a-service
offerings. Some interpretations do also include different storage-engines and
extensions for single-node DBMSs such as ScaleDB instead of InnoDB for MySQL
or Microsoft’s Hekaon OLTP engine for SQL Server, but according to Pavlo & Aslett
(2016) such systems are not representative for NewSQL systems.
Most promising NewSQL Systems are built from scratch using a new architecture
rather than add onto an existing DBMS, which enables them to start with a new
code base without any of the restrictions of a legacy system. This means that
scalability can be built within the system rather than on top of it, for example, using
a distributed architecture that operates on shared-nothing resources and only
contains components that support multi-node executions. Query optimizers and
communication protocols can be established that are able to send intra-query data
directly from node to node instead of relying on a central component. Further, all
new NewSQL systems implement their own storage layer instead of relying on
existing distributed filesystems. The biggest disadvantage of these new databases
can be found in the small community, as bigger companies do not dare to set on
small products.
The second category of NewSQL databases consists of products that make use of
the same kind of sharding middleware that was developed by companies like
Facebook and Google. This sharding technology makes it possible to split a single
node DBMS onto multiple nodes, where every node only contains portions of the
database. A centralized component does the routing of the queries, and the
coordination for transactions as the data on each node cannot be accessed
independently. The biggest advantage of this sharding products is that they can
easily replace existing single-node databases.
The third and last category is mostly about cloud services, so-called database-as-
a-service offerings. The advantage of a DBaaS solution is that the cloud provider
manages the hardware and maintenance. This means that the customers do not
have to think about hardware and any configuration concerning the availability of
the database service. The most notable example regarding cloud-based NewSQL
databases is Amazon Aurora, which is compatible with both MySQL and Postgres.
It is built on log-structured storage which improves the I/O performance. (Pavlo &
Aslett, 2016)

38
According to Pavlo & Aslett (2016) NewSQL databases mostly incorporate
techniques that were already used by the industry and academia for many years,
but instead of focusing on single approaches that were developed over time, they
combine multiple of these concepts into one single platform.

6.4 Cloud Storage


As the amount of storage that is needed for Big Data varies, it is ideal to just “rent”
this storage in the cloud. Cloud storage can have different forms; the two main
categories are object and block storage. Other than that, there are also cloud-based
NoSQL and relational database offerings, such as Amazon’s DynamoDB and RDS,
respectively (Strohbach, Daubert, Ravkin, & Lischka, 2016).
Object storage is a general term that describes how the data is organized and
stored in so-called objects. Every saved object is comprised out of three parts. First,
they include a globally unique identifier, that can be used to find the object in the
distributed storage system. Further, it contains metadata, that can be defined by
the person that saves the object. This metadata can include various things such as
tags to describe the object, context information or access policies. The last part of
every object is the data itself. The data saved in an object can be anything, which
means that one object is not equivalent to one file on a hard drive. The object
contains bits and bytes that are maybe related to each other, maybe not (Porter de
León & Piscopo, 2014). An example of object storage would be Amazon S3, which
provides data reliability of 11 9’s. This means that statistically, S3 loses one object
every 659.000 years, in regard to data availability and reliability no on-premise
storage installation can match these numbers. (Amazon Web Services, 2020)
The concept of block storage, on the other hand, is a bit older, but still used in most
enterprise systems. In block storage, a file is split into evenly sized blocks of data,
and each block gets its own address, but no metadata is associated with the blocks.
In contrast to object storage, block storage can easily be mounted as a volume by
an operating system and does not require to update all blocks if a related one is
changed. (Porter de León & Piscopo, 2014)

6.5 Data Lake


A Data Lake stores disparate information while ignoring almost everything.
… the lake pays no attention to how or when its data will be used, governed,
defined or secured. (King, 2016)

With the amount of data that is produced, companies struggle with utilizing its value.
Old Data Warehouse approaches used for structured data that comes from
transactional systems, and business applications, cannot handle the majority of
data as before it can be used within the Data Warehouse, the data needs to be
cleaned, transformed and enriched (Amazon Web Services, 2020). All data that is
produced by an organization will be stored in the Data Lake. To get insights from
the data, it is saved in the Data Lake without any pre-processing, in its original
format. Therefore, it contains structured, semi-structured and unstructured data.
The organization does not know the value of most of the data, but because the data
is in the Data Lake and available for everyone in the organization to access and
analyse it is possible to create value based on the data (Khine & Wang, 2018).
Exemplary use cases and interactions with a Data Lake are shown in Figure 11.
39
Fang (2015) describes the following capabilities that a Data Lake should have:

• Capture and save the data at a low cost. First, it must be easy to get the data
into the lake efficiently without much of processing. Secondly, the volume of
data in the lake scales infinitely, therefore it is essential to have cost-efficient
storage that scales well.
• Store data of all types. Data lakes most be able to store data in all formats,
disregarding if it is structured data from a DBMS, semi-structured or
unstructured data, such as IoT sensor data.
• ETL and pre-processing. Once data is in the Data Lake, it should be possible
to do pre-processing and ETL transformations on the data, to make it easier
for other systems to work with the data.
• Schema on read. In contrast to a Data Warehouse, where the data schema
is fixed before the data is introduced into the database, the data in a Data
Lake is saved without any schema. In a Data Lake, it has to be avoided to
do complex and costly data modelling as having a schema that the data must
adhere to increases the data integration effort. The schema of the data has
to be defined once it is used.
• Enable analytics. It must be possible to develop specific analytic applications
to find value in the saved data.

Figure 11 - A Data Lake and possible surrounding systems that interact with the data
(Amazon Web Services, 2020)

According to Khine & Wang (2018) there are still two big concerns regarding Data
Lakes. The first valid concern is that Data Lakes are just another marketing hype
but looking at the recent developments Data Lakes are successfully implemented
by many companies. According to a TDWI report (2017) 23% of the respondents of
the survey already have a Data Lake and use it in their production environment.
Another 24% were planning on using a Data Lake in production. With the additional
services that provide help for Data Lakes by cloud providers, such as the Azure
Data Lake or AWS Lake Formation and the increased amount of tooling that was
developed in the recent years, it is save to say that Data Lakes are far beyond only
being a marketing hype. The second concern is about creating a Data Swamp. A
Data Lake can transform very quickly into a data swamp where nobody knows what
data is put into it. If the veracity of the data cannot be ensured because nobody
40
knows what is in there, it is challenging to find corrupted data. Therefore security
and compliance has to be a top constraint when creating a Data Lake. If companies
start their Data Lakes without sophisticated security measures, the data can easily
be compromised (Khine & Wang, 2018). These concerns lead to the following
challenges that have to be addressed when working with a Data Lake.

• Lack of data quality and documentation of findings in the data.


• Oversight and Governance
• Maintaining descriptive metadata
• Reanalysis of the data from scratch
• Performance can vary and is not guaranteed
• Security and access control

Data Lakes are a new tool to handle the volume and variety of Big Data, that try to
tackle the problem of data silos that exist across organizations. They are no
replacement for Data Warehouses but can complete the data landscape of an
organization that makes use of them to gain unique insights through their data,
which brings them a competitive advantage over their competitors. Tableau (2017)
predicts that in the future, Data Warehouses and Data Lakes may be combined and
become only one concept by enhancing each other's capabilities.

6.5.1 AWS Lake Formation


AWS announced the Lake Formation service at their 2018 re:Invent keynote
presentation. It shall support its customers to build a Data Lake using S3. The
service aims to reduce the overhead of setting up and managing a Data Lake by
taking over tasks such as loading data, defining transformation jobs, reorganizing
data into columnar formats and deduplication. It provides so-called Data Crawlers
that get data from user-specified data sources. These data sources include Amazon
S3, relational and NoSQL databases. It gets the data and changes it into formats
such as Apache Parquet or ORC for faster analytical queries.
Further, it manages the access control, governance and policies in a central place
and provides the user with a data catalogue that describes the different imported
dataset and grants access to specific users. The imported data is cleaned and
classified using machine learning algorithms. In the end, the user can decide how
he wants to access the data; supported ways include AWS EMR using Spark,
Redshift, or Amazon Athena to query the data from the S3 buckets. (Amazon Web
Services, 2020)

7. IMPLEMENTATION & RESULTS


7.1 Introduction
In this chapter, two different implementations of a Big Data Streaming architecture
will be discussed and showcased. First, the reasoning, for choosing the two
architectures and their respective technologies are presented. Further, it will be
explained how the architectures are applied to the use case of real-time risk
analytics. Then a few points will be tackled that are the same for both
implementations, such as the data setup, the risk calculation model and how the
data is visualized. The first architecture that will be discussed uses Apache Spark
Streaming to process the data and runs in AWS EMR. The second solution makes
41
use of AWS Kinesis Analytics and Apache Flink to analyse the data stream. The
advantages and disadvantages of both architectures will be discussed, and further,
they will be compared in a cloud-based setup using the metrics that were discussed
in Chapter 1.3.

7.2 Chosen Architectures


Out of all the mentioned possibilities to implement the risk analysis application, two
specific scenarios were chosen. The decision on which architecture and
technologies to use is based on the review and comparison that was done in the
preceding chapters.

From an architectural standpoint, both the Lambda and the Kappa Architecture
include everything needed for an application that is capable of real-time analysis.
The Lambda Architecture also includes the batch view, which makes it possible to
query historical data and get more accurate results. Further, the Lambda
Architecture includes a lot of overhead as more moving parts are involved, and
therefore more effort for orchestration is needed than for a Kappa Architecture. The
Kappa Architecture, on the other hand, does only include a stream processing layer
and presentation layer, which is a lot less to handle and provides everything needed
for the real-time view of the risk-analysis data that the application needs to handle.
All in all, the Kappa Architecture has more advantages for this use-case and will
therefore be used for the risk-analysis application.

Both implemented applications will use a rather similar Kappa Architecture, but the
stream processing frameworks differ. The options for stream processing that were
reviewed in Chapter 5 include Apache Spark, Apache Flink and Apache Storm. All
three could be used to implement the stream processing portion of the architecture.
For the first implementation Apache Spark with Spark Streaming was chosen as
Spark is the leading Big Data processing platform and used by many companies in
production. If Spark is already used, and the cluster is already set up, adding Spark
Streaming is trivial and therefore the cheapest and easiest way that most
companies would take when they need to add real-time analysis capabilities. To be
used within the implementation AWS EMR can be used to set up a cluster were
Spark can run. For the second implementation, Apache Flink was chosen over
Apache Storm. Apache Flink is a newer, and more flexible, high performant
implementation that offers more features than Apache Storm. Further using
Amazon Kinesis Analytics, the Apache Flink application can be run in a cloud-native
way, without managing the cluster. Apache Storm, on the other hand, cannot be
run natively in AWS and would require manual setup.
Using Apache Spark and Apache Flink makes it possible to compare an approach
that would also work in a self-hosted datacentre with the cloud-native alternative of
managed services.

For data ingestion, Apache Kafka and Amazon Kinesis are available as SaaS
solutions in AWS, and both are supported by Apache Spark and Apache Flink,
which means that either one is a good choice for the implementation. In this case,
the decision fell in favour of Amazon Kinesis Data Streams as the orchestration and
configuration are easier than for Apache Kafka. Another decision criterion was the

42
integration with Kinesis Analytics. Kinesis Analytics supports both Amazon MKS
and Kinesis Data Streams, but the latter is easier to integrate.

The last part of the Kappa Architecture that is needed is the serving layer. Multiple
possible datastores were discussed in Chapter 6. The serving layer does not only
consist out of data storage but also needs the possibility for querying and data
visualization, and some tools have better support than others for that. Without
building an own UI or using some third-party tools, a lot of the introduced databases
are eliminated from the decision. Two possible scenarios for storage and
visualization could be identified.
The first solutions would use S3 as a data store and utilize Amazon Athena and
Amazon Quicksight to query and visualize the data. S3 as a datastore makes sense
as it scales infinitely and is supported by both Apache Flink and Apache Spark as
a data sink. All the components, in this case, are cloud-native and easy to use in
AWS.
The second solution uses Elasticsearch as a document store and Kibana to
visualize the data. Again, both Apache Flink and Apache Spark support
Elasticsearch natively, and AWS provides a service for a managed Elasticsearch
cluster which makes it easy to use.
As both scenarios are equally valid, the decision fell in favour of Elasticsearch and
Kibana because the knowledge of how to set up and use it, was already pre-
existing.

A few other components, such as a small application that generates data, will be
needed to implement and test the architectures thoroughly. All these additional
components will be dockerized and run using AWS ECS. The docker containers will
be run as AWS Fargate tasks, which diminishes the need to orchestrate the servers.

Briefly summarized this means that both implementations will use some sort of
Kappa Architecture. For streaming Amazon Kinesis will be used. For storage and
visualization, the decision fell on a managed Elasticsearch plus integrated Kibana.
The stream processing will be implemented once in Amazon Kinesis Analytics with
Apache Flink and once using Amazon EMR and Apache Spark with Spark
Streaming. For additional assisting services, AWS ECS with Fargate tasks will be
used.

7.3 Implementation Concept


The idea is to analyse the risk on roads and display how risky it is to drive there.
The risk, in this case, is assessed by using the variance of the velocity of all the
data points that are located closely together and the number of data points within
these clusters. A high-level concept overview of the risk analysis can be found in
Figure 12. The chosen risk calculation comes from the premise that the risk of a car
accident, that has to be paid by the insurance, is more considerable if the cars have
to brake and constantly accelerate, for example in a traffic jam or in the city. This
premise is supported by the number of accidents that happened in Austria
throughout the years. According to Statistik Austria (2020) 36.846 accidents
happened in 2018 in Austria, and roughly two-thirds happened in residential areas.
The velocity was chosen as a factor because if a car has a constant speed and
therefore no, or near zero velocity, the risk of a car accident is smaller. The risk
43
calculation model that is used for the implementation, is, of course, not a
production-ready model, and for a real risk model, more factors have to be taken
into consideration. However, for the sake of testing possible solutions to real-time
risk analysis, it is sufficient.

Figure 12 - High-level concept of the risk analysis application

7.4 Data Setup


The data that is streamed and analysed by the two implementations is simulated to
be sent from hundreds of different producers. As both applications get the data out
of a data stream, it does indeed not matter if the data is produced by hundreds of
devices or just one.
First, a small java project was created that uses the Kinesis Producer Library (KPL),
which makes it possible to send data to a Kinesis Stream easily. This producer
simulated basic movement at random coordinates and added a random velocity.
The sent data has a simple data model; it consists of four fields. The values are
sent as a comma-separated string. The first value is the current latitude, the second
represents the current longitude, the third parameter is the current random velocity
value, and the last value is the timestamp of this entry. This small java project did
help during the development to create sample data but is not useful to simulate
thousands of devices all over Austria. The task of creating producers that do behave
like real cars, and produce meaningful data, is not an easy one. To simulate real
data, the producers would have to stay on the roads, and move according to the
route and not through forests or buildings. Therefore, to create meaningful and
representative, data, another solution was needed.

The company SharedStreets created a tool called trip-simulator that generates


simulated raw GPS telemetry. As raw GPS data is highly sensitive data, it is difficult
to retrieve and operate on such data. The application uses an OpenStreetMap as
input and can generate data for hundreds of so-called agents. An agent can either
be a car, bike or scooter. The data that is generated is fake but can still be used for
algorithms that need to operate under real-world conditions (SharedStreets, 2019).
The trip-simulator is a node application that needs an OpenStreetMap as an input.
44
As already mentioned, the implementation will focus only on Austria and therefore,
the map data is also limited to a bounding box around Austria. The original trip-
simulator produces log files that can be used to display the trips that were
generated, but these are only useful after the execution. The data that needs to be
simulated for the streaming applications needs to be continuous. Therefore, the trip-
simulator code had to be customized.

The KPL is only available for java, for node.js AWS provides the AWS SDK which
was used to be able to write data into a Kinesis Data Stream. Each agent used in
the simulation would take one step, which can be defined as a movement in its
predefined route using a random velocity. One step in the simulation means that all
the agents would do one step, and these steps were collected and sent to the
Kinesis Data Stream as one AWS SDK call.

Simulation.prototype.step = async function () {


this.time += this.stepSize;
let records = [];
for (let agent of this.agents) {
await agent.step();
records.push(mapToDataEntry(agent.id, agent.location));
}

await kinesis.putRecords({
Records: records,
StreamName: "kinesis-analytics-stream",
}).promise().then(finish => {
console.log(finish);
records = [];
}).catch(error => console.log(error));
};
Code Listing 1 - Simulation.step method that executes the steps for all agents and sends
the result to a Kinesis Data Stream using the AWS SDK

One trip-simulator process can run up to 500 agents, and each agent produces one
data point per second. Therefore, to generate the data that is needed to benchmark
the streaming applications, multiple instances have to be started. One instance
needs at least 12 GB of RAM as all the routes that are available on the
OpenStreetMap are loaded into memory. To make this possible, further
customization was needed. The routes are loaded into a node.js Map, and at first,
the application crashed because the Austrian map that was provided contained too
much data. This happened because the maximum number of entries in a Map is
2^24, which was not enough to load all the data into memory. To fix this issue, a
custom BigMap was used that is able to split the records into multiple node.js Maps.

Fortunately, trip-simulator already provides a Dockerfile that can be used to create


a docker image that does all the map pre-processing and runs the application. With
a few tweaks, it was possible to create a docker image that can be used to produce
data for the streaming applications. As multiple instances are needed, and each at
least needs 12 GB, it is not possible to run the applications on a local machine.
Using AWS ECS with Fargate, made it possible to run the needed docker
containers in a serverless way. Figure 14 and Figure 20 shows how the trip-advisor

45
was used in the two implementations to produce the data points that were needed
to test the streaming applications.

7.4.1 Spatial Subdivision


To be able to aggregate data points during the processing, it is necessary to cluster
our geospatial data. There are multiple algorithms to cluster the data points; in the
following three options will be discussed.

The first algorithm is the most well-known clustering algorithm, K-Means. First, a
number of groups are selected, and their centre points are initialized randomly.
Then each data point computes the distance between itself and each group centre.
The point will be classified to be in the group that is the closest to the point. At the
end, the group centre is recomputed by taking the mean of all the vectors in the
group. This is done for a fixed number of iterations. K-Means is easy to implement
and pretty fast in clustering the data points as it has a linear complexity O(n). The
problem with K-Means is that at the beginning, the user has to choose the number
of clusters that will be computed (Seif, 2018).

The second algorithm that could be used for implementing clustering is Mean-Shift.
Mean-Shift uses a sliding-window that tries to find areas with multiple data points.
As a centroid-based algorithm, it tries to find the centre of a group of points. Each
iteration updates the centre-point candidate to be the mean of the points within the
sliding-window. All the not centred candidates are filtered in the post-processing,
which eliminates near-duplicates and forms the final centre points. The advantage
of this algorithm is that there is no need to select the number of clusters that should
be created, as opposed to the K-Means algorithm. The only problem, in this case,
is the selection of how big the radius of the sliding-window should be (Seif, 2018).

The third solution is the grid-based clustering. It uses the geo-hash concept, which
is a hierarchical spatial data structure that uses a latitude/longitude geocode
system. The concept works straightforwardly by dividing the space into squares of
a specific size and then grouping the data points inside the square. To be able to
do granular clustering, it is possible to simply divide each space again and again,
which results in a fine granular grid system and each square has a specific hash,
an example of how a hash is built is shown in Figure 13. The advantage of this
approach is the simplicity and that it is straightforward to set the granularity of the
clustering. The disadvantage is that it is not accurate in cases where the points are
located close to the borders of the grids (Amirkhanyan, Cheng, & Meinel, 2015).

46
Figure 13 - Explanation of how a geohash is built (PubNub, 2020)

7.5 Apache Spark Streaming


7.5.1 Architecture and Technology
To implement the first approach in AWS, the choice fell on AWS EMR, because of
the possibility to install Apache Spark and Hadoop on the EMR cluster.
The Spark application was written in Scala, as the Scala API for Spark libraries is
more accessible and straightforward to use, than the Java API.

For the processing of the data stream the framework of choice is Spark Streaming,
it is an extension to the core Spark API that helps to build highly scalable, fault-
tolerant and high throughput stream processing applications. It can consume data
from many sources such as AWS Kinesis or Apache Kafka.

AWS Kinesis was selected as the input sink for the data, and the results of the
processing are saved into Elasticsearch. Figure 14 shows the architecture that was
used. What stands out is that the Spark Streaming applications do not communicate
with the Elasticsearch cluster directly but use the aws-es-proxy running as a
Fargate task in ECS, why this is necessary will be explained in the next chapter.

47
Figure 14 - Architecture for the Spark Streaming implementation

7.5.2 Implementation
Apache Spark uses a concept named resilient distributed datasets (RDDs). An RDD
is a fault-tolerant collection of elements that can be operated in parallel. Spark
Streaming offers a Discretized Stream (DStream), which is an abstraction and
represents a continuous series of RDDs. Every RDD contains data from a specific
interval from within the data stream.

A DStream is associated with a Receiver, which is an object that gets data from a
source for processing. There are two different types of sources, Basic Sources and
Advanced Sources. Basic Sources are directly available in the StreamContext API,
for example, streams from the file system or a socket connection. Advanced
Sources include for example Apache Kafka and AWS Kinesis. These sources need
extra utility classes which are located in different libraries that have to be imported
when needed.

Before starting with the streaming, the boundaries of the area that will be processed
need to be evaluated. For the use case of this study, it is only needed to process
location data from within Austria. The area is configurable, but for this case, only
the minimum values of the latitude and longitude for Austria are set. The
coordinates are needed for the already in Chapter 7.4.1 mentioned approach for
creating geo hashes that will be used to cluster the incoming data points.

Out of all the Sources available, the Kinesis Stream Source is used to read from
the Kinesis Stream, the credentials that are used to access the stream are provided
through the underlying JobFlowRole of the EC2 instance running in the EMR
Hadoop Cluster.

As already mentioned, Spark Streaming works with a DStream that consists out of
RDDs for a specified interval. Figure 15 shows how the original DStream is
windowed and creates a small batch of values that can be processed. This interval

48
is used to create a sliding window that is processed as a mini-batch job and needed
for the computations.

Figure 15 - DStream and its interaction with windowing operators (Apache Software
Foundation, 2020)

The data is now split using the sliding window and it is possible to do aggregations
over this data. The framework of choice to be able to aggregate the data efficiently
is Spark SQL. Spark SQL can be used to process structured data. There are two
ways to use Spark SQL; in this case, Datasets and DataFrames are fitting best for
the implementation. A DataFrame can be defined as a distributed collection of data
organized into named columns. The concept is the same as one of the relational
databases but with optimizations under the hood. The DataFrame can be created
out of an RDD of Rows and a Schema definition. First, all incoming byte arrays in
the RDD, that was created by the windowing operato,r have to be mapped to a Row
that corresponds to the schema. Now that a DataFrame is available, the following
operations were executed:

1. Add a Column to each Row, called “cellx”, that corresponds to the X


coordinates of the cell which corresponds to the latitude. The cell value is
calculated based on the minimum latitude and a resolution parameter that
specifies the size of one cell.
2. Add a Column to each Row, called “celly” that corresponds to the Y
coordinates of the cell which corresponds to the longitude. The calculation
is done the same way as for the “cellx” Column.
3. Now that each Row can be assigned to a specific cell using the “cellx” and
“celly” values, the Rows are grouped by these two values. This makes it
possible to aggregate the data within one cell and makes it possible to
reduce the values that are emitted for each cell.
4. Now that the Rows are grouped by their cells, multiple aggregations can be
done. Aggregations done include:
a. Sum of the velocity
b. Variance of the velocity values
c. Count of the rows that are aggregated

49
5. The data is now aggregated, and the processing is finished, for development
and debugging purposes the first 100 lines are printed in the logs. The logs
are saved to S3 and can be accessed using the AWS console.
6. The data is saved to Elasticsearch using the es-hadoop library.

val rdd: RDD[Row] = createRows(javaRdd)


val streamData = sqlContext.createDataFrame(rdd, schema);
val relData = streamData
.withColumn("cellx", latCoordsUdf($"lat"))
.withColumn("celly", longCoordsUdf($"long"))
.groupBy($"cellx", $"celly")
.agg(
sum($"velocity"),
variance($"velocity"),
count($"velocity"),
mean($"lat"),
mean($"long")
);
relData.show(100)
relData.saveToEs("risk-analysis-spark-1/object")
Code Listing 2 - Data preparation and aggregation using Spark SQL before using es-
hadoop to save the data

To save the data the es-hadoop library was used, it offers full support for Spark,
Spark Streaming and Spark SQL and makes it easy to save the data to
Elasticsearch by adding dedicated methods to the RDDs. This addition works fine,
while only configuring a few values such as the Elasticsearch endpoint when
working with self-hosted Elasticsearch clusters. A problem arose as the
Elasticsearch cluster was hosted by the AWS Elasticsearch Service. Typically to be
able to send requests to the ES API, the requests need to be authenticated. This is
mostly done by sending Basic Authentication credentials with the requests, but in
the AWS ES Service the requests need to be signed using AWS credentials.
Unfortunately, the es-hadoop library does not provide any functionality to add the
request signing, which makes it impossible to write to an AWS ES cluster. As a
workaround, an additional application was needed that acts as a proxy between the
Spark Streaming application and the AWS ES cluster. As a proxy, the aws-es-proxy
was used. It is a small web server application that sits between the originating
application and the AWS Elasticsearch Service. It intercepts the requests and signs
them using the latest AWS Signature Version 4 before forwarding the requests to
the ES cluster. The response from ES will then be sent back to the application that
issued the request (aws-es-proxy, 2020). The aws-es-proxy provides a pre-built
docker image, and as it uses the Go AWS SDK to fetch and generate the
credentials, it is possible to rely on the standard AWS CredentialsProviderChain.
The CredentialsProviderChain then uses a provided TaskRolem that is allowed to
access the AWS ES cluster, to obtain the credentials. The implications of this setup
can also be seen on Figure 14 that shows an ECS cluster with an aws-es-proxy
Fargate Task to intercept the request to the Elasticsearch cluster.

7.5.3 Metrics
All the metrics were measured twice. Once while running the Spark Streaming
application only on one m4.large node, and the second scenario that was measured
50
is a cluster of five m4.large nodes. One m4.large EC2 instance has two vCPUs and
8 GB of memory. This is the least powerful machine that can be used in an ECR
cluster, but sufficient to test the application.

To measure the latency of the Spark Streaming solution, a small test application
was used. The tests were conducted while the trip-simulator produced the normal
load; this makes it possible to measure the latency during various load and scaling
scenarios. The test application adds a record to the Kinesis Data Stream, that is the
data source for the streaming application. The record that is added has unique
values that can then be used to query the Elasticsearch storage. The latency is now
calculated by measuring the time it takes the record to be present in the data
storage, beginning when it was written to the data stream. The test is conducted
multiple times to achieve a statistically relevant result. When looking at Table 2, it
can be seen that the average latency for the first scenario is around 24 seconds, in
comparison to the latency of the second scenario were four times as much
resources were used and the latency is only around 12 seconds. The maximum
latency values are also not much higher than the average latency, which means
that the latency is rather stable and has no outliers.

Scenario min mean median max


1 m4.large instance 18732 ms 24811.5 ms 24569 ms 29323 ms
4 m4.large instances 8850 ms 12518.9 ms 12207 ms 18386 ms
Table 2 – Measured latency in the Spark Streaming solution

Multiple metrics can be used to measure the general performance of the solution,
The first metric is the number of records in the Kinesis Data Stream. In Figure 16,
it can be seen that the 10 Kinesis Data Stream shards were used to their maximum
capacity regarding the number of records that can be processed. Within five
minutes, three million records were sent into the stream. This means roughly 10.000
records per second were added. Figure 16 shows both the number of records that
were added to the stream and the number of records that were read. Both lines in
the graph lie on top of each other, which means that each record is read from the
stream rather quickly after it was added.

Figure 16 - Spark Streaming Metrics: Kinesis Data Stream PutRecords vs GetRecords

51
This is also supported by Figure 17. It shows the IteratorAge of the data stream.
The IteratorAge metric shows how long the data is within the stream before it is
read. The average value for the iterator is 2.67 milliseconds before the data was
consumed again. The spikes that can be seen in the figure indicate the maximum
values, which were measured at 4 seconds.

Figure 17 - Spark Streaming Metrics: Kinesis Data Stream IteratorAge

There are also metrics that are provided by the Spark History Server UI, shown in
Figure 18. They show how many records are consumed and how long the
processing takes. Unfortunately, the average values shown in Figure 18, are
computed over the whole runtime of the streaming job, which means that they do
also include batches that did not process any data. The first graph shows the input
rate, which fluctuates between 6.000 and 10.000 records per second. When looking
at the histogram for this metric, it is clear that most batches included more than
8.000 records per second. The second graph shows the scheduling delay. It is
essential that the scheduling delay is smaller than the processing time. This issue
will be discussed in more detail in the next chapter. In this case, there was only one
spike to roughly 15 seconds, but this is not an issue as the framework can catch up
rather quickly, as, for the remaining time, the delay is continuously very low. The
total time that a record needs to be processed is calculated by adding the
scheduling delay and the processing time together. When looking at the processing
time, it can be seen that most values are around 10 to 15 seconds, which is mostly
the same as the values that were measured for the latency. What can also be seen
is the dotted line that indicates the time a task is allowed to take, for the application
to run stable and provide low latency.

52
Figure 18 - Spark Streaming Metrics: Input Rate, Scheduling Delay and Processing Time

First, the cost for the Spark Streaming solution will be calculated with the minimal
setup in mind. The minimal setup includes an EMR Cluster with two m4.large
instances and one Kinesis DataStream Shard. The second scenario for a more
considerable deployment uses five m4.large instances. Further, this calculation
uses on-demand instances in contrast to the cheaper spot instances. When
deploying this architecture in a real-world scenario, one would have to assess if it
makes sense to use spot instances for the cluster nodes. All prices resemble the
Europe/Frankfurt (eu-central-1) region.

Service Price Unit


Kinesis Data Stream
Shard $0.018 per hour
Put Records $0.0175 per million units
EMR Cluster
m4.large Instance Price $0.12 per hour
m4.large EMR Price $0.03 per hour
ECS Cluster
Fargate vCPU $0.013968 per vCPU/hour
Fargate memory $0.001533 per GB/hour
Elasticsearch Service
Instance Pricing $0.042 per hour
Storage $3.22 per month
Cloudwatch
Logs & Metrics Free Tier eligible

53
S3 Buckets
Logs & Application Artefacts Free Tier eligible
Table 3 - Spark Streaming Solution Pricing per Service

Costs for Cloudwatch and S3 logs storage will not be considered in the calculation,
as they are both free tier eligible and are indispensable. Using the table above, the
costs for both, a minimal solution and one that provides better performance for the
scale that was mentioned in Chapter 1.3, can be calculated.

Solution Price
Minimal Solution per hour $0.38157
Minimal Solution per month $278.4548
Scaled Solution per hour $0.9934
Scaled Solution per month $615.7148
Table 4 - Spark Streaming Solution hourly/monthly price

Scaling is rather easy in AWS EMR because the only thing that has to be done, is
adjusting the instance group, and the new EC2 instances are added to the cluster
immediately. Adding servers to the cluster is one thing; the other one is the
configuration of the Spark application. Usually, the number of cores and the GB of
memory that should be used have to be defined in the application configuration.
Fortunately, AWS offers a configuration flag that sets all the Spark configurations
to the maximum that can be allocated within the cluster. This helps when running
only one application that should use all the resources that are available in the
cluster.

7.5.4 Advantages & Disadvantages


A big advantage of the Spark approach is that Spark is already widely adopted by
many companies, and for them, it would be easy to use AWS EMR to run their
already existing applications. All the know-how for working with a Spark application
and running a production-grade environment would already be there, and with the
help of EMR, this process would even be simplified.

One disadvantage for this solution is that the monitoring of the Spark application
within the EMR cluster is not straightforward. There is no out-of-the-box solution to
monitor the application using AWS tools such as CloudWatch Metrics and Logs.
The logs for the application can only be sent to S3, which makes it a bit more
challenging to check for failures. To be able to observe what is going on in the
application, one would need to use additional tooling to bring the logs to a data store
where they can be searched and aggregated. An example would be CloudWatch
Logs or Elasticsearch. For the metrics, one has to access the Spark and Hadoop
GUIs, this means that at least the master node of the cluster has to be deployed in
a public subnet, and even then one needs to install a proxy software and open an
ssh tunnel to the master instance to access the GUIs. This is a lot of overhead for
simply accessing monitoring metrics.

Another disadvantage of Spark Streaming is the mini batching concept. Spark


usually is optimized for batch processing, and therefore it is necessary to provide
54
an interval that is used to schedule a new task that will work on the next batch. If
this interval is too short for the regular processing to finish, more and more tasks
are going to wait to be executed, which creates congestion that the framework
cannot overcome. What happens when this parameter is lower than the time a
batch needs before completing can be seen in Figure 19. Even if the standard
processing time is as low as a few seconds, the latest data that comes into the
stream will only be processed, after the scheduling delay. This is, of course, not
feasible for real-time analytics where the latency should be as low as possible.

Figure 19 - Spark Streaming Scheduling Delay because of a misconfigured batch interval

To fix this issue, it is necessary to experiment with the batch interval, as a longer
batch interval does also mean that the latency is longer, but too small will cause the
scheduling delay. To get to the sweet spot where a good trade-off can be found
multiple iterations are necessary. This has to be done for every resource
configuration as the processing time can of course change if more or fewer
resources are available.

7.6 Kinesis Analytics with Apache Flink


7.6.1 Architecture and Technology
For the second implementation, Apache Flink was used for the implementation of
the stream processing application. This was the obvious choice, as Kinesis
Analytics only supports Flink and some kind of SQL Dialect to analyse the data.
Using Kinesis Data Streams and Kinesis Analytics makes it easy to scale
horizontally, as for the stream only the shard count has to be increased, and for the
Flink application, scaling can easily be controlled by setting the parallelism. Figure
20 shows how the parts of the architecture interact with each other.

55
Figure 20 - Architecture for the Kinesis Analytics implementation

7.6.2 Implementation
The basic concept that is used for stream processing in Apache Flink is a
DataStream, that was already introduced in Chapter 5.6. A DataStream can be
created from multiple data sources. One of the predefined sources is the
FlinkKinesisConsumer, that can easily be added to the Flink environment as a
source for the DataStream. To create the consumer, all that is needed is the name
of the Kinesis Data Stream that should be used as a source, a deserialization
schema and configuration properties. The deserialization schema is an interface
that needs to be implemented to make it possible for the Flink application to create
an object out of the supplied byte array. There are multiple predefined
deserialization schemas such as the SimpleStringDeserializationSchema or the
POJODeserializationSchema. Unfortunately, the data sent from the trip-simulator
is sent as a comma-separated string, and therefore a custom deserialization
schema was needed. The third parameter, the configuration properties, are needed
to supply basic parameters such as the used AWS region, or the AWS credentials
provider. Further, there are advanced configuration properties available that can be
set to control the way the stream is read. An example would be the initial starting
position that can be set for Kinesis Data Streams. There are multiple possible
values:

• AT_SEQUENCE_NUMER – start streaming from the position of the


sequence number
• AFTER_SEQUENCE_NUMBER – start streaming after the position of the
sequence number
• AT_TIMESTAMP – start streaming from a specific timestamp
• TRIM_HORIZON – start streaming with the oldest data record in the shard
56
• LATEST – start streaming after the latest record, therefore always read the
latest data records.

All possible starting positions have their use cases, but for the risk analysis
implementation, the TRIM_HORIZON starting point was chosen. The reasoning
behind this decision is that all records, even if the processing stopped for a few
minutes and is started afterwards again, should be processed because otherwise
there would be holes in the analysed data. Further configuration values include, for
example, the interval in milliseconds for reading the records from the Kinesis Data
Stream.

After a DataSource was added to the Flink environment and a DataStream is


available, it is possible to start with the following computations. Each computation
can also be found in Code Listing 3.

1. First, each data record is mapped, to add the X and Y coordinates of the cell
that is corresponding to the longitude and latitude of the record. The
calculation of the cell coordinates is done in the same way as for the Spark
Streaming application by using the minimum latitude and longitude of Austria
and a resolution parameter that specifies the size of one cell.
2. Now the stream of records is logically partitioned using the keyBy operator.
The key for the partitioning is the cell coordinates. Creating a KeyedStream
makes it possible for Flink to do the following windowed computations in
parallel for all the cells, as they are independent of each other and can,
therefore, be even be processed from different processes.
3. To be able to group related records for the aggregations, a sliding time
window is defined.
4. The windowed stream can now be aggregated. There are multiple prebuilt
methods to do aggregations, but unfortunately, not all needed aggregations
are available, for example, statistical computations are missing. To be able
to do custom aggregations, one needs to supply an AggregateFunction. The
AggregateFunction interface has the following structure, as defined in
Equation 4.

𝑎𝑑𝑑(𝑟𝑒𝑐𝑜𝑟𝑑, 𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟) → 𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟


𝑔𝑒𝑡𝑅𝑒𝑠𝑢𝑙𝑡(𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟) → 𝑜𝑢𝑡𝑝𝑢𝑡
𝑚𝑒𝑟𝑔𝑒(𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟1, 𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟2) → 𝑎𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑜𝑟
Equation 4 - AggregateFunction Interface Definition

Using a custom AggregateFunction, all values within a time window are


aggregated using the add and merge function, and at the end, the getResult
method calculates the result based on the aggregated values.
5. After the aggregation is done, the data can be saved by adding a
SinkFunction to the DataStream. Apache Flink supports multiple sinks,
including Elasticsearch. Using the ElasticsearchSink, it is only needed to
define the ES endpoint, the target index and create a mapper that maps the
data that should be saved from a POJO to a format that the Elasticsearch
client can understand, such as the XContentBuilder. Fortunately, the ES

57
Flink integration uses the standard Elasticsearch RestClient and therefore
makes it possible to add a request interceptor that signs the requests for the
AWS Elasticsearch Service. In contrast to the Spark Streaming solution, no
workaround is needed for communication with ES.

DataStream<GpsVelocity> input = consumeInputStream(env);


input.map(calculateCell())
.keyBy(cellKey())
.timeWindow(TIME_WINDOW_MILLIS, SILIDING_WINDOW_MILLIS)
.aggregate(new GpsVelocityAggregateFunction())
.addSink(ElasticSearchSink.elasticsearchSink());
env.execute("Aggregate GPS Data for risk analysis");
Code Listing 3 - Stream windowing and data aggregation using Apache Flink

7.6.3 Metrics
All the metrics were measured twice, with different Kinesis Processing Units
(KPUs). There are two parameters that can be defined to configure the parallelism
for a Kinesis Analytics application. The first is the Parallelism parameter. It specifies
the number of parallel executions of the operators, sources and sinks. The second
parameter is the ParallelismPerKPU, which defines the number of parallel tasks
that can be executed per KPU. One single KPU provides 1vCPU and 4 GB of
memory. To know how many KPUs are used for the execution, both configs have
to be considered. The first test will be executed with a Parallelism and
ParallelismPerKPU of 1, which results in 1 KPU to be used. The second test uses
a Parallelism of 2 and ParallelismPerKPU of 8, which leads to 2 KPUs used.

To measure the latency, the same test as for the Spark Streaming solution was
used, for further details on the conducted test, refer to Chapter 7.5.3. The tests
were done multiple times for each scenario to retrieve statistically meaningful
results.

Scenario min mean median max


1 Parallelism 4886 ms 7654.5 ms 7969 ms 10649 ms
1 Parallelism per KPU
2 Parallelism 4198 ms 7643 ms 7247 ms 9674 ms
8 Parallelism per KPU
Table 5 - Measured latency in the Kinesis Analytics Solution

Table 5 shows the results of the latency tests for both scenarios. What is remarkable
is that the performance did not really increase when using more resources. This
can be explained by the fact that if there are multiple instances of an Apache Flink
application running, the framework has to handle the distribution of the dataset,
which adds overhead. Further, it can be seen that the resources that were used in
the first scenario were already enough for Flink to provide acceptable performance,
the CPU utilization was roughly around 50%, as seen in Figure 21. The blue line
seen in the figure from 18:30 to 21:00 is the CPU utilization for the first scenario.
After the short interruption at 21:00, the CPU utilization for the second scenario is
shown, which is roughly around 30%.

58
Figure 21 - Kinesis Analytics Solution: CPU & Memory Utilisation

Figure 22 shows the records that are processed per second by the Apache Flink
application. The first line describes the first scenario, with only 1 Parallelism. The
constant throughput of records that were processed by the application was 9100
per second. The second scenario, which includes multiple applications and
therefore, overhead for the distribution of the tasks, could maintain the performance
of roughly 2400 records per second. Although this seems like the performance was
lower than for the first scenario, the second scenario maintained a lower latency for
processing the records throughout the application runtime, as shown in Table 5.
Another measurement, to check whether the application can stay on par with the
produced data rate, is the IteratorAge and the MillisecondsBehindLatest metric.
Both metrics show how far behind the application is in terms of processing the
records within the stream. If the application were not performant enough, these two
metrics would increase as the oldest record in the stream is not within the last
seconds. This would be disastrous for real-time applications as it would never
analyse the current data but fall further behind and process old records instead. For
both scenarios, both metrics, showed a flat line, meaning that the time behind the
latest record is zero milliseconds, or in other words, the application always works
with the newest data.

59
Figure 22 - Kinesis Analytics Solution: Records processed per second

For fault-tolerance, in Kinesis Analytics Java Applications, a mechanism called


checkpointing is used. A checkpoint is an up-to-date backup of a currently running
application that can be used to recover immediately from an arbitrary failure. In
Kinesis Analytics, this is done automatically by the service if it is enabled. Further,
it provides metrics that can be used to monitor the checkpointing. Figure 23 shows
the time it takes Apache Flink to perform the checkpointing and save the last state.
As Kinesis Analytics is a managed service, there is no way to test instance failures
and other common scenarios, as this is all handled by the cloud-native service itself.

Figure 23 - Kinesis Analytics Solution: Fault tolerance through checkpointing

60
First, the cost for the Kinesis Analytics solution will be calculated with the minimal
setup in mind. The minimal setup includes one Kinesis Data Stream Shard and one
Kinesis Analytics Processor. From this, the appropriate scale for the full deployment
can be calculated. All prices resemble the Europe/Frankfurt (eu-central-1) region.

Service Price Unit


Kinesis Data Stream
Shard $0.018 per hour
Put Records $0.0175 per million units
Kinesis Analytics
Processing Unit $0.119 per hour
Running Application Storage $5.59 per month
per processing unit
Elasticsearch Service
Instance Pricing $0.042 per hour
Storage $3.22 per month
Cloudwatch
Logs & Metrics Free Tier eligible
S3 Buckets
Logs & Application Artefacts Free Tier eligible
Table 6 - Kinesis Analytics Solution Pricing per Service

Costs for Cloudwatch and S3 logs storage will not be considered in the calculation,
as they are both free tier eligible and are indispensable. Using the table above, the
costs for both, a minimal solution and one that has the scale that is needed for the
data throughput mentioned in Chapter 1.3, can be calculated.

Solution Price
Minimal Solution per hour $0.1911
Minimal Solution per month $139.48
Scaled Solution per hour $0.4797
Scaled Solution per month $350.20
Table 7 - Kinesis Analytics Solution hourly/monthly price

7.6.4 Advantages & Disadvantages


The most crucial advantage of this approach is the cloud-native nature. Not having
to provision resources, maintain servers and being able to scale within seconds
seamlessly are essential aspects when talking about architectures. The setup is
more tolerant, when using cloud-native services than, configuring the servers by
hand. Further Kinesis Analytics provides not only the possibility to run Apache Flink
applications, but also can use simple SQL statements to do the analytical workload.
This means that even if someone is not familiar with Apache Flink, it is easy to use
Kinesis Analytics, as SQL is a widely popular language that is easy to use. From
an architectural point of view, the integration of the different AWS services works
very well.

Another advantage when deploying to Kinesis Analytics is the increased


observability. The out of the box metrics that are available for Kinesis Data Streams
61
and Kinesis Analytics, that were also shown in the preceding chapter, are very
useful, and extremely important when deploying such analytics applications to
production. The metrics are all imported to AWS CloudWatch, which again makes
it easier to configure dashboards and alarms for the metrics.

Kinesis Analytics offers the possibility to do automated scaling. This has the
advantage that there is no need to test how many KPUs are needed for the
application to run stable at all time, as Kinesis Analytics simply scales the number
of KPUs used based on the CPU and memory utilization. Therefore, it can easily
handle a variable load without any interactions.

Working with Apache Flink is very easy, as the APIs are well documented, and its
main supported language is Java, which is not always the case for some of the data
processing frameworks. Building an Apache Flink application, therefore, is
straightforward, and all the other libraries that are supported in Java can be used
without any problems. The integrations of the different sources and sinks make it
very easy to consume data, process it and in the end save it without any hassle. In
the case of Kinesis Analytics, one could also use SQL to do the data analysis, but
the possibilities that come with Apache Flink make it worthwhile to accept the
overhead of creating a standalone application. All the advanced features that
Apache Flink implements that can be accessed using the DataStream API, which
provides the possibility for partitioning, windowing and aggregating the data, make
up for the increased costs that come with Kinesis Analytics while running Apache
Flink.

This is also one of the disadvantages when working with Kinesis Analytics and
Apache Flink because the costs for the deployment are rather high. The user not
only pays for Kinesis Analytics but also for the EC2 instances that run the Apache
Flink application.

7.7 Analysis Results


Both implementations implement the same basic analytical procedure to process
the data. Therefore, the results look very similar, the only exception being the
names of the fields in the final JSON document that is saved to Elasticsearch. As
both solutions produce the same output, only one result will be discussed in this
chapter. For this reason, the Kinesis Analytics implementation results will be shown.
To be able to query the data and create graphs that could be displayed on
dashboards, Kibana is used as the serving layer of the Kappa architecture.

The trip-simulator produced synthetic data that simulates cars moving from one
location to another. The result of the analysis can be seen in Figure 24. The
heatmap shows the variance of the velocity everywhere in Austria. Figure 25 shows
a zoomed-in version of the variance of the velocity around Vienna, and the count of
the records that were analysed in this area.

62
Figure 24 - Analysis Result shown as a heatmap in Kibana

Figure 25 - Variance of velocity around Vienna (left) & count of records around Vienna
(right)

Figure 25 shows the maximum zoom level that is available in Kibana; the data
behind the visualization is, of course, available at a much more granular level.
Still, the heatmap shows that the produced grid is fine enough to display the data
in a meaningful way.

7.8 Comparison
Both approaches, Apache Spark in AWS EMR and Apache Flink in AWS Kinesis
Analytics, were able to satisfy all the requirements and proved that they could be
used to implement real-time streaming analytics. The following chapter will compare
the two implementations. The aspects that will be illuminated are the development,
deployment, performance and pricing.

Both solutions were developed using a JVM language. For Spark it was Scala, and
for Flink it was Java. Although Spark also supports Java, the documentation and
the Scala source code made it difficult to write the code in Java. Further, most of
63
the supportive documents and all community examples are written in Scala.
Therefore, the development of the Spark application was a bit tedious, but even for
someone that is not fluent in Scala, it was possible using the well-documented API
to write a working application. The development for Flink as way more susceptible,
as it is natively written in Java and the developer support was better. For both
frameworks, one has to rely on the available support for the different sources and
sinks. Here again, Flink made it easier to enhance the Elasticsearch sink, which
was impossible in the Spark implementation. Both frameworks have a big
community, which is also an essential criterion if one wants to adopt a framework.
Although Spark streaming offers excellent support, it is still noticeable that the
framework works using mini batches, an example would be the batch interval issue
that was mentioned in the Spark implementation chapter. Apache Flink, on the other
hand, fully embraces the streaming paradigm, and therefore the API is easier to
use. For an application developer, the gap between using a framework such as
Spring and data science frameworks like Spark or Flink is massive. The support
that most application frameworks offer is simply not there, and issues regarding
dependencies that are not interoperable or version incompatibility are common.

Setting up the deployment of the two solutions was done using the AWS CDK to
create the cloud formation stacks. There are some issues in the AWS
documentation regarding some configurations which take a bit of time to overcome
and make the deployment not trivial. The Flink application deployment in Kinesis
Analytics was the easier one as it is a managed service, and there are only so many
configuration options. For Spark, the application was deployed in an EMR cluster
which still needs all the different configurations that are needed to run a Spark
application in a Hadoop cluster. Figuring out how to use the spark-submit command
within EMR and setting up all the permissions that are needed to deploy the
application was challenging.
Further Spark is deployed as a skinny-jar, that means that all the dependencies are
not packaged within the jar; only the application code is. Therefore, when submitting
the Spark application to run in the cluster, the dependencies have to be defined
once again, which could produce problems with the availability of the dependency
if the wrong maven repository was used. In terms of observability, Kinesis Analytics
makes it easier to access and search the logs as they are sent to CloudWatch. For
Spark, the logs are delivered to an S3 bucket.

In terms of performance, the Spark Streaming solution could not match the latency
of the Apache Flink application. Table 2 and Table 5 show the measured latencies
for both applications. While the performance did not really improve for Flink when
using more resources, significant improvements can be seen for Spark, which could
reduce the average latency from 24811 ms to 12518 ms. Still, the average latency
for Flink was between 7 and 8 seconds and therefore, up to 2 or 3 times quicker
depending on the used resources for the Spark application. The problem for Spark
was not the throughput, as both Spark and Flink consumed up to 10.000 records
per second without any problems, which can be seen in Figure 18 and Figure 22,
but the processing time. The processing time for Spark can be seen in Figure 18. It
shows that the average processing time was roughly 10 seconds, which in itself is
longer than the latency of the Flink application, on top of that the scheduling
overhead and the Elasticsearch proxy also affected the latency. In terms of using a
64
Kinesis DataStream, both frameworks had no issues to read all the records
immediately after they were available, this can be seen in Figure 16 for Spark
Streaming, but the performance was not any different for Apache Flink. When
comparing the used resources to get to this performance, Spark could use four
m4.large instances, which equals to 8 vCPUs and 32 GB memory, while Kinesis
only used two KPUs, which equals to 2 vCPUs and 8 GB of memory.

Another big factor, when comparing the two solutions, and their viability in a real-
world scenario is the costs that are accumulated. All the price calculations and the
final prices can be found in Table 4 and Table 7. Both configurations used the same
number of Kinesis Shards and the same Elasticsearch instance class. Therefore,
the pricing difference between the two approaches can be attributed to the used
resources. The full deployment for the Apache Flink application was around $350
while the Apache Spark deployment reached $615. This is a massive price
difference, especially when considering that the cheaper solution is a managed
service, in contrast to the EMR cluster used for the Spark deployment that was. self
managed. The pricing for the nodes inside the EMR cluster was calculated for on-
demand instances, so there would be a possibility to get it a bit cheaper with
reserved instances or spot instances. Still, the Kinesis Analytics deployment is 44%
cheaper and 170% more performant than the Spark deployment.

8. CONCLUSION
Many companies are already using Big Data analysis to make decisions. The
companies that can use their data have a competitive advantage over there
competitors. This advantage will get bigger over time, as the techniques and the
technology behind Big Data get more evolved. Being able to make use of all the
data that is produced within the business context is key to gaining competitive
advantages and making data-driven decisions. To find relations within the data and
to be able to draw conclusions based on that data, businesses have to create Data
Lakes that make it easy to analyse all the data. Especially in the fields of financial
and insurance services, Big Data analysis will increase in relevance.

The goal of this thesis was to find a solution for Big Data streaming in the context
of real-time insurance risk evaluation. Therefore, two solutions were implemented
within this problem-context and compared to each other.

The prevailing architectures in the fields of Big Data are the Lambda and the Kappa
Architecture. The Lambda Architecture is more complicated, but once in place can
deliver great results. In contrast to the Kappa Architecture that is relatively simple
in itself but cannot deliver as accurate results as its Lambda counterpart. As always
it depends very much on the requirements that should be fulfilled by the
architecture. For Fast Data or endless streaming data, the Kappa Architecture is a
better fit, which is also the reason why it was chosen for the implementation.

The next question is which technology and framework should be used to implement
these architectures? It is not easy to answer, as many different frameworks
specialize in Big Data processing. A few mentioned throughout this thesis are
MapReduce, Apache Hadoop, Apache Spark, Apache Storm and Apache Flink.
After evaluating all of these frameworks Apache Spark and Apache Flink were
65
chosen for the implementation because of their characteristics, popularity and
performance. They both support stream processing, in Spark it is added as a
submodule to enhance the standard batch processing, and Flink is built from
scratch for stream processing.

After the data was processed, it needs to be saved somewhere. The datastore has
to support a large data volume and still provide good query performance. To find
the fitting data store for the implementation, multiple different approaches were
discussed, including different NoSQL databases. Elasticsearch was evaluated as
the best-fitting database for the mentioned scenario. It fits the NoSQL paradigm by
saving the entries as JSON documents, and with Kibana, it also provides an easy
solution for the serving layer of the Kappa Architecture.

With the Kappa Architecture, Apache Flink, Apache Spark and Elasticsearch, two
solutions could be implemented. Both solutions were implemented to be operated
in the cloud. For this, Amazon Web Services was chosen as a cloud provider. Both
implementations took advantage of SaaS solutions, such as AWS Kinesis Data
Streams as a source. The first architecture used AWS EMR to run a Hadoop cluster
and deploy a Spark Streaming application into it. This approach could also be set
up in the same way in an on-premise datacentre. The second implementation used
AWS Kinesis Analytics to run an Apache Flink application as a managed service.
Elasticsearch was also used as a managed service using the AWS Elasticsearch
Service.

Comparing those two implementations, it is clear that they both have their
advantages and disadvantages. The biggest advantage of the Spark Streaming
solution is that it can easily be run on existing Hadoop clusters that many companies
already use on-premise, hybrid or in the cloud. Still, Spark Streaming works using
mini batches and the configuration, for example the batch interval, to run a resilient,
low latency streaming application is a lot harder to figure out than for Apache Flink.
Regarding monitoring and observability, both solutions are able to provide metrics,
but Kinesis Analytics is better integrated with the other AWS services. The metrics
and logs are sent to CloudWatch where it is possible to query and visualize them.
For Spark running on EMR, there are also some metrics provided and sent to
CloudWatch, but most are only accessible using the Spark History Server UI. Using
the provided metrics for Apache Flink and Amazon Kinesis Analytics it is also
possible to automatically scale the streaming application based on the load and the
needed resources. When comparing the performance of the two solutions it is clear
that Apache Flink performed way better than Apache Spark. The latency for the
former mentioned implementations were 12518 ms and 24811 ms respectively.
Overall the Apache Flink implementation performed 170% better than the Apache
Spark solution, while costing, in terms of infrastructure, only 44% of the later one.
Other than the infrastructure, the costs for running such streaming applications is a
lot lower for the Kinesis Analytics solution as the service itself is managed by AWS,
which further reduces the operational overhead by a large margin.

Possible future work could include testing these solutions with a larger scale of data.
Once, with thousands of records per second, to figure out how many records can
be processed using the current solutions, and once with data records that are close
66
in size to the maximum of one megabyte per record. Using a larger record size
should have a visible impact on both implemented solutions. Both scenarios would
deliver interesting data for future architectural decisions. Another approach would
be to implement the same analysis in Apache Storm and test it against the results
of the other two frameworks.

As the research has demonstrated, building a cloud-native real-time analytics


system that is able to work with Big Data is possible using different techniques.
Which technique shall be used has to be evaluated for each context, but the Kappa
architecture implemented using AWS Kinesis Analytics with Apache Flink is a solid
foundation for every stream processing system.

67
BIBLIOGRAPHY
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. 2013 International
Conference on Collaboration Technologies and Systems (CTS) (pp. 42-
47). San Diego: IEEE.
Domo Inc. (2018). Data Never Sleeps 6.0. Retrieved from Domo:
https://www.domo.com/learn/data-never-sleeps-6
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A.
H. (2011). Big data: The next frontier for innovation, competition and
productivity. McKinsey Global Institute.
Zhang, D. (2018). Big Data Security and Privacy Protection. 8th International
Conference on Management and Computer Science (ICMCS 2018) (pp.
275-278). Atlantis Press.
Hasani, Z., Velinov, G., & Kon-Popovska, M. (2014). Lambda Architecture for
Real Time Big Data Analytic. ICT Innovations 2014, Web Proceedings
ISSN 1857-7288, 133--143.
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable
realtime data systems. Manning Publications Co.
Amazon Web Services. (2018, October). Lambda Architecture for Batch and
Stream Processing (white paper). Retrieved from AWS Whitepapers:
https://d1.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-
aws.pdf
Jeffrey, D., & Sanjay, G. (2004). MapReduce: Simplified Data Processing on
Large Clusters. Communications of the ACM, volume 51, 137-150.
Grolinger, K., Hayes, M., Higashino, W. A., L'Heureux, A., Allison, D. S., &
Capretz, M. A. (2014). Challenges for MapReduce in Big Data. 2014 IEEE
World Congress on Services (pp. 182-189). Anchorage, AK, USA: IEEE.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sear, R.
(2010, April). MapReduce Online. Nsdi (Vol. 10, No. 4, p. 20).
Cheng-Zhang , P., Ze-Jun , J., Xiao-Bin, C., & Zhi-Ke, Z. (2012). Real-time
analytics processing with MapReduce. 2012 International Conference on
Machine Learning and Cybernetics (pp. 1308-1311). Xian, China: IEEE.
Kiran, M., Murphy, P., Monga, I., Dugan, J., & Baveja, S. S. (2015). Lambda
Architecture for Cost-effective Batch and Speed Big Data processing. 2015
IEEE International Conference on Big Data (pp. 2785-2792). IEEE.
Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2011). A Platform for
Scalable One-Pass Analytics using MapReduce. 2011 ACM SIGMOD
International Conference on Management of Data (pp. 985–996). New
York, NY, USA: Association for Computing Machinery.
Seif, G. (2018, February 5). The 5 Clustering Algorithms Data Scientists Need to
Know. Retrieved from https://towardsdatascience.com/the-5-clustering-
algorithms-data-scientists-need-to-know-a36d136ef68
Amirkhanyan, A., Cheng, F., & Meinel, C. (2015). Real-Time Clustering of
Massive Geodata for Online Maps to Improve Visual Analysis. 11th

68
International Conference on Innovations in Information Technology (pp.
308-313). Dubai: IEEE.
Baecke, P., & Bocca, L. (2017, June). The value of vehicle telematics data in
insurance risk selection processes. Decision Support Systems Volume 98,
pp. 69-79.
Huckstep, R. (2019, November 19). Insurance of Things – how IoT shows
prevention is better than cure for Insurers. InsurTech Insights Issue 39.
Segment. (2017). The 2017 State of Personalization Report. Retrieved from
Segment: http://grow.segment.com/Segment-2017-Personalization-
Report.pdf
Harris, M. (2018, December 31). How to Earn Your Customers’ Trust and
Encourage Data Sharing. Retrieved from Martech Advisor:
https://www.martechadvisor.com/articles/data-management/how-to-earn-
your-customers-trust-and-encourage-data-sharing/
STATISTIK AUSTRIA. (2019, December 31). Kfz Bestand 2019. Retrieved from
STATISTIK AUSTRIA:
https://www.statistik.at/wcm/idc/idcplg?IdcService=GET_PDF_FILE&Revisi
onSelectionMethod=LatestReleased&dDocName=122637
VCÖ. (2018, June 21). VCÖ: Im Österreich-Vergleich kommen in Kärnten die
meisten mit Auto zur Arbeit. Retrieved from VCÖ - MOBILITÄT MIT
ZUKUNFT:
https://www.vcoe.at/presse/presseaussendungen/detail/autofahrten-
arbeitsweg-2018
Amazon Web Services. (2020). What is Streaming Data? Retrieved from AWS:
https://aws.amazon.com/streaming-data/
Nagvanshi, P. (2018, December 2). 5 proven benefits of real-time analytics for
professional services organizations. Retrieved from Diginomica:
https://diginomica.com/5-proven-benefits-real-time-analytics-professional-
services-organizations
Spike van der Veen, J., van der Waaji, B., Lazovik, E., Wijbrandi, W., & Meijer, R.
J. (2015). Dynamically Scaling Apache Storm for the Analysis of Streaming
Data. 2015 IEEE First International Conference on Big Data Computing
Service and Applications (pp. 154-161). Redwood City, CA, USA: IEEE.
Rijmeam, M. v. (2013, January 7). A Short History Of Big Data. Retrieved from
Datafloq: https://datafloq.com/read/big-data-history/239
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H.-A.
(2013). Bigbench: towards an industry standard benchmark for big data
analytics. Proceedings of the 2013 ACM SIGMOD international conference
on Management of data (pp. 1197-1208). ACM.
Persico, V., Pescapé, A., Picarello, A., & Sperlí, G. (2018, December).
Benchmarking big data architectures for social networks data processing
using public cloud platforms. Future Generation Computer Systems
Volume 89 , pp. 98-109.

69
Wang et al., L. (2014). A big data benchmark suite from internet services. 2014
IEEE 20th International Symposium on High Performance Computer
Architecture (pp. 488-499). Orlando: IEEE.
Feick, M., Kleer, N., & Kohn, M. (2018). Fundamentals of Real-Time Data
Processing Architectures Lambda and Kappa. SKILL 2018 -
Studierendenkonferenz Informatik (pp. 55-66). Bonn: Gesellschaft für
Informatik e.V.
Sanla, A., & Numonda, T. (2019). A Comparative Performance of Real-time Big
Data Analytic Architectures. 2019 IEEE 9th International Conference on
Electronics Information and Emergency Communication (ICEIEC) (pp. 1-5).
Bejing, China: IEEE.
Apache Software Foundation. (2017). Introduction - Apache Kafka. Retrieved from
Apache Kafka: https://kafka.apache.org/intro
Lee, J., & Wu, W. (2019, October 8). How LinkedIn customizes Apache Kafka for
7 trillion messages per day. Retrieved from LinkedIn Engineering:
https://engineering.linkedin.com/blog/2019/apache-kafka-trillion-messages
Wang, Z., Dai, W., Wang, F., Deng, H., Wei, S., Zhang, X., & Liang, B. (2015).
Kafka and its Using in High-throughput and Reliable Message Distribution.
2015 8th International Conference on Intelligent Networks and Intelligent
Systems (ICINIS) (pp. 117-120). Tianjin, China: IEEE.
Amazon Web Services. (2019). Amazon Kinesis Data Streams. Retrieved from
Developer Guide: https://docs.aws.amazon.com/streams/latest/dev/kinesis-
dg.pdf
Nguyen, D., Luckow, A., Duffy, B. E., Kennedy, K., & Apon, A. (2018). Evaluation
of Highly Available Cloud Streaming Systems for Performance and Price.
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (pp. 360-363). Washington, DC, USA: IEEE.
Big Data Framework. (2019, March 12). The 4 Characteristics of Big Data.
Retrieved from Enterprise Big Data Framework:
https://www.bigdataframework.org/four-vs-of-big-data/
Strohbach, M., Daubert, J., Ravkin, H., & Lischka, M. (2016). Big Data Storage.
New Horizons for a Data-Driven Economy: A Roadmap for Usage and
Exploitation of Big Data in Europe, 119-141.
Nayak, A., Poriya, A., & Poojary, D. (2013). Type of NOSQL Databases and its
Comparison with Relational Databases. International Journal of Applied
Information Systems, 16-19.
Pavlo, A., & Aslett, M. (2016, September). What's Really New with NewSQL?
SIGMOD Rec., pp. 45-55.
Porter de León, Y., & Piscopo, T. (2014, August 14). Object Storage versus Block
Storage: Understanding the Technology Differences. Retrieved from Druva:
https://www.druva.com/blog/object-storage-versus-block-storage-
understanding-technology-differences/
Amazon Web Services. (2020). Amazon S3. Retrieved from Amazon Web
Services: https://aws.amazon.com/s3/?nc=sn&loc=0

70
King, T. (2016, March 3). The Emergence of Data Lake: Pros and Cons.
Retrieved from Solutions Review - Data Integration:
https://solutionsreview.com/data-integration/the-emergence-of-data-lake-
pros-and-cons/
Amazon Web Services. (2020). What is a data lake? Retrieved from Amazon Web
Services - Data lakes and Analytics on AWS: https://aws.amazon.com/big-
data/datalakes-and-analytics/what-is-a-data-lake/
Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. ITM
Web Conf. 17, 03025.
Fang, H. (2015). Managing Data Lakes in Big Data Era. The 5th Annual IEEE
International Conference on Cyber Technology in Automation, Control and
Intelligent Systems (pp. 820-824). Shenyang, China: IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake
Concepts. Procedia Computer Science Volume 88, 300-305.
TDWI. (2017). DATA LAKES: PURPOSES, PRACTICES, PATTERNS, AND
PLATFORMS . TDWI Research.
Tableau. (2017). Top 10 Big Data Trends 2017. Retrieved from Tableau:
https://www.tableau.com/resource/top-10-big-data-trends-2017
Amazon Web Services. (2020). AWS Lake Formation. Retrieved from Amazon
Web Services: https://aws.amazon.com/lake-formation/
Apache Software Foundation. (2020). Apache Hadoop. Retrieved from Apache
Hadoop: http://hadoop.apache.org/
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . .
. Baldschwieler. (2013). Apache Hadoop YARN: yet another resource
negotiator. Proceedings of the 4th annual Symposium on Cloud Computing
(pp. 1-16). Santa Clara, California: Association for Computing Machinery.
Zaharia, M. A., Xin, R., Wendell, P., Das, T., Armbrust, M., Dave, A., . . . Stoica, I.
(2016). Apache Spark: A Unified Engine for Big Data Processing}.
Commun. ACM (pp. 56-65). Association for Computing Machinery.
Karlon, A. (2020, Januray 16). How do Hadoop and Spark Stack Up? Retrieved
from logz.io: https://logz.io/blog/hadoop-vs-spark/
Gopalani, S., & Arora, R. (2015, March). Comparing Apache Spark and Map
Reduce with Performance Analysis using K-Means. International Journal of
Computer Applications, pp. 8-11.
Lopez, M. A., Lobato, A. G., & Duarte, O. C. (2015). A Performance Comparison
of Open-Source Stream Processing Platforms. 2015 IEEE 17th
International Conference on High Performance Computing and
Communications, 2015 IEEE 7th International Symposium on Cyberspace
Safety and Security, and 2015 IEEE 12th International Conference on
Embedded Software and Systems (pp. 166-173). New York: IEEE.
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., & Tzoumas, K.
(2016). Apache Flink™: Stream and Batch Processing in a Single Engine.
IEEE Data Engineering Bulletin, 36-40.

71
Iqbal, M. H., & Soomro, T. R. (2015). Big Data Analysis: Apache Storm
Perspective. International Journal of Computer Trends and Technology
(IJCTT), 9-14.
Apache Software Foundation. (2020). Apache Storm. Retrieved from Apache
Storm: http://storm.apache.org/index.html
Prakash, C. (2018, March 30). Spark Streaming vs Flink vs Storm vs Kafka
Streams vs Samza : Choose Your Stream Processing Framework
Veröffentlicht am 30. März 2018. Retrieved from LinkedIn:
https://www.linkedin.com/pulse/spark-streaming-vs-flink-storm-kafka-
streams-samza-choose-prakash/
Elasticsearch. (2020). What is Elasticsearch? Retrieved from Elasticsearch:
https://www.elastic.co/what-is/elasticsearch
Kreps, J. (2014, July 2). Questioning the Lambda Architecture. Retrieved from
Oreilly: https://www.oreilly.com/radar/questioning-the-lambda-architecture/
ITechSeeker. (2019, January 9). Introduction of Lambda Architecture. Retrieved
from ITechSeeker: http://itechseeker.com/en/projects-2/implement-lambda-
architecture/introduction-of-lambda-architecture/
PubNub. (2020). What is Geohashing? Retrieved from PubNub:
https://www.pubnub.com/learn/glossary/what-is-geohashing/
Apache Software Foundation. (2020). Spark Streaming Programming Guide.
Retrieved from Apache Spark:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
SharedStreets. (2019). trip-simulator. Retrieved from Github:
https://github.com/sharedstreets/trip-simulator
aws-es-proxy. (2020). aws-es-proxy. Retrieved from Github:
https://github.com/abutaha/aws-es-proxy
Statistik Austria. (2020). Unfallgeschehen nach Ortsgebiet, Freiland und
Straßenarten. Statistik Austria.
Valiant, L. G. (1990, August). A Bridging Model for Parallel Computation.
Communications of the ACM, pp. 103-111.
Kajdenowicz, T., Indyk, W., Kazienko, P., & Kubul, J. (2012). Comparison of the
Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to
Large Network Processing . 2012 IEEE 12th International Conference on
Data Mining Workshops (pp. 218-225). Brussels: IEEE.
Okada, T., Amaris, M. G., & Goldman, A. (2015). Scheduling Moldable BSP Tasks
on Clouds. XXII Symposium of Systems of High Performance Computing.
Florianopolis, Brazil.
Jungblut, T. (2011, October 24). Apache Hama realtime processing. Retrieved
from Thomas Jungblut's Blog:
https://codingwiththomas.blogspot.com/2011/10/apache-hama-realtime-
processing.html
Apache Software Foundation. (2020). HDFS Architecture. Retrieved from Apache
Hadoop: https://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/HdfsDesign.html
72
Cloud Native Computing Foundation. (2018, June 11). CNCF Cloud Native
Definition v1.0. Retrieved from cncf.io:
https://github.com/cncf/toc/blob/master/DEFINITION.md
Microsoft. (2019, August 20). Defining cloud native. Retrieved from Microsoft
Documentation: https://docs.microsoft.com/en-us/dotnet/architecture/cloud-
native/definition
Apache Software Foundation. (2020). RDD Programming Guide. Retrieved from
Apache Spark: https://spark.apache.org/docs/latest/rdd-programming-
guide.html
Pattamsetti, R. M. (2017). Distributed Computing in Java 9. Packt Publishing.
Chang, W. L., Boyd, D., & Levin, O. (2019, October 21). NIST Big Data
Interoperability Framework: Volume 6, Reference Architecture.
MapR Technologies, Inc. (2015, March). Zeta Architecture. Retrieved from MapR
Whitepapers: https://mapr.com/whitepapers/zeta-architecture/assets/zeta-
architecture.pdf

73
List of figures

Figure 1 - Speed, Serving and Batch Layer of the Lambda Architecture (ITechSeeker,
2019) ......................................................................................................................... 16
Figure 2 - Outline of a Kappa Architecture ....................................................................... 18
Figure 3 - Kinesis Data Stream with n shards that are consumed by multiple consumers.
.................................................................................................................................. 21
Figure 4 - Partition with records that have a unique sequence number and consumers that
use the offset to read any record from the partition (Apache Software Foundation,
2017) ......................................................................................................................... 23
Figure 5 - Example of how a word count application would work using the MapReduce
programming paradigm (Pattamsetti, 2017) ............................................................. 25
Figure 6 - Scheduling and synchronisation of a superstep in the bulk synchronous parallel
model (Okada, Amaris, & Goldman, 2015) ............................................................... 28
Figure 7 - Architecture of the Hadoop Distributed File System (Apache Software
Foundation, 2020)..................................................................................................... 30
Figure 8 - Example of a stream partitioned using a tumbling window .............................. 33
Figure 9 - Example of a stream partitioned with a sliding window .................................... 33
Figure 10 - Example of a Storm topology that shows the link between spouts and bolts
(Apache Software Foundation, 2020) ....................................................................... 34
Figure 11 - A Data Lake and possible surrounding systems that interact with the data
(Amazon Web Services, 2020) ................................................................................. 40
Figure 12 - High-level concept of the risk analysis application ......................................... 44
Figure 13 - Explanation of how a geohash is built (PubNub, 2020) ................................. 47
Figure 14 - Architecture for the Spark Streaming implementation .................................... 48
Figure 15 - DStream and its interaction with windowing operators (Apache Software
Foundation, 2020)..................................................................................................... 49
Figure 16 - Spark Streaming Metrics: Kinesis Data Stream PutRecords vs GetRecords 51
Figure 17 - Spark Streaming Metrics: Kinesis Data Stream IteratorAge ......................... 52
Figure 18 - Spark Streaming Metrics: Input Rate, Scheduling Delay and Processing Time
.................................................................................................................................. 53
Figure 19 - Spark Streaming Scheduling Delay because of a misconfigured batch interval
.................................................................................................................................. 55
Figure 20 - Architecture for the Kinesis Analytics implementation ................................... 56
Figure 21 - Kinesis Analytics Solution: CPU & Memory Utilisation ................................... 59
Figure 22 - Kinesis Analytics Solution: Records processed per second .......................... 60
Figure 23 - Kinesis Analytics Solution: Fault tolerance through checkpointing ................ 60
Figure 24 - Analysis Result shown as a heatmap in Kibana ............................................ 63
Figure 25 - Variance of velocity around Vienna (left) & count of records around Vienna
(right) ........................................................................................................................ 63

74
List of equations

Equation 1 - Relation between the batch view, real-time view and how the data is queried
(Marz & Warren, 2015) ............................................................................................. 17
Equation 2 - Formula to calculate the number of shards needed for a Kinesis Data Stream
(Amazon Web Services, 2019) ................................................................................. 20
Equation 3 - Interface and return value of the map and reduce functions ........................ 25
Equation 4 - AggregateFunction Interface Definition ........................................................ 57

75
List of code listings

Code Listing 1 - Simulation.step method that executes the steps for all agents and sends
the result to a Kinesis Data Stream using the AWS SDK ......................................... 45
Code Listing 2 - Data preparation and aggregation using Spark SQL before using es-
hadoop to save the data ........................................................................................... 50
Code Listing 3 - Stream windowing and data aggregation using Apache Flink ................ 58

76
List of tables

Table 1 - Possible customer base for a real-time insurance product and the number of car
rides for that product ................................................................................................. 10
Table 2 – Measured latency in the Spark Streaming solution .......................................... 51
Table 3 - Spark Streaming Solution Pricing per Service .................................................. 54
Table 4 - Spark Streaming Solution hourly/monthly price ................................................ 54
Table 5 - Measured latency in the Kinesis Analytics Solution .......................................... 58
Table 6 - Kinesis Analytics Solution Pricing per Service .................................................. 61
Table 7 - Kinesis Analytics Solution hourly/monthly price ................................................ 61

77

You might also like