Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS
Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS
Master Thesis
Submitted in partial fulfilment of the requirements for the degree of
Author:
Elias Dräxler, BSc (hons)
Date:
31.05.2020
Declaration of authorship:
I declare that this Master Thesis has been written by myself. I have not used any other than
the listed sources, nor have I received any unauthorised help.
I hereby certify that I have not submitted this Master Thesis in any form (to a reviewer for
assessment) either in Austria or abroad.
Furthermore, I assure that the (printed and electronic) copies I have submitted are identical.
First, I would like to thank my supervisor Dr. Sigrid Schefer-Wenzl for the input and
guidance that helped throughout my research.
I also wish thanks to Rene Schakman and my colleagues at the viesure innovation center
for supporting this reserach.
Finally, I would like to thank Anna Kirchgasser and my whole family for supporting me
throughout the process of writing this thesis.
i
Abstract
The competitive advantage that companies can gain nowadays when utilizing their
data is one of the most significant changes across many industries, also including
the finance and insurance sector. Being able to utilize available data brings benefits
for both, the customers and the insurance companies. Research has shown that
the traditional insurance models cannot compete with a data centric approach. This
study aims to discover possible architectures and implementations for Big Data
real-time analysis. It focuses on real-time analysis in the context of insurance risk
calculation, more specifically on telematics data that could be used to calculate car
insurance rates in real-time. Different approaches to implement such a system shall
be compared, specifically in the context cloud-native and on-premise setups.
Based on a review of the literature on Big Data and real-time processing, two
solutions were chosen and implemented. Further the advantages and
disadvantages of both approaches were evaluated and compared using multiple
metrics, such as latency and performance. Both implementations make use of a
Kappa architecture, that was once implemented using Apache Spark Streaming
and a second time using Apache Flink. The implementation was done using AWS
as a cloud provider. The results indicate that Apache Flink, as one of the first real
stream processing frameworks, could deliver up to 170% of the performance of
Apache Spark while the price is only 44% of the competing implementation.
Therefore, it is recommended to use Apache Flink with AWS Kinesis Analytics,
when implementing real-time analysis in the cloud.
ii
List of abbreviations
iii
Key terms
Big Data
Fast Data
Big Data Streaming
Big Data Analysis
Apache Spark
Apache Flink
Amazon Web Services
AWS Kinesis Analytics
Risk Analysis
Lambda Architecture
Kappa Architecture
iv
Table of Contents
Preface .......................................................................................................................... i
Abstract ........................................................................................................................ ii
List of abbreviations .................................................................................................... iii
Key terms .....................................................................................................................iv
1. Introduction .......................................................................................................... 8
1.1 Risk Analysis ............................................................................................................. 8
1.2 Research Question .................................................................................................... 9
1.3 How to compare approaches ..................................................................................... 9
1.4 Structure ................................................................................................................. 11
2. Concepts ............................................................................................................. 11
2.1 Big Data .................................................................................................................. 11
2.1.1 Characteristics ........................................................................................................................ 11
2.1.2 Application of Big Data ........................................................................................................... 12
2.1.3 Security and Privacy ............................................................................................................... 12
2.2 Fast Data ................................................................................................................. 13
2.3 Data Streaming ....................................................................................................... 13
2.4 Cloud-native............................................................................................................ 14
3. Architecture ........................................................................................................ 15
3.1 Lambda Architecture ............................................................................................... 15
3.1.1 Batch Layer ............................................................................................................................. 16
3.1.2 Serving Layer .......................................................................................................................... 16
3.1.3 Speed Layer ............................................................................................................................ 16
3.1.4 Integrated Layers .................................................................................................................... 16
3.1.5 Adoption and Implementation ............................................................................................... 17
3.2 Kappa Architecture ................................................................................................. 18
3.2.1 Layers ..................................................................................................................................... 18
3.2.2 Adoption & Implementation .................................................................................................. 18
3.3 Other Architectures................................................................................................. 19
4. Data Streaming ................................................................................................... 19
4.1 Introduction ............................................................................................................ 19
4.2 AWS Kinesis ............................................................................................................ 19
4.3 Apache Kafka .......................................................................................................... 22
5. Algorithms & Frameworks ................................................................................... 24
5.1 Introduction ............................................................................................................ 24
5.2 MapReduce ............................................................................................................. 24
5.2.1 Programming Model............................................................................................................... 25
5.2.2 MapReduce for real-time processing ..................................................................................... 25
v
5.3 Bulk Synchronous Parallel ....................................................................................... 27
5.4 Apache Hadoop ....................................................................................................... 28
5.4.1 AWS EMR ................................................................................................................................ 30
5.5 Apache Spark .......................................................................................................... 31
5.6 Apache Flink............................................................................................................ 32
5.7 Apache Storm ......................................................................................................... 33
5.8 Comparison: Spark vs Flink vs Storm ....................................................................... 34
6. Storage ............................................................................................................... 35
6.1 Introduction ............................................................................................................ 35
6.2 NoSQL Databases .................................................................................................... 35
6.2.1 Elasticsearch ........................................................................................................................... 37
6.3 NewSQL Databases ................................................................................................. 37
6.4 Cloud Storage .......................................................................................................... 39
6.5 Data Lake ................................................................................................................ 39
6.5.1 AWS Lake Formation .............................................................................................................. 41
There are a few factors that influence the willingness of customers to share their
data with a company. Improved transparency around privacy policies and terms
and conditions will help the customers to gain trust in the product and the business
as a whole. One example would be to write the terms and conditions in a way that
people can understand them and are not overwhelmed by legal language. The
second step to build a trust relationship with the customers is to communicate the
commitment to data privacy and security. A company can have the best data
security, but if their customers do not know about the commitment, it does not
provide any competitive advantages. The easiest and most effective way to get the
customers to share their data with a company is to offer value in exchange for their
data. No customer will provide their data if they gain no benefits from it (Harris,
2018).
Using telematics and IoT data available to the customer, both the insurer and the
customers can gain advantages. Baecke & Bocca (2017) state that even the usage
of telematics data without any traditional variables, such as customer-specific data,
car-specific data or the claim history outperform the basic insurance model.
The ability to use and analyse Big Data is becoming one of the core competitive
factors for the future of any financial institution (Liu, Peng, & Yu, 2018).
To be able to provide insurance services based on telematics data, the backend
systems must be able to handle vast amounts of sensor data, such as geo-location
or acceleration information. Further, it must be possible to do real-time analysis on
the incoming data, to calculate dynamic risk models and perform fraud detection.
Using the accumulated data for forecasts and data insights, insurers can optimise
the dynamic risk calculations further.
IoT and InsurTech will enable insurers to embrace the shift to paying in advance.
The shift to prevention rather than cure. (Huckstep, 2019)
According to Gartner, there will be an average of 500 smart devices in every home
by 2022, and even today, most cars already have accessible interfaces to use the
telematics data. Accenture wrote in its insurance blog that roughly 39 per cent of
insurance companies started to offer services based on IoT devices, and another
44 per cent are thinking about launching products in this area. (Huckstep, 2019)
8
1.2 Research Question
The goal of this thesis is a comparison of different Big Data streaming architectures
in the context of real-time insurance risk evaluation.
Two architectures for real-time analysis of millions of data points will be
implemented. The architectures will be compared using multiple parameters, such
as, latency, implementation cost, elasticity, advantages and disadvantages and
other benchmarks that will be explained in Chapter 1.3.
The analysed data will consist of GPS location data and acceleration data. For the
purpose of showcasing the solution, the results of the risk evaluation should be
available in real-time and shown, for example, on a heatmap chart.
To be able to find a solution that is fitting to the problem mentioned in the research
question, this thesis will focus on multiple different architectures and technologies.
First, a few fitting architectures have to be evaluated, and checked if these can be
tailored to real-time analysis of the risk data. Following that, fitting solutions, that
can be used in the context of these architectures in the fields of streaming, real-
time analysis, storage and presentation of the data have to be evaluated. After the
evaluation, two architectures with their particular technology choices will be
implemented and compared. The choices will be made with the concern of
feasibility in a real-world scenario in mind.
Both architectures should be implemented using AWS cloud-native services, where
possible, and open-source software.
Multiple factors have to be considered to get an idea of the volume that a real-time
risk analysis insurance calculation would produce. There are 7.8 million vehicles
registered in Austria (STATISTIK AUSTRIA, 2019). According to VCÖ (2018) there
are 3.5 million car rides to get to work and back home every day. This number is
from 2018 but to get an estimation of the possible volume, it is sufficient. Implying
a 25 per cent market share for the insurance company that wants to provide such
an insurance model and a 10 per cent adoption rate, would result in 87.500
customers that would provide data for the real-time risk analysis, the calculation
can also be seen in Table 1. If every customer sends one datapoint per second,
this would result in 87.500 data entries per second and therefore, in 315 million data
points per hour. Based on the average size of one data entry of 30 bytes, this will
produce roughly 9.45 GB per hour.
Based on the cited data, it is crucial to pick an architecture that can scale well
horizontally. The two implementations will be tested by producing a load of 10.000
data points per second. The reason behind the downscaling of the load is the costs
of the associated streaming solutions. Nevertheless, the picked architectures need
to be able to scale to a much larger number. This can be achieved by scaling
horizontally, which means that if needed it is possible to add more servers to the
cluster, in contrast to scaling vertically by adding more resources, such as a more
powerful CPU. Horizontal scaling is essential when working in a cloud environment
as it is a lot easier to provision hundreds of servers in just a few seconds.
All the tests must be done using comparable infrastructure for both Big Data
architectures, as the implementations need different hardware and cloud services
it is essential to not “overscale” one solution as this would make comparing the
implementations impossible. Where possible the same servers, database instance
tiers, network bandwidth settings should be used. If no comparable hardware or
10
service level is available, it is essential to compare the costs to create a comparable
solution.
1.4 Structure
This thesis is structured as follows. Chapter 2 discusses the most important
concepts that are needed throughout the thesis. Chapter 3 provides an overview of
architectures that could be used for real-time Big Data analysis, especially the
Lambda and the Kappa architecture. Section 4 discusses streaming platforms, such
as Apache Kafka and AWS Kinesis. Chapter 5 introduces data processing
algorithms that enable the mentioned architectures and could be used to implement
the data analysis. Further, it includes an evaluation and comparison of a few
frameworks for the data processing part of the architectures, like Apache Hadoop,
Apache Spark, Apache Flink and Apache Storm. In Chapter 6, different storage
solutions are presented, the discussed topics are ranging from NoSQL databases
to the implementation of Data Lakes. In Section 7, the findings of the preceding
chapters are used to define, implement, test and benchmark two different Big Data
analytics architectures. Chapter 8 concludes the work presented in this thesis and
contains possible directions for future work.
2. CONCEPTS
2.1 Big Data
2.1.1 Characteristics
The term Big Data increased in relevance since Roger Mougalas from O’Reilly
Media first mentioned it. The term Big Data refers to vast sets of data that are
difficult to manage, store, process and load. (Rijmeam, 2013)
Big Data grew massively over the last years. To set the growth in correlation, till
2003 five exabytes of data were created by the whole human population, in 2013
we only needed two days to create the same amount of data according to IBM, and
the number of generated data is increasing at an incredible rate. (Sagiroglu &
Sinanc, 2013) In 2018 Domo published their sixth report on Big Data that shows
examples of how much data gets generated every minute. In this report, they
estimated that in 2020 every person on earth would create 1.7 megabytes of data
every second. (Domo Inc., 2018)
There are many articles about the characteristics of Big Data, most of them sum up
the “Vs” of Big Data, how many V characteristics there are, and which are worth
mentioning is an open question. The four most used terms that characterise Big
Data, known as the 4Vs, are variety, velocity, volume and veracity (Big Data
Framework, 2019). Variety describes how data is represented and is one of the
defining factors for Big Data. Data comes in three forms; structured, semi-structured
and unstructured. Structured data is easy to work with as it is already tagged,
classified and annotated. Semi-structured data does not contain fixed fields but
contains tags to separate the data elements from each other. The real struggle with
Big Data is the handling of unstructured data as it is random and difficult to analyse.
Volume is all about the amount of data, traditional database, storage, query and
analysis techniques cannot be used on terabytes or petabytes of data. Therefore,
new techniques are needed to tackle the vast amount of data. Velocity describes
11
the way the data is handled when using batch processing or real-time analytics to
get the necessary information as soon as possible (Sagiroglu & Sinanc, 2013). The
last V stands for veracity which refers to the quality of the data that is being
analysed. Quality is here defined as how many data records are valuable and
contribute in a meaningful way to the result of the analysis. If data has low veracity,
it means that a high percentage of this data is meaningless and therefore just noise
that could be ignored.
Big Data mostly refers to datasets with high volume, high velocity and high variety,
which makes it nearly impossible to process this data with traditional tools. (Big
Data Framework, 2019)
New methods and techniques were created to deal with the vast amount of data.
The industry came up with multiple new architectures to support Big Data. One of
these architectures is MapReduce; it is a programming framework that was
implemented by Google. It uses a divide and conquer approach; more about
MapReduce can be found in Chapter 5.2. (Sagiroglu & Sinanc, 2013)
In 2005 Yahoo! built Hadoop, inspired by Google’s MapReduce, to index the entire
World Wide Web. Today it is known as the open-source project Apache Hadoop
that is used by organisations all around the world for their Big Data processing
needs. More about Hadoop can be found in Chapter 5.4. (Rijmeam, 2013)
12
concern. The concerns regarding Big Data are the targeted prediction of a people’s
state and behaviour. One way of protecting the user’s privacy is anonymisation.
Nevertheless, anonymous protection is not enough to adequately protect the
privacy of users. The current usage of Big Data, coupled with the lack of self-
protection awareness among users, can easily cause information leakage (Zhang,
2018). A big impact on the privacy of users has data mining and predictive analytics.
Both techniques are used to discover intercorrelated data. Information linkages can
bring advantages for companies, but on an individual basis, the discoveries of these
processes can lead to the exposal of the identity of the data providers (Grolinger,
et al., 2014).
Another issue when thinking about Big Data Security is the way people think about
their data. Mostly the data is thought of as a fact. In reality, someone can forge data
and therefore manipulate the outcome and decisions. One could intentionally
fabricate wrong malicious data and therefore create a reality beneficial to them. To
be able to trust Big Data, it is crucial to ensure credibility. Not only malicious activity
can change the outcome of Big Data but also the distortion during the processing
and the process of propagation. Therefore, it is essential to be able to ensure the
reliability and authenticity of the data used. (Zhang, 2018)
13
Streaming Data can provide near real-time insights that are crucial for businesses
that want to react to change as quick as possible. Reacting to change is not the
only benefit companies can gain from streaming data and the analytical capabilities
that come with it. For a long time, the most important factor for making decisions
was the data of the past, mostly reports generated based on old data that may
reflect the current situation, or maybe not. It is possible to create better forecasts
and therefore, better outcomes using streaming data and real-time analytics
(Nagvanshi, 2018).
Stream processing is different from batch processing in multiple ways; the most
significant factor being that the data scope is entirely different. In regular batch
processing, it is possible to do queries or processing using the whole data set, as
opposed to stream processing where the queries and the processing are only done
over the most frequent data, regardless of what techniques are used. The data set
that has to be taken into account for batch processing is a lot bigger than for stream
processing and therefore also more difficult to handle. The critical factor for stream
processing is the low latency, that should not be higher than a few milliseconds,
which means that the insights gained from the data are available in near real-time
in contrast to the long delays that are given when thinking about batch processing
(Amazon Web Services, 2020). As not only the real-time but also the holistic view
is vital for most businesses, they run a hybrid approach such as the Lambda
architecture that will be discussed in Chapter 3.1.
When streaming data, it is not always possible to define the exact volume and
velocity of the input data, which makes it challenging to define how much resources
are needed in processing the data stream. Therefore, streaming is another prime
example for cloud-based processing, as it takes only seconds to scale the used
computing resources up or down. There are even solely cloud-based solutions; an
example would be AWS Kinesis for streaming in combination with Kinesis Analytics
for running queries on the data stream. Unfortunately, not all data streaming
platforms are as flexible as cloud platforms. Spike van der Veen, van der Waaji,
Lazovik, Wijbrandi, & Meijer (2015) mentioned for example the Apache Storm
platform. As one of the leading data stream analytics platforms it still misses the
capability to scale by itself. The authors proposed and created a tool that sits on
top of the platform and monitors the application running on Apache Storm and
external systems such as queues and databases, which decides based on this data
whether additional resources are needed to process the data or not. This shows
how vital scaling is when assessing Big Data.
2.4 Cloud-native
The term cloud-native will be used throughout this thesis to describe particular
application and architecture characteristics, but what exactly does it mean?
14
Using these techniques and having a high degree of automation, it is possible to
build loosely coupled, resilient, manageable and observable systems (Cloud Native
Computing Foundation, 2018). Cloud-native applications and architectures try to
satisfy the requirements of the customers that expect rapid development,
responsiveness, innovative features and zero downtime. Not being able to satisfy
these customer requirements means that they will just use the product of the next
competitor. A cloud-native system should be able to take full advantage of cloud
services. Making use of PaaS and SaaS, these systems are mostly deployed in
highly dynamic cloud environments. Is a server unavailable? Provisioning, a new
one takes only minutes, and nobody will notice. Does a Service need a new
database? No problem using a SaaS model. To be able to reach this kind of
autonomy, all processes have to be automated (Microsoft, 2019).
3. ARCHITECTURE
This chapter will discuss different fitting architectures for implementing Big Data
streaming. All architectures could be implemented in the cloud and should be able
to process the needed amount of data. This chapter will focus on both cloud-native
architectures and traditional architectures that run in an on-premise datacentre. The
discussion in this chapter contributes to the decision, which architecture should be
implemented, in Chapter 7.
15
Figure 1 - Speed, Serving and Batch Layer of the Lambda Architecture (ITechSeeker, 2019)
By querying both the batch views and the real-time view for the queries, it is possible
to get a complete view of the data. The data is continuously added to both the batch
layer and the speed layer, once the batch view includes the new data it will be
dropped from the real-time view, which makes it easier to handle the continuous
data flow. (Marz & Warren, 2015)
The Lambda architecture is a good fit when going for fault-tolerance against
hardware failures and human mistakes and also has advantages in the computation
of arbitrary functions on real-time data. The trade-off that comes with these
advantages is the high complexity and redundancy. The different frameworks that
are needed to implement the batch, speed and serving layer are in itself highly
complex, and the combination of the layers does not help in decreasing the
complexity. Maintaining all the layers and keeping the batch and speed layer
synchronised is no easy task in a fully distributed architecture. All in all, the Lambda
architectures does its job exceptionally well but introduces high complexity.
Therefore, it is essential to ponder if most of the use cases need a batch and a
speed layer (Feick, Kleer, & Kohn, 2018).
17
3.2 Kappa Architecture
The Kappa architecture was introduced by one of the original authors of Apache
Kafka, Jay Kreps. Kreps (2014) wrote a blog post about the Lambda architecture
and the already mentioned disadvantages that the highly complex Lambda
architecture has. In the same blog post, he proposed another approach for real-
time data processing that is in some way inspired by the Lambda architecture but
tried to favour simplicity. The resulting Kappa architecture is easier to implement. It
places greater focus on development-related subjects, such as, implementation,
debugging and code maintenance as there is no need to implement two systems
that work together (Feick, Kleer, & Kohn, 2018).
3.2.1 Layers
As opposed to the Lambda architecture, the Kappa architecture only has a Real-
Time Layer and a Serving Layer, as shown in Figure 2. The input comes from a
data stream such as Apache Kafka or AWS Kinesis and is fed into a stream
processing system. This is the real-time layer that can be compared to the speed
layer of the Lambda Architecture. In the real-time layer, the stream processing jobs
are executed on the incoming data, and it provides real-time data processing. After
the data was processed, it enters the serving layer that makes it possible to run
queries on the data. The implementation of the two layers does not differ from the
implementation that would be needed for a Lambda Architecture. The only thing
that is missing is the batch layer. This is justified by the presumption that most
applications do not need the entirety of the data but just a large enough set of the
most recent data. By dropping the batch layer, the architecture gets a lot simpler
and easier to handle, but this cannot be done without constraints. Dropping the
batch layer means that it is not possible to query the entire dataset easily, as the
whole data has to be streamed so that it can be queried. The trade-off is to lose
accuracy for reduced complexity (Feick, Kleer, & Kohn, 2018).
18
architecture is better suited for use cases where speed is essential and the
accuracy loss is negligible.
4. DATA STREAMING
4.1 Introduction
The following chapter discusses different frameworks, services and products in their
respective categories. For data streaming AWS Kinesis and Apache Kafka will be
highlighted, as these two are the prime examples of high-volume data streaming,
and both are available as SaaS solutions in AWS.
Kinesis Data Streams is mostly used for applications that need high bandwidth
continuous data intake and aggregation. Some typical scenarios include log data
ingestion or real-time metrics, where producers push data directly into a stream.
Therefore, the data is available immediately, and no data is lost upon server failure.
Another use case is real-time data analytics, for example, in the form of a
clickstream that is processed in real-time and analyses the usability of a website.
All these scenarios are made possible because of the following benefits.
• Kinesis Streams are durable and elastic
• Low latency, the delay of a record being put in the stream, and the possibility
for a consumer to read it is less than one second.
• Multiple Consumers can consume the same data stream, therefore multiple
actions such as archiving, processing and aggregation can be done
concurrently and independently.
Figure 3 shows a standard Kinesis Data Stream application, with one producer that
writes data into the stream and multiple consumers that consume the records from
the shards. To help with building such applications there are two libraries, the
Kinesis Producer Library (KPL) and the Kinesis Client Library (KCL). The KPL is
used by applications to act as an intermediary between the application and the
20
Kinesis Data Streams API. It includes an automatic retry mechanism that is
important when a high number of records are sent to the stream. Further, it provides
features such as collecting records before sending them to different shards,
aggregated records to increase payload size and throughput and multiple
CloudWatch metrics to provide observability.
On the consumer side, the KCL helps applications to read from a data stream. It
handles all the logic for connecting to multiple shards and pulls data from the
stream, further it handles checkpointing and aggregation of the data that was sent
by the KPL. (Amazon Web Services, 2019)
Figure 3 - Kinesis Data Stream with n shards that are consumed by multiple consumers.
Using Kinesis Data Streams, consumers can make use of the enhanced fan-out
feature. This creates a logical 2 MB/sec throughput pipe between consumers and
shards. Consumers can decide whether they want to use enhanced fan-out or not,
as it comes with additional costs but therefore provides sub 200 ms latency between
producers and consumers. Using this feature makes it possible for users to allow
multiple applications that read from the data stream while still maintaining high
performance.
Other than Kinesis Data Stream, as already mentioned, there are three other
services under the Kinesis platform. Kinesis Video Streams makes it is easy to
stream video from devices to AWS and therefore enables the user to use these
videos for analytics, machine learning and other video processing tasks. Kinesis
Data Firehose helps to capture, transform and load data streams into AWS data
stores. The supported data stores include Amazon S3, Amazon Redshift, Amazon
Elasticsearch Service and Splunk. It provides five times higher input speed as
Kinesis Data Streams with up to 5 MB/sec. The last service in the Kinesis family is
Kinesis Data Analytics. It provides a comfortable, straightforward way to analyse
streaming data. It therefore helps the user to gain insights that can be used to
respond to their business and customers in real-time. It provides the possibility to
execute SQL queries and sophisticated Java applications that use operators for
standard stream processing functions to transform, aggregate, and analyse data at
any scale. Developers can use the AWS SDK or Apache Flink to write applications
that run in Amazon Kinesis Data Analytics.
21
Nguyen, Luckow, Duffy, Kennedy, & Apon (2018) compared Amazon Kinesis to
Apache Kafka, that will be discussed thoughtfully in Chapter 4.3, in the context of a
high available cloud streaming system. The authors compared multiple aspects,
such as throughput while using a different number of shards in Kinesis and
partitions in Kafka. Further, they compared the costs of such a cloud-based
streaming solution. They did their tests with 1, 2, 4, 8,16 and 32 shards/partitions
and six different data velocities. The comparision was done across multiple
dimensions, such as reliability, performance and costs. When comparing the
throughput, Kafka can achieve high values with only a single partition, while Kinesis
scales massively with the number of available shards. For both streaming platforms,
the consumer performance scales with the number of shards/partitions. When
comparing costs, Kinesis is around four times cheaper than Apache Kafka for a
message size of 10 KB. This difference only increases with increasing message
size. For smaller message sizes the costs for Kafka are nearly the same as for
Kinesis, the price for a Kafka system lies between the Kinesis price for a record size
of 1 KB and 3 KB. Nevertheless, Kafka requires more knowledge to set up in a
reliant, fault-tolerant way, as it needs a lot more configuration than the cloud-native
fully managed Kinesis Data Streams. (Nguyen, Luckow, Duffy, Kennedy, & Apon,
2018)
The two most common use cases for Apache Kafka are real-time streaming data
pipelines that transport data from one system or application to another, and real-
time streaming applications that transform or react according to the data streams.
The core abstraction of Kafka is a topic. Records are published to topics, and
therefore a topic is a stream of records. A topic, as known from other publish-
subscribe models, can have multiple subscribers and at its core is a partitioned
append-only log. Each partition in a topic is an ordered, immutable sequence of
records that is always appended to, also called a structured commit log. Every
record within a partition has a sequence number that is unique for the partition, this
number is called offset. All records that are published by a producer to a topic are
durably persisted by the Kafka cluster, for a configurable retention time.
The two main actors in a publish-subscribe model are the producer that adds
records to the data stream and the consumers that read these records and do
processing with the data. A producer pushes its data to a broker, which is a server
within the Kafka cluster. The consumers work on a pull basis, and every consumer
manages its offset. Typically, the consumers read the message linearly, but as
shown in Figure 4, every consumer manages its offset and can, therefore, read any
record it wants to. (Apache Software Foundation, 2017)
22
Figure 4 - Partition with records that have a unique sequence number and consumers that
use the offset to read any record from the partition (Apache Software Foundation, 2017)
One benefit of Apache Kafka is that it enables stream processing, which means
that a stream processor can take continual streams of data from one or multiple
input topics, performs processing and produces a continual data stream that is sent
to one or more output topics. To be able to write sophisticated stream processors,
Apache Kafka provides the Streams API, also known as Kafka Streams.
Apache Kafka does not only excel in stream processing but is also used as a
messaging system. There are many messaging system implementations, and all
have their respective advantages and disadvantages. One of the most used
messaging systems is RabbitMQ that is developed as an open-source project by
23
Pivotal. It also uses brokers and queues messages before they are sent to the
clients. This queuing opens the possibility for message routing, load balancing and
data persistence. RabbitMQ also supports a wide number of protocols such as
AMQP. In practice, RabbitMQ is mostly used when developing enterprise systems.
Another messaging system that is used by Twitter is ZeroMQ. ZeroMQ is used to
develop high throughput systems but has a lot of complexity when working with it.
Another problem with ZeroMQ is that messages are not persisted, which implies
that if the system goes down, messages could be lost. The last comparable
messaging solution is ActiveMQ, which is also a project of the Apache foundation.
It implements message queues by using brokers and can provide point-to-point
messaging. Problems arise for ActiveMQ when high throughput is needed because
every sent message includes high overhead as the message headers are
enormous. Compared to these messaging systems, Apache Kafka can provide
solutions to a few mentioned problems. For example, in Kafka, as already
mentioned, all messages are saved to disk. Therefore they are persistent even after
a consumer reads them. Another advantage that Kafka has over other messaging
systems is that the brokers do not have to maintain any state other than offsets and
messages, because the state of each consumer is managed by itself. All these
techniques help Kafka to reach the high throughput that was demonstrated by
LinkedIn.
5.2 MapReduce
In 2004 Google presented the world their algorithm for data processing on large
clusters, MapReduce. At that time, not many companies had as much data as
Google. They had massive amounts of data that had to be processed, such as
crawled documents or web request logs. The challenge was how to do
computations on a large amount of input data across thousands of machines. As
parallel computing and distributed data are challenging to handle, they created an
abstraction to hide all the complexity of parallelisation, fault-tolerance, load
balancing and data distribution in a library. The MapReduce library was inspired by
the primitive’s map and reduce as they are known from functional programming
languages such as Lisp. Using these abstractions, it is easy to process lists of
values that are too large to fit in the memory of a single machine. Example
applications that can easily be written as MapReduce programs are inverted
indices, distributed sorting and distributed grep. (Jeffrey & Sanjay, 2004)
An example of a distributed word count and how the different phases influence the
data can be found in Figure 5.
24
Figure 5 - Example of how a word count application would work using the MapReduce
programming paradigm (Pattamsetti, 2017)
25
• MapReduce computations work on batches rather than data streams
• MapReduce computations are snapshots of data that is stored in files, as
opposed to data streams where new data is generated the whole time
• File operations add latency
• Some computations cannot be efficiently expressed using the MapReduce
programming paradigm
Even with all the limitations mentioned above, there is still much work being done
to make data stream processing possible using MapReduce (Grolinger, et al.,
2014). Other projects, such as the improved Hadoop MapReduce framework that
was implemented by Condie, et al. (2010) tried to overcome the limitations that are
opposed by using batch processing. The authors extended the MapReduce
programming model by pipelining data between the two operators, which allowed
for data to be delivered more promptly to the operators and therefore reduce the
response time.
Cheng-Zhang , Ze-Jun , Xiao-Bin, & Zhi-Ke (2012) discussed the viability of the
MapReduce approach and concluded that the standard approach, as implemented
by most frameworks and for example, Apache Hadoop, is not fitting for real-time
data processing. The first issue that the authors identify is about the dynamically
generated data. For a typical job run on Hadoop, all the data has to be present on
the Hadoop Distributed File System (HDFS), in case of real-time analytics the data
is on the fly generated by external systems. The second issue is that only a small
part of all the data needs to be analysed. For real-time analytics, the data is time
correlative; this means that data with the same key should not be aggregated if the
correlation of the timestamps is not given. The author's solution to implement real-
time analytics using MapReduce included two big adoptions to the Hadoop
framework. First, they modified the programming model to exclude the shuffle and
sort phase and push intermediate data to the reduce function that also included a
timestamp. The second change was to move from the HDFS to a more appropriate
persistent data store that can handle the key-value pairs more efficiently. The
authors decided to use Cassandra for this task (Cheng-Zhang , Ze-Jun , Xiao-Bin,
& Zhi-Ke, 2012). The observations that the sort and the merge phase are the most
severe problems during real-time analytics is also verified by Li, Mazur, Diao,
McGregor, & Shenoy (2011). In their paper, the authors describe two key
mechanisms that were implemented into MapReduce. The first one replaces the
sort-merge mechanism with a hash-based framework, and this removes the
blocking nature of the algorithm and brings benefits in terms of computation and I/O
performance. The second measure to make MapReduce on Hadoop viable for real-
time processing tackles the problem of expensive I/O operations. By using a
technique that stores frequently used keys in memory and therefore minimises disk
operations, the reduce step can keep up with the map operation. Using these two
optimisations, the authors could return results earlier and reduce internal data spills
(Li, Mazur, Diao, McGregor, & Shenoy, 2011).
Another approach that was used by multiple projects, such as Twitter’s Storm and
Yahoo’s S4 was to abandon the MapReduce programming paradigm but still use
the same runtime platform and adopted event-by-event processing. Another
alternative to the classic MapReduce is Apache Spark Streaming, the data stream
processing framework works with small batches and does all the computation on
these batches (Grolinger, et al., 2014).
26
5.3 Bulk Synchronous Parallel
Other than MapReduce, there is also Bulk Synchronous Parallel Computing
(BSPC), which was first described in 1990. Bulk Synchronous Parallel is not a
programming model, but also not a hardware model, it lies in between. The Bulk
Synchronous Parallel model can be defined as a combination of three attributes:
1. Multiple components, that all perform some kind processing and memory
functions.
2. A router that takes care of the message handling by distributing messages
between the components.
3. The last attribute is the synchronisation of all the components at regular
intervals.
The computation is defined as a sequence of supersteps, where each step means
that all the components are assigned tasks that consist out of processing work,
message sending and consuming messages that are sent from other components.
After a specific period of time, a global check is made to determine if the
components have completed their task and therefore, the superstep is finished. If
the superstep has finished, the next superstep will be executed (Valiant, 1990). This
synchronization step can be seen in Figure 6.
Google has created a framework, named Pregel, that uses the BSP processing
model. Pregel itself is not used anymore, but it was the inspiration for multiple open-
source projects that are still developed today, for example, Apache Hama and
Apache Giraph. In the typical BSP model, the computations are done by executing
multiple supersteps after each other, and in every superstep, a user-defined
function is executed on every item from the dataset. In the newer implementations
like Pregel and Apache Hama, every agent computation has a graph representation
in BSP, that consists out of an identifier for the node, its value and state and all the
outgoing edges, all together form a vertex. Before the computation, all the vertices
are loaded into the local memory of the machines and stay there for the entire
computation, this has the advantage that all computations are done using the local
memory. As already described in the explanation of the BSP model, a vertex
consumes messages from another vertex and executes the user-defined function.
In contrast to the MapReduce model, only one function, the compute() function is
defined. Executing the function, the vertex performs local computations and
produces messages for its neighbours in the graph. After the vertex is finished, it
waits for all the other vertices to finish. (Kajdenowicz, Indyk, Kazienko, & Kubul,
2012)
27
Figure 6 - Scheduling and synchronisation of a superstep in the bulk synchronous parallel
model (Okada, Amaris, & Goldman, 2015)
Google identified rather quickly that Bulk Synchronous Parallel is a good fit for
graph algorithmic problems, and therefore Pregel and the systems that followed it
fully adopted it for these capabilities. According to Kajdenowicz, Indyk, Kazienko, &
Kubul (2012), the bulk synchronous parallel approach outperforms the MapReduce
approach significantly, when tackling graph algorithmic problems.
• Hadoop Common provides utilities that are used by the other modules.
• Hadoop Distributed File System (HDFS) is the underlying distributed file
system.
• Hadoop YARN is a framework for job scheduling and cluster resource
management. It was introduced in Hadoop 2 to replace the MapReduce
engine that was used and therefore decouple the programming model from
the resource management infrastructure.
• Hadoop MapReduce is a YARN based implementation of the map-reduce
programming model used for parallel processing.
• Hadoop Ozone is a highly scalable, redundant, distributed object-store.
Originally Apache Hadoop was one of many open-source projects that implemented
the MapReduce programming model and focused on tasks like web crawls. The
architecture was designed for precisely this one use case and focused on strong
fault tolerance for large and data-intensive computations. Soon it became the norm
for companies to save their data in the HDFS, as it was easy for developers and
data scientists to access the data instantaneously. Hadoop got much attention, the
community grew, and developers started to misuse the cluster management for
more than just MapReduce jobs. Therefore, Apache Hadoop released YARN, to
tackle the shortcomings of Hadoop 1. Hadoop YARN implements a new
architecture that decouples the programming model from the resource
management infrastructure. This means that MapReduce is now only one of many
frameworks that can be executed on top of Apache Hadoop, other programming
frameworks include Apache Spark, Apache Storm and Dryad (Vavilapalli, et al.,
2013).
The already mentioned Hadoop Distributed File System is a crucial component for
Hadoop. It is a distributed filesystem, that is designed to run on cheap hardware. It
is fault-tolerant per design and provides high throughput streaming data access. To
make this possible, a few POSIX semantics were sacrificed. For example, to enable
high throughput, the applications operated on the HDFS should support a write-
once-read-many access model. This means that the file is written to the storage
once, but not changed anymore after that, because new data is only appended. As
the typical applications run on Hadoop, work with data that has a considerable
volume, and files can easily have gigabytes to terabytes, HDFS is designed to be
able to support large files. As a Hadoop cluster can easily span across hundreds of
nodes, the HDFS can store large files across machines within the cluster, by
splitting them into equally sized blocks. This behaviour can be recognised in Figure
7 when looking at the green blocks that are present on the DataNodes. For fault-
tolerance, all saved blocks are replicated to multiple machines (Apache Software
29
Foundation, 2020). HDFS uses a master/slave architecture, and consists out of two
types of nodes:
30
5.5 Apache Spark
Apache Spark, initially developed by a group at the University of California and later
donated to the Apache Software Foundation, unifies multiple specialised engines
into one distributed data processing engine. This has the advantage that one API
can be used for multiple jobs. Most data processing pipelines need to do multiple
things like MapReduce and SQL queries. Before Spark, it was not possible to do
MapReduce, SQL, streaming and machine learning with only one engine, which
significantly increased the implementation and maintenance costs while lowering
the performance. While Spark unifies all these engines, it can still provide on par or
even better performance for most jobs than specialised engines. Spark can be
operated on multiple platforms. It can run as a standalone installation but also run
on Hadoop Yarn, Mesos or Kubernetes, this makes it particularly easy for most
companies to get started with Spark as most of the setup is already there.
RDDs are not only used for data sharing between the cluster nodes but also for
fault-tolerance. RDDs tack the graph of all transformations that have been applied
to the data. Therefore, it can rerun the needed transformations in case of lost
partitions. This fault-tolerance model is called “lineage” (Zaharia, et al., 2016).
Other than the Spark core, four high-level libraries are used for operations that
would usually be run on separate specialised computing engines. These libraries
make use of the RDD programming model to implement the execution techniques
of these specialised engines (Zaharia, et al., 2016). Some of them will be used in
the implementation described in Chapter 7.5.
SparkSQL implements one of the most common data processing concepts,
relational queries. By mirroring the data layout of analytical databases, columnar
storage, inside the RDDs, simple SQL statements can be used to query the data.
Other than that, there are also abstractions for RDDs that contain data with a known
schema that resemble database tables, so-called DataFrames.
Spark Streaming is used to implement streaming over Spark. For that it uses a
model called “discretised streams”. The input data is split into smaller batches that
are processed, rather than processing each element on their own.
31
GraphX provides graph computation capabilities, similar to systems like GraphLab
and Pregel.
MLib is a collection of over 50 common machine learning algorithms for distributed
model training.
When comparing the performance of Apache Spark with its most used competitors,
the results are mostly dependent on the executed jobs and the nature of the
workload (Zaharia, et al., 2016). In comparision to Apache Hadoop, it is clear that
Spark provides way better performance. For MapReduce workloads, it is up to 100
times faster for in-memory processing and still ten times faster on disk. A further
comparison found that Spark could process a 100TB workload three times faster
than Hadoop with only one-tenth of the machines used. One exception can be found
when Spark runs on Hadoop YARN because of the memory overhead; Hadoop is
more efficient in this case (Karlon, 2020). In regard to machine learning Spark can
again provide better results than, for example, MapReduce. According to Gopalani
& Arora (2015) the processing times for the K-Means algorithm decreased up to
three times in comparison to the processing time using MapReduce. A comparison
regarding stream processing can be found in Chapter 5.8.
Aside from the custom windows that can be built using these parameters, there are
a few predefined windows. The first is called a tumbling window. In a tumbling
window, each value of a stream is only present in one window, as shown in Figure
8.
Figure 10 - Example of a Storm topology that shows the link between spouts and bolts
(Apache Software Foundation, 2020)
6. STORAGE
6.1 Introduction
In the following chapter, different forms of storage will be discussed. All solutions
have to be able to comply with the requirements of Big Data, namely the volume,
velocity and variety that is inherent when working with Big Data. To address the
challenges that come with the high volume of the data, most fitting storage solutions
make use of a distributed, shared-nothing architecture. Some solutions like Apache
Cassandra can scale very well horizontally without any hassle, only by adding more
servers to the cluster. The storage solutions have to cope with the high velocity and
still maintain low latency for queries, even with a high rate of incoming data.
Further, it needs to be possible to store data that comes from a lot of different
sources and is not always structured, hence the variety of Big Data (Strohbach,
Daubert, Ravkin, & Lischka, 2016). Different storage solutions will be discussed,
beginning with the NoSQL storage implementations, followed by NewSQL
databases, distributed file systems, Big Data querying platforms and an evaluation
of current cloud storage services. At last, there will be a description of Data Lakes,
in regard to Big Data and real-time processing.
Key-Value Stores are the most simplistic version of a NoSQL database that can be
found. However, in what they are doing, they are very efficient and often can provide
35
single-digit millisecond latency for queries. The data in a key-value store is stored
in a schema-less way, and most of the time consists of strings, but also other
objects are supported. The data consists of a string which functions as the key and
the actual saved data. The keys are used as indexes, and the basic data model can
be imagined like a map or dictionary. Most querying features that can be found in
relational databases, such as joins and aggregation operations are sacrificed in
key-value stores for the sake of high scalability and fast lookups. Examples for key-
value databases are Redis and Amazon DynamoDB, which will be discussed more
thoughtfully in Chapter 6.4 (Nayak, Poriya, & Poojary, 2013).
Column-Oriented Databases are hybrid row/column stores, which means that the
database does not store the data in tables but rather in a massively distributed
architecture. The data is in columns rather than rows, and therefore it can easily be
aggregated with less I/O activity than would be needed in a relational database.
This is achieved by saving the data for each column continuously on the disk or in
memory, which brings performance benefits when running data mining or analytical
queries. Examples for columnar databases include Google’s BigTable, which is a
high performance, fault-tolerant, consistent and persistent database that is used for
many of the Google products such as YouTube or Gmail. Unfortunately, BigTable
is not distributed outside of Google but only usable together with the Google app
engine. The second database that is worth mentioning is Apache Cassandra. It is
developed by the Apache Software Foundation and is based on the principles of
Amazon DynamoDB and Google BigTable. Therefore, it includes both concepts, of
key-value stores and columnar stores. It includes features such as partition
tolerance, persistence and high availability and is used in various applications
ranging from social media networks to banking and finance applications (Nayak,
Poriya, & Poojary, 2013).
Document Store Databases store, as the name suggests, documents. The stored
documents are somehow similar to records in relational databases, but the most
significant difference is that they do not follow a predefined schema. The documents
are mostly saved in standard formats such as JSON or XML. The documents within
the document store can be similar or completely different. The database does not
care. Every document can be accessed using a unique key that is used to identify
and find the document. Other than the key most document-oriented databases also
support some kind of query language that can be used to search for documents
with specific features. This is the point where they differ from key-value stores as in
a key-value store, the values are complete black boxes in contrast to the document
store that knows and saves metadata from the documents that are stored.
Further document-oriented databases also support relationships between
documents. The best-known example is MongoDB, a highly performant, efficient
and fault-tolerant document-oriented database. MongoDB stores data in JSON-like
documents and provides a powerful query language, indexing and real-time
aggregation. Further examples include the cloud-based solution that offers
compatibility to MongoDB, Amazon DocumentDB or popular search engines like
Elasticsearch that do fit the definition of a document-oriented database (Nayak,
Poriya, & Poojary, 2013).
36
Graph databases store the data in the form of a graph. A graph consists of edges
and nodes. Nodes are the saved object, and the nodes are connected through
edges which represent relationships. Not only the nodes can have properties but
also the edges. Using graph databases, it is easy to traverse the complex
hierarchies of data as the main emphasis lies in the connections between the data
nodes. Further, the graph can contain semi-structured data, which makes it a lot
more flexible than relational databases. Most graph databases are ACID compliant
and offer functionality to rollback transactions. Neo4j is the most prominent
representative for graph databases. It uses its querying language Cypher to
traverse the graph through the REST API. It has an open-source free to use
community edition, but also provides licenses and support for enterprise-grade
deployments. It is a highly available, ACID-compliant graph database. Other
notable products include RedisGraph and SAP HANA (Nayak, Poriya, & Poojary,
2013).
NoSQL databases are mostly compared to relational databases as they are still the
most used databases out there. One of the significant advantages nowadays and
a crucial factor in the usability for Big Data projects is that NoSQL databases can
quickly scale for massive data volumes and still provide low latency when relational
databases are overwhelmed. There are multiple different databases in the NoSQL
world, and each has its area in which it excels. Further, the NoSQL technology did
evolve rapidly over the last years, and the community grows steadily. Still, there are
a few disadvantages because some products are brand new, and they are still
immature. Further problems that hinder the growth are that there is no standard
query language, such as SQL, for NoSQL databases (Nayak, Poriya, & Poojary,
2013).
6.2.1 Elasticsearch
“Elasticsearch is a distributed, open-source search and analytics engine for all
types of data, including textual, numerical, geospatial, structured, and unstructured”
(Elasticsearch, 2020).
Elasticsearch, easily the most popular open-source search engine, will be used as
the data store for the implementation that will be described in Chapter 7, and
therefore its core concepts will be elaborated. Elasticsearch is built on top of
Apache Lucene and best known for its scalability and speed. In the context of
NoSQL, Elasticsearch can be categorized as a document-oriented database as it
saves the data in an index as JSON documents. Using inverted indices that are
built upon data ingestion, Elasticsearch can search for data in the documents in
near real-time. The search engine comes with a management tool named Kibana.
Kibana offers a user interface that allows the user to execute searches, administrate
the cluster and build graphical representations of the data. Because of its capability
to aggregate and display the data within the Elasticsearch cluster, it is often used
to create dashboards that provide a real-time view of the data (Elasticsearch, 2020).
37
the ACID guarantees for transaction workloads required by traditional database
systems. The term NewSQL is not always used for the right kind of databases, but
in the last years, the industry came to a consensus of what a NewSQL database
has to provide to be called that way.
NewSQL databases have to:
• Be able to execute thousands of short-lived read-write transactions
• Touch only small subsets of data using index lookups, without requiring full
table scans
• Has to use a lock-free concurrency control scheme
• Implement a scale-out shared-nothing architecture capable of running on
hundreds of nodes
• Support the ACID principle
NewSQL databases can be divided into three categories. Firstly, systems built up
with a completely new architecture, secondly middleware that re-implements
existing sharding infrastructure, and the third category are database-as-a-service
offerings. Some interpretations do also include different storage-engines and
extensions for single-node DBMSs such as ScaleDB instead of InnoDB for MySQL
or Microsoft’s Hekaon OLTP engine for SQL Server, but according to Pavlo & Aslett
(2016) such systems are not representative for NewSQL systems.
Most promising NewSQL Systems are built from scratch using a new architecture
rather than add onto an existing DBMS, which enables them to start with a new
code base without any of the restrictions of a legacy system. This means that
scalability can be built within the system rather than on top of it, for example, using
a distributed architecture that operates on shared-nothing resources and only
contains components that support multi-node executions. Query optimizers and
communication protocols can be established that are able to send intra-query data
directly from node to node instead of relying on a central component. Further, all
new NewSQL systems implement their own storage layer instead of relying on
existing distributed filesystems. The biggest disadvantage of these new databases
can be found in the small community, as bigger companies do not dare to set on
small products.
The second category of NewSQL databases consists of products that make use of
the same kind of sharding middleware that was developed by companies like
Facebook and Google. This sharding technology makes it possible to split a single
node DBMS onto multiple nodes, where every node only contains portions of the
database. A centralized component does the routing of the queries, and the
coordination for transactions as the data on each node cannot be accessed
independently. The biggest advantage of this sharding products is that they can
easily replace existing single-node databases.
The third and last category is mostly about cloud services, so-called database-as-
a-service offerings. The advantage of a DBaaS solution is that the cloud provider
manages the hardware and maintenance. This means that the customers do not
have to think about hardware and any configuration concerning the availability of
the database service. The most notable example regarding cloud-based NewSQL
databases is Amazon Aurora, which is compatible with both MySQL and Postgres.
It is built on log-structured storage which improves the I/O performance. (Pavlo &
Aslett, 2016)
38
According to Pavlo & Aslett (2016) NewSQL databases mostly incorporate
techniques that were already used by the industry and academia for many years,
but instead of focusing on single approaches that were developed over time, they
combine multiple of these concepts into one single platform.
With the amount of data that is produced, companies struggle with utilizing its value.
Old Data Warehouse approaches used for structured data that comes from
transactional systems, and business applications, cannot handle the majority of
data as before it can be used within the Data Warehouse, the data needs to be
cleaned, transformed and enriched (Amazon Web Services, 2020). All data that is
produced by an organization will be stored in the Data Lake. To get insights from
the data, it is saved in the Data Lake without any pre-processing, in its original
format. Therefore, it contains structured, semi-structured and unstructured data.
The organization does not know the value of most of the data, but because the data
is in the Data Lake and available for everyone in the organization to access and
analyse it is possible to create value based on the data (Khine & Wang, 2018).
Exemplary use cases and interactions with a Data Lake are shown in Figure 11.
39
Fang (2015) describes the following capabilities that a Data Lake should have:
• Capture and save the data at a low cost. First, it must be easy to get the data
into the lake efficiently without much of processing. Secondly, the volume of
data in the lake scales infinitely, therefore it is essential to have cost-efficient
storage that scales well.
• Store data of all types. Data lakes most be able to store data in all formats,
disregarding if it is structured data from a DBMS, semi-structured or
unstructured data, such as IoT sensor data.
• ETL and pre-processing. Once data is in the Data Lake, it should be possible
to do pre-processing and ETL transformations on the data, to make it easier
for other systems to work with the data.
• Schema on read. In contrast to a Data Warehouse, where the data schema
is fixed before the data is introduced into the database, the data in a Data
Lake is saved without any schema. In a Data Lake, it has to be avoided to
do complex and costly data modelling as having a schema that the data must
adhere to increases the data integration effort. The schema of the data has
to be defined once it is used.
• Enable analytics. It must be possible to develop specific analytic applications
to find value in the saved data.
Figure 11 - A Data Lake and possible surrounding systems that interact with the data
(Amazon Web Services, 2020)
According to Khine & Wang (2018) there are still two big concerns regarding Data
Lakes. The first valid concern is that Data Lakes are just another marketing hype
but looking at the recent developments Data Lakes are successfully implemented
by many companies. According to a TDWI report (2017) 23% of the respondents of
the survey already have a Data Lake and use it in their production environment.
Another 24% were planning on using a Data Lake in production. With the additional
services that provide help for Data Lakes by cloud providers, such as the Azure
Data Lake or AWS Lake Formation and the increased amount of tooling that was
developed in the recent years, it is save to say that Data Lakes are far beyond only
being a marketing hype. The second concern is about creating a Data Swamp. A
Data Lake can transform very quickly into a data swamp where nobody knows what
data is put into it. If the veracity of the data cannot be ensured because nobody
40
knows what is in there, it is challenging to find corrupted data. Therefore security
and compliance has to be a top constraint when creating a Data Lake. If companies
start their Data Lakes without sophisticated security measures, the data can easily
be compromised (Khine & Wang, 2018). These concerns lead to the following
challenges that have to be addressed when working with a Data Lake.
Data Lakes are a new tool to handle the volume and variety of Big Data, that try to
tackle the problem of data silos that exist across organizations. They are no
replacement for Data Warehouses but can complete the data landscape of an
organization that makes use of them to gain unique insights through their data,
which brings them a competitive advantage over their competitors. Tableau (2017)
predicts that in the future, Data Warehouses and Data Lakes may be combined and
become only one concept by enhancing each other's capabilities.
From an architectural standpoint, both the Lambda and the Kappa Architecture
include everything needed for an application that is capable of real-time analysis.
The Lambda Architecture also includes the batch view, which makes it possible to
query historical data and get more accurate results. Further, the Lambda
Architecture includes a lot of overhead as more moving parts are involved, and
therefore more effort for orchestration is needed than for a Kappa Architecture. The
Kappa Architecture, on the other hand, does only include a stream processing layer
and presentation layer, which is a lot less to handle and provides everything needed
for the real-time view of the risk-analysis data that the application needs to handle.
All in all, the Kappa Architecture has more advantages for this use-case and will
therefore be used for the risk-analysis application.
Both implemented applications will use a rather similar Kappa Architecture, but the
stream processing frameworks differ. The options for stream processing that were
reviewed in Chapter 5 include Apache Spark, Apache Flink and Apache Storm. All
three could be used to implement the stream processing portion of the architecture.
For the first implementation Apache Spark with Spark Streaming was chosen as
Spark is the leading Big Data processing platform and used by many companies in
production. If Spark is already used, and the cluster is already set up, adding Spark
Streaming is trivial and therefore the cheapest and easiest way that most
companies would take when they need to add real-time analysis capabilities. To be
used within the implementation AWS EMR can be used to set up a cluster were
Spark can run. For the second implementation, Apache Flink was chosen over
Apache Storm. Apache Flink is a newer, and more flexible, high performant
implementation that offers more features than Apache Storm. Further using
Amazon Kinesis Analytics, the Apache Flink application can be run in a cloud-native
way, without managing the cluster. Apache Storm, on the other hand, cannot be
run natively in AWS and would require manual setup.
Using Apache Spark and Apache Flink makes it possible to compare an approach
that would also work in a self-hosted datacentre with the cloud-native alternative of
managed services.
For data ingestion, Apache Kafka and Amazon Kinesis are available as SaaS
solutions in AWS, and both are supported by Apache Spark and Apache Flink,
which means that either one is a good choice for the implementation. In this case,
the decision fell in favour of Amazon Kinesis Data Streams as the orchestration and
configuration are easier than for Apache Kafka. Another decision criterion was the
42
integration with Kinesis Analytics. Kinesis Analytics supports both Amazon MKS
and Kinesis Data Streams, but the latter is easier to integrate.
The last part of the Kappa Architecture that is needed is the serving layer. Multiple
possible datastores were discussed in Chapter 6. The serving layer does not only
consist out of data storage but also needs the possibility for querying and data
visualization, and some tools have better support than others for that. Without
building an own UI or using some third-party tools, a lot of the introduced databases
are eliminated from the decision. Two possible scenarios for storage and
visualization could be identified.
The first solutions would use S3 as a data store and utilize Amazon Athena and
Amazon Quicksight to query and visualize the data. S3 as a datastore makes sense
as it scales infinitely and is supported by both Apache Flink and Apache Spark as
a data sink. All the components, in this case, are cloud-native and easy to use in
AWS.
The second solution uses Elasticsearch as a document store and Kibana to
visualize the data. Again, both Apache Flink and Apache Spark support
Elasticsearch natively, and AWS provides a service for a managed Elasticsearch
cluster which makes it easy to use.
As both scenarios are equally valid, the decision fell in favour of Elasticsearch and
Kibana because the knowledge of how to set up and use it, was already pre-
existing.
A few other components, such as a small application that generates data, will be
needed to implement and test the architectures thoroughly. All these additional
components will be dockerized and run using AWS ECS. The docker containers will
be run as AWS Fargate tasks, which diminishes the need to orchestrate the servers.
Briefly summarized this means that both implementations will use some sort of
Kappa Architecture. For streaming Amazon Kinesis will be used. For storage and
visualization, the decision fell on a managed Elasticsearch plus integrated Kibana.
The stream processing will be implemented once in Amazon Kinesis Analytics with
Apache Flink and once using Amazon EMR and Apache Spark with Spark
Streaming. For additional assisting services, AWS ECS with Fargate tasks will be
used.
The KPL is only available for java, for node.js AWS provides the AWS SDK which
was used to be able to write data into a Kinesis Data Stream. Each agent used in
the simulation would take one step, which can be defined as a movement in its
predefined route using a random velocity. One step in the simulation means that all
the agents would do one step, and these steps were collected and sent to the
Kinesis Data Stream as one AWS SDK call.
await kinesis.putRecords({
Records: records,
StreamName: "kinesis-analytics-stream",
}).promise().then(finish => {
console.log(finish);
records = [];
}).catch(error => console.log(error));
};
Code Listing 1 - Simulation.step method that executes the steps for all agents and sends
the result to a Kinesis Data Stream using the AWS SDK
One trip-simulator process can run up to 500 agents, and each agent produces one
data point per second. Therefore, to generate the data that is needed to benchmark
the streaming applications, multiple instances have to be started. One instance
needs at least 12 GB of RAM as all the routes that are available on the
OpenStreetMap are loaded into memory. To make this possible, further
customization was needed. The routes are loaded into a node.js Map, and at first,
the application crashed because the Austrian map that was provided contained too
much data. This happened because the maximum number of entries in a Map is
2^24, which was not enough to load all the data into memory. To fix this issue, a
custom BigMap was used that is able to split the records into multiple node.js Maps.
45
was used in the two implementations to produce the data points that were needed
to test the streaming applications.
The first algorithm is the most well-known clustering algorithm, K-Means. First, a
number of groups are selected, and their centre points are initialized randomly.
Then each data point computes the distance between itself and each group centre.
The point will be classified to be in the group that is the closest to the point. At the
end, the group centre is recomputed by taking the mean of all the vectors in the
group. This is done for a fixed number of iterations. K-Means is easy to implement
and pretty fast in clustering the data points as it has a linear complexity O(n). The
problem with K-Means is that at the beginning, the user has to choose the number
of clusters that will be computed (Seif, 2018).
The second algorithm that could be used for implementing clustering is Mean-Shift.
Mean-Shift uses a sliding-window that tries to find areas with multiple data points.
As a centroid-based algorithm, it tries to find the centre of a group of points. Each
iteration updates the centre-point candidate to be the mean of the points within the
sliding-window. All the not centred candidates are filtered in the post-processing,
which eliminates near-duplicates and forms the final centre points. The advantage
of this algorithm is that there is no need to select the number of clusters that should
be created, as opposed to the K-Means algorithm. The only problem, in this case,
is the selection of how big the radius of the sliding-window should be (Seif, 2018).
The third solution is the grid-based clustering. It uses the geo-hash concept, which
is a hierarchical spatial data structure that uses a latitude/longitude geocode
system. The concept works straightforwardly by dividing the space into squares of
a specific size and then grouping the data points inside the square. To be able to
do granular clustering, it is possible to simply divide each space again and again,
which results in a fine granular grid system and each square has a specific hash,
an example of how a hash is built is shown in Figure 13. The advantage of this
approach is the simplicity and that it is straightforward to set the granularity of the
clustering. The disadvantage is that it is not accurate in cases where the points are
located close to the borders of the grids (Amirkhanyan, Cheng, & Meinel, 2015).
46
Figure 13 - Explanation of how a geohash is built (PubNub, 2020)
For the processing of the data stream the framework of choice is Spark Streaming,
it is an extension to the core Spark API that helps to build highly scalable, fault-
tolerant and high throughput stream processing applications. It can consume data
from many sources such as AWS Kinesis or Apache Kafka.
AWS Kinesis was selected as the input sink for the data, and the results of the
processing are saved into Elasticsearch. Figure 14 shows the architecture that was
used. What stands out is that the Spark Streaming applications do not communicate
with the Elasticsearch cluster directly but use the aws-es-proxy running as a
Fargate task in ECS, why this is necessary will be explained in the next chapter.
47
Figure 14 - Architecture for the Spark Streaming implementation
7.5.2 Implementation
Apache Spark uses a concept named resilient distributed datasets (RDDs). An RDD
is a fault-tolerant collection of elements that can be operated in parallel. Spark
Streaming offers a Discretized Stream (DStream), which is an abstraction and
represents a continuous series of RDDs. Every RDD contains data from a specific
interval from within the data stream.
A DStream is associated with a Receiver, which is an object that gets data from a
source for processing. There are two different types of sources, Basic Sources and
Advanced Sources. Basic Sources are directly available in the StreamContext API,
for example, streams from the file system or a socket connection. Advanced
Sources include for example Apache Kafka and AWS Kinesis. These sources need
extra utility classes which are located in different libraries that have to be imported
when needed.
Before starting with the streaming, the boundaries of the area that will be processed
need to be evaluated. For the use case of this study, it is only needed to process
location data from within Austria. The area is configurable, but for this case, only
the minimum values of the latitude and longitude for Austria are set. The
coordinates are needed for the already in Chapter 7.4.1 mentioned approach for
creating geo hashes that will be used to cluster the incoming data points.
Out of all the Sources available, the Kinesis Stream Source is used to read from
the Kinesis Stream, the credentials that are used to access the stream are provided
through the underlying JobFlowRole of the EC2 instance running in the EMR
Hadoop Cluster.
As already mentioned, Spark Streaming works with a DStream that consists out of
RDDs for a specified interval. Figure 15 shows how the original DStream is
windowed and creates a small batch of values that can be processed. This interval
48
is used to create a sliding window that is processed as a mini-batch job and needed
for the computations.
Figure 15 - DStream and its interaction with windowing operators (Apache Software
Foundation, 2020)
The data is now split using the sliding window and it is possible to do aggregations
over this data. The framework of choice to be able to aggregate the data efficiently
is Spark SQL. Spark SQL can be used to process structured data. There are two
ways to use Spark SQL; in this case, Datasets and DataFrames are fitting best for
the implementation. A DataFrame can be defined as a distributed collection of data
organized into named columns. The concept is the same as one of the relational
databases but with optimizations under the hood. The DataFrame can be created
out of an RDD of Rows and a Schema definition. First, all incoming byte arrays in
the RDD, that was created by the windowing operato,r have to be mapped to a Row
that corresponds to the schema. Now that a DataFrame is available, the following
operations were executed:
49
5. The data is now aggregated, and the processing is finished, for development
and debugging purposes the first 100 lines are printed in the logs. The logs
are saved to S3 and can be accessed using the AWS console.
6. The data is saved to Elasticsearch using the es-hadoop library.
To save the data the es-hadoop library was used, it offers full support for Spark,
Spark Streaming and Spark SQL and makes it easy to save the data to
Elasticsearch by adding dedicated methods to the RDDs. This addition works fine,
while only configuring a few values such as the Elasticsearch endpoint when
working with self-hosted Elasticsearch clusters. A problem arose as the
Elasticsearch cluster was hosted by the AWS Elasticsearch Service. Typically to be
able to send requests to the ES API, the requests need to be authenticated. This is
mostly done by sending Basic Authentication credentials with the requests, but in
the AWS ES Service the requests need to be signed using AWS credentials.
Unfortunately, the es-hadoop library does not provide any functionality to add the
request signing, which makes it impossible to write to an AWS ES cluster. As a
workaround, an additional application was needed that acts as a proxy between the
Spark Streaming application and the AWS ES cluster. As a proxy, the aws-es-proxy
was used. It is a small web server application that sits between the originating
application and the AWS Elasticsearch Service. It intercepts the requests and signs
them using the latest AWS Signature Version 4 before forwarding the requests to
the ES cluster. The response from ES will then be sent back to the application that
issued the request (aws-es-proxy, 2020). The aws-es-proxy provides a pre-built
docker image, and as it uses the Go AWS SDK to fetch and generate the
credentials, it is possible to rely on the standard AWS CredentialsProviderChain.
The CredentialsProviderChain then uses a provided TaskRolem that is allowed to
access the AWS ES cluster, to obtain the credentials. The implications of this setup
can also be seen on Figure 14 that shows an ECS cluster with an aws-es-proxy
Fargate Task to intercept the request to the Elasticsearch cluster.
7.5.3 Metrics
All the metrics were measured twice. Once while running the Spark Streaming
application only on one m4.large node, and the second scenario that was measured
50
is a cluster of five m4.large nodes. One m4.large EC2 instance has two vCPUs and
8 GB of memory. This is the least powerful machine that can be used in an ECR
cluster, but sufficient to test the application.
To measure the latency of the Spark Streaming solution, a small test application
was used. The tests were conducted while the trip-simulator produced the normal
load; this makes it possible to measure the latency during various load and scaling
scenarios. The test application adds a record to the Kinesis Data Stream, that is the
data source for the streaming application. The record that is added has unique
values that can then be used to query the Elasticsearch storage. The latency is now
calculated by measuring the time it takes the record to be present in the data
storage, beginning when it was written to the data stream. The test is conducted
multiple times to achieve a statistically relevant result. When looking at Table 2, it
can be seen that the average latency for the first scenario is around 24 seconds, in
comparison to the latency of the second scenario were four times as much
resources were used and the latency is only around 12 seconds. The maximum
latency values are also not much higher than the average latency, which means
that the latency is rather stable and has no outliers.
Multiple metrics can be used to measure the general performance of the solution,
The first metric is the number of records in the Kinesis Data Stream. In Figure 16,
it can be seen that the 10 Kinesis Data Stream shards were used to their maximum
capacity regarding the number of records that can be processed. Within five
minutes, three million records were sent into the stream. This means roughly 10.000
records per second were added. Figure 16 shows both the number of records that
were added to the stream and the number of records that were read. Both lines in
the graph lie on top of each other, which means that each record is read from the
stream rather quickly after it was added.
51
This is also supported by Figure 17. It shows the IteratorAge of the data stream.
The IteratorAge metric shows how long the data is within the stream before it is
read. The average value for the iterator is 2.67 milliseconds before the data was
consumed again. The spikes that can be seen in the figure indicate the maximum
values, which were measured at 4 seconds.
There are also metrics that are provided by the Spark History Server UI, shown in
Figure 18. They show how many records are consumed and how long the
processing takes. Unfortunately, the average values shown in Figure 18, are
computed over the whole runtime of the streaming job, which means that they do
also include batches that did not process any data. The first graph shows the input
rate, which fluctuates between 6.000 and 10.000 records per second. When looking
at the histogram for this metric, it is clear that most batches included more than
8.000 records per second. The second graph shows the scheduling delay. It is
essential that the scheduling delay is smaller than the processing time. This issue
will be discussed in more detail in the next chapter. In this case, there was only one
spike to roughly 15 seconds, but this is not an issue as the framework can catch up
rather quickly, as, for the remaining time, the delay is continuously very low. The
total time that a record needs to be processed is calculated by adding the
scheduling delay and the processing time together. When looking at the processing
time, it can be seen that most values are around 10 to 15 seconds, which is mostly
the same as the values that were measured for the latency. What can also be seen
is the dotted line that indicates the time a task is allowed to take, for the application
to run stable and provide low latency.
52
Figure 18 - Spark Streaming Metrics: Input Rate, Scheduling Delay and Processing Time
First, the cost for the Spark Streaming solution will be calculated with the minimal
setup in mind. The minimal setup includes an EMR Cluster with two m4.large
instances and one Kinesis DataStream Shard. The second scenario for a more
considerable deployment uses five m4.large instances. Further, this calculation
uses on-demand instances in contrast to the cheaper spot instances. When
deploying this architecture in a real-world scenario, one would have to assess if it
makes sense to use spot instances for the cluster nodes. All prices resemble the
Europe/Frankfurt (eu-central-1) region.
53
S3 Buckets
Logs & Application Artefacts Free Tier eligible
Table 3 - Spark Streaming Solution Pricing per Service
Costs for Cloudwatch and S3 logs storage will not be considered in the calculation,
as they are both free tier eligible and are indispensable. Using the table above, the
costs for both, a minimal solution and one that provides better performance for the
scale that was mentioned in Chapter 1.3, can be calculated.
Solution Price
Minimal Solution per hour $0.38157
Minimal Solution per month $278.4548
Scaled Solution per hour $0.9934
Scaled Solution per month $615.7148
Table 4 - Spark Streaming Solution hourly/monthly price
Scaling is rather easy in AWS EMR because the only thing that has to be done, is
adjusting the instance group, and the new EC2 instances are added to the cluster
immediately. Adding servers to the cluster is one thing; the other one is the
configuration of the Spark application. Usually, the number of cores and the GB of
memory that should be used have to be defined in the application configuration.
Fortunately, AWS offers a configuration flag that sets all the Spark configurations
to the maximum that can be allocated within the cluster. This helps when running
only one application that should use all the resources that are available in the
cluster.
One disadvantage for this solution is that the monitoring of the Spark application
within the EMR cluster is not straightforward. There is no out-of-the-box solution to
monitor the application using AWS tools such as CloudWatch Metrics and Logs.
The logs for the application can only be sent to S3, which makes it a bit more
challenging to check for failures. To be able to observe what is going on in the
application, one would need to use additional tooling to bring the logs to a data store
where they can be searched and aggregated. An example would be CloudWatch
Logs or Elasticsearch. For the metrics, one has to access the Spark and Hadoop
GUIs, this means that at least the master node of the cluster has to be deployed in
a public subnet, and even then one needs to install a proxy software and open an
ssh tunnel to the master instance to access the GUIs. This is a lot of overhead for
simply accessing monitoring metrics.
To fix this issue, it is necessary to experiment with the batch interval, as a longer
batch interval does also mean that the latency is longer, but too small will cause the
scheduling delay. To get to the sweet spot where a good trade-off can be found
multiple iterations are necessary. This has to be done for every resource
configuration as the processing time can of course change if more or fewer
resources are available.
55
Figure 20 - Architecture for the Kinesis Analytics implementation
7.6.2 Implementation
The basic concept that is used for stream processing in Apache Flink is a
DataStream, that was already introduced in Chapter 5.6. A DataStream can be
created from multiple data sources. One of the predefined sources is the
FlinkKinesisConsumer, that can easily be added to the Flink environment as a
source for the DataStream. To create the consumer, all that is needed is the name
of the Kinesis Data Stream that should be used as a source, a deserialization
schema and configuration properties. The deserialization schema is an interface
that needs to be implemented to make it possible for the Flink application to create
an object out of the supplied byte array. There are multiple predefined
deserialization schemas such as the SimpleStringDeserializationSchema or the
POJODeserializationSchema. Unfortunately, the data sent from the trip-simulator
is sent as a comma-separated string, and therefore a custom deserialization
schema was needed. The third parameter, the configuration properties, are needed
to supply basic parameters such as the used AWS region, or the AWS credentials
provider. Further, there are advanced configuration properties available that can be
set to control the way the stream is read. An example would be the initial starting
position that can be set for Kinesis Data Streams. There are multiple possible
values:
All possible starting positions have their use cases, but for the risk analysis
implementation, the TRIM_HORIZON starting point was chosen. The reasoning
behind this decision is that all records, even if the processing stopped for a few
minutes and is started afterwards again, should be processed because otherwise
there would be holes in the analysed data. Further configuration values include, for
example, the interval in milliseconds for reading the records from the Kinesis Data
Stream.
1. First, each data record is mapped, to add the X and Y coordinates of the cell
that is corresponding to the longitude and latitude of the record. The
calculation of the cell coordinates is done in the same way as for the Spark
Streaming application by using the minimum latitude and longitude of Austria
and a resolution parameter that specifies the size of one cell.
2. Now the stream of records is logically partitioned using the keyBy operator.
The key for the partitioning is the cell coordinates. Creating a KeyedStream
makes it possible for Flink to do the following windowed computations in
parallel for all the cells, as they are independent of each other and can,
therefore, be even be processed from different processes.
3. To be able to group related records for the aggregations, a sliding time
window is defined.
4. The windowed stream can now be aggregated. There are multiple prebuilt
methods to do aggregations, but unfortunately, not all needed aggregations
are available, for example, statistical computations are missing. To be able
to do custom aggregations, one needs to supply an AggregateFunction. The
AggregateFunction interface has the following structure, as defined in
Equation 4.
57
Flink integration uses the standard Elasticsearch RestClient and therefore
makes it possible to add a request interceptor that signs the requests for the
AWS Elasticsearch Service. In contrast to the Spark Streaming solution, no
workaround is needed for communication with ES.
7.6.3 Metrics
All the metrics were measured twice, with different Kinesis Processing Units
(KPUs). There are two parameters that can be defined to configure the parallelism
for a Kinesis Analytics application. The first is the Parallelism parameter. It specifies
the number of parallel executions of the operators, sources and sinks. The second
parameter is the ParallelismPerKPU, which defines the number of parallel tasks
that can be executed per KPU. One single KPU provides 1vCPU and 4 GB of
memory. To know how many KPUs are used for the execution, both configs have
to be considered. The first test will be executed with a Parallelism and
ParallelismPerKPU of 1, which results in 1 KPU to be used. The second test uses
a Parallelism of 2 and ParallelismPerKPU of 8, which leads to 2 KPUs used.
To measure the latency, the same test as for the Spark Streaming solution was
used, for further details on the conducted test, refer to Chapter 7.5.3. The tests
were done multiple times for each scenario to retrieve statistically meaningful
results.
Table 5 shows the results of the latency tests for both scenarios. What is remarkable
is that the performance did not really increase when using more resources. This
can be explained by the fact that if there are multiple instances of an Apache Flink
application running, the framework has to handle the distribution of the dataset,
which adds overhead. Further, it can be seen that the resources that were used in
the first scenario were already enough for Flink to provide acceptable performance,
the CPU utilization was roughly around 50%, as seen in Figure 21. The blue line
seen in the figure from 18:30 to 21:00 is the CPU utilization for the first scenario.
After the short interruption at 21:00, the CPU utilization for the second scenario is
shown, which is roughly around 30%.
58
Figure 21 - Kinesis Analytics Solution: CPU & Memory Utilisation
Figure 22 shows the records that are processed per second by the Apache Flink
application. The first line describes the first scenario, with only 1 Parallelism. The
constant throughput of records that were processed by the application was 9100
per second. The second scenario, which includes multiple applications and
therefore, overhead for the distribution of the tasks, could maintain the performance
of roughly 2400 records per second. Although this seems like the performance was
lower than for the first scenario, the second scenario maintained a lower latency for
processing the records throughout the application runtime, as shown in Table 5.
Another measurement, to check whether the application can stay on par with the
produced data rate, is the IteratorAge and the MillisecondsBehindLatest metric.
Both metrics show how far behind the application is in terms of processing the
records within the stream. If the application were not performant enough, these two
metrics would increase as the oldest record in the stream is not within the last
seconds. This would be disastrous for real-time applications as it would never
analyse the current data but fall further behind and process old records instead. For
both scenarios, both metrics, showed a flat line, meaning that the time behind the
latest record is zero milliseconds, or in other words, the application always works
with the newest data.
59
Figure 22 - Kinesis Analytics Solution: Records processed per second
60
First, the cost for the Kinesis Analytics solution will be calculated with the minimal
setup in mind. The minimal setup includes one Kinesis Data Stream Shard and one
Kinesis Analytics Processor. From this, the appropriate scale for the full deployment
can be calculated. All prices resemble the Europe/Frankfurt (eu-central-1) region.
Costs for Cloudwatch and S3 logs storage will not be considered in the calculation,
as they are both free tier eligible and are indispensable. Using the table above, the
costs for both, a minimal solution and one that has the scale that is needed for the
data throughput mentioned in Chapter 1.3, can be calculated.
Solution Price
Minimal Solution per hour $0.1911
Minimal Solution per month $139.48
Scaled Solution per hour $0.4797
Scaled Solution per month $350.20
Table 7 - Kinesis Analytics Solution hourly/monthly price
Kinesis Analytics offers the possibility to do automated scaling. This has the
advantage that there is no need to test how many KPUs are needed for the
application to run stable at all time, as Kinesis Analytics simply scales the number
of KPUs used based on the CPU and memory utilization. Therefore, it can easily
handle a variable load without any interactions.
Working with Apache Flink is very easy, as the APIs are well documented, and its
main supported language is Java, which is not always the case for some of the data
processing frameworks. Building an Apache Flink application, therefore, is
straightforward, and all the other libraries that are supported in Java can be used
without any problems. The integrations of the different sources and sinks make it
very easy to consume data, process it and in the end save it without any hassle. In
the case of Kinesis Analytics, one could also use SQL to do the data analysis, but
the possibilities that come with Apache Flink make it worthwhile to accept the
overhead of creating a standalone application. All the advanced features that
Apache Flink implements that can be accessed using the DataStream API, which
provides the possibility for partitioning, windowing and aggregating the data, make
up for the increased costs that come with Kinesis Analytics while running Apache
Flink.
This is also one of the disadvantages when working with Kinesis Analytics and
Apache Flink because the costs for the deployment are rather high. The user not
only pays for Kinesis Analytics but also for the EC2 instances that run the Apache
Flink application.
The trip-simulator produced synthetic data that simulates cars moving from one
location to another. The result of the analysis can be seen in Figure 24. The
heatmap shows the variance of the velocity everywhere in Austria. Figure 25 shows
a zoomed-in version of the variance of the velocity around Vienna, and the count of
the records that were analysed in this area.
62
Figure 24 - Analysis Result shown as a heatmap in Kibana
Figure 25 - Variance of velocity around Vienna (left) & count of records around Vienna
(right)
Figure 25 shows the maximum zoom level that is available in Kibana; the data
behind the visualization is, of course, available at a much more granular level.
Still, the heatmap shows that the produced grid is fine enough to display the data
in a meaningful way.
7.8 Comparison
Both approaches, Apache Spark in AWS EMR and Apache Flink in AWS Kinesis
Analytics, were able to satisfy all the requirements and proved that they could be
used to implement real-time streaming analytics. The following chapter will compare
the two implementations. The aspects that will be illuminated are the development,
deployment, performance and pricing.
Both solutions were developed using a JVM language. For Spark it was Scala, and
for Flink it was Java. Although Spark also supports Java, the documentation and
the Scala source code made it difficult to write the code in Java. Further, most of
63
the supportive documents and all community examples are written in Scala.
Therefore, the development of the Spark application was a bit tedious, but even for
someone that is not fluent in Scala, it was possible using the well-documented API
to write a working application. The development for Flink as way more susceptible,
as it is natively written in Java and the developer support was better. For both
frameworks, one has to rely on the available support for the different sources and
sinks. Here again, Flink made it easier to enhance the Elasticsearch sink, which
was impossible in the Spark implementation. Both frameworks have a big
community, which is also an essential criterion if one wants to adopt a framework.
Although Spark streaming offers excellent support, it is still noticeable that the
framework works using mini batches, an example would be the batch interval issue
that was mentioned in the Spark implementation chapter. Apache Flink, on the other
hand, fully embraces the streaming paradigm, and therefore the API is easier to
use. For an application developer, the gap between using a framework such as
Spring and data science frameworks like Spark or Flink is massive. The support
that most application frameworks offer is simply not there, and issues regarding
dependencies that are not interoperable or version incompatibility are common.
Setting up the deployment of the two solutions was done using the AWS CDK to
create the cloud formation stacks. There are some issues in the AWS
documentation regarding some configurations which take a bit of time to overcome
and make the deployment not trivial. The Flink application deployment in Kinesis
Analytics was the easier one as it is a managed service, and there are only so many
configuration options. For Spark, the application was deployed in an EMR cluster
which still needs all the different configurations that are needed to run a Spark
application in a Hadoop cluster. Figuring out how to use the spark-submit command
within EMR and setting up all the permissions that are needed to deploy the
application was challenging.
Further Spark is deployed as a skinny-jar, that means that all the dependencies are
not packaged within the jar; only the application code is. Therefore, when submitting
the Spark application to run in the cluster, the dependencies have to be defined
once again, which could produce problems with the availability of the dependency
if the wrong maven repository was used. In terms of observability, Kinesis Analytics
makes it easier to access and search the logs as they are sent to CloudWatch. For
Spark, the logs are delivered to an S3 bucket.
In terms of performance, the Spark Streaming solution could not match the latency
of the Apache Flink application. Table 2 and Table 5 show the measured latencies
for both applications. While the performance did not really improve for Flink when
using more resources, significant improvements can be seen for Spark, which could
reduce the average latency from 24811 ms to 12518 ms. Still, the average latency
for Flink was between 7 and 8 seconds and therefore, up to 2 or 3 times quicker
depending on the used resources for the Spark application. The problem for Spark
was not the throughput, as both Spark and Flink consumed up to 10.000 records
per second without any problems, which can be seen in Figure 18 and Figure 22,
but the processing time. The processing time for Spark can be seen in Figure 18. It
shows that the average processing time was roughly 10 seconds, which in itself is
longer than the latency of the Flink application, on top of that the scheduling
overhead and the Elasticsearch proxy also affected the latency. In terms of using a
64
Kinesis DataStream, both frameworks had no issues to read all the records
immediately after they were available, this can be seen in Figure 16 for Spark
Streaming, but the performance was not any different for Apache Flink. When
comparing the used resources to get to this performance, Spark could use four
m4.large instances, which equals to 8 vCPUs and 32 GB memory, while Kinesis
only used two KPUs, which equals to 2 vCPUs and 8 GB of memory.
Another big factor, when comparing the two solutions, and their viability in a real-
world scenario is the costs that are accumulated. All the price calculations and the
final prices can be found in Table 4 and Table 7. Both configurations used the same
number of Kinesis Shards and the same Elasticsearch instance class. Therefore,
the pricing difference between the two approaches can be attributed to the used
resources. The full deployment for the Apache Flink application was around $350
while the Apache Spark deployment reached $615. This is a massive price
difference, especially when considering that the cheaper solution is a managed
service, in contrast to the EMR cluster used for the Spark deployment that was. self
managed. The pricing for the nodes inside the EMR cluster was calculated for on-
demand instances, so there would be a possibility to get it a bit cheaper with
reserved instances or spot instances. Still, the Kinesis Analytics deployment is 44%
cheaper and 170% more performant than the Spark deployment.
8. CONCLUSION
Many companies are already using Big Data analysis to make decisions. The
companies that can use their data have a competitive advantage over there
competitors. This advantage will get bigger over time, as the techniques and the
technology behind Big Data get more evolved. Being able to make use of all the
data that is produced within the business context is key to gaining competitive
advantages and making data-driven decisions. To find relations within the data and
to be able to draw conclusions based on that data, businesses have to create Data
Lakes that make it easy to analyse all the data. Especially in the fields of financial
and insurance services, Big Data analysis will increase in relevance.
The goal of this thesis was to find a solution for Big Data streaming in the context
of real-time insurance risk evaluation. Therefore, two solutions were implemented
within this problem-context and compared to each other.
The prevailing architectures in the fields of Big Data are the Lambda and the Kappa
Architecture. The Lambda Architecture is more complicated, but once in place can
deliver great results. In contrast to the Kappa Architecture that is relatively simple
in itself but cannot deliver as accurate results as its Lambda counterpart. As always
it depends very much on the requirements that should be fulfilled by the
architecture. For Fast Data or endless streaming data, the Kappa Architecture is a
better fit, which is also the reason why it was chosen for the implementation.
The next question is which technology and framework should be used to implement
these architectures? It is not easy to answer, as many different frameworks
specialize in Big Data processing. A few mentioned throughout this thesis are
MapReduce, Apache Hadoop, Apache Spark, Apache Storm and Apache Flink.
After evaluating all of these frameworks Apache Spark and Apache Flink were
65
chosen for the implementation because of their characteristics, popularity and
performance. They both support stream processing, in Spark it is added as a
submodule to enhance the standard batch processing, and Flink is built from
scratch for stream processing.
After the data was processed, it needs to be saved somewhere. The datastore has
to support a large data volume and still provide good query performance. To find
the fitting data store for the implementation, multiple different approaches were
discussed, including different NoSQL databases. Elasticsearch was evaluated as
the best-fitting database for the mentioned scenario. It fits the NoSQL paradigm by
saving the entries as JSON documents, and with Kibana, it also provides an easy
solution for the serving layer of the Kappa Architecture.
With the Kappa Architecture, Apache Flink, Apache Spark and Elasticsearch, two
solutions could be implemented. Both solutions were implemented to be operated
in the cloud. For this, Amazon Web Services was chosen as a cloud provider. Both
implementations took advantage of SaaS solutions, such as AWS Kinesis Data
Streams as a source. The first architecture used AWS EMR to run a Hadoop cluster
and deploy a Spark Streaming application into it. This approach could also be set
up in the same way in an on-premise datacentre. The second implementation used
AWS Kinesis Analytics to run an Apache Flink application as a managed service.
Elasticsearch was also used as a managed service using the AWS Elasticsearch
Service.
Comparing those two implementations, it is clear that they both have their
advantages and disadvantages. The biggest advantage of the Spark Streaming
solution is that it can easily be run on existing Hadoop clusters that many companies
already use on-premise, hybrid or in the cloud. Still, Spark Streaming works using
mini batches and the configuration, for example the batch interval, to run a resilient,
low latency streaming application is a lot harder to figure out than for Apache Flink.
Regarding monitoring and observability, both solutions are able to provide metrics,
but Kinesis Analytics is better integrated with the other AWS services. The metrics
and logs are sent to CloudWatch where it is possible to query and visualize them.
For Spark running on EMR, there are also some metrics provided and sent to
CloudWatch, but most are only accessible using the Spark History Server UI. Using
the provided metrics for Apache Flink and Amazon Kinesis Analytics it is also
possible to automatically scale the streaming application based on the load and the
needed resources. When comparing the performance of the two solutions it is clear
that Apache Flink performed way better than Apache Spark. The latency for the
former mentioned implementations were 12518 ms and 24811 ms respectively.
Overall the Apache Flink implementation performed 170% better than the Apache
Spark solution, while costing, in terms of infrastructure, only 44% of the later one.
Other than the infrastructure, the costs for running such streaming applications is a
lot lower for the Kinesis Analytics solution as the service itself is managed by AWS,
which further reduces the operational overhead by a large margin.
Possible future work could include testing these solutions with a larger scale of data.
Once, with thousands of records per second, to figure out how many records can
be processed using the current solutions, and once with data records that are close
66
in size to the maximum of one megabyte per record. Using a larger record size
should have a visible impact on both implemented solutions. Both scenarios would
deliver interesting data for future architectural decisions. Another approach would
be to implement the same analysis in Apache Storm and test it against the results
of the other two frameworks.
67
BIBLIOGRAPHY
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. 2013 International
Conference on Collaboration Technologies and Systems (CTS) (pp. 42-
47). San Diego: IEEE.
Domo Inc. (2018). Data Never Sleeps 6.0. Retrieved from Domo:
https://www.domo.com/learn/data-never-sleeps-6
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A.
H. (2011). Big data: The next frontier for innovation, competition and
productivity. McKinsey Global Institute.
Zhang, D. (2018). Big Data Security and Privacy Protection. 8th International
Conference on Management and Computer Science (ICMCS 2018) (pp.
275-278). Atlantis Press.
Hasani, Z., Velinov, G., & Kon-Popovska, M. (2014). Lambda Architecture for
Real Time Big Data Analytic. ICT Innovations 2014, Web Proceedings
ISSN 1857-7288, 133--143.
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable
realtime data systems. Manning Publications Co.
Amazon Web Services. (2018, October). Lambda Architecture for Batch and
Stream Processing (white paper). Retrieved from AWS Whitepapers:
https://d1.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-
aws.pdf
Jeffrey, D., & Sanjay, G. (2004). MapReduce: Simplified Data Processing on
Large Clusters. Communications of the ACM, volume 51, 137-150.
Grolinger, K., Hayes, M., Higashino, W. A., L'Heureux, A., Allison, D. S., &
Capretz, M. A. (2014). Challenges for MapReduce in Big Data. 2014 IEEE
World Congress on Services (pp. 182-189). Anchorage, AK, USA: IEEE.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sear, R.
(2010, April). MapReduce Online. Nsdi (Vol. 10, No. 4, p. 20).
Cheng-Zhang , P., Ze-Jun , J., Xiao-Bin, C., & Zhi-Ke, Z. (2012). Real-time
analytics processing with MapReduce. 2012 International Conference on
Machine Learning and Cybernetics (pp. 1308-1311). Xian, China: IEEE.
Kiran, M., Murphy, P., Monga, I., Dugan, J., & Baveja, S. S. (2015). Lambda
Architecture for Cost-effective Batch and Speed Big Data processing. 2015
IEEE International Conference on Big Data (pp. 2785-2792). IEEE.
Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2011). A Platform for
Scalable One-Pass Analytics using MapReduce. 2011 ACM SIGMOD
International Conference on Management of Data (pp. 985–996). New
York, NY, USA: Association for Computing Machinery.
Seif, G. (2018, February 5). The 5 Clustering Algorithms Data Scientists Need to
Know. Retrieved from https://towardsdatascience.com/the-5-clustering-
algorithms-data-scientists-need-to-know-a36d136ef68
Amirkhanyan, A., Cheng, F., & Meinel, C. (2015). Real-Time Clustering of
Massive Geodata for Online Maps to Improve Visual Analysis. 11th
68
International Conference on Innovations in Information Technology (pp.
308-313). Dubai: IEEE.
Baecke, P., & Bocca, L. (2017, June). The value of vehicle telematics data in
insurance risk selection processes. Decision Support Systems Volume 98,
pp. 69-79.
Huckstep, R. (2019, November 19). Insurance of Things – how IoT shows
prevention is better than cure for Insurers. InsurTech Insights Issue 39.
Segment. (2017). The 2017 State of Personalization Report. Retrieved from
Segment: http://grow.segment.com/Segment-2017-Personalization-
Report.pdf
Harris, M. (2018, December 31). How to Earn Your Customers’ Trust and
Encourage Data Sharing. Retrieved from Martech Advisor:
https://www.martechadvisor.com/articles/data-management/how-to-earn-
your-customers-trust-and-encourage-data-sharing/
STATISTIK AUSTRIA. (2019, December 31). Kfz Bestand 2019. Retrieved from
STATISTIK AUSTRIA:
https://www.statistik.at/wcm/idc/idcplg?IdcService=GET_PDF_FILE&Revisi
onSelectionMethod=LatestReleased&dDocName=122637
VCÖ. (2018, June 21). VCÖ: Im Österreich-Vergleich kommen in Kärnten die
meisten mit Auto zur Arbeit. Retrieved from VCÖ - MOBILITÄT MIT
ZUKUNFT:
https://www.vcoe.at/presse/presseaussendungen/detail/autofahrten-
arbeitsweg-2018
Amazon Web Services. (2020). What is Streaming Data? Retrieved from AWS:
https://aws.amazon.com/streaming-data/
Nagvanshi, P. (2018, December 2). 5 proven benefits of real-time analytics for
professional services organizations. Retrieved from Diginomica:
https://diginomica.com/5-proven-benefits-real-time-analytics-professional-
services-organizations
Spike van der Veen, J., van der Waaji, B., Lazovik, E., Wijbrandi, W., & Meijer, R.
J. (2015). Dynamically Scaling Apache Storm for the Analysis of Streaming
Data. 2015 IEEE First International Conference on Big Data Computing
Service and Applications (pp. 154-161). Redwood City, CA, USA: IEEE.
Rijmeam, M. v. (2013, January 7). A Short History Of Big Data. Retrieved from
Datafloq: https://datafloq.com/read/big-data-history/239
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H.-A.
(2013). Bigbench: towards an industry standard benchmark for big data
analytics. Proceedings of the 2013 ACM SIGMOD international conference
on Management of data (pp. 1197-1208). ACM.
Persico, V., Pescapé, A., Picarello, A., & Sperlí, G. (2018, December).
Benchmarking big data architectures for social networks data processing
using public cloud platforms. Future Generation Computer Systems
Volume 89 , pp. 98-109.
69
Wang et al., L. (2014). A big data benchmark suite from internet services. 2014
IEEE 20th International Symposium on High Performance Computer
Architecture (pp. 488-499). Orlando: IEEE.
Feick, M., Kleer, N., & Kohn, M. (2018). Fundamentals of Real-Time Data
Processing Architectures Lambda and Kappa. SKILL 2018 -
Studierendenkonferenz Informatik (pp. 55-66). Bonn: Gesellschaft für
Informatik e.V.
Sanla, A., & Numonda, T. (2019). A Comparative Performance of Real-time Big
Data Analytic Architectures. 2019 IEEE 9th International Conference on
Electronics Information and Emergency Communication (ICEIEC) (pp. 1-5).
Bejing, China: IEEE.
Apache Software Foundation. (2017). Introduction - Apache Kafka. Retrieved from
Apache Kafka: https://kafka.apache.org/intro
Lee, J., & Wu, W. (2019, October 8). How LinkedIn customizes Apache Kafka for
7 trillion messages per day. Retrieved from LinkedIn Engineering:
https://engineering.linkedin.com/blog/2019/apache-kafka-trillion-messages
Wang, Z., Dai, W., Wang, F., Deng, H., Wei, S., Zhang, X., & Liang, B. (2015).
Kafka and its Using in High-throughput and Reliable Message Distribution.
2015 8th International Conference on Intelligent Networks and Intelligent
Systems (ICINIS) (pp. 117-120). Tianjin, China: IEEE.
Amazon Web Services. (2019). Amazon Kinesis Data Streams. Retrieved from
Developer Guide: https://docs.aws.amazon.com/streams/latest/dev/kinesis-
dg.pdf
Nguyen, D., Luckow, A., Duffy, B. E., Kennedy, K., & Apon, A. (2018). Evaluation
of Highly Available Cloud Streaming Systems for Performance and Price.
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (pp. 360-363). Washington, DC, USA: IEEE.
Big Data Framework. (2019, March 12). The 4 Characteristics of Big Data.
Retrieved from Enterprise Big Data Framework:
https://www.bigdataframework.org/four-vs-of-big-data/
Strohbach, M., Daubert, J., Ravkin, H., & Lischka, M. (2016). Big Data Storage.
New Horizons for a Data-Driven Economy: A Roadmap for Usage and
Exploitation of Big Data in Europe, 119-141.
Nayak, A., Poriya, A., & Poojary, D. (2013). Type of NOSQL Databases and its
Comparison with Relational Databases. International Journal of Applied
Information Systems, 16-19.
Pavlo, A., & Aslett, M. (2016, September). What's Really New with NewSQL?
SIGMOD Rec., pp. 45-55.
Porter de León, Y., & Piscopo, T. (2014, August 14). Object Storage versus Block
Storage: Understanding the Technology Differences. Retrieved from Druva:
https://www.druva.com/blog/object-storage-versus-block-storage-
understanding-technology-differences/
Amazon Web Services. (2020). Amazon S3. Retrieved from Amazon Web
Services: https://aws.amazon.com/s3/?nc=sn&loc=0
70
King, T. (2016, March 3). The Emergence of Data Lake: Pros and Cons.
Retrieved from Solutions Review - Data Integration:
https://solutionsreview.com/data-integration/the-emergence-of-data-lake-
pros-and-cons/
Amazon Web Services. (2020). What is a data lake? Retrieved from Amazon Web
Services - Data lakes and Analytics on AWS: https://aws.amazon.com/big-
data/datalakes-and-analytics/what-is-a-data-lake/
Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. ITM
Web Conf. 17, 03025.
Fang, H. (2015). Managing Data Lakes in Big Data Era. The 5th Annual IEEE
International Conference on Cyber Technology in Automation, Control and
Intelligent Systems (pp. 820-824). Shenyang, China: IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake
Concepts. Procedia Computer Science Volume 88, 300-305.
TDWI. (2017). DATA LAKES: PURPOSES, PRACTICES, PATTERNS, AND
PLATFORMS . TDWI Research.
Tableau. (2017). Top 10 Big Data Trends 2017. Retrieved from Tableau:
https://www.tableau.com/resource/top-10-big-data-trends-2017
Amazon Web Services. (2020). AWS Lake Formation. Retrieved from Amazon
Web Services: https://aws.amazon.com/lake-formation/
Apache Software Foundation. (2020). Apache Hadoop. Retrieved from Apache
Hadoop: http://hadoop.apache.org/
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . .
. Baldschwieler. (2013). Apache Hadoop YARN: yet another resource
negotiator. Proceedings of the 4th annual Symposium on Cloud Computing
(pp. 1-16). Santa Clara, California: Association for Computing Machinery.
Zaharia, M. A., Xin, R., Wendell, P., Das, T., Armbrust, M., Dave, A., . . . Stoica, I.
(2016). Apache Spark: A Unified Engine for Big Data Processing}.
Commun. ACM (pp. 56-65). Association for Computing Machinery.
Karlon, A. (2020, Januray 16). How do Hadoop and Spark Stack Up? Retrieved
from logz.io: https://logz.io/blog/hadoop-vs-spark/
Gopalani, S., & Arora, R. (2015, March). Comparing Apache Spark and Map
Reduce with Performance Analysis using K-Means. International Journal of
Computer Applications, pp. 8-11.
Lopez, M. A., Lobato, A. G., & Duarte, O. C. (2015). A Performance Comparison
of Open-Source Stream Processing Platforms. 2015 IEEE 17th
International Conference on High Performance Computing and
Communications, 2015 IEEE 7th International Symposium on Cyberspace
Safety and Security, and 2015 IEEE 12th International Conference on
Embedded Software and Systems (pp. 166-173). New York: IEEE.
Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., & Tzoumas, K.
(2016). Apache Flink™: Stream and Batch Processing in a Single Engine.
IEEE Data Engineering Bulletin, 36-40.
71
Iqbal, M. H., & Soomro, T. R. (2015). Big Data Analysis: Apache Storm
Perspective. International Journal of Computer Trends and Technology
(IJCTT), 9-14.
Apache Software Foundation. (2020). Apache Storm. Retrieved from Apache
Storm: http://storm.apache.org/index.html
Prakash, C. (2018, March 30). Spark Streaming vs Flink vs Storm vs Kafka
Streams vs Samza : Choose Your Stream Processing Framework
Veröffentlicht am 30. März 2018. Retrieved from LinkedIn:
https://www.linkedin.com/pulse/spark-streaming-vs-flink-storm-kafka-
streams-samza-choose-prakash/
Elasticsearch. (2020). What is Elasticsearch? Retrieved from Elasticsearch:
https://www.elastic.co/what-is/elasticsearch
Kreps, J. (2014, July 2). Questioning the Lambda Architecture. Retrieved from
Oreilly: https://www.oreilly.com/radar/questioning-the-lambda-architecture/
ITechSeeker. (2019, January 9). Introduction of Lambda Architecture. Retrieved
from ITechSeeker: http://itechseeker.com/en/projects-2/implement-lambda-
architecture/introduction-of-lambda-architecture/
PubNub. (2020). What is Geohashing? Retrieved from PubNub:
https://www.pubnub.com/learn/glossary/what-is-geohashing/
Apache Software Foundation. (2020). Spark Streaming Programming Guide.
Retrieved from Apache Spark:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
SharedStreets. (2019). trip-simulator. Retrieved from Github:
https://github.com/sharedstreets/trip-simulator
aws-es-proxy. (2020). aws-es-proxy. Retrieved from Github:
https://github.com/abutaha/aws-es-proxy
Statistik Austria. (2020). Unfallgeschehen nach Ortsgebiet, Freiland und
Straßenarten. Statistik Austria.
Valiant, L. G. (1990, August). A Bridging Model for Parallel Computation.
Communications of the ACM, pp. 103-111.
Kajdenowicz, T., Indyk, W., Kazienko, P., & Kubul, J. (2012). Comparison of the
Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to
Large Network Processing . 2012 IEEE 12th International Conference on
Data Mining Workshops (pp. 218-225). Brussels: IEEE.
Okada, T., Amaris, M. G., & Goldman, A. (2015). Scheduling Moldable BSP Tasks
on Clouds. XXII Symposium of Systems of High Performance Computing.
Florianopolis, Brazil.
Jungblut, T. (2011, October 24). Apache Hama realtime processing. Retrieved
from Thomas Jungblut's Blog:
https://codingwiththomas.blogspot.com/2011/10/apache-hama-realtime-
processing.html
Apache Software Foundation. (2020). HDFS Architecture. Retrieved from Apache
Hadoop: https://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/HdfsDesign.html
72
Cloud Native Computing Foundation. (2018, June 11). CNCF Cloud Native
Definition v1.0. Retrieved from cncf.io:
https://github.com/cncf/toc/blob/master/DEFINITION.md
Microsoft. (2019, August 20). Defining cloud native. Retrieved from Microsoft
Documentation: https://docs.microsoft.com/en-us/dotnet/architecture/cloud-
native/definition
Apache Software Foundation. (2020). RDD Programming Guide. Retrieved from
Apache Spark: https://spark.apache.org/docs/latest/rdd-programming-
guide.html
Pattamsetti, R. M. (2017). Distributed Computing in Java 9. Packt Publishing.
Chang, W. L., Boyd, D., & Levin, O. (2019, October 21). NIST Big Data
Interoperability Framework: Volume 6, Reference Architecture.
MapR Technologies, Inc. (2015, March). Zeta Architecture. Retrieved from MapR
Whitepapers: https://mapr.com/whitepapers/zeta-architecture/assets/zeta-
architecture.pdf
73
List of figures
Figure 1 - Speed, Serving and Batch Layer of the Lambda Architecture (ITechSeeker,
2019) ......................................................................................................................... 16
Figure 2 - Outline of a Kappa Architecture ....................................................................... 18
Figure 3 - Kinesis Data Stream with n shards that are consumed by multiple consumers.
.................................................................................................................................. 21
Figure 4 - Partition with records that have a unique sequence number and consumers that
use the offset to read any record from the partition (Apache Software Foundation,
2017) ......................................................................................................................... 23
Figure 5 - Example of how a word count application would work using the MapReduce
programming paradigm (Pattamsetti, 2017) ............................................................. 25
Figure 6 - Scheduling and synchronisation of a superstep in the bulk synchronous parallel
model (Okada, Amaris, & Goldman, 2015) ............................................................... 28
Figure 7 - Architecture of the Hadoop Distributed File System (Apache Software
Foundation, 2020)..................................................................................................... 30
Figure 8 - Example of a stream partitioned using a tumbling window .............................. 33
Figure 9 - Example of a stream partitioned with a sliding window .................................... 33
Figure 10 - Example of a Storm topology that shows the link between spouts and bolts
(Apache Software Foundation, 2020) ....................................................................... 34
Figure 11 - A Data Lake and possible surrounding systems that interact with the data
(Amazon Web Services, 2020) ................................................................................. 40
Figure 12 - High-level concept of the risk analysis application ......................................... 44
Figure 13 - Explanation of how a geohash is built (PubNub, 2020) ................................. 47
Figure 14 - Architecture for the Spark Streaming implementation .................................... 48
Figure 15 - DStream and its interaction with windowing operators (Apache Software
Foundation, 2020)..................................................................................................... 49
Figure 16 - Spark Streaming Metrics: Kinesis Data Stream PutRecords vs GetRecords 51
Figure 17 - Spark Streaming Metrics: Kinesis Data Stream IteratorAge ......................... 52
Figure 18 - Spark Streaming Metrics: Input Rate, Scheduling Delay and Processing Time
.................................................................................................................................. 53
Figure 19 - Spark Streaming Scheduling Delay because of a misconfigured batch interval
.................................................................................................................................. 55
Figure 20 - Architecture for the Kinesis Analytics implementation ................................... 56
Figure 21 - Kinesis Analytics Solution: CPU & Memory Utilisation ................................... 59
Figure 22 - Kinesis Analytics Solution: Records processed per second .......................... 60
Figure 23 - Kinesis Analytics Solution: Fault tolerance through checkpointing ................ 60
Figure 24 - Analysis Result shown as a heatmap in Kibana ............................................ 63
Figure 25 - Variance of velocity around Vienna (left) & count of records around Vienna
(right) ........................................................................................................................ 63
74
List of equations
Equation 1 - Relation between the batch view, real-time view and how the data is queried
(Marz & Warren, 2015) ............................................................................................. 17
Equation 2 - Formula to calculate the number of shards needed for a Kinesis Data Stream
(Amazon Web Services, 2019) ................................................................................. 20
Equation 3 - Interface and return value of the map and reduce functions ........................ 25
Equation 4 - AggregateFunction Interface Definition ........................................................ 57
75
List of code listings
Code Listing 1 - Simulation.step method that executes the steps for all agents and sends
the result to a Kinesis Data Stream using the AWS SDK ......................................... 45
Code Listing 2 - Data preparation and aggregation using Spark SQL before using es-
hadoop to save the data ........................................................................................... 50
Code Listing 3 - Stream windowing and data aggregation using Apache Flink ................ 58
76
List of tables
Table 1 - Possible customer base for a real-time insurance product and the number of car
rides for that product ................................................................................................. 10
Table 2 – Measured latency in the Spark Streaming solution .......................................... 51
Table 3 - Spark Streaming Solution Pricing per Service .................................................. 54
Table 4 - Spark Streaming Solution hourly/monthly price ................................................ 54
Table 5 - Measured latency in the Kinesis Analytics Solution .......................................... 58
Table 6 - Kinesis Analytics Solution Pricing per Service .................................................. 61
Table 7 - Kinesis Analytics Solution hourly/monthly price ................................................ 61
77