Near Real-Time Big-Data Processing For Data Driven Application
Near Real-Time Big-Data Processing For Data Driven Application
This paper addresses the context data integration pre-processing technologies today are more evolved than
and processing problem for design of data driven application by technologies for its real-time post-processing, utilization and
introducing ASAPCS (Auto-scaling and Adjustment Platform for life-cycle management [5]. The main challenges are volatility
Cloud-based Systems) platform. Conceptual model, technical of providers, semi-structured data formats, incompatibility with
architecture and data integration process are described. The legacy applications, fragmentation of stakeholders creating
ASAPCS platform supports model-driven configuration,
silos of disconnected data and context applications.
separation of context acquisition and application, utilization of
various context processing algorithms and scalability. It is based This paper investigates a problem of real-time context data
on technologies that have proven to work well with big data and integration for consumption in data driven applications for
each part of it is horizontally scalable. ASAPCs integrates data making application execution and adaption decisions. Software
from heterogeneous sources and aggregates raw context data and applications are provided in a Software as a Service model and
uses it to perform real-time adjustments in the data-driven the problem is investigated from the service provider’s
application. Its application is illustrated with example of providing
perspective. Several aspects of context processing are
data store resilience.
addressed, namely: 1) model driven handling of context
Keywords— stream processing, big data, cloud computing, data- processing; 2) scalability of context processing; 3) integration
driven systems of context processing in data driven applications; 4) decoupling
of context acquisition and other context processing activities;
I. INTRODUCTION
and 5) different levels of context data abstraction 6) context
Data driven applications with advanced real-time decision- sharing. The context model clearly distinguishes between
making capabilities are becoming more wide-spread [1]. context measurements (referred as to measurable properties)
Modern information technologies such as Internet of Things and processed context elements used in data driven
and cloud computing have greatly increased variety of data applications. The context elements are perceived as case
available for these applications. Contextual data characterizing independent and represent domain specific knowledge. They
situation of an entity [2] are one of data types used in the data are associated with application execution or adaptation actions
driven applications. These data come from various physical referred to as adjustments. Common context elements and
sources (i.e., sensors) or logical sources (i.e., web services). It actions can be shared and reused across multiple application
is assumed that context data are at least partially beyond control cases. Measurable properties capture case specific aspects at
of the application and their impact cannot be entirely predicted low granularity while the context elements are interpretations
during application design, processing requires extra effort and of one or multiple measurable property at higher granularity.
velocity of change is high. Some examples of context data are That is particularly valuable for service providers who can
changing number of application users and associated system identify solutions applicable in different environments and
load, traffic intensity, weather conditions, electricity price in the apply these solutions for other users of the application service.
market etc. The context life-cycle defines context processing The goal of the paper is to describe the context processing
activities. That includes context acquisition, modeling, and integration approach and to present a horizontally scalable
reasoning and dissemination. Some of the challenges and model-driven technical solution for context acquisition,
associated with context processing in relation to IoT are integration and aggregation. The context processing approach
unification and standardization of various techniques used in is developed on the basis of context processing components
throughout the context processing life-cycle and context used in the CDD methodology [6] and construction of the
sharing [3]. Context reasoning and interpretation is of particular technical solution is developed as a part of the collaboration
importance in development of context-aware information project with a company providing network management and
systems, where raw context should be transformed in data storage solutions. To address the issues above an agnostic
meaningful interpretations [4]. In general, context capture and cloud-based platform named ASAPCS (Auto-scaling and
36
Fig. 2. Conceptual model of ASAPCS
37
Fig. 3. ASAPCS architecture and data flow
the user interface of the ASAPCS. The sliding interval and adjustment. The message from Kafka is picked up by a Docker
window length are typical parameters used in the stream container running on Adjustment engine (10th connection in
processing. The window length shows the time span in seconds Fig. 3). If Kafka message doesn’t contain all the needed data for
for calculating aggregated values while the sliding interval performing the adjustment additional data can be easily
shows how often the values are recalculated. ASAPCS starts a retrieved from Apache Cassandra (e.g. historical context
measurable property archiving Apache Spark job according to element and measurable property values are stored there). The
the supplied schema and aggregation options. The results are adjustment can be implemented in any programming language.
stored in Apache Cassandra database (4th connection in Fig. 3). The only requirement is existence of a Kafka consumer
ASAPCS creates a single table for each measurable property, implementation. Lastly the adjustment is executed querying the
its schema is synchronized with user supplied measurable API of the data-driven system (11th connection in Fig. 3).
property structure and extended with a time column. The Kafka proxy cluster is used for ensuring an extra level
Raw measurable property data is also processed by a context of security and flexibility in defining the ASAPCS measurable
element (CE) calculation Apache Spark job (5th connection in property API. Kafka is chosen since it is horizontally scalable,
Fig. 3). ASAPCS ensures that there exists a single context fault-tolerant, ensures the right order of the messages and
element calculation Apache Spark job per context element. The exactly-once processing. It is also known to perform well with
current status of the job can be monitored in the user interface streaming applications and other real-time data. Apache Spark
of the ASAPCS. Once the context element values are calculated is used since it supports stream processing, provides machine
the results are sent back to Kafka (6th connection in Fig. 3) and learning (MLlib) and graph processing (GraphX) libraries.
archived in Apache Cassandra (7th connection in Fig. 3). Apache Cassandra is chosen as the database due to being
ASAPCS ensures that there exists a single Kafka topic and horizontally scalable and well integrated with Apache Spark. It
Cassandra table per context element. Both measurable property has proven to work well with temporal data.
and context element data are sent to Kafka in JSON format. The prototype of the ASAPCS is in its early stages and is
Context element data from Kafka is picked up by adjustment hosted on the CloudStack based RTU’s open-source cloud
triggering Apache Spark job (8th connection in Fig. 3). It checks computing platform. Currently ASAPCS is being validated
whether the current context element values should trigger an with a use-case of video transcoding application requiring auto-
adjustment performing changes in the data-driven system (for scaling and altering data replication logic during run-time. This
example scaling the data-driven application or adjusting its is done in collaboration with Komerccentrs DATI Grupa, a
business logic). If so the adjustment triggering job sends the Latvia based IT company.
specific context element data rows to Kafka (9th connection in
Fig. 3). ASAPCS ensures that there exists one Kafka topic per
38
IV. STEPS FOR BUILDING CONTEXT-AWARE SYSTEMS varying context. In the given example it is replication of the
The ASAPCS platform is used to develop context processing data to a safe location in case of a nature hazard or disk health
solutions for data driven applications. The development process issues.
consists of multiple steps and is illustrated by a running Once the variations are known possible context data
example. This section provides the process overview succeeded providers are identified in Step 2. Both local and external
by elaboration of development steps. systems can be used for this purpose. In some scenarios context
data collection agents must be implemented to support the flow
A. Overview of context data to the ASAPCS. To improve the credibility and
A data-driven application uses an ASAPCS based solution continuity of the context data flow it is advised to use several
for providing context dependent decision-making and adaption complementary context data providers whenever possible. In
capabilities. The development process described above ASAPCS each context provider has a name, description, ID and
concerns building of the ASAPCS base solution and it assumes token. ID and token are used as credentials for posting
that the data-driven application is able to consume inputs measurable property data via the ASAPCS REST API, name
provided by ASAPCS based adjustments. and description are used for informational purposes.
The context data items originating from the context providers
The main steps of the process are: have a compound structure – they reference one or many
1. Identification of context dependent variations in the entities of the problem domain and can provide multiple
data-driven application; measures for them. For example, a single context data item
2. Specification of potential context providers; could describe disk write errors, read errors, bad sectors and
3. Definition of relevant entities and measurable temperature and this data would be linked to a specific disk
properties; residing in a server. A corresponding measurable property
4. Creation of context elements and their calculations; schema would have two dimensions (disk, server) and four
5. Implementation of adjustments associated with the values (write and read errors, bad sectors and temperature).
context elements defined; The next step of designing a context-aware system considers
6. Deployment of the solution; establishing the measurable properties and entity model. The
7. Operation of the solution including context data entity model defines the types of entities that are directly or in-
integration and execution of adjustments. directly referenced by the measurable properties. A fragment of
the entity model for the data storage example is given in Fig 4.
The first step is not performed directly in the ASAPCS user
interface. It is assumed that this step involves human experts
who determine these variations using domain expertise or data
analysis methods. Further steps are supported by the ASAPCS
abstracting the complexity of configuring measurable property
REST web services, Kafka topics, Spark jobs and Cassandra
database schema.
B. Running Example
The running example is illustrating a data storage problem.
Data is stored on disks that are located on data nodes (servers).
Those servers are located in data centers belonging to specific
geographic regions. The disk health is measured by a
measurable property reflecting its write errors, read errors,
temperature and bad sectors. The data center region has
measurable property associated with its safety. The level of
Fig. 4. Entity model fragment
safety can be decreased by nature hazards, security incidents or
terrorist attacks. A context element is defined based on both
measurable properties showing the risk level for the disk. In It specifies that the disk is located inside a server which
case of a high risk the data storage tier should replicate data to resides in a data center belonging to a geographic region. Upon
a safe location. Safe location can be determined by querying the saving the entity model it is parsed by the ASAPCS to determine
all entity types and possible relations.
measurable property values stored in Apache Cassandra
database. Afterwards the designer can add all entity instances and their
relations (see Fig. 5). The entity model allows to omit certain
C. Elaboration of steps
entities from the measurable property dimensions. For the given
The process of building a data-driven system starts with example this would allow the designer to remove the server
identifying the possible context-dependent variations – parts of dimension from the measurable property DiskHealth since it is
application logic that should be adjusted according to the
39
known where each disk resides. This reduces the amount of
measurable property related traffic sent to the ASAPCS from
context data provider. Entity and their relation instances (e.g.
D1, s1, D1-s1) are also used to validate the posted measurable
property values. In the current example posting data about disk
D10 would be considered an error since such disk hasn’t been aggregation options have to be specified in a similar manner as
previously defined. A screenshot of the measurable property with the archiving options (see Fig. 7).
schema from ASAPCS is given in Fig. 6. If multiple measurable properties are used the designer must
specify how to join them. Measurable properties can only be
joined if they contain at least one matching dimension. Extra
dimensions from the Entity model can be added in the Context
Element design interface. Let’s consider that there exists
another measurable property RegionalSecurity that has a
dimension Region and a value Safety. A context element
DiskRisk needs to be created based on measurable properties
RegionalSecurity and DiskHealth. Although the measurable
property DiskHealth contains only the Disk dimension, based
on the Entity model it can be extended with Server, DataCenter
and Region dimensions. The dimensions are added only in the
scope of the context element DiskRisk. After adding the extra
dimensions both measurable properties contain a matching
dimension Region and can be finally joined (see Fig. 8).
40
After the list of used measurable properties is finalized the consist of triggering condition and implementation logic. Run-
designer can specify what dimensions and what values will be time adjustments are triggered upon certain context element
used in the resulting context element (e.g. Context Element values. If multiple context elements are used, they have to be
schema, see Fig. 9). Only entities from the included measurable joined in a similar way as measurable properties during context
properties can be used for dimension definition. A virtually element creation (see Fig. 9). In order to join them there has to
unlimited number of values can be defined however the be at least one matching dimension. New dimensions can be
designer must specify how each value is calculated based on the added based on the Entity model. These changes are however
measurable property data. reflected only in the scope of adjustment and won’t have any
effect on context element structure beyond that. An example of
adjustment triggering logic is given in Fig. 11.
41
• Adjustments can be implemented in platform context-aware systems and applications: A survey,” J.
agnostic way since only requirement is availability Syst. Softw., vol. 117, pp. 55–83, 2016.
of Kafka consumer (wide range of programming [6] S. Bērziša et al., “Capability Driven Development: An
languages are currently supported), Approach to Designing Digital Enterprises,” Bus. Inf.
• Solutions serving as the foundation of ASAPCS Syst. Eng., vol. 57, no. 1, pp. 15–25, 2015.
have gained wide acceptance in Big data [7] M. Baldauf, S. Dustdar, and F. Rosenberg, “A survey
community. on context aware systems,” Int J Ad Hoc Ubiquitous
Comput., vol. 2, no. 4, pp. 263–277, 2007.
During the next development iterations, it is planned to extend [8] T. D. C. Mattos, F. M. Santoro, K. Revoredo, and V. T.
ASAPCS with: Nunes, “A formal representation for context-aware
• calculation of numeric continuous context element business processes,” Comput. Ind., vol. 65, no. 8, pp.
values, 1193–1214, 2014.
[9] E. P. Blasch, S. Russell, and G. Seetharaman, “Joint
• integration with Apache Spark MLlib on ASAPCS
user interface level, data management for MOVINT data-to-decision
making,” in 14th International Conference on
• addition D3.js based visualization of measurable
Information Fusion, 2011, pp. 1–8.
property and context element data flows,
[10] T. Strang and C. Linnhoff-Popien, “A Context
• integration of real-time Docker container console
Modeling Survey,” in Workshop on Advanced Context
output next to adjustment implementation in user
Modelling, Reasoning and Management, UbiComp
interface of ASAPCS,
2004 - The Sixth International Conference on
• inclusion of scaling adjustments providing that Ubiquitous Computing, 2004, vol. Workshop o, no. 4,
ASAPCS can scale itself based on the load. pp. 1–8.
[11] N. Khabou and I. B. Rodriguez, “Towards a Novel
It is planned to release the source code of ASAPCS after Analysis Approach for Collaborative Ubiquitous
reaching sufficient level of maturity. Systems,” in 2012 IEEE 21st International Workshop
ACKNOWLEDGMENT on Enabling Technologies: Infrastructure for
Collaborative Enterprises, 2012, pp. 30–35.
The research leading to these results has received funding [12] D. Guan, W. Yuan, S. Lee, and Y. K. Lee, “Context
from the research project "Competence Centre of Information selection and reasoning in ubiquitous computing,” in
and Communication Technologies" of EU Structural funds, Proceedings The 2007 International Conference on
contract No. 1.2.1.1/16/A/007 signed between IT Competence Intelligent Pervasive Computing, IPC 2007, 2007, pp.
Centre and Central Finance and Contracting Agency, Research 184–187.
No. 1.12 “Configurable parameter set based adaptive cloud [13] S. Bouaziz, A. Nabli, and F. Gargouri, “From
computing platform scaling method". traditional data warehouse to real time data
warehouse,” Advances in Intelligent Systems and
Computing, vol. 557. MIRACL Laboratory, Faculty of
REFERENCES Sciences, Sfax University, BP, Sfax, Tunisia, pp. 467–
[1] C. L. Philip Chen and C.-Y. Zhang, “Data-intensive 477, 2017.
applications, challenges, techniques and technologies: [14] D. Gomes, J. M. Goncalves, R. O. Santos, and R.
A survey on Big Data,” Inf. Sci. (Ny)., vol. 275, pp. Aguiar, “XMPP based Context Management
314–347, 2014. Architecture,” 2010 IEEE Globecom Work., pp. 1372–
[2] A. K. Dey, “Context-aware computing: The CyberDesk 1377, 2010.
project,” in Proceedings of the AAAI 1998 Spring [15] A. Balalaie, A. Heydarnoori, and P. Jamshidi,
Symposium on Intelligent Environments, 1998, pp. 51– “Microservices Architecture Enables DevOps:
54. Migration to a Cloud-Native Architecture,” IEEE
[3] C. Perera, A. Zaslavsky, P. Christen, and D. Software, vol. 33, no. 3. pp. 42–52, 2016.
Georgakopoulos, “Context aware computing for the [16] M. Fazio and A. Puliafito, “Cloud4sens: A cloud-based
internet of things: A survey,” IEEE Commun. Surv. architecture for sensor controlling and monitoring,”
Tutorials, vol. 16, no. 1, pp. 414–454, 2014. IEEE Commun. Mag., vol. 53, no. 3, pp. 41–47, 2015.
[4] M. Born, J. Kirchner, and J. P. Müller, “Context-driven [17] J. Samosir, M. Indrawan-Santiago, and P. D. Haghighi,
Business Process Modelling,” Jt. Proc. 4th Int. Work. “An evaluation of data stream processing systems for
Technol. Context. Bus. Process Manag. TCoB 2009. data driven applications,” in Procedia Computer
AT4WS 2009. AER 2009. MDMD 2009. Conjunction Science, 2016, vol. 80, pp. 439–449.
with ICEIS 2009., pp. 17–26, 2009.
[5] U. Alegre, J. C. Augusto, and T. Clark, “Engineering
42