0% found this document useful (0 votes)
21 views

Flow-Based Programming For Machine Learning

The document discusses flow-based programming for machine learning. It proposes using graphical tools based on flow-based programming to allow domain experts with less programming skills to compose machine learning applications. These tools would model machine learning algorithms as composable components. A user could then connect the components in a flow to specify an application. The flow would be parsed and used to automatically generate code for the machine learning application in a target language without the user having to write any code. The document validates this conceptual approach through several use cases using Apache Spark machine learning libraries.

Uploaded by

project89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Flow-Based Programming For Machine Learning

The document discusses flow-based programming for machine learning. It proposes using graphical tools based on flow-based programming to allow domain experts with less programming skills to compose machine learning applications. These tools would model machine learning algorithms as composable components. A user could then connect the components in a flow to specify an application. The flow would be parsed and used to automatically generate code for the machine learning application in a target language without the user having to write any code. The document validates this conceptual approach through several use cases using Apache Spark machine learning libraries.

Uploaded by

project89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Flow-Based Programming For Machine Learning

Tanmaya Mahapatra  (  [email protected] )


Technische Universitat Munchen https://orcid.org/0000-0002-7946-5497
Syeeda Nilofer Banoo 
Technische Universität München: Technische Universitat Munchen

Research

Keywords: End-User Programming, Graphical Flows, Graphical Programming Tools, Machine learning as
a service (MLaaS), Machine-Learning-Platform-asa-Service (ML PaaS), Machine Learning Pipelines

DOI: https://doi.org/10.21203/rs.3.rs-707294/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.  
Read Full License
Journal of Big Data manuscript No.
(will be inserted by the editor)

Flow-based Programming for Machine Learning

Tanmaya Mahapatra · Syeeda Nilofer


Banoo

Received: July 11, 2021/ Revised: NA / Accepted: NA

Abstract Machine Learning (ML) has gained prominence and has tremendous
applications in fields like medicine, biology, geography and astrophysics, to name
a few. Arguably, in such areas it is used by domain experts, who are not necessar-
ily skilled-programmers. Thus, it presents a steep learning curve for such domain
experts in programming ML applications. To overcome this and foster widespread
adoption of ML techniques, we propose to equip them with domain-specific graphi-
cal tools. Such tools, based on the principles of flow-based programming paradigm,
would support the graphical composition of ML applications at a higher level of
abstraction and auto-generation of target code. Accordingly, (i) we have modelled
ML algorithms as composable components; (ii) described an approach to parse a
flow created by connecting several such composable components and use an API-
based code generation technique to generate the ML application. To demonstrate
the feasibility of our conceptual approach, we have modelled the APIs of Apache
Spark ML as composable components and validated it in three use-cases. The
use-cases are designed to capture the ease of program specification at a higher
abstraction level, easy parametrisation of ML APIs, auto-generation of the ML
application and auto-validation of the generated model for better prediction accu-
racy.

Keywords End-User Programming · Graphical Flows · Graphical Programming


Tools · Machine learning as a service (MLaaS) · Machine-Learning-Platform-as-
a-Service (ML PaaS) · Machine Learning Pipelines

Tanmaya Mahapatra · Syeeda Nilofer Banoo


Lehrstuhl für Software und Systems Engineering, Fakultät für Informatik,
Technische Universität München, Boltzmannstraße 03, 85748 Garching b. München
E-mail: [email protected]
Tanmaya Mahapatra
Birla Institute of Technology and Science, Department of Computer Science and Information
Systems, Vidya Vihar, Pilani - 333031, India.
E-mail: [email protected]
2 Mahapatra and Banoo

1 Introduction

Machine Learning (ML) is a scientific discipline that develops and makes use of
a particular class of algorithms which are designed to solve problems without ex-
plicit programming [42]. The algorithms infer about patterns present in a dataset
and learn how to solve a specific problem. This self-learning technique of com-
puter systems has gained prominence and has vast application in the current era.
The massive influx of data from the Internet and other sources creates a large
bed of structured as well as unstructured datasets, where ML techniques can be
leveraged to make meaningful correlations and automated decision-making. Nev-
ertheless, ML techniques used by domain experts like traffic engineers or molecular
biologists who are less-skilled programmers have to counter a steep learning curve,
i.e. learn how to program and write a ML application from scratch using general-
purpose, high-level languages like Java, Scala or Python. The learning curve hin-
ders the widespread adoption of ML by researchers unless they are well-trained in
programming. In response to this, we propose to equip less-skilled programmers
who are domain-experts with graphical tools. In particular, we intend to support
graphical specification of ML programs via flow-based programming paradigm
and support auto-generation of target code, thereby shielding the user of such
tools from nuances and complexities of programming. Such, graphical flow-based
programming tools called mashup tools have been extensively used to simplify
application development [10].

1.1 Contributions

Succinctly, the paper contributes to these aspects via:


1. We take the Java APIs of Spark ML operating on DataFrame [24], a popular
ML library of Apache Spark [40, 41, 39], model them as composable compo-
nents. Every component abstracts one or more underlying APIs such that they
represent one unit of processing step in a ML application. The different pa-
rameters accepted by the underlying APIs are made available on the front-end
while using a specific component to support easy parametrisation.
2. Development of a conceptual approach to parse a ML flow created by con-
necting several such components from step 1. The parsing ensures that the
components are connected in an acceptable positional hierarchy such that it
would generate target code which is compilable. The parsed user-flow is used
to generate target ML code using principles of Model-Driven Software Devel-
opment (MDSD). Model to text transformation is used, especially API based
code generation techniques [32], to transform the graphical model to target
code.
3. The conceptual approach is validated by designing three ML use-cases involv-
ing prediction using decision trees, anomaly detection with K-means clustering,
and collaborative filtering techniques to develop a music recommender applica-
tion. The use-cases demonstrate how such flows can be created by connecting
different components from step 1 at a higher level of abstraction, parameters
to various components can be configured with ease, automatic parsing of the
user flow to give feedback to the user if a component has been used in a wrong
position in a flow and finally automatic generation of ML application without
Flow-based Programming for Machine Learning 3

the end-user having to write any code. The user can split the initial dataset
into training and testing datasets, specify a range for different model parame-
ters for the system to iteratively generate models and test them till a model is
produced with higher prediction accuracy.

1.2 Outline

The rest of the paper is structured in the following way: Section 2 summarizes
the background while Section 3 discusses the related work. We give an overview
of Spark and its machine learning library, i.e. Spark ML, different kinds of data
tarnsformation APIs available in Spark and our design choices to support only
specific kind of APIs in Section 4. Section 5 describes our conceptual approach
to support graphical flow-based ML programming at a higher level of abstraction
involving modelling of APIs as components, flow-parsing and target code gener-
ation while Section 6 describes its realization. Section 7 validates the conceptual
approach in three concrete use-cases. We compare our conceptual approach with
existing works in Section 8 which is shortly followed by concluding remarks in
Section 9.

2 Background

2.1 Machine Learning

In classical instruction-based programming, a machine processes datasets based


on predefined rules. However, in ML programming machines are trained on large
datasets to discover the inherent pattern and thereby auto-create the data pro-
cessing rules. ML programming involves the usage of an input dataset, features or
pieces of information in the dataset useful for problem-solving and a ML model.
The features when passed to a ML algorithm for learning purposes outputs a ML
model. ML algorithms have been broadly classified into two categories of super-
vised and unsupervised learning algorithms. In supervised learning, the system
learns from training dataset with labelled data to predict future events. Examples
include classification and regression. Regression provides continuous prediction
while classification does non-continuous prediction. Unsupervised machine learn-
ing algorithms work on unlabelled data to find unknown patterns in the data.
Unsupervised learning outputs data as clusters. Clusters are a grouping of similar
data from the unlabelled input. Creation of a ML model involves specific steps
like data collection, data processing, choosing a relevant ML algorithm, training
the model on the training dataset, evaluating the model on testing dataset for
prediction accuracy, tuning the parameters to enhance the prediction accuracy of
the model and finally using the model to make predictions on new input data.
ML is often confused with Deep Learning (DL) [18]. The necessary steps in-
volved to create a model are the same for both ML and DL. Nevertheless, there
are many subtle differences between the two. First, we rely on a single algorithm
in ML to predict while DL uses multiple ML algorithms in a layered network to
make predictions. Second, ML typically works with structured data, while DL can
work with unstructured data too. Third, in ML, manual intervention is required in
4 Mahapatra and Banoo

the form of model parameter tuning to improve prediction accuracy while DL self-
improves to minimise error and increase prediction accuracy. Fourth, DL is suited
well with the availability of massive amounts of data and is a more sophisticated
technique in comparison to ML.

2.2 Machine Learning libraries

There are a plethora of ML libraries available, including TensorFlow [3, 2], Py-
Torch [29], FlinkML [7], SparkML and scikit-learn [30], among others. TensorFlow
has become one of the most prominent libraries for both ML as well as DL. It
provides flexible APIs for different programming languages with support for easy
creation of models by abstracting low-level details. PyTorch is another open-source
ML library developed by Facebook. It is based on Torch [8], an open-source ML
library used for scientific computation. This library provides several algorithms
for DL applications like natural language processing and computer vision, among
others via Python APIs. Similarly, Scikit-learn is an open-source ML framework
based on SciPy [36], which includes lots of packages for scientific computing in
Python. The framework has excellent support for traditional ML applications like
classification, clustering, dimensionality reduction. Open-source computer vision
(OpenCV) is a library providing ML algorithms and mainly used in the field of
computer vision [9]. OpenCV is implemented in C++. However, it provides APIs
in other languages like Java, Python, Haskell and many more. Apache Flink, the
popular distributed stream processing platform provides ML APIs in the form of a
library called FlinkML. The application designed using these APIs will run inside
the Flink execution environment. Another prominent open-source library is Weka.
These are some of the most widely used libraries and listing all the available all
the ML libraries is beyond the scope of this paper. We, therefore, invite interested
readers to refer to [27] for more comprehensive information about ML and DL
libraries.

2.3 Flow-Based Programming

Flow-Based Programming (FBP) is a programming paradigm invented by J. Paul


Rodker Morrison in the late 1960s [26]. It is an approach to develop applications
where program steps communicate with each other by transmitting streams of
data. The data flow is unidirectional. At one point in time, only one process can
work on data. Each process/component is independent of others, and hence many
flows of operation can be generated with different combinations of the compo-
nents. The components are responsible only for the input data that they consume.
Therefore, the input/output data formats are part of the specifications of the com-
ponents in FBP. The components act as software black boxes and are as loosely
coupled as possible in the FBP application which provides the flexibility to add
new features without affecting the existing network of FBP components.
Flow-based Programming for Machine Learning 5

2.4 Model-Driven Software Development

MDSD abstracts away the domain-specific implementation from the design of the
software systems [32].The levels of abstraction in a model-driven approach help to
communicate the design, scope, and intent of the software system to a broader au-
dience, which increases the quality of the system overall. The models in MDSD are
the abstract representation of real-world things that need to be understood before
building a system. These models are transformed into platform-specific implemen-
tation through domain modelling languages. The MDSD can be compared to the
transformation of a high-level programming language to machine code. MDSD of-
ten involves transforming the model into text which is popularly known as code
generation. There are different kinds of code-generation techniques widely like tem-
plates and filtering, templates and meta-model, code weaving and API-based code
generation among others [32]. API-based code generators are the most simple and
the most popular. These simply provide an API with which the elements of the
target platform or language can be generated. They are dependent on the abstract
syntax of the target language and are always tied to that language. To generate
target code in a new language we need new APIs working on the abstract syntax
of the new target language.

3 Related work

Literature hardly indicated any significant research work done to support graphical
ML programming at a higher level of abstraction and simultaneously explaining
programming concepts necessary for such. Nevertheless, there are a number of rel-
evant works in the literature as well as products in the market which support high-
level ML programming like WEKA [16], Azure Machine Learning Studio [38], KN-
IME [5], Orange [11], BigML [6], mljar [25], RapidMiner [17], Streamanalytix [33],
Lemonade [31] and Streamsets [34] among others. Out of these only Streamana-
lytix, Lemonade and Streamsets specifically deal with Spark ML. We compare our
conceptual approach with these solutions in Section 9.

4 Apache Spark

4.1 APIs

A Resilient Distributed Dataset (RDD), an immutable distributed dataset [39],


is the primary data abstraction in Spark. To manipulate data stored within the
Spark runtime environment, Spark provides two kinds of APIs. First, is a direct
coarse-grained transformation applied on RDDs directly using function handlers
like a map, filter or groupby, among others. This involves writing custom low-level
data transformation functions and invoking them via the handler functions. To
simplify this kind of data processing, Spark introduced a layer of abstraction on
top of RDD called a DataFrame [4]. A DataFrame is essentially a table or two-
dimensional array-like structure with named columns. DataFrames can be built
from existing RDDs, external databases or tables. RDDs can work on both struc-
tured and unstructured data while DataFrames strictly require either structured
6 Mahapatra and Banoo

or semi-structured data. It becomes easy to process data using named columns


than directly working with data. The second set of APIs is the DataFrame based
APIs which work on DataFrames and perform data transformation. These APIs
take one or more parameters for fine-tuning their operation. A further improve-
ment over the DataFrame based APIs is the Dataset APIs of Spark which are
strictly typed in comparison to the untyped APIs of DataFrames.
The RDD-based APIs operating at low-level provide fine-grained control over
data transformation. However, DataFrame and DataSet APIs offer a higher level
of abstraction in comparison to RDD-based APIs. The APIs are domain-specific
and developers do not have to write custom data transformation functions to use
them.

4.2 ML with Spark

Spark supports distributed machine learning via:

Spark MLlibSpark MLlib, has been built on top of Spark Core using the RDD
abstractions, offers a wide variety of machine learning and statistical algorithms.
It supports various supervised, unsupervised and recommendation algorithms. Su-
pervised learning algorithms include decision trees, random forest etc., while some
of the unsupervised learning algorithms supported are k-means clustering, support
vector machine etc.

Spark MLSpark ML is the successor of Spark MLlib and has been built on top
of Spark SQL using the DataFrame abstraction. It offers Pipeline APIs for easy
development, persistence and deployment of models. Practical machine learning
scenarios involve different stages with each stage consuming data from preced-
ing stage and producing data for the succeeding stage. Operational stages include
transforming data into appropriate format required by the algorithm, converting
categorical features into continuous features etc. Each operation involves invoking
declarative APIs which transform DataFrame based on user inputs [24] and pro-
duce a new DataFrame for use in the next operation. Hence, a Spark ML pipeline
is a sequence of stages consisting of either a transformer or an estimator.A trans-
former transforms one DataFrame to another often via operations like adding or
modifying columns in the input DataFrame while an estimator is used to train on
the input data and output a transformer which can be a ML model.

4.3 Design choice: API selection

We have chosen DataFrame based APIs for supporting flow-based ML program-


ming because the RDD based APIs operate at low-level and necessitate the us-
age of custom written functions for bringing data transformations. This makes it
challenging to come up with a generic method to parse a flow and ascertain if
the components are connected in a way which leads to the generation of always-
compilable target code. Additionally, it would demand strict type checking to be
introduced at the tool-level to facilitate the creation of such flows. The other kind
of API is the Dataset based APIs. These detect syntax as well as analytical errors
Flow-based Programming for Machine Learning 7

during compile time. Nonetheless, these are too restrictive and would render the
conceptual approach described in this manuscript non-generic in nature and tied
to specific use-cases only. In comparison to these, the DataFrame based APIs are
untyped APIs which provide columnar access to the underlying datasets. These
are easy to use domain-specific APIs and detect syntax errors at compile time. The
analytical errors which might creep in during usage of such APIs can be avoided
by designing checks to ensure that the named columns which a specific API is try-
ing to access are indeed received in its input. Additionally, the Spark ML library
(version 2.4.4) has been updated to use DataFrame APIs. Hence, it is an ideal
choice to use DataFrame APIs over RDD APIs.

5 Conceptual approach

The conceptual approach to enable non-programmers easily specify a ML program


graphically at a higher-level of abstraction makes use of principles of FBP and
MDSD. It consists of several distinct steps as discussed below:

5.1 Design of modular components

A ML flow consists of a set of connected components. Hence, we need to design the


foundational constituent or components first to realise flow-based ML program-
ming. A component mostly does one data processing step in ML model creation
and internally abstracts a specific API for delivering that functionality. However,
modelling every single Spark ML API as a different component would defeat the
very purpose of abstraction. With such a design, the components would have a
one-to-one correspondence with the underlying APIs and programming syntax of
the ML framework making it harder for less-skilled programmers to comprehend.
Hence, we introduce our first design choice by grouping several APIs as one compo-
nent such that the component represents a data processing operation at high-level
and is understandable to end-users. Moreover, the abstracted APIs are invoked in
an ordered fashion within the component to deliver the functionality. The different
parameters accepted by the APIs used inside a component are made available on
the front-end as component property which the user can configure to fine-tune the
operation of the component. The essential parameters of the APIs are initialised
with acceptable default settings unless overridden by the user of the component.
The second design choice is making the components as loosely coupled to each
other possible and the achieving tight functional cohesion between the APIs used
within a single component. This leaves space for a future extension where we
can introduce new Spark ML APIs as components without interfering with the
existing pool of components. To achieve this, we clearly define the input and output
interface of every component and define positional hierarchy rules for each one of
them. These rules help to decide whether a component can be used in a specific
position in a flow or not. The flow-validation step (discussed in Section 5.2) checks
this to ensure that the order in which the components are connected or rather the
sequence in which the APIs are invoked would not lead to compile-time errors.
The third design choice introduces additional abstraction by hiding away parts
of the ML application which are necessary for it to compile and run, written by
8 Mahapatra and Banoo

developers when coding from scratch, nevertheless, which do not directly corre-
spond to the data processing logic of the application. An example would be the
code responsible for initialising the Spark session or closing it within which the
remaining data processing APIs get invoked. Another example can be configuring
the Spark session like setting the application name, configuring the running envi-
ronment mode (local or cluster), specifying the driver memory size and providing
the binding address among many others. We handle these aspects at the back-end
to enable the end-user of such a tool to focus solely on the business logic or data
processing logic of the application. The code-generator, running at the back-end
(discussed in Section 5.3), is responsible for adding such crucial parts and initialis-
ing required settings with sensible defaults to the final code to make it compilable.
Nevertheless, the default settings can be overridden by the user from the front-
end. For example, the “Start” component (discussed in Section 5.2) is a special
component used by the user to mark the start of the flow which can be configured
to fine-tune and explicitly override different property values of the Spark session
as discussed above.

5.2 Flow specification and Flow-checking

As a second step, we take the components (discussed in Section 5.1) and make them
available to the end-user via a programming tool. The programming tool must have
an interactive graphical user interface consisting of a palette, a drawing area called
canvas, a property pane and a message pane. The palette contains all available
modular components which can be composed in a flow confirming to some standard
flow composition rules. The actual flow composition takes place on the canvas
when the user drags a component from the palette and places it on the canvas.
The component, when present on the canvas and when selected, should display
all its configurable properties in the property pane. The user can override default
settings and provide custom settings for the operation of a component via this
pane. The flow, while being composed on the canvas, is captured, converted into a
directed acyclic graph (DAG) and checked for the correct order of the connected
components. The flow-checking should be done whenever there is a state change.
A state change occurs whenever something changes on the canvas. For example, a
new component is dragged onto the canvas form the palette or connection between
two already present components change, among other such change possibilities.
Flow-checking checks the user flow for any potential irreconcilability with the
compositional rules. Such a flow when passed to the back-end for code generation
will generate target code which does not produce any compile-time errors. The
components are composable in the form of a flow if they adhere to the following
compositional rules:

1. A flow is a DAG consisting of a series of connected components.


2. It starts with a particular component called the “Start” component and ends
with a specific component called the “Save Model” component.
3. Every component has its input and output interface adequately defined, and
a component can be allowed to be used in a specific position in a flow if it is
compatible with the output of its immediate predecessor and if it is permitted
at that stage of processing. For example, the application of the ML algorithm
Flow-based Programming for Machine Learning 9

is only possible after feature extraction. Hence, the component correspond-


ing to the ML algorithm and evaluation must be connected after the feature
extraction component.

When the flow is marked complete and flow-checking has completed success-
fully, it is passed to the back-end to start the code generation process. Fig. 1
and Fig. 2 depict the flow-checking sequences for a typical ML flow leading to
passing and failure of the process respectively. In the figure, it is assumed that
the flow-checking begins after the flow is complete to make it easier for illustra-
tion purposes. Nonetheless, it is done whenever there is a change detected on the
canvas.

Feature Decision
Start Save Model
1 User Flow Extraction Tree

Feature Decision Save


DAG Start
2 Extraction Tree Model
creation

Feature Decision Save


Start
Extraction Tree Model
a
3 Flow Checking

Feature Decision Save


Start
b Extraction Tree Model

Feature Decision Save


Start
c Extraction Tree Model

Not checked yet


Feature Decision Save
Currently being
d Start
Extraction Tree Model
checked
Legend

Checked & Passed


Feature Decision Save
Checked & Failed
e Start
Extraction Tree Model

Fig. 1: Flow-checking sequences for a valid ML flow

Decision Feature
Start Save Model
1 User Flow Tree Extraction

DAG Decision Feature Save


2 Start
Tree Extraction Model
creation

Decision Feature Save


Start
a Tree Extraction Model
3 Flow Checking

Decision Feature Save


b Start
Tree Extraction Model

Error
Decision Feature Save
c Start
Tree Extraction Model

"Decision Tree" component can be used only after "Feature Selection". User is notified.

Fig. 2: Flow-checking sequences for an invalid ML flow


10 Mahapatra and Banoo

5.3 Model generation

The third step deals with the parsing of the user flow and generating the target
code. The back-end consists of a parser, API library and a code generator. The
parser is responsible for parsing the flow received and represent it in an interme-
diate representation, typically in the form of a DAG. Next, it traverses the DAG
to ascertain the type of component used and checks in the library for their cor-
responding method implementations. In MDSD, there are many code generation
techniques, but we have used the API-based code generation technique because it
is simple to use and serves our purpose well. The only downside is that in API-
based code generation technique, the API can generate code only for a specific
platform or language as it inherently depends on the abstract syntax of the tar-
get language to function. In our case, the API-base code generation technique is
restricted to generate Java code. The method implementations for every compo-
nent in the library contain statements or specific APIs to generate the particular
Spark API, which is represented by the component on the front-end. The DAG or
intermediate flow representation along with the information of which method to
invoke from the library for each vertex is passed to the code generator. The code
generator has extra APIs to generate the necessary but abstracted portion of the
ML code like the start of a spark session and inclusion of required libraries. Then,
it calls the invokes the specific method implementation from the library for each
vertex, i.e. invokes the APIs contained within them to generate the target Java
Spark API to be used in the final target code. When the code generator does this
for all the vertices of the DAG, the target Spark ML code in Java is generated.
The final code is compiled and packaged to create the runnable Spark ML driver
program. The driver program is the artefact which sent to a Spark cluster for
execution.

5.4 Model evaluation and hyperparameter tuning

The final step is used to test the generated model on the testing dataset to check
for prediction accuracy or performance of the model. The user should have speci-
fied different model parameters to tune during flow specification. If yes, then the
code generator generates different ML models, each pertaining to a unique model
parameter value. All the models are tested, and the best performant ML model is
saved for final usage. For example, if a user is designing a ML model using k-means
algorithm and has supplied a range of values for k as a parameter while specifying
the flow, where k is the number of clusters user wants to create from the input
dataset. In this case, the code generator generates different k-means models for
the range of the k values and evaluates the compute cost for each k-means model.
Compute cost is one of the ways to evaluate k-means clustering using Spark ML.
The system displays the k value and the corresponding model evaluation score in
ascending order for the user to select or go with the best performant model. The
goal of the hyper-parameter tuning in case of k-means algorithm is to find the
optimal k value for the given dataset. If the user provides only one k value, then
the code generator generates only one model.
Fig. 3 illustrates the conceptual approach to generate ML code after the steps
of user flow specification and flow-checking have been completed. Step 1 in the
Flow-based Programming for Machine Learning 11

figure corresponds to ideas discussed in Section 5.2 while steps 2-5 correspond to
Section 5.3. Finally, step 6 corresponds to ideas discussed in Section 5.4.

Feature Decision Save


Valid user flow 1 Start
Extraction Tree Model

Parser

1 2 3
Intermediate 2
representation -- DAG
4 5 6
Code Generator
3 API library
1. Modules execute as many
times as range of model 1. API library conatins
Generated Code modules with API
parameters have been Generated
Generated
From
Code
Code
Module
From
From Module
Module 1 11
specified in the user flow. statements to generate

Module execution
Code generation
1
target code.
Generated
Generated Code
Code
2. Result is target code with Generated
From
Code
Module 4
From
From Module
Module 4 44
different initialised 4 2. Every vertex in DAG
parameters. Generated Code 3 corresponds to one module
Generated
Generated Code
Code
From
From
From Module
Module
Module 3 33 in library for its final target
3. All such set of generated 6 code.
codes are assembled to Generated
Generated
Generated Code
Code
Code
produce a set of final target From
From
From Module
Module
Module 6 66 All ML models are evaluated
codes. against test data to select the
best ML model
JAR
5 Compile
All final assembled codes are
6
compiled to get a set of ML
JAR JAR JAR JAR JAR JAR Validate with test data
models

Fig. 3: Generation of a ML model from a graphical user flow

6 Realisation

In this section, we describe the realisation of the conceptual approach described in


Section 5 in the form of a graphical programming tool. In particular, we describe
the functioning of the tool and its architecture.

3 Button to start code


generation

1 Palette: contains ML components

2 Canvas: components are


connected to form a ML flow

Fig. 4: Interactive GUI of the prototype with palette and canvas


12 Mahapatra and Banoo

6.1 Prototype Overview

We have implemented a graphical programming tool with Spring Boot [37]. The
application comes with a minimalistic interactive graphical user interface (GUI)
built using Angular [14, 15, 12], a popular open-source web application framework.
Fig. 4 shows the GUI of the prototype with its palette containing available ML
components. It also has a canvas where the components are dragged and connected
in the form of a flow and finally, a button to start the code generation process.
Whenever a ML component is dragged onto the canvas, a property pane opens
up allowing the user to configure its various parameters to fine-tune its operation
by overriding the default values. Additionally, when something changes on the
canvas, the entire flow is checked to ensure its correctness, and in case of any
error, the user is provided feedback on the GUI. Fig. 5 depicts these aspects of
the prototype. On press of the button, the flow configuration is captured in a
JavaScript Object Notation (JSON) and send to the Spring Boot back-end via
REST APIs. It is converted into a Data Transfer Object (DTO) before processing.
Fowler describes such a DTO as “an object that carries data between processes to
reduce the number of method calls.” [13]. At the back-end, the user flow is parsed,
converted into an intermediate representation and passed to the code generator.
The code generator auto-generates the target Spark ML application with the help
of JavaPoet, an API-based Java code generator [1]. We have provided three video
files demonstrating use case 7.1, use case 7.2 and use case 7.3 respectively as
supplementary material with this work.

2 Flow-checking & user feedback

1 Property pane: component property


configuration

Fig. 5: Component property configuration and flow-checking with user feedback

6.2 Components

Components are the essential constituents of a flow. We have implemented around


seven components relevant to our use cases out of which some of them are generic
Flow-based Programming for Machine Learning 13

and can be used in all the three use cases. A typical ML flow consists of four
components. The first component marks the start of the flow and is responsible for
the generating code related to Spark session initialisation. The second component
is used to specify and pre-process the input data. It deals with dimensionality
reduction, i.e. extracting a reduced set of non-redundant information form the
dataset which can be fed to a ML algorithm and is sufficient to solve a given
problem. The third component is a very specific training module which houses
a specific ML algorithm like decision tree, k-means, among others. The fourth
component trains the model, supports hypertuning of model parameters and saves
the final selected model. Table 1 lists the prototyped components and summarises
their functionalities.

Table 1: Supported graphical ML components with their functionality

SN. Component Name Functionality


1. Start Marks the start of the flow and
help create a Spark session.
2. Feature Extraction Creates DataFrame from input
data depending on the features
selected.
3. Feature Extraction Transforms the input text data
From Text File to DataFrame(s).
4. Decision Tree Processes the DataFrames to
train a decision tree model and
evaluates it.
5. KMeans Clustering Creates a k-means model and
evaluates it. It works on given
hyperparameters to find the
most efficient model with min-
imum error in prediction.
6. Collaborative Fil- Creates a collaborative filtering
tering model and evaluates it.
7. Save Model Marks the end of the flow and
saves the ML model in a speci-
fied file path.

6.3 Working

In the back-end, the flow is parsed and represented in the form of a DAG, an
intermediate representation format. It traverses all the vertices of the DAG to find
out which components map to a specific module in an API library. A module in
API library contains API statements written using JavaPoet which on execution
generates target language statements. Collection of such statements from all such
modules in the library initialised with user parameters, which the code-generator
assembles, results in the final target code. This code is then compiled and packaged
to produce a runnable application. In this entire process, the prototype has many
components which interact and accomplish certain functionalities. The system
architecture of the prototype with its major components and their interactions
14 Mahapatra and Banoo

has been summarised in the form of an Identify, Down, Aid, and Role (IDAR)
graph in Fig. 6.

Sub-system
Front-end
Object

Indirect Notices
Notices(upstream) 1: send flow

Command (downstream)
Data flow
Controller
3a: done
2: parse flow 2a: done
7: all done 4: processflow
Flow Parser
6a: done
Code Generator
3: save flow
Model Generator
6. generate model Flow Save
5. generate code

Finish Finish Finish Finish

Start Spark Feature Close Spark


Train Model
Session Extraction Session

Execute API to Execute API to Execute API to


generate code generate code generate code

Fig. 6: IDAR

IDAR graphs are more simple and intuitive to comprehend about the system
structure, hierarchy and communication between components than a traditional
Unified Modelling Language (UML) diagram [28]. In an IDAR graph, the objects
at higher-level control the objects situated below. Communication is either via a
downstream command message (control messages) or an upstream non-command
message, which is called notice. Separate subsystems are denoted by hexagonal
boxes. Similarly, a dotted line ending with an arrow depicts the data flow in the
system, and an arrow with a bubble on its tail shows an indirect method call. For
a comprehensive understanding of IDAR, we suggest interested readers to refer
[28] and [19].
The front-end has been depicted as a separate subsystem where the user creates
the flow and it is validated. When the flow is marked as complete, the front-end
sends it via a REST API to the back-end controller. The controller is the main
component which houses all the REST APIs for interaction with the front-end,
invoking other components and coordinating the whole process of code generation.
On receiving a flow from the front-end, it invokes the parser which creates a DAG
out of it, saves it in the local database, traverses all the vertices to find out which
specific modules in the API library must be invoked for code generation. It passes
this information back to the controller which invokes the code-generator. The code-
generator invokes the necessary modules in the API library in the order of their
connection in the user flow. These modules on execution produce target language
statements which are then initialised with user supplied parameters. This is done
for all the components the user has connected in the flow. All the generated target
statements are assembled in the same order in which the modules were invoked
which is the final target code. This code is passed back to the controller. Then,
the controller invokes the model generator which takes the final code compiles and
Flow-based Programming for Machine Learning 15

packages it into a runnable Spark ML application. If the user had supplied a range
of parameters for the ML model, then the code generator invokes the modules in
the APIs and initialise the target statement with different set of parameters which
leads to the production of a series of final target codes. Accordingly, the model
generator compiles the entire set of generated codes to produce a set of runnable
Spark applications. All of the generated applications are evaluated on test data for
prediction accuracy. The best performing application/ML model is selected from
the whole set.

7 Running Examples

In this section, we discuss three running examples/use cases to capture the repli-
cability of the automatic code-generation from graphical Spark flows, which is
the quintessence of our conceptual approach. The running examples are based on
three ML algorithms, namely- Decision Tree, K-Means Clustering and Collabo-
rative Filtering. The goal is to demonstrate that the end-user is able to create a
runnable Spark program based on the ML algorithms mentioned without having to
understand the detail of the Spark ML APIs or requiring any programming skills.
Nevertheless, the user is expected to know the datasets on which a ML algorithm
is to be applied.Additionally, the user is expected to know the label column and
the features columns of the dataset.

7.1 Use Case 1: Predicting forest cover with decision trees

The first use case involved creating a ML Spark application based on the Decision
tree algorithm. This supervised ML technique splits the input data depending on
the model parameters to make a decision. We have used the covtype dataset in
this example [35].
The dataset is available online in compressed CSV format. The dataset reports
types of forest covering parcels of land in Colorado, USA. The dataset features
describe the parcel of land in terms of its elevation, slope, soil type and the known
forest type covers. The dataset has 54 features to describe the pieces of land and
581,012 examples. The forest covers have been categorised into seven different
cover types. The categorised values range from 1 to 7. The trained model using
the 54 features of the dataset and labelled data should learn forest cover type.
The data is already structured, and therefore have used it directly as input in our
example.
The end-user drags various graphical components from the palette to the canvas
and connects them in the form of a flow, as shown in Figure 7. The specifics of the
application like the input dataset, label column, ML model parameters are taken
as inputs from the user. Part 3A in Figure 7 depicts the parameters required to
create a decision tree model. Internally the parameters are specified to make calls
to the Spark ML Decision Tree APIs. The flow creation specifications are sent
as a JSON object to the back-end system. The decision tree model parameters
comprise impurity, depth of the tree and max bins. The model parameters are
crucial to the performance of the model. The impurity parameter minimises the
probability of misclassification of the decision tree classifier. The Spark ML library
16 Mahapatra and Banoo

3A

Configuration for decision tree model


1

ML flow to create a decision tree model

Fig. 7: Graphical flow to create a decision tree model for forest cover prediction

provides two impurity measures for classification, namely, entropy and Gini. The
impurity is set to the estimator by using the setImpurity function provided by
DecisionTreeClassifier Spark API. Similarly, max bins and max depth also play a
vital role in finding an optimal decision tree classifier. The end-user can tweak with
these settings until the desired outcome is achieved without having to understand
the internals of Spark ML APIs or updating the code manually. This process of
trying out different model parameters is called hyperparameter tuning. Listing 1
lists the feature extraction function auto-generated for our use case on covtype
data. Listing 2 lists the automatic generated code of the decision tree function
that includes an estimator and transformer steps of a ML Spark flow.
p u b l i c s t a t i c Dataset<Row> f e a t u r e E x t r a c t i o n ( S p a r k S e s s i o n spark ,
S t r i n g f i l e P a t h , S t r i n g labelColName ) {
2 Dataset<Row> d f = s p a r k . r e a d ( )
. option ( ” header ” , f a l s e )
4 . option ( ” inferSchema ” , true ) . csv ( f i l e P a t h ) ;

6 f o r ( S t r i n g c : d f . columns ( ) ) {
d f = d f . withColumn ( c , d f . c o l ( c ) . c a s t ( ” d o u b l e ” ) ) ;
8 }
d f = d f . withColumnRenamed ( labelColName , ” l a b e l C o l ” ) ;
10 d f = d f . withColumn ( ” l a b e l C o l ” , d f . c o l ( ” l a b e l C o l ” ) . minus ( 1 ) ) ;
return df ;
12 }

Listing 1: Generated feature extraction code for use case 1

2 public static DecisionTreeClassificationModel decisionTree ( String


i m p u r i t y , i n t depth ,
i n t maxBins , Dataset<Row> e l b y u v r o q k ) {
4 D e c i s i o n T r e e C l a s s i f i e r p s c s j n d f q w = new D e c i s i o n T r e e C l a s s i f i e r ( )
. setLabelCol (” labelCol ”)
6 . setFeaturesCol (” features ”)
. setMaxDepth ( depth )
8 . setImpurity ( impurity )
. setMaxBins ( maxBins ) ;
Flow-based Programming for Machine Learning 17

10 DecisionTreeClassificationModel rwljasxoxf = pscsjndfqw . f i t (


elbyuvroqk ) ;
return rwljasxoxf ;
12 }

Listing 2: Automatic generation of decision tree function

7.2 Use Case 2: Anomaly detection with K-means clustering

The second use case deals with the creation of a ML application involving K-Means.
K-Means algorithm partitions the data into several clusters. Anomaly detection
is often used to detect fraud, unusual behaviour or attack in the network. The
unsupervised learning techniques are suitable for this kind of problems as they
can learn the pattern. We have used the data set from KDD cup 1999. The data
constitutes network packet data. The data contains 38 features, including a label
column. We did not need the label data for applying K-Means, and therefore we
removed the same in the feature extraction module. The graphical flow to create
the Spark K-means application is depicted in Figure 8. The example model param-
eters necessary to design the model is depicted in part 3A of Figure 8. The data
for the use case contained String columns. We have removed those during the fea-
ture extraction phase as the KMeans algorithm cannot process them. Additionally,
the String columns would cause runtime error as VectorAssembler only supports
numeric, boolean and vector types. The code for this operation is auto-generated
when the user inputs the String column names in the feature extraction stage of
the flow, as depicted in part 2A of Figure 8.

3A
Property pane to configure k-means model parameters
Configuring feature extraction for k-means

2A

ML flow to create a k-means clustering model

5
Running output of k-means Spark ML application

Fig. 8: Graphical flow to create a k-means clustering model for anomaly detection

The TrainModel component takes a range of K values. K is the number of


clusters that the user wants to create from the input dataset. The system gen-
erates code to create KMeans models for the range of the K values and stores
the compute the cost for each K-means model evaluation in an array in ascend-
ing order. Compute cost is one of the ways to evaluate K-Means clustering using
18 Mahapatra and Banoo

Spark ML. If the end-user provides only one K value, the system generates only
one code to generate one model. The output of the K-Means Spark application
for automatic hyperparameter tuning is shown in part 5 of Figure 8. The figure
indicates different K values and their corresponding model evaluation score. The
goal of the hyperparameter tuning in the case of the K-Means algorithm is to find
the optimal K value for the given dataset. Listing 3 shows code generated by the
back-end system for the graphical flow composed on the front-end.
p u b l i c s t a t i c KMeansModel k M e a n s C l u s t e r i n g ( S t r i n g initMode , i n t
lowK , i n t highK , i n t maxIter ,
2 d o u b l e d i s t a n c e T h r e s h o l d , i n t s t e p , Dataset<Row> o i o s n s u t c x ) {
Map<I n t e g e r , Double> d e g k l r f h y b = new LinkedHashMap<I n t e g e r ,
Double >() ;
4 Map<I n t e g e r , KMeansModel> mnqoxuuyhr = new LinkedHashMap<I n t e g e r
, KMeansModel >() ;
f o r ( i n t i t e r = lowK ; i t e r <= highK ; i t e r +=s t e p ) {
6 KMeans p j t n f m y s s i = new KMeans ( ) . s e t F e a t u r e s C o l ( ” f e a t u r e s ” )
. setK ( i t e r )
8 . setInitMode ( initMode )
. setMaxIter ( maxIter )
10 . setTol ( distanceThreshold )
. s e t S e e d ( new Random ( ) . nextLong ( ) ) ;
12 KMeansModel okvhgrwqrk = p j t n f m y s s i . f i t ( o i o s n s u t c x ) ;
// E v a l u a t e c l u s t e r i n g .
14 Double u j y y y c c v q v = okvhgrwqrk . computeCost ( o i o s n s u t c x ) ;
d e g k l r f h y b . put ( i t e r , u j y y y c c v q v ) ;
16 mnqoxuuyhr . put ( i t e r , okvhgrwqrk ) ;
System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗Sum o f Squared E r r o r s = ”+
ujyyyccvqv ) ;
18 }
Map<I n t e g e r , Double> v q t k e n i i c i = d e g k l r f h y b . e n t r y S e t ( )
20 . stream ( )
. s o r t e d ( comparingByValue ( ) )
22 . c o l l e c t ( toMap (Map . Entry : : getKey , Map . Entry : : g e t V al u e , ( e1 , e2 )
−> e2 ,
LinkedHashMap : : new ) ) ;
24 I n t e g e r hwnnwdbhqd = v q t k e n i i c i . e n t r y S e t ( ) . stream ( ) . f i n d F i r s t ( ) .
g e t ( ) . getKey ( ) ;
KMeansModel okvhgrwqrk = mnqoxuuyhr . g e t ( hwnnwdbhqd ) ;
26 System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗Optimum K = ”+ hwnnwdbhqd ) ;
System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗ E r r o r with Optimum K = ”+ d e g k l r f h y b
. g e t ( hwnnwdbhqd ) ) ;
28 r e t u r n okvhgrwqrk ;
}

Listing 3: Function generated for applying KMeans algorithm

7.3 Use Case 3: Music recommender application

The third use case demonstrates the application of Spark Collaborative filtering
APIs. The use case checks the code generation of loading data from text files and
creating a custom “Rating” class from the input text file. The aim is to ensure
that the generated application should load the input text file using the “Rating”
class, train and evaluate the recommender model according to the end-user pref-
erences. For this use case, we have used data published by Audioscrobbler, the
Flow-based Programming for Machine Learning 19

first music recommendation system for last.fm. The input data includes three text
files, user artist data.txt, artist data.txt and artist alias.txt. The primary data
file is user artist data.txt which contains user id, artist id and play count. The
artist data.txt file includes the names of each artist mapped to artist ID. The
artist alias.text contains the map artist ID that is unknown misspellings. The data
contains implicit feedback data; it does not contain any direct rating or feedback
data from users.
Collaborative filtering (CF) is a model-based recommendation algorithm. The
CF algorithm finds the hidden factors or latent factors about the user’s preferences
from the user’s history data. The current data set fits for collaborative filtering
application as we don’t have any other information about users or artists. These
type of data are sparse. The missing entries in the user-artist association matrix
are learnt using the alternating least square (ALS) algorithm. At the moment,
Spark ML library supports only model-based collaborative filtering. We have used
Spark ML ALS estimator to train the recommendation system. The Spark ML
collaborative filtering requires the user to develop a Rating Java class for parsing
the main input data file while loading the raw data into a DataFrame. The Rating
class is responsible for casting the raw data to respective types, and in our case,
we also need the implementation to map the misspelt artist IDs to correct IDs.
spark.ml package also provides API to set the cold start strategy for NaN (Not
a Number) entries in the user data to mitigate the cold start problem. It is quite
normal to have missing entries in such data; for example, maybe a user never rated
a song, and a model can’t learn about the user in the training phase. The dataset
used for the evaluation may have entries for users that are missing in the training
dataset. These problems are defined as cold start problem in recommender system
design.
Configuring feature extraction for collaborative filtering

Property pane to configure collaborative filtering parameters

1
2A 3A

ML flow to create a collaborative filtering model

5
5 Running output of collaborative filtering Spark ML application

Fig. 9: Graphical flow to create a collaborative filtering model for music recom-
mendation

The CM ML flow features extraction technique is a bit different from the


previous two use cases. The Spark ML implementation requires the system to
generate a Rating class file in Spark version 2.4.4. The Rating file adds better
20 Mahapatra and Banoo

control to the model designing. Listing 4 shows the Rating class auto-generated
by the back-end system. Figure 9 depicts the assembling of graphical components
required for CF application generation. The CF Spark program generation required
FeatureExtractionFromTextFile component.
The user can create an application with auto-hyperparameter tuning through
the graphical interface shown in part 3A of Figure 9. All the available Spark ALS
API parameters settings have been considered to design the collaborative filtering
module in our system. In the given dataset, the preference of the user is inferred
from the given data. Hence the implicit preference box should be checked, or in
case of sending direct JSON data, the field should contain the true value. Part
3A of Figure 9 depicts how to set the model parameters for CF application. The
model parameters are crucial in making a sound recommender system. The user
interface allows setting different parameters for training the model.
@Getter
2 @Setter
p u b l i c s t a t i c c l a s s Rating implements S e r i a l i z a b l e {
4 private Integer userId ;

6 private Integer playcount ;

8 private Integer artistId ;

10 p u b l i c Rating ( I n t e g e r a r t i s t I d , I n t e g e r p l a y c o u n t , I n t e g e r
userId ) {
this . artistId = artistId ;
12 t h i s . playcount = playcount ;
this . userId = userId ;
14 }

16 p u b l i c s t a t i c Rating p a r s e R a t i n g ( S t r i n g s t r ) {
String [ ] f i e l d s = str . s p l i t (” ”) ;
18 i f ( f i e l d s . l e n g t h != 3 ) {
throw new I l l e g a l A r g u m e n t E x c e p t i o n ( ” Each l i n e must c o n t a i n 3
fields ”) ;
20 };
Integer artistId = Integer . parseInt ( f i e l d s [ 0 ] ) ;
22 Integer playcount = Integer . parseInt ( f i e l d s [ 1 ] ) ;
Integer userId = Integer . parseInt ( f i e l d s [ 2 ] ) ;
24 Integer finalArtistData = CollaborativeFiltering .
artistAliasMap . getOrDefault ( a r t i s t I d , a r t i s t I d ) ;
r e t u r n new Rating ( a r t i s t I d , p l a y c o u n t , u s e r I d ) ;
26 }
}
28

Listing 4: Rating class for parsing CF input data

The code generation for training the model using ALS is done depending on
the input parameters from the enduser. Table 2 shows the available option for the
end-user, and Listing 5 shows how the parameters are set for the ALS algorithm
in Spark internally.
// r e s t o f t h e code
2 f o r ( I n t e g e r rank : r a n k s ) {
f o r ( Double regParam : regParams ) {
4 f o r ( Double a l p h a : a l p h a s ) {
Flow-based Programming for Machine Learning 21

Table 2: Supported graphical ML components with their functionality

SN. Parameter Name Purpose


1. numBlocks parallelizing the computations
by partitioning the data. de-
fault value is 10.
2. rank number of latent factors to be
used in the model. default value
is 10
3. maxIter maximumn number of iteration
to run the train model. default
value is 10
4. regParam regularization parame-
ter.default value is 1.0
5. implicitPrefs to specify if the data contains
implicit or explicit feedback.
default value is false, which
means explicit feedback
6. alpha Only applicable when implic-
itPrefs is set to true. default
value is 1.0
7. userCol setting of input data user col-
umn name in ALS algorithm
8. itemCol setting of input data item col
name in ALS algorithm.in our
use case, its artistId
9. ratingCol setting of input data rating col-
umn name in ALS algorithm

ALS a l s = new ALS ( )


6 . setMaxIter (10)
. setAlpha ( alpha )
8 . setRegParam ( regParam )
. set Im pl ici tP ref s ( true )
10 . setRank ( rank )
. setUserCol ( ” userId ” )
12 . setItemCol ( ” a r t i s t I d ” )
. s e t R a t i n g C o l ( ” count ” ) ;
14 ALSModel a l s M o d e l = a l s . f i t ( t r a i n i n g ) ;
a l s M o d e l . s e t C o l d S t a r t S t r a t e g y ( ” drop ” ) ;
16 Dataset<Row> p r e d i c t i o n s = a l s M o d e l . t r a n s f o r m ( t e s t ) ;
R e g r e s s i o n E v a l u a t o r e v a l u a t o r = new R e g r e s s i o n E v a l u a t o r ( )
18 . setMetricName ( ” rmse ” ) // rmse , mse , mae , r 2
. s e t L a b e l C o l ( ” count ” )
20 . setPredictionCol (” prediction ”) ;
Double rmse = e v a l u a t o r . e v a l u a t e ( p r e d i c t i o n s ) ;
22 System . out . p r i n t l n ( ” Hyper Params = ( ” + rank + ” ” + a l p h a +
” ” + regParam + ” ) ” ) ;
System . out . p r i n t l n ( ” Root−mean−s q u a r e e r r o r = ” + rmse ) ;
24 }
}
26 }

Listing 5: Example of ALS model creation and evaluation

Spark 2.4.4 has four variants of metric for ALS model evaluation, namely rmse
(Root Mean Square Error), mse (Mean Square Error), mae (Mean Absolute Error),
22 Mahapatra and Banoo

r2 (RSquared). The RegressionEvaluator uses one of these metrics to evaluate the


model. The setting of the metric to this API is shown in Listing 5, and the output
of the model evaluation is shown in part 5 of Figure 9. The Spark application
outputs the least error and the corresponding model parameters used for ALS
model creation. The output also shows the error calculated for other combination
of parameters sent as input by the end-user. The end-user was able to create a
runnable Spark application using a collaborative filtering algorithm with the given
dataset. The user could customise the recommender system by providing different
parameters to the flow components, which did not change or add any new source
code to the existing system. The following observations summarise the essence of
the running examples:
1. The modular approach of the system development helps add new ML function-
ality without affecting the existing behaviour of the system. The end-user can
design applications from the list of the components provided through REST
API interfaces. The components are independent of each other, and interac-
tion between the components happen only through the input/output data to
and from the components that help the end-user design Spark applications
depending on the problem statement.
2. The flow-based programming approach hides the underlying Spark implemen-
tation from the end-users. The end-users do not have to learn Apache Spark
ML library or functionality of Spark Data abstractions to implement an ML
application using flow-based ML programming. The end-users can customise
the Spark application by providing the specifications through the flow com-
ponents. The end-users should only have a better understanding of the input
dataset used for training a model. The flow-based programming makes it easier
for users to customise their ML applications by providing specification through
the graphical components.

8 Discussion

In this Section, we compare the conceptual approach of our flow-based program-


ming for ML with the existing solutions as introduced in Section 3. The comparison
criteria include:
1. Graphical interface: The graphical interfaces should generally be intuitive and
easy to use for the end-users to navigate through a software application. The
graphical programming interface with flow-based programming paradigm and
options to customise the automatic code generation of ML application makes
an ideal choice for users with less knowledge of data science or ML algorithms.
As discussed in Section III, we have seen that almost all existing tools have a
flow-based graphical interface implementation for creating ML model, except
Rapidminer. However, Rapidminer provides a graphical wizard to initialise
the input parameters. In our approach, we have a list of graphical compo-
nents representing the steps of ML model creation. The drag and drop feature
with feedback on the incorrect assembly of the components guides the user
for submitting a logical flow to the back-end system. The graphical interface
implemented supports customisation of the components and abstracts the un-
derlying technologies used for automatic Spark application generation for the
user.
Flow-based Programming for Machine Learning 23

2. Target Frameworks: The second criterion is the target frameworks used by the
graphical tools that provide a high-level abstraction. The Deep learning studio
offers an interface to implement only deep learning models. Both Microsoft
Azure and Rapidminer visual platform supports an end to end automated
ML application generation, hiding the underlying technology used from the
user. However, the Microsoft Azure HDinsight service allows PySpark code
snippet to be used through the Jupyter notebook interface of the tool to run
the code on a Spark cluster. Lemonade and StreamSets support high-level
abstraction of Spark ML to build ML models. Lemonade also uses the Keras
platform to generate models for deep learning applications. On the contrary,
the Streamanalytix tool uses multiple frameworks to create a ML model, such
as Spark MLlib, Spark ML, PMML, TensorFlow, and H2O. In our solution, we
support Spark ML as our target framework.
3. Code generation: The code generation capability is another exciting feature to
compare whether the tool can create a native program in the target frame-
work from the graphical flow created by the end-user or not. StreamSets and
Microsoft Azure require customised code snippet from end-user to train the
ML model. They don’t generate any code, and the models are loaded directly
in their environment. Lemonade and Streamanalytix generate native Apache
Spark program, while Deep learning studio generates code in Keras from the
graphical flow generated by the user. The user can edit the generated code and
re-run the application. Our conceptual approach also generates Java source
code for the Apache Spark program from the graphical flow created by the
end-user.
4. Code snippet input from user: It’s desirable to have a graphical tool that can
support both expert and non-programmer to create ML applications without
having to understand the underlying technology. The fourth criterion is to
compare if the tool requires code snippets from the user for ML application
creation. Mainly, the code snippet is required for some part of the application
generation or the customisation of the program. For example, the StreamSets
tool provides an extension to add ML feature by writing customise code in
Scala or Python to the pipeline for generating the program. Tools like Rapid-
miner, Lemonade, Deep learning studio, and Streamanalytix do not require any
input code snippet from the user to create the ML application program. While
Microsoft Azure auto ML feature does not require any code-snippet from the
user, it explicitly asks for a code snippet to create models to run in the Spark
environment. The conceptual approach described in this manuscript does not
require the user to write any code for Spark ML application generation.
5. Data pre-processing: As we already know, the performance of the ML model
hugely depends on the quality of the input data. The collected data is usually
not structured, requiring a bit of processing before applying the ML algorithms
to the data. The manual preprocessing of this data is time-consuming and prone
to errors. Having a visual pre-processing feature to the ML tool saves much
time for the users. All the tools except Deep learning studio and Lemonade have
a data pre-processing step that supports data cleansing through the graphical
interface. Our conceptual approach does not include data pre-processing, al-
though it does help essential cleansing and feature extraction from the input
data.
24 Mahapatra and Banoo

6. Ensemble Learning: The sixth criterion is to compare whether the tools pro-
vide the ensemble learning method. Ensemble methods is a machine learning
technique that combines several base models to produce one optimal predictive
model. The ultimate goal of the machine learning technique is to find the opti-
mum model that best predicts the desired outcome — tools like Streamanlytix,
Rapidminer, Microsoft Azure auto ML support the ensemble learning method.
This is a limitation in our current approach as it cannot automatically combine
several base models to solve a specific use case.

Table 3 summarises the comparison of our conceptual approach with the ex-
isting solutions for the criteria discussed above.

Tools Graphical Target Code- Code Include Ensemble


Interface Frame- snippet gener- Data method
work required ation pre- sup-
as input for ML processing ported
program
Deep Flow-based Keras ✗ ✓ ✗ ✗
Learning GUI
Studio
Microsoft Flow-based Spark ✓(✗for ✗ ✓ ✓(auto
Azure GUI ML(Python, auto ML) ML),
ML R) ✗(for
Saprk ap-
plication)
Streamanalytix
Graphical Spark ✗ ✓ ✓ ✓
wizard- ML, H2O,
based PMML
StreamSets Flow-based Spark ✓ ✗ ✓ ✗
GUI ML(Python
& Scala)
Rapidminer Graphical Unknown ✗ ✗ ✓ ✓
wizard-
based
Lemonade Flow-based Spark ML ✗ ✓ ✗ ✗
GUI (PySpark)
Our So- Flow-based Spark ✗ ✓ ✗ ✗
lution GUI ML(Java)

Table 3: Comparison of existing solutions with our approach to automate ML


application creation

Comments about previous attempts

Previously we had attempted to support programming of Spark applications via


graphical flow-based programming paradigm [20, 21, 22, 23]. The previous work cul-
minated in the doctoral dissertation of the last author. This work is an extension
of the previous work. The main difference lies in the code-generation technique.
Previously, we had used the API-based code generation technique to generate only
Flow-based Programming for Machine Learning 25

the basic skeleton of the Spark application. A library called ‘SparFlo’ [23] was de-
veloped, which contained generic method implementation of various Spark APIs.
Codeweaving was used to invoke these generic method implementations inside the
basic skeleton of the target Spark program. This ensured that the SparFlo library
when supported by any graphical programming tool, would easily support Spark
programming. Nevertheless, any changes or updates in Spark libraries would cause
the release of a new version of the SparFlo library containing the latest generic
method implementations of the Spark APIs. Hence, in this attempt, we have relied
only on the API-based code generation technique, which eliminates our conceptual
approach to develop a pre-packaged implementation of all Spark APIs. It also de-
couples from a specific Spark version as now we can independently parse a Spark
version to generate relevant target source code.

9 Conclusion

The field of data science is challenging in many ways. First, the datasets are usu-
ally messy, and most of the time, the data scientists go into pre-processing the
data and selecting features from the data. Second, ML algorithms apply multiple
iterations to the dataset to train the model. The process of preparing a model is
computationally expensive and time-consuming. Third, the real-world application
of the generated model to the new data, such as fraud detection, recommends that
models become part of the production service in real-time. The data scientists
engage most of their time in understanding and analysing the data. They would
want to try different ML applications and tweak the models to achieve the sought
accuracy in data analytics. The modelling of such ML applications adds another
level of difficulty. There should be a way to reuse the models while experiment-
ing with the existing designed systems. Apache Spark framework combines both
distributed computing with clusters and library to write ML applications on top
of it. Nevertheless, writing good Spark programs requires the user to understand
the Spark session, data abstractions, and transformations. Moreover, writing inde-
pendent code for each ML application adds code redundancy. This paper enables
end-users to create Spark ML applications by circumventing the tedious task of
learning the Spark programming framework through a flow-based programming
paradigm. Our main contributions include taking Java APIs of Spark ML operat-
ing on DataFrame, a popular ML library of Apache Spark, model them as com-
posable components, and development of a conceptual approach to parse a ML
flow created by connecting several such components. The conceptual approach has
been validated by designing three ML use-cases involving prediction using decision
trees, anomaly detection with K-means clustering, and collaborative filtering tech-
niques to develop a music recommender application. The use-cases demonstrate
how easily ML flows can be created graphically by connecting different compo-
nents at a higher level of abstraction, parameters to various components can be
configured with ease, automatic parsing of the user flow to give feedback to the
user if a component has been used in a wrong position in a flow and finally au-
tomatic generation of ML application without the end-user having to write any
code. In addition to this, our work lays the foundation for several future works.
This includes data visualisation techniques to make it more promising for the end-
users to work on ML problems. Automatic deployment of the ML model with one
26 Mahapatra and Banoo

click after the training phase would be another extension of our work. Design and
implementation of a flow validation mechanism based on input/output of the flow
component would make the system more flexible for future changes and generic
for all kinds of flow design.

List of abbreviations

API: Application Programming Interface; DAG: Directed Acyclic Graph; DL:


Deep Learning; DTO: Data Transfer Object; FBP: Flow-based Programming;
GUI: Graphical User Interface; IDAR: Identify, Down, Aid, and Role; JSON:
JavaScript Object Notation; MDSD: Model-Driven Software Development; ML:
Machine Learning; OpenCV: Open Computer Vision; REST: REpresentational
State Transfer; UML: Unified Modelling Language.

Declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and material

Data sharing is not applicable to this article as no datasets were generated or


analysed during the current study.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the German Research Foundation (DFG) and the
Technical University of Munich within the Open Access Publishing Funding Pro-
gramme.

Authors’ contributions

TM designed the conceptual approach, selected the use cases and wrote the entire
manuscript. SB implemented the conceptual approach and evaluated the work
against the use cases.
Flow-based Programming for Machine Learning 27

Acknowledgements

The authors are grateful to all their colleagues who actively provided insights,
reviews and comments leading to the materialisation of this work.

Authors’ information

SB holds a bachelor degree in computer science from KIIT, India. She recently
completed her masters in computer science from the Technical University of Mu-
nich (TUM). In her master thesis, she has worked extensively to build tools to
prototype Machine Learning applications. TM is an Assistant Professor at the
Department of Computer Science and Information Systems at the Birla Institute
of Technology and Science, Pilani. He holds a bachelors degree in Computer Engi-
neering from NMiMS Mumbai, an M.Sc. degree in Software Systems Engineering
from RWTH Aachen and a PhD degree in Computer Science from the Technical
University of Munich. His research interest lies in supporting Big Data Analyt-
ics and ML via graphical flow-based programming paradigm to foster widespread
adoption of data science.

References

1. JavaPoet. https://github.com/square/javapoet. [Online; accessed 18-May-2020]


2. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.,
Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,
M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga,
R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I.,
Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning
on heterogeneous systems (2015). URL https://www.tensorflow.org/. Software available
from tensorflow.org
3. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G.,
Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Ten-
sorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016). URL
https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
4. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T.,
Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: Relational data processing in spark. In:
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’15, p. 1383–1394. Association for Computing Machinery, New York, NY, USA
(2015). DOI 10.1145/2723372.2742797. URL https://doi.org/10.1145/2723372.2742797
5. Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kötter, T., Meinl, T., Ohl, P., Thiel, K.,
Wiswedel, B.: Knime - the konstanz information miner: Version 2.0 and beyond. SIGKDD
Explor. Newsl. 11(1), 26–31 (2009). DOI 10.1145/1656274.1656280. URL https://doi.
org/10.1145/1656274.1656280
6. BigML: Machine Learning that works. https://static.bigml.com/pdf/
BigML-Machine-Learning-Platform.pdf?ver=5b569df (2020). [Online; accessed 06-
June-2020]
7. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache
flink™: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 28–38
(2015)
8. Collobert, R., Bengio, S., Mariéthoz, J.: Torch: a modular machine learning software li-
brary. Idiap-RR Idiap-RR-46-2002, IDIAP (2002)
28 Mahapatra and Banoo

9. Culjak, I., Abram, D., Pribanic, T., Dzapo, H., Cifrek, M.: A brief introduction to opencv.
In: 2012 Proceedings of the 35th International Convention MIPRO, pp. 1725–1730 (2012)
10. Daniel, F., Matera, M.: Mashups: Concepts, Models and Architectures. Springer Berlin
Heidelberg, Berlin, Heidelberg (2014)
11. Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina,
M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar,
J., Žitnik, M., Zupan, B.: Orange: Data mining toolbox in python. Journal of Machine
Learning Research 14, 2349–2353 (2013). URL http://jmlr.org/papers/v14/demsar13a.
html
12. Escott, K.R., Noble, J.: Design patterns for angular hotdraw. In: Proceedings of the 24th
European Conference on Pattern Languages of Programs, EuroPLop ’19. Association for
Computing Machinery, New York, NY, USA (2019). DOI 10.1145/3361149.3361185. URL
https://doi.org/10.1145/3361149.3361185
13. Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman
Publishing Co., Inc., USA (2002)
14. Freeman, A.: Pro Angular 6, 3rd edn. Apress, USA (2018)
15. Hajian, M.: Progressive Web Apps with Angular: Create Responsive, Fast and Reliable
PWAs Using Angular, 1st edn. APress (2019)
16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka
data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009). DOI
10.1145/1656274.1656278. URL https://doi.org/10.1145/1656274.1656278
17. Jannach, D., Jugovac, M., Lerche, L.: Supporting the design of machine learning workflows
with a recommendation system. ACM Trans. Interact. Intell. Syst. 6(1) (2016). DOI
10.1145/2852082. URL https://doi.org/10.1145/2852082
18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–44 (2015). DOI
10.1038/nature14539
19. M. A. Overton: The idar graph. Commun. ACM 60(7), 40–45 (2017). DOI 10.1145/
3079970. URL https://doi.org/10.1145/3079970
20. Mahapatra, T.: High-level Graphical Programming for Big Data Applications. Disserta-
tion, Technische Universität München, München (2019). URL http://mediatum.ub.tum.
de/?id=1524977
21. Mahapatra, T., Gerostathopoulos, I., Prehofer, C., Gore, S.G.: Graphical Spark Program-
ming in IoT Mashup Tools. In: 2018 Fifth International Conference on Internet of Things:
Systems, Management and Security, pp. 163–170 (2018). DOI 10.1109/IoTSMS.2018.
8554665. URL https://doi.org/10.1109/IoTSMS.2018.8554665
22. Mahapatra, T., Prehofer, C.: aFlux: Graphical flow-based data analytics. Software Impacts
2, 100007 (2019). DOI 10.1016/j.simpa.2019.100007. URL https://doi.org/10.1016/j.
simpa.2019.100007
23. Mahapatra, T., Prehofer, C.: Graphical Flow-based Spark Programming. Journal of Big
Data 7(1), 4 (2020). DOI 10.1186/s40537-019-0273-5. URL https://doi.org/10.1186/
s40537-019-0273-5
24. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J.,
Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M.,
Talwalkar, A.: Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 17(1),
1235–1241 (2016)
25. mljar: Machine Learning for Humans! Automated machine learning platform. https:
//mljar.com (2018). [Online; accessed 18-May-2020]
26. Morrison, J.P.: Flow-Based Programming, 2nd Edition: A New Approach to Application
Development. CreateSpace, Paramount, CA (2010)
27. Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., López Garcı́a, Á., Heredia, I., Malı́k,
P., Hluchý, L.: Machine learning and deep learning frameworks and libraries for large-
scale data mining: a survey. Artificial Intelligence Review 52(1), 77–124 (2019). DOI
10.1007/s10462-018-09679-z. URL https://doi.org/10.1007/s10462-018-09679-z
28. Overton, M.A.: The idar graph. Queue 15(2), 29–48 (2017). DOI 10.1145/3084693.
3089807. URL https://doi.org/10.1145/3084693.3089807
29. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-
Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning li-
brary. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
R. Garnett (eds.) Advances in Neural Information Processing Systems 32, pp.
Flow-based Programming for Machine Learning 29

8026–8037. Curran Associates, Inc. (2019). URL http://papers.nips.cc/paper/


9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
30. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python. J.
Mach. Learn. Res. 12(null), 2825–2830 (2011)
31. Santos, W.d., Avelar, G.P., Ribeiro, M.H., Guedes, D., Meira Jr., W.: Scalable and efficient
data analytics and mining with lemonade. Proc. VLDB Endow. 11(12), 2070–2073 (2018).
DOI 10.14778/3229863.3236262. URL https://doi.org/10.14778/3229863.3236262
32. Stahl, T., Völter, M., Czarnecki, K.: Model-Driven Software Development: Technology,
Engineering, Management. John Wiley & Sons, Inc., Hoboken, NJ, USA (2006)
33. StreamAnalytix: Self-Service Data Flow and Analytics For Apache Spark. https://www.
streamanalytix.com (2018). [Online; accessed 18-May-2020]
34. StreamSets: DataOps for Modern Data Integration. https://streamsets.com (2018). [On-
line; accessed 18-May-2020]
35. University of Irvine: UC Irvine Machine Learning Repository.
https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
36. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D.,
Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson,
J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey,
C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman,
R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa,
F., van Mulbregt, P., Vijaykumar, A., Bardelli, A.P., Rothberg, A., Hilboll, A., Kloeckner,
A., Scopatz, A., Lee, A., Rokem, A., Woods, C.N., Fulton, C., Masson, C., Häggström, C.,
Fitzgerald, C., Nicholson, D.A., Hagen, D.R., Pasechnik, D.V., Olivetti, E., Martin, E.,
Wieser, E., Silva, F., Lenders, F., Wilhelm, F., Young, G., Price, G.A., Ingold, G.L., Allen,
G.E., Lee, G.R., Audren, H., Probst, I., Dietrich, J.P., Silterra, J., Webber, J.T., Slavič,
J., Nothman, J., Buchner, J., Kulick, J., Schönberger, J.L., de Miranda Cardoso, J.V.,
Reimer, J., Harrington, J., Rodrı́guez, J.L.C., Nunez-Iglesias, J., Kuczynski, J., Tritz, K.,
Thoma, M., Newville, M., Kümmerer, M., Bolingbroke, M., Tartre, M., Pak, M., Smith,
N.J., Nowaczyk, N., Shebanov, N., Pavlyk, O., Brodtkorb, P.A., Lee, P., McGibbon, R.T.,
Feldbauer, R., Lewis, S., Tygier, S., Sievert, S., Vigna, S., Peterson, S., More, S., Pudlik,
T., Oshima, T., Pingel, T.J., Robitaille, T.P., Spura, T., Jones, T.R., Cera, T., Leslie, T.,
Zito, T., Krauss, T., Upadhyay, U., Halchenko, Y.O., Vázquez-Baeza, Y., 1.0 Contributors,
S.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nature Methods
17(3), 261–272 (2020). DOI 10.1038/s41592-019-0686-2. URL https://doi.org/10.1038/
s41592-019-0686-2
37. Walls, C.: Spring Boot in Action, 1st edn. Manning Publications Co., USA (2016)
38. Washington, M.: Azure Machine Learning Studio for The Non-Data Scientist: Learn How
to Create Experiments, Operationalize Them Using Excel and Angular .Net Core ... Pro-
grams to Improve Predictive Results., 1st edn. CreateSpace Independent Publishing Plat-
form, North Charleston, SC, USA (2017)
39. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J.,
Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-
memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked
Systems Design and Implementation, NSDI’12, p. 2. USENIX Association, USA (2012)
40. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster com-
puting with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics
in Cloud Computing, HotCloud’10, p. 10. USENIX Association, USA (2010)
41. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen,
J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.:
Apache spark: A unified engine for big data processing. Commun. ACM 59(11), 56–65
(2016). DOI 10.1145/2934664. URL https://doi.org/10.1145/2934664
42. Zecevic, P., Bonaci, M.: Spark in Action, 1st edn. Manning Publications Co., USA (2016)

You might also like