Flow-Based Programming For Machine Learning
Flow-Based Programming For Machine Learning
Research
Keywords: End-User Programming, Graphical Flows, Graphical Programming Tools, Machine learning as
a service (MLaaS), Machine-Learning-Platform-asa-Service (ML PaaS), Machine Learning Pipelines
DOI: https://doi.org/10.21203/rs.3.rs-707294/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Journal of Big Data manuscript No.
(will be inserted by the editor)
Abstract Machine Learning (ML) has gained prominence and has tremendous
applications in fields like medicine, biology, geography and astrophysics, to name
a few. Arguably, in such areas it is used by domain experts, who are not necessar-
ily skilled-programmers. Thus, it presents a steep learning curve for such domain
experts in programming ML applications. To overcome this and foster widespread
adoption of ML techniques, we propose to equip them with domain-specific graphi-
cal tools. Such tools, based on the principles of flow-based programming paradigm,
would support the graphical composition of ML applications at a higher level of
abstraction and auto-generation of target code. Accordingly, (i) we have modelled
ML algorithms as composable components; (ii) described an approach to parse a
flow created by connecting several such composable components and use an API-
based code generation technique to generate the ML application. To demonstrate
the feasibility of our conceptual approach, we have modelled the APIs of Apache
Spark ML as composable components and validated it in three use-cases. The
use-cases are designed to capture the ease of program specification at a higher
abstraction level, easy parametrisation of ML APIs, auto-generation of the ML
application and auto-validation of the generated model for better prediction accu-
racy.
1 Introduction
Machine Learning (ML) is a scientific discipline that develops and makes use of
a particular class of algorithms which are designed to solve problems without ex-
plicit programming [42]. The algorithms infer about patterns present in a dataset
and learn how to solve a specific problem. This self-learning technique of com-
puter systems has gained prominence and has vast application in the current era.
The massive influx of data from the Internet and other sources creates a large
bed of structured as well as unstructured datasets, where ML techniques can be
leveraged to make meaningful correlations and automated decision-making. Nev-
ertheless, ML techniques used by domain experts like traffic engineers or molecular
biologists who are less-skilled programmers have to counter a steep learning curve,
i.e. learn how to program and write a ML application from scratch using general-
purpose, high-level languages like Java, Scala or Python. The learning curve hin-
ders the widespread adoption of ML by researchers unless they are well-trained in
programming. In response to this, we propose to equip less-skilled programmers
who are domain-experts with graphical tools. In particular, we intend to support
graphical specification of ML programs via flow-based programming paradigm
and support auto-generation of target code, thereby shielding the user of such
tools from nuances and complexities of programming. Such, graphical flow-based
programming tools called mashup tools have been extensively used to simplify
application development [10].
1.1 Contributions
the end-user having to write any code. The user can split the initial dataset
into training and testing datasets, specify a range for different model parame-
ters for the system to iteratively generate models and test them till a model is
produced with higher prediction accuracy.
1.2 Outline
The rest of the paper is structured in the following way: Section 2 summarizes
the background while Section 3 discusses the related work. We give an overview
of Spark and its machine learning library, i.e. Spark ML, different kinds of data
tarnsformation APIs available in Spark and our design choices to support only
specific kind of APIs in Section 4. Section 5 describes our conceptual approach
to support graphical flow-based ML programming at a higher level of abstraction
involving modelling of APIs as components, flow-parsing and target code gener-
ation while Section 6 describes its realization. Section 7 validates the conceptual
approach in three concrete use-cases. We compare our conceptual approach with
existing works in Section 8 which is shortly followed by concluding remarks in
Section 9.
2 Background
the form of model parameter tuning to improve prediction accuracy while DL self-
improves to minimise error and increase prediction accuracy. Fourth, DL is suited
well with the availability of massive amounts of data and is a more sophisticated
technique in comparison to ML.
There are a plethora of ML libraries available, including TensorFlow [3, 2], Py-
Torch [29], FlinkML [7], SparkML and scikit-learn [30], among others. TensorFlow
has become one of the most prominent libraries for both ML as well as DL. It
provides flexible APIs for different programming languages with support for easy
creation of models by abstracting low-level details. PyTorch is another open-source
ML library developed by Facebook. It is based on Torch [8], an open-source ML
library used for scientific computation. This library provides several algorithms
for DL applications like natural language processing and computer vision, among
others via Python APIs. Similarly, Scikit-learn is an open-source ML framework
based on SciPy [36], which includes lots of packages for scientific computing in
Python. The framework has excellent support for traditional ML applications like
classification, clustering, dimensionality reduction. Open-source computer vision
(OpenCV) is a library providing ML algorithms and mainly used in the field of
computer vision [9]. OpenCV is implemented in C++. However, it provides APIs
in other languages like Java, Python, Haskell and many more. Apache Flink, the
popular distributed stream processing platform provides ML APIs in the form of a
library called FlinkML. The application designed using these APIs will run inside
the Flink execution environment. Another prominent open-source library is Weka.
These are some of the most widely used libraries and listing all the available all
the ML libraries is beyond the scope of this paper. We, therefore, invite interested
readers to refer to [27] for more comprehensive information about ML and DL
libraries.
MDSD abstracts away the domain-specific implementation from the design of the
software systems [32].The levels of abstraction in a model-driven approach help to
communicate the design, scope, and intent of the software system to a broader au-
dience, which increases the quality of the system overall. The models in MDSD are
the abstract representation of real-world things that need to be understood before
building a system. These models are transformed into platform-specific implemen-
tation through domain modelling languages. The MDSD can be compared to the
transformation of a high-level programming language to machine code. MDSD of-
ten involves transforming the model into text which is popularly known as code
generation. There are different kinds of code-generation techniques widely like tem-
plates and filtering, templates and meta-model, code weaving and API-based code
generation among others [32]. API-based code generators are the most simple and
the most popular. These simply provide an API with which the elements of the
target platform or language can be generated. They are dependent on the abstract
syntax of the target language and are always tied to that language. To generate
target code in a new language we need new APIs working on the abstract syntax
of the new target language.
3 Related work
Literature hardly indicated any significant research work done to support graphical
ML programming at a higher level of abstraction and simultaneously explaining
programming concepts necessary for such. Nevertheless, there are a number of rel-
evant works in the literature as well as products in the market which support high-
level ML programming like WEKA [16], Azure Machine Learning Studio [38], KN-
IME [5], Orange [11], BigML [6], mljar [25], RapidMiner [17], Streamanalytix [33],
Lemonade [31] and Streamsets [34] among others. Out of these only Streamana-
lytix, Lemonade and Streamsets specifically deal with Spark ML. We compare our
conceptual approach with these solutions in Section 9.
4 Apache Spark
4.1 APIs
Spark MLlibSpark MLlib, has been built on top of Spark Core using the RDD
abstractions, offers a wide variety of machine learning and statistical algorithms.
It supports various supervised, unsupervised and recommendation algorithms. Su-
pervised learning algorithms include decision trees, random forest etc., while some
of the unsupervised learning algorithms supported are k-means clustering, support
vector machine etc.
Spark MLSpark ML is the successor of Spark MLlib and has been built on top
of Spark SQL using the DataFrame abstraction. It offers Pipeline APIs for easy
development, persistence and deployment of models. Practical machine learning
scenarios involve different stages with each stage consuming data from preced-
ing stage and producing data for the succeeding stage. Operational stages include
transforming data into appropriate format required by the algorithm, converting
categorical features into continuous features etc. Each operation involves invoking
declarative APIs which transform DataFrame based on user inputs [24] and pro-
duce a new DataFrame for use in the next operation. Hence, a Spark ML pipeline
is a sequence of stages consisting of either a transformer or an estimator.A trans-
former transforms one DataFrame to another often via operations like adding or
modifying columns in the input DataFrame while an estimator is used to train on
the input data and output a transformer which can be a ML model.
during compile time. Nonetheless, these are too restrictive and would render the
conceptual approach described in this manuscript non-generic in nature and tied
to specific use-cases only. In comparison to these, the DataFrame based APIs are
untyped APIs which provide columnar access to the underlying datasets. These
are easy to use domain-specific APIs and detect syntax errors at compile time. The
analytical errors which might creep in during usage of such APIs can be avoided
by designing checks to ensure that the named columns which a specific API is try-
ing to access are indeed received in its input. Additionally, the Spark ML library
(version 2.4.4) has been updated to use DataFrame APIs. Hence, it is an ideal
choice to use DataFrame APIs over RDD APIs.
5 Conceptual approach
developers when coding from scratch, nevertheless, which do not directly corre-
spond to the data processing logic of the application. An example would be the
code responsible for initialising the Spark session or closing it within which the
remaining data processing APIs get invoked. Another example can be configuring
the Spark session like setting the application name, configuring the running envi-
ronment mode (local or cluster), specifying the driver memory size and providing
the binding address among many others. We handle these aspects at the back-end
to enable the end-user of such a tool to focus solely on the business logic or data
processing logic of the application. The code-generator, running at the back-end
(discussed in Section 5.3), is responsible for adding such crucial parts and initialis-
ing required settings with sensible defaults to the final code to make it compilable.
Nevertheless, the default settings can be overridden by the user from the front-
end. For example, the “Start” component (discussed in Section 5.2) is a special
component used by the user to mark the start of the flow which can be configured
to fine-tune and explicitly override different property values of the Spark session
as discussed above.
As a second step, we take the components (discussed in Section 5.1) and make them
available to the end-user via a programming tool. The programming tool must have
an interactive graphical user interface consisting of a palette, a drawing area called
canvas, a property pane and a message pane. The palette contains all available
modular components which can be composed in a flow confirming to some standard
flow composition rules. The actual flow composition takes place on the canvas
when the user drags a component from the palette and places it on the canvas.
The component, when present on the canvas and when selected, should display
all its configurable properties in the property pane. The user can override default
settings and provide custom settings for the operation of a component via this
pane. The flow, while being composed on the canvas, is captured, converted into a
directed acyclic graph (DAG) and checked for the correct order of the connected
components. The flow-checking should be done whenever there is a state change.
A state change occurs whenever something changes on the canvas. For example, a
new component is dragged onto the canvas form the palette or connection between
two already present components change, among other such change possibilities.
Flow-checking checks the user flow for any potential irreconcilability with the
compositional rules. Such a flow when passed to the back-end for code generation
will generate target code which does not produce any compile-time errors. The
components are composable in the form of a flow if they adhere to the following
compositional rules:
When the flow is marked complete and flow-checking has completed success-
fully, it is passed to the back-end to start the code generation process. Fig. 1
and Fig. 2 depict the flow-checking sequences for a typical ML flow leading to
passing and failure of the process respectively. In the figure, it is assumed that
the flow-checking begins after the flow is complete to make it easier for illustra-
tion purposes. Nonetheless, it is done whenever there is a change detected on the
canvas.
Feature Decision
Start Save Model
1 User Flow Extraction Tree
Decision Feature
Start Save Model
1 User Flow Tree Extraction
Error
Decision Feature Save
c Start
Tree Extraction Model
"Decision Tree" component can be used only after "Feature Selection". User is notified.
The third step deals with the parsing of the user flow and generating the target
code. The back-end consists of a parser, API library and a code generator. The
parser is responsible for parsing the flow received and represent it in an interme-
diate representation, typically in the form of a DAG. Next, it traverses the DAG
to ascertain the type of component used and checks in the library for their cor-
responding method implementations. In MDSD, there are many code generation
techniques, but we have used the API-based code generation technique because it
is simple to use and serves our purpose well. The only downside is that in API-
based code generation technique, the API can generate code only for a specific
platform or language as it inherently depends on the abstract syntax of the tar-
get language to function. In our case, the API-base code generation technique is
restricted to generate Java code. The method implementations for every compo-
nent in the library contain statements or specific APIs to generate the particular
Spark API, which is represented by the component on the front-end. The DAG or
intermediate flow representation along with the information of which method to
invoke from the library for each vertex is passed to the code generator. The code
generator has extra APIs to generate the necessary but abstracted portion of the
ML code like the start of a spark session and inclusion of required libraries. Then,
it calls the invokes the specific method implementation from the library for each
vertex, i.e. invokes the APIs contained within them to generate the target Java
Spark API to be used in the final target code. When the code generator does this
for all the vertices of the DAG, the target Spark ML code in Java is generated.
The final code is compiled and packaged to create the runnable Spark ML driver
program. The driver program is the artefact which sent to a Spark cluster for
execution.
The final step is used to test the generated model on the testing dataset to check
for prediction accuracy or performance of the model. The user should have speci-
fied different model parameters to tune during flow specification. If yes, then the
code generator generates different ML models, each pertaining to a unique model
parameter value. All the models are tested, and the best performant ML model is
saved for final usage. For example, if a user is designing a ML model using k-means
algorithm and has supplied a range of values for k as a parameter while specifying
the flow, where k is the number of clusters user wants to create from the input
dataset. In this case, the code generator generates different k-means models for
the range of the k values and evaluates the compute cost for each k-means model.
Compute cost is one of the ways to evaluate k-means clustering using Spark ML.
The system displays the k value and the corresponding model evaluation score in
ascending order for the user to select or go with the best performant model. The
goal of the hyper-parameter tuning in case of k-means algorithm is to find the
optimal k value for the given dataset. If the user provides only one k value, then
the code generator generates only one model.
Fig. 3 illustrates the conceptual approach to generate ML code after the steps
of user flow specification and flow-checking have been completed. Step 1 in the
Flow-based Programming for Machine Learning 11
figure corresponds to ideas discussed in Section 5.2 while steps 2-5 correspond to
Section 5.3. Finally, step 6 corresponds to ideas discussed in Section 5.4.
Parser
1 2 3
Intermediate 2
representation -- DAG
4 5 6
Code Generator
3 API library
1. Modules execute as many
times as range of model 1. API library conatins
Generated Code modules with API
parameters have been Generated
Generated
From
Code
Code
Module
From
From Module
Module 1 11
specified in the user flow. statements to generate
Module execution
Code generation
1
target code.
Generated
Generated Code
Code
2. Result is target code with Generated
From
Code
Module 4
From
From Module
Module 4 44
different initialised 4 2. Every vertex in DAG
parameters. Generated Code 3 corresponds to one module
Generated
Generated Code
Code
From
From
From Module
Module
Module 3 33 in library for its final target
3. All such set of generated 6 code.
codes are assembled to Generated
Generated
Generated Code
Code
Code
produce a set of final target From
From
From Module
Module
Module 6 66 All ML models are evaluated
codes. against test data to select the
best ML model
JAR
5 Compile
All final assembled codes are
6
compiled to get a set of ML
JAR JAR JAR JAR JAR JAR Validate with test data
models
6 Realisation
We have implemented a graphical programming tool with Spring Boot [37]. The
application comes with a minimalistic interactive graphical user interface (GUI)
built using Angular [14, 15, 12], a popular open-source web application framework.
Fig. 4 shows the GUI of the prototype with its palette containing available ML
components. It also has a canvas where the components are dragged and connected
in the form of a flow and finally, a button to start the code generation process.
Whenever a ML component is dragged onto the canvas, a property pane opens
up allowing the user to configure its various parameters to fine-tune its operation
by overriding the default values. Additionally, when something changes on the
canvas, the entire flow is checked to ensure its correctness, and in case of any
error, the user is provided feedback on the GUI. Fig. 5 depicts these aspects of
the prototype. On press of the button, the flow configuration is captured in a
JavaScript Object Notation (JSON) and send to the Spring Boot back-end via
REST APIs. It is converted into a Data Transfer Object (DTO) before processing.
Fowler describes such a DTO as “an object that carries data between processes to
reduce the number of method calls.” [13]. At the back-end, the user flow is parsed,
converted into an intermediate representation and passed to the code generator.
The code generator auto-generates the target Spark ML application with the help
of JavaPoet, an API-based Java code generator [1]. We have provided three video
files demonstrating use case 7.1, use case 7.2 and use case 7.3 respectively as
supplementary material with this work.
6.2 Components
and can be used in all the three use cases. A typical ML flow consists of four
components. The first component marks the start of the flow and is responsible for
the generating code related to Spark session initialisation. The second component
is used to specify and pre-process the input data. It deals with dimensionality
reduction, i.e. extracting a reduced set of non-redundant information form the
dataset which can be fed to a ML algorithm and is sufficient to solve a given
problem. The third component is a very specific training module which houses
a specific ML algorithm like decision tree, k-means, among others. The fourth
component trains the model, supports hypertuning of model parameters and saves
the final selected model. Table 1 lists the prototyped components and summarises
their functionalities.
6.3 Working
In the back-end, the flow is parsed and represented in the form of a DAG, an
intermediate representation format. It traverses all the vertices of the DAG to find
out which components map to a specific module in an API library. A module in
API library contains API statements written using JavaPoet which on execution
generates target language statements. Collection of such statements from all such
modules in the library initialised with user parameters, which the code-generator
assembles, results in the final target code. This code is then compiled and packaged
to produce a runnable application. In this entire process, the prototype has many
components which interact and accomplish certain functionalities. The system
architecture of the prototype with its major components and their interactions
14 Mahapatra and Banoo
has been summarised in the form of an Identify, Down, Aid, and Role (IDAR)
graph in Fig. 6.
Sub-system
Front-end
Object
Indirect Notices
Notices(upstream) 1: send flow
Command (downstream)
Data flow
Controller
3a: done
2: parse flow 2a: done
7: all done 4: processflow
Flow Parser
6a: done
Code Generator
3: save flow
Model Generator
6. generate model Flow Save
5. generate code
Fig. 6: IDAR
IDAR graphs are more simple and intuitive to comprehend about the system
structure, hierarchy and communication between components than a traditional
Unified Modelling Language (UML) diagram [28]. In an IDAR graph, the objects
at higher-level control the objects situated below. Communication is either via a
downstream command message (control messages) or an upstream non-command
message, which is called notice. Separate subsystems are denoted by hexagonal
boxes. Similarly, a dotted line ending with an arrow depicts the data flow in the
system, and an arrow with a bubble on its tail shows an indirect method call. For
a comprehensive understanding of IDAR, we suggest interested readers to refer
[28] and [19].
The front-end has been depicted as a separate subsystem where the user creates
the flow and it is validated. When the flow is marked as complete, the front-end
sends it via a REST API to the back-end controller. The controller is the main
component which houses all the REST APIs for interaction with the front-end,
invoking other components and coordinating the whole process of code generation.
On receiving a flow from the front-end, it invokes the parser which creates a DAG
out of it, saves it in the local database, traverses all the vertices to find out which
specific modules in the API library must be invoked for code generation. It passes
this information back to the controller which invokes the code-generator. The code-
generator invokes the necessary modules in the API library in the order of their
connection in the user flow. These modules on execution produce target language
statements which are then initialised with user supplied parameters. This is done
for all the components the user has connected in the flow. All the generated target
statements are assembled in the same order in which the modules were invoked
which is the final target code. This code is passed back to the controller. Then,
the controller invokes the model generator which takes the final code compiles and
Flow-based Programming for Machine Learning 15
packages it into a runnable Spark ML application. If the user had supplied a range
of parameters for the ML model, then the code generator invokes the modules in
the APIs and initialise the target statement with different set of parameters which
leads to the production of a series of final target codes. Accordingly, the model
generator compiles the entire set of generated codes to produce a set of runnable
Spark applications. All of the generated applications are evaluated on test data for
prediction accuracy. The best performing application/ML model is selected from
the whole set.
7 Running Examples
In this section, we discuss three running examples/use cases to capture the repli-
cability of the automatic code-generation from graphical Spark flows, which is
the quintessence of our conceptual approach. The running examples are based on
three ML algorithms, namely- Decision Tree, K-Means Clustering and Collabo-
rative Filtering. The goal is to demonstrate that the end-user is able to create a
runnable Spark program based on the ML algorithms mentioned without having to
understand the detail of the Spark ML APIs or requiring any programming skills.
Nevertheless, the user is expected to know the datasets on which a ML algorithm
is to be applied.Additionally, the user is expected to know the label column and
the features columns of the dataset.
The first use case involved creating a ML Spark application based on the Decision
tree algorithm. This supervised ML technique splits the input data depending on
the model parameters to make a decision. We have used the covtype dataset in
this example [35].
The dataset is available online in compressed CSV format. The dataset reports
types of forest covering parcels of land in Colorado, USA. The dataset features
describe the parcel of land in terms of its elevation, slope, soil type and the known
forest type covers. The dataset has 54 features to describe the pieces of land and
581,012 examples. The forest covers have been categorised into seven different
cover types. The categorised values range from 1 to 7. The trained model using
the 54 features of the dataset and labelled data should learn forest cover type.
The data is already structured, and therefore have used it directly as input in our
example.
The end-user drags various graphical components from the palette to the canvas
and connects them in the form of a flow, as shown in Figure 7. The specifics of the
application like the input dataset, label column, ML model parameters are taken
as inputs from the user. Part 3A in Figure 7 depicts the parameters required to
create a decision tree model. Internally the parameters are specified to make calls
to the Spark ML Decision Tree APIs. The flow creation specifications are sent
as a JSON object to the back-end system. The decision tree model parameters
comprise impurity, depth of the tree and max bins. The model parameters are
crucial to the performance of the model. The impurity parameter minimises the
probability of misclassification of the decision tree classifier. The Spark ML library
16 Mahapatra and Banoo
3A
Fig. 7: Graphical flow to create a decision tree model for forest cover prediction
provides two impurity measures for classification, namely, entropy and Gini. The
impurity is set to the estimator by using the setImpurity function provided by
DecisionTreeClassifier Spark API. Similarly, max bins and max depth also play a
vital role in finding an optimal decision tree classifier. The end-user can tweak with
these settings until the desired outcome is achieved without having to understand
the internals of Spark ML APIs or updating the code manually. This process of
trying out different model parameters is called hyperparameter tuning. Listing 1
lists the feature extraction function auto-generated for our use case on covtype
data. Listing 2 lists the automatic generated code of the decision tree function
that includes an estimator and transformer steps of a ML Spark flow.
p u b l i c s t a t i c Dataset<Row> f e a t u r e E x t r a c t i o n ( S p a r k S e s s i o n spark ,
S t r i n g f i l e P a t h , S t r i n g labelColName ) {
2 Dataset<Row> d f = s p a r k . r e a d ( )
. option ( ” header ” , f a l s e )
4 . option ( ” inferSchema ” , true ) . csv ( f i l e P a t h ) ;
6 f o r ( S t r i n g c : d f . columns ( ) ) {
d f = d f . withColumn ( c , d f . c o l ( c ) . c a s t ( ” d o u b l e ” ) ) ;
8 }
d f = d f . withColumnRenamed ( labelColName , ” l a b e l C o l ” ) ;
10 d f = d f . withColumn ( ” l a b e l C o l ” , d f . c o l ( ” l a b e l C o l ” ) . minus ( 1 ) ) ;
return df ;
12 }
The second use case deals with the creation of a ML application involving K-Means.
K-Means algorithm partitions the data into several clusters. Anomaly detection
is often used to detect fraud, unusual behaviour or attack in the network. The
unsupervised learning techniques are suitable for this kind of problems as they
can learn the pattern. We have used the data set from KDD cup 1999. The data
constitutes network packet data. The data contains 38 features, including a label
column. We did not need the label data for applying K-Means, and therefore we
removed the same in the feature extraction module. The graphical flow to create
the Spark K-means application is depicted in Figure 8. The example model param-
eters necessary to design the model is depicted in part 3A of Figure 8. The data
for the use case contained String columns. We have removed those during the fea-
ture extraction phase as the KMeans algorithm cannot process them. Additionally,
the String columns would cause runtime error as VectorAssembler only supports
numeric, boolean and vector types. The code for this operation is auto-generated
when the user inputs the String column names in the feature extraction stage of
the flow, as depicted in part 2A of Figure 8.
3A
Property pane to configure k-means model parameters
Configuring feature extraction for k-means
2A
5
Running output of k-means Spark ML application
Fig. 8: Graphical flow to create a k-means clustering model for anomaly detection
Spark ML. If the end-user provides only one K value, the system generates only
one code to generate one model. The output of the K-Means Spark application
for automatic hyperparameter tuning is shown in part 5 of Figure 8. The figure
indicates different K values and their corresponding model evaluation score. The
goal of the hyperparameter tuning in the case of the K-Means algorithm is to find
the optimal K value for the given dataset. Listing 3 shows code generated by the
back-end system for the graphical flow composed on the front-end.
p u b l i c s t a t i c KMeansModel k M e a n s C l u s t e r i n g ( S t r i n g initMode , i n t
lowK , i n t highK , i n t maxIter ,
2 d o u b l e d i s t a n c e T h r e s h o l d , i n t s t e p , Dataset<Row> o i o s n s u t c x ) {
Map<I n t e g e r , Double> d e g k l r f h y b = new LinkedHashMap<I n t e g e r ,
Double >() ;
4 Map<I n t e g e r , KMeansModel> mnqoxuuyhr = new LinkedHashMap<I n t e g e r
, KMeansModel >() ;
f o r ( i n t i t e r = lowK ; i t e r <= highK ; i t e r +=s t e p ) {
6 KMeans p j t n f m y s s i = new KMeans ( ) . s e t F e a t u r e s C o l ( ” f e a t u r e s ” )
. setK ( i t e r )
8 . setInitMode ( initMode )
. setMaxIter ( maxIter )
10 . setTol ( distanceThreshold )
. s e t S e e d ( new Random ( ) . nextLong ( ) ) ;
12 KMeansModel okvhgrwqrk = p j t n f m y s s i . f i t ( o i o s n s u t c x ) ;
// E v a l u a t e c l u s t e r i n g .
14 Double u j y y y c c v q v = okvhgrwqrk . computeCost ( o i o s n s u t c x ) ;
d e g k l r f h y b . put ( i t e r , u j y y y c c v q v ) ;
16 mnqoxuuyhr . put ( i t e r , okvhgrwqrk ) ;
System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗Sum o f Squared E r r o r s = ”+
ujyyyccvqv ) ;
18 }
Map<I n t e g e r , Double> v q t k e n i i c i = d e g k l r f h y b . e n t r y S e t ( )
20 . stream ( )
. s o r t e d ( comparingByValue ( ) )
22 . c o l l e c t ( toMap (Map . Entry : : getKey , Map . Entry : : g e t V al u e , ( e1 , e2 )
−> e2 ,
LinkedHashMap : : new ) ) ;
24 I n t e g e r hwnnwdbhqd = v q t k e n i i c i . e n t r y S e t ( ) . stream ( ) . f i n d F i r s t ( ) .
g e t ( ) . getKey ( ) ;
KMeansModel okvhgrwqrk = mnqoxuuyhr . g e t ( hwnnwdbhqd ) ;
26 System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗Optimum K = ”+ hwnnwdbhqd ) ;
System . out . p r i n t l n ( ” ∗∗∗∗∗∗∗ E r r o r with Optimum K = ”+ d e g k l r f h y b
. g e t ( hwnnwdbhqd ) ) ;
28 r e t u r n okvhgrwqrk ;
}
The third use case demonstrates the application of Spark Collaborative filtering
APIs. The use case checks the code generation of loading data from text files and
creating a custom “Rating” class from the input text file. The aim is to ensure
that the generated application should load the input text file using the “Rating”
class, train and evaluate the recommender model according to the end-user pref-
erences. For this use case, we have used data published by Audioscrobbler, the
Flow-based Programming for Machine Learning 19
first music recommendation system for last.fm. The input data includes three text
files, user artist data.txt, artist data.txt and artist alias.txt. The primary data
file is user artist data.txt which contains user id, artist id and play count. The
artist data.txt file includes the names of each artist mapped to artist ID. The
artist alias.text contains the map artist ID that is unknown misspellings. The data
contains implicit feedback data; it does not contain any direct rating or feedback
data from users.
Collaborative filtering (CF) is a model-based recommendation algorithm. The
CF algorithm finds the hidden factors or latent factors about the user’s preferences
from the user’s history data. The current data set fits for collaborative filtering
application as we don’t have any other information about users or artists. These
type of data are sparse. The missing entries in the user-artist association matrix
are learnt using the alternating least square (ALS) algorithm. At the moment,
Spark ML library supports only model-based collaborative filtering. We have used
Spark ML ALS estimator to train the recommendation system. The Spark ML
collaborative filtering requires the user to develop a Rating Java class for parsing
the main input data file while loading the raw data into a DataFrame. The Rating
class is responsible for casting the raw data to respective types, and in our case,
we also need the implementation to map the misspelt artist IDs to correct IDs.
spark.ml package also provides API to set the cold start strategy for NaN (Not
a Number) entries in the user data to mitigate the cold start problem. It is quite
normal to have missing entries in such data; for example, maybe a user never rated
a song, and a model can’t learn about the user in the training phase. The dataset
used for the evaluation may have entries for users that are missing in the training
dataset. These problems are defined as cold start problem in recommender system
design.
Configuring feature extraction for collaborative filtering
1
2A 3A
5
5 Running output of collaborative filtering Spark ML application
Fig. 9: Graphical flow to create a collaborative filtering model for music recom-
mendation
control to the model designing. Listing 4 shows the Rating class auto-generated
by the back-end system. Figure 9 depicts the assembling of graphical components
required for CF application generation. The CF Spark program generation required
FeatureExtractionFromTextFile component.
The user can create an application with auto-hyperparameter tuning through
the graphical interface shown in part 3A of Figure 9. All the available Spark ALS
API parameters settings have been considered to design the collaborative filtering
module in our system. In the given dataset, the preference of the user is inferred
from the given data. Hence the implicit preference box should be checked, or in
case of sending direct JSON data, the field should contain the true value. Part
3A of Figure 9 depicts how to set the model parameters for CF application. The
model parameters are crucial in making a sound recommender system. The user
interface allows setting different parameters for training the model.
@Getter
2 @Setter
p u b l i c s t a t i c c l a s s Rating implements S e r i a l i z a b l e {
4 private Integer userId ;
10 p u b l i c Rating ( I n t e g e r a r t i s t I d , I n t e g e r p l a y c o u n t , I n t e g e r
userId ) {
this . artistId = artistId ;
12 t h i s . playcount = playcount ;
this . userId = userId ;
14 }
16 p u b l i c s t a t i c Rating p a r s e R a t i n g ( S t r i n g s t r ) {
String [ ] f i e l d s = str . s p l i t (” ”) ;
18 i f ( f i e l d s . l e n g t h != 3 ) {
throw new I l l e g a l A r g u m e n t E x c e p t i o n ( ” Each l i n e must c o n t a i n 3
fields ”) ;
20 };
Integer artistId = Integer . parseInt ( f i e l d s [ 0 ] ) ;
22 Integer playcount = Integer . parseInt ( f i e l d s [ 1 ] ) ;
Integer userId = Integer . parseInt ( f i e l d s [ 2 ] ) ;
24 Integer finalArtistData = CollaborativeFiltering .
artistAliasMap . getOrDefault ( a r t i s t I d , a r t i s t I d ) ;
r e t u r n new Rating ( a r t i s t I d , p l a y c o u n t , u s e r I d ) ;
26 }
}
28
The code generation for training the model using ALS is done depending on
the input parameters from the enduser. Table 2 shows the available option for the
end-user, and Listing 5 shows how the parameters are set for the ALS algorithm
in Spark internally.
// r e s t o f t h e code
2 f o r ( I n t e g e r rank : r a n k s ) {
f o r ( Double regParam : regParams ) {
4 f o r ( Double a l p h a : a l p h a s ) {
Flow-based Programming for Machine Learning 21
Spark 2.4.4 has four variants of metric for ALS model evaluation, namely rmse
(Root Mean Square Error), mse (Mean Square Error), mae (Mean Absolute Error),
22 Mahapatra and Banoo
8 Discussion
2. Target Frameworks: The second criterion is the target frameworks used by the
graphical tools that provide a high-level abstraction. The Deep learning studio
offers an interface to implement only deep learning models. Both Microsoft
Azure and Rapidminer visual platform supports an end to end automated
ML application generation, hiding the underlying technology used from the
user. However, the Microsoft Azure HDinsight service allows PySpark code
snippet to be used through the Jupyter notebook interface of the tool to run
the code on a Spark cluster. Lemonade and StreamSets support high-level
abstraction of Spark ML to build ML models. Lemonade also uses the Keras
platform to generate models for deep learning applications. On the contrary,
the Streamanalytix tool uses multiple frameworks to create a ML model, such
as Spark MLlib, Spark ML, PMML, TensorFlow, and H2O. In our solution, we
support Spark ML as our target framework.
3. Code generation: The code generation capability is another exciting feature to
compare whether the tool can create a native program in the target frame-
work from the graphical flow created by the end-user or not. StreamSets and
Microsoft Azure require customised code snippet from end-user to train the
ML model. They don’t generate any code, and the models are loaded directly
in their environment. Lemonade and Streamanalytix generate native Apache
Spark program, while Deep learning studio generates code in Keras from the
graphical flow generated by the user. The user can edit the generated code and
re-run the application. Our conceptual approach also generates Java source
code for the Apache Spark program from the graphical flow created by the
end-user.
4. Code snippet input from user: It’s desirable to have a graphical tool that can
support both expert and non-programmer to create ML applications without
having to understand the underlying technology. The fourth criterion is to
compare if the tool requires code snippets from the user for ML application
creation. Mainly, the code snippet is required for some part of the application
generation or the customisation of the program. For example, the StreamSets
tool provides an extension to add ML feature by writing customise code in
Scala or Python to the pipeline for generating the program. Tools like Rapid-
miner, Lemonade, Deep learning studio, and Streamanalytix do not require any
input code snippet from the user to create the ML application program. While
Microsoft Azure auto ML feature does not require any code-snippet from the
user, it explicitly asks for a code snippet to create models to run in the Spark
environment. The conceptual approach described in this manuscript does not
require the user to write any code for Spark ML application generation.
5. Data pre-processing: As we already know, the performance of the ML model
hugely depends on the quality of the input data. The collected data is usually
not structured, requiring a bit of processing before applying the ML algorithms
to the data. The manual preprocessing of this data is time-consuming and prone
to errors. Having a visual pre-processing feature to the ML tool saves much
time for the users. All the tools except Deep learning studio and Lemonade have
a data pre-processing step that supports data cleansing through the graphical
interface. Our conceptual approach does not include data pre-processing, al-
though it does help essential cleansing and feature extraction from the input
data.
24 Mahapatra and Banoo
6. Ensemble Learning: The sixth criterion is to compare whether the tools pro-
vide the ensemble learning method. Ensemble methods is a machine learning
technique that combines several base models to produce one optimal predictive
model. The ultimate goal of the machine learning technique is to find the opti-
mum model that best predicts the desired outcome — tools like Streamanlytix,
Rapidminer, Microsoft Azure auto ML support the ensemble learning method.
This is a limitation in our current approach as it cannot automatically combine
several base models to solve a specific use case.
Table 3 summarises the comparison of our conceptual approach with the ex-
isting solutions for the criteria discussed above.
the basic skeleton of the Spark application. A library called ‘SparFlo’ [23] was de-
veloped, which contained generic method implementation of various Spark APIs.
Codeweaving was used to invoke these generic method implementations inside the
basic skeleton of the target Spark program. This ensured that the SparFlo library
when supported by any graphical programming tool, would easily support Spark
programming. Nevertheless, any changes or updates in Spark libraries would cause
the release of a new version of the SparFlo library containing the latest generic
method implementations of the Spark APIs. Hence, in this attempt, we have relied
only on the API-based code generation technique, which eliminates our conceptual
approach to develop a pre-packaged implementation of all Spark APIs. It also de-
couples from a specific Spark version as now we can independently parse a Spark
version to generate relevant target source code.
9 Conclusion
The field of data science is challenging in many ways. First, the datasets are usu-
ally messy, and most of the time, the data scientists go into pre-processing the
data and selecting features from the data. Second, ML algorithms apply multiple
iterations to the dataset to train the model. The process of preparing a model is
computationally expensive and time-consuming. Third, the real-world application
of the generated model to the new data, such as fraud detection, recommends that
models become part of the production service in real-time. The data scientists
engage most of their time in understanding and analysing the data. They would
want to try different ML applications and tweak the models to achieve the sought
accuracy in data analytics. The modelling of such ML applications adds another
level of difficulty. There should be a way to reuse the models while experiment-
ing with the existing designed systems. Apache Spark framework combines both
distributed computing with clusters and library to write ML applications on top
of it. Nevertheless, writing good Spark programs requires the user to understand
the Spark session, data abstractions, and transformations. Moreover, writing inde-
pendent code for each ML application adds code redundancy. This paper enables
end-users to create Spark ML applications by circumventing the tedious task of
learning the Spark programming framework through a flow-based programming
paradigm. Our main contributions include taking Java APIs of Spark ML operat-
ing on DataFrame, a popular ML library of Apache Spark, model them as com-
posable components, and development of a conceptual approach to parse a ML
flow created by connecting several such components. The conceptual approach has
been validated by designing three ML use-cases involving prediction using decision
trees, anomaly detection with K-means clustering, and collaborative filtering tech-
niques to develop a music recommender application. The use-cases demonstrate
how easily ML flows can be created graphically by connecting different compo-
nents at a higher level of abstraction, parameters to various components can be
configured with ease, automatic parsing of the user flow to give feedback to the
user if a component has been used in a wrong position in a flow and finally au-
tomatic generation of ML application without the end-user having to write any
code. In addition to this, our work lays the foundation for several future works.
This includes data visualisation techniques to make it more promising for the end-
users to work on ML problems. Automatic deployment of the ML model with one
26 Mahapatra and Banoo
click after the training phase would be another extension of our work. Design and
implementation of a flow validation mechanism based on input/output of the flow
component would make the system more flexible for future changes and generic
for all kinds of flow design.
List of abbreviations
Declarations
Not applicable
Not applicable
Competing interests
Funding
This work was supported by the German Research Foundation (DFG) and the
Technical University of Munich within the Open Access Publishing Funding Pro-
gramme.
Authors’ contributions
TM designed the conceptual approach, selected the use cases and wrote the entire
manuscript. SB implemented the conceptual approach and evaluated the work
against the use cases.
Flow-based Programming for Machine Learning 27
Acknowledgements
The authors are grateful to all their colleagues who actively provided insights,
reviews and comments leading to the materialisation of this work.
Authors’ information
SB holds a bachelor degree in computer science from KIIT, India. She recently
completed her masters in computer science from the Technical University of Mu-
nich (TUM). In her master thesis, she has worked extensively to build tools to
prototype Machine Learning applications. TM is an Assistant Professor at the
Department of Computer Science and Information Systems at the Birla Institute
of Technology and Science, Pilani. He holds a bachelors degree in Computer Engi-
neering from NMiMS Mumbai, an M.Sc. degree in Software Systems Engineering
from RWTH Aachen and a PhD degree in Computer Science from the Technical
University of Munich. His research interest lies in supporting Big Data Analyt-
ics and ML via graphical flow-based programming paradigm to foster widespread
adoption of data science.
References
9. Culjak, I., Abram, D., Pribanic, T., Dzapo, H., Cifrek, M.: A brief introduction to opencv.
In: 2012 Proceedings of the 35th International Convention MIPRO, pp. 1725–1730 (2012)
10. Daniel, F., Matera, M.: Mashups: Concepts, Models and Architectures. Springer Berlin
Heidelberg, Berlin, Heidelberg (2014)
11. Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina,
M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar,
J., Žitnik, M., Zupan, B.: Orange: Data mining toolbox in python. Journal of Machine
Learning Research 14, 2349–2353 (2013). URL http://jmlr.org/papers/v14/demsar13a.
html
12. Escott, K.R., Noble, J.: Design patterns for angular hotdraw. In: Proceedings of the 24th
European Conference on Pattern Languages of Programs, EuroPLop ’19. Association for
Computing Machinery, New York, NY, USA (2019). DOI 10.1145/3361149.3361185. URL
https://doi.org/10.1145/3361149.3361185
13. Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman
Publishing Co., Inc., USA (2002)
14. Freeman, A.: Pro Angular 6, 3rd edn. Apress, USA (2018)
15. Hajian, M.: Progressive Web Apps with Angular: Create Responsive, Fast and Reliable
PWAs Using Angular, 1st edn. APress (2019)
16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka
data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009). DOI
10.1145/1656274.1656278. URL https://doi.org/10.1145/1656274.1656278
17. Jannach, D., Jugovac, M., Lerche, L.: Supporting the design of machine learning workflows
with a recommendation system. ACM Trans. Interact. Intell. Syst. 6(1) (2016). DOI
10.1145/2852082. URL https://doi.org/10.1145/2852082
18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–44 (2015). DOI
10.1038/nature14539
19. M. A. Overton: The idar graph. Commun. ACM 60(7), 40–45 (2017). DOI 10.1145/
3079970. URL https://doi.org/10.1145/3079970
20. Mahapatra, T.: High-level Graphical Programming for Big Data Applications. Disserta-
tion, Technische Universität München, München (2019). URL http://mediatum.ub.tum.
de/?id=1524977
21. Mahapatra, T., Gerostathopoulos, I., Prehofer, C., Gore, S.G.: Graphical Spark Program-
ming in IoT Mashup Tools. In: 2018 Fifth International Conference on Internet of Things:
Systems, Management and Security, pp. 163–170 (2018). DOI 10.1109/IoTSMS.2018.
8554665. URL https://doi.org/10.1109/IoTSMS.2018.8554665
22. Mahapatra, T., Prehofer, C.: aFlux: Graphical flow-based data analytics. Software Impacts
2, 100007 (2019). DOI 10.1016/j.simpa.2019.100007. URL https://doi.org/10.1016/j.
simpa.2019.100007
23. Mahapatra, T., Prehofer, C.: Graphical Flow-based Spark Programming. Journal of Big
Data 7(1), 4 (2020). DOI 10.1186/s40537-019-0273-5. URL https://doi.org/10.1186/
s40537-019-0273-5
24. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J.,
Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M.,
Talwalkar, A.: Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 17(1),
1235–1241 (2016)
25. mljar: Machine Learning for Humans! Automated machine learning platform. https:
//mljar.com (2018). [Online; accessed 18-May-2020]
26. Morrison, J.P.: Flow-Based Programming, 2nd Edition: A New Approach to Application
Development. CreateSpace, Paramount, CA (2010)
27. Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., López Garcı́a, Á., Heredia, I., Malı́k,
P., Hluchý, L.: Machine learning and deep learning frameworks and libraries for large-
scale data mining: a survey. Artificial Intelligence Review 52(1), 77–124 (2019). DOI
10.1007/s10462-018-09679-z. URL https://doi.org/10.1007/s10462-018-09679-z
28. Overton, M.A.: The idar graph. Queue 15(2), 29–48 (2017). DOI 10.1145/3084693.
3089807. URL https://doi.org/10.1145/3084693.3089807
29. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-
Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning li-
brary. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
R. Garnett (eds.) Advances in Neural Information Processing Systems 32, pp.
Flow-based Programming for Machine Learning 29