Mathematical Programming For Piecewise Linear Regression Analysis
Mathematical Programming For Piecewise Linear Regression Analysis
Regression Analysis
Abstract
Preprint submitted to Journal of Expert Systems with Applications August 12, 2015
strate the efficiency of our proposed method. It is shown that our proposed
piece-wise regression method can be solved to global optimality for datasets of
thousands samples, which also consistently achieves higher prediction accuracy
than a number of state-of-the-art regression methods. Another advantage of
the proposed method is that the learned model can be conveniently expressed
as a small number of if-then rules that are easily interpretable. Overall, this
work proposes an efficient rule-based multivariate regression method based on
piece-wise functions and achieves better prediction performance than state-of-
the-arts approaches. This novel method can benefit expert systems in various
applications by automatically acquiring knowledge from databases to improve
the quality of knowledge base.
Keywords: regression analysis, surrogate model, piecewise linear function,
mathematical programming, optimisation
1. Introduction
2
usually via some other intermediate variables, is known, yet is too complex and
expensive to be evaluated comprehensively in feasible computational time. In
this case, regression analysis is capable of approximating the overall system be-
20 haviour with much simpler functions while preserving a desired level of accuracy,
and can then be more cheaply evaluated (Caballero & Grossmann, 2008; Henao
& Maravelias, 2011, 2010; Viana et al., 2014; Beck et al., 2012).
Over the past years, regression analysis has been established as a powerful tool
25 in a wide range of applications, including: customer demand forecasting (Levis
& Papageorgiou, 2005; Kone & Karwan, 2011), investigation of CO2 capture
process (Zhang & Sahinidis, 2013; Nuchitprasittichai & Cremaschi, 2013), opti-
misation of moving bed chromatography (Li et al., 2014b), forecasting of CO2
emission (Pan et al., 2014), prediction of acidity constants for aromatic acids
30 (Ghasemi et al., 2007), prediction of induction of apoptosis by different chemical
components (Afantitis et al., 2006) and estimation of thermodynamic property
of ionic liquids (Chen et al., 2014; Wu et al., 2014).
Linear regression
Linear regression is one of the most classic types of regression analysis, which
predicts the output variables as linear combinations of the input variables.
The regression coefficients of the input variables are usually estimated using
45 least squared error or least absolute error approaches, and the problems can
be formulated as either quadratic programming or linear programming prob-
lems, which can be solved efficiently. In some cases when the estimated linear
3
relationship fails to adequately describe the data, a variant of linear regres-
sion analysis, called polynomial regression, can be adopted to accommodate
50 non-linearity (Khuri & Mukhopadhyay, 2010). In polynomial regression, higher
degree polynomials of the original independent input variables are added as new
input variables into the regression function, before estimating the coefficients
of the aggregated regression function. Polynomial functions of second-degree
have been most frequently used in literature due to its robust performance and
55 computational efficiency (Khayet et al., 2008; Minjares-Fuentes et al., 2014).
SVR
Support vector machine is a very established statistical learning algorithm,
4
which fits a hyper plane to the data in hand (Smola & Schlkopf, 2004). SVR
80 minimises two terms in the objective function, one of which is -insensitive loss
function, i.e. only sample training error greater than an user-specific threshold,
, is considered in the loss function. The other term is model complexity, which
is expressed as sum of squared regression coefficients. Controlling model com-
plexity usually ensures the model generalisation, i.e. high prediction accuracy
85 in testing samples. Another user-specified trade-off parameter balances the sig-
nificance of the two terms (Chang & Lin, 2011; Bermolen & Rossi, 2009). One
of the most important features that contribute to the competitiveness of SVR
is the kernel trick. Kernel trick maps the dataset from the original space to
higher-dimensional inner product space, at where a linear regression is equiva-
90 lent to an non-linear regression function in the original space (Li et al., 2000).
A number of kernel functions can be employed, e.g. polynomial function, radial
basis function and fourier series (Levis & Papageorgiou, 2005). Formulated as a
convex quadratic programming problem, SVR can be solved to global optimality.
95 Despite the simplicity and optimality of SVR, the problem of tuning two param-
eters, i.e. training error tolerance and trade-off parameter balancing model
complexity and accuracy, and selection of suitable kernels still considerably af-
fect its performance accuracy (Lu et al., 2009; Cherkassky & Ma, 2004).
100 Kriging
Kriging is a spatial interpolation-based regression analysis methodology (Klei-
jnen & Beers, 2004). Given a query sample, kriging estimates its output as
a weighted sum of the outputs of the known nearby samples. The weights
of samples are computed solely from the data by considering sample closeness
105 and redundancy, instead of being given by an arbitrary decreasing function of
distance (Kleijnen, 2009). The interpolation nature of kriging means that the
derived interpolant passes through the given training data points, i.e. the er-
ror between predicted output and real output is zero for all training samples.
Different variants of kriging have been developed in literature, including the
5
110 most popular ordinary kriging (Lloyd & Atkinson, 2002; Zhu & Lin, 2010) and
universal kriging (Brus & Heuvelink, 2007; Sampson et al., 2013).
MARS
MARS (Friedman, 1991) is another type of regression analysis that accommo-
115 dates non-linearity and interaction between independent input variables in its
functional relationship. Non-linearity is introduced into MARS in the form of
the so-called hinge functions, which are expressions with max operators and look
like max(0, X − const). If independent variable X is greater than a constant
number const, the hinge function is equal to X-const, otherwise the hinge func-
120 tion equals to 0. The hinge functions create knots in the prediction surface of
MARS. The functional form of MARS can be a weighted sum of constant, hinge
functions and products of multiple hinge functions, which makes it suitable to
model a wide range of non-linearity (Andrs et al., 2011).
125 The building of MARS usually consists of two steps, a forward addition and
a backward deletion step. In the forward addition step, MARS starts from one
single intercept term/constant and iteratively adds pairs of hinge functions (i.e.
max(0, X − const) and max(0, const − X)) that leads to largest reduction in
training error. Afterwards, a backward deletion step, which removes one by
130 one those hinge functions contributing insignificantly to the model accuracy, is
employed to improve generalisation of the final model (Leathwick et al., 2006;
Balshi et al., 2009). The presence of hinge functions also make MARS a piece-
wise regression method.
135 MLP
Multilayer perceptron is a feedforward artificial neural network, whose structure
is inspired by the organisations of biological neural networks (Hill et al., 1994).
A MLP typically consists of an input layer of measurable features, an output
layer of response variables, sandwiching multiple intermediate layers of neurons.
140 The network is fully interconnected in the sense that neurons in each layer are
6
connected to all the neurons in the two neighbour layers (Comrie, 1997; Gevrey
et al., 2003). Each neuron in the intermediate layers takes a weighted linear
combination of outputs from all neurons in the previous layer as input, applies
an non-linear transformation function before supplying the output to all neu-
145 rons of the next layer. The use of non-linear transformation functions, including
sigmoid, hyperbolic tangent and logarithmic functions, makes MLP suitable for
modelling highly non-linear relationship (Gevrey et al., 2003; Rafiq et al., 2001).
Random forest
Before introducing random forest we first describe regression tree, which is a
decision tree-based prediction model. Starting from the entire set of samples, a
regression tree selects one independent input variable among all and performs
165 binary split into two child sets, under the condition that the two child nodes
give increased purity of the data compared with its single parent node. Purity
is often defined as the deviation of predicting with the mean value of the out-
put variable. The process of binary split is recursively applied for each child
node until a terminating criterion is satisfied. The nodes that are not further
170 partitioned are called leaves. After growing a large tree, a pruning process is
employed to remove the leaves contributing insignificantly to the purity im-
7
provement (Breiman et al., 1984; Loh, 2011). In order to improve model fit, a
linear regression model can be fitted for each leaf (Quinlan, 1992).
KNN
190 KNN belongs to the category of lazy learning algorithms, due to the fact that
prediction is based on the instances without an explicit training phase of con-
structing models, thus making it one of the simplest regression methods in
literature (Korhonen & Kangas, 1997). Given an enquiry sample, KNN first
identifies K closest instances in the training sample set, the exact value of K is
195 given a priori. The closeness of samples can be measured by different distance
metrics, for example Euclidean and Manhattan distances (Scheuber, 2010; Ero-
nen & Klapuri, 2010). Prediction is then taken as weighted mean of the outputs
of the K nearest neighbours, with weight often being defined as the inverse
of distance (Papadopoulos et al., 2011). Despite its simplicity, KNN usually
200 provides competitive prediction performance against much more sophisticated
algorithms.
8
Previous work on piecewise regression
Piecewise functions have been frequently studied in literature as well. In (Toms
205 & Lesperance, 2003), univariate piece-wise linear functions have been used to
fit ecological data and identify break-points that represent critical threshold
values of a phenomenon. In (Strikholm, 2006), a method based on statistical
testing is proposed to estimate the number of break-points for an univariate
piece-wise linear function. Malash & El-Khaiary (2010) also apply piece-wise
210 linear regression techniques on univariate experimental adsorption data. Piece-
wise function is determined by solving a non-linear programming model. Seg-
Reg (www.waterlog.info/segreg.htm) is free software that permits estimating
of piece-wise regression functions with up to two independent variables. For
one independent variable, SegReg splits from a series of candidate break-points
215 and for each one fits a linear regression for either side of the break-point. The
break-point corresponding to the largest statistical confidence is taken as the
final solution. In the case of two independent variables, SegReg first determines
the two-region piece-wise regression function between the dependent variable
and the most significant input variable, before computing the relation between
220 its residual/deviation and the second input variable.
Both Magnani & Boyd (2009) and Toriello & Vielma (2012) publish work on
data fitting with a special family of piece-wise regression functions, called max-
affine functions. The form of max-affine functions is defined as the maximum
225 of a series of linear functions, i.e. a sample is projected to all linear functions,
and the maximum projected value among all is taken as final predicted value
from the piece-wise functions. The use of max-affine functions limits the fitted
surface to be convex. In (Magnani & Boyd, 2009) a heuristic method is used to
ease the difficulty of direct solving the highly non-linear max-affine functions,
230 while in (Toriello & Vielma, 2012), big-M constraint is used to reformulate
the problem into an non-convex mixed integer non-linear programming model.
However, computational complexity is limiting their applications to examples
of small scale.
9
235 More recently, Greene et al. (2015) applies piece-wise regression analysis to
predict patient’s post-treatment life quality with the pre-treatment life quality
measure, which identifies the segments where therapy benefits vary significantly.
The analysis is performed using Segmented (Muggeo, 2008), a package written
in R (R Development Core Team, 2008). Segmented formulates the problem
240 using a non-linear model and requires a user to specify the segmented input
variables, the number of break-points and also the initial guess of each break-
point. Starting from the those user supplied initial positions of break-points,
Segmented iteratively moves around the neighbour of the initial guess points
to search break-points of better quality using local linearisation. However, it is
245 difficult if not impossible to reasonably guess good starting points for real world
multivariate problems of large number of samples and input variables, where
visual examination cannot be performed. This makes it hard to identify quality
solutions. Furthermore, Segmented only allows the input variables being par-
titioned to have different regression coefficients across different segments, while
250 the other input variables keep the same coefficients within the entire ranges,
significantly restricting its flexibility.
In both (Xue et al., 2013) and (Li et al., 2014a), piece-wise regression func-
tion were employed to detect vegetation changes. Piece-wise linear regression
255 was tackled using fuzzy logic and identifies the changes in patterns of vegetation
greenness. Cavanaugh et al. (2014) employ piece-wise regression and find out
that the changes in mangrove area over the last 20 years is a piece-wise functions
of latitude, with regions above and below a specific threshold latitude value fol-
lowing two different patterns of mangrove grows. Moreover, Matthews et al.
260 (2014) uses 2-segment piece-wise functions to describe the relationship between
species richness and fragment area of islands, with the critical breakpoint being
determined by simply sampling a number of candidate values and selecting the
one giving best model fit. Unfortunately, the above methods are all limited to
model rather simple relationship between one output variable and one input
10
265 variable, seriously limiting their usage in more complex problems.
285 The proposed piece-wise regression method can help construct expert systems in
various application domains. Expert systems are computer programs designed
to make decisions analogous to human experts. As an expert system is typ-
ically made up of an inference engine and a knowledge base, the quality and
quantity of information in knowledge base directly affects the usefulness of the
290 constructed expert system. Our proposed piece-wise regression method can be
helpful in more efficiently building expert systems via automatic and efficient
acquisition of knowledge. More specifically, the proposed piece-wise regression
method can extract latent knowledge from the large collection of domain expert
curated databases. Those discovered knowledge are represented in the form
295 of identified relationship between input and output variables of interest, which
11
can be combined with expert knowledge to form the final expert system (Alonso
et al., 2012). For example, the proposed piece-wise regression method in this
work can be used for building prognostic expert systems in medical applica-
tions. When presented historical data of patients’ clinical variables and survival
300 length, piece-wise regression can induce domain knowledge by approximating
the complex relationship between clinical variables and survival length. Those
induced knowledge can then be used to perform prognosis for the current pa-
tients, imitating the end-behaviour of human experts, i.e. medical doctors.
305 Overall, the key contributions of our work are illustrated below:
• Given that neither which feature should be segmented nor the number
of segments are typically known, a heuristic solution procedure is also
315 introduced that automatically identifies the key partition variable and the
final number of segments.
12
our proposed regression method has the advantage of being easily under-
standable and interpretable, as the learned model can be conveniently
represented as a small set of rules.
2. Method
A novel piecewise linear regression method is proposed in this work. The core
340 idea of the proposed method is to identify a single input feature, and separate
the samples into complementary regions on this feature. One different linear
regression function is fitted locally for each region. The sample partition and
calculation of local regression coefficients are performed simultaneously within
the proposed optimisation to achieve least absolute error.
The indices, parameters and variables associated with the proposed model are
13
listed below:
Indices
s sample, s=1,2,...,S
m f eature/independent input variable, m=1,2,...,M
r region, r=1,2,...,R
*
m the f eature where sample partition takes place
Parameters
Asm numeric value of sample s on f eature m
Ys output value of sample s
U 0 , U 00 arbitrarily large positive numbers
Continuous variables
r
Wm regression coef f icient f or f eature m in region r
Br intercept of regression f unction in region r
P redrs predicted output f or sample s in region r
Xr * break − point r on partition f eature m*
m
Ds training error between predicted output and real output f or sample s
Binary variables
Fsr 1 if sample s f alls into region r; 0 otherwise
Assume first that both the partition feature m* and the number of regions
R are given, the R-1 break points are arranged in an ordered way:
r−1 r
Xm ≤ Xm ∀m = m* , r = 2, 3, ..., R (1)
14
not. Modelling of which sample belongs to which region is achieved with the
following constraints:
r−1
Xm − U 0 (1 − Fsr ) ≤ Asm ∀s, r = 2, 3, ..., R, m = m* (2)
r
Asm ≤ Xm + U 0 (1 − Fsr ) ∀s, r = 1, 2, ..., R − 1, m = m* (3)
When sample s belongs to region r (i.e. Fsr = 1), Asm* falls into the region
bounded by the two consecutive break-points X r−1 r ∗
* and X * on feature m ;
m m
otherwise the two sets of constraints become redundant. A visualisation of
break-points and regions is provided in Figure 1:
The following constraints restrict that each sample belongs to one and only one
region:
X
Fsr = 1 ∀s (4)
r
For sample s, its predicted output value for region r, P redrs , is as below:
X
P redrs = r
Asm Wm + Br ∀s, r (5)
m
For any sample s, its training error is equal to the absolute deviation between
15
the real output and the predicted output for the region r where it belongs to
(i.e. Fsr = 1):
Ds ≥ Ys − P redrs − U 00 (1 − Fsr ) ∀s, r (6)
X
min Ds (8)
s
The final model, named as Optimal Piece-wise Linear Regression Analysis (OPLRA)
in this work, consists of a linear objective function and several linear constraints,
and the presence of both binary and continuous variables define an MILP prob-
350 lem, which can be solved to global optimality by standard solution algorithms,
for example branch and bound. A heuristic solution procedure is also employed
in this work to identify the partition feature and the number of regions, as de-
scribed in Figure 2 below.
355 The heuristic procedure starts with solving a linear regression on the entire set
of data with least absolute deviation. Subsequently, each input feature in turn
serves as partition feature m* once and the OPLRA model is solved while al-
lowing two regions (i.e. R = 2). The feature corresponding to the minimum
training error is kept and if its error represents a percentage reduction of more
360 than β from the global linear regression without data partition, the procedure
continues; otherwise it is decided that two-region piecewise linear regression does
not provide a desirable improvement upon the classic linear regression, and the
initially derived linear regression function without sample partition is obtained
for prediction. The parameter β, taking value between 0 and 1, quantifies the
365 percentage reduction in training error that justifies adding one more region.
16
Figure 2: Heuristic procedure to identify the partition feature and the number of regions
17
If two-region piecewise regression is accepted, the corresponding partition fea-
ture is retained for further analysis while the number of regions is iteratively
increased, until the β training reduction criterion is not satisfied between iter-
ations.
370
380 The constructed piecewise linear regression functions are then used to predict
the output value of new samples. A testing sample is firstly assigned to one of
the regions, and the regression coefficients for that region are used to estimate
its output value.
385 In order to better illustrate the training of the proposed regression method,
a simulation model is taken from literature. In brief, the illustrative example
(Palmer & Realff, 2002) describes the operation of a continuous stirred tank
reactor, where a chain reaction of A → B → C takes place. An inlet stream
containing both reactant A and B enters the reactor and the desirable output is
390 component B. There are 4 independent input variables to the simulation model,
including temperature of the reactor (T ), volume of the reactor (V ), concen-
tration of A and B in the inlet stream (CAin and CB in ). The output to be
predicted is the production rate of B (P ). The process and associated variables
are described in Figure 3.
395
18
Figure 3: Illustrative example of a continuous stirred tank reactor
With latin hypercube sampling technique (Helton & Davis, 2003) employed
to specify a set of data points, we run the simulation model and collect 300
samples. The goal of the regression analysis is to approximate the functional
relationship between output variable P and input variables including T , V ,
400 CAin and CB in using piece-wise linear functions. The step-wise description of
the training procedure is presented in Table 1 below.
Initially, a linear regression function is fitted to the entire dataset without fea-
ture segmentation, which gives an absolute deviation of 1677.78. The second
405 iteration of the method solves 4 independent OPLRA models allowing 2 regions
each, respectively specifying T, V, CAin and CB in as partition feature. The two-
region piece-wise linear functions constructed while partitioning on T appears
to yield lower training errors (i.e. 1030.63) than the other 3, and therefore is
taken as the solution for iteration 2. This represents a significant improvement
410 (i.e. 38.57%) from the initial global linear regression function. From iteration
3, the partition feature is fixed as T while one more region is allocated for each
increased iteration. Iteration 3 and 4 respectively lowers the training error to
876.66 and 807.12. The iterative procedure terminates when the β criterion is
not satisfied, e.g. if β = 20%, then the iterative procedure terminates at the
19
415 third iteration and the final regression function has 2 regions; if β = 10%, then
the final regression function has 3 regions.
...
Overall, the key features of our proposed piecewise linear regression method are
summarised here: 1) our method identifies one key partition feature and sepa-
420 rate samples into multiple complementary regions on it, 2) each region has the
flexibility of being fitted by its own linear regression function, with all input fea-
tures allowed to have different regression coefficients across different regions, 3)
there is only one tuning parameter β, 4) compared with algorithms like kernel-
based SVR and MLP, the constructed regression function is easy to understand,
425 as it exhibits linear relationships for different regions.
It is noted here that the obtained relationship between input and output vari-
ables, presented as rules in Table 1, can be used to build an expert system for
the above operation. Given the chain reaction of A → B → C in stirred tank
430 reactor (Palmer & Realff, 2002), domain experts perform experiments to create
a database of samples for different levels of temperature, reactor volume and
concentrations of reactants. Our proposed piece-wise regression method is then
applied to automatically extract the rules that predict production rate from
20
temperature, reactor volume and reactant concentrations. The rules will be
435 difficult to be provided directly from even chemical engineering experts due to
the complex nature of the reaction. Since the above extracted rules can calcu-
late a production rate value for any random value of temperature, tank volume
and reactant concentrations, regardless of if they obey physical laws (must be
positive) or are valid for the reaction of interest, expert knowledge should be
440 incorporated to further refine the rules. For example, expert knowledge can be
used to constraint the applicable temperature range outside which liquid phase
will vaporise to gas phase or freeze to solid phase, making it impossible for the
reaction to proceed as normal. The final expert system will allow users to query
the likely outcome, as production rate or no reaction, of any combination of
445 values of temperature, reactor volume and reactant concentrations.
In the next section, a number of real world regression problems are employed
to benchmark the predictive performance of our proposed model.
A total number of 7 real world datasets have been downloaded from UCI ma-
450 chine learning repository (http://archive.ics.uci.edu/ml/) (Bache & Lichman,
2013) to test the prediction performance of our proposed method. The first re-
gression problem Yacht Hydrodynamics predicts the hydrodynamic performance
of sailing yachts from 7 features describing the hull dimensions and velocity of
the boat for 308 samples. Energy Efficiency (Tsanas & Xifara, 2012) collects
455 data corresponding to 768 building shapes, described by 8 features including
wall area, root area and so on. The aims are to establish the relationship be-
tween either heating load or cooling load requirements and the 8 parameters
of the building. The third example, Concrete Strength (Yeh, 1998), looks into
the relationship between compressive strength of concrete and 8 input variables,
460 including water concentration and age, with 1030 samples of different concretes.
Airfoil dataset concerns how the different airfoil blade desings, wind speed and
angles of attack affect the sound pressure level. The last 2 case studies, Red
21
Wine Quality and White Wine Quality (Cortez et al., 2009), aims to predict
experts’ preference of red and while wine taste with 11 physicochemical features
465 of the wines. Almost 1600 red wine and 4900 white wine samples have been
obtained for analysis.
22
Figure 4: Sensitivity analysis of β. The numbers above points in each plot correspond to the
average numbers of final regions.
23
Figure 4 describes how mean absolute error changes with β. The numbers
attached to the points in each plot are the average numbers of final regions,
505 which always go up as β decreases. For Yacht Hydrodynamics example, setting
β = 0.20 results in just more than 4 final regions. Decrease the β value to
0.15 increases slightly the prediction error with marginally higher number of
regions. Further decrease β to 0.10 leads to lowest mean prediction error of
0.648 with an average of 5 regions, before excessively low values of β over-fits
510 the unseen testing samples by yielding much increased prediction error. For En-
ergy Efficiency Heating case study, when β = 0.10,0.15 and 0.20 our proposed
regression method constructs piece-wise regression functions of an average of 3
regions, yielding MAE of 0.907. Smaller values of β leads to about 5 regions,
which are shown to predict the testing samples with higher accuracy (MAE
515 around 0.810). In terms of Energy Efficiency Cooling and Concrete Strength
examples, similar phenomenon can be observed that when β takes overly high
values (i.e. 0.20, 0.15 ), the proposed method terminates prematurely with only
2 regions and relatively high MAE. More regions are allowed by lowering β,
which gives higher prediction accuracies. On Airfoil case study, the proposed
520 method outputs global multiple linear regression functions without data par-
titions when β = 0.20. As β decreases, more regions are permitted, which
predict unseen samples with better accuracy. With regards to Red Wine Qual-
ity dataset, the optimal prediction occurs when β = 0.03. On the last example
of White Wine Quality, 2-region piece-wise regression functions achieved with
525 β = 0.01, 0.03, 0.05 outperforms global multiple linear regressions for higher
values of β.
It can be seen from Figure 4 that the range of values between 0.01 and 0.05 gen-
erally lead to smaller prediction error than higher values of β. For all datasets
530 except Yacht Hydrodynamics, prediction errors of β = 0.01, 0.03 and 0.05 are
evidently smaller than that of β = 0.10, 0.15 and 0.20. Within the range be-
tween 0.01 and 0.05, there is no clear optimal value for β as different values have
24
different effects on the accuracy. We instead seek to identify the most robust
value for β, which gives consistently desirable prediction accuracy across a wide
535 range of problems. For each dataset, we normalise the MAE of each β accord-
M AEβ −minβ M AEβ
ing to the formula: minβ M AEβ . For example, in Yacht Hydrodynamics,
0.7131−0.6481
original MAE of β = 0.01 is normalised from 0.7131 to 0.6481 = 10.0%,
where 0.6481 is the lowest MAE achieved when β = 0.10. The normalised MAE
of each β represents the actual deviation of it compared to the lowest error, and
540 is averaged over all examples to reflect its overall competitiveness.
After identifying a value (i.e. 0.03 ) for the only tuning parameter β in
our proposed regression method, we now compare the accuracy of the proposed
method against some popular regression algorithms with the same set of 7 ex-
amples. The results of the comparison are available in Table 2 below.
555
25
In Table 2 and each tested dataset, the lowest prediction error achieved among
all implemented regression methods is marked with bold. On Hydrodynamics
problem, the proposed method in this work provides an MAE of 0.706, which is
560 lower than any other competing algorithm. ALAMO, MLP and MARS follow
closely with MAE of 0.787, 0.809 and 1.011, respectively. Mean error rates of
the rest of the methods are between 3 and 8. On Energy Efficiency Heating,
MARS emerges as the most accurate algorithm with an mean absolute error of
0.796, which is closely matched by our proposed method and MLP. Mean pre-
565 diction errors of the other approaches are almost all twice as large as that of the
MARS. In terms of Energy Efficiency Cooling dataset, the proposed method,
MARS, random forest and MLP are the top 4 performers with MAE between
1.278 and 1.924. On Concrete Strength, our proposed approach and MARS,
with an MAE of 4.870 and 4.871, again emerge as the leading methods from
570 random forest, Kriging, MLP and the others. When it comes to Airfoil example,
all the competing algorithms achieve similar prediction accuracies, with KNN
topping the chart with an MAE of 0.026. The proposed approach in this work
is a merely 0.003 far behind, with kriging and random forest a further 0.001
behind. A merely 0.011 separates the 10 methods. Lastly, on the two Wine
575 Quality examples, our proposed approach is respectively ranked as 1st and 3rd
best method.
As there does not exist a single regression method which can always outper-
585 form others on all datasets, a desirable regression algorithm should demonstrate
consistently competitive prediction accuracy. In order to more comprehensively
26
Figure 5: Scoring of regression methods
27
rate and robust regression algorithm among all, achieving a score of 9.43 out
600 of a possible 10. Random forest and MARS are second and third according to
the ranking with scores of 8 and 7.43, followed by kriging, KNN, MLP, SVR,
ALAMO, linear regression and PaceRegression in descending order. The advan-
tages of the proposed regression method is quite obvious compared with other
implemented methods.
605
Lastly, we take a look at, for each dataset, the number of regions and the key
partition feature determined by our proposed regression method. The results
are summarised in Table 3. It is clear that the proposed segmented regres-
sion method provides good interpretability as the number of regions are small
610 (usually between 2 to 4 and at most 5). The partition features may release im-
portant insights of the underlying system as the output variables change more
dramatically across different ranges alone this feature.
No regression method will be the best for all problems. In this section, we
615 give some general illustration of the pros and cons of the proposed OPLRA
piece-wise linear regression method, and compare it against some other litera-
ture methods. OPLRA piece-wise regression is inherently deterministic, which
means the same solution is always guaranteed regardless of the number of runs
executed. This is an advantage of OPLRA against stochastic-based methods,
620 for example MLP, where each execution would typically end up with a different
28
locally optimal solution. On the other hand, OPLRA is intuitive and easy to
interpret. OPLRA approximates the potentially highly non-linear relationship
between output and input variables as piece-wise linear algebraic functions, the
formalism of which is easy to understand, interpret and use for users without
625 sophisticated background knowledge. Contrarily, the mechanisms of certain
methods like SVR, MLP and Kriging lack transparency as the former two work
as black box techniques and the latter requires detailed knowledge on statistics.
The small number of user-specified parameters involved in training of OPLRA is
another remarkable advantage. β is the only tuning parameter in the proposed
630 OPLRA, which produces robust predictive performance with regards to varying
values of β as shown in the following Results and Discussion section. Conversely,
usage of certain regression methods, including SVR, MLP and Kriging requires
tuning a large number of parameters, making it a challenging task to identify
their optimal values. More importantly, OPLRA piece-wise regression achieves
635 more accurate and robust prediction performance against other methods. Using
a large number of real word problems, OPLRA is shown to outperform popular
state-of-the-art multivariate regression methods in terms of prediction accuracy
and does so consistently across a number of real world problems.
29
4. Concluding Remarks
665 To demonstrate the applicability and efficiency of the proposed piece-wise re-
gression method, 7 real world problems have been employed, covering a wide
range of application domains. To benchmark the predictive capability of the pro-
posed method, we have also implemented various popular regression methods in
literature for comparison, including support vector regression, artificial neural
670 network, MARS and K nearest neighbour. Computational experiments clearly
indicate that our proposed piece-wise regression method achieves consistently
high predictive accuracy as leading to the lowest prediction errors for 4 out of 7
datasets, second lowest errors for 2 datasets and third lowest error for the other
example. The results confirm our proposed method as a reliable alternative to
675 traditional regression analysis methods. Another remarkable advantage of our
proposed method is that the learned model can be conveniently expressed as
a set of if-then rules that are compact and easily understandable. From Table
3, it is clear that the number of if-then rules identified by our method as the
hidden patterns in the large scale databases (up to thousands expert curated
680 samples) are extremely small (usually 2 to 3 and at most 5). The model inter-
30
pretability of the proposed piece-wise regression is a desirable advantage over
black modelling techniques, for example support vector regression and neural
network.
685 With regards to research contribution in expert and intelligent systems, the
generic machine learning method proposed in this work can be used to con-
struct a large number of automatic decision making or support systems for var-
ious domain applications. As the quality and coverage of information contained
in knowledge base critically affects the efficiency of any expert and intelligent
690 system, our proposed machine learning method can serve to automatically and
more efficiently acquire knowledge from database by approximating the relation-
ship between output and input variables as rules. Subsequently, the discovered
knowledge can be used to generate forecasts to users’ enquiry.
695 To further improve the efficiency of the proposed piece-wise regression method
in this work, the following limitations can be considered for refinement. As the
piece-wise regression method proposed in this work can only partition a single
input variable, one potential improvement is to generalise the method so that
to permit segmentation of multiple variables so as to better capture the non-
700 linearity in datasets. Secondly, as our proposed method in this work can only
handle continuous input variables, we plan to improve its applicability by gener-
alising it to deal with categorical input variables having many distinct levels. In
addition, the relationship between output and input variables are approximated
as linear for each segment in the current method, which may not adequately
705 model the underlying patterns. To overcome this, more complex non-linear ba-
sis functions, for example polynomial, exponential and logarithmic forms, can
be added to allow more flexibility. Another limitation of our method is the rel-
atively high computational cost, which may restrict its usage in certain online
applications, where learning speed of the method is considered more important
710 than actual prediction accuracy. To tackle this problem, we can explore more
efficient heuristic solution procedures that, by estimating the possible break-
31
point positions and constricting the solution space, more quickly converge to a
quality solution.
715 In terms of practical future applications in expert and intelligent systems, the
proposed piece-wise regression method can benefit many via automatic extrac-
tion of knowledge from databases and generate accurate forecasts. As examples,
we have identified the following directions as possible avenues worth investiga-
tion in the near future. First, our proposed method can be incorporated into
720 the construction of a decision support expert system that continuously predicts
the personalised risk of prisoner with mental illness being released from the jail,
aiding clinician for decision making (Constantinou et al., 2015). Other applica-
tions that can benefit from our work include intelligent drowsiness monitoring
system and stock price prediction. In drowsiness monitoring, the proposed
725 regression model can be built into an intelligent fatigue detection equipment,
which records the dynamic physiological signals of drivers or medical staffs and
continuously predicts their level of fatigues. A warning will be automatically
issued when the model predicts the fatigue level of subjects to be above a pre-
specified threshold level (Chen et al., 2015). In financial area, our method can
730 help with construction of an automatic system that forecasts the stock price
based on the ever-changing variables quantifying the current performance of a
company, including assets, liabilities and income, providing management with
data support to make better financial benefits (Ballings et al., 2015). Lastly,
the proposed method developed here can also find application in airline industry
735 where managers and decision makers can benefit from a framework powerful of
predicting the level of customer satisfaction from various aspects of services, and
therefore making it possible for them to carefully allocate resource to maximise
customer loyalty (Leong et al., 2015).
32
5. Acknowledgements
740 Funding from the UK Engineering and Physical Sciences Research Coun-
cil (to LY, SL and LGP through the EPSRC Centre for Innovative Manufac-
turing in Emergent Macromolecular Therapies), the UK Leverhulme Trust (to
ST and LGP, RPG-2012-686), the European Union(to ST, HEALTH-F2-2011-
261366),and the Centre for Process Systems Engineering (CPSE) at Imperial
745 and University College London are gratefully acknowledged.
References
Afantitis, A., Melagraki, G., Sarimveis, H., Koutentis, P. A., Markopoulos, J.,
& Igglessi-Markopoulou, O. (2006). A novel {QSAR} model for predicting
induction of apoptosis by 4-aryl-4h-chromenes. Bioorganic and Medicinal
750 Chemistry, 14 , 6686 – 6694.
Alonso, F., Martnez, L., Prez, A., & Valente, J. P. (2012). Cooperation between
expert knowledge and data mining discovered knowledge: Lessons learned.
Expert Systems with Applications, 39 , 7524 – 7535.
Andrs, J. D., Lorca, P., de Cos Juez, F. J., & Snchez-Lasheras, F. (2011).
755 Bankruptcy forecasting: A hybrid approach using fuzzy c-means clustering
and multivariate adaptive regression splines (mars). Expert Systems with
Applications, 38 , 1866 – 1875.
Bai, Y., Wang, P., Li, C., Xie, J., & Wang, Y. (2014). A multi-scale relevance
760 vector regression approach for daily urban water demand forecasting. Journal
of Hydrology, 517 , 236 – 245.
Ballings, M., den Poel, D. V., Hespeels, N., & Gryp, R. (2015). Evaluating
multiple classifiers for stock price direction prediction. Expert Systems with
Applications, 42 , 7046 – 7056.
33
765 Balshi, M. S., Mcguire, A. D., Duffy, P., Flannigan, M., Walsh, J., & Melillo, J.
(2009). Assessing the response of area burned to changing climate in western
boreal north america using a multivariate adaptive regression splines (mars)
approach. Global Change Biology, 15 , 578–600.
Beck, J., Friedrich, D., Brandani, S., Guillas, S., & Fraga, E. (2012). Surro-
770 gate based optimisation for design of pressure swing adsorption systems. In
Proceedings of the 22nd European Symposium on Computer Aided Process
Engineering.
Bermolen, P., & Rossi, D. (2009). Support vector regression for link load pre-
diction. Computer Networks, 53 , 191 – 201. QoS Aspects in Next-Generation
775 Networks.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classifica-
780 tion and Regression Trees. Wadsworth.
Cavanaugh, K. C., Kellner, J. R., Forde, A. J., Gruner, D. S., Parker, J. D.,
Rodriguez, W., & Feller, I. C. (2014). Poleward expansion of mangroves is a
threshold response to decreased frequency of extreme cold events. Proceedings
of the National Academy of Sciences, 111 , 723–727.
790 Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology, 2 , 27:1–
27:27.
34
Chen, L., Zhao, Y., Zhang, J., & zhong Zou, J. (2015). Automatic detection of
alertness/drowsiness from physiological signals using wavelet-based nonlinear
795 features and machine learning. Expert Systems with Applications, 42 , 7344 –
7355.
Chen, Q.-L., Wu, K.-J., & He, C.-H. (2014). Thermal conductivity of ionic
liquids at atmospheric pressure: Database, analysis, and prediction using a
topological index method. Industrial and Engineering Chemistry Research,
800 53 , 7224–7232.
Cherkassky, V., & Ma, Y. (2004). Practical selection of svm parameters and
noise estimation for svm regression. Neural Networks, 17 , 113 – 126.
Constantinou, A. C., Freestone, M., Marsh, W., Fenton, N., & Coid, J. (2015).
Risk assessment and risk management of violent reoffending among prisoners.
Expert Systems with Applications, 42 , 7511 – 7529.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling
810 wine preferences by data mining from physicochemical properties. Decision
Support Systems, 47 , 547 – 553. Smart Business Networks: Concepts and
Empirical Evidence.
Cozad, A., Sahinidis, N. V., & Miller, D. C. (2014). Learning surrogate models
for simulation-based optimization. AIChE Journal , 60 , 2211–2227.
815 Davis, E., & Ierapetritou, M. (2008). A kriging-based approach to minlp con-
taining black-box models and noise. Industrial and Engineering Chemistry
Research, 47 , 6101–6125.
Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M.,
Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L.,
35
820 Žagar, L., Žbontar, J., Žitnik, M., & Zupan, B. (2013). Orange: Data mining
toolbox in python. Journal of Machine Learning Research, 14 , 2349–2353.
825 Eronen, A., & Klapuri, A. (2010). Music tempo estimation with k -nn regression.
Audio, Speech, and Language Processing, IEEE Transactions on, 18 , 50–57.
Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation
with random regression forests. In Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on (pp. 617–624).
Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using
835 random forests. Pattern Recognition Letters, 31 , 2225 – 2236.
Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of meth-
ods to study the contribution of variables in artificial neural network models.
Ecological Modelling, 160 , 249 – 264. Modelling the structure of acquatic
communities: concepts, methods and problems.
840 Ghasemi, J., Saaidpour, S., & Brown, S. D. (2007). Qspr study for estimation
of acidity constants of some aromatic acids derivatives using multiple linear
regression (mlr) analysis. Journal of Molecular Structure: THEOCHEM , 805 ,
27 – 32.
Greene, M., Rolfson, O., Garellick, G., Gordon, M., & Nemes, S. (2015).
845 Improved statistical analysis of pre- and post-treatment patient-reported
36
outcome measures (proms): the applicability of piecewise linear regression
splines. Quality of Life Research, 24 , 567–573.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H.
(2009). The weka data mining software: An update. SIGKDD Explorations
Newsletter , 11 , 10–18.
855 Helton, J., & Davis, F. (2003). Latin hypercube sampling and the propagation
of uncertainty in analyses of complex systems. Reliability Engineering and
System Safety, 81 , 23 – 69.
Hill, T., Marquez, L., O’Connor, M., & Remus, W. (1994). Artificial neural
865 network models for forecasting and decision making. International Journal of
Forecasting, 10 , 5 – 15.
37
Khuri, A. I., & Mukhopadhyay, S. (2010). Response surface methodology. Wiley
Interdisciplinary Reviews: Computational Statistics, 2 , 128–149.
885 Leathwick, J., Elith, J., & Hastie, T. (2006). Comparative performance of
generalized additive models and multivariate adaptive regression splines for
statistical modelling of species distributions. Ecological Modelling, 199 , 188
– 196. Predicting Species Distributions Results from a Second Workshop
on Advances in Predictive Species Distribution Models, held in Riederalp,
890 Switzerland, 2004.
Leong, L.-Y., Hew, T.-S., Lee, V.-H., & Ooi, K.-B. (2015). An semartificial-
neural-network analysis of the relationships between servperf, customer sat-
isfaction and loyalty among low-cost and full-service airline. Expert Systems
with Applications, 42 , 6620 – 6634.
38
Li, B., Zhang, L., Yan, Q., & Xue, Y. (2014a). Application of piecewise lin-
ear regression in the detection of vegetation greenness trends on the tibetan
900 plateau. International Journal of Remote Sensing, 35 , 1526–1539.
Li, S., Feng, L., P., B., & Seidel-Morgenstern, A. (2014b). Using surrogate
models for efficient optimization of simulated moving bed chromatography.
Computers and Chemical Engineering, 67 , 121 – 132.
Li, Y., Gong, S., & Liddell, H. (2000). Support vector regression and classifica-
905 tion based multi-view face detection and recognition. In Automatic Face and
Gesture Recognition, 2000. Proceedings. Fourth IEEE International Confer-
ence on (pp. 300–305).
Lloyd, C. D., & Atkinson, P. M. (2002). Deriving dsms from lidar data with
kriging. International Journal of Remote Sensing, 23 , 2519–2524.
910 Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery, 1 , 14–23.
Lu, C.-J., Lee, T.-S., & Chiu, C.-C. (2009). Financial time series forecasting us-
ing independent component analysis and support vector regression. Decision
Support Systems, 47 , 115 – 125.
915 Magnani, A., & Boyd, S. (2009). Convex piecewise-linear fitting. Optimization
and Engineering, 10 , 1–17.
Matthews, T. J., Steinbauer, M. J., Tzirkalli, E., Triantis, K. A., & Whittaker,
R. J. (2014). Thresholds and the speciesarea relationship: a synthetic analysis
of habitat island datasets. Journal of Biogeography, 41 , 1018–1028. URL:
http://dx.doi.org/10.1111/jbi.12286. doi:10.1111/jbi.12286.
39
925 Miller, D. C., Syamlal, M., Mebane, D. S., Storlie, C., Bhattacharyya, D.,
Sahinidis, N. V., Agarwal, D., Tong, C., Zitney, S. E., Sarkar, A., Sun, X.,
Sundaresan, S., Ryan, E., Engel, D., & Dale, C. (2014). Carbon capture simu-
lation initiative: A case study in multiscale modeling and new challenges. An-
nual Review of Chemical and Biomolecular Engineering, 5 , 301–323. PMID:
930 24797817.
Minjares-Fuentes, R., Femenia, A., Garau, M., Meza-Velzquez, J., Simal, S.,
& Rossell, C. (2014). Ultrasound-assisted extraction of pectins from grape
pomace using citric acid: A response surface methodology approach. Carbo-
hydrate Polymers, 106 , 179 – 189.
940 Paliwal, M., & Kumar, U. A. (2009). Neural networks and statistical techniques:
A review of applications. Expert Systems with Applications, 36 , 2 – 17.
945 Pan, J., Kung, P., Bretholt, A., & Lu, J. (2014). Prediction of energys environ-
mental impact using a three-variable time series model. Expert Systems with
Applications, 41 , 1031 – 1040.
40
Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the
Australian Joint Conference on Artificial Intelligence (pp. 343–348). World
Scientific.
Rafiq, M., Bugmann, G., & Easterbrook, D. (2001). Neural network design for
engineering applications. Computers and Structures, 79 , 1541 – 1552.
Sampson, P. D., Richards, M., Szpiro, A. A., Bergen, S., Sheppard, L., Larson,
960 T. V., & Kaufman, J. D. (2013). A regionalized national universal kriging
model using partial least squares regression for estimating annual pm2.5 con-
centrations in epidemiology. Atmospheric Environment, 75 , 383 – 392.
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society, Series B , 58 , 267–288.
41
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a ret-
980 rospective. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73 , 273–282.
Toriello, A., & Vielma, J. P. (2012). Fitting piecewise linear continuous func-
985 tions. European Journal of Operational Research, 219 , 86–95.
Venkatesh, K., Ravi, V., Prinzie, A., & den Poel, D. V. (2014). Cash demand
990 forecasting in atms by clustering and neural networks. European Journal of
Operational Research, 232 , 383 – 392.
Viana, F. A. C., Simpson, T. W., Balabanov, V., & Toropov, V. (2014). Meta-
modeling in Multidisciplinary Design Optimization: How Far Have We Really
Come? AIAA Journal , 52 , 670–690.
995 Wu, K.-J., Chen, Q.-L., & He, C.-H. (2014). Speed of sound of ionic liquids:
Database, estimation, and its application for thermal conductivity prediction.
AIChE Journal , 60 , 1120–1131.
Xue, Y., Liu, S., Zhang, L., & Hu, Y. (2013). Integrating fuzzy logic with
piecewise linear regression for detecting vegetation greenness change in the
1000 yukon river basin, alaska. International Journal of Remote Sensing, 34 , 4242–
4263.
Zhang, J.-R., Zhang, J., Lok, T.-M., & Lyu, M. R. (2007). A hybrid particle
1005 swarm optimizationback-propagation algorithm for feedforward neural net-
42
work training. Applied Mathematics and Computation, 185 , 1026 – 1037.
Special Issue on Intelligent Computing Theory and Methodology.
Zhu, Q., & Lin, H. (2010). Comparing ordinary kriging and regression kriging
for soil properties in contrasting landscapes. Pedosphere, 20 , 594 – 606.
43