0% found this document useful (0 votes)

108 views

(ARTICLE) Evaluation of Network Traffic Prediction Based On Neural Networks With Multi-Task Learning and Multiresolution Decomposition

This document discusses evaluating network traffic prediction using neural networks with multi-task learning and multiresolution decomposition. Specifically, it compares predictions from neural networks to statistical time series models like ARMA and Holt-Winters. The neural networks are trained using multi-task learning and multiresolution learning based on wavelet decomposition. Experimental results on real-world network traffic traces show nonlinear prediction with neural networks provides better accuracy than linear forecasting models for the purpose of network traffic prediction.

Uploaded by

Maria Rojas

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views

(ARTICLE) Evaluation of Network Traffic Prediction Based On Neural Networks With Multi-Task Learning and Multiresolution Decomposition

Uploaded by

Maria Rojas

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Evaluation of Network Traffic Prediction Based on

Neural Networks with Multi-task Learning and

Multiresolution Decomposition
Melinda Barabas, Georgeta Boanea, Andrei B. Rus, Virgil Dobrota Jordi Domingo-Pascual
Technical University of Cluj-Napoca Universitat Polit`ecnica de Catalunya (UPC)
2628 George Baritiu Street, 400027 Cluj-Napoca, Romania 13 Jordi Girona Street, 08034 Barcelona, Spain
Email: {Melinda.Barabas, Virgil.Dobrota}@com.utcluj.ro Email: [email protected]

AbstractNetwork traffic exhibits strong correlations which methods which detect congestion through measurements, only
make it suitable for prediction. Real-time forecasting of network after it has significantly influenced the network operation.
traffic load accurately and in a computationally efficient man- The prediction of network traffic parameters is possible
ner is the key element of proactive network management and
congestion control. This paper compares predictions produced because they present a strong correlation between chronologi-
by different types of neural networks (NN) with forecasts from cally ordered values. Their predictability is mainly determined
statistical time series models (ARMA, ARAR, HW). The novelty by their statistical characteristics. According to [3], network
of our approach is to predict aggregated Ethernet traffic with traffic is characterized by: self-similarity, multiscalarity, long-
NNs employing multiresolution learning (MRL) which is based range dependence (LRD) and a highly nonlinear nature.
on wavelet decomposition. In addition, we introduce a new
NN training paradigm, namely the combination of multi-task Several methods have been proposed in the literature for
learning with MRL. The experimental results show that nonlinear network traffic forecasting. These can be classified into two
prediction based on NNs is better suited for traffic prediction categories: linear prediction and nonlinear prediction. Choos-
purposes than linear forecasting models. Moreover, MRL helps ing a specific forecasting technique is based on a compro-
to exploit the correlation structures at lower resolutions of the mise between the complexity of the solution, characteristics
traffic trace and improves the generalization capability of NNs.
Keywordsmultiresolution learning, multi-task learning, neu-
of the data and the desired prediction accuracy. The most
ral networks, prediction widely used traditional linear prediction methods are: a)
the ARMA/ARIMA model [1], [4], [5], [6], [7] and b) the
I. I NTRODUCTION HoltWinters algorithm [1], etc. The most common nonlinear
forecasting methods involve neural networks (NN) [1], [3],
The main purpose of forecasting is to use historical data in [4], [8]. NNs can be combined with: a) multi-task learning
order to predict the behavior of a system, by modeling it as [9], [10] or b) multiresolution learning [2], [11], [12], [13],
a black-box [1]. Traffic prediction plays an important role in etc. Although some articles state that linear prediction models
guaranteeing Quality of Service (QoS) in IP networks due to are unable to describe the characteristics of network traffic [4],
the diversity of services and because of the increased volume other studies confirm the practical usability of linear predictors
of real-time network applications. Forecasting algorithms can for real-time traffic prediction [7]. Thus, it remains unclear
be embedded into network management systems to improve which predictors provide the best performance, being in the
the global performance of the network and to achieve a same time simple, adaptable and accurate.
balanced utilization of the resources. Traffic prediction can be In this paper we consider the problem of forecasting the
useful for dynamic routing, congestion control and prevention, transfer rate, i.e. given a set of transfer rates observed on a
autonomous traffic engineering, proactive management of the specific link, we try to predict its future values. We chose
network, etc. the prediction of this parameter because this is the basic QoS
Upon the occurrence of congestion in the network, a tra- parameter, i.e. if the demands regarding the transfer rate are
ditional routing protocol cannot react immediately, resulting not met, the other QoS parameters (delay, jitter, packet drops)
in packet loss, additional delay and jitter, as well as services will be affected seriously. In the following, we demonstrated
with severely degraded quality [2]. Prediction can be used by a that the prediction of future network traffic load based on
network device in the self-adaptation process for optimizing its recent observations is possible, with a certain accuracy, in a
own performance. Thus, proactive decision-making is possible computationally efficient manner.
based on the predicted evolution of traffic on certain links, The rest of this paper is organized as follows. Section II
as opposed to reacting to past events. Thanks to the early gives a brief introduction to traditional forecasting techniques.
warning, a prediction-based approach will be faster, in terms In Section III neural network traffic predictors with multi-task
of congestion identification and elimination, than reactive training and multiresolution learning approaches are described.
Section IV lists the performance metrics used for prediction The time series {Yt } of long-memory or moderately long-
accuracy evaluation. The experimental results are presented memory is processed until the transformed series can be
in Section V as a comparative study with various types of declared to be short-memory and stationary:
predictors applied to real-world network traffic traces. Finally,
St = (B)Yt = Yt + 1 Yt1 + . . . + k Ytk . (4)
Section VI concludes the paper and discusses future work.
The autoregressive model fitted to the mean-corrected series
II. T RADITIONAL T IME S ERIES F ORECASTING M ODELS Xt = St S, t = k + 1, n, where S represents the sample
In this section we give a brief introduction to various mean for Sk+1 , . . . , Sn , is given by:
predictors based on traditional statistical techniques, such as (B)Xt = Zt , (5)
ARMA(Autoregressive Moving Average), ARAR (Autoregres- l1 l2 l3
sive Autoregressive) and HW (HoltWinters) algorithm. where (B) = 1 1 B l1 B l2 B l3 B , {Zt }
WN(0, 2 ), while the coefficients j and the variance 2 are
A. ARMA model calculated using the YuleWalker equations described in [14].
From (4) and (5) we obtain the relationship:
The family of ARMA processes is one of the most popular
statistical methods used for modeling and forecasting linear (B)Yt = (1)S + Zt , (6)
time series. ARMA models rely on a linear combination of
where (B)Yt = (B)(B) = 1 + 1B + . . . + k+l3 B k+l3 .
autoregressive (AR) and moving average (MA) components. From the following recursion relation we can determine the
The time series {Xt } is called an ARMA(p, q) process if linear predictors Yn+h for n > k + l3 :
{Xt } is stationary (i.e. its statistical properties do not change
k+l
X3
over time) and :
Pn Yn+h = j Pn Yn+hj + (1)S h1, (7)
Xt 1 Xt1 . . .p Xtp = Zt +1 Zt1 +. . .+q Ztq (1) j=1

with the initial condition Pn Yn+h = Yn+h for h 0.

where {Zt } WN(0, 2 ) is white noise with zero mean and
variance 2 and the polynomials (z) = 1 1 z . . . p z p C. HoltWinters algorithm
and (z) = 1 + 1 z + . . . + q z q have no common factors [14]. The HoltWinters forecasting algorithm is an exponential
The identification of a zero-mean ARMA model which smoothing method which uses a set of recursions to predict the
describes a specific dataset involves the following steps [14]: future value of series containing a trend. The main advantage
a) order selection (p, q); b) estimation of the mean value of the of this algorithm is its simplicity, the reduced computational
series in order to subtract it from the data; c) determination of demand and the accuracy of the forecasts [1].
the coefficients {i , i = 1, p} and {i , i = 1, q}; d) estimation If the time series has a trend, then the forecast function is:
of the noise variance 2 . After the validation of the model,
Yn+h = Pn Yn+h = a n + bn h , (8)
predictions can be made recursively using:
Pn where a n and bn are the estimates of the level of the trend
Pj=1 nj (Xn+1j X n+1j ),if 1 n m function and the slope respectively. These are calculated using

Xn+1 = q
nj (Xn+1j X n+1j )+1 Xn +...+ (2)
j=1 the following recursive equations:
+p Xn+1p ,if n m

an + bn )

a
n+1 = Yn+1 + (1 )(
bn+1 = ( , (9)
where m = max(p, q) and nj is determined using the an+1 an ) + (1 )bn
innovations algorithm.
To fit a model to a nonstationary time series we use where Yn+1 = Pn Yn+1 = a n + bn represents the one-step
ARIMA (Autoregressive Integrated Moving Average). Fitting 2 = Y2 and b2 =
forecast. The initial conditions are set to a
an ARIMA model to the original nonstationary dataset is Y2 Y1 . The smoothing parameters and can be chosen
equivalent with determining the ARMA model for the dif- either randomly (between 0P and 1), or by minimizing the sum
n
ferentiated dataset. The ARIMA(p, q, d) process is described of squared one-step errors i=3 (Yi Pi1 Yi )2 [14].
by: The HoltWinters Seasonal (HWS) algorithm extends HW
to predict data which is characterized both by trend and
(B)(1 B)d Xt = (B)Zt , (3)
seasonal variation with period d. The forecast function can
where and are polynomials of degree p and q respectively, be expressed as
= 1 B represents the differencing operator, d indicates Pn Yn+h = a n + bn h + cn+h , (10)
the level of differencing and B is the backward-shift operator,
i.e. B j Xt = Xtj [4]. where a n , bn and cn are the estimates of the trend level, trend
slope and seasonal component, being given by the following
B. ARAR algorithm recursions:
an + bn )

The ARAR algorithm applies memory-shortening trans- a n+1 = (Yn+1 cn+1d ) + (1 )(
formations, followed by modeling the dataset as an AR(p) bn+1 = ( an+1 a n ) + (1 )bn , (11)
process: Xt = 1 Xt1 + . . . + p Xtp + Zt . cn+1 = (Yn+1 a n+1 ) + (1 ) cn+1d

d+1 = Yd+1 , bd+1 = (Yd+1
with the initial conditions a The number of hidden layers and the number of nodes in
Y1 )/d and ci = Yi (Y1 + bd+1 (i 1)) for i = 1, d + 1. each layer is usually chosen empirically. To be able to predict
The parameters , and can take values in the range from nonlinear values, NN must have at least one hidden layer.
0 to 1 and are either chosen arbitrary or obtained after the Too many hidden layers slow down the training process and
minimization of the sum of squared one-step errors. increase the complexity of the network. In order to improve
the nonlinearity of the solution, the activation functions of
III. N EURAL N ETWORKS FOR T RAFFIC P REDICTION neurons in the hidden layer are sigmoid functions, while the
output nodes have linear transfer functions.
Neural Networks (NN) are widely used in the process of
modeling and predicting network traffic because they can learn A. Multi-task Learning
complex patterns through their strong self-learning and self- NN predictors applying the traditional single-task learn-
adaptive capabilities. NNs are able to estimate almost any ing (STL) approach have only one output node and they
function in an efficient and stable manner, when the underlying focus on a single main task, i.e. predicting xt+1 based
data relationships are unknown [10]. The NN model is a on {x1 , x2 , . . . , xt }. In this way, the information hidden in
nonlinear, nonparametric, adaptive modeling approach which, other tasks is neglected, such as the relationship between
unlike the techniques presented in Section II, relies on the the historical data and xt+2 , although both tasks belong
observed data rather than on an analytical model [4]. The to the same dataset. In order to improve the generalization
architecture and the parameters of the NN are determined performance of NNs, the multi-task learning (MTL) paradigm
solely by the dataset. NNs are characterized by nonlinear is introduced. This means that we have a main task which
mapping and generalization ability, robustness, fault tolerance, is trained simultaneously with extra tasks, sharing the hidden
adaptability, parallel processing ability, etc. layer of the NN, as shown in Fig. 2. By learning multiple tasks
A neural network consists of interconnected nodes, called simultaneously, the NN can achieve better prediction accuracy.
neurons, every connection being characterized by a weight.
Input layer Hidden layer Output layer
NN comprises several layers of neurons: a) an input layer,
b) one or more hidden layers and c) an output layer. The
x1
most popular NN architecture is feed-forward in which the
xt Extra task
information travels through the network only in the forward
x2
direction: from the input layer towards the output layer, as xt+1 Main task
.
illustrated in Fig. 1. .
.
.
. xt+2 Extra task
Input layer Hidden layer Output layer .
xt
wi11 wh11
x1 y1

Fig. 2. NN predictor with multi-task learning

x2 y2

. . . We can have one or more extra tasks, depending on the

. . . assumed complexity of the NN topology. For time series
. . .
forecasting through the MTL concept, usually two extra tasks
xn
winm whnm
yp are chosen, namely the prediction of xt and xt+2 , which are
closely related to the main task xt+1 , as in [9] and [10].
Fig. 1. Neural Network B. Multiresolution Learning
In case of traditional learning, only a single representation
Using a NN as a predictor involves two phases: a) the train- of the training data is used in the learning process, namely
ing phase and b) the prediction phase. In the training phase, the the finest resolution of the original dataset. The multiresolu-
training set is presented at the input layer and the parameters tion learning (MRL) paradigm applied to NNs exploits the
of the NN are dynamically adjusted to achieve the desired correlation structures at lower resolutions of the training data.
output value for the input set. The most commonly used Thereby, the NN will have a better generalization capacity and
learning algorithm is the backpropagation algorithm, based the training process will be more efficient and robust, resulting
on the backward propagation of the error, where the weights in improved prediction accuracy.
are changed continuously until the output error falls below a Multiresolution decomposition is performed with the help of
preset value. In this way, the NN can learn correlated patterns the wavelet transform. In this paper, the Haar wavelet is used.
between input sets and the corresponding target values. The The signal si can be decomposed into approximation si1 ,
prediction phase represents the testing of the NN. A new input using a low-pass filter L, and detail di1 , using a high-pass
(not included in the training set) is presented to the NN and filter H: i1
the output is calculated, thereby predicting the outcome of new s = Lsi
. (12)
input data. di1 = Hsi
The approximation si1 contains half as many samples as si . In the literature, neural networks with multiresolution learn-
Reconstructing si means that si = si1 h+i di1 = L si1 + ing are used only for predicting network traffic associated with
H di1 , where h+i is the reconstruction operator, while L a single variable-bit-rate (VBR) MPEG video stream, as in
and H are low-pass and high-pass synthesis filters. [11], [12], [13]. These traffic traces are characterized by a
Using the decomposition algorithm described above, the seasonal component due to the periodically sent intra-coded
original signal sm is decomposed into a low frequency approx- video frames (I-frames).
imation smj and high frequency details dm1 , , dmj at The novelty of this article is to evaluate the performance
different levels, where j represents the decomposition level. of this approach applied to aggregated Ethernet traffic traces
For example, Fig. 3 illustrates the decomposition of the which lack of periodical components and are known to be
original signal sm at approximation level 3 which can be heavily nonlinear. In addition, we introduce a new NN train-
expressed as sm = sm3 h+i dm1 h+i dm2 h+i dm3 . ing paradigm, namely the combination of the multiresolution
learning with multi-task training. We intend to investigate if
this combination improves the overall prediction accuracy of
sm
the NN predictor.
sm-1 dm-1
IV. P ERFORMANCE M ETRICS
sm-2 dm-2 To quantitatively assess the overall performance of the ana-
lyzed prediction methods, the following performance metrics
sm-3 dm-3 are used to estimate the prediction accuracy:
1) MSE (Mean Square Error) is a scale dependent metric
Fig. 3. Example of multiresolution decomposition which quantifies the difference between the forecasted
values and the actual values of the quantity being pre-
MRL means that the traditional training is decomposed in dicted by computing the average sum of squared errors:
several stages, each involving a dataset of a certain resolution. N
The basic idea is to train the NN with the coarsest resolution 1 X
M SE = (yi yi )2 , (13)
smj , followed by more finer ones and finally learning the N i=1
original resolution sm of the dataset. The first training stage
starts with learning the coarsest resolution, which represents where yi is the observed value, yi is the predicted value
the simplest learning activity. In this stage, the NN parameters and N represents the total number of predictions.
are initialized randomly. Each following stage uses the weights 2) NMSE (Normalized Mean Square Error) can be ex-
obtained in the previous stage and recalculates them. pressed as MSE divided by the variance of the predicted
Because si contains fewer samples than sm , it has to be time series:
reconstructed so that both have the same length. Therefore, N
1 1 X
the training data used in each stage i will be obtained by N M SE = (yi yi )2 , (14)
2 N i=1
setting the details to zero and reconstructing si as follows:
si h+i 0i h+i 0i+1 h+i . . . h+i 0m1 . Only this way can we where 2 denotes the variance of the observed values
ensure a smooth transition between the different learning during the
stages which allows to use the same NN with the same PNprediction interval and is given by (15) where
y = N1 i=1 yi represents the mean value.
topology in every stage.
N
In this paper, a decomposition level of j = 2 is used. 1 X 2
The basic scheme of the corresponding NN predictor with 2 = (yi y) (15)
N i=1
multiresolution learning is shown in Fig. 4. The neural network
can use either single-task or multi-task learning. For a perfect prediction we obtain N M SE = 0.
If N M SE = 1, the predictor statistically forecasts
dm-1 the average value of the observed data. In case of
dm-2
<+> sm N M SE > 1, the performance of the prediction is worse
<+> sm-1 than forecasting the mean [11].
sm-2 3) MAPE (Mean Absolute Percentage Error) is a metric
widely used to evaluate prediction precision. MAPE
calculates the prediction error as a percentage of the ob-
NN NN NN served value. Expressed in percentage terms, it presents
Initial the advantage of being easy to interpret.
random Recalculated Recalculated
weights weights weights N
1 X |yi yi |
M AP E = 100% (16)
Fig. 4. NN predictor employing multiresolution learning
N i=1 yi
4) Coefficient of correlation (r) indicates the degree of evaluate and to compare the performance of the prediction
association between two variables, being a measure of approaches presented in Sections II and III. We intend to iden-
linear dependence. The linear correlation coefficient is tify the best forecasting method for network traffic prediction,
sometimes referred to as the Pearson productmoment taking into account the accuracy but also the complexity of the
correlation coefficient (PMCC) and is defined as: solutions. To assess the prediction performance, the metrics
described in Section IV are used.
COV (Y, Y )
r= , (17) The ARMA model and the ARAR, HW and HWS algo-
Y Y rithms were simulated using ITSM 2000, version 7 (Student).
where Y and Y indicate the standard deviation of The NN predictors were implemented and tested in Matlab
the observed and the predicted values, given by (18); using the Neural Networks Toolbox and Wavelet Toolbox.
COV (Y, Y ) is the covariance between Y and Y . A real-world time series of 200 consecutive traffic load
v measurements was used for modeling/training the predictor,
u
u1 X N
2 and the subsequent 20 values (not included in the training set)
Y = t (yi y) (18)
N i=1 were used for evaluating the traffic prediction performance.
The quantity we are predicting is traffic load, given in bits per
The covariance is used to determine the relationship second [bps]. The measurements represent the average transfer
between two datasets and is obtained as follows: rate on a 10GE (Gigabit Ethernet) link between Atlanta and
1 X
N Washington, measured every 10 seconds. The used trace and
COV (X, Y ) = (xi x
)(yi y) . (19) others are publicly available at [15]. Fig. 5 illustrates the
N i=1
dataset used for training and testing, along with the lower
Values for the Pearson correlation coefficient range resolutions of the training set used for MRL. For testing, only
between 1 and 1. If r = 1, there is a perfect positive the finest resolutions is needed.
correlation between the actual and the predicted values, The experiments were repeated for several other traffic
whereas r = 1 indicates a perfect negative correlation. traces, obtaining similar results. In the following, we only
If r = 0, we have a complete lack of correlation among discuss the results obtained for this particular dataset, due to
the datasets. lack of space.
5) Coefficient of efficiency (E):
PN A. Traditional Predictors
(yi yi )2 The training data is modeled by the following ARMA(3, 5)
E = 1 Pi=1 N
(20)
)2
i=1 (yi y process: X(t) = 2.142X(t1)2.038X(t2)+0.8036X(t
The efficiency coefficient can take values in the domain 3) + Z(t) 1.258Z(t 1) + 0.8077Z(t 2) + 0.3543Z(t
(, 1]. If E = 1, we have a perfect fit between the 3) 0.3987Z(t 4) + 0.1565Z(t 5), where the variance of
observed and the forecasted data. A value of E = 0 the white noise with zero mean is 2 = 0.050751.
occurs when the prediction corresponds to estimating The ARAR algorithm determined the following AR(11)
the mean of the actual values. An efficiency less than model: X(t) = 0.0481X(t 1) 0.204X(t 2)
zero, i.e. < E < 0, indicates that the average of 0.2266X(t 6) 0.1856X(t 11).
the actual values is a better predictor than the analyzed The HW algorithm predicts the testing set using (8) and (9)
forecasting method. The closer E is to 1, the more with the smoothing parameters: = 1 and = 0.04. The
accurate the prediction is. HWS algorithm uses (10) and (11) for forecasting with the
smoothing values: = 0.85, = 0 and = 1.
V. E XPERIMENTAL R ESULTS Table I compares the performance metrics of the above
In this section the experimental results are presented and mentioned predictors. As can be observed from the table, the
discussed. For illustration purposes, only the most interesting HWS algorithm presents the overall best performance among
results will be described. The goal of the experiments is to the linear predictors, having the lowest MSE and NMSE, and

4.5 4 3.2 3.6

3 3.4
4
3.5
2.8 3.2

3.5 2.6 3
3
2.4 2.8
3
2.5 2.2 2.6
2.5
2 2.4
2
2 1.8 2.2

1.6 2
1.5
1.5
1.4 1.8

1 1 1.6
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 0 2 4 6 8 10 12 14 16 18 20

(a) Original training data (b) Decomposition level 1 (c) Decomposition level 2 (d) Testing data

Fig. 5. Training and testing data [Gbps]

3
3.6 5
output
3.4 target 4.5 2.9

3.2 4
2.8
3 3.5
2.7

Y Quantiles
2.8 3
2.6
2.6 2.5

2.4 2 2.5

2.2 1.5
2.4
2 1
2.3
1.8 0.5
2.2
1.6 0 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
0 2 4 6 8 10 12 14 16 18 20 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 X Quantiles

(a) Prediction (b) Histogram of prediction errors (c) Q-Q plot

Fig. 6. Prediction evaluation: HoltWinters Seasonal algorithm

TABLE I
P ERFORMANCE METRICS FOR TRADITIONAL FORECASTING MODELS NMSE are too high and the value of the efficiency coefficient
E is too low.
Method MSE NMSE MAPE r E
TABLE II
ARMA 0.3864 1.2281 19.09% -0.521 -0.293 P ERFORMANCE IMPROVEMENTS BETWEEN TRADITIONAL PREDICTORS
ARAR 0.3068 0.975 20.08% 0.254 -0.0263
1. 2. 3. 4.
HW 0.269 0.855 20.73% 0.8935 0.2979
HWS 0.2112 0.671 19.69% 0.6933 0.3752 1. 0% 25.95% 43.64% 82.95%
2. 2.06% 0% 14.03% 45.27%
3. 30.38% 12.32% 0% 27.37%
the highest E. Fig. 6(a) illustrates the values predicted with 4. 45.34% 31.16% 21.49% 0%
the HWS algorithm, compared to the observed data. Although
this linear predictor has the best accuracy, the predicted data
does not follow the evolution of the target values. Fig. 6(b) B. NN Predictors
shows the histogram of prediction errors. The histogram lets Four types of NN predictors are compared: STL (Single-
us investigate the distribution of errors (i.e. the difference Task Learning), STL with MRL (Multiresolution Learning),
between the actual and the forecasted values): the more narrow MTL (Multi-Task Learning) and MTL with MRL. In addition,
it is, the better the prediction accuracy is. The errors are in in the case of STL we compare the performance of single-step
the range [0.7, 0.2] [0.1, 0.6]. The QQ plot (Quantile versus multi-step prediction.
Quantile) in Fig. 6(c) compares two probability distributions In the experiments, we use NNs with a small topology in or-
as parametric curves, the parameter being the interval for the der to reduce the overall complexity of the predictor. The order
quantile. The figure displays the quantiles of the observed of complexity for training a single epoch is O(nh no (ni +1))
values (on 0x) against the quantiles corresponding to the [4], where nh , ni and no represent the number of hidden,
predicted values (on 0y). If the samples would come from the input and output nodes respectively. To find the appropriate
same distribution, the plot would be linear. But this is not the NN architecture, ni and nh were varied between 3 and 10. We
case, thus we can affirm that the target values are not modeled achieved the best results for a 45no structured feedforward
with sufficient precision. NN with backpropagation algorithm. The notation indicates
In order to quantify the prediction performance improve- a three-layer NN predictor having 4 input nodes, 5 hidden
ment from method a to method b in terms of MSE, the neurons and no output neurons. In the case of single-task
following metric is used, as in [10]: learning we have no = 1, as can be seen in Fig. 7, whereas
M SEb M SEa for multi-task learning we use no = 3, as presented in Fig. 8
a,b = 100% . (21) ( represents a delay element).
M SEb
We denote the prediction models with the following numbers:
x(t)
1. 7 ARMA; 2. 7 ARAR; 3. 7 HW; 4. 7 HWS. Table
II compares the performance improvements of the traditional x(t-1)
predictors (in terms of MSE) using (21) for each combination
x(t+1)
of pairs of predictors. The values in each line of the table x(t-2)
can be interpreted as follows: by how much that certain
predictor improved the prediction performance, compared to x(t-3)

the predictors in the columns. Although the HWS algorithm

presents a positive performance improvement, it still cannot be
considered an efficient forecasting method because MAPE and Fig. 7. Simulated NN with STL
x(t)
by the fact the this approach does not take into account updated
x(t) Extra input values. It uses the predicted output for a given step as
x(t-1)
task
an input for the next step and all other inputs are shifted back
x(t+1) Main one time unit. This means that after predicting the first value,
x(t-2)
task
the output in Fig. 7 is connected to the x(t) input. Thereby,
x(t+2) Extra
task
the prediction error is propagated and the prediction accuracy
x(t-3)
constantly deteriorates. Meanwhile, single-step prediction dy-
namically updates the input information and the prediction
accuracy is not reduced. In this case, the input node x(t) can
Fig. 8. Simulated NN with MTL
be connected to a software tool for measuring traffic load.
An optical comparison of the forecasted values of the
The learning algorithm is trainlm which is based on network traffic predictors is possible by investigating Fig.
the LevenbergMarquardt algorithm. It was chosen because 9. We can observe that the output values predicted with
it converges rapidly and offers a satisfying accuracy. The both predictors employing MRL are close to the observed
learning rate was set to 0.01. The training phase was started in values, while the predictor with STL provides the worst match
each case with identical initial weights which were obtained between output and target values. In Fig. 10 the histograms of
randomly. In case of multiresolution learning, the training was the NN predictors errors are shown. The narrowest histograms
conducted 100 iterations for each resolution. When no MRL are obtained for STL with MRL and MTL with MRL, the
was involved, the training lasted 300 epochs. We used this prediction error having values in the range [0.3, 0.6]. As
approach in order to ensure a comparable complexity of the desired, we find that most error values are around 0 (between
different NN predictors. 0.2 and 0.2), as opposed to STL and MTL. Fig. 11 illustrates
In Table III we can compare the performance metrics of the the different QQ plots. Employing MRL results in a more
NN with STL, in case of single-step and multi-step prediction. linear QQ plot, whereas the STL predictor presents a more
pronounced nonlinearity.
TABLE III Table IV contains the performance metrics for the four NN
P ERFORMANCE METRICS FOR NN WITH STL predictors analyzed in this paper, considering only single-step
prediction. The traditional NN predictor (with STL) has the
Method MSE NMSE MAPE r E
worst accuracy, but it is the simplest approach. The best results
Single-step 0.14462 0.45965 11.08% 0.75583 0.51616 are obtained for MRL (MAPE below 7%, NMSE close to 0
Multi-step 0.34069 1.0828 18.52% 0.10777 -0.13979 and E close to 1), although the MTL with MRL has similar
results, but its topology is more complex, thus it involves
The multi-step prediction process presents a poor perfor- more calculations. Repeating the simulations several times, the
mance, having NMSE 1 and E < 0. This can be explained results differ slightly but they are similar to this presented case.

4 3.6 3.6 3.6

output output output output
3.4 target 3.4 target 3.4 target
target
3.5 3.2 3.2 3.2

3 3 3

3 2.8 2.8 2.8

2.6 2.6 2.6

2.5 2.4 2.4 2.4

2.2 2.2 2.2

2 2 2 2

1.8 1.8 1.8

1.5 1.6 1.6 1.6

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20

(a) STL (b) STL + MRL (c) MTL (d) MTL + MRL

Fig. 9. NN Prediction

4 5 5 5

4.5 4.5 4.5

3.5
4 4 4
3
3.5 3.5 3.5
2.5
3 3 3

2 2.5 2.5 2.5

2 2 2
1.5
1.5 1.5 1.5
1
1 1 1
0.5
0.5 0.5 0.5

0 0 0 0
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0.4 0.2 0 0.2 0.4 0.6 0.8 1

(a) STL (b) STL + MRL (c) MTL (d) MTL + MRL

Fig. 10. NN Histogram of prediction errors

4 3.6
3.2 3.2

3.4
3 3
3.5
3.2
2.8 2.8
3
3
Y Quantiles

Y Quantiles

Y Quantiles
2.8 2.6 2.6

2.6
2.4 2.4
2.5
2.4
2.2 2.2
2.2
2
2 2
2

1.5 1.8 1.8 1.8

1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
X Quantiles X Quantiles X Quantiles X Quantiles

(a) STL (b) STL + MRL (c) MTL (d) MTL + MRL

Fig. 11. NN Q-Q plot

TABLE IV
P ERFORMANCE METRICS FOR NN PREDICTORS ACKNOWLEDGMENT
This work was partly supported by the PRODOC program
Method MSE NMSE MAPE r E Project of Doctoral Studies Development in Advanced Tech-
STL 0.14462 0.45965 11.08% 0.75583 0.51616 nologies (POSDRU/6/1.5/S/5 ID 7676). The authors would
STL + MRL 0.049385 0.15696 6.92% 0.92009 0.83478 like to thank the Broadband Communications Research Group
MTL 0.091859 0.29195 8.69% 0.86099 0.69269 of the Universitat Polit`ecnica de Catalunya for their support.
MTL + MRL 0.050576 0.16074 7.53% 0.91699 0.8308
R EFERENCES
[1] P. Cortez, M. Rio, M. Rocha, P. Sousa, Internet Traffic Forecasting using
We make the following notations: 1. 7 STL; 2. 7 STL Neural Networks, International Joint Conference on Neural Networks,
with MRL; 3. 7 MTL and 4. 7 MTL with MRL. Table V pp. 26352642. Vancouver, Canada, 2006.
[2] Z. Li, R. Wang, J. Bi, A Multipath Routing Algorithm Based on Traffic
compares the performance improvement brought by different Prediction in Wireless Mesh Networks, Fifth International Conference on
NN predictors. Positive values are obtained for STL with Natural Computation, Volume 6, pp. 115119. Tianjin, China, August
MRL. The additional computational complexity of MTL with 2009.
[3] V. B. Dharmadhikari, J. D. Gavade, An NN Approach for MPEG Video
MRL is not justified in terms of performance improvement. Traffic Prediction, 2nd International Conference on Software Technology
Compared to the best linear predictor, namely the HWS and Engineering, pp. V1-57V1-61. San Juan, USA, 2010.
algorithm, the NN predictor involving STL with MRL brings [4] H. Feng, Y. Shu, Study on Network Traffic Prediction Techniques,
International Conference on Wireless Communications, Networking and
a performance improvement of = 76.62%. Mobile Computing, pp. 10411044. Wuhan, China, 2005.
[5] G. Mao, Real-Time Network Traffic Prediction Based on a Multiscale
TABLE V Decomposition, 4th International Conference on Networking, Reunion
P ERFORMANCE IMPROVEMENT BETWEEN NN PREDICTORS Island, France. Lecture Notes in Computer Science, Volume 3420, pp.
492499. 2005.
1. 2. 3. 4. [6] J. Dai, J. Li, VBR MPEG Video Traffic Dynamic Prediction Based on
1. 0% 192.84% 57.44% 185.95% the Modeling and Forecast of Time Series, Fifth International Joint
2. 65.85% 0% 46.24% 2.36% Conference on INC, IMS and IDC, pp. 17521757. Seoul, Korea, 2009.
[7] L. Cai, J. Wang, C. Wang, L. Han, A Novel Forwarding Algorithm over
3. 36.48% 86.01% 0% 81.63%
Multipath Network, International Conference on Computer Design and
4. 65.03% 2.41% 44.94% 0% Applications, pp. V5-353V5-357. Qinhuangdao, China, 2010.
[8] A. Abdennour, Evaluation of neural network architectures for MPEG-4
video traffic prediction, IEEE Transactions on Broadcasting, Volume 52,
VI. C ONCLUSIONS AND F UTURE W ORK No. 2, pp. 184192. ISSN 0018-9316, 2006.
[9] S. Sun, Traffic Flow Forecasting Based on Multitask Ensemble Learn-
In this paper we demonstrated that traffic load prediction ing, Proceedings of the first ACM/SIGEVO Summit on Genetic and
is possible, with a certain accuracy. The experimental results Evolutionary Computation, pp. 961964. Shanghai, China, 2009.
show that nonlinear traffic prediction based on NNs outper- [10] J. Rodrigues, A. Nogueira, P. Salvador, Improving the Traffic Prediction
Capability of Neural Networks Using Sliding Window and Multi-task
forms linear forecasting models (e.g. ARMA, ARAR, HW) Learning Mechanisms, Second International Conference on Evolving
which cannot meet the accuracy requirements. If we take into Internet, pp. 18. Valencia, Spain, 2010.
account both precision and complexity, the best results are [11] Y. Liang, Real-Time VBR Video Traffic Prediction for Dynamic Band-
width Allocation, IEEE Transactions on Systems, Man, and Cybernetics,
obtained by the NN predictor with multiresolution learning Part C: Applications and Reviews, Volume 34, No. 1, pp. 3247. ISSN
approach, the predicted traffic generally coinciding with the 1094-6977, 2004.
observed values. If a low computational complexity is more [12] Y. Liang, X. Liang, Improving Signal Prediction Performance of Neural
Networks Through Multiresolution Learning Approach, IEEE Transac-
important, then a NN predictor with multi-task learning offers tions on Systems, Man, and Cybernetics, Part B: Cybernetics, Volume
a better solutions because this approach is simpler and its 36, No. 2, pp. 341352. ISSN 1083-4419, 2006.
performance is satisfying. [13] D.-C. Park, Prediction of MPEG Traffic Data Using a Bilinear Recurrent
Neural Network with Adaptive Training, International Conference on
As future work we envisage to integrate the chosen predictor Computer Engineering and Technology, pp. 5357. Singapore, 2009.
into a network management system and to evaluate it in [14] P. J. Brockwell, R. A. Davis, Introduction to Time Series and Forecast-
real-time. Foreseeing the immediate future by employing a ing, Second Edition. Springer-Verlag,ISBN 0-387-95351-5, 2002.
[15] Traffic measurements http://dc-snmp.wcc.grnoc.iu.edu/i2net/#
prediction based approach enables a proactive management.