0% found this document useful (0 votes)
3 views

Wireless_Traffic_Prediction_With_Scalable_Gaussian_Process_Framework_Algorithms_and_Verification

This paper presents a scalable Gaussian process framework for wireless traffic prediction within cloud radio access networks (C-RANs), aiming to enhance spectrum and energy efficiency for 5G systems. The proposed method utilizes the alternating direction method of multipliers (ADMM) for hyper-parameter optimization and introduces a cross-validation-based strategy for fusing local predictions, achieving significant improvements in prediction accuracy and computational efficiency. Experimental results demonstrate that the scalable GP model outperforms existing approaches, making it a promising solution for large-scale wireless traffic management.

Uploaded by

sh1619513754
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Wireless_Traffic_Prediction_With_Scalable_Gaussian_Process_Framework_Algorithms_and_Verification

This paper presents a scalable Gaussian process framework for wireless traffic prediction within cloud radio access networks (C-RANs), aiming to enhance spectrum and energy efficiency for 5G systems. The proposed method utilizes the alternating direction method of multipliers (ADMM) for hyper-parameter optimization and introduces a cross-validation-based strategy for fusing local predictions, achieving significant improvements in prediction accuracy and computational efficiency. Experimental results demonstrate that the scalable GP model outperforms existing approaches, making it a promising solution for large-scale wireless traffic management.

Uploaded by

sh1619513754
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO.

6, JUNE 2019 1291

Wireless Traffic Prediction With Scalable Gaussian


Process: Framework, Algorithms, and Verification
Yue Xu, Student Member, IEEE, Feng Yin , Member, IEEE, Wenjun Xu , Senior Member, IEEE,
Jiaru Lin, Member, IEEE, and Shuguang Cui , Fellow, IEEE

Abstract— The cloud radio access network (C-RAN) is a I. I NTRODUCTION


promising paradigm to meet the stringent requirements of the
fifth generation (5G) wireless systems. Meanwhile, the wireless
traffic prediction is a key enabler for C-RANs to improve both
the spectrum efficiency and energy efficiency through load-aware
T HE fifth generation (5G) system is expected to provide
approximately 1000 times higher wireless capacity and
reduce up to 90 percent of energy consumption compared
network managements. This paper proposes a scalable Gaussian
process (GP) framework as a promising solution to achieve
with the current 4G system [1]. One promising solution to
large-scale wireless traffic prediction in a cost-efficient manner. reach such ambitious goals is the adoption of cloud radio
Our contribution is three-fold. First, to the best of our knowledge, access networks (C-RANs) [2], which have attracted intense
this paper is the first to empower GP regression with the research interests from both academia and industry in recent
alternating direction method of multipliers (ADMM) for parallel years [3]. A C-RAN is composed of two parts: the distributed
hyper-parameter optimization in the training phase, where such
a scalable training framework well balances the local estimation
remote radio heads (RRHs) with basic radio functionalities to
in baseband units (BBUs) and information consensus among provide coverage over a large area, and the centralized base-
BBUs in a principled way for large-scale executions. Second, band units (BBUs) pool with parallel BBUs to support joint
in the prediction phase, we fuse local predictions obtained processing and cooperative network management. The BBUs
from the BBUs via a cross-validation-based optimal strategy, can perform dynamic resource allocation in accordance with
which demonstrates itself to be reliable and robust for general
regression tasks. Moreover, such a cross-validation-based optimal
real-time network demands based on the virtualized resources
fusion strategy is built upon a well acknowledged probabilistic in cloud computing. One major feature for the C-RANs to
model to retain the valuable closed-form GP inference properties. enable high energy-efficient services is the fast adaptability
Third, we propose a C-RAN-based scalable wireless prediction to non-uniform traffic variations [1]–[4], e.g., the tidal effects.
architecture, where the prediction accuracy and the time con- Consequently, wireless traffic prediction techniques stand out
sumption can be balanced by tuning the number of the BBUs
according to the real-time system demands. The experimental
as the key enabler to realize such load-aware management and
results show that our proposed scalable GP model can outperform proactive control in C-RANs, e.g., the load-aware RRH on/off
the state-of-the-art approaches considerably, in terms of wireless operation [4]. However, the adoption of wireless traffic pre-
traffic prediction performance. diction techniques in C-RANs must satisfy the requirements
Index Terms— C-RANs, Gaussian processes, parallel process- on prediction accuracy, cost-efficiency, implementability, and
ing, ADMM, cross-validation, machine learning, wireless traffic. scalability for large-scale executions.

Manuscript received July 13, 2018; revised December 13, 2018; accepted
February 28, 2019. Date of publication March 11, 2019; date of current A. Related Works
version May 15, 2019. This work was supported in part by the National
Natural Science Foundation of China under Grant 61629101 and Grant In the literature, many statistical time series models and
61771066, in part by NSF under Grant DMS-1622433, Grant AST-1547436, analysis methods have been proposed for wireless traffic
and Grant ECCS-1659025, in part by Guangdong Province under Grant prediction. For instance, the linear autoregressive integrated
2017ZT07X152, and in part by the Shenzhen Fundamental Research Fund
under Grant KQTD2015033114415450, Grant ZDSYS201707251409055, moving average (ARIMA) model has been used to model the
and Grant JCYJ20170307155957688. (Corresponding authors: Feng Yin; short-term correlation in network traffic [5]. However, wireless
Wenjun Xu.) network traffic often shows a long-term correlation due to
Y. Xu, W. Xu, and J. Lin are with the Key Laboratory of Universal
Wireless Communications, Ministry of Education, Beijing University of Posts mobile user behaviors [6]. As an extension, Shu et al. [7]
and Telecommunications, Beijing 100876, China (e-mail: [email protected]; adopted the seasonal ARIMA (SARIMA) models to improve
[email protected]; [email protected]). the ARIMA-based models on long-term traffic correlation
F. Yin is with The Chinese University of Hong Kong, Hong Kong, and also
with the Shenzhen Research Institute of Big Data, Shenzhen 518172, China modeling. Wang et al. [8] proposed the sinusoid superposition
(e-mail: [email protected]). model to describe both the short-term and long-term traffic
S. Cui is with the Department of Electrical and Computer Engineer- patterns based on a liner combination of different frequency
ing, University of California at Davis, Davis, CA 95616 USA, and also
with the Shenzhen Research Institute of Big Data, School of Science and components, which are extracted by performing the fast
Engineering, the Chinese University of Hong Kong, Hong Kong (e-mail: Fourier transform (FFT) on traffic curves.
[email protected]). However, wireless network traffic is becoming more com-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. plex with increasing non-linear patterns, which are difficult
Digital Object Identifier 10.1109/JSAC.2019.2904330 to be captured via linear models. Therefore, machine learning
0733-8716 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1292 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

based models could be adopted to improve the prediction accu- which incurs extensive communication overheads in large-
racy, e.g., those based on deep neural networks. For instance, scale executions. Besides, in the prediction phase, rBCM uses
Nie et al. [9] exploited the deep belief network (DBN) and heuristic weights to combine the local inferences, which is
Gaussian model for both the low-pass and high-pass com- straightforward but not robust.
ponents of the wireless traffic respectively. Qiu et al. [10] Lastly but importantly, most of the existing literatures on
exploited the recurrent neural network (RNN) to predict wireless traffic prediction only focus on the algorithm design
wireless traffic based on certain spatial-temporal information. without considering their adaptation to practical wireless archi-
However, (deep) neural networks are also well-known for tectures. For example, the BBUs are aggregated in a few big
the difficulty of training, and their learned features packed rooms and can be selectively turned on or off according to the
in a black-box are usually hard to interpretate. Recently, actual network demands to curtail the operation expenditures,
the Gaussian process (GP) model, which is a class of Bayesian e.g., reducing the power consumption of air conditioning. In
nonparametric machine learning models, has achieved out- addition, how to exploit the parallel BBUs for joint processing
standing performance in various fields [11], [12]. Compar- to improve the prediction speed thus reducing the operation
ing with other machine learning methods, e.g., the neural time and how to accommodate the utilization of BBUs and the
networks, the GP model does not involve any black-box scalability of prediction models still remain to be explored.
operations. Instead, the GP encodes domain/expert knowledge
into the kernel function and optimizes the hyper-parameters
B. Contributions
explicitly based on the Bayes theorems to generate certain
explainable results. Therefore, it has a great potential in In this paper, we propose a scalable GP-based wireless
improving the interpretability and prediction accuracy. More- prediction framework based on a C-RAN enabled architecture.
over, GP could provide the posterior variance of the pre- The proposed framework leverages the parallel BBUs in
dictions; in other words, GP not only predicts the future C-RANs to perform predictions with computational complex-
traffic, but also provides a measure of uncertainty over the ity depending on the number of active BBUs. Therefore,
predicted results, which is of vital importance for robust the prediction accuracy and the time consumption can be well-
network managements, e.g., the network routing based on balanced by activating or deactivating the BBUs, which con-
traffic uncertainty [13], and the resource reservation for cell stitutes a cost-efficient solution for large-scale executions. The
on/off operations [14]. In addition, GP is also promising for main contributions of this work are summarized as follows.
modeling and analyzing spatial-temporal data [15]. • To the best of our knowledge, this work is the first to
Nevertheless, the bottleneck of standard GP model lies in apply GP for wireless traffic prediction with a tailored
the high computational complexity, which creates difficulties kernel function that can capture both the periodic trend
for large-scale executions in the C-RANs. This motivates the and the dynamic deviations observed in real 4G data. The
researchers to seek for low-complex GP methods that are obtained prediction accuracy reaches up to 97%, much
capable of achieving similar prediction performance as the higher than the existing methods. Besides, compared with
standard ones but with much lower computational complexity. the existing works that are typically based on 2G or
Among others, the following two methods are promising for 3G wireless traffic data, the proposed models are more
large-scale applications. The first is the sparse GP model with promising to be applied in 5G wireless networks.
the idea to approximate the distribution of the full dataset • This work proposes a C-RAN based wireless prediction
based on its subset [16]. However, the selection of such an architecture to depict a feasible realization with C-RAN
optimal subset is difficult and time-consuming. The other infrastructures, where the RRHs collect and deliver the
alternative is the distributed GP model, which splits the heavy local traffic data to the BBU pool, while the BBU
training burden to multiple parallel machines and then opti- pool performs on-demand wireless traffic predictions by
mally combine local inferences to generate an improved global readily changing the number of active BBUs. Such a
prediction. Distributed GP models comprise two operation C-RAN based architecture is promising for joint parallel
phases in sequence, i.e., the joint hyper-parameter learning in processing and large-scale cooperative optimization to
the training phase and the optimal fusion of local inferences realize intelligent load-aware network management.
in the prediction phase. The existing distributed GP models all • This paper proposes a scalable GP framework based on
come with certain limits in either of the two phases or both. the distributed GP with significant innovations in both
For example, the Bayesian committee machine (BCM) pro- the training phase and the prediction phase. Specifically,
posed by Tresp [17] approximates the likelihood function over 1) for the GP training phase: this work is the first to
the full dataset with a product of likelihood functions over its propose a scalable training framework based on the alter-
subsets. However, during the BCM’s training phase, each local nating direction method of multipliers (ADMM) algo-
GP model optimizes its own hyper-parameters independently rithm. The training framework provides an elegant and
without any interactions, which forbids information sharing principled route for performing hyper-parameter learning
from promoting joint improvements. The robust BCM (rBCM) and information exchanging among parallel computing
proposed by Deisenroth [18] is based on a product-of-experts units, i.e., BBUs, and moreover achieves excellent trade-
(PoE) framework to address the shortcomings of BCM. How- off between the communication overhead and training
ever, in the training phase, rBCM assumes that all local GP performance. For each BBU, the computational com-
models are trained jointly with complete information sharing, plexity of training phase can be reduced from O(N 3 )

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1293

3
of the standard GP to O( N K 3 ) of our proposed scalable
GP, where N is the number of training points and K is
the number of parallel BBUs. 2) For the GP prediction
phase: this work is the first to fuse the prediction results
from local GP models elegantly via optimizing the fusion
weights based on a desired performance metric with
validation points. We prove that when the validation set
only contains a single data point, the weight optimization
problem can be cast into a convex problem with a global
optimal solution. When the validation set contains more
than one data points, the weight optimization problem can
be solved via mirror descent efficiently with convergence
guarantees. Moreover, we propose a simplified weighting
strategy based on the soft-max function for high-speed
executions particularly for real-time applications.
• In order to further decrease the computational complexity
of each BBU, we propose the structured GP model for
kernel matrices with a Toeplitz structure. This structure
arises when dealing with a dataset with regularly-spaced Fig. 1. A general architecture for wireless traffic prediction based on
input grid, concretely, in our case, the wireless traffic C-RANs.
recorded with regular time intervals. The structured GP
model can further reduce the training complexity from
3
N2
O( NK 3 ) to O( K 2 ) without sacrificing any prediction
structure, i.e., the RRHs deployed at remote sites, and the
accuracy. BBUs clustered as the BBU pool centrally. The RRHs are
The remainder of this paper is organized as follows. equipped with basic radio functions to monitor the local traffic
Section II presents the system model including the C-RAN data at different areas, and deliver them to the BBU pool
enabled architecture and the standard GP regression model. via the common public radio interface (CPRI) protocol [3],
Section III presents the scalable GP framework, where the where the BBU pool performs traffic predictions for all
ADMM-empowered scalable training framework and cross- RRHs accordingly based on the available BBUs. To support
validation based scalable fusion framework are established large-scale and real-time executions, we propose a scalable
respectively. Section IV demonstrates the experimental results. wireless traffic prediction framework that can well-balance
Finally, Section V concludes this paper. the prediction accuracy and time consumption by readily
changing the number of running BBUs. Specifically, each
II. S YSTEM M ODEL traffic prediction model is performed on an individual BBU
and trained based on its assigned subset split from the full
A. C-RAN Based Wireless Traffic Prediction Architecture training dataset. Consequently, higher prediction accuracy can
The C-RAN combines powerful cloud computing and be achieved when using less BBUs but with larger subsets, for
flexible virtualization techniques based on a centralized that more information can be preserved from dataset division.
architecture [2], [3], which makes it a decent paradigm to The extreme case is only using one BBU to learn from the
support large-scale wireless predictions with the following full dataset directly. On the contrary, less time consumption
advantages. First, the C-RAN can support on-demand comput- can be achieved by continuously increasing the number of
ing resource allocation for cost-efficient service commitments, running BBUs with smaller subsets, which however will bring
such as changing the number of active BBUs according to certain prediction performance loss. In this way, the prediction
real-time computing demands [3]. Second, the centralized and tasks triggered by different RRHs can be easily matched
virtualized hardware platform can better support internal infor- with appropriate computing resources for the best service
mation sharing, which lays the foundation for joint parallel commitment. In other words, the C-RAN can selectively acti-
processing with reduced communication overheads and rapid vate or deactivate the BBUs according to the actual network
data assignments. Third, the centralized C-RAN architecture demands in terms of, e.g., acceptable accuracy levels, delay
can also better support cooperative managements [19], such requirements, task priorities, system burdens, etc. Therefore,
that the wireless traffic prediction result can directly guide the our proposed architecture can dynamically and adaptively
load-aware network management to improve both the spectrum balance the energy consumption and prediction performance
efficiency and energy efficiency of wireless networks, which to provide a cost-effective solution.
contributes a candidate solution to realize adaptive resource The proposed traffic prediction architecture could greatly
allocation and proactive network control for future intelligent assist in traffic-aware managements and adaptive network
management. control to accomplish better network scheduling and resource
Motivated by the above benefits, we propose a C-RAN configuration in future C-RANs. For example, the proposed
based wireless traffic prediction architecture as shown architecture could well-guide the RRH on/off management to
in Fig. 1. The proposed architecture inherits the two-layer largely improve the energy efficiency, e.g., the traffic-aware

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1294 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

RRH on/off operations in the 5G C-RAN [20], the load-aware


RRH on/off operations in the green heterogeneous C-RAN [4].
On the other hand, the GP-based traffic prediction model has
already been applied to guide the base station on/off operations
in wireless networks [14]. Other usage cases for the wireless
traffic prediction model include the traffic-aware resources
management in the 5G core network [21], the 5G traffic
management under the enhanced mobile broadband (eMBB),
massive machine type communications (mMTC), and ultra-
reliable low-latency communication (URLLC) scenarios [22],
etc. Moreover, the proposed C-RAN-based prediction archi-
tecture is also able to flexibly control the time consumption
over online traffic predictions by readily changing the number
of BBUs, which can therefore better support delay-sensitive
services and improve the user experience on latency. Mean-
while, since the GP prediction phase requires considerably
less time consumption than the GP training phase (which
Fig. 2. The PRB usage curves of three base stations.
will be discussed in the following sections), the C-RAN can
also choose to train the GP model offline, and use the well-
trained GP model to perform online predictions in accom- to visit the same locations [23]. Similar periodicities have
plishing ultra-fast traffic prediction responses.1 Additionally, already been observed in the 3G traffic [23] and the simulated
it is noteworthy to point out that the proposed scalable GP dense C-RAN system [20]. Hence, it is highly likely that the
framework is indeed a general-purpose regression framework, 5G traffic will also exhibit similar periodic patterns as long as
which is likely to find wide applications on other regression the nature of human mobility still remains. Actually, the 4G
tasks by changing the kernel function and the learning context, traffic already exhibits larger dynamic deviations than the 3G
e.g., building a GP-based fingerprint map with the SE kernel traffic due to the diversification of services and the evolution
and RSS measurements as the training data [11]. of network architectures, which implies further exaggerated
dynamic deviations in the future 5G traffic due to the ongoing
B. General Patterns of Wireless Traffic Data network developments. Therefore, in this paper, we aim to
propose a general prediction model that has the flexibility
We first analyze the general patterns in wireless traffic data
to fit both the periodic patterns and the dynamic deviations
as the basis to craft a decent prediction model. The dataset
with different magnitudes, thereby being widely applicable to
used in this paper contains hourly recorded downlink physical
general traffic datasets collected from the real 4G networks or
resource block (PRB) usage2 histories of about 3000 4G base
the future 5G networks.
stations. The three panels of Fig. 2 show the PRB usage curves
of three different base stations. The curve in the first panel
represents one base station in an office area, where the traffic C. GP-Based Wireless Traffic Prediction Model
trend shows a strong weekly periodic pattern in accordance 1) GP-Based Regression Model: A Gaussian process is an
with weekdays and weekends. The curve in the second panel important class of the kernel-based machine learning methods.
represents one base station in a residential area, which shows It is a collection of random variables, any finite number of
a strong daily pattern with higher demands in the daytime which have Gaussian distributions [24]. In this paper, we focus
and lower demands in the nighttime. The curve in the third on the real-valued Gaussian processes that are completely
panel represents one base station in a rural area, where the specified by a mean function and a kernel function. Specif-
weekly and daily patterns are not obvious. Apart from the ically, for the wireless traffic of each RRH in the C-RAN,
periodic patterns, all three curves show irregular dynamic we consider the following regression model:
deviations. To summarize, the 4G wireless traffic in our dataset
shows three general patterns: 1) weekly periodic pattern: y = f (x) + e, (1)
the variations in accordance with weekdays and weekends;
2) daily periodic pattern: the variations in accordance with where y ∈ R1 is a continuously valued scalar output; e
weekdays and weekends; 3) dynamic deviations: the variations is the independent noise, which is assumed to be Gaussian
on top of the above periodic trends. distributed with zero mean and variance σe2 ; f (x) is the
Generally, the periodic traffic variations are mainly caused regression function, which is described with a GP model as
by the aggregated user behaviors incorporating the 24-hour
f (x) ∼ GP(m(x), k(x, x ; θh )), (2)
periodic nature of human mobility and the periodical tendency
where m(x) is the mean function, often assumed to be zero
1 However, it is suggested to re-train the GP model periodically with the in practice, especially when no prior knowledge is available,
most recent traffic samples in order to retain the prediction accuracy.
2 The PRB usage can reflect wireless traffic flow changes and is therefore and k(x, x ; θh ) is the kernel function determined by the
chosen to be the prediction target. kernel hyper-parameters θh . Hence, the hyper-parameters to

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1295

be learned3 for wireless traffic prediction can be denoted as Additionally, in order to eliminate human interventions,
θ  [θhT , σe2 ]T . automatic kernel learning/determination has become more
Generally, the prediction task can be summarized as and more fashionable in recent years. Representative results
follows. Given a training dataset D  {X, y}, where include [26]–[28] that proposed to design an universal kernel
y = [y1 , y2 , . . . , yn ]T is the training outputs and X = in the frequency domain and [29] that proposed to search for
[x1 , x2 , ..., xn ] is the training inputs, the aim is to predict a space of kernel structures built compositionally by adding
the output y∗ = [y∗,1 , y∗,2 , ..., y∗,n∗ ]T given the test inputs and multiplying a small number of elementary kernels. The
X∗ = [x∗,1 , x∗,2 , ..., x∗,n∗ ] based on the posterior distribu- drawback of these advanced kernels lies in the increased model
tion p(y∗ |D, X∗ ; θ). According to the definition of Gaussian and training complexity. In this paper, we prefer to use a
process given beforehand, the joint prior distribution of the relatively simpler linear kernel constructed based on our expert
training output y and test output y∗ can be written explicitly knowledge about the data generation process, not stretching to
as overfit the data for the robustness of a communication system.
    
y [c]K + σe2 In k∗ However, we note that automatic kernel learning is worth
∼ N 0, , (3)
y∗ k∗T k∗∗ trying for better adaptivity and full automation of a system,
especially when the training dataset is sufficiently large and
where
the computation power is abundant. In the sequel, as a concrete
• K = K(X, X; θ) is an n × n kernel matrix of correla-
solution, we select the following three kernels to model the
tions among training inputs; wireless traffic patterns observed in Section II-B.
• k∗ = K(X, X∗ ; θ) is an n × n∗ kernel matrix of
Then, we compose them as a tailored kernel function
correlations between the training inputs and test inputs; particularly for wireless traffic prediction. Specifically, 1) for
k∗T = K(X∗ , X; θ) = K(X, X∗ ; θ)T ;
the weekly periodic pattern, we select a periodic kernel with
• k∗∗ = K(X∗ , X∗ ; θ) is an n∗ × n∗ kernel matrix of
the periodic length set to be the number of data points of one
correlations among test inputs. week, which is defined as
By applying the results of conditional Gaussian ⎡ π(ti −tj )

distribution [24], we can derive the posterior distribution as sin2 λ1
k1 (ti , tj ) = σp21 exp ⎣− ⎦, (7)
p(y∗ |D, X∗ ; θ) ∼ N (μ̄, σ̄) , (4) lp21
where the posterior mean and variance are respectively given where λ1 is the periodic length, lp1 is the length scale
as determining how rapidly the function varies with time t, and
 −1
E [f (X∗ )] = μ̄ = k∗T K + σe2 In y, (5) σp21 is the variance determining the average distance of the
 2
−1 function away from its mean; 2) for the daily periodic pattern,
V [f (X∗ )] = σ̄ = k∗∗ − k∗ K + σe In
T
k∗ . (6)
we also select a periodic kernel, but with different hyper-
Note that the GP model is a Bayesian non-parametric model parameters, which is defined as
since the above posterior distribution can be refined as the ⎡ π(ti −tj )

number of observed data grows. sin2 λ2
2) Kernel Function Tailored for Wireless Traffic Prediction: k2 (ti , tj ) = σp22 exp ⎣− ⎦, (8)
lp22
Kernel function design is critical to GP, as it encodes the prior
information about the underlying process. A properly designed where the periodic length λ2 is set to be the number of data
kernel function can generate both high modeling accuracy points for one day, with length scale lp2 and output variance
and strong generalization ability. Generally, a kernel function σp22 also set differently from kernel k1 ; 3) for the dynamic
can be either stationary or non-stationary. A stationary kernel deviations, we select an SE kernel, which is defined as
depends on the relative position of the two inputs, i.e., K(τ )  
with τ = xi − xj and xi , xj being two different inputs. 2 (ti − tj )2
k3 (ti , tj ) = σlt exp − , (9)
A non-stationary kernel depends on the absolute position of 2ll2t
the two inputs, i.e., K(xi , xj ). However, modeling with a
stationary kernel is the preliminary requirement to generate a where σl2t is the magnitude of the correlated components and
Toeplitz structure in the kernel matrix, which is the basis for lt is its length scale. The above three elementary kernels can
the structured GP model proposed in Section III-C. Therefore, be added together without influencing the GP properties [24].
we mainly focus on the stationary kernels in this paper. Therefore, the tailored kernel function for general wireless
Different stationary kernels can generate data profiles with traffic prediction tasks could be written as
distinctly different characteristics, e.g., a periodic kernel can k(ti , tj ) = k1 (ti , tj ) + k2 (ti , tj ) + k3 (ti , tj ) (10)
generate structured periodic curves, while a squared exponen-
tial (SE) kernel can generate smooth and flexible curves. More- with the kernel hyper-parameters
over, multiple elementary kernels can be composed as a hybrid  T
θh = σp21 , σp22 , σl2t , lp21 , lp22 , ll2t . (11)
one while preserving their own particular characteristics.
Note that it is totally feasible to construct a new composite
3 Although σe2 can be estimated jointly with the kernel hyper-parameters
θh , in this paper we assume it is estimated independently by some other kernel function with other stationary kernels to predict traffic
estimation processes, e.g., the robust smoothing method [25]. with different primary patterns. For example, the rational

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1296 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

quadratic (RQ) kernel with three hyper-parameters is more where η is the step size. The derivative of the i-th element in
flexible than the SE kernel to model the irregular dynamic θ is computed as [24]
deviations. Hence, the RQ kernel can be added into the com-  
∂l(θ)   ∂C(θ)
posite kernel function to better model the traffic with obvious = T r C −1 (θ) − γγ T , (15)
irregular variations, e.g., the future 5G traffic. Meanwhile, it is ∂θi ∂θi
noteworthy to point out that adding or deleting the stationary where T r(·) represents the matrix trace and γ  C −1 (θ)y.
elementary kernels will not influence the GP properties, such Note that C −1 (θ) has to be re-evaluated at each gradient
that our later proposed scalable GP framework still applies. step. Such a matrix inversion requires O(N 3 ) computations,
3) Learning Objectives: The kernel function design deter- where N is the number of training data, which dominates
mines the basic form of our GP regression model, where the computational complexity of the standard GP. Conse-
the predictive performance largely depends on the goodness quently, when the wireless traffic dataset is large, i.e., N is a
of the model parameters, i.e., the hyper-parameters of kernel large number, the time consumption of each execution grows
functions. exponentially, which prohibits its application from large-scale
Generally, the hyper-parameters can be initialized with a executions. Therefore, a scalable GP framework that can fully
set of universal values when predicting different data profiles. utilize the parallel computing resources in the BBU pool of
However, specifying the initial hyper-parameters according to C-RANs is desperately needed for large-scale applications.
the observed primary patterns may help the hyper-parameter
tuning start from a better initial point, thereby improving
III. S CALABLE GP F RAMEWORK W ITH
the learning efficiency. For example, when predicting similar
ADMM AND C ROSS -VALIDATION
curves to the one in the top panel of Fig. 2, which shows a
stronger weekly periodic pattern, the magnitude of the weekly In this section, we propose a scalable GP framework for the
periodic kernel σp21 could be initialized with a larger value C-RAN based architecture given in Section II-A. The proposed
than that of the other two kernels. After the initialization, framework can efficiently train the GP model via parallel
the dominant method for tuning the model parameters is to computations based on multiple BBUs, and then optimally
maximize the marginal likelihood function, which can be combine the local inferences from different BBUs to generate
written in a closed form as an improved global prediction. A detailed work-flow of the
scalable GP framework is presented in Fig. 3. Specifically,
log p(y; θ) the framework contains one central node and multiple local
1  nodes operate individually on parallel BBUs. At the training
= − log |C(θ)|+y T C −1 (θ)y+n log(2π) , (12) phase, each local node trains a GP model based on a subset
2
of data split from the full training set and communicates
where | · | is the matrix determinant and C(θ)  K(θh ) + with the central node for joint optimizations via the ADMM
σe2 In . The model hyper-parameters θ can be tuned equiv- algorithm. At the prediction phase, each local node performs
alently by minimizing the negative log-likelihood function. predictions on both validation points and test points based on
Therefore, the learning objective of our prediction model can its well-trained GP model. Afterwards, the central node first
be written as calculates the optimal fusion weights based on the prediction
performance on validations points, then combines the local
P0 :min l(θ) = y T C −1 (θ)y+log |C(θ)|, predictions on test points based on the calculated weights for
θ
s.t. θ ∈ Θ, Θ ⊆ Rp . (13) a better and robust global prediction.
Generally, the central node is mainly responsible for i) deliv-
It is noteworthy to remark that for most of kernel functions, ering the subsets of the full dataset to the non-central nodes at
no matter a standalone one or a composite one, problem the very beginning; ii) updating the global ADMM parameters
P0 is non-convex4 and often shows no favorable structures based on the local ADMM parameters from other non-central
in terms of the hyper-parameters. Consequently, the clas- nodes at the training phase; iii) deriving the fusion weights
sic gradient descent methods, such as the limited-memory based on the local predictions results from other non-central
Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm and nodes at the prediction phase. Hence, it is feasible to transfer
the conjugate gradient methods could be used for solving the functionalities of the central node to any of the non-
the hyper-parameters but with no guarantee that the global central nodes by simply re-directing the data flow from other
minimum of P0 could be found. The work flow of the gradient non-central nodes to the newly selected central node, and
descent based method is presented as follows. In each iteration, then, the global ADMM parameters and fusion weights can
the hyper-parameters are updated as be fully recovered. On the other hand, malfunction of one
non-central node will only cause the information loss of one
∂l(θ)  subset, instead of a total failure. However, it is noteworthy to
θir+1 = θir − η · , ∀i = 1, 2 · · · , p, (14) point out that the centralized scheme can also be turned into
∂θi θ=θr
a fully distributed scheme by letting each local node select
4 However, for multiple linear kernels, such as the ones proposed
a subset of data, update global parameters, and fuse weights,
in [28], [30], [31], P0 becomes a difference-of-convex problem and efficient separately. For example, i) each local node could select its
algorithms exist for solving the hyper-parameters. own subset from the full dataset individually; ii) each local

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1297

Fig. 3. Scalable GP framework with ADMM and Cross-Validation. (a) Training framework. (b) Prediction framework.

node could broadcast its local ADMM parameters, such that As in the PoE model [32], we approximate the probability
each of them is able to compute the global ADMM parameter distribution of the full dataset by a product of the probability
individually; iii) each local node could broadcast its local distributions of all local subsets, concretely,
prediction result, such that each of them is able to output the
final prediction result individually. However, the information 
K
p(y|X; θ) ≈ pi (y (i) |X (i) ; θ), (16a)
broadcast in the fully distributed scheme may increase the i=1
communication overhead in the C-RAN architecture.
K
With the proposed scalable GP framework, the computa- log p(y|X; θ) ≈ log p(y (i) |X (i) ; θ). (16b)
tional complexity of training could be reduced from O(N 3 ) i=1
3
of a standard GP to O( NK 3 ) of the proposed scalable GP, where The standard parallel hyper-parameter training of the PoE can
N is the number of training points and K is the number of
be expressed as
parallel BBUs, such that the dominating time consumption of
GP regression could be largely decreased by simply increasing 
K
the number of BBUs. Moreover, the complexity could be P1 : min l(i) (θ),
2 θ
further reduced to O( N K 2 ) when the kernel matrix follows a
i=1
Toeplitz structure based on the structured GP model. On the s.t. θ ∈ Θ, (17)
other hand, the computational√ complexity of fusion weights where
optimization scales as O( log K), which makes the predic-
tion phase also implementable for large-scale executions. As l(i) (θ) = (y (i) )T (C (i) (θ))−1 y (i) + log |C (i) (θ)|, (18)
such, the dominant √ computational complexity of the central
node scales as O( log K), while the dominant computational Hence, each local GP model only needs to optimize its own
2
complexity of each non-central node scales as O( N K 2 ), which
cost function as in Eq. (18) w.r.t the hyper-parameters. Such an
makes the central node usually have a lower computational optimization only requires operations on the small covariance
load than any of the non-central nodes. However, the central matrix C (i) , whose size is ni × ni , where ni denotes the
node does require a higher throughput than the non-central number of data points in the subset D(i) with ni  n. The
nodes to support the data delivery and collection, which computational complexity for each local GP model is thus
should be considered in practical implementations. In the reduced to O(n3i ) with standard GP implementation. Note that
following statements, we present the scalable GP training we use equal subset sizes for all local GP models in this paper,
framework, scalable GP prediction framework, and structured namely, ni = N K , ∀i = 1, 2, · · · , K, which could be further
GP in sequence. optimized in our future work.
The philosophy behind PoE is to approximate the covariance
matrix of full dataset with a block-diagonal matrix5 of the
A. ADMM Empowered Scalable Training Framework same size. Each block is determined by the correspond-
ing subset. Hence, the well-trained local hyper-parameters
We aim to distribute the computation cost for solving P0
should ideally be the same, namely θi − θj = 0, ∀i, j,
evenly to a bunch of parallel BBUs for training speed-up.
to make the approximation from block-diagonal matrix consis-
Given the full training dataset D  {X, y}, we define a set of
tent with the origin full matrix. Therefore, existing PoE-based
K training subsets, denoted as S  {D(1) , D(2) , · · · , D(K) }.
methods [18], [32], [33] assumed that all local GP models
Each subset D(i)  {X (i) , y (i) } is sampled from the full
dataset D and each local GP model is trained based on its 5 If the local subsets are overlapped, the approximated matrix would not be
own subset D(i) . a block-diagonal matrix [32]

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1298 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

are trained jointly and shared the same set of global hyper- where the local θi -minimization step in Eq. (21a) optimizes
parameters θ. However, such joint training can only be real- the local cost function and reduces the distance between the
ized via rigorous gradient consensus. Specifically, for each local and global estimates of the hyper-parameters at the
gradient step in Eq. (15), the local cost and local derivative same time. The ADMM-based consensus only requires ncons ∗
information should be collected and coordinated as the global (2 ∗ dim(θ) + 1) communication overheads, where ncons is the
cost and derivative. Such global information is then used number of ADMM iterations and we have ncons  ngrads .
to update a globally shared hyper-parameters θ, which is Another benefit of such a ADMM-based consensus is that,
transmitted back to each local GP model afterwards. However, if one local GP model is stuck at a bad local minimum,
the gradient-consensus steps, on the other hand, requires the global consensus may re-start at a more reasonable point
ngrads ∗ (dim(θ) + 1) communication overheads for local cost for better converged result next time.
and derivative collection, where ngrads is the number of gradi- The optimality conditions for the ADMM solution are
ent steps and dim(θ) is the number of hyper-parameters to be determined by the primal residuals Δp and dual residuals
optimized. Meanwhile, the rigorous synchronous requirement Δd [34]. They can be given for each local GP model as
per gradient step also restricts its practical application in real
i,p = θi
Δr+1 − z r+1 , i = 1, 2, · · · , K,
r+1
(22)
systems.
Therefore, we aim to empower the distributed GP model Δr+1
d = ρ(z r+1
− z ).
r
(23)
with the powerful ADMM algorithm to develop a principled The above two residuals will converge to zero as ADMM
parallel training framework. The ADMM-empowered frame- iterates.
 r  Hence, the stopping criteria for our problem comprises
work allows each distributed units, e.g., the BBU in C-RANs, Δp  ≤ pri and Δr ≤ dual , where pri and dual are the
2 d 2
to be trained independently and to be coordinated with much feasibility tolerance constants for the primal and dual residuals,
less communication overhead. The proposed training frame- respectively. As suggested in [34], they can be set as
work could lay foundations for future scalable GP system √
design.
pri
= p abs + rel max { θir 2 , z r 2 } , (24)

In general, ADMM takes the form of a decomposition- dual
= p abs + rel ρζ r 2 , (25)
coordination procedure, where the original large problem is
decomposed into small local subproblems that can be solved in where p denotes the dimension of θ in the l2 norm.
a coordinated way [34]. Based on ADMM, problem P1 can be The detailed ADMM-empowered GP training procedure is
recast equivalently with the newly introduced local variables summarized in Algorithm 1. It should be pointed out that
θi and a common global variable z as even for convex problems, the convergence of ADMM could
be very slow [34]. But fortunately, a few iterations is often

K
sufficient for ADMM to converge to an acceptable accuracy
P2 :min l(i) (θi ), level in practical applications.
θi
i=1
s.t. θi − z = 0, i = 1, 2, . . . , K,
Algorithm 1 ADMM-Empowered Scalable GP Training
θi ∈ Θ, i = 1, 2, . . . , K. (19)
1: Initialization: r = 0, K, θi0 , ζi0 , ρ, z 0 =
1  K 0 1 0
Note that the above two problems P1 and P2 are equivalent. i=1 θi + ρ ζi , tolerance
abs
K , rel .
But with the new formulation, each local GP model is free 2: Iteration:
to train its own local hyper-parameters θi based on the local 3: while ||Δri,p ||2 ≥ pri or ||Δrd ||2 ≥ dual do
subset D(i) , where the local hyper-parameters will eventually 4: r = r + 1.
converge to the global hyper-parameter z after a few ADMM 5: for i = 1, · · · , K do
iterations, as presented below. 6: Obtain the parameters for local GP i by Eq. (21a).
To solve problem P2 with ADMM, we formulate the 7: end for
augmented Lagrangian as 8: Obtain the global parameters by Eq. (21b).
9: Obtain the dual variable by Eq. (21c).
L(θ1 , · · · , θK , ζ1 , · · · , ζK , z) r+1
10: Calculate the primal residuals Δi,p and the dual resid-
 K
ρ r+1
uals Δd by Eq. (22) and Eq. (23), respectively.
 l(i) (θi ) + ζiT (θi − z) + θi − z 2
2 , (20)
2 11: Update the feasibility tolerance pri and dual by Eq. (24)
i=1
and Eq. (25).
where ζi is the dual variable and ρ > 0 is a fixed augmented 12: end while
Lagrangian parameter. The sequential update of ADMM para- 13: Output: Global parameter z.
meters in the (r + 1)-th iteration can be written as
ρ
θir+1: = argθi min l(i) (θi )+ζiT (θi −z)+ θi −z 22 ,
2
(21a) B. Cross-Validation Based Scalable Fusion Framework
 K   After having trained the GP model hyper-parameters,
1 1
z r+1 : = θir+1 + ζir , (21b) we need to fuse the local predictions from all BBUs to get a
K i=1 ρ
global prediction. In contrast to the existing fusion strategies
ζir+1 : = ζir + ρ(θir+1 − z r+1 ), (21c) that are based on empirical weights, e.g., using the entropy as

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1299

weights directly in [18] and [33], we propose to optimize the


weights via cross-validation, which could provide a reliable
fusion quality with concrete theoretical analysis. Meanwhile,
as suggested in [33], we aim to achieve three desirable prop-
erties for the fusion process: 1) the predictions are combined
based on the prior or posterior information rather than being Fig. 4. The full training set for each RRH is split into a training set and
fixed, which gives the combined model more generalization a validation set, where the former is used in the hyper-parameters θ learning
while the latter is used in the fusion weights β optimization. The test set
power; 2) the combination should follow a valid probabilistic remains the same, which is the prediction target y∗ .
model, which helps preserve the distinct GP properties, e.g.,
the posterior variance for prediction uncertainty evaluation;
3) the combined prediction should alleviate the influence from guarantee a robust performance for general regression tasks,
bad local predictions, which makes the global prediction robust we first optimize the prediction performance on the validation
against bad local minimals. set, then use the optimized weights to combine local inferences
In the following statement, we first present the generalized for test set. Specifically, optimizing the prediction performance
PoE framework discussed in [18] and [33], which can assign on the validation set can be formulated as the minimization of
explicit weighting terms to local predictions while preserving the prediction residuals:
the probabilistic GP structure. Then we propose our cross-
validation based weighting model, and show that the opti- 
M

mization problem could be expressed in a convex form to min (ym − ỹm )2 ,


β
obtain the global optimal solution under certain conditions. m=1

In other cases, the optimization problem can be solved via the s.t. β ∈ Ω, (29)
mirror descent method for locally optimal solution efficiently
where M is the size of the validation set, ỹm is the
with convergence guarantees. Finally, we present a simplified
combined
 predictionon the validation point m and Ω =
weighting strategy that does not require information sharing
β ∈ RK + e β = 1
: T
restricts the weights β to be in a
and has a constant computational cost.
probability simplex.
The generalized PoE model proposes to bring in an explicit
Based on the joint posterior estimation from parallel BBUs
weight parameter β in the prediction phase, to balance the
given by Eq. (26) - (28), the combined prediction ỹm can be
importance among different local predictions. The revised
written as
predictive distribution is

K 
K

p(f∗ |x∗ , D) ≈ pβi i (f∗ |x∗ , D(i) ), (26) ỹm = argf̃m max pβk f˜m |μk (xm ), σk (xm ) (30a)
k=1
i=1

K
where βi is the weight for the i-th local GP model and the 2
= σm βi σi−2 (xm )μk (xm ) (30b)
corresponding posterior mean and variance are, respectively, i=1
K −2
 i=1 βi σi (xm )μk (xm )
K
μ∗ = (σ∗ )2 βi σi−2 (x∗ )μk (x∗ ), (27) = K −2
. (30c)
i=1 βi σi (xm )
i=1
 −1

K Therefore, the optimization problem proposed in Eq. (29) can
σ∗2 = βi σi−2 (x∗ ) . (28) be re-cast as
i=1  K 2
M
a (x )β
Consequently, the choice of β is vital to the prediction P3 : min f (β) = ym − i=1
i m i
,
K
i=1 bi (xm )βi
phase. Existing works, e.g., [18], [33], exploited the differ- β
m=1
ential entropy as the weights. However, the entropy-based s.t. β ∈ Ω, (31)
weights cannot guarantee a robust improvement on the pre-
diction accuracy. Besides, [33] also pointed out that the lack where
of change in entropy did not necessarily mean an irrelevant
prediction: e.g., the kernel could be mis-specified, which ai (xm ) = σi−2 (xm )μi (xm ), (32a)
indicates that the differential entropy may not always be bi (xm ) = σi−2 (xm ). (32b)
reliable.
For general regression tasks, the points with shorter input The convexity of problem P3 depends on the size of the
distances are usually deemed, with a higher probability, as hav- validation set. Specifically, when we use a single point for
ing similar target values, implying that training points that are validation, i.e., M = 1, problem P3 can be cast into a convex
closer to the desired test points should be more informative form, where the global optimal point can be obtained; when we
when performing predictions [24]. Hence, as shown in Fig. 4, use more than one points for validation, i.e., M > 1, problem
we divide the full training set into two parts, i.e., the training P3 will be non-convex, but can be solved rather efficiently
set and the validation set, where the validation set consists of for suboptimal solutions. Detailed analysis is conducted as
the training points that are closer to the test set. In order to follows.

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1300 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

1) Global Optimal Solution With a Single Validation Point: We use the Kullback-Leibler (KL) divergence K for
When using a single point for validation, the objective function Dψ (β, β r ) in our problem, where ψ(x) = i=1 xi log xi
in problem P3 could be simplified as with i = 1, 2, · · · , K. The subproblem given by Eq. (38a)
 K 2 is a convex problem. By taking the gradient with respect to β,
a i (x∗ )β i we obtain the optimal condition of a specific local GP model
f (β) = y∗ − i=1 (33a)
K i as
i=1 bi (x∗ )βi
 2 1 r+ 1
K g(βir ) + r ∇ψ(βi 2 ) − ∇ψ(βir ) = 0, (39a)
= y∗ − ai (x∗ )ri , (33b) η
r+ 12
i=1 ⇐⇒ ∇ψ(βi ) = ∇ψ(βir ) − η r g(βir ), (39b)
K r+ 12 r+ 1
where rk = βk / i=1 bi (x∗ )βi , such that we can transform ⇐⇒ log(βi )= log(βi 2 ) −η r
g(βir ), (39c)
problem P3 into a classic quadratic programming (QP) prob- r+ 1
⇐⇒ βi 2 = βir exp {−η r gi } . (39d)
lem as
 2 The
K  subproblem given
 by Eq. (38b) over the simplex Ω =
P4 : min f (r) = y∗ − ai (x∗ )ri , β ∈ RK+ :e β = 1
T
is also a convex problem, where the
r Lagrangian function can be written as
i=1

K K 
K
βi K
r+ 12

s.t. bi (x∗ )ri = 1, L= βi log r+ 1 − βi −βi +λ βi −1 . (40)
i=1 i=1 βi 2 i=1 i=1
ri ≥ 0, i = 1, 2, . . . , K. (34)
For all i = 1, . . . , K, we have
The global optimal solution r ∗ for problem P4 can be easily ∂L βi
obtained.∗ Finally, the optimal weights can be calculated as = log r+ 1 + λ = 0. (41)

βi∗ = ir∗ .
r
i i
∂βi β 2 i
2) Local Optimal Solution With Multiple Validation Points: r+ 1
Thus, we have βi = γβi 2 , ∀i
= 1, 2, · · · , K. Given eT β =
When using multiple points for validation, we could use mirror 1
1, we have γ = T r+ 1 . Therefore, the projection with
descent to solve problem P3 for locally optimal solutions. The e β 2
mirror descent method is a first-order optimization procedure respect to the KL divergence in Eq. (38b) amounts to a simple
which provides an important generalization of the sub-gradient renormalization as
descent method towards non-Euclidean geometries [35], and r+ 12
βi
has been applied to many large-scale optimization problems βir+1 = 1 . (42)
in machine learning applications, e.g., online learning [36], eT β r+ 2
multi-kernel learning [37]. The main motivation for apply- Therefore, the update rules given in Eq. (38a) and Eq. (38b)
ing mirror descent to problem P3 is that, mirror descent become, respectively,
can outperform the regular projected sub-gradient method r+ 12
when dealing with optimization problems on high-dimensional βi = βir exp {−η r gir } , (43a)
r+ 1
spaces [35], [38], which in our case, refers to the number of βi 2
BBUs. βir+1 = 1 . (43b)
eT β r+ 2
We first approximate the prediction residual f (β) around
β r by the first-order Taylor expansion as The convergence analysis is presented as follows. Bounding
by ||gir ||22 ≤ G and ||Dψ (β ∗ , β 1 )|| ≤ R, ∀i = 1, . . . , K,
f (β) ≈ f (β r ) + g(β r ), β − β r . (35) the convergence rate of mirror descent with KL divergence
in Eq. (43) can be given by [35]
Then, we penalize such displacement with a Bregman diver- T
gence term, which makes the update of β as 2R2 + G2 r=1 (η r )2
min k ≤  , (44)
1≤r≤T 2 Tr=1 η r
β r+1 = argβ min ∈ Ωq(β r ), (36)
where k = f (β r ) − f (β ∗ ) and β ∗ is the global optimum
where point. The upper bound given in Eq. (44) is a convex and
1 symmetric function of η r , such that the optimal upper bound
q(β r ) = f (β r )+ g(β r ), β−β r + Dψ (β, β r ), (37)
be achieved when setting a constant step-size, i.e., η =
r
ηr can
R 2
with g(β r ) the gradient of f (β r ) and Dψ (β, β r ) = ψ(β) − G T , which reduces the upper bound to be

ψ(β r ) − ∇ψ(β r )T (β − β r ) the Bregman divergence. Consis- 


2
tent with the sub-gradient descent rule, the update of Eq. (36) min k ≤ RG , (45)
1≤r≤T T
can be resolved as
1 −1
1
β r+ 2 = argβ min q(β r ), (38a) K T∗ denotes∗the number of iterations.
where √ With βi = n and
1 i=1 βi = 1, βi > 0, we have R = log K. The detailed
β r+1 = argβ∈Ωmin Dψ (β, β r+ 2 ). (38b) procedures are presented in Algorithm 2.

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1301

Algorithm 2 Mirror Descent for Weight Optimization


1: Initialization: r = 0, β 0 , η r = η 0 , tolerance mirror .
2: Iteration:
3: while |f (β r ) − f (β ∗ )| ≥ mirror do
4: r = r + 1.
5: for i = 1, · · · , K do
6: Optimize the unconstrained weights by Eq. (43a).
7: Re-normalize the weights by Eq. (43b).
8: end for
9: end while
10: Output: Optimal fusion weights β.

3) Simplified Soft-Max Based Solution: Intuitively, the val-


idation dataset should be carefully selected to mimic the
test data profile so as to achieve overall better prediction
performance. The computational complexity of mirror descent
based weight optimization increases along with the number Fig. 5. Each polygon represents an area served by a certain RRH cluster.
of BBUs, which may not be able to meet the response time
requirement when using massive parallel BBUs for delay-
sensitive applications. Hence, we propose a simplified fusion further speed acceleration. More details about the structured
weight optimization method based on the soft-max function. GP can be found in our previous work [39].
Specifically, we set the weights to be inversely proportional
IV. E XPERIMENTAL R ESULTS
to the average prediction error on the validation set, and we
smooth such proportion with a soft-max function as A. Data Description and Pre-Processing
The experimental dataset contains hourly recorded downlink
exp(−ek )
βk = K , (46) PRB usage histories of 3072 base stations in three southern
k=1 exp(−ek ) cities of China, from September 1st to September 30th, 2015.
where ek is the averaged error (e.g., the root mean square Each traffic flow history profile on each day contains 24 data
error (RMSE)) obtained for the k-th local GP model on the points corresponding to 24 hours. The base stations can be
validation dataset. Note that the soft-max based weighting treated as the RRHs in the C-RANs. As the 4G networks have
strategy does not involve any joint optimization and has a con- only been commercially used for less than one year when the
stant computational complexity. But the fusion performance is data was collected, each cell usually only contains less than ten
slightly degraded compared with the previous optimization- active 4G users per hour. To set a more realistic data profile,
based methods. Detailed comparisons are presented in the we group the RRHs (the cells) into 360 clusters according
experimental results. to their geographical distribution by the K-means clustering
algorithm, and the result is shown in Fig. 5. Accordingly,
C. Extra Complexity Reduction for Structured Kernel each RRH cluster provides services for a certain area with
Matrices the averaged coverage range around 1km. The PRB usages
are aggregated among all RRHs within the same cluster to
As pointed out in Section II, GP regression models need to obtain the wireless traffic of its served area, which is our
take the inverse and log-determinant of a kernel matrix at each prediction target. The number of users in each RRH cluster
iteration of the classic gradient descent method when learning per hour is thus increased by a rough order of one hundred.
the optimal hyper-parameters. The standard implementation Consequently, as shown in Fig. 2, the aggregated traffic
of Eq. (15) relies on the Cholesky factorization [24], which is exhibits obvious periodic traffic patterns along with strong
computationally demanding when dealing with a large dataset. dynamic deviations that are highly likely to appear in the 5G
However, when there is a certain favorable structure in the scenarios, as discussed in Sec. II-B.
kernel matrix, the computational complexity can be further
reduced. For instance, the regularly recorded wireless traffic B. Performance Metrics
data from RRHs form a regularly-spaced time series, which
is common for the standard wireless systems. Additionally, In this paper, we use the RMSE and the mean absolute
the tailored kernel function for wireless traffic prediction percentage error (MAPE) averaged over multiple test points
proposed in Section II-C2 is composed of stationary kernels. as the performance metric:

These will give rise to a Toeplitz structure in the kernel matrix, n∗ 2
i=1 (y∗ (xi ) − y(xi ))
such that fast matrix operations can be applied to reduce eRMSE = , (47a)
n∗
the computational complexity. Consequently, each BBU could n∗  
1   y∗ (xi ) − y(xi ) 
2
exploit the structured GP method to generate a reduced O( N K 2)
eMAPE =   × 100, (47b)
3
n∗  y(xi ) 
computational complexity from aforementioned O( N K3 ) for i=1

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1302 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

TABLE I from the full dataset, where K is the number of BBUs and
T IME C ONSUMPTION FOR T RAINING P HASE N is the size of the full dataset, i.e., 700. To guarantee a
fair comparison, we initialize all local GP models with the
same hyper-parameters in all experiments, and average the
results over 100 runs. In Table II, we use three validation
points in the weight optimization for all simulations. The
result also indicates that when using the mirror descent based
fusion method, the time consumption increases slowly with
increased number of BBUs. However, for the soft-max based
TABLE II
fusion method, the time consumption decreases slowly, for
T IME C ONSUMPTION FOR P REDICTION P HASE
that the fusion weight optimization no longer dominates the
time consumption, instead, the posterior inference takes place,
which will become faster when being executed with smaller
subsets.
As for the time consumption of the comparing schemes, i) as
shown in Table I and II, the gradient-consensus based training
phase of the rBCM exhibits a similar time consumption to
that of our scalable training framework with STD (without the
where n∗ is the number of test data points, y∗ (xi ) is the acceleration from the structured GP). The entropy-weighting
posterior GP prediction for the test input xi , and y(xi ) is based prediction phase of rBCM exhibits a similar time con-
the ground truth. sumption to our soft-max based prediction phase with cross-
validation; ii) the SARIMA, SS, and LSTM methods require
about 4s, 2s, and 270s per single execution, respectively, using
C. Results Analysis the full computing power of the same workstation. Noticeably,
In the following experiments, we consider three baselines with the acceleration from the structured GP, our proposed
to present the performance comparison between the GP and scalable GP framework can run faster than the SARIMA,
other prediction models: the seasonal ARIMA model [7], SS, and LSTM models when using more than two BBUs for
the sinusoid superposition model [8], and the recurrent neural parallel computation. Moreover, in what follows, we will show
network with long short-term memory (LSTM) based on deep that the prediction performance of our scheme can also surpass
learning techniques [10]. We refer to the baseline models as the competing schemes considerably.
SARIMA, SS, and LSTM respectively for short. Meanwhile, 2) Performance of Full GP: In the following experiments,
we consider another three baselines to present the performance we present the performance comparisons between the GP and
comparison between our proposed scalable GP framework other prediction models. For all baseline models, we train
and other scalable GP models: the full GP, the rBCM model them with the full dataset, and consider their predictions based
based on the distributed GP [18], and the subset-of-data (SOD) on temporal information only, consistent with our GP model.
model based on the sparse GP [40]. The full GP operates on We use one BBU to train our proposed GP-based model
the full dataset. The rBCM operates on subsets, generated by directly on the full dataset to generate the best prediction
random data assignments, and the training for rBCM operates performance, which is referred to as full GP for short. Note
in the gradient-consensus manner as described in Section III-A that the training and prediction phases of the full GP model are
to generate its best performance. The SOD uses a random consistent with the standard GP model, which is centralized
subset of the full training data as the training set, which without the scalable framework.
was identified as a good and robust sparse GP approximation In Fig. 6, we present an overview on the prediction results
in [18] and [40]. of three RRH clusters, which illustrates the prediction quality
1) Operation Time: In Table I, we show that per each in accordance with the wireless traffic patterns presented in
execution, the training time of our scalable training framework Section II-B. Specifically, for all models in Fig. 6(a), each
can be largely reduced by increasing the number of running time we use 300 data points as the training set, and predict
BBUs. Moreover, the structured GP (TPLZ) can further reduce the next one data point (one-hour-ahead). We iteratively update
the time consumption of the standard GP (STD). In Table II, the training set by removing the oldest data point and adding
we show that the prediction time is much less than the training one new data point, and keep going. Since the remained test
time, especially for the soft-max based weight model. The set contains 420 data points, we can repeat the predictions
simulations are performed on our 4G wireless traffic dataset 420 times for each virtual base station. Fig. 6(a) illustrates
from one RRH cluster with 700 data points. The BBU pool the prediction results obtained by iteratively updating and
is simulated based on a workstation with eight Intel Xeon predicting 400 times, and each panel corresponds to different
E3 CPU cores at 3.50 GHz and 32 gigabytes memory. The types of PRB curves in accordance with the ones were
number of BBUs varies from two to sixteen with the same presented in Section II-B. Specifically, the top and medium
computing power, which are simulated by activating them on panels show that the GP-model, SARIMA and LSTM can
the workstation in rotation. In Table I, each BBU trains a local capture both daily and weekly patterns, but SS can only
N
GP model based on a subset with the size of K , sampled capture the daily pattern. The SARIMA model is more easily

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1303

Fig. 6. One-hour look-ahead prediction of three RRH clusters. (a) One-hour look-ahead. (b) Twenty-four-hour look-ahead.

Fig. 7. Averaged MAPE and RMSE of downlink PRB usage predictions. Fig. 8. The prediction performance of training phase.

to be influenced by the bursts in traffic, thus generates a the beginning and 6.8% at the end. Sinusoid superposition has
sharp prediction curve, which sometimes can largely degrade a relatively stable but higher MAPE around 6.2%. Note that
the performance, as shown in the bottom panel. The LSTM although the prediction performance of SARIMA is close to
sometimes overestimates or underestimates the general trend. that of our proposed model when performing one-hour-ahead
The GP model usually generates a more robust prediction prediction, its prediction performance degrades quickly with
curve to fit the overall trend. In Fig. 6(b), we present the increased prediction scope. However, it is important to ensure
posterior variance obtained by the GP model of a twenty- the prediction accuracy on both the short-term and long-
four-hour-ahead prediction, where the gray area represents term traffic trends, in order to support the real-world traffic-
the 95% confidence interval (CI). The result shows that most aware managements that aim at maximizing the cumulative
of the wireless traffic variances fall into the predicted 95% utility over a long time range. For example, the traffic-aware
confidence interval, which indicates that the posterior variance RRH on/off operations should be determined according to
is promising to be used in wireless systems for robust control. both the short-term and long-term traffic variations in the
In Fig. 7, we show the averaged RMSE and MAPE of all future, thereby preventing frequent on/off operations to reduce
three models with the prediction length varying from one hour operational expenditure.
to ten hours. We repeat the prediction 400 times for each 3) Performance of Scalable GP: In the following exper-
cluster and further average the performance over 70 clusters iments, we present the comparison between our proposed
to eliminate the influence due to both temporal and spatial scalable GP framework and other scalable GP models to
impacts. Therefore, we believe that the obtained performance demonstrate the superiority of our method. For each predic-
can present a fair conclusion for general prediction tasks. tion, we use 600 data points as the full training set, and predict
Generally, both the RMSE and the MAPE curves show that our the next 10 upcoming data points. We repeat the prediction
proposed GP model can surpass all the competitors. Specifi- 100 times for each RRH cluster and further average the
cally, the full GP model leads to a prediction error around 3.5% performance over 100 RRH clusters to eliminate the influence
when predicting the one-hour-ahead traffic, and the error will from random data assignments and abnormal traffic variations.
gradually arise to 4.3% when the prediction length is extended In Fig. 8, we show the performance of the training phase.
to 10 hours, then stay stable. The SARIMA leads to a predic- The experiments are conducted on three BBUs. In order to
tion error around 4% at the beginning and increases to 8% at compare the performance on the prediction phase solely, for
the end. The LSTM leads to a prediction error around 5.2% at all schemes, we use the same process on the prediction phase

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1304 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

Fig. 9. The performance of prediction phase. 1) For our scheme, we set the Fig. 10. The one-hour-ahead prediction performance. (a) Performance
validation set to be one-point-length, i.e., the closest training point to the test without concatenation. (b) Performance with concatenation.
points. 2) For rBCM, we use the differential entropy between the prior and
posterior data distribution to be the fusion weights, as in [18]. 2) For SoD,
we randomly choose one-third of the full dataset as the subset to train the
sparse GP model, which does not require fusion weights.

as in the full GP, i.e., we use the full dataset for the posterior
inference. The curves in Fig. 8 show that both our ADMM
based GP and gradient-consensus based rBCM schemes can
approximate the training results of the full GP well, with
about 9% performance loss. The result indicates that our
proposed ADMM based training scheme can achieve the same
performance as the gradient-consensus based rBCM, but with Fig. 11. General prediction quality on different number of BBUs. The number
much less communication overheads. Specifically, as discussed of BBUs varies from 2, 3, 4, 8 to 16 with 300, 200, 150, 75, 37 training points,
in Section III-A, the communication overheads required by the respectively.
gradient-consensus based rBCM and our scheme are ngrads ∗
(dim(θ) + 1) and ncons ∗ (2 ∗ dim(θ) + 1), respectively. For may cause the GP model to “over-fit” the previous traffic that
each prediction, training the GP model usually requires a few is far away from the present traffic, while ignoring the most
hundreds of gradient steps, i.e., ngrads ≥ 100. Meanwhile, only recent changes. Currently, we only use the short training set,
a few times of ADMM consensuses, i.e., ncons ≤ 10, is already as shown in Fig. 4, for posterior inference after determining
sufficient for our scheme to converge to an acceptable accuracy the fusion weights. However, due to the fact that the validation
level. Therefore, our scheme can save an order of magnitude of points contain valuable information about the recent changes
communication overheads than the gradient-consensus based close to the test set in regression tasks, the performance could
rBCM, in general. be further improved by using the full training set to calculate
In Fig. 9, we show the performance of the prediction phase. the posterior mean and variance, which we refer to as the
The experiments are conducted on three parallel machines. concatenating scheme. Specifically, in Fig. 10(b), we show that
In this experiment, we bring in the algorithm on the prediction the concatenating scheme can bring an obvious improvement
phase for all three schemes. Therefore, on the prediction phase, with all different numbers of validation points. Moreover,
the scalable GP models no longer have the full dataset; instead, notice that when using one point for validation, the scalable
they need to merge the local predictions as the overall result. GP with multiple BBUs can even surpass the full GP with
The figure shows that our optimization based algorithm can one BBU. This is because for certain RRHs, the wireless
best approximate the predictions from the full GP, with only traffic has many abnormal variations. The full GP uses all
8% performance loss at the beginning and narrow down to the historical traffic data, including those abnormal points,
5% at the end, almost the same as using the full dataset for to inference the future traffic. Therefore, the abnormal points
prediction. The soft-max based algorithm is slightly worse than with close distance to the test set would have a large influence
our optimization based algorithm on short-term predictions, on the prediction results, which may harm the final predictions.
but reaches the same level at the latter points. However, both However, the scalable GP can filter the bad local inferences
the soft-max based and optimization based fusion schemes can based on cross-validation, i.e., by assigning lower weights to
surpass the rBCM and SoD. poor predictions on the most recent traffic, which results in an
In Fig. 10(a), we show the one-hour-ahead prediction per- improved and robust prediction performance compared to the
formance when varying the number of validation points. The full GP. Additionally, it should be pointed out that the optimal
experiments are conducted on three parallel machines. The validation set size may differ for different datasets. Setting the
result shows that when the validation set only contains one validation set to be at one-point-length is the optimal choice
point, i.e., the closest point to the test points, the prediction for our hourly recorded wireless traffic dataset, but may not
model can reach the best performance. The RMSE grows be so for general cases.
slowly as the validation set size increases. This trend may In Fig. 11, we present the general scalability performance by
be due to the fact that our traffic dataset is hourly recorded, averaging the prediction performance from one-hour-ahead all
such that optimizing the weights on a larger validation set the way up to ten-hour-ahead for all schemes. The result shows

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
XU et al.: WIRELESS TRAFFIC PREDICTION WITH SCALABLE GAUSSIAN PROCESS 1305

that for all scalable GP models, the prediction performance [7] Y. Shu, M. Yu, J. Liu, and O. W. W. Yang, “Wireless traffic modeling
becomes worse with an increasing number of parallel BBUs, and prediction using seasonal ARIMA models,” in Proc. IEEE Int. Conf.
Commun. (ICC), Anchorage, AK, USA, May 2003, pp. 1675–1679.
or in other words, with a decreasing number of local training [8] S. Wang, X. Zhang, J. Zhang, J. Feng, W. Wang, and K. Xin,
points. Moreover, the results also show that either with or “An approach for spatial-temporal traffic modeling in mobile cellular
without the concatenating scheme, our proposed scalable GP networks,” in Proc. Int. Teletraffic Congr., Anchorage, AK, USA,
Sep. 2015, pp. 203–209.
model can outperform the rBCM and SoD with considerable [9] L. Nie, D. Jiang, S. Yu, and H. Song, “Network traffic prediction based
performance gains. on deep belief network in wireless mesh backbone networks,” in Proc.
IEEE Wireless Commun. Netw. Conf. (WCNC), San Francisco, CA, USA,
V. C ONCLUSIONS Mar. 2017, pp. 1–5.
[10] C. Qiu, Y. Zhang, Z. Feng, P. Zhang, and S. Cui, “Spatio-temporal
In this paper, we proposed an ADMM and cross-validation wireless traffic prediction with recurrent neural network,” IEEE Wireless
empowered scalable GP framework, which could be executed Commun. Lett., vol. 7, no. 4, pp. 554–557, Aug. 2018.
[11] F. Yin and F. Gunnarsson, “Distributed recursive Gaussian processes for
on our proposed C-RAN based wireless traffic prediction RSS map applied to target tracking,” IEEE J. Sel. Topics Signal Process.,
architecture. Hence, the framework could exploit the par- vol. 11, no. 3, pp. 492–503, Apr. 2017.
allel BBUs in C-RANs to meet the large-scale prediction [12] J.-B. Fiot and F. Dinuzzo, “Electricity demand forecasting by multi-
requirements for massive RRHs in a cost-effective way. First, task learning,” IEEE Trans. Smart Grid, vol. 9, no. 2, pp. 544–551,
Mar. 2018.
we proposed the standard GP model with the kernel function [13] S. Yang and F. A. Kuipers, “Traffic uncertainty models in net-
tailored for wireless traffic predictions. Second, we extended work planning,” IEEE Commun. Mag., vol. 52, no. 2, pp. 172–177,
the standard GP model to a scalable framework. Specifically, Feb. 2014.
[14] W. Xu, Y. Xu, C.-H. Lee, Z. Feng, P. Zhang, and J. Lin, “Data-cognition-
in the scalable GP training phase, we trained the local GP empowered intelligent wireless networks: Data, utilities, cognition brain,
models in parallel BBUs jointly with an ADMM algorithm and architecture,” IEEE Wireless Commun., vol. 25, no. 1, pp. 56–63,
to achieve good tradeoff between the communication over- Feb. 2018.
[15] R. Senanayake, S. O’Callaghan, and F. Ramos, “Predicting spatio-
head and approximation accuracy. In the scalable GP pre- temporal propagation of seasonal influenza using variational Gaussian
diction phase, we optimized the fusion weights based on process regression,” in Proc. AAAI Conf. Artif. Intell. (AAAI), Phoenix,
cross-validation to guarantee a reliable and robust prediction Arizona, Feb. 2016, pp. 3901–3907.
performance. In addition, we proposed the structured GP [16] J. Quiñonero-Candela and C. E. Rasmussen, “A unifying view of sparse
approximate Gaussian process regression,” J. Mach. Learn. Res., vol. 6,
model to leverage the Toeplitz structure in kernel matrix to pp. 1939–1959, Dec. 2005.
further reduce the computational complexity. The scalable [17] V. Tresp, “A Bayesian committee machine,” Neural Comput., vol. 30,
GP framework could easily control the prediction accuracy no. 12, pp. 2719–2741, Nov. 2000.
[18] M. P. Deisenroth and J. W. Ng, “Distributed Gaussian processes,” in
and time consumption by simply activating or deactivating Proc. Int. Conf. Mach. Learn., Jul. 2015, pp. 1481–1490.
the BBUs, such that C-RANs could perform a cost-efficient [19] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong,
prediction according to real-time system demands. Finally, “Caching in the sky: Proactive deployment of cache-enabled unmanned
aerial vehicles for optimized quality-of-experience,” IEEE J. Sel. Areas
the experimental results showed that 1) the proposed GP-based Commun., vol. 35, no. 5, pp. 1046–1061, May 2017.
prediction model could achieve better prediction accuracy than [20] N. Saxena, A. Roy, and H. Kim, “Traffic-aware cloud RAN: A key
the existing prediction models, e.g., the seasonal ARIMA for green 5G networks,” IEEE J. Sel. Areas Commun., vol. 34, no. 4,
model, the sinusoid superposition model, and the recurrent pp. 1010–1021, Apr. 2016.
[21] I. Alawe, A. Ksentini, Y. Hadjadj-Aoul, and P. Bertin, “Improving
neural network model; 2) the proposed scalable GP model traffic forecasting for 5G core network scalability: A machine learn-
could achieve the expected goal of complexity reduction ing approach,” IEEE/ACM Trans. Netw., vol. 32, no. 6, pp. 42–49,
while better mitigating the performance loss than the existing Nov./Dec. 2018.
[22] Y. Fu, S. Wang, C. Wang, X. Hong, and S. McLaughlin, “Artificial intel-
distributed GP and sparse GP methods. ligence to manage network traffic of 5G wireless networks,” IEEE/ACM
Trans. Netw., vol. 32, no. 6, pp. 58–64, Nov./Dec. 2018.
R EFERENCES [23] U. Paul, A. P. Subramanian, M. M. Buddhikot, and S. R. Das, “Under-
standing traffic dynamics in cellular data networks,” in Proc. IEEE Int.
[1] M. Peng, Y. Li, Z. Zhao, and C. Wang, “System architecture and key
Conf. Comput. Commun. (INFOCOM), Shanghai, China, Apr. 2011,
technologies for 5G heterogeneous cloud radio access networks,” IEEE
pp. 882–890.
Netw., vol. 29, no. 2, pp. 6–14, Mar. 2015.
[24] C. E. Rasmussen and C. I. K. Williams, Gaussian Processes for Machine
[2] K. Chen and R. Duan, “C-RAN: The road towards green
Learning. Cambridge, MA, USA: MIT Press, 2006.
RAN,” China Mobile Res. Inst., Beijing, China, White Paper,
Oct. 2011. [Online]. Available: https://pdfs:semanticscholar: [25] D. Garcia, “Robust smoothing of gridded data in one and higher
org/eaa3/ca62c9d5653e4f2318aed9ddb8992a505d3c:pdf dimensions with missing values,” Comput. Statist. Data Anal., vol. 54,
[3] M. Peng, Y. Sun, X. Li, Z. Mao, and C. Wang, “Recent advances no. 4, pp. 1167–1178, Apr. 2010.
in cloud radio access networks: System architectures, key techniques, [26] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and
and open issues,” IEEE Commun. Surveys Tuts., vol. 18, no. 3, A. R. Figueiras-Vidal, “Sparse spectrum Gaussian process regression,”
pp. 2282–2308, Aug. 2016. J. Mach. Learn. Research, vol. 11, no. 1, pp. 1865–1881, Aug. 2010.
[4] Y. Li, T. Jiang, K. Luo, and S. Mao, “Green heterogeneous cloud [27] A. Wilson and R. P. Adams, “Gaussian process kernels for pattern
radio access networks: Potential techniques, performance trade-offs, discovery and extrapolation,” in Proc. Int. Conf. Mach. Learn. (ICML),
and challenges,” IEEE Commun. Mag., vol. 55, no. 11, pp. 33–39, Atlanta, GA, USA, Jul. 2013, pp. 1067–1075.
Nov. 2017. [28] F. Yin, X. He, L. Pan, T. Chen, Z.-Q. Luo, and S. Theodoridis, “Sparse
[5] B. Zhou, D. He, and Z. Sun, “Traffic modeling and prediction using structure enabled grid spectral mixture kernel for temporal Gaussian
ARIMA/GARCH model,” in Modeling and Simulation Tools for Emerg- process regression,” in Proc. 21st Int. Conf. Inf. Fusion (FUSION),
ing Telecommunication Networks. Boston, MA, USA: Springer, 2006, Cambridge, U.K., Jul. 2018, pp. 47–54.
pp. 101–121. [29] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and
[6] H. Feng, Y. Shu, S. Wang, and M. Ma, “SVM-based models for Z. Ghahramani, “Structure discovery in nonparametric regression
predicting WLAN traffic,” in Proc. IEEE Int. Conf. Commun. (ICC), through compositional kernel search,” in Proc. Int. Conf. Mach. Learn.
Istanbul, Turkey, Jun. 2006, pp. 597–602. (ICML), Atlanta, GA, USA, Jul. 2013, pp. 1166–1174.

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.
1306 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 37, NO. 6, JUNE 2019

[30] T. Chen, M. S. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto, Wenjun Xu (M’10–SM’18) received the B.S.
“System identification via sparse multiple kernel-based regularization and Ph.D. degrees from the Beijing University
using sequential convex optimization techniques,” IEEE Trans. Autom. of Posts and Telecommunications (BUPT), China,
Control, vol. 59, no. 11, pp. 2933–2945, Nov. 2014. in 2003 and 2008, respectively. He is currently a
[31] B. Mu, T. Chen, and L. Ljung, “On asymptotic properties of hyperpara- Professor and the Ph.D. supervisor with the School
meter estimators for kernel-based regularization methods,” Automatica, of Information and Communication Engineering,
vol. 94, no. 1, pp. 381–395, Aug. 2018. BUPT, and serves as a member of the Key Labo-
[32] J. W. Ng and M. P. Deisenroth. (2014). “Hierarchical mixture-of-experts ratory of Universal Wireless Communications, Min-
model for large-scale Gaussian process regression.” [Online]. Available: istry of Education, China. His research interests
https://arxiv.org/abs/1412.3078 include AI-driven networks, UAV communications
[33] Y. Cao and D. J. Fleet. (2014). “Generalized product of experts for auto- and networks, green communications and network-
matic and principled fusion of Gaussian process predictions.” [Online]. ing, and cognitive radio networks. He is currently an Editor of the China
Available: https://arxiv.org/abs/1410.7827 Communications.
[34] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
optimization and statistical learning via the alternating direction method
of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,
Jan. 2011.
[35] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected
subgradient methods for convex optimization,” Oper. Res. Lett., vol. 31,
no. 3, pp. 167–175, 2003. Jiaru Lin (M’12) received the B.S. and Ph.D.
[36] N. Srebro, K. Sridharan, and A. Tewari, “On the universality of online degrees from the School of Information Engineering,
mirror descent,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Granada, Beijing University of Posts and Telecommunications
Spain, Dec. 2011, pp. 2645–2653. (BUPT), China, in 1987 and 2001, respectively.
[37] S. N. Jagarlapudi, G. Dinesh, S. Raman, C. Bhattacharyya, From 1991 to 1994, he studied at the Swiss Federal
A. Ben-Tal, and R. Kr, “On the algorithmics and applications of Institute of Technology, Zurich. He is currently a
a mixed-norm based kernel learning formulation,” in Proc. Adv. Professor and the Ph.D. supervisor with the School
Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, Dec. 2009, of Information and Communication Engineering,
pp. 844–852. BUPT. His research interests include wireless com-
[38] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Com- munication, personal communication, and commu-
posite objective mirror descent,” in Proc. 23rd Annu. Conf. Learn. nication networks.
Theory (COLT), Haifa, Israel, Jun. 2010, pp. 14–26.
[39] Y. Xu, W. Xu, F. Yin, J. Lin, and S. Cui, “High-accuracy wireless
traffic prediction: A GP-based machine learning approach,” in Proc.
IEEE Global Commun. Conf. (GLOBECOM), Singapore, Dec. 2017,
pp. 1–6.
[40] K. Chalupka, C. K. I. Williams, and I. Murray, “A framework for
evaluating approximation methods for Gaussian process regression,”
J. Mach. Learn. Res., vol. 14, no. 1, pp. 333–350, Feb. 2013. Shuguang Cui (S’99–M’05–SM’12–F’14) received
the Ph.D. degree in electrical engineering from
Stanford University, Stanford, CA, USA, in 2005.
Yue Xu (S’19) received the B.S. degree from Beijing Afterwards, he has been working as an Assis-
University of Post and Telecommunication (BUPT) tant/Associate/Full/Chair Professor in electrical and
in 2015. He is currently working toward the Ph.D. computer engineering at the University of Arizona;
degree in BUPT. He has been a Visiting Researcher Texas A&M University; UC Davis; and CUHK,
with the Department of Electrical and Computer Shenzhen. He has also been the Vice Director of
Engineering, University of California, Davis, USA, the Shenzhen Research Institute of Big Data. His
The Chinese University of Hong Kong, Shenzhen, current research interests focus on data-driven large-
China, and Shenzhen Research Institute of Big Data. scale system control and resource management, large
His research interests include data-driven wireless data set analysis, IoT system design, energy harvesting-based communication
network management, machine learning, large-scale system design, and cognitive network optimization. He was a member of
data analytics, and system control. the IEEE ComSoc Emerging Technology Committee. He has been an Elected
Member of the IEEE Signal Processing Society SPCOM Technical Committee
(2009–2014) and the Elected Chair of the IEEE ComSoc Wireless Technical
Feng Yin (S’11–M’14) received the B.Sc. degree Committee (2017–2018). He was a recipient of the IEEE Signal Processing
from Shanghai Jiao Tong University, Shanghai, Society 2012 Best Paper Award. He has served as the General Co-Chair and
China, in 2008, and the M.Sc. and Ph.D. degrees TPC Co-Chair for many IEEE conferences. He has also been serving as the
from TU Darmstadt, Germany, in 2011 and 2014, Area Editor for the IEEE Signal Processing Magazine, and an Associate
respectively. From 2014 to 2016, he was with Eric- Editor for the IEEE T RANSACTIONS ON B IG D ATA, the IEEE T RANS -
sson Research, Linkoping, Sweden. Since 2016, ACTIONS ON S IGNAL P ROCESSING , the IEEE JSAC S ERIES ON G REEN
he has been with The Chinese University of C OMMUNICATIONS AND N ETWORKING, and the IEEE T RANSACTIONS ON
Hong Kong, Shenzhen, China, and the Shenzhen W IRELESS C OMMUNICATIONS. He is a member of the Steering Committee
Research Institute of Big Data. His research inter- of the IEEE T RANSACTIONS ON B IG D ATA and the Chair of the Steering
ests include statistical signal processing, machine Committee of the IEEE T RANSACTIONS ON C OGNITIVE C OMMUNICATIONS
learning, and sensory data fusion, with applications AND N ETWORKING . He was elected as an IEEE ComSoc Distinguished
in wireless positioning and tracking. He was a recipient of the Chinese Lecturer in 2014. He was selected as the Thomson Reuters Highly Cited
Government Award for Outstanding Self-Financed Students Abroad in 2013. Researcher and listed in the Worlds’ Most Influential Scientific Minds by
He received the Marie Curie Scholarship from the European Union in 2014. ScienceWatch in 2014.

Authorized licensed use limited to: SHANDONG UNIVERSITY. Downloaded on November 24,2024 at 05:16:41 UTC from IEEE Xplore. Restrictions apply.

You might also like