Robust Deep Learning For Wireless Network Optimization
Robust Deep Learning For Wireless Network Optimization
Optimization
Shuai Zhang∗ , Bo Yin∗ , Suyang Wang∗ , Yu Cheng∗
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616
Abstract—Wireless optimization involves repeatedly solving since problem instances are solved in a case-by-case manner.
difficult optimization problems, and data-driven deep learning Inspired by the recent success of machine learning (ML)
techniques have great promise to alleviate this issue through in other domains, data-driven approaches receive a lot of
its pattern matching capability: past optimal solutions can be
used as the training data in a supervised learning paradigm so attention in the area of wireless network optimization.
that the neural network can generate an approximate solution One promising thread in wireless network optimization
using a fraction of the computational cost, due to its high rep- follows the paradigm of supervised learning [4]–[6]. Specifi-
resenting power and parallel implementation. However, making cally, works in this thread aim to utilize the pattern matching
this approach practical in networking scenarios requires careful, capability of deep neural network (DNN) for distilling useful
domain-specific consideration, currently lacking in similar works.
In this paper, we use deep learning in a wireless network insight with respect to a specific optimization task from the
scheduling and routing to predict if subsets of the network links historical problem instances that are solved by conventional
are going to be used, so that the effective problem scale is reduced. algorithms. The trained neural network could generalize to
A real-world concern is the varying data importance: training new problem instances if the training data set is sufficiently
samples are not equally important due to class imbalance or large, and if the model has the representation power to
different label quality. To compensate for this fact, we develop
an adaptive sample weighting scheme which dynamically weights characterize the relationship between the input and output
the batch samples in the training process. In addition, we design data. It is possible to build an end-to-end learning framework
a novel loss function that uses additional network-layer feature in which the mapping from problem instances to solutions
information to improve the solution quality. We also discuss is approximated. However, for problems that involve highly
a post-processing step that gives a good threshold value to sparse and high dimensional data, e.g., joint routing and
balance the trade-off between prediction quality and problem
scale reduction. By numerical simulations, we demonstrate that scheduling decisions, directly learning the mapping is difficult
these measures improve both the prediction quality and scale in practice. Instead of building a ML model that outputs the
reduction when training from data of varied importance. solutions directly, the work in [7] circumvents this difficulty
Index Terms—multi-hop wireless mesh network, deep learning, via training a DNN that learns meaningful properties of the
network utility maximization solution space. More precisely, the approach proposed in [7]
can identify the subspace in which an optimal or near-optimal
I. I NTRODUCTION solution is included with high probability. As the search space
Wireless optimization problems are challenging and ubiq- is narrowed down, the total amortized computation cost of
uitous in many applications. Network controllers have com- the conventional approximation algorithm can be significantly
monly followed a model-based approach: a mathematical reduced.
model, parameterized by design variables, standard stipula- There are still several issues which hinder deep learning
tions and control variables, is formed which gives a measure from gaining more adoption in the networking field. A com-
of the system utility or objective. Then given such a model, monly encountered problem is that not all training samples
optimization problems are solved by a network entity, e.g., are of the same importance. This can be attributed to two
a base station or access point, and the optimal solutions sources. First is class imbalance: it means that the probability
are used as the control decision. Then this process repeats distribution to learn is severely distorted towards one type of
periodically to adapt to the time-varying conditions of the outcome over another. For example, in a D2D link scheduling
network. With the ever growing need for higher spectrum problem, most of the time links choose not to transmit. This
and power efficiency and lower delay, the next generation can lead to severe bias in the final training result as the model
communication system will go through a fundamental change is likely to output “not transmit” even when it is not supposed
in its paradigm. to, because on average this matches the training data and is
To tackle this challenge, studies on the wireless network easier to make this decision. In this sense, another sample of
optimization in the past years focus on the development not transmitting is not of much use for the training progress.
of approximation algorithms [1]–[3]. Wireless networks are The other reason samples should not be treated equally is that
envisioned to be an integral part of many emerging appli- the labels’ quality can vary: the samples can be generated and
cations, e.g., the Internet of Things (IoT), which are inher- collected under subtle, different circumstances; labels can be
ently dynamic and require adaptive control. In this way, the close to, but not exactly at the optimum of the problem to solve
conventional approaches are becoming increasingly infeasible due to the implementation of the conventional algorithms.
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
These minor factors in total can have an effect on the sample ^ϭ
labels that are similar to random noise. It is entirely possible
to train on these polluted data source and capture spurious
patterns that are not present in the testing or application. ƚϮ
Moreover, if the output needed by the network layer is discrete,
but the neural network provides a continuous variable, then a
ƚϭ
proper threshold value should be established. In most works,
there is no principled approach to selecting good threshold and ^Ϯ
the process is through simple trial-and-error, causing potential
loss of performance. ƚϯ
^ϯ
Considering these issues, we propose a novel scheme to
improve learning performance when the training samples are Fig. 1: System illustration
of varied importance. Our contribution can be summarized as
follows:
• An adaptive sample weighting scheme. It learns the sam-
u to node v is successful only if no other nodes within the
ple weights that can fit in any classification framework
detection range of v is simultaneously transmitting. Moreover,
without new weighting function. The weight is learned
each node works in a half-duplex manner, which prohibits the
based on how much each sample contributes to the metric
concurrent activation of links that involve overlapping nodes.
function on a separate, high quality dataset.
The conflict relations among all the links can be characterized
• An improved loss measure based on the cross entropy
by a conflict graph.
function. Rather than treating each link as an equal and
independent label, we consider each link carrying a unit We adopt the pattern-based formulation in [9], where a
flow as an independent label. In this way the link flow transmission pattern is an independent set of the conflict graph.
information is also used to guide the learning process. Formally, each pattern is characterized by a |E|-dimensional
• A threshold value selection criterion to further improve vector, where the jth component equals to the feasible capacity
the performance. Based on Bayesian estimation, we use of the jth link in this pattern. More precisely, if link (u, v) is
an exponential family probabilistic model which gets active in pattern m, the value of its corresponding component,
updated with samples of threshold values to avoid large denoted by pm (u, v), is c(u, v); otherwise, it equals to 0. To
overhead involved in evaluation. ensure interference-free communications, all the transmission
patterns are supposed to be scheduled in a TDMA manner. Let
II. S YSTEM M ODELS AND P ROBLEM S ETTING M be the set of all the transmission pattern and αm denotes
the fraction of slot allocated to pattern m.
We are interested in a multi-hop single-radio single-channel
wireless network, which is the general case of a D2D network. Let fk (u, v) denote the flow allocation variable with respect
A set of communication nodes N is randomly distributed to commodity l over link (u, v). Generally, all the flow
within a rectangular area. The time-slotted system has one variables need to satisfy following constraints.
given frequency band W , assuming synchronization between
• Link capacity: the sum of all the flows over a link does
the nodes. A central node has access to the nodes’ location
not exceed its capacity, i.e.,
information, and makes scheduling and routing decisions. At
a given time slot, a non-determined number of pairs of nodes fk (u, v) ≤ c(u, v), ∀(u, v) ∈ E (1)
need to transmit to each other. The set of communication links k∈K
is denoted by E. The transmission capacity of the link between
• Flow conservation: for commodity k, the amount of flow
transmitter u and receiver v is denoted by c(u, v). The multi-
entering an intermediate node equals to that exits the
commodity flow demands are denoted by K, in which each
node, i.e.,
flow k ∈ K is a source and destination tuple: k = {sk , tk }.
The system model is illustrated in Fig. 1. fk (u, v) = fk (v, u), ∀v = sk , tk ; ∀k ∈ K
In order to maximize the throughput-based system perfor- u∈N u∈N
mance, at each slot the decisions need to make are: (2)
• scheduling, determining which subset of links to activate The interference-free requirement introduces following con-
to avoid interference; straints.
• flow allocation, specifying which type of flow and how
much of it should be transported on each activated link
such that the flow conservation and link capacity is fk (u, v) ≤ αm pm (u, v), ∀(u, v) ∈ E (3)
satisfied. k∈K m∈M
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
With the demand of commodity k being expressed as lem instances. x(i) denotes the vector representation1 of prob-
lem instance i and y(i) characterizes the ground truth in terms
dk = fk (sk , v) − fk (v, sk ), (5) of the link usefulness in the solution to this instance, i.e.,
v∈N v∈N
In this section we explain how supervised learning paradigm B. Adaptive Sample Weights
is used to alleviate the problem’s complexity and explore the Recall that in supervised learning, given the training dataset
aspects where the deep learning could be more efficient. Dtrain {(xi , yi )}i , the training process can be expressed by
the optimization problem
A. Supervised Learning for Problem Scale Reduction |Dtrain |
1
Consider the combinatorial nature of problem (6), we adopt minimize L(ŷ(i) (θ), y(i) ) (9)
θ |Dtrain | i=1
the framework proposed in [7] to develop our ML model, in
which intermediate characterizations of the problem instances, ŷ(i) (θ) φ(x(i) ; θ) (10)
i.e., link usefulness, are learned to reduce the problem scale. (i) (i)
Roughly speaking, a link is considered to be useful if it is where ŷ is the predicted result given the input x produced
activated to accommodate arbitrary commodity flows. It is by the neural network φ parameterized by theta. After
shown in [7] that the computational complexity of the column training, the parameterized neural function should minimize
generation based optimization can be amortized significantly the expected loss measure L across all sample points.
through removing those links that are probably useless. This scheme works well when all the training samples have
consistent importance, so all the samples contribute equally
In this framework, the ML model is trained in a super-
vised learning fashion. Formally, the training dataset Dtrain 1 x(i) can be either meta-features of the problem instance or intrinsic
{(x(i) , y(i) )}i captures the optimization results of past prob- structures that are extracted via representation learning approaches.
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
when calculating the gradients of the output with respect to rather than one sample at a time, so the above relationship
the model parameters. However, if a large portion N of the should be adapted to accommodate it. Another relevant factor
samples, denoted as D ⊂ Dtrain are labeled with corruptions, is that the new parameter is likely to not be a linear function
and only a small subset Dhq ⊂ Dtrain of M (M N ) of the gradient in many optimizer algorithms used today, e.g.,
samples can be verified to have high-quality labels, it is no Adam-like optimizers that make use of higher-order gradient
longer appropriate to give them equal weight in the parameter statistics to regulate the effective learning rate, making it even
updating. But it would also be a waste of data if one only more difficult to find the best sample weight.
trains with Dhq , and the model quality could suffer due To make the process practical in the presence of the above
to a smaller dataset. The better approach would be to give difficulties, we include the two additional approximating as-
samples individual weights according to its importance. It is sumptions:
possible to identify and manually lower the weight of the low • the new parameter θ can be approximated by a linear
quality sample points before the training; but this requires function of the sample weights w, and the penalty func-
the user to supply a weighting scheme independent of the tion J is differentiable w.r.t to w;
learning progress, and the selection of a good scheme is • instead of finding the best w , in each iteration use a fixed
essentially optimizing new hyperparameters requiring careful number of gradient descent step to find a good enough
tuning before it can be integrated into an existing workflow. w̃ .
Different from that, we propose to use a learning approach The above assumptions enables us to obtain differentiated
where a sample weighting scheme is generated from the high- weighting without incurring a large number of iterative updates
quality data during the training process. Adapted from a to w, as we observed in the numerical experiments. With
mete-learning perspective [11], this approach fits into existing the first assumption, one can directly use gradient descent to
supervised learning framework widely used in networking improve current w; the second assumption uses the certain
applications, without requiring significant changes to the steps number gradient steps to get a surrogate of the optimum w.
in a typical learning process. We can observe the relationship between the gradient with
We hope to derive the sample-wise weight coefficients w = respect to each sample weight to understand the behavior:
{wi }i , where the index i is used in the dataset D and each N
weight is a non-negative number. It is required to satisfy the 1
θ (w) = θ − ηeff ∇θ wi L(φ(x(i) , θ), y(i) ) (13)
following relationship: N i=1
N
∇wi θ = −ηeff ∇θ L(φ(x(i) , θ), y(i) ) (14)
θ (w) = arg min wi L(ŷ(i) (θ), y(i) ) (11) M M
θ 1
i=1
∇w i J = ∇1 J · ∇θ φ · ∇ w i θ
M M
1 i=1 i=1
w∗ = arg min J (ŷ(i) (θ (w)), y(i) ), (12) M
w≥0 M 1
i=1
=− ∇θ J (φ(x(i) , θ), y(i) ) · ηeff ∇θ L(φ(x(i) , θ), y(i) )
where J is a continuous function that can be an identical to M i=1
or different from the loss L. (15)
The above relationship comes from the following intuitive From Eq. (15), it is clear that when the two gradient expression
observation: for each sample weight vector w, when it is is similar, i.e., having a large product, then the weight will
applied in a training step on the set D, there exists an become larger, since the weight is updated with the negative
updated set of model parameters θ (w), obtained from the of gradient. At the same time, the update is still regulated with
optimizer process, as w’s function. We can test this new the effective step length ηeff , so adjustable learning rate plays
model parameter’s performance by a weight penalty function a role here, adjusting how much the sample weight is changed
J on the high-quality dataset Dhq , and since the labels can each time these steps are executed. In the context of network
be trusted to be good, the J function’s value should be a link evaluation, this would mean that the each sample in the
good indicator of the new parameter’s quality unaffected by set D is weighted according to how similar the gradient flow
any degradation from the data. Notice that the J function it causes to the loss L compared with the average gradient
can be non-differentiable, if its form permits finding a best caused by the high-quality samples to the penalty function J .
w efficiently. And because J value is a function of the
sample weights, one can find the corresponding best w which C. Choice of Penalty Function J in link prediction
minimizes it. From previous analysis we have seen that the weights are
However, although the mentioned best sample weights exist, based on gradient similarity, then it is obvious that use J = L
practically it is not always feasible to attempt to find them is acceptable. However, if in the high quality set Dhq , there are
directly. This is because the set of training samples can be additional network layer information available, we can choose
very large, and to solve the equations to accuracy would need a different J function to take advantage of that.
several rounds through it. Besides, today’s learning process If cross entropy is chosen, then for any graph instance g,
performs parameter updates with mini-batches of sample data minimizing the cross-entropy from the predicted link usage
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
to the actual link usage is equivalent to maximizing the over problem instances. Then finding a good threshold value
probability y is equivalent to maximizing the merit value. The relationship
ŷi i (1 − ŷi )1−yi . (16) between the threshold value and the merit value is not a
i∈Eg straightforward one. A lower threshold value includes more
links in the solution, and it increases the approximation ratio
Such a probability measure contains an implicit assumption
and at the same time reduces the time cost reduction because
that whether the prediction is right or wrong contributes
the problem has cut off less links. Similarly, a higher threshold
equally to how the model parameters are adjusted. This
value prunes off links more aggressively, and this causes
assumption is valid when the underlying objects in the predic-
solution quality to be likely lower, and may decrease the
tions are largely unrelated and equally important; the outputs
solution time, because the pruning may cause the system to
can be said to be “structure-less”. But if the outputs here
become infeasible and additional time is needed to revert the
correspond to the links in a network, it is no longer appropriate
instance to a feasible one.
because the links due to the graph topology have different
topological importance. For example, some links can act Typically to choose a good threshold value to is through
as bottlenecks that are part of a minimum cut, so wrong trial-and-error: one can use a grid search, where the neighbor-
predictions at this links are likely to cause the prediction hood of valid threshold values is divided into small regions and
quality to suffer. In this case, the output labels contain a a best region can be picked by evaluating them all. However,
structure that should reflect in the design choice of the penalty considering the inherent structure of this problem, we adopt
function. a Bayesian approach for finding good threshold values. This
Considering this, suppose that in Dhq , in addition to label, method is a fitting one because the relationship between the
there exists another information vector z ∈ R+ that is merit and the threshold value is not explicitly expressible
correlated to y and holds information about the structural with closed-form expressions and thus hard to apply typical
importance of each output label. The new penalty function optimization techniques; Moreover, it is expensive to evaluate
can take the form even one point of this relationship, since we have to iterate
(1+βsigmoid(z ))y over all the points in Dhq for every α. The Bayesian ap-
ŷi i i
(1 − ŷi )(1+βsigmoid(zi ))(1−yi ) . (17) proach maintains an internal probabilistic model whose mean
i∈Eg approaches ξ(α) as more samples are observed, thus making it
Now each link with importance zi is treated as (1 + efficient to rapidly finding a good α given limited knowledge
βsigmoid(zi ) copies stacked together. This scheme favors the of the function.
link more important structurally. In this iterative algorithm, first we assume that without
further information, the α is described with a type of distri-
D. Threshold value optimization bution P0 (α; γ) within a specified range. We start with a set
The neural model as we have shown is still a continuous of observation points in the form of (αi , ξ(αi ))i , then fit the
one. In the training phase, we could use continuous proxy distribution according to these observations. With the updated
measures such as cross-entropy that measures how good the probabilistic model P of α, the next point to sample α is
decision is. But when the model is put to use, the only output given by the expected improvement [12] method:
that matters is the discrete decision variables. Converting from α = arg max E max(ξ(α) − ξ(α+ ), 0), (18)
a continuous output to a discrete variables is typically done α P
through the use of threshold values: if the final output ŷ is one where ξ(α+ ) is the largest value observed so far. This step uses
of the 2 possibilities, then ŷ = 0 if ŷ ≥ p and 1 otherwise. the probabilistic model to estimate the expected improvement
In our case, the threshold value has a large impact on the over the current best value, and returns the currently unob-
final system performance. There are two key performance served point α to evaluate next. The algorithmic steps are
metrics used: summarized in the following listing.
• approximation ratio r, expressed as the ratio of reduced
problem instance optimization objective value to the value IV. N UMERICAL R ESULTS
of the original problem.
• time cost reduction dt . This measures the fraction of run We implement this system with the existing software frame-
time saved by solving a smaller problem instance. Note work Pytorch [13] in Python, and conducted experiments to
that it also considers the additional run time caused by test its system performance. We consider the testing network
the loss of connectivity: if the reduced problem instance to be located within a square area of 1000 meters sides,
does not have a feasible solution, then the algorithm will with its illustration shown in Fig. 1 and the parameters
attempt to add back links that are not present, in the order listed in Table I. The communication nodes are randomly
of link score. distributed with a minimum separation of 0.5m. Two nodes are
For easy comparison, we use their sum as the scalar value connected by a link if their distance is smaller than the system
called solution merit: ξ(α) = E r(α) + dt (α), where α is parameter transmit range and links present as an interference
the threshold value to be chosen and the expectation is taken if their distance is smaller than detection range. Once the
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Threshold Generation
input : P0 (α), range of α, ξ, MaxIter, n0
output: threshold value α∗
Sample and Evaluate n0 points in the range of α
randomly
Let i = 1
while i ≤ MaxIter do
Let Pi be the updated distribution of α with all
observed data
αi = arg max ζ(α; Pi )
Record (αi , ξ(αi )
end
(a) 16 nodes
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.
[10] D. Bertsimas and J. N. Tsitsiklis, Introduction to linear opti-
mization. Athena Scientific Belmont, MA, 1997, vol. 6.
[11] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to
reweight examples for robust deep learning,” arXiv preprint
arXiv:1803.09050, 2018.
[12] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global
optimization of expensive black-box functions,” Journal of
Global optimization, vol. 13, no. 4, pp. 455–492, 1998.
[13] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z.
DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer,
“Automatic differentiation in pytorch,” in NIPS-W, 2017.
[14] L. Liu, X. Cao, Y. Cheng, and Z. Niu, “Energy-efficient sleep
scheduling for delay-constrained applications over WLANs,”
IEEE Transactions on Vehicular Technology, vol. 63, no. 5,
pp. 2048–2058, Jun. 2014.
Fig. 3: The convergence of the Bayesian estimation to the best [15] C. E. Rasmussen, “Gaussian processes in machine learning,” in
α value, marked by the vertical line. All the α values evaluated Summer School on Machine Learning, Springer, 2003, pp. 63–
are marked with ”x”. The other methods’ solutions are marked 71.
with red and black dot. [16] Z. C. Lipton, C. Elkan, and B. Naryanaswamy, “Optimal
thresholding of classifiers to maximize f1 measure,” in Joint
European Conference on Machine Learning and Knowledge
Discovery in Databases, Springer, 2014, pp. 225–239.
simulation results confirm that these measures combined result
in a higher solution quality at the cost of additional processing.
As data becomes a more pressing issue, future work is
expected to explore techniques for the learning model to deal
with the issues related to it.
R EFERENCES
[1] D. Chafekar, V. A. Kumar, M. V. Marathe, S. Parthasarathy,
and A. Srinivasan, “Approximation algorithms for computing
capacity of wireless networks with sinr constraints,” in Proc.
of IEEE INFOCOM, 2008, pp. 1166–1174.
[2] S. Misra, S. D. Hong, G. Xue, and J. Tang, “Constrained
relay node placement in wireless sensor networks: Formulation
and approximations,” IEEE/ACM Transactions on Networking,
vol. 18, no. 2, pp. 434–447, 2010.
[3] R. Gandhi, Y.-A. Kim, S. Lee, J. Ryu, and P.-J. Wan, “Approx-
imation algorithms for data broadcast in wireless networks,”
IEEE Transactions on Mobile Computing, vol. 11, no. 7,
pp. 1237–1248, 2012.
[4] M. A. Wijaya, K. Fukawa, and H. Suzuki, “Neural network
based transmit power control and interference cancellation for
mimo small cell networks,” IEICE Transactions on Commu-
nications, vol. 99, no. 5, pp. 1157–1169, 2016.
[5] F. Tang, B. Mao, Z. M. Fadlullah, N. Kato, O. Akashi, T.
Inoue, and K. Mizutani, “On removing routing protocol from
future wireless networks: A real-time deep learning approach
for intelligent traffic control,” IEEE Wireless Communications,
vol. 25, no. 1, pp. 154–160, 2017.
[6] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D.
Sidiropoulos, “Learning to optimize: Training deep neural
networks for interference management,” IEEE Transactions on
Signal Processing, vol. 66, no. 20, pp. 5438–5453, 2018.
[7] L. Liu, B. Yin, S. Zhang, X. Cao, and Y. Cheng, “Deep
learning meets wireless network optimization: Identify critical
links,” IEEE Transactions on Network Science and Engineer-
ing, 2018.
[8] P. Gupta and P. R. Kumar, “The capacity of wireless networks,”
IEEE Transactions on information theory, vol. 46, no. 2,
pp. 388–404, 2000.
[9] Y. Cheng, X. Cao, X. S. Shen, D. M. Shila, and H. Li, “A
systematic study of the delayed column generation method
for optimizing wireless networks,” in Proceedings of ACM
MobiHoc, 2014, pp. 23–32.
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:29 UTC from IEEE Xplore. Restrictions apply.