Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)
Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)
Abstract—With the increasing popularity and demands for large and storage space for training and inference, which is a signif-
language model applications on mobile devices, it is difficult for icant problem for resource-limited devices and environments.
resource-limited mobile terminals to run large-model inference Although researchers have proposed some solutions, such as
tasks efficiently. Traditional deep reinforcement learning (DRL)
based approaches have been used to offload large language models distributed training, model pruning, and model compression,
(LLMs) inference tasks to servers. However, existing DRL solutions which can reduce the storage and computational overhead of
suffer from data inefficiency, insensitivity to latency requirements, models, there are still resource issues in scenarios where LLMs
and non-adaptability to task load variations, which will degrade the are served to endpoints on a large scale [2].
performance of LLMs. In this paper, we propose a novel approach
Cloud-edge computing is a computing model that integrates
based on active inference for LLMs inference task offloading and
resource allocation in cloud-edge computing. Extensive simulation cloud computing and edge computing, aiming to take full advan-
results show that our proposed method has superior performance tage of cloud computing and edge computing to provide more
over mainstream DRLs, improves in data utilization efficiency, and flexible, efficient and resilient computing capabilities in large-
is more adaptable to changing task load scenarios. scale network systems [3]. In cloud-edge computing, cloud com-
Index Terms—Active inference, cloud-edge computing, large puting represents centralized data centers and powerful com-
language model, reinforcement learning, resource allocation, task puting resources that can handle large-scale data and complex
offloading. computing tasks. Edge computing, on the other hand, extends
computing power and data processing capabilities to locations
such as edge devices, edge nodes or edge gateways close to
I. INTRODUCTION data sources to realize the advantages of low latency, real-time
N RECENT years, OpenAI’s GPT family (e.g., ChatGPT) and localized data processing. The core idea of cloud-edge
I has attracted a lot of attention with the development of large
language models (LLMs). The main advantage of LLMs is their
computing is to assign computational tasks to the right location
for processing based on their characteristics and needs [4].
greater representational power and learning ability [1]. Models Cloud-edge computing for large-model inference tasks is a com-
with more parameters are able to capture more complex patterns puting model that distributes large-model inference tasks to the
and associations, thus providing more accurate and richer predic- cloud and the edge for processing. Among them offloading and
tions and generated results. However, LLMs also face challenges resource allocation are the key concepts and techniques used to
and limitations: LLMs require huge computational resources rationally allocate inference tasks to cloud and edge devices for
processing. Offloading and resource allocation need to consider
factors such as network bandwidth, latency, and computational
Manuscript received 9 November 2023; revised 20 February 2024; accepted capacity of the processing ends, etc. The best overall system can
22 April 2024. Date of publication 9 July 2024; date of current version 5 Novem-
ber 2024. This work was supported in part by the National Natural Science be achieved by a reasonable offloading strategy and resource
Foundation of China (NSFC) under Grant 62271324, Grant 62231020, and allocation scheme.
Grant 62002238, in part by Shenzhen Science and Technology Program under Different from conventional task offloading optimization, the
Grant ZDSYS20220527171400002, and in part by the Open Research Fund
from Guangdong Laboratory of Artificial Intelligence and Digital Economy unique characteristics of LLMs pose non-trivial challenges to
(SZ) under Grant GML-KF-22-26. Recommended for acceptance by E. Ngai. task offloading optimization in edge-cloud computing. Specif-
(Corresponding author: F. Richard Yu.) ically, LLMs have huge scale parameters and computational
Ying He and Jingcheng Fang are with the College of Computer Science,
Software Engineering, Shenzhen University, Shenzhen 518060, China (e-mail: resource requirements. As hardware computing power and
[email protected]; [email protected]). datasets increase, researchers are beginning to design and train
F. Richard Yu is with the College of Computer Science, Software Engineering, models with billions of parameters, and the number of train-
Shenzhen University, Shenzhen 518060, China, and also with the School of
Information Technology, Carleton University, Ottawa, ON K1S 5B6, Canada able parameters in large models grows exponentially, often
(e-mail: [email protected]). requiring huge computational resources and high-performance
Victor C. Leung is with the Department of Electrical, Computer Engineer- computing devices for training and inference [2]. Furthermore,
ing, University of British Columbia, Vancouver V6T 1Z4, Canada (e-mail:
[email protected]). the performance measures in LLMs are usually different from
Digital Object Identifier 10.1109/TMC.2024.3415661 conventional tasks. In this paper, we consider the average latency
1536-1233 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11254 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
of all LLMs inference tasks that need to be offloaded as small in large models grows exponentially, often requiring huge com-
as possible and the accuracy of the model’s prediction output as putational resources and high-performance computing devices
large as possible while satisfying the bandwidth resource con- for training and inference.
straint, computational resource constraint, and graphics memory The authors of [15] argue that distributed systems can provide
resource constraints. excellent solutions for training and inference of large models,
Deep reinforcement learning (DRL) has been very successful and that distributed federation of multiple high-performance
in many decision making application scenarios, such as games, computer machines can effectively solve the resource dilemma
robotics and resource management [5], [6]. Thanks to the rapid faced by large models.
development of DRL, mainstream DRL algorithms, such as A large language models accelerator (LLMA) is proposed
Rainbow DQN, PPO and SAC, are used in cloud-edge com- in [16] that losslessly accelerates LLMs inference with refer-
puting scenarios [7], [8], [9] for task offloading and resource ences. LLMA selects a text span from a reference and copies its
allocation [10], [11], [12]. However, traditional DRL-based tokens to the decoder, and then efficiently checks the appropri-
strategies operating in different environments require different ateness of the tokens as decoding results in a decoding step in
reward functions, which results in poor generalization. It is also parallel. the appropriateness of the tokens as decoding results in
difficult to define an explicit and appropriate reward function. a decoding step. Using LLMA achieves a 2x inference speedup
The transformation of human knowledge into numerical reward and yields the same predicted output as greedy decoding in many
values is often subject to human cognitive biases [13]. scenarios.
In this paper, we propose a novel algorithm using a reward- Sheng et al. investigate how to use a single GPU for high-
less guidance instead of the reward model in traditional DRL throughput LLMs inference with limited hardware resources. A
approaches, which enables agents to form higher-level cognition high-throughput inference generation engine for LLMs running
about the environment and reach the preferred state directly on a single consumer-grade GPU, FlexGen, is proposed, which
without defining reward functions, resulting in better general- flexibly configures LLMs inference tasks under various hard-
ization ability than traditional DRL approaches. In this way, the ware resource constraints by aggregating memory from GPUs,
algorithm is able to actively select the actions that provide the CPUs, and disks [17].
most informative value to guide the inference process and reduce An inference system with a multilevel inference engine is
the uncertainty of the future state. The main contributions of this proposed in [18] that can serve inferential computations for
paper are as follows: corresponding applications using small models and optionally
r With the recent advances in active inference [14], we LLMs. the idea of Tabi is that due to the diminishing returns of
propose a novel scheme to address the LLMs inference adding more trainable parameters to LLMs, smaller models can
offloading and resource allocation problem. Compared to make the same predictions as LLMs for most queries. the idea of
mainstream DRLs, our proposed method has better con- Tabi multilevel inference is that the non-generative LLMs in the
vergence performance and generalization performance. service framework is optimized to use the calibrated confidence
r We present the system model and formulation for GPT-J- scores to decide whether to return accurate results for small
6B LLM based on real experimental data in a server cluster. models extremely quickly or to reroute them to the LLMs.
Both training and inference phases are considered. In the scenario of generating batch large-model inference
r Extensive simulation results show that our proposed tasks, Cheng et al. propose batch prompting, which enables
scheme can train a better converged policy and has a LLMs to run inference in batches instead of one inference task
superior performance in LLMs inference compared with at a time. This method reduces token and time costs while
the mainstream DRLs algorithms. maintaining downstream performance, and the inference cost
The rest of this paper is organized as follows. Section II decreases inversely and linearly with the number of samples
presents the system model. The proposed scheme is presented in in each batch. The study suggests that this batch prompting
Section III. Section IV presents the experimental results. Finally, approach can be applied to different batch large model inference
this paper is concluded in Section V. tasks [19].
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11255
Li et al. argue that running Deep Neural Networks interactions without a priors knowledge [6]. Some common
(DNN)-based computationally intensive tasks on mobile devices deep reinforcement learning control algorithms include: Deep
with limited computing resources is challenging, and that tra- Q-Network (DQN) uses a deep neural network to approximate
ditional cloud-service-responsive DNN inference services are the value function and select actions through greedy policies,
severely hampered by wide-area network latency, resulting in DQN improves training stability through empirical replay and
poor real-time performance and low quality of user experience. objective networks. Proximal Policy Optimization (PPO) uses
Researchers propose a framework for collaborative DNN infer- a policy optimization approach to update the policy parameters
ence using edge computing through device edge collaboration. by Optimizing the objective function of the current policy to
Specifically, DNNs are partitioned and adaptively allocated be- update the policy parameters, PPO uses importance sampling
tween devices and edges for computation to coordinate powerful and clipping of the objective function to maintain training sta-
cloud resources with near-edge resources for real-time DNN bility. Actor-Critic approach combines policy optimization and
inference. And the DNN size is appropriately adjusted during value function estimation by simultaneously training a policy
the inference process to further reduce the computational latency network (Actor) and a value function network (Critic), where
by launching the inference earlier in the intermediate DNN Actor selects actions based on the policy, while Critic evaluates
layer [22]. the current policy’s value function [7], [8], [9], [26].
The authors of [23] investigate inference acceleration using Liu et al. use a DQN approach for task offloading and resource
distributed Convolution Neural Network (CNN) in collaborative allocation in a vehicular edge computing network architec-
edge computing networks and propose an acceptance domain- ture. In this vehicular edge computing network, vehicles act as
based partitioning to guarantee no loss of inference accuracy mobile edge servers and provide computing services to nearby
when dividing inference tasks. To reduce the computation time mobile users. This process is described as a Markov process and
of inference and the communication overhead in the distributed solved by DRL to obtain the optimal policy for computation
system, a partitioning of the CNN model into multiple task offloading and resource allocation [27].
blocks using fusion layer parallelization is used and the op- Tang et al. consider indivisible and latency-sensitive tasks
timal partitioning of the CNN model is found using dynamic and edge load dynamics in mobile edge computing systems and
programming. A low-complexity search algorithm is used to formulate task offloading problems to minimize the expected
collaborate the best subset of edge servers for inference in a long-term cost. The researchers combined long short-term mem-
distributed inference system. Experimental results show that the ory (LSTM), dualing deep Q-network (Dueling DQN) and dou-
framework can significantly improve inference speed compared ble DQN (Double DQN) techniques to propose a distributed
to running pre-trained models. algorithm based on model-free deep reinforcement learning.
Hu et al. argue that the computational structure of modern Simulation results show that the processing power of edge nodes
deep learning models involves directed acyclic graphs (DAGs), can be better utilized and the task loss rate and average latency
while most existing research assumes that deep learning models can be significantly reduced [28].
are constructed using a chain of layers, and then divides the
models across edge devices in this way. This study proposes
EdgeFlow, a novel distributed inference mechanism designed
for generalized DAG-structured deep learning models, which D. Active Inference
uses a new progressive model partitioning algorithm to divide Active inference has similarities to reinforcement learning,
model layers into independent execution units, and then assigns that it uses the reinforcement learning paradigm, whereby strate-
these near-optimal model partitioning units to inference compu- gies are trained during interaction with the environment.
tations on distributed edge devices. During the inference process, Active inference describes the properties of agents in an envi-
EdgeFlow coordinates the intermediate results flowing through ronment according to the free energy principle, by minimizing
these units to achieve complex layer dependencies [24]. the free energy to obtain the Bayesian inference of the optimal
The authors of [25] study the problem of coordinating DNN action in that environment [29]. The free energy principle is
model partitioning and task assignment for end devices and a physical and information-theoretic concept used to describe
edge servers in heterogeneous edge computing systems. For the stability and organization of systems, which originated in
the problem model which is difficult to solve directly, dynamic the theory of thermodynamics in statistical physics and was
programming and greedy strategy are used to reduce the solution later applied to the fields of information theory and machine
space while obtaining a good solution, and then an online GSPI learning. In information theory and machine learning, the free
algorithm is proposed to solve the problem. energy principle is used to describe the optimization goals in
learning and reasoning processes. The free energy is considered
as an upper bound on the desired energy of an agent, and the
C. Deep Reinforcement Learning Decision agent minimizes the free energy to maximize its desired energy
Deep reinforcement learning control algorithms are a class during learning and reasoning [14]. The expected energy here
of algorithms based on deep neural networks and reinforce- can be understood as the system’s plausibility for the observed
ment learning for solving decision and control problems. It data or the optimization of the model parameters, so the free
combines the power of deep learning with a framework of energy can be understood as the difference between the current
reinforcement learning to learn optimal decision strategies from distribution and the distribution of the real data samples.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11256 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11257
Position encoding: The position encoding matrix P is done on receiver remains constant during an effective communication,
the embedding vector as a position mask operation to obtain the according to the data transmission rate formula in the informa-
position encoded embedding vector P E = [pe1 , pe2 , . . ., pen ]. tion theory [36]:
Attention and multi-head self-attention: According to [35],
P ower · G
assuming that the query matrix is Q = P E × W q , the key ma- R = W log2 1 + , (3)
trix is K = P E × W k , and the value matrix is V = P E × W v , N
then: where W represents the bandwidth of the communication chan-
QK T nel, P ower represents the transmit power of the device, G rep-
Attention(Q, K, V ) = softmax √ V, (1) resents the channel gain, and N represents the random thermal
dk
noise power of the channel. The channel gain G is related to the
M ultiHead(Q, K, V ) = Concat(head1 , . . ., headh )W O , antenna gain g, and the path loss P L is related to the shadow
fading Xσ . While antenna gain g is related to the receiving
headi = Attention(QWiQ , KWiK , V WiV ),
device, path loss P L and shadow fading Xσ are both related
(2)
to the channel type. Xσ is the shadow fading component, which
where W q , W k , and W v are the parameter matrices, dk is the di- is usually modeled as a zero-mean Gaussian random variable
mension of K, and WiQ ∈ R(dmodel ×dq ) , WiK ∈ R(dmodel ×dk ) , Xσ N (0, σ 2 ), and Xσ is usually non-negative, and in general,
WiV ∈ R(dmodel ×dv ) , and W O ∈ R(hdv ×dmodel ) . The resulting we treat g and σ as constants. G is generally defined as follows:
self-attention layer output after the multi-head self-attention
G = g − P L − Xσ . (4)
calculation is Z = M ultiHead(Q, K, V ). This process has
multiple times, depending on the number of layers of the model. According to the system model, both the sender and receiver
Forward neural network: The attention output Z is fed into may be on the ground or in the air, so the communication channel
the multilayer perceptron (MLP) for feedforward neural net- mainly consists of two types: Ground-to-Ground (G2G) and
work to calculate the output Y = M LP (Z), where M LP () is Ground-to-Air (G2A), where ground means ground equipment
a multilayer perceptron containing linear transformations and (less than 100 meters in height) and air means air equipment
activation functions. The final prediction text can be obtained or high altitude equipment (more than 100 meters in height). It
based on Y and tokenizer. should be noted that M ECi and CS are ground equipment.
The inference process described above takes place in the Serj . Ground-to-Ground: In G2G channels, where both the sender
Task model sends task request Tt from terminal Devi , i.e., the and receiver are terrestrial devices, the path loss P LG2G in G2G
packet with the size of P Sx containing the original input text, channels is defined according to [37] as:
to Serj for inference, and then Serj sends the packet with the
size of P Sy containing the predicted text back to the requesting P LG2G = 128.1 + 37.6 log(d), (5)
terminal Devi . where d is the euclidean distance between the sender and the
receiver.
C. Terminal Mobility Model Ground-to-Air: In the G2A channel, one of the sender and
There are two types of nodes in the system model of this the receiver is a ground device and the other is an aerial device,
paper, immovable nodes (including MEC, CS and fixed terminal according to [38], the path loss P LG2A is defined in the Ground-
Diunmo ) and movable nodes (including mobile terminal Dimobi to-Air channel as:
such as net-connected cars, smartphones and UAVs). The fixed
P LG2A = 10α log(d) + C, (6)
terminal Diunmo generates requests for LLMs inference tasks
from time to time and is in an immobile state. The mobile ter- where α is the path loss index, which is related to the envi-
minal Dimobi moves at a given directional movement speed and ronment in which the channel propagates, and environmental
generates LLMs inference task requests from time to time during factors include the density, type and height of buildings and
the movement. About the distance of the mobile terminal Dimobi vegetation, etc. d is the euclidean distance between the sender
and endpoints (M ECi and CS) for calculation. We use the eu- and the receiver. The constant C depends on several parameters
clidean distance to calculate, suppose the mobile terminal Dimobi such as the operating frequency of the device and the antenna
coordinates are (x1 , y1 , z1 ) and the endpoint coordinates are gain.
mobi
(x2 , y2 , z2 ), the distance
between the mobile terminal Di and Air-to-Air: Relay communication between UAVs is per-
the endpoint is d = (x1 − x2 ) + (y1 − y2 ) + (z1 − z2 )2 .
2 2
formed using Air-to-Air channels, specifically, the offloading of
inference tasks and the return of results can be transmitted via
D. Communication Model UAVs. According to [39], the path loss P LA2A in A2A channels
The terminal Devi sends a task request Tt at moment t. is defined as:
The decision algorithm offloads it to the server Serj and sends P LA2A = 10α log(d), (7)
the processing result Py back to the terminal Devi after the
processing is completed, both of which are related to the wireless where α is path loss index, d is the air distance between UAVs.
communication channel between the sender and the receiver. Importantly, in the design scenario of this paper, because the
Assuming that the relative distance d between the sender and the relay communication between high altitude UAVs belongs to
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11258 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
the line-of-sight wireless transmission behavior. Therefore, path where avg(·) is the average function and PTt is the prediction
loss index α can be appropriately chosen to be a small value. accuracy of task Tt . Thus, the final optimization objective can
be expressed as maximizing the total system utility U (LT· , PT· ):
E. Data Transfer Model
maximize U (LT· , PT· )
In the scenario designed in this paper, the data transfer in-
cludes the task offloading phase and the result return phase: Wrest , Crest , Mrest ≥ 0, ∀t
s.t. , (10)
the decision algorithm sends a packet with the size of P Sx LTt ≤ tmax , ∀t
of task Tt from terminal Devi to Serj , and returns the result
where Wrest , Crest , Mrest denote the remaining bandwidth
packet with the size of P Sy from Serj to terminal Devi after
resources, computational resources, and graphics memory re-
processing. In the task offloading phase, it goes through four
sources of each MEC and CS at time t, respectively. Therefore,
periods: transmission, propagation, queuing and computation.
the optimal offloading strategy can be found by maximizing
In the result return phase, there are two periods: transmission
U (LT· , PT· ).
and propagation.
In practical systems, price and risk are very important factors,
In the task offloading phase, the data transmission rate R can
which should be considered for data offloading decision-making
be calculated according to (3). Suppose P Sx is the packet size
in edge computing systems [40]. These important factors can be
of task Tt , then transmission latency is PRSx . The distance d1
incorporated into our proposed scheme by revising the system
between the sender and the receiver is determined according
utility function in (9) and constraints in (10). This provides a
to the mobility model, using wireless propagation, and the
potential area for enriching our scheme by integrating price and
propagation latency is dc1 , where c represents the speed of light.
risk awareness, which could further optimize offloading deci-
When the server’s task queue exceeds the parallel processing
sions by considering economic and reliability factors, leading
limit, the latest requests need to be queued, and the waiting time
to a more holistic approach in managing cloud-edge computing
depends on the remaining time Lq for the fastest completed task
resources. In this paper, due to the limited space, we focus on
in the processing queue. The task processing time Lc varies
efficiency and adaptability through an active inference method
depending on the server’s computing power and whether or not
without directly addressing price and risk.
it uses an accelerated inference framework.
In the result return phase, the processing result of task Tt
PS IV. ACTIVE INFERENCE BASED OFFLOADING STRATEGY
is packet P Sy , then transmission latency is R y . The distance
between terminal Devi and server Serj is d2 . Note that the In this section, we first describe the environment state rep-
distance d2 may change during the server processing task, so resentation and action representation of the agent. Then, we
propagation latency is dc2 . describe the rewardless guidance in active inference algorithms.
In summary, when a task Tt is offloaded successfully, its time Finally, we describe the complete algorithmic process for of-
delay LTt can be calculated as: floading and resource allocation of LLMs inference tasks in
cloud-edge networks.
P Sx + P S y d 1 + d2
LT t = + + Lq + Lc , (8)
R c
A. State and Action Representations
in addition, the maximum acceptable delay of all task requests Tt
is constrained to be tmax , which indicates the delay requirement According to the description of the system model in Sec-
of the requesting terminal. When the task processing result tion III, in the environment of this paper, s′j = [Cj , Wj , Mj ]T
cannot be transmitted to the terminal within the maximum is defined to denote the state in which the server Serj is in.
acceptable delay, i.e., LTt > tmax , then task Tt is abandoned. s′j includes the remaining computing resource Cj of Serj , the
remaining bandwidth resource Wj , and the remaining graphics
F. Problem Formulation memory resource Mj . In addition, it is necessary to consider
the distance matrix D between the terminal Devi and all Serj .
The objective of this paper is to find an optimal strategy Therefore, D and the state of each Serj are united into the global
for offloading resource-intensive large-model inference tasks to state St = [D; s′1 ; s′2 ; . . .; s′Nser ] at time t. The distance matrix
edge computing nodes or cloud computing nodes in the case of D of terminal Devi and all Serj is also considered.
limited endpoint resources. Consider an objective function to According to the description of the system model in Sec-
represent this problem that attempts to make the average latency tion III, the actions of the agent should include offloading the
of all LLMs inference tasks that need to be offloaded as small LLMs inference task Tt to the server Serj , allocating compu-
as possible and the accuracy of the model’s prediction output as tational resources, allocating channel bandwidth resources, and
large as possible while satisfying the bandwidth resource con- allocating graphics memory resources. Therefore, the action of
straint, computational resource constraint, and graphics memory the agent at time t is defined as at = [j, cj , wj , mj ], where j
resource constraint of the MEC and CS. Therefore, the total is the unique corresponding index of the server, cj represents
system utility is defined as: the computational resources allocated to task Tt by Serj , wj
1 represents the channel bandwidth resources allocated by Serj ,
U (LT· , PT· ) = + avg PTt , (9) and mj represents the graphics memory resources allocated by
avg( Tt LTt )
Tt Serj . It should be noted that the allocated resources should be
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11259
less than or equal to the remaining resources of the server Serj , In active inference algorithms, the free energy principle is an
otherwise this action is regarded as an invalid action, and this important concept. It was proposed by Friston et al. to describe
task Tt fails to run at this time. how an agent minimizes its free energy by reasoning about
the environment [14]. The free energy principle is based on
B. Rewardless Guidance in Active Inference Bayesian reasoning and information theory and aims to explain
how an agent perceives and understands the environment and
This paper uses an active inference-based algorithm as the
makes decisions based on its understanding. Free energy can be
agent’s decision algorithm for task offloading and resource
considered as a measure that represents uncertainty about the
allocation decisions. The key to this is the use of simple and
state of the environment. The goal of the agent is to reduce the
effective rewardless guidance instead of the reward model in the
uncertainty of the environment by reducing the free energy [30].
active inference decision algorithm, which serves to provide an
According to the free energy principle, the agent achieves free
environmental reward signal for the decision. Traditional reward
energy minimization through two processes. The first is the
models tend to use reward values from environmental feedback
POMDP process described earlier, where the agent senses the
as a basis for model revision, which are strongly correlated with
environment to obtain external information ot and form an
the environment and weakly generalized in the presence of en-
internal model p(o, s, θ) of the state of the environment. Next
vironmental changes. However, using rewardless guidance that
is the action planning process, where the agent makes optimal
summarizes the abstraction of multiple environments can solve
actions to reduce free energy based on the internal generative
this problem to some extent because it does not directly require
model and the objective function. Through this active inference
the environment feedback reward values, but rather summarizes
process, the agent can better understand the environment, predict
multiple similar environments. In our proposed algorithm, it is
the future state, and take appropriate actions to achieve its goals.
important in the offloading decision to ensure that the task is
The optimization goal of active inference is to maximize the
completed with low latency and high pass rate to complete the
evidence of the agent’s generative model, i.e., to minimize the
task, defining rewardless guidance:
free energy, and by setting the expected preferences, the agent’s
1 generative model p(o, s, θ) can move toward this goal state.
rg(st , at ) = tc × + PT t , (11)
LT t The standard free energy is defined at a single moment t,
whereas in the active inference algorithm used in this paper, the
where tc = 1 when task Tt is completed and tc = 0 in the
agent’s optimization goal is to minimize the variational free en-
opposite case. PTt is the prediction accuracy of task Tt . In
ergy F = DKL (q(s, θ)p(o, s, θ)), where q(s, θ) is the agent’s
our proposed algorithm, the larger rg(st , at ) indicates that the
belief about future variables and the variational free energy
selection of at in st is more consistent with rewardless guidance
F is also referred to as the (negative) evidence lower bound
and the greater the probability of selecting at . Therefore, one
(ELBO) [42], the agent chooses the strategy π that minimizes the
of the advantages of using rewardless guidance is that it does
variational free energy F as the chosen strategy [31]. According
not require the real reward signal returned by the environment,
to [43], the definition of free energy of the expected future in the
and only needs to determine whether the selection of at in st
algorithm used in this paper is:
conforms to the rewardless guidance.
F̃ = DKL (q(o0:T , s0:T , θ, π)p(o0:T , s0:T , θ)), (12)
C. Active Inference Decision
where o0:T represents the observation sequence of the agent
The decision algorithm based on active inference is described on the time series 0 : T , s0:T represents the state sequence of
next. In a task offloading and resource allocation environment, the agent on the time series 0 : T , q(o0:T , s0:T , θ, π) represents
the agent maintains an active inference algorithm internally and the belief of the agent on the future variables, p(o0:T , s0:T , θ) is
is exposed to the environment externally. The agent-environment the generative model of the agent, and θ is the neural network of
interaction decision process is not a fully observable process, but the generative model parameters. The target strategy π ∗ can be
a partially observable Markov decision process (POMDP) [41]. found by minimizing the expected future free energy F̃ . In prac-
Suppose that at moment t − 1, the state is represented as st−1 tical calculations, the free energy can be minimized by making
and the agent makes an action at−1 with a certain probability the output distribution p(o0:T , s0:T , θ) of the generative model
P such that the state shifts to st at moment t. This probabil- closer and closer to the true state distribution q(o0:T , s0:T , θ, π)
ity is the environmental shift probability P (st |st−1 , at−1 ). In by:
POMDP, the agent does not necessarily obtain the true state of
the external environment, but always obtains an observation ot of DKL (q(o0:T , s0:T , θ, π)p(o0:T , s0:T , θ)) = 0 ⇒ F̃ = 0,
its own environment, ot P (ot |st ). In active inference, the agent (13)
internally maintains a generative model p(o, s, θ) for making
Algorithm 1 is a complete algorithm for task offloading and
predictions about the state of the external environment, which is
resource allocation based on active inference.
a neural network model, and θ denotes the model parameters to
be learned. In our proposed algorithm, instead of maintaining a
D. Complexity Analysis
reward model based on a large neural network model to obtain
preferences, we use rewardless guidance to obtain preferences, The complexity of Algorithm 1 can be analyzed as follows.
which influence action selection during agent planning. The outer loop runs for nepisodes iterations. Within each episode,
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11260 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11261
an existing scheme [45] that does not consider LLMs. They As shown in Fig. 3(a), our proposed method reaches con-
have already achieved great success in the current DRL field, vergence at about episode = 200 and the level of convergence
with SOTA performance in various application scenarios, both is higher than mainstream DRLs, although our algorithm does
discrete and continuous. Regarding the scope of the reward not learn a better strategy in the first 50 episodes of the training
function, in the DRL algorithms mentioned in this paper, the phase, which is due to the fact that the complexity of the environ-
reward function is used for training and performance evaluation, ment is high. Rainbow DQN converge fast in the early stage, but
in the algorithm proposed in this paper, the reward function has the final convergence level is lower than our proposed method,
nothing to do with the selection of actions and it is only used for and SAC is obviously lower than the other three algorithms. PPO
performance evaluation, which is to make an effective compari- converges slowly in the early stage, but the convergence level in
son with the DRLs. According to the problem formulation, the the late stage is slightly higher than that of Rainbow DQN, and
overall system has the requirement of low latency and high pass only lower than that of our proposed method. To summarize,
rate for LLMs inference task, so the reward function is defined as: our proposed method performs better than the three algorithms
of mainstream DRLs in terms of both convergence speed and
1
r= + PT t , (14) convergence level, indicating that our proposed method can train
LT t better strategies in the complex environment proposed in this
Evaluation Metrics. In the experiments of this paper, the paper.
evaluation metrics include sum reward, task completion rate, It can be seen from Fig. 3(b) that our proposed method can
average latency, and average pass@100, with a statistical range eventually achieve a task completion rate of about 99%, while
within one episode, which denotes the total reward, the task the task completion rate of Rainbow DQN is about 85%, that of
completion rate, the average latency of all the tasks, and the PPO is about 90%, and that of SAC is about 80%. In the metric
average pass@100 of all the tasks, respectively. pass@100 is of task completion rate, our proposed method outperforms all
the pass rate for a hundred samples. three algorithms of mainstream DRLs. In addition, it shows that
our proposed method balances all tasks while achieving the best
convergence performance and does not give up individual tasks
B. DRLs Comparison to improve the overall performance.
This section analyzes the performance of the training phase of In Fig. 3(c), it can be seen that all three of our proposed
our proposed method and the existing schemes with the settings method, Rainbow DQN and PPO can eventually achieve an
of tmax = 15 and ntasks = 100, both of which simultaneously average task completion latency of about 8 s, while SAC can only
ensure the validity of the subsequent experiments on tmax vari- achieve about 10 s. And it can be seen from the figure that our
ations and task load experiments. Fig. 3 shows the performance proposed method is slightly lower than Rainbow DQN and PPO.
comparison between our proposed method and mainstream It should be noted that all three algorithms of mainstream DRLs
DRLs during the training phase, including sum reward, task show average latency jitter during the training process, i.e., the
completion rate, average latency, and average pass@100. shaded part of the corresponding curve in the figure, which
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11262 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
indicates that the strategies of mainstream DRLs show unstable When tmax ≤ 2, it can be seen from Fig. 4(b) that all tasks
action output during the training process, and in this regard, our cannot be completed regardless of which algorithm is used.
proposed method clearly outperforms the mainstream DRLs. Combined with Fig. 2, even when offloading to the fastest
In Fig. 3(d), it can be seen that the average pass@100 of our inference cloud, the shortest inference time takes more than 2 s,
proposed method reaches around 0.175, the average pass@100 and with the wireless transmission time, a total latency greater
of Rainbow DQN and PPO reaches around 0.15, and the average than 2 s is inevitable.
pass@100 of SAC only manages to reach around 0.14. This When 3 ≤ tmax ≤ 9, according to Fig. 4(b), only our pro-
shows that in the training phase, under the same setup conditions, posed method and Rainbow DQN can keep the task completion
the strategy of our proposed method tends to offload the task rate above 20%, and both PPO and SAC are less than 20%. Ac-
more to the edge with high inference accuracy, while the strategy cording to Fig. 2, the minimum value of inference time is greater
trained by mainstream DRLs offloads the task more to the cloud. than 9 s due to the edge, and the delay of wireless transmission
This suggests that mainstream DRLs tend to trade pass@100 for needs to be considered. Combined with the environment settings,
low latency, but combining the above Fig. 3(c) reveals that our the resource ratio of the cloud server and the four edge servers
proposed method balances both. is 1:4, which exactly corresponds to a task completion rate of
In summary, our proposed method can train a strategy that 20%. Therefore, it can be concluded that the task can only be
balances task average delay, average pass@100 and task com- offloaded to the cloud at this time. According to Fig. 4(c), it can
pletion rate. While Rainbow DQN and PPO guarantee a lower be found that the curve declines compared to when tmax ≤ 2,
task average delay, but cannot balance the average pass@100 and because the task can be offloaded to the cloud at this time.
task completion rate, SAC has the worst overall performance. When tmax ≥ 10, according to Fig. 4(b), only our proposed
method can achieve a task completion rate of about 100%. It can
also be seen that the performance of all four indicators of our
C. Latency Variation proposed method is optimal.
In order to explore the latency variation of the algorithms, To summarize, the four algorithmic strategies obtained from
i.e., the performance of the strategies in scenarios where the training are tested under different values of tmax , and our pro-
maximum latency requirement of the task, tmax , varies. The posed method has the best overall performance.
four algorithmic strategies obtained from the above training are
used for testing. The range of tmax variation is set to 1-15 s,
D. Task Load Variation
the interval to 1 s, and ntasks = 100. Since tmax = 15 is set in
the train phase, the latency variation of the algorithms can be In order to explore the effect of task load on the strategies, the
compared when tmax < 15. strategies obtained from the above training are used for testing
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11263
by setting ntasks to vary from 100-200, and tmax = 15. ntasks To summarize, since mainstream DRLs and the existing
denotes the number of tasks in an episode of the environment, scheme cannot balance latency, task completion rate and
and there is no need for resource scheduling when there are pass@100 well, the final performances of mainstream DRLs
sufficient resources (task load is less than or equal to 100%), are all lower than our proposed method under increasing task
thus setting the task load ntasks ∈ [100, 200] for the experiment, load.
which corresponds to a task load of 100%-200%.
It can be seen from Fig. 5(a) that the sum reward of our pro- VI. CONCLUSIONS AND FUTURE WORK
posed method is always higher than that of mainstream DRLs. In this paper, in order to solve the resource dilemma of
For existing schemes, it can be seen that the sum reward increases LLMs inference task, we proposed to use active inference with
as the load increases, but it cannot reach the maximum level. rewardless guidance in cloud-edge computing to solve the of-
The reason for the increasing sum reward is that mainstream floading and resource allocation problem of LLMs inference
DRL does not allocate all the resources to existing users, so task. Specifically, by constructing an arithmetic-powerful cloud-
whenever the number of users increases the mainstream DRL edge network system, the LLMs inference task request from the
always has resources to allocate. The mainstream DRLs are not terminal is sent to the server for processing and returned to the
able to reach the highest level precisely because they are not terminal. Extensive simulation results show that this scheme is
able to allocate all the resources efficiently to the existing users. effective, and our proposed method outperforms the mainstream
Our proposed method allocates all the resources efficiently at DRLs, both in terms of convergence performance in the training
all times, so the sum reward always remains the highest level. phase and maximum tolerable latency experiments and task load
In addition, SAC eventually outperforms PPO as the task load experiments in the testing phase. In future work, we plan to apply
increases, which indicates that SAC is more suitable than PPO this scheme to more complex environments, such as more diverse
in high load environments, but still lower than our proposed types of device terminals, distributed environments, as well as
method and Rainbow DQN. using more advanced network systems to schedule more diverse
In Fig. 5(b) and (d), it can be seen that the task completion resources, such as space-air-ground integrated network systems,
rate and average pass@100 of the four algorithms are decreasing etc., and working on the further improvement of the algorithm
when the task load increases, which is reasonable because the performance.
total amount of resources is becoming less and less able to satisfy
the demand of all tasks. Nevertheless, our proposed method REFERENCES
consistently outperforms mainstream DRLs.
[1] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A biblio-
According to Fig. 5(c), we can see that the average latency of metric review of large language models research from 2017 to 2023,”
our proposed method is increasing when the task load increases, 2023, arXiv:2304.02020.
which is reasonable because an increase in task load means that [2] C. Zhou et al., “A comprehensive survey on pretrained foundation models:
A history from BERT to ChatGPT,” 2023, arXiv:2302.09419.
more tasks are requesting resources, which inevitably leads to [3] K. Cao, Y. Liu, G. Meng, and Q. Sun, ”An overview on edge computing
an upward trend in average latency. research,” IEEE Access, vol. 8, pp. 85714–85728, 2020.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11264 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
[4] H. Li, G. Shou, Y. Hu, and Z. Guo, “Mobile edge computing: Progress and [26] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
challenges,” in Proc. 4th IEEE Int. Conf. Mobile Cloud Comput., Serv., “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag.,
Eng., 2016, pp. 83–84. vol. 34, no. 6, pp. 26–38, Nov. 2017.
[5] Y. He et al., “Deep-reinforcement-learning-based optimization for cache- [27] Y. Liu, H. Yu, S. Xie, and Y. Zhang, “Deep reinforcement learning
enabled opportunistic interference alignment wireless networks,” IEEE for offloading and resource allocation in vehicle edge computing and
Trans. Veh. Technol., vol. 66, no. 11, pp. 10433–10445, Nov. 2017. networks,” IEEE Trans. Veh. Technol., vol. 68, no. 11, pp. 11158–11168,
[6] Y. Li, “Deep reinforcement learning: An overview,” 2017, arXiv: Nov. 2019.
1701.07274. [28] M. Tang and V. W. Wong, “Deep reinforcement learning for task offloading
[7] M. Hessel et al., “Rainbow: Combining improvements in deep reinforce- in mobile edge computing systems,” IEEE Trans. Mobile Comput., vol. 21,
ment learning,” in Proc. AAAI Conf. Artif. Intell., 2018. no. 6, pp. 1985–1997, Jun. 2022.
[8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal [29] K. Friston, P. Schwartenbeck, T. FitzGerald, M. Moutoussis, T. Behrens,
policy optimization algorithms,” 2017, arXiv: 1707.06347. and R. J. Dolan, “The anatomy of choice: Dopamine and decision-
[9] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” making,” Philos. Trans. Roy. Soc. B: Biol. Sci., vol. 369, no. 1655, 2014,
2018, arXiv: 1812.05905. Art. no. 20130481.
[10] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and com- [30] K. Friston et al., “Active inference and learning,” Neurosci. Biobehavioral
puting for connected vehicles: A deep reinforcement learning approach,” Rev., vol. 68, pp. 862–879, 2016.
IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, Jan. 2018. [31] K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, and G.
[11] Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin, “Software-defined Pezzulo, “Active inference and epistemic value,” Cogn. Neurosci., vol. 6,
networks with mobile edge computing and caching for smart cities: A no. 4, pp. 187–214, 2015.
Big Data deep reinforcement learning approach,” IEEE Commun. Mag., [32] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
vol. 55, no. 12, pp. 31–37, Dec. 2017. Cambridge, MA, USA: MIT Press, 2018.
[12] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation [33] T. Parr and K. J. Friston, “Uncertainty, epistemics and active inference,”
for mobile edge computing: A deep reinforcement learning approach,” J. Roy. Soc. Interface, vol. 14, no. 136, 2017, Art. no. 20170376.
IEEE Trans. Emerg. Topics Comput., vol. 9, no. 3, pp. 1529–1541, [34] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 billion parameter au-
Jul.-Sep. 2021. toregressive language model,” 2021. [Online]. Available: https://www.
[13] Y. Hu et al., “Learning to utilize shaping rewards: A new approach eleuther.ai/artifacts/gpt-j
of reward shaping,” in Proc. Adv. Neural Inf. Process. Syst., 2020, [35] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
pp. 15931–15941. Process. Syst., 2017.
[14] K. Friston, “A free energy principle for a particular physics,” 2019, arXiv: [36] C. E. Shannon, “A mathematical theory of communication,” Bell Syst.
1906.10184. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
[15] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. [37] S. Sun, T. A. Thomas, T. S. Rappaport, H. Nguyen, I. Z. Kovacs, and
Rellermeyer, “A survey on distributed machine learning,” ACM Comput. I. Rodriguez, “Path loss, shadow fading, and line-of-sight probability
Surv., vol. 53, no. 2, pp. 1–33, 2020. models for 5G urban macro-cellular scenarios,” in Proc. IEEE Globecom
[16] N. Yang et al., “Inference with reference: Lossless acceleration of large Workshops, 2015, pp. 1–7.
language models,” 2023, arXiv:2304.04487. [38] A. Al-Hourani and K. Gomez, “Modeling cellular-to-UAV path-loss for
[17] Y. Sheng et al., “High-throughput generative inference of large language suburban environments,” IEEE Wireless Commun. Lett., vol. 7, no. 1,
models with a single GPU,” 2023, arXiv:2303.06865. pp. 82–85, Feb. 2018.
[18] Y. Wang, K. Chen, H. Tan, and K. Guo, “TABI: An efficient multi-level [39] A. A. Khuwaja, Y. Chen, N. Zhao, M.-S. Alouini, and P. Dobbins, “A
inference system for large language models,” in Proc. 18th Eur. Conf. survey of channel modeling for UAV communications,” IEEE Commun.
Comput. Syst., 2023, pp. 233–248. Surv. Tut., vol. 20, no. 4, pp. 2804–2821, Fourth Quarter 2018.
[19] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with [40] G. Mitsis, E. E. Tsiropoulou, and S. Papavassiliou, “Price and risk aware-
large language model APIs,” 2023, arXiv:2301.08721. ness for data offloading decision-making in edge computing systems,”
[20] L. Lin, X. Liao, H. Jin, and P. Li, “Computation offloading to- IEEE Syst. J., vol. 16, no. 4, pp. 6546–6557, Dec. 2022.
ward edge computing,” Proc. IEEE, vol. 107, no. 8, pp. 1584–1607, [41] R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang, “Dynamic computation
Aug. 2019. offloading in IOT fog systems with imperfect channel state information:
[21] P. Mach and Z. Becvar, “Mobile edge computing: A survey on architecture A POMDP approach,” IEEE Internet Things J., vol. 8, no. 1, pp. 345–356,
and computation offloading,” IEEE Commun. Surveys Tuts., vol. 19, no. 3, Jan. 2001.
pp. 1628–1656, Third Quarter 2017. [42] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
[22] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating A review for statisticians,” J. Amer. Stat. Assoc., vol. 112, no. 518,
deep neural network inference via edge computing,” IEEE Trans. Wireless pp. 859–877, 2017.
Commun., vol. 19, no. 1, pp. 447–457, Jan. 2020. [43] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley, “Reinforcement
[23] N. Li, A. Iosifidis, and Q. Zhang, “Collaborative edge computing for learning through active inference,” 2020, arXiv: 2002.12636.
distributed CNN inference acceleration using receptive field-based seg- [44] M. Chen et al., “Evaluating large language models trained on code,”
mentation,” Comput. Netw., vol. 214, 2022, Art. no. 109150. 2021, arXiv:2107.03374.
[24] C. Hu and B. Li, “Distributed inference with deep learning models across [45] J. Hou, M. Chen, H. Geng, R. Li, and J. Lu, “GP-NFSP: Decentralized
heterogeneous edge devices,” in Proc. IEEE Conf. Comput. Commun., task offloading for mobile edge computing with independent reinforcement
2022, pp. 330–339. learning,” Future Gener. Comput. Syst., vol. 141, pp. 205–217, 2023.
[25] L. Shi, Z. Xu, Y. Sun, Y. Shi, Y. Fan, and X. Ding, “A DNN inference
acceleration algorithm combining model partition and task allocation in
heterogeneous edge computing system,” Peer-to-Peer Netw. Appl., vol. 14,
no. 6, pp. 4031–4045, 2021.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.