0% found this document useful (0 votes)

3 views

Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)

This paper presents a novel active inference approach for offloading large language model (LLM) inference tasks and resource allocation in cloud-edge computing, addressing the inefficiencies of traditional deep reinforcement learning (DRL) methods. The proposed method demonstrates superior performance in data utilization and adaptability to varying task loads, as evidenced by extensive simulation results. The study highlights the challenges of LLMs due to their high computational demands and explores the integration of cloud and edge computing to optimize inference task processing.

Uploaded by

anikhasan64445

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)

Uploaded by

anikhasan64445

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO.

12, DECEMBER 2024 11253

Large Language Models (LLMs) Inference

Offloading and Resource Allocation in Cloud-Edge
Computing: An Active Inference Approach
Ying He , Member, IEEE, Jingcheng Fang, F. Richard Yu , Fellow, IEEE, and Victor C. Leung

Abstract—With the increasing popularity and demands for large and storage space for training and inference, which is a signif-
language model applications on mobile devices, it is difficult for icant problem for resource-limited devices and environments.
resource-limited mobile terminals to run large-model inference Although researchers have proposed some solutions, such as
tasks efficiently. Traditional deep reinforcement learning (DRL)
based approaches have been used to offload large language models distributed training, model pruning, and model compression,
(LLMs) inference tasks to servers. However, existing DRL solutions which can reduce the storage and computational overhead of
suffer from data inefficiency, insensitivity to latency requirements, models, there are still resource issues in scenarios where LLMs
and non-adaptability to task load variations, which will degrade the are served to endpoints on a large scale [2].
performance of LLMs. In this paper, we propose a novel approach
Cloud-edge computing is a computing model that integrates
based on active inference for LLMs inference task offloading and
resource allocation in cloud-edge computing. Extensive simulation cloud computing and edge computing, aiming to take full advan-
results show that our proposed method has superior performance tage of cloud computing and edge computing to provide more
over mainstream DRLs, improves in data utilization efficiency, and flexible, efficient and resilient computing capabilities in large-
is more adaptable to changing task load scenarios. scale network systems [3]. In cloud-edge computing, cloud com-
Index Terms—Active inference, cloud-edge computing, large puting represents centralized data centers and powerful com-
language model, reinforcement learning, resource allocation, task puting resources that can handle large-scale data and complex
offloading. computing tasks. Edge computing, on the other hand, extends
computing power and data processing capabilities to locations
such as edge devices, edge nodes or edge gateways close to
I. INTRODUCTION data sources to realize the advantages of low latency, real-time
N RECENT years, OpenAI’s GPT family (e.g., ChatGPT) and localized data processing. The core idea of cloud-edge
I has attracted a lot of attention with the development of large
language models (LLMs). The main advantage of LLMs is their
computing is to assign computational tasks to the right location
for processing based on their characteristics and needs [4].
greater representational power and learning ability [1]. Models Cloud-edge computing for large-model inference tasks is a com-
with more parameters are able to capture more complex patterns puting model that distributes large-model inference tasks to the
and associations, thus providing more accurate and richer predic- cloud and the edge for processing. Among them offloading and
tions and generated results. However, LLMs also face challenges resource allocation are the key concepts and techniques used to
and limitations: LLMs require huge computational resources rationally allocate inference tasks to cloud and edge devices for
processing. Offloading and resource allocation need to consider
factors such as network bandwidth, latency, and computational
Manuscript received 9 November 2023; revised 20 February 2024; accepted capacity of the processing ends, etc. The best overall system can
22 April 2024. Date of publication 9 July 2024; date of current version 5 Novem-
ber 2024. This work was supported in part by the National Natural Science be achieved by a reasonable offloading strategy and resource
Foundation of China (NSFC) under Grant 62271324, Grant 62231020, and allocation scheme.
Grant 62002238, in part by Shenzhen Science and Technology Program under Different from conventional task offloading optimization, the
Grant ZDSYS20220527171400002, and in part by the Open Research Fund
from Guangdong Laboratory of Artificial Intelligence and Digital Economy unique characteristics of LLMs pose non-trivial challenges to
(SZ) under Grant GML-KF-22-26. Recommended for acceptance by E. Ngai. task offloading optimization in edge-cloud computing. Specif-
(Corresponding author: F. Richard Yu.) ically, LLMs have huge scale parameters and computational
Ying He and Jingcheng Fang are with the College of Computer Science,
Software Engineering, Shenzhen University, Shenzhen 518060, China (e-mail: resource requirements. As hardware computing power and
[email protected]; [email protected]). datasets increase, researchers are beginning to design and train
F. Richard Yu is with the College of Computer Science, Software Engineering, models with billions of parameters, and the number of train-
Shenzhen University, Shenzhen 518060, China, and also with the School of
Information Technology, Carleton University, Ottawa, ON K1S 5B6, Canada able parameters in large models grows exponentially, often
(e-mail: [email protected]). requiring huge computational resources and high-performance
Victor C. Leung is with the Department of Electrical, Computer Engineer- computing devices for training and inference [2]. Furthermore,
ing, University of British Columbia, Vancouver V6T 1Z4, Canada (e-mail:
[email protected]). the performance measures in LLMs are usually different from
Digital Object Identifier 10.1109/TMC.2024.3415661 conventional tasks. In this paper, we consider the average latency

1536-1233 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11254 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

of all LLMs inference tasks that need to be offloaded as small in large models grows exponentially, often requiring huge com-
as possible and the accuracy of the model’s prediction output as putational resources and high-performance computing devices
large as possible while satisfying the bandwidth resource con- for training and inference.
straint, computational resource constraint, and graphics memory The authors of [15] argue that distributed systems can provide
resource constraints. excellent solutions for training and inference of large models,
Deep reinforcement learning (DRL) has been very successful and that distributed federation of multiple high-performance
in many decision making application scenarios, such as games, computer machines can effectively solve the resource dilemma
robotics and resource management [5], [6]. Thanks to the rapid faced by large models.
development of DRL, mainstream DRL algorithms, such as A large language models accelerator (LLMA) is proposed
Rainbow DQN, PPO and SAC, are used in cloud-edge com- in [16] that losslessly accelerates LLMs inference with refer-
puting scenarios [7], [8], [9] for task offloading and resource ences. LLMA selects a text span from a reference and copies its
allocation [10], [11], [12]. However, traditional DRL-based tokens to the decoder, and then efficiently checks the appropri-
strategies operating in different environments require different ateness of the tokens as decoding results in a decoding step in
reward functions, which results in poor generalization. It is also parallel. the appropriateness of the tokens as decoding results in
difficult to define an explicit and appropriate reward function. a decoding step. Using LLMA achieves a 2x inference speedup
The transformation of human knowledge into numerical reward and yields the same predicted output as greedy decoding in many
values is often subject to human cognitive biases [13]. scenarios.
In this paper, we propose a novel algorithm using a reward- Sheng et al. investigate how to use a single GPU for high-
less guidance instead of the reward model in traditional DRL throughput LLMs inference with limited hardware resources. A
approaches, which enables agents to form higher-level cognition high-throughput inference generation engine for LLMs running
about the environment and reach the preferred state directly on a single consumer-grade GPU, FlexGen, is proposed, which
without defining reward functions, resulting in better general- flexibly configures LLMs inference tasks under various hard-
ization ability than traditional DRL approaches. In this way, the ware resource constraints by aggregating memory from GPUs,
algorithm is able to actively select the actions that provide the CPUs, and disks [17].
most informative value to guide the inference process and reduce An inference system with a multilevel inference engine is
the uncertainty of the future state. The main contributions of this proposed in [18] that can serve inferential computations for
paper are as follows: corresponding applications using small models and optionally
r With the recent advances in active inference [14], we LLMs. the idea of Tabi is that due to the diminishing returns of
propose a novel scheme to address the LLMs inference adding more trainable parameters to LLMs, smaller models can
offloading and resource allocation problem. Compared to make the same predictions as LLMs for most queries. the idea of
mainstream DRLs, our proposed method has better con- Tabi multilevel inference is that the non-generative LLMs in the
vergence performance and generalization performance. service framework is optimized to use the calibrated confidence
r We present the system model and formulation for GPT-J- scores to decide whether to return accurate results for small
6B LLM based on real experimental data in a server cluster. models extremely quickly or to reroute them to the LLMs.
Both training and inference phases are considered. In the scenario of generating batch large-model inference
r Extensive simulation results show that our proposed tasks, Cheng et al. propose batch prompting, which enables
scheme can train a better converged policy and has a LLMs to run inference in batches instead of one inference task
superior performance in LLMs inference compared with at a time. This method reduces token and time costs while
the mainstream DRLs algorithms. maintaining downstream performance, and the inference cost
The rest of this paper is organized as follows. Section II decreases inversely and linearly with the number of samples
presents the system model. The proposed scheme is presented in in each batch. The study suggests that this batch prompting
Section III. Section IV presents the experimental results. Finally, approach can be applied to different batch large model inference
this paper is concluded in Section V. tasks [19].

II. RELATED WORKS

B. Cloud and Edge for Large Language Models
In this section, LLMs inference, LLMs inference task offload-
Cloud-edge computing combines the concepts and technolo-
ing in cloud or edge, DRL for resource allocation and offloading
gies of cloud computing and edge computing, aiming to meet
in cloud or edge, and active inference are introduced.
the data processing and computing needs in different application
scenarios and provide more efficient and flexible computing and
A. Large Language Model Inference
storage capabilities. Through decision algorithms, computing
Large-scale model refers to deep learning models with huge tasks are assigned to the cloud or the edge for processing, and
scale parameters and computational resource requirements. As the powerful computing resources in the cloud make up for the
hardware computing power and datasets increase, researchers shortage in the edge, and the low latency in the edge makes up
are beginning to design and train models with hundreds of for the shortage in the cloud, finally achieving the optimal total
millions of parameters, and the number of trainable parameters utility of the overall network system [20], [21].

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
HE et al.: LARGE LANGUAGE MODELS (LLMS) INFERENCE OFFLOADING AND RESOURCE ALLOCATION IN CLOUD-EDGE COMPUTING 11255

Li et al. argue that running Deep Neural Networks interactions without a priors knowledge [6]. Some common
(DNN)-based computationally intensive tasks on mobile devices deep reinforcement learning control algorithms include: Deep
with limited computing resources is challenging, and that tra- Q-Network (DQN) uses a deep neural network to approximate
ditional cloud-service-responsive DNN inference services are the value function and select actions through greedy policies,
severely hampered by wide-area network latency, resulting in DQN improves training stability through empirical replay and
poor real-time performance and low quality of user experience. objective networks. Proximal Policy Optimization (PPO) uses
Researchers propose a framework for collaborative DNN infer- a policy optimization approach to update the policy parameters
ence using edge computing through device edge collaboration. by Optimizing the objective function of the current policy to
Specifically, DNNs are partitioned and adaptively allocated be- update the policy parameters, PPO uses importance sampling
tween devices and edges for computation to coordinate powerful and clipping of the objective function to maintain training sta-
cloud resources with near-edge resources for real-time DNN bility. Actor-Critic approach combines policy optimization and
inference. And the DNN size is appropriately adjusted during value function estimation by simultaneously training a policy
the inference process to further reduce the computational latency network (Actor) and a value function network (Critic), where
by launching the inference earlier in the intermediate DNN Actor selects actions based on the policy, while Critic evaluates
layer [22]. the current policy’s value function [7], [8], [9], [26].
The authors of [23] investigate inference acceleration using Liu et al. use a DQN approach for task offloading and resource
distributed Convolution Neural Network (CNN) in collaborative allocation in a vehicular edge computing network architec-
edge computing networks and propose an acceptance domain- ture. In this vehicular edge computing network, vehicles act as
based partitioning to guarantee no loss of inference accuracy mobile edge servers and provide computing services to nearby
when dividing inference tasks. To reduce the computation time mobile users. This process is described as a Markov process and
of inference and the communication overhead in the distributed solved by DRL to obtain the optimal policy for computation
system, a partitioning of the CNN model into multiple task offloading and resource allocation [27].
blocks using fusion layer parallelization is used and the op- Tang et al. consider indivisible and latency-sensitive tasks
timal partitioning of the CNN model is found using dynamic and edge load dynamics in mobile edge computing systems and
programming. A low-complexity search algorithm is used to formulate task offloading problems to minimize the expected
collaborate the best subset of edge servers for inference in a long-term cost. The researchers combined long short-term mem-
distributed inference system. Experimental results show that the ory (LSTM), dualing deep Q-network (Dueling DQN) and dou-
framework can significantly improve inference speed compared ble DQN (Double DQN) techniques to propose a distributed
to running pre-trained models. algorithm based on model-free deep reinforcement learning.
Hu et al. argue that the computational structure of modern Simulation results show that the processing power of edge nodes
deep learning models involves directed acyclic graphs (DAGs), can be better utilized and the task loss rate and average latency
while most existing research assumes that deep learning models can be significantly reduced [28].
are constructed using a chain of layers, and then divides the
models across edge devices in this way. This study proposes
EdgeFlow, a novel distributed inference mechanism designed
for generalized DAG-structured deep learning models, which D. Active Inference
uses a new progressive model partitioning algorithm to divide Active inference has similarities to reinforcement learning,
model layers into independent execution units, and then assigns that it uses the reinforcement learning paradigm, whereby strate-
these near-optimal model partitioning units to inference compu- gies are trained during interaction with the environment.
tations on distributed edge devices. During the inference process, Active inference describes the properties of agents in an envi-
EdgeFlow coordinates the intermediate results flowing through ronment according to the free energy principle, by minimizing
these units to achieve complex layer dependencies [24]. the free energy to obtain the Bayesian inference of the optimal
The authors of [25] study the problem of coordinating DNN action in that environment [29]. The free energy principle is
model partitioning and task assignment for end devices and a physical and information-theoretic concept used to describe
edge servers in heterogeneous edge computing systems. For the stability and organization of systems, which originated in
the problem model which is difficult to solve directly, dynamic the theory of thermodynamics in statistical physics and was
programming and greedy strategy are used to reduce the solution later applied to the fields of information theory and machine
space while obtaining a good solution, and then an online GSPI learning. In information theory and machine learning, the free
algorithm is proposed to solve the problem. energy principle is used to describe the optimization goals in
learning and reasoning processes. The free energy is considered
as an upper bound on the desired energy of an agent, and the
C. Deep Reinforcement Learning Decision agent minimizes the free energy to maximize its desired energy
Deep reinforcement learning control algorithms are a class during learning and reasoning [14]. The expected energy here
of algorithms based on deep neural networks and reinforce- can be understood as the system’s plausibility for the observed
ment learning for solving decision and control problems. It data or the optimization of the model parameters, so the free
combines the power of deep learning with a framework of energy can be understood as the difference between the current
reinforcement learning to learn optimal decision strategies from distribution and the distribution of the real data samples.

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11256 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Active inference selects the optimal action by minimizing the

free energy. In a partially observable environment, the agent does
not have access to the current true state, but obtains the partially
observable state of the environment. The active inference agent
estimates the future true state by generative model, which in turn
calculates the free energy of the current state and the future state.
By minimizing the free energy, the agent selects the action that
makes the maximum desired energy. Specifically, the generative
model is an important tool for sensing the environment, and
its important goal is to predict the real state of the next stage.
The planner of the agent calculates the free energy based on
the hidden state predicted by the generative model, and selects
the optimal action by minimizing the free energy. Thus, the agent
can make correct decisions on action sequences based on the
predictions of the generative model [30].
The model parameters involved in the minimization of free
energy process in active inference are optimized with Bayesian
inference, and free energy is generally based on the information-
Fig. 1. The architecture of an LLMs inference system in cloud-edge comput-
theoretic measure of surprise. due to its Bayesian inference, the ing.
strategy that most closely resembles the prior preferences is se-
lected, and this process not only inferred the optimal action, but
the agent also inferred which hidden states produce the observed
data to optimize the generative model. Both are performed not change over time, such as PCs, workstations, etc. Define the
while minimizing free energy and weighing the exploration and set of terminals Devi ∈ {D·mobi , D·unmo }, i = 1, 2, . . ., NDev .
exploitation of actions [31]. In addition, consider an environment with several offload-able
Active inference differs from DRL in that DRL makes action edge servers MEC, and cloud computing centers CS, define
choices based on the value of the state or based on a policy, the set of servers Serj ∈ {CS, M EC· }, j = 1, 2, . . ., Nser . At
whereas active inference makes action choices based on the time t the terminal Devi randomly generates an offload request
predictions of the generative model. The key difference between for a LLMs task Tt , and the system offloads task Tt to server
these two frameworks is that the reward function of DRL is Serj according to the decision algorithm, and allocates the
dependent on the state of the environment and the agent wants appropriate network, compute, and graphics memory resources
to reside in a highly rewarding state distribution [32]. This for the task.
distinguishes between the reward-maximizing features of DRL
and the free-energy minimizing features of active inference [33].
B. Task Model
III. SYSTEM MODEL AND FORMULATION The inference process of the LLMs inference tasks will be
described in detail. This paper studies offloading decisions and
In this section, the system model proposed in this paper
resource allocation for LLMs inference tasks using the GPT-
will be introduced. A multi-LLMs user scenario containing a
J-6B model as an example. GPT-J-6B is a transformer model
cloud computing center and an edge server is designed, and
trained using Ben Wang’s Mesh Transformer JAX [34]. With 6
the architecture is shown in Fig. 1. The cloud server (CS)
billion parameters, GPT-J-6B represents a major leap forward
has powerful and sufficient computing resources to meet the
in natural language processing.
demands of LLMs inference tasks. Several mobile edge com-
GPT-J-6B follows Transformer-based architecture with self-
puting (MEC) centers, which are closer to users, can also offload
attentive mechanism and feed-forward neural network [35].
LLMs inference tasks. At the bottom are various terminals that
Its multilayer encoder and decoder help in modeling remote
can issue requests for LLMs inference tasks, including PCs,
dependencies and context understanding. The training process
laptops, smartphones, smart connected cars, and various sensing
of the model consists of unsupervised pre-training followed by
devices, etc. These devices exist in various application scenar-
fine-tuning for specific downstream tasks to ensure adaptabil-
ios, including outdoor scenarios, traffic routes, indoor offices,
ity and generality. The inference process in GPT-J-6B can be
entertainment scenarios, etc.
formulated as the following steps:
Input representation: Tokenizer converts the original input
A. System Description text into a sequence suitable for model input. Suppose the input
In this paper, there are multiple endpoints requesting to of- sequence is x = [x1 , x2 , . . ., xn ], where xi denotes the ith token
fload the LLMs inference task. Mobile terminals Dimobi (i = or vocabulary of the input sequence.
1, 2, . . .) denote terminals whose locations change over time, Embedding representation: The Embedding layer transforms
such as smartphones, drones, and networked cars. Fixed termi- the input sequence x into an embedding representation E =
nals Diunmo (i = 1, 2, . . .) denote terminals whose locations do [e1 , e2 , . . ., en ], where ei denotes the ith embedding vector.

Position encoding: The position encoding matrix P is done on receiver remains constant during an effective communication,
the embedding vector as a position mask operation to obtain the according to the data transmission rate formula in the informa-
position encoded embedding vector P E = [pe1 , pe2 , . . ., pen ]. tion theory [36]:
Attention and multi-head self-attention: According to [35],
P ower · G
assuming that the query matrix is Q = P E × W q , the key ma- R = W log2 1 + , (3)
trix is K = P E × W k , and the value matrix is V = P E × W v , N
then: where W represents the bandwidth of the communication chan-
QK T nel, P ower represents the transmit power of the device, G rep-

Attention(Q, K, V ) = softmax √ V, (1) resents the channel gain, and N represents the random thermal
dk
noise power of the channel. The channel gain G is related to the
M ultiHead(Q, K, V ) = Concat(head1 , . . ., headh )W O , antenna gain g, and the path loss P L is related to the shadow
fading Xσ . While antenna gain g is related to the receiving
headi = Attention(QWiQ , KWiK , V WiV ),
device, path loss P L and shadow fading Xσ are both related
(2)
to the channel type. Xσ is the shadow fading component, which
where W q , W k , and W v are the parameter matrices, dk is the di- is usually modeled as a zero-mean Gaussian random variable
mension of K, and WiQ ∈ R(dmodel ×dq ) , WiK ∈ R(dmodel ×dk ) , Xσ N (0, σ 2 ), and Xσ is usually non-negative, and in general,
WiV ∈ R(dmodel ×dv ) , and W O ∈ R(hdv ×dmodel ) . The resulting we treat g and σ as constants. G is generally defined as follows:
self-attention layer output after the multi-head self-attention
G = g − P L − Xσ . (4)
calculation is Z = M ultiHead(Q, K, V ). This process has
multiple times, depending on the number of layers of the model. According to the system model, both the sender and receiver
Forward neural network: The attention output Z is fed into may be on the ground or in the air, so the communication channel
the multilayer perceptron (MLP) for feedforward neural net- mainly consists of two types: Ground-to-Ground (G2G) and
work to calculate the output Y = M LP (Z), where M LP () is Ground-to-Air (G2A), where ground means ground equipment
a multilayer perceptron containing linear transformations and (less than 100 meters in height) and air means air equipment
activation functions. The final prediction text can be obtained or high altitude equipment (more than 100 meters in height). It
based on Y and tokenizer. should be noted that M ECi and CS are ground equipment.
The inference process described above takes place in the Serj . Ground-to-Ground: In G2G channels, where both the sender
Task model sends task request Tt from terminal Devi , i.e., the and receiver are terrestrial devices, the path loss P LG2G in G2G
packet with the size of P Sx containing the original input text, channels is defined according to [37] as:
to Serj for inference, and then Serj sends the packet with the
size of P Sy containing the predicted text back to the requesting P LG2G = 128.1 + 37.6 log(d), (5)
terminal Devi . where d is the euclidean distance between the sender and the
receiver.
C. Terminal Mobility Model Ground-to-Air: In the G2A channel, one of the sender and
There are two types of nodes in the system model of this the receiver is a ground device and the other is an aerial device,
paper, immovable nodes (including MEC, CS and fixed terminal according to [38], the path loss P LG2A is defined in the Ground-
Diunmo ) and movable nodes (including mobile terminal Dimobi to-Air channel as:
such as net-connected cars, smartphones and UAVs). The fixed
P LG2A = 10α log(d) + C, (6)
terminal Diunmo generates requests for LLMs inference tasks
from time to time and is in an immobile state. The mobile ter- where α is the path loss index, which is related to the envi-
minal Dimobi moves at a given directional movement speed and ronment in which the channel propagates, and environmental
generates LLMs inference task requests from time to time during factors include the density, type and height of buildings and
the movement. About the distance of the mobile terminal Dimobi vegetation, etc. d is the euclidean distance between the sender
and endpoints (M ECi and CS) for calculation. We use the eu- and the receiver. The constant C depends on several parameters
clidean distance to calculate, suppose the mobile terminal Dimobi such as the operating frequency of the device and the antenna
coordinates are (x1 , y1 , z1 ) and the endpoint coordinates are gain.
mobi
(x2 , y2 , z2 ), the distance
between the mobile terminal Di and Air-to-Air: Relay communication between UAVs is per-
the endpoint is d = (x1 − x2 ) + (y1 − y2 ) + (z1 − z2 )2 .
2 2
formed using Air-to-Air channels, specifically, the offloading of
inference tasks and the return of results can be transmitted via
D. Communication Model UAVs. According to [39], the path loss P LA2A in A2A channels
The terminal Devi sends a task request Tt at moment t. is defined as:
The decision algorithm offloads it to the server Serj and sends P LA2A = 10α log(d), (7)
the processing result Py back to the terminal Devi after the
processing is completed, both of which are related to the wireless where α is path loss index, d is the air distance between UAVs.
communication channel between the sender and the receiver. Importantly, in the design scenario of this paper, because the
Assuming that the relative distance d between the sender and the relay communication between high altitude UAVs belongs to

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11258 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

the line-of-sight wireless transmission behavior. Therefore, path where avg(·) is the average function and PTt is the prediction
loss index α can be appropriately chosen to be a small value. accuracy of task Tt . Thus, the final optimization objective can
be expressed as maximizing the total system utility U (LT· , PT· ):
E. Data Transfer Model
maximize U (LT· , PT· )
In the scenario designed in this paper, the data transfer in-
cludes the task offloading phase and the result return phase: Wrest , Crest , Mrest ≥ 0, ∀t
s.t. , (10)
the decision algorithm sends a packet with the size of P Sx LTt ≤ tmax , ∀t
of task Tt from terminal Devi to Serj , and returns the result
where Wrest , Crest , Mrest denote the remaining bandwidth
packet with the size of P Sy from Serj to terminal Devi after
resources, computational resources, and graphics memory re-
processing. In the task offloading phase, it goes through four
sources of each MEC and CS at time t, respectively. Therefore,
periods: transmission, propagation, queuing and computation.
the optimal offloading strategy can be found by maximizing
In the result return phase, there are two periods: transmission
U (LT· , PT· ).
and propagation.
In practical systems, price and risk are very important factors,
In the task offloading phase, the data transmission rate R can
which should be considered for data offloading decision-making
be calculated according to (3). Suppose P Sx is the packet size
in edge computing systems [40]. These important factors can be
of task Tt , then transmission latency is PRSx . The distance d1
incorporated into our proposed scheme by revising the system
between the sender and the receiver is determined according
utility function in (9) and constraints in (10). This provides a
to the mobility model, using wireless propagation, and the
potential area for enriching our scheme by integrating price and
propagation latency is dc1 , where c represents the speed of light.
risk awareness, which could further optimize offloading deci-
When the server’s task queue exceeds the parallel processing
sions by considering economic and reliability factors, leading
limit, the latest requests need to be queued, and the waiting time
to a more holistic approach in managing cloud-edge computing
depends on the remaining time Lq for the fastest completed task
resources. In this paper, due to the limited space, we focus on
in the processing queue. The task processing time Lc varies
efficiency and adaptability through an active inference method
depending on the server’s computing power and whether or not
without directly addressing price and risk.
it uses an accelerated inference framework.
In the result return phase, the processing result of task Tt
PS IV. ACTIVE INFERENCE BASED OFFLOADING STRATEGY
is packet P Sy , then transmission latency is R y . The distance
between terminal Devi and server Serj is d2 . Note that the In this section, we first describe the environment state rep-
distance d2 may change during the server processing task, so resentation and action representation of the agent. Then, we
propagation latency is dc2 . describe the rewardless guidance in active inference algorithms.
In summary, when a task Tt is offloaded successfully, its time Finally, we describe the complete algorithmic process for of-
delay LTt can be calculated as: floading and resource allocation of LLMs inference tasks in
cloud-edge networks.
P Sx + P S y d 1 + d2
LT t = + + Lq + Lc , (8)
R c
A. State and Action Representations
in addition, the maximum acceptable delay of all task requests Tt
is constrained to be tmax , which indicates the delay requirement According to the description of the system model in Sec-
of the requesting terminal. When the task processing result tion III, in the environment of this paper, s′j = [Cj , Wj , Mj ]T
cannot be transmitted to the terminal within the maximum is defined to denote the state in which the server Serj is in.
acceptable delay, i.e., LTt > tmax , then task Tt is abandoned. s′j includes the remaining computing resource Cj of Serj , the
remaining bandwidth resource Wj , and the remaining graphics
F. Problem Formulation memory resource Mj . In addition, it is necessary to consider
the distance matrix D between the terminal Devi and all Serj .
The objective of this paper is to find an optimal strategy Therefore, D and the state of each Serj are united into the global
for offloading resource-intensive large-model inference tasks to state St = [D; s′1 ; s′2 ; . . .; s′Nser ] at time t. The distance matrix
edge computing nodes or cloud computing nodes in the case of D of terminal Devi and all Serj is also considered.
limited endpoint resources. Consider an objective function to According to the description of the system model in Sec-
represent this problem that attempts to make the average latency tion III, the actions of the agent should include offloading the
of all LLMs inference tasks that need to be offloaded as small LLMs inference task Tt to the server Serj , allocating compu-
as possible and the accuracy of the model’s prediction output as tational resources, allocating channel bandwidth resources, and
large as possible while satisfying the bandwidth resource con- allocating graphics memory resources. Therefore, the action of
straint, computational resource constraint, and graphics memory the agent at time t is defined as at = [j, cj , wj , mj ], where j
resource constraint of the MEC and CS. Therefore, the total is the unique corresponding index of the server, cj represents
system utility is defined as: the computational resources allocated to task Tt by Serj , wj

1 represents the channel bandwidth resources allocated by Serj ,
U (LT· , PT· ) = + avg PTt , (9) and mj represents the graphics memory resources allocated by
avg( Tt LTt )
Tt Serj . It should be noted that the allocated resources should be

less than or equal to the remaining resources of the server Serj , In active inference algorithms, the free energy principle is an
otherwise this action is regarded as an invalid action, and this important concept. It was proposed by Friston et al. to describe
task Tt fails to run at this time. how an agent minimizes its free energy by reasoning about
the environment [14]. The free energy principle is based on
B. Rewardless Guidance in Active Inference Bayesian reasoning and information theory and aims to explain
how an agent perceives and understands the environment and
This paper uses an active inference-based algorithm as the
makes decisions based on its understanding. Free energy can be
agent’s decision algorithm for task offloading and resource
considered as a measure that represents uncertainty about the
allocation decisions. The key to this is the use of simple and
state of the environment. The goal of the agent is to reduce the
effective rewardless guidance instead of the reward model in the
uncertainty of the environment by reducing the free energy [30].
active inference decision algorithm, which serves to provide an
According to the free energy principle, the agent achieves free
environmental reward signal for the decision. Traditional reward
energy minimization through two processes. The first is the
models tend to use reward values from environmental feedback
POMDP process described earlier, where the agent senses the
as a basis for model revision, which are strongly correlated with
environment to obtain external information ot and form an
the environment and weakly generalized in the presence of en-
internal model p(o, s, θ) of the state of the environment. Next
vironmental changes. However, using rewardless guidance that
is the action planning process, where the agent makes optimal
summarizes the abstraction of multiple environments can solve
actions to reduce free energy based on the internal generative
this problem to some extent because it does not directly require
model and the objective function. Through this active inference
the environment feedback reward values, but rather summarizes
process, the agent can better understand the environment, predict
multiple similar environments. In our proposed algorithm, it is
the future state, and take appropriate actions to achieve its goals.
important in the offloading decision to ensure that the task is
The optimization goal of active inference is to maximize the
completed with low latency and high pass rate to complete the
evidence of the agent’s generative model, i.e., to minimize the
task, defining rewardless guidance:
free energy, and by setting the expected preferences, the agent’s
1 generative model p(o, s, θ) can move toward this goal state.
rg(st , at ) = tc × + PT t , (11)
LT t The standard free energy is defined at a single moment t,
whereas in the active inference algorithm used in this paper, the
where tc = 1 when task Tt is completed and tc = 0 in the
agent’s optimization goal is to minimize the variational free en-
opposite case. PTt is the prediction accuracy of task Tt . In
ergy F = DKL (q(s, θ)p(o, s, θ)), where q(s, θ) is the agent’s
our proposed algorithm, the larger rg(st , at ) indicates that the
belief about future variables and the variational free energy
selection of at in st is more consistent with rewardless guidance
F is also referred to as the (negative) evidence lower bound
and the greater the probability of selecting at . Therefore, one
(ELBO) [42], the agent chooses the strategy π that minimizes the
of the advantages of using rewardless guidance is that it does
variational free energy F as the chosen strategy [31]. According
not require the real reward signal returned by the environment,
to [43], the definition of free energy of the expected future in the
and only needs to determine whether the selection of at in st
algorithm used in this paper is:
conforms to the rewardless guidance.
F̃ = DKL (q(o0:T , s0:T , θ, π)p(o0:T , s0:T , θ)), (12)
C. Active Inference Decision
where o0:T represents the observation sequence of the agent
The decision algorithm based on active inference is described on the time series 0 : T , s0:T represents the state sequence of
next. In a task offloading and resource allocation environment, the agent on the time series 0 : T , q(o0:T , s0:T , θ, π) represents
the agent maintains an active inference algorithm internally and the belief of the agent on the future variables, p(o0:T , s0:T , θ) is
is exposed to the environment externally. The agent-environment the generative model of the agent, and θ is the neural network of
interaction decision process is not a fully observable process, but the generative model parameters. The target strategy π ∗ can be
a partially observable Markov decision process (POMDP) [41]. found by minimizing the expected future free energy F̃ . In prac-
Suppose that at moment t − 1, the state is represented as st−1 tical calculations, the free energy can be minimized by making
and the agent makes an action at−1 with a certain probability the output distribution p(o0:T , s0:T , θ) of the generative model
P such that the state shifts to st at moment t. This probabil- closer and closer to the true state distribution q(o0:T , s0:T , θ, π)
ity is the environmental shift probability P (st |st−1 , at−1 ). In by:
POMDP, the agent does not necessarily obtain the true state of
the external environment, but always obtains an observation ot of DKL (q(o0:T , s0:T , θ, π)p(o0:T , s0:T , θ)) = 0 ⇒ F̃ = 0,
its own environment, ot P (ot |st ). In active inference, the agent (13)
internally maintains a generative model p(o, s, θ) for making
Algorithm 1 is a complete algorithm for task offloading and
predictions about the state of the external environment, which is
resource allocation based on active inference.
a neural network model, and θ denotes the model parameters to
be learned. In our proposed algorithm, instead of maintaining a
D. Complexity Analysis
reward model based on a large neural network model to obtain
preferences, we use rewardless guidance to obtain preferences, The complexity of Algorithm 1 can be analyzed as follows.
which influence action selection during agent planning. The outer loop runs for nepisodes iterations. Within each episode,

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11260 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Algorithm 1: Active Inference With Rewardless Guidance

Offloading Strategy.
Input: Transition distribution P (st |st−1 , at−1 ) —
Initialization strategy π — Optimisation iterations I —
Number of candidate policies J — Number of top k
candidate policies k — Number of episodes nepisodes —
Number of steps nsteps
Parameter: Ensemble model θ.
Output: Optimized strategy π ∗
1: for episode = 1, . . ., nepisodes do
2: Let t = 0 and Reset st .
3: for step = 1, . . ., nsteps do Fig. 2. Relationship between timecost and CPU allocation values.
4: Devi generates task Tt . TABLE I
5: for optimisation iteration i = 1, . . ., I do RELATIONSHIP BETWEEN WHETHER THE SERVER IS ACCELERATED OR NOT
AND THE PASS@100
6: Sample J candidate policies from q(π).
7: for candidate policy j = 1, . . ., J do
8: Get πj ∼ q(π).
9: Calculate the expected information gain r1 by
minimizing F̃ .
10: Calculate the reward gain r2 by (11).
11: Select top k(k < j) πj according to r = r1 + r2 . A. Setups
12: end for GPT-J-6B and Dataset. GPT-J-6B consists of 28 layers with
13: Optimize π according to top k πj . a model dimension of dmodel = 4096, and a feedforward di-
14: end for mension of df orward = 16384. The model dimension is split
15: Take action at ∼ π. into nheads = 16 heads, each with a dimension of dhead = 256.
16: Get st+1 , LTt , PTt , is_completion by step(at ). Rotary Position Embedding (RoPE) is applied to dRoP E = 64
17: Record (st , at , LTt , PTt , st+1 ) to buffer. dimensions of each head. The model is trained with a tokeniza-
18: Let st = st+1 . tion vocabulary of nvocab = 50257, using the same set of BPEs
19: end for as GPT-2/GPT-3 [34].
20: Fit ensemble model θ. The dataset used for the experiments is Human Eval [44],
21: end for published by OpenAI, and includes 164 programming prob-
22: return π ∗ = π lems with function signatures, string annotations, code bodies,
and test units, which are handwritten to ensure accuracy and
non-repeatability. These programming problems are written in
there is a loop of nsteps iterations. For each step, an optimiza-
Python and English is used in descriptive sections such as
tion process runs for I iterations. Within each optimization
comments.
iteration, J candidate policies are sampled and evaluated. The
Environment and Resource Limitation. The total amount of
selection of the top k policies from these J candidates is an
resources in the environment will be set to the maximum total
operation that can usually be done in O(J log J) time due to
demand of the task in the training phase, and the performance
sorting or using a priority queue, though the algorithm specifies
of the algorithms will be explored under varying task load in the
this step conceptually without detail. Therefore, the overall time
testing phase.
complexity can be expressed as O(nepisodes · nsteps · I · J).
In addition, in the cloud-edge network system environment,
The selection step’s complexity, which is dependent on sorting or
the edge server is equipped with an NVIDIA 3090 graphics card
selecting the top k policies, adds to the complexity but does not
and runs the GPT-J-6B task without any inference acceleration
change the overall order unless J is significantly large compared
technique, while the cloud server is equipped with an NVIDIA
to the other parameters.
A800 graphics card and runs the GPT-J-6B task with Triton
server inference acceleration technique. As a result, the compu-
V. SIMULATION RESULTS AND DISCUSSIONS tation time Lc and hundred passes pass@100 spent to offload the
In this section, first, we introduce the relevant setups for the GPT-J-6B task to the cloud and the edge are different because
simulations, including the LLMs used, dataset, the cloud-edge the server hardware performance is different between the two.
network environment and its resource constraints, the DRLs The computation time Lc and hundred passes pass@100 versus
algorithms for comparison and the reward functions, and eval- cloud-edge are shown in Fig. 2 and Table I, respectively.
uation metrics. Next, our proposed method and mainstream Benchmarks. Rainbow DQN [7], PPO [8] and SAC [9]
DRLs are compared in the context of that setups. Finally, our algorithms from mainstream DRLs are used as benchmark
proposed method and mainstream DRLs are compared under algorithms in the experiments to compare with the algorithm
tmax variation and task load variation. proposed in this paper. In addition, we also compare it with

Fig. 3. Mainstream DRLs comparison in the train phase.

an existing scheme [45] that does not consider LLMs. They As shown in Fig. 3(a), our proposed method reaches con-
have already achieved great success in the current DRL field, vergence at about episode = 200 and the level of convergence
with SOTA performance in various application scenarios, both is higher than mainstream DRLs, although our algorithm does
discrete and continuous. Regarding the scope of the reward not learn a better strategy in the first 50 episodes of the training
function, in the DRL algorithms mentioned in this paper, the phase, which is due to the fact that the complexity of the environ-
reward function is used for training and performance evaluation, ment is high. Rainbow DQN converge fast in the early stage, but
in the algorithm proposed in this paper, the reward function has the final convergence level is lower than our proposed method,
nothing to do with the selection of actions and it is only used for and SAC is obviously lower than the other three algorithms. PPO
performance evaluation, which is to make an effective compari- converges slowly in the early stage, but the convergence level in
son with the DRLs. According to the problem formulation, the the late stage is slightly higher than that of Rainbow DQN, and
overall system has the requirement of low latency and high pass only lower than that of our proposed method. To summarize,
rate for LLMs inference task, so the reward function is defined as: our proposed method performs better than the three algorithms
of mainstream DRLs in terms of both convergence speed and
1
r= + PT t , (14) convergence level, indicating that our proposed method can train
LT t better strategies in the complex environment proposed in this
Evaluation Metrics. In the experiments of this paper, the paper.
evaluation metrics include sum reward, task completion rate, It can be seen from Fig. 3(b) that our proposed method can
average latency, and average pass@100, with a statistical range eventually achieve a task completion rate of about 99%, while
within one episode, which denotes the total reward, the task the task completion rate of Rainbow DQN is about 85%, that of
completion rate, the average latency of all the tasks, and the PPO is about 90%, and that of SAC is about 80%. In the metric
average pass@100 of all the tasks, respectively. pass@100 is of task completion rate, our proposed method outperforms all
the pass rate for a hundred samples. three algorithms of mainstream DRLs. In addition, it shows that
our proposed method balances all tasks while achieving the best
convergence performance and does not give up individual tasks
B. DRLs Comparison to improve the overall performance.
This section analyzes the performance of the training phase of In Fig. 3(c), it can be seen that all three of our proposed
our proposed method and the existing schemes with the settings method, Rainbow DQN and PPO can eventually achieve an
of tmax = 15 and ntasks = 100, both of which simultaneously average task completion latency of about 8 s, while SAC can only
ensure the validity of the subsequent experiments on tmax vari- achieve about 10 s. And it can be seen from the figure that our
ations and task load experiments. Fig. 3 shows the performance proposed method is slightly lower than Rainbow DQN and PPO.
comparison between our proposed method and mainstream It should be noted that all three algorithms of mainstream DRLs
DRLs during the training phase, including sum reward, task show average latency jitter during the training process, i.e., the
completion rate, average latency, and average pass@100. shaded part of the corresponding curve in the figure, which
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11262 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Fig. 4. Latency variation in the test phase.

indicates that the strategies of mainstream DRLs show unstable When tmax ≤ 2, it can be seen from Fig. 4(b) that all tasks
action output during the training process, and in this regard, our cannot be completed regardless of which algorithm is used.
proposed method clearly outperforms the mainstream DRLs. Combined with Fig. 2, even when offloading to the fastest
In Fig. 3(d), it can be seen that the average pass@100 of our inference cloud, the shortest inference time takes more than 2 s,
proposed method reaches around 0.175, the average pass@100 and with the wireless transmission time, a total latency greater
of Rainbow DQN and PPO reaches around 0.15, and the average than 2 s is inevitable.
pass@100 of SAC only manages to reach around 0.14. This When 3 ≤ tmax ≤ 9, according to Fig. 4(b), only our pro-
shows that in the training phase, under the same setup conditions, posed method and Rainbow DQN can keep the task completion
the strategy of our proposed method tends to offload the task rate above 20%, and both PPO and SAC are less than 20%. Ac-
more to the edge with high inference accuracy, while the strategy cording to Fig. 2, the minimum value of inference time is greater
trained by mainstream DRLs offloads the task more to the cloud. than 9 s due to the edge, and the delay of wireless transmission
This suggests that mainstream DRLs tend to trade pass@100 for needs to be considered. Combined with the environment settings,
low latency, but combining the above Fig. 3(c) reveals that our the resource ratio of the cloud server and the four edge servers
proposed method balances both. is 1:4, which exactly corresponds to a task completion rate of
In summary, our proposed method can train a strategy that 20%. Therefore, it can be concluded that the task can only be
balances task average delay, average pass@100 and task com- offloaded to the cloud at this time. According to Fig. 4(c), it can
pletion rate. While Rainbow DQN and PPO guarantee a lower be found that the curve declines compared to when tmax ≤ 2,
task average delay, but cannot balance the average pass@100 and because the task can be offloaded to the cloud at this time.
task completion rate, SAC has the worst overall performance. When tmax ≥ 10, according to Fig. 4(b), only our proposed
method can achieve a task completion rate of about 100%. It can
also be seen that the performance of all four indicators of our
C. Latency Variation proposed method is optimal.
In order to explore the latency variation of the algorithms, To summarize, the four algorithmic strategies obtained from
i.e., the performance of the strategies in scenarios where the training are tested under different values of tmax , and our pro-
maximum latency requirement of the task, tmax , varies. The posed method has the best overall performance.
four algorithmic strategies obtained from the above training are
used for testing. The range of tmax variation is set to 1-15 s,
D. Task Load Variation
the interval to 1 s, and ntasks = 100. Since tmax = 15 is set in
the train phase, the latency variation of the algorithms can be In order to explore the effect of task load on the strategies, the
compared when tmax < 15. strategies obtained from the above training are used for testing

Fig. 5. Task load in the test phase.

by setting ntasks to vary from 100-200, and tmax = 15. ntasks To summarize, since mainstream DRLs and the existing
denotes the number of tasks in an episode of the environment, scheme cannot balance latency, task completion rate and
and there is no need for resource scheduling when there are pass@100 well, the final performances of mainstream DRLs
sufficient resources (task load is less than or equal to 100%), are all lower than our proposed method under increasing task
thus setting the task load ntasks ∈ [100, 200] for the experiment, load.
which corresponds to a task load of 100%-200%.
It can be seen from Fig. 5(a) that the sum reward of our pro- VI. CONCLUSIONS AND FUTURE WORK
posed method is always higher than that of mainstream DRLs. In this paper, in order to solve the resource dilemma of
For existing schemes, it can be seen that the sum reward increases LLMs inference task, we proposed to use active inference with
as the load increases, but it cannot reach the maximum level. rewardless guidance in cloud-edge computing to solve the of-
The reason for the increasing sum reward is that mainstream floading and resource allocation problem of LLMs inference
DRL does not allocate all the resources to existing users, so task. Specifically, by constructing an arithmetic-powerful cloud-
whenever the number of users increases the mainstream DRL edge network system, the LLMs inference task request from the
always has resources to allocate. The mainstream DRLs are not terminal is sent to the server for processing and returned to the
able to reach the highest level precisely because they are not terminal. Extensive simulation results show that this scheme is
able to allocate all the resources efficiently to the existing users. effective, and our proposed method outperforms the mainstream
Our proposed method allocates all the resources efficiently at DRLs, both in terms of convergence performance in the training
all times, so the sum reward always remains the highest level. phase and maximum tolerable latency experiments and task load
In addition, SAC eventually outperforms PPO as the task load experiments in the testing phase. In future work, we plan to apply
increases, which indicates that SAC is more suitable than PPO this scheme to more complex environments, such as more diverse
in high load environments, but still lower than our proposed types of device terminals, distributed environments, as well as
method and Rainbow DQN. using more advanced network systems to schedule more diverse
In Fig. 5(b) and (d), it can be seen that the task completion resources, such as space-air-ground integrated network systems,
rate and average pass@100 of the four algorithms are decreasing etc., and working on the further improvement of the algorithm
when the task load increases, which is reasonable because the performance.
total amount of resources is becoming less and less able to satisfy
the demand of all tasks. Nevertheless, our proposed method REFERENCES
consistently outperforms mainstream DRLs.
[1] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A biblio-
According to Fig. 5(c), we can see that the average latency of metric review of large language models research from 2017 to 2023,”
our proposed method is increasing when the task load increases, 2023, arXiv:2304.02020.
which is reasonable because an increase in task load means that [2] C. Zhou et al., “A comprehensive survey on pretrained foundation models:
A history from BERT to ChatGPT,” 2023, arXiv:2302.09419.
more tasks are requesting resources, which inevitably leads to [3] K. Cao, Y. Liu, G. Meng, and Q. Sun, ”An overview on edge computing
an upward trend in average latency. research,” IEEE Access, vol. 8, pp. 85714–85728, 2020.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.
11264 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

[4] H. Li, G. Shou, Y. Hu, and Z. Guo, “Mobile edge computing: Progress and [26] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
challenges,” in Proc. 4th IEEE Int. Conf. Mobile Cloud Comput., Serv., “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag.,
Eng., 2016, pp. 83–84. vol. 34, no. 6, pp. 26–38, Nov. 2017.
[5] Y. He et al., “Deep-reinforcement-learning-based optimization for cache- [27] Y. Liu, H. Yu, S. Xie, and Y. Zhang, “Deep reinforcement learning
enabled opportunistic interference alignment wireless networks,” IEEE for offloading and resource allocation in vehicle edge computing and
Trans. Veh. Technol., vol. 66, no. 11, pp. 10433–10445, Nov. 2017. networks,” IEEE Trans. Veh. Technol., vol. 68, no. 11, pp. 11158–11168,
[6] Y. Li, “Deep reinforcement learning: An overview,” 2017, arXiv: Nov. 2019.
1701.07274. [28] M. Tang and V. W. Wong, “Deep reinforcement learning for task offloading
[7] M. Hessel et al., “Rainbow: Combining improvements in deep reinforce- in mobile edge computing systems,” IEEE Trans. Mobile Comput., vol. 21,
ment learning,” in Proc. AAAI Conf. Artif. Intell., 2018. no. 6, pp. 1985–1997, Jun. 2022.
[8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal [29] K. Friston, P. Schwartenbeck, T. FitzGerald, M. Moutoussis, T. Behrens,
policy optimization algorithms,” 2017, arXiv: 1707.06347. and R. J. Dolan, “The anatomy of choice: Dopamine and decision-
[9] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” making,” Philos. Trans. Roy. Soc. B: Biol. Sci., vol. 369, no. 1655, 2014,
2018, arXiv: 1812.05905. Art. no. 20130481.
[10] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and com- [30] K. Friston et al., “Active inference and learning,” Neurosci. Biobehavioral
puting for connected vehicles: A deep reinforcement learning approach,” Rev., vol. 68, pp. 862–879, 2016.
IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, Jan. 2018. [31] K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, and G.
[11] Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin, “Software-defined Pezzulo, “Active inference and epistemic value,” Cogn. Neurosci., vol. 6,
networks with mobile edge computing and caching for smart cities: A no. 4, pp. 187–214, 2015.
Big Data deep reinforcement learning approach,” IEEE Commun. Mag., [32] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
vol. 55, no. 12, pp. 31–37, Dec. 2017. Cambridge, MA, USA: MIT Press, 2018.
[12] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation [33] T. Parr and K. J. Friston, “Uncertainty, epistemics and active inference,”
for mobile edge computing: A deep reinforcement learning approach,” J. Roy. Soc. Interface, vol. 14, no. 136, 2017, Art. no. 20170376.
IEEE Trans. Emerg. Topics Comput., vol. 9, no. 3, pp. 1529–1541, [34] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 billion parameter au-
Jul.-Sep. 2021. toregressive language model,” 2021. [Online]. Available: https://www.
[13] Y. Hu et al., “Learning to utilize shaping rewards: A new approach eleuther.ai/artifacts/gpt-j
of reward shaping,” in Proc. Adv. Neural Inf. Process. Syst., 2020, [35] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
pp. 15931–15941. Process. Syst., 2017.
[14] K. Friston, “A free energy principle for a particular physics,” 2019, arXiv: [36] C. E. Shannon, “A mathematical theory of communication,” Bell Syst.
1906.10184. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
[15] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. [37] S. Sun, T. A. Thomas, T. S. Rappaport, H. Nguyen, I. Z. Kovacs, and
Rellermeyer, “A survey on distributed machine learning,” ACM Comput. I. Rodriguez, “Path loss, shadow fading, and line-of-sight probability
Surv., vol. 53, no. 2, pp. 1–33, 2020. models for 5G urban macro-cellular scenarios,” in Proc. IEEE Globecom
[16] N. Yang et al., “Inference with reference: Lossless acceleration of large Workshops, 2015, pp. 1–7.
language models,” 2023, arXiv:2304.04487. [38] A. Al-Hourani and K. Gomez, “Modeling cellular-to-UAV path-loss for
[17] Y. Sheng et al., “High-throughput generative inference of large language suburban environments,” IEEE Wireless Commun. Lett., vol. 7, no. 1,
models with a single GPU,” 2023, arXiv:2303.06865. pp. 82–85, Feb. 2018.
[18] Y. Wang, K. Chen, H. Tan, and K. Guo, “TABI: An efficient multi-level [39] A. A. Khuwaja, Y. Chen, N. Zhao, M.-S. Alouini, and P. Dobbins, “A
inference system for large language models,” in Proc. 18th Eur. Conf. survey of channel modeling for UAV communications,” IEEE Commun.
Comput. Syst., 2023, pp. 233–248. Surv. Tut., vol. 20, no. 4, pp. 2804–2821, Fourth Quarter 2018.
[19] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with [40] G. Mitsis, E. E. Tsiropoulou, and S. Papavassiliou, “Price and risk aware-
large language model APIs,” 2023, arXiv:2301.08721. ness for data offloading decision-making in edge computing systems,”
[20] L. Lin, X. Liao, H. Jin, and P. Li, “Computation offloading to- IEEE Syst. J., vol. 16, no. 4, pp. 6546–6557, Dec. 2022.
ward edge computing,” Proc. IEEE, vol. 107, no. 8, pp. 1584–1607, [41] R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang, “Dynamic computation
Aug. 2019. offloading in IOT fog systems with imperfect channel state information:
[21] P. Mach and Z. Becvar, “Mobile edge computing: A survey on architecture A POMDP approach,” IEEE Internet Things J., vol. 8, no. 1, pp. 345–356,
and computation offloading,” IEEE Commun. Surveys Tuts., vol. 19, no. 3, Jan. 2001.
pp. 1628–1656, Third Quarter 2017. [42] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
[22] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating A review for statisticians,” J. Amer. Stat. Assoc., vol. 112, no. 518,
deep neural network inference via edge computing,” IEEE Trans. Wireless pp. 859–877, 2017.
Commun., vol. 19, no. 1, pp. 447–457, Jan. 2020. [43] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley, “Reinforcement
[23] N. Li, A. Iosifidis, and Q. Zhang, “Collaborative edge computing for learning through active inference,” 2020, arXiv: 2002.12636.
distributed CNN inference acceleration using receptive field-based seg- [44] M. Chen et al., “Evaluating large language models trained on code,”
mentation,” Comput. Netw., vol. 214, 2022, Art. no. 109150. 2021, arXiv:2107.03374.
[24] C. Hu and B. Li, “Distributed inference with deep learning models across [45] J. Hou, M. Chen, H. Geng, R. Li, and J. Lu, “GP-NFSP: Decentralized
heterogeneous edge devices,” in Proc. IEEE Conf. Comput. Commun., task offloading for mobile edge computing with independent reinforcement
2022, pp. 330–339. learning,” Future Gener. Comput. Syst., vol. 141, pp. 205–217, 2023.
[25] L. Shi, Z. Xu, Y. Sun, Y. Shi, Y. Fan, and X. Ding, “A DNN inference
acceleration algorithm combining model partition and task allocation in
heterogeneous edge computing system,” Peer-to-Peer Netw. Appl., vol. 14,
no. 6, pp. 4031–4045, 2021.

Authorized licensed use limited to: Green University of Bangladesh. Downloaded on February 02,2025 at 12:47:34 UTC from IEEE Xplore. Restrictions apply.

Sheet6 ch6
No ratings yet
Sheet6 ch6
7 pages
MMi 3G - BASIC
No ratings yet
MMi 3G - BASIC
29 pages
Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents
No ratings yet
Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents
9 pages
Concurrency and Computation - 2020 - Rjoub - Deep and Reinforcement Learning For Automated Task Scheduling in Large Scale
No ratings yet
Concurrency and Computation - 2020 - Rjoub - Deep and Reinforcement Learning For Automated Task Scheduling in Large Scale
14 pages
J. Parallel Distrib. Comput.: Jiayin Li Meikang Qiu Zhong Ming Gang Quan Xiao Qin Zonghua Gu
No ratings yet
J. Parallel Distrib. Comput.: Jiayin Li Meikang Qiu Zhong Ming Gang Quan Xiao Qin Zonghua Gu
12 pages
Resource Allocation for Stable LLM Training in Mobile Edge
No ratings yet
Resource Allocation for Stable LLM Training in Mobile Edge
10 pages
HFEL_Joint_Edge_Association_and_Resource_Allocation_for_Cost-Efficient_Hierarchical_Federated_Edge_Learning
No ratings yet
HFEL_Joint_Edge_Association_and_Resource_Allocation_for_Cost-Efficient_Hierarchical_Federated_Edge_Learning
14 pages
Mean_Field_Graph_Based_D2D_Collaboration_and_Offloading_Pricing_in_Mobile_Edge_Computing
No ratings yet
Mean_Field_Graph_Based_D2D_Collaboration_and_Offloading_Pricing_in_Mobile_Edge_Computing
15 pages
Toan Energy-Aware Task Offloading and Load Balancing For Latency-Sensitive IoT Applications in The Fog-Cloud Continuum
No ratings yet
Toan Energy-Aware Task Offloading and Load Balancing For Latency-Sensitive IoT Applications in The Fog-Cloud Continuum
16 pages
Joint Multi-User Computation Offloading and Data Caching For Hybrid Mobile Cloud Edge Computing
No ratings yet
Joint Multi-User Computation Offloading and Data Caching For Hybrid Mobile Cloud Edge Computing
13 pages
2107.03428
No ratings yet
2107.03428
29 pages
config-scalability
No ratings yet
config-scalability
13 pages
Accelerating_Federated_Learning_via_Momentum_Gradient_Descent
No ratings yet
Accelerating_Federated_Learning_via_Momentum_Gradient_Descent
13 pages
Federated Learning Via Over-the-Air Computation
No ratings yet
Federated Learning Via Over-the-Air Computation
14 pages
A New Platform For Distributed
No ratings yet
A New Platform For Distributed
19 pages
Chen 2021
No ratings yet
Chen 2021
12 pages
Anewapproach Based On Genetic Algorithm For Computation Offloading Optimization in Multi-Access Edge Computing Networks
No ratings yet
Anewapproach Based On Genetic Algorithm For Computation Offloading Optimization in Multi-Access Edge Computing Networks
9 pages
Task Partitioning and Offloading in Iot Cloud-Edge Collaborative Computing Framework: A Survey
No ratings yet
Task Partitioning and Offloading in Iot Cloud-Edge Collaborative Computing Framework: A Survey
19 pages
Task Allocation Algorithm and Optimization Model On Edge Collaboration
No ratings yet
Task Allocation Algorithm and Optimization Model On Edge Collaboration
16 pages
Resource Allocation and Secure Wireless Communication in The
No ratings yet
Resource Allocation and Secure Wireless Communication in The
10 pages
Sonmez 2018
No ratings yet
Sonmez 2018
17 pages
s00607-023-01179-5
No ratings yet
s00607-023-01179-5
35 pages
Dynamic_Scheduling_for_Stochastic_Edge-Cloud_Computing_Environments_Using_A3C_Learning_and_Residual_Recurrent_Neural_Networks
No ratings yet
Dynamic_Scheduling_for_Stochastic_Edge-Cloud_Computing_Environments_Using_A3C_Learning_and_Residual_Recurrent_Neural_Networks
15 pages
A Distributed and Context-Aware Task Assignment Mechanism For Collaborative Mobile Edge Computing PDF
No ratings yet
A Distributed and Context-Aware Task Assignment Mechanism For Collaborative Mobile Edge Computing PDF
17 pages
Resource Scheduling in Edge Computing Architecture Taxonomy Open Issues and Future Research Directions
No ratings yet
Resource Scheduling in Edge Computing Architecture Taxonomy Open Issues and Future Research Directions
22 pages
An Iot-Based Task Scheduling Optimization Scheme Considering The Deadline and Cost-Aware Scientific Workflow For Cloud Computing
No ratings yet
An Iot-Based Task Scheduling Optimization Scheme Considering The Deadline and Cost-Aware Scientific Workflow For Cloud Computing
19 pages
Dynamic Energy-Aware Cloudlet-Based Mobile Cloud Computing Model
No ratings yet
Dynamic Energy-Aware Cloudlet-Based Mobile Cloud Computing Model
9 pages
Reference Paper - Page 87
No ratings yet
Reference Paper - Page 87
10 pages
s13677-022-00362-x
No ratings yet
s13677-022-00362-x
17 pages
Scalable Management of Heterogeneous Cloud Resourc
No ratings yet
Scalable Management of Heterogeneous Cloud Resourc
14 pages
Improving The Performance of Many-Tasks Scientific Computing by Using Data Aware Scheduling
No ratings yet
Improving The Performance of Many-Tasks Scientific Computing by Using Data Aware Scheduling
13 pages
2- Offloading Decision and Resource Allocation in Mobile Edge omputing for Cost and Latency Efficiencies in Real-Time IoT
No ratings yet
2- Offloading Decision and Resource Allocation in Mobile Edge omputing for Cost and Latency Efficiencies in Real-Time IoT
23 pages
An Energy-Aware Task Offloading and Load Balancing For Latency-Sensitive IoT Applications in The Fog-Cloud Continuum
No ratings yet
An Energy-Aware Task Offloading and Load Balancing For Latency-Sensitive IoT Applications in The Fog-Cloud Continuum
16 pages
Enhancing Kubernetes Automated Scheduling With Deep Learning and Reinforcement Techniques For Large-Scale Cloud Computing Optimization
No ratings yet
Enhancing Kubernetes Automated Scheduling With Deep Learning and Reinforcement Techniques For Large-Scale Cloud Computing Optimization
12 pages
Virtual_Machine_Allocation_Policy_in_Cloud_Computi
No ratings yet
Virtual_Machine_Allocation_Policy_in_Cloud_Computi
11 pages
1 s2.0 S0167739X15003362 Main
No ratings yet
1 s2.0 S0167739X15003362 Main
18 pages
paper4
No ratings yet
paper4
10 pages
Computation_Offloading_and_Resource_Allocation_in_LEO_Satellite-Terrestrial_Integrated_Networks_With_System_State_Delay
No ratings yet
Computation_Offloading_and_Resource_Allocation_in_LEO_Satellite-Terrestrial_Integrated_Networks_With_System_State_Delay
13 pages
Machine and Deep Learning For Resource Allocation in Multi-Access Edge Computing - A Survey
No ratings yet
Machine and Deep Learning For Resource Allocation in Multi-Access Edge Computing - A Survey
51 pages
Scheduling_job_and_online_Dispatching_in_Edge_Cloud_ijariie22944
No ratings yet
Scheduling_job_and_online_Dispatching_in_Edge_Cloud_ijariie22944
5 pages
An Efficient Resource Management Optimization Scheme
No ratings yet
An Efficient Resource Management Optimization Scheme
8 pages
Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks
No ratings yet
Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks
12 pages
machine learning tree
No ratings yet
machine learning tree
13 pages
Adapting Deep Learning For Content Caching Frameworks in Device-to-Device Environments
No ratings yet
Adapting Deep Learning For Content Caching Frameworks in Device-to-Device Environments
9 pages
saeik 2021
No ratings yet
saeik 2021
26 pages
Varghese 2016
No ratings yet
Varghese 2016
7 pages
Open-Source Simulators For Cloud Computing: Comparative Study and Challenging Issues
No ratings yet
Open-Source Simulators For Cloud Computing: Comparative Study and Challenging Issues
15 pages
Towards An of Oading Framework Based On Big Data Analytics in Mobile Cloud Computing Environments
No ratings yet
Towards An of Oading Framework Based On Big Data Analytics in Mobile Cloud Computing Environments
7 pages
DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
No ratings yet
DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
13 pages
Moble Edge Cloud
No ratings yet
Moble Edge Cloud
14 pages
Lectura Técnica 3 Introducing An Efficient Programming Paradigm
No ratings yet
Lectura Técnica 3 Introducing An Efficient Programming Paradigm
8 pages
Challenges and Opportunities in Edge Computing
No ratings yet
Challenges and Opportunities in Edge Computing
7 pages
Future Generation Computer Systems: Zhikai Kuang Songtao Guo Jiadi Liu Yuanyuan Yang
No ratings yet
Future Generation Computer Systems: Zhikai Kuang Songtao Guo Jiadi Liu Yuanyuan Yang
11 pages
comparative study of cloud and edge computing
No ratings yet
comparative study of cloud and edge computing
5 pages
Smart Design For Resources Allocation in
No ratings yet
Smart Design For Resources Allocation in
14 pages
1 s2.0 S266729522100009X Main
No ratings yet
1 s2.0 S266729522100009X Main
13 pages
Distributed Machine Learning For Multiuser Mobile Edge Computing Systems
No ratings yet
Distributed Machine Learning For Multiuser Mobile Edge Computing Systems
14 pages
1-s2.0-S0167739X18303996-main
No ratings yet
1-s2.0-S0167739X18303996-main
9 pages
07542128
No ratings yet
07542128
18 pages
10 1 1 718 8383 PDF
No ratings yet
10 1 1 718 8383 PDF
10 pages
7gt
No ratings yet
7gt
40 pages
s41598-024-51466-0
No ratings yet
s41598-024-51466-0
18 pages
Chap 09 1
No ratings yet
Chap 09 1
29 pages
Performing The Upgrade
No ratings yet
Performing The Upgrade
23 pages
Speedway Interface: Plug-In Manual
No ratings yet
Speedway Interface: Plug-In Manual
10 pages
1 - Introduction To Cyber Crime
No ratings yet
1 - Introduction To Cyber Crime
85 pages
70 533
100% (1)
70 533
88 pages
Unicast, Multicast, Broadcast Routing
No ratings yet
Unicast, Multicast, Broadcast Routing
2 pages
Web Page Maker Complete Urdu Tutorial by Murad Khan
No ratings yet
Web Page Maker Complete Urdu Tutorial by Murad Khan
6 pages
Humaclot Pro
No ratings yet
Humaclot Pro
28 pages
Information Communication Technology: Grade 12
No ratings yet
Information Communication Technology: Grade 12
112 pages
1528PVMS Notification PART-B
No ratings yet
1528PVMS Notification PART-B
417 pages
Networking Management-Assignment 1
No ratings yet
Networking Management-Assignment 1
6 pages
Jira Fundamentals Assessment
No ratings yet
Jira Fundamentals Assessment
20 pages
01-02 Basic Configurations Commands
No ratings yet
01-02 Basic Configurations Commands
227 pages
PVTP Complete
No ratings yet
PVTP Complete
741 pages
Apollo Tyres
100% (1)
Apollo Tyres
4 pages
Lecture 13 - OO Design
No ratings yet
Lecture 13 - OO Design
15 pages
Avinash Resume
No ratings yet
Avinash Resume
8 pages
RE: SAP Production Server Error Logs: 1 Message
No ratings yet
RE: SAP Production Server Error Logs: 1 Message
2 pages
SITRANS F Profibus PA DP Profile 3 Add-On Module For MAG 6000 and MASS 6000 Operating Instructions en
No ratings yet
SITRANS F Profibus PA DP Profile 3 Add-On Module For MAG 6000 and MASS 6000 Operating Instructions en
80 pages
Netronics - Company Profile PDF
No ratings yet
Netronics - Company Profile PDF
6 pages
Part I: Dry-Run The Following Codes, Assuming No Errors: (30, 5 Each)
No ratings yet
Part I: Dry-Run The Following Codes, Assuming No Errors: (30, 5 Each)
7 pages
Product & Price List: Model Picture Description Unit Price (USD)
No ratings yet
Product & Price List: Model Picture Description Unit Price (USD)
2 pages
The Test Access Port and Boundary Scan Architecture - Colin M Maunder and Rodham E Tulloss - Ieee Computer Society Press PDF
No ratings yet
The Test Access Port and Boundary Scan Architecture - Colin M Maunder and Rodham E Tulloss - Ieee Computer Society Press PDF
394 pages
Tips and Hacks To Sell Better and Grow Your Business On Instagram
No ratings yet
Tips and Hacks To Sell Better and Grow Your Business On Instagram
21 pages
Ghanashyam Rohina (Sponsorship Management System)
No ratings yet
Ghanashyam Rohina (Sponsorship Management System)
58 pages
NWC Solution
No ratings yet
NWC Solution
22 pages
Leica Viva GS10 GS15 User Manual
No ratings yet
Leica Viva GS10 GS15 User Manual
152 pages
RG302529
No ratings yet
RG302529
12 pages

Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)

Uploaded by

Large_Language_Models_LLMs_Inference_Offloading_and_Resource_Allocation_in_Cloud-Edge_Computing_An_Active_Inference_Approach (2)

Uploaded by

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO.

12, DECEMBER 2024 11253

Large Language Models (LLMs) Inference

II. RELATED WORKS

Active inference selects the optimal action by minimizing the

Algorithm 1: Active Inference With Rewardless Guidance

Fig. 3. Mainstream DRLs comparison in the train phase.

Fig. 4. Latency variation in the test phase.

Fig. 5. Task load in the test phase.

You might also like