0% found this document useful (0 votes)
28 views

Deep_Reinforcement_Learning_for_Smart_Grid_Operations_Algorithms_Applications_and_Prospects

This article provides a comprehensive overview of deep reinforcement learning (DRL) methodologies and their applications in smart grid operations, addressing the challenges posed by increased renewable energy and complex power systems. It discusses the advantages of DRL over traditional optimization methods, including its ability to operate effectively under uncertainty and its data-driven nature. The article also reviews various DRL techniques, identifies future research directions, and highlights essential challenges and potential solutions in the context of smart grid operations.

Uploaded by

chatgptplus3214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Deep_Reinforcement_Learning_for_Smart_Grid_Operations_Algorithms_Applications_and_Prospects

This article provides a comprehensive overview of deep reinforcement learning (DRL) methodologies and their applications in smart grid operations, addressing the challenges posed by increased renewable energy and complex power systems. It discusses the advantages of DRL over traditional optimization methods, including its ability to operate effectively under uncertainty and its data-driven nature. The article also reviews various DRL techniques, identifies future research directions, and highlights essential challenges and potential solutions in the context of smart grid operations.

Uploaded by

chatgptplus3214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Deep Reinforcement Learning

for Smart Grid Operations:


Algorithms, Applications,
and Prospects
This article provides a detailed and well-organized overview of deep reinforcement
learning (DRL) methodologies, which encompasses fundamental concepts and
theoretical DRL principles, as well as the most sophisticated DRL techniques
applied to power system operations.
By Y UANZHENG L I , Senior Member IEEE, C HAOFAN Y U ,
M OHAMMAD S HAHIDEHPOUR , Life Fellow IEEE, TAO YANG , Senior Member IEEE,
Z HIGANG Z ENG , Fellow IEEE, AND T IANYOU C HAI , Life Fellow IEEE

ABSTRACT | With the increasing penetration of renewable the widespread popularity of advanced meters makes it pos-
energy and flexible loads in smart grids, a more compli- sible for smart grid to collect massive data, which offers
cated power system with high uncertainty is gradually formed, opportunities for data-driven artificial intelligence methods to
which brings about great challenges to smart grid opera- address the optimal operation and control issues. Therein,
tions. Traditional optimization methods usually require accu- deep reinforcement learning (DRL) has attracted extensive
rate mathematical models and parameters and cannot deal attention for its excellent performance in operation problems
well with the growing complexity and uncertainty. Fortunately, with high uncertainty. To this end, this article presents a
comprehensive literature survey on DRL and its applications
in smart grid operations. First, a detailed overview of DRL,
Manuscript received 15 July 2022; revised 15 June 2023; accepted 1 August
2023. Date of publication 5 September 2023; date of current version from fundamental concepts to advanced models, is conducted
15 September 2023. This work was supported in part by the National Key R&D
in this article. Afterward, we review various DRL techniques
Program of China under Grant 2021ZD0201300, in part by the National Natural
Science Foundation of China under Grant 62073148, in part by the Key Project of as well as their extensions developed to cope with emerging
National Natural Science Foundation of China under Grant 62233006, in part by
issues in the smart grid, including optimal dispatch, opera-
the Smart Grid Joint Key Project of National Natural Science Foundation of China
and the State Grid Corporation of China under Grant U2066202, in part by the tional control, electricity market, and other emerging areas.
Major Program of National Natural Science Foundation of China under Grant
In addition, an application-oriented survey of DRL in smart grid
61991400, and in part by the 2020 Science and Technology Major Project of
Liaoning Province under Grant 2020JH1/10100008. (Corresponding author: is presented to identify difficulties for future research. Finally,
Zhigang Zeng.)
essential challenges, potential solutions, and future research
Yuanzheng Li and Zhigang Zeng are with the School of Artificial Intelligence
and Automation, Autonomous Intelligent Unmanned System Engineering directions concerning the DRL applications in smart grid are
Research Center, Key Laboratory of Image Processing and Intelligence Control, also discussed.
Ministry of Education of China, and the Hubei Key Laboratory of Brain-Inspired
Intelligent Systems and the Belt and Road Joint Laboratory on Measurement and
KEYWORDS | Deep reinforcement learning (DRL); electricity
Control Technology, Huazhong University of Science and Technology, Wuhan
430074, China (e-mail: [email protected]; [email protected]). market; operational control; optimal dispatch; smart grid (SG).
Chaofan Yu is with the China-EU Institute for Clean and Renewable Energy,
Huazhong University of Science and Technology, Wuhan 430074, China (e-mail:
[email protected]). N O M E N C L AT U R E
Mohammad Shahidehpour is with the Robert W. Galvin Center for Electricity
Innovation, Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail:
Notations
[email protected]).
A, a Set of actions and action.
Tao Yang and Tianyou Chai are with the State Key Laboratory of Synthetical
Automation for Process Industries, Northeastern University, Shenyang 110819, S, s Set of all states and state.
China (e-mail: [email protected]; [email protected]). P Transition probability.
Digital Object Identifier 10.1109/JPROC.2023.3303358 R Set of all possible rewards.

0018-9219 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1055


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Ω Set of observations. DN Distribution network.


O Observation probabilities. DNN Deep neural network.
L Loss error between prediction and target. DNR Distribution network reconfiguration.
Rt Cumulative reward at time step t. DQN Deep Q-network.
π, π ∗ Policy (decision-making rule) and optimal DRL Deep reinforcement learning.
policy. ESS Energy storage system.
π(a|s) Probability of taking action a in state s EV Electric vehicle.
under stochastic policy π . FDI False data injection.
µ(s) Action taken in state s under deterministic IES Integrated energy system.
policy µ. LFC Load frequency control.
o, b(s) Current observation and its belief about LSTM Long short-term memory.
state s. MAAC Multi-actor-attention-critic.
p(s′ |s, a) Probability of transitioning to state s′ , MCDRL Monte Carlo DRL.
from state s taking action a. MDP Markov decision process.
r(s, a) Expected immediate reward from state s P2P Peer-to-peer.
after action a. PARS Parallel ARS.
V π (s) Value of state s under policy π (expected
PPO Proximal policy optimization.
return).
RL Reinforcement learning.
V ∗ (s) Value of state s under the optimal policy.
RNN Recurrent neural network.
Qπ (s, a) Value of taking action a in state s under
SAC Soft actor–critic.
policy π .
SARSA State-action-reward-state-action.
Q∗ (s, a) Value of taking action a in state s under
SG Smart grid.
the optimal policy.
SSA Security situational awareness.
V (s) State-value function.
A (a) State-dependent action advantage func- TD Temporal difference.
tion. TD3 Twin delay DDPG.
δt Temporal difference error at time step t. TDAC Three-network double-delay actor–critic.
θ Parameter vector of target policy. TRPO Trust region policy optimization.
ψ Parameter vector of the critic.
πθ Policy corresponding to parameter θ . I. I N T R O D U C T I O N
J(θ) Performance measure for the policy πθ . Energy signifies an important basis for the development
ϵ Probability of taking a random action in an and the survival and human societies, where social devel-
ϵ-greedy policy. opments are accompanied by a continuous increase in
α, β Step-size parameters. energy demand. However, the massive use of fossil energy
γ Discount-rate parameter. in developed societies brings about a series of global
λ Lagrange multiplier. problems pertaining to environmental pollution, ecological
t Discrete time step. destruction, and global warming. In order to mitigate the
τ Trajectory of state–action pairs. expanding environment concerns, attractive SG technolo-
Abbreviations gies have been developed in various parts of the world [1],
AC Actor–critic. [2], [3], [4], [5]. The essence of using SGs along with
A2C Advantage actor–critic. conventional power systems is itemized as follows.
A3C Asynchronous advantage actor–critic. 1) The power generation profile is shifting from con-
AGC Automatic generation control. trollable and continuous supply of coal-fired power
AI Artificial intelligence. generation to renewable energy (RE) with high uncer-
ANN Artificial neural network. tainty and weak controllability.
ARS Augmented random search. 2) The load characteristic is shifting from the traditional
AVC Autonomous voltage control. rigid and purely captive to flexible and active type
CHP Combined heat and power. that combines both production and consumption of
CNN Convolutional neural network. energy.
CPO Constrained policy optimization. 3) The power grid is transitioning from a traditional
DDPG Deep deterministic policy gradient. one-way flow to an energy Internet with a two-way
DDQN Double deep Q-network. flow that includes large-scale hybrid ac/dc subsys-
DER Distributed energy resource. tems, microgrids, and adjustable loads.
DFF Deep feedforward. 4) The power grid foundation is shifting from a tra-
DFRL Deep forest reinforcement learning. ditional mechanical–electromagnetic system of syn-
DG Diesel generator. chronous generators with high inertia to a hybrid
DIRL Deep inverse reinforcement learning. system dominated by low-inertia power electronic
DL Deep learning. devices.

1056 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

These characteristics increase the uncertainty and com- strong randomness. To this end, DRL has been regarded
plexity, which brings great challenges to the secure and as an alternative solution to overcome these challenges.
economic operations of SG [6]. To solve these prob- General speaking, the DRL methods provide the following
lems, various approaches have been proposed for the advantages.
optimal operations of SG. However, conventional opti-
1) DRL can achieve the optimal solution of sophisticated
mization methods require accurate mathematical models
grid optimization without using complete and accu-
and parameters, which makes it difficult to apply them to
rate network information.
increasing complex and distributed systems with multiple
2) DRL allows grid entities to learn and build knowledge
uncertain subsystems. Consequently, the applications of
about the environment on the basis of historical data.
traditional methods are limited in practice, which calls for
3) DRL offers autonomous decision-making within min-
a more intelligent and efficient solution.
imum information exchange, which not only reduces
With the wide application of advanced sensors, smart
computational burden but also improves the SG secu-
meters, and monitoring systems, SG is producing mas-
rity and robustness.
sive data with mutual correlations [7], [8], [9], which
4) DRL significantly enhances the learning capability in
also offers the data basis for data-driven AI method, for
comparison to the traditional RL, especially in prob-
instance, RL. Indeed, RL is one of the most important
lems with numerous states and action spaces.
research topics of AI over the last two decades, due to its
excellent ability of self-directed learning, adaptive adjust- Although there exist some RL reviews, detailed dis-
ment, and optimal decision. Specifically, RL is a learning cussions on DRL applications in SG are still lacking.
process that allows the agent to periodically make deci- Specifically, existing surveys have focused on the DRL
sions, observe the results, and then automatically adjust its applications to the Internet of Things, natural language
action to achieve the optimal policy. For instance, as one of processing, and computer vision [15], [16], [17], [18].
the pioneering works in the application of RL to renew- Indeed, there are some excellent reviews on RL appli-
able power system operations, Liao et al. [10] proposed a cations to energy systems [19], [20], [21], [22], [23],
multiobjective optimization algorithm based on learning [24], [25], [26]. However, they mainly concentrate on
automata for economic emission dispatching and voltage conventional RL methods or power system, rather than
stability enhancement in SGs. Simulation results have presenting state-of-the-art DRL approaches to SG applica-
demonstrated that the proposed method achieves accurate tions. A detailed comparison between our work and related
solutions and adapts effectively to dynamic fluctuations in surveys is presented in Table 1 to identify the unique
wind power and load demand. aspects and novel perspectives that distinguish our work.
Despite all these advantages, RL is still unsuitable and It could be observed that there exist some reviews on
inapplicable to complicated large-scale problem environ- DRL-based decision-making in conventional power system
ments as it has to explore and gain knowledge from the and modern SG. For instance, Chen et al. [19] provided
entire system, which takes much time to obtain the best a comprehensive review of various DRL techniques and
policy. In this situation, the applicability of the RL has their potential applications in power systems, with a focus
encountered serious challenges in the real world. Recently, on three key applications: frequency regulation, voltage
the rapid development of DL has aroused great interests in control, and energy management.
industry and academia [11], [12]. The deep RL architec- However, emerging energy solutions for improving the
tures result in a better data processing and representation SG efficiency and ensuring its secure operations are not
learning capability, which provide a potential solution to covered in [19]. Zhang et al. [20] and Glavic [21] cov-
overcome the RL limitations, that is, the combination of ered multiple aspects of power system operations, which
RL and DL has led to a breakthrough technique, named includes optimal dispatch, operational control, electricity
DRL, which integrates the decision-making capacity of RL market, and others. Although Zhang et al. [20] and Glavic
and the DL perception capability [13]. More precisely, [21] provided summaries of typical DRL algorithms such as
DRL improves the learning speed and performance of DQN, DDPG, and AC, they do not cover recently developed
conventional RL, in virtue of the advantages of DNNs in state-of-the-art DRL methods. To this end, Cao et al. [22]
the training process. Therefore, DRL has been introduced provided a comprehensive summary of advanced DRL
in various applications and achieved phenomenal success, algorithms and their SG applications, including value-
such as games, robotics, natural language processing, com- based, policy-based, and AC solution methods. Although
puter vision, and SG operations [14]. summaries of DRL algorithms are detailed, Cao et al. [22]
In the field of SG, DRL has been intensively adopted did not convey emerging SG areas, including P2P trading
to undertake various tasks, which stem from developing markets and privacy preservation issues.
the optimal policy. As mentioned above, SG is one of the Yang et al. [23], Perera and Kamalaruban [24], and Yu
largest artificial systems, which is well known for their et al. [25] classified DRL papers in the literature into seven
highly uncertain and nonlinear operating characteristics. categories according to their application fields. Their study
Although several approaches have been developed for SG, reveals that about half of the publications use Q-learning,
they still suffer from great computational complexity and whereas some of the state-of-the-art DRL methods are

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1057


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 1 Comparison Between Our Work and Related Surveys

not utilized in power system applications. Although methods may rely on simplified models and assumptions,
Zhang et al. [26] attempted to provide an in-depth anal- which may not fully capture the complexity of the pre-
ysis of DRL application algorithms in SGs, it lacks a vailing assumptions and may not fully comprehend the
summary of advanced algorithms applied to power sys- complexity of real-world scenarios. Finally, widespread
tems. With the significant AI applications in power systems, deployments of Internet-connected SG devices have sig-
the number of publications that use DRL has also grown nificantly increased the vulnerability of power systems
rapidly, where many state-of-the-art DRL algorithms have to cyberattacks. Cyberattack dynamics and complexi-
been proposed for SG operations. This major development ties necessitate the implementation of responsive, adap-
calls for a more comprehensive analysis of potential DRL tive, and scalable protection mechanisms in SGs. These
applications to SGs. Accordingly, this article is dedicated requirements are difficult to achieve by typical operation
to presenting a relatively holistic overview of various methods that would rely on static security measures [28].
DRL-based methods applied to SG operations. More importantly, through summarizing, highlighting, and
The main contributions of this article are listed as analyzing the DRL characteristics and their SG applica-
follows. tions, this survey article would highlight specific potential
1) A detailed and well-organized overview of DRL research directions for interested parties. The rest part of
methodologies is provided, which encompasses fun- this article is organized as follows. Section II introduces
damental concepts and theoretical DRL principles, the evolution of DRL and discusses its state-of-the-art
as well as the most sophisticated DRL techniques techniques as well as their extensions. In Section III, the
applied to power system operations. detailed DRL applications in SG are presented. After that,
2) The SG operation issues are divided into four critical Section IV discusses the prospects and challenges of DRL in
categories to illustrate DRL applications to modeling, SG operations in the future. Finally, the conclusion of this
design, solution, and numerical experimentation. article is drawn in Section V.
3) An in-depth understanding of challenges, potential
solutions, and future directions for deploying DRL II. D R L : A N O V E R V I E W
to SG problems is discussed and the outlook for In this section, the fundamental knowledge of MDP, RL,
additional developments is presented. and DL techniques, which are crucial components of DRL,
Different from the excellent prior works about RL, this is introduced first. Then, the combination of DL and RL is
article attempts to conduct a relatively exhaustive review presented, which results in the formation of DRL. Finally,
of DRL applications to SG operations, especially for the last advanced DRL models as well as their state-of-the-art
few years. These reviews will encompass emerging topics extensions are reviewed.
such as optimal economic dispatch, distributed control,
and electricity markets. First, with the increasing pene- A. Markov Decision Process
tration of RE, the optimal dispatch of SG resources is In mathematics, MDP is a discrete-time stochastic con-
confronted with unprecedented challenges, including mas- trol progress, which assumes that the future state is only
sive operational uncertainty, lower system inertia and new related to the present state and is independent of the past
dynamics phenomena, and highly nonlinear and complex states [29]. MDP provides a useful framework for modeling
power systems that cannot be effectively represented and the decision-making problems in situations where the solu-
constructed by existing mathematical tools [27]. Second, tions are deemed to be partly random and uncontrollable.
operational control is a critical SG task that involves device MDPs are popular for studying optimization problems
control and coordination, including generators, transform- solved by dynamic programming and RL approaches [30].
ers, and capacitors. Traditional mathematical methods may Generally, an MDP is defined as a tuple (S, A, P, R), where
be based on simplified models and linear control strategies, S is the set of finite states named the state space, A rep-
which may struggle to address the complexities of nonlin- resents the set of actions called action space, P is the
ear and dynamic SG characteristics. transition probability from state s to state s′ after action a
Third, electricity market operation is a complex is executed, and R denotes the immediate reward received
optimization problem that involves multiple participants, after state transition from state s to state s′ , due to the
variables, and uncertainties. Conventional mathematical performance of action a.

1058 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

A typical POMDP is defined as a six-tuple


(S, A, P, R, Ω, O), where (S, A, P, R) are denoted in
the same way as in MDPs. Ω and O refer to the set
of observations and their corresponding observation
probabilities, respectively. The illustration of POMDP
is presented in Fig. 1. At each time period, the agent
chooses an action at ∈ A according to the popular belief
on current state st ∈ S , i.e., b(st ). Then, the current
state st transfers to the next state st+1 with a probability
of p(st+1 |st , at ). What distinguishes a POMDP from the
completely observable MDP is that the agent now perceives
an observation ot+1 ∈ Ω of the next state st+1 , rather than
the true state itself. The probability of observing which
observation depends on the next state st+1 as well as
its action at in state st , which is drawn according to the
observation function, i.e., O(ot+1 |st , at , st+1 ). Finally, the
agent receives an immediate reward rt (st , at ) ∈ R and
Fig. 1. Illustration diagrams of completely observable and partially repeats the above process.
observable MDP. Based on current belief b(st ) and its observation ot+1 ,
the agent updates its belief about the next (unobserved)
state st+1 , i.e., bst+1 , which is stated as follows:

The reward function serves as a guide for DRL agent, b(st+1 )


shaping its behavior and influencing its learning process. P
O(ot+1 |st , at , st+1 ) st ∈S p(st+1 |st , at )b(st )
While there is no one-size-fits-all approach to design the = P P
reward function, there are certain principles that can be st+1 ∈S O(ot+1 |st , at , st+1 ) st ∈S p(st+1 |st , at )b(st )
followed to enhance DRL effectiveness. One important
(1)
principle is to ensure that the reward function effectively
reflects the control goals of the DRL problem. Rewards
should be aligned with the desired behavior or outcome where O(ot+1 |st , at , st+1 ) represents the probability that
that the agent is expected to achieve. This requires careful agent perceives observation ot+1 after action at is executed
consideration of specific tasks and the objectives that need in state st , and then, it moves to the next state st+1 with
to be optimized. A policy function π is defined as the probability p(st+1 |st , at ). Similar to a fully observed MDP,
mapping from state space S to the action space A, which the goal of POMDP agent is also devoted to finding an
determines how the decision-maker selects actions. The optimal policy π ∗ , in order to maximize the cumulative
discounted reward ∞ ∗
P
illustration of MDP is shown in Fig. 1. In each epoch, the i=0 γrt (st , π (st )).
decision-maker chooses an action ai based on its policy
about the current state si , i.e., π(si ). Then, the current 2) Multiagent Markov Decision Process: A single-agent
state si transfers to the next state si+1 with a proba- MDP may suffer from limited exploration, in which the
bility of p(si+1 |si , ai ) and obtains the immediate reward agent fails to explore the entire state space and can get
r(si , ai , si+1 ). stuck in suboptimal solutions. To this end, multiagent
Markov decision process (MMDP) is developed to deal
1) Partially Observable Markov Decision Process: It is with complex tasks that cannot be accomplished by a
assumed that the system state is completely observable by single agent. Specifically, MMDP generalizes the classical
the agent in conventional MDP. However, the agent can MDP modeling framework with the notion of multiple
only observe a part of the system state in many cases, agents, each with its own state and action space, inter-
and thus, a partially observable Markov decision process acting in a shared environment to achieve a common
(POMDP) is proposed to establish the decision-making goal. In general, a multiagent Markov decision process is
model while considering the uncertainty introduced by the defined by a five-tuple (I, S, {Ai }i∈I , P, R), where I ≜
partial observation. Actually, POMDP is a mathematical {1, 2, . . . , i, . . . , I} represents the finite set of agents and
framework for modeling the decision-making situations S ≜ {S 1 , . . . , S i , . . . , S I } refers to the global state space
where the decision-maker only has partial information of all agents with S i denoting the state space of agent i.
about the state of system. POMDP is an extension of MDP, {Ai }i∈I indicate sets of joint action spaces, while Ai is the
which accounts for cases when some state data are missing action space of agent i. P ≜ S × A1 × · · · × AI → [0, 1]
or considered uncertain. In POMDP, the decision-maker represents the joint transition probability function of the
receives an observation of the system’s state, rather than whole system and R ≜ S × A1 × · · · × AI → R denotes the
the true state itself. joint reward function.

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1059


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Intuitively, all MMDP agents try to find their individ- obtained as


ual optimal policies to maximize their own cumulative
P∞ i ∗
expected rewards, i.e., t=0 γt rt (st , πi (st ))∀i. The joint V ∗ (s) = max V π (s)
policy π induced by the set of individual policies {πi∗ }i∈I
∗ π

maps states to joint actions. In this way, an MMDP could Q∗ (s, a) = max Qπ (s, a). (6)
π
be regarded as a single-agent MDP where the agent takes
joint actions. On this basis, the MMDP goal is to find As for the state–action pair (s, a), it is observed that the
a policy that maximizes the expected total reward for optimal action-value function gives the expected return for
all agents by taking their interactions into account. The taking action a in state s and thereafter follows an optimal
MDP objective is to find a good policy that maximizes policy. Therefore, Q∗ (s, a) could also be written in terms of
the future reward function, which could be expressed by optimal state-value function V ∗ (s), which is expressed as
the following cumulative discounted reward:


X Q∗ (s, a) = E [Rt | st = s, at = a]
2 k
Rt = rt+1 + γrt+2 + γ rt+3 + · · · = γ rt+k+1 = E [rt+1 + γRt+1 | st = s, at = a]
k=0
= rt+1 + γ(rt+2 + γrt+3 + γ 2 rt+4 + · · · ) = E [rt+1 + γV ∗ (st+1 ) | st = s, at = a] . (7)
= rt+1 + γRt+1 (2)
Since the action is selected by the policy, an optimal
action at each state is found through the optimal policy
where Rt denotes the cumulative reward at time step t
as well as the optimal state-value function. In this way, the
and γ ∈ [0, 1] represents the discount factor. Here,
optimal state-value function is rewritten as follows:
γ determines the importance of future rewards compared
with the current one. If γ approaches one, it means
that the decision-maker regards the long-term reward as V ∗ (s) = max V π (s) = max Eπ∗ [Rt | st = s, at = a]
π a
important. On the contrary, the decision-maker prefers to = max Eπ∗ [rt+1 (s, a) + γV ∗ (st+1 ) | st = s, at = a]
a
maximize the current reward, while the discount factor γ
approaches zero. = max Q∗ (s, a). (8)
a
In order to find an optimal policy π ∗ : S → A for the
agent to maximize the long-term reward, the state-value Taking the expression of optimal action-value function
function V π : S → R is first defined in the RL that into account, the problem of optimal sate value is simpli-
denotes the expected value of current state s under policy fied to the optimal values of action function, i.e., Q∗ (s, a).
π . The state-value function V for the following policy π Intuitively, (8) indicates that the value of a state under
measures the quality of this policy through the discounted an optimal policy should be equal to the expected reward
MDP, which could be shown as follows: for the best action from that state, which is denoted by
"∞ # the Bellman optimality equation in MDP [31]. With the
π
X k definition of optimal value functions and policies, the rest
V (s) = Eπ [Rt |st = s] = Eπ γ rt+k+1 | st = s (3)
k=0
of the work would be to update the value function and
achieve the optimal policy, which can be accomplished by
where the state value is the expected reward for the RL approaches.
following policy π from state s.
Similarly, the value of taking action a in state s under B. Reinforcement Learning
policy π , i.e., action-value function Qπ (s, a), is defined as
As one of the machine learning paradigms, RL is con-
π cerned with a decision-maker’s action for maximizing
Q (s, a) = E [Rt | st = s, at = a]
"∞ # the notion of cumulative reward Rt [32]. In RL, the
decision-making process is executed by the agent, which
X k
=E γ rt+k+1 | st = s, at = a . (4)
k=0 learns the optimal policy by interacting with the envi-
ronment. Here, the agent first observes the current state
Since the purpose of RL is to find the optimal policy and then performs an action in the environment, which is
that achieves the largest cumulative reward in long-term, based on its policy. After that, the environment feeds its
we define the optimal policy π ∗ as immediate reward back into the agent and updates its new
state at the same time. The typical RL interactions between
environment and agent are shown in Fig. 2. The agent will
π ∗ = argmax V π (s). (5)
π constantly adjust its policy according to the observed infor-
mation, i.e., the received immediate reward and updated
In this situation, the optimal state-value function state. This adjustment process will be repeated until the
V ∗ (s) and the action-value function Q∗ (s, a) could be policy of agent approaches its optimum.

1060 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

In contrast, a zero value makes the agent learn nothing


from the current information, which exclusively exploits
the prior knowledge. Usually, the learning rate is selected
as a constant value in a deterministic environment; other-
wise, it may be dynamically adjusted during the learning
process for stochastic problems. The detailed framework of
Q-learning algorithm is presented in Algorithm 1. Before
Fig. 2. Interaction of RL between the environment and agent. learning starts, Q(s, a) is initialized to an available arbi-
trary value. Then, in each episode t, the agent selects an
action a according to the policy π and observes a reward r.
Subsequently, the agent enters a new state s′ , which may
depend on both the previous state s and the selected action
Generally, RL methods are divided into on-policy and
a. After that, the value of Q-learning table is iteratively
off-policy categories, which indicate whether the training
updated by (9) until the state s reaches the terminal.
samples are collected by following the target policy. The
In summary, Q-learning is a model-free RL algorithm that
policy employed to generate samples by interacting with
learns the value of an action in a particular state. It does
the environment is referred to as the behavior policy,
not require the mathematical model of the environment
whereas the policy that agents aim to learn and improve
and has the capacity to handle problems with stochastic
upon, based on the collected samples, is known as the
transitions and rewards. However, the standard Q-learning
behavior policy, whereas the policy that agents aim to
algorithm using Q-table would only be applied to discrete
learn and improve upon, based on the collected samples,
action and limited state spaces, which is mainly due to the
is known as the target policy. As for on-policy methods,
curse of dimensionality. In other words, this method falters
target and behavior policies are the same. This means that
with an increasing number of states/actions since the
agents learn a policy (i.e., target policy) and implement
maintenance of this tremendous table is time-consuming
it to generate samples for the algorithm training. On the
and inefficient.
contrary, off-policy methods improve a policy based on
samples collected from a different policy (i.e., behavior
Algorithm 1 Q-Learning Algorithm
policy). In the rest part of this section, we introduce
the two classical RL methods, including Q-learning (off- Input: Initialize Q(s, a), for all s ∈ S, a ∈ A,
policy) and SARSA (on-policy) algorithm, which are also arbitrarily except that Q(terminal, ·) = 0
1 for each episode t do
the most effective and widely used methods in the real
world. Indeed, these two algorithms are categorized as 2 From the current state–action pair (s, a),
tabular RL methods due to the fact that they use a tabular execute action a and receive the immediate
representation of Q values. reward r and the new state s′ .
3 Select an action a′ with maximal Q-value
1) Q-Learning Algorithm: Q-learning is regarded as one from the new state s′ and then update the
of the early breakthroughs in RL, which finds an optimal table entry for Q(s, a) as follows:
policy in the sense of maximizing the expected value of
the total reward over all successive steps [33]. In partic- Qt+1 (s, a) = Qt (s, a)
ular, the action-value function Q(s, a) is updated by the h i
′ ′
weighted average of the old value and the new information + α rt+1 + γ max ′
Qt+1 (s , a ) − Qt (s, a)
a
as
4 Replace s ← s′ .
Qt+1 (s, a) 5 end
= Qt (s, a) Output: π ∗ (s) = arg maxQ∗ (s, a)
  a
′ ′

+ α rt+1 + γ max

Q t+1 s , a − Q t (s, a) (9)
a

2) SARSA Algorithm: Even though Q-learning can find


where rt+1 is the reward obtained when moving from the the optimal policy without the need of prior knowledge
state s to the state s′ and α represents the learning rate. about the environment, it is an off-policy RL algorithm
The core idea behind this update is to find the TD between that obtains the optimal policy only after all Q values
the estimated action value rt+1 + γ maxa′ Qt+1 (s′ , a′ ) and have converged. Thus, an alternative on-policy RL method,
its old value Q(s, a). In (9), the learning rate α denotes the named SARSA, is introduced in this section to provide
extent to which the newly acquired information overrides an on-policy learning pattern for the agent to approach
the old one. If the value of learning rate approaches one, the optimal policy. Different from the off-policy Q-learning
it means that the agent considers more recent informa- algorithm, SARSA allows the agent to grasp the optimal
tion and ignores prior knowledge to explore possibilities. policy and use the same one to act. As for the Q-

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1061


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Algorithm 2 SARSA Algorithm refers to the use of multiple layers in a neural network,
Input: Initialize Q(s, a), for all s ∈ S, a ∈ A, which enhances the perception capacity of DNN.
arbitrarily except that Q(terminal, ·) = 0 Fig. 3 shows the DFF neural network, which is con-
1 for each episode t do sidered the simplest type of DNN. It is observed that a
2 From the current state-action pair (s, a), DFF network contains multiple layers of interconnected
execute action a and receive the immediate nodes, i.e., artificial neurons, which are analogous to bio-
reward r and the new state s′ . logical neurons in brain. Each connection between neurons
3 Select an action a′ from the new state s′ using transmits a signal to other ones, and the receiving neuron
the same policy and then update the table processes this signal. Then, the receiving neuron activates
entry for Q(s, a) as follows: downstream connected neurons. The signal within a con-
nection is usually represented by a real number between
0 and 1, and the output of each neuron is computed
Qt+1 (s, a) = Qt (s, a)
by the weighted summation of its inputs as well as a
+ α [rt+1 + γQt+1 (s′ , a′ ) − Qt (s, a)] nonlinear transformation through the activation function.
This computation process from the neural network is
4 Replace s ← s′ ; a ← a′ . named the forward propagation, which achieves the data
5 end processing during the computation from the input to the
Output: π ∗ (s) = arg maxQ∗ (s, a) output. Typically, neurons are aggregated into layers, and
a
different layers may perform different transformations on
their inputs. It should be noted that signals travel from
the first layer named input layer to the last one, i.e., the
learning algorithm, the target policy is updated according output layer, possibly after traversing deeply hidden layers
to the maximal reward of available actions rather than the multiple times.
behavior policy used for choosing actions, i.e., off-policy The DFF neural network shown in Fig. 3 is the sim-
learning. On the contrary, the SARSA algorithm uses the plest among DNNs. There are two classical DNN models,
same policy to update the Q values and select actions, including the CNN [34] and RNN [35]. CNN is distin-
i.e., on-policy learning. The details of the SARSA algorithm guished from other DNNs by its superior computer vision
are provided in Algorithm 2, which illustrates that the performance, which comprises three main layers shown in
SARSA agent interacts with the environment and updates Fig. 4, i.e., convolutional, pooling, and fully connected.
the policy based on actions taken. Hence, it is regarded as The name CNN stems from the convolution operation that
an on-policy RL algorithm. In particular, the action-value occurred in the convolutional layer, which converts the raw
function Q(s, a) is updated by the Q value of the next state input data to numerical values and allows CNN to interpret
s′ and the current policy’s action a′ as and extract relevant features. Similar to the convolutional
layer, the pooling layer derives its name from the pooling
Qt+1 (s, a) = Qt (s, a) operation, and it conducts dimension reduction to decrease
+ α rt+1 +γQt+1 s′ , a′ −Qt (s, a) . (10)
   complexity. The pooling layers can improve the efficiency
and reduce the risk of overfitting. Furthermore, the fully
connected layer performs the task of classification based
In conclusion, the state and action spaces in tabular RL on the features learned through previous conventional and
methods are small enough to allow the Q values to be pooling layers, which map the extracted features back to
represented as a table. This is feasible when the number the final output.
of states and actions is small. However, state and action Unlike CNN, which assumes that inputs and outputs
spaces in many real-world applications are excessively are independent, RNN extracts the information from prior
large, which makes it impossible to represent the Q values
in a table. In such cases, function approximation methods,
such as DNNs, are used to approximate the Q values or
policy, which is introduced in the following.

C. Deep Learning
As mentioned before, RL is not suitable for handling
complicated problems with large-scale environments and
high uncertainty, which limits its application in SG opera-
tions. To this end, DL is introduced to assist RL in dealing
with these challenges. To be specific, DL is a subset of
machine learning based on DNNs, which attempts to sim-
ulate the human brain behavior and extract the important
features from massive raw data. The adjective deep in DL Fig. 3. Structure diagram of typical DFF neural network.

1062 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Fig. 6. Illustration diagram of LSTM.

Fig. 4. Structure diagram of CNN.


and backpropagations allow DNN to make predictions
and adjust its parameters in accordance with errors.
As the training continues, DNN outputs will gradually
inputs to determine the current input and output, as shown become more accurate. To conclude, the excellent percep-
in Fig. 5. Here, the RNN output depends not only on its tion capacity of DNN provides RL with an opportunity
immediate inputs but also on the neural state of previous to address the existing limitations. First, the strong fea-
layers within the sequence. In this way, RNN could utilize ture extraction capacity of DNN could help RL avoid the
its internal state to process variable-length sequences of manual feature design process, which is usually difficult
inputs, which makes it applicable to tasks such as speech to be represented by hand. Second, the outstanding pre-
recognition. diction capacity of DNN rescues RL from the curse of
As an RNN architecture, LSTM is employed extensively dimensionality, which allows it to cope with scenarios
for DL. LSTM, shown in Fig. 6, stands out for its remarkable with high-dimensional and continuous state/action spaces.
capability to capture and retain long-term dependencies by Therefore, the combination of RL and DL brings about the
integrating memory cells and diverse gating mechanisms. formulation of DRL, and it will be discussed in the next
The memory cells in LSTM allow the network to store and section.
access information over long periods of time. Moreover,
gating mechanisms, including the input gate, forget gate, D. Deep Reinforcement Learning
and output gate, control the information flow and enable
the network to selectively retain or forget the information As mentioned before, the states of MDP are
based on its relevance. high-dimensional and difficult to design, which would
After mapping the DNN results to the input data, limit the RL applications in practical decision-making
another challenge is to optimize the results. To this end, problems. To this end, DRL is introduced to overcome this
backpropagation is proposed to adjust the network param- drawback, which incorporates the DL technique to address
eters in the reverse direction, using the network output the dimensional curse of RL. The DRL value functions and
deviation from its actual values [36]. Here, the back- policy are usually parameterized by the DNN variables,
propagation uses algorithms such as gradient descent to rather than the Q-table in RL. Using the excellent DL
calculate prediction errors and then refine the activation ability in feature extraction, DRL could complete complex
function weights and biases for training the DNN by tasks without any prior knowledge. A detailed taxonomy
moving backward through the layers. Combined forward of DRL algorithms is shown in Fig. 7. According to the
policy optimization, various DRL approaches could be
divided into two categories of value- and policy-based
algorithms. The value-based DRL methods usually imply
the optimization over the action-value function and further
derive the optimal policy. Consequently, the value-based
algorithm possesses a relatively higher sampling efficiency
and smaller estimation variance of value function and will
not fall easily into a local optimum.
However, DRL methods cannot deal with continuous
action space problems, which would limit the use of
value-based methods in SG. As for the policy-based DRL
approaches, they directly optimize the policy and itera-
Fig. 5. Structure diagram of typical RNN. tively update the policy to maximize the accumulative

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1063


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Fig. 7. Taxonomy of DRL algorithms (boxes with thick borders denote different categories, while others indicate specific algorithms).

return. In this way, the policy-based algorithm offers the function approximator for estimating Q∗ (s, a), instead
a simple policy parameterization and fast convergence of the Q-table. However, the value iteration is proved to
speed, which is suitable for problems with continu- be unstable and might even diverge when a nonlinear
ous or high-dimensional action spaces. Nevertheless, the function approximator, e.g., neural network, is used to
policy-based methods also suffer from sampling ineffi- represent the action-value function [37]. This instability
ciency and overestimation. However, a combination of is attributed to the fact that small updates of Q(s, a) might
these two categories has conveniently given rise to the AC significantly change the agent policy. Therefore, the data
framework. In the rest of this section, we discuss several distribution and the correlations between Q(s, a) and the
typical value- and policy-based DRL algorithms. target value rt+1 + γ maxa′ Qt+1 (s′ , a′ ) are quite diverse.
Two key ideas, which include experience replay and
E. Value-Based DRL Algorithm fixed target Q-network, are adopted to address the insta-
The RL goal is to improve its policy to acquire better bility issue as described in the following.
rewards. As for the value-based algorithm, it tends to
1) Experience Replay: In each time epoch t, DQN stores
optimize the action-value function Q(s, a) for obtaining
the experience of agent (st , at , rt , st+1 ) into the replay
preferences for the action choice. Usually, value-based
buffer and then draws a mini-batch of samples from
algorithms, such as Q-learning and SARSA, need to alter-
this buffer randomly to train the DNN. Then, the Q
nate between the value function estimation under the
values estimated by the trained DNN will be applied
current policy and the policy improvement with the esti-
to generate new experiences, which will be appended
mated value function, as shown in (8). However, it is
into the replay buffer in an iterative way. The expe-
not trivial to predict the accurate value of a complicated
rience replay mechanism has several advantages over
action-value function, especially when state and action
the fit Q-learning. First, both old and new experiences
spaces are continuous. The conventional tabular methods,
are used in the experience replay mechanism to learn
such as Q-learning, cannot cope with these complex cases
the Q-function, which provides higher data efficiency.
because of the limitation of computational resources. Also,
Second, the experience replay avoids the situation
state representations in practice would need to be man-
where samples used for DNN training are determined
ually designed with aligned data structures, which are
by previous parameters, which smooths out changes
also difficult to specify. To this end, the DL technique is
in the data distribution and removes correlations in
introduced to assist RL methods to estimate the action-
the observation sequence.
value function, which is the core concept of value-based
2) Fixed Target Q-Network: To further improve the neural
DRL algorithms. Next, typical value-based DRL algorithms,
network stability, a separate target network is devel-
including DQN and its variants, are depicted with detailed
oped to generate Q-learning targets, instead of the
theories and explanations.
desired Q-network. At times, the target network will
1) Deep Q-Network: As one of the breakthroughs in DRL, be synchronized with the primary Q-network by copy-
the DQN structure shown in Fig. 8 implements DNN as ing directly (hard update) or exponentially decaying

1064 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

the DQN architecture provides a natural candidate for the


extra network. More specifically, the action selection is
still executed by the primary network with parameters θ.
In other words, DDQN still selects the action with the
maximal estimated action value according to the current
state, as denoted by θ. However, the value evaluation of
current policy is fairly performed by the extra network, i.e.,
the target network in DDQN with parameter θ′ . Therefore,
the DDQN loss function could be expressed as
Fig. 8. Structure diagram of DRL.
 2
rj + γ Q̂(s′ , arg max

Q(s ′
, a ′
; θ); θ ′
) − Q(s, a; θ) (12)
a

average (soft update). In this way, the target network where Q̂(s′ , a′ ; θ′ ) and Q(s, a; θ) represent the target net-
is updated regularly but at a rate that is slower work with parameter θ′ for value evaluation and the
than the primary Q-network. This could significantly primary network with parameter θ for action selection,
reduce the divergence and the correlation between respectively. In this way, the estimated value of future
the target and estimated Q values. expected reward is evaluated using a different policy,
The DQN algorithm with experience replay and fixed which could manage the overestimation issue and outper-
target Q-network is presented in Algorithm 3. Before form the original DQN algorithm [43].
learning starts, replay buffer D, primary network Q, and 3) Dueling DQN: For certain states, different actions
target network Q̂ are initialized with random parameters. are not relevant to the expected reward and there is no
Then, at each episode t, the agent selects an action at need to learn the effect of each action for such states.
with ϵ-greedy policy and observe reward rt , to enter a For instance, the values of the different actions are very
new state st+1 . After that, the transition (st , at , rt , st+1 ) is similar in various states, and thus, the action taken would
stored in the replay buffer for further sampling. Stochastic be less important. However, the conventional DQN could
gradient descent with respect to the network parameter θ accurately estimate the Q value of this state only when
is performed to optimize the DNN loss function, which is all data are collected for each discrete action. This could
defined in (11) as the deviation between the target and result in a slower learning speed as the algorithm is not
primary networks. Finally, target network parameters are concerned with the actions that are not taken. To address
updated by the primary network for every certain step until this issue, a network architecture called the dueling DQN
the epoch is terminated is proposed, which explicitly separates the representation
 2
L = rj + γ max Q̂(sj+1 , aj+1 ; θ′ ) − Q(sj , aj ; θ) . (11)
aj+1
Algorithm 3 DQN Algorithm
Input: Initialize replay buffer D, the primary
In conclusion, DQN absorbs the advantages of both DL
Q-network Q with stochastic weights θ
and RL techniques, which are critical for SG application
and the target Q-network Q̂ with stochastic
[38], [39], [40].
weights θ′ .
2) Double DQN: DQN, which has been implemented 1 for each episode t do
successfully, has struggled with large overestimations of 2 With probability ϵ select a random action at ,
action values, especially in noisy environments [41]. otherwise select at = argmaxa Q∗ (s, a; θ).
These over-estimations stem from positive deviation since 3 Execute action at and observe the immediate
Q-learning always selects the maximum action value as reward rt and next state st+1 .
the approximation for maximal expected reward, which 4 Store transition (st , at , rt , st+1 ) in the
is denoted by the Bellman equation in (8). Therefore, the experience replay buffer D.
next Q values are usually overestimated since samples are 5 Sample random minibatch of transitions
used to select the optimal action, i.e., with the largest (sj , aj , rj , sj+1 ) from D.
expected reward, and the same samples are also utilized 6 Perform a gradient descent step with respect to
for evaluating the action-value. To this end, a variant h network parameter θ to minimize the loss: i2
the
algorithm called DDQN is proposed to address the over- rj +γ maxaj+1 Q̂(sj+1 , aj+1 ; θ′ )−Q(sj , aj ; θ) .
estimation problem of DQN [42]. The central idea of
DDQN is to decouple correlations in the action selection
7 Synchronize Q̂ = Q every certain interval
and value evaluation by using two different networks at
steps.
8 end
these two stages. In particular, the target Q-network in

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1065


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

of action-value function Q(s, a) into the state function continuous states and actions, the uncertainty stemming
V (s) and state-dependent action advantages A (a) [44]. from the stochastic environment, and the inaccuracy of
Accordingly, the Q value function of dueling DQN would estimated action value.
be decoupled into state value and action advantage parts, Another benefit of the policy-based algorithm is that
where the policy gradient methods could naturally model
stochastic policies, while the value-based algorithms need
Q(s, a) = V (s) + A (a). (13) to explicitly represent its exploration like greedy to model
the stochastic policies. Furthermore, gradient information
is utilized to guide the optimization in policy-based
On the one hand, the value part, i.e., state-value func- algorithms, which contributes to the network training
tion V (s), concentrates on estimating the importance of convergence. In general, the policy-based algorithms
current state s. On the other hand, the action advantage is could be divided into stochastic and deterministic polices,
denoted by the state-dependent advantage function A (a), according to their representation. Therefore, several
which estimates the importance of choosing the action a popular policy-based algorithms are introduced here for
compared with other actions. Intuitively, the dueling archi- both policies.
tecture could draw lessons from valuable states, without
learning the effect of each state action.
1) Stochastic Policy: As mentioned before, the basic idea
However, it might not be suitable if we directly separate
of policy-based algorithm is to represent the policy by
the Q value function as shown in (13) since it might be
a parametric neural network πθ (a|s), where the agent
unidentifiable in mathematics, that is, there might exist
randomly chooses an action a at state s according to
different combinations of V (s) and A (a), where all sat-
parameter θ . Then, policy-based algorithms typically opti-
isfy (13) for a given Q(s, a). To deal with the identifiability
mize the policy πθ with respect to the goal J(πθ ), through
issue, the advantage function estimator is refined to have
sampling the policies and adjusting the policy parameters
a zero advantage at the selected action by force
θ in the direction of more cumulative reward, which could
be expressed as follows:
Q(s, a; α, β) = V (s; β)
! "∞ #
1 X
+ A (s, a; α) − A (s, a′ ; α)
X k
|A| ′ J(πθ ) = Eτ ∼πθ [R(τ )] = Eτ ∼πθ γ rt+k+1 (15)
a
k=0
(14)
where policy gradient-based optimization uses an estima-
where α and β are parameters of the two estimators tor for the gradients on the expected return collected from
V (s; β) and A (s, a′ ; α), respectively. It should be noted samples to improve the policy with gradient ascent. Here,
that the subtraction in (14) helps with identifiability, which trajectory τ is a sequence of state–action pairs sampled by
does not change the relative rank of the A values and current policy πθ , which records how the agent interacts
preserves the original policy based on Q values from (13). with the environment. Thus, the gradient regarding the
In addition, the stability of policy optimization is enhanced policy parameter is defined as the policy gradient, which
since the advantages in (14) would only need to adapt to could be calculated as follows: ∆θ = α∇θ J(πθ ). On this
the average value, instead of pursuing an optimal solution. basis, the policy gradient theorem is proposed to denote
The training of dueling DQN requires more network layers the optimal ascent direction of expected reward [45],
compared with the standard DQN, which achieves a better as illustrated in the following equation:
policy evaluation in the presence of large action spaces.
" T #
X
F. Policy-Based DRL Algorithm ∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (τ )R(τ )
t=0
Different from the value-based algorithm, policy-based " T #
algorithms depend on the use of gradient descent for opti- X
= Eτ ∼πθ ∇θ log πθ (at |st )R(τ ) . (16)
mizing the parameterized policies, regarding the expected t=0
reward, instead of optimizing the action-value function.
The abstract policies in DRL are called parameterized In this way, policy-based algorithms are updated along
policies as they are represented by parametric neural the direction of ascent gradients, which is denoted as
networks. In particular, policy-based approaches would follows:
directly perform the learning of the parameterized policy
of agent in DRL without learning or estimating the action-
value function. Accordingly, policy-based DRL algorithms θ = θ + ∆θ = θ + α∇θ J(πθ ). (17)
do not suffer from specific concerns, which have been
encountered with traditional RL methods. These concerns Based on policy gradient, several typical policy-based
mainly consist of higher complexities that arise from DRL algorithms, including the TRPO and the PPO, are

1066 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

actor, it follows the principle of policy gradient method to


update its policy, which is stated as:

" T #
X
J(πθ ) = Eτ ∼πθ log πθ (at |st )δt
t=0
θ = θ + αθ ∇θ J(πθ ). (19)

The pseudocode of the complete network is summarized


in Algorithm 4. Before the learning starts, actor network
Fig. 9. Structure diagram of AC algorithm.
parameters θ , critic network parameters ψ , and hyper-
parameters, including learning rates and discount factor,
are initialized as random parameters. After that, the agent
selects an action at according to the current policy πθ , i.e.,
proposed. In recent years, it has witnessed the successful at ∼ πθ (·|s), shifts to the next state st+1 , and receives the
applications of these algorithms in both academy and immediate reward Rt . For each step, the TD error δt is first
industry. In the rest part of this section, we will introduce calculated for further actor network selections and critic
these policy-based algorithms as well as their variants in network evaluations. Finally, the policy gradient theorem
detail. is applied to update the parameters of both actor and critic
a) Actor–critic: It could be observed from (16) that networks, as shown in (18) and (19), respectively.
a straightforward gradient ascent is performed on the In conclusion, the AC algorithm is situated in the
policy parameters θ , in order to gradually improve the intersection of policy- and value-based methods, which
performance of policy πθ . Despite this concision, the con- is regarded as the breakthrough in DRL and derives a
ventional policy gradient method is considered to suffer series of state-of-the-art DRL algorithms, such as A2C, A3C
from a large variance while predicting the gradient [46]. [48], TRPO, PPO, and SAC. Furthermore, the architecture
Indeed, the complexity and randomness of reward Rt may inspires the deterministic policy methods, such as the
grow exponentially with the trajectory length, which is DDPG algorithm, which will be discussed in this section
difficult to handle. To this end, the AC architecture is pro- later.
posed to alleviate the large variance problem, which aims b) Trust region policy optimization: Although the AC
to take advantage of all the merits from both value- and method achieves a combination of policy- and value-based
policy-based methods while overcoming their drawbacks methods, it still suffers from the learning step-size pitfall
[47]. More specifically, the principal concept of architec-
ture is to split the model into two components, i.e., the
actor decides which action should be taken while the critic Algorithm 4 AC Algorithm
feeds the quality of its action back to the actor as well as Input: Initialize actor network parameters θ0 and
the suggestion of corresponding adjustments. critic network parameters ψ0 ; Initialize
In Fig. 9, the actor takes the state as input and outputs learning rates of actor and critic networks,
the optimal action, which essentially controls the behavior respectively.
of agent through learning the optimal policy. In compari- 1 for each episode t do

son, the critic evaluates the selected action by computing 2 Actor network selection:
hP i
T
the value function V π (s). Generally, the training of these J(πθ ) = Eτ ∼πθ t=0 log πθ (at |st )δt
two components is performed separately and gradient 3 Critic network evaluation:
ascent is adopted to update their parameters. Here, the hP
T 2
i
critic Vψπθ
is optimized to minimize the square of TD error
JV πθ (ψ) = Eτ ∼πθ t=0 δ t
ψ

δt , which is similar to the loss function of DQN as 4 Take action at and observe next state st+1 and
reward Rt according to current policy
π π πθ (·|s).
δt = Rt + γVψθ (st+1 ) − Vψθ (st )
" T # 5 Collect sample (at , st , Rt , st+1 ) into the
X 2 trajectory.
JV πθ (ψ) = Eτ ∼πθ δt
ψ
t=0
6 Calculate the TD error as follows:
ψ = ψ + αψ ∇ψ JV πθ (ψ) (18) δt = Rt + γVψπθ (st+1 ) − Vψπθ (st ).
ψ
7 Replace ψ = ψ + αψ ∇ψ JV πθ (ψ).
ψ

where ψ represents the parameters of the critic and αψ


8 Replace θ = θ + αθ ∇θ J(πθ ).
9 end
denotes the learning rate. It should be mentioned that the
accumulative return in (18) is substituted by the TD error,
Output: Parameters pair of the actor and critic
which further reduces the variance of gradient. As for the
(θ, ψ)

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1067


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

just like the standard gradient descent algorithm. Indeed, where C is a constant independent to πθ′ and C ·
the gradient ∇θ J(πθ ) only provides the local first-order DKLmax
(πθ ∥πθ′ ) represents the maximum Kullback-Leibler
information at current parameters θ , which completely (KL) divergence, which is a statistical distance measuring
ignores the curvature of the reward landscape. However, the difference between πθ′ and πθ . Therefore, it is reason-
the suitable adjustment of learning step is very impor- able to optimize Lπθ (πθ′ ) if DKL
max
(πθ ∥πθ′ ) is small, which is
tant for policy gradient methods. On the one hand, the actually the principle of TRPO. On this basis, the original
algorithm might suffer a performance collapse if the learn- problem is converted into an optimization problem, which
ing step αθ is large. On the other hand, if the step size is is stated as
set small, the learning would be conservative to converge.
What is more, the gradient ∇θ J(πθ ) in policy gradient max Lπθ πθ′


methods requires an estimation from samples collected by πθ

πθ ∥πθ′ ≤ ξ
 max 
the current policy πθ , which in turn affects the quality of s.t. E DKL (23)
the collected samples and makes the learning performance
more sensitive to the step-size selection.
where ξ is a predefined constant denoting the maximum
Another shortcoming of policy gradient method in the
allowable difference between πθ′ and πθ . Afterward, the
standard AC model is that the update occurs in the param-
first-order approximation for the objective function and the
eter space rather than the policy space. This makes it more
second-order approximation for constraints are adopted to
difficult to tune in the step size αθ since the same step size
solve this optimization problem. In fact, the gradient of
may correspond to totally different updated magnitudes
Lπθ (πθ′ ) at the current policy could be expressed by (24),
in the policy space, which is dependent on the current
which is similar to the AC
policy πθ . To this end, an algorithm, called the TRPO,
is developed, which is based on the concept of trust region
g = ∇θ Lπθ πθ′

for adjusting the step size more precisely in the policy gra- "∞ #
dient [49]. It should be noted that the goal of policy-based X
= Eτ ∼πθ ∇θ log πθ (at | st ) γ A (st , at ) . (24)
t πθ
method is to find an updated policy πθ′ that improves the
t=0
current policy πθ . Fortunately, the improvement from the
current policy to the updated one could be measured by the
Accordingly, the TRPO algorithm solves the approxi-
advantage function A πθ (s, a) [50], which was introduced
mated optimization problem at the current policy as
in the dueling DQN. It is illustrated that (20) provides
an insightful connection between the performances of πθ′
θ′ = arg max g⊤ θ′ − θ

and πθ θ′
⊤
s.t. θ′ − θ H θ′ − θ ≤ ξ

(25)
π
A πθ (s, a) = Qπθ (s, a) − Vψ θ (s)
"∞ #

X t π where H represents the Hessian matrix of E[DKL max
(πθ ∥πθ′ )].
J(πθ ) = J(πθ ) + Eτ ∼πθ′ γ A (at , st )
θ
(20)
t=0
It is illustrated by (25) that the gradients are calculated
in the first order and the constraint is depicted in the
where τ denotes the state–action trajectory sampled second order. This approximation problem can be analyt-
by updated policy πθ′ . Obviously, learning the opti- ically solved by the methods of Lagrangian duality [52],
mal policy is equivalent to optimizing the bonus term resulting in the following analytic form solution:
Eτ ∼πθ′ [ ∞
t=0 γ A
t πθ
P
(at , st )]. The above expectation is r
based on the updated policy πθ′ that is difficult to optimize ′
θ =θ+

H −1 g. (26)
directly. Thus, TRPO optimizes an approximation of this g ⊤ H −1 g
expectation, denoted by Lπθ (πθ′ ), which is stated as
In summary, TRPO trains the stochastic policy in an
"∞ # on-policy way, where it explores by sampling according
X πθ′ (at | st ) πθ
Lπθ πθ′ = Eτ ∼πθ γt A (st , at )

(21) to the newest version of its stochastic policy. During the
πθ (at | st )
t=0 training procedure, the policy usually becomes less uncer-
tain, progressively, since the update rule encourages it to
where πθ′ is directly approximated by πθ , which seems to exploit rewards that it has already obtained. Empirically,
be coarse, but its approximation error (22) is proved to the TRPO method performs well on previous problems that
be theoretically bounded and thus ensures its effectiveness require precise problem-specific hyperparameter tuning,
[51]. The bounded approximation error is presented as which are solvable with a set of reasonable parame-
ters. However, one challenge with the implementation of
J(πθ′ ) − J(πθ ) − Lπθ πθ′ max
πθ ∥πθ′ TRPO lies in calculating the estimation of KL divergence
 
≤ C · DKL (22)
between parameters, as it increases the complexity and the

1068 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

specialized clipping in the objective function to remove


incentives for the new policy to diverge from the old
one. To achieve this goal, the ratio of the new to the old
policy is represented as ℓt (θ′ ) = (πθ′ (at | st ))/(πθ (at | st )).
Then, the clipping mechanism is introduced as a regulator
to prevent the dramatic policy update from affecting the
learning performance of agents. In particular, PPO tends to
clip ℓt (θ′ ) within [1 − ∆, 1 + ∆] to ensure that the updated
policy πθ′ is adjacent to πθ . In other words, if ℓt (θ′ ) falls out-
Fig. 10. Diagram of clipping mechanism in the PPO algorithm. side the interval, the advantage function will be clipped,
as shown in Fig. 10. Finally, the minimum of clipped and
unclipped objectives is selected as the learning objective.
computation time of TRPO and thus limits it applicability Therefore, PPO would maximize the lower bound of the
in practical terms. To this end, some improvements and target objective while maintaining a controllable update
simplifications are developed to tackle this specific trouble, from πθ to πθ′
which will be discussed next.
c) Proximal policy optimization: TRPO is relatively
LPPO πθ′ = Eπθ min ℓt θ′ A πθ (st , at ),
  
complex and suffers from a computational burden when
clip(ℓt θ′ , 1−∆, 1+∆)A πθ (st , at ) .
 
calculating the conjugate gradients for constrained opti-
mization. The complexity of computing the second-order (28)
Hessian matrix H −1 reaches Ω(N 3 ), which is quite expen-
sive to undertake in the real world. Here, N denotes the
number of parameters. Therefore, another policy gradient
approach is developed in [53], using the PPO, to enforce a
simpler and more efficient solution for calculating the sim- Algorithm 5 PPO Algorithm With Penalty
ilarity between updated and current policies. Unlike TRPO Input: Initialize policy parameters θ and value
that tends to optimize (23) with a hard constraint, PPO function parameters ψ; Initialize reward
improves the objective function of TRPO by converting discount factor γ and KL penalty
the constraint into a penalty term. Indeed, the Lagrangian coefficient λ; Initialize adaptive parameters
duality theorem is applied to adjoin a constraint to the a = 1.5 and b = 2, respectively;
objective function through a multiplier. The dual problem 1 for each episode t do
after adjoining the constraint is mathematically equivalent 2 Take action at and observe next state st+1 and
to the primal formulation under a constraint qualification reward Rt according to current policy
condition [54]. Therefore, the objective function in PPO πθ (·|s).
is rewritten in (27) after adjoining the constraint to the 3 Collect sample (at , st , Rt , st+1 ) into the
primal objective trajectory.
4 Estimate the advantage function as follows:
max Lπθ πθ′ − λ · E DKL πθ ∥πθ′
  
(27) A πθ (s, a) = Qπθ (s, a) − Vψπθ (s).
′πθ 5 for m ∈ {1, 2, . . . , M } do
6 JPPOh (πθ ) =
π ′ (a |s )
i
where λ is the Lagrange multiplier associated with the E πθθ (att |stt ) At − λ · DKL (πθ ∥πθ′ )
inequality constraint. For each ξ in (23), there exists a cor-
7 Update θ by a gradient method w.r.t
responding constant λ, which provides (23) and (27) with
JPPO (πθ ).
the same optimal solutions. Thus, it is significant to adjust 8 end
the value of Lagrange multiplier adaptively. Here, the KL 9 for b ∈ {1, 2, . . . , B} do
divergence is particularly checked for adjusting λ. This 10 LBL (ψ) =
method, which is referred to as the PPO-penalty algorithm P∞ P 2
− t=1 τ −t γ
τ −t
Rτ − Vψ (st )
[55], is illustrated in Algorithm 5. The PPO-penalty
approximately solves a KL-constrained update, like TRPO,
11 end
by penalizing the KL divergence in the objective function
12 Compute d = E [DKL (πθ ∥πθ′ )].
and then adjusting the multiplier λ over the course of
13 if d < dtarget /a then
training so that it is scaled appropriately.
14 Update λ with λ ← λ/b;
Another variant method of PPO, i.e., PPO-clip, would
15 else if d > dtarget × a then
clip the objective value intuitively for the policy gradient,
16 Update λ with λ ← λ × b;
which brings about a more conservative update [56].
17 end
Here, the PPO-clip does not contain the KL-divergence
18 end
term in the objective or constraint. Instead, it depends on
Output: Parameters of the (θ, ψ) pair.

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1069


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Hence, PPO is motivated to take the largest possible The overall pseudocode of DDPG presented in Algorithm
advantage of current data, without stepping out so far 6 initializes the replay buffer R and parameters of four net-
that could accidentally cause any performance collapse. works. Then, it selects the action according to the current
Unlike TRPO which tends to solve this problem with a policy and exploration noise as at = µ(st |θµ )+Nt , in order
complicated second-order method, PPO is a member of to enhance the DDPG exploration capacity. After execution,
the first-order approaches, which adopts clipping tricks to the action at and receive reward rt are transferred to the
maintain the proximity between old and new policies. The next state st+1 to store the transition (st , at , rt , st+1 ) in
PPO algorithm performs comparably or even better than buffer R. On this basis, DDPG simultaneously maintains
the other state-of-the-art methods, is significantly simpler two models, i.e., actor and critic, in order to manage
to implement, and has thus become the default DRL in the problems with continuous action spaces. As for the
many popular platforms due to its ease of use and good critic network, it aims to approximate the output of value
performance [57]. function, which uses the same structure as DQN, i.e.,
2) Deterministic Policy: The content described above a primary network and a target network. Then, the critic
belongs to the stochastic policy gradient, which aims to network updates its state by minimizing the loss as
optimize the stochastic policy π(a|s) and represent the
′ ′
action as a probabilistic distribution according to the yi = ri + γQ′ (si+1 , µ′ (si+1 |θµ )|θQ )
current state, where a ∼ π(·|s). On the contrary, the N 2
1 X
deterministic policy considers the action as a deterministic L= yi − Q(si , ai |θQ ) (30)
N
output of policy, i.e., a = µ(s), instead of sampling the i

probability from the given distribution. In addition, it is


derived that the deterministic policy (29) follows the policy where yi represents the estimated Q value of target net-

gradient theorem despite the fact that they have different work with parameter θQ and L denotes the loss error
explicit expressions [58]: between primary and target networks. It is found that the
critic network shares the same double network architecture
with DQN, which improves the accuracy of value estima-
∇θ J(µθ ) = E ∇θ µθ (s)∇a Qµ (s, a)|a=µθ (s)
 
(29)
tion and maintains the stability during training.
As for the actor network, DDPG tends to learn a deter-
where µ(s) denotes the deterministic policy rather than ministic policy µ(s) and select the action that maximizes
π(s), in order to eliminate the ambiguity in the distinc- the value function Q(s, a). Therefore, the gradient ascent
tion with stochastic policy π(a|s). It is illustrated in (29) method is performed to optimize the policy as
that the derivation of reward with respect to the action
is integrated, while the integration of policy is absent.
h i
∇θµ J = E ∇θµ Q(s, a|θQ )
This makes the deterministic policy easier to train in h i
high-dimensional action spaces when compared with the = E ∇a Q(s, a|θQ )∇θµ µ(s|θµ )
stochastic one. Nowadays, some methods that combine N
DQN with the deterministic policy are quite popular, which 1 X
≈ ∇a Q(s, a|θQ )∇θµ µ(s|θµ ) (31)
take advantage of both methods and perform well in most N
i
environments, especially with continuous action spaces.
Next, we will discuss several typical deterministic policy where the chain rule of mathematics is applied to calculate
gradient algorithms, including DDPG and its extensions, the expected return from the start distribution J . Unlike
with detailed theories and explanations. the target network which is regularly updated by the
a) Deep deterministic policy gradient: The DDPG primary network in DQN, DDPG develops a novel updating
approach is viewed as the combination of DNN and deter- mechanism called the soft update, which changes network
ministic policy gradient algorithm. DDPG is devoted to parameters by exponential smoothing rather than directly
addressing the problem with continuous action spaces that copying the parameters as follows:
DQN cannot tackle easily. Hence, DDPG is regarded as an
′ Q′
extension of DQN in the continuous action spaces, with θQ ← ρθQ + (1 − ρ)θθ
the help of deterministic policy gradient. More specifically, ′ µ′

DDPG adopts the AC architecture from the policy gradient θµ ← ρθµ + (1 − ρ)θθ (32)
framework, which maintains a deterministic policy func-
tion µ(s) (actor) as well as a value function Q(s, a) (critic). where ρ represents the update coefficient, which is far less
The policy gradient algorithm is used to optimize the policy than 1 so that the learning work is updated very slowly and
function assisted by the value function. The AC used in smoothly, thus promoting the learning stability.
DDPG is different from the previous one since this actor In summary, DDPG combines ideas from both DQN and
is a deterministic policy function. Nevertheless, the value AC techniques, which extends the Q-learning into the
function in DDPG is the same as that in DQN, which utilizes continuous action spaces and produces a lasting influence
the TD error to update itself. for subsequent DRL algorithms. On the one hand, the

1070 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Algorithm 6 DDPG Algorithm two Q values, in order to form the targets in the Bellman
Input: Initialize replay buffer R. Randomly error loss function
initialize actor network parameters θµ and
critic network parameters θQ . Initialize Qθ1′ (s′ , a′ ) = Qθ1′ (s′ , µψ1 (s′ ))
target network Q′ and µ′ with parameters Qθ2′ (s′ , a′ ) = Qθ2′ (s′ , µψ2 (s′ ))
′ ′
θQ ← θQ , θµ ← θµ .
y1 = Ri + γ min Qθi′ (s′ , µψi (s′ )) (33)
1 for each episode do i=1,2

2 Initialize a random process N for action


exploration. where using the smaller Q value for the target helps
3 Receive initial observation state s1 mitigate the overestimation in the Q-function.
4 for t = 1, . . . , T do Second, it should be mentioned that the target network
5 Selection action as at = µ(st |θµ ) + Nt is an effective tool to improve the DRL stability. This is
according to the current policy and because the deep function approximator needs multiple
exploration noise. gradient updates to converge, and target networks can
6 Execute action at and observe reward rt as supply a stable objective during the learning process. In the
well as the new state st+1 . absence of a fixed target network, residual errors would
7 Store transition (st , at , rt , st+1 ) in the accumulate with each update. Therefore, target networks
replay buffer R. could be used to reduce the error of multiple gradient

8 Set yi = ri + γQ′ (si+1 , µ′ (si+1 |θµ )). updates, and the policy updates based on high-error states
9 Update critic by minimizing the loss: that would lead to a good divergence. Then, the policy
Q 2 network should have a lower updating frequency than the
L = N1
P 
i yi − Q(si , ai |θ ) .
10 Update the actor policy using the sampled value network to minimize the estimation error before
policy gradient: policy updates. Accordingly, the updating frequency of the
11 ∇θµ J ≈ policy network is reduced by applying the TD3 algorithm,
1
PN Q µ which is also the name delay stems from. Generally speak-
N i ∇a Q(s, a)|θ )∇θ µ µ(s|θ ).
12 Update the target networks: ing, the less frequent policy updates, the smaller variance
′ Q′ of Q value function is, which results in a higher quality
θQ ← ρθQ + (1 − ρ)θθ
′ µ′ policy.
θµ ← ρθµ + (1 − ρ)θθ Third, one of the issues with the deterministic policy is
13 end that such a method might be overfitted to narrow peaks
14 end in value estimation. In other words, if the Q-function
Output: Parameters pair of the actor and critic approximator produces an incorrect sharp peak for some
actions, the policy will quickly exploit this spike and then
bring about fragile or incorrect behavior. Hence, the target
policy smoothing regularization is developed to address
AC architecture is transformed into off-policy methods this issue by adding noise to the target action to smooth
because of DQN, and thus, networks are trained with sam- out the Q value function and avoid any overfitting
ples from replay buffer that further enhances the sample
efficiency. What is more, the introduction of replay buffer
∆ ∼ clip(N (0, σ), −c, c)
can also alleviate the correlations between samples, which
brings about more robust and stable learning performance. y = R + γQθ′ (s′ , µψ′ (s′ ) + ∆) (34)
On the other hand, DDPG can cope with high-dimensional
problems with continuous action spaces owing to the exis- where ∆ represents the truncated normal distribution
tence of AC model. In addition, the development of soft noise to each action as a regularization, which is clipped
target update trick not only smooths the network training into the valid range [−c, c].
but also impels other DRL algorithms such as TD3 and SAC. To conclude, TD3 is the successor of DDPG that aims
b) Twin delay DDPG: Even though DDPG achieves a to address the overestimation problem of critic network
great performance, it is still frequently beset with respect in the conventional DDPG. In particular, TD3 gains an
to hyperparameters tuning. A common shortcoming of improved performance over the baseline DDPG via intro-
traditional DDPG is the overestimation of Q values, which ducing three key tricks, including the clipped DDQN for
also appears in DQN as DDPG shares the same function AC, delayed policy and target updates, and target policy
approximator in the action section with DQN. To this end, smoothing regularization. However, as an extension of
an improved version of DDPG named TD3 is proposed to DDPG, TD3 is still sensitive to hyperparameter tuning,
address this issue, via introducing three critical tricks [59]. and different hyperparameter settings will lead to different
First, TD3 draws lessons from double-DQN that learn performances, which might be a potential direction of
two Q-functions instead of one, from which the name of future research. In summary, the fundamental knowledge
twin is originated. Then, it selects the smaller one in these of RL, DL, and DRL is presented in this section with a

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1071


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

detailed explanation. On this basis, various advanced DRL


techniques as well as their extensions are discussed to
analyze their advantages and drawbacks. It could be con-
cluded that different DRL algorithms are applicable to deal
with different problems in different scenarios. Therefore,
in Section III, applications of DRL in SG operations for
various problems are reviewed with discussion.

III. A P P L I C A T I O N S O F D R L I N
Fig. 11. Typical architecture of SG operation. This figure is cited
S G O P E R AT I O N
from [76].
The SG applications are devoted to achieving a sustain-
able, secure, reliable, and flexible energy delivery through
suppressing voltage fluctuations, and strengthening the
bidirectional power and information flows. SG applications
SG security and stability. Despite the applicability of tra-
possess the following features.
ditional power system methods, it is envisaged that they
1) SG offers a more efficient way to maintain the may be inadequate for SG applications characterized by
optimal dispatch with a lower generation cost and high levels of renewable and variable energy penetrations
higher power quality via the integration of distributed and increased human participation in load management.
sources and flexible loads, such as RE and EVs [60], Traditional optimization methods could struggle with iden-
[61], [62], [63], [64]. tifying the best solutions for these problems due to the
2) SG achieves the secure and stable operation of power high uncertainty in prevailing SG operations and the high
system via the deployment of effective operational dimensionality of distributed systems with coupled vari-
control technologies, including the AGC, AVC, and ables that are metastasizing in SGs.
LFC [65], [66], [67], [68]. In essence, it would be difficult to establish the corre-
3) SG provides a transaction platform for customers and sponding accurate models. Fortunately, DRL agents could
suppliers affiliated to different entities, thus enhanc- automatically learn the pertinent knowledge in such cases
ing the interactions between suppliers and customers, while interacting with the environment, which is indepen-
which facilitates the development of electricity mar- dent of the accurate environment model. However, the
ket [69], [70], [71]. purpose of applying DRL is not to completely replace con-
4) SG equipment encompasses numerous advanced ventional optimization methods. Instead, DRL can serve as
infrastructures, including sensors, meters, and con- a complement to existing approaches and enhance them in
trollers, which are also applied to emerging condi- a data-driven manner. In this way, DRL has the advantage
tions, such as network security and privacy preserva- of addressing such SG problems more effectively due to
tion [72], [73], [74], [75]. its data-driven and model-free nature. In the rest part
On this basis, a typical SG architecture is shown in of this section, the DRL applications to optimal dispatch,
Fig. 11, which illustrates that the SG operation involves operational control, electricity market, and other emerging
four fundamental segments, i.e., power generation, trans- areas are analyzed and investigated in detail.
mission, distribution, and customers. As for the generation,
traditional thermal energy is converted to electrical power, A. Optimal Dispatch
while large-scale RE integration is a promising trend in SG
Compared with traditional power systems, SG inte-
applications. After that, the electrical energy is delivered
grates more distributed RE to promote sustainability [76].
from the power plant to the power substations via the
Under this circumstance, the conventional centralized
high-voltage transmission lines. Then, substations lower
high-voltage power transmission might not be considered
the transmission voltage and distribute the energy to
an economic operation since the RE sources are usually
individual customers such as residential, commercial, and
distributed and closely located to load centers. As a result,
industrial loads. During the transmission and distribution
DN, self-sufficient microgrid, and IES are gradually becom-
stages, numerous smart meters are deployed in SGs to
ing more independent of transmission network operations,
ensure their secure and stable operations. In addition, such
which are also highlighted as a major developing trend in
advanced infrastructures bring about emerging concerns,
SG applications [77], [78], [79], [80]. In addition, it has
e.g., network security and privacy concern, that traditional
witnessed the rapid development of EVs in recent years,
power systems would seldom encounter.
which has already become a critical SG component [81],
In order to support SG operations, DRL applications
[82], [83], as shown in Fig. 12. To this end, applications
are also divided into four categories, including optimal
of DRL regarding optimal dispatch on DN, microgrid, IES,
dispatch, operational control, electricity market, and other
and EV are summarized as follows.
emerging issues, such as network security and privacy
concern. These problems usually have similar economic 1) Distribution Network: In recent years, DN opera-
and technical objectives for reducing operational costs, tions have faced significant challenges mainly due to the

1072 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 2 DN Optimal Dispatch Based on DRL

increasing deployment of DERs and EVs. Specifically, the On the one hand, DRL could provide better flexible con-
uncertain RE output could impact the distribution and trol decisions to promote DN operations, including voltage
the direction of DN power flow, which may further lead regulation. For instance, Cao et al. [84] and Kou et al. [85]
to the increase of power loss and voltage fluctuations. proposed a multiagent DDPG (MADDPG)-based approach
Hence, traditional methods based on mathematical opti- for the DN voltage regulation with a high penetration of
mization methods might not deal effectively with this photovoltaics (PVs), which shows a better utilization of PV
highly uncertain environment. More importantly, these resources and control performance. A novel DRL algorithm
traditional methods significantly depend on the accurate named constrained SAC is proposed in [86] and [87]
DN parameters, which are difficult to acquire in practice. to solve Volt–Var control problems in a model-free man-
To address these limitations, DRL methods are applied ner. Comprehensive numerical studies demonstrate the
in DNs, which could provide more flexible control deci- efficiency and scalability of the proposed DRL algorithm,
sions and promote the operation of DN. Generally, the compared with state-of-the-art DRL and convectional opti-
reward can be designed to achieve certain goals, such mization algorithms. Sun and Qiu [88] and Yang et al. [89]
as minimizing power losses, improving voltage profile, proposed a two-stage real-time Volt–Var control method,
or maximizing RE utilization. The literature about the in which the model-based centralized optimization and
applications of DRL on the DN is listed in Table 2, which is the DQN algorithm are combined to mitigate the voltage
summarized from two aspects, i.e., management method violation of DN.
and solving algorithm. In addition, the performance of On the other hand, DRL algorithms are also applied
reviewed methods is analyzed from the perspectives of to determine the optimal network configuration of DN.
convergence, privacy protection, and scalability, where the For example, Li et al. [90] developed a many-objective
tick mark means an outstanding performance and blank DNR model to assess the tradeoff relationship for better
means the corresponding article that does not refer to the operations of DN, in which a DQN-assisted evolutionary
performance. algorithm (DQN-EA) is proposed to improve searching
efficiency. Similarly, an online DNR scheme based on deep
Q-learning is introduced in [91] to determine the optimal
network topology. Simulation results indicate that the com-
putation time of the proposed algorithm is low enough for
practical applications. In addition, Gao et al. [92] devel-
oped a data-driven batch-constrained SAC algorithm for
the dynamic DNR, which could learn the network recon-
figuration control policy from historical datasets without
interacting with the DN. In [93], the federated learning
and AC algorithm are combined to solve the demand
response problem in DN, which considers the privacy
protection, uncertainties, as well as power flow constrains
of DN simultaneously. In addition, a DRL framework based
on A2C algorithm is proposed in [94], which aims at
Fig. 12. Optimal dispatch issues of SG operation. This figure is cited enhancing the long-term resilience of DN using hardening
from [76]. strategies. Simulation results show its effectiveness and

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1073


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

scalability in promoting the resilience of DN compared


with traditional mathematical methods.

2) Microgrid Network: Microgrid is a local electric power


system with DERs, ESS, and flexible loads [95]. Various
objectives are proposed for the microgrid optimization dis-
patch, such as maximizing the operator’s revenues, mini-
mizing operational costs, promoting the users’ satisfaction,
reducing power delivery losses, increasing the RE utiliza-
Fig. 13. Microgrid optimal dispatch method. This figure is cited from
tion, and promoting the system stability. Indeed, decision [76].
variables of microgrid dispatch (35) mainly include the
electricity price, generation allocation strategies, device
availability, and operation state. Hence, for the mth micro-
grid where Pm load
(t) represents the power load of the mth
microgrid.
T
X X The managed objects of microgrid optimal dispatch
πdg Pkdg (t) + ηm π(t)Pm
grid

min (t) could be divided into three categories, i.e., DERs, ESS, and
t=1 k∈m
user loads, as shown in Fig. 13. The DER management,
+ ηess |SOC(t) − SOC(t − 1)|
Z
! which consists of PV, wind power, DG, fuel cell, and alike,
is primarily concerned with the DER generation dispatch
X z z z
+ πm qm (t)um (t) . (35)
z=1 [96]. As for ESS, the optimal management is achieved
by charge/discharge controls, in order to coordinate the
The first part of (35) represents the DER generation cost, microgrid supply and demand [97]. In addition, demand
in which πdg (Pkdg (t)) is the quadratic polynomial corre- response is widely applied to dispatch the user loads,
lated to electricity generation Pkdg . The second part is the which aims to reduce the operational cost and enhance
energy cost of microgrid for purchasing electricity from the service reliability [98]. However, it is difficult to opti-
the main grid where ηm denotes the loss coefficient of mize the microgrid dispatch, considering the variability
power delivery. π(t) and Pm grid
(t) represent the electricity of RE and uncertain loads. What is more, the existence
sale price and energy purchased from the main grid, of high-dimensional variables and nonlinear constraints
respectively. The third part denotes the ESS degradation makes the solving trouble. To this end, DRL is introduced to
cost where ηess is the degradation cost coefficient and SOC solve the microgrid optimal dispatch problem in some liter-
represents the ESS energy state. The last part of (35) is ature, which achieves a higher computational efficiency as
the internal cost of microgrid where πm z
denotes the cost well as a better scalability, simultaneously. In the rest part
z
of the z th load block. In particular, qm represents the z th of this topic, the detailed applications of DRL on microgrid
load block and uzm is the binary variable denoting the load are shown in Table 3.
participation status in demand response. a) Distributed energy resources: Typically, the reward
To ensure the microgrid secure operation, the following function can be designed to maximize the RE utilization
constraints should be considered. For example, the energy or minimize the energy cost. Dridi et al. [99] proposed a
stored at ESS during ∆t should be equal to the difference novel DRL approach based on deep LSTM for microgrid
of charged and discharged energy: energy management through the generation dispatch of
DERs and ESS, which shows better results compared to
ESS
Pdc Q-learning. The A3C algorithm is introduced in [100]
ESS
SOC(t) = SOC(t − 1) + ηch Pch ∆t − ∆t (36) to dispatch energy while considering prevailing risks.
ηdc
Experiment results denote a higher accuracy for energy
where (36) describes the energy balance of ESS, ηch and scheduling in the proposed risk-aware model than those
ηdc denote the charging and discharging coefficients of of traditional methods. In addition, a finite-horizon DDPG
ESS
ESS, respectively, and Pch ESS
and Pdc represent the energy (FH-DDPG)-based DRL algorithm is proposed in [101]
amount of ESS charging and discharging, respectively. for energy dispatch with DGs, PV panels, and ESS. The
In addition, ∆t is the duration of the tth time interval. case study using isolated microgrid data shows that the
In microgrid operations, active power supply and proposed approach can offer efficient decisions even with
demand should be balanced partially available state information.
b) Energy storage system: In the context of ESS, the
X Z
X reward function can be based on costs associated with
grid
Pm (t) + Pkdg (t) + Pdc
ESS
(t) + z
qm (t)uzm (t) energy usage, such as electricity prices or battery degrada-
k∈m z=1 tion costs. Sanchez Gorostiza and Gonzalez-Longatt [102]
load ESS
= Pm (t) + Pch (t) (37) introduced the DDPG algorithm to derive ESS energy

1074 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 3 Microgrid Network Optimal Dispatch Based on DRL

dispatch policies without fully observable state infor- the randomness of EV charging behaviors. On the other
mation. The proposed algorithm has derived an energy hand, multistage optimization is introduced to handle
dispatch policy for ESS. A multiagent TD3 (MATD3) is the problem caused by high-dimensional variables. Nev-
developed in [103] for ESS energy management. Sim- ertheless, the optimization results of these methods are
ulation results demonstrate its efficiency and scalability dependent on the predictive accuracy. Accordingly, DRL
while handling high-dimensional problems with contin- is applied to deal with the EV optimal dispatch problem,
uous action space. In addition, the curriculum learning which is a data-driven method and to some extent insen-
is integrated into A2C to improve sample efficiency and sitive to prediction accuracy. The reward can incorporate
accelerate the training process in [104], which speeds up factors related to user comfort and convenience, such as
the convergence during the DRL training and increases the the queuing time waiting for charging, the available range
overall profits. of EV travel, or the ability to meet specific user preferences.
c) User loads: The reward can be formulated as the In the rest part of this section, we present the detailed
reduction in peak load or the cost savings achieved through DRL applications to the EV optimal dispatch, as shown in
demand response. In [105], a prioritized experience replay Table 4.
DDPG (PER-DDPG) is applied to the microgrid dispatch For instance, Zhang et al. [112] proposed a novel
model considering demand response. Simulation studies approach based on DQN to dispatch the EVs charging
indicate its advantage in reducing operational costs com- and recommend the appropriate traveling route for EVs.
pared with traditional dispatch methods. Du and Li [106] Simulation studies demonstrate its effectiveness in signif-
proposed an MCDRL approach for demand-side manage- icantly reducing the charging time and origin–destination
ment, which tends to have a strong exploration capabil- distance. In [113], a DRL approach with embedding and
ity and protect consumer privacy. In addition, the A2C attention mechanism is developed to handle the EV routing
algorithm is developed in [107] to address the demand problem with time windows. Numerical studies show that
response problem, which not only shows the superiority it is able to efficiently solve large-size problems, which
and flexibility of the proposed approach but also preserves are not solvable with other existing methods. In addition,
customer privacy. a charging control DDPG algorithm is introduced to learn
the optimal strategy for satisfying the requirements of
3) Electric Vehicles: The use of EVs has been growing users while minimizing the charging expense in [114].
rapidly across the globe, in particular within the past The SAC algorithm is applied to deal with the congestion
decade, which is mostly due to its low environmental control problem in [115], which proves to outperform
impacts [108], [109], [110], [111]. Specifically, reducing other decentralized feedback control algorithms in terms
the charging cost through dispatching the behaviors of of fairness and utilization.
charging and discharging is the hot spot of research. Due to Taking the security into account, Li et al. [116] pro-
the flexibility of EVs charging/discharging, some literature posed a CPO approach based on the safe DRL to minimize
focuses on the coordinated dispatch of EVs and RE, which the charging cost, which does not require any domain
is devoted to promoting the utilization of RE by EVs. knowledge about the randomness. Numerical experiments
However, the uncertainty of RE and user loads results in demonstrate that this method could adequately satisfy the
the difficulty of model construction. At the same time, the charging constraints and reduce the charging cost. A novel
proliferation of EVs makes it more difficult to optimize the MADDPG algorithm for traffic light control is proposed
solution of the operation problem, which is mostly due to to reduce the traffic congestion in [117]. Experimental
the large number of variables. results show that this method can significantly reduce
On the one hand, traditional methods tend to estimate congestion in various scenarios. Qian et al. [118] devel-
before optimization and decision-making while addressing oped a multiagent DQN (MA-DQN) method to model the

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1075


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 4 EV Optimal Dispatch Based on DRL

pricing game in the transportation network and determine as well as heat load. Numerical simulations on a typi-
the optimal charging price for electric vehicle charging cal day scenario demonstrate that the developed method
station (EVCS). Case studies are conducted to verify the avoids dependence on uncertainty knowledge and has a
effectiveness and scalability of the proposed approach. In strong adaptability for inexperienced scenarios. In [123],
[119], a DQN-based EV charging navigation framework is a dynamic energy conversion and dispatch model for IES
proposed to minimize the total travel time and charging is developed based on DDPG, which takes the uncertainty
cost in the EVCS. Experimental results demonstrate the of demand as well as the flexibility of wholesale prices
necessity of the coordination of SG with an intelligent into account. Case studies illustrate that the proposed
transportation system. In addition, the continuous SAC algorithm can effectively improve the profit of system oper-
algorithm is applied to crack the EV charging dispatch ator and smooth the fluctuations of user loads. Similarly,
problem considering the dynamic user behaviors and elec- the optimal dispatch problem of IES with RE integrated is
tricity price in [120]. Simulation studies show that the first formulated as a discrete MDP in [123], which is solved
proposed SAC-based approach could learn the dynamics of by the proposed DRL method based on PPO subsequently.
electricity price and driver’s behavior in different locations. Finally, simulation results show that this method can dis-
tinctly minimize the operation cost of IES. In addition,
4) Integrated Energy System: In order to solve the prob- the IES economical optimization problem with wind power
lem of sustainable supply of energy and environmental and power-to-gas technology is discussed in [124], which
pollution, IES has attracted extensive attention all over develops a cycling decay learning rate DDPG to obtain the
the world. It regards the electric power system as the optimal operation strategy. Zhang et al. [125] investigated
core platform and integrates RE at the source side and the optimal energy management of IES considering solar
achieves the combined operation of cooling, heating, power, diesel generation, and ESS, which introduces the
as well as electric power at the load side [121]. However, PPO algorithm to solve the optimization problem and
the high penetration of RE and flexible loads make the realizes 14.17% of cost reduction in comparison with other
IES become a complicated dynamic system with strong methods.
uncertainty, which poses huge challenges to the secure On the other hand, DRL approaches are also introduced
and economic operation of IES. Moreover, conventional to deal with the IES optimal dispatch problem at the load
optimization methods often rely on accurate mathematical side [126]. For instance, Zhou et al. [127] established the
model and parameters, which are not suitable for IES constrained CHP dispatch problem as an MDP. Afterward,
optimal dispatch problem while considering strong ran- an improved policy gradient DRL algorithm named dis-
domness. Fortunately, DRL is introduced to address the IES tributed PPO is developed to handle the CHP economic
optimal dispatch problem, which is a model-free method dispatch problem. Simulation results demonstrate that the
and achieves a series of successful applications. When proposed algorithm could cope with different operation
applying DRL to IES, the design of the reward function can scenarios while obtaining a better optimization perfor-
vary depending on the specific objectives and constraints mance than other methods. In [128], a DRL algorithm
of the system. In the rest part of this section, compre- based on DQN is used to realize the dynamic selection
hensive reviews about DRL-based IES optimal dispatch are of optimal subsidy price for IES with regenerative electric
discussed as follows. heating, which aims to maximize the load aggregator prof-
On the one hand, DRL methods are applied to cope its while promoting demand response. Numerical studies
with optimal dispatch problem of IES at the source side. show that the power grid can save 56.6% of its invest-
For example, Yang et al. [122] proposed a DDPG-based ment and users save up to 8.7% of costs. In addition,
dynamic energy dispatch method for IES while considering a model-free and data-driven DRL method based on DDPG
the uncertainty of renewable generation, electric load, with prioritized experience replay strategy is proposed to

1076 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

address the IES energy management problem in [129],


which also illustrates its superior performance in reducing
the energy cost. In addition, Li et al. [130] constructed a
coordinated power dispatch framework, which is based
on the MADDPG for combining imitation learning and
curriculum learning simultaneously. Case studies verify the
effectiveness of the proposed algorithm in the dispatch
performance against renewable power fluctuations and
stochastic loads.
In conclusion, this section reviews the applications of
DRL for the optimal dispatch issues in SG operations, and
the reviewed methods are summarized along with the
references in Tables 2–4. It could be illustrated that the
optimal dispatch problems in SG operations are usually
established as an MDP. On this basis, RL could be uti-
lized to deal with the model-free optimization problem
with high uncertainty, while the curse of dimensionality
is handled by the deep depth of NN. Therefore, DRL is
capable of coping with such high-dimensional optimal dis-
patch problems with uncertainties, which achieves better Fig. 14. Operational control issues in SG.

performances than conventional methods. What is more,


policy-based algorithms including both deterministic and
stochastic methods receive more attention, compared with
the value-based DRL approaches. In Section III-B, the cost. For instance, Vijayshankar et al. [135] proposed a
adoption of DRL for the operational control of SG is dis- model-free DRL framework based on SARSA for real-time
cussed, which is confronted with more difficult challenges yaw control of wind farms to accurately track the power
due to strict operating rules. reference signal. Simulation studies indicate that a wind
farm could achieve a better tracking performance with
B. Operational Control this control paradigm. In [136], an intelligent automatic
control framework is proposed to address the coordination
The operational control of SG aims at maintaining its
problems between AGC controllers in multiarea power
secure and stable operation, as shown in Fig. 14. However,
systems, which adopts an imitation guided-exploration
it has become more complicated and challenging with
multiagent twin-delayed DDPG (IGE-MATD3) algorithm.
the prevalence of RE. Indeed, the problem of voltage and
As demonstrated by the simulation results, the intelligent
frequency fluctuations induced by the ever-increasing pen-
AGC framework can improve dynamic control performance
etration of stochastic and intermittent RE becomes more
and reduce the regulation mileage payment in each area.
serious, which threatens the secure and stable operation of
In addition, Xi et al. [137] proposed a novel DRL algorithm
SG.
(namely DPDPN) to allocate power order among the vari-
To this end, conventional operational control methods
ous generators, which combines RL and DNN to obtain the
are proposed to maintain the balance on active power,
optimal coordinated control of source–grid–load. Experi-
voltage, and frequency stability, such as the AGC, AVC, and
mental results demonstrate that the power system control
LFC. Unfortunately, traditional operation methods are not
performance and adaptability are improved using the pro-
suitable for managing the variability of large-scale RE and
posed algorithm compared with conventional methods.
loads in SG [131], [132], [133]. At present, DRL-based
In [138], a multiple experience pool replay twin delayed
methods are widely applied in the field of operational con-
deep deterministic policy gradient (MEPR-TD3) algorithm
trol, which is due to their self-learning, self-optimization,
is proposed to handle AGC dispatch problem in the IES,
and decision-making merits [134]. In the rest part of
which achieves the comprehensive optimum in control
this section, detailed DRL applications to the operational
performance and economy profit. The control performance
control of SG are described as follows.
of AGC for wind power ramping based on DRL is inves-
1) Automatic Generation Control: GC is used for adjust- tigated in [139], in which the DQN is used to AGC
ing the power output of multiple generators at differ- parameter fitting. Simulation results verify the feasibility
ent power plants, in response to changes in the load and effectiveness of analyzing the relationship between
side. As the conventional AGC methods are not ade- AGC performance and wind power ramping, on the basis
quate to handle the strong uncertainty induced by the of the proposed AGC parameter fitting model. In addi-
ever-increasing penetration of RE, DRL is applied to tion, Li et al. [140] proposed a hierarchical multiagent
deal with the above problems. Accordingly, the reward deep deterministic policy gradient (HMA-DDPG) algorithm
is generally defined as the minus of total generation for AGC dispatch. Numerical analysis verifies that the

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1077


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 5 SG AGC Based on DRL

sectional AGC dispatch based on HMA-DDPG can adjust Although most of the existing model-based AVC methods
the AGC unit outputs with the changes in system state, thus could mitigate voltage violations, they are significantly
guaranteeing an optimal economic, secure, and stable SG dependent on the accurate SG knowledge data, which
operation. is often difficult to acquire in real time. Thus, the use
In [141], a swarm intelligence-based DDPG (SIDDPG) of DRL allows controllers to learn the control strategy
algorithm is designed to acquire the control knowledge through interactions with a system-like simulation model,
and implement a high-quality decision for AGC. Simu- where the reward is defined as a penalty for the voltage
lation results on a two-area SG validate the SI-DDPG deviation from its nominal value. Wang et al. [148] pro-
effectiveness for improving the area control performance. posed a multiagent AVC algorithm based on MADDPG to
In addition, a threshold solver based on TD3 is presented mitigate voltage fluctuations, which could learn gradually
in [142] to dynamically update the thresholds of AGC, and master the system operation rules by input and output
which is verified to be effective in maintaining the SG data.
stability with a lower operation cost. A preventive strategy More specifically, MADDPG utilizes a centralized train-
for the AGC application in SG operation is proposed in ing approach with decentralized execution, as presented
[143]. The strategy is on the basis of DFRL, which achieves in Fig. 15. During the training phase, MADDPG agents
the highest control performance compared with ten other employ a centralized critic network that observes all
conventional AGC methods. Yang et al. [144] presented a agents’ actions to estimate the value function. This enables
DRL model about wind farm AGC to maximize the rev- them to learn coordination and collaboration among
enue of wind power producers, which utilizes the rainbow agents. However, during the execution phase, each agent
algorithm to train the wind farm controller against uncer- acts independently based on its own observations and
tainties. makes decisions based solely on its own observations. This
In [145], an intelligent controller based on Q-learning decentralized execution allows agents to interact with the
for the AGC application in the SG operation is proposed to environment and make decisions autonomously, without a
compensate the power balance between generation against
the load demand. Numerical simulations validate the fea-
sibility of the SG controller with network-induced effects.
In addition, a multiarea AGC scheme based on Q-learning
is designed in [146] to dynamically allocate the AGC
regulating commands among various AGC units. Compre-
hensive tests on practical data demonstrate the validation
of the proposed method in minimizing the generation cost
and regulating error. In addition, Hasanvand et al. [147]
presented a reliable and optimal AGC method based on
DQN to manage the generators in electric ship. Real-time
simulation is conducted to verify the performance and
efficacy of suggested AGC scheme for the electric ship.

2) Autonomous Voltage Control: Voltage control is


another critical aspect of SG operational control, which can
maintain bus voltage magnitudes within a desirable range. Fig. 15. Illustration diagram of MADDPG.

1078 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 6 SG AVC Based on DRL

need for explicit coordination with other agents. In [149], problem, where the GCN model assists the DRL algorithm
a DRL-based AVC scheme is developed for autonomous to better capture topology changes and spatial correlations
grid operation, which takes control actions to ensure in nodal features. A model-free centralized training and
secure SG operations for various randomly generated oper- decentralized execution multiagent SAC (MASAC) frame-
ating conditions. Numerical studies on a realistic 200-bus work is designed in [157] for AVC with high penetration
test system demonstrate the effectiveness and promis- of PVs. Comparative simulation studies demonstrate the
ing performance of the proposed method. In addition, superiority of the proposed approach in reducing the
a physical-model-free AVC approach based on DDPG is communication requirements. Nguyen and Choi [158]
presented in [150], which can cope with fast voltage presented a three-stage AVC framework in SG using the
fluctuations. A model-free DRL control strategy based on online safe SAC method to reduce voltage violations,
DQN is proposed in [151], which aims to enhance the bus mitigate peak loads, and manage active power losses
voltage regulation performance of converters. by coordinating the three stages with different control
The comparison of simulation results indicates the effi- timescales. Numerical simulations for the IEEE 123-bus
ciency of the proposed control strategy for managing large system demonstrate the high efficiency and safety of
signal perturbations. Wang et al. [152] proposed a novel the presented method for regulating voltages. In addi-
DRL-based voltage regulation scheme for unbalanced low- tion, a novel deep meta-reinforcement learning (DMRL)
voltage DNs, which is devoted to minimizing the expected algorithm is developed in [159], which combines the
total daily voltage regulation cost while satisfying opera- meta-strategy optimization with PARS to maintain voltage
tional constraints. An attention-enabled MATD3 algorithm stability. Experimental results show that the performance
is designed in [153] for decentralized AVCs, which is of the proposed method surpasses those of state-of-the-art
demonstrated to be effective in dealing with uncertainties, DRL and model predictive control approaches.
reducing communication requirements, and achieving fast
decision-making processes. In addition, a novel hierarchi- 3) Load Frequency Control: LFC is also a complicated
cal DRL, referred to as the ARS algorithm, is proposed in decision-making problem in SG applications. To this end,
[154] where the lower level DRL agents are trained in an DRL is introduced for restoring the frequency and tie-line
areawise decentralized manner, and the higher level agent power flows to their nominal values after disturbances.
is trained to coordinate the actions executed by lower level Therefore, the reward could be defined as negative fre-
agents. Numerical experiments verify the advantages and quency and tie-line flow deviations. A novel control strat-
various intricacies of the hierarchical method applied to egy for distributed LFCs is developed in [160], which
the IEEE 39-bus power system. Huang et al. [155] formu- is based on the multiagent DDQN with action discovery
lated a derivative-free PARS algorithm for AVC via load (DDQN-AD) algorithm. The approach shows a faster con-
shedding, which can overcome the control problems of vergence speed and stronger learning ability compared
existing DRL algorithms, including computational ineffi- with other traditional methods. In [161], a TDAC control
ciency and poor scalability. Simulation results illustrate strategy is proposed for LFC to deal with strong random
that the proposed method offers better computational effi- disturbances caused by RE. Simulation studies show that
ciency, more robustness in learning, excellent scalability, TDAC has an excellent exploratory stability and learning
and better generalization capacity, compared with other capability, which improves the power system dynamic per-
approaches. formance and achieves the regional optimal coordinated
In [156], a DDQN framework, which applies the graph control. In addition, a multistep unified RL method is
convolutional network (GCN), referred to as GC-DDQN, proposed in [162] for managing the LFC in multiarea
is proposed to tackle topology changes in the AVC interconnected power grid, which proves to outperform

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1079


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 7 SG LFC Based on DRL

other traditional algorithms in terms of convergence and Tables 5–7. It could be observed that the most widely used
dynamic performance. Yan and Xu [163] developed a DRL framework for the operational control of SG is cen-
data-driven LFC method based on DRL in the continu- tralized, while the decentralized manner is an irresistible
ous action domain for minimizing the frequency devia- trend with the prevalence of distributed generation. What
tions under uncertainties. Numerical simulations verify the is more, CNN is the most popular network architecture
effectiveness and advantages of the proposed method over for aforementioned DRL algorithms to extract features,
other existing approaches. while the novel GCN is gradually applied to capture the
A data-driven cooperative approach for LFC, which is topology information of SG that typical CNN cannot com-
based on MADDPG in a multiarea SG, is presented in plete. Despite the successful applications of DRL in opera-
[164]. The approach offers optimal coordinated control tional control, they are still deemed to be computationally
strategies for LFC controller via centralized learning and inefficient and offer poor scalability to a certain extent,
decentralized implementation. Experimental results for a according to the statements of some related literature.
three-area SG demonstrate that the proposed algorithm To this end, SG calls for more advanced DRL frameworks to
can effectively minimize control errors against stochastic support its secure and stable operation via offering robust
frequency variations. Khooban and Gheisarnejad [165] strategy for its operational control. In Section III-C, the
considered the DDPG to generate the supplementary con- DRL adoption in SG markets is discussed, which involves
trol action for LFC, which is appraised for its systematic multiple entities and complex relationships.
feasibility and applicability. In addition, a novel model-free
LFC scheme is presented in [166], which adopts DDPG to C. Electricity Market
learn the near-optimal strategies under various scenarios. The reforming of electricity power market has drawn
Numerical simulations on benchmark systems verify the much attention during the progressively undergoing
effectiveness of the proposed scheme in achieving satis- restructuring of modern power systems. The emerging
factory control performances. Yan et al. [169] developed electricity market is regarded as the potential solution for
a data-driven algorithm for distributed frequency control improving the power system efficiency and optimizing SG
of island microgrids based on multiagent quantum DRL operations [170]. In this situation, electricity retailers have
(MAQDRL). Numerical tests illustrate that the designed appeared in various liberalized electricity markets, as the
method can effectively regulate the frequency with better intermediary between electricity power producers and
time delay tolerance. In [167], a DDPG-based data-driven consumers. However, the electricity market with retailers
approach for optimal control of ESS is proposed to support contains increasing uncertainties and complexities in both
LFC. Simulation results in a three-area SG demonstrate supply and retail sectors, which is a challenge that affects
the effectiveness of the proposed approach in supporting the decision of participants. Indeed, the decision-making
frequency regulation. In addition, the DDPG algorithm is progress of electricity market is extremely complicated,
combined with sensitivity analysis theory in [168], in order as shown in Fig. 16, which mainly consists of energy
to learn the sparse coordinated LFC policy of multiple bidding and retail pricing strategies [171]. On the one
power grids. Numerical experiments verify that the pro- hand, the energy bidding process is a vital decision-making
posed approach can obtain better performance of damping step for suppliers, which requires generality in different
oscillation and robustness against wind power uncertainty. situations. On the other hand, the retail pricing strategy
To conclude, this section reviews the applications of DRL is the core challenge for retailers to promote profitability,
for the operational control of SGs, which are inherently which should have the adaptability to cope with a dynamic
coupled with generation adjustment, voltage regulation, and complex environment.
frequency stabilization, and so on. The reviewed meth- Accordingly, conventional methods are proposed to pro-
ods are summarized along with adequate references in mote the implement of electricity market, such as the

1080 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

power bidding and ESS operation with uncertainty. The


case study illustrates that the bidding policy learned by
the DRL method can effectively improve the wind farm
benefit while ensuring robustness. In addition, the DDPG
algorithm is proposed in [177] to model generation bid-
ding strategies, which is verified to be more accurate than
those of the traditional RL algorithms and can converge to
the Nash equilibrium even in an incomplete information
environment.
In addition, Guo et al. [178] developed a data-driven
bidding objective function identification framework with
three procedures. First, the bidding decision process is for-
mulated for participants as a standard MDP. Then, a DIRL
method, which is based on maximum entropy, is intro-
duced to identify individual reward functions. Finally, the
DQN method is customized to simulate the individual
bidding behaviors based on the obtained objective func-
tions. The effectiveness and feasibility of the proposed
framework and methods are verified by the real data
Fig. 16. Hierarchical electricity market model. from the Australian power market. A model-based DRL
called MB-A3C is presented for the strategic energy bid-
ding issues of wind farms in [179], by which generated
policies are verified to be less cost than those provided
equilibrium model, bilevel optimization, and game theory.
by both previous model-based and model-free approaches.
Nevertheless, the traditional methods are not suitable to
Also, Tao et al. [180] proposed a bidding strategy for EV
handle the decision-making problem under strong uncer-
aggregators, which is based on the data analytic and
tainties [172]. Fortunately, the development of digitization
DDPG algorithm. Compared with the stochastic strategy,
in SG offers an application of data-driven algorithms,
the profit of the proposed approach is 63.3% higher than
especially the DRL approach. At present, DRL is gradually
that of the random strategy.
becoming an effective tool for electricity participants to
In addition, the SAC method is utilized in [181] to
make decisions during the execution of energy transac-
learn the optimal bidding strategy in a complex SG with
tions. In the rest part of this section, detailed applications
incomplete information. Comprehensive simulations are
of DRL on electricity power market are discussed as
conducted to verify the effectiveness of the proposed
follows.
method in balancing supply and demand across the SG
1) Optimal Bidding Strategy: Energy bidding is the in a distributed manner. An asynchronous version of the
critical step during the decision-making progress in a fit Q iteration algorithm is proposed for continuous intra-
wholesale electricity market. Although some methods have day market bidding in [182], which is compared against
been developed to address the optimal bidding strategy, a number of benchmark policies and outperforms them
the strategy still encounters significant data acquisition on average 5%. Zhang et al. [183] considered a multi-
and decision-making challenges. The DRL approach can DQN-based bidding strategy for EV owners to formulate
provide an opportunity to handle this data-driven con- the optimal bidding strategy and maximize the economic
cern. Accordingly, the reward is defined based on payoffs benefits, in which a target network and a value evalu-
obtained from successful bids. The energy bidding is for- ation network are proposed for each agent to learn the
mulated in [173] as an MDP with continuous state and optimal bidding strategy. Extensive experimental results
action spaces, which is solved by DDPG for calculating the demonstrate that the proposed strategy can achieve better
optimal bidding policy. economic benefits and assist EV owners spend less time
Du et al. [174] developed a model-free and data-driven on charging in comparison to those of Q-learning-based
approach, referred to as MADDPG, to approximate the methods. In addition, an electricity–gas joint bilateral bid-
Nash equilibrium in an electricity market with incomplete ding energy market model is constructed in [184], which
information. Simulation results demonstrate that the pro- employs the improved PPO algorithm to learn the optimal
posed algorithm could find a superior bidding strategy bidding policy. Comparative simulations illustrate that the
for all market participants with increased profit gains, innovative market model as well as the PPO method is able
compared with conventional bidding methods. A coordi- to obtain the multienergy collaborative optimization and
nated bidding and operation model, which is based on improve the energy utilization efficiency.
the DRL algorithm, is proposed in [175] and [176] to
improve the real-time wholesale electricity market benefits 2) Optimal Pricing Strategy: Since the conventional pric-
for wind farms, which considers the cooperation of wind ing mechanism suffers from prevailing shortcomings to

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1081


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

deal with dynamics and uncertainties [185], DRL has been


adopted to offer a pricing strategy for electricity retailers.
Accordingly, the reward could be defined as the market
agent’s total revenue for selling electricity. Liu et al. [186]
established a quarter-hourly dynamic time-sharing pricing
model, which is based on the DDPG algorithm. It considers
various market factors such as peak-valley time-of-use
tariff, demand response, and balanced market deviations.
Simulation results show that the proposed scheme with
a higher daily pricing frequency could guide the users’
charging behavior more effectively, tapping to a greater
extent into the retail electricity market’s economic poten-
tials and clamping the load fluctuations in power grids. Fig. 17. Typical architecture diagram of P2P energy trading market.
In [187], a novel DRL method is proposed to solve the
EV pricing problem, which combines DDPG principles with
a prioritized experience replay strategy. Numerical stud-
ies demonstrate that the proposed approach outperforms A real-time pricing strategy for multienergy generation
state-of-the-art RL methods in terms of both the solution system is discussed in [194], which integrates a distributed
optimality and computational requirements. The optimal online MARL algorithm to solve the MDP model without
retail pricing problem is formulated as an MDP in [188], acquisition of the transition probabilities. As demonstrated
which is solved by the proposed DDPG method with a by simulation, the proposed pricing approach shows a good
shared LSTM-based representation network. As indicated performance in ensuring the revenue of both supply and
by simulation results, the proposed framework enhances demand sides.
the perception capacity, further improves the optimiza-
tion performance, and provides a more profitable pricing 3) P2P Energy Trading: The abovementioned works are
strategy. mainly based on the centralized collaborative method,
In addition, Lee and Choi [189] developed a pric- which suffers from certain shortcomings, including pri-
ing strategy for EVCSs based on the privacy-preserving vacy leakage and low efficiency [195], [196], [197]. With
distributed SAC algorithm to maximize the benefits of respect to these challenges, P2P energy trading is consid-
EVCSs integrated with PV and ESS. Numerical experiments ered as the potential solution for distributed and flexible
show that the proposed method outperforms in terms of control of energy flow among peers, which allows direct
convergence, sensitivity, and adaptability. In [190], a dif- energy transactions between producers and consumers
ferentiated pricing mechanism based on TD3 algorithm is [198]. The P2P energy trading market participants are
proposed to motivate EV users to avoid the over-utilization. described as prosumers, who can buy and sell electricity
Simulation studies demonstrate that the designed pricing to achieve win–win market transactions [199], as shown
scheme can maximize the utilization of charging facility in Fig. 17. Therefore, P2P participants could gain more
while ensuring the satisfaction of service quality. A DRL- revenues for trading energy in an electricity market.
based pricing strategy of energy aggregator for profit However, the decentralized electricity market involves
maximization is presented in [191], which considers the complex decision-making processes with high-dimensional
behavior of opponents, uncertainty of RE, and varying data and uncertainties, which is difficult to solve by tra-
bounds of the charging and discharging events in an ditional methods. With the development of AI, DRL is
unstable environment. gradually applied in the P2P energy trading market due to
In addition, Lu et al. [192] proposed a Q-learning-based its scalability and privacy protection features. One possible
decision system for assisting the selection of electric- design is to reward agents for maximizing their individual
ity pricing plans, which extracts the hidden features profits or minimizing their costs. The P2P transactions in
behind the time-varying pricing plans from the contin- an electricity market are established as an MDP in [200]
uous high-dimensional state space. Experimental studies and solved by the SARSA algorithm with average discrete
demonstrate that the developed decision model can estab- processing. The case study of a community with multiple
lish a precise predictive policy for individual users, effec- users is conducted to verify the DRL’s effectiveness, econ-
tively decreasing their energy consumption dissatisfaction omy, and security for solving the P2P energy trading prob-
and cost. In [193], a novel online pricing strategy is lem. Liu et al. [201] introduced the DQN for autonomous
formulated to prevent power outages, which adopts the agents in the consumer-centric electricity market, which
multiagent DRL (MADRL) to control EV charging demands considers both local energy priority transactions and public
and support the grid stability and the EVCS profitabil- shared energy facilities. Numerical studies verify that the
ity. Extensive simulation studies illustrate the significant proposed data-driven method can handle the P2P decision-
improvement in the robustness and effectiveness of the making problem and promote the benefits of the whole
developed solution in terms of revenue and energy saving. community in electricity markets.

1082 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

In addition, the P2P energy trading problem in a com- that most DRL framework for electricity market only con-
munity market with many participating households is tains single agent, while the MADRL indicates a promising
investigated in [202], which accounts for heterogeneity prospect with the development of decentralized electricity
with respect to their DER portfolios. In order to address market, e.g., P2P trading market. First, the prevalence
this problem, a novel DRL algorithm named MADDPG of distributed electricity market calls for DRL algorithms
with parameter sharing (MADDPG-PS) is proposed in this with multiple agents, in which each agent is responsible
article, which achieves a significant operating cost and for a local market. Second, the increasing concern about
peak demand benefits. Samende et al. [203] presented an privacy leakage starves for MADRL approaches where
MADDPG-based algorithm for P2P electricity trading con- multiple agents cooperatively train the model without the
sidering SG constraints. It minimizes the energy costs need of sharing datasets. Moreover, the policy-based DRL
of prosumers who are participating in the P2P market. methods are adopted extensively in SG electricity market
Numerical experiments on real-world datasets indicate operations that are compared with the value-based one,
that the proposed algorithm can reduce the energy cost due to the complexity in both supply and retail sectors. In
while satisfying network constraints. Section III-D, we conduct a discussion on DRL applications
In [204], a distributed DQN-based method is developed in SG operations that will highlight future research trends.
to manage the energy trading between multiple virtual
power plants through P2P and utility. Simulation results D. Emerging Areas
show that the designed method can adjust its action
In recent years, industry has witnessed the SG digitiza-
according to the available energy demand and uncertain
tion and modernization via the numerous deployments of
environment adaptively. An improved MADDPG method-
advanced metering infrastructures. On this basis, SG will
based double-side auction market is formulated in [205],
maintain secure, economic, and sustainable operations,
in order to address the automated P2P energy trad-
compared with those in traditional power systems. Mean-
ing problem among multiple consumers and prosumers.
while, the widespread popularity of smart meters and RE
Case studies demonstrate that the proposed algorithm
also brings about some emerging issues that conventional
can promote the economic benefits of prosumers in P2P
power systems have seldom encountered, including net-
energy trading. In addition, Zhang et al. [206] developed
work security and privacy concerns. Since these problems
an MADDPG-based P2P energy trading model among
are rather new in SG operations, typical methods may
microgrids to improve the resource utilization and oper-
not cope with them in an effective manner. To this end,
ational economy. Simulation results illustrate that the
the data-driven DRL approaches are introduced in these
designed algorithm could reduce the operation cost of each
emerging areas to assist SGs in tackling the aforemen-
microgrid by 0.09%–8.02%, compared to baselines.
tioned issues. In the rest part of this section, detailed
Taking the privacy concern of P2P trading into account,
applications of DRL on network security and privacy
Ye et al. [207] proposed a scalable and privacy-preserving
preservation are depicted as follows.
P2P energy trading scheme based on the MAAC algorithm.
Simulation studies, including a real-world, large-scale sce- 1) Network Security: With rapid SG developments in
nario with 300 residential participants demonstrate that active DNs, various sensing, communication, and control
the proposed approach significantly outperforms the state- devices are deployed to maintain a secure SG opera-
of-the-art MADRL algorithms in reducing the operation tion. However, these cyber-physical components have also
cost and peak demand. In addition, Wang et al. [208] pro- expanded the landscape of cyber threat, which have fur-
vided a novel hybrid community P2P market framework ther resulted in SG vulnerabilities to malicious cyberat-
for multienergy systems, where a data-driven market sur- tacks [210], [211], [212]. Even though regular defense
rogate model-enabled DRL method is proposed to facilitate strategies, such as intrusion prevention systems and fire-
P2P transaction within constraints. Specifically, a market walls, are provided in SGs, such methods might not be very
surrogate model based on deep belief network is developed effective while facing the many unknown vulnerabilities
to characterize P2P transaction behaviors of peers in the [213]. To this end, DRL is applied in SGs to offer additional
market without disclosing their private data. In addition, defense strategies for mitigating the blackout risks during
an MADDPG-based energy trading algorithm is developed cyberattacks. Accordingly, the reward should be designed
in [209] to formulate the optimal policy for each microgrid to incentive actions that enhance network security and
in the electricity market. Moreover, blockchain is adopted discourage actions that compromise it. For example, DRL
to guarantee the privacy of energy transaction data. is applied to assist SG operators in counteracting malicious
In summary, this section reviews the DRL applications cyberattacks in [214], which investigates the possibility of
in and SG electricity market, which mainly involves three defending SG using a DQN agent. Simulation results not
actions, i.e., bidding, pricing, and P2P trading. DRL offers only demonstrate the effectiveness of the proposed DQN
an effective tool for market participants to make optimal algorithm but also pave the way for defending the SG
decisions, even without using the complete information under a sophisticated cyberattack.
about electricity market. These approaches are summa- Liu et al. [215] proposed a cybersecurity assess-
rized along with the references in Table 8. It is illustrated ment approach based on DQN to determine the optimal

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1083


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 8 Applications of DRL on the Wholesale, Retail, and P2P Electricity Markets

attack transition policy. Numerical and real-time simu- delay and load balancing simultaneously. Then, a DQN-
lation experiments verify the performance of developed based route planning algorithm is designed to find the
algorithm without the need for full observation of system. optimal route, which not only meets the delay require-
A DQN-based DRL algorithm is developed in [216] for the ments but also enhances the resistivity of SG. To address
low-latency detection of cyberattacks in SGs, which aims the ever-increasing FDI attack in SG, Zhang et al. [221]
at minimizing the detection delay while maintaining a high proposed a resilient optimal defensive strategy with dis-
accuracy. Case studies verify that the DQN-based algorithm tributed DRL method, which devotes itself to correcting
could achieve very low detection delays while ensuring a false price information and making the optimal recovery
good performance. In addition, a DRL-based approach is strategy for fighting against the FDI attack. Numerical
proposed in [217] to detect data integrity attacks, which studies reveal that the distributed DRL algorithm provides
checks whether the system is currently under attack by a promising way for the optimal SG defense against cyber-
introducing LSTM to extract state features of previous attacks. In [222], a DQN detection scheme is presented
time steps. Simulation studies illustrate that the proposed to defend against data integrity attacks in SG. Experi-
detection approach outperforms the benchmarked metrics, mental results demonstrate that the developed method
including the delay error rate and false rate. surpasses the existing DRL-based detection scheme in
Moreover, Chen et al. [218] proposed the model-free terms of accuracy and rapidity. In addition, an MADRL
defense strategy for SG secondary frequency control with with prioritized experience replay algorithm is proposed
the help of DRL, which proves to be effective through to identify the critical lines under coordinated multistage
validation based on the IEEE benchmark systems. In [219], cyberattacks, which contributes to deploying the limited
an MADDPG algorithm is proposed for SSA, which inte- defense resources optimally and mitigating the impact of
grates the DRL and edge computing to conduct efficient cyberattacks.
SSA deployment in SGs. In addition, a comprehensive 2) Privacy Preservation: To an extent, the widespread
risk assessment model of excessive traffic concentration in deployment of advanced meters in SG has also raised
an SG is established in [220], which considers the link serious concerns from the privacy perspective, which is

1084 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

regarded as one of the main oppositions for SG modern- A privacy-preserving Q-learning framework for the SG
ization [223]. In fact, the fine-grained smart meter data energy management is formulated in [232], which is
carries sensitive information about consumers, posing a verified to be effective in energy management without
potential threat for preserving privacy. Traditional methods privacy leakage. In addition, Zhang et al. [233] developed
have been proposed for privacy preservation in SGs, such an intelligent demand response resource trading frame-
as data aggregation and encryption [224], data downsam- work, in which the dueling DQN is constructed to simulate
pling [225], and random noise addition [226]. However, the bilevel Stackelberg game in a privacy-protecting way.
these approaches may restrict the potential applications Numerical experiments demonstrate that the designed
of SG data in an uncontrolled manner, e.g., time delay approach has an outstanding performance in reducing
of fault detection and degradation of detection precision. energy cost as well as preserving privacy.
In this regard, DRL is introduced to provide the optimal Liu et al. [234] presented a battery-based intermittently
operational strategy while ensuring the privacy security of differential privacy scheme to realize privacy protection.
consumers. Afterward, it develops a DDPG-based algorithm to offer
When applying DRL to privacy protection in power sys- the optimal battery control policy, in order to maintain
tems, the design of the reward function can vary depending the battery power level and realize cost saving. Case
on the specific goals and requirements. For instance, studies illustrate that the proposed method has a better
Lee and Choi [227] proposed a privacy-preserving method performance in both cost saving and privacy preservation.
based on federated RL for the energy management of smart A DQN-based technique is applied in [235] to keep the bal-
homes with PV and ESS. It develops a novel distributed ance between privacy protection and knowledge discovery
A2C model that contains a global server and a local home during SG data analysis. In [236], a hierarchical SAC-
energy management system. First, A2C agents for local based energy trading scheme is presented in electricity
energy management systems construct and upload their markets, by which the prosumers’ privacy concerns are
models to the global server. After that, the global server tackled because the training process would only require
aggregates the local models to update a global model and the local observations. Extensive simulations validate that
broadcasts it to the A2C agents. Finally, the A2C agents the proposed algorithm can effectively reduce the daily
replace the previous local models with the global one cost of prosumers without privacy leakage. In addition,
and reconstruct their local models, iteratively. In this way, a DDPG-based energy management approach is developed
data sharing between local systems is prevented, thus pre- in [237] for integration in SG systems, which addresses
serving SG privacy. In [228], a distributed DRL algorithm the privacy issues via local data executions. Experimental
is employed for devising the intelligent management of results demonstrate that the proposed scheme can achieve
household power consumption. More specifically, the inter- good performances while preserving the data privacy.
actions of SGs and household appliances are established as To conclude, this section reviews DRL applications to SG
a noncooperative game problem, which is addressed by the network security and privacy preservation. These methods
DPG algorithm considering privacy protection. In addition, are summarized in Table 9 along with the corresponding
a privacy-aware smart meter framework is investigated in references. It is observed that the value-based DQN is
[229] that utilizes the battery to hide the actual power con- the most popular DRL algorithm for managing network
sumption of a household. In detail, the problem of search- security, while policy-based DRL methods proposed for pri-
ing the optimal charging/discharging policy for reducing vacy preservation include both deterministic and stochastic
information leakage with minimal additional energy cost policies. Furthermore, decentralized DRL frameworks for
is formulated as an MDP, which is handled by the DDQN handling emerging SG issues are paid more attention than
with mutual information. As demonstrated by simulation other architectures, which is due to additional require-
studies, the performance of developed algorithm achieves ments for maintaining security and privacy. However, DRL
significant improvements over the state-of-the-art privacy- applications are relatively inadequate for managing the
aware demand shaping approaches. SG emerging issues, which call for more investigation
In [230], a novel federated learning framework is pre- and exploration in the future. Although there have been
sented for privacy-preserving and communication-efficient numerous literature studies on DRL applications in SGs,
energy data analysis in SG. On this basis, a DQN-based many critical problems would still need to be addressed
incentive algorithm with two layers is devised to offer before their practical implementations. On the one hand,
optimal operational strategies. Extensive simulations val- DRL applications to SG systems are still relatively new
idate that the designed scheme can significantly stimulate and require further research before maturity. On the other
high-quality data sharing while ensuring preserving pri- hand, it is necessary to reassess the DRL advantages and
vacy. Wang et al. [231] proposed a data privacy-aware limitations in SG applications, which are among the most
routing algorithm based on DDPG for communication complex and critically engineered systems in the world.
issues in SGs, to realize the latency reduction and load Although real-world DRL applications in SG operations
balancing. Experimental results show that the formu- are relatively limited, this technology holds great potential
lated privacy-aware routing protocol can effectively reduce for encountering SG applications, particularly in tackling
the latency while maintaining excellent load balancing. complex decision-making and control problems. Therefore,

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1085


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Table 9 Applications of DRL on the Emerging Issues in SG

a comprehensive review of DRL applications in SG oper- in the learned policy that would not lead to potentially
ations can help comprehend unsolved problems in this catastrophic consequences in SGs. For instance, the control
domain and provide guidance to promote its development, commands issued by DRL should not violate physical SG
which is one of the intentions for drafting this survey constraints that could possibly result in device failures,
article. grid instability, or even system breakdown. At present,
DRL studies can be divided into three categories,
IV. C H A L L E N G E S A N D O P E N including modifications in optimization criteria, modifi-
RESEARCH ISSUES cations in exploration processes [238], and offline DRL
We have mentioned the difficulty of SG operations mainly methods [239].
stems out of strong uncertainty, curse of dimensionality,
1) Modifying Optimization Criterion: In general, the pur-
and lack of accurate models. As one of the model-free
pose of DRL is primarily focused on maximizing long-term
approaches, RL can deal with variable RE and uncertain
rewards without explicitly considering the potential harm
load demand issues by interacting with the environment
caused by dangerous states to the agent. In other words,
in the absence of sufficient knowledge data. In addition,
the objective function of traditional DRL does not incor-
the curse of dimensionality can be handled with DNN.
porate a description about decision risks. Moreover, if the
Therefore, DRL shows great potential in addressing the
objective function is designed inadequately, the DRL agent
pertinent SG operation issues. However, current DRL meth-
may encounter safety issues. To this end, the transforma-
ods still have a certain extent of limitation, which is mainly
tion of optimization criterion has been proposed to take
due to their dependence on handcrafted reward functions.
the risk into account. This can be achieved through various
It is not easy to design a reward function that encourages
approaches, such as directly penalizing infeasible solutions
the desirable behaviors. Furthermore, the most reasonable
[240], penalizing the worst case scenario [241], or incor-
reward function cannot avoid the local optimality, which
porating constrained optimality within the reward function
belongs to the typical exploration–exploitation dilemma
[242]. For example, Qian et al. [110] incorporated con-
and has puzzled DRL applications for a long time. Hence,
strained optimality within the reward function through
a relatively comprehensive survey of DRL approaches,
using a Lagrangian function of power flow constraints.
potential solutions, and future directions is discussed in
this section. 2) Modifying Exploration Process: The unrestricted ran-
dom exploration can potentially expose the agents to
highly dangerous states. To prevent unforeseen and irre-
A. Security Concerns versible consequences, it is essential to evaluate the DRL
SGs are critical infrastructures in modern power sys- agent security during training and deployment and restrict
tems, which can handle sustainable, secure, economic, their exploration within permissible regions. Such meth-
and reliable power system operations. To this end, it is ods can be categorized as the modification of exploration
crucial for DRL algorithms to ensure secure decisions processes with a focus on ensuring security [243]. The

1086 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

modification can be achieved through various approaches, 2) Adversarial DRL: It involves training a DRL agent in
such as embedding external knowledge [244] and con- the presence of adversarial agents or environments
straining exploration within a certified safe region [245]. that actively try to disturb the learning process or
Cui et al. [246] formulated the online preventive control achieve their own objectives [252]. Adversarial train-
problem for mitigating transmission overloads as a con- ing has been applied to enhance DRL algorithms
strained MDP. The constrained MDP is then solved using against adversarial attacks in managing the SG cyber-
the interior-point policy optimization, which promotes security. For instance, the attack and defense prob-
learnings that can satisfy the pertinent constraints and lems are formulated as MDP and adversarial MDP in
improves the policy simultaneously. [253], while the robust defense strategy is generated
by adversarial training between attack and defense
3) Offline DRL: The two categories for modifying the agents. In [254], a repeated game is formulated to
optimization criteria and exploration process are regarded mimic the real-world interactions between attackers
as online DRL, where the agent learns how to perform and defenders in SGs. Furthermore, according to
tasks by continuously interacting with the environment. [255], it has been observed that a high-performing
In contrast, offline DRL requires an agent that can learn DRL agent, initially vulnerable to action perturba-
solely from statically offline datasets without exploration, tions, can be made more resilient against similar
thus ensuring the training safety from the perspective of perturbations through the application of adversarial
data [247]. However, such approaches do not consider training. It is indeed worth mentioning that naively
risk-related factors during policy deployment phases and, applying adversarial training may not be effective for
therefore, might not guarantee the security at the time of all DRL tasks [256]. Adversarial training is a complex
deployment [248]. and challenging process that requires careful consid-
In response to safety concerns, four related DRL variants eration and customization for each specific task.
are briefly introduced here, which include constrained 3) Robust DRL: It incorporates robust optimization tech-
DRL, adversarial DRL, robust DRL, and federated DRL as niques to ensure that the learned policies remain
presented in the following. effective even in the presence of uncertainties and
1) Constrained DRL: It refers to the application of RL perturbations, thereby improving the overall perfor-
techniques to solve SG problems with explicit con- mance and stability of the DRL agent [257]. To be
straints. Generally, there are two types of soft and specific, robust DRL considers the worst case scenario
hard constraints, which are considered in the lit- or min–max framework to learn a control policy that
erature. Soft constraints allow for some degree of maximizes the reward with respect to the worst case
violation, whereas hard constraints must be strictly scenario or outcome encountered during the learning
adhered to. On the one hand, there are common process. By training against these worst case scenar-
approaches to addressing soft constraints, including ios, the agent becomes more resilient and capable
adjoining constraints to the reward through barrier of making effective decisions even in the face of
or penalty functions and formulating constraints as uncertainties or adversarial conditions. The utiliza-
chance constraints (i.e., setting a predefined thresh- tion of min–max structure in DRL algorithms has been
old for the probability of constraint violation), or a a vibrant area of research. Previous studies primar-
budget constraint as follows [249]: ily focus on addressing two types of uncertainties,
i.e., inherent uncertainty stemming from the stochas-
tic nature of the system and parameter uncertainty
max J(π), s.t. J c (π) ≤ J¯
π arising from incomplete knowledge about certain
parameters of the MDP [258], [259]. While robust
where the agent goal is to find a control policy π DRL has not yet received extensive attention in the
that maximizes the expected return with respect to context of SG, it holds significant potential as a future
reward function J subject to a budget constraint direction to tackle the diverse uncertainties present
for the return with respect to the cost function J c . in the environment, such as model uncertainty, noise,
However, constrained DRL methods that focus on soft and disturbances.
constraints alone may not guarantee safe exploration 4) Federated DRL: The concern regarding SG security
during the training phase. In addition, even after and privacy is one of the main obstacles in SG oper-
training convergence, the control actions generated ations. However, extensive previous research about
by the trained policy may not always be entirely safe DRL applications in SG mainly belongs to the cen-
[250]. On the other hand, the enforcement approach tralized method, which is vulnerable to cyberattack
is to take the conservative actions while dealing with and privacy leakage. To this end, federated learn-
hard constraints in constrained DRL [251]. Never- ing is combined with DRL to meet the requirements
theless, the enforcement approach usually results in of privacy preservation and network security [260].
significant conservation and might have large errors By combining federated learning and DRL, federated
for complex power networks. DRL enables collaborative learning while preserving

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1087


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

data privacy and reducing the communication over- experts but also learn a generalized policy that can handle
head between the central server and distributed unseen situations. The combination of imitation learning
devices. For instance, Li et al. [261] proposed a fed- and RL is a very promising research field that has been
erated MADRL algorithm via the physics-informed extensively studied in recent years [262], [263]. It has
reward to solve the complex multiple microgrids been applied in various domains such as autonomous
energy management with privacy concern. Federated driving [264], quantitative trading [265], and the optimal
learning enables multiple agents to coordinately learn SG dispatch [266], to tackle the challenge of low learning
a shared decision model while keeping all the training efficiency in DRL. For example, Guo et al. [267] combined
data on device, thus preventing the risk of privacy DRL with imitation learning for cloud resource schedul-
leakage. What is more, the decentralized structure ing, where DRL is devoted to tackling the challenging
of federated learning offers a promising technique multiresource scheduling problem and imitation learning
to reduce the pressure of centralized data storage. enables an agent to learn an optimal policy more effi-
Therefore, it is meaningful to investigate a combina- ciently. In conclusion, the integration of imitation learning
tion of federated learning and DRL in SG operations. and DRL can provide a powerful learning framework that
enables fast learning, generalization, and effectiveness.
B. Sample Efficiency This combination is significant for addressing complex
Despite the success of DRL, it usually needs at least tasks and improving the learning capabilities of intelligent
thousands of samples to gradually learn some useful agents.
policies even for a simple task. However, the real-world
or real-time interactions between agent and environment
C. Learning Stability
are usually costly, and they still require time and energy
consumption even in the simulation platform. This brings Unlike the stable supervised learning, present DRL algo-
about a critical problem for DRL, i.e., how to design a rithms are volatile to a certain extent, which means that
more efficient algorithm to learn faster with fewer sam- there exist huge differences of the learning performances
ples. At present, most DRL algorithms are of such a low over time in horizontal comparisons across multiple runs.
learning efficiency that requires unbearable training time In specific, this learning instability over time generally
under current computational power. It is even worse for reflects as the large local variances or the nonmonotonicity
real-world interactions that potential problems of security on a single learning curve. As for unstable learning, it man-
concern, risks of failure cases, and time consumption all ifests a significant performance difference between differ-
put forward higher requirements on the learning efficiency ent runs during training, which leads to large variances
of DRL algorithms in practice. for horizontal comparisons. What is more, the endoge-
nous instability and unpredictability of DNN aggravate the
1) Model-Based DRL: Different from the aforementioned deviation of value function approximation, which further
model-free methods, model-based DRL generally indicates brings about noise in the gradient estimators and unstable
an agent not only learns a policy to estimate its action learning performance. Significant efforts have been ded-
but also learns a model of environment to assist its action icated to addressing the stability problem in DRL for a
planning, thus accelerating the speed of policy learn- considerable period of time. As mentioned in this article,
ing. Learning an accurate model of environment provides the utilization of a target network with delayed updates
additional pertinent information that could be helpful in and the incorporation of a replay buffer have been shown
evaluating the agent’s current policy, which can make the to mitigate the issue of unstable learning. In addition,
entire learning process more efficient. In principle, a good TRPO employs second-order optimization techniques to
model could handle a bunch of problems, as AlphaGo has provide more stable updates and comprehensive infor-
done. Therefore, it is meaningful to integrate model-based mation. It applies constraints to the updated policy to
and model-free DRLs and promote the sample efficiency ensure conservative yet stable improvements. However,
in SGs. On the one hand, model-based methods can be DRL remains sensitive to hyperparameters and initializa-
utilized as warm-starts or the nominal model, provid- tion even with the above works. This sensitivity poses a
ing initial information or serving as a foundation for significant challenge and highlights the need for further
model-free DRL methods. On the other hand, model-free research in this area to address these issues and improve
DRL algorithms can coordinate and fine-tune the param- the robustness and stability of DRL algorithms.
eters of existing model-based controllers to improve their
1) Multiagent DRL: With the development of DRL,
adaptability while maintaining baseline performance guar-
MADRL is proposed and has attracted much atten-
antees. Although the amount of research in this area is
tion. In fact, MADRL is regarded as a promising and
currently limited, the integration of model-free DRL with
worth exploring direction, which provides a novel
existing model-based approaches is considered to be a
way to investigate the unconventional DRL situations,
promising direction for future research.
including swarm intelligence, unstable environments
2) Imitation Learning Combined DRL: Imitation learning for each agent, and innovation of agent itself. MADRL
attempts to not only mimic the actions and choices of not only makes it possible to explore distributed

1088 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

intelligence in multiagent environments but also con- accuracy. Taking SG as an example, a single error in oper-
tributes to learning a near-optimal agent policy in ation can result in catastrophic consequences. Moreover,
large-scale SG applications. Overall, multiple agents most of the existing literature trails DRL policies solely
and their interactions in MADRL can enhance the based on high-fidelity power system simulators, which
learning stability of DRL by promoting exploration, do not emphasize the gap between the simulators and
facilitating experience sharing, enabling policy coor- real-world SG operations, i.e., reality gap. Therefore, poli-
dination, and improving robustness to environmen- cies trained in simulators may not always exhibit reliable
tal changes. These characteristics make MADRL a performance in real-world scenarios due to the existence
promising approach for addressing the learning sta- of reality gap. In general, methods for addressing the
bility challenges in DRL. simulation to reality can be categorized into at least two
following approaches.

D. Exploration 1) Meta-Learning-Based DRL: Meta-learning is devoted


to improving the learning performance by utilizing the
Different from classical exploration–exploitation trade-
previous SG experience. More specifically, meta-learning
offs in RL, exploration is another main challenge of DRL.
leverages the acquired SG knowledge by extracting univer-
The difficulty of exploration in RL mainly stems from
sal learning policies and knowledge to guide the learning
sparse reward functions, large action spaces, and unstable
process in a new task. In this way, it is viable to extend the
environment, as well as the security issue of exploration
meta-learning to a new SG operation scenario, including
in the real world. First, the sparse rewards might result
the transfer from simulated environments to real-world
in the value function and policy networks optimized on
systems. In addition, the combination of meta-learning
hypersurfaces that are not convex and not smooth or even
and RL gives the meta-RL methods, which can reduce
discontinuous scenarios. In such circumstances, the policy
the sensitivity to network parameters and enhance the
after one-step optimization may not effectively facilitate
robustness of the SG algorithm. On this basis, it would be
the exploration of higher reward regions. Consequently,
quite desirable to promote the meta-RL into the deep one
the agent encounters challenges in exploring trajectories
and apply it to SG operational problems.
that yield high rewards during its exploration phase. Sec-
ond, the large action space also poses a challenge for 2) Transfer Learning Combined DRL: Similar to the meta-
exploration in DRL agents. For example, in electricity learning approach, transfer learning also emphasizes the
market bidding issues, the presence of a large action storage knowledge acquired while solving one problem
space makes it extremely difficult to explore an optimal and applying to a different but relevant one, whose core
policy. Third, the presence of an unstable environment is to find similarities between the existing and new knowl-
also makes it difficult for agents to explore effectively. edge. Since it is too expensive to learn the target domain
For instance, multiplayer settings cause the opponent part directly from scratch, transfer learning is adopted to use
of the electricity market environment to some extent, existing knowledge to learn new knowledge as quickly as
which weakens the exploration capacity of agent. Finally, possible. Transfer learning can leverage knowledge from
real-world security concerns in SG applications also raise simulated tasks to speed up the learning process in real-
exploration concerns in DRL. For instance, in the context world cases. Consequently, we can potentially combine
of an SG controlled by an agent, it is crucial to learn from DRL with transfer learning to handle the reality gap in SG
failure cases such as power outages, voltage fluctuations, operations.
and transmission congestion. However, the collection of
these failure cases directly from SG is not feasible due to V. C O N C L U S I O N
safety and operational concerns. Random actions for explo- In this article, we have presented a comprehensive review
ration in SGs are also not viable as they can potentially of DRL applications in SG operations. First, an overview of
lead to catastrophic consequences. Therefore, alternative RL, DL, and DRL is provided. Then, various DRL techniques
methods and strategies need to be employed to enable safe are divided into two categories according to the optimiza-
and effective exploration in these complex and high-stakes tion policy, i.e., value- and policy-based algorithms. On the
environments. one hand, typical value-based DRL algorithms, including
DQN and its variants, are depicted with detailed theories.
On the other hand, several popular policy-based algo-
E. Simulation to Reality rithms involving both stochastic and deterministic policies
DRL has been successful in solving a wide range of are introduced with specific explanations. Then, we dis-
optimization tasks in simulated environments and may played exhaustive surveys, comparisons, and analyses for
even surpass human performance in some specific domains DRL approaches to address a variety of SG applications,
such as the game of GO [268]. However, the challenges of covering optimal dispatch, operation control, electricity
applying DRL methods to real-world cases have not been markets, and other emerging areas in power systems.
addressed. As mentioned above, tasks involving hardware Finally, essential challenges, potential solutions, and future
in the real world often have high demands for security and research directions are discussed from the perspective of

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1089


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

safety concerns, sample efficiency, learning stability, explo- in a data-driven manner. In summary, careful consider-
ration, simulation to reality, and so on. Furthermore, ation should be devoted to identifying appropriate DRL
this article does not propose a dichotomy between DRL application scenarios and utilizing them effectively in SG
and conventional methods. Instead, DRL can serve as a applications.
complement to existing approaches and enhance them

REFERENCES
[1] M. Liserre, M. A. Perez, M. Langwasser, [17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, [35] M. Ravanelli, P. Brakel, M. Omologo, and
C. A. Rojas, and Z. Zhou, “Unlocking the hidden and A. A. Bharath, “Deep reinforcement learning: Y. Bengio, “Light gated recurrent units for speech
capacity of the electrical grid through smart A brief survey,” IEEE Signal Process. Mag., vol. 34, recognition,” IEEE Trans. Emerg. Topics Comput.
transformer and smart transmission,” Proc. IEEE, no. 6, pp. 26–38, Nov. 2017. Intell., vol. 2, no. 2, pp. 92–102, Apr. 2018.
vol. 111, no. 4, pp. 421–437, Apr. 2023. [18] N. Le, V. S. Rathour, K. Yamazaki, K. Luu, and [36] Y. LeCun et al., “Backpropagation applied to
[2] M. Chertkov and G. Andersson, “Multienergy M. Savvides, Deep Reinforcement Learning in handwritten zip code recognition,” Neural
systems,” Proc. IEEE, vol. 108, no. 9, Computer Vision: A Comprehensive Survey. Cham, Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989.
pp. 1387–1391, Sep. 2020. Switzerland: Springer, 2021. [37] J. N. Tsitsiklis and B. Van Roy, “An analysis of
[3] S. Geng, M. Vrakopoulou, and I. A. Hiskens, [19] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, temporal-difference learning with function
“Optimal capacity design and operation of energy “Reinforcement learning for selective key approximation,” IEEE Trans. Autom. Control,
hub systems,” Proc. IEEE, vol. 108, no. 9, applications in power systems: Recent advances vol. 42, no. 5, pp. 674–690, May 1997.
pp. 1475–1495, Sep. 2020. and future challenges,” IEEE Trans. Smart Grid, [38] J. Li, H. Wang, H. He, Z. Wei, Q. Yang, and P. Igic,
[4] M. Shahidehpour, M. Yan, P. Shikhar, vol. 13, no. 4, pp. 2935–2958, Jul. 2022. “Battery optimal sizing under a synergistic
S. Bahramirad, and A. Paaso, “Blockchain for [20] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep framework with DQN-based power managements
peer-to-peer transactive energy trading in reinforcement learning for power system for the fuel cell hybrid powertrain,” IEEE Trans.
networked microgrids: Providing an effective and applications: An overview,” CSEE J. Power Energy Transport. Electrific., vol. 8, no. 1, pp. 36–47,
decentralized strategy,” IEEE Electrific. Mag., Syst., vol. 6, no. 1, pp. 213–225, Mar. 2020. Mar. 2022.
vol. 8, no. 4, pp. 80–90, Dec. 2020. [21] M. Glavic, “(Deep) reinforcement learning for [39] V. Mnih et al., “Human-level control through deep
[5] M. Shahidehpour, Z. Li, S. Bahramirad, Z. Li, and electric power system control and related reinforcement learning,” Nature, vol. 518,
W. Tian, “Networked microgrids: Exploring the problems: A short review and perspectives,” Annu. no. 7540, pp. 529–533, 2015.
possibilities of the IIT-Bronzeville grid,” IEEE Rev. Control, vol. 48, pp. 22–35, Jan. 2019. [40] A. Camacho, J. Varley, A. Zeng, D. Jain, A. Iscen,
Power Energy Mag., vol. 15, no. 4, pp. 63–71, [22] D. Cao et al., “Reinforcement learning and its and D. Kalashnikov, “Reward machines for
Jul. 2017. applications in modern power and energy vision-based robotic manipulation,” in Proc. IEEE
[6] S. Z. Tajalli, A. Kavousi-Fard, M. Mardaneh, systems: A review,” J. Modern Power Syst. Clean Int. Conf. Robot. Autom. (ICRA), May 2021,
A. Khosravi, and R. Razavi-Far, Energy, vol. 8, no. 6, pp. 1029–1042, Nov. 2020. pp. 14284–14290.
“Uncertainty-aware management of smart grids [23] T. Yang, L. Zhao, W. Li, and A. Y. Zomaya, [41] H. Hasselt, “Double Q-learning,” in Proc. Adv.
using cloud-based LSTM-prediction interval,” IEEE “Reinforcement learning in sustainable energy Neural Inf. Process. Syst., vol. 23, 2010,
Trans. Cybern., vol. 52, no. 10, pp. 9964–9977, and electric systems: A survey,” Annu. Rev. pp. 2613–2621.
Oct. 2022. Control, vol. 49, pp. 145–163, Jan. 2020. [42] H. Van Hasselt, A. Guez, and D. Silver, “Deep
[7] X. Xia, Y. Xiao, W. Liang, and J. Cui, “Detection [24] A. T. D. Perera and P. Kamalaruban, “Applications reinforcement learning with double Q-learning,”
methods in smart meters for electricity thefts: A of reinforcement learning in energy systems,” in Proc. AAAI Conf. Artif. Intell., vol. 30, no. 1,
survey,” Proc. IEEE, vol. 110, no. 2, pp. 273–319, Renew. Sustain. Energy Rev., vol. 137, Mar. 2021, 2016, pp. 1–13.
Feb. 2022. Art. no. 110618. [43] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and
[8] L. Duchesne, E. Karangelos, and L. Wehenkel, [25] L. Yu, S. Qin, M. Zhang, C. Shen, T. Jiang, and M. Bennis, “Optimized computation offloading
“Recent developments in machine learning for X. Guan, “A review of deep reinforcement learning performance in virtual edge computing systems
energy systems reliability management,” Proc. for smart building energy management,” IEEE via deep reinforcement learning,” IEEE Internet
IEEE, vol. 108, no. 9, pp. 1656–1676, Sep. 2020. Internet Things J., vol. 8, no. 15, Things J., vol. 6, no. 3, pp. 4005–4018, Jun. 2019.
[9] Y. Yuan et al., “Data driven discovery of cyber pp. 12046–12063, Aug. 2021. [44] Z. Wang, T. Schaul, M. Hessel, H. Hasselt,
physical systems,” Nature Commun., vol. 10, no. 1, [26] D. Zhang, X. Han, and C. Deng, “Review on the M. Lanctot, and N. Freitas, “Dueling network
p. 4894, Oct. 2019. research and practice of deep learning and architectures for deep reinforcement learning,” in
[10] H. L. Liao, Q. H. Wu, Y. Z. Li, and L. Jiang, reinforcement learning in smart grids,” CSEE J. Proc. Int. Conf. Mach. Learn., 2016,
“Economic emission dispatching with variations of Power Energy Syst., vol. 4, no. 3, pp. 362–370, pp. 1995–2003.
wind power and loads using multi-objective Sep. 2018. [45] R. S. Sutton, D. McAllester, S. Singh, and
optimization by learning automata,” Energy [27] L. Zeng, M. Sun, X. Wan, Z. Zhang, R. Deng, and Y. Mansour, “Policy gradient methods for
Convers. Manage., vol. 87, pp. 990–999, Y. Xu, “Physics-constrained vulnerability reinforcement learning with function
Nov. 2014. assessment of deep reinforcement learning-based approximation,” in Proc. Adv. Neural Inf. Process.
[11] W. Samek, G. Montavon, S. Lapuschkin, SCOPF,” IEEE Trans. Power Syst., vol. 38, no. 3, Syst., vol. 12, 1999, pp. 1057–1063.
C. J. Anders, and K.-R. Müller, “Explaining deep pp. 2690–2704, May 2023. [46] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,”
neural networks and beyond: A review of methods [28] T. T. Nguyen and V. J. Reddi, “Deep reinforcement in Proc. Adv. Neural Inf. Process. Syst., vol. 12,
and applications,” Proc. IEEE, vol. 109, no. 3, learning for cyber security,” IEEE Trans. Neural 1999, pp. 1008–1014.
pp. 247–278, Mar. 2021. Netw. Learn. Syst., vol. 34, no. 8, pp. 3779–3795, [47] A. G. Barto, R. S. Sutton, and C. W. Anderson,
[12] Y. Li et al., “Dense skip attention based deep Aug. 2023. “Neuronlike adaptive elements that can solve
learning for day-ahead electricity price [29] T. Ding, Z. Zeng, J. Bai, B. Qin, Y. Yang, and difficult learning control problems,” IEEE Trans.
forecasting,” IEEE Trans. Power Syst., vol. 38, M. Shahidehpour, “Optimal electric vehicle Syst. Man, Cybern., vol. SMC-13, no. 5,
no. 5, pp. 4308–4327, Sep. 2023. charging strategy with Markov decision process pp. 834–846, Sep. 1983.
[13] M. Lapan, Deep Reinforcement Learning Hands-On: and reinforcement learning technique,” IEEE [48] V. Mnih et al., “Asynchronous methods for deep
Apply Modern RL Methods to Practical Problems of Trans. Ind. Appl., vol. 56, no. 5, pp. 5811–5823, reinforcement learning,” in Proc. Int. Conf. Mach.
Chatbots, Robotics, Discrete Optimization, Web Sep. 2020. Learn., 2016, pp. 1928–1937.
Automation, and More. Birmingham, U.K.: Packt [30] H. Dong, Z. Ding, and S. Zhang, Deep [49] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
Publishing Ltd, 2020. Reinforcement Learning. Cham, Switzerland: P. Moritz, “Trust region policy optimization,” in
[14] N. C. Luong et al., “Applications of deep Springer, 2020. Proc. Int. Conf. Mach. Learn., 2015,
reinforcement learning in communications and [31] S. Dreyfus, “Richard Bellman on the birth of pp. 1889–1897.
networking: A survey,” IEEE Commun. Surveys dynamic programming,” Oper. Res., vol. 50, no. 1, [50] S. Kakade and J. Langford, “Approximately
Tuts., vol. 21, no. 4, pp. 3133–3174, pp. 48–51, Feb. 2002. optimal approximate reinforcement learning,” in
4th Quart., 2019. [32] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, Proc. 19th Int. Conf. Mach. Learn., 2002,
[15] W. Chen, X. Qiu, T. Cai, H.-N. Dai, Z. Zheng, and “Voronoi-based multi-robot autonomous pp. 267–274.
Y. Zhang, “Deep reinforcement learning for exploration in unknown environments via deep [51] J. Achiam, D. Held, A. Tamar, and P. Abbeel,
Internet of Things: A comprehensive survey,” IEEE reinforcement learning,” IEEE Trans. Veh. Technol., “Constrained policy optimization,” in Proc. Int.
Commun. Surveys Tuts., vol. 23, no. 3, vol. 69, no. 12, pp. 14413–14423, Dec. 2020. Conf. Mach. Learn., 2017, pp. 22–31.
pp. 1659–1692, 3rd Quart., 2021. [33] C. J. C. H. Watkins and P. Dayan, “Q-learning,” [52] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex
[16] Y. Keneshloo, T. Shi, N. Ramakrishnan, and Mach. Learn., vol. 8, nos. 3–4, pp. 279–292, 1992. Optimization. Cambridge, U.K.: Cambridge Univ.
C. K. Reddy, “Deep reinforcement learning for [34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Press, 2004.
sequence-to-sequence models,” IEEE Trans. Neural “Gradient-based learning applied to document [53] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, recognition,” Proc. IEEE, vol. 86, no. 11, and O. Klimov, “Proximal policy optimization
Jul. 2020. pp. 2278–2324, Nov. 1998. algorithms,” 2017, arXiv:1707.06347.

1090 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

[54] R. I. Bot, S. M. Grad, and G. Wanka, Duality in pp. 3–19, Jan. 2021. Jun. 2022.
Vector Optimization. Cham, Switzerland: Springer, [73] Y. Ding, B. Wang, Y. Wang, K. Zhang, and [91] S. H. Oh, Y. T. Yoon, and S. W. Kim, “Online
2009. H. Wang, “Secure metering data aggregation with reconfiguration scheme of self-sufficient
[55] N. Heess et al., “Emergence of locomotion batch verification in industrial smart grid,” IEEE distribution network based on a reinforcement
behaviours in rich environments,” 2017, Trans. Ind. Informat., vol. 16, no. 10, learning approach,” Appl. Energy, vol. 280,
arXiv:1707.02286. pp. 6607–6616, Oct. 2020. Dec. 2020, Art. no. 115900.
[56] J. Booth, “PPO dash: Improving generalization in [74] K. Kaur, G. Kaddoum, and S. Zeadally, [92] Y. Gao, W. Wang, J. Shi, and N. Yu,
deep reinforcement learning,” 2019, “Blockchain-based cyber-physical security for “Batch-constrained reinforcement learning for
arXiv:1907.06704. electrical vehicle aided smart grid ecosystem,” dynamic distribution network reconfiguration,”
[57] C.-Y. Tang, C.-H. Liu, W.-K. Chen, and S. D. You, IEEE Trans. Intell. Transp. Syst., vol. 22, no. 8, IEEE Trans. Smart Grid, vol. 11, no. 6,
“Implementing action mask in proximal policy pp. 5178–5189, Aug. 2021. pp. 5357–5369, Nov. 2020.
optimization (PPO) algorithm,” ICT Exp., vol. 6, [75] M. B. Gough, S. F. Santos, T. AlSkaif, M. S. Javadi, [93] S. Bahrami, Y. C. Chen, and V. W. S. Wong, “Deep
no. 3, pp. 200–203, Sep. 2020. R. Castro, and J. P. S. Catalão, “Preserving privacy reinforcement learning for demand response in
[58] D. Silver, G. Lever, N. Heess, T. Degris, of smart meter data in a smart grid environment,” distribution networks,” IEEE Trans. Smart Grid,
D. Wierstra, and M. Riedmiller, “Deterministic IEEE Trans. Ind. Informat., vol. 18, no. 1, vol. 12, no. 2, pp. 1496–1506, Mar. 2021.
policy gradient algorithms,” in Proc. Int. Conf. pp. 707–718, Jan. 2022. [94] N. L. Dehghani, A. B. Jeddi, and A. Shafieezadeh,
Mach. Learn., 2014, pp. 387–395. [76] Y. Li, Y. Zhao, L. Wu, and Z. Zeng, Artificial “Intelligent hurricane resilience enhancement of
[59] S. Fujimoto, H. Hoof, and D. Meger, “Addressing Intelligence Enabled Computational Methods for power distribution systems via deep
function approximation error in actor-critic Smart Grid Forecast and Dispatch. Cham, reinforcement learning,” Appl. Energy, vol. 285,
methods,” in Proc. Int. Conf. Mach. Learn., 2018, Switzerland: Springer, 2023. Mar. 2021, Art. no. 116355.
pp. 1587–1596. [77] A. M. Fathabad, J. Cheng, K. Pan, and F. Qiu, [95] Y. Li et al., “Optimal operation of multimicrogrids
[60] A. Navas, J. S. Gómez, J. Llanos, E. Rute, D. Sáez, “Data-driven planning for renewable distributed via cooperative energy and reserve scheduling,”
and M. Sumner, “Distributed predictive control generation integration,” IEEE Trans. Power Syst., IEEE Trans. Ind. Informat., vol. 14, no. 8,
strategy for frequency restoration of microgrids vol. 35, no. 6, pp. 4357–4368, Nov. 2020. pp. 3459–3468, Aug. 2018.
considering optimal dispatch,” IEEE Trans. Smart [78] K. Utkarsh, D. Srinivasan, A. Trivedi, W. Zhang, [96] M. Mahmoodi, P. Shamsi, and B. Fahimi,
Grid, vol. 12, no. 4, pp. 2748–2759, Jul. 2021. and T. Reindl, “Distributed model-predictive “Economic dispatch of a hybrid microgrid with
[61] Z. Chen, J. Zhu, H. Dong, W. Wu, and H. Zhu, real-time optimal operation of a network of smart distributed energy storage,” IEEE Trans. Smart
“Optimal dispatch of WT/PV/ES combined microgrids,” IEEE Trans. Smart Grid, vol. 10, no. 3, Grid, vol. 6, no. 6, pp. 2607–2614, Nov. 2015.
generation system based on cyber-physical-social pp. 2833–2845, May 2019. [97] Y. Shi, S. Dong, C. Guo, Z. Chen, and L. Wang,
integration,” IEEE Trans. Smart Grid, vol. 13, [79] Y. Liu, L. Guo, and C. Wang, “A robust “Enhancing the flexibility of storage integrated
no. 1, pp. 342–354, Jan. 2022. operation-based scheduling optimization for smart power system by multi-stage robust dispatch,”
[62] T. Wu, C. Zhao, and Y. A. Zhang, “Distributed distribution networks with multi-microgrids,” IEEE Trans. Power Syst., vol. 36, no. 3,
AC–DC optimal power dispatch of VSC-based Appl. Energy, vol. 228, pp. 130–140, Oct. 2018. pp. 2314–2322, May 2021.
energy routers in smart microgrids,” IEEE Trans. [80] C. Guo, F. Luo, Z. Cai, and Z. Y. Dong, “Integrated [98] Y. Li et al., “Day-ahead risk averse market clearing
Power Syst., vol. 36, no. 5, pp. 4457–4470, energy systems of data centers and smart grids: considering demand response with data-driven
Sep. 2021. State-of-the-art and future opportunities,” Appl. load uncertainty representation: A Singapore
[63] Z. Zhang, C. Wang, H. Lv, F. Liu, H. Sheng, and Energy, vol. 301, Nov. 2021, Art. no. 117474. electricity market study,” Energy, vol. 254,
M. Yang, “Day-ahead optimal dispatch for [81] Z. J. Lee et al., “Adaptive charging networks: A Sep. 2022, Art. no. 123923.
integrated energy system considering framework for smart electric vehicle charging,” [99] A. Dridi, H. Afifi, H. Moungla, and J. Badosa,
power-to-gas and dynamic pipeline networks,” IEEE Trans. Smart Grid, vol. 12, no. 5, “A novel deep reinforcement approach for IIoT
IEEE Trans. Ind. Appl., vol. 57, no. 4, pp. 4339–4350, Sep. 2021. microgrid energy management systems,” IEEE
pp. 3317–3328, Jul. 2021. [82] C. Li, Z. Dong, G. Chen, B. Zhou, J. Zhang, and Trans. Green Commun. Netw., vol. 6, no. 1,
[64] Md. R. Islam, H. Lu, Md. R. Islam, M. J. Hossain, X. Yu, “Data-driven planning of electric vehicle pp. 148–159, Mar. 2022.
and L. Li, “An IoT-based decision support tool for charging infrastructure: A case study of Sydney, [100] Md. S. Munir, S. F. Abedin, N. H. Tran, Z. Han,
improving the performance of smart grids Australia,” IEEE Trans. Smart Grid, vol. 12, no. 4, E.-N. Huh, and C. S. Hong, “Risk-aware energy
connected with distributed energy sources and pp. 3289–3304, Jul. 2021. scheduling for edge computing with microgrid: A
electric vehicles,” IEEE Trans. Ind. Appl., vol. 56, [83] B. Zhou et al., “Optimal coordination of electric multi-agent deep reinforcement learning
no. 4, pp. 4552–4562, Jul. 2020. vehicles for virtual power plants with dynamic approach,” IEEE Trans. Netw. Service Manage.,
[65] X. Sun and J. Qiu, “Hierarchical voltage control communication spectrum allocation,” IEEE Trans. vol. 18, no. 3, pp. 3476–3497, Sep. 2021.
strategy in distribution networks considering Ind. Informat., vol. 17, no. 1, pp. 450–462, [101] L. Lei, Y. Tan, G. Dahlenburg, W. Xiang, and
customized charging navigation of electric Jan. 2021. K. Zheng, “Dynamic energy dispatch based on
vehicles,” IEEE Trans. Smart Grid, vol. 12, no. 6, [84] D. Cao, W. Hu, J. Zhao, Q. Huang, Z. Chen, and deep reinforcement learning in IoT-driven smart
pp. 4752–4764, Nov. 2021. F. Blaabjerg, “A multi-agent deep reinforcement isolated microgrids,” IEEE Internet Things J.,
[66] L. Xi, L. Zhang, Y. Xu, S. Wang, and C. Yang, learning based voltage regulation using vol. 8, no. 10, pp. 7938–7953, May 2021.
“Automatic generation control based on coordinated PV inverters,” IEEE Trans. Power Syst., [102] F. Sanchez Gorostiza and F. M. Gonzalez-Longatt,
multiple-step greedy attribute and multiple-level vol. 35, no. 5, pp. 4120–4123, Sep. 2020. “Deep reinforcement learning-based controller for
allocation strategy,” CSEE J. Power Energy Syst., [85] P. Kou, D. Liang, C. Wang, Z. Wu, and L. Gao, SOC management of multi-electrical energy
vol. 8, no. 1, pp. 281–292, Jan. 2022. “Safe deep reinforcement learning-based storage system,” IEEE Trans. Smart Grid, vol. 11,
[67] K. S. Xiahou, Y. Liu, and Q. H. Wu, “Robust load constrained optimal control scheme for active no. 6, pp. 5039–5050, Nov. 2020.
frequency control of power systems against distribution networks,” Appl. Energy, vol. 264, [103] T. Chen, S. Bu, X. Liu, J. Kang, F. R. Yu, and
random time-delay attacks,” IEEE Trans. Smart Apr. 2020, Art. no. 114772. Z. Han, “Peer-to-peer energy trading and energy
Grid, vol. 12, no. 1, pp. 909–911, Jan. 2021. [86] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy conversion in interconnected multi-energy
[68] K.-D. Lu, G.-Q. Zeng, X. Luo, J. Weng, Y. Zhang, deep reinforcement learning algorithm for microgrids using multi-agent deep reinforcement
and M. Li, “An adaptive resilient load frequency Volt-VAR control in power distribution systems,” learning,” IEEE Trans. Smart Grid, vol. 13, no. 1,
controller for smart grids with DoS attacks,” IEEE IEEE Trans. Smart Grid, vol. 11, no. 4, pp. 715–727, Jan. 2022.
Trans. Veh. Technol., vol. 69, no. 5, pp. 3008–3018, Jul. 2020. [104] H. Hua et al., “Data-driven dynamical control for
pp. 4689–4699, May 2020. [87] H. Liu and W. Wu, “Two-stage deep reinforcement bottom-up energy Internet system,” IEEE Trans.
[69] B. Hu, Y. Gong, C. Y. Chung, B. F. Noble, and learning for inverter-based Volt-VAR control in Sustain. Energy, vol. 13, no. 1, pp. 315–327,
G. Poelzer, “Price-maker bidding and offering active distribution networks,” IEEE Trans. Smart Jan. 2022.
strategies for networked microgrids in day-ahead Grid, vol. 12, no. 3, pp. 2037–2047, May 2021. [105] Y. Li, R. Wang, and Z. Yang, “Optimal scheduling
electricity markets,” IEEE Trans. Smart Grid, [88] X. Sun and J. Qiu, “Two-stage Volt/Var control in of isolated microgrids using automated
vol. 12, no. 6, pp. 5201–5211, Nov. 2021. active distribution networks with multi-agent reinforcement learning-based multi-period
[70] H. Haghighat, H. Karimianfard, and B. Zeng, deep reinforcement learning method,” IEEE Trans. forecasting,” IEEE Trans. Sustain. Energy, vol. 13,
“Integrating energy management of autonomous Smart Grid, vol. 12, no. 4, pp. 2903–2912, no. 1, pp. 159–169, Jan. 2022.
smart grids in electricity market operation,” IEEE Jul. 2021. [106] Y. Du and F. Li, “Intelligent multi-microgrid energy
Trans. Smart Grid, vol. 11, no. 5, pp. 4044–4055, [89] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, management based on deep neural network and
Sep. 2020. and J. Sun, “Two-timescale voltage control in model-free reinforcement learning,” IEEE Trans.
[71] A. Paudel, L. P. M. I. Sampath, J. Yang, and distribution grids using deep reinforcement Smart Grid, vol. 11, no. 2, pp. 1066–1076,
H. B. Gooi, “Peer-to-peer energy trading in smart learning,” IEEE Trans. Smart Grid, vol. 11, no. 3, Mar. 2020.
grid considering power losses and network fees,” pp. 2313–2323, May 2020. [107] Z. Qin, D. Liu, H. Hua, and J. Cao, “Privacy
IEEE Trans. Smart Grid, vol. 11, no. 6, [90] Y. Li, G. Hao, Y. Liu, Y. Yu, Z. Ni, and Y. Zhao, preserving load control of residential microgrid
pp. 4727–4737, Nov. 2020. “Many-objective distribution network via deep reinforcement learning,” IEEE Trans.
[72] P. Zhuang, T. Zamir, and H. Liang, “Blockchain for reconfiguration via deep reinforcement learning Smart Grid, vol. 12, no. 5, pp. 4079–4089,
cybersecurity in smart grid: A comprehensive assisted optimization algorithm,” IEEE Trans. Sep. 2021.
survey,” IEEE Trans. Ind. Informat., vol. 17, no. 1, Power Del., vol. 37, no. 3, pp. 2230–2244, [108] Y. Li et al., “Coordinated scheduling for improving

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1091


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

uncertain wind power adsorption in electric [125] G. Zhang et al., “Data-driven optimal energy interconnected power grid with various renewable
vehicles—Wind integrated power systems by management for a wind-solar-diesel-battery- units,” IET Renew. Power Gener., vol. 16, no. 7,
multiobjective optimization approach,” IEEE reverse osmosis hybrid energy system using a pp. 1316–1335, May 2022.
Trans. Ind. Appl., vol. 56, no. 3, pp. 2238–2250, deep reinforcement learning approach,” Energy [142] Q. Zhang, H. Tang, Z. Wang, X. Wu, and K. Lv,
May 2020. Convers. Manage., vol. 227, Jan. 2021, “Flexible selection framework for secondary
[109] Y. Li, S. He, Y. Li, L. Ge, S. Lou, and Z. Zeng, Art. no. 113608. frequency regulation units based on learning
“Probabilistic charging power forecast of EVCS: [126] Y. Li, F. Bu, Y. Li, and C. Long, “Optimal scheduling optimisation method,” Int. J. Electr. Power Energy
Reinforcement learning assisted deep learning of island integrated energy systems considering Syst., vol. 142, Nov. 2022, Art. no. 108175.
approach,” IEEE Trans. Intell. Vehicles, vol. 8, multi-uncertainties and hydrothermal [143] L. Yin, L. Zhao, T. Yu, and X. Zhang, “Deep forest
no. 1, pp. 344–357, Jan. 2023. simultaneous transmission: A deep reinforcement reinforcement learning for preventive strategy
[110] T. Qian, C. Shao, X. Wang, Q. Zhou, and learning approach,” Appl. Energy, vol. 333, considering automatic generation control in
M. Shahidehpour, “Shadow-price DRL: A Mar. 2023, Art. no. 120540. large-scale interconnected power systems,” Appl.
framework for online scheduling of shared [127] S. Zhou et al., “Combined heat and power system Sci., vol. 8, no. 11, p. 2185, Nov. 2018.
autonomous EVs fleets,” IEEE Trans. Smart Grid, intelligent economic dispatch: A deep [144] J. J. Yang, M. Yang, M. X. Wang, P. J. Du, and
vol. 13, no. 4, pp. 3106–3117, Jul. 2022. reinforcement learning approach,” Int. J. Electr. Y. X. Yu, “A deep reinforcement learning method
[111] J. Zhang, Y. Guan, L. Che, and M. Shahidehpour, Power Energy Syst., vol. 120, Sep. 2020, for managing wind farm uncertainties through
“EV charging command fast allocation approach Art. no. 106016. energy storage system control and external
based on deep reinforcement learning with safety [128] S. Zhong et al., “Deep reinforcement learning reserve purchasing,” Int. J. Electr. Power Energy
modules,” IEEE Trans. Smart Grid, early access, framework for dynamic pricing demand response Syst., vol. 119, Jul. 2020, Art. no. 105928.
Jun. 5, 2023, doi: 10.1109/TSG.2023.3281782. of regenerative electric heating,” Appl. Energy, [145] V. P. Singh, N. Kishor, and P. Samuel, “Distributed
[112] C. Zhang, Y. Liu, F. Wu, B. Tang, and W. Fan, vol. 288, Apr. 2021, Art. no. 116623. multi-agent system-based load frequency control
“Effective charging planning based on deep [129] Y. Ye, D. Qiu, X. Wu, G. Strbac, and J. Ward, for multi-area power system in smart grid,” IEEE
reinforcement learning for electric vehicles,” IEEE “Model-free real-time autonomous control for a Trans. Ind. Electron., vol. 64, no. 6,
Trans. Intell. Transp. Syst., vol. 22, no. 1, residential multi-energy system using deep pp. 5151–5160, Jun. 2017.
pp. 542–554, Jan. 2021. reinforcement learning,” IEEE Trans. Smart Grid, [146] H. Wang, Z. Lei, X. Zhang, J. Peng, and H. Jiang,
[113] B. Lin, B. Ghaddar, and J. Nathwani, “Deep vol. 11, no. 4, pp. 3068–3082, Jul. 2020. “Multiobjective reinforcement learning-based
reinforcement learning for the electric vehicle [130] J. Li, T. Yu, and X. Zhang, “Coordinated load intelligent approach for optimization of activation
routing problem with time windows,” IEEE Trans. frequency control of multi-area integrated energy rules in automatic generation control,” IEEE
Intell. Transp. Syst., vol. 23, no. 8, system using multi-agent deep reinforcement Access, vol. 7, pp. 17480–17492, 2019.
pp. 11528–11538, Aug. 2022. learning,” Appl. Energy, vol. 306, Jan. 2022, [147] S. Hasanvand, M. Rafiei, M. Gheisarnejad, and
[114] F. Zhang, Q. Yang, and D. An, “CDDPG: A Art. no. 117900. M.-H. Khooban, “Reliable power scheduling of an
deep-reinforcement-learning-based approach for [131] B. Yang, X. Zhang, T. Yu, H. Shu, and Z. Fang, emission-free ship: Multiobjective deep
electric vehicle charging control,” IEEE Internet “Grouped grey wolf optimizer for maximum reinforcement learning,” IEEE Trans. Transport.
Things J., vol. 8, no. 5, pp. 3075–3087, Mar. 2021. power point tracking of doubly-fed induction Electrific., vol. 6, no. 2, pp. 832–843, Jun. 2020.
[115] A. A. Zishan, M. M. Haji, and O. Ardakanian, generator based wind turbine,” Energy Convers. [148] S. Wang et al., “A data-driven multi-agent
“Adaptive congestion control for electric vehicle Manage., vol. 133, pp. 427–443, Feb. 2017. autonomous voltage control framework using
charging in the smart grid,” IEEE Trans. Smart [132] Q. Sun, R. Fan, Y. Li, B. Huang, and D. Ma, deep reinforcement learning,” IEEE Trans. Power
Grid, vol. 12, no. 3, pp. 2439–2449, May 2021. “A distributed double-consensus algorithm for Syst., vol. 35, no. 6, pp. 4644–4654, Nov. 2020.
[116] H. Li, Z. Wan, and H. He, “Constrained EV residential We-Energy,” IEEE Trans. Ind. Informat., [149] J. Duan et al., “Deep-reinforcement-learning-
charging scheduling based on safe deep vol. 15, no. 8, pp. 4830–4842, Aug. 2019. based autonomous voltage control for power grid
reinforcement learning,” IEEE Trans. Smart Grid, [133] W. Fu, K. Wang, J. Tan, and K. Zhang, operations,” IEEE Trans. Power Syst., vol. 35, no. 1,
vol. 11, no. 3, pp. 2427–2439, May 2020. “A composite framework coupling multiple feature pp. 814–817, Jan. 2020.
[117] T. Wu et al., “Multi-agent deep reinforcement selection, compound prediction models and novel [150] D. Cao et al., “Model-free voltage control of active
learning for urban traffic light control in vehicular hybrid swarm optimizer-based synchronization distribution system with PVs using surrogate
networks,” IEEE Trans. Veh. Technol., vol. 69, optimization strategy for multi-step ahead model-based deep reinforcement learning,” Appl.
no. 8, pp. 8243–8256, Aug. 2020. short-term wind speed forecasting,” Energy Energy, vol. 306, Jan. 2022, Art. no. 117982.
[118] T. Qian, C. Shao, X. Li, X. Wang, Z. Chen, and Convers. Manage., vol. 205, Feb. 2020, [151] C. Cui, N. Yan, B. Huangfu, T. Yang, and C. Zhang,
M. Shahidehpour, “Multi-agent deep Art. no. 112461. “Voltage regulation of DC–DC buck converters
reinforcement learning method for EV charging [134] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and feeding CPLs via deep reinforcement learning,”
station game,” IEEE Trans. Power Syst., vol. 37, F. L. Lewis, “Optimal and autonomous control IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69,
no. 3, pp. 1682–1694, May 2022. using reinforcement learning: A survey,” IEEE no. 3, pp. 1777–1781, Mar. 2022.
[119] T. Qian, C. Shao, X. Wang, and M. Shahidehpour, Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, [152] S. Wang, L. Du, X. Fan, and Q. Huang, “Deep
“Deep reinforcement learning for EV charging pp. 2042–2062, Jun. 2018. reinforcement scheduling of energy storage
navigation by coordinating smart grid and [135] S. Vijayshankar, P. Stanfel, J. King, E. Spyrou, and systems for real-time voltage regulation in
intelligent transportation system,” IEEE Trans. K. Johnson, “Deep reinforcement learning for unbalanced LV networks with high PV
Smart Grid, vol. 11, no. 2, pp. 1714–1723, automatic generation control of wind farms,” in penetration,” IEEE Trans. Sustain. Energy, vol. 12,
Mar. 2020. Proc. Amer. Control Conf. (ACC), May 2021, no. 4, pp. 2342–2352, Oct. 2021.
[120] L. Yan, X. Chen, J. Zhou, Y. Chen, and J. Wen, pp. 1796–1802. [153] D. Cao, J. Zhao, W. Hu, F. Ding, Q. Huang, and
“Deep reinforcement learning for continuous [136] J. Li, T. Yu, and X. Zhang, “Coordinated automatic Z. Chen, “Attention enabled multi-agent DRL for
electric vehicles charging control with dynamic generation control of interconnected power decentralized Volt-VAR control of active
user behaviors,” IEEE Trans. Smart Grid, vol. 12, system with imitation guided exploration distribution system using PV inverters and SVCs,”
no. 6, pp. 5124–5134, Nov. 2021. multi-agent deep reinforcement learning,” Int. J. IEEE Trans. Sustain. Energy, vol. 12, no. 3,
[121] E. A. M. Ceseña, E. Loukarakis, N. Good, and Electr. Power Energy Syst., vol. 136, Mar. 2022, pp. 1582–1592, Jul. 2021.
P. Mancarella, “Integrated electricity–heat–gas Art. no. 107471. [154] S. Mukherjee, R. Huang, Q. Huang, T. L. Vu, and
systems: Techno–Economic modeling, [137] L. Xi et al., “A deep reinforcement learning T. Yin, “Scalable voltage control using
optimization, and application to multienergy algorithm for the power order optimization structure-driven hierarchical deep reinforcement
districts,” Proc. IEEE, vol. 108, no. 9, allocation of AGC in interconnected power grids,” learning,” 2021, arXiv:2102.00077.
pp. 1392–1410, Sep. 2020. CSEE J. Power Energy Syst., vol. 6, no. 3, [155] R. Huang et al., “Accelerated derivative-free deep
[122] T. Yang, L. Zhao, W. Li, and A. Y. Zomaya, pp. 712–723, Sep. 2020. reinforcement learning for large-scale grid
“Dynamic energy dispatch strategy for integrated [138] J. Li, T. Yu, X. Zhang, F. Li, D. Lin, and H. Zhu, emergency voltage control,” IEEE Trans. Power
energy system based on improved deep “Efficient experience replay based deep Syst., vol. 37, no. 1, pp. 14–25, Jan. 2022.
reinforcement learning,” Energy, vol. 235, deterministic policy gradient for AGC dispatch in [156] R. R. Hossain, Q. Huang, and R. Huang, “Graph
Nov. 2021, Art. no. 121377. integrated energy system,” Appl. Energy, vol. 285, convolutional network-based topology embedded
[123] B. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and Mar. 2021, Art. no. 116386. deep reinforcement learning for voltage stability
F. Blaabjerg, “Deep reinforcement learning–based [139] D. Zhang et al., “Research on AGC performance control,” IEEE Trans. Power Syst., vol. 36, no. 5,
approach for optimizing energy conversion in during wind power ramping based on deep pp. 4848–4851, Sep. 2021.
integrated electrical and heating system with reinforcement learning,” IEEE Access, vol. 8, [157] D. Cao et al., “Data-driven multi-agent deep
renewable energy,” Energy Convers. Manage., pp. 107409–107418, 2020. reinforcement learning for distribution system
vol. 202, Dec. 2019, Art. no. 112199. [140] J. Li, T. Yu, H. Zhu, F. Li, D. Lin, and Z. Li, decentralized voltage control with high
[124] B. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and “Multi-agent deep reinforcement learning for penetration of PVs,” IEEE Trans. Smart Grid,
F. Blaabjerg, “Economical operation strategy of an sectional AGC dispatch,” IEEE Access, vol. 8, vol. 12, no. 5, pp. 4137–4150, Sep. 2021.
integrated energy system with wind power and pp. 158067–158081, 2020. [158] H. T. Nguyen and D.-H. Choi, “Three-stage
power to gas technology—A DRL-based [141] J. Li, J. Yao, T. Yu, and X. Zhang, “Distributed inverter-based peak shaving and Volt-VAR control
approach,” IET Renew. Power Gener., vol. 14, deep reinforcement learning for integrated in active distribution networks using online safe
no. 17, pp. 3292–3299, Dec. 2020. generation-control and power-dispatch of deep reinforcement learning,” IEEE Trans. Smart

1092 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Grid, vol. 13, no. 4, pp. 3266–3277, Jul. 2022. Apr. 2022. Smart Grid, vol. 12, no. 3, pp. 2176–2187,
[159] R. Huang et al., “Learning and fast adaptation for [176] X. Wei, Y. Xiang, J. Li, and X. Zhang, May 2021.
grid emergency control via deep meta “Self-dispatch of wind-storage integrated system: [193] V. Moghaddam, A. Yazdani, H. Wang, D. Parlevliet,
reinforcement learning,” IEEE Trans. Power Syst., A deep reinforcement learning approach,” IEEE and F. Shahnia, “An online reinforcement learning
vol. 37, no. 6, pp. 4168–4178, Nov. 2022. Trans. Sustain. Energy, vol. 13, no. 3, approach for dynamic pricing of electric vehicle
[160] L. Xi, L. Yu, Y. Xu, S. Wang, and X. Chen, “A novel pp. 1861–1864, Jul. 2022. charging stations,” IEEE Access, vol. 8,
multi-agent DDQN-AD method-based distributed [177] Y. Liang, C. Guo, Z. Ding, and H. Hua, pp. 130305–130313, 2020.
strategy for automatic generation control of “Agent-based modeling in electricity market using [194] L. Zhang, Y. Gao, H. Zhu, and L. Tao, “Bi-level
integrated energy systems,” IEEE Trans. Sustain. deep deterministic policy gradient algorithm,” stochastic real-time pricing model in multi-energy
Energy, vol. 11, no. 4, pp. 2417–2426, Oct. 2020. IEEE Trans. Power Syst., vol. 35, no. 6, generation system: A reinforcement learning
[161] L. Xi, J. Wu, Y. Xu, and H. Sun, “Automatic pp. 4180–4192, Nov. 2020. approach,” Energy, vol. 239, Jan. 2022,
generation control based on multiple neural [178] H. Guo, Q. Chen, Q. Xia, and C. Kang, “Deep Art. no. 121926.
networks with actor-critic strategy,” IEEE Trans. inverse reinforcement learning for objective [195] N. Z. Aitzhan and D. Svetinovic, “Security and
Neural Netw. Learn. Syst., vol. 32, no. 6, function identification in bidding models,” IEEE privacy in decentralized energy trading through
pp. 2483–2493, Jun. 2021. Trans. Power Syst., vol. 36, no. 6, pp. 5684–5696, multi-signatures, blockchain and anonymous
[162] L. Xi, L. Zhou, Y. Xu, and X. Chen, “A multi-step Nov. 2021. messaging streams,” IEEE Trans. Depend. Secure
unified reinforcement learning method for [179] M. Sanayha and P. Vateekul, “Model-based deep Comput., vol. 15, no. 5, pp. 840–852, Sep. 2018.
automatic generation control in multi-area reinforcement learning for wind energy bidding,” [196] J. Kang, R. Yu, X. Huang, S. Maharjan, Y. Zhang,
interconnected power grid,” IEEE Trans. Sustain. Int. J. Electr. Power Energy Syst., vol. 136, and E. Hossain, “Enabling localized peer-to-peer
Energy, vol. 12, no. 2, pp. 1406–1415, Apr. 2021. Mar. 2022, Art. no. 107625. electricity trading among plug-in hybrid electric
[163] Z. Yan and Y. Xu, “Data-driven load frequency [180] Y. Tao, J. Qiu, and S. Lai, “Deep reinforcement vehicles using consortium blockchains,” IEEE
control for stochastic power systems: A deep learning based bidding strategy for EVAs in local Trans. Ind. Informat., vol. 13, no. 6,
reinforcement learning method with continuous energy market considering information pp. 3154–3164, Dec. 2017.
action search,” IEEE Trans. Power Syst., vol. 34, asymmetry,” IEEE Trans. Ind. Informat., vol. 18, [197] R. Khalid, N. Javaid, A. Almogren, M. U. Javed,
no. 2, pp. 1653–1656, Mar. 2019. no. 6, pp. 3831–3842, Jun. 2022. S. Javaid, and M. Zuair, “A blockchain-based load
[164] Z. Yan and Y. Xu, “A multi-agent deep [181] A. Taghizadeh, M. Montazeri, and H. Kebriaei, balancing in decentralized hybrid P2P energy
reinforcement learning method for cooperative “Deep reinforcement learning-aided bidding trading market in smart grid,” IEEE Access, vol. 8,
load frequency control of a multi-area power strategies for transactive energy market,” IEEE pp. 47047–47062, 2020.
system,” IEEE Trans. Power Syst., vol. 35, no. 6, Syst. J., vol. 16, no. 3, pp. 4445–4453, Sep. 2022. [198] A. A. Al-Obaidi and H. E. Z. Farag, “Decentralized
pp. 4599–4608, Nov. 2020. [182] I. Boukas et al., “A deep reinforcement learning quality of service based system for energy trading
[165] M. H. Khooban and M. Gheisarnejad, “A novel framework for continuous intraday market among electric vehicles,” IEEE Trans. Intell. Transp.
deep reinforcement learning controller based bidding,” Mach. Learn., vol. 110, no. 9, Syst., vol. 23, no. 7, pp. 6586–6595, Jul. 2022.
type-II fuzzy system: Frequency regulation in pp. 2335–2387, Sep. 2021. [199] Y. Li, C. Yu, Y. Liu, Z. Ni, L. Ge, and X. Li,
microgrids,” IEEE Trans. Emerg. Topics Comput. [183] Y. Zhang, Z. Zhang, Q. Yang, D. An, D. Li, and “Collaborative operation between power network
Intell., vol. 5, no. 4, pp. 689–699, Aug. 2021. C. Li, “EV charging bidding by multi-DQN and hydrogen fueling stations with peer-to-peer
[166] C. Chen, M. Cui, F. Li, S. Yin, and X. Wang, reinforcement learning in electricity auction energy trading,” IEEE Trans. Transport. Electrific.,
“Model-free emergency frequency control based market,” Neurocomputing, vol. 397, pp. 404–414, vol. 9, no. 1, pp. 1521–1540, Mar. 2023.
on reinforcement learning,” IEEE Trans. Ind. Jul. 2020. [200] D. Wang, B. Liu, H. Jia, Z. Zhang, J. Chen, and
Informat., vol. 17, no. 4, pp. 2336–2346, [184] L. Yang, Q. Sun, N. Zhang, and Y. Li, “Indirect D. Huang, “Peer-to-peer electricity transaction
Apr. 2021. multi-energy transactions of energy Internet with decisions of the user-side smart energy system
[167] Z. Yan, Y. Xu, Y. Wang, and X. Feng, “Deep deep reinforcement learning approach,” IEEE based on the SARSA reinforcement learning,”
reinforcement learning-based optimal data-driven Trans. Power Syst., vol. 37, no. 5, pp. 4067–4077, CSEE J. Power Energy Syst., vol. 8, no. 3,
control of battery energy storage for power system Sep. 2022. pp. 826–837, May 2022.
frequency support,” IET Gener., Transmiss. Distrib., [185] C. Schlereth, B. Skiera, and F. Schulz, “Why do [201] Y. Liu, D. Zhang, C. Deng, and X. Wang, “Deep
vol. 14, no. 25, pp. 6071–6078, Dec. 2020. consumers prefer static instead of dynamic pricing reinforcement learning approach for autonomous
[168] G. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and plans? An empirical study for a better agents in consumer-centric electricity market,” in
F. Blaabjerg, “A novel deep reinforcement learning understanding of the low preferences for Proc. 5th IEEE Int. Conf. Big Data Anal. (ICBDA),
enabled sparsity promoting adaptive control time-variant pricing plans,” Eur. J. Oper. Res., May 2020, pp. 37–41.
method to improve the stability of power systems vol. 269, no. 3, pp. 1165–1179, Sep. 2018. [202] D. Qiu, Y. Ye, D. Papadaskalopoulos, and
with wind energy penetration,” Renew. Energy, [186] D. Liu, W. Wang, L. Wang, H. Jia, and M. Shi, G. Strbac, “Scalable coordinated management of
vol. 178, pp. 363–376, Nov. 2021. “Dynamic pricing strategy of electric vehicle peer-to-peer energy trading: A multi-cluster deep
[169] R. Yan, Y. Wang, Y. Xu, and J. Dai, “A multiagent aggregators based on DDPG reinforcement reinforcement learning approach,” Appl. Energy,
quantum deep reinforcement learning method for learning algorithm,” IEEE Access, vol. 9, vol. 292, Jun. 2021, Art. no. 116940.
distributed frequency control of islanded pp. 21556–21566, 2021. [203] C. Samende, J. Cao, and Z. Fan, “Multi-agent deep
microgrids,” IEEE Trans. Control Netw. Syst., [187] D. Qiu, Y. Ye, D. Papadaskalopoulos, and deterministic policy gradient algorithm for
vol. 9, no. 4, pp. 1622–1632, Dec. 2022. G. Strbac, “A deep reinforcement learning method peer-to-peer energy trading considering
[170] M. Shahidehpour, H. Yamin, and Z. Li, Market for pricing electric vehicles with discrete charging distribution network constraints,” Appl. Energy,
Operations in Electric Power Systems: Forecasting, levels,” IEEE Trans. Ind. Appl., vol. 56, no. 5, vol. 317, Jul. 2022, Art. no. 119123.
Scheduling, and Risk Management. Hoboken, NJ, pp. 5901–5912, Sep. 2020. [204] J. Li et al., “Energy trading of multiple virtual
USA: Wiley, 2002. [188] H. Xu, J. Wen, Q. Hu, J. Shu, J. Lu, and Z. Yang, power plants using deep reinforcement learning,”
[171] Y. Liu, D. Zhang, and H. B. Gooi, “Data-driven “Energy procurement and retail pricing of in Proc. Int. Conf. Power Syst. Technol.
decision-making strategies for electricity retailers: electricity retailers via deep reinforcement (POWERCON), Dec. 2021, pp. 892–897.
A deep reinforcement learning approach,” CSEE J. learning with long short-term memory,” CSEE J. [205] D. Qiu, J. Wang, J. Wang, and G. Strbac,
Power Energy Syst., vol. 7, no. 2, pp. 358–367, Power Energy Syst., vol. 8, no. 5, pp. 1338–1351, “Multi-agent reinforcement learning for
Mar. 2021. Sep. 2022. automated peer-to-peer energy trading in
[172] Y. Ye, D. Qiu, M. Sun, D. Papadaskalopoulos, and [189] S. Lee and D.-H. Choi, “Dynamic pricing and double-side auction market,” in Proc. 13th Int.
G. Strbac, “Deep reinforcement learning for energy management for profit maximization in Joint Conf. Artif. Intell., Aug. 2021,
strategic bidding in electricity markets,” IEEE multiple smart electric vehicle charging stations: pp. 2913–2920.
Trans. Smart Grid, vol. 11, no. 2, pp. 1343–1355, A privacy-preserving deep reinforcement learning [206] T. Zhang, D. Yue, L. Yu, C. Dou, and X. Xie, “Joint
Mar. 2020. approach,” Appl. Energy, vol. 304, Dec. 2021, energy and workload scheduling for fog-assisted
[173] H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori, Art. no. 117754. multimicrogrid systems: A deep reinforcement
and H. Hashimoto, “Deep reinforcement learning [190] A. Abdalrahman and W. Zhuang, “Dynamic pricing learning approach,” IEEE Syst. J., vol. 17, no. 1,
for joint bidding and pricing of load serving for differentiated PEV charging services using pp. 164–175, Mar. 2023.
entity,” IEEE Trans. Smart Grid, vol. 10, no. 6, deep reinforcement learning,” IEEE Trans. Intell. [207] Y. Ye, Y. Tang, H. Wang, X.-P. Zhang, and G. Strbac,
pp. 6366–6375, Nov. 2019. Transp. Syst., vol. 23, no. 2, pp. 1415–1427, “A scalable privacy-preserving multi-agent deep
[174] Y. Du, F. Li, H. Zandi, and Y. Xue, “Approximating Feb. 2022. reinforcement learning approach for large-scale
Nash equilibrium in day-ahead electricity market [191] Y.-C. Chuang and W.-Y. Chiu, “Deep reinforcement peer-to-peer transactive energy trading,” IEEE
bidding with multi-agent deep reinforcement learning based pricing strategy of aggregators Trans. Smart Grid, vol. 12, no. 6, pp. 5185–5200,
learning,” J. Modern Power Syst. Clean Energy, considering renewable energy,” IEEE Trans. Emerg. Nov. 2021.
vol. 9, no. 3, pp. 534–544, May 2021. Topics Comput. Intell., vol. 6, no. 3, pp. 499–508, [208] X. Wang, Y. Liu, J. Zhao, C. Liu, J. Liu, and J. Yan,
[175] X. Wei, Y. Xiang, J. Li, and J. Liu, “Wind power Jun. 2022. “Surrogate model enabled deep reinforcement
bidding coordinated with energy storage system [192] T. Lu, X. Chen, M. B. McElroy, C. P. Nielsen, Q. Wu, learning for hybrid energy community operation,”
operation in real-time electricity market: A and Q. Ai, “A reinforcement learning-based Appl. Energy, vol. 289, May 2021, Art. no. 116722.
maximum entropy deep reinforcement learning decision system for electricity pricing plan [209] Y. Xu, L. Yu, G. Bi, M. Zhang, and C. Shen, “Deep
approach,” Energy Rep., vol. 8, pp. 770–775, selection by smart grid end users,” IEEE Trans. reinforcement learning and blockchain for

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1093


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

peer-to-peer energy trading among microgrids,” in IEEE Trans. Comput., vol. 71, no. 11, Mar. 2020.
Proc. Int. Conferences Internet Things (iThings) pp. 2915–2926, Nov. 2022. [244] E. Marchesini, D. Corsi, and A. Farinelli,
IEEE Green Comput. Commun. (GreenCom) IEEE [227] S. Lee and D.-H. Choi, “Federated reinforcement “Exploring safer behaviors for deep reinforcement
Cyber, Phys. Social Comput. (CPSCom) IEEE Smart learning for energy management of multiple learning,” in Proc. AAAI Conf. Artif. Intell., vol. 36,
Data (SmartData) IEEE Congr. Cybermatics smart homes with distributed energy resources,” no. 7, 2022, pp. 7701–7709.
(Cybermatics), Nov. 2020, pp. 360–365. IEEE Trans. Ind. Informat., vol. 18, no. 1, [245] Y. Ye, H. Wang, P. Chen, Y. Tang, and G. Strbac,
[210] Y. Li, X. Wei, Y. Li, Z. Dong, and M. Shahidehpour, pp. 488–497, Jan. 2022. “Safe deep reinforcement learning for microgrid
“Detection of false data injection attacks in smart [228] H.-M. Chung, S. Maharjan, Y. Zhang, and energy management in distribution networks with
grid: A secure federated deep learning approach,” F. Eliassen, “Distributed deep reinforcement leveraged spatial–temporal perception,” IEEE
IEEE Trans. Smart Grid, vol. 13, no. 6, learning for intelligent load scheduling in Trans. Smart Grid, vol. 14, no. 5, pp. 3759–3775,
pp. 4862–4872, Nov. 2022. residential smart grids,” IEEE Trans. Ind. Informat., Sep. 2023.
[211] Z. Li, M. Shahidehpour, and F. Aminifar, vol. 17, no. 4, pp. 2752–2763, Apr. 2021. [246] H. Cui, Y. Ye, J. Hu, Y. Tang, Z. Lin, and G. Strbac,
“Cybersecurity in distributed power systems,” [229] M. Shateri, F. Messina, P. Piantanida, and “Online preventive control for transmission
Proc. IEEE, vol. 105, no. 7, pp. 1367–1388, F. Labeau, “Privacy-cost management in smart overload relief using safe reinforcement learning
Jul. 2017. meters with mutual-information-based with enhanced spatial–temporal awareness,” IEEE
[212] M. Shahidehpour, F. Tinney, and Y. Fu, “Impact of reinforcement learning,” IEEE Internet Things J., Trans. Power Syst., early access, Mar. 15, 2023,
security on power systems operation,” Proc. IEEE, vol. 9, no. 22, pp. 22389–22398, Nov. 2022. doi: 10.1109/TPWRS.2023.3257259.
vol. 93, no. 11, pp. 2013–2025, Nov. 2005. [230] Z. Su et al., “Secure and efficient federated [247] R. F. Prudencio, M. R. O. A. Maximo, and
[213] Z. Zhang, S. Huang, Y. Chen, B. Li, and S. Mei, learning for smart grid with edge-cloud E. L. Colombini, “A survey on offline
“Cyber-physical coordinated risk mitigation in collaboration,” IEEE Trans. Ind. Informat., vol. 18, reinforcement learning: Taxonomy, review, and
smart grids based on attack-defense game,” IEEE no. 2, pp. 1333–1344, Feb. 2022. open problems,” IEEE Trans. Neural Netw. Learn.
Trans. Power Syst., vol. 37, no. 1, pp. 530–542, [231] X. Wang et al., “QoS and privacy-aware routing Syst., early access, Mar. 22, 2023, doi:
Jan. 2022. for 5G-enabled industrial Internet of Things: A 10.1109/TNNLS.2023.3250269.
[214] T. Bailey, J. Johnson, and D. Levin, “Deep federated reinforcement learning approach,” IEEE [248] H. Niu, Y. Qiu, M. Li, G. Zhou, J. Hu, and X. Zhan,
reinforcement learning for online distribution Trans. Ind. Informat., vol. 18, no. 6, “When to trust your simulator: Dynamics-aware
power system cybersecurity protection,” in Proc. pp. 4189–4197, Jun. 2022. hybrid offline-and-online reinforcement learning,”
IEEE Int. Conf. Commun., Control, Comput. [232] Z. Wang, Y. Liu, Z. Ma, X. Liu, and J. Ma, “LiPSG: in Proc. Adv. Neural Inf. Process. Syst., vol. 35,
Technol. Smart Grids (SmartGridComm), Lightweight privacy-preserving Q-learning-based 2022, pp. 36599–36612.
Oct. 2021, pp. 227–232. energy management for the IoT-enabled smart [249] Z. Yan and Y. Xu, “A hybrid data-driven method for
[215] X. Liu, J. Ospina, and C. Konstantinou, “Deep grid,” IEEE Internet Things J., vol. 7, no. 5, fast solution of security-constrained optimal
reinforcement learning for cybersecurity pp. 3935–3947, May 2020. power flow,” IEEE Trans. Power Syst., vol. 37,
assessment of wind integrated power systems,” [233] Y. Zhang, Q. Ai, and Z. Li, “Intelligent demand no. 6, pp. 4365–4374, Nov. 2022.
IEEE Access, vol. 8, pp. 208378–208394, 2020. response resource trading using deep [250] A. R. Sayed, C. Wang, H. Anis, and T. Bi,
[216] Y. Li and J. Wu, “Low latency cyberattack reinforcement learning,” CSEE J. Power Energy “Feasibility constrained online calculation for
detection in smart grids with deep reinforcement Syst., early access, Sep. 10, 2021, doi: 10.17775/ real-time optimal power flow: A convex
learning,” Int. J. Electr. Power Energy Syst., CSEEJPES.2020.05540. constrained deep reinforcement learning
vol. 142, Nov. 2022, Art. no. 108265. [234] X. Liu, H. Wang, G. Chen, B. Zhou, and approach,” IEEE Trans. Power Syst., early access,
[217] D. An, F. Zhang, Q. Yang, and C. Zhang, “Data A. U. Rehman, “Intermittently differential privacy Nov. 9, 2022, doi: 10.1109/TPWRS.2022.
integrity attack in dynamic state estimation of in smart meters via rechargeable batteries,” Electr. 3220799.
smart grid: Attack model and countermeasures,” Power Syst. Res., vol. 199, Oct. 2021, [251] T. L. Vu, S. Mukherjee, R. Huang, and Q. Huang,
IEEE Trans. Autom. Sci. Eng., vol. 19, no. 3, Art. no. 107410. “Barrier function-based safe reinforcement
pp. 1631–1644, Jul. 2022. [235] U. Ahmed, J. C. Lin, and G. Srivastava, learning for emergency control of power systems,”
[218] C. Chen, M. Cui, X. Fang, B. Ren, and Y. Chen, “5G-empowered drone networks in federated and in Proc. 60th IEEE Conf. Decis. Control (CDC),
“Load altering attack-tolerant defense strategy for deep reinforcement learning environments,” IEEE Dec. 2021, pp. 3652–3657.
load frequency control system,” Appl. Energy, Commun. Standards Mag., vol. 5, no. 4, pp. 55–61, [252] I. Ilahi et al., “Challenges and countermeasures
vol. 280, Dec. 2020, Art. no. 116015. Dec. 2021. for adversarial attacks on deep reinforcement
[219] W. Lei, H. Wen, J. Wu, and W. Hou, [236] L. Yan, X. Chen, Y. Chen, and J. Wen, learning,” IEEE Trans. Artif. Intell., vol. 3, no. 2,
“MADDPG-based security situational awareness “A hierarchical deep reinforcement learning-based pp. 90–109, Apr. 2022.
for smart grid with intelligent edge,” Appl. Sci., community energy trading scheme for a [253] Y. Wang and B. Pal, “Destabilizing attack and
vol. 11, no. 7, p. 3101, Mar. 2021. neighborhood of smart households,” IEEE Trans. robust defense for inverter-based microgrids by
[220] Z. Jin et al., “Cyber-physical risk driven routing Smart Grid, vol. 13, no. 6, pp. 4747–4758, adversarial deep reinforcement learning,” IEEE
planning with deep reinforcement-learning in Nov. 2022. Trans. Smart Grid, early access, Mar. 30, 2023,
smart grid communication networks,” in Proc. Int. [237] T. Li, Y. Xiao, and L. Song, “Integrating future doi: 10.1109/TSG.2023.3263243.
Wireless Commun. Mobile Comput. (IWCMC), smart home operation platform with demand side [254] S. Paul, Z. Ni, and C. Mu, “A learning-based
Jun. 2020, pp. 1278–1283. management via deep reinforcement learning,” solution for an adversarial repeated game in
[221] H. Zhang, D. Yue, C. Dou, and G. P. Hancke, IEEE Trans. Green Commun. Netw., vol. 5, no. 2, cyber-physical power systems,” IEEE Trans. Neural
“Resilient optimal defensive strategy of pp. 921–933, Jun. 2021. Netw. Learn. Syst., vol. 31, no. 11, pp. 4512–4523,
micro-grids system via distributed deep [238] J. García and F. Fernández, “A comprehensive Nov. 2020.
reinforcement learning approach against FDI survey on safe reinforcement learning,” J. Mach. [255] K. L. Tan, Y. Esfandiari, X. Y. Lee, and S. Sarkar,
attack,” IEEE Trans. Neural Netw. Learn. Syst., Learn. Res., vol. 16, no. 42, pp. 1437–1480, “Robustifying reinforcement learning agents via
early access, May 27, 2022, doi: 10.1109/TNNLS. Aug. 2015. action space adversarial training,” in Proc. Amer.
2022.3175917. [239] X. Wang, R. Wang, and Y. Cheng, “Safe Control Conf. (ACC), Jul. 2020, pp. 3959–3964.
[222] D. An, Q. Yang, W. Liu, and Y. Zhang, “Defending reinforcement learning: A survey,” Acta [256] H. Zhang et al., “Robust deep reinforcement
against data integrity attacks in smart grid: A Automatica Sinica, vol. 49, no. 9, pp. 1–23, learning against adversarial perturbations on state
deep reinforcement learning-based approach,” Sep. 2023. observations,” in Proc. Adv. Neural Inf. Process.
IEEE Access, vol. 7, pp. 110835–110845, 2019. [240] Z. Yi et al., “An improved two-stage deep Syst. (NIPS), vol. 33, 2020, pp. 21024–21037.
[223] Y. Wang, Q. Chen, T. Hong, and C. Kang, “Review reinforcement learning approach for regulation [257] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and
of smart meter data analytics: Applications, service disaggregation in a virtual power plant,” S. Russell, “Robust multi-agent reinforcement
methodologies, and challenges,” IEEE Trans. Smart IEEE Trans. Smart Grid, vol. 13, no. 4, learning via minimax deep deterministic policy
Grid, vol. 10, no. 3, pp. 3125–3148, May 2019. pp. 2844–2858, Jul. 2022. gradient,” in Proc. AAAI Conf. Artif. Intell., vol. 33,
[224] A. Mohammadali and M. S. Haghighi, [241] Z. Zhu, K. W. Chan, S. Xia, and S. Bu, “Optimal 2019, pp. 4213–4220.
“A privacy-preserving homomorphic scheme with bi-level bidding and dispatching strategy between [258] H. Dong and X. Zhao, “Wind-farm power tracking
multiple dimensions and fault tolerance for active distribution network and virtual alliances via preview-based robust reinforcement learning,”
metering data aggregation in smart grid,” IEEE using distributed robust multi-agent deep IEEE Trans. Ind. Informat., vol. 18, no. 3,
Trans. Smart Grid, vol. 12, no. 6, pp. 5212–5220, reinforcement learning,” IEEE Trans. Smart Grid, pp. 1706–1715, Mar. 2022.
Nov. 2021. vol. 13, no. 4, pp. 2833–2843, Jul. 2022. [259] A. Roy, H. Xu, and S. Pokutta, “Reinforcement
[225] C. E. Kement, B. Tavli, H. Gultekin, and [242] M. M. Hosseini and M. Parvania, “On the learning under model mismatch,” in Proc. Adv.
H. Yanikomeroglu, “Holistic privacy for electricity, feasibility guarantees of deep reinforcement Neural Inf. Process. Syst., vol. 30, 2017,
water, and natural gas metering in next learning solutions for distribution system pp. 3046–3055.
generation smart homes,” IEEE Commun. Mag., operation,” IEEE Trans. Smart Grid, vol. 14, no. 2, [260] Y. Li, R. Wang, Y. Li, M. Zhang, and C. Long,
vol. 59, no. 3, pp. 24–29, Mar. 2021. pp. 954–964, Mar. 2023. “Wind power forecasting considering data privacy
[226] Z. Zheng, T. Wang, A. K. Bashir, M. Alazab, [243] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and protection: A federated deep reinforcement
S. Mumtaz, and X. Wang, “A decentralized Z. Huang, “Adaptive power system emergency learning approach,” Appl. Energy, vol. 329,
mechanism based on differential privacy for control using deep reinforcement learning,” IEEE Jan. 2023, Art. no. 120291.
privacy-preserving computation in smart grid,” Trans. Smart Grid, vol. 11, no. 2, pp. 1171–1182, [261] Y. Li, S. He, Y. Li, Y. Shi, and Z. Zeng, “Federated

1094 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

multiagent deep reinforcement learning approach Neural Inf. Process. Syst., vol. 33, 2020, A. H. Gebremedhin, “Reinforcement learning for
via physics-informed reward for multimicrogrid pp. 18353–18363. battery energy storage dispatch augmented with
energy management,” IEEE Trans. Neural Netw. [264] H. Liu, Z. Huang, J. Wu, and C. Lv, “Improved model-based optimizer,” in Proc. IEEE Int. Conf.
Learn. Syst., early access, Jan. 3, 2023, doi: deep reinforcement learning with expert Commun., Control, Comput. Technol. Smart Grids
10.1109/TNNLS.2022.3232630. demonstrations for urban autonomous driving,” in (SmartGridComm), Oct. 2021, pp. 289–294.
[262] P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2022, [267] W. Guo, W. Tian, Y. Ye, L. Xu, and K. Wu, “Cloud
S. Russell, “Bridging offline reinforcement pp. 921–928. resource scheduling with deep reinforcement
learning and imitation learning: A tale of [265] Y. Liu, Q. Liu, H. Zhao, Z. Pan, and C. Liu, learning and imitation learning,” IEEE Internet
pessimism,” in Proc. Adv. Neural Inf. Process. Syst., “Adaptive quantitative trading: An imitative deep Things J., vol. 8, no. 5, pp. 3576–3586, Mar. 2021.
vol. 34, 2021, pp. 11702–11716. reinforcement learning approach,” in Proc. AAAI [268] D. Silver et al., “Mastering the game of Go
[263] X. Chen, Z. Zhou, Z. Wang, C. Wang, Y. Wu, and Conf. Artif. Intell., vol. 34, no. 2, Apr. 2020, without human knowledge,” Nature, vol. 550,
K. Ross, “BAIL: Best-action imitation learning for pp. 2128–2135. no. 7676, pp. 354–359, Oct. 2017.
batch deep reinforcement learning,” in Proc. Adv. [266] G. Krishnamoorthy, A. Dubey, and

ABOUT THE AUTHORS


Yuanzheng Li (Senior Member, IEEE) Mohammad Shahidehpour (Life Fellow,
received the M.S. degree in electrical engi- IEEE) is currently a University Distinguished
neering from the Huazhong University of Sci- Professor, a Bodine Chair Professor of Elec-
ence and Technology (HUST), Wuhan, China, trical and Computer Engineering, and the
in 2011, and the Ph.D. degree in electrical Director of the Robert W. Galvin Center
engineering from the South China University for Electricity Innovation, Illinois Institute of
of Technology (SCUT), Guangzhou, China, Technology (IIT), Chicago, IL, USA. He has
in 2015. 45 years of experience with electric power
He is currently an Associate Professor with system operation and planning. His spon-
HUST. He has published several peer-reviewed articles in interna- sored project on perfect power systems has converted the entire
tional journals. His current research interests include deep learn- IIT campus to an islandable microgrid. He has coauthored six books
ing, reinforcement learning, smart grid operations, optimal power and more than 800 technical articles on electric power system
system/microgrid scheduling and decision-making, stochastic opti- operation and planning.
mization considering large-scale integration of renewable energy Dr. Shahidehpour is a member of the National Academy of
into the power system, and multiobjective optimization. Engineering and a Fellow of the American Association for the
Advancement of Science and the National Academy of Inventors.
He received the IEEE Burke Hayes Award for his research on
hydrokinetics, the IEEE Power and Energy Society (PES) Outstand-
ing Power Engineering Educator Award, the IEEE/PES Ramakumar
Family Renewable Energy Excellence Award, the IEEE/PES Dou-
glas M. Staszesky Distribution Automation Award, and the Edison
Electric Institute’s Power Engineering Educator Award. He was the
founding Editor-in-Chief of IEEE TRANSACTIONS ON SMART GRID.

Tao Yang (Senior Member, IEEE) received


the Ph.D. degree in electrical engineering
from Washington State University, Pullman,
WA, USA, in 2012.
From August 2012 to August 2014, he was
an ACCESS Postdoctoral Researcher with the
ACCESS Linnaeus Centre, Royal Institute of
Technology, Stockholm, Sweden. He then
joined the Pacific Northwest National Labo-
ratory, Richland, WA, USA, as a Postdoctor, where he was promoted
to a Scientist/Engineer II in 2015. He was an Assistant Professor
with the Department of Electrical Engineering, The University of
North Texas, Denton, TX, USA, from 2016 to 2019. He is currently a
Chaofan Yu received the B.S. degree in Professor with the State Key Laboratory of Synthetical Automation
automation from Guangxi University (GXU), for Process Industries, Northeastern University, Shenyang, China.
Nanning, China, in 2020. He is currently His research interests include industrial artificial intelligence, inte-
working toward the M.S. degree at the grated optimization and control, distributed control and optimiza-
China-EU Institute for Clean and Renewable tion with applications to process industries, cyber-physical systems,
Energy, Huazhong University of Science and networked control systems, and multiagent systems.
Technology (HUST), Wuhan, China. Dr. Yang received the Ralph E. Powe Junior Faculty Enhancement
His current research interests include elec- Award and the Best Student Paper Award (as an advisor) at sev-
tric vehicles, optimal scheduling of large- eral international conferences. He is an Associate Editor of IEEE
scale renewable energy integrated power system, and artificial TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY and IEEE TRANSACTIONS
intelligence and its application in the smart grid. ON NEURAL NETWORKS AND LEARNING SYSTEMS.

Vol. 111, No. 9, September 2023 | P ROCEEDINGS OF THE IEEE 1095


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.
Li et al.: DRL for Smart Grid Operations: Algorithms, Applications, and Prospects

Zhigang Zeng (Fellow, IEEE) received the Tianyou Chai (Life Fellow, IEEE) received
Ph.D. degree in systems analysis and inte- the Ph.D. degree in control theory and
gration from the Huazhong University of engineering from Northeastern University,
Science and Technology, Wuhan, China, Shenyang, China, in 1985.
in 2003. He became a Professor at Northeast-
He is currently a Professor with the School ern University in 1988. He is the Founder
of Automation and the Key Laboratory of and the Director of the Center of Automa-
Image Processing and Intelligent Control of tion, Northeastern University, which became
the Education Ministry of China, Huazhong the National Engineering and Technology
University of Science and Technology. He has published more than Research Center and the State Key Laboratory. He was the Director
100 international journal articles. His current research interests of the Department of Information Science, National Natural Science
include the theory of functional differential equations and dif- Foundation of China, from 2010 to 2018. He has developed control
ferential equations with discontinuous right-hand sides and their technologies with applications to various industrial processes. He
applications to dynamics of neural networks, memristive systems, has published more than 320 peer-reviewed international journal
and control systems. articles. His current research interests include modeling, control,
Dr. Zeng has been a member of the Editorial Board of Neural Net- optimization, and integrated automation of complex industrial
works since 2012, Cognitive Computation since 2010, and Applied processes.
Soft Computing since 2013. He was an Associate Editor of IEEE Dr. Chai is a member of the Chinese Academy of Engineering and
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS from 2010 a Fellow of International Federation for Automatic Control (IFAC).
to 2011. He has been an Associate Editor of IEEE TRANSACTIONS ON His paper titled “Hybrid intelligent control for optimal operation of
CYBERNETICS since 2014 and IEEE TRANSACTIONS ON FUZZY SYSTEMS since shaft furnace roasting process” was selected as one of the three
2016. best papers for the Control Engineering Practice Paper Prize for the
term 2011–2013. For his contributions, he has won five prestigious
awards of the National Natural Science, the National Science and
Technology Progress, and the National Technological Innovation,
the 2007 Industry Award for Excellence in Transitional Control
Research from IEEE Multi-Conference on Systems and Control, and
the 2017 Wook Hyun Kwon Education Award from the Asian Control
Association.

1096 P ROCEEDINGS OF THE IEEE | Vol. 111, No. 9, September 2023


Authorized licensed use limited to: Université de Strasbourg SCD. Downloaded on February 11,2025 at 06:54:59 UTC from IEEE Xplore. Restrictions apply.

You might also like