Deep_Reinforcement_Learning_for_Smart_Grid_Operations_Algorithms_Applications_and_Prospects
Deep_Reinforcement_Learning_for_Smart_Grid_Operations_Algorithms_Applications_and_Prospects
ABSTRACT | With the increasing penetration of renewable the widespread popularity of advanced meters makes it pos-
energy and flexible loads in smart grids, a more compli- sible for smart grid to collect massive data, which offers
cated power system with high uncertainty is gradually formed, opportunities for data-driven artificial intelligence methods to
which brings about great challenges to smart grid opera- address the optimal operation and control issues. Therein,
tions. Traditional optimization methods usually require accu- deep reinforcement learning (DRL) has attracted extensive
rate mathematical models and parameters and cannot deal attention for its excellent performance in operation problems
well with the growing complexity and uncertainty. Fortunately, with high uncertainty. To this end, this article presents a
comprehensive literature survey on DRL and its applications
in smart grid operations. First, a detailed overview of DRL,
Manuscript received 15 July 2022; revised 15 June 2023; accepted 1 August
2023. Date of publication 5 September 2023; date of current version from fundamental concepts to advanced models, is conducted
15 September 2023. This work was supported in part by the National Key R&D
in this article. Afterward, we review various DRL techniques
Program of China under Grant 2021ZD0201300, in part by the National Natural
Science Foundation of China under Grant 62073148, in part by the Key Project of as well as their extensions developed to cope with emerging
National Natural Science Foundation of China under Grant 62233006, in part by
issues in the smart grid, including optimal dispatch, opera-
the Smart Grid Joint Key Project of National Natural Science Foundation of China
and the State Grid Corporation of China under Grant U2066202, in part by the tional control, electricity market, and other emerging areas.
Major Program of National Natural Science Foundation of China under Grant
In addition, an application-oriented survey of DRL in smart grid
61991400, and in part by the 2020 Science and Technology Major Project of
Liaoning Province under Grant 2020JH1/10100008. (Corresponding author: is presented to identify difficulties for future research. Finally,
Zhigang Zeng.)
essential challenges, potential solutions, and future research
Yuanzheng Li and Zhigang Zeng are with the School of Artificial Intelligence
and Automation, Autonomous Intelligent Unmanned System Engineering directions concerning the DRL applications in smart grid are
Research Center, Key Laboratory of Image Processing and Intelligence Control, also discussed.
Ministry of Education of China, and the Hubei Key Laboratory of Brain-Inspired
Intelligent Systems and the Belt and Road Joint Laboratory on Measurement and
KEYWORDS | Deep reinforcement learning (DRL); electricity
Control Technology, Huazhong University of Science and Technology, Wuhan
430074, China (e-mail: [email protected]; [email protected]). market; operational control; optimal dispatch; smart grid (SG).
Chaofan Yu is with the China-EU Institute for Clean and Renewable Energy,
Huazhong University of Science and Technology, Wuhan 430074, China (e-mail:
[email protected]). N O M E N C L AT U R E
Mohammad Shahidehpour is with the Robert W. Galvin Center for Electricity
Innovation, Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail:
Notations
[email protected]).
A, a Set of actions and action.
Tao Yang and Tianyou Chai are with the State Key Laboratory of Synthetical
Automation for Process Industries, Northeastern University, Shenyang 110819, S, s Set of all states and state.
China (e-mail: [email protected]; [email protected]). P Transition probability.
Digital Object Identifier 10.1109/JPROC.2023.3303358 R Set of all possible rewards.
0018-9219 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
These characteristics increase the uncertainty and com- strong randomness. To this end, DRL has been regarded
plexity, which brings great challenges to the secure and as an alternative solution to overcome these challenges.
economic operations of SG [6]. To solve these prob- General speaking, the DRL methods provide the following
lems, various approaches have been proposed for the advantages.
optimal operations of SG. However, conventional opti-
1) DRL can achieve the optimal solution of sophisticated
mization methods require accurate mathematical models
grid optimization without using complete and accu-
and parameters, which makes it difficult to apply them to
rate network information.
increasing complex and distributed systems with multiple
2) DRL allows grid entities to learn and build knowledge
uncertain subsystems. Consequently, the applications of
about the environment on the basis of historical data.
traditional methods are limited in practice, which calls for
3) DRL offers autonomous decision-making within min-
a more intelligent and efficient solution.
imum information exchange, which not only reduces
With the wide application of advanced sensors, smart
computational burden but also improves the SG secu-
meters, and monitoring systems, SG is producing mas-
rity and robustness.
sive data with mutual correlations [7], [8], [9], which
4) DRL significantly enhances the learning capability in
also offers the data basis for data-driven AI method, for
comparison to the traditional RL, especially in prob-
instance, RL. Indeed, RL is one of the most important
lems with numerous states and action spaces.
research topics of AI over the last two decades, due to its
excellent ability of self-directed learning, adaptive adjust- Although there exist some RL reviews, detailed dis-
ment, and optimal decision. Specifically, RL is a learning cussions on DRL applications in SG are still lacking.
process that allows the agent to periodically make deci- Specifically, existing surveys have focused on the DRL
sions, observe the results, and then automatically adjust its applications to the Internet of Things, natural language
action to achieve the optimal policy. For instance, as one of processing, and computer vision [15], [16], [17], [18].
the pioneering works in the application of RL to renew- Indeed, there are some excellent reviews on RL appli-
able power system operations, Liao et al. [10] proposed a cations to energy systems [19], [20], [21], [22], [23],
multiobjective optimization algorithm based on learning [24], [25], [26]. However, they mainly concentrate on
automata for economic emission dispatching and voltage conventional RL methods or power system, rather than
stability enhancement in SGs. Simulation results have presenting state-of-the-art DRL approaches to SG applica-
demonstrated that the proposed method achieves accurate tions. A detailed comparison between our work and related
solutions and adapts effectively to dynamic fluctuations in surveys is presented in Table 1 to identify the unique
wind power and load demand. aspects and novel perspectives that distinguish our work.
Despite all these advantages, RL is still unsuitable and It could be observed that there exist some reviews on
inapplicable to complicated large-scale problem environ- DRL-based decision-making in conventional power system
ments as it has to explore and gain knowledge from the and modern SG. For instance, Chen et al. [19] provided
entire system, which takes much time to obtain the best a comprehensive review of various DRL techniques and
policy. In this situation, the applicability of the RL has their potential applications in power systems, with a focus
encountered serious challenges in the real world. Recently, on three key applications: frequency regulation, voltage
the rapid development of DL has aroused great interests in control, and energy management.
industry and academia [11], [12]. The deep RL architec- However, emerging energy solutions for improving the
tures result in a better data processing and representation SG efficiency and ensuring its secure operations are not
learning capability, which provide a potential solution to covered in [19]. Zhang et al. [20] and Glavic [21] cov-
overcome the RL limitations, that is, the combination of ered multiple aspects of power system operations, which
RL and DL has led to a breakthrough technique, named includes optimal dispatch, operational control, electricity
DRL, which integrates the decision-making capacity of RL market, and others. Although Zhang et al. [20] and Glavic
and the DL perception capability [13]. More precisely, [21] provided summaries of typical DRL algorithms such as
DRL improves the learning speed and performance of DQN, DDPG, and AC, they do not cover recently developed
conventional RL, in virtue of the advantages of DNNs in state-of-the-art DRL methods. To this end, Cao et al. [22]
the training process. Therefore, DRL has been introduced provided a comprehensive summary of advanced DRL
in various applications and achieved phenomenal success, algorithms and their SG applications, including value-
such as games, robotics, natural language processing, com- based, policy-based, and AC solution methods. Although
puter vision, and SG operations [14]. summaries of DRL algorithms are detailed, Cao et al. [22]
In the field of SG, DRL has been intensively adopted did not convey emerging SG areas, including P2P trading
to undertake various tasks, which stem from developing markets and privacy preservation issues.
the optimal policy. As mentioned above, SG is one of the Yang et al. [23], Perera and Kamalaruban [24], and Yu
largest artificial systems, which is well known for their et al. [25] classified DRL papers in the literature into seven
highly uncertain and nonlinear operating characteristics. categories according to their application fields. Their study
Although several approaches have been developed for SG, reveals that about half of the publications use Q-learning,
they still suffer from great computational complexity and whereas some of the state-of-the-art DRL methods are
not utilized in power system applications. Although methods may rely on simplified models and assumptions,
Zhang et al. [26] attempted to provide an in-depth anal- which may not fully capture the complexity of the pre-
ysis of DRL application algorithms in SGs, it lacks a vailing assumptions and may not fully comprehend the
summary of advanced algorithms applied to power sys- complexity of real-world scenarios. Finally, widespread
tems. With the significant AI applications in power systems, deployments of Internet-connected SG devices have sig-
the number of publications that use DRL has also grown nificantly increased the vulnerability of power systems
rapidly, where many state-of-the-art DRL algorithms have to cyberattacks. Cyberattack dynamics and complexi-
been proposed for SG operations. This major development ties necessitate the implementation of responsive, adap-
calls for a more comprehensive analysis of potential DRL tive, and scalable protection mechanisms in SGs. These
applications to SGs. Accordingly, this article is dedicated requirements are difficult to achieve by typical operation
to presenting a relatively holistic overview of various methods that would rely on static security measures [28].
DRL-based methods applied to SG operations. More importantly, through summarizing, highlighting, and
The main contributions of this article are listed as analyzing the DRL characteristics and their SG applica-
follows. tions, this survey article would highlight specific potential
1) A detailed and well-organized overview of DRL research directions for interested parties. The rest part of
methodologies is provided, which encompasses fun- this article is organized as follows. Section II introduces
damental concepts and theoretical DRL principles, the evolution of DRL and discusses its state-of-the-art
as well as the most sophisticated DRL techniques techniques as well as their extensions. In Section III, the
applied to power system operations. detailed DRL applications in SG are presented. After that,
2) The SG operation issues are divided into four critical Section IV discusses the prospects and challenges of DRL in
categories to illustrate DRL applications to modeling, SG operations in the future. Finally, the conclusion of this
design, solution, and numerical experimentation. article is drawn in Section V.
3) An in-depth understanding of challenges, potential
solutions, and future directions for deploying DRL II. D R L : A N O V E R V I E W
to SG problems is discussed and the outlook for In this section, the fundamental knowledge of MDP, RL,
additional developments is presented. and DL techniques, which are crucial components of DRL,
Different from the excellent prior works about RL, this is introduced first. Then, the combination of DL and RL is
article attempts to conduct a relatively exhaustive review presented, which results in the formation of DRL. Finally,
of DRL applications to SG operations, especially for the last advanced DRL models as well as their state-of-the-art
few years. These reviews will encompass emerging topics extensions are reviewed.
such as optimal economic dispatch, distributed control,
and electricity markets. First, with the increasing pene- A. Markov Decision Process
tration of RE, the optimal dispatch of SG resources is In mathematics, MDP is a discrete-time stochastic con-
confronted with unprecedented challenges, including mas- trol progress, which assumes that the future state is only
sive operational uncertainty, lower system inertia and new related to the present state and is independent of the past
dynamics phenomena, and highly nonlinear and complex states [29]. MDP provides a useful framework for modeling
power systems that cannot be effectively represented and the decision-making problems in situations where the solu-
constructed by existing mathematical tools [27]. Second, tions are deemed to be partly random and uncontrollable.
operational control is a critical SG task that involves device MDPs are popular for studying optimization problems
control and coordination, including generators, transform- solved by dynamic programming and RL approaches [30].
ers, and capacitors. Traditional mathematical methods may Generally, an MDP is defined as a tuple (S, A, P, R), where
be based on simplified models and linear control strategies, S is the set of finite states named the state space, A rep-
which may struggle to address the complexities of nonlin- resents the set of actions called action space, P is the
ear and dynamic SG characteristics. transition probability from state s to state s′ after action a
Third, electricity market operation is a complex is executed, and R denotes the immediate reward received
optimization problem that involves multiple participants, after state transition from state s to state s′ , due to the
variables, and uncertainties. Conventional mathematical performance of action a.
maps states to joint actions. In this way, an MMDP could Q∗ (s, a) = max Qπ (s, a). (6)
π
be regarded as a single-agent MDP where the agent takes
joint actions. On this basis, the MMDP goal is to find As for the state–action pair (s, a), it is observed that the
a policy that maximizes the expected total reward for optimal action-value function gives the expected return for
all agents by taking their interactions into account. The taking action a in state s and thereafter follows an optimal
MDP objective is to find a good policy that maximizes policy. Therefore, Q∗ (s, a) could also be written in terms of
the future reward function, which could be expressed by optimal state-value function V ∗ (s), which is expressed as
the following cumulative discounted reward:
∞
X Q∗ (s, a) = E [Rt | st = s, at = a]
2 k
Rt = rt+1 + γrt+2 + γ rt+3 + · · · = γ rt+k+1 = E [rt+1 + γRt+1 | st = s, at = a]
k=0
= rt+1 + γ(rt+2 + γrt+3 + γ 2 rt+4 + · · · ) = E [rt+1 + γV ∗ (st+1 ) | st = s, at = a] . (7)
= rt+1 + γRt+1 (2)
Since the action is selected by the policy, an optimal
action at each state is found through the optimal policy
where Rt denotes the cumulative reward at time step t
as well as the optimal state-value function. In this way, the
and γ ∈ [0, 1] represents the discount factor. Here,
optimal state-value function is rewritten as follows:
γ determines the importance of future rewards compared
with the current one. If γ approaches one, it means
that the decision-maker regards the long-term reward as V ∗ (s) = max V π (s) = max Eπ∗ [Rt | st = s, at = a]
π a
important. On the contrary, the decision-maker prefers to = max Eπ∗ [rt+1 (s, a) + γV ∗ (st+1 ) | st = s, at = a]
a
maximize the current reward, while the discount factor γ
approaches zero. = max Q∗ (s, a). (8)
a
In order to find an optimal policy π ∗ : S → A for the
agent to maximize the long-term reward, the state-value Taking the expression of optimal action-value function
function V π : S → R is first defined in the RL that into account, the problem of optimal sate value is simpli-
denotes the expected value of current state s under policy fied to the optimal values of action function, i.e., Q∗ (s, a).
π . The state-value function V for the following policy π Intuitively, (8) indicates that the value of a state under
measures the quality of this policy through the discounted an optimal policy should be equal to the expected reward
MDP, which could be shown as follows: for the best action from that state, which is denoted by
"∞ # the Bellman optimality equation in MDP [31]. With the
π
X k definition of optimal value functions and policies, the rest
V (s) = Eπ [Rt |st = s] = Eπ γ rt+k+1 | st = s (3)
k=0
of the work would be to update the value function and
achieve the optimal policy, which can be accomplished by
where the state value is the expected reward for the RL approaches.
following policy π from state s.
Similarly, the value of taking action a in state s under B. Reinforcement Learning
policy π , i.e., action-value function Qπ (s, a), is defined as
As one of the machine learning paradigms, RL is con-
π cerned with a decision-maker’s action for maximizing
Q (s, a) = E [Rt | st = s, at = a]
"∞ # the notion of cumulative reward Rt [32]. In RL, the
decision-making process is executed by the agent, which
X k
=E γ rt+k+1 | st = s, at = a . (4)
k=0 learns the optimal policy by interacting with the envi-
ronment. Here, the agent first observes the current state
Since the purpose of RL is to find the optimal policy and then performs an action in the environment, which is
that achieves the largest cumulative reward in long-term, based on its policy. After that, the environment feeds its
we define the optimal policy π ∗ as immediate reward back into the agent and updates its new
state at the same time. The typical RL interactions between
environment and agent are shown in Fig. 2. The agent will
π ∗ = argmax V π (s). (5)
π constantly adjust its policy according to the observed infor-
mation, i.e., the received immediate reward and updated
In this situation, the optimal state-value function state. This adjustment process will be repeated until the
V ∗ (s) and the action-value function Q∗ (s, a) could be policy of agent approaches its optimum.
Algorithm 2 SARSA Algorithm refers to the use of multiple layers in a neural network,
Input: Initialize Q(s, a), for all s ∈ S, a ∈ A, which enhances the perception capacity of DNN.
arbitrarily except that Q(terminal, ·) = 0 Fig. 3 shows the DFF neural network, which is con-
1 for each episode t do sidered the simplest type of DNN. It is observed that a
2 From the current state-action pair (s, a), DFF network contains multiple layers of interconnected
execute action a and receive the immediate nodes, i.e., artificial neurons, which are analogous to bio-
reward r and the new state s′ . logical neurons in brain. Each connection between neurons
3 Select an action a′ from the new state s′ using transmits a signal to other ones, and the receiving neuron
the same policy and then update the table processes this signal. Then, the receiving neuron activates
entry for Q(s, a) as follows: downstream connected neurons. The signal within a con-
nection is usually represented by a real number between
0 and 1, and the output of each neuron is computed
Qt+1 (s, a) = Qt (s, a)
by the weighted summation of its inputs as well as a
+ α [rt+1 + γQt+1 (s′ , a′ ) − Qt (s, a)] nonlinear transformation through the activation function.
This computation process from the neural network is
4 Replace s ← s′ ; a ← a′ . named the forward propagation, which achieves the data
5 end processing during the computation from the input to the
Output: π ∗ (s) = arg maxQ∗ (s, a) output. Typically, neurons are aggregated into layers, and
a
different layers may perform different transformations on
their inputs. It should be noted that signals travel from
the first layer named input layer to the last one, i.e., the
learning algorithm, the target policy is updated according output layer, possibly after traversing deeply hidden layers
to the maximal reward of available actions rather than the multiple times.
behavior policy used for choosing actions, i.e., off-policy The DFF neural network shown in Fig. 3 is the sim-
learning. On the contrary, the SARSA algorithm uses the plest among DNNs. There are two classical DNN models,
same policy to update the Q values and select actions, including the CNN [34] and RNN [35]. CNN is distin-
i.e., on-policy learning. The details of the SARSA algorithm guished from other DNNs by its superior computer vision
are provided in Algorithm 2, which illustrates that the performance, which comprises three main layers shown in
SARSA agent interacts with the environment and updates Fig. 4, i.e., convolutional, pooling, and fully connected.
the policy based on actions taken. Hence, it is regarded as The name CNN stems from the convolution operation that
an on-policy RL algorithm. In particular, the action-value occurred in the convolutional layer, which converts the raw
function Q(s, a) is updated by the Q value of the next state input data to numerical values and allows CNN to interpret
s′ and the current policy’s action a′ as and extract relevant features. Similar to the convolutional
layer, the pooling layer derives its name from the pooling
Qt+1 (s, a) = Qt (s, a) operation, and it conducts dimension reduction to decrease
+ α rt+1 +γQt+1 s′ , a′ −Qt (s, a) . (10)
complexity. The pooling layers can improve the efficiency
and reduce the risk of overfitting. Furthermore, the fully
connected layer performs the task of classification based
In conclusion, the state and action spaces in tabular RL on the features learned through previous conventional and
methods are small enough to allow the Q values to be pooling layers, which map the extracted features back to
represented as a table. This is feasible when the number the final output.
of states and actions is small. However, state and action Unlike CNN, which assumes that inputs and outputs
spaces in many real-world applications are excessively are independent, RNN extracts the information from prior
large, which makes it impossible to represent the Q values
in a table. In such cases, function approximation methods,
such as DNNs, are used to approximate the Q values or
policy, which is introduced in the following.
C. Deep Learning
As mentioned before, RL is not suitable for handling
complicated problems with large-scale environments and
high uncertainty, which limits its application in SG opera-
tions. To this end, DL is introduced to assist RL in dealing
with these challenges. To be specific, DL is a subset of
machine learning based on DNNs, which attempts to sim-
ulate the human brain behavior and extract the important
features from massive raw data. The adjective deep in DL Fig. 3. Structure diagram of typical DFF neural network.
Fig. 7. Taxonomy of DRL algorithms (boxes with thick borders denote different categories, while others indicate specific algorithms).
return. In this way, the policy-based algorithm offers the function approximator for estimating Q∗ (s, a), instead
a simple policy parameterization and fast convergence of the Q-table. However, the value iteration is proved to
speed, which is suitable for problems with continu- be unstable and might even diverge when a nonlinear
ous or high-dimensional action spaces. Nevertheless, the function approximator, e.g., neural network, is used to
policy-based methods also suffer from sampling ineffi- represent the action-value function [37]. This instability
ciency and overestimation. However, a combination of is attributed to the fact that small updates of Q(s, a) might
these two categories has conveniently given rise to the AC significantly change the agent policy. Therefore, the data
framework. In the rest of this section, we discuss several distribution and the correlations between Q(s, a) and the
typical value- and policy-based DRL algorithms. target value rt+1 + γ maxa′ Qt+1 (s′ , a′ ) are quite diverse.
Two key ideas, which include experience replay and
E. Value-Based DRL Algorithm fixed target Q-network, are adopted to address the insta-
The RL goal is to improve its policy to acquire better bility issue as described in the following.
rewards. As for the value-based algorithm, it tends to
1) Experience Replay: In each time epoch t, DQN stores
optimize the action-value function Q(s, a) for obtaining
the experience of agent (st , at , rt , st+1 ) into the replay
preferences for the action choice. Usually, value-based
buffer and then draws a mini-batch of samples from
algorithms, such as Q-learning and SARSA, need to alter-
this buffer randomly to train the DNN. Then, the Q
nate between the value function estimation under the
values estimated by the trained DNN will be applied
current policy and the policy improvement with the esti-
to generate new experiences, which will be appended
mated value function, as shown in (8). However, it is
into the replay buffer in an iterative way. The expe-
not trivial to predict the accurate value of a complicated
rience replay mechanism has several advantages over
action-value function, especially when state and action
the fit Q-learning. First, both old and new experiences
spaces are continuous. The conventional tabular methods,
are used in the experience replay mechanism to learn
such as Q-learning, cannot cope with these complex cases
the Q-function, which provides higher data efficiency.
because of the limitation of computational resources. Also,
Second, the experience replay avoids the situation
state representations in practice would need to be man-
where samples used for DNN training are determined
ually designed with aligned data structures, which are
by previous parameters, which smooths out changes
also difficult to specify. To this end, the DL technique is
in the data distribution and removes correlations in
introduced to assist RL methods to estimate the action-
the observation sequence.
value function, which is the core concept of value-based
2) Fixed Target Q-Network: To further improve the neural
DRL algorithms. Next, typical value-based DRL algorithms,
network stability, a separate target network is devel-
including DQN and its variants, are depicted with detailed
oped to generate Q-learning targets, instead of the
theories and explanations.
desired Q-network. At times, the target network will
1) Deep Q-Network: As one of the breakthroughs in DRL, be synchronized with the primary Q-network by copy-
the DQN structure shown in Fig. 8 implements DNN as ing directly (hard update) or exponentially decaying
average (soft update). In this way, the target network where Q̂(s′ , a′ ; θ′ ) and Q(s, a; θ) represent the target net-
is updated regularly but at a rate that is slower work with parameter θ′ for value evaluation and the
than the primary Q-network. This could significantly primary network with parameter θ for action selection,
reduce the divergence and the correlation between respectively. In this way, the estimated value of future
the target and estimated Q values. expected reward is evaluated using a different policy,
The DQN algorithm with experience replay and fixed which could manage the overestimation issue and outper-
target Q-network is presented in Algorithm 3. Before form the original DQN algorithm [43].
learning starts, replay buffer D, primary network Q, and 3) Dueling DQN: For certain states, different actions
target network Q̂ are initialized with random parameters. are not relevant to the expected reward and there is no
Then, at each episode t, the agent selects an action at need to learn the effect of each action for such states.
with ϵ-greedy policy and observe reward rt , to enter a For instance, the values of the different actions are very
new state st+1 . After that, the transition (st , at , rt , st+1 ) is similar in various states, and thus, the action taken would
stored in the replay buffer for further sampling. Stochastic be less important. However, the conventional DQN could
gradient descent with respect to the network parameter θ accurately estimate the Q value of this state only when
is performed to optimize the DNN loss function, which is all data are collected for each discrete action. This could
defined in (11) as the deviation between the target and result in a slower learning speed as the algorithm is not
primary networks. Finally, target network parameters are concerned with the actions that are not taken. To address
updated by the primary network for every certain step until this issue, a network architecture called the dueling DQN
the epoch is terminated is proposed, which explicitly separates the representation
2
L = rj + γ max Q̂(sj+1 , aj+1 ; θ′ ) − Q(sj , aj ; θ) . (11)
aj+1
Algorithm 3 DQN Algorithm
Input: Initialize replay buffer D, the primary
In conclusion, DQN absorbs the advantages of both DL
Q-network Q with stochastic weights θ
and RL techniques, which are critical for SG application
and the target Q-network Q̂ with stochastic
[38], [39], [40].
weights θ′ .
2) Double DQN: DQN, which has been implemented 1 for each episode t do
successfully, has struggled with large overestimations of 2 With probability ϵ select a random action at ,
action values, especially in noisy environments [41]. otherwise select at = argmaxa Q∗ (s, a; θ).
These over-estimations stem from positive deviation since 3 Execute action at and observe the immediate
Q-learning always selects the maximum action value as reward rt and next state st+1 .
the approximation for maximal expected reward, which 4 Store transition (st , at , rt , st+1 ) in the
is denoted by the Bellman equation in (8). Therefore, the experience replay buffer D.
next Q values are usually overestimated since samples are 5 Sample random minibatch of transitions
used to select the optimal action, i.e., with the largest (sj , aj , rj , sj+1 ) from D.
expected reward, and the same samples are also utilized 6 Perform a gradient descent step with respect to
for evaluating the action-value. To this end, a variant h network parameter θ to minimize the loss: i2
the
algorithm called DDQN is proposed to address the over- rj +γ maxaj+1 Q̂(sj+1 , aj+1 ; θ′ )−Q(sj , aj ; θ) .
estimation problem of DQN [42]. The central idea of
DDQN is to decouple correlations in the action selection
7 Synchronize Q̂ = Q every certain interval
and value evaluation by using two different networks at
steps.
8 end
these two stages. In particular, the target Q-network in
of action-value function Q(s, a) into the state function continuous states and actions, the uncertainty stemming
V (s) and state-dependent action advantages A (a) [44]. from the stochastic environment, and the inaccuracy of
Accordingly, the Q value function of dueling DQN would estimated action value.
be decoupled into state value and action advantage parts, Another benefit of the policy-based algorithm is that
where the policy gradient methods could naturally model
stochastic policies, while the value-based algorithms need
Q(s, a) = V (s) + A (a). (13) to explicitly represent its exploration like greedy to model
the stochastic policies. Furthermore, gradient information
is utilized to guide the optimization in policy-based
On the one hand, the value part, i.e., state-value func- algorithms, which contributes to the network training
tion V (s), concentrates on estimating the importance of convergence. In general, the policy-based algorithms
current state s. On the other hand, the action advantage is could be divided into stochastic and deterministic polices,
denoted by the state-dependent advantage function A (a), according to their representation. Therefore, several
which estimates the importance of choosing the action a popular policy-based algorithms are introduced here for
compared with other actions. Intuitively, the dueling archi- both policies.
tecture could draw lessons from valuable states, without
learning the effect of each state action.
1) Stochastic Policy: As mentioned before, the basic idea
However, it might not be suitable if we directly separate
of policy-based algorithm is to represent the policy by
the Q value function as shown in (13) since it might be
a parametric neural network πθ (a|s), where the agent
unidentifiable in mathematics, that is, there might exist
randomly chooses an action a at state s according to
different combinations of V (s) and A (a), where all sat-
parameter θ . Then, policy-based algorithms typically opti-
isfy (13) for a given Q(s, a). To deal with the identifiability
mize the policy πθ with respect to the goal J(πθ ), through
issue, the advantage function estimator is refined to have
sampling the policies and adjusting the policy parameters
a zero advantage at the selected action by force
θ in the direction of more cumulative reward, which could
be expressed as follows:
Q(s, a; α, β) = V (s; β)
! "∞ #
1 X
+ A (s, a; α) − A (s, a′ ; α)
X k
|A| ′ J(πθ ) = Eτ ∼πθ [R(τ )] = Eτ ∼πθ γ rt+k+1 (15)
a
k=0
(14)
where policy gradient-based optimization uses an estima-
where α and β are parameters of the two estimators tor for the gradients on the expected return collected from
V (s; β) and A (s, a′ ; α), respectively. It should be noted samples to improve the policy with gradient ascent. Here,
that the subtraction in (14) helps with identifiability, which trajectory τ is a sequence of state–action pairs sampled by
does not change the relative rank of the A values and current policy πθ , which records how the agent interacts
preserves the original policy based on Q values from (13). with the environment. Thus, the gradient regarding the
In addition, the stability of policy optimization is enhanced policy parameter is defined as the policy gradient, which
since the advantages in (14) would only need to adapt to could be calculated as follows: ∆θ = α∇θ J(πθ ). On this
the average value, instead of pursuing an optimal solution. basis, the policy gradient theorem is proposed to denote
The training of dueling DQN requires more network layers the optimal ascent direction of expected reward [45],
compared with the standard DQN, which achieves a better as illustrated in the following equation:
policy evaluation in the presence of large action spaces.
" T #
X
F. Policy-Based DRL Algorithm ∇θ J(πθ ) = Eτ ∼πθ ∇θ log πθ (τ )R(τ )
t=0
Different from the value-based algorithm, policy-based " T #
algorithms depend on the use of gradient descent for opti- X
= Eτ ∼πθ ∇θ log πθ (at |st )R(τ ) . (16)
mizing the parameterized policies, regarding the expected t=0
reward, instead of optimizing the action-value function.
The abstract policies in DRL are called parameterized In this way, policy-based algorithms are updated along
policies as they are represented by parametric neural the direction of ascent gradients, which is denoted as
networks. In particular, policy-based approaches would follows:
directly perform the learning of the parameterized policy
of agent in DRL without learning or estimating the action-
value function. Accordingly, policy-based DRL algorithms θ = θ + ∆θ = θ + α∇θ J(πθ ). (17)
do not suffer from specific concerns, which have been
encountered with traditional RL methods. These concerns Based on policy gradient, several typical policy-based
mainly consist of higher complexities that arise from DRL algorithms, including the TRPO and the PPO, are
" T #
X
J(πθ ) = Eτ ∼πθ log πθ (at |st )δt
t=0
θ = θ + αθ ∇θ J(πθ ). (19)
son, the critic evaluates the selected action by computing 2 Actor network selection:
hP i
T
the value function V π (s). Generally, the training of these J(πθ ) = Eτ ∼πθ t=0 log πθ (at |st )δt
two components is performed separately and gradient 3 Critic network evaluation:
ascent is adopted to update their parameters. Here, the hP
T 2
i
critic Vψπθ
is optimized to minimize the square of TD error
JV πθ (ψ) = Eτ ∼πθ t=0 δ t
ψ
δt , which is similar to the loss function of DQN as 4 Take action at and observe next state st+1 and
reward Rt according to current policy
π π πθ (·|s).
δt = Rt + γVψθ (st+1 ) − Vψθ (st )
" T # 5 Collect sample (at , st , Rt , st+1 ) into the
X 2 trajectory.
JV πθ (ψ) = Eτ ∼πθ δt
ψ
t=0
6 Calculate the TD error as follows:
ψ = ψ + αψ ∇ψ JV πθ (ψ) (18) δt = Rt + γVψπθ (st+1 ) − Vψπθ (st ).
ψ
7 Replace ψ = ψ + αψ ∇ψ JV πθ (ψ).
ψ
just like the standard gradient descent algorithm. Indeed, where C is a constant independent to πθ′ and C ·
the gradient ∇θ J(πθ ) only provides the local first-order DKLmax
(πθ ∥πθ′ ) represents the maximum Kullback-Leibler
information at current parameters θ , which completely (KL) divergence, which is a statistical distance measuring
ignores the curvature of the reward landscape. However, the difference between πθ′ and πθ . Therefore, it is reason-
the suitable adjustment of learning step is very impor- able to optimize Lπθ (πθ′ ) if DKL
max
(πθ ∥πθ′ ) is small, which is
tant for policy gradient methods. On the one hand, the actually the principle of TRPO. On this basis, the original
algorithm might suffer a performance collapse if the learn- problem is converted into an optimization problem, which
ing step αθ is large. On the other hand, if the step size is is stated as
set small, the learning would be conservative to converge.
What is more, the gradient ∇θ J(πθ ) in policy gradient max Lπθ πθ′
′
methods requires an estimation from samples collected by πθ
πθ ∥πθ′ ≤ ξ
max
the current policy πθ , which in turn affects the quality of s.t. E DKL (23)
the collected samples and makes the learning performance
more sensitive to the step-size selection.
where ξ is a predefined constant denoting the maximum
Another shortcoming of policy gradient method in the
allowable difference between πθ′ and πθ . Afterward, the
standard AC model is that the update occurs in the param-
first-order approximation for the objective function and the
eter space rather than the policy space. This makes it more
second-order approximation for constraints are adopted to
difficult to tune in the step size αθ since the same step size
solve this optimization problem. In fact, the gradient of
may correspond to totally different updated magnitudes
Lπθ (πθ′ ) at the current policy could be expressed by (24),
in the policy space, which is dependent on the current
which is similar to the AC
policy πθ . To this end, an algorithm, called the TRPO,
is developed, which is based on the concept of trust region
g = ∇θ Lπθ πθ′
for adjusting the step size more precisely in the policy gra- "∞ #
dient [49]. It should be noted that the goal of policy-based X
= Eτ ∼πθ ∇θ log πθ (at | st ) γ A (st , at ) . (24)
t πθ
method is to find an updated policy πθ′ that improves the
t=0
current policy πθ . Fortunately, the improvement from the
current policy to the updated one could be measured by the
Accordingly, the TRPO algorithm solves the approxi-
advantage function A πθ (s, a) [50], which was introduced
mated optimization problem at the current policy as
in the dueling DQN. It is illustrated that (20) provides
an insightful connection between the performances of πθ′
θ′ = arg max g⊤ θ′ − θ
and πθ θ′
⊤
s.t. θ′ − θ H θ′ − θ ≤ ξ
(25)
π
A πθ (s, a) = Qπθ (s, a) − Vψ θ (s)
"∞ #
′
X t π where H represents the Hessian matrix of E[DKL max
(πθ ∥πθ′ )].
J(πθ ) = J(πθ ) + Eτ ∼πθ′ γ A (at , st )
θ
(20)
t=0
It is illustrated by (25) that the gradients are calculated
in the first order and the constraint is depicted in the
where τ denotes the state–action trajectory sampled second order. This approximation problem can be analyt-
by updated policy πθ′ . Obviously, learning the opti- ically solved by the methods of Lagrangian duality [52],
mal policy is equivalent to optimizing the bonus term resulting in the following analytic form solution:
Eτ ∼πθ′ [ ∞
t=0 γ A
t πθ
P
(at , st )]. The above expectation is r
based on the updated policy πθ′ that is difficult to optimize ′
θ =θ+
2ξ
H −1 g. (26)
directly. Thus, TRPO optimizes an approximation of this g ⊤ H −1 g
expectation, denoted by Lπθ (πθ′ ), which is stated as
In summary, TRPO trains the stochastic policy in an
"∞ # on-policy way, where it explores by sampling according
X πθ′ (at | st ) πθ
Lπθ πθ′ = Eτ ∼πθ γt A (st , at )
(21) to the newest version of its stochastic policy. During the
πθ (at | st )
t=0 training procedure, the policy usually becomes less uncer-
tain, progressively, since the update rule encourages it to
where πθ′ is directly approximated by πθ , which seems to exploit rewards that it has already obtained. Empirically,
be coarse, but its approximation error (22) is proved to the TRPO method performs well on previous problems that
be theoretically bounded and thus ensures its effectiveness require precise problem-specific hyperparameter tuning,
[51]. The bounded approximation error is presented as which are solvable with a set of reasonable parame-
ters. However, one challenge with the implementation of
J(πθ′ ) − J(πθ ) − Lπθ πθ′ max
πθ ∥πθ′ TRPO lies in calculating the estimation of KL divergence
≤ C · DKL (22)
between parameters, as it increases the complexity and the
Hence, PPO is motivated to take the largest possible The overall pseudocode of DDPG presented in Algorithm
advantage of current data, without stepping out so far 6 initializes the replay buffer R and parameters of four net-
that could accidentally cause any performance collapse. works. Then, it selects the action according to the current
Unlike TRPO which tends to solve this problem with a policy and exploration noise as at = µ(st |θµ )+Nt , in order
complicated second-order method, PPO is a member of to enhance the DDPG exploration capacity. After execution,
the first-order approaches, which adopts clipping tricks to the action at and receive reward rt are transferred to the
maintain the proximity between old and new policies. The next state st+1 to store the transition (st , at , rt , st+1 ) in
PPO algorithm performs comparably or even better than buffer R. On this basis, DDPG simultaneously maintains
the other state-of-the-art methods, is significantly simpler two models, i.e., actor and critic, in order to manage
to implement, and has thus become the default DRL in the problems with continuous action spaces. As for the
many popular platforms due to its ease of use and good critic network, it aims to approximate the output of value
performance [57]. function, which uses the same structure as DQN, i.e.,
2) Deterministic Policy: The content described above a primary network and a target network. Then, the critic
belongs to the stochastic policy gradient, which aims to network updates its state by minimizing the loss as
optimize the stochastic policy π(a|s) and represent the
′ ′
action as a probabilistic distribution according to the yi = ri + γQ′ (si+1 , µ′ (si+1 |θµ )|θQ )
current state, where a ∼ π(·|s). On the contrary, the N 2
1 X
deterministic policy considers the action as a deterministic L= yi − Q(si , ai |θQ ) (30)
N
output of policy, i.e., a = µ(s), instead of sampling the i
DDPG adopts the AC architecture from the policy gradient θµ ← ρθµ + (1 − ρ)θθ (32)
framework, which maintains a deterministic policy func-
tion µ(s) (actor) as well as a value function Q(s, a) (critic). where ρ represents the update coefficient, which is far less
The policy gradient algorithm is used to optimize the policy than 1 so that the learning work is updated very slowly and
function assisted by the value function. The AC used in smoothly, thus promoting the learning stability.
DDPG is different from the previous one since this actor In summary, DDPG combines ideas from both DQN and
is a deterministic policy function. Nevertheless, the value AC techniques, which extends the Q-learning into the
function in DDPG is the same as that in DQN, which utilizes continuous action spaces and produces a lasting influence
the TD error to update itself. for subsequent DRL algorithms. On the one hand, the
Algorithm 6 DDPG Algorithm two Q values, in order to form the targets in the Bellman
Input: Initialize replay buffer R. Randomly error loss function
initialize actor network parameters θµ and
critic network parameters θQ . Initialize Qθ1′ (s′ , a′ ) = Qθ1′ (s′ , µψ1 (s′ ))
target network Q′ and µ′ with parameters Qθ2′ (s′ , a′ ) = Qθ2′ (s′ , µψ2 (s′ ))
′ ′
θQ ← θQ , θµ ← θµ .
y1 = Ri + γ min Qθi′ (s′ , µψi (s′ )) (33)
1 for each episode do i=1,2
III. A P P L I C A T I O N S O F D R L I N
Fig. 11. Typical architecture of SG operation. This figure is cited
S G O P E R AT I O N
from [76].
The SG applications are devoted to achieving a sustain-
able, secure, reliable, and flexible energy delivery through
suppressing voltage fluctuations, and strengthening the
bidirectional power and information flows. SG applications
SG security and stability. Despite the applicability of tra-
possess the following features.
ditional power system methods, it is envisaged that they
1) SG offers a more efficient way to maintain the may be inadequate for SG applications characterized by
optimal dispatch with a lower generation cost and high levels of renewable and variable energy penetrations
higher power quality via the integration of distributed and increased human participation in load management.
sources and flexible loads, such as RE and EVs [60], Traditional optimization methods could struggle with iden-
[61], [62], [63], [64]. tifying the best solutions for these problems due to the
2) SG achieves the secure and stable operation of power high uncertainty in prevailing SG operations and the high
system via the deployment of effective operational dimensionality of distributed systems with coupled vari-
control technologies, including the AGC, AVC, and ables that are metastasizing in SGs.
LFC [65], [66], [67], [68]. In essence, it would be difficult to establish the corre-
3) SG provides a transaction platform for customers and sponding accurate models. Fortunately, DRL agents could
suppliers affiliated to different entities, thus enhanc- automatically learn the pertinent knowledge in such cases
ing the interactions between suppliers and customers, while interacting with the environment, which is indepen-
which facilitates the development of electricity mar- dent of the accurate environment model. However, the
ket [69], [70], [71]. purpose of applying DRL is not to completely replace con-
4) SG equipment encompasses numerous advanced ventional optimization methods. Instead, DRL can serve as
infrastructures, including sensors, meters, and con- a complement to existing approaches and enhance them in
trollers, which are also applied to emerging condi- a data-driven manner. In this way, DRL has the advantage
tions, such as network security and privacy preserva- of addressing such SG problems more effectively due to
tion [72], [73], [74], [75]. its data-driven and model-free nature. In the rest part
On this basis, a typical SG architecture is shown in of this section, the DRL applications to optimal dispatch,
Fig. 11, which illustrates that the SG operation involves operational control, electricity market, and other emerging
four fundamental segments, i.e., power generation, trans- areas are analyzed and investigated in detail.
mission, distribution, and customers. As for the generation,
traditional thermal energy is converted to electrical power, A. Optimal Dispatch
while large-scale RE integration is a promising trend in SG
Compared with traditional power systems, SG inte-
applications. After that, the electrical energy is delivered
grates more distributed RE to promote sustainability [76].
from the power plant to the power substations via the
Under this circumstance, the conventional centralized
high-voltage transmission lines. Then, substations lower
high-voltage power transmission might not be considered
the transmission voltage and distribute the energy to
an economic operation since the RE sources are usually
individual customers such as residential, commercial, and
distributed and closely located to load centers. As a result,
industrial loads. During the transmission and distribution
DN, self-sufficient microgrid, and IES are gradually becom-
stages, numerous smart meters are deployed in SGs to
ing more independent of transmission network operations,
ensure their secure and stable operations. In addition, such
which are also highlighted as a major developing trend in
advanced infrastructures bring about emerging concerns,
SG applications [77], [78], [79], [80]. In addition, it has
e.g., network security and privacy concern, that traditional
witnessed the rapid development of EVs in recent years,
power systems would seldom encounter.
which has already become a critical SG component [81],
In order to support SG operations, DRL applications
[82], [83], as shown in Fig. 12. To this end, applications
are also divided into four categories, including optimal
of DRL regarding optimal dispatch on DN, microgrid, IES,
dispatch, operational control, electricity market, and other
and EV are summarized as follows.
emerging issues, such as network security and privacy
concern. These problems usually have similar economic 1) Distribution Network: In recent years, DN opera-
and technical objectives for reducing operational costs, tions have faced significant challenges mainly due to the
increasing deployment of DERs and EVs. Specifically, the On the one hand, DRL could provide better flexible con-
uncertain RE output could impact the distribution and trol decisions to promote DN operations, including voltage
the direction of DN power flow, which may further lead regulation. For instance, Cao et al. [84] and Kou et al. [85]
to the increase of power loss and voltage fluctuations. proposed a multiagent DDPG (MADDPG)-based approach
Hence, traditional methods based on mathematical opti- for the DN voltage regulation with a high penetration of
mization methods might not deal effectively with this photovoltaics (PVs), which shows a better utilization of PV
highly uncertain environment. More importantly, these resources and control performance. A novel DRL algorithm
traditional methods significantly depend on the accurate named constrained SAC is proposed in [86] and [87]
DN parameters, which are difficult to acquire in practice. to solve Volt–Var control problems in a model-free man-
To address these limitations, DRL methods are applied ner. Comprehensive numerical studies demonstrate the
in DNs, which could provide more flexible control deci- efficiency and scalability of the proposed DRL algorithm,
sions and promote the operation of DN. Generally, the compared with state-of-the-art DRL and convectional opti-
reward can be designed to achieve certain goals, such mization algorithms. Sun and Qiu [88] and Yang et al. [89]
as minimizing power losses, improving voltage profile, proposed a two-stage real-time Volt–Var control method,
or maximizing RE utilization. The literature about the in which the model-based centralized optimization and
applications of DRL on the DN is listed in Table 2, which is the DQN algorithm are combined to mitigate the voltage
summarized from two aspects, i.e., management method violation of DN.
and solving algorithm. In addition, the performance of On the other hand, DRL algorithms are also applied
reviewed methods is analyzed from the perspectives of to determine the optimal network configuration of DN.
convergence, privacy protection, and scalability, where the For example, Li et al. [90] developed a many-objective
tick mark means an outstanding performance and blank DNR model to assess the tradeoff relationship for better
means the corresponding article that does not refer to the operations of DN, in which a DQN-assisted evolutionary
performance. algorithm (DQN-EA) is proposed to improve searching
efficiency. Similarly, an online DNR scheme based on deep
Q-learning is introduced in [91] to determine the optimal
network topology. Simulation results indicate that the com-
putation time of the proposed algorithm is low enough for
practical applications. In addition, Gao et al. [92] devel-
oped a data-driven batch-constrained SAC algorithm for
the dynamic DNR, which could learn the network recon-
figuration control policy from historical datasets without
interacting with the DN. In [93], the federated learning
and AC algorithm are combined to solve the demand
response problem in DN, which considers the privacy
protection, uncertainties, as well as power flow constrains
of DN simultaneously. In addition, a DRL framework based
on A2C algorithm is proposed in [94], which aims at
Fig. 12. Optimal dispatch issues of SG operation. This figure is cited enhancing the long-term resilience of DN using hardening
from [76]. strategies. Simulation results show its effectiveness and
dispatch policies without fully observable state infor- the randomness of EV charging behaviors. On the other
mation. The proposed algorithm has derived an energy hand, multistage optimization is introduced to handle
dispatch policy for ESS. A multiagent TD3 (MATD3) is the problem caused by high-dimensional variables. Nev-
developed in [103] for ESS energy management. Sim- ertheless, the optimization results of these methods are
ulation results demonstrate its efficiency and scalability dependent on the predictive accuracy. Accordingly, DRL
while handling high-dimensional problems with contin- is applied to deal with the EV optimal dispatch problem,
uous action space. In addition, the curriculum learning which is a data-driven method and to some extent insen-
is integrated into A2C to improve sample efficiency and sitive to prediction accuracy. The reward can incorporate
accelerate the training process in [104], which speeds up factors related to user comfort and convenience, such as
the convergence during the DRL training and increases the the queuing time waiting for charging, the available range
overall profits. of EV travel, or the ability to meet specific user preferences.
c) User loads: The reward can be formulated as the In the rest part of this section, we present the detailed
reduction in peak load or the cost savings achieved through DRL applications to the EV optimal dispatch, as shown in
demand response. In [105], a prioritized experience replay Table 4.
DDPG (PER-DDPG) is applied to the microgrid dispatch For instance, Zhang et al. [112] proposed a novel
model considering demand response. Simulation studies approach based on DQN to dispatch the EVs charging
indicate its advantage in reducing operational costs com- and recommend the appropriate traveling route for EVs.
pared with traditional dispatch methods. Du and Li [106] Simulation studies demonstrate its effectiveness in signif-
proposed an MCDRL approach for demand-side manage- icantly reducing the charging time and origin–destination
ment, which tends to have a strong exploration capabil- distance. In [113], a DRL approach with embedding and
ity and protect consumer privacy. In addition, the A2C attention mechanism is developed to handle the EV routing
algorithm is developed in [107] to address the demand problem with time windows. Numerical studies show that
response problem, which not only shows the superiority it is able to efficiently solve large-size problems, which
and flexibility of the proposed approach but also preserves are not solvable with other existing methods. In addition,
customer privacy. a charging control DDPG algorithm is introduced to learn
the optimal strategy for satisfying the requirements of
3) Electric Vehicles: The use of EVs has been growing users while minimizing the charging expense in [114].
rapidly across the globe, in particular within the past The SAC algorithm is applied to deal with the congestion
decade, which is mostly due to its low environmental control problem in [115], which proves to outperform
impacts [108], [109], [110], [111]. Specifically, reducing other decentralized feedback control algorithms in terms
the charging cost through dispatching the behaviors of of fairness and utilization.
charging and discharging is the hot spot of research. Due to Taking the security into account, Li et al. [116] pro-
the flexibility of EVs charging/discharging, some literature posed a CPO approach based on the safe DRL to minimize
focuses on the coordinated dispatch of EVs and RE, which the charging cost, which does not require any domain
is devoted to promoting the utilization of RE by EVs. knowledge about the randomness. Numerical experiments
However, the uncertainty of RE and user loads results in demonstrate that this method could adequately satisfy the
the difficulty of model construction. At the same time, the charging constraints and reduce the charging cost. A novel
proliferation of EVs makes it more difficult to optimize the MADDPG algorithm for traffic light control is proposed
solution of the operation problem, which is mostly due to to reduce the traffic congestion in [117]. Experimental
the large number of variables. results show that this method can significantly reduce
On the one hand, traditional methods tend to estimate congestion in various scenarios. Qian et al. [118] devel-
before optimization and decision-making while addressing oped a multiagent DQN (MA-DQN) method to model the
pricing game in the transportation network and determine as well as heat load. Numerical simulations on a typi-
the optimal charging price for electric vehicle charging cal day scenario demonstrate that the developed method
station (EVCS). Case studies are conducted to verify the avoids dependence on uncertainty knowledge and has a
effectiveness and scalability of the proposed approach. In strong adaptability for inexperienced scenarios. In [123],
[119], a DQN-based EV charging navigation framework is a dynamic energy conversion and dispatch model for IES
proposed to minimize the total travel time and charging is developed based on DDPG, which takes the uncertainty
cost in the EVCS. Experimental results demonstrate the of demand as well as the flexibility of wholesale prices
necessity of the coordination of SG with an intelligent into account. Case studies illustrate that the proposed
transportation system. In addition, the continuous SAC algorithm can effectively improve the profit of system oper-
algorithm is applied to crack the EV charging dispatch ator and smooth the fluctuations of user loads. Similarly,
problem considering the dynamic user behaviors and elec- the optimal dispatch problem of IES with RE integrated is
tricity price in [120]. Simulation studies show that the first formulated as a discrete MDP in [123], which is solved
proposed SAC-based approach could learn the dynamics of by the proposed DRL method based on PPO subsequently.
electricity price and driver’s behavior in different locations. Finally, simulation results show that this method can dis-
tinctly minimize the operation cost of IES. In addition,
4) Integrated Energy System: In order to solve the prob- the IES economical optimization problem with wind power
lem of sustainable supply of energy and environmental and power-to-gas technology is discussed in [124], which
pollution, IES has attracted extensive attention all over develops a cycling decay learning rate DDPG to obtain the
the world. It regards the electric power system as the optimal operation strategy. Zhang et al. [125] investigated
core platform and integrates RE at the source side and the optimal energy management of IES considering solar
achieves the combined operation of cooling, heating, power, diesel generation, and ESS, which introduces the
as well as electric power at the load side [121]. However, PPO algorithm to solve the optimization problem and
the high penetration of RE and flexible loads make the realizes 14.17% of cost reduction in comparison with other
IES become a complicated dynamic system with strong methods.
uncertainty, which poses huge challenges to the secure On the other hand, DRL approaches are also introduced
and economic operation of IES. Moreover, conventional to deal with the IES optimal dispatch problem at the load
optimization methods often rely on accurate mathematical side [126]. For instance, Zhou et al. [127] established the
model and parameters, which are not suitable for IES constrained CHP dispatch problem as an MDP. Afterward,
optimal dispatch problem while considering strong ran- an improved policy gradient DRL algorithm named dis-
domness. Fortunately, DRL is introduced to address the IES tributed PPO is developed to handle the CHP economic
optimal dispatch problem, which is a model-free method dispatch problem. Simulation results demonstrate that the
and achieves a series of successful applications. When proposed algorithm could cope with different operation
applying DRL to IES, the design of the reward function can scenarios while obtaining a better optimization perfor-
vary depending on the specific objectives and constraints mance than other methods. In [128], a DRL algorithm
of the system. In the rest part of this section, compre- based on DQN is used to realize the dynamic selection
hensive reviews about DRL-based IES optimal dispatch are of optimal subsidy price for IES with regenerative electric
discussed as follows. heating, which aims to maximize the load aggregator prof-
On the one hand, DRL methods are applied to cope its while promoting demand response. Numerical studies
with optimal dispatch problem of IES at the source side. show that the power grid can save 56.6% of its invest-
For example, Yang et al. [122] proposed a DDPG-based ment and users save up to 8.7% of costs. In addition,
dynamic energy dispatch method for IES while considering a model-free and data-driven DRL method based on DDPG
the uncertainty of renewable generation, electric load, with prioritized experience replay strategy is proposed to
sectional AGC dispatch based on HMA-DDPG can adjust Although most of the existing model-based AVC methods
the AGC unit outputs with the changes in system state, thus could mitigate voltage violations, they are significantly
guaranteeing an optimal economic, secure, and stable SG dependent on the accurate SG knowledge data, which
operation. is often difficult to acquire in real time. Thus, the use
In [141], a swarm intelligence-based DDPG (SIDDPG) of DRL allows controllers to learn the control strategy
algorithm is designed to acquire the control knowledge through interactions with a system-like simulation model,
and implement a high-quality decision for AGC. Simu- where the reward is defined as a penalty for the voltage
lation results on a two-area SG validate the SI-DDPG deviation from its nominal value. Wang et al. [148] pro-
effectiveness for improving the area control performance. posed a multiagent AVC algorithm based on MADDPG to
In addition, a threshold solver based on TD3 is presented mitigate voltage fluctuations, which could learn gradually
in [142] to dynamically update the thresholds of AGC, and master the system operation rules by input and output
which is verified to be effective in maintaining the SG data.
stability with a lower operation cost. A preventive strategy More specifically, MADDPG utilizes a centralized train-
for the AGC application in SG operation is proposed in ing approach with decentralized execution, as presented
[143]. The strategy is on the basis of DFRL, which achieves in Fig. 15. During the training phase, MADDPG agents
the highest control performance compared with ten other employ a centralized critic network that observes all
conventional AGC methods. Yang et al. [144] presented a agents’ actions to estimate the value function. This enables
DRL model about wind farm AGC to maximize the rev- them to learn coordination and collaboration among
enue of wind power producers, which utilizes the rainbow agents. However, during the execution phase, each agent
algorithm to train the wind farm controller against uncer- acts independently based on its own observations and
tainties. makes decisions based solely on its own observations. This
In [145], an intelligent controller based on Q-learning decentralized execution allows agents to interact with the
for the AGC application in the SG operation is proposed to environment and make decisions autonomously, without a
compensate the power balance between generation against
the load demand. Numerical simulations validate the fea-
sibility of the SG controller with network-induced effects.
In addition, a multiarea AGC scheme based on Q-learning
is designed in [146] to dynamically allocate the AGC
regulating commands among various AGC units. Compre-
hensive tests on practical data demonstrate the validation
of the proposed method in minimizing the generation cost
and regulating error. In addition, Hasanvand et al. [147]
presented a reliable and optimal AGC method based on
DQN to manage the generators in electric ship. Real-time
simulation is conducted to verify the performance and
efficacy of suggested AGC scheme for the electric ship.
need for explicit coordination with other agents. In [149], problem, where the GCN model assists the DRL algorithm
a DRL-based AVC scheme is developed for autonomous to better capture topology changes and spatial correlations
grid operation, which takes control actions to ensure in nodal features. A model-free centralized training and
secure SG operations for various randomly generated oper- decentralized execution multiagent SAC (MASAC) frame-
ating conditions. Numerical studies on a realistic 200-bus work is designed in [157] for AVC with high penetration
test system demonstrate the effectiveness and promis- of PVs. Comparative simulation studies demonstrate the
ing performance of the proposed method. In addition, superiority of the proposed approach in reducing the
a physical-model-free AVC approach based on DDPG is communication requirements. Nguyen and Choi [158]
presented in [150], which can cope with fast voltage presented a three-stage AVC framework in SG using the
fluctuations. A model-free DRL control strategy based on online safe SAC method to reduce voltage violations,
DQN is proposed in [151], which aims to enhance the bus mitigate peak loads, and manage active power losses
voltage regulation performance of converters. by coordinating the three stages with different control
The comparison of simulation results indicates the effi- timescales. Numerical simulations for the IEEE 123-bus
ciency of the proposed control strategy for managing large system demonstrate the high efficiency and safety of
signal perturbations. Wang et al. [152] proposed a novel the presented method for regulating voltages. In addi-
DRL-based voltage regulation scheme for unbalanced low- tion, a novel deep meta-reinforcement learning (DMRL)
voltage DNs, which is devoted to minimizing the expected algorithm is developed in [159], which combines the
total daily voltage regulation cost while satisfying opera- meta-strategy optimization with PARS to maintain voltage
tional constraints. An attention-enabled MATD3 algorithm stability. Experimental results show that the performance
is designed in [153] for decentralized AVCs, which is of the proposed method surpasses those of state-of-the-art
demonstrated to be effective in dealing with uncertainties, DRL and model predictive control approaches.
reducing communication requirements, and achieving fast
decision-making processes. In addition, a novel hierarchi- 3) Load Frequency Control: LFC is also a complicated
cal DRL, referred to as the ARS algorithm, is proposed in decision-making problem in SG applications. To this end,
[154] where the lower level DRL agents are trained in an DRL is introduced for restoring the frequency and tie-line
areawise decentralized manner, and the higher level agent power flows to their nominal values after disturbances.
is trained to coordinate the actions executed by lower level Therefore, the reward could be defined as negative fre-
agents. Numerical experiments verify the advantages and quency and tie-line flow deviations. A novel control strat-
various intricacies of the hierarchical method applied to egy for distributed LFCs is developed in [160], which
the IEEE 39-bus power system. Huang et al. [155] formu- is based on the multiagent DDQN with action discovery
lated a derivative-free PARS algorithm for AVC via load (DDQN-AD) algorithm. The approach shows a faster con-
shedding, which can overcome the control problems of vergence speed and stronger learning ability compared
existing DRL algorithms, including computational ineffi- with other traditional methods. In [161], a TDAC control
ciency and poor scalability. Simulation results illustrate strategy is proposed for LFC to deal with strong random
that the proposed method offers better computational effi- disturbances caused by RE. Simulation studies show that
ciency, more robustness in learning, excellent scalability, TDAC has an excellent exploratory stability and learning
and better generalization capacity, compared with other capability, which improves the power system dynamic per-
approaches. formance and achieves the regional optimal coordinated
In [156], a DDQN framework, which applies the graph control. In addition, a multistep unified RL method is
convolutional network (GCN), referred to as GC-DDQN, proposed in [162] for managing the LFC in multiarea
is proposed to tackle topology changes in the AVC interconnected power grid, which proves to outperform
other traditional algorithms in terms of convergence and Tables 5–7. It could be observed that the most widely used
dynamic performance. Yan and Xu [163] developed a DRL framework for the operational control of SG is cen-
data-driven LFC method based on DRL in the continu- tralized, while the decentralized manner is an irresistible
ous action domain for minimizing the frequency devia- trend with the prevalence of distributed generation. What
tions under uncertainties. Numerical simulations verify the is more, CNN is the most popular network architecture
effectiveness and advantages of the proposed method over for aforementioned DRL algorithms to extract features,
other existing approaches. while the novel GCN is gradually applied to capture the
A data-driven cooperative approach for LFC, which is topology information of SG that typical CNN cannot com-
based on MADDPG in a multiarea SG, is presented in plete. Despite the successful applications of DRL in opera-
[164]. The approach offers optimal coordinated control tional control, they are still deemed to be computationally
strategies for LFC controller via centralized learning and inefficient and offer poor scalability to a certain extent,
decentralized implementation. Experimental results for a according to the statements of some related literature.
three-area SG demonstrate that the proposed algorithm To this end, SG calls for more advanced DRL frameworks to
can effectively minimize control errors against stochastic support its secure and stable operation via offering robust
frequency variations. Khooban and Gheisarnejad [165] strategy for its operational control. In Section III-C, the
considered the DDPG to generate the supplementary con- DRL adoption in SG markets is discussed, which involves
trol action for LFC, which is appraised for its systematic multiple entities and complex relationships.
feasibility and applicability. In addition, a novel model-free
LFC scheme is presented in [166], which adopts DDPG to C. Electricity Market
learn the near-optimal strategies under various scenarios. The reforming of electricity power market has drawn
Numerical simulations on benchmark systems verify the much attention during the progressively undergoing
effectiveness of the proposed scheme in achieving satis- restructuring of modern power systems. The emerging
factory control performances. Yan et al. [169] developed electricity market is regarded as the potential solution for
a data-driven algorithm for distributed frequency control improving the power system efficiency and optimizing SG
of island microgrids based on multiagent quantum DRL operations [170]. In this situation, electricity retailers have
(MAQDRL). Numerical tests illustrate that the designed appeared in various liberalized electricity markets, as the
method can effectively regulate the frequency with better intermediary between electricity power producers and
time delay tolerance. In [167], a DDPG-based data-driven consumers. However, the electricity market with retailers
approach for optimal control of ESS is proposed to support contains increasing uncertainties and complexities in both
LFC. Simulation results in a three-area SG demonstrate supply and retail sectors, which is a challenge that affects
the effectiveness of the proposed approach in supporting the decision of participants. Indeed, the decision-making
frequency regulation. In addition, the DDPG algorithm is progress of electricity market is extremely complicated,
combined with sensitivity analysis theory in [168], in order as shown in Fig. 16, which mainly consists of energy
to learn the sparse coordinated LFC policy of multiple bidding and retail pricing strategies [171]. On the one
power grids. Numerical experiments verify that the pro- hand, the energy bidding process is a vital decision-making
posed approach can obtain better performance of damping step for suppliers, which requires generality in different
oscillation and robustness against wind power uncertainty. situations. On the other hand, the retail pricing strategy
To conclude, this section reviews the applications of DRL is the core challenge for retailers to promote profitability,
for the operational control of SGs, which are inherently which should have the adaptability to cope with a dynamic
coupled with generation adjustment, voltage regulation, and complex environment.
frequency stabilization, and so on. The reviewed meth- Accordingly, conventional methods are proposed to pro-
ods are summarized along with adequate references in mote the implement of electricity market, such as the
In addition, the P2P energy trading problem in a com- that most DRL framework for electricity market only con-
munity market with many participating households is tains single agent, while the MADRL indicates a promising
investigated in [202], which accounts for heterogeneity prospect with the development of decentralized electricity
with respect to their DER portfolios. In order to address market, e.g., P2P trading market. First, the prevalence
this problem, a novel DRL algorithm named MADDPG of distributed electricity market calls for DRL algorithms
with parameter sharing (MADDPG-PS) is proposed in this with multiple agents, in which each agent is responsible
article, which achieves a significant operating cost and for a local market. Second, the increasing concern about
peak demand benefits. Samende et al. [203] presented an privacy leakage starves for MADRL approaches where
MADDPG-based algorithm for P2P electricity trading con- multiple agents cooperatively train the model without the
sidering SG constraints. It minimizes the energy costs need of sharing datasets. Moreover, the policy-based DRL
of prosumers who are participating in the P2P market. methods are adopted extensively in SG electricity market
Numerical experiments on real-world datasets indicate operations that are compared with the value-based one,
that the proposed algorithm can reduce the energy cost due to the complexity in both supply and retail sectors. In
while satisfying network constraints. Section III-D, we conduct a discussion on DRL applications
In [204], a distributed DQN-based method is developed in SG operations that will highlight future research trends.
to manage the energy trading between multiple virtual
power plants through P2P and utility. Simulation results D. Emerging Areas
show that the designed method can adjust its action
In recent years, industry has witnessed the SG digitiza-
according to the available energy demand and uncertain
tion and modernization via the numerous deployments of
environment adaptively. An improved MADDPG method-
advanced metering infrastructures. On this basis, SG will
based double-side auction market is formulated in [205],
maintain secure, economic, and sustainable operations,
in order to address the automated P2P energy trad-
compared with those in traditional power systems. Mean-
ing problem among multiple consumers and prosumers.
while, the widespread popularity of smart meters and RE
Case studies demonstrate that the proposed algorithm
also brings about some emerging issues that conventional
can promote the economic benefits of prosumers in P2P
power systems have seldom encountered, including net-
energy trading. In addition, Zhang et al. [206] developed
work security and privacy concerns. Since these problems
an MADDPG-based P2P energy trading model among
are rather new in SG operations, typical methods may
microgrids to improve the resource utilization and oper-
not cope with them in an effective manner. To this end,
ational economy. Simulation results illustrate that the
the data-driven DRL approaches are introduced in these
designed algorithm could reduce the operation cost of each
emerging areas to assist SGs in tackling the aforemen-
microgrid by 0.09%–8.02%, compared to baselines.
tioned issues. In the rest part of this section, detailed
Taking the privacy concern of P2P trading into account,
applications of DRL on network security and privacy
Ye et al. [207] proposed a scalable and privacy-preserving
preservation are depicted as follows.
P2P energy trading scheme based on the MAAC algorithm.
Simulation studies, including a real-world, large-scale sce- 1) Network Security: With rapid SG developments in
nario with 300 residential participants demonstrate that active DNs, various sensing, communication, and control
the proposed approach significantly outperforms the state- devices are deployed to maintain a secure SG opera-
of-the-art MADRL algorithms in reducing the operation tion. However, these cyber-physical components have also
cost and peak demand. In addition, Wang et al. [208] pro- expanded the landscape of cyber threat, which have fur-
vided a novel hybrid community P2P market framework ther resulted in SG vulnerabilities to malicious cyberat-
for multienergy systems, where a data-driven market sur- tacks [210], [211], [212]. Even though regular defense
rogate model-enabled DRL method is proposed to facilitate strategies, such as intrusion prevention systems and fire-
P2P transaction within constraints. Specifically, a market walls, are provided in SGs, such methods might not be very
surrogate model based on deep belief network is developed effective while facing the many unknown vulnerabilities
to characterize P2P transaction behaviors of peers in the [213]. To this end, DRL is applied in SGs to offer additional
market without disclosing their private data. In addition, defense strategies for mitigating the blackout risks during
an MADDPG-based energy trading algorithm is developed cyberattacks. Accordingly, the reward should be designed
in [209] to formulate the optimal policy for each microgrid to incentive actions that enhance network security and
in the electricity market. Moreover, blockchain is adopted discourage actions that compromise it. For example, DRL
to guarantee the privacy of energy transaction data. is applied to assist SG operators in counteracting malicious
In summary, this section reviews the DRL applications cyberattacks in [214], which investigates the possibility of
in and SG electricity market, which mainly involves three defending SG using a DQN agent. Simulation results not
actions, i.e., bidding, pricing, and P2P trading. DRL offers only demonstrate the effectiveness of the proposed DQN
an effective tool for market participants to make optimal algorithm but also pave the way for defending the SG
decisions, even without using the complete information under a sophisticated cyberattack.
about electricity market. These approaches are summa- Liu et al. [215] proposed a cybersecurity assess-
rized along with the references in Table 8. It is illustrated ment approach based on DQN to determine the optimal
Table 8 Applications of DRL on the Wholesale, Retail, and P2P Electricity Markets
attack transition policy. Numerical and real-time simu- delay and load balancing simultaneously. Then, a DQN-
lation experiments verify the performance of developed based route planning algorithm is designed to find the
algorithm without the need for full observation of system. optimal route, which not only meets the delay require-
A DQN-based DRL algorithm is developed in [216] for the ments but also enhances the resistivity of SG. To address
low-latency detection of cyberattacks in SGs, which aims the ever-increasing FDI attack in SG, Zhang et al. [221]
at minimizing the detection delay while maintaining a high proposed a resilient optimal defensive strategy with dis-
accuracy. Case studies verify that the DQN-based algorithm tributed DRL method, which devotes itself to correcting
could achieve very low detection delays while ensuring a false price information and making the optimal recovery
good performance. In addition, a DRL-based approach is strategy for fighting against the FDI attack. Numerical
proposed in [217] to detect data integrity attacks, which studies reveal that the distributed DRL algorithm provides
checks whether the system is currently under attack by a promising way for the optimal SG defense against cyber-
introducing LSTM to extract state features of previous attacks. In [222], a DQN detection scheme is presented
time steps. Simulation studies illustrate that the proposed to defend against data integrity attacks in SG. Experi-
detection approach outperforms the benchmarked metrics, mental results demonstrate that the developed method
including the delay error rate and false rate. surpasses the existing DRL-based detection scheme in
Moreover, Chen et al. [218] proposed the model-free terms of accuracy and rapidity. In addition, an MADRL
defense strategy for SG secondary frequency control with with prioritized experience replay algorithm is proposed
the help of DRL, which proves to be effective through to identify the critical lines under coordinated multistage
validation based on the IEEE benchmark systems. In [219], cyberattacks, which contributes to deploying the limited
an MADDPG algorithm is proposed for SSA, which inte- defense resources optimally and mitigating the impact of
grates the DRL and edge computing to conduct efficient cyberattacks.
SSA deployment in SGs. In addition, a comprehensive 2) Privacy Preservation: To an extent, the widespread
risk assessment model of excessive traffic concentration in deployment of advanced meters in SG has also raised
an SG is established in [220], which considers the link serious concerns from the privacy perspective, which is
regarded as one of the main oppositions for SG modern- A privacy-preserving Q-learning framework for the SG
ization [223]. In fact, the fine-grained smart meter data energy management is formulated in [232], which is
carries sensitive information about consumers, posing a verified to be effective in energy management without
potential threat for preserving privacy. Traditional methods privacy leakage. In addition, Zhang et al. [233] developed
have been proposed for privacy preservation in SGs, such an intelligent demand response resource trading frame-
as data aggregation and encryption [224], data downsam- work, in which the dueling DQN is constructed to simulate
pling [225], and random noise addition [226]. However, the bilevel Stackelberg game in a privacy-protecting way.
these approaches may restrict the potential applications Numerical experiments demonstrate that the designed
of SG data in an uncontrolled manner, e.g., time delay approach has an outstanding performance in reducing
of fault detection and degradation of detection precision. energy cost as well as preserving privacy.
In this regard, DRL is introduced to provide the optimal Liu et al. [234] presented a battery-based intermittently
operational strategy while ensuring the privacy security of differential privacy scheme to realize privacy protection.
consumers. Afterward, it develops a DDPG-based algorithm to offer
When applying DRL to privacy protection in power sys- the optimal battery control policy, in order to maintain
tems, the design of the reward function can vary depending the battery power level and realize cost saving. Case
on the specific goals and requirements. For instance, studies illustrate that the proposed method has a better
Lee and Choi [227] proposed a privacy-preserving method performance in both cost saving and privacy preservation.
based on federated RL for the energy management of smart A DQN-based technique is applied in [235] to keep the bal-
homes with PV and ESS. It develops a novel distributed ance between privacy protection and knowledge discovery
A2C model that contains a global server and a local home during SG data analysis. In [236], a hierarchical SAC-
energy management system. First, A2C agents for local based energy trading scheme is presented in electricity
energy management systems construct and upload their markets, by which the prosumers’ privacy concerns are
models to the global server. After that, the global server tackled because the training process would only require
aggregates the local models to update a global model and the local observations. Extensive simulations validate that
broadcasts it to the A2C agents. Finally, the A2C agents the proposed algorithm can effectively reduce the daily
replace the previous local models with the global one cost of prosumers without privacy leakage. In addition,
and reconstruct their local models, iteratively. In this way, a DDPG-based energy management approach is developed
data sharing between local systems is prevented, thus pre- in [237] for integration in SG systems, which addresses
serving SG privacy. In [228], a distributed DRL algorithm the privacy issues via local data executions. Experimental
is employed for devising the intelligent management of results demonstrate that the proposed scheme can achieve
household power consumption. More specifically, the inter- good performances while preserving the data privacy.
actions of SGs and household appliances are established as To conclude, this section reviews DRL applications to SG
a noncooperative game problem, which is addressed by the network security and privacy preservation. These methods
DPG algorithm considering privacy protection. In addition, are summarized in Table 9 along with the corresponding
a privacy-aware smart meter framework is investigated in references. It is observed that the value-based DQN is
[229] that utilizes the battery to hide the actual power con- the most popular DRL algorithm for managing network
sumption of a household. In detail, the problem of search- security, while policy-based DRL methods proposed for pri-
ing the optimal charging/discharging policy for reducing vacy preservation include both deterministic and stochastic
information leakage with minimal additional energy cost policies. Furthermore, decentralized DRL frameworks for
is formulated as an MDP, which is handled by the DDQN handling emerging SG issues are paid more attention than
with mutual information. As demonstrated by simulation other architectures, which is due to additional require-
studies, the performance of developed algorithm achieves ments for maintaining security and privacy. However, DRL
significant improvements over the state-of-the-art privacy- applications are relatively inadequate for managing the
aware demand shaping approaches. SG emerging issues, which call for more investigation
In [230], a novel federated learning framework is pre- and exploration in the future. Although there have been
sented for privacy-preserving and communication-efficient numerous literature studies on DRL applications in SGs,
energy data analysis in SG. On this basis, a DQN-based many critical problems would still need to be addressed
incentive algorithm with two layers is devised to offer before their practical implementations. On the one hand,
optimal operational strategies. Extensive simulations val- DRL applications to SG systems are still relatively new
idate that the designed scheme can significantly stimulate and require further research before maturity. On the other
high-quality data sharing while ensuring preserving pri- hand, it is necessary to reassess the DRL advantages and
vacy. Wang et al. [231] proposed a data privacy-aware limitations in SG applications, which are among the most
routing algorithm based on DDPG for communication complex and critically engineered systems in the world.
issues in SGs, to realize the latency reduction and load Although real-world DRL applications in SG operations
balancing. Experimental results show that the formu- are relatively limited, this technology holds great potential
lated privacy-aware routing protocol can effectively reduce for encountering SG applications, particularly in tackling
the latency while maintaining excellent load balancing. complex decision-making and control problems. Therefore,
a comprehensive review of DRL applications in SG oper- in the learned policy that would not lead to potentially
ations can help comprehend unsolved problems in this catastrophic consequences in SGs. For instance, the control
domain and provide guidance to promote its development, commands issued by DRL should not violate physical SG
which is one of the intentions for drafting this survey constraints that could possibly result in device failures,
article. grid instability, or even system breakdown. At present,
DRL studies can be divided into three categories,
IV. C H A L L E N G E S A N D O P E N including modifications in optimization criteria, modifi-
RESEARCH ISSUES cations in exploration processes [238], and offline DRL
We have mentioned the difficulty of SG operations mainly methods [239].
stems out of strong uncertainty, curse of dimensionality,
1) Modifying Optimization Criterion: In general, the pur-
and lack of accurate models. As one of the model-free
pose of DRL is primarily focused on maximizing long-term
approaches, RL can deal with variable RE and uncertain
rewards without explicitly considering the potential harm
load demand issues by interacting with the environment
caused by dangerous states to the agent. In other words,
in the absence of sufficient knowledge data. In addition,
the objective function of traditional DRL does not incor-
the curse of dimensionality can be handled with DNN.
porate a description about decision risks. Moreover, if the
Therefore, DRL shows great potential in addressing the
objective function is designed inadequately, the DRL agent
pertinent SG operation issues. However, current DRL meth-
may encounter safety issues. To this end, the transforma-
ods still have a certain extent of limitation, which is mainly
tion of optimization criterion has been proposed to take
due to their dependence on handcrafted reward functions.
the risk into account. This can be achieved through various
It is not easy to design a reward function that encourages
approaches, such as directly penalizing infeasible solutions
the desirable behaviors. Furthermore, the most reasonable
[240], penalizing the worst case scenario [241], or incor-
reward function cannot avoid the local optimality, which
porating constrained optimality within the reward function
belongs to the typical exploration–exploitation dilemma
[242]. For example, Qian et al. [110] incorporated con-
and has puzzled DRL applications for a long time. Hence,
strained optimality within the reward function through
a relatively comprehensive survey of DRL approaches,
using a Lagrangian function of power flow constraints.
potential solutions, and future directions is discussed in
this section. 2) Modifying Exploration Process: The unrestricted ran-
dom exploration can potentially expose the agents to
highly dangerous states. To prevent unforeseen and irre-
A. Security Concerns versible consequences, it is essential to evaluate the DRL
SGs are critical infrastructures in modern power sys- agent security during training and deployment and restrict
tems, which can handle sustainable, secure, economic, their exploration within permissible regions. Such meth-
and reliable power system operations. To this end, it is ods can be categorized as the modification of exploration
crucial for DRL algorithms to ensure secure decisions processes with a focus on ensuring security [243]. The
modification can be achieved through various approaches, 2) Adversarial DRL: It involves training a DRL agent in
such as embedding external knowledge [244] and con- the presence of adversarial agents or environments
straining exploration within a certified safe region [245]. that actively try to disturb the learning process or
Cui et al. [246] formulated the online preventive control achieve their own objectives [252]. Adversarial train-
problem for mitigating transmission overloads as a con- ing has been applied to enhance DRL algorithms
strained MDP. The constrained MDP is then solved using against adversarial attacks in managing the SG cyber-
the interior-point policy optimization, which promotes security. For instance, the attack and defense prob-
learnings that can satisfy the pertinent constraints and lems are formulated as MDP and adversarial MDP in
improves the policy simultaneously. [253], while the robust defense strategy is generated
by adversarial training between attack and defense
3) Offline DRL: The two categories for modifying the agents. In [254], a repeated game is formulated to
optimization criteria and exploration process are regarded mimic the real-world interactions between attackers
as online DRL, where the agent learns how to perform and defenders in SGs. Furthermore, according to
tasks by continuously interacting with the environment. [255], it has been observed that a high-performing
In contrast, offline DRL requires an agent that can learn DRL agent, initially vulnerable to action perturba-
solely from statically offline datasets without exploration, tions, can be made more resilient against similar
thus ensuring the training safety from the perspective of perturbations through the application of adversarial
data [247]. However, such approaches do not consider training. It is indeed worth mentioning that naively
risk-related factors during policy deployment phases and, applying adversarial training may not be effective for
therefore, might not guarantee the security at the time of all DRL tasks [256]. Adversarial training is a complex
deployment [248]. and challenging process that requires careful consid-
In response to safety concerns, four related DRL variants eration and customization for each specific task.
are briefly introduced here, which include constrained 3) Robust DRL: It incorporates robust optimization tech-
DRL, adversarial DRL, robust DRL, and federated DRL as niques to ensure that the learned policies remain
presented in the following. effective even in the presence of uncertainties and
1) Constrained DRL: It refers to the application of RL perturbations, thereby improving the overall perfor-
techniques to solve SG problems with explicit con- mance and stability of the DRL agent [257]. To be
straints. Generally, there are two types of soft and specific, robust DRL considers the worst case scenario
hard constraints, which are considered in the lit- or min–max framework to learn a control policy that
erature. Soft constraints allow for some degree of maximizes the reward with respect to the worst case
violation, whereas hard constraints must be strictly scenario or outcome encountered during the learning
adhered to. On the one hand, there are common process. By training against these worst case scenar-
approaches to addressing soft constraints, including ios, the agent becomes more resilient and capable
adjoining constraints to the reward through barrier of making effective decisions even in the face of
or penalty functions and formulating constraints as uncertainties or adversarial conditions. The utiliza-
chance constraints (i.e., setting a predefined thresh- tion of min–max structure in DRL algorithms has been
old for the probability of constraint violation), or a a vibrant area of research. Previous studies primar-
budget constraint as follows [249]: ily focus on addressing two types of uncertainties,
i.e., inherent uncertainty stemming from the stochas-
tic nature of the system and parameter uncertainty
max J(π), s.t. J c (π) ≤ J¯
π arising from incomplete knowledge about certain
parameters of the MDP [258], [259]. While robust
where the agent goal is to find a control policy π DRL has not yet received extensive attention in the
that maximizes the expected return with respect to context of SG, it holds significant potential as a future
reward function J subject to a budget constraint direction to tackle the diverse uncertainties present
for the return with respect to the cost function J c . in the environment, such as model uncertainty, noise,
However, constrained DRL methods that focus on soft and disturbances.
constraints alone may not guarantee safe exploration 4) Federated DRL: The concern regarding SG security
during the training phase. In addition, even after and privacy is one of the main obstacles in SG oper-
training convergence, the control actions generated ations. However, extensive previous research about
by the trained policy may not always be entirely safe DRL applications in SG mainly belongs to the cen-
[250]. On the other hand, the enforcement approach tralized method, which is vulnerable to cyberattack
is to take the conservative actions while dealing with and privacy leakage. To this end, federated learn-
hard constraints in constrained DRL [251]. Never- ing is combined with DRL to meet the requirements
theless, the enforcement approach usually results in of privacy preservation and network security [260].
significant conservation and might have large errors By combining federated learning and DRL, federated
for complex power networks. DRL enables collaborative learning while preserving
data privacy and reducing the communication over- experts but also learn a generalized policy that can handle
head between the central server and distributed unseen situations. The combination of imitation learning
devices. For instance, Li et al. [261] proposed a fed- and RL is a very promising research field that has been
erated MADRL algorithm via the physics-informed extensively studied in recent years [262], [263]. It has
reward to solve the complex multiple microgrids been applied in various domains such as autonomous
energy management with privacy concern. Federated driving [264], quantitative trading [265], and the optimal
learning enables multiple agents to coordinately learn SG dispatch [266], to tackle the challenge of low learning
a shared decision model while keeping all the training efficiency in DRL. For example, Guo et al. [267] combined
data on device, thus preventing the risk of privacy DRL with imitation learning for cloud resource schedul-
leakage. What is more, the decentralized structure ing, where DRL is devoted to tackling the challenging
of federated learning offers a promising technique multiresource scheduling problem and imitation learning
to reduce the pressure of centralized data storage. enables an agent to learn an optimal policy more effi-
Therefore, it is meaningful to investigate a combina- ciently. In conclusion, the integration of imitation learning
tion of federated learning and DRL in SG operations. and DRL can provide a powerful learning framework that
enables fast learning, generalization, and effectiveness.
B. Sample Efficiency This combination is significant for addressing complex
Despite the success of DRL, it usually needs at least tasks and improving the learning capabilities of intelligent
thousands of samples to gradually learn some useful agents.
policies even for a simple task. However, the real-world
or real-time interactions between agent and environment
C. Learning Stability
are usually costly, and they still require time and energy
consumption even in the simulation platform. This brings Unlike the stable supervised learning, present DRL algo-
about a critical problem for DRL, i.e., how to design a rithms are volatile to a certain extent, which means that
more efficient algorithm to learn faster with fewer sam- there exist huge differences of the learning performances
ples. At present, most DRL algorithms are of such a low over time in horizontal comparisons across multiple runs.
learning efficiency that requires unbearable training time In specific, this learning instability over time generally
under current computational power. It is even worse for reflects as the large local variances or the nonmonotonicity
real-world interactions that potential problems of security on a single learning curve. As for unstable learning, it man-
concern, risks of failure cases, and time consumption all ifests a significant performance difference between differ-
put forward higher requirements on the learning efficiency ent runs during training, which leads to large variances
of DRL algorithms in practice. for horizontal comparisons. What is more, the endoge-
nous instability and unpredictability of DNN aggravate the
1) Model-Based DRL: Different from the aforementioned deviation of value function approximation, which further
model-free methods, model-based DRL generally indicates brings about noise in the gradient estimators and unstable
an agent not only learns a policy to estimate its action learning performance. Significant efforts have been ded-
but also learns a model of environment to assist its action icated to addressing the stability problem in DRL for a
planning, thus accelerating the speed of policy learn- considerable period of time. As mentioned in this article,
ing. Learning an accurate model of environment provides the utilization of a target network with delayed updates
additional pertinent information that could be helpful in and the incorporation of a replay buffer have been shown
evaluating the agent’s current policy, which can make the to mitigate the issue of unstable learning. In addition,
entire learning process more efficient. In principle, a good TRPO employs second-order optimization techniques to
model could handle a bunch of problems, as AlphaGo has provide more stable updates and comprehensive infor-
done. Therefore, it is meaningful to integrate model-based mation. It applies constraints to the updated policy to
and model-free DRLs and promote the sample efficiency ensure conservative yet stable improvements. However,
in SGs. On the one hand, model-based methods can be DRL remains sensitive to hyperparameters and initializa-
utilized as warm-starts or the nominal model, provid- tion even with the above works. This sensitivity poses a
ing initial information or serving as a foundation for significant challenge and highlights the need for further
model-free DRL methods. On the other hand, model-free research in this area to address these issues and improve
DRL algorithms can coordinate and fine-tune the param- the robustness and stability of DRL algorithms.
eters of existing model-based controllers to improve their
1) Multiagent DRL: With the development of DRL,
adaptability while maintaining baseline performance guar-
MADRL is proposed and has attracted much atten-
antees. Although the amount of research in this area is
tion. In fact, MADRL is regarded as a promising and
currently limited, the integration of model-free DRL with
worth exploring direction, which provides a novel
existing model-based approaches is considered to be a
way to investigate the unconventional DRL situations,
promising direction for future research.
including swarm intelligence, unstable environments
2) Imitation Learning Combined DRL: Imitation learning for each agent, and innovation of agent itself. MADRL
attempts to not only mimic the actions and choices of not only makes it possible to explore distributed
intelligence in multiagent environments but also con- accuracy. Taking SG as an example, a single error in oper-
tributes to learning a near-optimal agent policy in ation can result in catastrophic consequences. Moreover,
large-scale SG applications. Overall, multiple agents most of the existing literature trails DRL policies solely
and their interactions in MADRL can enhance the based on high-fidelity power system simulators, which
learning stability of DRL by promoting exploration, do not emphasize the gap between the simulators and
facilitating experience sharing, enabling policy coor- real-world SG operations, i.e., reality gap. Therefore, poli-
dination, and improving robustness to environmen- cies trained in simulators may not always exhibit reliable
tal changes. These characteristics make MADRL a performance in real-world scenarios due to the existence
promising approach for addressing the learning sta- of reality gap. In general, methods for addressing the
bility challenges in DRL. simulation to reality can be categorized into at least two
following approaches.
safety concerns, sample efficiency, learning stability, explo- in a data-driven manner. In summary, careful consider-
ration, simulation to reality, and so on. Furthermore, ation should be devoted to identifying appropriate DRL
this article does not propose a dichotomy between DRL application scenarios and utilizing them effectively in SG
and conventional methods. Instead, DRL can serve as a applications.
complement to existing approaches and enhance them
REFERENCES
[1] M. Liserre, M. A. Perez, M. Langwasser, [17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, [35] M. Ravanelli, P. Brakel, M. Omologo, and
C. A. Rojas, and Z. Zhou, “Unlocking the hidden and A. A. Bharath, “Deep reinforcement learning: Y. Bengio, “Light gated recurrent units for speech
capacity of the electrical grid through smart A brief survey,” IEEE Signal Process. Mag., vol. 34, recognition,” IEEE Trans. Emerg. Topics Comput.
transformer and smart transmission,” Proc. IEEE, no. 6, pp. 26–38, Nov. 2017. Intell., vol. 2, no. 2, pp. 92–102, Apr. 2018.
vol. 111, no. 4, pp. 421–437, Apr. 2023. [18] N. Le, V. S. Rathour, K. Yamazaki, K. Luu, and [36] Y. LeCun et al., “Backpropagation applied to
[2] M. Chertkov and G. Andersson, “Multienergy M. Savvides, Deep Reinforcement Learning in handwritten zip code recognition,” Neural
systems,” Proc. IEEE, vol. 108, no. 9, Computer Vision: A Comprehensive Survey. Cham, Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989.
pp. 1387–1391, Sep. 2020. Switzerland: Springer, 2021. [37] J. N. Tsitsiklis and B. Van Roy, “An analysis of
[3] S. Geng, M. Vrakopoulou, and I. A. Hiskens, [19] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, temporal-difference learning with function
“Optimal capacity design and operation of energy “Reinforcement learning for selective key approximation,” IEEE Trans. Autom. Control,
hub systems,” Proc. IEEE, vol. 108, no. 9, applications in power systems: Recent advances vol. 42, no. 5, pp. 674–690, May 1997.
pp. 1475–1495, Sep. 2020. and future challenges,” IEEE Trans. Smart Grid, [38] J. Li, H. Wang, H. He, Z. Wei, Q. Yang, and P. Igic,
[4] M. Shahidehpour, M. Yan, P. Shikhar, vol. 13, no. 4, pp. 2935–2958, Jul. 2022. “Battery optimal sizing under a synergistic
S. Bahramirad, and A. Paaso, “Blockchain for [20] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep framework with DQN-based power managements
peer-to-peer transactive energy trading in reinforcement learning for power system for the fuel cell hybrid powertrain,” IEEE Trans.
networked microgrids: Providing an effective and applications: An overview,” CSEE J. Power Energy Transport. Electrific., vol. 8, no. 1, pp. 36–47,
decentralized strategy,” IEEE Electrific. Mag., Syst., vol. 6, no. 1, pp. 213–225, Mar. 2020. Mar. 2022.
vol. 8, no. 4, pp. 80–90, Dec. 2020. [21] M. Glavic, “(Deep) reinforcement learning for [39] V. Mnih et al., “Human-level control through deep
[5] M. Shahidehpour, Z. Li, S. Bahramirad, Z. Li, and electric power system control and related reinforcement learning,” Nature, vol. 518,
W. Tian, “Networked microgrids: Exploring the problems: A short review and perspectives,” Annu. no. 7540, pp. 529–533, 2015.
possibilities of the IIT-Bronzeville grid,” IEEE Rev. Control, vol. 48, pp. 22–35, Jan. 2019. [40] A. Camacho, J. Varley, A. Zeng, D. Jain, A. Iscen,
Power Energy Mag., vol. 15, no. 4, pp. 63–71, [22] D. Cao et al., “Reinforcement learning and its and D. Kalashnikov, “Reward machines for
Jul. 2017. applications in modern power and energy vision-based robotic manipulation,” in Proc. IEEE
[6] S. Z. Tajalli, A. Kavousi-Fard, M. Mardaneh, systems: A review,” J. Modern Power Syst. Clean Int. Conf. Robot. Autom. (ICRA), May 2021,
A. Khosravi, and R. Razavi-Far, Energy, vol. 8, no. 6, pp. 1029–1042, Nov. 2020. pp. 14284–14290.
“Uncertainty-aware management of smart grids [23] T. Yang, L. Zhao, W. Li, and A. Y. Zomaya, [41] H. Hasselt, “Double Q-learning,” in Proc. Adv.
using cloud-based LSTM-prediction interval,” IEEE “Reinforcement learning in sustainable energy Neural Inf. Process. Syst., vol. 23, 2010,
Trans. Cybern., vol. 52, no. 10, pp. 9964–9977, and electric systems: A survey,” Annu. Rev. pp. 2613–2621.
Oct. 2022. Control, vol. 49, pp. 145–163, Jan. 2020. [42] H. Van Hasselt, A. Guez, and D. Silver, “Deep
[7] X. Xia, Y. Xiao, W. Liang, and J. Cui, “Detection [24] A. T. D. Perera and P. Kamalaruban, “Applications reinforcement learning with double Q-learning,”
methods in smart meters for electricity thefts: A of reinforcement learning in energy systems,” in Proc. AAAI Conf. Artif. Intell., vol. 30, no. 1,
survey,” Proc. IEEE, vol. 110, no. 2, pp. 273–319, Renew. Sustain. Energy Rev., vol. 137, Mar. 2021, 2016, pp. 1–13.
Feb. 2022. Art. no. 110618. [43] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and
[8] L. Duchesne, E. Karangelos, and L. Wehenkel, [25] L. Yu, S. Qin, M. Zhang, C. Shen, T. Jiang, and M. Bennis, “Optimized computation offloading
“Recent developments in machine learning for X. Guan, “A review of deep reinforcement learning performance in virtual edge computing systems
energy systems reliability management,” Proc. for smart building energy management,” IEEE via deep reinforcement learning,” IEEE Internet
IEEE, vol. 108, no. 9, pp. 1656–1676, Sep. 2020. Internet Things J., vol. 8, no. 15, Things J., vol. 6, no. 3, pp. 4005–4018, Jun. 2019.
[9] Y. Yuan et al., “Data driven discovery of cyber pp. 12046–12063, Aug. 2021. [44] Z. Wang, T. Schaul, M. Hessel, H. Hasselt,
physical systems,” Nature Commun., vol. 10, no. 1, [26] D. Zhang, X. Han, and C. Deng, “Review on the M. Lanctot, and N. Freitas, “Dueling network
p. 4894, Oct. 2019. research and practice of deep learning and architectures for deep reinforcement learning,” in
[10] H. L. Liao, Q. H. Wu, Y. Z. Li, and L. Jiang, reinforcement learning in smart grids,” CSEE J. Proc. Int. Conf. Mach. Learn., 2016,
“Economic emission dispatching with variations of Power Energy Syst., vol. 4, no. 3, pp. 362–370, pp. 1995–2003.
wind power and loads using multi-objective Sep. 2018. [45] R. S. Sutton, D. McAllester, S. Singh, and
optimization by learning automata,” Energy [27] L. Zeng, M. Sun, X. Wan, Z. Zhang, R. Deng, and Y. Mansour, “Policy gradient methods for
Convers. Manage., vol. 87, pp. 990–999, Y. Xu, “Physics-constrained vulnerability reinforcement learning with function
Nov. 2014. assessment of deep reinforcement learning-based approximation,” in Proc. Adv. Neural Inf. Process.
[11] W. Samek, G. Montavon, S. Lapuschkin, SCOPF,” IEEE Trans. Power Syst., vol. 38, no. 3, Syst., vol. 12, 1999, pp. 1057–1063.
C. J. Anders, and K.-R. Müller, “Explaining deep pp. 2690–2704, May 2023. [46] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,”
neural networks and beyond: A review of methods [28] T. T. Nguyen and V. J. Reddi, “Deep reinforcement in Proc. Adv. Neural Inf. Process. Syst., vol. 12,
and applications,” Proc. IEEE, vol. 109, no. 3, learning for cyber security,” IEEE Trans. Neural 1999, pp. 1008–1014.
pp. 247–278, Mar. 2021. Netw. Learn. Syst., vol. 34, no. 8, pp. 3779–3795, [47] A. G. Barto, R. S. Sutton, and C. W. Anderson,
[12] Y. Li et al., “Dense skip attention based deep Aug. 2023. “Neuronlike adaptive elements that can solve
learning for day-ahead electricity price [29] T. Ding, Z. Zeng, J. Bai, B. Qin, Y. Yang, and difficult learning control problems,” IEEE Trans.
forecasting,” IEEE Trans. Power Syst., vol. 38, M. Shahidehpour, “Optimal electric vehicle Syst. Man, Cybern., vol. SMC-13, no. 5,
no. 5, pp. 4308–4327, Sep. 2023. charging strategy with Markov decision process pp. 834–846, Sep. 1983.
[13] M. Lapan, Deep Reinforcement Learning Hands-On: and reinforcement learning technique,” IEEE [48] V. Mnih et al., “Asynchronous methods for deep
Apply Modern RL Methods to Practical Problems of Trans. Ind. Appl., vol. 56, no. 5, pp. 5811–5823, reinforcement learning,” in Proc. Int. Conf. Mach.
Chatbots, Robotics, Discrete Optimization, Web Sep. 2020. Learn., 2016, pp. 1928–1937.
Automation, and More. Birmingham, U.K.: Packt [30] H. Dong, Z. Ding, and S. Zhang, Deep [49] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
Publishing Ltd, 2020. Reinforcement Learning. Cham, Switzerland: P. Moritz, “Trust region policy optimization,” in
[14] N. C. Luong et al., “Applications of deep Springer, 2020. Proc. Int. Conf. Mach. Learn., 2015,
reinforcement learning in communications and [31] S. Dreyfus, “Richard Bellman on the birth of pp. 1889–1897.
networking: A survey,” IEEE Commun. Surveys dynamic programming,” Oper. Res., vol. 50, no. 1, [50] S. Kakade and J. Langford, “Approximately
Tuts., vol. 21, no. 4, pp. 3133–3174, pp. 48–51, Feb. 2002. optimal approximate reinforcement learning,” in
4th Quart., 2019. [32] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, Proc. 19th Int. Conf. Mach. Learn., 2002,
[15] W. Chen, X. Qiu, T. Cai, H.-N. Dai, Z. Zheng, and “Voronoi-based multi-robot autonomous pp. 267–274.
Y. Zhang, “Deep reinforcement learning for exploration in unknown environments via deep [51] J. Achiam, D. Held, A. Tamar, and P. Abbeel,
Internet of Things: A comprehensive survey,” IEEE reinforcement learning,” IEEE Trans. Veh. Technol., “Constrained policy optimization,” in Proc. Int.
Commun. Surveys Tuts., vol. 23, no. 3, vol. 69, no. 12, pp. 14413–14423, Dec. 2020. Conf. Mach. Learn., 2017, pp. 22–31.
pp. 1659–1692, 3rd Quart., 2021. [33] C. J. C. H. Watkins and P. Dayan, “Q-learning,” [52] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex
[16] Y. Keneshloo, T. Shi, N. Ramakrishnan, and Mach. Learn., vol. 8, nos. 3–4, pp. 279–292, 1992. Optimization. Cambridge, U.K.: Cambridge Univ.
C. K. Reddy, “Deep reinforcement learning for [34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Press, 2004.
sequence-to-sequence models,” IEEE Trans. Neural “Gradient-based learning applied to document [53] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, recognition,” Proc. IEEE, vol. 86, no. 11, and O. Klimov, “Proximal policy optimization
Jul. 2020. pp. 2278–2324, Nov. 1998. algorithms,” 2017, arXiv:1707.06347.
[54] R. I. Bot, S. M. Grad, and G. Wanka, Duality in pp. 3–19, Jan. 2021. Jun. 2022.
Vector Optimization. Cham, Switzerland: Springer, [73] Y. Ding, B. Wang, Y. Wang, K. Zhang, and [91] S. H. Oh, Y. T. Yoon, and S. W. Kim, “Online
2009. H. Wang, “Secure metering data aggregation with reconfiguration scheme of self-sufficient
[55] N. Heess et al., “Emergence of locomotion batch verification in industrial smart grid,” IEEE distribution network based on a reinforcement
behaviours in rich environments,” 2017, Trans. Ind. Informat., vol. 16, no. 10, learning approach,” Appl. Energy, vol. 280,
arXiv:1707.02286. pp. 6607–6616, Oct. 2020. Dec. 2020, Art. no. 115900.
[56] J. Booth, “PPO dash: Improving generalization in [74] K. Kaur, G. Kaddoum, and S. Zeadally, [92] Y. Gao, W. Wang, J. Shi, and N. Yu,
deep reinforcement learning,” 2019, “Blockchain-based cyber-physical security for “Batch-constrained reinforcement learning for
arXiv:1907.06704. electrical vehicle aided smart grid ecosystem,” dynamic distribution network reconfiguration,”
[57] C.-Y. Tang, C.-H. Liu, W.-K. Chen, and S. D. You, IEEE Trans. Intell. Transp. Syst., vol. 22, no. 8, IEEE Trans. Smart Grid, vol. 11, no. 6,
“Implementing action mask in proximal policy pp. 5178–5189, Aug. 2021. pp. 5357–5369, Nov. 2020.
optimization (PPO) algorithm,” ICT Exp., vol. 6, [75] M. B. Gough, S. F. Santos, T. AlSkaif, M. S. Javadi, [93] S. Bahrami, Y. C. Chen, and V. W. S. Wong, “Deep
no. 3, pp. 200–203, Sep. 2020. R. Castro, and J. P. S. Catalão, “Preserving privacy reinforcement learning for demand response in
[58] D. Silver, G. Lever, N. Heess, T. Degris, of smart meter data in a smart grid environment,” distribution networks,” IEEE Trans. Smart Grid,
D. Wierstra, and M. Riedmiller, “Deterministic IEEE Trans. Ind. Informat., vol. 18, no. 1, vol. 12, no. 2, pp. 1496–1506, Mar. 2021.
policy gradient algorithms,” in Proc. Int. Conf. pp. 707–718, Jan. 2022. [94] N. L. Dehghani, A. B. Jeddi, and A. Shafieezadeh,
Mach. Learn., 2014, pp. 387–395. [76] Y. Li, Y. Zhao, L. Wu, and Z. Zeng, Artificial “Intelligent hurricane resilience enhancement of
[59] S. Fujimoto, H. Hoof, and D. Meger, “Addressing Intelligence Enabled Computational Methods for power distribution systems via deep
function approximation error in actor-critic Smart Grid Forecast and Dispatch. Cham, reinforcement learning,” Appl. Energy, vol. 285,
methods,” in Proc. Int. Conf. Mach. Learn., 2018, Switzerland: Springer, 2023. Mar. 2021, Art. no. 116355.
pp. 1587–1596. [77] A. M. Fathabad, J. Cheng, K. Pan, and F. Qiu, [95] Y. Li et al., “Optimal operation of multimicrogrids
[60] A. Navas, J. S. Gómez, J. Llanos, E. Rute, D. Sáez, “Data-driven planning for renewable distributed via cooperative energy and reserve scheduling,”
and M. Sumner, “Distributed predictive control generation integration,” IEEE Trans. Power Syst., IEEE Trans. Ind. Informat., vol. 14, no. 8,
strategy for frequency restoration of microgrids vol. 35, no. 6, pp. 4357–4368, Nov. 2020. pp. 3459–3468, Aug. 2018.
considering optimal dispatch,” IEEE Trans. Smart [78] K. Utkarsh, D. Srinivasan, A. Trivedi, W. Zhang, [96] M. Mahmoodi, P. Shamsi, and B. Fahimi,
Grid, vol. 12, no. 4, pp. 2748–2759, Jul. 2021. and T. Reindl, “Distributed model-predictive “Economic dispatch of a hybrid microgrid with
[61] Z. Chen, J. Zhu, H. Dong, W. Wu, and H. Zhu, real-time optimal operation of a network of smart distributed energy storage,” IEEE Trans. Smart
“Optimal dispatch of WT/PV/ES combined microgrids,” IEEE Trans. Smart Grid, vol. 10, no. 3, Grid, vol. 6, no. 6, pp. 2607–2614, Nov. 2015.
generation system based on cyber-physical-social pp. 2833–2845, May 2019. [97] Y. Shi, S. Dong, C. Guo, Z. Chen, and L. Wang,
integration,” IEEE Trans. Smart Grid, vol. 13, [79] Y. Liu, L. Guo, and C. Wang, “A robust “Enhancing the flexibility of storage integrated
no. 1, pp. 342–354, Jan. 2022. operation-based scheduling optimization for smart power system by multi-stage robust dispatch,”
[62] T. Wu, C. Zhao, and Y. A. Zhang, “Distributed distribution networks with multi-microgrids,” IEEE Trans. Power Syst., vol. 36, no. 3,
AC–DC optimal power dispatch of VSC-based Appl. Energy, vol. 228, pp. 130–140, Oct. 2018. pp. 2314–2322, May 2021.
energy routers in smart microgrids,” IEEE Trans. [80] C. Guo, F. Luo, Z. Cai, and Z. Y. Dong, “Integrated [98] Y. Li et al., “Day-ahead risk averse market clearing
Power Syst., vol. 36, no. 5, pp. 4457–4470, energy systems of data centers and smart grids: considering demand response with data-driven
Sep. 2021. State-of-the-art and future opportunities,” Appl. load uncertainty representation: A Singapore
[63] Z. Zhang, C. Wang, H. Lv, F. Liu, H. Sheng, and Energy, vol. 301, Nov. 2021, Art. no. 117474. electricity market study,” Energy, vol. 254,
M. Yang, “Day-ahead optimal dispatch for [81] Z. J. Lee et al., “Adaptive charging networks: A Sep. 2022, Art. no. 123923.
integrated energy system considering framework for smart electric vehicle charging,” [99] A. Dridi, H. Afifi, H. Moungla, and J. Badosa,
power-to-gas and dynamic pipeline networks,” IEEE Trans. Smart Grid, vol. 12, no. 5, “A novel deep reinforcement approach for IIoT
IEEE Trans. Ind. Appl., vol. 57, no. 4, pp. 4339–4350, Sep. 2021. microgrid energy management systems,” IEEE
pp. 3317–3328, Jul. 2021. [82] C. Li, Z. Dong, G. Chen, B. Zhou, J. Zhang, and Trans. Green Commun. Netw., vol. 6, no. 1,
[64] Md. R. Islam, H. Lu, Md. R. Islam, M. J. Hossain, X. Yu, “Data-driven planning of electric vehicle pp. 148–159, Mar. 2022.
and L. Li, “An IoT-based decision support tool for charging infrastructure: A case study of Sydney, [100] Md. S. Munir, S. F. Abedin, N. H. Tran, Z. Han,
improving the performance of smart grids Australia,” IEEE Trans. Smart Grid, vol. 12, no. 4, E.-N. Huh, and C. S. Hong, “Risk-aware energy
connected with distributed energy sources and pp. 3289–3304, Jul. 2021. scheduling for edge computing with microgrid: A
electric vehicles,” IEEE Trans. Ind. Appl., vol. 56, [83] B. Zhou et al., “Optimal coordination of electric multi-agent deep reinforcement learning
no. 4, pp. 4552–4562, Jul. 2020. vehicles for virtual power plants with dynamic approach,” IEEE Trans. Netw. Service Manage.,
[65] X. Sun and J. Qiu, “Hierarchical voltage control communication spectrum allocation,” IEEE Trans. vol. 18, no. 3, pp. 3476–3497, Sep. 2021.
strategy in distribution networks considering Ind. Informat., vol. 17, no. 1, pp. 450–462, [101] L. Lei, Y. Tan, G. Dahlenburg, W. Xiang, and
customized charging navigation of electric Jan. 2021. K. Zheng, “Dynamic energy dispatch based on
vehicles,” IEEE Trans. Smart Grid, vol. 12, no. 6, [84] D. Cao, W. Hu, J. Zhao, Q. Huang, Z. Chen, and deep reinforcement learning in IoT-driven smart
pp. 4752–4764, Nov. 2021. F. Blaabjerg, “A multi-agent deep reinforcement isolated microgrids,” IEEE Internet Things J.,
[66] L. Xi, L. Zhang, Y. Xu, S. Wang, and C. Yang, learning based voltage regulation using vol. 8, no. 10, pp. 7938–7953, May 2021.
“Automatic generation control based on coordinated PV inverters,” IEEE Trans. Power Syst., [102] F. Sanchez Gorostiza and F. M. Gonzalez-Longatt,
multiple-step greedy attribute and multiple-level vol. 35, no. 5, pp. 4120–4123, Sep. 2020. “Deep reinforcement learning-based controller for
allocation strategy,” CSEE J. Power Energy Syst., [85] P. Kou, D. Liang, C. Wang, Z. Wu, and L. Gao, SOC management of multi-electrical energy
vol. 8, no. 1, pp. 281–292, Jan. 2022. “Safe deep reinforcement learning-based storage system,” IEEE Trans. Smart Grid, vol. 11,
[67] K. S. Xiahou, Y. Liu, and Q. H. Wu, “Robust load constrained optimal control scheme for active no. 6, pp. 5039–5050, Nov. 2020.
frequency control of power systems against distribution networks,” Appl. Energy, vol. 264, [103] T. Chen, S. Bu, X. Liu, J. Kang, F. R. Yu, and
random time-delay attacks,” IEEE Trans. Smart Apr. 2020, Art. no. 114772. Z. Han, “Peer-to-peer energy trading and energy
Grid, vol. 12, no. 1, pp. 909–911, Jan. 2021. [86] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy conversion in interconnected multi-energy
[68] K.-D. Lu, G.-Q. Zeng, X. Luo, J. Weng, Y. Zhang, deep reinforcement learning algorithm for microgrids using multi-agent deep reinforcement
and M. Li, “An adaptive resilient load frequency Volt-VAR control in power distribution systems,” learning,” IEEE Trans. Smart Grid, vol. 13, no. 1,
controller for smart grids with DoS attacks,” IEEE IEEE Trans. Smart Grid, vol. 11, no. 4, pp. 715–727, Jan. 2022.
Trans. Veh. Technol., vol. 69, no. 5, pp. 3008–3018, Jul. 2020. [104] H. Hua et al., “Data-driven dynamical control for
pp. 4689–4699, May 2020. [87] H. Liu and W. Wu, “Two-stage deep reinforcement bottom-up energy Internet system,” IEEE Trans.
[69] B. Hu, Y. Gong, C. Y. Chung, B. F. Noble, and learning for inverter-based Volt-VAR control in Sustain. Energy, vol. 13, no. 1, pp. 315–327,
G. Poelzer, “Price-maker bidding and offering active distribution networks,” IEEE Trans. Smart Jan. 2022.
strategies for networked microgrids in day-ahead Grid, vol. 12, no. 3, pp. 2037–2047, May 2021. [105] Y. Li, R. Wang, and Z. Yang, “Optimal scheduling
electricity markets,” IEEE Trans. Smart Grid, [88] X. Sun and J. Qiu, “Two-stage Volt/Var control in of isolated microgrids using automated
vol. 12, no. 6, pp. 5201–5211, Nov. 2021. active distribution networks with multi-agent reinforcement learning-based multi-period
[70] H. Haghighat, H. Karimianfard, and B. Zeng, deep reinforcement learning method,” IEEE Trans. forecasting,” IEEE Trans. Sustain. Energy, vol. 13,
“Integrating energy management of autonomous Smart Grid, vol. 12, no. 4, pp. 2903–2912, no. 1, pp. 159–169, Jan. 2022.
smart grids in electricity market operation,” IEEE Jul. 2021. [106] Y. Du and F. Li, “Intelligent multi-microgrid energy
Trans. Smart Grid, vol. 11, no. 5, pp. 4044–4055, [89] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, management based on deep neural network and
Sep. 2020. and J. Sun, “Two-timescale voltage control in model-free reinforcement learning,” IEEE Trans.
[71] A. Paudel, L. P. M. I. Sampath, J. Yang, and distribution grids using deep reinforcement Smart Grid, vol. 11, no. 2, pp. 1066–1076,
H. B. Gooi, “Peer-to-peer energy trading in smart learning,” IEEE Trans. Smart Grid, vol. 11, no. 3, Mar. 2020.
grid considering power losses and network fees,” pp. 2313–2323, May 2020. [107] Z. Qin, D. Liu, H. Hua, and J. Cao, “Privacy
IEEE Trans. Smart Grid, vol. 11, no. 6, [90] Y. Li, G. Hao, Y. Liu, Y. Yu, Z. Ni, and Y. Zhao, preserving load control of residential microgrid
pp. 4727–4737, Nov. 2020. “Many-objective distribution network via deep reinforcement learning,” IEEE Trans.
[72] P. Zhuang, T. Zamir, and H. Liang, “Blockchain for reconfiguration via deep reinforcement learning Smart Grid, vol. 12, no. 5, pp. 4079–4089,
cybersecurity in smart grid: A comprehensive assisted optimization algorithm,” IEEE Trans. Sep. 2021.
survey,” IEEE Trans. Ind. Informat., vol. 17, no. 1, Power Del., vol. 37, no. 3, pp. 2230–2244, [108] Y. Li et al., “Coordinated scheduling for improving
uncertain wind power adsorption in electric [125] G. Zhang et al., “Data-driven optimal energy interconnected power grid with various renewable
vehicles—Wind integrated power systems by management for a wind-solar-diesel-battery- units,” IET Renew. Power Gener., vol. 16, no. 7,
multiobjective optimization approach,” IEEE reverse osmosis hybrid energy system using a pp. 1316–1335, May 2022.
Trans. Ind. Appl., vol. 56, no. 3, pp. 2238–2250, deep reinforcement learning approach,” Energy [142] Q. Zhang, H. Tang, Z. Wang, X. Wu, and K. Lv,
May 2020. Convers. Manage., vol. 227, Jan. 2021, “Flexible selection framework for secondary
[109] Y. Li, S. He, Y. Li, L. Ge, S. Lou, and Z. Zeng, Art. no. 113608. frequency regulation units based on learning
“Probabilistic charging power forecast of EVCS: [126] Y. Li, F. Bu, Y. Li, and C. Long, “Optimal scheduling optimisation method,” Int. J. Electr. Power Energy
Reinforcement learning assisted deep learning of island integrated energy systems considering Syst., vol. 142, Nov. 2022, Art. no. 108175.
approach,” IEEE Trans. Intell. Vehicles, vol. 8, multi-uncertainties and hydrothermal [143] L. Yin, L. Zhao, T. Yu, and X. Zhang, “Deep forest
no. 1, pp. 344–357, Jan. 2023. simultaneous transmission: A deep reinforcement reinforcement learning for preventive strategy
[110] T. Qian, C. Shao, X. Wang, Q. Zhou, and learning approach,” Appl. Energy, vol. 333, considering automatic generation control in
M. Shahidehpour, “Shadow-price DRL: A Mar. 2023, Art. no. 120540. large-scale interconnected power systems,” Appl.
framework for online scheduling of shared [127] S. Zhou et al., “Combined heat and power system Sci., vol. 8, no. 11, p. 2185, Nov. 2018.
autonomous EVs fleets,” IEEE Trans. Smart Grid, intelligent economic dispatch: A deep [144] J. J. Yang, M. Yang, M. X. Wang, P. J. Du, and
vol. 13, no. 4, pp. 3106–3117, Jul. 2022. reinforcement learning approach,” Int. J. Electr. Y. X. Yu, “A deep reinforcement learning method
[111] J. Zhang, Y. Guan, L. Che, and M. Shahidehpour, Power Energy Syst., vol. 120, Sep. 2020, for managing wind farm uncertainties through
“EV charging command fast allocation approach Art. no. 106016. energy storage system control and external
based on deep reinforcement learning with safety [128] S. Zhong et al., “Deep reinforcement learning reserve purchasing,” Int. J. Electr. Power Energy
modules,” IEEE Trans. Smart Grid, early access, framework for dynamic pricing demand response Syst., vol. 119, Jul. 2020, Art. no. 105928.
Jun. 5, 2023, doi: 10.1109/TSG.2023.3281782. of regenerative electric heating,” Appl. Energy, [145] V. P. Singh, N. Kishor, and P. Samuel, “Distributed
[112] C. Zhang, Y. Liu, F. Wu, B. Tang, and W. Fan, vol. 288, Apr. 2021, Art. no. 116623. multi-agent system-based load frequency control
“Effective charging planning based on deep [129] Y. Ye, D. Qiu, X. Wu, G. Strbac, and J. Ward, for multi-area power system in smart grid,” IEEE
reinforcement learning for electric vehicles,” IEEE “Model-free real-time autonomous control for a Trans. Ind. Electron., vol. 64, no. 6,
Trans. Intell. Transp. Syst., vol. 22, no. 1, residential multi-energy system using deep pp. 5151–5160, Jun. 2017.
pp. 542–554, Jan. 2021. reinforcement learning,” IEEE Trans. Smart Grid, [146] H. Wang, Z. Lei, X. Zhang, J. Peng, and H. Jiang,
[113] B. Lin, B. Ghaddar, and J. Nathwani, “Deep vol. 11, no. 4, pp. 3068–3082, Jul. 2020. “Multiobjective reinforcement learning-based
reinforcement learning for the electric vehicle [130] J. Li, T. Yu, and X. Zhang, “Coordinated load intelligent approach for optimization of activation
routing problem with time windows,” IEEE Trans. frequency control of multi-area integrated energy rules in automatic generation control,” IEEE
Intell. Transp. Syst., vol. 23, no. 8, system using multi-agent deep reinforcement Access, vol. 7, pp. 17480–17492, 2019.
pp. 11528–11538, Aug. 2022. learning,” Appl. Energy, vol. 306, Jan. 2022, [147] S. Hasanvand, M. Rafiei, M. Gheisarnejad, and
[114] F. Zhang, Q. Yang, and D. An, “CDDPG: A Art. no. 117900. M.-H. Khooban, “Reliable power scheduling of an
deep-reinforcement-learning-based approach for [131] B. Yang, X. Zhang, T. Yu, H. Shu, and Z. Fang, emission-free ship: Multiobjective deep
electric vehicle charging control,” IEEE Internet “Grouped grey wolf optimizer for maximum reinforcement learning,” IEEE Trans. Transport.
Things J., vol. 8, no. 5, pp. 3075–3087, Mar. 2021. power point tracking of doubly-fed induction Electrific., vol. 6, no. 2, pp. 832–843, Jun. 2020.
[115] A. A. Zishan, M. M. Haji, and O. Ardakanian, generator based wind turbine,” Energy Convers. [148] S. Wang et al., “A data-driven multi-agent
“Adaptive congestion control for electric vehicle Manage., vol. 133, pp. 427–443, Feb. 2017. autonomous voltage control framework using
charging in the smart grid,” IEEE Trans. Smart [132] Q. Sun, R. Fan, Y. Li, B. Huang, and D. Ma, deep reinforcement learning,” IEEE Trans. Power
Grid, vol. 12, no. 3, pp. 2439–2449, May 2021. “A distributed double-consensus algorithm for Syst., vol. 35, no. 6, pp. 4644–4654, Nov. 2020.
[116] H. Li, Z. Wan, and H. He, “Constrained EV residential We-Energy,” IEEE Trans. Ind. Informat., [149] J. Duan et al., “Deep-reinforcement-learning-
charging scheduling based on safe deep vol. 15, no. 8, pp. 4830–4842, Aug. 2019. based autonomous voltage control for power grid
reinforcement learning,” IEEE Trans. Smart Grid, [133] W. Fu, K. Wang, J. Tan, and K. Zhang, operations,” IEEE Trans. Power Syst., vol. 35, no. 1,
vol. 11, no. 3, pp. 2427–2439, May 2020. “A composite framework coupling multiple feature pp. 814–817, Jan. 2020.
[117] T. Wu et al., “Multi-agent deep reinforcement selection, compound prediction models and novel [150] D. Cao et al., “Model-free voltage control of active
learning for urban traffic light control in vehicular hybrid swarm optimizer-based synchronization distribution system with PVs using surrogate
networks,” IEEE Trans. Veh. Technol., vol. 69, optimization strategy for multi-step ahead model-based deep reinforcement learning,” Appl.
no. 8, pp. 8243–8256, Aug. 2020. short-term wind speed forecasting,” Energy Energy, vol. 306, Jan. 2022, Art. no. 117982.
[118] T. Qian, C. Shao, X. Li, X. Wang, Z. Chen, and Convers. Manage., vol. 205, Feb. 2020, [151] C. Cui, N. Yan, B. Huangfu, T. Yang, and C. Zhang,
M. Shahidehpour, “Multi-agent deep Art. no. 112461. “Voltage regulation of DC–DC buck converters
reinforcement learning method for EV charging [134] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and feeding CPLs via deep reinforcement learning,”
station game,” IEEE Trans. Power Syst., vol. 37, F. L. Lewis, “Optimal and autonomous control IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69,
no. 3, pp. 1682–1694, May 2022. using reinforcement learning: A survey,” IEEE no. 3, pp. 1777–1781, Mar. 2022.
[119] T. Qian, C. Shao, X. Wang, and M. Shahidehpour, Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, [152] S. Wang, L. Du, X. Fan, and Q. Huang, “Deep
“Deep reinforcement learning for EV charging pp. 2042–2062, Jun. 2018. reinforcement scheduling of energy storage
navigation by coordinating smart grid and [135] S. Vijayshankar, P. Stanfel, J. King, E. Spyrou, and systems for real-time voltage regulation in
intelligent transportation system,” IEEE Trans. K. Johnson, “Deep reinforcement learning for unbalanced LV networks with high PV
Smart Grid, vol. 11, no. 2, pp. 1714–1723, automatic generation control of wind farms,” in penetration,” IEEE Trans. Sustain. Energy, vol. 12,
Mar. 2020. Proc. Amer. Control Conf. (ACC), May 2021, no. 4, pp. 2342–2352, Oct. 2021.
[120] L. Yan, X. Chen, J. Zhou, Y. Chen, and J. Wen, pp. 1796–1802. [153] D. Cao, J. Zhao, W. Hu, F. Ding, Q. Huang, and
“Deep reinforcement learning for continuous [136] J. Li, T. Yu, and X. Zhang, “Coordinated automatic Z. Chen, “Attention enabled multi-agent DRL for
electric vehicles charging control with dynamic generation control of interconnected power decentralized Volt-VAR control of active
user behaviors,” IEEE Trans. Smart Grid, vol. 12, system with imitation guided exploration distribution system using PV inverters and SVCs,”
no. 6, pp. 5124–5134, Nov. 2021. multi-agent deep reinforcement learning,” Int. J. IEEE Trans. Sustain. Energy, vol. 12, no. 3,
[121] E. A. M. Ceseña, E. Loukarakis, N. Good, and Electr. Power Energy Syst., vol. 136, Mar. 2022, pp. 1582–1592, Jul. 2021.
P. Mancarella, “Integrated electricity–heat–gas Art. no. 107471. [154] S. Mukherjee, R. Huang, Q. Huang, T. L. Vu, and
systems: Techno–Economic modeling, [137] L. Xi et al., “A deep reinforcement learning T. Yin, “Scalable voltage control using
optimization, and application to multienergy algorithm for the power order optimization structure-driven hierarchical deep reinforcement
districts,” Proc. IEEE, vol. 108, no. 9, allocation of AGC in interconnected power grids,” learning,” 2021, arXiv:2102.00077.
pp. 1392–1410, Sep. 2020. CSEE J. Power Energy Syst., vol. 6, no. 3, [155] R. Huang et al., “Accelerated derivative-free deep
[122] T. Yang, L. Zhao, W. Li, and A. Y. Zomaya, pp. 712–723, Sep. 2020. reinforcement learning for large-scale grid
“Dynamic energy dispatch strategy for integrated [138] J. Li, T. Yu, X. Zhang, F. Li, D. Lin, and H. Zhu, emergency voltage control,” IEEE Trans. Power
energy system based on improved deep “Efficient experience replay based deep Syst., vol. 37, no. 1, pp. 14–25, Jan. 2022.
reinforcement learning,” Energy, vol. 235, deterministic policy gradient for AGC dispatch in [156] R. R. Hossain, Q. Huang, and R. Huang, “Graph
Nov. 2021, Art. no. 121377. integrated energy system,” Appl. Energy, vol. 285, convolutional network-based topology embedded
[123] B. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and Mar. 2021, Art. no. 116386. deep reinforcement learning for voltage stability
F. Blaabjerg, “Deep reinforcement learning–based [139] D. Zhang et al., “Research on AGC performance control,” IEEE Trans. Power Syst., vol. 36, no. 5,
approach for optimizing energy conversion in during wind power ramping based on deep pp. 4848–4851, Sep. 2021.
integrated electrical and heating system with reinforcement learning,” IEEE Access, vol. 8, [157] D. Cao et al., “Data-driven multi-agent deep
renewable energy,” Energy Convers. Manage., pp. 107409–107418, 2020. reinforcement learning for distribution system
vol. 202, Dec. 2019, Art. no. 112199. [140] J. Li, T. Yu, H. Zhu, F. Li, D. Lin, and Z. Li, decentralized voltage control with high
[124] B. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and “Multi-agent deep reinforcement learning for penetration of PVs,” IEEE Trans. Smart Grid,
F. Blaabjerg, “Economical operation strategy of an sectional AGC dispatch,” IEEE Access, vol. 8, vol. 12, no. 5, pp. 4137–4150, Sep. 2021.
integrated energy system with wind power and pp. 158067–158081, 2020. [158] H. T. Nguyen and D.-H. Choi, “Three-stage
power to gas technology—A DRL-based [141] J. Li, J. Yao, T. Yu, and X. Zhang, “Distributed inverter-based peak shaving and Volt-VAR control
approach,” IET Renew. Power Gener., vol. 14, deep reinforcement learning for integrated in active distribution networks using online safe
no. 17, pp. 3292–3299, Dec. 2020. generation-control and power-dispatch of deep reinforcement learning,” IEEE Trans. Smart
Grid, vol. 13, no. 4, pp. 3266–3277, Jul. 2022. Apr. 2022. Smart Grid, vol. 12, no. 3, pp. 2176–2187,
[159] R. Huang et al., “Learning and fast adaptation for [176] X. Wei, Y. Xiang, J. Li, and X. Zhang, May 2021.
grid emergency control via deep meta “Self-dispatch of wind-storage integrated system: [193] V. Moghaddam, A. Yazdani, H. Wang, D. Parlevliet,
reinforcement learning,” IEEE Trans. Power Syst., A deep reinforcement learning approach,” IEEE and F. Shahnia, “An online reinforcement learning
vol. 37, no. 6, pp. 4168–4178, Nov. 2022. Trans. Sustain. Energy, vol. 13, no. 3, approach for dynamic pricing of electric vehicle
[160] L. Xi, L. Yu, Y. Xu, S. Wang, and X. Chen, “A novel pp. 1861–1864, Jul. 2022. charging stations,” IEEE Access, vol. 8,
multi-agent DDQN-AD method-based distributed [177] Y. Liang, C. Guo, Z. Ding, and H. Hua, pp. 130305–130313, 2020.
strategy for automatic generation control of “Agent-based modeling in electricity market using [194] L. Zhang, Y. Gao, H. Zhu, and L. Tao, “Bi-level
integrated energy systems,” IEEE Trans. Sustain. deep deterministic policy gradient algorithm,” stochastic real-time pricing model in multi-energy
Energy, vol. 11, no. 4, pp. 2417–2426, Oct. 2020. IEEE Trans. Power Syst., vol. 35, no. 6, generation system: A reinforcement learning
[161] L. Xi, J. Wu, Y. Xu, and H. Sun, “Automatic pp. 4180–4192, Nov. 2020. approach,” Energy, vol. 239, Jan. 2022,
generation control based on multiple neural [178] H. Guo, Q. Chen, Q. Xia, and C. Kang, “Deep Art. no. 121926.
networks with actor-critic strategy,” IEEE Trans. inverse reinforcement learning for objective [195] N. Z. Aitzhan and D. Svetinovic, “Security and
Neural Netw. Learn. Syst., vol. 32, no. 6, function identification in bidding models,” IEEE privacy in decentralized energy trading through
pp. 2483–2493, Jun. 2021. Trans. Power Syst., vol. 36, no. 6, pp. 5684–5696, multi-signatures, blockchain and anonymous
[162] L. Xi, L. Zhou, Y. Xu, and X. Chen, “A multi-step Nov. 2021. messaging streams,” IEEE Trans. Depend. Secure
unified reinforcement learning method for [179] M. Sanayha and P. Vateekul, “Model-based deep Comput., vol. 15, no. 5, pp. 840–852, Sep. 2018.
automatic generation control in multi-area reinforcement learning for wind energy bidding,” [196] J. Kang, R. Yu, X. Huang, S. Maharjan, Y. Zhang,
interconnected power grid,” IEEE Trans. Sustain. Int. J. Electr. Power Energy Syst., vol. 136, and E. Hossain, “Enabling localized peer-to-peer
Energy, vol. 12, no. 2, pp. 1406–1415, Apr. 2021. Mar. 2022, Art. no. 107625. electricity trading among plug-in hybrid electric
[163] Z. Yan and Y. Xu, “Data-driven load frequency [180] Y. Tao, J. Qiu, and S. Lai, “Deep reinforcement vehicles using consortium blockchains,” IEEE
control for stochastic power systems: A deep learning based bidding strategy for EVAs in local Trans. Ind. Informat., vol. 13, no. 6,
reinforcement learning method with continuous energy market considering information pp. 3154–3164, Dec. 2017.
action search,” IEEE Trans. Power Syst., vol. 34, asymmetry,” IEEE Trans. Ind. Informat., vol. 18, [197] R. Khalid, N. Javaid, A. Almogren, M. U. Javed,
no. 2, pp. 1653–1656, Mar. 2019. no. 6, pp. 3831–3842, Jun. 2022. S. Javaid, and M. Zuair, “A blockchain-based load
[164] Z. Yan and Y. Xu, “A multi-agent deep [181] A. Taghizadeh, M. Montazeri, and H. Kebriaei, balancing in decentralized hybrid P2P energy
reinforcement learning method for cooperative “Deep reinforcement learning-aided bidding trading market in smart grid,” IEEE Access, vol. 8,
load frequency control of a multi-area power strategies for transactive energy market,” IEEE pp. 47047–47062, 2020.
system,” IEEE Trans. Power Syst., vol. 35, no. 6, Syst. J., vol. 16, no. 3, pp. 4445–4453, Sep. 2022. [198] A. A. Al-Obaidi and H. E. Z. Farag, “Decentralized
pp. 4599–4608, Nov. 2020. [182] I. Boukas et al., “A deep reinforcement learning quality of service based system for energy trading
[165] M. H. Khooban and M. Gheisarnejad, “A novel framework for continuous intraday market among electric vehicles,” IEEE Trans. Intell. Transp.
deep reinforcement learning controller based bidding,” Mach. Learn., vol. 110, no. 9, Syst., vol. 23, no. 7, pp. 6586–6595, Jul. 2022.
type-II fuzzy system: Frequency regulation in pp. 2335–2387, Sep. 2021. [199] Y. Li, C. Yu, Y. Liu, Z. Ni, L. Ge, and X. Li,
microgrids,” IEEE Trans. Emerg. Topics Comput. [183] Y. Zhang, Z. Zhang, Q. Yang, D. An, D. Li, and “Collaborative operation between power network
Intell., vol. 5, no. 4, pp. 689–699, Aug. 2021. C. Li, “EV charging bidding by multi-DQN and hydrogen fueling stations with peer-to-peer
[166] C. Chen, M. Cui, F. Li, S. Yin, and X. Wang, reinforcement learning in electricity auction energy trading,” IEEE Trans. Transport. Electrific.,
“Model-free emergency frequency control based market,” Neurocomputing, vol. 397, pp. 404–414, vol. 9, no. 1, pp. 1521–1540, Mar. 2023.
on reinforcement learning,” IEEE Trans. Ind. Jul. 2020. [200] D. Wang, B. Liu, H. Jia, Z. Zhang, J. Chen, and
Informat., vol. 17, no. 4, pp. 2336–2346, [184] L. Yang, Q. Sun, N. Zhang, and Y. Li, “Indirect D. Huang, “Peer-to-peer electricity transaction
Apr. 2021. multi-energy transactions of energy Internet with decisions of the user-side smart energy system
[167] Z. Yan, Y. Xu, Y. Wang, and X. Feng, “Deep deep reinforcement learning approach,” IEEE based on the SARSA reinforcement learning,”
reinforcement learning-based optimal data-driven Trans. Power Syst., vol. 37, no. 5, pp. 4067–4077, CSEE J. Power Energy Syst., vol. 8, no. 3,
control of battery energy storage for power system Sep. 2022. pp. 826–837, May 2022.
frequency support,” IET Gener., Transmiss. Distrib., [185] C. Schlereth, B. Skiera, and F. Schulz, “Why do [201] Y. Liu, D. Zhang, C. Deng, and X. Wang, “Deep
vol. 14, no. 25, pp. 6071–6078, Dec. 2020. consumers prefer static instead of dynamic pricing reinforcement learning approach for autonomous
[168] G. Zhang, W. Hu, D. Cao, Q. Huang, Z. Chen, and plans? An empirical study for a better agents in consumer-centric electricity market,” in
F. Blaabjerg, “A novel deep reinforcement learning understanding of the low preferences for Proc. 5th IEEE Int. Conf. Big Data Anal. (ICBDA),
enabled sparsity promoting adaptive control time-variant pricing plans,” Eur. J. Oper. Res., May 2020, pp. 37–41.
method to improve the stability of power systems vol. 269, no. 3, pp. 1165–1179, Sep. 2018. [202] D. Qiu, Y. Ye, D. Papadaskalopoulos, and
with wind energy penetration,” Renew. Energy, [186] D. Liu, W. Wang, L. Wang, H. Jia, and M. Shi, G. Strbac, “Scalable coordinated management of
vol. 178, pp. 363–376, Nov. 2021. “Dynamic pricing strategy of electric vehicle peer-to-peer energy trading: A multi-cluster deep
[169] R. Yan, Y. Wang, Y. Xu, and J. Dai, “A multiagent aggregators based on DDPG reinforcement reinforcement learning approach,” Appl. Energy,
quantum deep reinforcement learning method for learning algorithm,” IEEE Access, vol. 9, vol. 292, Jun. 2021, Art. no. 116940.
distributed frequency control of islanded pp. 21556–21566, 2021. [203] C. Samende, J. Cao, and Z. Fan, “Multi-agent deep
microgrids,” IEEE Trans. Control Netw. Syst., [187] D. Qiu, Y. Ye, D. Papadaskalopoulos, and deterministic policy gradient algorithm for
vol. 9, no. 4, pp. 1622–1632, Dec. 2022. G. Strbac, “A deep reinforcement learning method peer-to-peer energy trading considering
[170] M. Shahidehpour, H. Yamin, and Z. Li, Market for pricing electric vehicles with discrete charging distribution network constraints,” Appl. Energy,
Operations in Electric Power Systems: Forecasting, levels,” IEEE Trans. Ind. Appl., vol. 56, no. 5, vol. 317, Jul. 2022, Art. no. 119123.
Scheduling, and Risk Management. Hoboken, NJ, pp. 5901–5912, Sep. 2020. [204] J. Li et al., “Energy trading of multiple virtual
USA: Wiley, 2002. [188] H. Xu, J. Wen, Q. Hu, J. Shu, J. Lu, and Z. Yang, power plants using deep reinforcement learning,”
[171] Y. Liu, D. Zhang, and H. B. Gooi, “Data-driven “Energy procurement and retail pricing of in Proc. Int. Conf. Power Syst. Technol.
decision-making strategies for electricity retailers: electricity retailers via deep reinforcement (POWERCON), Dec. 2021, pp. 892–897.
A deep reinforcement learning approach,” CSEE J. learning with long short-term memory,” CSEE J. [205] D. Qiu, J. Wang, J. Wang, and G. Strbac,
Power Energy Syst., vol. 7, no. 2, pp. 358–367, Power Energy Syst., vol. 8, no. 5, pp. 1338–1351, “Multi-agent reinforcement learning for
Mar. 2021. Sep. 2022. automated peer-to-peer energy trading in
[172] Y. Ye, D. Qiu, M. Sun, D. Papadaskalopoulos, and [189] S. Lee and D.-H. Choi, “Dynamic pricing and double-side auction market,” in Proc. 13th Int.
G. Strbac, “Deep reinforcement learning for energy management for profit maximization in Joint Conf. Artif. Intell., Aug. 2021,
strategic bidding in electricity markets,” IEEE multiple smart electric vehicle charging stations: pp. 2913–2920.
Trans. Smart Grid, vol. 11, no. 2, pp. 1343–1355, A privacy-preserving deep reinforcement learning [206] T. Zhang, D. Yue, L. Yu, C. Dou, and X. Xie, “Joint
Mar. 2020. approach,” Appl. Energy, vol. 304, Dec. 2021, energy and workload scheduling for fog-assisted
[173] H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori, Art. no. 117754. multimicrogrid systems: A deep reinforcement
and H. Hashimoto, “Deep reinforcement learning [190] A. Abdalrahman and W. Zhuang, “Dynamic pricing learning approach,” IEEE Syst. J., vol. 17, no. 1,
for joint bidding and pricing of load serving for differentiated PEV charging services using pp. 164–175, Mar. 2023.
entity,” IEEE Trans. Smart Grid, vol. 10, no. 6, deep reinforcement learning,” IEEE Trans. Intell. [207] Y. Ye, Y. Tang, H. Wang, X.-P. Zhang, and G. Strbac,
pp. 6366–6375, Nov. 2019. Transp. Syst., vol. 23, no. 2, pp. 1415–1427, “A scalable privacy-preserving multi-agent deep
[174] Y. Du, F. Li, H. Zandi, and Y. Xue, “Approximating Feb. 2022. reinforcement learning approach for large-scale
Nash equilibrium in day-ahead electricity market [191] Y.-C. Chuang and W.-Y. Chiu, “Deep reinforcement peer-to-peer transactive energy trading,” IEEE
bidding with multi-agent deep reinforcement learning based pricing strategy of aggregators Trans. Smart Grid, vol. 12, no. 6, pp. 5185–5200,
learning,” J. Modern Power Syst. Clean Energy, considering renewable energy,” IEEE Trans. Emerg. Nov. 2021.
vol. 9, no. 3, pp. 534–544, May 2021. Topics Comput. Intell., vol. 6, no. 3, pp. 499–508, [208] X. Wang, Y. Liu, J. Zhao, C. Liu, J. Liu, and J. Yan,
[175] X. Wei, Y. Xiang, J. Li, and J. Liu, “Wind power Jun. 2022. “Surrogate model enabled deep reinforcement
bidding coordinated with energy storage system [192] T. Lu, X. Chen, M. B. McElroy, C. P. Nielsen, Q. Wu, learning for hybrid energy community operation,”
operation in real-time electricity market: A and Q. Ai, “A reinforcement learning-based Appl. Energy, vol. 289, May 2021, Art. no. 116722.
maximum entropy deep reinforcement learning decision system for electricity pricing plan [209] Y. Xu, L. Yu, G. Bi, M. Zhang, and C. Shen, “Deep
approach,” Energy Rep., vol. 8, pp. 770–775, selection by smart grid end users,” IEEE Trans. reinforcement learning and blockchain for
peer-to-peer energy trading among microgrids,” in IEEE Trans. Comput., vol. 71, no. 11, Mar. 2020.
Proc. Int. Conferences Internet Things (iThings) pp. 2915–2926, Nov. 2022. [244] E. Marchesini, D. Corsi, and A. Farinelli,
IEEE Green Comput. Commun. (GreenCom) IEEE [227] S. Lee and D.-H. Choi, “Federated reinforcement “Exploring safer behaviors for deep reinforcement
Cyber, Phys. Social Comput. (CPSCom) IEEE Smart learning for energy management of multiple learning,” in Proc. AAAI Conf. Artif. Intell., vol. 36,
Data (SmartData) IEEE Congr. Cybermatics smart homes with distributed energy resources,” no. 7, 2022, pp. 7701–7709.
(Cybermatics), Nov. 2020, pp. 360–365. IEEE Trans. Ind. Informat., vol. 18, no. 1, [245] Y. Ye, H. Wang, P. Chen, Y. Tang, and G. Strbac,
[210] Y. Li, X. Wei, Y. Li, Z. Dong, and M. Shahidehpour, pp. 488–497, Jan. 2022. “Safe deep reinforcement learning for microgrid
“Detection of false data injection attacks in smart [228] H.-M. Chung, S. Maharjan, Y. Zhang, and energy management in distribution networks with
grid: A secure federated deep learning approach,” F. Eliassen, “Distributed deep reinforcement leveraged spatial–temporal perception,” IEEE
IEEE Trans. Smart Grid, vol. 13, no. 6, learning for intelligent load scheduling in Trans. Smart Grid, vol. 14, no. 5, pp. 3759–3775,
pp. 4862–4872, Nov. 2022. residential smart grids,” IEEE Trans. Ind. Informat., Sep. 2023.
[211] Z. Li, M. Shahidehpour, and F. Aminifar, vol. 17, no. 4, pp. 2752–2763, Apr. 2021. [246] H. Cui, Y. Ye, J. Hu, Y. Tang, Z. Lin, and G. Strbac,
“Cybersecurity in distributed power systems,” [229] M. Shateri, F. Messina, P. Piantanida, and “Online preventive control for transmission
Proc. IEEE, vol. 105, no. 7, pp. 1367–1388, F. Labeau, “Privacy-cost management in smart overload relief using safe reinforcement learning
Jul. 2017. meters with mutual-information-based with enhanced spatial–temporal awareness,” IEEE
[212] M. Shahidehpour, F. Tinney, and Y. Fu, “Impact of reinforcement learning,” IEEE Internet Things J., Trans. Power Syst., early access, Mar. 15, 2023,
security on power systems operation,” Proc. IEEE, vol. 9, no. 22, pp. 22389–22398, Nov. 2022. doi: 10.1109/TPWRS.2023.3257259.
vol. 93, no. 11, pp. 2013–2025, Nov. 2005. [230] Z. Su et al., “Secure and efficient federated [247] R. F. Prudencio, M. R. O. A. Maximo, and
[213] Z. Zhang, S. Huang, Y. Chen, B. Li, and S. Mei, learning for smart grid with edge-cloud E. L. Colombini, “A survey on offline
“Cyber-physical coordinated risk mitigation in collaboration,” IEEE Trans. Ind. Informat., vol. 18, reinforcement learning: Taxonomy, review, and
smart grids based on attack-defense game,” IEEE no. 2, pp. 1333–1344, Feb. 2022. open problems,” IEEE Trans. Neural Netw. Learn.
Trans. Power Syst., vol. 37, no. 1, pp. 530–542, [231] X. Wang et al., “QoS and privacy-aware routing Syst., early access, Mar. 22, 2023, doi:
Jan. 2022. for 5G-enabled industrial Internet of Things: A 10.1109/TNNLS.2023.3250269.
[214] T. Bailey, J. Johnson, and D. Levin, “Deep federated reinforcement learning approach,” IEEE [248] H. Niu, Y. Qiu, M. Li, G. Zhou, J. Hu, and X. Zhan,
reinforcement learning for online distribution Trans. Ind. Informat., vol. 18, no. 6, “When to trust your simulator: Dynamics-aware
power system cybersecurity protection,” in Proc. pp. 4189–4197, Jun. 2022. hybrid offline-and-online reinforcement learning,”
IEEE Int. Conf. Commun., Control, Comput. [232] Z. Wang, Y. Liu, Z. Ma, X. Liu, and J. Ma, “LiPSG: in Proc. Adv. Neural Inf. Process. Syst., vol. 35,
Technol. Smart Grids (SmartGridComm), Lightweight privacy-preserving Q-learning-based 2022, pp. 36599–36612.
Oct. 2021, pp. 227–232. energy management for the IoT-enabled smart [249] Z. Yan and Y. Xu, “A hybrid data-driven method for
[215] X. Liu, J. Ospina, and C. Konstantinou, “Deep grid,” IEEE Internet Things J., vol. 7, no. 5, fast solution of security-constrained optimal
reinforcement learning for cybersecurity pp. 3935–3947, May 2020. power flow,” IEEE Trans. Power Syst., vol. 37,
assessment of wind integrated power systems,” [233] Y. Zhang, Q. Ai, and Z. Li, “Intelligent demand no. 6, pp. 4365–4374, Nov. 2022.
IEEE Access, vol. 8, pp. 208378–208394, 2020. response resource trading using deep [250] A. R. Sayed, C. Wang, H. Anis, and T. Bi,
[216] Y. Li and J. Wu, “Low latency cyberattack reinforcement learning,” CSEE J. Power Energy “Feasibility constrained online calculation for
detection in smart grids with deep reinforcement Syst., early access, Sep. 10, 2021, doi: 10.17775/ real-time optimal power flow: A convex
learning,” Int. J. Electr. Power Energy Syst., CSEEJPES.2020.05540. constrained deep reinforcement learning
vol. 142, Nov. 2022, Art. no. 108265. [234] X. Liu, H. Wang, G. Chen, B. Zhou, and approach,” IEEE Trans. Power Syst., early access,
[217] D. An, F. Zhang, Q. Yang, and C. Zhang, “Data A. U. Rehman, “Intermittently differential privacy Nov. 9, 2022, doi: 10.1109/TPWRS.2022.
integrity attack in dynamic state estimation of in smart meters via rechargeable batteries,” Electr. 3220799.
smart grid: Attack model and countermeasures,” Power Syst. Res., vol. 199, Oct. 2021, [251] T. L. Vu, S. Mukherjee, R. Huang, and Q. Huang,
IEEE Trans. Autom. Sci. Eng., vol. 19, no. 3, Art. no. 107410. “Barrier function-based safe reinforcement
pp. 1631–1644, Jul. 2022. [235] U. Ahmed, J. C. Lin, and G. Srivastava, learning for emergency control of power systems,”
[218] C. Chen, M. Cui, X. Fang, B. Ren, and Y. Chen, “5G-empowered drone networks in federated and in Proc. 60th IEEE Conf. Decis. Control (CDC),
“Load altering attack-tolerant defense strategy for deep reinforcement learning environments,” IEEE Dec. 2021, pp. 3652–3657.
load frequency control system,” Appl. Energy, Commun. Standards Mag., vol. 5, no. 4, pp. 55–61, [252] I. Ilahi et al., “Challenges and countermeasures
vol. 280, Dec. 2020, Art. no. 116015. Dec. 2021. for adversarial attacks on deep reinforcement
[219] W. Lei, H. Wen, J. Wu, and W. Hou, [236] L. Yan, X. Chen, Y. Chen, and J. Wen, learning,” IEEE Trans. Artif. Intell., vol. 3, no. 2,
“MADDPG-based security situational awareness “A hierarchical deep reinforcement learning-based pp. 90–109, Apr. 2022.
for smart grid with intelligent edge,” Appl. Sci., community energy trading scheme for a [253] Y. Wang and B. Pal, “Destabilizing attack and
vol. 11, no. 7, p. 3101, Mar. 2021. neighborhood of smart households,” IEEE Trans. robust defense for inverter-based microgrids by
[220] Z. Jin et al., “Cyber-physical risk driven routing Smart Grid, vol. 13, no. 6, pp. 4747–4758, adversarial deep reinforcement learning,” IEEE
planning with deep reinforcement-learning in Nov. 2022. Trans. Smart Grid, early access, Mar. 30, 2023,
smart grid communication networks,” in Proc. Int. [237] T. Li, Y. Xiao, and L. Song, “Integrating future doi: 10.1109/TSG.2023.3263243.
Wireless Commun. Mobile Comput. (IWCMC), smart home operation platform with demand side [254] S. Paul, Z. Ni, and C. Mu, “A learning-based
Jun. 2020, pp. 1278–1283. management via deep reinforcement learning,” solution for an adversarial repeated game in
[221] H. Zhang, D. Yue, C. Dou, and G. P. Hancke, IEEE Trans. Green Commun. Netw., vol. 5, no. 2, cyber-physical power systems,” IEEE Trans. Neural
“Resilient optimal defensive strategy of pp. 921–933, Jun. 2021. Netw. Learn. Syst., vol. 31, no. 11, pp. 4512–4523,
micro-grids system via distributed deep [238] J. García and F. Fernández, “A comprehensive Nov. 2020.
reinforcement learning approach against FDI survey on safe reinforcement learning,” J. Mach. [255] K. L. Tan, Y. Esfandiari, X. Y. Lee, and S. Sarkar,
attack,” IEEE Trans. Neural Netw. Learn. Syst., Learn. Res., vol. 16, no. 42, pp. 1437–1480, “Robustifying reinforcement learning agents via
early access, May 27, 2022, doi: 10.1109/TNNLS. Aug. 2015. action space adversarial training,” in Proc. Amer.
2022.3175917. [239] X. Wang, R. Wang, and Y. Cheng, “Safe Control Conf. (ACC), Jul. 2020, pp. 3959–3964.
[222] D. An, Q. Yang, W. Liu, and Y. Zhang, “Defending reinforcement learning: A survey,” Acta [256] H. Zhang et al., “Robust deep reinforcement
against data integrity attacks in smart grid: A Automatica Sinica, vol. 49, no. 9, pp. 1–23, learning against adversarial perturbations on state
deep reinforcement learning-based approach,” Sep. 2023. observations,” in Proc. Adv. Neural Inf. Process.
IEEE Access, vol. 7, pp. 110835–110845, 2019. [240] Z. Yi et al., “An improved two-stage deep Syst. (NIPS), vol. 33, 2020, pp. 21024–21037.
[223] Y. Wang, Q. Chen, T. Hong, and C. Kang, “Review reinforcement learning approach for regulation [257] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and
of smart meter data analytics: Applications, service disaggregation in a virtual power plant,” S. Russell, “Robust multi-agent reinforcement
methodologies, and challenges,” IEEE Trans. Smart IEEE Trans. Smart Grid, vol. 13, no. 4, learning via minimax deep deterministic policy
Grid, vol. 10, no. 3, pp. 3125–3148, May 2019. pp. 2844–2858, Jul. 2022. gradient,” in Proc. AAAI Conf. Artif. Intell., vol. 33,
[224] A. Mohammadali and M. S. Haghighi, [241] Z. Zhu, K. W. Chan, S. Xia, and S. Bu, “Optimal 2019, pp. 4213–4220.
“A privacy-preserving homomorphic scheme with bi-level bidding and dispatching strategy between [258] H. Dong and X. Zhao, “Wind-farm power tracking
multiple dimensions and fault tolerance for active distribution network and virtual alliances via preview-based robust reinforcement learning,”
metering data aggregation in smart grid,” IEEE using distributed robust multi-agent deep IEEE Trans. Ind. Informat., vol. 18, no. 3,
Trans. Smart Grid, vol. 12, no. 6, pp. 5212–5220, reinforcement learning,” IEEE Trans. Smart Grid, pp. 1706–1715, Mar. 2022.
Nov. 2021. vol. 13, no. 4, pp. 2833–2843, Jul. 2022. [259] A. Roy, H. Xu, and S. Pokutta, “Reinforcement
[225] C. E. Kement, B. Tavli, H. Gultekin, and [242] M. M. Hosseini and M. Parvania, “On the learning under model mismatch,” in Proc. Adv.
H. Yanikomeroglu, “Holistic privacy for electricity, feasibility guarantees of deep reinforcement Neural Inf. Process. Syst., vol. 30, 2017,
water, and natural gas metering in next learning solutions for distribution system pp. 3046–3055.
generation smart homes,” IEEE Commun. Mag., operation,” IEEE Trans. Smart Grid, vol. 14, no. 2, [260] Y. Li, R. Wang, Y. Li, M. Zhang, and C. Long,
vol. 59, no. 3, pp. 24–29, Mar. 2021. pp. 954–964, Mar. 2023. “Wind power forecasting considering data privacy
[226] Z. Zheng, T. Wang, A. K. Bashir, M. Alazab, [243] Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan, and protection: A federated deep reinforcement
S. Mumtaz, and X. Wang, “A decentralized Z. Huang, “Adaptive power system emergency learning approach,” Appl. Energy, vol. 329,
mechanism based on differential privacy for control using deep reinforcement learning,” IEEE Jan. 2023, Art. no. 120291.
privacy-preserving computation in smart grid,” Trans. Smart Grid, vol. 11, no. 2, pp. 1171–1182, [261] Y. Li, S. He, Y. Li, Y. Shi, and Z. Zeng, “Federated
multiagent deep reinforcement learning approach Neural Inf. Process. Syst., vol. 33, 2020, A. H. Gebremedhin, “Reinforcement learning for
via physics-informed reward for multimicrogrid pp. 18353–18363. battery energy storage dispatch augmented with
energy management,” IEEE Trans. Neural Netw. [264] H. Liu, Z. Huang, J. Wu, and C. Lv, “Improved model-based optimizer,” in Proc. IEEE Int. Conf.
Learn. Syst., early access, Jan. 3, 2023, doi: deep reinforcement learning with expert Commun., Control, Comput. Technol. Smart Grids
10.1109/TNNLS.2022.3232630. demonstrations for urban autonomous driving,” in (SmartGridComm), Oct. 2021, pp. 289–294.
[262] P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2022, [267] W. Guo, W. Tian, Y. Ye, L. Xu, and K. Wu, “Cloud
S. Russell, “Bridging offline reinforcement pp. 921–928. resource scheduling with deep reinforcement
learning and imitation learning: A tale of [265] Y. Liu, Q. Liu, H. Zhao, Z. Pan, and C. Liu, learning and imitation learning,” IEEE Internet
pessimism,” in Proc. Adv. Neural Inf. Process. Syst., “Adaptive quantitative trading: An imitative deep Things J., vol. 8, no. 5, pp. 3576–3586, Mar. 2021.
vol. 34, 2021, pp. 11702–11716. reinforcement learning approach,” in Proc. AAAI [268] D. Silver et al., “Mastering the game of Go
[263] X. Chen, Z. Zhou, Z. Wang, C. Wang, Y. Wu, and Conf. Artif. Intell., vol. 34, no. 2, Apr. 2020, without human knowledge,” Nature, vol. 550,
K. Ross, “BAIL: Best-action imitation learning for pp. 2128–2135. no. 7676, pp. 354–359, Oct. 2017.
batch deep reinforcement learning,” in Proc. Adv. [266] G. Krishnamoorthy, A. Dubey, and
Zhigang Zeng (Fellow, IEEE) received the Tianyou Chai (Life Fellow, IEEE) received
Ph.D. degree in systems analysis and inte- the Ph.D. degree in control theory and
gration from the Huazhong University of engineering from Northeastern University,
Science and Technology, Wuhan, China, Shenyang, China, in 1985.
in 2003. He became a Professor at Northeast-
He is currently a Professor with the School ern University in 1988. He is the Founder
of Automation and the Key Laboratory of and the Director of the Center of Automa-
Image Processing and Intelligent Control of tion, Northeastern University, which became
the Education Ministry of China, Huazhong the National Engineering and Technology
University of Science and Technology. He has published more than Research Center and the State Key Laboratory. He was the Director
100 international journal articles. His current research interests of the Department of Information Science, National Natural Science
include the theory of functional differential equations and dif- Foundation of China, from 2010 to 2018. He has developed control
ferential equations with discontinuous right-hand sides and their technologies with applications to various industrial processes. He
applications to dynamics of neural networks, memristive systems, has published more than 320 peer-reviewed international journal
and control systems. articles. His current research interests include modeling, control,
Dr. Zeng has been a member of the Editorial Board of Neural Net- optimization, and integrated automation of complex industrial
works since 2012, Cognitive Computation since 2010, and Applied processes.
Soft Computing since 2013. He was an Associate Editor of IEEE Dr. Chai is a member of the Chinese Academy of Engineering and
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS from 2010 a Fellow of International Federation for Automatic Control (IFAC).
to 2011. He has been an Associate Editor of IEEE TRANSACTIONS ON His paper titled “Hybrid intelligent control for optimal operation of
CYBERNETICS since 2014 and IEEE TRANSACTIONS ON FUZZY SYSTEMS since shaft furnace roasting process” was selected as one of the three
2016. best papers for the Control Engineering Practice Paper Prize for the
term 2011–2013. For his contributions, he has won five prestigious
awards of the National Natural Science, the National Science and
Technology Progress, and the National Technological Innovation,
the 2007 Industry Award for Excellence in Transitional Control
Research from IEEE Multi-Conference on Systems and Control, and
the 2017 Wook Hyun Kwon Education Award from the Asian Control
Association.