0% found this document useful (0 votes)
13 views10 pages

2024-Ouput-feedback-linear-system-based-on-ADP-1 (1)

This paper presents an adaptive dynamic programming (ADP) algorithm for optimal dynamic output feedback control of unknown linear continuous-time systems, addressing the linear quadratic regulation problem. The proposed method constructs an internal state and utilizes a learning algorithm that estimates the optimal control gain using only input and output data, while being immune to observer errors and reducing memory storage requirements. A physical experiment on an unmanned quadrotor demonstrates the effectiveness of the approach.

Uploaded by

duy0378578911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

2024-Ouput-feedback-linear-system-based-on-ADP-1 (1)

This paper presents an adaptive dynamic programming (ADP) algorithm for optimal dynamic output feedback control of unknown linear continuous-time systems, addressing the linear quadratic regulation problem. The proposed method constructs an internal state and utilizes a learning algorithm that estimates the optimal control gain using only input and output data, while being immune to observer errors and reducing memory storage requirements. A physical experiment on an unmanned quadrotor demonstrates the effectiveness of the approach.

Uploaded by

duy0378578911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Automatica 163 (2024) 111601

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Brief paper

Optimal dynamic output feedback control of unknown linear


continuous-time systems by adaptive dynamic programming✩

Kedi Xie a,b , Yiwei Zheng a , Yi Jiang c , Weiyao Lan a , Xiao Yu d,e ,
a
Department of Automation, Xiamen University, Xiamen 361005, China
b
School of Automation, Beijing Institute of Technology, Beijing 100081, China
c
Department of Electrical Engineering and Centre for Complexity and Complex Networks, City University of Hong Kong, Hong Kong SAR, China
d
Institute of Artificial Intelligence, Xiamen University, Xiamen 361005, China
e
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen 361005, China

article info a b s t r a c t

Article history: In this paper, we present an approximate optimal dynamic output feedback control learning algorithm
Received 16 April 2023 to solve the linear quadratic regulation problem for unknown linear continuous-time systems. First, a
Received in revised form 30 September 2023 dynamic output feedback controller is designed by constructing the internal state. Then, an adaptive
Accepted 31 January 2024
dynamic programming based learning algorithm is proposed to estimate the optimal feedback control
Available online 2 March 2024
gain by only accessing the input and output data. By adding a constructed virtual observer error
Keywords: into the iterative learning equation, the proposed learning algorithm with the new iterative learning
Adaptive dynamic programming equation is immune to the observer error. In addition, the value iteration based learning equation
Dynamic output feedback control is established without storing a series of past data, which could lead to a reduction of demands on
Linear quadratic regulation the usage of memory storage. Besides, the proposed algorithm eliminates the requirement of repeated
Value iteration
finite window integrals, which may reduce the computational load. Moreover, the convergence analysis
shows that the estimated control policy converges to the optimal control policy. Finally, a physical
experiment on an unmanned quadrotor is given to illustrate the effectiveness of the proposed
approach.
© 2024 Elsevier Ltd. All rights reserved.

1. Introduction Gao and Jiang (2016, 2019), Jiang and Jiang (2012), Lewis and
Vamvoudakis (2011), Lewis and Vrabie (2009), Modares et al.
Adaptive dynamic programming (ADP) (Jiang & Jiang, 2017; (2016), Vamvoudakis and Lewis (2010). Although the explicit
Powell, 2004), which not only solves the ‘‘curse of dimensional- system identification process is skipped by using ADP/RL based
ity’’ problem in dynamic programming but also provides a way learning algorithm for solving the optimal control problem, yet
to resolve the ‘‘curse of modeling’’ problem in most model-based accurate state data is still required in Bian and Jiang (2016), Chen
control approaches, is one of the intelligent control methods et al. (2019), Gao and Jiang (2018), Jiang and Jiang (2012), Jiang
and has attracted extensive interest in both optimal control the- et al. (2018), Luo et al. (2018) to establish the iterative learning
ory and applications (Liu et al., 2017, 2021). In the last decade, equation.
several ADP-based learning methods that aim to estimate the In order to expand the application of ADP/RL based learning
optimal control policy for uncertain or unknown systems without methods, several practical issues are further taken into account,
explicitly identifying the dynamics of the system were inves- for instance, control with output feedback (Gao et al., 2016; Lewis
tigated by integrating the ADP algorithm with the reinforce- & Vamvoudakis, 2011; Modares et al., 2016; Peng et al., 2020;
ment learning (RL) technique (Lewis et al., 2012; Sutton & Barto, Rizvi & Lin, 2020b; Sun et al., 2019), disturbance rejection (Gao
1998), for example, Bian and Jiang (2016), Chen et al. (2019), & Jiang, 2016, 2019; Gao et al., 2019; Luo et al., 2018), uncertain
system dynamics (Gao et al., 2016; Jiang & Jiang, 2013). In partic-
✩ The material in this paper was not presented at any conference. This paper ular, a robust ADP-based learning algorithm with output feedback
was recommended for publication in revised form by Associate Editor Kyriakos (OPFB) controller was proposed in Gao et al. (2016) to solve the
G. Vamvoudakis under the direction of Editor Miroslav Krstic. linear quadratic regulation (LQR) problem for linear systems with
∗ Corresponding author at: Institute of Artificial Intelligence, Xiamen
dynamics uncertainty, which was extended to solve the output
University, Xiamen 361005, China.
E-mail addresses: [email protected] (K. Xie),
regulation problem for discrete-time systems in Gao and Jiang
[email protected] (Y. Zheng), [email protected] (Y. Jiang), (2019). Under the observability condition, different from discrete-
[email protected] (W. Lan), [email protected] (X. Yu). time systems where the state can be represented with a linear

https://doi.org/10.1016/j.automatica.2024.111601
0005-1098/© 2024 Elsevier Ltd. All rights reserved.
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

combination of a finite series of history data of the sampling input of the internal state, which makes the repeated FWIs not needed
and output, the reconstructed state of continuous-time systems in this paper. This may lead to a reduction in the computational
may involve the derivatives of the input and output signals. In complexity of the proposed learning algorithm. Furthermore, the
the case of continuous-time systems, it is necessary to design proposed one-layer filter structure iterative learning equation
some filters to get away with the derivatives, which makes the only needs to store filter data at the instant t instead of a series of
learning algorithm design more challenging. In Modares et al. history data, which could reduce the usage of memory storage in
(2016), an off-policy RL-based learning approach with the OPFB the learning algorithm. Last but not least, a physical experimental
design method for continuous-time systems was presented, while validation on an unmanned quadrotor is presented to illustrate
estimation bias occurs due to the existence of exploration noise the effectiveness of the proposed ADP-based algorithm with OPFB
in the learning process. This issue was solved in Rizvi and Lin control.
(2020b) where an integral reinforcement learning (IRL) based al- The remainder of this paper is organized as follows. Sec-
gorithm was proposed with a novel dynamic OPFB design method tion 2 presents the preliminaries and problem formulation. In
by constructing the internal state. The learning algorithm design Section 3, the ADP-based learning algorithm with VI scheme is
method in Rizvi and Lin (2020b) has been extended to solve the developed with the proposed dynamic OPFB controller. The solv-
zero-sum game with output feedback (Rizvi & Lin, 2020a) and ability and convergence analyses are presented in this section.
the tracking problem under unmeasurable disturbances (Rizvi Finally, a physical experimental validation is given in Section 4,
et al., 2022). In addition, a novel data-driven dynamic output and Section 5 draws the conclusion.
feedback controller design with an internal model was proposed Notations: For a square matrix A, σ (A) is the complex spectrum
in Chen et al. (2023), which can solve the linear quadratic tracking of A. A > 0 and A ≥ 0 represent that A is positive definite and
problem for discrete-time systems. However, the existence of an positive semi-definite, respectively. For any given n, m ∈ Z+ ,
observer error in the existing algorithms, for instance, Chen et al. In ∈ Rn×n is a unit matrix, and On,m ∈ Rn×m is a zero matrix.
(2023), Rizvi and Lin (2020b), Rizvi et al. (2022), still influences For column vectors x ∈ Rn and vi ∈ Rni , i = 1, 2, . . . , N,
the algorithm convergence performance. Furthermore, it should define col(v1 , v2 , . . . , vN ) = [v1T , v2T , . . . , vNT ]T ∈ Rn1 +n2 +···+nN ,
be pointed out that most of these learning algorithms require and dia(x) denotes a diagonal matrix X = [Xjk ] ∈ Rn×n with
either the persistence of excitation (PE) condition (Luo et al., Xjk = 0, j ̸ = k, and Xjj = xj . For any given matrix B ∈
2018; Modares et al., 2016; Peng et al., 2020) or the rank con- Rm×n and a symmetric matrix C ∈ Rn×n , define vec(B) =
dition (Gao & Jiang, 2016; Gao et al., 2019, 2016; Rizvi & Lin, [bT1 , bT2 , . . . , bTn ]T with bj ∈ Rm being the columns of B, and
n(n+1)
2020b). Besides, a series of history data needs to be stored in the vech(C ) = [c11 , c12 , . . . , c1n , c22 , c23 , . . . , cn−1,n , cnn ]T ∈ R 2
memory stacks and processed by using finite window integrals with cjk being the element of matrix C . For any matrix Gi , i =
(FWIs), which may limit the effectiveness and efficiency of the 1, 2, . . . , n, G = block diag[G1 , G2 , . . . , Gn ] is an augmented block
online implementation of these learning algorithms. diagonal matrix with main diagonal block matrix Gii = Gi . Pn
Lately, a concept of initial excitation was investigated in Jha denotes the normed space of all n-by-n real symmetric matrices
et al. (2019), Roy et al. (2018), which is less restrictive than and Pn+ := {P ∈ Pn : P ≥ 0}. The symbol ⊗ denotes the Kro-
both the PE condition and the rank condition. A two-layer filter necker product. L(·) and L−1 (·) represent the Laplace transform
structure estimator design was proposed in Roy et al. (2018) to operation and the inverse Laplace transform, respectively.
identify the system parameters online, which was also devel-
oped in Jha et al. (2019) to solve the LQR problem for unknown 2. Preliminaries and problem formulation
continuous-time systems with a full state feedback controller. The
data-driven algorithm with the two-layer filter design proposed Consider a linear continuous-time system described by the
in Jha et al. (2019) reduces the demands of memory storage for following state-space equation:
storing past data and even eliminates the need to use any FWIs.
{
ẋ =Ax + Bu
Nevertheless, the full state information is still required in Jha et al. (1)
y =Cx
(2019) to establish the learning algorithm, and the two-layer filter
may lead to extra demands on the computational load. where x ∈ Rn , u ∈ Rm , and y ∈ Rp are the state, control input, and
In this paper, an ADP-based learning algorithm is developed output, respectively. A, B, and C are the constant system matrices
to solve the LQR problem for a continuous-time system without with appropriate dimensions and satisfy
using full-state feedback. The main contributions are summarized
as follows. First, different from the learning algorithm proposed Assumption 1. The pair (A, B) is stabilizable, and the pair (A, C )
in Jha et al. (2019) with the two-layer filter structure using is observable. ■
the state data, in this paper, the one-layer filter structure with Define the infinite horizon quadratic value function as
the constructed internal state dynamics is proposed to estab- ∫ ∞
lish the new iterative learning equation without using the state yT (t)Qy(t) + uT (t)Ru(t) dt
( )
V (x0 ) = (2)
data, which could reduce the demands of design complexity. 0
Second, unlike the learning algorithm with OPFB controller pro-
where x0 is the initial state, Q = Q T ∈ Rp×p > 0 and R = RT ∈
posed in Chen et al. (2023), Rizvi and Lin (2020b), Rizvi et al.
Rm×m > 0 are the weight matrices.
(2022) where the state observer error will affect the conver-
The aim of solving the LQR problem is to find an optimal
gence performance, by modifying the iterative learning equation
control policy that can minimize the value function (2). Under
with a constructed virtual observer error term in this paper, Assumption 1, minimizing the value function (2) with respect to
the proposed ADP-based data-driven algorithm is immune to the policy u = −Kx gives the optimal control gain
the observer error. Third, compared to the existing ADP-based
learning algorithms with IRL method, for instance, Gao and Jiang K ∗ = R−1 BT P ∗ (3)
(2016), Gao et al. (2018), Modares et al. (2016), Rizvi and Lin
where K ∗
∈ R satisfies σ (A − BK ) ∈ C and P
m×n ∗
∈ − ∗
(2020b), the off-policy Q-learning algorithms, for instance, Luo
Rn×n is the solution to the following algebraic Riccati equation
et al. (2018), Peng et al. (2020), Sun et al. (2019), a newly pro-
(ARE) (Lancaster & Rodman, 1995)
posed matrix-formed iterative learning equation is established by
directly using the internal state data and the obtainable derivative AT P ∗ + P ∗ A + C T QC − P ∗ BR−1 BT P ∗ = 0. (4)
2
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

To further solve the LQR problem with output feedback, con- where ζ = col(u, y), K is the feedback control gain, and z is the
sider a state observer based OPFB controller in the form of internal state with the matrix pair (G1 , G2 ) determined by the user-
defined eigenvalues of the matrix AL = A − LC such that the
{
u = −K x̂
(5) closed-loop system consisting of (1) and (6) is asymptotically stable.
x̂˙ = (A − LC )x̂ + Bu + Ly
Proof. As given in Rizvi and Lin (2020b), by considering the input
where x̂ ∈ Rn is the estimate of state x, K ∈ Rm×n is the
control gain, and L ∈ Rn×p is the observer gain matrix. Under u and the output y as the input to the observer system in (5)
Assumption 1, by designing L and K satisfying σ (A − LC ) ∈ C− and and setting x̂(0) = 0, using the Laplace transform operation, the
σ (A − BK ) ∈ C− , respectively, the estimated state x̂ converges to estimated state x̂ can be rewritten as
( )
x as t → ∞, and the closed-loop system is asymptotically stable. BU(s) LY (s)
In this paper, we consider the case with unknown system x̂(t) =L−1 +
(sI − (A − LC )) (sI − (A − LC ))
matrices and unmeasurable state. The control objective is to find ( m p
)
an optimal output feedback controller for system (1) such that the −1
∑ Bi Ui (s) ∑ Lk Yk (s)
:=L +
closed-loop system is asymptotically stable and the value func- (sI − (A − LC )) (sI − (A − LC ))
i=1 k=1
tion (2) is minimized simultaneously. This setup greatly increases m p
the difficulty for the design of adaptive optimal output feedback ∑ ∑
= Mui zui (t) + Myk zyk (t) (7)
controller, since not only the solution to the ARE (4) but also
i=1 k=1
the design of the OPFB controller (5) rely on prior knowledge of
system matrices. where Mui ∈ Rn×n and Myk ∈ Rn×n are unknown constant matrices
Although an ADP-based learning algorithm with a novel dy- determined by the matrices A, B, C , and L; see Appendix A for
namic OPFB controller design was proposed in Rizvi and Lin the detailed derivation. Besides, zui (t) and zyk (t) in (7) are internal
(2020b) by using the IRL technique, it still suffers from a large de- states given by
mand on the computational load of using the FWIs. Moreover, the
effect on the estimated optimal controller by the existence of the żui (t) = Azui (t) + bui (t), żyk (t) = Azyk (t) + byk (t) (8)
observer error has not been discussed in Rizvi and Lin (2020b).
In addition, in most existing ADP-based algorithms, e.g., Gao and with
Jiang (2016), Gao et al. (2018), Jiang and Jiang (2012), Lewis and 0 1 0 ··· 0 0
⎡ ⎤ ⎡ ⎤
Vamvoudakis (2011), Modares et al. (2016), Rizvi and Lin (2020b), ⎢ 0 0 1 ··· 0 ⎥ ⎢0⎥
⎢ . .. .. ..
Vamvoudakis and Lewis (2010), the PE condition or the rank .. .. ⎥ , b = ⎢ .. ⎥
⎥ ⎢ ⎥
condition is usually needed to ensure that the collected learning
A=⎢
⎢ . . . . ⎥ ⎢.⎥ (9)
⎣ 0 0 0 ··· 1
⎦ ⎣0⎦
data is rich enough.
To handle the aforementioned issues simultaneously for solv- −α0 −α1 −α2 ··· −αn−1 1
ing the LQR problem with output feedback and unknown system where A is determined by
matrices, we are motivated to develop an ADP-based learning
algorithm to estimate the optimal dynamic OPFB control policy, Λ(s) = det(sI − AL ) = sn + αn−1 sn−1 + · · · + α1 s + α0 . (10)
which is not only immune to the observer error but also relaxes
the requirements on the following aspects: (1) a large demand Under the observability condition, Λ(s) is a user-defined polyno-
on the computational load for computing the FWIs repeatedly; mial. By (7) and (8), the estimated state x̂ can be parameterized as
(2) an ‘‘intelligent’’ memory storage method for storing a series x̂(t) ≜ Mz(t) (11)
of history data; (3) the PE condition or the rank condition.
p
where M = [ , . . . , Mu1 , ,..., ] ∈ R
Mum My1 with nz = My n×nz

3. Main results n(m + ( p) is the constant parametrization matrix, and


p
z = col zu1 , . . . , zum , zy1 , . . . , zy ∈ Rnz is the augmented internal
)
In this section, we first introduce a dynamic OPFB control de- state given by
sign method. Then, an ADP-based learning algorithm with value
iteration (VI) scheme is developed to estimate the optimal con- ż(t) = G1 z(t) + G2 ζ (t) (12)
trol gain for the dynamic OPFB controller. Finally, the conver-
with G1 = block diag[A, . . . , A], G2 = block diag[b, . . . , b],
gence analysis and the stability analysis of the proposed learning      
algorithm are given. m+p m+p
and ζ = col(u, y) being the augmented measurement data. It
3.1. Design of dynamic OPFB controller follows from (9) and (12) that the pair (G1 , G2 ) only relies on the
user-defined coefficients of Λ(s).
In general, the design of output feedback controller with state Combining K = KM and (11), the state observer based
observer structure requires the exact knowledge of system matri- feedback controller (5) is equal to the dynamic OPFB controller
ces A, B, and C . To relax this requirement, we present a dynamic (6) with the obtainable internal state z.
OPFB control design scheme in Lemma 1, where the internal Substituting (5) into the system (1), we have
state is established without using any prior knowledge of the [ ] [ ][ ] [ ]
ẋ A −BK x x
system dynamics, and the optimal control gain can be estimated = := Ac . (13)
by using the proposed ADP-based learning algorithm developed x̂˙ LC AL − BK x̂ x̂
in Section 3.2. It is obvious that σ (Ac ) = σ (AL ) ∪ σ (A − BK ). Since there exist
matrices L and K satisfying σ (AL ) ∈ C− and σ (A − BK ) ∈ C−
Lemma 1. If Assumption 1 is satisfied, then there exists a dynamic under Assumption 1, the closed-loop system (13) is exponen-
OPFB controller tially stable. That is, under Assumption 1, the system (1) can be
{
u = − Kz stabilized by the dynamic OPFB controller (6). The proof is thus
(6)
ż =G1 z + G2 ζ completed. ■
3
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

Combining (3)–(5) and (11) with Lemma 1, the optimal dy- + (x̂ + εx )T C T QC (x̂ + εx )
namic OPFB controller for minimizing value function (2) is
= x̂T Hj x̂ + 2uT RKj x̂ + 2εxT (C T LT Pj + C T QC )x̂ + εxT C T QC εx (20)
u = −K∗ z
{
(14) where εx (t) = x(t) − x̂(t) is the unavailable observer error. By the
ż = G1 z + G2 ζ equivalence between (19) and (20), and combining (11), we have
where K∗ = K ∗ M is the optimal control gain with K ∗ and M
described as in (3) and (11), respectively. z T H̄j z + 2uT RK̄j z + 2εx T∆1 z +εx T ∆2 εx = 2ż T P̄j z + yT Qy (21)

Remark 1. Note that the internal state z is constructed without where H̄j = M Hj M, K̄j = Kj M, ∆1 = C L Pj M + C QCM,
T T T T

using any prior knowledge of the system dynamics. In addition, ∆2 = C T QC , and P̄j = M T Pj M are unknown matrices required
since the eigenvalues of AL can be user-defined under the ob- to be estimated. This indicates that K and M are not required to
servability condition, the dynamic of internal state, i.e., (G1 , G2 ), be approximated separately.
is known. This implies that the data of pair (z , ż) is available by It follows from (5) that the dynamics of the observer error is
using the input and output data (u, y) of system (1) which are ε̇x = (A − LC )εx . (22)
seen as the input of the augmented internal system (12). ■
Note that the observer error εx is unobtainable, due to the un-
3.2. ADP-based learning algorithm known matrix (A−LC ) and unmeasurable initial state x(0). Instead
of reconstructing the observer error εx , we find a way to gener-
In this subsection, by only accessing the data information of ate a virtual vector signal that has linear relationship with the
unknown εx .
the input u and the output y, an ADP-based learning algorithm
with VI scheme is proposed to approximate K∗ directly without
Lemma 2. For a system η̇ = Aη η, where η ∈ Rn is the state,
any prior knowledge of system matrices.
and Aη ∈ Rn×n is the unknown matrix with known characteristic
To begin with, define a bounded set {Bi }∞ i=0 with nonempty polynomial Λη (s) = sn + µn−1 sn−1 + · · · + µ1 s + µ0 . There always
interiors and a deterministic sequence {ϵj }∞
j=0 satisfying exist an available vector β ∈ Rn and a known matrix Aβ ∈ Rn×n

∑ ∞
∑ such that
Bi ⊂ Bi+1 , lim Bi = Pn+ , ϵj > 0, ϵj = ∞, ϵj2 < ∞. (16)
i→∞ β̇ (t) = Aβ β (t), η(t) = Aη←β β (t), ∀t ≥ 0 (23)
j=0 j=0
n×n
with unknown constant matrix Aη←β ∈ R .
A model-based VI scheme proposed in Bian and Jiang (2016) is
recalled in Algorithm 1.
Proof. See Appendix B. ■

Algorithm 1. Model-based VI Scheme (Bian & Jiang, 2016) Combining Lemma 2 with the user-defined characteristic poly-
nomial Λ(s) of (A − LC ), there always exists an available vector
Initialize: Choose P0 = P0T > 0 and a small threshold ε . Select v ∈ Rn and a known matrix Av ∈ Rn×n such that
the sets {Bi }∞ ∞
i=0 and {ϵj }j=0 satisfying (16). Set i = 0 and j = 0.
v̇ (t) = Av v (t), εx (t) = Aεx ←v v (t), ∀t ≥ 0 (24)
loop:
n×n
Compute P̃j+1 from the following equation: with unknown constant matrix Aεx ←v ∈ R . In order to clarify
the relationship between v and εx , we refer to v as the virtual
P̃j+1 ← Pj + ϵj (AT Pj + Pj A + C T QC − Pj BR−1 BT Pj ) (15) observer error.
/ Bi then
if P̃j+1 ∈ It follows from (24) that (21) can be rewritten with the avail-
Pj+1 = P0 , i ← i + 1. able vector v as
else if ∥P̃j+1 − Pj ∥/ϵj < ε then z T H̄j z + 2uT RK̄j z + 2v T ∆
¯ 1 z + vT ∆
¯ 2 v = 2ż T P̄j z + yT Qy (25)
return Pj as an estimate of P ∗
with ∆
¯ 1 = ATε ←v ∆1 and ∆
x
¯ 2 = ATε ←v ∆2 Aεx ←v .
x
break Note that (25) can be rearranged as the following vector-
else formed iterative learning equation:
Pj+1 = P̃j+1
f (t)T φj = ψ (t)T Φ (P̄j ) (26)
end if
j←j+1 where
]T
f (t) = δz (t)T , 2δzu (t)T , 2δz v (t)T , δv (t)T ∈ Rκf
[
end loop
]T
φj = vech(H̄j )T , vec(K̄j )T , vec(∆
¯ 1 )T , vech(∆
¯ 2 )T ∈ Rκf
[
d T T
Then, define Lj = x Pj x
+ y Qy. Taking its time derivative
dt ]T
ψ (t) = 2δż (t)T , δy (t)T ∈ Rκψ
[
along the trajectories of system (1), we have
]T
Φ (P̄j ) = vech(P̄j )T , vech(Q )T ∈ Rκψ
[
Lj = ẋT Pj x + xT Pj ẋ + yT Qy (17)
T T T
= (Ax + Bu) Pj x + x Pj (Ax + Bu) + y Qy with δz (t) = vech(2zz T − dia(z)2 ), δzu (t) = z ⊗ Ru, δz v (t) = z ⊗ v ,
T
= x Hj x + 2u RKj x T
(18) δv (t) = vech(2vv T −dia(v )2 ), δż (t) = vech(z ż T +żz T −dia(z)dia(ż)),
and δy (t) = vech(2yyT − dia(y)2 ). The dimension symbols are
where Hj = AT Pj + Pj A + C T QC and Kj = R−1 BT Pj . given as κf = z z and κψ = z z 2
n (n +2m+2n+1)+n(n+1) n (n +1)+p(p+1)
2
.
Due to the unmeasurable x(t), modify Lj as L̂j = dtd x̂T Pj x̂ + yT Qy. Inspired by the filter system used in Jha et al. (2019), define
Then, according to the dynamics of x̂ in (5), we have the following low-pass filter system:

L̂j = x̂˙ T Pj x̂ + x̂T Pj x̂˙ + yT Qy (19) Ḟm (t) = − kFm (t) + f (t)f (t)T , Fm (t0 ) = Oκf ,κf (27)
= (Ax̂ + Bu + LC εx ) Pj x̂ + x̂ Pj (Ax̂ + Bu + LC εx )
T T
Ψ̇m (t) = − kΨm (t) + f (t)ψ (t) , Ψm (t0 ) = Oκf ,κψ
T
(28)
4
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

where the matrices Fm ∈ Rκf ×κf and Ψm ∈ Rκf ×κψ are the 3.3. Convergence and stability analysis
states of the filter system, and k > 0 is a tunable gain to sta-
bilize the filter system. Thus, integrating (26) with (27) and (28), To investigate the convergence of the proposed learning al-
we have the following matrix-formed iterative learning equation gorithm, we first present the definition of the initial excitation
with one-layer filter structure as condition according to Roy et al. (2018). Then, the convergence
analysis is given in Theorem 1.
Fm (t)φj = Ψm (t)Φ (P̄j ). (29)
Finally, the ADP-based learning algorithm with VI scheme for Definition 1 (Initial Excitation Condition). The function
estimating the optimal control gain K∗ of the dynamic OPFB f (x(t), u(t)) ∈ Rnf is called initially exciting (IE) with respect to
controller (14) is described in Algorithm 2. system (1) if ∃α > 0, and tl > t0 ≥ 0, such that the corresponding
solutions satisfy
Algorithm 2. VI-based learning algorithm for solving LQR with ∫ tl
the dynamic OPFB control f (x(τ ), u(τ ))f (x(τ ), u(τ ))T dτ ≥ α Inf . (32)
t0
Initialize. Give an arbitrary initial control gain K0 for the
dynamic OPFB
Theorem 1. If there exists an instant tl such that f (t) satisfies the
u0 = − K 0 z + ξ
{
initial excitation condition at the initial instant t0 , then the estimated
ż =G1 z + G2 ζ control scheme by employing Algorithm 2 converges to the optimal
where the pair (G1 , G2 ) is set by (12), and ξ is an exploration dynamic OPFB controller (14). ■
noise. Choose P̄0 = P̄0T > 0 and a positive tunable gain k > 0.
Select sets {Bi }∞ ∞ Proof. According to the definition of Fm in (27), the explicit
i=0 and {ϵj }j=0 satisfying (16) with appropriate
dimension, and a small threshold ε > 0. Set i = 0 and j = 0. solution is given as
∫ t
Data Collection. Collect the online data of input u0 (t) and
output y(t) for t ∈ [t0 , tl ], where tl is obtained by checking Fm (t) = exp(−kt) exp(kτ )f (τ )f (τ )T dτ .
t0
det(Fm (tl )) > 0.
loop: Combining (32) with exp(kτ ) ≥ 1 and f (τ )f (τ )T ≥ 0, ∀t > 0, for
any ∆t ≥ 0, we have the following inequality
Compute H̄j and K̄j from the following equation:
⎡ ⎤ ∫ tl +∆t
vech(H̄j ) exp(−kt) exp(kτ )f (τ )f (τ )T dτ ≥ exp(−kt)α Iκf .
⎢ vec(K̄j ) ⎥ t0
⎢ ⎥ = F −1 Ψm Φ (P̄j ) (30)
⎣ vec(∆¯ 1) ⎦ m
That is, for all t > tl , the square matrix Fm (t) > 0, i.e., Fm (t) is
vech(∆¯ 2) of full rank if f (t) is IE. This ensures the existence of the unique
solution to (29).
Update P̃j+1 as, As proven in Bian and Jiang (2016, Theorem 3.3), for any
P̃j+1 ← P̄j + ϵj (H̄j − K̄jT RK̄j ) (31) given arbitrary P0 = P0T > 0, the matrix Pj calculated from
Algorithm 1 will converge to P ∗ as j → ∞. By Algorithm 2, for
/ Bi then
if P̃j+1 ∈ any given P̄0 = P̄0T > 0, substituting P̄j into (30) at every iteration
P̄j+1 = P̄0 , i ← i + 1 j = 0, 1, . . ., (30) has a unique solution (H̄j , K̄j , ∆
¯ 1, ∆
¯ 2 ) when
else if ∥P̃j+1 − P̄j ∥/ϵj < ε then Fm is of full rank. Thus, P̄j+1 can also be recalculated uniquely
return K̃∗ = K̄j as an estimate of K∗ from (31). It should be noted that (31) is transformed from (15).
Therefore, P̄j+1 converges to P̄ ∗ as j → ∞. This implies that K̄j
break
converges to K∗ as j → ∞. The proof is completed. ■
else
P̄j+1 = P̃j+1 Finally, the stability of the closed-loop system employing Al-
gorithm 2 is summarized in the following Theorem 2.
end if
j←j+1 Theorem 2. Under Assumption 1, the LQR problem of system (1)
end loop is solvable by the estimated controller
Update Controller. Update the control input as u = −K̃∗ z.
u = − K̃∗ z
{
(33)
ż =G1 z + G2 ζ
Remark 2. Note that the observer error εx is equivalently in-
cluded in the left-hand and right-hand sides of the learning where the control gain K̃∗ is obtained by employing Algorithm 2.
equation (21) along the evolution of x̂ in (5), and so is the
exploration noise ξ . Since ξ and εx are always independent, using Proof. As shown in Lemma 1, the closed-loop system is expo-
a similar derivation as in Rizvi and Lin (2020b, Theorem 4), it nentially stable under the dynamic feedback controller (6) if the
is easy to give the proposition that Algorithm 2 is immune to system matrix of z satisfies σ (A) ∈ C− and the feedback control
observer error and the exploration bias problem. ■ gain K = KM with K satisfying σ (A − BK ) ∈ C− . Since A is a user-
defined matrix, it is easy to choose A satisfying σ (A) ∈ C− . In
Remark 3. In fact, since the derivative of the internal state z is addition, it follows from Theorem 1 that K̃∗ will converge to the
obtainable by directly computing (12), the user-defined dynam- optimal control gain K∗ = K ∗ M satisfying σ (A − BK ∗ ) ∈ C− which
ics of the internal state (12) can simultaneously deal with the can minimize the predefined cost function (2). Thus, the LQR
unmeasurable state and relax the requirement that the FWIs op- problem is solved by employing the proposed Algorithm 2. ■
eration is needed. The one-layer filter structure consisting of (27)
and (28) is used to relax the PE condition and avoid the ‘‘intelli- The instant tl is not a user-defined constant, which is de-
gent’’ memory storage issue, which is similar to the second-layer termined by the instant when the IE condition of f (t) is met.
filter in Jha et al. (2019). ■ As stated in Jiang and Jiang (2012, Remarks 8 and 9) and Roy
5
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

et al. (2018), both the rank condition and the IE condition can Ψm (t) is κf (κf + κψ ). Note that there is no requirement for storing
be easily satisfied by adding an exploration noise in the data a series of history data, and only the filter data at the current
collection process, which has been utilized in most existing ADP- instant t need to be stored.
based learning algorithms, for instance, Gao and Jiang (2016), Analysis on Algorithm 6 in Rizvi and Lin (2020b): It follows
Jha et al. (2019), Jiang and Jiang (2012). Besides, the exploration from (34) that the computational( complexity )of the six different
noise will be cut off once the IE condition of f (t) is met at the components, i.e., terms 1–6, is O ΣT (κf + κψ ) , for each learning
instant tl . Although the requirement on exploration noise is also interval [t − T , t ], where T is the learning interval length, and
needed to ensure the IE condition to be satisfied, the IE condition ΣT is the sum number of implementing the computation of data
appears milder than the rank condition, which is shown in Jha regressors in terms 1–6 during the learning interval [t − T , t ].
et al. (2019, Proposition 2 and Remark 6) that the data regressor However, there should be at least κf learning intervals in the
satisfying (32) is necessary but not sufficient for guaranteeing the whole learning process to satisfy the rank condition. Then, to con-
rank condition. struct the matrices FR and ΨR , the total computational complexity
of the learning algorithm in Rizvi and Lin (2020b) for the whole
Remark 4. The learning algorithm proposed in Rizvi and Lin learning process is
(2020b) cannot be established until the real state observer error C2 = O ΣT κR (κf + κψ ) .
( )
(37)
εx converges to zero, which implies that the existence of the
observer error will affect the implementation of the learning algo- Note that all the data computed by using FWIs at every learning
rithm in Rizvi and Lin (2020b). In this paper, by constructing the interval [t − T , t ] should be ‘‘intelligently’’ stored in the memory
virtual observer error v , the new iterative learning equation (29) stacks. Thus, the usage of the memory stacks for implementing
can still be effectively established when the real observer error the
( ADP-based ) learning algorithm in Rizvi and Lin (2020b) is
does not converge to zero. Theorem 1 shows that, by iteratively κR (κf + κψ ) .
calculating the proposed learning equation, the estimated control Hence, the usage of memory storage could be reduced by using
gain converges to the optimal control gain. This means that the the proposed learning algorithm since κR ≥ κf . Furthermore, it
influence of observer error on the learning algorithm is avoided. should be noted that the IRL-based learning algorithm in Rizvi
However, compared to Rizvi and Lin (2020b), it should be pointed and Lin (2020b) requires repeated FWIs, which relies on the
out that the immunity to observer error is achieved at the cost choice of the learning interval length T . The proposed learning
algorithm eliminates the requirement on the choice of T , which
of an increased demand on computation since the additional
implies that the repeated FWIs are no longer needed. Note that
terms ∆ ¯ 1 and ∆¯ 2 in (25) need to be identified in the learning
ΣT > Σ holds in the case where the length T is set to T ≤ tl − t0 .
process. ■
In this case, the proposed method could reduce the demands on
the computational load of the learning algorithm.
3.4. Algorithm analysis
Remark 5. In fact, not only the learning method in Rizvi and
In this subsection, comparisons are made on the usage of
Lin (2020b) but also most existing model-free ADP-based learn-
memory storage and the computational complexity of the learn- ing algorithms with employing the IRL method to solve several
ing algorithms between the proposed Algorithm 2 and the ADP- optimal control problems, for instance, the LQR problem (Jiang &
based learning algorithm in Rizvi and Lin (2020b). Jiang, 2012; Modares et al., 2016), the optimal tracking control
To begin with, by introducing the virtual observer error v , the problem (Chen et al., 2019; Peng et al., 2020), the optimal output
iterative learning equation in Rizvi and Lin (2020b, Algorithm 6) regulation problem (Gao & Jiang, 2016; Gao et al., 2018), need to
with employing VI scheme and IRL method is modified as record a series of history continuous data and use the ‘‘intelligent’’
∫ t memory storage method to operate the FWIs for every learning
T T
z(t) P̄j z(t) − z (t − T )P̄j z(t − T ) + yT (τ )Qy(τ )dτ interval T . However, the requirements on recording a series of
t −T
continuous data and calculating a large number of repeated FWIs
  
term 1   
term 2 are not required in this paper. ■
∫ t ∫ t
= z T (τ )H̄j z(τ )dτ + 2 uT (τ )RK̄j z(τ )dτ 4. Experiment
t −T t −T
     
term 3 term 4 In this subsection, an experiment on an unmanned quadrotor
t t is given to illustrate the proposed approach. The experiment is
∫ ∫
+ 2 v T (τ ) ∆
¯ 1 z(τ )dτ + v T (τ ) ∆
¯ 2 v (τ )dτ . (34) implemented by a ZY-UAV-200 quadrotor shown in Fig. 1(a) and
t −T t −T
      a motion capture OptiTrack system shown in Fig. 1(b) used to
term 5 term 6 fetch the position of the quadrotor, where the control frequency
The matrix-formed iterative learning equation in Rizvi and Lin is set to 30 Hz. The proposed learning algorithm is employed to
(2020b) is given as estimate the optimal feedback control policy for the outer-loop
position control of the aircraft, and the PX4 open-source autopilot
FR φj = ΨR Φ (P̄j ) (35) flight controller is used for the inner-loop attitude control of the
κR ×κf κR ×κψ quadrotor.
where FR ∈ R and ΨR ∈ R with κR ≥ κf .
The objective of this experiment is to estimate the optimal
Analysis on Algorithm 2 in this paper: According to the def-
control policy that can track a reference orbit trajectory generated
inition in (27) and (28), the computational complexity of the by the following dynamic system:
one-layer filter is
−0.4 −0.825
[ ] [ ][ ] [ ] [ ]
ṙ1 (t) 0 r1 (t) r1 (t0 )
C1 = O Σ κf (κf + κψ )
( )
(36) = , =
ṙ2 (t) 0.4 0 r2 (t) r2 (t0 ) 0
where Σ is the sum number of implementing the computation of
(38)
the filters (27) and (28) in the whole learning process t ∈ [t0 , tN ],
and κf and κψ are defined in (26). In addition, the demands on with t0 being the take-off instant of the quadrotor. Correspond-
the number of memory stacks for storing the filter data Fm and ingly, to track the orbit given in (38), we set the desired position
6
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

where ζX = col(uX [ , eX ), ζY = col(u


] Y , eY ),[ zX and zY are
] the
A O2,2 b O2,1
internal states, G1 = , G2 = with
O2,2 A O2,1 b
[ ] [ ]
0 1 0
A= , b= , and KX and KY are the optimal
−1 −2 1
control gains to be estimated by the proposed Algorithm 2. For
both X -axis and Y -axis, set Av = A and v (0) = [0, 1]T for the
virtual observer error system defined in (24). For the reference
trajectory (38), the actual control inputs of the quadrotor along X -
axis and Y -axis for solving the tracking problem are u′X = uX + r̈1
and u′Y = uY + r̈2 , respectively.
Choose the parameters in Algorithm 2 as P̄0 = 0.001I4 , K0 =
O1,4 , z(0) = O4,1 , ε = 0.01, {Bi }∞
i=0 = {P̄ > 0| ∥P̄ ∥ < 2000(i + 1)},
and {ϵj }∞j=0 = 100/(1000 + j). The exploration noise ξ is chosen
as a combination of five sinusoids with frequencies 0.2 Hz, 8 Hz,
Fig. 1. Equipments for the physical experiment.
13 Hz, 50 Hz, and 130 Hz. The weight matrices are set to QX =
QY = 700, RX = 0.03, RY = 0.02. For comparison, the parameters
in
[ the unknown
] system[ matrices
] are supposed to be AX = AY =
0 1 0
, BX = BY = , and CX = CY = [1 0], which gives
0 0 1
the optimal control gains KX and KY∗ as

KX∗ = [ 187.7099 17.4787 152.7525 322.9837]


KY∗ = [ 225.7696 19.3434 187.0829 393.5091].
Note that the system matrices AX , AY , BX , BY , CX , and CY
are given just for experimental analysis, while Algorithm 2 is
established without using any prior knowledge of these matrices.
In addition, only the position tracking errors eX and eY , which can
be seen as the outputs of the systems (39), and the inputs uX and
uY are used for implementing Algorithm 2.
Then, we employ the proposed Algorithm 2 to estimate the
optimal control gains KX∗ and KY∗ . The convergence of the pro-
posed learning algorithm is shown in Fig. 2. The data collection
process starts at t = 15.0 s, and it ends when the rank condition
is satisfied. To be more specific, for X -axis, the rank condition is
satisfied at t = 16.1333 s, and the data-driven learning control
Fig. 2. The convergence of the proposed algorithm. law is updated at t = 16.2000 s after 104 iterations. For Y -axis,
the rank condition is satisfied at t = 16.1667 s, and the data-
driven learning control law is updated at t = 16.2333 s after 169
pdX and pdY and the desired velocity vXd and vYd of the quadrotor iterations. The estimated optimal control gains K̃X∗ and K̃Y∗ are
along X -axis and Y -axis as
K̃X∗ = [ 187.8690 18.6789 153.5373 321.1720]
[ ] [ ][ ]
pdX (t) 1 0 r1 (t) K̃Y∗ = [ 214.8897 20.2890 187.2618 387.9858].
= ,
v d
X (t)
0 −0.4 r2 (t)
[ ] [ ][ ] The reference orbit and 3-D flight trajectory of the quadrotor
d
pY (t) 0 1 r1 (t) during the whole experiment with six highlighted 3-D positions
= .
v d
Y (t)
0.4 0 r2 (t) at different instants are shown in Fig. 4. The snapshots of the
experiment at t = 16.2s, t = 60.1 s, t = 64.0 s, t = 67.9 s,
Define xX = [pX − pdX , vX −vXd ]T and xY = [pY − pdY , vY −vYd ]T , t = 71.8 s, and t = 75.8 s are shown in Fig. 3, which indicates
where pX , pY , vX , vY are the position and velocity along X -axis and that the trajectories during the period t = [60.1, 75.8] s track
Y -axis, respectively. The tracking problem can be transformed the reference orbit. The trajectories of error outputs eX and eY are
into an LQR problem. The dynamics of the quadrotor’s motions depicted in Fig. 5, which shows that the errors of the quadrotor
along X -axis and Y -axis are described by along both X -axis and Y -axis are close to zero.
{
ẋX =AX xX + BX uX
{
ẋY =AY xY + BY uY The results of the experiment illustrate the effectiveness of
, (39) the proposed approach. Please also find the demo video chip in
eX =CX xX e Y = C Y xY
https://xiaoyu.xmu.edu.cn/demos/UAV_OPFB_IE.mp4.
where AX , AY , BX , BY , CX and CY are unknown system matrices, eX
and eY are the position tracking errors along X -axis and Y -axis, 5. Conclusion
respectively. uX and uY denote the control inputs along X -axis and
Y -axis, respectively. This paper aims to solve the LQR problem of unknown linear
With Lemma 1, by setting both eigenvalues of the matrix AL to continuous-time systems without using full-state feedback. By
−1, the dynamic OPFB controllers of the quadrotor along X -axis only accessing the input and output data, a novel ADP-based
and Y -axis for solving the LQR problem are designed as learning algorithm with dynamic output feedback control is pro-
{ { posed without estimating the system state or using any prior
uX = − KX zX uY = − KY zY knowledge of system dynamics. Moreover, the influence of ob-
,
żX =G1 zX + G2 ζX żY =G1 zY + G2 ζY server error on the convergence performance of the proposed
7
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

Fig. 3. Six snapshots of the quadrotor flight trajectories.

Fig. 4. The reference orbit and the 3-D flight trajectories of the quadrotor.
Fig. 5. The trajectories of the error outputs.

a1i sn−1 +a1i sn−2 +···+a1i s+a1i


⎡ ⎤
n −1 n−2 1 0
sn +αn−1 sn−1 +···+α1 s+α0
algorithm is avoided, and the key iteration learning scheme is ⎢ ⎥
a2i sn−1 +a2i sn−2 +···+a2i s+a2i
⎢ ⎥
established without using any finite window integrals or storing a ⎢ n −1 n−2 1 0 ⎥
⎢ sn +αn−1 sn−1 +···+α1 s+α0 ⎥
series of historical data. For future work, we will consider the sys- =⎢ ⎥ Ui (s)
⎢ .. ⎥
tem uncertainty, and investigate model-free learning algorithms ⎢
⎢ .


for solving robust regulation problems. ⎣
ani sn−1 +ani sn−2 +···+ani s+a0 ni

n −1 n−2 1
sn +αn−1 sn−1 +···+α1 s+α0
Ui (s)
⎡ ⎤
Acknowledgments ⎡
a1i a1i ··· a1i

Λ(s)
0 1 n−1
sUi (s)
⎢ ⎥
⎢a2i a2i ··· a2i Λ(s)
⎥⎢ ⎥
⎢ 0 1 n−1 ⎥ ⎢ ⎥
This work was supported in part by the National Key R&D =⎢
⎢ .. .. .. ⎥ ⎢ ..
⎥⎢ ⎥
Program of China under Grant 2021ZD0112600, in part by the Na- ⎣. . . ⎦⎢ .

··· ⎥
tional Natural Science Foundation of China under Grants 62173283
⎣ ⎦
ani
0 ani
1 ··· ani
n−1
sn−1 Ui (s)
and 62273285, in part by the Postdoctoral Fellowship Program Λ(s)
of CPSF, Project No. GZC20233407, and in part by the fellowship :=Mui (A, L, C , Bi )Zui (s) (A.1)
award from the Research Grants Council of Hong Kong, Project
No. CityU PDFS2324-1S02. where Mui (A, L, C , Bi ) is a constant matrix determined by the ma-
trices A, L, C , and Bi , and det((sI − (A − LC ))) and adj((sI − (A − LC )))
Appendix A. Derivation of Eq. (7) denote the determinant and the adjoint matrices of (sI − (A − LC )),
respectively. Ui (s) and Zui (s) are the Laplace transformations of the
To begin with, for the term
Bi Ui (s)
, i = 1, 2, . . . , m, we input signal ui (t) and the related internal state zui (t), respectively.
(sI −(A−LC ))
have Using the inverse Laplace transformation for the term Zui (s), we
Bi Ui (s) adj (sI − (A − LC )) Bi can get the time domain signals with the dynamics described by
= Ui (s) żui (t) = Azui (t) + bui (t). Along with the same derivation of (A.1)
(sI − (A − LC )) det (sI − (A − LC ))
8
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

Lk Yk (s)
for (sI −(A−LC ))
, k = 1, 2, . . . , p, we have Gao, W., & Jiang, Z. (2018). Learning-based adaptive optimal tracking control of
( ) strict-feedback nonlinear systems. IEEE Transactions on Neural Networks and
Lk Yk (s) Learning Systems, 29(6), 2614–2624.
L−1 = Myk zyk (t) (A.2) Gao, W., & Jiang, Z. P. (2019). Adaptive optimal output regulation of time-delay
(sI − (A − LC ))
systems via measurement feedback. IEEE Transactions on Neural Networks and
with żyk (t) = Azyk (t) + byk (t). Then, we have Learning Systems, 30(3), 938–945.
Gao, W., Jiang, Y., & Davari, M. (2019). Data-driven cooperative output regulation
m p of multi-agent systems via robust adaptive dynamic programming. IEEE
∑ ∑
x̂(t) = Mui zui (t) + Myk zyk (t). (A.3) Transactions on Circuits and Systems II: Express Briefs, 66(3), 447–451.
Gao, W., Jiang, Y., Jiang, Z. P., & Chai, T. (2016). Output-feedback adaptive
i=1 k=1
optimal control of interconnected systems based on robust adaptive dynamic
Thus, the derivation is completed. programming. Automatica, 72, 37–45.
Gao, W., Jiang, Z. P., & Lewis, F. L. (2018). Leader-to-formation stability of multi-
Appendix B. Proof of Lemma 2 agent systems: an adaptive optimal control approach. IEEE Transactions on
Automatic Control, 63(10), 3581–3588.
Jha, S. K., Roy, S. B., & Bhasin, S. (2019). Initial excitation-based iterative
For a system η̇ = Aη η with initial state η(0), using the same
algorithm for approximate optimal control of completely unknown LTI
derivation in (A.1), the explicit solution of η(t) can be written as systems. IEEE Transactions on Automatic Control, 64(12), 5230–5237.
⎛⎡ ⎤⎞ Jiang, Y., & Jiang, Z. P. (2012). Computational adaptive optimal control
1
Λη (s) for continuous-time linear systems with completely unknown dynamics.
Automatica, 48(10), 2699–2704.
( ) ⎜⎢ ⎥⎟
⎜⎢ s ⎥⎟
η(0) Λ (s)
⎢ η ⎥⎟ Jiang, Y., & Jiang, Z. P. (2013). Robust adaptive dynamic programming with
η(t) = L−1 = Mη (Aη , η(0))L−1 ⎜
⎜⎢ ⎥⎟
( ) ⎜⎢ . ⎥⎟ (B.1) an application to power systems. IEEE Transactions on Neural Networks and
sI − Aη ⎜⎢ .. ⎥⎟ Learning Systems, 24(7), 1150–1156.
⎝⎣ ⎦⎠
Jiang, Y., & Jiang, Z. P. (2017). Robust adaptive dynamic programming. Hoboken ,
sn−1
Λη (s) NJ, USA: Wiley.
Jiang, H., Zhang, H., Zhang, K., & Cui, X. (2018). Data-driven adaptive dynamic
where Mη (Aη , η(0)) is an unknown constant matrix determined programming schemes for non-zero-sum games of unknown discrete-time
by Aη and η(0). nonlinear systems. Neurocomputing, 275, 649–658.
Choose a constant matrix Aβ ∈ Rn×n which has the same Lancaster, P., & Rodman, L. (1995). Algebraic riccati equations. New York, NY,
characteristic polynomial Λη (s). For example, set USA: Oxford University Press Inc.
Lewis, F. L., & Vamvoudakis, K. G. (2011). Reinforcement learning for partially
0 1 0 ··· 0
⎡ ⎤
observable dynamic processes: Adaptive dynamic programming using mea-
⎢ 0 0 1 ··· 0 ⎥ sured output data. IEEE Transactions on Systems, Man and Cybernetics, Part B

Aβ = ⎢
.. .. .. .. .. ⎥.
⎥ (Cybernetics), 41(1), 14–25.
⎢ . . . . . ⎥ Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic
⎣ 0 0 0 ··· 1
⎦ programming for feedback control. IEEE Circuits and Systems Magazine, 9(3),
32–50.
−µ0 −µ1 −µ2 ··· −µn−1 Lewis, F. L., Vrabie, D. L., & Syrmos, V. L. (2012). Optimal control. John Wiley &
Sons, Inc.
Define the system β̇ = Aβ β with initial state β (0) ∈ Rn . Using
Liu, D., Wei, Q., Ding, W., Yang, X., & Li, H. (2017). Adaptive dynamic programming
the same derivation in (B.1), the explicit solution of β (t) can be with applications in optimal control. Cham, Switzerland: Springer.
written as Liu, D., Xue, S., Zhao, B., Luo, B., & Wei, Q. (2021). Adaptive dynamic program-
ming for control: A survey and recent advances. IEEE Transactions on System
⎛⎡ ⎤⎞
1
Λη (s) Man Cybernetics: System, 51(1), 142–160.
⎜⎢ ⎥⎟
( ) ⎜⎢ s ⎥⎟ Luo, B., Yang, Y., & Liu, D. (2018). Adaptive Q-learning for data-based optimal
β (0) ⎢ Λη (s) ⎥⎟ output regulation with experience replay. IEEE Transactions on Cybernetics,
β (t) = L−1 = Mβ (Aβ , β (0))L−1 ⎜
⎜ ⎢ ⎥⎟
( ) ⎜⎢ . ⎥⎟ (B.2) 48(12), 3337–3348.
sI − Aβ ⎜⎢ .. ⎥⎟ Modares, H., Lewis, F. L., & Jiang, Z. (2016). Optimal output-feedback control
⎝⎣ ⎦⎠
sn−1 of unknown continuous-time linear systems using off-policy reinforcement
Λη (s) learning. IEEE Transactions on Cybernetics, 46(11), 2401–2410.
Peng, Y., Meng, Q., & Sun, W. (2020). Adaptive output-feedback quadratic
where Mβ (Aβ , β (0)) is an accessible constant matrix determined tracking control of continuous-time systems via value iteration with its
by user-defined Aβ and β (0). application. IET Control Theory & Applications, 14(20), 3621–3631.
By selecting Aβ and β (0) properly to satisfy Mβ being of full Powell, W. (2004). Approximate dynamic programming: solving the curse of
rank, combining (B.1) and (B.2), we have dimensionality. New York, NY, USA: Wiley.
Rizvi, S. A. A., & Lin, Z. (2020a). Output feedback adaptive dynamic programming
η(t) = Mη Mβ−1 β (t) := Aη←β β (t) (B.3) for linear differential zero-sum games. Automatica, 122, Article 109272.
Rizvi, S. A. A., & Lin, Z. (2020b). Reinforcement learning-based linear quadratic
where Aη←β is an unknown matrix since Mη is unavailable. This regulation of continuous-time systems using dynamic output feedback. IEEE
completes the proof. Transactions on Cybernetics, 50(11), 4670–4679.
Rizvi, S. A. A., Pertzborn, A. J., & Lin, Z. (2022). Reinforcement learning based
References optimal tracking control under unmeasurable disturbances with application
to HVAC systems. IEEE Transactions on Neural Networks and Learning Systems,
Bian, T., & Jiang, Z. P. (2016). Value iteration and adaptive dynamic programming 33(12), 7523–7533.
for data-driven adaptive optimal control design. Automatica, 71, 348–360. Roy, S. B., Bhasin, S., & Kar, I. N. (2018). Combined MRAC for unknown MIMO LTI
Chen, C., Modares, H., Xie, K., Lewis, F. L., Wan, Y., & Xie, S. (2019). Reinforce- systems with parameter convergence. IEEE Transactions on Automatic Control,
ment learning-based adaptive optimal exponential tracking control of linear 63(1), 283–290.
systems with unknown dynamics. IEEE Transactions on Automatic Control, Sun, W., Zhao, G., & Peng, Y. (2019). Adaptive optimal output feedback track-
64(11), 4423–4438. ing control for unknown discrete-time linear systems using a combined
Chen, C., Xie, L., Jiang, Y., Xie, K., & Xie, S. (2023). Robust output regulation reinforcement Q-learning and internal model method. IET Control Theory &
and reinforcement learning-based output tracking design for unknown Applications, 13(18), 3075–3086.
linear discrete-time systems. IEEE Transactions on Automatic Control, 68(4), Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning.
2391–2398. Cambridge, MA, USA: MIT Press.
Gao, W., & Jiang, Z. P. (2016). Adaptive dynamic programming and adaptive Vamvoudakis, K. G., & Lewis, F. L. (2010). Online actor-critic algorithm to solve
optimal output regulation of linear systems. IEEE Transactions on Automatic the continuous-time infinite horizon optimal control problem. Automatica,
Control, 61(12), 4164–4169. 46(5), 878–888.

9
K. Xie, Y. Zheng, Y. Jiang et al. Automatica 163 (2024) 111601

Kedi Xie received the B.S. degree in automation and process operational control, reinforcement learning and event-triggered control.
the M.E. degree in control theory and control engi- Dr. Jiang is an Associate Editor for Advanced Control for Applications: Engineer-
neering, from Sichuan University, Chengdu, China, in ing and Industrial Systems, and the recipient of Excellent Doctoral Dissertation
2014 and 2017, respectively, and the Ph.D. degree in Award from Chinese Association of Automation (CAA) in 2021 and Hong Kong
control theory and control engineering from Xiamen Research Grants Council (RGC) Postdoctoral Fellowship Scheme (PDFS) 2023/24.
University, Xiamen, China, in 2022. From 2021 to 2022,
she was a visiting Ph.D. student with the Depart-
ment of Electrical and Computer Engineering, National Weiyao Lan received his B.S. degree in precision in-
University of Singapore, Singapore. She is currently a strument from Chongqing University, Chongqing, China,
Postdoctoral Fellow with the School of Automation, in 1995, M.S. degree in control theory and control
Beijing Institute of Technology, Beijing, China. Her cur- engineering from Xiamen University, Xiamen, China, in
rent research interests include adaptive dynamic programming, optimal control, 1998, and Ph.D. degree in automation and computer
output regulation, and robotics. aided engineering from the Chinese University of Hong
Kong, Hong Kong SAR, China, in 2004. From 2004 to
2006, he was a research fellow in the Department of
Yiwei Zheng received the B.S. in Opto-Electronic and Electrical and Computer Engineering, National Univer-
Engineering from Huaqiao University, Quanzhou, China, sity of Singapore, Singapore. Since December in 2006,
in 2018. He is currently a Ph.D. candidate with Depart- he has been with the Department of Automation, Xia-
ment of Automation, Xiamen University, Xiamen, China. men University, Xiamen, China, where he is currently a professor. His research
His current research interests include formal method, interests include nonlinear control theory and applications, intelligent control
opacity of discrete event systems, nonlinear systems, technology, and robust and optimal control. Dr. Lan is a member of Technical
and mobile robotics. Committee on Control Theory, Chinese Associate of Automation, and the vice-
president of Fujian Association of Automation. He is also serving as an associate
editor for the Transactions of the Institute of Measurement and Control.

Xiao Yu received the B.S. degree in electrical en-


Yi Jiang received the B.Eng. degree in automation, gineering and automation from Southwest Jiaotong
M.S. and Ph.D. degrees in control theory and control University, Chengdu, China, in 2010, the M.E. degree in
engineering from Information Science and Engineer- control engineering from Xiamen University, Xiamen,
ing College and State Key Laboratory of Synthetical China, in 2013, and the Ph.D. degree in mechanical
Automation for Process Industries in Northeastern Uni- and biomedical engineering from City University of
versity, Shenyang, Liaoning, China in 2014, 2016 and Hong Kong, Hong Kong SAR, China, in 2017. From
2020, respectively. From January to July 2017, he was 2018 to 2019, he was an Assistant Professor with the
a Visiting Scholar with the UTA Research Institute, Uni- Department of Automation, Shanghai Jiao Tong Uni-
versity of Texas at Arlington, TX, USA, and from March versity, Shanghai, China. He joined in the Department
2018 to March 2019, he was a Research Assistant with of Automation, Xiamen University in Dec. 2019, and
the University of Alberta, Edmonton, Canada. Currently, he is currently a Professor with the Institute of Artificial Intelligence, Xiamen
he is a Postdoc Fellow with the City University of Hong Kong, Hong Kong University, Xiamen, China. His current research interests include multi-agent
SAR, China. His research interests include networked control systems, industrial systems, learning control, mobile robotics, and intelligent systems.

10

You might also like