0% found this document useful (0 votes)
13 views

Model Free Difference Feedback Control of Stochastic Systems

Uploaded by

Gary Rey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Model Free Difference Feedback Control of Stochastic Systems

Uploaded by

Gary Rey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 American Control Conference (ACC)

July 8-12, 2024. Toronto, Canada

Model Free Difference Feedback Control of Stochastic Systems


Muhammad Hamad Zaheer1 and Se Young Yoon1

Abstract— This paper presents a model-free reinforcement 2) the training of the output feedback control policy and the
learning (RL) algorithm for the output-difference feedback stability of the closed-loop system can be guaranteed under
control (ODFC) of stochastic systems with measurement and nonzero-mean Gaussian measurement and process noise.
process noise. A policy iteration algorithm is presented using
a quadratic output-difference reward function to learn the Theoretical analysis demonstrating the contributions of this
optimal ODFC from noisy output difference measurements. It is work is presented in this paper, together with results from
proved that the proposed method trains stable dynamic control numerical studies that compare the proposed solution to other
laws that approach the analytical optimal solution, even in the comparable RL-based algorithms.
presence of process and measurement noise. Simulation results The remainder of the paper is structured as follows. The
are presented to demonstrate the effectiveness of the proposed
scheme. control problem considered in this work is introduced in
Section II, which discusses the output-difference feedback
I. I NTRODUCTION problem for a linear stochastic system. An analytical model-
Reinforcement Learning (RL) based control methods com- based solution to the problem is given in Section III, while
bine the advantages of two distinct control fields namely the model-free PI algorithm for training the ODFC is pre-
optimal control [7] and adaptive control [2]. By interacting sented in Section IV. In Section V, numerical simulations
with the enviroment, RL based methods learn the optimal are presented to validate the theoretical results of the paper.
control law using reward stimuli as feedback. Because of
this, RL algorithms have been widely used to solve optimal II. P ROBLEM F ORMULATION
control problems for uncertain systems [3], [12], [14]. Consider a linear Gaussian system given by
RL based solutions in control theory solve the Bellman
equation associated to the desired optimal control law using xk+1 = Axk + Buk + wk ,
(1)
iterative techniques like Policy Iteration (PI) and Value yk = Cxk + vk ,
Iteration (VI). When only the system output measurements
are available for feedback, PI [8] and Q-learning [10] have where x ∈ Rn , u ∈ Rm and y ∈ Rp are the system state,
been considered to learn optimal output feedback controllers. input and output vectors, respectively. The vectors w ∈ Rn
For linear stochastic systems with measurement and process and v ∈ Rp model the process and sensor noise. They
noise, [4] considered model-free RL algorithms for state are drawn i.i.d. (independent and identically distributed)
feedback and under state and control-dependent noise, as- from Gaussian distributions N (µw , Ww ) and N (µv , Wv ),
suming that the noise is measurable. A more general case respectively.
of systems with both process and state measurement noise Without loss of generality, it is assumed that µw = 0.
is analysed in [15], and theoretical proofs are presented The case involving non-zero mean process noise, wk ∼
that guarantee stability of the learned controllers. However, N (µw , Ww ), can be readily addressed by introducing a static
all these solutions still require full state measurement for shift in the system states.
training and control, which have obvious limitations for
A. State feedback optimal control
practical applications. Part of the contributions of the current
work aim to overcome this limitation. Consider a state feedback controller for (1) as
This paper explores RL-based methods for controlling
systems with biased noise using output-difference feedback uk = −K x̂k , (2)
controllers (ODFCs). Difference feedback control methods where x̂k ∼ N (xk , Σε ) is some unbiased estimate of the
have been considered in the literature for the control of system states xk , and Σε is its estimation error covariance.
chaotic systems [6], [5] and systems with uncertainty in their Define a quadratic cost function as
state equilibrium [9], [16]. The current work proposes a new
T
model-free PI algorithm for training an optimal ODFC so- r (yk , uk ) = (yk+1 − yk ) Qy (yk+1 − yk ) + uT
k Ruk , (3)
lution. The contributions of the proposed RL-based solution
lie in that: 1) it relaxes the need of full state measurement where Qy ≥ 0 and R > 0 are weighing matrices. The av-
by introducing an extended parameterized state observer, and erage cost associated with a stabilizing gain K of controller
(2) is defined as
1 Muhammad Hamad Zaheer and Se Young Yoon are with the
"N #
Department of Electrical and Computer Engineering, University of 1 X
New Hampshire, Durham, NH 03824, USA [email protected], λK = lim E r (yi , −K x̂i ) . (4)
[email protected] N →∞ N
i=1

979-8-3503-8265-5/$31.00 ©2024 AACC 21


Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.
A differential cost function associated with a stabilizing Theorem 3.2: Let K 0 be any stabilizing state feedback
controller (2) is defined as controller gain and P i > 0 be the solution of the Lyapunov
"+∞ # equation
X
VK (x̂k ) = E (r (yi , −K x̂i ) − λK )|x̂k . (5) AT i i i i i TT i T
i P Ai −P + Q̄+K R̄K −K N −N K = 0, (11)
i=k

The corresponding Bellman equation for the cost function where i = 0, 1, 2, . . . and Ai = A − BK i . For K i+1
(5) is calculated as
−1 T i
K i+1 = R̄ + B T P i B B P A + NT ,

VK (x̂k ) = E [r (yk , −K x̂k ) |x̂k ] − λK (12)
(6)
+ E [VK (x̂k+1 ) |x̂k ] . the following hold:
The following assumptions are made throughout the paper: 1) A − BK i+1 is Schur,
Assumption 1: The pair (A, B) is controllable and the pair 2) P ∗ ≤ P i+1 ≤ P i ,
(A, C) is observable. 3) limi→∞ P i = P ∗ , limi→∞ K i = K ∗ .
1/2
Assumption 2: The pair (A, Qx ) is observable where Proof: The proof of the this theorem follows similar
1/2 T 1/2 arguments as in [17] (Thm 3.1) and it is omitted here.
Qx Qx = Qx and Qx = C T Qy C. Next, a parameterized observer is introduced, which es-
III. M ODEL -BASED O UTPUT-D IFFERENCE F EEDBACK timates the system state xk from the output difference
O PTIMAL C ONTROL measurement. This observer can be combined with (8) to
offer an output-difference feedback solution to the optimal
A model-based method is initially investigated to design control problem (2)-(5).
the optimal ODFC for (1). First assume that an observer to Theorem 3.3: There exists a state parametrization
be presented later offers an unbiased estimate of the state
x̄k = Γu αk + Γy βk , (13)
x̂k = xk + εk , (7)
that converges exponentially in mean to the state xk as
where εk ∼ N (0, Σε ). k → ∞ for an observable system (1), and the estimation error
Theorem 3.1: Consider the optimal control problem (2)- x̃k ≜ xk − x̄k is given at steady-state by x̃k ∼ N (0, Σε ),
(5). The optimal state feedback controller gain K ∗ is where Σε is a bounded error covariance matrix. The matrices
−1 T ∗ Γu and Γy are system-dependent matrices containing the
K ∗ = R̄ + B T P ∗ B B P A + NT ,

(8)
system’s transfer function coefficients, and
where P ∗ > 0 is the solution of the ARE h iT
αk = αk1 T
 T T
−1 αk2 . . . (αkm ) , (14)
AT P ∗ A − P ∗ − A T ∗
P B + N R̄ + B T P ∗ B

(9)
B T P ∗ A + N T + Q̄ = 0,
 h iT
βk = βk1 T
T T
. . . (βkp )

βk2 , (15)
Q̄ = ĀT Qx Ā, R̄ = B T Qx B + R, N = ĀT Qx B and Ā =
A − I. The average cost associated with K ∗ is given by are given as

λK ∗ = Tr ĀT Qx ĀΣε + Tr (Qx Ww ) + 2Tr (Qy Wv )


 αk+1 i = Aαki + Buik , ∀i = 1, 2, . . . , m, (16)
 
+ Tr K ∗ T B T P ∗ BK ∗ Wv + Tr (P ∗ (Ww + Σε )) and
i
= Aσki + B yki − yk−1
i


T
 σk+1 ,
− Tr (A − BK ∗ ) P ∗ (A − BK ∗ ) Σε . i i i i
 (17)
βk = Cσk + D yk − yk−1 , ∀i = 1, 2, . . . , p,
(10)
Proof: Result similar to this theorem but for linear where ui and y i are the ith input and output, respectively.
stochastic systems and state-dependent quadratic cost is The matrix Ais any user-defined
T Schur matrix in companion
presented in [15]. The proof here follows a similar procedure form, B = 0 0 . . . 1 , C ∈ Rn×n is an identity
T
but the result apply to the case where the reward is a function

matrix, and D = −1/λ0 0 . . . 0 , where λ0 is the
of the output difference. coefficient of the constant term in the characteristic polyno-
The results of this theorem can be arrived at by assuming a mial of A.
quadratic cost function VK (x̂k ) = x̂T
k P x̂k for some P > 0, Proof: The existence of the parametrization (13) equiv-
and substituting this to the Bellman equation (6). Then, with alent to a difference-feedback state observer
the system dynamics (1), minimizing the Bellman equation
x̄k+1 = (A − LCA + LC) x̄k + (B − LCB) uk
yields the optimal feedback gain K ∗ . Finally, substituting the (18)
optimal feedback gain (6) yields that the Bellman equation + L (yk+1 − yk ) ,
holds if (9) and (10) are satisfied. with observer gain L can be proven following the same
Theorem 3.1 states that the solution to the optimal control procedure as in Theorem 1 in [11].
problem can be solved using the ARE (9). An iterative Equation (18) can now be used to determine the mean
algorithm for solving (9) is presented in Theorem 3.2. and covariance of the estimation error of the observer (13).

22
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.
To determine the error dynamics of the observer (18), IV. M ODEL -F REE O UTPUT-D IFFERENCE F EEDBACK
LC(xk+1 − xk ) is first added to and subtracted from the O PTIMAL C ONTROL
state equation (1) to obtain A. Policy Evaluation
xk+1 = (A − LCA + LC) xk + (B − LCB) uk The Bellman equation (6) can be rewritten in terms of zk
(19)
+ (I − LC) wk + LC (xk+1 − xk ) . as
Now subtracting (18) from (19), the error dynamics is found    T
zkT P̄ zk = E r(yk , −K̄zk )|zk − λK + E zk+1

P̄ zk+1 |zk ,
as (27)
x̃k+1 = (A − LCA + LC) x̃k + (I − LC) wk where P̄ = ΓT P Γ. The Least Square Temporal Difference
(20)
− L (vk+1 − vk ) , (LSTD) [13] approximation P̄ˆ of P̄ in (27) is given by
where x̃k ≜ xk − x̄k . To calculate the mean and covariance τ −1
!−1 τ −1
X  
!
ˆ
X T
of the error, one must first write the error dynamics in filter vec(P̄ ) = z̄ (z̄ − z̄ )
i i i+1 z̄ r − λ̂ i i , K
notation as i=0 i=0
m p (28)
X Wi (z)  i  X Vi (z)  i i where ri = r(yi , −K̄zi ), z̄i = zi ⊗ zi is the quadratic

x̃k = wk + vk − vk−1
∆(z) ∆(z)
i=1 i=1 (21) basis and vec(P̄ˆ ) is the vectorization of the estimated cost
+ (A − LCA + LC) [x̃0 ] ,
k
function matrix P̄ˆ . Since the average cost λK cannot be
where wi and v i are the ith process and sensor noise, determined without system dynamics, λ̂K is used in (28),
respectively. Following the proof of Theorem 1 in [11], which is calculated by approximating the expectation in (4)
  
(Wi (z)/∆(z)) wki and (Vi (z)/∆(z)) vki − vk−1 i
can be as τ
1X 
expressed as λ̂K = r yi , −K̄zi . (29)
τ i=1
Wi (z)  i 
wk = Γiw γki ,
∆(z) As noted in [1], the estimation error ||P̄ − P̄ˆ ||F can be made
Vi (z)  i i sufficiently small using a large number of samples τ .
= Γiv ξki ,

vk − vk−1
∆(z)
B. Policy Improvement
for some system dependent matrices Γiw and Γiv , and where
γki and ξki are obtained from the following two equations In this section, Q-learning is used to estimate the improved
policy. First, the estimate of the cost function matrix P̄ˆ from
γk+1 i = Aγki + Bwki , ∀i = 1, 2, . . . , n, (22) (28) and the input/output data will provide an estimate the Q-
and function, which is then used to estimate the improved policy.
i
= Aθki + B vki − vk−1
i The Q-function, denoted as QK (x̂k , uk ), represents the

θk+1 ,
i i i i
 (23) expected cost of using an arbitrary input uk at current
ξk = Cθk + D vk − vk−1 , ∀i = 1, 2, . . . , p.
time k and thereafter using the feedback controller uk+1 =
Since (A, C) is observable, there exist L that makes A − −K x̂k+1 for times greater than or equal to k + 1. We can
k
LCA+LC Schur and (A − LCA + LC) in (21) converges also define the Q-function using the Bellman equation
exponentially to zero. Therefore, x̃k approaches
QK (x̂k , uk ) = E [r(yk , uk )|ζk ] − λK
x̃k = Γw γk + Γv ξk , (24) (30)
 1 2    + E [VK (x̂k+1 ) |ζk ] ,
where Γw = Γw , Γw , . . . , Γnw and Γv = Γ1v , Γ2v , . . . , Γpv
T T
 
are system-dependent matrices. where ζk = x̂T k , uk . When the current input uk is selected
as −K x̂k , the Q-function becomes
     
Since A is Schur, E γki = 0, E θki = 0 and E ξki = 0
holds as k → ∞. These results and (24) yield that E [x̃k ] = 0
QK (x̂k , −K x̂k ) = VK (x̂k ) . (31)
as k → ∞, and x̃k ∼ N (0, Σε ), where Σε is the steady-state
covariance of x̃k . Using (24), the steady-state covariance Σε Theorem 4.1: The Q-function (30) can be calculated as
is obtained as  T  
x̂ x̂
Σε = Γw Σγ ΓT T
w + Γv Σξ Γv , (25) QK (x̂k , uk ) = k G k , (32)
uk uk
where Σγ and Σξ are the steady-state covariance of γ and
where G is the unique solution of the equation
ξ, respectively.  T    
The observer states x̄k now offer an unbiased estimate A  T
 I   Q̄ N
I −K G A B − G + = 0.
of the system states and they can be used in the controller BT −K NT R̄
equation (2) to construct the ODFC (33)
Proof: Let the current input uk = −K x̂k + ηk at time
uk = −K x̄k = −K̄zk , (26)
T k, for some given ηk . From (1),
where K̄ = KΓ, Γ = Γu Γy and zk = αkT βkT .
  
The
optimal ODFC gain is given as K̄ ∗ = K ∗ Γ. x̂k+1 + AK εk = AK x̂k + Bηk + κk , (34)

23
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.
 T
where εk is the state estimation error and κk = −BKεk + where ψk = zkT , uTk . Similarly, the observer is incorpo-
wk + εk+1 . Using (34), the following expectation can be rated to the Bellman equation (30) to obtain
calculated as
QK (zk , uk ) = E [r(yk , uk )|ψk ] − λK
E x̂T
 
k+1 P x̂k+1 |ζk  T  (42)
+ E zk+1 P̄ zk+1 |ψk .
= x̂T T T T T T
k A P Ax̂k − 2x̂k A P BK x̂k + 2x̂k A P Bηk
T
+ x̂T T T T T T
k K B P BK x̂k − 2x̂k K B P Bηk
Adding and subtracting zk+1 P̄ zk+1 from the above equation
and employing (41) result in
− Tr(AT T T
K P AK Σε ) + Tr(K B P BKΣε )
+ ηkT B T P Bηk + Tr(P Ww ) + Tr(P Σε ). ψk T Ḡψk = E [r(yk , uk )|ψk ] − λK + n
T
(43)
Substituting uk = −K x̂k + ηk in the above equation yields + zk+1 P̄ zk+1 ,
E x̂T
 
k+1 P x̂k+1 |ζk where
 T  T
A P A AT P B x̂k
 
x̂k  T  T
= n = E zk+1 P̄ zk+1 |ψk − zk+1 P̄ zk+1 . (44)
uk B T P A B T P B uk (35)
− Tr(AT T T ˆ of Ḡ in (43) can
K P AK Σε ) + Tr(K B P BKΣε ) As described in [1], the LSTD estimate Ḡ
+ Tr(P Ww ) + Tr(P Σε ). be obtained as
−1  ′
Assume that the Q-function
  T isT quadratic in x̂k and uk ,
 ′ 
τ τ
T ˆ) = 
X X
QK (x̂k , uk ) = x̂T
k , uT
k G x̂ k , uk , and the cost function vec(Ḡ ψ̄i ψ̄iT   ψ̄i ci  , (45)
is quadratic in x̂k , VK (x̂k+1 ) = x̂T k+1 P x̂ k+1 . Then, with i=1 i=1
(35) and (1), the Bellman equation (30) holds if
 T   where ψ̄i = ψi ⊗ ψi and
A   Q̄ N
P A B − G + = 0, (36)
BT N T R̄ T
ci = r(yk , uk ) − λ̂K + zk+1 P̄ˆ zk+1 , (46)
and the average cost λK is
where P̄ˆ is the estimate of the cost function matrix obtained
λK = Tr ĀT Qx ĀΣε + Tr K T B T P BKΣε
 
using (28). The error ||Ḡ − Ḡ ˆ || can be made sufficiently
F
+ Tr (Qx Ww ) + 2Tr (Qy Wv ) + Tr (P (Ww + Σε ))) small using a large number of samples τ ′ and τ [1].
 
T
− Tr (A − BK) P (A − BK) Σε . (37) The iterative algorithm to estimate the optimal ODFC us-
ing input and output measurements is presented in Algorithm
For the input uk = −K x̂k , (32) results in 1. It consists of two steps in each iteration: policy evaluation
 T   and policy update. The policy evaluation step computes the
T I I
QK (x̂k , −K x̂k ) = x̂k G x̂ , cost of the current controller using (28). The policy update
−K −K k
step calculates the Q-function of the current controller using
and comparing with (31) yields that, (45) and calculates the updated controller K̄ i+1 using aver-
age Q-learning [15]. If the estimation errors ||P̄ − P̄ˆ ||F and
 
 T
 I
P = I −K G . (38)
−K ˆ || are small, then the updated controller gain K̄ i+1
||Ḡ − Ḡ F
Substituting the above relation in (36) results in (33). is stabilizing (Theorem 4.2) and approaches the optimal
The matrix G from Theorem 4.1 can be partitioned as controller gain K̄ ∗ (Theorem 4.3).

Gxx Gxu
 The policy update step in Algorithm 1 calculates the Q-
G= , (39) function of the current controller and following the average
Gux Guu
Q-learning procedure in [15], the improved control input uk
where is selected to minimize the average of all the previously
Gxx = AT P A + Q̄ ∈ Rn×n , estimated Q-functions
Gxu = AT P B + N ∈ Rn×m , i i
(40) 1 X 1 X ˆ j −1 ˆ j T
Gux = B T P A + N T ∈ Rm×n , uk = arg min Q̂ (z
j k , uk ) = − (Ḡ ) Ḡxu zk ,
uk iρ iρ j=1 uu
Guu = B T P B + R̄ ∈ Rm×m , j=1
(47)
can be obtained using (33). Next, the output-difference feed- where i is the current iteration number, ρ ≈ 1 is the rate
back Q-function is constructed by incorporating the observer factor which is a design parameter and
from Theorem 3.3 into the state feedback Q-function (32)  T " ˆ j #
 T  T zk Ḡxx Ḡ ˆj z 
T ˆj k
Γ Gxx Γ ΓT Gxu zk
 
zk xu
Q̂j (zk , uk ) = ψk Ḡ ψk = ˆj ˆj ,
QK (zk , uk ) = uk Ḡ Ḡ uk
uk Gux Γ Guu uk ux uu
 T    (48)
z Ḡxx Ḡxu zk
= k = ψk T Ḡψk ,
uk Ḡux Ḡuu uk is the estimated Q-function corresponding to the controller
(41) gain K̄ j . The expression for the improved control input uk

24
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.
in (47) can be represented as uk = −K̄ i+1 zk , where the Proof: Only a sketch of the proof in presented here for
value of the improved controller gain K̄ i+1 is brevity. First introduce
i
1 X ˆ j −1 ˆ j T i
K̄ i+1 = (Ḡ ) Ḡxu . (49) Ĝ(zk , K̄, i) =
1 X
Q̂j (zk , −K̄zk ),
iρ j=1 uu iρ j=1
(50)

where K̄ is some stabilizing ODFC gain. For a controller


Algorithm 1: Model-Free Optimal ODFC Design
gain K̄ i obtained during (i − 1)-th iteration of Algorithm 1,
Data: Initial stabilizing controller gain K̄ 0 , Number (50) yields
of samples τ for evaluating (28), Number of

samples τ ′ for evaluating (45), convergence

i 1 1
time length s, Exploration signal covariance Ĝ(zk , K̄ , i) = 1 − Ĝ(zk , K̄ i , i−1)+ ρ Q̂i (zk , −K̄ i zk ).
i i
Wη . (51)
Result: Optimal controller gain K̄ ∗ . As K̄ i+1 is obtained from
for i = 0, 1, . . . , N do
Policy Evaluation: Execute uk = −K̄ i zk for τ K̄ i+1 = arg min Ĝ(zk , K̄, i),
time samples and estimate λ̂i and P̄ˆ i using (29)

and (28) respectively. we get


Policy Update: Z = CollectData K̄ i , τ ′ , s, Wη .


Compute Ḡ ˆ i from Z and P̄ˆ i using (45). Q̂i (zk , −K̄ i+1 zk ) ≤ Q̂i (zk , −K̄ i zk ). (52)
Obtain the updated policy K̄ i+1 using (49).
end Let ϵ1 be the estimation error of Qi (zk , −K̄ i+1 zk ), and ϵ3
be the estimation error of Qi (zk , −K̄ i zk ). From (52), there
is a ϵ3 ≥ 0 such that

Algorithm 2: CollectData routine Qi (zk , −K̄ i+1 zk ) ≤ Qi (zk , −K̄ i zk ) + ϵ1 − ϵ2 + ϵ3 .


Data: Controller gain K̄, Number of samples τ ′ for
evaluating (45), convergence time length s, From the observer equation in Theorem 3.3, it is obtained
Exploration signal covariance Wη . that
Result: Set Z of τ ′ data samples
Qi (x̂k , −K i+1 x̂k ) ≤ Qi (x̂k , −K i x̂k ) + ϵ1 − ϵ2 + ϵ3 .
(zk+s , uk+s , zk+s+1 ).
k=0, Z = {}. Furthermore, (31) yields
for i = 1, . . . , τ ′ do
Execute the controller uk = −K̄zk for s time Qi (x̂k , −K i+1 x̂k ) ≤ x̂T i
k P x̂k + ϵ1 − ϵ2 + ϵ3 . (53)
samples.
Sample η ∼ N (0, Wη ) and execute Using (32), (39) and (40) one can write the Q-function as
uk+s = −K̄zk+s + η.
Add (zk+s , uk+s , zk+s+1 ) to Z. Qi (x̂k , −K i+1 x̂k ) =
k ←k+s+1 x̂T i+1 T i
) P (A − BK i+1 ) + Q̄
k (A − BK
end
T T
+ K i+1 R̄K i+1 − K i+1 N T − N K i+1 x̂k .


To gather appropriate data to evaluate (45), the CollectData


Finally, the above equation and (53) are combined to obtain
routine shown in Algorithm 2 is used, which ensures that the
data samples are independent to each other as required to x̂T i+1 T i
) P (A − BK i+1 ) − P i x̂k ≤

k (A − BK
minimize the estimation error [1]. Algorithm 2 implements  T
− x̂T Ā − BK i+1 Qx Ā − BK i+1

the controller uk = −K̄zk for s time samples and stores k (54)
zk+s . The sample size s is selected to be large enough so that i+1 T i+1

+K RK x̂k + ϵ1 − ϵ2 + ϵ3 .
the system states converge to the steady-state distribution,
and the influence of any input prior to time k becomes
Then by Lyapunov stability theorem and for sufficiently
negligible [1]. Then, the control input uk+s = −K̄zk+s +η is
small estimation errors ϵ1 and ϵ3 , K i+1 is stabilizing.
executed, where η ∼ N (0, Wη ) is the exploration signal, the
next state zk+s+1 is saved, and the set (zk+s , uk+s , zk+s+1 ) Theorem 4.3: Assume that the estimation errors in (28)
is added to the dataset Z. This process is repeated to collect and (45) are small enough. Then the following hold,
τ ′ data samples. 1) P̄ ∗ ≤ P̄ i+1 ≤ P̄ i ,
Theorem 4.2: Assume that the estimation errors in (28) 2) limi→∞ P̄ i = P̄ ∗ , limi→∞ K̄ i = K̄ ∗ .
and (45) are small enough. Then, Algorithm 1 produces The proof of the above theorem follows the same proce-
stabilizing policy gains K̄ i , i = 1, . . . , N . dure as the proof of Theorem 3 in [15], and so it is omitted.

25
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSION
In this paper, we presented a novel approach to designing
ODFCs for stochastic systems using a model-free RL algo-
rithm. The proposed algorithm builds upon current RL-based
optimal control design methods by focusing on the design
and convergence of ODFCs, which offers enhanced robust-
ness to biased noise disturbance. The key contribution of this
research is the relaxation of the training and control data
requirement for stochastic systems to output measurement,
and the training of ODFC that is robust to bias in process
and measurement noise. Simulation results demonstrate the
effectiveness of the proposed scheme.
R EFERENCES
[1] Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvari, “Model-free linear
quadratic control via reduction to expert prediction,” in Proceedings of
Fig. 1. Top Left: MRE in the ODFC and OFC gain estimate K̄ i ; Top Right: the Twenty-Second International Conference on Artificial Intelligence
MRE in the ODFC and OFC cost function matrix estimate P̄ i . Shaded and Statistics, vol. 89, Apr 2019, pp. 3108–3117.
regions display quartiles; Bottom Left: MRE in the ODFC gain estimate K̄ i [2] K. J. Åström and B. Wittenmark, Adaptive control. Addison-Wesley,
for different values of ρ; Bottom Right: MRE in the ODFC cost function 1995.
matrix estimate P̄ i for different values of ρ [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
Belmont, MA, USA: Athena Scientific, 1996.
[4] T. Bian and Z.-P. Jiang, “Value iteration and adaptive dynamic
programming for data-driven adaptive optimal control design,” Au-
tomatica, vol. 71, pp. 348–360, 2016.
V. N UMERICAL S IMULATION [5] H. Kokame, K. Hirata, K. Konishi, and T. Mori, “Difference feedback
can stabilize uncertain steady states,” IEEE Transactions on Automatic
Control, vol. 46, no. 12, pp. 1908–1913, Dec. 2001.
The proposed model-free ODFC method is numerically [6] H. Kokame, K. Hirata, K. Konishi, and T. Mori, “State difference
evaluated for the control of a pendulum system with process feedback for stabilizing uncertain steady states of non-linear systems,”
and measurement noise, International Journal of Control, vol. 74, no. 6, pp. 537–546, 2001.
[7] F. L. Lewis and V. L. Syrmos, Optimal control. John Wiley & Sons,
    1995.
1.1 0.1 0 [8] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for par-
xk+1 = x + u + wk , tially observable dynamic processes: Adaptive dynamic programming
1.0 0.95 k 1 k using measured output data,” IEEE Transactions on Systems, Man, and
 
yk = 1 0 xk + vk , Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 14–25, 2011.
[9] M. Pai, Energy Function Analysis for Power System Stability. Ams-
terdam: Kluwer, 1989.
 T [10] S. A. A. Rizvi and Z. Lin, “Output feedback reinforcement q-learning
where wk ∼ N ( 10 5 , I) and vk ∼ N (5, 0.1). The first control for the discrete-time linear quadratic regulator problem,” in
and second states correspond to the angular position and 2017 IEEE 56th Annual Conference on Decision and Control (CDC),
velocity of the pendulum, respectively. The input u of the 2017, pp. 1311–1316.
[11] S. A. A. Rizvi and Z. Lin, “Reinforcement learning-based linear
system is the torque actuation. quadratic regulation of continuous-time systems using dynamic output
In order to evaluate our algorithm, a series of 100 ex- feedback,” IEEE Transactions on Cybernetics, vol. 50, no. 11, pp.
4670–4679, 2020.
periments were carried out. Each experiment comprised a [12] R. S. Sutton and A. G. Barto, Reinforcement Learning—An Introduc-
run of the Algorithm 1 with N = 50. An initial sta- tion. Cambridge, MA, USA: MIT Press, 1998.
bilizing non-optimal controller gain was chosen as K = [13] J. N. Tsitsiklis and B. Van Roy, “Average cost temporal-difference
learning,” Automatica, vol. 35, no. 11, pp. 1799–1808, 1999.
[−3.0614 1.9400 − 2.7496 14.2954]. Out of 100 ex- [14] H.-N. Wu and B. Luo, “Neural network based online simultaneous
periments, the ones with stabilizing policies were collected policy update algorithm for solving the HJI equation in nonlinear
to evaluate the convergence of the algorithm to the optimal H∞ control,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 23, no. 12, pp. 1884–1895, 2012.
values. The mean of the relative estimation error (MRE) of [15] F. A. Yaghmaie, F. Gustafsson, and L. Ljung, “Linear quadratic
∗ i
the ODFC gain ||K̄||K̄−∗K̄||F||F and OFDC cost function matrix control using model-free reinforcement learning,” IEEE Transactions
||P̄ ∗ −P̄ i ||F on Automatic Control, vol. 68, no. 2, pp. 737–752, 2023.
were used as an evaluation criterion. The number
||P̄ ∗ ||F [16] S. Y. Yoon, Z. Lin, and P. Allaire, Control of surge in centrifugal
of data samples was chosen as τ = 1000, τ ′ = 1000 and the compressors by active magnetic bearings: Theory and implementation.
London: Springer, 2012.
rate factor was chosen as ρ = 1.15. The mean relative errors [17] M. H. Zaheer, S. Y. Yoon, and S. A. A. Rizvi, “Derivative feedback
for Algorithm 2 is compared with the mean relative errors of control using reinforcement learning,” in Proceedings of the 2023
Algorithm 4 [15] in Fig. 1. Due to noise bias, the controller IEEE Conference on Decision and Controls, 2023, p. to appear.
gain estimate does not converge to zero for Algorithm 4 [15]
as shown in Fig. 1. On the other hand, noise bias does not
effect the convergence of Algorithm 2. The impact of the
rate factor ρ on the convergence of the trained estimates is
also seen in Fig. 1.

26
Authorized licensed use limited to: University of New Hampshire. Downloaded on November 11,2024 at 10:12:09 UTC from IEEE Xplore. Restrictions apply.

You might also like