Reinforcement Learning For Optimal Feedback Control
Reinforcement Learning For Optimal Feedback Control
Rushikesh Kamalapurkar
Patrick Walters · Joel Rosenfeld
Warren Dixon
Reinforcement
Learning for
Optimal Feedback
Control
A Lyapunov-Based Approach
Communications and Control Engineering
Series editors
Alberto Isidori, Roma, Italy
Jan H. van Schuppen, Amsterdam, The Netherlands
Eduardo D. Sontag, Boston, USA
Miroslav Krstic, La Jolla, USA
Communications and Control Engineering is a high-level academic monograph
series publishing research in control and systems theory, control engineering and
communications. It has worldwide distribution to engineers, researchers, educators
(several of the titles in this series find use as advanced textbooks although that is not
their primary purpose), and libraries.
The series reflects the major technological and mathematical advances that have
a great impact in the fields of communication and control. The range of areas to
which control and systems theory is applied is broadening rapidly with particular
growth being noticeable in the fields of finance and biologically-inspired control.
Books in this series generally pull together many related research threads in more
mature areas of the subject than the highly-specialised volumes of Lecture Notes in
Control and Information Sciences. This series’s mathematical and control-theoretic
emphasis is complemented by Advances in Industrial Control which provides a
much more applied, engineering-oriented outlook.
Publishing Ethics: Researchers should conduct their research from research
proposal to publication in line with best practices and codes of conduct of relevant
professional bodies and/or national and international regulatory bodies. For more
details on individual ethics matters please see:
https://www.springer.com/gp/authors-editors/journal-author/journal-author-help-
desk/publishing-ethics/14214.
Reinforcement Learning
for Optimal Feedback
Control
A Lyapunov-Based Approach
123
Rushikesh Kamalapurkar Joel Rosenfeld
Mechanical and Aerospace Engineering Electrical Engineering
Oklahoma State University Vanderbilt University
Stillwater, OK Nashville, TN
USA USA
MATLAB® and Simulink® are registered trademarks of The MathWorks, Inc., 1 Apple Hill Drive,
Natick, MA 01760-2098, USA, http://www.mathworks.com.
Mathematics Subject Classification (2010): 49-XX, 34-XX, 46-XX, 65-XX, 68-XX, 90-XX, 91-XX,
93-XX
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my nurturing grandmother, Mangala
Vasant Kamalapurkar.
—Rushikesh Kamalapurkar
Making the best possible decision according to some desired set of criteria is always
difficult. Such decisions are even more difficult when there are time constraints and
can be impossible when there is uncertainty in the system model. Yet, the ability to
make such decisions can enable higher levels of autonomy in robotic systems and,
as a result, have dramatic impacts on society. Given this motivation, various
mathematical theories have been developed related to concepts such as optimality,
feedback control, and adaptation/learning. This book describes how such theories
can be used to develop optimal (i.e., the best possible) controllers/policies (i.e., the
decision) for a particular class of problems. Specifically, this book is focused on the
development of concurrent, real-time learning and execution of approximate opti-
mal policies for infinite-horizon optimal control problems for continuous-time
deterministic uncertain nonlinear systems.
The developed approximate optimal controllers are based on reinforcement
learning-based solutions, where learning occurs through an actor–critic-based
reward system. Detailed attention to control-theoretic concerns such as convergence
and stability differentiates this book from the large body of existing literature on
reinforcement learning. Moreover, both model-free and model-based methods are
developed. The model-based methods are motivated by the idea that a system can
be controlled better as more knowledge is available about the system. To account
for the uncertainty in the model, typical actor–critic reinforcement learning is
augmented with unique model identification methods. The optimal policies in this
book are derived from dynamic programming methods; hence, they suffer from the
curse of dimensionality. To address the computational demands of such an
approach, a unique function approximation strategy is provided to significantly
reduce the number of required kernels along with parallel learning through novel
state extrapolation strategies.
The material is intended for readers that have a basic understanding of nonlinear
analysis tools such as Lyapunov-based methods. The development and results may
help to support educators, practitioners, and researchers with nonlinear
systems/control, optimal control, and intelligent/adaptive control interests working
in aerospace engineering, computer science, electrical engineering, industrial
vii
viii Preface
1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Bolza Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Necessary Conditions for Optimality . . . . . . . . . . . . . . . . 3
1.4.2 Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . 5
1.5 The Unconstrained Affine-Quadratic Regulator . . . . . . . . . . . . . . . 5
1.6 Input Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Connections with Pontryagin’s Maximum Principle . . . . . . . . . . . 9
1.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.1 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.2 Differential Games and Equilibrium Solutions . . . . . . . . . 11
1.8.3 Viscosity Solutions and State Constraints . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Approximate Dynamic Programming . . . . . . . . . . . . . . . . . . . . . ... 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 17
2.2 Exact Dynamic Programming in Continuous Time and Space . ... 17
2.2.1 Exact Policy Iteration: Differential and Integral
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 18
2.2.2 Value Iteration and Associated Challenges . . . . . . . . . ... 22
2.3 Approximate Dynamic Programming in Continuous Time
and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Some Remarks on Function Approximation . . . . . . . . . . . 23
2.3.2 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Development of Actor-Critic Methods . . . . . . . . . . . . . . . 25
2.3.4 Actor-Critic Methods in Continuous Time and Space . . . . 26
2.4 Optimal Control and Lyapunov Stability . . . . . . . . . . . . . . . . . . . 26
xi
xii Contents
6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.2 Station-Keeping of a Marine Craft . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.1 Vehicle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.2 System Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2.4 Approximate Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2.5 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.2.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3 Online Optimal Control for Path-Following . . . . . . . . . . . . . . . . . 213
6.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.3.2 Optimal Control and Approximate Solution . . . . . . . . . . . 215
6.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.4 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 223
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.2 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . 230
7.3 StaF: A Local Approximation Method . . . . . . . . . . . . . . . . . . . . . 232
7.3.1 The StaF Problem Statement . . . . . . . . . . . . . . . . . . . . . . 232
7.3.2 Feasibility of the StaF Approximation
and the Ideal Weight Functions . . . . . . . . . . . . . . . . . . . . 233
7.3.3 Explicit Bound for the Exponential Kernel . . . . . . . . . . . 235
7.3.4 The Gradient Chase Theorem . . . . . . . . . . . . . . . . . . . . . 237
7.3.5 Simulation for the Gradient Chase Theorem . . . . . . . . . . 240
7.4 Local Approximation for Efficient Model-Based
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.1 StaF Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.2 StaF Kernel Functions for Online Approximate
Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.4 Extension to Systems with Uncertain Drift
Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
7.4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.5 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Appendix A: Supplementary Lemmas and Definitions . . . . . . . . . . . . . . . 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Symbols
Lists of abbreviations and symbols used in definitions, lemmas, theorems, and the
development in the subsequent chapters.
xv
xvi Symbols
1.1 Introduction
The ability to learn behaviors from interactions with the environment is a desirable
characteristic of a cognitive agent. Typical interactions between an agent and its
environment can be described in terms of actions, states, and rewards (or penalties).
Actions executed by the agent affect the state of the system (i.e., the agent and the
environment), and the agent is presented with a reward (or a penalty). Assuming that
the agent chooses an action based on the state of the system, the behavior (or the
policy) of the agent can be described as a map from the state-space to the action-space.
Desired behaviors can be learned by adjusting the agent-environment interaction
through the rewards/penalties. Typically, the rewards/penalties are qualified by a cost.
For example, in many applications, the correctness of a policy is often quantified in
terms of the Lagrange cost and the Mayer cost. The Lagrange cost is the cumulative
penalty accumulated along a path traversed by the agent and the Mayer cost is the
penalty at the boundary. Policies with lower total cost are considered better and
policies that minimize the total cost are considered optimal. The problem of finding
the optimal policy that minimizes the total Lagrange and Meyer cost is known as the
Bolza optimal control problem.
1.2 Notation
Throughout the book, unless otherwise specified, the domain of all the functions is
assumed to be R≥0 . Function names corresponding to state and control trajectories are
reused to denote elements in the range of the function. For example, the notation u (·)
is used to denote the function u : R≥t0 → Rm , the notation u is used to denote an arbi-
trary element of Rm , and the notation u (t) is used to denote the value of the function
u (·) evaluated at time t. Unless otherwise specified, all the mathematical quanti-
ties are assumed to be time-varying, an equation of the form g (x) = f + h (y, t)
is interpreted as g (x (t)) = f (t) + h (y (t) , t) for all t ∈ R≥0 , and a definition of
the form g (x, y) f (y) + h (x) for functions g : A × B → C, f : B → C and
where t0 is the initial time, x : R≥t0 → Rn denotes the system state and u : R≥t0 →
U ⊂ Rm denotes the control input, and U denotes the action-space.
To ensure local existence and uniqueness of Carathéodory solutions to (1.1), it is
assumed that the function f : Rn × U × R≥t0 → Rn is continuous with respect to
t and u, and continuously differentiable with respect to x. Furthermore, the control
signal, u (·), is restricted to be piecewise continuous. The assumptions stated here are
sufficient but not necessary to ensure local existence and uniqueness of Carathéodory
solutions to (1.1). For further discussion on existence and uniqueness of Carathéodory
solutions, see [1, 2]. Further restrictions on the dynamical system are stated, when
necessary, in subsequent chapters.
Consider a fixed final time optimal control problem where the optimality of a
control policy is quantified in terms of a cost functional
t f
J (t0 , x0 , u (·)) = L (x (t; t0 , x0 , u (·)) , u (t) , t) dt + Φ x f , (1.2)
t0
cost, and t f and x f x t f denote the final time and state, respectively. In (1.2),
the notation x (t; t0 , x0 , u (·)) is used to denote a trajectory of the system in (1.1),
evaluated at time t, under the controller u (·), starting at the initial time t0 , and with
the initial state x0 . Similarly, for a given policy φ : Rn → Rn , the short notation
x (t; t0 , x0 , φ (x (·))) is used to denote a trajectory under the feedback controller
u (t) = φ (x (t; t0 , x0 , u (·))). Throughout the book, the symbol x is also used to
denote generic initial conditions in Rn . Furthermore, when the controller, the initial
time, and the initial state are understood from the context, the shorthand x (·) is used
when referring to the entire trajectory, and the shorthand x (t) is used when referring
to the state of the system at time t.
The two most popular approaches to solve Bolza problems are Pontryagin’s max-
imum principle and dynamic programming. The two approaches are independent,
both conceptually and in terms of their historic development. Both the approaches
are developed on the foundation of calculus of variations, which has its origins in
1.3 The Bolza Problem 3
Newton’s Minimal Resistance Problem dating back to 1685 and Johann Bernoulli’s
Brachistochrone problem dating back to 1696. The maximum principle was devel-
oped by the Pontryagin school at the Steklov Institute in the 1950s [3]. The devel-
opment of dynamic programming methods was simultaneously but independently
initiated by Bellman at the RAND Corporation [4]. While Pontryagin’s maximum
principle results in optimal control methods that generate optimal state and control
trajectories starting from a specific state, dynamic programming results in methods
that generate optimal policies (i.e., they determine the optimal decision to be made
at any state of the system).
Barring some comparative remarks, the rest of this monograph will focus on the
dynamic programming approach to solve Bolza problems. The interested reader is
directed to the books by Kirk [5], Bryson and Ho [6], Liberzon [7], and Vinter [8]
for an in-depth discussion of Pontryagin’s maximum principle.
t f
J (t, x, u (·)) = L (x (τ ; t, x, u (·)) , u (τ ) , τ ) dτ + Φ x f (1.3)
t
is solved, where t ∈ t0 , t f , t f ∈ R≥0 , and x ∈ Rn . A solution to the family of Bolza
problems in (1.3) can be characterized using the optimal cost-to-go function (i.e.,
the optimal value function) V ∗ : Rn × R≥0 → R, defined as
where the notation u [t,τ ] for τ ≥ t ≥ t0 denotes the controller u (·) restricted to the
time interval [t, τ ].
⎧ t+Δt ⎫
⎨ ⎬
V (x, t) inf L (x (τ ) , u (τ ) , τ ) dτ + V ∗ (x (t + Δt) , t + Δt) .
u [t,t+Δt] ⎩ ⎭
t
t+Δt
V (x, t) = inf L (x (τ ) , u (τ ) , τ ) dτ + inf J (t + Δt, x (t + Δt) , u (·)) .
u [t,t+Δt] u
t+Δt,t f
t
t+Δt
Thus, V (x, t) ≤ V ∗ (x, t), which, along with (1.6), implies V (x, t) = V ∗ (x, t).
Under the assumption that V ∗ ∈ C 1 Rn × t0 , t f , R , the optimal value function
can be shown to satisfy
0 = −∇t V ∗ (x, t) − inf L (x, u, t) + ∇x V ∗T (x, t) f (x, u, t) ,
u∈U
1.4 Dynamic Programming 5
for all t ∈ t0 , t f and all x ∈ Rn , with the boundary condition V ∗ x, t f = Φ (x),
for all x ∈ Rn . In fact, the Hamilton–Jacobi–Bellman equation along with a Hamil-
tonian maximization condition completely characterize the solution to the family of
Bolza problems.
Theorem 1.2 presents necessary and sufficient conditions for a function to be the
optimal value function.
Theorem 1.2 Let V ∗ ∈ C1 Rn × t0 , t f , R denote the optimal value function.
Then, a function V : Rn × t0 , tf →R is the optimal value function (i.e., V (x, t) =
V ∗ (x, t) for all (x, t) ∈ Rn × t0 , t f ) if and only if:
1. V ∈ C 1 Rn × t0 , t f , R and V satisfies the Hamilton–Jacobi–Bellman equa-
tion
0 = −∇t V (x, t) − inf L (x, u, t) + ∇x V T (x, t) f (x, u, t) , (1.7)
u∈U
for all t ∈ t0 , t f and all x ∈ Rn , with the boundary condition V x, t f =
Φ (x), for all x ∈ Rn .
2. For all x ∈ Rn , there exists a controller u (·), such that the function V , the con-
troller u (·), and the trajectory x (·) of (1.1) under u (·) with the initial condition
x (t0 ) = x, satisfy the equation
for all t ∈ R≥t0 . Furthermore, the Hamiltonian minimization condition in (1.8) is sat-
isfied by the controller u (t) = u ∗ (x (t)) , where the policy u ∗ : Rn → Rm is defined
as
1 T
u ∗ (x) = − R −1 g T (x) ∇x V ∗ (x) . (1.13)
2
Hence, assuming that an optimal controller exists, a complete characterization of the
solution to the optimal control problem can be obtained using the Hamilton–Jacobi–
Bellman equation.
Remark 1.3 While infinite horizon optimal control problems naturally arise in feed-
back control application where stability is of paramount importance, path planning
applications often involve finite-horizon optimal control problems. The method of
1.5 The Unconstrained Affine-Quadratic Regulator 7
dynamic programming has extensively been studied for finite horizon problems [12–
20], although such problems are out of the scope of this monograph.
with the boundary condition V (0) = 0. Furthermore, the optimal controller can be
expressed as the state-feedback law u (t) = u ∗ (x (t)) .
Theorem 1.7. [3, 5, 7] Let x ∗ : R≥t0 → Rn and u ∗ : R≥t0 → U denote the opti-
mal state and control trajectories corresponding to the optimal control problem in
Sect. 1.5. Then there exists a trajectory p ∗ : R≥t0 → Rn such that p ∗ (t) = 0 for some
t ∈ R≥t0 and x ∗ and p ∗ satisfy the equations
T
ẋ ∗ (t) = ∇ p H x ∗ (t) , u ∗ (t) , p ∗ (t) ,
∗
∗ ∗ ∗
T
ṗ (t) = − ∇x H x (t) , u (t) , p (t) ,
Under further assumptions on the state and the control trajectories, and on the
functions f, g, and r , the so-called natural transversality condition limt→∞ p (t) = 0
can be obtained (cf. [30–32]). The natural transversality condition does not hold in
general for infinite horizon optimal control problems. For some illustrative coun-
terexamples and further discussion, see [30–35].
A quick comparison of Eq. (1.14) and (1.18) suggests that the optimal costate
should satisfy
T
p ∗ (t) = − ∇x V x ∗ (t) . (1.19)
∗ T T
f x +g x ∗ u ∇x ∇x V x ∗ = −∇x V x ∗ ∇x f x ∗ +∇x g x ∗ u − ∇x r x ∗ , u ∗ .
Therefore, the expression of the costate in (1.19) satisfies Theorem 1.7. The relation-
ship in (1.19) implies that the costate is the sensitivity of the optimal value function to
changes in the system state trajectory. Furthermore, the Hamiltonian maximization
conditions in (1.8) and (1.17) are equivalent. Dynamic programming and Pontrya-
gin’s maximum principle methods are therefore closely related. However, there are
a few key differences between the two methods.
The solution in (1.13) obtained using dynamics programming is a feedback law.
That is, dynamic programming can be used to generate a policy that can be used to
close the control loop. Furthermore, once the Hamilton–Jacobi–Bellman equation is
solved, the resulting feedback law is guaranteed to be optimal for any initial condi-
tion of the dynamical system. On the other hand, Pontryagin’s maximum principle
generates the optimal state, costate, and control trajectories for a given initial condi-
tion. The controller must be implemented in an open-loop manner. Furthermore, if
the initial condition changes, the optimal solution is no longer valid and the optimal
control problem needs to be solved again.
Since dynamic programming generates a feedback law, it provides much more
information than the maximum principle. However, the added benefit comes at a
heavy computational cost. To generate the optimal policy, the Hamilton–Jacobi–
Bellman partial differential equation must be solved. In general, numerical methods
to solve the Hamilton–Jacobi–Bellman equation grow exponentially in numerical
complexity with increasing dimensionality. That is, dynamic programming suffers
from the so-called Bellman’s curse of dimensionality.
One way to develop optimal controllers for general nonlinear systems is to use
numerical methods [5]. A common approach is to formulate the optimal control
problem in terms of a Hamiltonian and then to numerically solve a two point boundary
value problem for the state and co-state equations [36, 37]. Another approach is to
cast the optimal control problem as a nonlinear programming problem via direct
transcription and then solve the resulting nonlinear program [30, 38–42]. Numerical
methods are offline, do not generally guarantee stability, or optimality, and are often
1.8 Further Reading 11
open-loop. These issues motivate the desire to find an analytical solution. Developing
analytical solutions to optimal control problems for linear systems is complicated
by the need to solve an algebraic Riccati equation or a differential Riccati equation.
Developing analytical solutions for nonlinear systems is even further complicated by
the sufficient condition of solving a Hamilton–Jacobi–Bellman partial differential
equation, where an analytical solution may not exist in general. If the nonlinear
dynamics are exactly known, then the problem can be simplified at the expense of
optimality by solving an algebraic Riccati equations through feedback-linearization
methods (cf. [43–47]).
Alternatively, some investigators temporarily assume that the uncertain system
could be feedback-linearized, solve the resulting optimal control problem, and then
use adaptive/learning methods to asymptotically learn the uncertainty [48–51] (i.e.,
asymptotically converge to the optimal controller). The nonlinear optimal control
problem can also be solved using inverse optimal control [52–61] by circumvent-
ing the need to solve the Hamilton–Jacobi–Bellman equation. By finding a control
Lyapunov function, which can be shown to also be a value function, an optimal
controller can be developed that optimizes a derived cost. However, since the cost is
derived rather than specified by mission/task objectives, this approach is not explored
in this monograph. Optimal control-based algorithms such as state dependent Ric-
cati equations [62–65] and model-predictive control [66–72] have been widely uti-
lized for control of nonlinear systems. However, both state dependent Riccati equa-
tions and model-predictive control are inherently model-based. Furthermore, due
to nonuniqueness of state dependent linear factorization in state dependent Riccati
equations-based techniques, and since the optimal control problem is solved over a
small prediction horizon in model-predictive control, they generally result in subopti-
mal policies. Furthermore, model-predictive control approaches are computationally
intensive, and closed-loop stability of state dependent Riccati equations-based meth-
ods is generally impossible to establish a priori and has to be established through
extensive simulation.
provide a secure set of strategies, in the sense that none of the players have an incentive
to diverge from their equilibrium policy. Hence, Nash equilibrium has been a widely
used solution concept in differential game-based control techniques. For an in-depth
discussion on Nash equilibrium solutions to differential game problems, see Chaps. 3
and 4.
Differential game theory is also employed in multi-agent optimal control, where
each agent has its own decentralized objective and may not have access to the entire
system state. In this case, graph theoretic models of the information structure are
utilized in a differential game framework to formulate coupled Hamilton–Jacobi
equations (c.f. [77]). Since the coupled Hamilton–Jacobi equations are difficult to
solve, reinforcement learning is often employed to get an approximate solution.
Results such as [77, 78] indicate that adaptive dynamic programming can be used
to generate approximate optimal policies online for multi-agent systems. For an in-
depth discussion on the use of graph theoretic models of information structure in a
differential game framework, see Chap. 5
integral [83] formulation of the temporal difference error (called the Bellman error).
The corresponding reinforcement learning algorithms are generally designed to min-
imize the Bellman error. Since such minimization yields estimates of generalized
solutions, but not necessarily viscosity solutions, to the Hamilton–Jacobi–Bellman
equation, reinforcement learning in continuous time and space for optimal control
problems with state constraints has largely remained an open area of research.
References
21. Ge SS, Zhang J (2003) Neural-network control of nonaffine nonlinear system with zero dynam-
ics by state and output feedback. IEEE Trans Neural Netw 14(4):900–918
22. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
23. Zhang X, Zhang H, Sun Q, Luo Y (2012) Adaptive dynamic programming-based optimal
control of unknown nonaffine nonlinear discrete-time systems with proof of convergence.
Neurocomputing 91:48–55
24. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
25. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
26. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
27. Kiumarsi B, Kang W, Lewis FL (2016) H-∞ control of nonaffine aerial systems using off-policy
reinforcement learning. Unmanned Syst 4(1):1–10
28. Song R, Wei Q, Xiao W (2016) Off-policy neuro-optimal control for unknown complex-valued
nonlinear systems based on policy iteration. Neural Comput Appl 46(1):85–95
29. Lyashevskiy S, Meyer AU (1995) Control system analysis and design upon the Lyapunov
method. In: Proceedings of the American control conference, vol 5, pp 3219–3223
30. Fahroo F, Ross IM (2008) Pseudospectral methods for infinite-horizon nonlinear optimal con-
trol problems. J Guid Control Dyn 31(4):927–936
31. Pickenhain S (2014) Hilbert space treatment of optimal control problems with infinite horizon.
In: Bock GH, Hoang PX, Rannacher R, Schlöder PJ (eds) Modeling, simulation and optimiza-
tion of complex processes - HPSC 2012: Proceedings of the fifth international conference on
high performance scientific computing, 5–9 March 2012, Hanoi, Vietnam. Springer Interna-
tional Publishing, Cham, pp 169–182
32. Tauchnitz N (2015) The pontryagin maximum principle for nonlinear optimal control problems
with infinite horizon. J Optim Theory Appl 167(1):27–48
33. Halkin H (1974) Necessary conditions for optimal control problems with infinite horizons.
Econometrica pp 267–272
34. Aseev SM, Kryazhimskii A (2007) The pontryagin maximum principle and optimal economic
growth problems. Proc Steklov Inst Math 257(1):1–255
35. Aseev SM, Veliov VM (2015) Maximum principle for infinite-horizon optimal control problems
under weak regularity assumptions. Proc Steklov Inst Math 291(1):22–39
36. von Stryk O, Bulirsch R (1992) Direct and indirect methods for trajectory optimization. Ann
Oper Res 37(1):357–373
37. Betts JT (1998) Survey of numerical methods for trajectory optimization. J Guid Control Dyn
21(2):193–207
38. Hargraves CR, Paris S (1987) Direct trajectory optimization using nonlinear programming and
collocation. J Guid Control Dyn 10(4):338–342
39. Huntington GT (2007) Advancement and analysis of a gauss pseudospectral transcription for
optimal control. Ph.D thesis, Department of Aeronautics and Astronautics, MIT
40. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, A MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
41. Darby CL, Hager WW, Rao AV (2011) An hp-adaptive pseudospectral method for solving
optimal control problems. Optim Control Appl Methods 32(4):476–502
42. Garg D, Hager WW, Rao AV (2011) Pseudospectral methods for solving infinite-horizon opti-
mal control problems. Automatica 47(4):829–837
43. Freeman R, Kokotovic P (1995) Optimal nonlinear controllers for feedback linearizable sys-
tems. In: Proceedings of the American control conference, pp 2722–2726
References 15
66. Garcia CE, Prett DM, Morari M (1989) Model predictive control: theory and practice - a survey.
Automatica 25(3):335–348
67. Mayne D, Michalska H (1990) Receding horizon control of nonlinear systems. IEEE Trans
Autom Control 35(7):814–824
68. Morari M, Lee J (1999) Model predictive control: past, present and future. Comput Chem Eng
23(4–5):667–682
69. Allgöwer F, Zheng A (2000) Nonlinear model predictive control, vol 26. Springer
70. Mayne D, Rawlings J, Rao C, Scokaert P (2000) Constrained model predictive control: Stability
and optimality. Automatica 36:789–814
71. Camacho EF, Bordons C (2004) Model predictive control, vol 2. Springer
72. Grüne L, Pannek J (2011) Nonlinear model predictive control. Springer
73. Isaacs R (1999) Differential games: a mathematical theory with applications to warfare and
pursuit, control and optimization. Dover books on mathematics, Dover Publications
74. Tijs S (2003) Introduction to game theory. Hindustan Book Agency
75. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory, 2nd edn. Classics in applied
mathematics, SIAM
76. Nash J (1951) Non-cooperative games. Ann Math 2:286–295
77. Vamvoudakis KG, Lewis FL (2011) Policy iteration algorithm for distributed networks and
graphical games. In: Proceedings of the IEEE conference decision control European control
conference, pp 128–135
78. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games: online
adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–1611
79. Dolcetta IC (1983) On a discrete approximation of the hamilton-jacobi equation of dynamic
programming. Appl Math Optim 10(1):367–377
80. Sethian JA (1999) Level set methods and fast marching methods: evolving interfaces in com-
putational geometry, fluid mechanics, computer vision, and materials science. Cambridge Uni-
versity Press
81. Osher S, Fedkiw R (2006) Level set methods and dynamic implicit surfaces, vol 153. Springer
Science & Business Media
82. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
83. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
Chapter 2
Approximate Dynamic Programming
2.1 Introduction
(i.e., the total optimal cost-to-go when the system is started in that state). An action
value function (generally referred to as the Q−function) is a map from the Carte-
sian product of the state space and the action-space to positive real numbers. The
Q−function assigns each state-action pair, (s, a), a value (i.e., the total optimal cost
when the action a is performed in the state s, and the optimal policy is followed
thereafter). Another unifying characteristic of dynamic programming based meth-
ods is the interaction of policy evaluation and policy improvement. Policy evaluation
(also referred to as the prediction problem) refers to the problem of finding the (state
or action) value function for a given arbitrary policy. Policy improvement refers to
the problem of construction of a new policy that improves the original policy. The
family of approximate optimal control methods that can be viewed as an interaction
between policy evaluation and policy improvement is referred to as generalized pol-
icy iteration. Almost all dynamic programming-based approximate optimal control
methods can be described as generalized policy iteration [8].
For the Bolza problem in Sect. 1.5, policy evaluation amounts to finding a solution
to the generalized Hamilton–Jacobi–Bellman equation (first introduced in [15])
for a fixed value function V : Rn → R≥0 . Since the system dynamics are affine in
control, the policy improvement step reduces to the simple assignment
1
φ (x) = − R −1 g T (x) ∇x V T (x) .
2
The policy iteration algorithm, also known as the successive approximation algorithm
alternates between policy improvement and policy evaluation. The policy iteration
algorithm was first developed by Bellman in [16], and a policy improvement theorem
was provided by Howard in [17]. In Algorithm 2.1, a version of the policy iteration
algorithm (cf. [15]) is presented for systems with continuous state space, where φ (0) :
Rn → Rm denotes an initial admissible policy (i.e., a policy that results in a finite
cost, starting from any initial condition), and V (i) : Rn → R≥0 and φ (i) : Rn → Rm
denote the value function and the policy obtained in the i th iteration. Provided the
initial policy is admissible, policy iteration generates a sequence of policies and
value functions that asymptotically approach the optimal policy and the optimal
2.2 Exact Dynamic Programming in Continuous Time and Space 19
value function. Furthermore, each policy in the sequence is at least as good as the
previous policy, which also implies that each policy in the sequence is admissible.
For a proof of convergence, see [18].
τ +T
where the shorthand x (t) is utilized to denote x (t; τ, x, φ (x (·))), that is, a trajectory
of the system in (1.9) under the feedback controller u = φ (x) such that x (τ ) = x.
Note that since the dynamics, the policy, and the cost function are time-independent,
to establish Ṽ as a solution to the integral generalized Hamilton–Jacobi–Bellman
equation, it is sufficient to check (2.2) for τ = t0 (or for any other arbitrary value of
τ ). Algorithm 2.2, first developed in [19] details a technique that utilizes the integral
generalized Hamilton–Jacobi–Bellman equation to implement policy iteration with-
(i)
out the knowledge of the drift dynamics,
(i)
f . In Algorithm 2.2, the shorthand x (t) is
utilized to denote x t; τ, x, φ (x (·)) . The equivalence of differential and integral
policy iteration is captured in the following theorem.
Proof The following proof is a slight modification of the argument presented in [19].
Under suitable smoothness assumptions and provided the policy φ is admissible, the
generalized Hamilton–Jacobi–Bellman equation in (2.1) admits a unique continu-
ously differentiable solution [15]. Let Ṽ ∈ C 1 (Rn , R) be the solution to the gener-
alized Hamilton–Jacobi–Bellman equation with the boundary condition Ṽ (0) = 0.
For an initial condition x ∈ Rn , differentiation of Ṽ (x (·)) with respect to time yields
Integrating the above expression over the interval [τ, τ + T ] for some τ ∈ R≥t0
yields the integral generalized Hamilton–Jacobi–Bellman equation in (2.2). Thus,
any solution to the generalized Hamilton–Jacobi–Bellman equation is also a solu-
tion to the integral generalized Hamilton–Jacobi–Bellman equation. To establish the
other direction, let Ṽ ∈ C 1 (Rn , R) be a solution to the generalized Hamilton–Jacobi–
Bellman equation and let V ∈ C 1 (Rn , R) be a different solution to the integral gener-
alized Hamilton–Jacobi–Bellman equation with the boundary conditions Ṽ (0) = 0
and V (0) = 0. Consider the time-derivative of the difference Ṽ − V :
Integrating the above expression over the interval [τ, τ + T ] and using x (τ ) = x,
τ+T τ+T
˙ ˙
Ṽ (x (t)) − V (x (t)) dt = − r (x (t) , φ (x (t))) dt − V (x (τ + T )) − V (x) .
τ τ
2.2 Exact Dynamic Programming in Continuous Time and Space 21
τ +T
Ṽ˙ (x (t)) − V˙ (x (t)) dt = 0, ∀τ ∈ R≥t0 .
τ
Hence, for all x ∈ Rn and for all τ ∈ R≥t0 , Ṽ − V is a constant along the trajectory
x (t) , t ∈ [τ, τ + T ], with the initial condition x (τ ) = x. Hence, using the time-
independence of the dynamics, the policy, and the cost function, it can be concluded
that Ṽ − V is a constant along every trajectory x : R≥t0 → Rn of the system in
(1.9) under the controller u (t) = φ (x (t)). Since Ṽ (0) − V (0) = 0 it can be con-
cluded that Ṽ (x (t)) − V (x (t)) = 0, ∀t ∈ R≥t0 , provided the trajectory x (·) passes
through the origin (i.e., x (t) = 0 for some t ∈ Rn ).
In general, a trajectory of a dynamical system need not pass through the origin. For
example, consider ẋ (t) = −x (t) , x (0) = 1. However, since the policy φ is admis-
sible, every trajectory of the system in (1.9) under the controller u (t) = φ (x (t))
asymptotically goes to zero, leading to the following claim.
Ṽ (x) = V (x) , ∀x ∈ Rn .
Ṽ x t; t0 , x ∗ , φ (x (·)) − V x t; t0 , x ∗ , φ (x (·))
> , ∀t ∈ R≥t0 .
∗
t→∞ x (t; t0 , x , φ (x (·))) = 0. Since Ṽ − V is a contin-
Since φ is admissible, lim
Ṽ x T ; t0 , x ∗ , φ (x (·)) − V x T ; t0 , x ∗ , φ (x (·))
< ,
which is a contradiction. Since the constants and x ∗ were arbitrarily selected, the
proof of the claim is complete.
The claim implies that the solutions to the integral generalized Hamilton–
Jacobi–Bellman equations are unique, and hence, the proof of the theorem is
complete.
22 2 Approximate Dynamic Programming
The policy iteration algorithm and the integral policy iteration algorithm both
require an initial admissible policy. The requirement of a initial admissible pol-
icy can be circumvented using value iteration. Value iteration and its variants are
popular generalized policy iteration algorithms for discrete-time systems owing to
the simplicity of their implementation. In discrete time, value iteration algorithms
work by turning Bellman’s recurrence relation (the discrete time counterpart of
the Hamilton–Jacobi–Bellman equation) into an update rule [3, 8, 20–22]. One
example of a discrete-time value iteration algorithm is detailed in Algorithm 2.3.
In Algorithm 2.3, the system dynamics are described by the difference equation
x (k + 1) = f (x (k)) + g (x (k)) u (k), where the objective is to minimize the total
cost J (x (·) , u (·)) = ∞ k=0 r (x (k) , u (k)), and V (0)
: Rn
→ R≥0 denotes an arbi-
trary initialization. A key strength of value iteration over policy iteration is that
the initialization V (0) does not need to be a Lyapunov function or a value func-
tion corresponding to any admissible policy. An arbitrary initialization such as
V (0) (x) = 0, ∀x ∈ Rn is acceptable. Hence, to implement value iteration, knowl-
edge of an initial admissible policy is not needed. As a result, unlike policy iteration,
the functions V (i) generated by value iteration are not guaranteed to be value func-
tions corresponding to admissible policies, and similarly, the policies φ (i) are not
guaranteed to be admissible. However, it can be shown that the sequences V (i) and
φ (i) converge to the optimal value function, V ∗ , and the optimal policy, u ∗ , respec-
tively [23–25]. An offline value iteration-like algorithm that relies on Pontryagin’s
maximum principle is developed in [26–28] where a single neural network is utilized
to approximate the relationship between the state and the costate variables. Value
iteration algorithms for continuous-time linear systems are presented in [29–32]. For
nonlinear systems, an implementation of Q-learning is presented in [33]; however,
closed-loop stability of the developed controller is not analyzed.
For systems with finite state and action-spaces, policy iteration and value iteration
are established as effective tools for optimal control synthesis. However, in con-
tinuous state-space systems, both policy iteration and value iteration suffer from
2.3 Approximate Dynamic Programming in Continuous Time and Space 23
In this chapter and in the rest of the book, the value function is approximated using
a linear-in-the-parameters approximation scheme. The following characteristics of
the approximation scheme can be established using the Stone-Weierstrass Theorem
(see [49–51]).
Property
2.3 Let V ∈ C 1 (R
n , R), χ ⊂ Rn be compact, ∈ R>0 be a constant, and
let σi ∈ C 1 (Rn , R) | i ∈ N be a set of countably many uniformly bounded basis
functions (cf. [52, Definition 2.1]). Then,
there exists L ∈ N, a set of basis functions
σi ∈ C 1 (Rn , R) | i = 1, 2, · · · , L , and a set of weights {wi ∈ R | i = 1, 2, · · · , L}
such that supx∈χ
V (x) − W T σ (x)
+ ∇x V (x) − W T ∇x σ (x) ≤ , where σ
T T
σ1 · · · σ L and W w1 · · · w L .
Property 2.3, also known as the Universal Approximation Theorem, states that a sin-
gle layer neural network can simultaneously approximate a function and its derivative
given a sufficiently large number of basis functions. Using Property 2.3, a continu-
ously differentiable function can be represented as V (x) = W T σ (x) + (x), where
: Rn → R denotes the function approximation error. The function approximation
error, along with its derivative can be made arbitrarily small by increasing the number
of basis functions used in the approximation.
24 2 Approximate Dynamic Programming
Similar to Algorithms 2.2, 2.4 can be expressed in a model-free form using inte-
gration. However, the usefulness of Algorithm 2.4 (and its model-free form) is
limited by the need to solve the minimization problem Ŵc(i) = arg min Ŵc ∈RL x∈Rn
2
δφ (i−1) σ, Ŵc dσ , which is often intractable due to computational and infor-
mation constraints. A more useful implementation of approximate policy iteration is
detailed in Algorithm 2.5 in the model-based form, where the minimization is carried
out over a specific trajectory instead of the whole state space [19, 53]. In Algorithm
φ
2.5, the set x0 ⊂ Rn is defined as
φx0 x ∈ Rn | x (t; t0 , x0 , φ (x (·))) = x, for some t ∈ R≥t0 .
24, 25, 49, 53]. However, the algorithm is iterative in nature, and unlike exact policy
iteration, the policies φ (i) cannot generally be shown to be stabilizing. Hence, the
approximate policy iteration algorithms, as stated, are not suitable for online learning
and online optimal feedback control. To ensure system stability during the learning
phase, a two-network approach is utilized, where in addition to the value function,
the policy, φ, is also approximated using a parametric approximation, û x, Ŵa .
The critic learns the value of a policy by updating the weights Ŵc and the actor
improves the current policy by updating the weights Ŵa .
The actor-critic (also known as adaptive-critic) architecture is one of the most widely
used architectures to implement generalized policy iteration algorithms [1, 8, 54].
Actor-critic algorithms are pervasive in machine learning and are used to learn the
optimal policy online for finite-space discrete-time Markov decision problems [1,
3, 8, 14, 55]. The idea of learning with a critic (or a trainer) first appeared in [56,
57] where the state-space was partitioned to make the computations tractable. Critic-
based methods were further developed to learn optimal actions in sequential decision
problems in [54]. Actor-critic methods were first developed in [58] for systems
with finite state and action-spaces, and in [1] for systems with continuous state
and action-spaces using neural networks to implement the actor and the critic. An
analysis of convergence properties of actor-critic methods was presented in [47, 59]
for deterministic systems and in [14] for stochastic systems. For a detailed review of
actor-critic methods, see [60].
Several methods have been investigated to tune the actor and the critic networks
in the actor-critic methods described in the paragraph above. The actor can learn
to directly minimize the estimated cost-to-go, where the estimate of the cost-to-go
is obtained by the critic [1, 14, 55, 58, 60, 61]. The actor can also be tuned to
minimize the Bellman error (also known as the temporal-difference error) [62]. The
critic network can be tuned using the method of temporal differences [1, 2, 8, 11,
12, 14, 63] or using heuristic dynamic programming [3, 9, 20, 64–67] or its variants
[55, 68, 69].
The iterative nature of actor-critic methods makes them particularly suitable for
offline computation and for discrete-time systems, and hence, discrete-time approx-
imate optimal control has been a growing area of research over the past decade [24,
70–80]. The trajectory-based formulation in Algorithm 2.5 lends itself to an online
solution approach using asynchronous dynamic programming, where the parame-
ters are adjusted on-the-fly using input-output data. The concept of asynchronous
dynamic programming can be further exploited to apply actor-critic methods online
to continuous-time systems.
26 2 Approximate Dynamic Programming
Obtaining an analytical solution to the Bolza problem is often infeasible if the system
dynamics are nonlinear. Many numerical solution techniques are available to solve
Bolza problems; however, numerical solution techniques require exact model knowl-
edge and are realized via open-loop implementation of offline solutions. Open-loop
implementations are sensitive to disturbances, changes in objectives, and changes in
the system dynamics; hence, online closed-loop solutions of optimal control prob-
lems are sought-after. Inroads to solve an optimal control problem online can be made
by looking at the value function. Under a given policy, the value function provides
a map from the state space to the set of real numbers that measures the quality of a
state. In other words, under a given policy, the value function evaluated at a given
state is the cost accumulated when starting in the given state and following the given
policy. Under general conditions, the policy that drives the system state along the
steepest negative gradient of the optimal value function turns out to be the optimal
policy; hence, online optimal control design relies on computation of the optimal
value function.
In online closed-loop approximate optimal control, the value function has an even
more important role to play. Not only does the value function provide the optimal
policy, but the value function is also a Lyapunov function that establishes global
asymptotic stability of the closed-loop system.
2.4 Optimal Control and Lyapunov Stability 27
Theorem 2.4 Consider the affine dynamical system in (1.9). Let V ∗ : Rn → R≥0
be the optimal value function corresponding to the affine-quadratic optimal control
problem in (1.10). Assume further that f (0) = 0 and that the control effectiveness
matrix g (x) is full rank for all x ∈ Rn . Then, the closed-loop system under the
optimal controller u (t) = u ∗ (x (t)) is asymptotically stable.
Proof Since f (0) = 0, it follows that when x (t0 ) = 0, the controller that yields the
lowest cost is u (t) = 0, ∀t. Hence, V ∗ (0) = 0, and since the optimal controller is
given by u (t) = u ∗ (x (t)) = − 21 R −1 g T (x (t)) ∇x V ∗ (x (t)), and g is assumed to be
full rank, it can be concluded that ∇x V ∗ (0) = 0. Furthermore, if x = 0, it follows
that V ∗ (x) = 0. Hence, the function V ∗ is a candidate Lyapunov function, and x = 0
is an equilibrium point of the closed-loop dynamics
Since the function Q is positive definite by design, [86, Theorem 4.2] can be invoked
to conclude that the equilibrium point x = 0 is asymptotically stable.
The utility of Theorem 2.4 as a tool to analyze optimal controllers is limited
because for nonlinear systems, analytical or exact numerical computation of the
optimal controller is often intractable. Hence, one often works with approximate
value functions and approximate optimal controllers. Theorem 2.4 provides a pow-
erful tool for the analysis of approximate optimal controllers because the optimal
policy is inherently robust to approximation (for an in-depth discussion regarding
robustness of the optimal policy, see [87, 88]). That is, the optimal value function can
also be used as a candidate Lyapunov function to establish practical stability (that
is, uniform ultimate boundedness) of the system in (1.9) under controllers that are
close to or asymptotically approach a neighborhood of the optimal controller. The
rest of the discussion in this chapter focuses on the methodology employed in the
rest of this monograph to generate an approximation of the optimal controller. Over
the years, many different approximate optimal control methods have been developed
for various classes of systems. For a brief discussion about alternative methods, see
Sect. 2.8.
28 2 Approximate Dynamic Programming
∗
In an approximate actor-critic-based solution, the optimal value function V is
replaced by a parametric estimate V̂ x, Ŵc and the optimal policy u ∗ by a para-
metric estimate û x, Ŵa where Ŵc ∈ R L and Ŵa ∈ R L denote vectors of estimates
of the ideal parameters.
Substituting the estimates V̂ and û for V ∗ and u ∗ in (1.14), respectively, a residual
error δ : Rn × R L × R L → R, called the Bellman error , is defined as
δ x, Ŵc , Ŵa ∇x V̂ x, Ŵc f (x) + g (x) û x, Ŵa + r x, û x, Ŵa .
(2.3)
The use of two separate sets of weight estimates Ŵa and Ŵc is motivated by the
fact that the Bellman error is linear with respect to the critic weight estimates and
nonlinear with respect to the actor weight estimates. Use of a separate set of weight
estimates for the value function facilitates least-squares-based adaptive updates.
To solve the optimal control problem, the critic aims to find aset of parameters
Ŵc and the actor aims to find a set of parameters Ŵa such that δ x, Ŵc , Ŵa = 0,
T
and û x, Ŵa = − 21 R −1 g T (x) ∇ V̂ x, Ŵc ∀x ∈ Rn . Since an exact basis for
value function approximation is generally not available, an approximate set of param-
eters that minimizes the Bellman error is sought. In particular, to ensure uniform
approximation of the value function and the policy over an operating domain D ⊂ Rn ,
it is desirable to find parameters that minimize the error E s : R L × R L → R defined
as
while
the system in (1.9) is being controlled using the control law u (t) =
û x (t) , Ŵa (t) .
2.5 Differential Online Approximate Optimal Control 29
Computation of the Bellman error in (2.3) and the integral error in (2.5) requires exact
model knowledge. Furthermore, computation of the integral error in (2.5) is generally
infeasible. Two prevalent approaches employed to render the control design robust to
uncertainties in the system drift dynamics are integral reinforcement learning (cf. [7,
19, 89–92]) and state derivative estimation (cf. [93, 94]). This section focuses on state
derivative estimation based methods. For further details on integral reinforcement
learning, see Sect. 2.2.1.
State derivative estimation-based techniques exploit the fact that if the system
model is uncertain, the critic can compute the Bellman error at each time instance
using the state-derivative ẋ (t) as
δt (t) ∇x V̂ x (t) , Ŵc (t) ẋ (t) + r x (t) , û x (t) , Ŵa (t) . (2.6)
t
E t (t) δt2 (τ ) dτ, (2.7)
0
using a steepest descent update law. The use of the cumulative squared error is
motivated by the fact that in the presence of uncertainties, the Bellman error can only
be evaluated along the system trajectory; hence, E t (t) is the closest approximation
to E (t) in (2.5) that can be computed using available information.
Intuitively, for E t (t) to approximate E (t) over an operating domain, the state
trajectory x (t) needs to visit as many points in the operating domain as possible. This
intuition is formalized by the fact that the use of the approximation E t (t) to update
the critic parameter estimates is valid provided certain exploration conditions1 are
met. In reinforcement learning terms, the exploration conditions translate to the need
for the critic to gain enough experience to learn the value function. The exploration
1 Theexploration conditions are detailed in the next section for a linear-in-the-parameters approxi-
mation of the value function.
30 2 Approximate Dynamic Programming
conditions can be relaxed using experience replay (cf. [92]), where each evaluation
of the Bellman error δint is interpreted as gained experience, and these experiences
are stored in a history stack and are repeatedly used in the learning algorithm to
improve data efficiency; however, a finite amount of exploration is still required since
the values stored in the history stack are also constrained to the system trajectory.
Learning based on simulation of experience has also been investigated in results
such as [95–100] for stochastic model-based reinforcement learning; however, these
results solve the optimal control problem off-line in the sense that repeated learning
trials need to be performed before the algorithm learns the controller and system
stability during the learning phase is not analyzed.
While the estimates Ŵc (·) are being updated by the critic, the actor simultaneously
updates the parameter estimates Ŵa (·) using a gradient-based approach so that the
T
quantity û x (t) , Ŵa (t) + 21 R −1 g T (x (t)) ∇x V̂ x (t) , Ŵc (t) decreases.
The weight updates are performed online and in real-time while the system is being
controlled using the control law u (t) = û x (t) , Ŵa (t) . Naturally, it is difficult
to guarantee stability during the learning phase. In fact, the use of two different sets
parameters to approximate the value function and the policy is required solely for
the purpose of maintaining stability during the learning phase.
For feasibility of analysis, the optimal value function is approximated using a linear-
in-the-parameters approximation
V̂ x, Ŵc ŴcT σ (x) , (2.8)
ω (t)
Ŵ˙ c (t) = −ηc
δt (t) ,
ρ (t)
ω (t) ω T (t)
˙ (t) = β
(t) − ηc
(t)
(t) , (2.10)
ρ 2 (t)
2.5 Differential Online Approximate Optimal Control 31
BE
Actor CriƟc
AcƟon
where ηa1 , ηa2 ∈ R>0 are constant learning gains. A block diagram of the resulting
control architecture is in Fig. 2.1.
The stability analysis indicates that the sufficient exploration condition takes the
form of a persistence of excitation condition that requires the existence of positive
constants ψ and T such that the regressor vector satisfies
t+T
ω (τ ) ω T (τ )
ψI L ≤ dτ, (2.12)
ρ (τ )
t
for all t ∈ R≥t0 . The regressor is defined here as a trajectory indexed by time. It should
be noted that different initial conditions result in different regressor trajectories;
hence, the constants T and ψ depend on the initial values of x (·) and Ŵa (·). Hence,
the final result is generally not uniform in the initial conditions.
Let W̃c (t) W − Ŵc (t) and W̃a (t) W − Ŵa (t) denote the vectors of param-
eter estimation errors, where W ∈ R L denotes the constant vector of ideal parameters
(see Property 2.3). Provided (2.12) is satisfied, and under sufficient conditions on the
learning gains and the constants ψ and T , the candidate Lyapunov function
32 2 Approximate Dynamic Programming
1 1
VL x, W̃c , W̃a , t V ∗ (x) + W̃cT
−1 (t) W̃c + W̃aT W̃a
2 2
can be used to establish convergence of x (t), W̃c (t), and W̃a (t) to a neighborhood
of zero as t → ∞, when the system in (1.9) is controlled using the control law
u (t) = û x (t) , Ŵa (t) , (2.13)
and the parameter estimates Ŵc (·) and Ŵa (·) are updated using the update laws in
(2.10) and (2.11), respectively.
The use of the state derivative to compute the Bellman error in (2.6) is advantageous
because it is easier to obtain a dynamic estimate of the state derivative than it is to
identify the system dynamics. For example, consider the high-gain dynamic state
derivative estimator
where x̂˙ (t) ∈ Rn is an estimate of the state derivative, x̃ (t) x − x̂ (t) is the state
estimation error, and k, α ∈ R>0 are identification gains. Using (2.14), the Bellman
error in (2.6) can be approximated by δ̂t as
δ̂t (t) = ∇x V̂ x (t) , Ŵc (t) x̂˙ (t) + r x (t) , û x (t) , Ŵa (t) .
The critic can then learn the critic weights by using an approximation of cumulative
experience, quantified using δ̂t instead of δt in (2.10), that is,
t
Ê t (t) = δ̂t2 (τ ) dτ. (2.15)
0
Under additional sufficient conditions on the gains k and α, the candidate Lyapunov
function
1 1 1 1
VL x, W̃c , W̃a , x̃, x f , t V ∗ (x) + W̃cT
−1 (t) W̃c + W̃aT W̃a + x̃ T x̃ + x Tf x f ,
2 2 2 2
where x f (t) x̃˙ (t) + α x̃ (t), can be used to establish convergence of x (t), W̃c (t),
W̃a (t), x̃ (t), and x f (t) to a neighborhood of zero, when the system in (1.9) is
2.6 Uncertainties in System Dynamics 33
State
AcƟon
BE
Actor AcƟon
CriƟc
controlled using the control law (2.13). The aforementioned extension of the actor-
critic method to handle uncertainties in the system dynamics using derivative esti-
mation is known as the actor-critic-identifier architecture. A block diagram of the
actor-critic-identifier architecture is presented in Fig. 2.2.
online implementation. Moreover, the added probing signal causes large control
effort expenditure and there is no means to know when it is sufficient to remove the
probing signal. Chap. 4 addresses the challenges associated with the satisfaction of
the condition in (2.12) via simulated experience and cumulative experience collected
along the system trajectory.
Approximate optimal control has been an active topic of research since the seminal
works of Bellman [106] and Pontryagin [107] in the 1950s. A comprehensive survey
and classification of all the results dealing with approximate optimal control is out
of the scope of this monograph. In the following, a brief (by no means exhaustive)
classification of techniques based on Bellman’s dynamic programming principle is
presented. For a recent survey of approximate dynamic programming in deterministic
systems, see [108]. Brief discussions on a few specific techniques directly related
to the methodology used in this book are also presented. For a brief description of
methods based on Pontryagin’s maximum principle refer back to Sect. 1.8.
On-Policy Versus Off-Policy Learning: A generalized policy iteration technique
is called on-policy if the data used to improve an estimate of the optimal policy is
required to be collected using the same estimate. A generalized policy iteration
technique is called off-policy if an estimate of the optimal policy can be improved
using data collected using another policy. For example methods such as policy iter-
ation, value iteration, heuristic dynamic programming, adaptive-critic methods [1,
14, 55, 59, 61], SARSA (cf. [109, 110]) are on-policy, whereas methods such as
Q−learning [111] and R−learning [112] are off-policy. The distinction between
on-policy and off-policy methods is important because most online generalized pol-
icy iteration methods require exploration for convergence, whereas the on-policy
condition requires exploitation, hence leading to the exploration versus exploitation
conflict. Off-policy methods avoid the exploration versus exploitation conflict since
an arbitrary exploring policy can be used to facilitate learning.
Approximation of solutions of reward-maximization problems using indirect feed-
back generated by a critic network was first investigated in [57]. Critic-based methods
were further developed to solve a variety of optimal control problems [1, 4, 20, 54],
for example, heuristic dynamic programming [20], adaptive critic elements [1], and
Q-learning [111]. A common theme among the aforementioned techniques is the
use of two neuron-like elements, an actor element that is responsible for generating
control signals and a critic element that is responsible for evaluation of the control
signals generated by the actor (except Q-learning, which is implemented with just
one neuron-like element that combines the information about the policy and the value
function [4, 111]). The most useful feature of critic based methods is that they can
be implemented online in real time.
Policy Iteration, Value Iteration, and Policy Gradient: Dynamic program-
ming methods have traditionally been classified into three distinct schemes: policy
2.8 Further Reading and Historical Remarks 35
iteration, value iteration, and policy gradient. Policy iteration methods start with
a stabilizing policy, find the value function corresponding to that policy (i.e., pol-
icy evaluation), and then update the policy to exploit the value function (i.e., pol-
icy improvement). A large majority of dynamic programming algorithms can be
classified as policy iteration algorithms. For example, SARSA and the successive
approximation methods developed in results such as [15, 16, 18, 21, 45, 46, 49, 79,
113–119] are policy iteration algorithms. In value iteration, starting from an arbitrary
initial guess, the value function is directly improved by effectively combining the
evaluation and the improvement phases into one single update. For example, algo-
rithms such as Q−learning [111], R−learning [112], heuristic dynamic program-
ming, action-dependent heuristic dynamic programming, dual heuristic program-
ming, action-dependent dual heuristic programming, [9] and modern extensions of
value iteration (see [6–8, 77] for a summary). Both policy iteration and value itera-
tion are typically critic-only methods [60] and can be considered as special cases of
generalized policy iteration [8, 21].
Policy gradient methods (also known as actor-only methods) are philosophically
different from policy iteration and value iteration. In policy gradient methods, instead
of approximating the value function, the policy is directly approximated by comput-
ing the gradient of the cost functional with respect to the unknown parameters in
the approximation of the policy [120–123]. Modern policy gradient methods uti-
lize an approximation of the value function to estimate the gradients, and are called
actor-critic methods [14, 60, 124].
Continuous-Time Versus Discrete-Time Methods: For deterministic systems,
reinforcement learning algorithms have been extended to a solve finite and infinite-
horizon discounted and total cost optimal regulation problems (cf. [24, 26, 48, 49,
70, 85, 91, 93, 105, 125]) under names such as adaptive dynamic programming or
adaptive critic algorithms. The discrete/iterative nature of the approximate dynamic
programming formulation lends itself naturally to the design of discrete-time opti-
mal controllers [24, 26, 27, 70–75, 79, 126], and the convergence of algorithms for
dynamic programming-based reinforcement learning controllers is studied in results
such as [47, 59, 61, 72]. Most prior work has focused on convergence analysis for
discrete-time systems, but some continuous examples are available [15, 19, 45, 47,
49, 81, 82, 84, 85, 105, 127–129]. For example, in [81] advantage updating was pro-
posed as an extension of the Q−learning algorithm which could be implemented in
continuous time and provided faster convergence. The result in [82] used a Hamilton–
Jacobi–Bellman framework to derive algorithms for value function approximation
and policy improvement, based on a continuous version of the temporal difference
error. An Hamilton–Jacobi–Bellman framework was also used in [47] to develop
a stepwise stable iterative approximate dynamic programming algorithm for con-
tinuous input-affine systems with an input-quadratic performance measure. Based
on the successive approximation method first proposed in [15], an adaptive optimal
control solution is provided in [45], where a Galerkin’s spectral method is used to
approximate the solution to the generalized Hamilton–Jacobi–Bellman equation. A
least-squares-based successive approximation solution to the generalized Hamilton–
Jacobi–Bellman equation is provided in [49], where an neural network is trained
36 2 Approximate Dynamic Programming
References
1. Barto A, Sutton R, Anderson C (1983) Neuron-like adaptive elements that can solve difficult
learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846
2. Sutton R (1988) Learning to predict by the methods of temporal differences. Mach Learn
3(1):9–44
3. Werbos P (1990) A menu of designs for reinforcement learning over time. Neural Netw
Control 67–95
4. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292
5. Bellman RE (2003) Dynamic programming. Dover Publications, Inc, New York
6. Bertsekas D (2007) Dynamic programming and optimal control, vol 2, 3rd edn. Athena Sci-
entific, Belmont, MA
7. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control, 3rd edn. Wiley, Hoboken
8. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
9. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural mod-
eling. In: White DA, Sorge DA (eds) Handbook of intelligent control: neural, fuzzy, and
adaptive approaches, vol 15. Nostrand, New York, pp 493–525
10. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Nashua
11. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function
approximation. IEEE Trans Autom Control 42(5):674–690
12. Tsitsiklis JN, Roy BV (1999) Average cost temporal-difference learning. Automatica
35(11):1799–1808
13. Tsitsiklis J (2003) On the convergence of optimistic policy iteration. J Mach Learn Res 3:59–
72
14. Konda V, Tsitsiklis J (2004) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–
1166
15. Leake R, Liu R (1967) Construction of suboptimal control sequences. SIAM J Control 5:54
16. Bellman R (1957) Dynamic programming, 1st edn. Princeton University Press, Princeton
17. Howard R (1960) Dynamic programming and Markov processes. Technology Press of Mas-
sachusetts Institute of Technology (Cambridge)
18. Saridis G, Lee C (1979) An approximation theory of optimal control for trainable manipula-
tors. IEEE Trans Syst Man Cyber 9(3):152–159
19. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive
optimal control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
20. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
21. Puterman ML, Shin MC (1978) Modified policy iteration algorithms for discounted markov
decision problems. Manag Sci 24(11):1127–1137
22. Bertsekas DP (1987) Dynamic programming: deterministic and stochastic models. Prentice-
Hall, Englewood Cliffs
23. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
38 2 Approximate Dynamic Programming
24. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part
B Cybern 38:943–949
25. Heydari A (2014) Revisiting approximate dynamic programming and its convergence. IEEE
Trans Cybern 44(12):2733–2743
26. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
27. Heydari A, Balakrishnan S (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
28. Heydari A, Balakrishnan SN (2013) Fixed-final-time optimal control of nonlinear systems
with terminal constraints. Neural Netw 48:61–71
29. Lee JY, Park JB, Choi YH (2013) On integral value iteration for continuous-time linear
systems. In: Proceedings of the American control conference, pp 4215–4220
30. Jha SK, Bhasin S (2014) On-policy q-learning for adaptive optimal control. In: Proceedings
of the IEEE symposium on adaptive dynamic programming and reinforcement learning, pp
1–6
31. Palanisamy M, Modares H, Lewis FL, Aurangzeb M (2015) Continuous-time q-learning
for infinite-horizon discounted cost linear quadratic regulator problems. IEEE Trans Cybern
45(2):165–176
32. Bian T, Jiang ZP (2015) Value iteration and adaptive optimal control for linear continuous-time
systems. In: Proceedings of the IEEE international conference on cybernetics and intelligent
systems, IEEE conference on robotics, automation and mechatronics, pp 53–58
33. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
the IEEE conference on decision and control, pp 3598–3605
34. Al’Brekht E (1961) On the optimal stabilization of nonlinear systems. J Appl Math Mech
25(5):1254–1266
35. Lukes DL (1969) Optimal regulation of nonlinear dynamical systems. SIAM J Control
7(1):75–100
36. Nishikawa Y, Sannomiya N, Itakura H (1971) A method for suboptimal design of nonlinear
feedback systems. Automatica 7(6):703–712
37. Garrard WL, Jordan JM (1977) Design of nonlinear automatic flight control systems. Auto-
matica 13(5):497–505
38. Dolcetta IC (1983) On a discrete approximation of the hamilton-jacobi equation of dynamic
programming. Appl Math Optim 10(1):367–377
39. Falcone M, Ferretti R (1994) Discrete time high-order schemes for viscosity solutions of
Hamilton-Jacobi-Bellman equations. Numer Math 67(3):315–344
40. Bardi M, Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton-Jacobi-
Bellman equations. Springer, Berlin
41. Gonzalez R (1985a) On deterministic control problems: an approximation procedure for the
optimal cost i. The stationary problem. SIAM J Control Optim 23(2):242–266
42. Gonzalez R, Rofman E (1985b) On deterministic control problems: an approximation proce-
dure for the optimal cost ii. The nonstationary case. SIAM J Control Optim 23(2):267–285
43. Falcone M (1987) A numerical approach to the infinite horizon problem of deterministic
control theory. Appl Math Optim 15(1):1–13
44. Kushner HJ (1990) Numerical methods for stochastic control problems in continuous time.
SIAM J Control Optim 28(5):999–1048
45. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33:2159–2178
46. Beard RW, Mclain TW (1998) Successive Galerkin approximation algorithms for nonlinear
optimal and robust control. Int J Control 71(5):717–743
47. Murray J, Cox C, Lendaris G, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
References 39
48. Abu-Khalaf M, Lewis FL (2002) Nearly optimal HJB solution for constrained input systems
using a neural network least-squares approach. In: Proceedings of the IEEE conference on
decision and control, Las Vegas, NV, pp 943–948
49. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
50. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
51. Hornick K (1991) Approximation capabilities of multilayer feedforward networks. Neural
Netw 4:251–257
52. Sadegh N (1993) A perceptron network for functional identification and control of nonlinear
systems. IEEE Trans Neural Netw 4(6):982–988
53. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
54. Widrow B, Gupta N, Maitra S (1973) Punish/reward: Learning with a critic in adaptive thresh-
old systems. IEEE Trans Syst Man Cybern 3(5):455–465
55. Prokhorov DV, Wunsch IDC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8:997–
1007
56. Fu KS (1964) Learning control systems. In: Tou JT, Wilcox RH (eds) Computing and infor-
mation science, collected papers on learning, adaptation and control in information systems.
Spartan Books, Washington, pp 318–343
57. Fu KS (1969) Learning control systems. In: Tou JT (ed) Advances in information systems
science, vol 1. Springer. US, Boston, pp 251–292
58. Witten IH (1977) An adaptive optimal controller for discrete-time markov environments. Inf
Control 34(4):286–295
59. Liu X, Balakrishnan S (2000) Convergence analysis of adaptive critic based optimal control.
In: Proceedings of the American control conference, vol 3
60. Grondman I, Buşoniu L, Lopes GA, Babuška R (2012) A survey of actor-critic reinforcement
learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C Appl
Rev 42(6):1291–1307
61. Prokhorov D, Santiago R, Wunsch D (1995) Adaptive critic designs: a case study for neuro-
control. Neural Netw 8(9):1367–1372
62. Fuselli D, De Angelis F, Boaro M, Squartini S, Wei Q, Liu D, Piazza F (2013) Action dependent
heuristic dynamic programming for home energy resource scheduling. Int J Electr Power
Energy Syst 48:148–160
63. Miller WT, Sutton R, Werbos P (1990) Neural networks for control. MIT Press, Cambridge
64. Werbos P (1987) Building and understanding adaptive systems: a statistical/numerical
approach to factory automation and brain research. IEEE Trans Syst Man Cybern 17(1):7–20
65. Werbos PJ (1989) Back propagation: past and future. Proceedings of the international con-
ference on neural network 1:1343–1353
66. Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE
78(10):1550–1560
67. Werbos P (2000) New directions in ACDs: keys to intelligent control and understanding the
brain. Proceedings of the IEEE-INNS-ENNS international joint conference on neural network
3:61–66
68. Si J, Wang Y (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
69. Yang L, Enns R, Wang YT, Si J (2003) Direct neural dynamic programming. In: Stability and
control of dynamical systems with applications. Springer, Berlin, pp 193–214
70. Balakrishnan S (1996) Adaptive-critic-based neural networks for aircraft optimal control. J
Guid Control Dynam 19(4):893–898
71. Lendaris G, Schultz L, Shannon T (2000) Adaptive critic design for intelligent steering and
speed control of a 2-axle vehicle. In: International joint conference on neural network, pp
73–78
40 2 Approximate Dynamic Programming
72. Ferrari S, Stengel R (2002) An adaptive critic global controller. Proc Am Control Conf 4:2665–
2670
73. Han D, Balakrishnan S (2002) State-constrained agile missile control with adaptive-critic-
based neural networks. IEEE Trans Control Syst Technol 10(4):481–489
74. He P, Jagannathan S (2007) Reinforcement learning neural-network-based controller for non-
linear discrete-time systems with input constraints. IEEE Trans Syst Man Cybern Part B
Cybern 37(2):425–436
75. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
76. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
77. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
78. Wei Q, Liu D (2013) Optimal tracking control scheme for discrete-time nonlinear systems
with approximation errors. In: Guo C, Hou ZG, Zeng Z (eds) Advances in neural networks -
ISNN 2013, vol 7952. Lecture notes in computer science. Springer, Berlin, pp 1–10
79. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
80. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
81. Baird L (1993) Advantage updating. Technical report, Wright Lab, Wright-Patterson Air
Force Base, OH
82. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
83. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2):477–484
84. Hanselmann T, Noakes L, Zaknich A (2007) Continuous-time adaptive critics. IEEE Trans
Neural Netw 18(3):631–647
85. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
86. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
87. Wang K, Liu Y, Li L (2014) Visual servoing trajectory tracking of nonholonomic mobile
robots without direct position measurement. IEEE Trans Robot 30(4):1026–1035
88. Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlin-
ear robust optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern Syst
46(11):1544–1555
89. Vamvoudakis KG, Vrabie D, Lewis FL (2009) Online policy iteration based algorithms to
solve the continuous-time infinite horizon optimal control problem. IEEE symposium on
adaptive dynamic programming and reinforcement learning, IEEE, pp 36–41
90. Vrabie D, Vamvoudakis KG, Lewis FL (2009) Adaptive optimal controllers based on gener-
alized policy iteration in a continuous-time framework. In: Proceedings of the mediterranean
conference on control and automation, IEEE, pp 1402–1409
91. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE Conference
on decision and control, pp 3066–3071
92. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
93. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlin-
ear systems. Automatica 49(1):89–92
References 41
94. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory track-
ing for continuous-time nonlinear systems. Automatica 51:40–48
95. Singh SP (1992) Reinforcement learning with a hierarchy of abstract models. In: AAAI
national conference on artificial intelligence 92:202–207
96. Atkeson CG, Schaal S (1997) Robot learning from demonstration. Int Conf Mach Learn
97:12–20
97. Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In:
International conference on machine learning. ACM, New York, pp 1–8
98. Deisenroth MP (2010) Efficient reinforcement learning using Gaussian processes. KIT Sci-
entific Publishing
99. Mitrovic D, Klanke S, Vijayakumar S (2010) Adaptive optimal feedback control with learned
internal dynamics models. In: Sigaud O, Peters J (eds) From motor learning to interaction
learning in robots, vol 264. Studies in computational intelligence. Springer, Berlin, pp 65–84
100. Deisenroth MP, Rasmussen CE (2011) Pilco: a model-based and data-efficient approach to
policy search. In: International conference on machine learning, pp 465–472
101. Narendra KS, Annaswamy AM (1987) A new adaptive law for robust adaptive control without
persistent excitation. IEEE Trans Autom Control 32:134–145
102. Narendra K, Annaswamy A (1989) Stable adaptive systems. Prentice-Hall Inc, Upper Saddle
River
103. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
104. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
105. Vrabie D, Abu-Khalaf M, Lewis F, Wang Y (2007) Continuous-time ADP for linear systems
with partially unknown dynamics. In: Proceedings of the IEEE international symposium on
approximate dynamic programming and reformulation learning, pp 247–253
106. Bellman R (1954) The theory of dynamic programming. Technical report, DTIC Document
107. Pontryagin LS, Boltyanskii VG, Gamkrelidze RV, Mishchenko EF (1962) The mathematical
theory of optimal processes. Interscience, New York
108. Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (to appear) Optimal and autonomous
control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst
109. Rummery GA, Niranjan M (1994) On-line q-learning using connectionist systems, Technical
report. University of Cambridge, Department of Engineering
110. Sutton R (1996) Generalization in reinforcement learning: successful examples using sparse
coarse coding. In: Advances in neural information processing systems, pp 1038–1044
111. Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, University of Cambridge
England
112. Schwartz A (1993) A reinforcement learning method for maximizing undiscounted rewards.
Proc Int Conf Mach Learn 298:298–305
113. Bradtke S, Ydstie B, Barto A (1994) Adaptive linear quadratic control using policy iteration.
In: Proceedings of the American control conference, IEEE, pp 3475–3479
114. McLain T, Beard R (1998) Successive galerkin approximations to the nonlinear optimal
control of an underwater robotic vehicle. In: Proceedings of the IEEE international conference
on robotics and automation
115. Lawton J, Beard R, Mclain T (1999) Successive Galerkin approximation of nonlinear optimal
attitude. Proc Am Control Conf 6:4373–4377
116. Lawton J, Beard R (1998) Numerically efficient approximations to the Hamilton–Jacobi–
Bellman equation. Proc Am Control Conf 1:195–199
117. Bertsekas D (2011) Approximate policy iteration: a survey and some new methods. J Control
Theory Appl 9:310–335
118. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural
Netw Learn Syst 24(10):1513–1525
119. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica
42 2 Approximate Dynamic Programming
3.1 Introduction
The focus of this chapter is adaptive online approximate optimal control of uncertain
nonlinear systems. The state-derivative-based method summarized in Sect. 2.6 is fur-
ther developed in this chapter. In Sect. 3.2, a novel actor-critic-identifier architecture
is developed to obviate the need to know the system drift dynamics via simultaneous
learning of the actor, the critic, and the identifier. The actor-critic-identifier method
utilizes a persistence of excitation-based online learning scheme, and hence is an
indirect adaptive control approach to reinforcement learning. The idea is similar to
the heuristic dynamic programming algorithm [1], where Werbos suggested the use
of a model network along with the actor and critic networks. Because of the general-
ity of the considered system and objective function, the developed solution approach
can be used in a wide range of applications in different fields.
The actor and critic neural networks developed in this chapter use gradient and
least-squares-based update laws, respectively, to minimize the Bellman error. The
identifier dynamic neural network is a combination of a Hopfield-type [2] component
and a novel RISE (Robust Integral of Sign of the Error) component. The Hopfield
component of the dynamic neural network learns the system dynamics based on
online gradient-based weight tuning laws, while the RISE term robustly accounts for
the function reconstruction errors, guaranteeing asymptotic estimation of the state
and the state derivative. Online estimation of the state derivative allows the actor-
critic-identifier architecture to be implemented without knowledge of system drift
dynamics; however, knowledge of the input gain matrix is required to implement
the control policy. While the design of the actor and critic are coupled through the
Hamilton–Jacobi–Bellman equation, the design of the identifier is decoupled from
the actor and the critic components, and can be considered as a modular component
in the actor-critic-identifier architecture. Convergence of the actor-critic-identifier
algorithm and stability of the closed-loop system are analyzed using Lyapunov-
based adaptive control methods, and a persistence of excitation condition is used
to guarantee exponential convergence to a bounded region in the neighborhood of
the optimal control and uniformly ultimately bounded stability of the closed-loop
system.
game problem. For an equivalent nonlinear system, previous research makes use of
offline procedures or requires full knowledge of the system dynamics to determine
the Nash equilibrium. A Lyapunov-based stability analysis shows that uniformly ulti-
mately bounded tracking for the closed-loop system is guaranteed for the proposed
actor-critic-identifier architecture and a convergence analysis demonstrates that the
approximate control policies converge to a neighborhood of the optimal solutions.
The actor and the critic adjust the weights Ŵa (·) and Ŵc (·), respectively, to
minimize the approximate Bellman error. The identifier learns the derivatives x̂˙ (·)
to minimize the error between the true Bellman error and its approximation. The
following assumptions facilitate the development of update laws for the identifier,
the critic, and the actor.
Assumption 3.2 The input gain matrix g(x) is known and uniformly bounded for
all x ∈ Rn (i.e., 0 < g(x) ≤ ḡ, ∀x ∈ Rn , where ḡ is a known positive constant).
To facilitate the design of the identifier, the following restriction is placed on the
control input.
Assumption 3.3 The control input is bounded (i.e., u (·) ∈ L∞ ).
Using Assumption 3.2, Property 2.3, and the projection
algorithm
in (3.27), Assump-
tion 3.3 holds for the control design u (t) = û x (t) , Ŵa (t) in (2.9). Using Assump-
tion 3.3, the dynamic system in (1.9), with control u (·), can be represented using a
multi-layer neural network as
1 Parts of the text in this section are reproduced, with permission, from [9], 2013,
c Elsevier.
46 3 Excitation-Based Online Approximate Optimal Control
ẋ (t) = Fu (x (t) , u (t)) WfT σf VfT x (t) + εf (x (t)) + g (x (t)) u (t) , (3.2)
where Wf ∈ R(Lf +1)×n , Vf ∈ Rn×Lf are the unknown ideal neural network weights,
σf : RLf → RLf +1 is the neural network activation function, and εf : Rn → Rn is
the function reconstruction error. The following multi-layer dynamic neural network
identifier is used to approximate the system in (3.2)
x̂˙ (t) = F̂u x (t) , x̂ (t) , u (t) ŴfT (t) σf V̂fT (t) x̂ (t) + g (x (t)) u (t) + μ (t) , (3.3)
where x̂ : R≥t0 → Rn is the dynamic neural network state, Ŵf : R≥t0 → RLf +1×n
and V̂f : R≥t0 → Rn×Lf are weight estimates, and μ : R≥t0 → Rn denotes the RISE
feedback term defined as [10, 11]
where x̃ (t) x (t) − x̂ (t) ∈ Rn is the identification error, and v (t) ∈ Rn is a Filippov
solution [12] to the initial value problem
where F̃u x, x̂, u Fu (x, u) − F̂u x, x̂, u ∈ Rn . A filtered identification error is
defined as
ef (t) x̃˙ (t) + α x̃ (t) . (3.6)
+ ε̇f (x (t) , ẋ (t)) − kef (t) − γ x̃ (t) − β1 sgn(x̃ (t)) + α x̃˙ (t) . (3.7)
Based on (3.7) and the subsequent stability analysis, the weight update laws for
the dynamic neural network are designed as
Ŵ˙ f (t) = proj Γwf ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t) x̃T (t) ,
3.2 Online Optimal Regulation 47
V̂˙f (t) = proj Γvf x̂˙ (t) x̃T (t) ŴfT (t) ∇VfT x σf V̂fT (t) x̂ (t) , (3.8)
where the
projection
operator is used to bound the weight estimates such that
Ŵf (t) , V̂f (t) ≤ W f , ∀t ∈ R≥t0 , W f ∈ R>0 is a constant, and Γwf ∈
RLf +1×Lf +1 and Γvf ∈ Rn×n are positive constant adaptation gain matrices. The
expression in (3.7) can be rewritten as
ėf (t) = Ñ (t) + NB1 (t) + N̂B2 (t) − kef (t) − γ x̃ (t) − β1 sgn(x̃ (t)), (3.9)
where the auxiliary signals, Ñ : R≥t0 → Rn , NB1 : R≥t0 → Rn , and N̂B2 : R≥t0 →
Rn are defined as
Ñ (t) α x̃˙ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂˙fT (t) x̂ (t)
f
1 T
T T
+ Wf ∇V T x σf V̂f (t) x̂ (t) V̂f (t) x̃ (t) ˙
2 f
1 T
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT x̃˙ (t) , (3.10)
2 f
1
NB1 (t) WfT ∇V T x σf VfT x (t) VfT ẋ (t) − WfT ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) ẋ (t)
f 2 f
1 T
− Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT (t) ẋ (t) + ε̇f (x (t) , ẋ (t)), (3.11)
2 f
1 T
N̂B2 (t) W̃f (t) ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
2
1 T
+ Ŵf (t) ∇VfT x σf V̂fT (t) x̂ (t) ṼfT (t) x̂˙ (t) , (3.12)
2
where W̃f (t) Wf − Ŵf (t) and Ṽf (t) Vf − V̂f (t). To facilitate the subsequent
stability analysis, an auxiliary term NB2 : R≥t0 → Rn is defined by replacing x̂˙ (t)
in N̂B2 (t) by ẋ (t) , and ÑB2 N̂B2 − NB2 . The terms NB1 and NB2 are grouped as
NB NB1 + NB2 .
Provided x (t) ∈ χ , where χ ⊂ Rn is a compact set containing the origin, using
Assumption 3.2, Property 2.3, (3.6), (3.8), (3.11), and (3.12), the following bounds
can be obtained
Ñ (t) ≤ ρ1 (z (t)) z (t) , (3.13)
NB1 (t) ≤ ζ1 , NB2 (t) ≤ ζ2 ,
ṄB (t) ≤ ζ3 + ζ4 ρ2 (z (t)) z (t) , (3.14)
2
˙T
x̃ (t) ÑB2 (t) ≤ ζ5 x̃ (t) + ζ6 ef (t) ,
2
(3.15)
48 3 Excitation-Based Online Approximate Optimal Control
1
Q (t) α tr(W̃fT (t) Γwf−1 W̃f (t)) + tr(ṼfT (t) Γvf−1 Ṽf (t)) ,
4
where the auxiliary function P : R≥t0 → R is the Filippov solution [12] to the dif-
ferential equation
n
Ṗ (t) = −L (z (t) , t) , P(t0 ) = β1 |x̃i (t0 )| − x̃T (t0 ) NB (t0 ). (3.17)
i=1
L (z, t) efT (NB1 (t) − β1 sgn(x̃ (t))) + x̃˙ T (t) NB2 (t) − β2 ρ2 (z) z x̃ (t) ,
(3.18)
where β1 , β2 ∈ R are selected according to the sufficient conditions
ζ3
β1 > max(ζ1 + ζ2 , ζ1 + ), β2 > ζ4 (3.19)
α
to ensure P (t) ≥ 0, ∀t ∈ R≥t0 (see Appendix A.1.1).
Let D ⊂ R2n+2 be an open and connected set defined as D y ∈ R2n+2 | y
−1 √
< inf ρ 2 λη, ∞ , where λ and η are defined in Appendix A.1.2. Let D be
√
the compact set D y ∈ R2n+2 | y ≤ inf ρ −1 2 λη, ∞ . Let VI : D → R
be a positive-definite, locally Lipschitz, regular function defined as
1 T 1
VI (y) e ef + γ x̃T x̃ + P + Q. (3.20)
2 f 2
The candidate Lyapunov function in (3.20) satisfies the inequalities
1
U1 min(1, γ ) y2 U2 max(1, γ ) y2 .
2
3.2 Online Optimal Regulation 49
√
Additionally,
√
let S ⊂ D denote a set defined as S y ∈ D | ρ 2U2 (y)
< 2 λη and let
ẏ (t) = h(y (t) , t) (3.22)
represent the closed-loop differential equations in (3.5), (3.8), (3.9), and (3.17),
where h : R2n+2 × R≥t0 → R2n+2 denotes the right-hand side of the closed-loop error
signals.
Theorem 3.4 For the system in (1.9), the identifier developed in (3.3) along with
the weight update laws in (3.8) ensures asymptotic identification of the state and its
derivative, in the sense that all Filippov solutions to (3.22) that satisfy y (t0 ) ∈ S,
are bounded, and further satisfy
lim x̃ (t) = 0, lim x̃˙ (t) = 0,
t→∞ t→∞
provided the control gains k and γ are selected sufficiently large based on the initial
conditions of the states and satisfy the sufficient conditions
ζ5
γ > , k > ζ6 , (3.23)
α
where ζ5 and ζ6 are introduced in (3.15), and β1 , and β2 , introduced in (3.18), are
selected according to the sufficient conditions in (3.19).
ω (t)
Ŵ˙ c (t) = −kc Γ (t) δ̂t (t) , (3.24)
1+ νωT (t) Γ (t) ω (t)
ω : R≥t0 → RL , defined as ω (t) ∇x σ (x (t)) F̂u x (t) , x̂ (t) , û (x (t) ,
where
Ŵa (t) is the critic neural network regressor vector, ν, kc ∈ R are constant positive
gains, and Γ : R≥t0 → RL×L is a symmetric estimation gain matrix generated using
the initial value problem
ω (t) ωT (t)
Γ˙ (t) = −kc Γ (t) Γ (t) ; Γ (tr+ ) = Γ (0) = γ IL , (3.25)
1 + νωT (t) Γ ω (t)
50 3 Excitation-Based Online Approximate Optimal Control
where tr+ is the resetting time at which λmin {Γ (t)} ≤ γ , γ > γ > 0. The covariance
resetting ensures that Γ (·) is positive-definite for all time and prevents its value
from becoming arbitrarily small in some directions, thus avoiding slow adaptation
in some directions. From (3.25), it is clear that Γ˙ (t) ≤ 0, which means that Γ (·)
can be bounded as
γ IL ≤ Γ (t) ≤ γ IL , ∀t ∈ R≥t0 . (3.26)
The actor update, like the critic update in Sect. 3.2.2, is based on the minimization
of the Bellman error δ̂t . However, unlike the critic weights, the actor weights appear
nonlinearly in δ̂t , making it problematic to develop a least-squares update law. Hence,
a gradient update law is developed for the actor which minimizes the squared Bellman
error. The gradient-based update law for the actor neural network is given by
Ŵ˙ a (t) = proj −
2ka1
·
1 + ωT (t) ω (t)
⎛ ⎞T
∂ F̂u x (t) , x̂ (t) , û x (t) , Ŵa (t) ∂ û x (t) , Ŵa (t)
⎝ŴcT (t) ∇x σ (x (t)) ⎠ δ̂t (t)
∂ û ∂ Ŵa
⎛ ⎞T
4ka1 ∂ û x (t) , Ŵa (t)
− ⎝ ⎠ Rû x (t) , Ŵa (t) δ̂t (t)
1 + ωT (t) ω (t) ∂ Ŵa
− ka2 (Ŵa (t) − Ŵc (t)) (3.27)
where G (x) g (x) R−1 g (x)T , ka1 , ka2 ∈ R are positive adaptation gains,
√ 1T is the normalization term, and the last term in (3.27) is added for stability
1+ω (t)ω(t)
(based on the subsequent stability analysis). Using the identifier developed in (3.3),
the actor weight update law can now be simplified as
Ŵ˙ a (t) = proj −
ka1
∇x σ (x (t)) G∇x σ T (x (t)) Ŵa (t) − Ŵc (t) δ̂t (t)
1 + ωT (t) ω (t)
− ka2 (Ŵa (t) − Ŵc (t)) . (3.28)
The projection operator ensures that Ŵa (t) ≤ W , ∀t ∈ R≥t0 , where W ∈ R>0 is
Remark 3.5 A recursive least-squares update law with covariance resetting is devel-
oped for the critic in (3.24), which exploits the fact that the critic weights appear
linearly in the Bellman error δ̂t (·). This is in contrast to the modified Levenberg–
Marquardt algorithm in [14] which is similar to the normalized gradient update law.
The actor update law in (3.27) also differs in the sense that the update law in [14] is
purely motivated by the stability analysis whereas the proposed actor update law is
based on the minimization of the Bellman error with an additional term for stability.
Heuristically, the difference in the update law development could lead to improved
performance in terms of faster convergence of the actor and critic weights, as seen
from the simulation results in Sect. 3.2.5.
1 1
= −W̃cT ω − W T ∇x σ F̃û + W̃aT ∇x σ G∇x σ T W̃a − ∇x G∇x T − ∇x Fu∗ ,
4 4
(3.29)
The dynamics of the critic weight estimation error W̃c can now be developed by
substituting (3.29) into (3.24) as
ω
W̃˙ c = −kc Γ ψψ T W̃c + kc Γ −W T ∇x σ F̃û
1 + νωT Γ ω
1 T 1
+ W̃a ∇x σ G∇x σ W̃a − ∇x G∇x − ∇x Fu∗ ,
T T
(3.30)
4 4
ω
where ψ √
1+νωΓ ω
∈ RN is the normalized critic regressor vector, bounded as
1
ψ ≤ √ , ∀t ∈ R≥t0 (3.31)
νγ
where γ is introduced in (3.26). The error system in (3.30) can be represented by the
following perturbed system
W̃˙ c = Ωnom + Δper , (3.32)
t+δ
ψ(τ )ψ(τ )T d τ ≥ μ1 IL ∀t ≥ t0 ,
t
where ka1 , c3 , κ1 , κ2 are introduced in (3.27), (3.34), and (3.35), then the controller
in (1.9), the actor-critic weight update laws in (3.24)–(3.25) and (3.28), and the
identifier in (3.3) and (3.8), guarantee that the state of the system x, and the actor-
critic weight estimation errors W̃a and W̃c are uniformly ultimately bounded.
3.2 Online Optimal Regulation 53
Proof To investigate the stability of (1.9) with control û, and the perturbed system
in (3.32), consider VL : X × RN × RN × [0, ∞) → R as the continuously differen-
tiable, positive-definite Lyapunov function candidate defined as
1
VL (x, W̃c , W̃a , t) V ∗ (x) + Vc (W̃c , t) + W̃aT W̃a ,
2
where V ∗ (i.e., the optimal value function) is the candidate Lyapunov function for
(1.9), and Vc is the Lyapunov function for the exponentially stable system in (3.33).
Since V ∗ is continuously differentiable and positive-definite, there exist class K
functions (see [18, Lemma 4.3]), such that
where z̃ [x W̃c W̃a ]T ∈ Rn+2N , and α3 and α4 are class K functions. Taking the
time derivative of VL (·) yields
∂V ∗ ∂V ∗ ∂ Vc ∂ Vc ∂ Vc
V̇L = f + g û + + Ωnom + Δper − W̃aT Ŵ˙ a , (3.38)
∂x ∂x ∂t ∂ W̃c ∂ W̃c
where the time derivative of V ∗ is taken along the trajectories of the system (1.9)
with control û, and the time derivative of Vc is taken along the along the trajectories
of the perturbed system (3.32). To facilitate the subsequent analysis, the Hamilton–
V∗ V∗
Jacobi–Bellman equation in (1.14) is rewritten as ∂∂x f = − ∂∂x gu∗ − Q − u∗ Ru∗ .
T
V∗ V∗
Substituting for ∂∂x f in (3.38), using the fact that ∂∂x g = −2u∗ R from (1.13), and
T
ka1
+ ka2 W̃aT (Ŵa − Ŵc ) + √ W̃aT ∇x σ G∇x σ T (Ŵa − Ŵc )δ̂t . (3.39)
1 + ωT ω
Substituting for u∗ , û, δ̂t , and Δper using (1.13), (2.9), (3.29), and (3.32), respec-
tively, and substituting (3.26) and (3.31) into (3.39), yields
54 3 Excitation-Based Online Approximate Optimal Control
2 2 1
V̇L ≤ −Q − c3 W̃c − ka2 W̃a + W T ∇x σ G∇x T
2
1 1 T 1
+ ∇x G∇x + W ∇x σ G∇x σ W̃a + ∇x G∇x σ T W̃a
T T
2 2 2
kc γ 1
+ c4 √ −W T ∇x σ F̃û + W̃aT ∇x σ G∇x σ T W̃a
2 νγ 4
1
− ∇x G∇x T − ∇x Fu∗ W̃ c + ka2 W̃a W̃c
4
ka1
+√ W̃aT ∇x σ G∇x σ T (W̃c − W̃a ) −W̃cT ω
1 + ωT ω
1 T 1
−W ∇x σ F̃û + W̃a ∇x σ G∇x σ W̃a − ∇x G∇x − ∇x Fu∗ .
T T T
(3.40)
4 4
Using the bounds developed in (3.35), (3.40) can be further upper bounded as
2 2
V̇L ≤ −Q − (c3 − ka1 κ1 κ2 ) W̃c − ka2 W̃a + ka1 κ12 κ2 κ3 + κ4
c4 kc γ
+ √ κ 3 + k κ κ κ
a1 1 2 3 + k κ 2
a1 1 2κ + ka2 1 W̃c .
κ
2 νγ
where 0 < θ < 1. Since Q is positive definite, [18, Lemma 4.3] indicates that there
exist class K functions α5 and α6 such that
2 2
α5 (z̃) ≤ Q + (1 − θ )(c3 − ka1 κ1 κ2 ) W̃c + ka2 W̃a
≤ α6 (z̃) ∀v ∈ Bs . (3.42)
where
2
1 c4 kc γ
κ5 √ κ3 + ka1 κ1 κ2 κ3 + ka1 κ1 κ2 + ka2 κ1
2
.
4θ (c3 − ka1 κ1 κ2 ) 2 νγ
3.2 Online Optimal Regulation 55
Hence, V̇L (t) is negative whenever z̃ (t) lies outside the compact set
Ωz̃ z̃ : z̃ ≤ α5−1 κ5 + ka1 κ12 κ2 κ3 + κ4 .
Invoking [18, Theorem 4.18], it can be concluded that z̃ (·) is uniformly ultimately
bounded. The bounds in (3.35) depend on the actor neural network approximation
error ∇x , which can be reduced by increasing the number of neurons, L, thereby
reducing the size of the residual set Ωz̃ . From Property 2.3, as the number of neurons
of the actor and critic neural networks approaches infinity, ∇x → 0.
Since c3 is a function of the critic adaptation gain kc , ka1 is the actor adaptation
gain, and κ1 , κ2 are known constants, the sufficient gain condition in (3.36) can be
easily satisfied.
Remark 3.7 Since the actor, the critic, and the identifier are continuously updated,
the developed reinforcement learning algorithm can be compared to fully optimistic
policy iteration in machine learning literature [19]. Unlike traditional policy iteration
where policy improvement is done after convergence of the policy evaluation step,
fully optimistic policy iteration carries out policy evaluation and policy improvement
after every state transition. Proving convergence of optimistic policy iteration is
complicated and is an active area of research in machine learning [19, 20]. By
considering an adaptive control framework, this result investigates the convergence
and stability behavior of fully optimistic policy iteration in continuous-time.
Remark 3.8 The persistence of excitation condition in Theorem 2 is equivalent to the
exploration paradigm in reinforcement learning which ensures sufficient sampling
of the state-space and convergence to the optimal policy [21].
3.2.5 Simulation
where x (t) [x1 (t) x2 (t)]T ∈ R2 and u (t) ∈ R. The state and control penalties are
selected as
T 1 0
Q(x) = x x; R = 1.
01
The optimal value function and optimal control for the system in (3.43) are known,
and given by [14]
1 2
V ∗ (x) = x + x22 ; u∗ (x) = −(cos(2x1 ) + 2)x2 .
2 1
56 3 Excitation-Based Online Approximate Optimal Control
The activation function for the critic neural network is selected with three neurons
as
σ (x) = [x12 x1 x2 x22 ]T ,
which yields the optimal weights W = [0.5 0 1]T . The activation function for the
identifier dynamic neural network is selected as a symmetric sigmoid with five neu-
rons in the hidden layer.
Remark 3.9 The choice of a good basis for the value function and control policy is
critical for convergence. For a general nonlinear system, choosing a suitable basis
can be a challenging problem without any prior knowledge about the system.
The identifier gains are selected as
and the gains for the actor-critic learning laws are selected as
The covariance matrix is initialized to Γ (0) = 5000I3 , all the neural network weights
are randomly initialized in [−1, 1], and the states are initialized to x(0) = [3, −1].
An implementation issue in using the developed algorithm is to ensure persis-
tence of excitation of the critic regressor vector. Unlike linear systems, where
persistence of excitation of the regressor translates to sufficient richness of the
external input, no verifiable method exists to ensure persistence of excitation in
nonlinear regulation problems. To ensure persistence of excitation qualitatively,
a small exploratory signal consisting of sinusoids of varying frequencies, n (t) =
sin2 (t) cos (t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t) + sin5 (t), is added to
the control u (t) for the first 3 s [14]. The proposed control algorithm is implemented
using (2.9), (3.1), (3.3), (3.4), (3.24), (3.25), and (3.28). The evolution of states is
shown in Fig. 3.1. The identifier approximates the system dynamics, and the state
derivative estimation error is shown in Fig. 3.2.
As compared to discontinuous sliding mode identifiers which require infinite
bandwidth and exhibit chattering, the RISE-based identifier in (3.3) is continuous,
and thus, mitigates chattering to a large extent, as seen in Fig. 3.2.
Persistence of excitation ensures that the weights converge close to their optimal
values (i.e., Ŵc = [0.5004 0.0005 0.9999]T ≈ Ŵa ) in approximately 2 s, as seen
from the evolution of actor-critic weights in Figs. 3.3 and 3.4. The improved actor-
critic weight update laws, based on minimization of the Bellman error, led to faster
convergence of weights as compared to [14]. Errors in approximating the optimal
value function and optimal control at steady state (t = 10 s) are plotted against the
states in Figs. 3.5 and 3.6, respectively.
3.2 Online Optimal Regulation 57
x
0.5
−0.5
−1
−1.5
0 2 4 6 8 10
Time (s)
−0.005
−0.01
−0.015
0 2 4 6 8 10
Time (s)
0.5
Wc
−0.5
−1
0 2 4 6 8 10
Time (s)
58 3 Excitation-Based Online Approximate Optimal Control
0.5
Wa
0
−0.5
−1
0 2 4 6 8 10
Time (s)
c Elsevier) 5
−5
2
1 2
0 1
0
x2 −1 −1 x1
−2 −2
0
−2
−4
−6
2
1 2
0 1
x2 0
−1 −1 x1
−2 −2
3.3 Extension to Trajectory Tracking 59
Assumption 3.11 The desired trajectory is bounded such that xd (t) ≤ d , ∀t ∈
R≥t0 , and there exists a locally Lipschitz function hd : Rn → Rn such that ẋd (t) =
hd (xd (t)) and g (xd (t)) g + (xd (t)) (hd (xd (t)) − f (xd (t))) = hd (xd (t)) −
f (xd (t)), ∀t ∈ R≥t0 .
Remark 3.12 Assumptions 3.10 and 3.11 can be eliminated if a discounted cost
optimal tracking problem is considered instead of the total cost problem considered
in this chapter. The discounted cost tracking problem considers a value function
∞
of the form V ∗ (ζ ) minu(τ )∈U |τ ∈R≥t t eκ(t−τ ) r (φ (τ, t, ζ, u (·)) , u (τ )) dτ , where
κ ∈ R>0 is a constant discount factor, and the control effort u is minimized instead
of the control error μ, introduced in (3.48). The control effort required for a system
to perfectly track a desired trajectory is generally nonzero even if the initial system
state is on the desired trajectory. Hence, in general, the optimal value function for a
discounted cost problem does not satisfy V ∗ (0) = 0. In fact, the origin may not even
2 Parts of the text in this section are reproduced, with permission, from [22], 2015,
c Elsevier.
60 3 Excitation-Based Online Approximate Optimal Control
For notational brevity, let gd+ (t) g + (xd (t)) and fd (t) f (xd (t)). To transform
the time-varying optimal control problem into a time-invariant optimal control prob-
lem, a new concatenated state ζ : R≥t0 → R2n is defined as [23]
T
ζ eT , xdT . (3.46)
Based on (3.44) and Assumption 3.11, the time derivative of (3.46) can be expressed
as
ζ̇ (t) = F (ζ (t)) + G (ζ (t)) μ (t) , (3.47)
where the functions F : R2n → R2n , G : R2n → R2n×m , and the control μ : R≥t0 →
Rm are defined as
f (e + xd ) − hd (xd ) + g (e + xd ) ud (xd )
F (ζ ) ,
hd (xd )
g (e + xd )
G (ζ ) , μ (t) u (t) − ud (xd (t)) . (3.48)
0
Local Lipschitz continuity of f and g, the fact that f (0) = 0, and Assumption 3.11
imply that F (0) = 0 and F is locally Lipschitz.
The objective of the optimal control problem is to design a controller μ (·) that
minimizes the cost functional
∞
J (ζ, μ (·)) rt (ζ (τ ; t, ζ, μ (·)) , μ (τ )) dτ, (3.49)
t0
subject to the dynamic constraints in (3.47) and rt : R2n × Rm → R≥0 is the local
cost defined as
rt (ζ, μ) Qt (ζ ) + μT Rμ. (3.50)
Assuming that a minimizing policy exists and that the optimal value function V ∗ :
R2n → R≥0 defined as
∞
∗
V (ζ ) min rt (ζ (τ ; t, ζ, μ (·)) , μ (τ )) dτ (3.52)
μ(τ )|τ ∈R≥t
t
for all ζ , with the boundary condition V ∗ (0) = 0, where H ∗ denotes the Hamiltonian,
and μ∗ : R2n → Rm denotes the optimal policy. For the local cost in (3.50) and the
dynamics in (3.47), the optimal controller is given by μ (t) = μ∗ (ζ (t)), where μ∗
is the optimal policy given by
1 T
μ∗ (ζ ) = − R−1 G T (ζ ) ∇ζ V ∗ (ζ ) . (3.54)
2
Using Property 2.3, the value function, V ∗ , can be represented using a neural
network with L neurons as
V ∗ (ζ ) = W T σ (ζ ) + (ζ ) , (3.55)
where W ∈ RL is the constant ideal weight matrix bounded above by a known positive
constant W ∈ R in the sense that W ≤ W , σ : R2n → RL is a bounded continu-
ously differentiable nonlinear activation function, and : R2n → R is the function
reconstruction error [24, 25].
Using (3.54) and (3.55) the optimal policy can be expressed as
1
μ∗ (ζ ) = − R−1 G T (ζ ) ∇ζ σ T (ζ ) W + ∇ζ T (ζ ) . (3.56)
2
62 3 Excitation-Based Online Approximate Optimal Control
Based on (3.55) and (3.56), the neural network approximations to the optimal value
function and the optimal policy are given by
V̂ ζ, Ŵc = ŴcT σ (ζ ) ,
1
μ̂ ζ, Ŵa = − R−1 G T (ζ ) ∇ζ σ T (ζ ) Ŵa , (3.57)
2
where Ŵc ∈ RL and Ŵa ∈ RL are estimates of the ideal neural network weights W .
The controller for the concatenated system is then designed as μ (t) =
μ̂ ζ (t) , Ŵa (t) . The controller for the original system is obtained from (3.45),
(3.48), and (3.57) as
1
u (t) = − R−1 G T (ζ (t)) ∇ζ σ T (ζ (t)) Ŵa (t) + gd+ (t) (hd (xd (t)) − fd (t)) .
2
(3.58)
Using the approximations μ̂ and V̂ for μ∗ and V ∗ in (3.53), respectively, the
error between the approximate and the optimal Hamiltonian (i.e., the Bellman error,
δ : Rn × RL → R), is given in a measurable form by
δ ζ, Ŵc , Ŵa ∇ V̂ ζ, Ŵc F (ζ ) + G (ζ ) μ̂ ζ, Ŵa + rt ζ, μ̂ ζ, Ŵa .
(3.59)
t
The critic weights are updated to minimize 0 δt2 (ρ) d ρ using a normalized least-
squares update law with an exponential forgetting factor as [26]
ω (t)
Ŵ˙ c (t) = −kc Γ (t) δt (t) , (3.60)
1 + νωT (t) Γ ω (t)
ω (t) ωT (t)
Γ˙ (t) = −kc −λΓ (t) + Γ (t) Γ (t) , (3.61)
1 + νωT (t) Γ (t) ω (t)
where δt is the evaluation of the Bellman error along the system trajectories
(i.e., δt (t) = δ ζ (t) , Ŵa (t) ), ν, kc ∈ R are constant positive adaptation gains, ω :
R≥0 → RL is defined as ω (t) ∇ζ σ (ζ (t)) F (ζ (t)) + G (ζ (t)) μ̂ ζ (t) , Ŵa (t) ,
and λ ∈ (0, 1) is the constant forgetting factor for the estimation gain matrix
Γ ∈ RL×L . The least-squares approach is motivated by faster convergence. With
minor modifications to the stability analysis, the result can also be established for
a gradient descent update law. The actor weights are updated to follow the critic
weights as
Ŵ˙ a (t) = −ka1 Ŵa (t) − Ŵc (t) − ka2 Ŵa (t) , (3.62)
3.3 Extension to Trajectory Tracking 63
where ka1 , ka2 ∈ R are constant positive adaptation gains. The least-squares approach
can not be used to update the actor weights because the Bellman error is a nonlinear
function of the actor weights.
The following assumption facilitates the stability analysis using persistence of
excitation.
ω(t)
Assumption 3.13 The regressor ψ : R≥0 →RL , defined as ψ (t) √ ,
1+νωT (t)Γ (t)ω(t)
t+T
is persistently exciting (i.e., there exist T , ψ>0 such that ψIL ≤ t ψ (τ ) ψ (τ )T dτ ).
where ϕ, ϕ ∈ R are constants such that 0 < ϕ < ϕ. Since the evolution of ψ is
dependent on the initial values of ζ and Ŵa , the constants ϕ and ϕ depend on the
initial values of ζ and Ŵa . Based on (3.63), the regressor can be bounded as
1
ψ (t) ≤ √ , ∀t ∈ R≥0 . (3.64)
νϕ
Using (3.53), (3.59), and (3.60), an unmeasurable form of the Bellman error can be
written as
1 1 1
δt = −W̃cT ω + W̃aT Gσ W̃a + ∇ζ G∇ζ T + W T ∇ζ σ G∇ζ T − ∇ζ F, (3.65)
4 4 2
ω
where ψ √1+νω TΓω
∈ RL is the regressor vector.
Before stating the main result of the section, three supplementary technical lem-
mas are stated. To facilitate the discussion, let Y ∈ R2n+2L be a compact set, and let
Z denote the projection of Y onto Rn+2L . Using the universal approximation prop-
erty of neural networks, on the compact set defined by the projection of Y onto R2n ,
64 3 Excitation-Based Online Approximate Optimal Control
the neural
network approximation errors can be bounded such that sup | (ζ )| ≤
and sup ∇ζ (ζ ) ≤ , where ∈ R is a positive constants, and there exists a posi-
tive constant LF ∈ R such that sup F (ζ ) ≤ LF ζ . Instead of using the fact that
locally Lipschitz functions on compact sets are Lipschitz, it is possible to bound the
function F as F (ζ ) ≤ ρ (ζ ) ζ , where ρ : R≥0 → R≥0 is non-decreasing. This
approach is feasible and results in additional gain conditions. To aid the subsequent
stability analysis, Assumptions 3.10 and 3.11 are used to develop the bounds
∇ζ WT∇ σ
4 + 2 ζ G∇ζ T + LF xd ≤ ι1 , Gσ ≤ ι2 ,
∇ζ G∇ζ T ≤ ι3 , 1 W T Gσ + 1 ∇ζ G∇ζ σ T ≤ ι4 , (3.67)
1 2 2
∇ζ G∇ζ T + 1 W T ∇ζ σ G∇ζ T ≤ ι5 ,
4 2
for all e ∈ Rn and for all t ∈ R≥0 . Specifically, the time-invariant form facilitates the
development of the approximate optimal policy, whereas the equivalent time-varying
form can be shown to be a positive definite and decrescent function of the tracking
error. In the following, Lemma 3.14 is used to prove that Vt∗ : Rn × R≥0 → R is
positive definite and decrescent, and hence, a candidate Lyapunov function.
Lemma 3.14 Let Ba denote a closed ball around the origin with the radius a ∈ R>0 .
The optimal value function Vt∗ : Rn × R≥0 → R satisfies the following properties
∀t ∈ R≥0 and ∀e ∈ Ba where v : [0, a] → R≥0 and v : [0, a] → R≥0 are class K
functions.
Proof See Appendix A.1.4.
3.3 Extension to Trajectory Tracking 65
Lemmas 3.15 and 3.16 facilitate the stability analysis by establishing bounds on
the error signal.
T
Lemma 3.15 Let Z eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z, ∀τ ∈ [t, t + T ].
The neural network weights and the tracking errors satisfy
2
− inf e (τ )2 ≤ −0 sup e (τ )2 + 1 T 2 sup W̃a (τ ) + 2 ,
τ ∈[t,t+T ] τ ∈[t,t+T ] τ ∈[t,t+T ]
(3.70)
2 2 2
− inf W̃a (τ ) ≤ −3 sup W̃a (τ ) + 4 inf W̃c (τ ) L
τ ∈[t,t+T ] τ ∈[t,t+T ] τ ∈[t,t+T ]
t+T
2
t+T
t+T
4
T 2 2 2
− W̃c ψ dτ ≤ −ψ7 W̃c + 8 e dτ + 3ι2 W̃a (σ ) dσ + 9 T ,
t t t
∀Z ∈ Bb , ∀t ∈ R≥0 , where vl : [0, b] → R≥0 and vl : [0, b] → R≥0 are class K func-
tions. T
To facilitate the discussion, define ka12 ka1 + ka2 , Z eT W̃cT W̃aT , ι
(ka2 W +ι4 ) + 2k (ι )2 + 1 ι , 6 ka12 +22 q+kc 9 + ι, and 1 min(k ψ ,
2
ka12 c 1 4 3 10 8 11 16 c 7
20 qT , 3 ka12 T ). Let Z0 ∈ R≥0 denote a known constant bound on the initial con-
dition such that Z (t0 ) ≤ Z0 , and let
10 T
Z vl −1 vl max Z0 , + ιT . (3.73)
11
66 3 Excitation-Based Online Approximate Optimal Control
The sufficient gain conditions for the subsequent Theorem 3.17 are given by
kc ι2 Z
ka12 > max ka1 ξ2 + , 3kc ι22 Z ,
4 νϕ
ka1 24 ka12
ξ1 > 2LF , kc > ,ψ> T,
λγ ξ2 kc 7
5 ka12 1
q > max , kc 8 , kc LF ξ1 ,
0 2
1 νϕ 1 ka12
T < min √ ,√ , √ , , (3.74)
6Lka12 6Lkc ϕ 2 nLF 6Lka12 + 8q1
q 1
2 k 2
a12
V̇L ≤ − e2 − kc W̃cT ψ − W̃a + ι. (3.76)
2 8 4
The inequality in (3.76) is valid provided Z (t) ∈ Z.
3.3 Extension to Trajectory Tracking 67
t+T
Integrating (3.76), using the facts that − t e (τ )2 dτ ≤ − T inf τ ∈[t,t+T ]
t+T
2
2
e (τ )2 and − t W̃a (τ ) dτ ≤ −T inf τ ∈[t,t+T ] W̃a (τ ) , Lemmas 3.15 and
3.16, and the gain conditions in (3.74) yields
kc ψ7
2 0 qT
VL (Z (t + T ) , t + T ) − VL (Z (t) , t) ≤ − W̃c (t) − e (t)2 + 10 T
16 8
3 ka12 T
2
− W̃a (t) ,
16
3.3.4 Simulation
M q̈ + Vm q̇ + Fd q̇ + Fs = u, (3.77)
T T
where q = q1 q2 and q̇ = q̇1 q̇2 are the angular positions in radian and the
angular velocities in radian/s respectively. In (3.77), M ∈ R2×2 denotes the iner-
tia matrix, and Vm ∈ R2×2 denotes the centripetal-Coriolis matrix given by M
68 3 Excitation-Based Online Approximate Optimal Control
p1 + 2p3 c2 p2 + p3 c2 −p3 s2 q̇2 −p3 s2 (q̇1 + q̇2 )
, Vm , where c2 = cos (q2 ) ,
p2 + p3 c2 p2 p3 s2 q̇1 0
s2 = sin (q2 ), p1 = 3.473 kg.m2 , p2 = 0.196 kg.m2 , p3 = 0.242 kg.m2 , and Fd =
T
diag 5.3, 1.1 Nm.s and Fs (q̇) = 8.45 tanh (q̇1 ) , 2.35 tanh (q̇2 ) Nm are the
models for the static and the dynamic friction, respectively.
T
The objective is to find a policy μ that ensures that the state x q1 , q2 , q̇1 , q̇2
T
tracks the desired trajectory xd (t) = 0.5 cos (2t) , 0.33 cos
(3t) , − sin (2t)
, − sin (3t) ,
while minimizing the cost in (3.49), where Q = diag 10, 10, 2, 2 . Using (3.45)–
(3.48) and the definitions
T T
x
f x3 , x4 , M −1 (−Vm − Fd ) 3 − Fs ,
x4
T
g 0, 0 T , 0, 0 T , M −1 T ,
gd+ 0, 0 T , 0, 0 T , M (xd ) ,
T
hd xd 3 , xd 4 , −4xd 1 , −9xd 2 , (3.78)
the optimal tracking problem can be transformed into the time-invariant form in
(3.48).
In this effort, the basis selected for the value function approximation is a polyno-
mial basis with 23 elements given by
1 2 2
σ (ζ ) = ζ ζ ζ1 ζ3 ζ1 ζ4 ζ2 ζ3 ζ2 ζ4 ζ12 ζ22 ζ12 ζ52
2 1 2
ζ12 ζ62 ζ12 ζ72 ζ12 ζ82 ζ22 ζ52 ζ22 ζ62 ζ22 ζ72 ζ22 ζ82 ζ32 ζ52
T
ζ32 ζ62 ζ32 ζ72 ζ32 ζ82 ζ42 ζ52 ζ42 ζ62 ζ42 ζ72 ζ42 ζ82 . (3.79)
The control gains are selected as ka1 = 5, ka2 = 0.001, kc = 1.25, λ = 0.001, and
T
ν = 0.005 The initial conditions are x (0) = 1.8 1.6 0 0 , Ŵc (0) = 10 × 123×1 ,
Ŵa (0) = 6 × 123×1 , and Γ (0) = 2000 × I23 . To ensure persistence of excitation, a
probing signal
⎡ √ √ ⎤
2.55tanh(2t) 20 sin 232π t cos 20π t
⎢ 2 ⎥
⎢ +6 sin 18e ⎥
p (t) = ⎢ t + √20 cos (40t)
(21t)
√ ⎥ (3.80)
⎢ ⎥
⎣ 0.01 tanh(2t) 20 sin 132π t cos 10π t ⎦
+6 cos (8et) + 20 cos (10t) cos (11t))
is added to the control signal for the first 30 s of the simulation [14].
It is clear from Figs. 3.7 and 3.8 that the system states are bounded during the
learning phase and the algorithm converges to a stabilizing controller in the sense
that the tracking errors go to zero when the probing signal is eliminated. Furthermore,
3.3 Extension to Trajectory Tracking 69
Figs. 3.9 and 3.10 shows that the weight estimates for the value function and the policy
are bounded and they converge.
The neural network weights converge to the following values
Ŵc = Ŵa = 83.36 2.37 27.0 2.78 −2.83 0.20 14.13
29.81 18.87 4.11 3.47 6.69 9.71 15.58 4.97 12.42
T
11.31 3.29 1.19 −1.99 4.55 −0.47 0.56 . (3.81)
70 3 Excitation-Based Online Approximate Optimal Control
60
40
20
−20
−40
0 10 20 30 40 50 60
Time (s)
60
40
20
−20
−40
0 10 20 30 40 50 60
Time (s)
Note that the last sixteen weights that correspond to the terms containing the desired
trajectories ζ5 , . . . , ζ8 are non-zero. Thus, the resulting value function V and the
resulting policy μ depend on the desired trajectory, and hence, are time-varying
functions of the tracking error. Since the true weights are unknown, a direct compar-
ison of the weights in (3.81) with the true weights is not possible. Instead, to gauge
the performance of the presented technique, the state and the control trajectories
obtained using the estimated policy are compared with those obtained using Radau-
pseudospectral numerical optimal control computed using the GPOPS software [29].
3.3 Extension to Trajectory Tracking 71
−2
−4
−6
0 5 10 15 20
Time(s)
−1
−2
−3 GPOPS
ADP
−4 μ1
−5
μ2
0 5 10 15 20
Time(s)
−0.5
GPOPS
ADP
−1
e1
−1.5 e
2
e
3
−2 e
4
0 5 10 15 20
Time(s)
are nonzero. Since these costate variables represent the sensitivity of the cost with
respect to the desired trajectories, this further supports the assertion that the opti-
mal value function depends on the desired trajectory, and hence, is a time-varying
function of the tracking error.
Figures 3.13 and 3.14 show the control and the tracking error trajectories obtained
from the developed technique (dashed lines) plotted alongside the numerical solution
obtained using GPOPS (solid lines). The trajectories obtained using the developed
technique are close to the numerical solution. The inaccuracies are a result of the facts
that the set of basis functions in (3.79) is not exact, and the proposed method attempts
to find the weights that generate the least total cost for the given set of basis functions.
The accuracy of the approximation can be improved by choosing a more appropriate
set of basis functions, or at an increased computational cost, by adding more basis
functions to the existing set in (3.79). The total cost obtained using the numerical
solution is found to be 75.42 and the total cost obtained using the developed method
is found to be 84.31. Note that from Figs. 3.13 and 3.14, it is clear that both the
tracking error and the control converge to zero after approximately 20 s, and hence,
the total cost obtained from the numerical solution is a good approximation of the
infinite-horizon cost.
3 Parts of the text in this section are reproduced, with permission, from [30], 2015,
c IEEE.
74 3 Excitation-Based Online Approximate Optimal Control
N
ẋ (t) = f (x (t)) + gi (x (t)) ui (t) , (3.82)
i=1
where x : R≥t0 → Rn is the state vector, ui : R≥t0 → Rmi are the control inputs, and
f : Rn → Rn and gi : Rn → Rn×mj are the drift and input matrices, respectively.
Assume that g1 , . . . , gN , and f are second order differentiable, and that f (0) = 0 so
that x = 0 is an equilibrium point for the uncontrolled dynamics in (3.82). Let
'
(
U φi : Rn → Rmi , i = 1, . . . , N | {φi , . . . , φN } is admissible for (3.82)
be the set of all admissible tuples of feedback policies φi : Rn → Rmi (cf. [6]). Let
{φ ,...,φN }
Vi i : Rn → R≥0 denote the value function of the ith player with respect to the
feedback policies {φ1 , . . . , φN } ∈ U , defined as
∞
{u }
Vi 1 ,...,uN (x) = ri (x (τ ; t, x) , φi (x (τ ; t, x)) , . . . , φN (x (τ ; t, x))) d τ, (3.83)
t
1 T
ui∗ (x) = − R−1 g T (x) ∇x Vi∗ (x) , (3.84)
2 ii i
3.4 N -Player Nonzero-Sum Differential Games 75
where the value functions V1∗ , . . . , VN∗ satisfy the coupled Hamilton–Jacobi
equations
N
1 T
0 = x Qi x +
T
∇x Vj∗ (x) G ij (x) ∇x Vj∗ (x) + ∇x Vi∗ (x) f (x)
j=1
4
1
N T
− ∇x Vi∗ (x) G j (x) ∇x Vj∗ (x) . (3.85)
2 j=1
where
N
Fu (x, u1 , . . . , uN ) f (x) + gj (x) uj ∈ Rn . (3.87)
j=1
76 3 Excitation-Based Online Approximate Optimal Control
∗ ∗
Replacing the optimal
Jacobian
∇
x Vi and
optimal control policies ui by parametric
estimates ∇x V̂i x, Ŵci and ûi x, Ŵai , respectively, where Ŵci and Ŵai are the
estimates of the unknown parameters, yields the Bellman error
δi x, Ŵci , Ŵa1 , . . . , ŴaN = ri x, û1 x, Ŵa1 , . . . , ûN x, ŴaN
+ ∇x V̂i x, Ŵci Fu x, û1 x, Ŵa1 , . . . , ûN x, ŴaN .
(3.88)
where ui : R≥t0 → Rmi are the control inputs, and the state x : R≥t0 → Rn is assumed
to be available for feedback. The following assumptions about the system will be
used in the subsequent development.
Assumption 3.18 The input matrices g1 and g2 are known and bounded according
to the inequalities g1 (x) ≤ ḡ1 and g2 (x) ≤ ḡ2 , for all x ∈ Rn , where ḡ1 and ḡ2
are known positive constants.
Assumption 3.19 The control inputs u1 (·) and u2 (·) are bounded (i.e., u1 (·) ,
u2 (·) ∈ L∞ ). This assumption facilitates the design of the state-derivative estimator,
and is relaxed in Sect. 3.4.5.
3.4 N -Player Nonzero-Sum Differential Games 77
Based on Property 2.3, the nonlinear system in (3.90) can be represented using a
multi-layer neural network as
ẋ (t) = WfT σf VfT x (t) + f (x (t)) + g1 (x (t)) u1 (t) + g2 (x (t)) u2 (t) ,
Fu (x (t) , u1 (t) , u2 (t)) , (3.91)
where Wf ∈ RLf +1×n and Vf ∈ Rn×Lf are unknown ideal neural network weight
matrices with Lf ∈ N representing the neurons in the output layers. In (3.91),
σf : RLf → RLf +1 is the vector of basis functions, and f : Rn → Rn is the func-
tion reconstruction error in approximating the function f . The proposed dynamic
neural network used to identify the system in (3.90) is
x̂˙ (t) = ŴfT (t) σf V̂fT (t) x̂ (t) + g1 (x (t)) u1 (t) + g2 (x (t)) u2 (t) + μ (t) ,
F̂u x (t) , x̂ (t) , u1 (t) , u2 (t) , (3.92)
where x̂ : R≥t0 → Rn is the state of the dynamic neural network, Ŵf : R≥t0 →
RLf +1×n , V̂f : R≥t0 → Rn×Lf are the estimates of the ideal weights of the neural
networks, and μ : R≥t0 → Rn denotes the RISE feedback term (cf. [34]) defined as
where k, α, γ β ∈ R are positive constant gains, and sgn (·) denotes a vector signum
function.
The identification error dynamics are developed by taking the time derivative of
(3.94) and substituting for (3.91) and (3.92) as
x̃˙ = WfT σf VfT x (t) − ŴfT σf V̂fT (t) x̂ (t) + f (x (t)) − μ (t) . (3.95)
+ ˙f (x (t) , ẋ (t)) − kef (t) − γ x̃ (t) − β1 sgn (x̃ (t)) + α x̃˙ (t) . (3.97)
The weight update laws for the dynamic neural network in (3.92) are developed
based on the subsequent stability analysis as
Ŵ˙ f (t) = proj Γwf ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t) x̃T (t) ,
V̂˙f (t) = proj Γvf x̂˙ (t) x̃T (t) ŴfT (t) ∇VfT x σf V̂fT (t) x̂ (t) , (3.98)
where proj (·) is a smooth projection operator [35, 36], and Γwf ∈RLf +1×Lf +1 , Γvf ∈
Rn×n are positive
constant
adaptation gain matrices. Adding and subtracting
1
W ∇VfT x σf V̂f (t) x̂ (t) V̂f (t) x̂ (t)+ 2 Ŵf (t) ∇VfT x σf V̂f (t) x̂ (t) VfT x̂˙ (t),
2 f
T T T ˙ 1 T T
ėf (t) = Ñ (t) + NB1 (t) + N̂B2 (t) − kef (t) − γ x̃ (t) − β1 sgn (x̃ (t)) , (3.99)
where the auxiliary signals, Ñ , NB1 , and N̂B2 : R≥t0 → Rn in (3.99) are defined as
Ñ (t) α x̃˙ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂˙fT (t) x̂ (t)
f
1 T
+ Wf ∇V T x σf V̂f (t) x̂ (t) V̂f (t) x̃˙ (t)
T T
2 f
1 T
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT x̃˙ (t) , (3.100)
2 f
1
NB1 (t) WfT ∇V T x σf VfT x (t) VfT ẋ (t) − WfT ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) ẋ (t)
f 2 f
1 T
T T
− Ŵf (t) ∇V T x σf V̂f (t) x̂ (t) Vf ẋ (t) + ˙f (x (t) , ẋ (t)) , (3.101)
2 f
1
N̂B2 (t) W̃fT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
2 f
1 T
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) ṼfT (t) x̂˙ (t) . (3.102)
2 f
T
where z (t) x̃T (t) efT (t) ∈ R2n , ∀t ∈ R≥t0 and ρ1 , ρ2 : R → R are positive,
strictly increasing functions, and ζi ∈ R, i = 1, . . . , 6 are positive constants. To
facilitate the subsequent stability analysis, let the auxiliary signal y : R≥t0 → R2n+2
be defined as
y (t) x̃T (t) ef T (t) P (t) Q (t) T , ∀t ∈ R≥t0 (3.106)
where the auxiliary signal P :∈ R≥t0 → R is the Filippov solution to the initial value
problem [11]
Ṗ (t) = β2 ρ2 (z (t)) z (t) x̃ (t) − efT (t) (NB1 (t) − β1 sgn (x̃ (t))) − x̃˙ T (t) NB2 (t) ,
n
P(t0 ) = β1 |x̃i (t0 )| − x̃T (t0 ) NB (t0 ), (3.107)
i=1
ζ3
β1 > max(ζ1 + ζ2 , ζ1 + ), β2 > ζ4 , (3.108)
α
such that P (t) ≥ 0 for all t ∈ [0, ∞) (see Appendix
A.1.1). The auxiliary function
Q : Rn(2Lf +1) →R in (3.106) is defined as Q 41 α tr(W̃fT Γwf−1 W̃f )+tr(ṼfT Γvf−1 Ṽf ) .
Let D ⊂ R 2n+2
and connected set defined as D y ∈ R
−1 be√ the open
2n+2
| y < inf ρ 2 λη, ∞ , where λ and η are defined in Appendix A.1.7.
√
Let D be the compact set D y ∈ R2n+2 | y ≤ inf ρ −1 2 λη, ∞ . Let
VI : D → R be a positive-definite, locally Lipschitz, regular function defined as
1 T 1
VI (y) ef ef + γ x̃T x̃ + P + Q. (3.109)
2 2
The candidate Lyapunov function in (3.109) satisfies the inequalities
1
U1 min(1, γ ) y2 U2 max(1, γ ) y2 .
2
80 3 Excitation-Based Online Approximate Optimal Control
√
Additionally,
√
let S ⊂ D denote a set defined as S y ∈ D | ρ 2U2 (y)
< 2 λη , and let
ẏ (t) = h(y (t) , t) (3.111)
represent the closed-loop differential equations in (3.95), (3.98), (3.99), and (3.107),
where h(y, t) : R2n+2 × R≥t0 → R2n+2 denotes the right-hand side of the the closed-
loop error signals.
Theorem 3.20 For the system in (3.90), the identifier developed in (3.92) along with
the weight update laws in (3.98) ensures asymptotic identification of the state and
its derivative, in the sense that
lim x̃ (t) = 0, lim x̃˙ (t) = 0,
t→∞ t→∞
provided Assumptions 3.18 and 3.19 hold, and the control gains k and γ are selected
sufficiently large based on the initial conditions of the states, and satisfy the following
sufficient conditions
αγ > ζ5 , k > ζ6 , (3.112)
Using Property 2.3 and (3.84), the optimal value function and the optimal controls
can be represented by neural networks as
where Wi ∈ RLi are unknown constant ideal neural network weights, Li is the number
of neurons, σi = [σi1 σi2 . . . σiL ]T : Rn → RLi are smooth neural network activation
functions, such that σi (0) = 0 and ∇x σi (0) = 0, and i : Rn → R are the function
reconstruction errors.
Using Property 2.3, both Vi∗ and ∇x Vi∗ can be uniformly approximated by neural
networks in (3.113) (i.e., as Li → ∞, the approximation errors i , ∇x i → 0 for i =
1, 2, respectively). The critic V̂ and the actor û approximate the optimal value function
and the optimal controls in (3.113), and are given as
3.4 N -Player Nonzero-Sum Differential Games 81
1
ûi x, Ŵai = − R−1 ii gi (x) ∇x σi (x) Ŵai ,
T T
2
V̂i x, Ŵci = ŴciT σi (x) , (3.114)
where Ŵci : R≥t0 → RLi and Ŵai : R≥t0 → RLi are estimates of the ideal weights of
the critic and actor neural networks, respectively. The weight estimation errors for
the critic and actor are defined as W̃ci (t) Wi − Ŵci (t) and W̃ai (t) Wi − Ŵai (t)
for i = 1, 2, respectively.
Least-Squares Update for the Critic
The recursive formulation of the normalized least-squares algorithm is used to derive
the update laws for the two critic weights as
ωi (t)
Ŵ˙ ci (t) = −kci Γci (t) δ̂ti (t) , (3.115)
1+ νi ωiT (t) Γci (t) ωi (t)
where ωi : R≥t0 → RLi , defined as ωi (t) ∇x σi (x (t)) F̂u x (t) , x̂ (t),
û1 x (t) , Ŵa1 (t) , u2 x (t) , Ŵa2 (t) for i = 1, 2, is the critic neural network
regressor vector, νi , kci ∈ R are constant positive gains and δ̂ti : R≥t0 → R denotes
evaluation of the approximate
Bellman error in (3.89), along the system trajectories,
defined as δ̂ti δ̂ x (t) , x̂˙ (t) , Ŵci (t) , Ŵa1 (t) , Ŵa2 (t) . In (3.115), Γci : R≥t0 →
RLi ×Li for i = 1, 2, are symmetric estimation gain matrices generated by
ωi (t) ωi T (t)
Γ˙ci (t) = −kci −λi Γci (t) + Γci (t) Γ ci (t) , (3.116)
1 + νi ωiT (t) Γci (t) ωi (t)
where λ1 , λ2 ∈ (0, 1) are forgetting factors. The use of forgetting factors ensures
that Γc1 and Γc2 are positive-definite for all time and prevents arbitrarily small val-
ues in some directions, making adaptation in those directions very slow. Thus, the
covariance matrices (Γc1 , Γc2 ) can be bounded as
ϕ11 IL1 ≤ Γc1 (t) ≤ ϕ01 IL1 , ϕ12 IL2 ≤ Γc2 (t) ≤ ϕ02 IL2 . (3.117)
The dynamics of the critic weight estimation errors W̃c1 and W̃c2 can be developed as
ω1
W̃˙ c1 = kc1 Γc1 ω1 − W1T ∇x σ1 F̃û − u1∗ R11 u1∗ − 1v
T
−W̃c1T
Fu∗ + û1T R11 û1
1 + ν1 ω1T Γc1 ω1
+W1T ∇x σ1 g1 (û1 − u1∗ ) + g2 (û2 − u2∗ ) − u2∗ R12 u2∗ + û2T R12 û2 ,
T
and
ω2
W̃˙ c2 = kc2 Γc2 ω2 − W2T ∇x σ2 F̃û − u2∗ R22 u2∗ − 2v
T
−W̃c2T
Fu∗ + û2T R22 û2
1 + ν2 ω2 Γc2 ω2
T
+W2T ∇x σ2 g1 (û1 − u1∗ ) + g2 (û2 − u2∗ ) − u1∗ R21 u1∗ + û1T R21 û1 .
T
(3.119)
Substituting for u1∗ , u2∗ and û1 , û2 from (3.113) and (3.114), respectively, in
(3.119) yields
ω1
W̃˙ c1 = −kc1 Γc1 ψ1 ψ1T W̃c1 + kc1 Γc1 −W1T ∇x σ1 F̃û
1 + ν1 ω1 Γc1 ω1
T
1 T 1 1 T
+ W̃a2 ∇x σ2 G 12 ∇x σ2T W̃a2 − ∇x 2 G 12 ∇x 2T + W̃a1 ∇x σ1 G 1 ∇x σ1T W̃a1
4 4 4
1
+ W̃a2 ∇x σ2 + ∇x 2T G 2 ∇x σ1T W1 − G 12 ∇x σ2T W2
2
1
− ∇x 1 G 1 ∇x 1 − ∇x 1 Fu∗ ,
T
4
3.4 N -Player Nonzero-Sum Differential Games 83
ω2
W̃˙ c2 = −kc2 Γc2 ψ2 ψ2T W̃c2 + kc2 Γc2 −W2T ∇x σ2 F̃û
1 + ν2 ω2T Γc2 ω2
1 T 1 1 T
+ W̃a1 ∇x σ1 G 21 ∇x σ1T W̃a1 − ∇x 1 G 21 ∇x 1T + W̃a2 ∇x σ2 G 2 ∇x σ2T W̃a2
4 4 4
1 1
+ W̃a1 ∇x σ1 + ∇x 1T G 1 ∇x σ2T W2 − G 21 ∇x σ1T W1 − ∇x 2 G 2 ∇x 2T
2 4
−∇x 2 Fu ] ,
∗ (3.120)
1 1
ψ1 ≤ √ , ψ2 ≤ √ , (3.121)
ν1 ϕ11 ν2 ϕ12
where ϕ11 and ϕ12 are introduced in (3.117). The error systems in (3.120) can be
represented as the following perturbed systems
·
W̃˙ c1 = Ω1 + Λ01 Δ1 , W̃ c2 = Ω2 + Λ02 Δ2 , (3.122)
where Ωi (W̃ci , t) −ηci Γci (t) ψi (t) ψiT (t) W̃ci ∈ RLi denotes the nominal system,
Λ0i 1+νηciωΓTciΓωi ω denotes the perturbation gain, and the perturbations Δi ∈ RLi are
i i ci i
denoted as
1 T
Δi −WiT ∇x σi F̃û + W̃aiT ∇x σi G i ∇x σi W̃ai − ∇x i Fu∗
4
1 T 1
+ W̃ak ∇x σk G ik ∇x σkT W̃ak − ∇x k G ik ∇x kT
4 4
1 1
− ∇ x i G i ∇x i +
T
W̃ak ∇x σk + ∇x kT G k ∇x σiT Wi − G ik ∇x σkT Wk ,
4 2
where i = 1, 2 and k = 3 − i. Using Theorem 2.5.1 in [15], it can be shown that the
nominal systems
·
W̃˙ c1 = −kc1 Γc1 ψ1 ψ1T W̃c1 , W̃ c2 = −kc2 Γc2 ψ2 ψ2T W̃c2 , (3.123)
are exponentially stable if the bounded signals (ψ1 (t) , ψ2 (t)) are uniformly persis-
tently exciting over the compact set χ × D × BW 1 × BW 2 , as [17]
t
0 +δi
where μi1 , μi2 , δi ∈ R are positive constants independent of the initial conditions.
Since Ωi is continuously differentiable in W̃ci and the Jacobian ∇W̃ci Ωi = −ηci Γci ψi
ψiT is bounded for the exponentially stable system (3.123) for i = 1, 2, the Converse
Lyapunov Theorem 4.14 in [18] can be used to show that there exists a function
Vc : RLi × RLi × [0, ∞) → R, which satisfies the following inequalities
2 2
c11 W̃c1 + c12 W̃c2 ≤ Vc (W̃c1 , W̃c2 , t),
2 2
Vc (W̃c1 , W̃c2 , t) ≤ c21 W̃c1 + c22 W̃c2 ,
2 2 ∂ V ∂ Vc ∂ Vc
c
−c31 W̃c1 − c32 W̃c2 ≥ + Ω1 (W̃c1 , t) + Ω2 (W̃c2 , t),
∂t ∂ W̃c1 ∂ W̃c2
∂ Vc
≤ c41
W̃
,
∂ W̃ c1
c1
∂ Vc
≤ c42
W̃
, (3.124)
∂ W̃ c2
c2
for some positive constants c1i , c2i , c3i , c4i ∈ R for i = 1, 2. Using Property
2.3,
Assumption 3.18, the projection bounds in (3.118), the fact that t → Fu x (t) ,
u1∗ (x (t)) , u2∗ (x (t)) ∈ L∞ over compact sets (using (3.91)), and provided the condi-
tions of Theorem 1 hold (required to prove that t → F̃û x (t) , x̂ (t) ,
û x (t) , Ŵa (t) ∈ L∞ ), the following bounds are developed to facilitate the sub-
sequent stability proof
ι1 ≥ W̃a1 , ι2 ≥ W̃a2 ,
ι3 ≥ ∇x σ1 G 1 ∇x σ1T , ι4 ≥ ∇x σ2 G 2 ∇x σ2T ,
ι5 ≥ Δ1 ; ι6 ≥ Δ2 ,
1 2 1 2
ι7 ≥ G 1 − G 21 ∇x V1∗ + G 2 − G 12 ∇x V2∗
4 4
1
+ ∇x V1 (G 2 + G 1 ) ∇x V2 ,
∗ ∗T
2
1
ι8 ≥ ∗ ∗
− 2 ∇x V1 − ∇x V2 G 1 ∇x σ1 Wa1 − G 2 ∇x σ2 Wa2
T T
1
+ ∇x V1 − ∇x V2 G 1 ∇x σ1 W̃a1 − G 2 ∇x σ2 W̃a2
∗ ∗ T T
,
2
ι9 ≥ ∇x σ1 G 21 ∇x σ1T , ι10 ≥ ∇x σ2 G 1 ∇x σ1T ,
ι11 ≥ ∇x σ1 G 2 ∇x σ T ,
2 ι12 ≥ ∇x σ2 G 12 ∇x σ T ,
2 (3.125)
Theorem 3.21 If Assumptions 3.18 and 3.19 hold, the regressors ψi for i = 1, 2
are uniformly persistently exciting, and provided (3.108), (3.112), and the following
sufficient gain conditions are satisfied
where ka11 , ka21 , c31 , c32 , ι1 , ι2 , ι3 , and ι4 are introduced in (3.118), (3.124), and
(3.125), then the controller in (3.114), the actor-critic weight update laws in (3.115)–
(3.116) and (3.118), and the identifier in (3.92) and (3.98), guarantee that the state
of the system, x (·), and the actor-critic weight estimation errors, W̃a1 (·) , W̃a2 (·)
and W̃c1 (·) , W̃c2 (·) , are uniformly ultimately bounded.
Proof To investigate the stability of (3.90) with control inputs û1 and û2 , and the
perturbed system (3.122), consider VL : χ × RL1 × RL1 × RL2 × RL2 × [0, ∞) →
R as the continuously differentiable, positive-definite Lyapunov function candidate,
given as
1 T 1 T
VL x, W̃c1 , W̃c2 , W̃a1 , W̃a2 , t V1∗ (x) + V2∗ (x) + Vc (W̃c1 , W̃c2 , t) + W̃a1 W̃a1 + W̃a2 W̃a2 ,
2 2
where Vi∗ for i = 1, 2 (the optimal value function for (3.90)), is the Lyapunov function
for (3.90), and Vc is the Lyapunov function for the exponentially stable system in
(3.123). Since V1∗ , V2∗ are continuously differentiable and positive-definite, [18,
Lemma 4.3] implies that there exist class K functions α1 and α2 defined on [0, r],
where Br ⊂ X , such that
2 2 1 2 2
α1 (x) + c11 W̃c1 + c12 W̃c2 + W̃a1 + W̃a2 ≤ VL
2
2 2 1 2 2
≤ α2 (x) + c21 W̃c1 + c22 W̃c2 + W̃a1 + W̃a2 .
2
∂ Vc ∂ Vc ∂ Vc ∂ Vc
V̇L = ∇x V1∗ + ∇x V2∗ f + g1 û1 + g2 û2 + + Ω1 + Λ01 Δ1 + Ω2
∂t ∂ W̃c1 ∂ W̃c1 ∂ W̃c2
∂ Vc T ˙ T ˙
+ Λ02 Δ2 − W̃a1 Ŵa1 − W̃a2 Ŵa2 , (3.127)
∂ W̃c2
where the time derivatives of Vi∗ for i = 1, 2, are taken along the trajectories
of the system (3.90) with control inputs û1 , û2 and the time derivative of Vc
is taken along the along the trajectories of the perturbed system (3.122). Using
)
2
(3.86), ∇x Vi∗ f = −∇x Vi∗ g1 u1∗ + g2 u2∗ − Qi (x) − uj∗T Rij uj∗ for i = 1, 2. Substi-
j=1
tuting for the ∇x Vi∗ f terms in (3.127), using the fact that ∇x Vi∗ gi = −2ui∗ Rii from
T
(3.84), and using (3.118) and (3.124), (3.127) can be upper bounded as
V̇L ≤ −Q − u1∗ (R11 + R21 ) u1∗ − u2∗ (R22 + R12 ) u2∗ + 2u1∗ R11 (u1∗ − û1 )
T T T
+ 2u2∗ R22 (u2∗ − û2 ) + ∇x V1∗ g2 û2 − u2∗ + ∇x V2∗ g1 û1 − u1∗
T
2 2
+ c41 Λ01 W̃c1 Δ1 − c31 W̃c1 + c42 Λ02 W̃c2 Δ2 − c32 W̃c2
⎡ ⎤
T ⎣ k ∂E
+ ka12 (Ŵa1 − Ŵc1 )⎦
a11 a
+ W̃a1
1 + ω1 ω1T ∂ Ŵ a1
⎡ ⎤
T ⎣ k ∂E
+ ka22 (Ŵa2 − Ŵc2 )⎦ ,
a21 a
+ W̃a2 (3.128)
1 + ω2 ω2T ∂ Ŵ a2
1 2 1 2
V̇L ≤ G 1 − G 21 ∇x V1∗ + G 2 − G 12 ∇x V2∗
4 4
1
+ ∇x V1∗ (G 1 + G 2 ) ∇x V2∗T − Q
2
1
− ∇x V1∗ − ∇x V2∗ G 1 ∇x σ1T Wa1 − G 2 ∇x σ2T Wa2
2
1
+ ∇x V1∗ − ∇x V2∗ G 1 ∇x σ1T W̃a1 − G 2 ∇x σ2T W̃a2
2
kc1 ϕ01 2
+ c41 √ Δ1 W̃c1 − c31 W̃c1
2 ν1 ϕ11
kc2 ϕ02 2
+ c42 √ Δ2 W̃c2 − c32 W̃c2
2 ν2 ϕ12
2 2
+ ka12 W̃a1 W̃c1 + ka22 W̃a2 W̃c2 − ka12 W̃a1 − ka22 W̃a2
3.4 N -Player Nonzero-Sum Differential Games 87
ka11 T
+ T
W̃a1 W̃c1 − W̃a1 ∇x σ1 G 1 ∇x σ1T −W̃c1
T
ω1 + Δ1
1 + ω1T ω1
+ W̃a1
T
∇x σ1 G 21 − W̃c2
T
∇x σ2 G 2 ∇x σ1T −W̃c2T
ω2 + Δ2
+ W1T ∇x σ1 G 21 − W2T ∇x σ2 G 1 ∇x σ1T −W̃c2T
ω2 + Δ2
ka21 T
+ T
W̃a2 W̃c2 − W̃a2 ∇x σ2 G 2 ∇x σ2T −W̃c2T
ω2 + Δ2
1 + ω2T ω2
+ W̃a2
T
∇x σ2 G 12 − W̃c1
T
∇x σ1 G 2 ∇x σ2T −W̃c1T
ω1 + Δ1
+ W2T ∇x σ2 G 12 − W1T ∇x σ1 G 2 ∇x σ2T −W̃c1T
ω1 + Δ1 . (3.129)
Using the bounds developed in (3.125), (3.129) can be further upper bounded as
2 2
V̇L ≤ −Q − (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1 − ka12 W̃a1
2 2
− (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2 + σ2 W̃c2 − ka22 W̃a2
+ σ1 W̃c1 + ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12 + ι7 + ι8 ,
where
c41 kc1 ϕ01
σ1 √ ι5 + ka11 (ι1 ι3 (ι1 + ι5 )) + ka21 ι2 ι11 ι5 + W̄1 + ι12 ι2 + W̄2
2 ν1 ϕ11
+ ka12 ι1 ,
c42 kc2 ϕ02
σ2 √ ι6 + ka21 (ι2 ι4 (ι2 + ι6 )) + ka11 ι1 ι9 ι1 + W̄1 + ι10 ι6 + W̄2
2 ν2 ϕ12
+ ka22 ι2 .
Provided c31 > ka11 ι1 ι3 + ka21 ι2 ι11 and c32 > ka21 ι2 ι4 + ka11 ι1 ι10 , completion of the
squares yields
2 2
V̇L ≤ −Q − ka22 W̃a2 − ka12 W̃a1
2
− (1 − θ1 )(c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1
2
− (1 − θ2 )(c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2
88 3 Excitation-Based Online Approximate Optimal Control
+ ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12
σ12
+ + ι7
4θ1 (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 )
σ22
+ + ι8 , (3.130)
4θ2 (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 )
where
2 2
F (w) = Q + ka12 W̃a1 + (1 − θ1 )(c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1
2 2
+ (1 − θ2 )(c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2 + ka22 W̃a2 ,
Using (3.131), the expression in (3.130) can be further upper bounded as V̇L ≤
−α5 (w) + Υ, where
σ12
Υ = ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10 +
4θ1 (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 )
σ22
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12 +
4θ2 (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 )
+ ι 7 + ι8 ,
3.4.6 Simulations
Fig. 3.15 Evolution of the system states, state derivative estimates, and control signals for the
two-player nonzero-sum game, with persistently excited input for the first six seconds (reproduced
with permission from [30], 2015,
c IEEE)
(x2 − 2x1 )
f (x) = 2 ,
− 21 x1 − x2 + 41 x2 (cos (2x1 ) + 2)2 + 41 x2 sin 4x12 + 2
T
g1 (x) = 0 cos (2x1 ) + 2 , (3.132)
2 T
g2 (x) = 0 sin 4x1 + 2 . (3.133)
Fig. 3.16 Convergence of actor and critic weights for player 1 and player 2 in the nonzero-sum
game (reproduced with permission from [30], 2015,
c IEEE)
function for the identifier dynamic neural network is selected as a symmetric sig-
moid with 5 neurons in the hidden layer. The identifier gains are selected as k = 300,
α = 200, γ = 5, β1 = 0.2, Γwf = 0.1I6 , and Γvf = 0.1I2 , and the gains of the actor-
critic learning laws are selected as ka11 = ka12 = 10, ka21 = ka22 = 20, kc1 = 50,
kc2 = 10, ν1 = ν2 = 0.001, and λ1 = λ2 = 0.03. The covariance matrix is initial-
ized to Γ (0) = 5000I3 , the neural network weights for state derivative estimator are
randomly initialized with values between [−1, 1], the weights for the actor and the
critic are initialized to [3, 3, 3]T , the state estimates are initialized to zero, and the
states are initialized to x (0) = [3, −1]. Similar to results such as [9, 14, 32, 39,
40], a small amplitude exploratory signal (noise) is added to the control to excite the
states for the first six seconds of the simulation, as seen from the evolution of states
and control in Fig. 3.15. The identifier approximates the system dynamics, and the
state derivative estimation error is shown in Fig. 3.15. The time histories of the critic
neural network weights and the actor neural network weights are given in Fig. 3.16,
where solid lines denote the weight estimates and dotted lines denote the true values
of the weights. Persistence of excitation ensures that the weights converge to their
3.4 N -Player Nonzero-Sum Differential Games 91
known ideal values in less than five seconds of simulation. The use of two separate
neural networks facilitates the design of least-squares-based update laws in (3.115).
The least-squares-based update laws result in a performance benefit over single neu-
ral network-based results such as [40], where the convergence of weights is obtained
after about 250 s of simulation.
Three Player Game
To demonstrate the performance of the developed technique in the multi-player case,
the two player simulation is augmented with another actor. The resulting dynamics
are ẋ = f (x) + g1 (x) u1 + g2 (x) u2 + g3 (x) u3 , where
⎡ ⎤
(x2 − 2x1 )
⎛ 1 ⎞
2 ⎥
⎢ 1 1
⎢ − x1 − x2 + x2 (cos (2x1 ) + 2)2 + x2 sin 4x12 + 2 ⎥
f (x) = ⎢ ⎜ 2 4 4 ⎟⎥,
⎣⎝ 1 2 ⎠ ⎦
+ x2 cos 4x12 + 2
4
2 T
g3 (x) = 0 cos 4x1 + 2 , (3.134)
and g1 and g2 are the same as (3.133). Figure 3.17 demonstrates the convergence
of the actor and the critic weights. Since the feedback-Nash equilibrium solution is
unknown for the dynamics in (3.134), the obtained weights are not compared against
their true values. Figure 3.18 demonstrates the regulation of the system states and
the state derivative estimation error to the origin, and the boundedness of the control
signals.
n (t) = sin (5π t) + sin (et) + sin5 (t) + cos5 (20t) + sin2 (−1.2t) cos (0.5t) .
Fig. 3.17 Convergence of actor and critic weights for the three-player nonzero-sum game (repro-
duced with permission from [30], 2015,
c IEEE)
3.5 Background and Further Reading 93
Fig. 3.18 Evolution of the system states, state derivative estimates, and control signals for the three-
player nonzero-sum game, with persistently excited input for the first six seconds (reproduced
with permission from [30], 2015,
c IEEE)
game in [32] and [40] for nonlinear continuous-time systems with known dynamics.
Furthermore, [75] presents a policy iteration method for an infinite-horizon two-
player zero-sum Nash game with unknown nonlinear continuous-time dynamics.
Recent results also focus on the development of data-driven approximate dynamic
programming methods for set-point regulation, trajectory tracking, and differential
games to relax the persistence of excitation conditions. These methods are surveyed
at the end of the next chapter.
References
1. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sorge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches, vol 15. Nostrand, New York, pp 493–525
2. Hopfield J (1984) Neurons with graded response have collective computational properties like
those of two-state neurons. Proc Nat Acad Sci USA 81(10):3088
3. Kirk D (2004) Optimal Control Theory: An Introduction. Dover, Mineola, NY
4. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal Control, 3rd edn. Wiley, Hoboken
5. Case J (1969) Toward a theory of many player differential games. SIAM J Control 7:179–197
6. Starr A, Ho CY (1969) Nonzero-sum differential games. J Optim Theory App 3(3):184–206
7. Starr A, Ho, (1969) Further properties of nonzero-sum differential games. J Optim Theory App
4:207–219
8. Friedman A (1971) Differential games. Wiley, Hoboken
9. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
10. Xian B, Dawson DM, de Queiroz MS, Chen J (2004) A continuous asymptotic tracking control
strategy for uncertain nonlinear systems. IEEE Trans Autom Control 49(7):1206–1211
11. Patre PM, MacKunis W, Kaiser K, Dixon WE (2008) Asymptotic tracking for uncertain
dynamic systems via a multilayer neural network feedforward and RISE feedback control
structure. IEEE Trans Autom Control 53(9):2180–2185
12. Filippov AF (1988) Differential equations with discontinuous right-hand sides. Kluwer Aca-
demic Publishers, Dordrecht
13. Kamalapurkar R, Rosenfeld JA, Klotz J, Downey RJ, Dixon WE (2014) Supporting lemmas
for RISE-based control methods. arXiv:1306.3432
14. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
15. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
16. Panteley E, Loria A, Teel A (2001) Relaxed persistency of excitation for uniform asymptotic
stability. IEEE Trans Autom Control 46(12):1874–1886
17. Loría A, Panteley E (2002) Uniform exponential stability of linear time-varying systems:
revisited. Syst Control Lett 47(1):13–24
18. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
19. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Belmont
20. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic
programming using function approximators. CRC Press, Boca Raton
21. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
22. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking
for continuous-time nonlinear systems. Automatica 51:40–48
96 3 Excitation-Based Online Approximate Optimal Control
23. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy hdp iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
24. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
25. Lewis FL, Selmic R, Campos J (2002) Neuro-fuzzy control of industrial systems with actuator
nonlinearities. Society for Industrial and Applied Mathematics, Philadelphia
26. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
27. Misovec KM (1999) Friction compensation using adaptive non-linear control with persistent
excitation. Int J Control 72(5):457–479
28. Narendra K, Annaswamy A (1986) Robust adaptive control in the presence of bounded distur-
bances. IEEE Trans Autom Control 31(4):306–315
29. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, A MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
30. Johnson M, Kamalapurkar R, Bhasin S, Dixon WE (2015) Approximate n-player nonzero-sum
game solution for an uncertain continuous nonlinear system. IEEE Trans Neural Netw Learn
Syst 26(8):1645–1658
31. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory. Classics in applied mathe-
matics, 2nd edn. SIAM, Philadelphia
32. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled hamilton-jacobi equations. Automatica 47:1556–1569
33. Basar T, Bernhard P (2008) H ∞ -optimal control and related minimax design problems: a
dynamic game approach, 2nd edn. Modern Birkhäuser Classics, Birkhäuser, Boston
34. Patre PM, Dixon WE, Makkar C, Mackunis W (2006) Asymptotic tracking for systems with
structured and unstructured uncertainties. In: Proceedings of the IEEE conference on decision
and control, San Diego, California, pp 441–446
35. Dixon WE, Behal A, Dawson DM, Nagarkatti S (2003) Nonlinear control of engineering
systems: a lyapunov-based approach. Birkhauser, Boston
36. Krstic M, Kanellakopoulos I, Kokotovic PV (1995) Nonlinear and adaptive control design.
Wiley, New York
37. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical report. CIT-CDS 96-021, California Institute of Technology, Pasadena, CA 91125
38. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin, pp
357–374
39. Vamvoudakis KG, Lewis FL (2010) Online neural network solution of nonlinear two-player
zero-sum games using synchronous policy iteration. In: Proceedings of the IEEE conference
on decision and control
40. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
41. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
42. Chen Z, Jagannathan S (2008) Generalized Hamilton-Jacobi-Bellman formulation -based
neural network control of affine nonlinear discrete-time systems. IEEE Trans Neural Netw
19(1):90–106
43. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
44. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
45. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
References 97
46. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
47. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
48. Dierks T, Jagannathan S (2009) Optimal tracking control of affine nonlinear discrete-time
systems with unknown internal dynamics. In: Proceedings of the IEEE conference on decision
and control, Shanghai, CN, pp 6750–6755
49. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
50. Wei Q, Liu D (2013) Optimal tracking control scheme for discrete-time nonlinear systems with
approximation errors. In: Guo C, Hou ZG, Zeng Z (eds) Advances in neural networks - ISNN
2013, vol 7952. Lecture notes in computer science. Springer, Berlin, pp 1–10
51. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement
Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4):1167–1175
52. Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear
systems with unknown dynamics by using adaptive dynamic programming. Int J Control
87(5):1000–1009
53. Murray J, Cox C, Lendaris G, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
54. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton-Jacobi-
Bellman equation. Automatica 33:2159–2178
55. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
56. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
57. Wang K, Liu Y, Li L (2014) Visual servoing trajectory tracking of nonholonomic mobile robots
without direct position measurement. IEEE Trans Robot 30(4):1026–1035
58. Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlinear robust
optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern Syst 46(11):1544–1555
59. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
60. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference, pp 1568–1573
61. Park YM, Choi MS, Lee KY (1996) An optimal tracking neuro-controller for nonlinear dynamic
systems. IEEE Trans Neural Netw 7(5):1099–1110
62. Luo Y, Liang M (2011) Approximate optimal tracking control for a class of discrete-time non-
affine systems based on GDHP algorithm. In: IWACI International Workshop on Advanced
Computational Intelligence, pp 143–149
63. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
64. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
65. Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only
q-learning. IEEE Trans Neural Netw Learn Syst 27(10):2134–2144
66. Yang X, Liu D, Wei Q, Wang D (2016) Guaranteed cost neural tracking control for a class of
uncertain nonlinear systems using adaptive dynamic programming. Neurocomputing 198:80–
90
67. Zhao B, Liu D, Yang X, Li Y (2017) Observer-critic structure-based adaptive dynamic pro-
gramming for decentralised tracking control of unknown large-scale nonlinear systems. Int J
Syst Sci 48(9):1978–1989
98 3 Excitation-Based Online Approximate Optimal Control
68. Wang D, Liu D, Zhang Y, Li H (2018) Neural network robust tracking control with adaptive
critic framework for uncertain nonlinear systems. Neural Netw 97:11–18
69. Vamvoudakis KG, Mojoodi A, Ferraz H (2017) Event-triggered optimal tracking control of
nonlinear systems. Int J Robust Nonlinear Control 27(4):598–619
70. Wei Q, Zhang H (2008) A new approach to solve a class of continuous-time nonlinear quadratic
zero-sum game using ADP. In: IEEE international conference on networking, sensing and
control, pp 507–512
71. Zhang H, Wei Q, Liu D (2010) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47:207–214
72. Zhang X, Zhang H, Luo Y, Dong M (2010) Iteration algorithm for solving the optimal strategies
of a class of nonaffine nonlinear quadratic zero-sum games. In: Proceedings of the IEEE
conference on decision and control, pp 1359–1364
73. Mellouk A (ed) (2011) Advances in reinforcement learning. InTech
74. Littman M (2001) Value-function reinforcement learning in markov games. Cogn Syst Res
2(1):55–66
75. Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate
solution using a policy iteration algorithm. In: Proceedings of the IEEE conference on decision
and control, pp 142–147
Chapter 4
Model-Based Reinforcement Learning
for Approximate Optimal Control
4.1 Introduction
instability during the learning phase. This chapter analyzes the stability of the closed-
loop system in the presence of the aforementioned additional approximation error.
The error between the actual steady-state controller and its estimate is included in
the stability analysis by examining the trajectories of the concatenated system under
the implemented control signal. In addition to estimating the desired steady-state
controller, the concurrent learning-based system identifier is also used to simulate
experience by evaluating the Bellman error over unexplored areas of the state-space
[14, 29, 30].
In Sect. 4.5 (see also, [16]), the model-based reinforcement learning method is
extended to obtain an approximate feedback-Nash equilibrium solution to an infinite-
horizon N -player nonzero-sum differential game online, without requiring persis-
tence of excitation, for a nonlinear control-affine system with uncertain linearly
parameterized drift dynamics. A system identifier is used to estimate the unknown
parameters in the drift dynamics. The solutions to the coupled Hamilton–Jacobi
equations and the corresponding feedback-Nash equilibrium policies are approxi-
mated using parametric universal function approximators. Based on estimates of the
unknown drift parameters, estimates for the Bellman errors are evaluated at a set of
pre-selected points in the state-space. The critic and the actor weights are updated
using a concurrent learning-based least-squares approach to minimize the instanta-
neous Bellman errors and the Bellman errors evaluated at pre-selected points. Simul-
taneously, the unknown parameters in the drift dynamics are updated using a history
stack of recorded data via a concurrent learning-based gradient descent approach. It
is shown that under a condition milder than persistence of excitation, uniformly ulti-
mately bounded convergence of the unknown drift parameters, the critic weights and
the actor weights to their true values can be established. Simulation results demon-
strate the effectiveness of the developed approximate solutions to infinite-horizon
optimal regulation and tracking problems online for inherently unstable nonlinear
systems with uncertain drift dynamics. The simulations also demonstrate that the
developed method can be used to implement reinforcement learning without the
addition of a probing signal.
Consider the control-affine nonlinear dynamical system in (1.9) and recall the expres-
sion for the Bellman error in (2.3)
δ x, Ŵc , Ŵa ∇x V̂ x, Ŵc f (x) + g (x) û x, Ŵa + r x, û x, Ŵa .
To solve the optimal control problem, the critic aims to find a set of parameters Ŵc
and the actor aims to find a set of parameters Ŵa that minimize the integral error E,
introduced in (2.5). Computation of the error E requires evaluation of the Bellman
error over the entire domain D, which is generally infeasible. As a result, a derivative-
102 4 Model-Based Reinforcement Learning for Approximate Optimal Control
Given current parameter estimates Ŵc (t), Ŵa (t), and θ̂ (t), the approximate Bellman
error in (4.1) can be evaluated at any point xi ∈ Rn .This results in simulatedexpe-
rience quantified by the Bellman error δ̂ti (t) = δ̂ xi , Ŵc (t) , Ŵa (t) , θ̂ (t) . The
simulated experience can then be used along with gained experience by the critic
to learn the value function. Motivation behind using simulated experience is that by
selecting multiple (say N ) points, the error signal in (2.7) can be augmented to yield
a heuristically better approximation Ê ti (t), given by
t
N
Ê ti (t) δ̂t (τ ) +
2
δ̂ti (τ ) dτ,
2
0 i=1
for the desired error signal in (2.5). A block diagram of the simulation-based actor-
critic-identifier architecture is presented in Fig. 4.1. For notational brevity, the depen-
dence of all the functions on the system states and time is suppressed in the stability
analysis subsections unless required for clarity of exposition.
4.3 Online Approximate Regulation 103
State
AcƟon
BE
Actor AcƟon
CriƟc
To facilitate online system identification, let f (x) = Y (x) θ denote the linear
parametrization of the function f , where Y : Rn → Rn× pθ is the regression matrix,
1 Parts of the text in this section are reproduced, with permission, from [18], ©2016, Elsevier.
104 4 Model-Based Reinforcement Learning for Approximate Optimal Control
v θ
θ̃
≤ Vθ θ̃ , t ≤ vθ
θ̃
, (4.3)
2
∇θ̃ Vθ θ̃ , t − f θs θ − θ̃, t + ∇t Vθ θ̃ , t ≤ −K
θ̃
+ D
θ̃
, (4.4)
for all s ∈ N, t ∈ R≥t0 , and θ̃ ∈ R pθ , where v θ , v θ : R≥0 → R≥0 are class K func-
tions, K ∈ R>0 is an adjustable parameter, and D ∈ R>0 is a positive constant.
The subsequent analysis in Sect. 4.3.4 indicates that when a system identifier that
satisfies Assumption 4.1 is employed to facilitate online optimal control, the ratio
D
K
needs to be sufficiently small to establish set-point regulation and convergence to
optimality. Using an estimate θ̂ , the Bellman error in (2.3) can be approximated by
δ̂ : Rn+2L+ p → R as
δ̂ x, Ŵc , Ŵa , θ̂ ∇x V̂ x, Ŵc Y (x) θ̂ + g (x) û x, Ŵa + r x, û x, Ŵa . (4.5)
In the following, the approximate Bellman error in (4.5) is used to obtain an approx-
imate solution to the Hamilton–Jacobi–Bellman equation in (1.14).
Approximations to the optimal value function V ∗ and the optimal policy u ∗ are
designed based on neural network-based representations. Given any compact set
χ ⊂ Rn and a positive constant ∈ R, the universal function approximation property
of neural networks can be exploited to represent the optimal value function V ∗
4.3 Online Approximate Regulation 105
where Ŵc ∈ R L and Ŵa ∈ R L are the estimates of W . The use of two sets of weights
to estimate the same set of ideal weights is motivated by the stability analysis and
the fact that it enables a formulation of the Bellman error that is linear in the critic
weight estimates Ŵc , enabling a least-squares-based adaptive update law.
A least-squares update law for the critic weights is designed based on the subse-
quent stability analysis as
ηc2 ωi (t)
N
˙ ω (t)
Ŵc (t) = −ηc1 Γ δ̂t (t) − Γ δ̂ti (t) , (4.7)
ρ (t) N ρ (t)
i=1 i
Γ (t) ω (t) ω (t)T Γ (t)
˙
Γ (t) = βΓ (t) − ηc1 1{ Γ ≤Γ } , (4.8)
ρ 2 (t)
where ηa1 , ηa2 ∈ R are positive constant adaptation gains, G σ (t) ∇x σ (x (t))
g (x (t)) R −1 g T (x (t)) ∇x σ T (x (t)), G σ i σi
gi R −1 giT σi
T ∈ R L×L , where gi
g (xi ) and σi
∇x σ (xi ).
The update law in (4.7) ensures that the adaptation gain matrix is bounded such
that
Using the weight estimates Ŵa , the controller for the system in (1.9) is designed as
u (t) = û x (t) , Ŵa (t) . (4.11)
Since the rank condition in (4.12) depends on the estimates θ̂ and Ŵa , it is generally
impossible to guarantee a priori. However, unlike the persistence of excitation condi-
tion in previous results such as [2, 6, 37–39], the condition in (4.12) can be verified
online at each time t. Furthermore, the condition in (4.12) can be heuristically met by
collecting redundant data (i.e., by selecting more points than the number of neurons
by choosing N L).
The update law in (4.7) is fundamentally different from the concurrent
learning adaptive update in results such as [30, 31] in the sense that the points
{xi ∈ Rn | i = 1, . . . , N } are selected a priori based on information about the desired
behavior of the system. Given the system dynamics, or an estimate of the system
dynamics, the approximate Bellman error can be extrapolated to any desired point
in the state-space, whereas the prediction error, which is used as a metric in adaptive
control, can only be evaluated at observed data points along the state trajectory.
where Yi = Y (xi ), i
= ∇x (xi ), f i = f (xi ), G i gi R −1 giT ∈ Rn×n , G i
i
G i i
T ∈ R, and i 21 W T σi
G i i
T + 41 G i − i
f i ∈ R is a constant.
On any compact set χ ⊂ Rn the function Y is Lipschitz continuous, and hence,
there exists a positive constant L Y ∈ R such that
Y (x) ≤ L Y x , ∀x ∈ χ . (4.15)
In (4.15), the Lipschitz property is exploited for clarity of exposition. The bound in
(4.15) can be easily generalized to Y (x) ≤ L Y ( x ) x , where L Y : R → R is
a positive, non-decreasing, and radially unbounded function.
108 4 Model-Based Reinforcement Learning for Approximate Optimal Control
ω
1
sup
ρ
≤ 2√νΓ . (4.16)
t∈R≥t0
L Y ηc1 W ∇x σ
1
ϑ3 √ , ϑ4
4
,
G
4 νΓ
ηc1
2W T ∇x σ G∇x T + G
ηc2 ωi i
ϑ5 √ +
,
8 νΓ
Nρi
i=1
1 1
ϑ6
W G σ + ∇x G ∇x σ
2
+ ϑ7 W + ηa2 W ,
T T T
2 2
N
ηc1 G σ ηc2 G σ i
ϑ7 √ + √ , q λmin {Q},
8 νΓ i=1
8N νΓ
1 q ηc2 c ηa1 + 2ηa2 K
vl = min , , , ,
2 2 3 6 4
3ϑ52 3ϑ62 D2
ι= + + + ϑ4 , (4.17)
4ηc2 c 2 (ηa1 + 2ηa2 ) 2K
where diam (Z) denotes the diameter of the set Z defined as diam (Z)
sup { x − y | x, y ∈ Z}. The main result of this section can now be stated as fol-
lows.
Theorem 4.3 Provided Assumptions 4.1 and 4.2 hold and gains q, ηc2 , ηa2 , and K
are selected large enough using Algorithm A.2, the controller in (4.11) along with
the adaptive update laws in (4.7) and (4.9) ensure that the x (·), W̃c (·), and W̃a (·)
are uniformly ultimately bounded.
Proof Let VL : Rn+2L+ p × R≥0 → R≥0 be a continuously differentiable positive
definite candidate Lyapunov function defined as
1 1
VL (Z , t) V ∗ (x) + W̃cT Γ −1 (t) W̃c + W̃aT W̃a + Vθ θ̃, t , (4.20)
2 2
where V ∗ is the optimal value function, Vθ was introduced in Assumption 4.1. Using
the fact that V ∗ is positive definite, (4.3), (4.10) and [40, Lemma 4.3] yield
v ( Z ) ≤ VL (Z , t) ≤ v ( Z ) , (4.21)
for all t ∈ R≥t0 and for all Z ∈ Rn+2L+ p , where v, v : R≥0 → R≥0 are class K func-
tions.
110 4 Model-Based Reinforcement Learning for Approximate Optimal Control
Provided the gains are selected according to Algorithm A.2, substituting for the
approximate Bellman errors from (4.13) and (4.14), using the bounds in (4.15) and
(4.16), and using Young’s inequality, the time derivative of (4.20) evaluated along
the trajectory Z (·) can be upper-bounded as
∇ Z VL (Z , t) h (Z , t) + ∇t VL (Z , t) ≤ −vl Z 2 , (4.22)
for all Z ≥ vιl > 0, Z ∈ Z and t ∈ R≥t0 , where h : Rn+2L+ p × R≥t0 → Rn+2L+ p
is a concatenation of the vector fields in (1.9), (4.2), (4.7), and (4.9). Since Vθ is
a common Lyapunov function for the switched subsystem in (4.2), and the terms
introduced by the update law (4.8) do not contribute to the bound in (4.22), VL is a
common Lyapunov function for the complete error system.
Using (4.19), (4.21), and (4.22), [40, Theorem 4.18] can be invoked to conclude
that Z (·) uniformly ultimately bounded in the sense that lim supt→∞ Z (t) ≤
is
ι
v−1 v vl
. Furthermore, the concatenated state trajectories are bounded such
that Z (t) ≤ Z , ∀t ∈ R≥t0 . Since the estimates Ŵa (·) approximate the ideal weights
W , the policy û approximates the optimal policy u ∗ .
4.3.5 Simulation
This section presents two simulations to demonstrate the performance and the appli-
cability of the developed technique. First, the performance of an approximate solu-
tion to an optimal control problem that has a known analytical solution. Based on the
known solution, an exact polynomial basis is used for value function approximation.
The second simulation demonstrates the applicability of the developed technique in
the case where the analytical solution, and hence, an exact basis for value function
approximation is not known. In this case, since the optimal solution is unknown,
the optimal trajectories obtained using the developed technique are compared with
optimal trajectories obtained through numerical optimal control techniques.
Problem With a Known Basis
The performance of the developed controller is demonstrated by simulating a non-
linear control-affine system with a two dimensional state x = [x1 , x2 ]T . The system
dynamics are described by (1.9), where
⎡ ⎤
a
x x 0 0 ⎢ ⎥
f = 1 2 ⎢b⎥,
0 0 x1 x2 1 − (cos (2x1 ) + 2) 2 ⎣c⎦
d
T T
g = 0 (cos (2x1 ) + 2) . (4.23)
4.3 Online Approximate Regulation 111
[0.5, 0, 1]T . The data points for the concurrent learning-based update law in (4.7)
are selected to be on a 5 × 5 grid around the origin. The learning gains are selected
as ηc1 = 1, ηc2 = 15, ηa1 = 100, ηa2 = 0.1, and ν = 0.005 and gains for the system
identifier developed in Appendix A.2.3 are selected as k x = 10I2 , Γθ = 20I4 , and
kθ = 30. The actor and the critic weight estimates are initialized using a stabilizing
set of initial weights as Ŵc (0) = Ŵa (0) = [1, 1, 1]T and the least-squares gain is
initialized as Γ (0) = 100I3 . The initial condition for the system state is selected as
x (0) = [−1, −1]T , the state estimates x̂ are initialized to be zero, the parameter
estimates θ̂ are initialized to be one, and the data stack for concurrent learning is
recorded online.
Figures 4.2, 4.3, 4.4, 4.5 and 4.6 demonstrates that the system state is regulated
to the origin, the unknown parameters in the drift dynamics are identified, and the
value function and the actor weights converge to their true values. Furthermore,
unlike previous results, a probing signal to ensure persistence of excitation is not
required. Figures 4.7 and 4.8 demonstrate the satisfaction of Assumptions 4.2 and
A.2, respectively.
Time (s)
In (4.24), D diag [x3 , x4 , tanh (x3 ) , tanh (x4 )] and the matrices M, Vm , Fd ,
p1 + 2 p3 c2 , p2 + p3 c2 f d1 , 0
Fs ∈ R are defined as M
2×2
, Fd , Vm
p2 + p3 c2 , p2 0, f d2
− p3 s2 x4 , − p3 s2 (x3 + x4 ) f s1 tanh (x3 ) , 0
, and Fs , where
p 3 s2 x 3 , 0 0, f s2 tanh (x3 )
c2 = cos (x2 ), s2 = sin (x2 ), p1 = 3.473, p2 = 0.196, and p3 = 0.242. The posi-
tive constants f d1 , f d2 , f s1 , f s2 ∈ R are the unknown parameters. The parameters are
selected as f d1 = 5.3, f d2 = 1.1, f s1 = 8.45, and f s2 = 2.35. The control objective is
to minimize the cost in (1.10) with Q = diag ([10, 10, 1, 1]) and R = diag ([1, 1])
while regulating the system state to the origin. The origin is a marginally stable equi-
librium point of the unforced system ẋ = f (x).
The basis
function σ : R4 → R10 for value function approximation is selected as
σ (x) = x1 x3 , x2 x4 , x3 x2 , x4 x1 , x1 x2 , x4 x3 , x12 , x22 , x32 , x42 . The data points for
the concurrent learning-based update law in (4.7) are selected to be on a 3 × 3 ×
3 × 3 grid around the origin, and the actor weights are updated using a projection-
based update law. The learning gains are selected as ηc1 = 1, ηc2 = 30, ηa1 = 0.1,
and ν = 0.0005. The gains for the system identifier developed in Appendix A.2.3
are selected as k x = 10I4 , Γθ = diag([90, 50, 160, 50]), and kθ = 1.1. The least-
squares gain is initialized as Γ (0) = 1000I10 and the policy and the critic weight esti-
mates are initialized as Ŵc (0) = Ŵa (0) = [5, 5, 0, 0, 0, 0, 25, 0, 2 , 2]T . The
initial condition for the system state is selected as x (0) = [1, 1, 0, 0]T , the state
estimates x̂ are initialized to be zero, the parameter estimates θ̂ are initialized to be
one, and the data stack for concurrent learning is recorded online.
Figures 4.9, 4.10, 4.11, 4.12, 4.13 and 4.14 demonstrates that the system state is
regulated to the origin, the unknown parameters in the drift dynamics are identified,
and the value function and the actor weights converge. Figures 4.15 and 4.16 demon-
strate the satisfaction of Assumptions 4.2 and A.2, respectively. The value function
and the actor weights converge to the following values.
Ŵc∗ = Ŵa∗ = [24.7, 1.19, 2.25, 2.67, 1.18, 0.93, 44.34, 11.31, 3.81 , 0.10]T . (4.25)
4.3 Online Approximate Regulation 115
Since the true values of the critic weights are unknown, the weights in (4.25)
can not be compared to their true values. However, a measure of proximity of
the weights in (4.25) to the ideal weights W can be obtained by comparing the
system trajectories resulting from applying the feedback control policy û ∗ (x) =
− 21 R −1 g T (x) ∇x σ T (x) Ŵa∗ to the system, against numerically computed optimal
system trajectories. Figures 4.14 and 4.17 indicate that the weights in (4.25) generate
state and control trajectories that closely match the numerically computed optimal
trajectories. The numerical optimal solution is obtained using an infinite-horizon
Gauss-pseudospectral method (cf. [42]) using 45 collocation points.
118 4 Model-Based Reinforcement Learning for Approximate Optimal Control
u(t)
−2
−3
u1 − Proposed
−4 u2 − Proposed
u1 − Numerical
u − Numerical
2
−5
0 5 10 15 20 25 30
Time (s)
The control objective is to simultaneously synthesize and utilize a control signal μ (·)
to minimize the cost functional in (3.49) under the dynamic constraint in (3.47), while
tracking the desired trajectory, where the local cost rt : R2n × Rm → R≥0 is defined
as
2 Parts of the text in this section are reproduced, with permission, from [22], ©2017, IEEE.
4.4 Extension to Trajectory Tracking 119
with the
initial condition
V ∗ (0) = 0, where the function Q t : R2n → R is defined
T T
as Q t e , xd
T
= Q (e) , ∀e, xd ∈ Rn .
Similar to the development in Sect. 4.3, the Bellman error is extrapolated to unex-
plored areas of the state-space using a system identifier. In this section, a neural
network-based system identifier is employed.
120 4 Model-Based Reinforcement Learning for Approximate Optimal Control
λmin M
x̄˙ j − ẋ j
< d, ∀ j is available a priori, where
j=1 σ f j σ f j = σθ > 0,
T
T
σ f j σ f Y T 1, x Tj , d ∈ R is a known positive constant, and ẋ j = f x j +
g xj u j.
A priori availability of the history stack is used for ease of exposition, and isnot neces-
sary. Provided the system states are exciting over a finite time interval t ∈ t0 , t0 + t
(versus t ∈ [t0 , ∞) as in traditional persistence of excitation-based approaches) the
history stack can also be recorded online. The controller developed in [44] can be
used over the time interval t ∈ t0 , t0 + t while the history stack is being recorded,
and the controller developed in this result can be used thereafter. The use of two
different controllers results in a switched system with one switching event. Since
there is only one switching event, the stability of the switched system follows from
the stability of the individual subsystems.
The weight estimates θ̂ are updated using the concurrent learning-based update
law
M T
θ̂˙ (t) = Γθ σ f Y T x1 (t) x̃ T (t) + kθ Γθ σ f j x̄˙ j − g j u j − θ̂ T (t) σ f j ,
j=1
(4.29)
where kθ ∈ R is a constant positive concurrent learning gain and Γθ ∈ R p+1× p+1 is a
constant, diagonal, and positive definite adaptation gain matrix. Using the identifier,
the Bellman error in (4.27) can be approximated as
4.4 Extension to Trajectory Tracking 121
δ̂ ζ, θ̂ , Ŵc , Ŵa = Q t (ζ ) + μ̂T ζ, Ŵa R μ̂ ζ, Ŵa
+ ∇ ζ V̂ ζ, Ŵa Fθ ζ, θ̂ + F1 (ζ ) + G (ζ ) μ̂ ζ, Ŵa ,
(4.30)
where
⎡
⎤
+ 0n×1
⎢ θ̂ σθ (ζ ) − g (x) g (xd ) θ̂ σθ
T T
⎥
Fθ ζ, θ̂ ⎣ xd ⎦,
0n×1
T T
F1 (ζ ) −h d + g (e + xd ) g + (xd ) h d , h dT .
Since V ∗ and μ∗ are functions of the augmented state ζ , the minimization problem
stated in Sect. 4.4.1 is intractable. To obtain a finite-dimensional minimization prob-
lem, the optimal value function is represented over any compact operating domain
C ⊂ R2n using a neural network as V ∗ (ζ ) = W T σ (ζ ) + (ζ ), where W ∈ R L
denotes a vector of unknown neural network weights, σ : R2n → R L denotes a
bounded neural network basis function, : R2n → R denotes the function recon-
struction error, and L ∈ N denotes the number of neural network neurons. Using
Property 2.3, for any compact set C ⊂ R2n , there exist constant ideal weights W and
known positive
constants
W , and such that W ≤ W < ∞, supζ ∈C (ζ ) ≤ ,
and supζ ∈C
∇ζ (ζ )
≤ [45].
∗
A neural network
representation of the optimal policy is obtained as μ (ζ ) =
− 21 R −1 G T (ζ ) ∇ζ σ T (ζ ) W + ∇ζ T (ζ ) . Using estimates Ŵc and Ŵa for the ideal
weights W , the optimal value function and the optimal policy are approximated as
1
V̂ ζ, Ŵc ŴcT σ (ζ ) , μ̂ ζ, Ŵa − R −1 G T (ζ ) ∇ζ σ T (ζ ) Ŵa . (4.31)
2
The optimal control problem is thus reformulated
as
the need to find
" a set of weights
"
" "
Ŵc and Ŵa online, to minimize the error Ê θ̂ Ŵc , Ŵa supζ ∈χ "δ̂ ζ, θ̂ , Ŵc , Ŵa ",
for a given θ̂ , while simultaneously improving θ̂ using (4.29), and ensuring stability
of the system using the control law
u (t) = μ̂ ζ (t) , Ŵa (t) + û d ζ (t) , θ̂ (t) , (4.32)
T
where û d ζ, θ̂ gd+ (t) h d (t) − θ̂ T σθd (t) and σθd (t) σθ 01×n xdT (t) .
The error between u d and û d is included in the stability analysis based on the fact that
122 4 Model-Based Reinforcement Learning for Approximate Optimal Control
the error trajectories generated by the system ė (t) = f (x (t)) + g (x (t)) u (t) −
ẋd (t) under the controller in (4.32) are identical to the error trajectories gener-
ated
by the system ζ̇ (t) = F (ζ (t)) + G (ζ (t)) μ (t) under the control law μ (t) =
μ̂ ζ (t) , Ŵa (t) + gd+ (t) θ̃ T (t) σθd (t) + gd+ (t) θd (t), where θd (t) θ (xd (t)).
ω (t) ωi (t)
N
Ŵ˙ c (t) = −kc1 Γ
kc2
δ̂t (t) − Γ (t) δ̂ti (t) , (4.33)
ρ (t) N ρ (t)
i=1 i
ω (t) ω T (t)
Γ˙ (t) = βΓ (t) − kc1 Γ (t) Γ (t) 1{ Γ ≤Γ } , Γ (t0 ) ≤ Γ ,
ρ 2 (t)
(4.34)
˙
Ŵa (t) = −ka1 Ŵa (t) − Ŵc (t) − ka2 Ŵa (t)
kc1 G σT (t) Ŵa (t) ω T (t) kc2 G σT i (t) Ŵa (t) ωiT (t)
N
+ + Ŵc (t) ,
4ρ (t) i=1
4Nρi (t)
(4.35)
where ω (t) ∇ ζ σ (ζ (t)) Fθ ζ (t) , θ̂ (t) + F1 (ζ (t)) + G (ζ (t)) μ̂ ζ (t) , Ŵa (t) , Γ ∈
R L×L is the least-squares gain matrix, Γ ∈ R denotes a positive saturation constant,
β ∈ R denotes a constant forgetting factor, kc1 , kc2 , ka1 , ka2 ∈ R denote constant pos-
itive adaptation gains, G σ (t) ∇ζ σ (ζ (t)) G (ζ (t)) R −1 G T (ζ (t)) ∇ζ σ T (ζ (t)),
and ρ (t) 1 + νω T (t) Γ (t) ω (t), where ν ∈ R is a positive normalization con-
stant. In (4.33)–(4.35) and in the subsequent development, the notation ξi , is
defined as ξi ξ (ζi , ·) for any function ξ (ζ, ·) and the instantaneous Bellman
errors δ̂t and δ̂ti are given by δ̂t (t) = δ̂ ζ (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) and δ̂ti (t) =
δ̂ ζi , Ŵc (t) , Ŵa (t) , θ̂ (t) .
4.4 Extension to Trajectory Tracking 123
If the state penalty function Q t is positive definite, then the optimal value function V ∗
is positive definite (cf. [2, 6, 46]), and serves as a Lyapunov function for the concate-
nated system under the optimal control policy μ∗ . As a result, V ∗ is used as a candidate
Lyapunov function for the closed-loop system under the policy μ̂. In this case, the
function Q t , and hence, the function V ∗ are positive semidefinite. Therefore, the func-
tion V ∗ is not a valid candidate Lyapunov function. However, the results in [44] can
be used to show that a nonautonomous form of the optimal value function denoted by
∗ ∗ ∗
T T T
Vt : R × R → R, defined as Vt (e, t) = V
n
e , xd (t) , ∀e ∈ Rn , t ∈ R, is
∗
positive definite and decrescent. Hence, Vt (0, t) = 0, ∀t ∈ R and there exist class
K functions v : R → R and v : R → R such that v ( e ) ≤ Vt∗ (e, t) ≤ v ( e ),
∀e ∈ Rn and ∀t ∈ R.
To facilitate the stability analysis, a concatenated state Z ∈ R2n+2L+n( p+1) is
defined as
T T
Z e T W̃cT W̃aT x̃ T vec θ̃ ,
1 1 1 1
VL (Z , t) Vt∗ (e, t) + W̃cT Γ −1 W̃c + W̃aT W̃a + x̃ T x̃ + tr θ̃ T Γθ−1 θ̃ .
2 2 2 2
(4.36)
The saturated least-squares update
law in
(4.34) ensures that there exist positive
constants γ , γ ∈ R such that γ ≤
Γ −1 (t)
≤ γ , ∀t ∈ R. Using the bounds on Γ
T
and Vt∗ and the fact that tr θ̃ T Γθ−1 θ̃ = vec θ̃ Γθ−1 ⊗ I p+1 vec θ̃ , the
candidate Lyapunov function in (4.36) can be bounded as
vl ( Z ) ≤ VL (Z , t) ≤ vl ( Z ) , (4.37)
2
(kc1 + kc2 )2 θ 2
+ + +
Ggd+ θd
4νΓ kc2 c 2k
1
1
+
G
+
W σ Gr
T
T
+
W T σ
Ggd+ θd
, (4.38)
2 2
T
where G r G R −1 G T and G
G r
. Let vl : R → R be a class K function
such that
q ( e ) kc2 c
2 (ka1 + ka2 )
2 k k θ σθ
2
vl ( Z ) ≤ +
W̃c
+
W̃a
+ x̃ 2 +
vec θ̃
. (4.39)
2 8 6 4 6
The sufficient gain conditions used in the subsequent Theorem 4.6 are
vl−1 (ι) < vl −1 vl (ρ) , (4.40)
2 2
3 (kc2 + kc1 )2 W σ
σg 2
kc2 c > , (4.41)
4kθ σθ νΓ
2
3 (kc1 + kc2 ) W G σ 3 (kc1 + kc2 ) W G σ
(ka1 + ka2 ) > √ + √ + ka1 ,
8 νΓ ckc2 8 νΓ
(4.42)
where σg σθ +
ggd+
σθd . In (4.38)–(4.42), the notation denotes
sup y∈χl (y) for any function : Rl → R, where l ∈ N and χl denotes the pro-
jection of χ onto Rl .
The sufficient condition in (4.40) requires the set χ to be large enough based on the
constant ι. Since the neural network approximation errors depend on the compact set
χ , in general, the constant ι increases with the size of the set χ for a fixed number of
neural network neurons. However, for a fixed set χ , the constant ι can be reduced by
reducing function reconstruction errors (i.e., by increasing number of neural network
neurons) and by increasing the learning gains provided σθ is large enough. Hence a
sufficient number of neural network neurons and extrapolation points are required
to satisfy the condition in (4.40).
Theorem 4.6 Provided Assumptions 4.4 and 4.5 hold and L, c, and σ θ are large
enough to satisfy the sufficient gain conditions in (4.40)–(4.42), the controller in
(4.32) with the weight update laws (4.33)–(4.35), and the identifier in (4.28) with
the weight update law (4.29), ensure that the system states remain bounded, the
tracking error is ultimately bounded, and that the control policy μ̂ converges to a
neighborhood around the optimal control policy μ∗ .
Proof Using (3.47) and the fact that V̇t∗ (e (t) , t) = V̇ ∗ (ζ (t)) , ∀t ∈ R, the time-
derivative of the candidate Lyapunov function in (4.36) is
4.4 Extension to Trajectory Tracking 125
V̇L = ∇ζ V ∗ F + Gμ∗ − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c
1
2
T ˙
− W̃ Ŵ + V̇ + ∇ V Gμ − ∇ V Gμ∗ .
a a 0
∗
ζ
∗
ζ (4.43)
where ι is a positive constant and χ ⊂ R2n+2L+n( p+1) is a compact set. Using (4.37)
can be invoked to conclude that every trajectory Z (·)
and (4.44), [40, Theorem 4.18]
satisfying Z (t0 ) ≤ vl −1 vl (ρ) , where ρ is a positive constant, is bounded for
all t ∈ R and satisfies lim supt→∞ Z (t) ≤ vl −1 vl vl−1 (ι) . The ultimate bound
can be decreased by increasing learning gains and the number of neurons in the neural
networks, provided the points in the history stack and Bellman error extrapolation
can be selected to increase σθ and c.
4.4.7 Simulation
Linear System
In the following, the developed technique is applied to solve a linear quadratic track-
ing problem. A linear system is selected because the optimal solution to the linear
quadratic tracking problem can be computed analytically and compared against the
solution generated by the developed technique. To demonstrate convergence
tothe
−1 1 0
ideal weights, the following linear system is simulated: ẋ = x+ u.
−0.5 0.5 1
The control objective is to followa desired trajectory, which is the solution to the
−1 1 0
initial value problem ẋd = x , xd (0) = , while ensuring convergence
−2 1 d 2
∗
of the estimated policy μ̂ to a neighborhood # ∞ ofT the policy μ , such that the control
law
∗
μ (t) = μ (ζ (t)) minimizes the cost 0 e (t) diag ([10, 10]) e (t) + μ2 (t) dt.
Since the system is linear, the optimal value function is known to be quadratic.
Hence, the value function is approximated using the quadratic basis σ (ζ ) =
[e12 , e22 , e1 e2 , e1 xd1 , e2 xd2 , e1 xd2 , e2 xd1 ]T , and the unknown drift dynamics is
approximated using the linear basis σθ (x) = [x1 , x2 ]T .
The linear system and the linear desired dynamics result in a linear time-invariant
concatenated system. Since the system is linear, the optimal tracking problem reduces
to an optimal regulation problem, which can be solved using the resulting Algebraic
Riccati Equation. The optimal value function is given by V ∗ (ζ ) = ζ T Pζ ζ , where
the matrix Pζ is given by
126 4 Model-Based Reinforcement Learning for Approximate Optimal Control
⎡ ⎤
4.43 0.67 02×2
Pζ = ⎣0.67 2.91 ⎦.
02×2 02×2
Using the matrix Pζ , the ideal weighs corresponding to the selected basis can be
computed as W = [4.43, 1.35, 0, 0, 2.91, 0, 0]T .
Figures 4.18, 4.19, 4.20 and 4.21 demonstrate that the controller remains bounded,
the tracking error goes to zero, and the weight estimates Ŵc , Ŵa and θ̂ go to their
true values, establishing convergence of the approximate policy to the optimal policy.
Figures 4.22 and 4.23 demonstrate satisfaction of the rank conditions in Assumptions
4.4 and 4.5, respectively.
Nonlinear System
Effectiveness of the developed technique is demonstrated via numerical simulation
on the nonlinear system ẋ = f (x) + (cos (2x) + 2)2 u, x ∈ R, where f (x) = x 2
is assumed to be unknown. The control objective is to track the desired trajectory
xd (t) = 2 sin (2t), while ensuring convergence of the estimated μ̂ to a neigh-
policy
#∞
borhood of the policy μ∗ , such that μ∗ minimizes the cost 0 10e2 (t) + 10 μ
1 2
(t)) dt.
Since the system is linear, the optimal value function is known to be
quadratic. Hence, the value function is approximated using the quadratic basis
σ (ζ ) = [e12 , e22 , e1 e2 , e1 xd1 , e2 xd2 , e1 xd2 , e2 xd1 ]T , and the unknown drift dynam-
ics is approximated using the linear basis σθ (x) = [x1 , x2 ]T .
The value function is approximated using the polynomial basis
σ (ζ ) = e2 , e4 , e6 , e2 xd2 , e4 xd2, e6 xd2 , and f (x) is approximated using the polyno-
mial basis σθ (x) = x, x 2 , x 3 . The higher order terms in σ (ζ ) are used to com-
pensate for the higher order terms in σθ .
The initial values for the state and the state estimate are selected to be x (0) = 1
and x̂ (0) = 0, respectively, and the initial values for the neural network weights for
the value function, the policy, and the drift dynamics are selected to be zero. Since
the selected system exhibits a finite escape time for any initial condition other than
zero, the initial policy μ̂ (ζ, 06×1 ) is not stabilizing. The stabilization demonstrated
in Fig. 4.24 is achieved via fast simultaneous learning of the system dynamics and
the value function.
Figures 4.24, 4.25, 4.26 and 4.27 demonstrate that the controller remains bounded,
the tracking error is regulated to the origin, and the neural network weights converge.
Figures 4.28 and 4.29 demonstrate satisfaction of the rank conditions in Assump-
tions 4.4 and 4.5, respectively. The rank condition on the history stack in Assumption
4.4 is ensured by selecting points using a singular value maximization algorithm [30],
and the condition in Assumption 4.5 is met via oversampling (i.e., by selecting fifty-
six points to identify six unknown parameters). Unlike previous results that rely on
4.4 Extension to Trajectory Tracking 129
the addition of an ad-hoc probing signal to satisfy the persistence of excitation condi-
tion, this result ensures sufficient exploration via Bellman error extrapolation. Since
an analytical solution to the nonlinear optimal tracking problem is not available, the
value function and the actor weights can not be compared against the ideal values.
However, a comparison between the learned weights and the optimal weights is pos-
sible for linear systems provided the dynamics h d of the desired trajectory are also
linear.
The learning gains, the basis functions for the neural networks, and the points for
Bellman error extrapolation are selected using a trial and error approach. Alterna-
tively, global optimization methods such as a genetic algorithm, or simulation-based
methods such as a Monte-Carlo simulation can be used to tune the gains.
3 Parts of the text in this section are reproduced, with permission, from [16], ©2014, IEEE.
132 4 Model-Based Reinforcement Learning for Approximate Optimal Control
Let f (x) = Y (x) θ be the linear parametrization of the drift dynamics, where Y :
Rn → Rn× pθ denotes the locally Lipschitz regression matrix, and θ ∈ R pθ denotes
the vector of constant unknown drift parameters. The system identifier is designed
as
N
x̂˙ (t) = Y (x (t)) θ̂ (t) + gi (x (t)) u i (t) + k x x̃ (t) , (4.46)
i=1
where the measurable state estimation error x̃ is defined as x̃ (t) x (t) − x̂ (t),
k x ∈ Rn×n is a constant positive definite diagonal observer gain matrix, and θ̂ :
R≥t0 → R pθ denotes the vector of estimates of the unknown drift parameters. In
traditional adaptive systems, the estimates are updated to minimize the instanta-
neous state estimation error, and convergence of parameter estimates to their true
values can be established under a restrictive persistence of excitation condition. In
this result, a concurrent learning-based data-driven approach is developed to relax
the persistence of excitation condition to a weaker, verifiable rank condition.
Assumption
4.7 ([30, 31]) A history stack Hid containing state-action tuples
x j , û i j | i = 1, . . . , N , j = 1, . . . , Mθ recorded along the trajectories of (3.82)
that satisfies
⎛ ⎞
Mθ
rank ⎝ Y jT Y j ⎠ = pθ , (4.47)
j=1
is available a priori, where Y j = Y x j , and pθ denotes the number of unknown
parameters in the drift dynamics.
To facilitate the concurrent learning-based parameter update,
numerical
methods are
used to compute the state derivative ẋ j corresponding to x j , û j . The update law
for the drift parameter estimates is designed as
Mθ
N
θ̂˙ (t) = Γθ Y T (x (t)) x̃ (t) + Γθ kθ Y jT ẋ j − gi j u i j − Y j θ̂ (t) , (4.48)
j=1 i=1
where gi j gi x j , Γθ ∈ R p× p is a constant positive definite adaptation gain matrix,
and kθ ∈ R is a constant positive concurrent learning gain. The update law in (4.48)
requires the unmeasurable state derivative ẋ j . Since the state derivative at a past
recorded point on the state trajectory is required, past and future recorded values of
the state can be used along with accurate noncausal smoothing techniques to obtain
good estimates of ẋ j . In the presence of derivative estimation errors, the parameter
estimation errors can be shown to be uniformly ultimately bounded, where the size
of the ultimate bound depends on the error in the derivative estimate [31].
4.5 N-Player Nonzero-Sum Differential Games 133
To incorporate new information, the history stack is updated with new data. Thus,
the resulting closed-loop system is a switched system. To ensure the stability of the
switched system, the history stack is updated using a singular value maximizing
algorithm (cf. [31]). Using (3.82), the state derivative can be expressed as
N
ẋ j − gi j u i j = Y j θ,
i=1
and hence, the update law in (4.48) yields the parameter estimation error dynamics
⎛ ⎞
˙θ̃ (t) = −Γ Y T (x (t)) x̃ (t) − Γ k ⎝ Y T Y ⎠ θ̃ (t) ,
Mθ
θ θ θ j j (4.49)
j=1
where θ̃ (t) θ − θ̂ (t) denotes the drift parameter estimation error. The closed-loop
dynamics of the state estimation error are given by
The following assumption, which in general is weaker than the persistence of excita-
tion assumption, is required for convergence of the concurrent learning-based critic
weight estimates.
Assumption 4.8 For each i ∈ {1, . . . , N }, there exists a finite set of Mxi points
xi j ∈ Rn | j = 1, . . . , Mxi such that for all t ∈ R≥0 ,
( )
Mxi ωi (t)(ωi (t))
k k T
inf t∈R≥0 λmin k=1 ρi (t)
k
c xi > 0, (4.52)
Mxi
kc2i Γi (t)
Mxi
ωi (t) ωik (t) k
Ŵ˙ ci (t) = −kc1i Γi (t) δ̂ti (t) − δ̂ (t) ,
ρi (t) Mxi ρ k (t) ti
k=1 i
ωi (t) ωiT (t)
˙
Γi (t) = βi Γi (t) − kc1i Γi (t) Γi (t) 1{ Γi ≤Γ i } , Γi (t0 ) ≤ Γ i ,
ρi2 (t)
(4.53)
where
δ̂ti (t) = δ̂i x (t) , Ŵci (t) , Ŵa1 (t) , . . . , Ŵa N (t) , θ̂ (t) ,
δ̂tik (t) = δ̂i xik , Ŵci (t) , Ŵa1 (t) , . . . , Ŵa N (t) , θ̂ (t) ,
1
N
ω T (t) T
+ kc1i ∇x σ j (x (t)) G i j (x (t)) ∇x σ jT (x (t)) ŴaTj (t) i Ŵ (t)
4 j=1 ρi (t) ci
k T
1
Mxi
kc2i
ik ik
ik T T
N
ω (t)
+ σ j Gi j σ j Ŵa j (t) i k ŴciT (t) , (4.54)
4 k=1 j=1 Mxi ρi (t)
where ka1i , ka2i ∈ R are positive constant adaptation gains. The forgetting factor
βi along with the saturation in the update law for the least-squares gain matrix in
(4.53) ensure (cf. [47]) that the least-squares gain matrix Γi and its inverse is positive
definite and bounded for all i ∈ {1, . . . , N } as
ωi
≤ *1 .
ρ
2 νi Γ i
i ∞
4.5 N-Player Nonzero-Sum Differential Games 135
Subtracting (3.85) from (4.51), the approximate Bellman error can be expressed in
an unmeasurable form as
N
1
δ̂ti = ωiT Ŵci + x T Q i x + ŴaTj ∇x σ j G i j ∇x σ jT Ŵa j
j=1
4
⎛ ⎞
N
N
− ⎝x T Q i x + u ∗T ∗ ∗
j Ri j u j + ∇x Vi f + ∇x Vi
∗
g j u ∗j ⎠ .
j=1 j=1
Substituting for V ∗ and u ∗ from (3.113) and using f = Y θ , the approximate Bellman
error can be expressed as
N
1
δ̂ti = ωiT Ŵci + ŴaTj ∇x σ j G i j ∇x σ jT Ŵa j − WiT ∇x σi Y θ − ∇x i Y θ
j=1
4
N
1 T
− W j ∇x σ j G i j ∇x σ jT W j + 2∇x j G i j ∇x σ jT W j + ∇x j G i j ∇x Tj
j=1
4
1 T
N
+ Wi ∇x σi G j ∇x σ jT W j + ∇x i G j ∇x σ jT W j + WiT ∇x σi G j ∇x Tj
2 j=1
1
N
+ ∇x i G j ∇x Tj .
2 j=1
1 T
N
δ̂ti = −ωiT W̃ci + W̃ ∇x σ j G i j ∇x σ jT W̃a j − WiT ∇x σi Y θ̃
4 j=1 a j
1 T
N
− Wi ∇x σi G j − W jT ∇x σ j G i j ∇x σ jT W̃a j − ∇x i Y θ + i , (4.56)
2 j=1
N N
where i 1
2 j=1 WiT ∇x σi G j − W jT ∇x σ j G i j ∇x Tj + 1
2 j=1 W jT ∇x σ j
N N
G j ∇x iT + 21 ∇x i G j ∇x Tj −
j=1 j=1 4 ∇x j G i j ∇x j .
1
Similarly, the approxi- T
mate Bellman error evaluated at the selected points can be expressed in an unmea-
surable form as
136 4 Model-Based Reinforcement Learning for Approximate Optimal Control
1 T
ik ik
ik T
N
δ̂tik = −ωikT W̃ci + W̃ σ G σ W̃a j + ik
4 j=1 a j j i j j
1 T
ik ik
ik T
N
− Wi σi G j − W jT σ j
ik G ik
ij σj W̃a j − WiT σi
ik Y ik θ̃ , (4.57)
2 j=1
N
1 T −1
N
1 T
N
1 1
VL Vi∗ + W̃ci Γi W̃ci + W̃ai W̃ai + x̃ T x̃ + θ̃ T Γθ−1 θ̃ .
i=1
2 i=1 2 i=1 2 2
(4.58)
Since Vi∗ are positive definite, the bound in (4.55) and [40, Lemma 4.3] can be used
to bound the candidate Lyapunov function as
v ( Z ) ≤ VL (Z , t) ≤ v ( Z ) , (4.59)
T
L i + pθ
where Z = x T , W̃c1
T
, . . . , W̃cN
T
, W̃a1
T
, . . . , W̃aTN , x̃, θ̃ ∈ R2n+2N i and v, v :
L i + pθ
R≥0 → R≥0 are class K functions. For any compact set Z ⊂ R 2n+2N i , define
1 T 1
ι1 max sup
Wi ∇x σi G j ∇x σ j + ∇x i G j ∇x σ j
,
T
i, j Z ∈Z 2 2
k ω
c1i i
ι2 max sup
3W j ∇x σ j G i j − 2WiT ∇x σi G j ∇x σ jT
i, j Z ∈Z 4ρi
kc2i ω
M xi k
T
ik ik T
ik ik
ik T
+ i
3W σ
j j G ij − 2W σ
i i G j σ j
,
k=1
4Mxi ρik
1 N
T
ι3 max sup
Wi ∇x σi + ∇x i G j ∇x Tj
i, j Z ∈Z 2 i, j=1
1
N
− 2W jT ∇x σ j + ∇x j G i j ∇x Tj
,
4 i, j=1
4.5 N-Player Nonzero-Sum Differential Games 137
kc1i L Y i θ
T
N
(kc1i + kc2i ) W i ι4
ι8 * , ι9i ι1 N + (ka2i + ι8 ) W i ,
i=1
8 νi Γ i
ι10i * ,
2 νi Γ i
qi kc2i c xi 2ka1i + ka2i kθ y
vl min , , kx , , ,
2 4 8 2
N
2ι29i ι2
ι + 10i + ι3 , (4.60)
i=1
2ka1i + ka2i kc2i c xi
qi > 2ι5i ,
kc2i c xi > 2ι5i + 2ζ1 ι7i + ι2 ζ2 N + ka1i + 2ζ3 ι6i Z ,
2ι2 N
2ka1i + ka2i > 4ι8 + ,
ζ2
2ι7i ι6i
kθ y > + 2 Z, (4.61)
ζ1 ζ3
where Z v −1 v max Z (t0 ) , vιl and ζ1 , ζ2 , ζ3 ∈ R are known positive
adjustable constants.
138 4 Model-Based Reinforcement Learning for Approximate Optimal Control
Since the neural network function approximation error and the Lipschitz constant
L Y depend on the compact set that contains the state trajectories, the compact set
needs to be established before the gains can be selected using (4.61). Based on
the subsequent stability analysis, an algorithm is developed in Appendix A.2.2 to
compute the required compact set (denoted by Z) based on the initial conditions.
Since the constants ι and vl depend on L Y only through the products L Y i and L Y ζ3 ,
Algorithm A.3 ensures that
ι 1
≤ diam (Z) , (4.62)
vl 2
Proof The derivative of the candidate Lyapunov function in (4.58) along the trajec-
tories of (3.82), (4.49), (4.50), (4.53), and (4.54) is given by
⎛ ⎛ ⎞⎞
N
N
V̇L = ⎝∇x Vi∗ ⎝ f + g j u j ⎠⎠ + x̃ T Y θ̃ − k x x̃
i=1 j=1
⎛ ⎞
N
k ω k
Mxi
ω k
1 T
N
ωi ωiT
T ⎝ c1i i c2i i k⎠ −1
+ W̃ci δ̂ti + δ̂ti − W̃ci βi Γi − kc1i 2 W̃ci
ρi Mxi ρk 2 ρi
i=1 i=1 i i=1
⎛ ⎛ ⎞ ⎞
Mθ N
T ⎝
+ θ̃ −Y x̃ − kθ ⎝
T
Y j Y j ⎠ θ̃ ⎠ −
T
W̃aiT − ka1i ŴaiT − ŴciT − ka2i ŴaiT
j=1 i=1
1 kc2i T ωik T
ik ik
ik T
N Mxi N
1 ωi T
+ kc1i ŴciT Ŵa j ∇x σ j G i j ∇x σ jT + Ŵci k Ŵa j σ j G i j σ j .
4 ρi 4 Mxi ρi
j=1 k=1 j=1
(4.63)
Substituting the unmeasurable forms of the Bellman errors from (4.56) and (4.57)
into (4.63) and using the triangle inequality, the Cauchy–Schwarz inequality, and
Young’s inequality, the Lyapunov derivative in (4.63) can be bounded as
4.5 N-Player Nonzero-Sum Differential Games 139
N
kc2i c xi
2 kθ y
2 2ka1i + ka2i
2
N q N
i
V̇ ≤ − x 2 −
W̃ci
− k x x̃ 2 −
θ̃
−
W̃ai
2 2 2 4
i=1 i=1 i=1
N
2
kc2i c xi 1 1
2 2 2
i=1
N
k y
2
N N
θ ι ι
+ − 7i − 6i x
θ̃i
+ ι9i
W̃ai
+ ι10i
W̃ci
2 ζ1 ζ3
i=1 i=1 i=1
N
ι2 N
2
N
2ka1i + ka2i
qi
+ − ι8 −
W̃ai
+ ι3 − − ι5i x 2 . (4.64)
4 2ζ2 2
i=1 i=1
kc2i c xi 1 1
> ι5i + ζ1 ι7i + ι2 ζ2 N + ka1i + ζ3 ι6i x ,
2 2 2
kθ y ι7i ι6i
> + x , (4.65)
2 ζ1 ζ3
hold for all Z ∈ Z, completing the squares in (4.64), the bound on the Lyapunov
derivative can be expressed as
2 N
2ka1i + ka2i
2
N N
qi kc2i c
V̇ ≤ − x − 2 xi
W̃ci
− k x x̃ −
2
W̃ai
i=1
2 i=1
4 i=1
8
kθ y
−
θ̃
+ ι
2
ι
≤ −vl Z , ∀ Z > , Z ∈ Z. (4.66)
vl
∗
u − û i
≤ 1 Rii gi σ i
+
¯
i
W̃ ai
i ,
2
for all i = 1, . . . , N , where gi supx gi (x) . Since the weights W̃ai are uniformly
ultimately bounded, uniformly ultimately bounded convergence of the approximate
policies to the feedback-Nash equilibrium policies is obtained.
Remark 4.10 The closed-loop system analyzed using the candidate Lyapunov func-
tion in (4.58) is a switched system. The switching happens when the history stack
140 4 Model-Based Reinforcement Learning for Approximate Optimal Control
is updated and when the least-squares regression matrices Γi reach their saturation
bound. Similar to least-squares-based adaptive control (cf. [47]), (4.58) can be shown
to be a common Lyapunov function for the regression matrix saturation, and the use
of a singular value maximizing algorithm to update the history stack ensures that
(4.58) is a common Lyapunov function for the history stack updates (cf. [31]). Since
(4.58) is a common Lyapunov function, (4.59), (4.62), and (4.66) establish uniformly
ultimately bounded convergence of the switched system.
4.5.4 Simulation
where x ∈ R2 , u 1 , u 2 ∈ R, and
⎡ ⎤
x2 − 2x1
f = ⎣ − 21 x1 − x2 + 41 x2 (cos (2x1 ) + 2) ⎦ ,
2
2 2
+ 4 x2 sin 4x1 + 2
1
0 0
g1 = , g2 = .
cos (2x1 ) + 2 sin 4x12 + 2
The value function has the structure shown in (3.83) with the weights Q 1 = 2Q 2 =
2I2 and R11 = R12 = 2R21 = 2R22 = 2. The system identification protocol given in
Sect. 4.5.1 and the concurrent learning-based scheme given in Sect. 4.5.2 are imple-
mented simultaneously to provide an approximate online feedback-Nash equilibrium
solution to the given nonzero-sum two-player game.
The control-affine system in (4.67) is selected for this simulation because it is
constructed using the converse Hamilton–Jacobi approach [48] where the analytical
feedback-Nash equilibrium solution to the nonzero-sum game is
⎡ ⎤T ⎡ 2 ⎤ ⎡ ⎤T ⎡ 2 ⎤
0.5 x1 0.25 x1
V1∗ = ⎣ 0 ⎦ ⎣ x1 x2 ⎦ , V2∗ = ⎣ 0 ⎦ ⎣ x1 x2 ⎦ ,
1 x22 0.5 x22
and the feedback-Nash equilibrium control policies for Player 1 and Player 2 are
⎡ ⎤T ⎡ ⎤ ⎡ ⎤T ⎡ ⎤
2x 0 0.5 2x1 0 0.25
1 −1 T ⎣ 1 1
u ∗1 = − R11 g1 x2 x1 ⎦ ⎣ 0 ⎦ , u 2 = − R22 g2 ⎣ x2 x1 ⎦ ⎣ 0 ⎦ .
∗ −1 T
2 0 2x2 1 2 0 2x2 0.5
4.5 N-Player Nonzero-Sum Differential Games 141
Since the analytical solution is available, the performance of the developed method
can be evaluated by comparing the obtained approximate solution against the ana-
lytical solution.
The dynamics are linearly parameterized as f = Y (x) θ , where
⎡ ⎤T
x2 0
⎢ x1 0 ⎥
⎢ ⎥
⎢0 x1 ⎥
Y (x) = ⎢
⎢0 x
⎥
⎥
⎢ 2 ⎥
⎣ 0 x2 (cos (2x1 ) + 2)2 ⎦
2 2
0 x2 sin 4x1 + 2
T
is known and the constant vector of parameters θ = 1, −2, − 12 , −1, 14 , − 14 is
assumed to be unknown. The initial guess for θ is selected as θ̂ (t0 ) = 0.5 ∗
[1, 1, 1, 1, 1, 1]T . The system identification gains are selected as k x = 5, Γθ =
diag (20, 20, 100, 100, 60, 60), and kθ = 1.5. A history stack of thirty points is
selected using a singular value maximizing algorithm (cf. [31]) for the concurrent
learning-based update law in (4.48), and the state derivatives are estimated using a
fifth order Savitzky–Golay filter (cf. [41]). Based on the structure of the feedback-
Nash equilibrium value functions, the basis function for value function approximation
is selected as σ = [x12 , x1 x2 , x22 ]T , and the adaptive learning parameters and initial
conditions are shown for both players in Table 4.1. Twenty-five points lying on a
5 × 5 grid around the origin are selected for the concurrent learning-based update
laws in (4.53) and (4.54).
Figures 4.30, 4.31, 4.32 and 4.33 show the rapid convergence of the actor and
critic weights to the approximate feedback-Nash equilibrium values for both players,
resulting in the value functions and control policies
Table 4.1 Approximate dynamic programming learning gains and initial conditions
Learning Gains Initial Conditions
Player 1 Player 2 Player 1 Player 2
v 0.005 0.005 +
Wc (t0 ) [3, 3, 3] T [3, 3, 3]T
kc1 1 1 +
Wa (t0 ) [3, 3, 3] T [3, 3, 3]T
kc2 1.5 1 Γ (t0 ) 100I3 100I3
ka1 10 10 x(t0 ) [1, 1]T [1, 1]T
ka2 0.1 0.1 x̂(t0 ) [0, 0]T [0, 0]T
β 3 3
Γ¯ 10,000 10,000
142 4 Model-Based Reinforcement Learning for Approximate Optimal Control
⎡ ⎤T ⎡ ⎤T
0.5021 0.2510
V̂1 = ⎣ −0.0159 ⎦ σ, V̂2 = ⎣ −0.0074 ⎦ σ,
0.9942 0.4968
⎡ ⎤T ⎡ ⎤
2x 0 0.4970
1 −1 T ⎣ 1
û 1 = − R11 g1 x2 x1 ⎦ ⎣ −0.0137 ⎦ ,
2 0 2x2 0.9810
⎡ ⎤T ⎡ ⎤
2x1 0 0.2485
1 −1 T ⎣
û 2 = − R22 g2 x2 x1 ⎦ ⎣ −0.0055 ⎦ .
2 0 2x 0.4872
2
Figure 4.34 demonstrates that (without the injection of a persistently exciting signal)
the system identification parameters also approximately converged to the correct
values. The state and control signal trajectories are displayed in Figs. 4.35 and 4.36.
144 4 Model-Based Reinforcement Learning for Approximate Optimal Control
Results such as [7, 10, 37, 44, 53–56] solve optimal tracking and differential game
problems for linear and nonlinear systems online, where persistence of excitation
of the error states is used to establish convergence. In general, it is impossible to
guarantee persistence of excitation a priori. As a result, a probing signal designed
using trial and error is added to the controller to ensure persistence of excitation.
However, the probing signal is typically not considered in the stability analysis.
Contemporary results on data-driven approximate dynamic programming meth-
ods include methods to solve set-point and output regulation [11, 57–61], trajectory
tracking [56, 62, 63], and differential game [64–69] problems.
References
1. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
IEEE conference on decision control, pp 3598–3605
2. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
3. Vrabie D (2010) Online adaptive optimal control for continuous-time systems. PhD thesis,
University of Texas at Arlington
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free q-learning designs for linear
discrete-time zero-sum games with application to H∞ control. Automatica 43:473–481
5. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
6. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
7. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: Online adaptive learn-
ing solution of coupled hamilton-jacobi equations. Automatica 47:1556–1569
8. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games:
Online adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–
1611
9. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
10. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement
Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4):1167–1175
11. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
12. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
13. Kamalapurkar R, Walters P, Dixon WE (2013) Concurrent learning-based approximate optimal
regulation. In: Proceedings of IEEE conference on decision control, Florence, IT, pp 6256–6261
14. Kamalapurkar R, Andrews L, Walters P, Dixon WE (2014) Model-based reinforcement learning
for infinite-horizon approximate optimal tracking. In: Proceedings of IEEE conference on
decision control, Los Angeles, CA, pp 5083–5088
15. Kamalapurkar R, Klotz J, Dixon WE (2014) Model-based reinforcement learning for on-line
feedback-Nash equilibrium solution of N-player nonzero-sum differential games. In: Proceed-
ings of the American control conference, pp 3000–3005
146 4 Model-Based Reinforcement Learning for Approximate Optimal Control
39. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference, pp 1568–1573
40. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
41. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares
procedures. Anal Chem 36(8):1627–1639
42. Garg D, Hager WW, Rao AV (2011) Pseudospectral methods for solving infinite-horizon opti-
mal control problems. Automatica 47(4):829–837
43. Kirk D (2004) Optimal control theory: an introduction. Dover, Mineola
44. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking
for continuous-time nonlinear systems. Automatica 51:40–48
45. Lewis FL, Jagannathan S, Yesildirak A (1998) Neural network control of robot manipulators
and nonlinear systems. CRC Press, Philadelphia
46. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal Control, 3rd edn. Wiley, Hoboken
47. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, New Jersey
48. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical Report CIT-CDS 96-021, California Institute of Technology, Pasadena, CA 91125
49. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
50. He P, Jagannathan S (2007) Reinforcement learning neural-network-based controller for non-
linear discrete-time systems with input constraints. IEEE Trans Syst Man Cybern Part B Cybern
37(2):425–436
51. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy hdp iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
52. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
53. Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate
solution using a policy iteration algorithm. In: Proceedings of conference on decision and
control, pp 142–147
54. Modares H, Lewis FL (2013) Online solution to the linear quadratic tracking problem of
continuous-time systems using reinforcement learning. In: Proceedings of conference on deci-
sion and control, Florence, IT, pp 3851–3856
55. Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear
systems with unknown dynamics by using adaptive dynamic programming. Int J Control
87(5):1000–1009
56. Modares H, Lewis FL, Jiang ZP (2015) H∞ tracking control of completely unknown
continuous-time systems via off-policy reinforcement learning. IEEE Trans Neural Netw Learn
Syst 26(10):2550–2562
57. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica 50:3281–3290
58. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
59. Jiang Y, Jiang ZP (2015) Global adaptive dynamic programming for continuous-time nonlinear
systems. IEEE Trans Autom Control 60(11):2917–2929
60. Bian T, Jiang ZP (2016) Value iteration and adaptive dynamic programming for data-driven
adaptive optimal control design. Automatica 71:348–360
61. Gao W, Jiang ZP (2016) Adaptive dynamic programming and adaptive optimal output regula-
tion of linear systems. IEEE Trans Autom Control 61(12):4164–4169
62. Xiao G, Luo Y, Zhang H, Jiang H (2016) Data-driven optimal tracking control for a class of
affine non-linear continuous-time systems with completely unknown dynamics. IET Control
Theory Appl 10(6):700–710
148 4 Model-Based Reinforcement Learning for Approximate Optimal Control
63. Gao W, Jiang ZP (to appear) Learning-based adaptive optimal tracking control of strict-
feedback nonlinear systems. IEEE Trans Neural Netw Learn Syst
64. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
65. Wei Q, Song R, Yan P (2016) Data-driven zero-sum neuro-optimal control for a class of
continuous-time unknown nonlinear systems with disturbance using adp. IEEE Trans Neural
Netw Learn Syst 27(2):444–458
66. Song R, Lewis FL, Wei Q (2017) Off-policy integral reinforcement learning method to solve
nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn
Syst 28(3):704–713
67. Song R, Wei Q, Song B (2017) Neural-network-based synchronous iteration learning method
for multi-player zero-sum games. Neurocomputing 242:73–82
68. Vamvoudakis KG, Modares H, Kiumarsi B, Lewis FL (2017) Game theory-based control
system algorithms with real-time reinforcement learning: How to solve multiplayer games
online. IEEE Control Syst 37(1):33–52
69. Wei Q, Liu D, Lin Q, Song R (2017) Adaptive dynamic programming for discrete-time zero-
sum games. IEEE Trans Neural Netw Learn Syst 29(4):957–969
Chapter 5
Differential Graphical Games
5.1 Introduction
Reinforcement learning techniques are valuable not only for optimization but also
for control synthesis in complex systems such as a distributed network of cogni-
tive agents. Combined efforts from multiple autonomous agents can yield tacti-
cal advantages including improved munitions effects, distributed sensing, detection,
and threat response, and distributed communication pipelines. While coordinating
behaviors among autonomous agents is a challenging problem that has received
mainstream focus, unique challenges arise when seeking optimal autonomous col-
laborative behaviors. For example, most collaborative control literature focuses on
centralized approaches that require all nodes to continuously communicate with a
central agent, yielding a heavy communication demand that is subject to failure due
to delays, and missing information. Furthermore, the central agent is required to
carry enough on-board computational resources to process the data and to generate
command signals. These challenges motivate the need to minimize communication
for guidance, navigation and control tasks, and to distribute the computational burden
among the agents. Since all the agents in a network have independent collaborative
or competitive objectives, the resulting optimization problem is a multi-objective
optimization problem.
In this chapter (see also, [1]), the objective is to obtain an online forward-in-
time feedback-Nash equilibrium solution (cf. [2–7]) to an infinite-horizon formation
tracking problem, where each agent desires to follow a mobile leader while the group
maintains a desired formation. The agents try to minimize cost functions that penalize
their own formation tracking errors and their own control efforts.
For multi-agent problems with decentralized objectives, the desired action by an
individual agent depends on the actions and the resulting trajectories of its neighbors;
hence, the error system for each agent is a complex nonautonomous dynamical sys-
tem. Nonautonomous systems, in general, have non-stationary value functions. Since
non-stationary functions are difficult to approximate using parameterized function
Consider a set of N autonomous agents moving in the state space Rn . The control
objective is for the agents to maintain a desired formation with respect to a leader.
The state of the leader is denoted by x0 ∈ Rn . The agents are assumed to be on a
network with a fixed communication topology modeled as a static directed graph
(i.e., digraph).
Each agent forms a node in the digraph. The set of all nodes excluding the leader
is denoted by N = {1, . . . N } and the leader is denoted by node 0. If node i can
receive information from node j then there exists a directed edge from the jth to
the ith node of the digraph, denoted by the ordered pair ( j, i). Let E denote the
set of all edges. Let there be a positive weight ai j ∈ R associated with each edge
( j, i). Note that ai j = 0 if and only if ( j, i) ∈ E. The digraph is assumed to have
no repeated edges (i.e., (i, i) ∈ / E, ∀i), which implies aii = 0, ∀i. The neighborhood
sets of node i are denoted by N−i and Ni , defined as N−i { j ∈ N | ( j, i) ∈ E}
and Ni N−i ∪ {i}.
N ×N
To streamline the analysis, an adjacency matrix A ∈ R N ×N
is defined as A
ai j | i, j ∈ N , a diagonal pinning gain matrix A0 ∈ R is defined as A0
diag ([a10 , . . . , a N 0 ]), an in-degree matrix D ∈ R N ×N is defined as D diag (di ) ,
where di j∈Ni ai j , and a graph Laplacian matrix L ∈ R N ×N is defined as L
D − A. The graph is assumed to have a spanning tree (i.e., given any node i, there
exists a directed path from the leader 0 to node i). A node j is said to be an extended
neighbor of node i if there exists a directed path from node j to node i. The extended
neighborhood set of node i, denoted by S−i , is defined as the set of all extended
neighbors of node i. Formally, S−i { j ∈ N | j = i ∧ ∃κ ≤ N , { j1 , . . . jκ } ⊂ N |
{( j, j1 ) , ( j1 , j2 ) , . . . , ( j
κ , i)} ⊂ 2 }. Let Si S−i ∪ {i}, and let the edge weights
E
1 Parts of the text in this section are reproduced, with permission, from [1],
2016,
c IEEE.
152 5 Differential Graphical Games
To facilitate the analysis, the error signal in (5.2) is expressed in terms of the
unknown leader-relative desired positions as
ei (t) = ai j (xi (t) − xdi0 ) − x j (t) − xd j0 . (5.3)
j∈{0}∪N−i
T
Stacking the error signals in a vector E (t) e1T (t) , e2T (t) , . . . , e TN (t) ∈ Rn N
the equation in (5.3) can be expressed in a matrix form
ėi (t) = ai j f i (xi (t)) + gi (xi (t)) u i (t) − f j x j (t) − g j x j (t) u j (t) .
j∈{0}∪N−i
Under the temporary assumption that each controller u i (·) is an error-feedback con-
troller (i.e., u i (t) = û i (ei (t) , t)), the error dynamics are expressed as
ėi (t) = ai j f i (xi (t)) + gi (xi (t)) û i (ei (t) , t) − f j x j (t) − g j x j (t) û j e j (t) , t .
j∈{0}∪N−i
Thus, the error trajectory {ei (t)}∞ t=t0 , where t0 denotes the initial time, depends on
∞
û j e j (t) , t , ∀ j ∈ Ni . Similarly, the error trajectory e j (t) t=t0 depends
on û k (ek (t) , t) , ∀k ∈ N j . Recursively, the trajectory {ei (t)}∞ t=t0 depends on
û j e j (t) , t , and hence, on e j (t) , ∀ j ∈ Si . Thus, even if the controller for each
agent is restricted to use local error feedback, the resulting error trajectories are
interdependent. In particular, a change in the initial condition of one agent in
the extended neighborhood causes a change in the error trajectories correspond-
ing to all the extended neighbors. Consequently, the value function corresponding
to
∞an infinite-horizon optimal control problemn where each agent tries to minimize
t0 (Q (ei (τ )) + R (u i (τ ))) dτ , where Q : R → R and R : R mi
→ R are positive
definite functions, is dependent on the error states of all the extended neighbors.
Since the steady-state controllers required for formation tracking are generally
nonzero, quadratic total-cost optimal control problems result in infinite costs, and
hence, are infeasible. In the following section, relative steady-state controllers are
derived to facilitate the formulation of a feasible optimal control problem.
When the agents are perfectly tracking the desired trajectory in the desired formation,
even though the states of all the agents are different, the time-derivatives of the states
of all the agents are identical. Hence, in steady state, the control signal applied by
each agent must be such that the time derivatives of the states corresponding to
the set of extended neighbors are identical. In particular, the relative control signal
u i j : R≥t0 → Rm i that will keep node i in its desired relative position with respect
to node j ∈ S−i (i.e., xi (t) = x j (t) + xdi j ), must be such that the time derivative
of xi (·) is the same as the time derivative of x j (·). Using the dynamics of the agent
from (5.1) and substituting the desired relative position x j (·) + xdi j for the state
xi (·), the relative control signal u i j (·) must satisfy
154 5 Differential Graphical Games
f i x j (t) + xdi j + gi x j (t) + xdi j u i j (t) = ẋ j (t) . (5.5)
The relative steady-state control signal can be expressed in an explicit form provided
the following assumption is satisfied.
Assumption 5.1 The matrix gi (x) is full rank for all i ∈ N and for all x ∈ R ; fur-
n
thermore, the relative steady-state control signal expressed as u i j (t) = f i j x j (t) +
gi j x j (t) u j (t) , satisfies (5.5) along the desired trajectory, where f i j x j
gi+ x j + xdi j f j x j − f i x j + xdi j ∈ Rm i , gi j x j gi+ x j + xdi j g j x j
∈ Rm i ×m j , where the control effectiveness and the control input relative to the leader
are understood to be g0 (x) = 0, ∀x ∈ Rn and u i0 ≡ 0, ∀i ∈ N , respectively, and
gi+ (x) denotes a pseudoinverse of the matrix gi (x), ∀x ∈ Rn .
To facilitate the formulation of an optimal formation tracking problem, define the
control error μi : R≥t0 → Rm i as
μi (t) ai j u i (t) − u i j (t) . (5.6)
j∈N−i ∪{0}
In the remainder of this section, the control errors {μi (·)} will be treated as
the design variables. To implement the controllers using the designed control
errors, it is essential to invert the relationship in (5.6). To facilitate the inver-
sion, let Sio {1, . . . , si }, where si |Si |. Let λi : Sio → Si be a bijective map
such that λi (1) = i. For notational brevity, let (·)Si denote the concatenated vector
T
T
(·)λT1 , (·)λT2 , . . . , (·)λTsi , (·)S−i denote the concatenated vector (·)λT2 , . . . , (·)λTsi ,
i
i i
i
T i i
j n(si +1)
denote j∈N−i ∪{0} , λi denote λi ( j), Ei (t) eSi (t) , xλ1 (t) ∈ R
T T
, and
T i
E−i (t) eST −i (t) , xλT1 (t) ∈ Rnsi . Then, the control error vector μSi (t) ∈
i
mk
R k∈Si
can be expressed as
where k, l = 1, 2, . . . , si , and Fi : Rn(si +1) → R k∈Si mk
is defined as
λisi
i T
Fi (Ei ) aλi1 j f λT1 j xj ,..., T
aλi i j f λsi j x j
s .
i i
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 155
Assumption 5.2 The matrix Lgi (Ei (t)) is invertible for all t ∈ R.
Assumption 5.2 is a controllability-like condition. Intuitively, Assumption 5.2
requires the control effectiveness matrices to be compatible to ensure the existence
of relative control inputs that allow the agents to follow the desired trajectory in the
desired formation.
Using Assumption 5.2, the control vector can be expressed as
u Si (t) = Lgi−1 (Ei (t)) μSi (t) + Lgi−1 (Ei (t)) Fi (Ei (t)) . (5.8)
Let Lgik denote the λi−1 (k) th block row of Lgi−1 . Then, the controller u i (·) can be
implemented as
u i (t) = Lgii (Ei (t)) μSi (t) + Lgii (Ei (t)) Fi (Ei (t)) , (5.9)
Using (5.9) and (5.10), the error and the state dynamics for the agents can be repre-
sented as
ėi (t) = Fi (Ei (t)) + Gi (Ei (t)) μSi (t) , (5.11)
and
ẋi (t) = Fi (Ei (t)) + Gi (Ei (t)) μSi (t) , (5.12)
where
i i j
Fi (Ei ) ai j gi (xi ) Lgii (Ei ) Fi (Ei ) − ai j g j x j Lgi (Ei ) Fi (Ei )
i i
+ ai j f i (xi ) − ai j f j x j ,
i j
Gi (Ei ) ai j gi (xi ) Lgii (Ei ) − g j x j Lgi (Ei ) ,
Fi (Ei ) f i (xi ) + gi (xi ) Lgii (Ei ) Fi (Ei ) ,
Gi (Ei ) gi (xi ) Lgii (Ei ) .
Let h ei t; t0 , Ei0 , μi , μS−i and h xi t; t0 , Ei0 , μi , μS−i denote the trajectories of
(5.11) and (5.12), respectively, with the initial time t0 , initial condition Ei (t0 ) = Ei0 ,
T
and policies μ j : Rn(si +1) → Rm i , j ∈ Si , and let Hi (h e )ST i , h Txλ1 . Define a
i
cost functional
∞
Ji (ei (·) , μi (·)) ri (ei (σ ) , μi (σ )) dσ (5.13)
0
156 5 Differential Graphical Games
∞
Vi Ei ; μi , μS−i ri h ei σ ; t, Ei , μi , μS−i , μi Hi σ ; t, Ei , μi , μS−i dσ,
t
(5.14)
where Vi Ei ; μi , μS−i denotes the total cost-to-go under the policies μSi , starting
from the state Ei . Note that
the value functions in (5.14) are time-invariant
because
the dynamical systems ė j (t) = F j (Ei (t)) + G j (Ei (t)) μS j (t) j∈Si and ẋi (t) =
Fi (Ei (t)) + Gi (Ei (t)) μSi (t) together form an autonomous dynamical system.
A graphical feedback-Nash
within the subgraph Si is defined
equilibrium solution
∗ n (s j +1)
as the tuple of policies μ j : R →Rmj
such that the value functions in
j∈Si
(5.14) satisfy
V j∗ E j V j E j ; μ∗j , μ∗S− j ≤ V j E j ; μ j , μ∗S− j ,
∀ j ∈ Si , ∀Ei ∈ Rn(si +1) , and for all admissible policies μ j . Provided a feedback-
Nash equilibrium solution exists and the value functions (5.14) are continuously
differentiable, the feedback-Nash equilibrium value functions can be characterized
in terms of the following system of Hamilton–Jacobi equations:
∇e j Vi∗ (Ei ) F j (Ei ) + G j (Ei ) μ∗S j (Ei ) + Q i (Ei ) + μi∗T (Ei ) Ri μi∗ (Ei )
j∈Si
+ ∇xi Vi∗ (Ei ) Fi (Ei ) + Gi (Ei ) μ∗Si (Ei ) = 0, ∀Ei ∈ Rn(si +1) , (5.15)
Theorem 5.3 Provided a feedback-Nash equilibrium solution exists and that the
value functions in (5.14) are continuously differentiable, the system of Hamilton–
Jacobi equations in (5.15) constitutes a necessary and sufficient condition for
feedback-Nash equilibrium.
Proof Consider the cost functional in (5.13), and assume that all the extended neigh-
bors of the ith agent follow their feedback-Nash equilibrium policies. The value
function corresponding to any admissible policy μi can be expressed as
T ∞
Vi ei , E−i ; μi , μS−i = ri h ei σ ; t, Ei , μi , μ∗S−i , μi Hi σ ; t, Ei , μi , μ∗S−i
T T ∗
dσ.
t
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 157
∀ei ∈ Rn and ∀t ∈ R≥0 . Assuming that the optimal controller that minimizes (5.13)
when all the extended neighbors follow their feedback-Nash
equilibrium
policies
∗ ∗ ∗
exists, and that the optimal value function V i V i (·) ; μi , μS−i exists and is
continuously differentiable, optimal control theory for single objective optimization
problems (cf. [10]) can be used to derive the following necessary and sufficient
condition
∗ ∗
∂ V i (ei , t) ∂ V i (ei , t)
Fi (Ei ) + Gi (Ei ) μ∗Si (Ei ) + + Q i (ei ) + μi∗T (Ei ) Ri μi∗ (Ei ) = 0.
∂ei ∂t
(5.17)
Using (5.16), the partial derivative with respect to the state can be expressed as
∗
∂ V i (ei , t) ∂ Vi∗ (Ei )
= , (5.18)
∂ei ∂ei
∀ei ∈ Rn and ∀t ∈ R≥0 . The partial derivative with respect to time can be expressed
as
∗
∂ V i (ei , t) ∂ V ∗ (Ei ) ∂ V ∗ (E )
F j (Ei ) + G j (Ei ) μ∗S j (Ei ) + i
i
= i
Fi (Ei ) ,
∂t j∈S
∂e j ∂ x i
−i
∂ V ∗ (Ei )
+ i Gi (Ei ) μ∗Si (Ei ) , (5.19)
∂ xi
∀ei ∈ Rn and ∀t ∈ R≥0 . Substituting (5.18) and (5.19) into (5.17) and repeat-
ing the process for each i, the system of Hamilton–Jacobi equations in (5.15) is
obtained.
Minimizing the Hamilton–Jacobi equations using the stationary condition, the
feedback-Nash equilibrium solution is expressed in the explicit form
1 T T 1 T T
μi∗ (Ei ) = − Ri−1 G ji (Ei ) ∇e j Vi∗ (Ei ) − Ri−1 Gii (Ei ) ∇xi Vi∗ (Ei ) ,
2 2
j∈Si
(5.20)
∂μ∗S ∂μ∗
n(si +1)
∀Ei ∈ R , where G ji
G j ∂μ∗
j
and Gii Gi ∂μS∗i
. As it is generally infeasible to
i i
obtain an analytical solution to the system of the Hamilton–Jacobi equations in (5.15),
the feedback-Nash value functions and the feedback-Nash
policiesare approximated
using parametric approximation schemes as V̂i Ei , Ŵci and μ̂i Ei , Ŵai , respec-
tively, where Ŵci ∈ R L i and Ŵai ∈ R L i are parameter estimates. Substitution of the
approximations V̂i and μ̂i in (5.15) leads to a set of Bellman errors δi defined as
158 5 Differential Graphical Games
δi Ei , Ŵci , Ŵa μ̂iT Ei , Ŵai R μ̂i Ei , Ŵai + ∇e j V̂i Ei , Ŵci F j E j
Si
j∈Si
+ ∇e j V̂i Ei , Ŵci G j E j μ̂S j E j , Ŵa
Sj
j∈Si
+ ∇xi V̂i Ei , Ŵci Fi (Ei ) + Gi (Ei ) μ̂Si Ei , Ŵa + Q i (ei ) .
Si
(5.21)
On any compact set χ ⊂ Rn the function f i can be represented using a neural network
as
f i (x) = θiT σθi (x) +
θi (x) , (5.22)
θ̂ T σθi (x) . Based on (5.22), an estimator for online identification of the drift dynamics
is developed as
x̂˙i (t) = θ̂iT (t) σθi (xi (t)) + gi (xi (t)) u i (t) + ki x̃i (t) , (5.23)
where x̃i (t) xi (t) − x̂i (t) and ki ∈ R is a positive constant learning gain. The
following assumption facilitates concurrent learning-based system identification.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 159
Assumption 5.5 [12, 13] A history stack containing recorded state-action pairs
k k Mθi
Mθi
xi , u i k=1 along with numerically computed state derivatives x̄˙ik k=1 that satisfies
M
θi
k T
λmin σθik σθi = σθi > 0, x̄˙ik − ẋik < di , ∀k (5.24)
k=1
is available a priori. In (5.24), σθik σθi xik , and di , σθi ∈ R are known positive
constants.
The weight estimates θ̂i (·) are updated using the following concurrent learning-based
update law:
Mθi T
θ̂˙i (t) = kθi θi σθik x̄˙ik −gik u ik − θ̂iT (t) σθik +θi σθi (xi (t))x̃iT (t) , (5.25)
k=1
where gik gi xik , kθi ∈ R is a constant positive concurrent learning gain, and
θi ∈ R Pi +1×Pi +1 is a constant, diagonal, and positive definite adaptation gain matrix.
To facilitate the subsequent stability analysis, a candidate Lyapunov function
V0i : Rn × R Pi +1×n → R is selected as
1 1
−1
V0i x̃i , θ̃i x̃iT x̃i + tr θ̃iT θi θ̃i , (5.26)
2 2
where θ̃i (t) θi (t) − θ̂i (t). Using (5.23)–(5.25), a bound on the time derivative of
V0i is established as
2
V̇0i x̃i , θ̃i ≤ −ki x̃i 2 − kθi σθi θ̃i +
θi x̃i + kθi dθi θ̃i , (5.27)
F F
Mθi k Mθi k k
where dθi d i k=1 σ +
σ . Using (5.26) and (5.27),
θi k=1 θi θi
a Lyapunov-based stability analysis can be used to show that θ̂i (·) converges expo-
nentially to a neighborhood around θi .
Remark 5.6 Using an integral formulation, the system identifier can also be imple-
mented without using state-derivative measurements (see, e.g., [14]).
Using the approximations fˆi for the functions f i , the Bellman errors in (5.21) can
be approximated as
160 5 Differential Graphical Games
δ̂i Ei ,Ŵci , Ŵa , θ̂Si μ̂iT Ei , Ŵai Ri μ̂i Ei , Ŵai + ∇e j V̂i Ei , Ŵci Fˆ j E j , θ̂S j
Si
j∈Si
+ ∇xi V̂i Ei , Ŵci F̂i Ei , θ̂Si + Gi (Ei ) μ̂Si Ei , Ŵa + Q i (ei )
Si
+ ∇e j V̂i Ei , Ŵci G j E j μ̂S j E j , Ŵa . (5.28)
Sj
j∈Si
In (5.28),
i j
Fˆi Ei , θ̂Si ai j fˆi xi , θ̂i − fˆj x j , θ̂ j + iai j gi (xi ) L gii −g j x j L gi F̂i Ei , θ̂Si ,
F̂i Ei , θ̂Si θ̂iT σθi (xi ) + gi (xi ) L gii F̂i Ei , θ̂Si ,
⎡ 1 ⎤
λi
aλ1 j fˆλ1 j xλ1 , θ̂λ1 , x j , θ̂ j
⎢
⎢
i i i i ⎥
⎥
.
F̂i Ei , θ̂Si ⎢ ⎢ . ⎥,
⎥
⎣ si . ⎦
λi
aλsi j fˆλsi j xλsi , θ̂λsi , x j , θ̂ j
i i i i
fˆi j xi , θ̂i , x j , θ̂ j gi+ x j + xdi j fˆj x j , θ̂ j − gi+ x j + xdi j fˆi x j + xdi j , θ̂i .
The approximations F̂i , Fˆi , and F̂i are related to the original unknown function as
F̂i Ei , θSi + Bi (Ei ) = Fi (Ei ), Fˆi Ei , θSi + B i (Ei ) = Fi (Ei ), and F̂i Ei , θSi
+ Bi (Ei ) = Fi (Ei ), where Bi , Bi , and Bi are O (
θ )Si terms that denote bounded
function approximation errors.
Using the approximations fˆi , an implementable form of the controllers in (5.8) is
expressed as
u Si (t) = Lgi−1 (Ei (t)) μ̂Si Ei (t) , Ŵa (t) + Lgi−1 (Ei (t)) F̂i Ei (t) , θ̂Si (t) .
Si
(5.29)
Using (5.7) and (5.29), an unmeasurable form of the virtual controllers for the systems
(5.11) and (5.12) is given by
μSi (t) = μ̂Si Ei (t) , Ŵa (t) − F̂i Ei (t) , θ̃Si (t) − Bi (Ei (t)) . (5.30)
Si
On any compact set χ ∈ Rn(si +1) , the value functions can be represented as
where Wi ∈ R L i are ideal neural network weights, σi : Rn(si +1) → R L i are neural
network basis functions, and
i : Rn(si +1) → R are function approximation errors.
Using the universal function approximation property of single layer neural networks,
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 161
provided σi (Ei ) forms a proper basis, there exist constant ideal weights Wi and
positive constants Wi ,
i , ∇
i ∈ R such that Wi ≤ Wi < ∞, supEi ∈χ
i (Ei ) ≤
i , and supEi ∈χ ∇
i (Ei ) ≤ ∇
i .
1 1
μi∗ (Ei ) = − Ri−1 G σ i (Ei ) Wi − Ri−1 G
i (Ei ) ,
2 2
The value functions and the policies are approximated using neural networks as
V̂i Ei , Ŵci ŴciT σi (Ei ) ,
1
μ̂i Ei , Ŵai − Ri−1 G σ i (Ei ) Ŵai , (5.32)
2
where Ŵci and Ŵai are estimates of the ideal weights Wi introduced in (5.21).
A consequence of Theorem 5.3 is that the Bellman error provides an indirect mea-
sure of how close the estimates Ŵci and Ŵai are to the ideal weights Wi . From a
reinforcement learning perspective, each evaluation of the Bellman error along the
system trajectory can be interpreted as experience gained by the critic, and each
evaluation of the Bellman error at points not yet visited can be interpreted as sim-
ulated experience. In previous results such as [15–19], the critic is restricted to the
experience gained (in other words Bellman errors evaluated) along the system state
trajectory. The development in [15–18] can be extended to employ simulated expe-
rience; however, the extension requires exact model knowledge. The formulation in
(5.28) employs the system identifier developed in Sect. 5.2.5 to facilitate approximate
evaluation of the Bellman error at off-trajectory points.
162 5 Differential Graphical Games
Mi
To simulate experience, a set of points Eik k=1 is selected corresponding to each
agent i, and the instantaneous Bellman error in (5.21) is approximated at the current
state and the selected points using (5.36). The approximation at the current state is
denoted by δ̂ti and the approximation at the selected points is denoted by δ̂tik , where
δ̂ti and δ̂tik are defined as
δ̂ti (t) δ̂i Ei (t) , Ŵci (t) , Ŵa (t) , θ̂ (t) ,
Si Si
δ̂tik (t) δ̂i Eik , Ŵci (t) , Ŵa (t) , θ̂ (t) .
Si Si
Note that once e j j∈Si and xi are selected, the ith agent can compute the states of
all the remaining agents in the sub-graph.
The critic uses simulated experience to update the critic weights using the least-
squares-based update law
ηc2i i (t)
Mi
ωi (t) ωik (t) k
Ŵ˙ ci (t) = −ηc1i i (t) δ̂ti (t) − δ̂ (t) ,
ρi (t) Mi ρ k (t) ti
k=1 i
ωi (t) ωiT (t)
˙ i (t) = βi i (t) − ηc1i i (t) i 1{i ≤ i } , i (t0 ) ≤ i ,
ρi2 (t)
(5.33)
where the notation φik indicates evaluation at Ei = Eik for a function φi (Ei , (·)) (i.e.,
φik (·) φi Eik , (·) ). The actor updates the actor weights using the following update
law derived based on a Lyapunov-based stability analysis:
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 163
ω T (t)
Ŵ˙ ai (t) = ηc1i G σT i (Ei (t)) Ri−1 G σ i (Ei (t)) Ŵai (t) i
1
Ŵci (t) − ηa2i Ŵai (t)
4 ρi (t)
k T
1 ηc2i k T −1 k
Mi
ω (t)
+ Gσ i Ri G σ i Ŵai (t) i k Ŵci (t) − ηa1i Ŵai (t) − Ŵci (t) ,
4
k=1
Mi ρi (t)
(5.34)
where ηa1i , ηa2i ∈ R are constant positive learning gains. The following assumption
facilitates simulation of experience.
Mi
Assumption 5.8 [13] For each i ∈ N , there exists a finite set of Mi points Eik k=1
such that # $
Mi ωik (t)(ωik (t))T
inf t∈R≥0 λmin k=1 ρik (t)
ρi > 0, (5.35)
Mi
To facilitate the stability analysis, the left hand side of (5.15) is subtracted from
(5.28) to express the Bellman error in terms of the weight estimation errors as
1
δ̂ti = −W̃ciT ωi − WiT ∇xi σi F̂i Ei , θ̃Si + W̃aiT G σT i Ri−1 G σ i W̃ai
4
1
− WiT ∇e j σi Fˆ j E j , θ̃S j + WiT ∇e j σi G j RS j W̃a
j∈Si
2 j∈Si
Sj
1 1
− WiT G σT i Ri−1 G σ i W̃ai + WiT ∇xi σi Gi RSi W̃a + i , (5.36)
2 2 Si
where ˜ (·) − (·),
(·) ˆ i = O (
)Si , ∇
Si , (
θ )Si , and RS j
diag Rλ−1 1 G
T
σ λ1
, . . . , R −1
sj G
T
sj is a block diagonal matrix. Consider a set of
j j λj σλj
extended neighbors S p corresponding to the pth agent. To analyze asymptotic prop-
erties of the agents in S p , consider the following candidate Lyapunov function
1 1
VL p Z p , t Vti eSi , t + W̃ciT i−1 W̃ci + T W̃ +
W̃ai ai V0i x̃i , θ̃i ,
2 2
i∈S p i∈S p i∈S p i∈S p
(5.37)
vec (·) denotes the vectorization operator, and Vti : Rnsi × R → R is defined as
T
Vti eSi , t Vi∗ eST i , xiT (t) , (5.38)
∀Z p ∈ R(2nsi +2L i si +n(Pi +1)si ) and ∀t ∈ R≥t0 , where vlp , vlp : R → R are class K func-
tions.
To facilitate the stability analysis, given any compact ball χ p ⊂ R2nsi +2L i si +n(Pi +1)si
of radius r p ∈ R centered at the origin, a positive constant ι p ∈ R is defined as
⎛ 2 ⎞ 2
⎜
θi 2 3 kθi dθi + Aiθ Biθ 5 (ηc1i + ηc2i )2 ωρii i
⎟
ιp ⎝ + ⎠+
i∈S
2ki 4σθi i∈S
4ηc2i ρi
p p
1
+ ∗
∇e j Vi (Ei ) G j RS j
S j
∗
∇xi Vi (Ei ) Gi RSi
Si +
i∈S p
2 j∈Si
+ ∇ V ∗
(E ) G B + ∇ V ∗
(E ) G B
ej i i j j xi i i i i
i∈S p j∈Si
2
Aia1 (ηc1i +ηc2i ) T ωi −1
3 i∈S p 2
+ η a2i W i + 4 W i ρi W i
T T
G R
σi i G σ i
+ ,
4 (ηa1i + ηa2i )
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 165
Theorem 5.9 Provided Assumptions 5.2–5.8 hold and the sufficient gain conditions
in (5.40)–(5.42) are satisfied, the controller in (5.32) along with the actor and critic
update laws in (5.33) and (5.34), and the system identifier in (5.23) along with the
weight update laws in (5.25) ensure that the local neighborhood tracking errors ei
are ultimately bounded and that the policies μ̂i converge to a neighborhood around
the feedback-Nash policies μi∗ , ∀i ∈ N .
Proof The time derivative of the candidate Lyapunov function in (5.37) is given by
1 T −1
V̇L p = V̇ti eSi , t − W̃ci i ˙ i i−1 W̃ci − W̃ciT i−1 Ŵ˙ ci − W̃aiT Ŵ˙ ai
2
i∈S p i∈S p i∈S p i∈S p
+ V̇0i x̃i , θ̃i . (5.43)
i∈S p
Using (5.15), (5.27), (5.30), and (5.36), the update laws in (5.33) and (5.34), and the
definition of Vti in (5.38), the derivative in (5.43) can be bounded as
ηc2i ρi
2 (ηa1i + ηa2i )
2
V̇L p ≤ − ci
W̃ − ai
W̃
i∈S p
5 3
ki kθi σθi
2
+ −qi (ei ) − x̃i −
2
θ̃i + ι p .
i∈S
2 3 F
p
1 1 ηc2i ρi
2 1 (η + η )
a2i
2
vlp Z p ≤
a1i
qi (ei ) + W̃ci + W̃ai
2 2 5 2 3
i∈S p i∈S p i∈S p
1 ki 1 kθi σθi
2
+ x̃i 2 + θ̃i , (5.44)
2 2 2 3 F
i∈S p i∈S p
5.2.10 Simulations
One-Dimensional Example
This section provides a simulation example to demonstrate the applicability of the
developed technique. The agents are assumed to have the communication topology
as shown in Fig. 5.1 with unit pinning gains and edge weights. Agent motion is
described by identical nonlinear one-dimensional dynamics of the form (5.1) where
f i (xi ) = θi1 xi + θi2 xi2 and gi (xi ) = (cos(2xi1 ) + 2), ∀i = 1, . . . , 5. The ideal val-
ues of the unknown parameters are selected to be θi1 = 0, 0, 0.1, 0.5, and 0.2,
and θi2 = 1, 0.5, 1, 1, and 1, for i = 1, . . . , 5, respectively. The agents start at
xi = 2, ∀i, and their final desired locations with respect to each other are given
by xd12 = 0.5, xd21 = −0.5, xd43 = −0.5, and xd53 = −0.5. The leader traverses an
exponentially decaying trajectory x0 (t) = e−0.1t . The desired positions of agents 1
and 3 with respect to the leader are xd10 = 0.75 and xd30 = 1, respectively.
For each agent i, five values of ei , three values of xi , and three values of errors cor-
responding to all the extended neighbors are selected for Bellman error extrapolation,
resulting in 5 × 3si total values of Ei . All agents estimate the unknown drift param-
eters using history stacks containing thirty points recorded online using a singular
value maximizing algorithm (cf. [22]), and compute the required state derivatives
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 167
3 1
5 4 2
Fig. 5.1 Communication topology: A network containing five agents (reproduced with permission
from [1],
2016,
c IEEE)
0.5
0
0 5 10 15 20 25 30
Time (s)
0.1
-0.1
-0.2
-0.3
0 5 10 15 20 25 30
Time (s)
using a fifth order Savitzky–Golay smoothing filter (cf. [23]). Figures 5.2, 5.3, 5.4
and 5.5 show the tracking error, the state trajectories compared with the desired
trajectories, and the control inputs for all the agents demonstrating convergence to
the desired formation and the desired trajectory. Note that Agents 2, 4, and 5 do
not have a communication link to the leader, nor do they know their desired relative
position from the leader. The convergence to the desired formation is achieved via
cooperative control based on decentralized objectives. Figures 5.6 and 5.7 show the
evolution and convergence of the critic weights and the parameters estimates for the
drift dynamics for Agent 1. See Table 5.1 for the optimal control problem parameters,
168 5 Differential Graphical Games
u(t)
-6
-8
-10
0 5 10 15 20 25 30
Time (s)
-40
-50
-60
-70
0 5 10 15 20 25 30
Time (s)
0.4
0.2
-0.2
-0.4
0 5 10 15 20 25 30
Time (s)
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 169
0.2
0
0 5 10 15 20 25 30
Time (s)
basis functions, and adaptation gains for all the agents. The errors between the ideal
drift parameters and their respective estimates are large; however, as demonstrated
by Fig. 5.3, the resulting dynamics are sufficiently close to the actual dynamics for
the developed technique to generate stabilizing policies. It is unclear whether the
value function and the actor weights converge to their ideal values.
Two-Dimensional Example
In this simulation, the dynamics of all the agents are assumed to be exactly known,
and are selected to be of the form (5.1) where for all i = 1, . . . , 5,
−xi1 + xi2
f i (xi ) = ,
−0.5xi1 − 0.5xi2 (1 − (cos(2xi1 ) + 2)2 )
sin(2xi1 ) + 2 0
gi (xi ) = .
0 cos(2xi1 ) + 2
The agents start at the origin, and their final desired relative positions are given by
xd12 = [−0.5, 1]T , xd21 = [0.5, −1]T , xd43 = [0.5, 1]T , and xd53 = [−1, 1]T .
The relative positions are designed such that the final desired formation is a pentagon
with the leader node at the center. The leader traverses a sinusoidal trajectory x0 (t) =
[2 sin(t), 2 sin(t) + 2 cos(t)]T . The desired positions of agents 1 and 3 with respect
to the leader are xd10 = [−1, 0]T and xd30 = [0.5, −1]T , respectively.
The optimal control problem parameters, basis functions, and adaptation gains for
the agents are available in Table 5.2. Nine values of ei , xi , and the errors corresponding
to all the extended neighbors are selected for Bellman error extrapolation for each
agent i on a uniform 3 × 3 grid in a 1 × 1 square around the origin, resulting in
9(si +1) total values of Ei . Figures 5.8, 5.9, 5.10 and 5.11 show the tracking error, the
state trajectories, and the control inputs for all the agents demonstrating convergence
to the desired formation and the desired trajectory. Note that Agents 2, 4, and 5 do
not have a communication link to the leader, nor do they know their desired relative
position from the leader. The convergence to the desired formation is achieved via
170
x2
0
around the leader is
represented by a dotted black −1
pentagon
−2
−3
−4
−3 −2 −1 0 1 2 3
x1
−0.5
−1
e2
−1.5
−2
−2.5
−3
−0.5 0 0.5 1 1.5
e1
cooperative control based on decentralized objectives. Figure 5.12 show the evolution
and convergence of the actor weights for all the agents.
Three Dimensional Example
To demonstrate the applicability of the developed method to nonholonomic systems,
simulations are performed on a five agent network of wheeled mobile robots. The
dynamics of the wheeled mobile robots are given by
⎡ ⎤
cos (xi3 ) 0
ẋi = g (xi ) u i , g (xi ) = ⎣ sin (xi3 ) 0⎦ ,
0 1
where xi j (t) denotes the jth element of the vector xi (t) ∈ R3 . The desired trajectory
is selected to be a circular trajectory that slowly comes to a halt after three rotations,
generated using the following dynamical system.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 173
20 u11 20 u21
u12 u22
15 15
10 10
u1 (t)
u2 (t)
5 5
0 0
−5 −5
−10 −10
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
20 u31 20 u41
u32 u42
15 15
10 10
u3 (t)
u4 (t)
5 5
0 0
−5 −5
−10 −10
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
20 u51
u52
15
10
u5 (t)
−5
−10
0 2 4 6 8 10
Time (s)
Fig. 5.10 Trajectories of the control input for Agents 1–5 for the two-dimensional example
⎡ ⎤
Fr
Tp
T p2 − x03
2
cos (x03 )
⎢ 2 ⎥
ẋ0 = ⎢ T p − x03 sin (x03 ) ⎥
Fr
⎦,
2
⎣ Tp
2
Fr
Tp
T p − x03
2
174 5 Differential Graphical Games
10 µ11 3 µ21
µ12 µ22
8 2
1
6
0
u1 (t)
u2 (t)
4
−1
2
−2
0 −3
−2 −4
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
8 µ31 20 µ41
µ32 µ42
6 15
4 10
u3 (t)
u4 (t)
2 5
0 0
−2 −5
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
10 µ51
µ52
5
0
u5 (t)
−5
−10
−15
−20
0 2 4 6 8 10
Time (s)
Fig. 5.11 Trajectories of the relative control error for Agents 1–5 for the two-dimensional example
where x03 denotes the third element of x0 , and the parameters Fr and T p are selected to
be Fr = 0.1 and T p = 6 π . The desired formation and the communication topology
are shown in Fig. 5.13. For each agent, a random point is selected in the state space
each control iteration for Bellman error extrapolation. Figures 5.14, 5.15, 5.16 and
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 175
1.5 1.5
0.5 1
0
Ŵa1 (t)
Ŵa2 (t)
−0.5 0.5
−1
−1.5 0
−2
−2.5 −0.5
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
2 5
1.5 4
3
1
Ŵa3 (t)
Ŵa4 (t)
2
0.5
1
0
0
−0.5 −1
−1 −2
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
1
Ŵa5 (t)
−1
−2
−3
0 2 4 6 8 10
Time (s)
Fig. 5.12 actor weights for Agents 1–5 for the two-dimensional example
5.17 show the tracking error, the control inputs, and the actor weight estimates for
all the agents demonstrating convergence to the desired formation and the desired
trajectory. Note that Agents 2, 3 and 5 do not have a direct communication link to
the leader, nor do they know their desired relative position with respect to the leader.
176 5 Differential Graphical Games
Agent 5 Agent 4
Leader
Agent 1
Agent 3 Agent 2
x1
x0
2 x2
x0 + xd20
1 x3
x0 + xd30
xi2 (t)
x4
0 x0 + xd40
x5
-1 x0 + xd50
-2
2
0 40
30
-2 20
10
xi1 (t) -4 0 Time (s)
Fig. 5.14 State trajectories and desired state trajectories as a function of time
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 177
1 1
0.5
0.5
0
e1 (t)
e2 (t)
0
-0.5
-0.5
-1
-1 -1.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
1 1
0.5
0.5
0
e3 (t)
e4 (t)
0
-0.5
-0.5
-1
-1 -1.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
0.5
e5 (t)
-0.5
0 10 20 30 40
Time (s)
The convergence to the desired formation is achieved via cooperative control based
on decentralized objectives.
178 5 Differential Graphical Games
2.5 2.5
2 2
1.5 1.5
u1 (t)
u2 (t)
1 1
0.5 0.5
0 0
-0.5 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
3 4
2 3
1
2
u3 (t)
u4 (t)
0
1
-1
-2 0
-3 -1
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
2.5
1.5
u5 (t)
0.5
-0.5
0 10 20 30 40
Time (s)
3 2
1.5
2
1
Ŵa1 (t)
Ŵa2 (t)
1
0.5
0
0
-1 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
2.5 2.5
2 2
1.5 1.5
Ŵa3 (t)
Ŵa4 (t)
1 1
0.5 0.5
0 0
-0.5 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
3
2
Ŵa5 (t)
-1
0 10 20 30 40
Time (s)
where xi : R≥0 → S is the state, the set S ⊂ Rn is the agents’ state space, f i : S →
Rn is a locally Lipschitz function, gi ∈ Rn×m is a known constant matrix, u i : R≥0 →
Rm is a pre-established stabilizing and synchronizing control input, and the derivative
of the leader state, ẋ0 , is continuous and bounded.
The monitoring objective applied at each agent is to cooperatively monitor the
network for satisfaction of its control objective, wherein the network may be affected
by input disturbances that cause suboptimal performance. Moreover, the monitoring
protocol should be decentralized and passive, i.e., only information from one-hop
neighbors should be used and the protocol should not interfere with the monitored
systems.
For typical synchronization techniques, such as model predictive control, inverse-
optimal, or approximate dynamic programming, a control law is developed based on
a cost functional of the form
tf
Ji (ei (·) , u i (·)) Q i (ei (τ )) + u iT (τ ) Ri u i (τ ) dτ, (5.47)
0
2 Parts of the text in this section are reproduced, with permission, from [8],
2015,
c IEEE.
5.3 Reinforcement Learning-Based Network Monitoring 181
where t f > 0 is the final time of the optimal control problem, Q i : Rn → R is a track-
ing error weighting function, Ri is a constant positive definite symmetric weighting
matrix, and ei : R≥0 → Rn is the neighborhood tracking error defined as
ei (t) ai j xi (t) − x j (t) + ai0 (xi (t) − x0 (t)) .
j∈Ni
which is minimized as
tf
Vi∗ (E, t) = min Q i (ei (τ ; t, ei , u i (·))) + u iT (τ ) Ri u i (τ ) dτ, (5.48)
u i ∈Ui t
where Ui is the set of admissible controllers for agent i. Because the minimization
of the value function Vi is inherently coupled with the minimization of other value
functions in the network, the value function Vi can naturally be a function of the error
signal e j , j ∈ V, if there exists a directed path from the leader to agent i that includes
agent j. Using techniques similar to Theorem 5.3, it can be shown that provided a
feedback-Nash equilibrium solution exists and that the value functions in (5.48) are
continuously differentiable, a necessary and sufficient condition for feedback-Nash
equilibrium is given by the system of Hamilton–Jacobi equations
∂ V ∗ (E, t)
i
a jk f j x j + g j u ∗j (E, t) − f k (xk ) − gk u ∗k (E, t)
∂e j
j∈V k∈N̄ j
∂ Vi∗ (E, t)
+ Q i (ei ) + u i∗T (E, t) Ri u i∗ (E, t) + ≡ 0. (5.49)
∂t
Thus, one method for monitoring the network’s operating conditions is to monitor
the expression in (5.49), which equals zero for the implementation of optimal control
efforts.
182 5 Differential Graphical Games
Because the optimal value function Vi∗ is often infeasible to solve analytically, an
approximate dynamic programming-based approximation scheme is subsequently
∂V ∗
developed so that the approximate value of ∂ti + Hi may be monitored. However,
the Hamilton–Jacobi equation for agent i in (5.49) is inherently coupled with the
state and control of every agent j ∈ V such that there exists a directed path from the
leader to agent i. Consequently, checking for satisfaction of the Hamilton–Jacobi
equations is seemingly unavoidably centralized in information communication. To
overcome this restriction, the developed approximate dynamic programming-based
approximation scheme is constructed such that only information concerning one-hop
neighbors’ states, one-hop neighbors’ control policies and time is necessary for value
function approximation.
To make the current problem tenable, it is assumed that authentic information is
exchanged between the agents, i.e., communication is not maliciously compromised;
rather, the agents are cooperatively monitoring each other’s performance. If neces-
sary, communication authentication algorithms such as in [27] or [28] can be used
to verify digitally communicated information.
To evaluate the expression in (5.49), knowledge of the drift dynamics f i is
required. The following section provides a method to estimate the function f i using
a data-based approach.
Assumption 5.10 The uncertain, locally Lipschitz drift dynamics, f i , are linear-in-
the-parameters, such that f i (xi ) = Yi (xi ) θi∗ , where Yi : Rn → Rn× pi is a known
regression matrix and θi∗ ∈ R pi is a vector of constant unknown parameters.
where x̃i (t) xi (t) − x̂i (t) is the state estimation error, and k xi ∈ Rn×n is a constant
positive definite diagonal gain matrix. The state identification error dynamics are
expressed using (5.46) and (5.50) as
where θ̃i (t) θ ∗ − θ̂ (t). The state estimator in (5.50) is used to develop a data-
driven concurrent learning-based update law for θ̂i (·) as
5.3 Reinforcement Learning-Based Network Monitoring 183
K
T
θ̂˙i (t) = θi (Yi (xi (t)))T x̃i (t) + θi kθi Yiξ ẋiξ − giξ u iξ − Yiξ θ̂i (t) ,
ξ =1
(5.52)
where θi ∈ R pi × pi is a constant positive definite symmetric gain matrix, kθi ∈ R>0
is a constant concurrent learning gain, and the superscript (·)ξ denotes evaluation
ξ
at one of the unique recorded values in the state data stack xi | ξ = 1, . . . , K or
ξ
corresponding control value data stack u i | ξ = 1, . . . , K . It is assumed that these
data stacks are recorded prior to use of the drift dynamics estimator. The following
assumption specifies the necessary data richness of the recorded data.
ξ
Assumption 5.11 [22] There exists a finite set of collected data xi | ξ =1, . . . , K
such that
⎛ ⎞
K T
ξ ξ
rank ⎝ Yi Yi ⎠ = pi . (5.53)
ξ =1
Note that persistence of excitation is not mentioned as a necessity for this identifica-
tion algorithm; instead of guaranteeing data richness by assuming that the dynamics
are persistently exciting for all time, it is only assumed that there exists a finite set of
data points that provide the necessary data richness. This also eliminates the common
requirement for injection of a persistent dither signal to attempt to ensure persistence
of excitation, which would interfere with the monitored systems. Furthermore, con-
trary to persistence of excitation-based approaches, the condition in (5.53) can be
verified. Note that, because (5.52) depends on the state derivative ẋiξ at a past value,
numerical techniques can be used to approximate ẋiξ using preceding and proceeding
recorded state information.
To facilitate an analysis of the performance of the identifier in (5.52), the identifier
dynamics are expressed in terms of estimation errors as
K
T
θ̃˙i (t) = −θi (Yi (xi (t)))T x̃i (t) − θi kθi
ξ ξ ξ ξ ξ
Yi ẋi − gi u i − Yi θ̂i (t) .
ξ =1
(5.54)
To describe the performance of the identification of θi∗ , consider the positive definite
continuously differentiable Lyapunov function Vθi : Rn+ pi → [0, ∞) defined as
1 T 1 −1
Vθi (z i ) x̃i x̃i + θ̃iT θi θ̃i , (5.55)
2 2
T
where z i x̃iT , θ̃iT . The expression in (5.55) satisfies the inequalities
K ξ T ξ
Note that because the matrix ξ =1 Yi Yi is symmetric and positive semi-
definite, its eigenvalues are real and greater than or equal to zero. Furthermore, by
ξ T ξ
Assumption 5.11, none of the eigenvalues of ξK=1 Yi Yi are equal to zero. Thus,
K ξ T ξ
all of the eigenvalues of the symmetric matrix ξ =1 Yi Yi are positive, and the
T
matrix ξK=1 Yiξ Yiξ is positive definite. Using this property and the inequalities
in (5.56), (5.57) is upper bounded as
ci
V̇θi (z i ) ≤ −ci z i 2 ≤ − Vθi (z i ) , (5.58)
V̄θi
# $
K ξ T ξ
where ci min λmin {k xi } , kθi λmin ξ =1 Yi Yi . The inequalities in (5.56)
and (5.58) can then be used to conclude that x̃i (t) , θ̃i (t) → 0 exponentially
fast. Thus, the drift dynamics f i (xi ) = Yi (xi ) θi∗ are exponentially identified.
Note that even with state derivative estimation errors, the parameter estimation
error θ̃i (·) can be shown to be uniformly ultimately bounded, where the magnitude
of the ultimate bound depends on the derivative estimation error [22].
Remark 5.12 Using an integral formulation, the system identifier can also be imple-
mented without using state-derivative measurements (see, e.g., [14]).
For the approximate value of (5.49) to be evaluated for monitoring purposes, the
unknown optimal value function Vi∗ needs to be approximated for each agent i.
Because the coupled Hamilton–Jacobi equations are typically difficult to solve ana-
lytically, this section provides an approach to approximate Vi∗ using neural networks.
Assuming the networked agents’ states remain bounded, the universal function
approximation property of neural networks can be used with wi neurons to equiva-
lently represent Vi∗ as
Vi∗ ei , t = WiT σi ei , t +
i ei , t , (5.59)
5.3 Reinforcement Learning-Based Network Monitoring 185
where t t
tf
is the normalized time, Wi ∈ Rwi is an unknown ideal neural net-
work weight vector bounded above by a known constant W̄i ∈ R>0 as Wi ≤ W̄i ,
σi : S × [0, 1] → Rwi is a bounded nonlinear continuously differentiable activation
function with the property σ (0, 0) = 0, where
i : S × [0, 1] → R is the unknown
function reconstruction error. From the universal function approximation prop-
erty, the reconstruction error
i satisfies the properties supρ∈S,ϕ∈[0,1] |
i (ρ, ϕ)| ≤
¯i ,
supρ∈S,ϕ∈[0,1] ∂|
i ∂ρ
(ρ,ϕ)|
≤
¯ei , and supρ∈S,ϕ∈[0,1] ∂|
i ∂ϕ
(ρ,ϕ)|
≤
¯ti , where
¯i ,
¯ei ,
¯ti ∈
R>0 are constant upper bounds.
Note that only the state of agent i, states of neighbors of agent i ( j ∈ N̄i ), and
time are used as arguments in the neural network representation of Vi∗ in (5.59),
instead of the states of all agents in the network. This is justified by treating the error
states of other agents simply as functions of time, the effect of which is accommo-
dated by including time in the basis function σi and function reconstruction error
i . Inclusion of time in the basis function is feasible due to the finite horizon of
the optimization problem in (5.47). Using state information from additional agents
(e.g., two-hop communication) in the network may increase the practical fidelity of
function reconstruction and may be done in an approach similar to that developed in
this section.
Using this neural network representation, Vi∗ is approximated for use in computing
the Hamiltonian as
V̂i ei , Ŵci , t ŴciT σi ei , t ,
Ĥi E i , X i , Ŵci , ω̂ai , t Q i (ei ) + û iT ei , Ŵai , t Ri û i ei , Ŵai , t
+ ŴciT σei ei , t ai j fˆi (xi ) + gi û i ei , Ŵai , t − fˆj x j − g j x j û j e j , Ŵa j , t
j∈Ni
+ ai0 fˆi (xi ) + gi û i ei , Ŵai , t − ẋ0 t , (5.62)
∂σi (ei ,t ) −1 T T
where σei ei , t ∂ei
, û i ei , Ŵai , t − 21 j∈N̄i ai j Ri gi σei ei , t
Ŵai is the approximated optimal control for agent i, and Ŵai is another estimate
of the ideal neural network weight vector Wi . Noting that the expression in (5.62) is
measurable (assuming that the leader state derivative is available to those communi-
cating with the leader), the Bellman error in (5.60) may be put into measurable form,
∂V ∗
after recalling that ∂ti + Hi E, X , U ∗ , Vi∗ , t ≡ 0, as
δi t ŴciT σti ei , t + Ĥi E i , X i , Ŵci , ω̂ai , t , (5.63)
which is the feedback to be used to train the neural network estimate Ŵci . The use
of the two neural network estimates Ŵci and Ŵai allows for least-squares based
adaptation for the feedback in (5.63), since only the use of Ŵci would result in
nonlinearity of Ŵci in (5.63).
The difficulty in making a non-interfering monitoring scheme with approximate
dynamic programming lies in obtaining sufficient data richness for learning. Contrary
to typical approximate dynamic programming-based control methods, the devel-
oped data-driven adaptive learning policy uses
concepts from concurrent + + learning
+ +
(cf. toprovide data richness. Let si ρl ∈ S | l = 1, . . . , N̄i + 1 ∪
[22, 29])
ϕ | ϕ ∈ 0, t f be a pre-selected sample point in the state space of agent i, its
neighbors, and time. Additionally, let Si sicl | cl = 1, . . . , L be a collection of
these unique sample points. The Bellman error will be evaluated over the set Si in
the neural network update policies to guarantee data richness. As opposed to the
common practice of injecting an exciting signal into a system’s control input to pro-
vide sufficient data richness for adaptive learning, this strategy evaluates the Bellman
error at preselected points in the state space and time to mimic exploration of the
state space. The following assumption specifies a sufficient condition on the set Si
for convergence of the subsequently defined update policies.
Assumption 5.13 For each agent i ∈ V, the set of sample points Si satisfies
T
1 L
χicl χicl
μi inf λmin > 0, (5.64)
L t∈[0,t f ]
c =1
γicl
l
φc2i χicl cl
L
χi
Ŵ˙ ci = −φc1i i δi − i δ , (5.65)
γi L γ cl i
c =1 i l
χi χiT
˙
i = βi i − φc1i i 2 i 1{i ≤¯ i } , (5.66)
γi
where φc1i , φc2i ∈ R>0 are constant adaptation gains, βi ∈ R>0 is a constant forget-
ting factor, ¯ i ∈ R>0 is a saturation constant, and i (0) is positive definite, symmet-
ric, and bounded such that i (0) ≤ ¯ i . The form of the least-squares gain matrix
update law in (5.66) is constructed such that i remains positive definite and
where i ∈ R>0 is constant [32]. The neural network estimate Ŵai is updated towards
the estimate Ŵci as
T
φc1i G σ i Ŵai χiT
L
φc2i G cσli Ŵai χicl
Ŵ˙ ai = −φa1i Ŵai − Ŵci −φa2i Ŵai + + Ŵci ,
4γi
cl =1
4Lγicl
(5.68)
2
where φa1i , φa2i ∈ R>0 are constant adaptation gains, G σ i j∈N̄i ai j σei G i σeiT
∈ Rwi , and G i gi Ri−1 giT .
188 5 Differential Graphical Games
1 1
VL W̃ciT i−1 W̃ci + W̃aiT W̃ai + Vθi (z i ) . (5.69)
i∈V
2 2
An upper bound of V̇L along the trajectories of (5.51), (5.54), (5.65), and (5.68)
can be obtained after expressing δi in terms of estimation errors, using the property
−1
d
dt
i = −i−1 ˙i i−1 , using the inequality χi ≤ √1 , applying the Cauchy–
γi 2 λi i
Schwarz and triangle inequalities, and performing nonlinear damping, such that V̇L
is bounded from above by an expression that is negative definite in terms of the state
of VL plus a constant upper-bounding term. Using [21, Theorem 4.18], it can shown
that the estimation errors are uniformly ultimately bounded.
With the estimation of the unknown neural network weight Wi by Ŵci from the previ-
ous section, the performance of an agent in satisfying the Hamilton–Jacobi–Bellman
optimality constraint can be monitored through use of V̂i = ŴciT σi , the approximation
∂V ∗
of Vi∗ . From (5.49), ∂ti + Hi E, X , U ∗ , Vi∗ , t ≡ 0 (where U ∗ denotes the optimal
control efforts). Let Mi ∈ R denote the signal to be monitored by agent i ∈ V, defined
as
++
Mi ei , X i , Ŵci , Ui , t = ++ŴciT σti + ŴciT σei ai j fˆi (xi ) + gi u i − fˆj x j − g j u j
j∈Ni
++
+ Q i (ei ) + u iT Ri u i + ai0 fˆi (xi ) + gi u i − ẋ0 ++,
5.3 Reinforcement Learning-Based Network Monitoring 189
Online real-time solutions to differential games are presented in results such as [15,
16, 33–35]; however, since these results solve problems with centralized objectives
(i.e., each agent minimizes or maximizes a cost function that penalizes the states of
all the agents in the network), they are not applicable for a network of agents with
independent decentralized objectives (i.e., each agent minimizes or maximizes a cost
function that penalizes only the error states corresponding to itself).
Various methods have been developed to solve formation tracking problems for
linear systems (cf. [36–40] and the references therein). An optimal control approach
is used in [41] to achieve consensus while avoiding obstacles. In [42], an optimal
controller is developed for agents with known dynamics to cooperatively track a
desired trajectory. In [43], an inverse optimal controller is developed for unmanned
190 5 Differential Graphical Games
References
26. Ratliff ND, Bagnell JA, Zinkevich MA (2006) Maximum margin planning. In: Proceedings of
the international conference on learning
27. Pang Z, Liu G (2012) Design and implementation of secure networked predictive control
systems under deception attacks. IEEE Trans Control Syst Technol 20(5):1334–1342
28. Clark A, Zhu Q, Poovendran R, Başar T (2013) An impact-aware defense against stuxnet. In:
Proceedings of the American control conference, pp 4146–4153
29. Kamalapurkar R, Walters P, Dixon WE (2013) Concurrent learning-based approximate optimal
regulation. In: Proceedings of the IEEE conference on decision and control, Florence, IT, pp
6256–6261
30. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
31. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems, Springer, pp 357–374
32. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall
33. Johnson M, Hiramatsu T, Fitz-Coy N, Dixon WE (2010) Asymptotic stackelberg optimal
control design for an uncertain Euler-Lagrange system. In: Proceedings of the IEEE conference
on decision and control, Atlanta, GA, pp 6686–6691
34. Vamvoudakis KG, Lewis FL (2010) Online neural network solution of nonlinear two-player
zero-sum games using synchronous policy iteration. In: Proceedings of the IEEE conference
on decision and control
35. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
36. Lewis M, Tan K (1997) High precision formation control of mobile robots using virtual struc-
tures. Auton Robots 4(4):387–403
37. Balch T, Arkin R (1998) Behavior-based formation control for multirobot teams. IEEE Trans
Robot Autom 14(6):926–939
38. Das A, Fierro R, Kumar V, Ostrowski J, Spletzer J, Taylor C (2002) A vision-based formation
control framework. IEEE Trans Robot Autom 18(5):813–825
39. Fax J, Murray R (2004) Information flow and cooperative control of vehicle formations. IEEE
Trans Autom Control 49(9):1465–1476
40. Murray R (2007) Recent research in cooperative control of multivehicle systems. J Dyn Syst
Meas Control 129:571–583
41. Wang J, Xin M (2010) Multi-agent consensus algorithm with obstacle avoidance via optimal
control approach. Int J Control 83(12):2606–2621
42. Wang J, Xin M (2012) Distributed optimal cooperative tracking control of multiple autonomous
robots. Robot Auton Syst 60(4):572–583
43. Wang J, Xin M (2013) Integrated optimal formation control of multiple unmanned aerial
vehicles. IEEE Trans Control Syst Technol 21(5):1731–1744
44. Lin W (2014) Distributed uav formation control using differential game approach. Aerosp Sci
Technol 35:54–62
45. Semsar-Kazerooni E, Khorasani K (2008) Optimal consensus algorithms for cooperative team
of agents subject to partial information. Automatica 44(11):2766–2777
46. Shim DH, Kim HJ, Sastry S (2003) Decentralized nonlinear model predictive control of multiple
flying robots. Proceedings of the IEEE conference on decision and control 4:3621–3626
47. Magni L, Scattolini R (2006) Stabilizing decentralized model predictive control of nonlinear
systems. Automatica 42(7):1231–1236
48. Heydari A, Balakrishnan SN (2012) An optimal tracking approach to formation control of
nonlinear multi-agent systems. In: Proceedings of AIAA guidance, navigation and control
conference
49. Zhang H, Zhang J, Yang GH, Luo Y (2015) Leader-based optimal coordination control for the
consensus problem of multiagent differential games via fuzzy adaptive dynamic programming.
IEEE Trans Fuzzy Syst 23(1):152–163
References 193
6.1 Introduction
The nonlinear equations of motion for a marine craft, including the effects of irrota-
tional ocean current, are given by [9]
6.2 Station-Keeping of a Marine Craft 197
where x, y : R≥t0 → R, are the earth-fixed position vector components of the center
of mass, ψ : R≥t0 → R represents the yaw angle, u, v : R≥t0 → R are the body-
fixed translational velocities, and r : R≥t0 → R is the body-fixed angular velocity.
The irrotational current vector is defined as
T
νc u c vc 0 ,
Assumption 6.1 The marine craft is neutrally buoyant if submerged and the center
of gravity is located vertically below the center of buoyancy on the z axis if the
vehicle model includes roll and pitch.
1 The orientation of the vehicle may be represented as Euler angles, quaternions, or angular rates.
In this development, the use of Euler angles is assumed, see Sect. 7.5 in [9] for details regarding
other representations.
198 6 Applications
Assumption 6.1 simplifies the subsequent analysis and can often be met by trimming
the vehicle. For marine craft where this assumption cannot be met, an additional term
may be added to the controller, similar to how terms dependent on the irrotational
current are handled.
Since the hydrodynamic effects pertaining to a specific marine craft may be unknown,
an online system identifier is developed for the vehicle drift dynamics. Consider the
control-affine form of the vehicle model,
and known rigid body drift dynamics f 0 : R2n × Rn → R2n are modeled as
JE (η) ν
f 0 (ζ, ν̇c ) = ,
M −1 M A ν̇c − M −1 C R B (ν) ν − M −1 G (η)
An identifier is designed as
ζ̂˙ (t) = Y (ζ (t) , νc (t)) θ̂ (t) + f 0 (ζ (t) , ν̇c (t)) + gτb (t) + kζ ζ̃ (t) , (6.4)
⎛ ⎞
M
rank ⎝ Y jT Y j ⎠ = p, (6.5)
j=1
˙
ζ̄ j − ζ̇ j < d̄, ∀ j,
where Y j Y ζ j , νcj , f 0 j f 0 ζ j , ζ̇ j = Y j θ + f 0 j + gτbj , and d̄ ∈ [0, ∞) is a
constant.
In this development, it is assumed that a data set of state-action pairs is available a
priori. Experiments to collect state-action pairs do not necessarily need to be con-
ducted in the presence of a current (e.g., the data may be collected in a pool). Since
the current affects the dynamics only through the νr terms, data that is sufficiently
rich and satisfies Assumption 6.2 may be collected by merely exploring the ζ state-
space. Note, this is the reason the body-fixed current νc and acceleration ν̇c are not
considered a part of the state. If state-action data is not available for the given system
then it is possible to build the history stack in real-time (see Appendix A.2.3).
The parameter estimate update law is
M
θ̂˙ (t) = Γθ Y (ζ (t) , νc (t))T ζ̃ (t) + Γθ kθ Y jT ζ̄˙ j − f 0 j − gτbj − Y j θ̂ (t) , (6.6)
j=1
M
θ̂˙ (t) = Γθ Y (ζ (t) , νc (t))T ζ̃ (t) + Γθ kθ Y jT Y j θ̃ + d j ,
j=1
200 6 Applications
where d j = ζ̄˙ j − ζ̇ j .
To analyze the developed identifier, consider the candidate Lyapunov function
V P : R2n+ p → [0, ∞) defined as
1 T 1
V P (Z P ) ζ̃ ζ̃ + θ̃ T Γθ−1 θ̃ , (6.7)
2 2
where Z P ζ̃ T θ̃ T . The candidate Lyapunov function can be bounded as
1 1
min 1, γθ Z P 2 ≤ V P (Z P ) ≤ max {1, γθ } Z P 2 (6.8)
2 2
where γθ , γθ are the minimum and maximum eigenvalues of Γθ , respectively.
The time derivative of the candidate Lyapunov function in (6.7) is
M
M
V̇ P (t) = −ζ̃ T (t) kζ ζ̃ (t) − kθ θ̃ T (t) Y jT Y j θ̃ (t) − kθ θ̃ T (t) Y jT d j .
j=1 j=1
where kζ , y are the minimum eigenvalues of kζ and M T
j=1 Y j Y j , respectively, and
M
dθ = d̄ j=1 Y j . After completing the squares, (6.9) can be upper bounded as
2 kθ y 2 k d 2
θ θ
V̇ P (t) ≤ −kζ ζ̃ (t) − θ̃ (t) + ,
2 2y
current is introduced. The residual model is used in the development of the optimal
control problem in place of the original model. A disadvantage of this approach is
that the optimal policy is developed for the current-free model. In the case where
the earth-fixed current is constant, the effects of the current may be included in the
development of the optimal control problem as detailed in Appendix A.3.2.
The dynamic constraints can be written in a control-affine form as
τ̃c τc − τ̂c .
∞
J (t0 , ζ0 , u (·)) = r (ζ (τ ; t0 , ζ0 , u (·)) , u (τ )) dτ, (6.15)
t0
In (6.16), Q ∈ R2n×2n and R ∈ Rn×n are symmetric positive definite weighting matri-
2 2
ces. The matrix Q has the property q ξq ≤ ξqT Qξq ≤ q ξq , ∀ξq ∈ R2n where
q and q are positive constants. Assuming existence of the optimal controller, the
infinite-time scalar value function V : R2n → [0, ∞) for the optimal solution is writ-
ten as
∞
∗
V (ζ ) = min r (ζ (τ ; t, ζ, u (·)) , u (τ )) dτ, (6.17)
u [t,∞]
t
where ∂ V∂t(ζ ) = 0 since the value function is not an explicit function of time. After
substituting (6.16) into (6.18), the optimal policy is given by [14]
1 T
u ∗ (ζ ) = − R −1 g T ∇ζ V ∗ (ζ ) . (6.19)
2
6.2 Station-Keeping of a Marine Craft 203
V ∗ (ζ ) = W T σ (ζ ) + (ζ ) , (6.20)
where W ∈ Rl is the ideal weight vector bounded above by a known positive constant,
σ : R2n → Rl is a bounded continuously differentiable activation function, and :
R2n → R is the bounded continuously differential function reconstruction error.
Using (6.19) and (6.20), the optimal policy can be expressed as
1
u ∗ (ζ ) = − R −1 g T ∇ζ σ T (ζ ) W + ∇ζ T (ζ ) . (6.21)
2
Based on (6.20) and (6.21), neural network approximations of the value function and
the optimal policy are defined as
V̂ ζ, Ŵc = ŴcT σ (ζ ) , (6.22)
1
û ζ, Ŵa = − R −1 g T ∇ζ σ T (ζ ) Ŵa , (6.23)
2
where Ŵc , Ŵa : R≥t0 → Rl are estimates of the constant ideal weight vector W . The
weight estimation errors are defined as W̃c W − Ŵc and W̃a W − Ŵa .
Substituting (6.11), (6.22), and (6.23) into (6.18), the Bellman error, δ̂ : R2n ×
R × Rl × Rl → R, given as
p
δ̂ ζ, θ̂, Ŵc , Ŵa = r ζ, û ζ, Ŵa + ∇ζ V̂ ζ, Ŵc Yr es (ζ ) θ̂ + f 0r es (ζ ) + g û ζ, Ŵa
(6.24)
The Bellman error, evaluated along the system trajectories, can be expressed as
δ̂t (t) δ ζ (t) , θ̂ (t) , Ŵc (t) , Ŵa (t) = r ζ (t) , û ζ (t) , Ŵa (t) + ŴcT (t) ω (t) ,
βΓ (t) − kc1 Γ (t) ω(t)ω(t)
T
Γ, Γ (t) ≤ Γ
Γ˙ (t) = (t) , (6.27)
0 otherwise
where kc1 , kc2 ∈ R are positive adaptation gains, δ̂tk (t) δ̂ ζk , θ̂ (t) , Ŵc (t) ,
Ŵa (t) is the extrapolated approximate Bellman error, Γ (t0 ) = Γ0 ≤ Γ¯ is
the initial adaptation gain, Γ¯ ∈ R is a positive saturation gain, β ∈ R is a positive
forgetting factor, and ρ 1 + kρ ω T Γ ω is a normalization constant, where kρ ∈ R
is a positive gain. The update law in (6.26) and (6.27) ensures that
Γ ≤ Γ (t) ≤ Γ , ∀t ∈ [0, ∞) .
where ka ∈ R is an positive gain, and proj {·} is a smooth projection operator used
to bound the weight estimates. Using properties of the projection operator, the actor
6.2 Station-Keeping of a Marine Craft 205
neural network weight estimation error can be bounded above by positive constant.
See Sect. 4.4 in [11] or Remark 3.6 in [15] for details of smooth projection operators.
Using the definition in (6.13), the force and moment applied to the vehicle,
described in (6.3), is given in terms of the approximated optimal virtual control
(6.23) and the approximate compensation term in (6.14) as
τ̂b (t) = û ζ (t) , Ŵa (t) + τ̂c ζ (t) , θ̂ (t) , νc (t) , ν̇c (t) . (6.29)
An unmeasurable form of the Bellman error can be written using (6.18) and (6.24)
as
1 1
δ̂t = −W̃cT ω − W T ∇ζ σ Yr es θ̃ − ∇ζ Yr es θ + f 0r es + W̃aT G σ W̃a + ∇ζ G∇ζ σ T W
4 2
1
+ ∇ζ G∇ζ , (6.30)
4
Yr es (ζ ) ≤ L Yr es ζ , ∀ζ ∈ χ ,
f 0 (ζ ) ≤ L f ζ , ∀ζ ∈ χ ,
r es 0r es
1 T 1
VL (Z ) = V (ζ ) + W̃c Γ −1 W̃c + W̃aT W̃a + V P (Z P ) ,
2 2
T
where Z ζ T W̃cT W̃aT Z TP ∈ χ ∪ Rl × Rl × R p . Since the value function V in
(6.17) is positive definite, VL can be bounded by
using [16, Lemma 4.3] and (6.8), where υ L , υ L : [0, ∞) → [0, ∞) are class K
functions. Let β ⊂ χ ∪ Rl × Rl × R p be a compact set, and let ϕζ , ϕc , ϕa , ϕθ , κc ,
κa , κθ , and κ be constants as defined in Appendix A.3.1. When Assumptions 6.2 and
6.3, and the sufficient gain conditions in Appendix A.3.1 are satisfied, the constant
K ∈ R defined as
!
κc2 κ2 κ2 κ
K + a + θ +
2αϕc 2αϕa 2αϕθ α
is positive, where α 1
2
min ϕζ , ϕc , ϕa , ϕθ , 2kζ .
K < υ L −1 (υ L (r )) (6.33)
are satisfied, where r ∈ R is the radius of the compact set β, then the policy in (6.23)
with the neural network update laws in (6.26)–(6.28) guarantee uniformly ultimately
bounded regulation of the state ζ and uniformly ultimately bounded convergence of
the approximated policies û to the optimal policy u ∗ .
Proof The time derivative of the candidate Lyapunov function is
∂V ∂V
g û + τ̂c − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c − W̃aT Ŵ˙ a + V̇ P .
1
V̇L = (Y θ + f 0 ) +
∂ζ ∂ζ 2
(6.34)
∂V
Using (6.18), ∂ζ (Y θ + f 0 ) = − ∂∂ζV g (u ∗ + τc ) − r (ζ, u ∗ ). Then,
∂V ∂V ∗
g u + τc − r ζ, u ∗ − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c
1
V̇L = g û + τ̂c −
∂ζ ∂ζ 2
− W̃ T Ŵ˙ + V̇ .
a a P
Substituting (6.26) and (6.28) for Ŵ˙ c and Ŵ˙ a , respectively, yields
6.2 Station-Keeping of a Marine Craft 207
⎡ ⎤
kc2 ωk ⎦
N
∗T ∗ ∂V ∂V ∂V ∗ T ⎣ ω
V̇L = −ζ Qζ − u Ru +
T
g τ̃c + g û − gu + W̃c kc1 δ + δk
∂ζ ∂ζ ∂ζ ρ N ρ
j=1 k
1 " #
ωω T
+ W̃aT ka Ŵa − Ŵc − W̃cT Γ −1 βΓ − kc1 Γ Γ 1Γ ≤Γ Γ −1 W̃c + V̇ P .
2 ρ
Using Young’s inequality, (6.20), (6.21), (6.23), (6.30), and (6.31) the Lyapunov
derivative can be upper bounded as
2 2 2 2
V̇L ≤ −ϕζ ζ 2 − ϕc W̃c − ϕa W̃a − ϕθ θ̃ − kζ ζ̃ + κa W̃a + κc W̃c
+ κθ θ̃ + κ.
Completing the squares, the upper bound on the Lyapunov derivative may be written
as
ϕζ ϕc
2 ϕa 2 ϕθ 2
2
κ2 κ2
V̇L ≤ − ζ 2 − W̃c − W̃a − θ̃ − kζ ζ̃ + c + a
2 2 2 2 2ϕc 2ϕa
κ 2
+ θ + κ,
2ϕθ
Using (6.32), (6.33), and (6.35), [16, Theorem 4.18] is invoked to conclude that Z is
uniformly ultimately bounded, in the sense that lim supt→∞ Z (t) ≤υ L −1 (υ L (K )).
Based on the definition of Z and the inequalities in (6.32) and (6.35), ζ (·) ,
W̃c (·) , W̃a (·) ∈ L∞ . From the definition of W and the neural network weight esti-
mation errors, Ŵc (·) , Ŵa (·) ∈ L∞ . Usingthe actor update ˙
laws, Ŵa (·) ∈ L∞ . It fol-
lows that t → V̂ (x (t)) ∈ L∞ and t → û x (t) , Ŵa (t) ∈ L∞ . From the dynamics
in (6.12), ζ̇ (·) ∈ L∞ . By the definition in (6.24), δ̂t (·) ∈ L∞ . By the definition of
the normalized critic update law, Ŵ˙ c (·) ∈ L∞ .
sides of the central pressure vessel, two vessels house 44 Ah of batteries used for
propulsion and electronics.
The vehicle’s software runs within the Robot Operating System framework in
the central pressure vessel. For the experiment, three main software nodes were
used: navigation, control, and thruster mapping nodes. The navigation node receives
packaged navigation data from the navigation pressure vessel where an unscented
Kalman filter estimates the vehicle’s full state at 50 Hz. The desired force and moment
produced by the controller are mapped to the eight thrusters using a least-squares
minimization algorithm. The controller node contains the proposed controller and
system identifier.
The implementation of the proposed method may has three parts: system iden-
tification, value function iteration, and control iteration. Implementing the system
identifier requires (6.4), (6.6), and the data set alluded to in Assumption 6.2. The data
set in Assumption 6.2 was collected in a swimming pool. The vehicle was commanded
to track a data-rich trajectory with a RISE controller [18] while the state-action pairs
were recorded. The recorded data was trimmed to a subset of 40 sampled points
that were selected to maximize the minimum singular value of Y1 Y2 . . . Y j as in
Appendix A.2.3. The system identifier is updated at 50 Hz.
Equations (6.24) and (6.26) form the value function iteration. Evaluating the
extrapolated Bellman error (6.24) with each control iteration is computational expen-
sive. Due to the limited computational resources available on-board the autonomous
underwater vehicle, the update of the critic weights was selected to be calculated at
a different rate (5 Hz) than the main control loop.
For the experiments, the controller in (6.4), (6.6), (6.23), (6.24), (6.26), (6.28),
and (6.29) was restricted to three degrees of freedom (i.e., surge, sway, and yaw). The
RISE controller is used to regulate the remaining degrees-of-freedom (i.e., heave,
roll, and pitch), to maintain the implied assumption that roll and pitch remain at
zero and the depth remains constant. The RISE controller in conjunction with the
proposed controller is executed at 50 Hz.
The vehicle uses water profiling data from the Doppler velocity log to measure
the relative water velocity near the vehicle in addition to bottom tracking data for
the state estimator. Between the state estimator, water profiling data, and recorded
data, the equations used to implement the developed controller only contain known
or measurable quantities.
The vehicle was commanded to hold a station near the vent of Ginnie Spring.
T
An initial condition of ζ (t0 ) = 4 m 4 m [ π4 ] rad 0 m/s 0 ms 0 rad/s was given
to demonstrate the method’s ability to regulate the state. The optimal control weight-
ing matrices were selected to be Q = diag ([20, 50, 20, 10, 10, 10]) and R = I3 . The
system identifier adaptation gains were selected to be k x = 25 × I6 , kθ = 12.5, and
Γθ = diag ([187.5, 937.5, 37.5, 37.5, 37.5, 37.5, 37.5, 37.5]). The parameter esti-
mate was initialized with θ̂ (t0 ) = 08×1 . The neural network weights were initialized
to match the ideal values for the linearized optimal control problem, selected by solv-
ing the algebraic Riccati equation with the dynamics linearized about the station. The
actor adaptation gains were selected to be kc1 = 0.25 × I21 , kc2 = 0.5 × I21 , ka = I21 ,
210 6 Applications
-2
-3
-4
-5
0 50 100 150
Time (sec)
vehicle
0.2
0.1
-0.1
-0.2
0 50 100 150
Time (sec)
k p = 0.25, and β = 0.025. The adaptation matrix was initialized to Γ0 = 400 × I21 .
The Bellman error was extrapolated to 2025 points in a grid about the station.
Figures 6.2 and 6.3 illustrate the ability of the generated policy to regulate the
state in the presence of the spring’s current. Figure 6.4 illustrates the total control
effort applied to the body of the vehicle, which includes the estimate of the current
compensation term and approximate optimal control. Figure 6.5 illustrates the output
of the approximate optimal policy for the residual system. Figure 6.6 illustrates the
convergence of the parameters of the system identifier and Figs. 6.7 and 6.8 illustrate
convergence of the neural network weights representing the value function.
6.2 Station-Keeping of a Marine Craft 211
10
-5
-10
0 50 100 150
Time (sec)
-200
-300
-400
-500
0 50 100 150
Time (sec)
The anomaly seen at ∼70 s in the total control effort (Fig. 6.4) is attributed to
a series of incorrect current velocity measurements. The corruption of the current
velocity measurements is possibly due in part to the extremely low turbidity in the
spring and/or relatively shallow operating depth. Despite presence of unreliable cur-
rent velocity measurements the vehicle was able to regulate the vehicle to its station.
The results demonstrate the developed method’s ability to concurrently identify the
unknown hydrodynamic parameters and generate an approximate optimal policy
using the identified model. The vehicle follows the generated policy to achieve its
station keeping objective using industry standard navigation and environmental sen-
sors (i.e., inertial measurement unit, Doppler velocity log).
212 6 Applications
Parameters
Eq. (6.100) of [9]
-200
-300
-400
-500
0 50 100 150
Time (sec)
Fig.
6.7 Value function 1500
Ŵc neural network weight
estimates online convergence
1000
Ŵc
500
-500
0 50 100 150
Time (sec)
Fig. 6.8 Policy Ŵa 1500
neural network weight
estimates online convergence
1000
Ŵa
500
-500
0 50 100 150
Time (sec)
6.3 Online Optimal Control for Path-Following 213
where x (t) , y (t) ∈ R denote the planar position error between the vehicle and the
virtual target, θ (t) ∈ R denotes the rotational error between the vehicle heading
and the heading of the virtual target, v (t) ∈ R denotes the linear velocity of the
vehicle, w (t) ∈ R denotes the angular velocity of the vehicle, κ (t) ∈ R denotes the
path curvature evaluated at the virtual target, and s p (t) ∈ R denotes velocity of the
virtual target along the path. For a detailed derivation of the dynamics in (6.36) see
Appendix A.3.3.
Assumption 6.5 The desired path is regular and C 2 continuous; hence, the path
curvature κ is bounded and continuous.
As described in [6], the location of the virtual target is determined by
where k2 ∈ R>0 is an adjustable gain. From (6.37) and (6.38), the time derivative of
φ is
φ̇ (t) = k2 1 − φ 2 (t) (vdes (t) cos θ (t) + k1 x (t)) . (6.39)
2 Parts of the text in this section are reproduced, with permission, from [5], 2014,
c IEEE.)
214 6 Applications
Note that the path curvature and desired speed profile can be written as functions of
φ.
Based on (6.36) and (6.37), auxiliary control inputs ve , we ∈ R are designed as
where wss κvdes and vss vdes are computed based on the control input required
to remain on the path.
Substituting (6.37) and (6.40) into (6.36), and augmenting the system state with
(6.39), the closed-loop system is
ẋ (t) = κ (φ (t)) y (t) vdes (φ (t)) cos θ (t) + k1 κ x (t) y (t) − k1 x (t) + ve (t) cos θ (t) ,
ẏ = vdes (φ (t)) sin θ (t) − κ (φ (t)) x (t) vdes (φ (t)) cos θ (t) − k1 κ (φ (t)) x 2 (t)
+ ve (t) sin θ (t) ,
θ̇ = κ (φ (t)) vdes (φ (t)) − κ (φ (t)) (vdes (φ (t)) cos θ (t) + k1 x (t)) + we (t) ,
φ̇ = k2 1 − φ 2 (t) (vdes (φ (t)) cos θ (t) + k1 x (t)) . (6.41)
⎡ ⎤
cos (θ ) 0
⎢ sin (θ ) 0⎥
g (ζ ) ⎢
⎣ 0
⎥. (6.43)
1⎦
0 0
The cost functional for the optimal control problem is selected as (6.15), where
Q ∈ R4×4 is defined as
Q 03×1
Q ,
01×3 0
2
where Q ∈ R3×3 is a user-defined positive definite matrix such that q ξq ≤
2
ξqT Qξq ≤ q ξq , ∀ξq ∈ R3 , where q and q are positive constants.
The value function satisfies the Hamilton–Jacobi–Bellman equation [14]
0 = r ζ, u ∗ (ζ ) + ∇ζ V ∗ (ζ ) f (ζ ) + g (ζ ) u ∗ (ζ ) , (6.44)
Using (6.44) and the parametric approximation of the optimal value function and
the optimal policy from (6.22) and (6.23), respectively, the Bellman error, δ : R4 ×
R L × R L → R is defined as
δ ζ, Ŵc , Ŵa = r ζ, û ζ, Ŵa + ŴcT ∇ζ σ (ζ ) f (ζ ) + g (ζ ) û ζ, Ŵa . (6.45)
The adaptive update laws for the critic weights and the actor weights are given by
(6.26) and (6.28), respectively,
with the regressor ωk defined as ωk (t)
∇ζ σ (ζk ) f (ζk ) + g (ζk ) û ζk , Ŵa (t) ∈ R L and the normalization factor
&
defined as ρk (t) 1 + ωkT (t) ωk (t). The adaptation gain Γ is held constant and
it is assumed that the regressor satisfies the rank condition in (6.25).
f (ζ ) ≤ L f ζ , ∀ζ ∈ χ ,
Vt∗ (0, t) = 0,
1 1
VL (Z , t) = Vt∗ (e, t) + W̃cT Γ −1 W̃c + W̃aT Γ −1 W̃a . (6.49)
2 2
Using (6.48), the candidate Lyapunov function can be bounded as
where K is an auxiliary positive constant (see Appendix A.3.4) and r ∈ R is the
radius of a selected compact set β, then the controller u (t) = û ζ (t) , Ŵa (t) with
the update laws in (6.26), (6.27), and (6.28) guarantees uniformly ultimately bounded
convergence of the approximate policy to the optimal policy and of the vehicle to the
virtual target.
ϕc
2 ϕa 2 ι2 ι2
V̇L ≤ −ϕe e2 − W̃c − W̃a + c + a + ι,
2 2 2ϕc 2ϕa
To illustrate the ability of the proposed method to approximate the optimal solution,
a simulation is performed where the developed method’s policy and value function
neural network weight estimates are initialized to ideal weights identified on a previ-
ous trial. The true values of the ideal neural network weights are unknown. However,
initializing the actor and critic neural network weights to the ideal weights deter-
mined offline, the accuracy of the approximation can be compared to the optimal
solution. Since an analytical solution is not feasible for this problem, the simulation
results are directly compared to results obtained by the offline numerical optimal
solver GPOPS [19].
The simulation result utilize the kinematic model in (6.36) as the simulated mobile
robot. The vehicle is commanded to follow a figure eight path with a desired speed
of vdes = 0.25 ms . The virtual target is initially placed at the position corresponding
to an initial path parameter of s p (0) = 0 m, and the initial error state is selected
T
as e (0) = 0.5 m 0.5 m 2 πrad . Therefore, the initial augmented state is ζ (0) =
0.5 m 0.5 m 2 πrad 0 m . The basis for the value function approximation is selected
T
as
T
σ = ζ1 ζ2 , ζ1 ζ3 , ζ2 ζ3 , ζ12 , ζ22 , ζ32 , ζ42 .
The sampled data points are selected on a 5 × 5 × 3 × 3 grid about the origin. The
quadratic cost weighting matrices are selected as Q = diag ([2, 2, 0.25]) and R = I2 .
The learning gains are selected by trial and error as
Additionally, systematic gain tuning methods may be used (e.g., a genetic algorithm
approach similar to [20] may be used to minimize a desired performance criteria
such as weight settling time).
6.3 Online Optimal Control for Path-Following 219
The auxiliary gains in (6.37) and (6.39) are selected as k1 = 0.5 and k2 = 0.005.
Determined from a previous trial, the actor and critic neural network weight estimates
are initialized to
T
Ŵc (0) = 2.8 × 10−2 , −3.3 × 10−2 , 4.0, 1.2, 2.7, 2.9, 1.0
and
T
Ŵa (0) = 2.8 × 10−2 , −3.3 × 10−2 , 4.0, 1.2, 2.7, 2.9, 1.0
Figures 6.9 and 6.10 illustrate that the state and control trajectories approach
the solution found using the offline optimal solver, and Fig. 6.11 shows that the
neural network critic and actor weight estimates remain at their steady state values.
The system trajectories and control values obtained using the developed method
approximate the system trajectories and control value of the offline optimal solver. It
takes approximately 125 s for the mobile robot to traverse the desired path. However,
all figures with the exception of the vehicle trajectory are plotted only for 60 s to
provide clarity on the transient response. The steady-state response remains the same
after the initial transient (20 s).
-1
0 10 20 30 40 50 60
Time (sec)
2
-2
0 10 20 30 40 50 60
Time (sec)
220 6 Applications
0 10 20 30 40 50 60
Time (sec)
-1
0 10 20 30 40 50 60
Time (sec)
2
simulation
1
0
0 10 20 30 40 50 60
Time (sec)
5
4
3
Ŵa
2
1
0
0 10 20 30 40 50 60
Time (sec)
3 Thealgorithm complexity is linear with respect to the number of sampled points and cubic with
respect to the number of basis functions for each control iteration.
6.3 Online Optimal Control for Path-Following 221
As with the simulation results, the vehicle is commanded to follow a figure eight
path with a desired speed of vdes = 0.25 ms . The basis for the value function approx-
imation is selected as
T
σ = ζ1 ζ2 , ζ1 ζ3 , ζ1 ζ4 , ζ2 ζ3 , ζ2 ζ4 , ζ3 ζ4, ζ12 , ζ22 , ζ32 , ζ42 .
The sampled data points are selected on a 5 × 5 × 3 × 3 grid about the origin. The
quadratic cost weighting matrices are selected as Q = diag ([2, 2, 0.25]) and R = I2 .
The learning gains are selected by trial and error as
The auxiliary gains in (6.37) and (6.39) are selected as k1 = 0.5 and k2 = 0.005. The
T
initial augmented state is ζ (0) = −0.5 m −0.5 m 2 πrad 0 m . The actor and critic
neural network weight estimates are arbitrarily initialized to
and
0.5
-0.5
-1
0 20 40 60
Time (sec)
implemented on the 1
Turtlebot
0
0 10 20 30 40 50 60
Time (sec)
3
2
Ŵa
0 10 20 30 40 50 60
Time (sec)
For the given basis, the actor and critic neural network weight estimates may also be
initialized such that the value function approximation is equivalent to the solution
to the algebraic Riccati equation corresponding to the kinematic model linearized
about the initial conditions.
Figure 6.13 shows convergences of the error state to a ball about the origin. Figure
6.14 shows the neural network critic and actor weight estimates converge to steady
state values. The ability of the mobile robot to track the desired path is demonstrated
in Fig. 6.15.
6.4 Background and Further Reading 223
-2
-4
-6 -4 -2 0 2 4 6
solving the algebraic Riccati equation corresponding to the linearized error dynam-
ics about the desired heading. Nonlinear model-predictive control is used in [32]
to develop an optimal path-following controller over a finite time horizon. Dynamic
programming was applied to the path-following problem in [33] to numerically deter-
mine an optimal path-following feedback policy offline. The survey in [34] cites addi-
tional examples of model-predictive control and dynamic programming applied to
path-following. Unlike approximate dynamic programming, model-predictive con-
trol does not guarantee optimality of the implemented controller and dynamic pro-
gramming does not accommodate simultaneous online learning and utilization of the
feedback policy.
References
18. Fischer N, Hughes D, Walters P, Schwartz E, Dixon WE (2014) Nonlinear RISE-based control
of an autonomous underwater vehicle. IEEE Trans Robot 30(4):845–852
19. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, a MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
20. Otsuka A, Nagata F (2013) Application of genetic algorithms to fine-gain tuning of improved
the resolved acceleration controller. Procedia Comput Sci 22:50–59
21. Sørensen AJ (2011) A survey of dynamic positioning control systems. Annu Rev Control
35:123–136
22. Fossen T, Grovlen A (1998) Nonlinear output feedback control of dynamically positioned ships
using vectorial observer backstepping. IEEE Trans Control Syst Technol 6:121–128
23. Sebastian E, Sotelo MA (2007) Adaptive fuzzy sliding mode controller for the kinematic
variables of an underwater vehicle. J Intell Robot Syst 49(2):189–215
24. Tannuri E, Agostinho A, Morishita H, Moratelli L Jr (2010) Dynamic positioning systems: an
experimental analysis of sliding mode control. Control Eng Pract 18:1121–1132
25. Beard RW, Mclain TW (1998) Successive Galerkin approximation algorithms for nonlinear
optimal and robust control. Int J Control 71(5):717–743
26. Fannemel ÅV (2008) Dynamic positioning by nonlinear model predictive control. Master’s
thesis, Norwegian University of Science and Technology
27. Morro A, Sgorbissa A, Zaccaria R (2011) Path following for unicycle robots with an arbitrary
path curvature. IEEE Trans Robot 27(5):1016–1023
28. Morin P, Samson C (2008) Motion control of wheeled mobile robots. Springer handbook of
robotics. Springer, Berlin, pp 799–826
29. Dacic D, Nesic D, Kokotovic P (2007) Path-following for nonlinear systems with unstable zero
dynamics. IEEE Trans Autom Control 52(3):481–487
30. Kanjanawanishkul K, Zell A (2009) Path following for an omnidirectional mobile robot based
on model predictive control. In: Proceedings of the IEEE International Conference on Robotics
and Automation, pp 3341–3346
31. Ratnoo A, Pb S, Kothari M (2011) Optimal path following for high wind flights. In: IFAC
world congress, Milano, Italy 18:12985–12990
32. Faulwasser T, Findeisen R (2009) Nonlinear model predictive path-following control. In: Magni
L, Raimondo D, Allgöwer F (eds) Nonlinear model predictive control, vol 384. Springer, Berlin,
pp 335–343
33. da Silva JE, de Sousa JB (2011) A dynamic programming based path-following controller for
autonous vehicles. Control Intell Syst 39:245–253
34. Sujit P, Saripalli S, Borges Sousa J (2014) Unmanned aerial vehicle path following: a survey
and analysis of algorithms for fixed-wing unmanned aerial vehicles. IEEE Control Syst Mag
34(1):42–59
Chapter 7
Computational Considerations
7.1 Introduction
Efficient methods for the approximation of the optimal value function are essential,
since an increase in dimension can lead to a exponential increase in the number
of required basis functions necessary to achieve an accurate approximation. This
is known as the “curse of dimensionality”. To set the stage for the approximation
methods of this chapter, the first half of the introduction outlines a problem that
arises in the real time application of optimal control theory. Sufficiently accurate
approximation of the value function over a sufficiently large neighborhood often
requires a large number of basis functions, and hence, introduces a large number
of unknown parameters. One way to achieve accurate function approximation with
fewer unknown parameters is to use prior knowledge about the system to determine
the basis functions. However, for general nonlinear systems, prior knowledge of
the features of the optimal value function is generally not available; hence, a large
number of generic basis functions is often the only feasible option.
For some problems, such as the linear quadratic regulator problem, the optimal
value function takes a particular form which makes the choice of basis functions
trivial. In the
case of the linear quadratic regulator, the optimal value function is
of the form i,n j=1 wi, j x j xi (c.f., [1, 2]), so basis functions of the form σi, j = x j xi
will provide an accurate estimation of the optimal value function. However, in most
cases, the form of the optimal value function is unknown, and generic basis functions
are employed to parameterize the problem.
Often, kernel functions from reproducing kernel Hilbert spaces are used as gener-
ic basis functions, and the approximation problem is solved over a (preferably large)
compact domain of Rn [3–5]. An essential property of reproducing kernel Hilbert
spaces is given a collection of basis functions in the Hilbert space, there is a unique
set of weights that minimize the error in the Hilbert space norm, the so called ideal
weights [6]. The model choice of kernel is the Gaussian radial basis function given by
K (x, y) = exp(−x − y2 /μ) where x, y ∈ Rn and μ > 0 [5, 7]. For the approxi-
mation of a function over a large compact domain, a large number of basis functions
is required, which leads to an intractable computational problem for online control
applications.
Thus, approximation methodologies for the reduction of the number of basis
functions required to achieve accurate function approximation are well motivated.
In particular, the aim of this chapter is the development of an efficient scheme for
the approximation of continuous functions via state and time varying basis functions
that maintain the approximation of a function in a local neighborhood of the state,
deemed the state following (StaF) method. The method developed in this chapter is
presented as a general strategy for function approximation, and can be implemented
in contexts outside of optimal control.
The particular basis functions that will be employed throughout this chapter are
derived from kernel functions corresponding to reproducing kernel Hilbert spaces.
In particular, the centers are selected to be continuous functions of the state variable
bounded by a predetermined value. That is, given a compact set D ⊂ Rn , > 0, r >
0, and L ∈ N, ci (x) x + di (x), where di : Rn → Rn is continuously differentiable
and supx∈D di (x) < r for i = 1, . . . , L. The parameterization of a function V in
terms of StaF kernel functions is given by
L
V̂ (y; x(t), t) = wi (t)K (y, ci (x(t))),
i=1
facilitated by the fact that the ideal weights are continuously differentiable functions of
the system state. To facilitate the proof of continuous differentiability, the StaF kernels
are selected from a reproducing kernel Hilbert space. Other function approximation
methods, such as radial basis functions, sigmoids, higher order neural networks, sup-
port vector machines, etc., can potentially be utilized in a state-following manner to
achieve similar results provided continuous differentiability of the ideal weights can
be established. An examination of smoothness properties of the ideal weights resulting
from a state-following implementation of the aforementioned function approximation
methods is out of the scope of this chapter.
Another key contribution of this chapter is the observation that model-based re-
inforcement learning techniques can be implemented without storing any data if the
available model is used to simulate persistent excitation. In other words, an exci-
tation signal added to the simulated system, instead of the actual physical system,
can be used to learn the value function. Excitation via simulation is implemented
using Bellman error extrapolation (cf. [11–13]); however, instead of a large num-
ber of autonomous extrapolation functions employed in the previous chapters, a
single time-varying extrapolation function is selected, where the time-variation of
the extrapolation function simulates excitation. The use of a single extrapolation
point introduces a technical challenge since the Bellman error extrapolation matrix
is rank deficient at each time instance. The aforementioned challenge is addressed in
Sect. 7.4.3 by modifying the stability analysis to utilize persistent excitation of the ex-
trapolated regressor matrix. Simulation results including comparisons with state-of-
the-art model-based reinforcement learning techniques are presented to demonstrate
the effectiveness of the developed technique.
A reproducing kernel Hilbert space, H , is a Hilbert space with inner product ·, · H of
functions f : X → F (where F = C or R) for which given any x ∈ X , the function-
al E x f := f (x) is bounded. By the Reisz representation theorem, for each x ∈ X
there is a unique function k x ∈ H for which f, k x H = f (x). Each function k x is
called a reproducing kernel for the point x ∈ X . The function K (x, y) = k y , k x H
is called the kernel function for H [7]. The norm corresponding to H will be de-
noted as · H , and the subscript will be suppressed when the Hilbert space is
understood. Kernel functions are dense in H under the reproducing kernel Hilbert
space norm.
Kernel functions have the property that for each collection of points {x1 , . . . , xm } ⊂
X , the matrix (K (xi , x j ))i,m j=1 is positive semi-definite. The Aronszajn–Moore the-
orem states that there is a one to one correspondence between kernel functions with
7.2 Reproducing Kernel Hilbert Spaces 231
this property and reproducing kernel Hilbert spaces. In fact, starting with a kernel
function having the positive semi-definite property, there is an explicit construction
for its reproducing kernel Hilbert space. Generally, the norm for the reproducing
kernel Hilbert space is given by
where Pc1 ,...,cm f is the projection of f onto the subspace of H spanned by the
kernel function K (·, ci ) for i = 1, . . . , M. Pc1 ,...,c M f is computed by interpolating the
M
points (ci , f (ci )) for i = 1, . . . , M with a function of the form i=1 wi K (·, ci ). The
1/2
M
norm of the projection then becomes1 Pc1 ,...,c M f = i, j=1 ci c j K (c j , ci ) . In
practice, the utility of computing the norm of f as (7.1) is limited, and alternate
forms of the norm are sought for specific reproducing kernel Hilbert spaces.
Unlike L 2 spaces, norm convergence in a reproducing kernel Hilbert space implies
pointwise convergence. This follows since if f n → f in the reproducing kernel
Hilbert space norm, then
| f (x) − f n (x)| = | f − f n , k x | ≤
f − f n k x = f − f n K (x, x).
√
When K is a continuous function of X , the term K (x, x) is bounded over compact
sets, and thus, norm convergence implies uniform convergence over compact sets.
Therefore, the problem of establishing an accurate approximation in the supremum
norm of a function is often relaxed to determining an accurate approximation of a
function in the reproducing kernel Hilbert space norm.
Given a reproducing kernel Hilbert space H over a set X and Y ⊂ X , the space
HY obtained by restricting each function f ∈ H to the set Y is itself a reproducing
kernel Hilbert space where the kernel function is given by restricting the original
kernel function to the set Y × Y . The resulting Hilbert space norm is given by
Therefore, the map f → f |Y is norm decreasing from H to HY [7]. For the purposes
of this paper, the norm obtained by restricting a reproducing kernel Hilbert space H
over Rn to a closed neighborhood Br (x) where r > 0 and x ∈ Rn will be denoted
as · r,x .
1 For z ∈ C the quantity Re(z) is the real part of z, and z represents the complex conjugate of z.
232 7 Computational Considerations
Central problems to the StaF method are those of determining the basis functions
and the weight signals. When reproducing kernel Hilbert spaces are used for ba-
sis functions, (7.2) can be relaxed so that the supremum norm is replaced with the
Hilbert space norm. Since the Hilbert space norm of a reproducing kernel Hilbert
space dominates the supremum norm (cf. [7, Corollary 4.36]), (7.2) with the supre-
mum norm is simultaneously satisfied. Moreover, when using a reproducing
kernel Hilbert space, the basis functions can be selected to correspond to centers
placed in a moving neighborhood of the state. In particular, given a kernel function
K : Rn × Rn → R corresponding to a (universal) reproducing kernel Hilbert space,
H , and center functions ci : Rn → Rn for which ci (x) − x = di (x) is a continuous
function bounded by r then the StaF problem becomes the determination of weight
signals wi : R+ → R for i = 1, . . . , L such that
L
lim sup V (·) − wi (t)K (·, ci (x(t))) < , (7.3)
t→∞
i=1 r,x(t)
where · r,x(t) is the norm of the reproducing kernel Hilbert space obtained by
restricting functions in H to Br (x(t)) [7, 14].
Since (7.3) implies (7.2), the focus of this section is to demonstrate the feasibility
of satisfying (7.3). Theorem 7.1 demonstrates that under a certain continuity assump-
tion a bound on the number of kernel functions necessary for the maintenance of an
approximation throughout a compact set can be determined, and Theorem 7.3 shows
that a collection of continuous ideal weight functions can be determined to satisfy
(7.3). Theorem 7.3 justifies the use of weight update laws for the maintenance of an
accurate function approximation, and this is demonstrated by Theorem 7.5.
The choice of reproducing kernel Hilbert space for Sect. 7.3.5 is that which
corresponds to the exponential kernel K (x, y) = exp(x T y), where x, y ∈ Rn . The
reproducing kernel Hilbert space will be denoted by F 2 (Rn ) since it is close-
ly connected to the Bergmann–Fock space [15]. The reproducing kernel Hilbert
space corresponding to the exponential kernel is a universal reproducing kernel
7.3 StaF: A Local Approximation Method 233
Hilbert space [7, 16], which means that given any compact set D ⊂ Rn , > 0
and continuous function f : D → R, there exists a function fˆ ∈ F 2 (Rn ) for which
supx∈D | f (x) − fˆ(x)| < .
The first theorem concerning the StaF method demonstrates that if the state variable
is constrained to a compact subset of Rn , then there is a finite number of StaF basis
functions required to establish the accuracy of an approximation.
Proof Let > 0. For each neighborhood Br (x) with x ∈ D, there exists a finite
number of centers c1 , . . . , c L ∈ Br (x), and weights w1 , . . . , w L ∈ C, such that
L
V (·) − wi K (·, ci ) < .
i=1 r,x
Let L x, be the minimum such number. The claim of the proposition is that the
set Q {L x, : x ∈ D} is bounded. Assume by way of contradiction that Q is
unbounded, and take a sequence {xn } ⊂ D such that L xn , is a strictly increasing
sequence (i.e., an unbounded sequence of integers) and xn → x in D. It is always
possible to find such a convergent sequence, since every compact subset of metric
space is sequentially compact. Let c1 , . . . , c L x,/2 ∈ Br (x) and w1 , . . . , w L x,/2 ∈ C be
centers and weights for which
L x,/2
V (·) − wi K (·, ci ) < /2. (7.4)
i=1
r,x
⎛ ⎛ ⎞ ⎞1/2
L x,/2 L x,/2
E(x) ⎝V r,x − 2Re ⎝ wi V (x + di )⎠ + wi w j K (x + di , x + d j )⎠ .
i=1 i, j=1
The assumption of the continuity of V r,x in Theorem 7.1 is well founded. There
are several examples where the assumption is known to hold. For instance, if the
reproducing kernel Hilbert space is a space of real entire functions, as it is for the
exponential kernel, then V r,x is not only continuous, but it is constant.
Using a similar argument as that in Theorem 7.1, the theorem can be shown to
hold when the restricted Hilbert space norm is replace by the supremum norm over
Br (x). The proof of the following theorem can be found in [17].
Proposition 7.2 Let D be a compact subset of Rn , V : Rn → R be a continu-
ous function, and K : Rn → Rn → R be a continuous and universal kernel func-
tion. For all , r > 0, there exists L ∈ N such that for each x ∈ D, there is a
collection of centers c1 , . . . , c L ∈ Br (x) and weights w1 , . . . , w L ∈ R such that
L
sup y∈Br (x) V (y) − i=1 K (y, ci ) < .
Now that it has been demonstrated that only a finite number of moving centers is
required to maintain an accurate approximation, it will now be demonstrated that the
ideal weights corresponding to the moving centers change continuously or smoothly
with the corresponding change in centers. In traditional adaptive control applications,
it is assumed that there is a collection of constant ideal weights, and much of the
theory is in the demonstration of the convergence of approximate weights to the ideal
weights. Since the ideal weights are no longer constant, it is necessary to show that
the ideal weights change smoothly as the system progresses. The smooth change in
centers will allow the proof of uniform ultimately bounded results through the use of
weight update laws. One such result will be demonstrated in Sect. 7.3.4, in particular
a uniformly ultimately bounded result is proven in Theorem 7.5.
Theorem 7.3 Let H be a reproducing kernel Hilbert space over a set X ⊂ Rn with
a strictly positive kernel K : X × X → C such that K (·, c) ∈ C m (Rn ) for all c ∈ X .
Suppose that V ∈ H . Let C be an ordered collection of L distinct centers, C =
(c1 , c2 , . . . , c L ) ∈ X L , with the associated ideal weights
L
W H (C) = arg min ai K (·, ci ) − V (·) . (7.5)
a∈C L
i=1 H
Theorem 7.1 demonstrated a bound on the number of kernel functions required for
the maintenance of the accuracy of a moving local approximation. However, the
proof does not provide an algorithm to computationally determine the upper bound.
Indeed, even when the approximation with kernel functions is performed over a
fixed compact set, a general bound for the number of collocation nodes required for
accurate function approximation is unknown.
Thus, it is desirable to have a computationally determinable upper bound to the
number of StaF basis functions required for the maintenance of an accurate function
approximation. Theorem 7.4 provides a calculable bound on the number of exponen-
tial functions required for the maintenance of an approximation with respect to the
supremum norm.
While such error bounds have been computed for the exponential function before
(cf. [18]), current literature lets the “frequencies” or centers of the exponential kernel
functions to be unconstrained. The contribution of Theorem 7.4 is the development
of an error bound while constraining the size of the centers.
Theorem 7.4 Let K : Rn × Rn → R given by K (x, y) = exp x T y be the expo-
nential kernel function. Let D ⊂ Rn be a compact set, V : D → R continuous, and
, r > 0. For each x ∈ D, there exists a finite number of centers c1 , . . . , c L x, ∈ Br (x)
and weights w1 , w2 , . . . , w L x, ∈ R, such that
236 7 Computational Considerations
L x,
sup V (y) − wi K (y, ci ) < .
y∈Br (x) i=1
Proof For notational simplicity, the quantity f D,∞ denotes the supremum norm of
a function f : D → R over the compact set D throughout the proof of Theorem 7.4.
First, consider the ball of radius r centered at the origin. The statement of the
theorem can be proven by finding an approximation of monomials by a linear com-
bination of exponential kernel functions.
Let α = (α1 , α2 , . . . , αn ) be a multi-index, and define |α| = αi . Note that
n
1
m |α| αi
(exp (yi /m) − 1) = y1α1 y2α2 ··· ynαn +O
i=1
m
where the notation gm (x) = O( f (m)) means that for sufficiently large m, there is a
constant C for which gm (x) < C f (m), ∀y ∈ Br (0). The big-oh constant indicated by
O(1/m) can be computed in terms of the derivatives of the exponential function via
Taylor’s Theorem. The centers corresponding to this approximation are of the form
li /m where li is a non-negative integer satisfying li < αi . Hence, for m sufficiently
large, the centers reside in Br (0).
Tin Br (y), let x = (x1 , x2 , . . . , xn ) ∈ R ,
T n
To shift the centers so that they reside
and multiply both sides of (7.6) by exp y x to get
n
α1 α2 αn li
|α| |α|− i li
m ··· (−1) exp yi + xi
li ≤αi ,i=1,2,...,n
l1 l2 ln i=1
m
yT x
α1 α2 αn
1
=e y1 y2 · · · yn + O .
m
For each multi-index, α = (α1 , α2 , . . . , αn ), the centers for the approximation of the
corresponding monomial are of the form xi + li /m for 0 ≤ li ≤ αi . Thus, by linear
7.3 StaF: A Local Approximation Method 237
T
combinations of these kernel functions, a function of the form e y x g(y), with g a
multivariate polynomial, can be uniformly approximated by exponential functions
over Br (x). Moreover if g is
a polynomial
of degree β, then this approximation can
be a linear combination of n+β β
kernel functions.
Let
> 0 and suppose that px is polynomial with degree N x,
such that
px (y) = V (y) + 1 (y) where |1 (y)| < e y x −1
T
D,∞ /2 ∀y ∈ Br (x). Let q x (y) be
a polynomial in R variables of degree Sx, such that qx (y) = e−y x + 2 (y) where
T
n
−1 −1
2 (y) < V D,∞ e y x D,∞
/2, ∀y ∈ Br (x).
T
for all y ∈ Br (x). The degree of the polynomial qx , Sx, , can be uniformly bounded
T
in terms of the modulus of continuity of e y x over D. Similarly, the uniform bound
on the degree of px , N x,
, can be described in terms of the modulus of continuity
of V over D. The number of centers required for Fm (y) is determined by the degree
of the polynomial q · p (treating the x terms of q as constant), which is sum of the
two polynomial degrees. Finally for m large enough and
small enough, |Fm (y) −
V (y)| < , and the proof is complete.
Theorem 7.4 demonstrates an upper bound required for the accurate approxima-
tion of a function through the estimation of approximating polynomials. Moreover,
the upper bound is a function of the polynomial degrees. The exponential kernel will
be used for simulations in Sect. 7.3.5.
As mentioned before, the theory of adaptive control is centered around the concept
of weight update laws. Weight update laws are a collection of rules that the approx-
imating weights must obey which lead to convergence to the ideal weights. In the
case of the StaF approximation framework, the ideal weights are replaced with ideal
238 7 Computational Considerations
weight functions. Theorem 7.3 showed that if the moving centers of the StaF kernel
functions are selected in such a way that the centers adjust smoothly with respect to
the state x, then the ideal weight functions will also change smoothly with respect
to x. Thus, in this context, weight update laws of the StaF approximation framework
aim to achieve an estimation of the ideal weight function at the current state.
Theorem 7.5 provides an example of such weight update laws that achieve a uni-
formly ultimately bounded result. The theorem takes advantage of perfect samples of
a function in the reproducing kernel Hilbert space H corresponding to a real valued
kernel function. The proof of the theorem follows the standard proof for the conver-
gence of the gradient descent algorithm for a quadratic programming problem [19].
Theorem 7.5 Let H be a real valued reproducing kernel Hilbert space over Rn with
a continuously differentiable strictly positive definite kernel function K : Rn × Rn →
R. Let V ∈ H , D ⊂ Rn be a compact set, and x : R → Rn be a state variable
subject to the dynamical system ẋ = q(x, t), where q : Rn × R+ → Rn is a bounded
locally Lipschitz continuous function. Further suppose that x(t) ∈ D ∀t > 0. Let
c : Rn → R L , where for each i = 1, . . . , L, ci (x) = x + di (x) where di ∈ C 1 (Rn ),
and let a ∈ R L . Consider the function
2
L
F(a, c) = V − ai K (·, ci (x)) .
i=1 H
At each time instance t > 0, there is a unique W (t) for which W (t) = arg mina∈RL
F(a, c(x(t))). Given any > 0 and initial value a 0 , there is a frequency τ > 0,
where if the gradient descent algorithm (with respect to a) is iterated at time steps
Δt < τ −1 , then F(a k , ck ) − F(wk , ck ) will approach a neighborhood of radius as
k → ∞.
where V (c) = (V (c1 ), . . . , V (c L ))T and K (c) = (K (ci , c j ))i,L j=1 is the symmet-
ric strictly positive kernel matrix corresponding to c. At each time iteration t k ,
k = 0, 1, 2, . . ., the corresponding centers and weights will be written as ck ∈ Rn L
and a k ∈ R L , respectively. The ideal weights corresponding to ck will be denoted
by wk . It can be shown that wk = K (ck )−1 V (ck ) and F(wk , ck ) = V 2H − V (ck )T
K (ck )V (ck ). Theorem 7.3 ensures that the ideal weights change continuously with
respect to the centers which
remain in a compact set D̃ L , where D̃ = {x ∈ R L :
x − D ≤ maxi=1,...,L supx∈D |di (x)| }, so the collection of ideal weights is bound-
ed. Let R > ¯ be large enough so that B R (0) contains both the initial value a 0 and
the set of ideal weights. To facilitate the subsequent analysis, consider the constants
7.3 StaF: A Local Approximation Method 239
and let Δt < τ −1 ¯ (2(R0 + R3 )(R1 R4 (R0 + R3 ) + R2 + 1))−1 . The proof aims
to show that by using the gradient descent law for choosing a k , the inequality
where |1 (t k )| < ¯ /2, ∀k. The quantity F(wk+1 , ck+1 ) is continuously differentiable
in both variables. Thus, by the multi-variable chain rule and another application of
the mean value theorem
where Ack is the largest eigenvalue of K (ck ) and ack is the smallest eigenvalue of
K (ck ). The quantity on the right of (7.8) is continuous with respect to Ack and ack .
In turn, Ack and ack are continuous with respect to K (ck ) (c.f. Exercise 4.1.6 [20])
which is continuous with respect to ck . Therefore, there is a largest value, δ, that the
right hand side of (7.8) obtains on the compact set D̃ and this value is less than 1.
Moreover, δ is independent of ¯ , so it may be declared that ¯ = (1 − δ). Finally,
which is the dynamical system corresponding to a circular trajectory. The state de-
pendent function to be approximated is
for i = 1, 2, 3.
The initial values selected for the weights are a 0 = [0 0 0]T . The gradient descent
weight update law, given by (7.7), is applied 10 iterations per time-step and the time-
steps incremented every 0.01 s. Figures 7.1, 7.2, 7.3 and 7.4 present the results of the
simulation.
Figure 7.4 demonstrates that the function approximation error is driven to a small
neighborhood of zero as the gradient chase theorem is implemented, which numeri-
cally validates the claim of the uniformly ultimately bounded result of Theorem 7.5.
Approximations of the ideal weight function depicted in Fig. 7.3, are periodic and
7.3 StaF: A Local Approximation Method 241
0.5
−0.5
−1
−1 0 1
10
0
0 2 4 6 8
Time (s)
−2
−4
−6
0 2 4 6 8
Time (s)
242 7 Computational Considerations
−2
0 2 4 6 8
Time (s)
smooth. Smoothness of the ideal weight function itself is given in Theorem 7.3, and
the periodicity of the approximation follows from the periodicity of the selected
dynamical system, Fig. 7.1. Finally, Fig. 7.2 shows that along the system trajectory,
the approximation V̂ rapidly converges to the true function V . Approximation of the
function is maintained as the system state moves through its domain as anticipated.
In the following, Sect. 7.4.1 summarizes key results from the previous section in
the context of model-based reinforcement learning. In Sect. 7.4.2, the StaF-based
function approximation approach is used to approximately solve an optimal regula-
tion problem online using exact model knowledge via value function approximation.
Section 7.4.3 is dedicated to Lyapunov-based stability analysis of the developed tech-
nique. Section 7.4.4 extends the developed technique to systems with uncertain drift
dynamics and Sect. 7.4.5 presents comparative simulation results.
2 Parts of the text in this section are reproduced, with permission, from [21], 2016,
c Elsevier.
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 243
[K (x, c1 ) , · · · , K (x, c L )]T . Then, there exists a unique set of weights W H such
that
∗
W H (C) = arg min a T σ (·, C) − V , (7.10)
a∈R L H
where · H denotes the Hilbert space norm. Furthermore, for any given > 0,
there exists a constant
of centers, C ∈ χ , and a set of weights, W H ∈
L ∈ N, a set L
∗
R L , such that W HT σ (·, C) − V ≤ . On compact sets, the Hilbert space norm
H
corresponding to a Hilbert space with continuously differentiable kernels dominates
the supremum norm of functions and their derivatives [7, Corollary 4.36]. Hence, the
function can be approximated
as well asits derivative, that is, there exists centers
and
T ∗ T ∗
weights for which, W H σ (·, C) − V < and W H ∇σ (·, C) − ∇V <
χ,∞ χ,∞
. The notation ∇ f denotes the gradient of f with respect to the first argument and
the notation f A,∞ denotes the supremum of the absolute value (or the pointwise
norm, if f is vector-valued) of f over the set A.
Let Hx,r denote the restriction of the Hilbert space H to Br (x) ⊂ χ . Then, Hx,r
is a Hilbert space with the restricted kernel K x,r : Br (x) × Br (x) → R defined as
K x,r (y, z) = K (y, z) , ∀ (y, z) ∈ Br (x) × Br (x). The Weierstrass Theorem indi-
cates that as r decreases, the degree N x, of the polynomial needed to achieve the
same error over Br (x) decreases [22]. Hence, by Theorem 7.4, approximation of
a function over a smaller domain requires a smaller number of exponential kernels.
Furthermore, provided the region of interest is small enough, the number of kernels
required
to approximate continuous functions with arbitrary accuracy can be reduced
to n+22
.
In the StaF approach, the centers are selected to follow the current state x (i.e.,
the locations of the centers are defined as a function of the system state). Since the
system state evolves in time, the ideal weights are not constant. To approximate the
ideal weights using gradient-based algorithms, it is essential that the weights change
smoothly with respect to the system state. Theorem 7.3 establishes differentiability of
the ideal weights as a function of the centers to facilitate implementation of gradient-
based update laws to learn the time-varying ideal weights in real-time.
Consider the Bolza problem introduced in Sect. 1.5 where the functions f and g
are assumed to be known and locally Lipschitz continuous. Furthermore, assume
that f (0) = 0 and that ∇ f : Rn → Rn×n is continuous. The selection of an optimal
regulation problem and the assumption that the system dynamics are known are
motivated by ease of exposition. Using the concurrent learning-based adaptive system
identifier and the state augmentation technique described in Sects. 3.3 and 4.4, the
approach developed in this section can be extended to a class of trajectory tracking
244 7 Computational Considerations
problems in the presence of uncertainties in the system drift dynamics. For a detailed
description of StaF-based online approximate optimal control under uncertainty, see
Sect. 7.4.4. Simulation results in Sect. 7.4.5 demonstrate the performance of such an
extension.
The expression for the optimal policy in (1.13) indicates that to compute the
optimal action when the system is at any given state x, one only needs to evaluate
the gradient ∇V ∗ at x. Hence, to compute the optimal policy at x, one only needs to
approximate the value function over a small neighborhood around x. Furthermore, as
established in Theorem 7.4, the number of basis functions required to approximate
the value function is smaller if the region for the approximation is smaller (with
respect to the ordering induced by set containment). Hence, in this result, the aim is
to obtain a uniform approximation of the value function over a small neighborhood
around the current system state.
StaF kernels are employed to achieve the aforementioned objective. To facilitate
the development, let x be in the interior of χ . Then, for all > 0, there exists a
∗ ∗
function V ∈ Hx,r such that sup y∈Br (x) V ∗ (y) − V (y) < , where Hx,r is a re-
striction of a universal reproducing kernel Hilbert space, H , introduced in Sect. 7.4.1,
to Br (x). In the developed StaF-based method, a small compact set Br (x) around
the current state x is selected for value function approximation by selecting the cen-
ters C ∈ Br (x) such that C = c (x) for some continuously differentiable function
c : χ → χ L . Using StaF kernels centered at a point x, the value function can be
represented as
The objective of the critic is to learn the ideal parameters W (x), and the objective
of the actor is to implement a stabilizing controller based on the parameters learned by
the critic . Motivated by the stability analysis, the actor and the critic maintain separate
estimates Ŵa and Ŵc , respectively, of the ideal parameters W (x). Using the estimates
V̂ and û for V ∗ and u ∗ , respectively, the Bellman error is δ : Rn × Rn × R L × R L →
R, is computed as
δ y, x, Ŵc , Ŵa r y, û y, x, Ŵa + ∇ V̂ y, x, Ŵc f (y) + g (y) û y, x, Ŵa .
(7.12)
To solve the optimal control problem, the critic aims to find a set of parameters
Ŵc and the actor aims to find a set of parameters Ŵa such that δ y, x, Ŵc , Ŵa =
0, ∀x ∈ Rn , ∀y ∈ Br (x). Since an exact basis for value function approximation is
generally not available, an approximate set of parameters that minimizes the Bellman
error is sought.
To learn the ideal parameters online, the critic evaluates a form δt : R≥t0 → R of
the Bellman error at each time instance t as
δt (t) δ x (t) , x (t) , Ŵc (t) , Ŵa (t) , (7.13)
where Ŵa (t) and Ŵc (t) denote the estimates of the actor and the critic weights,
respectively, at time t, and the notation x (t) is used to denote the state the system
in (1.9), at time t, when starting from initial time t0 , initial state x0 , and under the
feedback controller
u (t) = û x (t) , x (t) , Ŵa (t) . (7.14)
Since (1.14) constitutes a necessary and sufficient condition for optimality, the Bell-
man error serves as an indirect measure of how close the critic parameter estimates
Ŵc are to their ideal values; hence, in the context of reinforcement learning, each
evaluation of the Bellman error is interpreted as gained experience. Since the Bell-
man error in (7.13) is evaluated along the system trajectory, the experience gained is
along the system trajectory.
Learning based on simulation of experience is achieved by extrapolating the Bell-
man error to unexplored areas of the state-space. The critic selects a set of functions
N
xi : Rn × R≥t0 → Rn i=1 such that each xi maps the current state x (t) to a point
xi (x (t) , t) ∈ Br (x (t)). The critic then evaluates a form δti : R≥t0 → R of the Bell-
man error for each xi as
δti (t) = δ xi (x (t) , t) , x (t) , Ŵc (t) , Ŵa (t) . (7.15)
The critic then uses the Bellman errors from (7.13) and (7.15) to improve the estimate
Ŵc (t) using the recursive least-squares-based update law
246 7 Computational Considerations
ω (t) ωi (t) N
Ŵ˙ c (t) = −kc1 Γ (t)
kc2
δt (t) − Γ (t) δti (t) , (7.16)
ρ (t) N ρ (t)
i=1 i
where ρi (t) 1 + γ1 ωiT (t) ωi (t), ρ (t) 1 + γ1 ω T (t) ω (t),
ω (t) ∇σ (x (t) , c (x (t))) f (x (t))+∇σ (x (t) , c (x (t))) g (x (t)) û x (t) , x (t) , Ŵa (t) ,
ωi (t) ∇σ (xi (x (t)) , c (x (t))) g (xi (x (t) , t)) û xi (x (t) , t) , x (t) , Ŵa (t)
+ ∇σ (xi (x (t)) , c (x (t))) f (xi (x (t) , t)) ,
and kc1 , kc2 , γ1 ∈ R>0 are constant learning gains. In (7.16), Γ (t) denotes the least-
squares learning gain matrix updated according to
where
7.4.3 Analysis
tional benefit. The computational cost grows linearly with the number of extrapola-
tion points (i.e., N ). If the points are selected using grid-based methods employed
in results such as [11], the number N increases geometrically with respect to the
state dimension, n. On the other hand, if the extrapolation points are selected to be
time varying, then even a single point is sufficient, provided the time-trajectory of
the point contains enough information to satisfy the subsequent Assumption 7.6.
In the following, Assumption 7.6 formalizes the conditions under which the tra-
jectories of the closed-loop system can be shown to be ultimately bounded, and
Lemma 7.7 facilitates the analysis of the closed-loop system when time-varying ex-
trapolation trajectories are utilized.
For notational brevity, time-dependence of all the signals is suppressed hereafter.
Let χ denote the projection of Bζ onto Rn . To facilitate the subsequent stability
analysis, the Bellman errors in (7.13) and (7.15) are expressed in terms of the weight
estimation errors W̃c W − Ŵc and W̃a = W − Ŵa as
1
δt = −ω T W̃c + W̃a G σ W̃a + Δ (x) ,
4
1
δti = −ωi W̃c + W̃aT G σ i W̃a + Δi (x) ,
T
(7.19)
4
where the functions Δ, Δi : Rn → R are uniformly bounded over χ such that the
bounds Δ and Δi decrease with decreasing ∇ε and ∇W . Let a candidate
Lyapunov function VL : Rn+2L × R≥t0 → R be defined as
1 1
VL (Z , t) V ∗ (x) + W̃cT Γ −1 (t) W̃c + W̃aT W̃a ,
2 2
where V ∗ is the optimal value function, and
T
Z = x T , W̃cT , W̃aT .
To facilitate learning, the system states x and the selected functions xi are assumed
to satisfy the following.
Assumption 7.6 There exist constants T ∈ R>0 and c1 , c2 , c3 ∈ R≥0 , such that
t+T
ω (τ ) ω T (τ )
c1 I L ≤ dτ, ∀t ∈ R≥t0 ,
ρ 2 (τ )
t
1 ωi (t) ωiT (t)
N
c2 I L ≤ inf ,
t∈R≥t0 N i=1 ρi2 (t)
t+T N
1 ωi (τ ) ωiT (τ )
c3 I L ≤ dτ, ∀t ∈ R≥t0 ,
N i=1
ρi2 (τ )
t
where
1
Γ = ,
min kc1 c1 + kc2 max c2 T, c3 , λmin Γ0−1 e−βT
1
Γ = −1 (kc1 +kc2 ) .
λmax Γ0 + βγ1
Furthermore, Γ > 0.
Proof The proof closely follows the proof of [23, Corollary 4.3.2]. The update law
in (7.17) implies that
ω (t) ω T (t) kc2 ωi (t) ωiT (t)
N
d −1
Γ (t) = −βΓ −1 (t) + kc1 + .
dt ρ 2 (t) N i=1 ρi2 (t)
Hence,
t t N
−1 ω (τ ) ω T (τ ) kc2 ωi (τ ) ωiT (τ )
Γ (t) = kc1 e−β(t−τ ) dτ + e−β(t−τ ) dτ
ρ 2 (τ ) N i=1
ρi2 (τ )
0 0
+ e−βt Γ0−1 .
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 249
Since the integrands are positive, it follows that if t ≥ T, then Γ −1 can be bounded
as
t t N
−1 ω (τ ) ω T (τ ) kc2 ωi (τ ) ωiT (τ )
Γ (t) ≥ kc1 e−β(t−τ ) dτ + e −β(t−τ )
dτ.
ρ 2 (τ ) N i=1
ρi2 (τ )
t−T t−T
Therefore,
t t N
−1 −βT ω (τ ) ω T (τ ) kc2 −βT ωi (τ ) ωiT (τ )
Γ (t) ≥ kc1 e dτ + e dτ.
ρ 2 (τ ) N i=1
ρi2 (τ )
t−T t−T
t t
1
N
ωi (τ ) ωiT (τ ) ω (τ ) ω T (τ )
dτ ≥ max c2 T, c3 I L , dτ ≥ c1 I L .
N i=1
ρi (τ )
2 ρ 2 (τ )
t−T t−T
Provided Assumption 7.6 holds, the lower bound in (7.21) is strictly positive. Fur-
ω (t)ω T (t)
thermore, using the facts that ω(t)ω (t)
T
ρ 2 (t)
≤ γ11 and i ρ 2 (t)i ≤ γ11 , ∀t ∈ R≥t0 ,
i
t
kc2 1
N
1
Γ −1
(t) ≤ e −β(t−τ )
kc1 + I L dτ + e−βt Γ0−1
γ1 N i=1 γ1
0
−1 (kc1 + kc2 )
≤ λmax Γ0 + IL .
βγ1
Since the inverse of the lower and upper bounds on Γ −1 are the upper and lower
bounds on Γ , respectively, the proof is complete.
Since the optimal value function is positive definite, (7.20) and [24, Lemma 4.3]
can be used to show that the candidate Lyapunov function satisfies the following
bounds
vl (Z ) ≤ VL (Z , t) ≤ vl (Z ) , (7.22)
250 7 Computational Considerations
β c2
c + , (7.23)
2Γ kc2 2
Q (x) kc2 c 2 (k + k ) 2
a1 a2
vl (Z ) ≤ + W̃c + W̃a .
2 6 8
The sufficient conditions for the subsequent Lyapunov-based stability analysis are
given by
2
G W σ (kc1 +kc2 )W T G σ
2Γ
+ √
4 v
+ ka1
kc2 c
≥ , (7.24)
3 (ka1 + ka2 )
(ka1 + ka2 ) G W σ (kc1 +kc2 )W G σ
≥ + √ , (7.25)
4 2 4 v
vl−1 (ι) < vl −1 vl (ζ ) . (7.26)
The sufficient condition in (7.24) can be satisfied provided the points for Bellman
error extrapolation are selected such that the minimum eigenvalue c, introduced in
(7.23) is large enough. The sufficient condition in (7.25) can be satisfied without
affecting (7.24) by increasing the gain ka2 . The sufficient condition in (7.26) can be
satisfied provided c, ka2 , and the state penalty Q (x) are selected to be sufficiently
large and the StaF kernels for value function approximation are selected such that
∇W , ε, and ∇ε are sufficiently small.
Similar to neural network-based approximation methods such as [25–32], the
function approximation error, ε, is unknown, and in general, infeasible to compute
for a given function, since the ideal neural network weights are unknown. Since a
bound on ε is unavailable, the gain conditions in (7.24)–(7.26) cannot be formally
verified. However, they can be met using trial and error by increasing the gain ka2 ,
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 251
the number of StaF basis functions, and c by selecting more points to extrapolate the
Bellman error.
To improve computational efficiency, the size of the domain around the current
state where the StaF kernels provide good approximation of the value function is
desired to be small. Smaller approximation domain results in almost identical extrap-
olated points, which in turn, results in smaller c. Hence, the approximation domain
cannot be selected to be arbitrarily small and needs to be large enough to meet the
sufficient conditions in (7.24)–(7.26).
Theorem 7.8 Provided Assumption 7.6 holds and the sufficient gain conditions in
(7.24)–(7.26) are satisfied, the controller in (7.14) and the update laws in (7.16)–
(7.18) ensure that the state x and the weight estimation errors W̃c and W̃a are
ultimately bounded.
Using (7.16)–(7.19) and (7.27), the time derivative of the Lyapunov function is
expressed as
Provided the sufficient conditions in (7.24)–(7.26) hold, the time derivative of the
candidate Lyapunov function can be bounded as
252 7 Computational Considerations
Using (7.22), (7.26), and (7.28), [24, Theorem 4.18] can be invoked to conclude that
Z is ultimately bounded, in the sense that
lim sup Z (t) ≤ vl −1 vl vl−1 (ι) .
t→∞
If the drift dynamics are uncertain, a parametric approximation of the dynamics can be
employed for Bellman error extrapolation. On any compact set
C ⊂ Rn the function
f can be represented using a neural network as f (x) = θ T σ f Y T x1 (x) + εθ (x) ,
T
where x1 (x) 1, x T ∈ Rn+1 , θ ∈ R p+1×n , and Y ∈ Rn+1× p denote the constant
unknown output-layer and hidden-layer neural network weights, σ f : R p → R p+1
denotes a bounded neural network basis function, εθ : Rn → Rn denotes the function
reconstruction error, and p ∈ N denotes the number of neural network neurons.
Using the universal function approximation property
of single layer neural networks,
given a constant matrix Y such that the rows of σ f Y T x1 form a proper basis (cf.
[33]), there exist constant ideal weights θ and known constants θ , εθ , and εθ
∈ R
such that θ ≤ θ < ∞, supx∈C εθ (x) ≤ εθ , and supx∈C ∇x εθ (x) ≤ εθ
. Using
an estimate θ̂ ∈ R p+1×n of the weight matrix θ, the function
fcan be approximated
ˆ
by the function f : R × Rn p+1×n ˆ
→ R defined as f x, θ̂ θ̂ T σθ (x) , where
n
T
σθ : Rn → R p+1 is defined as σθ (x) σ f Y T 1, x T . Using fˆ, the Bellman
error in (7.12) can be approximated by δ̂ : Rn × Rn × R L × R L × R p+1×n → Rn as
δ̂ y, x, Ŵc , Ŵa , θ̂ r y, û y, x, Ŵa + ∇ V̂ y, x, Ŵc fˆ y, θ̂ + g (y) û y, x, Ŵa .
(7.29)
Using δ̂, the instantaneous Bellman errors in (7.13) and (7.15) are redefined as
δt (t) δ̂ x (t) , x (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) , (7.30)
and
δti (t) δ̂ xi (x (t) , t) , x (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) , (7.31)
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 253
Proof The proof is a trivial combination of the proofs of Theorems 7.8 and 4.3.
7.4.5 Simulation
∞
x T (τ ) x (τ ) + u 2 (τ ) dτ. (7.35)
0
254 7 Computational Considerations
The system in (7.34) and the cost in (7.35) are selected because the corresponding op-
timal control problem has a known analytical solution. The optimal value function is
V ∗ (x) = 21 x1o2 + x2o2 , and the optimal control policy is u ∗ (x) = − (cos(2x1 ) + 2)x2 .
To apply the developed technique to this problem, the value function is approxi-
mated using three exponential StaF kernels (i.e, σ (x, C) = [σ1 (x, c1 ) , σ2 (x, c2 ) ,
T
σ3 (x, c3 )]T ). The kernels are selected to be σi (x, ci ) = e x ci − 1, i = 1, · · · , 3. The
centers ci are selected to be on the vertices of a shrinking equilateral triangle around
the current state (i.e., ci = x + di (x) i = 1, · · · , 3), where d1 (x) = 0.7ν o (x) ·
[0, 1]T , d2 (x)=0.7ν o (x) · [0.87, −0.5] , and d3 (x) = 0.7ν (x) · [−0.87, −0.5] ,
T o T
T
and ν o (x) x1+γ x+0.01
T
2x x
denotes the shrinking function, where γ2 ∈ R>0 is a con-
stant normalization gain. To ensure sufficient excitation, a single point for Bell-
man error extrapolation is selected at random from a uniform distribution over a
2.1ν o (x (t)) × 2.1ν o (x (t)) square centered at the current state x (t) so that the
function xi is of the form xi (x, t) = x + ai (t) for some ai (t) ∈ R2 . For a general
problem with an n−dimensional state, exponential kernels can be utilized with the
centers placed at the vertices of an n−dimensional simplex with the current state as
the centroid. The extrapolation point can be sampled at each iteration from a uniform
distribution over an n−dimensional hypercube centered at the current state.
The system is initialized at t0 = 0 and the initial conditions
x (0) = [−1, 1]T , Ŵc (0) = 0.4 × 13×1 , Γ (0) = 500I3 , Ŵa (0) = 0.7Ŵc (0) ,
kc1 = 0.001, kc2 = 0.25, ka1 = 1.2, ka2 = 0.01, β = 0.003, γ1 = 0.05, γ2 = 1.
Figure 7.5 shows that the developed StaF-based controller drives the system states
to the origin while maintaining system stability. Figure 7.6 shows the implemented
control signal compared with the optimal control signal. It is clear that the imple-
-0.5
-1
0 5 10
Time (s)
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 255
-1.5
-2
0 1 2 3 4 5
Time (s)
mented control converges to the optimal controller. Figures 7.7 and 7.8 shows that
the weight estimates for the StaF-based value function and policy approximation
remain bounded and converge as the state converges to the origin. Since the ideal
values of the weights are unknown, the weights can not directly be compared with
their ideal values. However, since the optimal solution is known, the value function
estimate corresponding to the weights in Fig. 7.7 can be compared to the optimal
value function at each time t. Figure 7.9 shows that the error between the optimal
and the estimated value functions rapidly decays to zero.
Optimal Tracking Problem with Parametric Uncertainties in the Drift Dynamics
This simulation demonstrates the effectiveness of the extension developed in
Sect. 7.4.4. The drift dynamics in the two-state nonlinear dynamical system in (7.34)
are assumed to be linearly parameterized as
256 7 Computational Considerations
-4
-6
-8
0 5 10
Time (s)
⎡ ⎤
! x1
θ1 θ2 θ3 ⎣ ⎦,
f (x) = x2
θ4 θ5 θ6
" #$ % 2 x (cos (2x 1 ) + 2)
θT
" #$ %
σθ (x)
where θ ∈ R3×2 is the matrix of unknown parameters, and σθ is the known vector of
basis functions. The ideal values of the unknown parameters are θ1 = −1, θ2 = 1,
θ3 = 0, θ4 = −0.5, θ5 = 0, and θ6 = −0.5. Let θ̂ denote an estimate of the unknown
matrix θ. The control objective is to drive the estimate θ̂ to the ideal matrix θ , and to
drive the state x to follow a desired trajectory xd . The desired trajectory is selected
to be the solution to the initial value problem
! !
−1 1 0
ẋd (t) = x (t) , xd (0) = , (7.36)
−2 1 d 1
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 257
-0.5
-1
0 10 20 30 40
Time (s)
*∞
and the cost functional is selected to be 0 e T (t) diag (10, 10) e (t) + (μ (t))2 dt,
where e (t) = x (t) − xd (t) , μ is an auxiliary controller designed using the devel-
oped method, and the tracking controller is designed as
!
−1 1
u (t) = g + (xd (t)) xd (t) − θ̂ T σθ (xd (t)) + μ (t) ,
−2 1
-1
-2
0 10 20 30 40
Time (s)
Fig. 7.11 Control signal generated using the proposed method for the trajectory tracking problem
(reproduced with permission from [21], 2016,
c Elsevier)
0.5
-0.5
0 10 20 30 40
Time (s)
Fig. 7.12 Actor weight trajectories generated using the proposed method for the trajectory tracking
problem. The weights do not converge to a steady-state value because the ideal weights are not
constant, they are functions of the time-varying system state. Since an analytical solution to the
optimal tracking problem is not available, weights cannot be compared against their ideal values
(reproduced with permission from [21], 2016,
c Elsevier)
Figures 7.10 and 7.11 demonstrate that the controller remains bounded and the
tracking error is regulated to the origin. The neural network weights are functions of
the system state ζ . Since ζ converges to a periodic orbit, the neural network weights
also converge to a periodic orbit (within the bounds of the excitation introduced
by the Bellman error extrapolation signal), as demonstrated in Figs. 7.12 and 7.13.
Figure 7.14 demonstrates that the unknown parameters in the drift dynamics, repre-
sented by solid lines, converge to their ideal values, represented by dashed lines.
The developed technique is compared with the model-based reinforcement learn-
ing method developed in [11] for regulation and [12] for tracking, respectively.
The simulations are performed in MATLAB Simulink at 1000 Hz on the same
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 259
0.5
-0.5
0 10 20 30 40
Time (s)
Fig. 7.13 Critic function weight trajectories generated using the proposed method for the trajectory
tracking problem. The weights do not converge to a steady-state value because the ideal weights
not constant, they are functions of the time-varying system state. Since an analytical solution to the
optimal tracking problem is not available, weights cannot be compared against their ideal values
(reproduced with permission from [21], 2016,
c Elsevier)
-0.5
-1
0 10 20 30 40
Time (s)
machine. The simulations run for 100 s of simulated time. Since the objective is
to compare computational efficiency of the model-based reinforcement learning
method, exact knowledge of the system model is used. Table 7.1 shows that the
260 7 Computational Considerations
Reinforcement learning has become a popular tool for determining online solutions of
optimal control problems for systems with finite state and action-spaces [30, 36–40].
Due to various technical and practical difficulties, implementation of reinforcement
learning-based closed-loop controllers on hardware platforms remains a challenge.
Approximate dynamic programming-based controllers are void of pre-designed sta-
bilizing feedback and are completely defined by the estimated parameters. Hence,
the error between the optimal and the estimated value function is required to decay to
a sufficiently small bound sufficiently fast to establish closed-loop stability. The size
of the error bound is determined by the selected basis functions, and the convergence
rate is determined by richness of the data used for learning.
Fast approximation of the value function over a large neighborhood requires suf-
ficiently rich data to be available for learning. In traditional approximate dynamic
programming methods such as [31, 41, 42], richness of data manifests itself as the
amount of excitation in the system. In experience replay-based techniques such as
[34, 43–45], richness of data is quantified by eigenvalues of a recorded history stack.
In model-based reinforcement learning techniques such as [11–13], richness of data
corresponds to the eigenvalues of a learning matrix. As the dimension of the system
and the number of basis functions increases, the richer data is required to achieve
learning. In traditional approximate dynamic programming methods, the demand
7.5 Background and Further Reading 261
for rich data is met by adding excitation signals to the controller, thereby causing
undesirable oscillations. In experience replay-based approximate dynamic program-
ming methods and in model-based reinforcement learning, the demand for richer
data causes exponential growth in the required data storage. Hence, experimental
implementations of traditional approximate dynamic programming techniques such
as [25–32, 41, 42, 46, 47] and data-driven approximate dynamic programming tech-
niques such as [11–13, 45, 48, 49] in high dimensional systems are scarcely found
in the literature.
The control design in (7.11) exploits the fact that given a basis σ for approximation
of the value function, the basis 21 R −1 g T ∇σ T approximates the optimal controller,
provided the dynamics control-affine. As a part of future research, possible extensions
to nonaffine systems could potentially be explored by approximating the controller
using an independent basis (cf. [50–57]).
References
16. Pinkus A (2004) Strictly positive definite functions on a real inner product space. Adv Comput
Math 20:263–271
17. Rosenfeld JA, Kamalapurkar R, Dixon WE (2015) State following (StaF) kernel functions for
function approximation part I: theory and motivation. In: Proceedings of the American control
conference, pp 1217–1222
18. Beylkin G, Monzon L (2005) On approximation of functions by exponential sums. Appl Com-
put Harmon Anal 19(1):17–48
19. Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
20. Pedersen GK (1989) Analysis now, vol 118. Graduate texts in mathematics, Springer, New
York
21. Kamalapurkar R, Rosenfeld J, Dixon WE (2016) Efficient model-based reinforcement learning
for approximate online optimal control. Automatica 74:247–258
22. Lorentz GG (1986) Bernstein polynomials, 2nd edn. Chelsea Publishing Co., New York
23. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
24. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
25. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
26. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
27. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38:943–949
28. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
29. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
30. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
the IEEE conference on decision and control, pp 3598–3605
31. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
32. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
33. Sadegh N (1993) A perceptron network for functional identification and control of nonlinear
systems. IEEE Trans Neural Netw 4(6):982–988
34. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
35. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares
procedures. Anal Chem 36(8):1627–1639
36. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Belmont
37. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
38. Konda V, Tsitsiklis J (2004) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–
1166
39. Bertsekas D (2007) Dynamic programming and optimal control, vol 2, 3rd edn. Athena Scien-
tific, Belmont
40. Szepesvári C (2010) Algorithms for reinforcement learning. Synthesis lectures on artificial
intelligence and machine learning. Morgan & Claypool Publishers, San Rafael
41. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin, pp
357–374
References 263
42. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
43. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. Ph.D. thesis, Georgia Institute of Technology
44. Chowdhary G, Johnson E (2011) A singular value maximizing data recording algorithm for
concurrent learning. In: Proceedings of the American control conference, pp 3547–3552
45. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
46. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
47. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
48. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica
49. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
50. Ge SS, Zhang J (2003) Neural-network control of nonaffine nonlinear system with zero dy-
namics by state and output feedback. IEEE Trans Neural Netw 14(4):900–918
51. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
52. Zhang X, Zhang H, Sun Q, Luo Y (2012) Adaptive dynamic programming-based optimal
control of unknown nonaffine nonlinear discrete-time systems with proof of convergence.
Neurocomputing 91:48–55
53. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
54. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
55. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
56. Kiumarsi B, Kang W, Lewis FL (2016) H-∞ control of nonaffine aerial systems using off-policy
reinforcement learning. Unmanned Syst 4(1):1–10
57. Song R, Wei Q, Xiao W (2016) Off-policy neuro-optimal control for unknown complex-valued
nonlinear systems based on policy iteration. Neural Comput Appl 46(1):85–95
Appendix A
Supplementary Lemmas and Definitions
t
t T
L(τ )dτ = ˙ )T N B2 (τ ) − β2 ρ2 (z) z x̃ dτ
N B1 (τ ) − β1 sgn(x̃)) + x̃(τ
0 r
0
t
n
n
= x̃ T N B − x̃ T (0)N B (0) − 0 x̃ T Ṅ B dτ + β1 |x̃i (0)| − β1 |x̃i (t)|
i=1 i=1
t t
+ 0 αx̃ T (N B1 − β1 sgn(x̃))dτ − 0 β2 ρ2 (z) z x̃ dτ ,
n
where (3.9) is used. Using the fact that x̃2 ≤ |x̃i | , and using the bounds in
i=1
(3.14), yields
t
n
L(τ )dτ ≤ β1 |x̃i (0)| − x̃ T (0)N B (0) − (β1 − ζ1 − ζ2 ) x̃
0 i=1
t t
ζ3
− α(β1 − ζ1 − ) x̃ dτ − (β2 − ζ4 ) ρ2 (z) z x̃ dτ .
α
0 0
If the sufficient conditions in (3.19) are satisfied, then the following inequality holds
t
n
L(τ )dτ ≤ β1 |x̃i (0)| − x̃ T (0)N B (0) = P(0). (A.1)
0 i=1
Proof Let y (t) for t ∈ [t0 , ∞) denote a Filippov solution to the differential equa-
tion in (3.22) that satisfies y (t0 ) ∈ S. Using Filippov’s theory of differential inclu-
sions [1, 2], the
existence
for ẏ ∈ K [h] (y, t), where
of solutions can be established
K [h] (y, t) coh (Bδ (y) \Sm , t), where denotes the intersection of
δ>0 μSm =0 μSm =0
all sets Sm of Lebesgue measure zero, co denotes convex closure [3, 4]. The time
derivative of (3.20) along the Filippov trajectory y (·) exists almost everywhere (a.e.),
and V˙I ∈ V˙˜I where
a.e.
T
V˙˜I =
1 1
ξ T K e˙f T x̃˙ T P − 2 Ṗ Q − 2 Q̇ ,
1 1
(A.2)
2 2
ξ∈∂VI (y)
2 2
Using the calculus for K[·] from [4] (Theorem 1, Properties 2, 5, 7), and substituting
the dynamics from (3.9) and (3.17), yields
where (3.8), (3.13), and (3.15) are used, K[sgn] (x̃) = SGN (x̃) [4], such that
SGN (x̃i ) = {1} if x̃i > 0, [−1, 1] if x̃i = 0, and {−1} if x̃i < 0 (the subscript i
denotes the ith element).
The set in (A.3) reduces to the scalar inequality in (A.4) since the right hand side
is continuous almost everywhere (i.e., the right
hand
side is continuous
except for the
Lebesgue negligible set of times when e Tf K sgn (x̃) − e Tf K sgn (x̃) = 0). The set
Appendix A: Supplementary Lemmas and Definitions 267
of times t ∈ [0, ∞) | e f (t)T K sgn (x̃ (t)) − e f (t)T K sgn (x̃ (t)) = 0 ⊂
[0, ∞) is equivalent to the set oftimes t | x̃ (t) = 0 ∧ e f(t) = 0 . From (3.16),
this set can also be represented by t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 . Provided x̃ is continu-
ously
differentiable, it can be shown that the set of time instances
t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 is isolated, and thus, measure zero. This implies that
the set is measure zero [6].
Substituting for k k1 + k2 and γ γ1 + γ2 , and completing the squares, the
expression in (A.4) can be upper bounded as
2 ρ1 (z)2 β 2 ρ2 (z)2
V˙˜I ≤ −(αγ1 − ζ5 ) x̃2 − (k1 − ζ6 ) e f +
a.e.
z2 + 2 z2 .
4k2 4αγ2
(A.5)
Provided the sufficient conditions in (3.23) are satisfied, the expression in (A.5) can
be rewritten as
ρ(z)2
V˙˜I ≤ −λ z2 +
a.e.
z2
4η
a.e.
≤ −U (y) ∀y ∈ D, (A.6)
αγ2
where λ min{αγ1 − ζ5 , k1 − ζ6 }, η min{k2 , β22
}, ρ(z)2 ρ1 (z)2 +
ρ2 (z)2 is a positive strictly increasing function, and U (y) = c z2 , for some pos-
itive constant c, is a continuous positive semi-definite function defined on the domain
D. The size of the domain D can be increased by increasing the gains k and γ. Using
the inequalities in (3.21) and (A.6), [7, Corollary 1]can be
invoked to show that y (·) ∈
L∞ , provided y(0) ∈ S. Furthermore, x̃ (t) , x̃˙ (t) , e f (t) → 0 as t → ∞
provided y(0) ∈ S.
Since y (·) ∈ L∞ , x̃ (·) , e f (·) ∈ L∞ . Using (3.6), standard linear analysis can
be used to show that x̃˙ (·) ∈ L∞ , and since ẋ (·)∈ L∞ , x̂˙ (·) ∈ L∞ . Since Ŵ f (·) ∈
L∞ from the use of projection in (3.8), t
→ σ f V̂ fT (t) x̂ (t) ∈ L∞ from Property
2.3, u (·) ∈ L∞ from Assumption 3.3, and μ (·) ∈ L∞ from (3.3). Using the above
bounds, it can be shown from (3.9) that ė f (·) ∈ L∞ .
Since the gains depend on the initial conditions, the compact sets used for function
approximation, and the Lipschitz bounds, an iterative algorithm is developed to select
the gains. In Algorithm A.1, the notation {}i for any parameter denotes the value
of computed in the ith iteration. Algorithm A.1 ensures satisfaction of the sufficient
condition in (3.75).
268 Appendix A: Supplementary Lemmas and Definitions
Proof Since t
−→ (x, t) is uniformly bounded, for all x ∈ D, supt∈R≥0 { (x, t)}
exists and is unique for all x ∈ D. Let the function α : D → R≥0 be defined as
d D×R≥0 ((x, t) , (y, t)) < ς (x) =⇒ dR≥0 ( (x, t) , (y, t)) < ε, (A.8)
where d M (·, ·) denotes the standard Euclidean metric on the metric space M. By the
definition of d M (·, ·), d D×R≥0 ((x, t) , (y, t)) = d D (x, y) . Using (A.8),
Given the fact that is positive, (A.9) implies (x, t) < (y, t) + ε and
(y, t) < (x, t) + ε which from (A.7) implies α (x) < α (y) + ε and α (y) <
α (x) + ε, and hence, from (A.9), d D (x, y) < ς (x) =⇒ |α (x) − α (y)| < ε. Since
is positive definite, (A.7) can be used to conclude α (0) = 0. Thus, is bounded
above by a continuous positive definite function; hence, is decrescent in D.
Proof Based on the definitions in (3.51), (3.52) and (3.68), Vt∗(e, t) > 0, ∀t ∈ R≥0
T
and ∀e ∈ Ba \ {0}. The optimal value function V ∗ 0, xdT is the cost incurred
when starting with e = 0 and following the optimal policy thereafter for an arbitrary
desired trajectory xd . Substituting x (t0 ) = xd (t0 ), μ (t0 ) = 0 and (3.45) in (3.47)
indicates that ė (t0 ) = 0. Thus, when starting from e = 0, a policy that is identically
zero satisfies the dynamic constraints in (3.47). Furthermore, the optimal cost is
T
V ∗ 0, xdT (t0 ) = 0, ∀xd (t0 ) which, from (3.68), implies (3.69b). Since the opti-
mal value function Vt∗ is strictly positive everywhere but at e = 0 and is zero at e = 0,
Vt∗ is a positive definite function. Hence, [8, Lemma 4.3] can be invoked to conclude
that there exists a class K function v : [0, a] → R≥0 such that v (e) ≤ Vt∗ (e, t),
∀t ∈ R≥0 and ∀e ∈ Ba .
Admissibility of the optimal policy implies that V ∗ (ζ) is bounded over all compact
subsets K ⊂ R2n . Since the desired trajectory is bounded, t
→ Vt∗ (e, t) is uniformly
bounded for all e ∈ Ba . To establish that e
→ Vt∗ (e, t) is continuous, uniformly in
t, let χeo ⊂ Rn be a compact set containing eo . Since xd is bounded, xd ∈ χxd , where
χxd ⊂ Rn is compact. Since V ∗ : R2n → R≥0 is continuous, and χeo × χxd ⊂ R2n
is compact, V ∗ is uniformly continuous on χeo × χxd . Thus, ∀ε > 0, ∃ς > 0,
T T T T T T T T T T T T
such that ∀ eo , xd , e1 , xd ∈ χeo × χxd , dχeo ×χxd eo , xd , e1 , xd <
∗ T T T
∗ T T
ς =⇒ dR V eo , xd
T
,V e1 , xd < ε. Thus, for each eo ∈ R , there
n
2
18 ka1 Lkc ϕL F T 2
5 = 2 ,
νϕ 1 − 6L (kc ϕT ) / νϕ
2
2
18 Lka1 kc ϕ (L F d + ι5 ) T 2 2
6 = 2 + 3L ka2 W T .
νϕ 1 − 6L (kc ϕT )2 / νϕ
Using the definition of the controller in (3.57), the tracking error dynamics can be
expressed as
1 1
ė = f + g R −1 G T σ T W̃a + ggd+ (h d − f d ) − g R −1 G T σ T W − h d .
2 2
On any compact set, the tracking error derivative can be bounded above as
ė ≤ L F e + L W W̃a + L e ,
where L = L x + ggd+ (h d − f d ) − 21 g R −1 G T σ T W − h d and L W = 21
−1 Te T F d
g R G σ . Using the fact that e and W̃a are continuous functions of time, on
the interval [t, t + T ], the time derivative of e can be bounded as
ė ≤ L F sup e (τ ) + L W sup W̃a (τ ) + L e .
τ ∈[t,t+T ] τ ∈[t,t+T ]
Since the infinity norm is less than the 2-norm, the derivative of the jth component
of ė is bounded as
ė j ≤ L F sup e (τ ) + L W sup W̃a (τ ) + L e .
τ ∈[t,t+T ] τ ∈[t,t+T ]
n
Summing over j, and using the the facts that supτ ∈[t,t+T ] e (τ )2 ≤ supτ ∈[t,t+T ]
j=1
e j (τ )2 and inf τ ∈[t,t+T ] n e j (τ )2 ≤ inf τ ∈[t,t+T ] e (τ )2 ,
j=1
2 2
1 − 6N (ηa1 + ηa2 )2 T 2
− inf W̃a (τ ) ≤ − sup W̃a (τ )
τ ∈[t,t+T ] 2 τ ∈[t,t+T ]
2
+ 3N ηa1
2
sup W̃c (τ ) T 2 + 3N ηa2 2
W 2 T 2 . (A.10)
τ ∈[t,t+T ]
6N T 2 ηc2 ϕ2 ¯ L 2F
2
+ sup e (τ )2
6N η 2 ϕ2 T 2 τ ∈[t,t+T ]
νϕ 1 − ν 2c ϕ2
2
6N T 2 ηc2 ϕ2 ¯ L F d + ι5
+ . (A.11)
6N η 2 ϕ2 T 2
νϕ 1 − ν 2c ϕ2
ν 2 ϕ2
Proof Let the constants 7 − 9 be defined as 7 = 2 ν 2 ϕ2 +k ϕ2 T 2 , 8 = 32 L 2F ,
( c )
and 9 = 2 ι25 + 2 L 2F d 2 . The integrand on the left hand side can be written as
W̃cT (τ ) ψ (τ ) = W̃cT (t) ψ (τ ) + W̃cT (τ ) − W̃cT (t) ψ (τ ) .
Substituting the dynamics for W̃c from (3.66) and using the persistence of excitation
condition in Assumption 3.13,
t+T 2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t)
2
t
t+Tτ
ηc (σ) ψ (σ) (σ)
− "
1 + νω (σ)T (σ) ω (σ)
t t
t+T 2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t)
2
t
Appendix A: Supplementary Lemmas and Definitions 273
⎛ ⎞2
t+T τ
− 2⎝ ηc W̃cT (σ) ψ (σ) ψ T (σ) T (σ) ψ (τ ) dσ ⎠ dτ
t t
⎛ ⎞2
t+T τ
⎝ η T
(σ) ψ T
(σ) T
(σ) ψ (τ )
dσ ⎠ dτ
c
−6 "
1 + νω (σ)T (σ) ω (σ)
t t
⎛ τ ⎞2
t+T
Using the Cauchy–Schwarz inequality, the Lipschitz property, the fact that
√ 1
1+νω T ω
≤ 1, and the bounds in (3.67),
⎛ ⎞2
t+T 2 t+T τ
1 ⎝ η ι ϕ
dσ ⎠ dτ
c 5
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − 6
2 νϕ
t t t
⎛ τ ⎞
t+T 2 τ
2ηc2 ⎝ ψ T (σ) T (σ) ψ (τ ) dσ ⎠ dτ
2
− W̃cT (σ) ψ (σ) dσ
t t t
⎛ τ ⎞
t+T 4 τ
T
6ηc2 ι22 ⎝ ψ (σ) T (σ) ψ (τ ) dσ ⎠ dτ
2
− W̃a (σ) dσ
t t t
⎛ τ ⎞
t+T τ
6ηc2 ¯ ⎝ F (σ)2 dσ ψ T (σ) T (σ) ψ (τ ) dσ ⎠ dτ .
2 2
−
t t t
Rearranging,
t+T
2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − 3ηc2 A4 ϕ2 ι25 T 3
2
t
t+T τ 2
dσdτ − 3ηc2 A4 ϕ2 ¯ L 2F d 2 T 3
2
−2ηc2 A4 ϕ2 (τ − t) W̃cT (σ) ψ (σ)
t t
t+T τ 4
t+T τ
(τ − t) W̃a (σ) dσdτ − 6ηc2 ¯ L 2F A4 ϕ2
2
−6ηc2 ι22 A4 ϕ2 (τ − t) e2 dσdτ ,
t t t t
274 Appendix A: Supplementary Lemmas and Definitions
where A = 1
√ .
νϕ
Changing the order of integration,
t+T 2 t+T 2
1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − ηc2 A4 ϕ2 T 2 W̃cT (σ) ψ (σ) dσ
2
t t
t+T t+T 4
−3ηc2 A4 ϕ2 ¯ L 2F T 2
2
e (σ) dσ −
2
3ηc2 ι22 A4 ϕ2 T 2 W̃a (σ) dσ
t t
−2ηc2 A4 ϕ2 T 3 ι25 + ¯ L 2F d 2 .
2
Proof To facilitate the subsequent development, let the gains k and γ be split as k
k1 + k2 and γ γ1 + γ2 . Let λ min{αγ1 − ζ5 , k1 − ζ6 }, ρ (z)2 ρ1 (z)2 +
ρ2 (z)2 , and η min{k2 , αγ β22
2
}. Let y (t) for t ∈ [t0 , ∞) denote a Filippov solution
to the differential equation in (3.111) that satisfies y (t0 ) ∈ S. Using Filippov’s theory
of differential inclusions [1, 2], the
existence
for ẏ ∈
of solutions can be established
K [h] (y, t), where K [h] (y, t) coh (Bδ (y) \Sm , t), where denotes
δ>0 μSm =0 μSm =0
the intersection of all sets Sm of Lebesgue measure zero [3, 4]. The time derivative
of (3.109) along the Filippov trajectory y (·) exists almost everywhere (a.e.), and
V˙I ∈ V˙˜I where
a.e.
T
V˙˜I =
1 1
x̃˙ T P − 2 Ṗ Q − 2 Q̇
1 1
ξ K e˙f
T T
, (A.12)
2 2
ξ∈∂VI (y)
2 2
T
1 1
T ˙ T 1 − 21 1 −1
= e f γ x̃ 2P 2Q K e˙f x̃
T T 2 2 P Ṗ Q Q̇ .
2
2 2
Using the calculus for K[·] from [4], and substituting the dynamics from (3.99) and
(3.107), yields
Appendix A: Supplementary Lemmas and Definitions 275
where K [sgn] (x̃) = SGN (x̃). Substituting (3.98), canceling common terms, and
rearranging the expression yields
2 ρ1 (z)2 β 2 ρ2 (z)2
Ṽ˙ I ≤ −(αγ1 − ζ5 ) x̃2 − (k1 − ζ6 ) e f +
a.e.
z2 + 2 z2 .
4k2 4αγ2
(A.15)
Provided the sufficient conditions in (3.112) are satisfied, the expression in (A.15)
can be rewritten as
ρ (z)2
Ṽ˙ I ≤ −λ z2 +
a.e. a.e.
z2 ≤ −U (y) , ∀y ∈ D. (A.16)
4η
2.3, and u (·) ∈ L∞ from Assumption 3.19, (3.92) can be used to conclude that
μ (·) ∈ L∞ . Using (3.97) and the above bounds it can be shown that ė f (·) ∈ L∞ .
From (A.16), [7, Corollary 1] can be invoked to show that y (·) ∈ L∞ , provided
y(0) ∈ S. Furthermore,
x̃ (t) , x̃˙ (t) , e f (t) → 0 as t → ∞,
provided y (t0 ) ∈ S.
In the following, the notation {}i for any parameter denotes the value of
computed in the ith iteration.
In the following, the notation {}i for any parameter denotes the value of
computed in the ith iteration.
Appendix A: Supplementary Lemmas and Definitions 277
γ
2
γ 2
θ̃ ≤ Vθ θ̃ ≤ θ̃ ,
2 2
where γ, γ ∈ R denote the minimum and the maximum eigenvalues of the matrix
θ−1 . Using (A.19), the Lyapunov derivative can be expressed as
Appendix A: Supplementary Lemmas and Definitions 279
⎛ ⎞
k θ
M
kθ T
M
V̇θ = −θ̃ T ⎝ Y T a j Y a j ⎠θ̃ − θ̃ T Y aj dj.
M j=1 M j=1
M
Let y ∈ R be the minimum eigenvalue of M1 j=1 Y
T
a j Y a j . Since
M T
j=1 Y a j Y a j is symmetric and positive semi-definite, (A.17) can be used
to conclude that it is also positive definite, and hence y > 0. Hence, the Lyapunov
derivative can be bounded as
2
V̇0 ≤ −ykθ θ̃ + kθ dθ θ̃ ,
where dθ = dY , Y = max j=1,··· ,M Y a j . Hence, θ̃ exponentially decays to
an ultimate bound as t → ∞. If Hid is updated with new data, the update law
(A.19) forms a switched system. Provided (A.17) holds, and Hid is updated using
a singular value maximizing algorithm, Vθ is a common Lyapunov function for the
switched system (cf. [14]). The concurrent learning-based system identifier satisfies
Assumption 4.1 with K = ykθ and D = kθ dθ . To satisfy the last inequality in (4.18),
the quantity vιl needs to be small. Based on the definitions in (4.17), the quantity vιl
D2 dθ2
is proportional to K2
, which is proportional to y2
. From the definitions of dθ and y,
2
M
dθ2 2 j=1 Y a j
= d 2 .
y2 M
λmin j=1 Y T a Y a
j j
A data stack Hid that satisfies conditions in (A.17) can be collected online provided
the controller in (4.6)
resultsin the system states being sufficiently exciting over a
finite time interval t0 , t0 + t ⊂ R. To collect the data stack, the first M values of
the state, the control, and the corresponding numerically computed state derivative
are added to the data stack. Then, the existing values are progressively replaced with
new values using a singular value maximization algorithm. During this finite time
interval, since a data stack is not available, an adaptive update law that ensures fast
convergence of θ̃ to zero without persistence of excitation can not be developed.
Hence, the system dynamics can not be directly estimated without persistence of
excitation. Since extrapolation of the Bellman error to unexplored areas of the state-
280 Appendix A: Supplementary Lemmas and Definitions
x̂˙ f = gu + k f x̃ f + μ f ,
μ̇ f = k f α f + 1 x̃ f , (A.20)
˙
where ω f ∈ R is the regressor vector defined as ω f σ (x) x̂ f . During the interval
L
t0 , t0 + t , the value function and the actor weights can be learned based on the
approximate Bellman error in (A.21) provided the system states are exciting (i.e., if
the following assumption is satisfied).
Assumption A.3 There exists a time interval t0 , t0 + t ⊂ R and positive constants
ψ, T ∈ R such that closed-loop trajectories of the system in (1.9) with the controller
u = û T x, Ŵa f along with the weight update laws
ωf ω f ω Tf
Ŵ˙ c f = −ηc f f δ f , ˙ f = λ f f − ηc f f f,
ρf ρf
·
Ŵ a f = −ηa1 f Ŵa − Ŵc − ηa2 f Ŵa , (A.22)
t+T
ψI L ≤ ψ f (τ ) ψ f (τ )T dτ , ∀t ∈ t0 , t0 + t , (A.23)
t
∀x ∈ χ. The update laws in (A.22) along with the excitation condition in (A.23)
ensure that the adaptation gain matrix is bounded such that
f ≤ f ≤ f , ∀t ∈ R≥t0 , (A.25)
1 1 T 1 T 1 T
VL f Z f , t V ∗ (x) + W̃cTf −1
f W̃c f + W̃a f W̃a f + x̃ f x̃ f + r r. (A.28)
2 2 2 2
282 Appendix A: Supplementary Lemmas and Definitions
Using the fact that V ∗ is positive definite, (A.25) and [8, Lemma 4.3] can be used to
establish the bound
vl f Z f ≤ VL f Z f , t ≤ vl f Z f , (A.29)
Theorem A.4 Provided the gains are selected to satisfy the sufficient conditions
in (A.30) based on an algorithm similar to Algorithm A.2, the controller in (4.6),
the weight update laws in (A.22), the state derivative estimator in (A.20), and the
excitation condition in (A.23) ensure that the state trajectory x, the state estimation
error x̃ f , and the parameter estimation errors W̃c f , and W̃a f remain bounded such
that
Z f (t) ≤ Z f , ∀t ∈ t0 , t0 + t .
Proof Using techniques similar to the proof of Theorem 4.3, the time derivative of
the candidate Lyapunov function in (A.28) can be bounded as
2 & ιf
V̇L f ≤ −vl f Z f , ∀ Z f ≥ , (A.32)
vl f
1.2 Chapter 4 Supplementary Material 283
in the domain Z f . Using (A.29), (A.31), and (A.32), [8, Theorem 4.18] is used
to show that Z is uniformly ultimately bounded, and that Z (t) ≤ Z f , ∀t ∈
f f
t0 , t 0 + t .
During the interval t0 , t0 + t , the controller in (4.6) is used along with the weight
update laws in Assumption A.3. When enough data is collected in the data stack to
satisfy the rank condition in (A.17), the update laws from Sect. 4.3.3 are used. The
bound Z f is used to compute gains for Theorem 4.3 using Algorithm A.2.
kc1 sup Z ∈β ∇ζ L Yr es θ + L f0r es
ϕζ = q −
2
L Yc g W sup Z ∈β ∇ζ σ + sup Z ∈β ∇ζ
− ,
2
kc2 ka kc1 sup Z ∈β ∇ζ L Yr es θ + L f0r es
ϕc = c− −
N 2 2
kc1 L Y sup Z ∈β ζ sup Z ∈β ∇ζ σ W
−
2
kc2 n
N j=1 Y r es j
σ j W
− ,
2
ka
ϕa = ,
2
N
kc2 Yr es σ W
k=1 k k
ϕθ = k θ y − N
2
L Yc g W sup Z ∈β ∇ζ σ + sup Z ∈β ∇ζ
−
2
kc1 L Yr es W sup Z ∈β ζ sup Z ∈β ∇ζ σ
− ,
2
284 Appendix A: Supplementary Lemmas and Definitions
kc2 N
kc1 T
κc = sup 4N W̃aT G σ j W̃a + W̃ G σ W̃a
Z ∈β j=1
4 a
kc2
N
kc1
+kc1 ∇ζ G∇ζ σ W +
T
∇ζ G∇ζ + Ek ,
4 N
k=1
1 T 1
κa = sup W G σ + ∇ζ G∇ζ σ
,
Z ∈β 2 2
κθ = kθ dθ ,
1
κ = sup ∇
4 ζ G∇
ζ .
Z ∈β
In the case where the earth-fixed current is constant, the effects of the current may
be included in the development of the optimal control problem. The body-relative
Appendix A: Supplementary Lemmas and Definitions 285
where η̇c ∈ Rn is the known constant current velocity in the inertial frame. The
functions Yr es θ and f 0r es in (6.11) can then be redefined as
⎡ ⎤
0
Yr es θ ⎣ −M −1 C A (−νc ) νc − M −1 D (−νc ) νc . . . ⎦ ,
−M −1 C A (νr ) νr − M −1 D (νr ) νr
JE ν
f 0r es ,
−M −1 C R B (ν) ν − M −1 G (η)
u = τb − τc
where τc (ζ) ∈ Rn is the control effort required to keep the vehicle on station given
the current and is redefined as
The geometry of the path-following problem is depicted in Fig. A.1. Let I denote
an inertial frame. Consider the coordinate system i in I with its origin and the basis
vectors i 1 ∈ R3 and i 2 ∈ R3 in the plane of vehicle motion and i 3 i 1 × i 2 . The point
P (t) ∈ R3 on the desired path represents the location of the virtual target at time t.
The location of the virtual target is determined by the path parameter s p (t) ∈ R. In
the controller development, the path parameter is defined as the arc length along the
desired path from some arbitrary initial position on the path to the point P (t). It is
convenient to select the arc length as the path parameter for a mobile robot, since the
desired speed can be defined as unit length per unit time. Let F denote a frame fixed to
the virtual target with the origin of the coordinate system f fixed in F at point P (t).
The basis vectors f 1 (t) , f 2 (t) ∈ R3 are the unit tangent and normal vectors of the
path at P (t), respectively, in the plane of vehicle motion and f 3 (t) f 1 (t) × f 2 (t).
Let B denote a frame fixed to the vehicle with the origin of its coordinate system
b at the center of mass Q (t) ∈ R3 . The basis vectors b1 (t) , b2 (t) ∈ R3 are the
unit tangent and normal vectors of the vehicle motion at Q (t), and b3 (t) b1 (t) ×
b2 (t). Note, the bases {i 1 , i 2 , i 3 } , { f 1 (t) , f 2 (t) , f 3 (t)} , and {b1 (t) , b2 (t) , b3 (t)}
form standard bases.
286 Appendix A: Supplementary Lemmas and Definitions
where r Q (t) ∈ R3 and r P (t) ∈ R3 are the position vectors of points Q and P, from
the origin of the inertial coordinate system, respectively, at time t. The rate of change
of r Q as viewed by an observer in I and expressed in the coordinate system f is
P
f f f
v Q (t) = v Q (t) − v P (t) . (A.36)
P
where ṡ p : R≥t0 → R is the velocity of the virtual target along the path. The velocity
of point Q as viewed by an observer in I and expressed in f is
f f
v Q (t) = Rb (θ (t)) vbQ (t) , (A.38)
f
where θ : R≥t0 → R is the angle between f 1 and b1 and Rb : R → R3×3 is a trans-
formation from b to f , defined as
Appendix A: Supplementary Lemmas and Definitions 287
⎡ ⎤
cos θ − sin θ 0
Rb (θ) ⎣ sin θ cos θ 0⎦.
f
0 0 1
F d f
r Q (t) +I →F (t) × r Q/P (t) .
f f
v Q/P (t) = (A.39)
dt /P
The angular velocity of F as viewed by an observer in I expressed in f is given
T
as I ω F (t) = 0 0 κ (t) ṡ p (t) where κ : R≥t0 → R is the path curvature, and the
relative position of the vehicle with respect to the virtual target expressed in f is
f T
r Q (t) = x (t) y (t) 0 . Substituting (A.37)–(A.39) into (A.36) the planar posi-
P
tional error dynamics are given as
ηc1 supζ∈χ L f ηc2 ηa ηc1 supζ∈χ L f
ϕe q − , ϕc c− − ,
2 N 2 2
288 Appendix A: Supplementary Lemmas and Definitions
ηc2 N
ηc1 T ηc1 T ηc1 T
ιc sup
4N W̃aT G σ j W̃a + W̃a G σ W̃a + Gσ W + G
ζ∈χ j=1
4 2 4
ηc2
N
+ E j + ηc1 L f
,
N j=1
1 1 T ηa 1 T
ιa sup
G σ W + σ G
, ϕa , ι sup
G ,
ζ∈χ 2 2 2 ζ∈χ 4
+
ι2c ι2 ι 1 ϕc ϕa
K + a + , α min ϕe , , .
2αϕc 2αϕa α 2 2 2
References
1. Filippov AF (1988) Differential equations with discontinuous right-hand sides. Kluwer Aca-
demic Publishers, Dordrecht
2. Aubin JP, Frankowska H (2008) Set-valued analysis. Birkhäuser
3. Shevitz D, Paden B (1994) Lyapunov stability theory of nonsmooth systems. IEEE Trans Autom
Control 39(9):1910–1914
4. Paden BE, Sastry SS (1987) A calculus for computing Filippov’s differential inclusion with
application to the variable structure control of robot manipulators. IEEE Trans Circuits Syst
34(1):73–82
5. Clarke FH (1990) Optimization and nonsmooth analysis. SIAM
6. Kamalapurkar R, Rosenfeld JA, Klotz J, Downey RJ, Dixon WE (2014) Supporting lemmas
for RISE-based control methods. arXiv:1306.3432
7. Fischer N, Kamalapurkar R, Dixon WE (2013) LaSalle-Yoshizawa corollaries for nonsmooth
systems. IEEE Trans Autom Control 58(9):2333–2338
8. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
9. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
10. Narendra K, Annaswamy A (1989) Stable adaptive systems. Prentice-Hall Inc, Upper Saddle
River
11. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
Appendix A: Supplementary Lemmas and Definitions 289
12. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. PhD thesis, Georgia Institute of Technology
13. Chowdhary GV, Johnson EN (2011) Theory and flight-test validation of a concurrent-learning
adaptive controller. J Guid Control Dynam 34(2):592–607
14. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
15. Dierks T, Jagannathan S (2009) Optimal tracking control of affine nonlinear discrete-time
systems with unknown internal dynamics. In: Proceedings of the IEEE conference on decision
and control. Shanghai, CN, pp 6750–6755
16. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
17. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
18. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
19. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
20. Walters P, Kamalapurkar R, Andrews L, Dixon WE (2014) Online approximate optimal path-
following for a mobile robot. In: Proceedings of the IEEE conference on decision and control,
pp 4536–4541
Index
D
B
Differential game, 11, 12, 44, 74, 75, 94, 189,
Bellman error, 24, 25, 28–33, 43, 45, 50,
190
51, 56, 62, 63, 73, 76, 81, 99–105,
Differential game, closed-loop, 44
107, 110, 118–120, 122, 125, 130,
Differential game, graphical, 150, 190
131, 133, 135, 138, 144, 150, 157–
Differential game, nonzero-sum, 44, 73, 75,
159, 161–163, 185–187, 203–205,
89, 90, 92–94, 101, 131, 140
209, 210, 215, 245, 247, 251, 252
Differential game, zero-sum, 94
Bellman error extrapolation, 166, 169, 174,
230, 248, 250, 252–254, 257, 258 Dynamic neural network, 43, 46, 56, 73, 75,
Bellman’s principle of optimality, 3 77, 78, 90
Bergmann–Fock space, 232
Bolza problem, 1–3, 5, 7, 18, 26, 243
Brachistochrone problem, 3 E
e−modification, 33
Existence, 2
C Experience replay, 30, 99, 102, 260, 261
Carathéodory solutions, 2 Exponential kernel function, 229, 232, 234–
Common Lyapunov function, 110, 140 237, 240
F N
Filippov solution, 46, 48, 49, 77, 79 Nash equilibrium, 11, 44, 45, 74, 89, 91,
Finite excitation, 102 101, 131, 138–141, 149, 150, 156–
Finite time horizon, 224 158, 161, 166, 181, 190
Finite-horizon, 71, 94, 185 Nash policy, 165
Formation tracking, 149, 153, 154, 189 Network, leader-follower, 188
Network systems, 149–152, 167, 172, 176,
180, 181, 184, 185, 189, 190
G Nonholonomic system, 172
Galerkin’s method, 223 Nonholonomic vehicle, 223
Galerkin’s spectral method, 35, 93
GPOPS, 70, 72, 73
Gram-Schmidt algorithm, 235 P
Graph, directed, 151, 180 Path-following, 196, 213, 223, 224
Graph Laplacian, 151 Persistence of excitation, 24, 31, 33, 43, 52,
55, 56, 63, 66–68, 83, 85, 89–91,
93, 99, 101–103, 107, 111, 118, 120,
H 131–133, 143, 145, 183, 187, 190,
Hamiltonian, 5, 6, 9, 10, 61, 62, 72, 76, 185, 195, 199, 204, 230, 248
189 Policy evaluation, 18, 35, 55
Hamilton–Jacobi–Bellman equation, 5–8, Policy gradient, 35
10–13, 18–23, 27, 35, 36, 43, 44, 51, Policy improvement, 18, 35, 55, 93
53, 61, 93, 99, 104, 119, 181, 202, Policy iteration, 17–19, 22–25, 34–36, 55,
215, 223 94, 95
Hamilton–Jacobi equation, 75, 182, 184 Policy iteration, synchronous, 94
Hamilton–Jacobi–Isaacs equation, 94 Pontryagin’s maximum principle, 2, 3, 9, 10,
Heuristic dynamic programming, 25, 34–36, 22, 34
43, 94 Prediction error, 107
Hopfield, 43 Projection operator, 50, 78, 82, 84, 114, 204,
205
Pseudospectral, Gauss, 117
I Pseudospectral, Radau, 70
Identifier, 43, 45, 46, 49, 50, 52, 55–57, 73,
76, 80, 85, 90
Infinite-horizon, 5, 35, 44, 45, 71, 73, 74, 94, Q
95, 100, 101, 117, 118, 131, 149, 153 Q-learning, 17, 22, 26, 34–36, 94
K R
Kalman filter, 209 Radial basis function, 228–230
Kantorovich inequality, 239 Randomized stationary policy, 105
Receding horizon, 36
Reinforcement learning, 12, 13, 17, 29, 30,
L 33, 35–37, 43, 45, 55, 60, 91, 94, 99–
Least-squares, 28, 31, 33, 35, 43, 49–51, 62, 103, 105, 118, 144, 149, 150, 158,
63, 73, 81, 91, 93, 100, 101, 105, 106, 161, 195, 229, 230, 242, 245, 258–
111, 114, 134, 140, 162, 186, 187, 261
245, 246, 248 Reproducing kernel Hilbert space, 227, 228,
Levenberg-Marquardt, 51 230–234, 238
Linear quadratic regulator, 227 Reproducing kernel Hilbert space, universal,
233, 242, 244
Riccati equation, 11
M RISE, 43, 46, 56, 77, 209
Model-predictive control, 11, 36, 223, 224 R−learning, 34–36
Index 293
S T
Saddle point, 94 Temporal difference, 13, 17, 25, 35, 93
SARSA, 34–36
σ−modification, 33
Sigmoid, 230
Simulation of experience, 100–103, 105, U
118, 122, 131, 161–163, 245 Uniqueness, 2, 12
Single network adaptive critic, 36 Universal Approximation, 23, 63, 101, 104,
Singular value maximization algorithm, 111, 158, 160, 184, 185, 252
128, 133, 140, 141
Spanning tree, 151, 152, 164
State following (StaF), 228–230, 232, 233, V
235, 237, 238, 242–244, 246, 250, Value iteration, 17, 22, 23, 34–36
251, 253–258, 260 Viscosity solutions, 12
Station keeping, 195, 211, 223
Stone-Weierstrass Theorem, 23
Successive approximation, 18, 35, 36
Support vector machines, 230 W
Switched subsystem, 104, 110, 120, 133, Weierstrass Theorem, 243
139, 140 Wheeled mobile robot, 172, 196, 218–223