0% found this document useful (0 votes)
318 views

Reinforcement Learning For Optimal Feedback Control

Uploaded by

Shad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
318 views

Reinforcement Learning For Optimal Feedback Control

Uploaded by

Shad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 305

Communications and Control Engineering

Rushikesh Kamalapurkar 
Patrick Walters · Joel Rosenfeld 
Warren Dixon

Reinforcement
Learning for
Optimal Feedback
Control
A Lyapunov-Based Approach
Communications and Control Engineering

Series editors
Alberto Isidori, Roma, Italy
Jan H. van Schuppen, Amsterdam, The Netherlands
Eduardo D. Sontag, Boston, USA
Miroslav Krstic, La Jolla, USA
Communications and Control Engineering is a high-level academic monograph
series publishing research in control and systems theory, control engineering and
communications. It has worldwide distribution to engineers, researchers, educators
(several of the titles in this series find use as advanced textbooks although that is not
their primary purpose), and libraries.
The series reflects the major technological and mathematical advances that have
a great impact in the fields of communication and control. The range of areas to
which control and systems theory is applied is broadening rapidly with particular
growth being noticeable in the fields of finance and biologically-inspired control.
Books in this series generally pull together many related research threads in more
mature areas of the subject than the highly-specialised volumes of Lecture Notes in
Control and Information Sciences. This series’s mathematical and control-theoretic
emphasis is complemented by Advances in Industrial Control which provides a
much more applied, engineering-oriented outlook.
Publishing Ethics: Researchers should conduct their research from research
proposal to publication in line with best practices and codes of conduct of relevant
professional bodies and/or national and international regulatory bodies. For more
details on individual ethics matters please see:
https://www.springer.com/gp/authors-editors/journal-author/journal-author-help-
desk/publishing-ethics/14214.

More information about this series at http://www.springer.com/series/61


Rushikesh Kamalapurkar Patrick Walters

Joel Rosenfeld Warren Dixon


Reinforcement Learning
for Optimal Feedback
Control
A Lyapunov-Based Approach

123
Rushikesh Kamalapurkar Joel Rosenfeld
Mechanical and Aerospace Engineering Electrical Engineering
Oklahoma State University Vanderbilt University
Stillwater, OK Nashville, TN
USA USA

Patrick Walters Warren Dixon


Naval Surface Warfare Center Department of Mechanical
Panama City, FL and Aerospace Engineering
USA University of Florida
Gainesville, FL
USA

ISSN 0178-5354 ISSN 2197-7119 (electronic)


Communications and Control Engineering
ISBN 978-3-319-78383-3 ISBN 978-3-319-78384-0 (eBook)
https://doi.org/10.1007/978-3-319-78384-0
Library of Congress Control Number: 2018936639

MATLAB® and Simulink® are registered trademarks of The MathWorks, Inc., 1 Apple Hill Drive,
Natick, MA 01760-2098, USA, http://www.mathworks.com.

Mathematics Subject Classification (2010): 49-XX, 34-XX, 46-XX, 65-XX, 68-XX, 90-XX, 91-XX,
93-XX

© Springer International Publishing AG 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my nurturing grandmother, Mangala
Vasant Kamalapurkar.
—Rushikesh Kamalapurkar

To my strong and caring grandparents.


—Patrick Walters

To my wife, Laura Forest Gruss Rosenfeld,


with whom I have set out on the greatest
journey of my life.
—Joel Rosenfeld

To my beautiful son, Isaac Nathaniel Dixon.


—Warren Dixon
Preface

Making the best possible decision according to some desired set of criteria is always
difficult. Such decisions are even more difficult when there are time constraints and
can be impossible when there is uncertainty in the system model. Yet, the ability to
make such decisions can enable higher levels of autonomy in robotic systems and,
as a result, have dramatic impacts on society. Given this motivation, various
mathematical theories have been developed related to concepts such as optimality,
feedback control, and adaptation/learning. This book describes how such theories
can be used to develop optimal (i.e., the best possible) controllers/policies (i.e., the
decision) for a particular class of problems. Specifically, this book is focused on the
development of concurrent, real-time learning and execution of approximate opti-
mal policies for infinite-horizon optimal control problems for continuous-time
deterministic uncertain nonlinear systems.
The developed approximate optimal controllers are based on reinforcement
learning-based solutions, where learning occurs through an actor–critic-based
reward system. Detailed attention to control-theoretic concerns such as convergence
and stability differentiates this book from the large body of existing literature on
reinforcement learning. Moreover, both model-free and model-based methods are
developed. The model-based methods are motivated by the idea that a system can
be controlled better as more knowledge is available about the system. To account
for the uncertainty in the model, typical actor–critic reinforcement learning is
augmented with unique model identification methods. The optimal policies in this
book are derived from dynamic programming methods; hence, they suffer from the
curse of dimensionality. To address the computational demands of such an
approach, a unique function approximation strategy is provided to significantly
reduce the number of required kernels along with parallel learning through novel
state extrapolation strategies.
The material is intended for readers that have a basic understanding of nonlinear
analysis tools such as Lyapunov-based methods. The development and results may
help to support educators, practitioners, and researchers with nonlinear
systems/control, optimal control, and intelligent/adaptive control interests working
in aerospace engineering, computer science, electrical engineering, industrial

vii
viii Preface

engineering, mechanical engineering, mathematics, and process engineering


disciplines/industries.
Chapter 1 provides a brief introduction to optimal control. Dynamic
programming-based solutions to optimal control problems are derived, and the
connections between the methods based on dynamic programming and the methods
based on the calculus of variations are discussed, along with necessary and suffi-
cient conditions for establishing an optimal value function. The chapter ends with a
brief survey of techniques to solve optimal control problems. Chapter 2 includes a
brief review of dynamic programming in continuous time and space. In particular,
traditional dynamic programming algorithms such as policy iteration, value itera-
tion, and actor–critic methods are presented in the context of continuous-time
optimal control. The role of the optimal value function as a Lyapunov function is
explained to facilitate online closed-loop optimal control. This chapter also high-
lights the problems and limitations of existing techniques, thereby motivating the
development in this book. The chapter concludes with some historic remarks and a
brief classification of the available dynamic programming techniques.
In Chap. 3, online adaptive reinforcement learning-based solutions are devel-
oped for infinite-horizon optimal control problems for continuous-time uncertain
nonlinear systems. A novel actor–critic–identifier structure is developed to
approximate the solution to the Hamilton–Jacobi–Bellman equation using three
neural network structures. The actor and the critic neural networks approximate the
optimal controller and the optimal value function, respectively, and a robust
dynamic neural network identifier asymptotically approximates the uncertain sys-
tem dynamics. An advantage of using the actor–critic–identifier architecture is that
learning by the actor, critic, and identifier is continuous and concurrent, without
requiring knowledge of system drift dynamics. Convergence is analyzed using
Lyapunov-based adaptive control methods. The developed actor–critic method is
extended to solve trajectory tracking problems under the assumption that the system
dynamics are completely known. The actor–critic–identifier architecture is also
extended to generate approximate feedback-Nash equilibrium solutions to N-player
nonzero-sum differential games. Simulation results are provided to demonstrate the
performance of the developed actor–critic–identifier method.
Chapter 4 introduces the use of an additional adaptation strategy called con-
current learning. Specifically, a concurrent learning-based implementation of
model-based reinforcement learning is used to solve approximate optimal control
problems online under a finite excitation condition. The development is based on
the observation that, given a model of the system, reinforcement learning can be
implemented by evaluating the Bellman error at any number of desired points in the
state space. By exploiting this observation, a concurrent learning-based parameter
identifier is developed to compensate for uncertainty in the parameters.
Convergence of the developed policy to a neighborhood of the optimal policy is
established using a Lyapunov-based analysis. Simulation results indicate that the
developed controller can be implemented to achieve fast online learning without the
addition of ad hoc probing signals as in Chap. 3. The developed model-based
reinforcement learning method is extended to solve trajectory tracking problems for
Preface ix

uncertain nonlinear systems and to generate approximate feedback-Nash equilib-


rium solutions to N-player nonzero-sum differential games.
Chapter 5 discusses the formulation and online approximate feedback-Nash
equilibrium solution for an optimal formation tracking problem. A relative control
error minimization technique is introduced to facilitate the formulation of a feasible
infinite-horizon total-cost differential graphical game. A dynamic programming-
based feedback-Nash equilibrium solution to the differential graphical game is
obtained via the development of a set of coupled Hamilton–Jacobi equations. The
developed approximate feedback-Nash equilibrium solution is analyzed using a
Lyapunov-based stability analysis to yield formation tracking in the presence of
uncertainties. In addition to control, this chapter also explores applications of dif-
ferential graphical games to monitoring the behavior of neighboring agents in a
network.
Chapter 6 focuses on applications of model-based reinforcement learning to
closed-loop control of autonomous vehicles. The first part of the chapter is devoted
to online approximation of the optimal station keeping strategy for a fully actuated
marine craft. The developed strategy is experimentally validated using an autono-
mous underwater vehicle, where the three degrees of freedom in the horizontal
plane are regulated. The second part of the chapter is devoted to online approxi-
mation of an infinite-horizon optimal path-following strategy for a unicycle-type
mobile robot. An approximate optimal guidance law is obtained through the
application of model-based reinforcement learning and concurrent learning-based
parameter estimation. Simulation results demonstrate that the developed method
learns an optimal controller which is approximately the same as an optimal con-
troller determined by an off-line numerical solver, and experimental results
demonstrate the ability of the controller to learn the approximate solution in real
time.
Motivated by computational issues arising in approximate dynamic program-
ming, a function approximation method is developed in Chap. 7 that aims to
approximate a function in a small neighborhood of a state that travels within a
compact set. The development is based on the theory of universal reproducing
kernel Hilbert spaces over the n-dimensional Euclidean space. Several theorems are
introduced that support the development of this State Following (StaF) method. In
particular, it is shown that there is a bound on the number of kernel functions
required for the maintenance of an accurate function approximation as a state moves
through a compact set. Additionally, a weight update law, based on gradient des-
cent, is introduced where good accuracy can be achieved provided the weight
update law is iterated at a high enough frequency. Simulation results are presented
that demonstrate the utility of the StaF methodology for the maintenance of accurate
function approximation as well as solving the infinite-horizon optimal regulation
problem. The results of the simulation indicate that fewer basis functions are
required to guarantee stability and approximate optimality than are required when a
global approximation approach is used.
x Preface

The authors would like to express their sincere appreciation to a number of


individuals whose support made the book possible. Numerous intellectual discus-
sions and research support were provided by all of our friends and colleagues in the
Nonlinear Controls and Robotics Laboratory at the University of Florida, with
particular thanks to Shubhendu Bhasin, Patryk Deptula, Huyen Dinh, Keith Dupree,
Nic Fischer, Marcus Johnson, Justin Klotz, and Anup Parikh. Inspiration and
insights for our work were provided, in part, through discussions with and/or
reading foundational literature by Bill Hager, Michael Jury, Paul Robinson, Frank
Lewis (the academic grandfather or great grandfather to several of the authors),
Derong Liu, Anil Rao, Kyriakos Vamvoudakis, Richard Vinter, Daniel Liberzon,
and Draguna Vrabie. The research strategies and breakthroughs described in this
book would also not have been possible without funding support provided from
research sponsors including: NSF award numbers 0901491 and 1509516, Office of
Naval Research Grants N00014-13-1-0151 and N00014-16-1-2091, Prioria
Robotics, and the Air Force Research Laboratory, Eglin AFB. Most importantly, we
are eternally thankful for our families who are unwavering in their love, support,
and understanding.

Stillwater, OK, USA Rushikesh Kamalapurkar


Panama City, FL, USA Patrick Walters
Nashville, TN, USA Joel Rosenfeld
Gainesville, FL, USA Warren Dixon
January 2018
Contents

1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Bolza Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Necessary Conditions for Optimality . . . . . . . . . . . . . . . . 3
1.4.2 Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . 5
1.5 The Unconstrained Affine-Quadratic Regulator . . . . . . . . . . . . . . . 5
1.6 Input Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Connections with Pontryagin’s Maximum Principle . . . . . . . . . . . 9
1.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.1 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.2 Differential Games and Equilibrium Solutions . . . . . . . . . 11
1.8.3 Viscosity Solutions and State Constraints . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Approximate Dynamic Programming . . . . . . . . . . . . . . . . . . . . . ... 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 17
2.2 Exact Dynamic Programming in Continuous Time and Space . ... 17
2.2.1 Exact Policy Iteration: Differential and Integral
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 18
2.2.2 Value Iteration and Associated Challenges . . . . . . . . . ... 22
2.3 Approximate Dynamic Programming in Continuous Time
and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Some Remarks on Function Approximation . . . . . . . . . . . 23
2.3.2 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Development of Actor-Critic Methods . . . . . . . . . . . . . . . 25
2.3.4 Actor-Critic Methods in Continuous Time and Space . . . . 26
2.4 Optimal Control and Lyapunov Stability . . . . . . . . . . . . . . . . . . . 26

xi
xii Contents

2.5 Differential Online Approximate Optimal Control . . . . ......... 28


2.5.1 Reinforcement Learning-Based Online
Implementation . . . . . . . . . . . . . . . . . . . . . . ......... 29
2.5.2 Linear-in-the-Parameters Approximation
of the Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Uncertainties in System Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Persistence of Excitation and Parameter Convergence . . . . . . . . . . 33
2.8 Further Reading and Historical Remarks . . . . . . . . . . . . . . . . . . . 34
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Excitation-Based Online Approximate Optimal Control . . . . . . . . . . 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Online Optimal Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Identifier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Least-Squares Update for the Critic . . . . . . . . . . . . . . . . . 49
3.2.3 Gradient Update for the Actor . . . . . . . . . . . . . . . . . . . . 50
3.2.4 Convergence and Stability Analysis . . . . . . . . . . . . . . . . 51
3.2.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Extension to Trajectory Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Formulation of a Time-Invariant Optimal Control
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Approximate Optimal Solution . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 N-Player Nonzero-Sum Differential Games . . . . . . . . . . . . . . . . . . 73
3.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.2 Hamilton–Jacobi Approximation Via
Actor-Critic-Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.3 System Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.4 Actor-Critic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4.5 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 Model-Based Reinforcement Learning for Approximate
Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 101
4.3 Online Approximate Regulation . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.1 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.2 Value Function Approximation . . . . . . . . . . . . . . . . . . . . 104
4.3.3 Simulation of Experience Via Bellman Error
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Contents xiii

4.3.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


4.3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4 Extension to Trajectory Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4.1 Problem Formulation and Exact Solution . . . . . . . . . . . . . 118
4.4.2 Bellman Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4.3 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.4 Value Function Approximation . . . . . . . . . . . . . . . . . . . . 121
4.4.5 Simulation of Experience . . . . . . . . . . . . . . . . . . . . . . . . 122
4.4.6 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.4.7 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.5 N-Player Nonzero-Sum Differential Games . . . . . . . . . . . . . . . . . . 131
4.5.1 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.2 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . 133
4.5.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.5.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.6 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 144
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5 Differential Graphical Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2 Cooperative Formation Tracking Control of Heterogeneous
Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2.1 Graph Theory Preliminaries . . . . . . . . . . . . . . . . . . . . . . 151
5.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2.3 Elements of the Value Function . . . . . . . . . . . . . . . . . . . 153
5.2.4 Optimal Formation Tracking Problem . . . . . . . . . . . . . . . 153
5.2.5 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2.6 Approximation of the Bellman Error and the Relative
Steady-State Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2.7 Value Function Approximation . . . . . . . . . . . . . . . . . . . . 160
5.2.8 Simulation of Experience via Bellman Error
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.9 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.2.10 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Reinforcement Learning-Based Network Monitoring . . . . . . . . . . . 180
5.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3.2 System Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.3.3 Value Function Approximation . . . . . . . . . . . . . . . . . . . . 184
5.3.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3.5 Monitoring Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.4 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 189
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
xiv Contents

6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.2 Station-Keeping of a Marine Craft . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.1 Vehicle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.2 System Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2.4 Approximate Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2.5 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.2.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3 Online Optimal Control for Path-Following . . . . . . . . . . . . . . . . . 213
6.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.3.2 Optimal Control and Approximate Solution . . . . . . . . . . . 215
6.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.4 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 223
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.2 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . 230
7.3 StaF: A Local Approximation Method . . . . . . . . . . . . . . . . . . . . . 232
7.3.1 The StaF Problem Statement . . . . . . . . . . . . . . . . . . . . . . 232
7.3.2 Feasibility of the StaF Approximation
and the Ideal Weight Functions . . . . . . . . . . . . . . . . . . . . 233
7.3.3 Explicit Bound for the Exponential Kernel . . . . . . . . . . . 235
7.3.4 The Gradient Chase Theorem . . . . . . . . . . . . . . . . . . . . . 237
7.3.5 Simulation for the Gradient Chase Theorem . . . . . . . . . . 240
7.4 Local Approximation for Efficient Model-Based
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.1 StaF Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.2 StaF Kernel Functions for Online Approximate
Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.4 Extension to Systems with Uncertain Drift
Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
7.4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.5 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Appendix A: Supplementary Lemmas and Definitions . . . . . . . . . . . . . . . 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Symbols

Lists of abbreviations and symbols used in definitions, lemmas, theorems, and the
development in the subsequent chapters.

R Set of real numbers


R  ð  Þa Set of real numbers greater (less) than or equal to a
R [ ð\Þa Set of real numbers strictly greater (less) than a
Rn n-dimensional real Euclidean space
Rnm The space of n  m matrices of real numbers
Cn n-dimensional complex Euclidean space
Cn ðD1 ; D2 Þ The space of n-times continuously differentiable functions with
domain D1 and codomain D2 , and the domain and the codomain
are suppressed when clear from the context
In n  n Identity matrix
0nn n  n Matrix of zeros
1nn n  n Matrix of ones
diagfx1 ; . . .; xn g Diagonal matrix with x1 ; . . .; xn on the diagonal
2 Belongs to
8 For all
 Subset of
, Equals by definition
f : D1 ! D2 A function f with domain D1 and codomain D2
! Approaches
7! Maps to
) Implies that
 Convolution operator
jj Absolute value
k k Euclidean norm
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ffi
k kF
Frobenius norm, khkF ¼ tr hT h
k k1 Induced infinity norm

xv
xvi Symbols

kmin Minimum eigenvalue


kmax Maximum eigenvalue
x_ ; €x; . . .; xðiÞ First, second, …, ith time derivative of x
@f ðx;y;...Þ Partial derivative of f with respect to y
@y
ry f ðx; y; . . .Þ Gradient of f with respect to y
rf ðx; y; . . .Þ Gradient of f with respect to the first argument
Br The ball x 2 Rn j k xk\r
Br ð yÞ The ball x 2 Rn j kx  yk\r

A Closure of a set A
int(AÞ Interior of a set A
@ ð AÞ Boundary of a set A
1A Indicator function of a set A
L1 ð D 1 ; D 2 Þ Space of uniformly essentially bounded functions with domain
D1 and codomain D2 , and the domain and the codomain are
suppressed when clear from the context
sgnðÞ Vector and scalar signum function
trðÞ Trace of a matrix
vecðÞ Stacks the columns of a matrix to form a vector
projðÞ A smooth projection operator
½  Skew-symmetric cross product matrix
Chapter 1
Optimal Control

1.1 Introduction

The ability to learn behaviors from interactions with the environment is a desirable
characteristic of a cognitive agent. Typical interactions between an agent and its
environment can be described in terms of actions, states, and rewards (or penalties).
Actions executed by the agent affect the state of the system (i.e., the agent and the
environment), and the agent is presented with a reward (or a penalty). Assuming that
the agent chooses an action based on the state of the system, the behavior (or the
policy) of the agent can be described as a map from the state-space to the action-space.
Desired behaviors can be learned by adjusting the agent-environment interaction
through the rewards/penalties. Typically, the rewards/penalties are qualified by a cost.
For example, in many applications, the correctness of a policy is often quantified in
terms of the Lagrange cost and the Mayer cost. The Lagrange cost is the cumulative
penalty accumulated along a path traversed by the agent and the Mayer cost is the
penalty at the boundary. Policies with lower total cost are considered better and
policies that minimize the total cost are considered optimal. The problem of finding
the optimal policy that minimizes the total Lagrange and Meyer cost is known as the
Bolza optimal control problem.

1.2 Notation

Throughout the book, unless otherwise specified, the domain of all the functions is
assumed to be R≥0 . Function names corresponding to state and control trajectories are
reused to denote elements in the range of the function. For example, the notation u (·)
is used to denote the function u : R≥t0 → Rm , the notation u is used to denote an arbi-
trary element of Rm , and the notation u (t) is used to denote the value of the function
u (·) evaluated at time t. Unless otherwise specified, all the mathematical quanti-
ties are assumed to be time-varying, an equation of the form g (x) = f + h (y, t)
is interpreted as g (x (t)) = f (t) + h (y (t) , t) for all t ∈ R≥0 , and a definition of
the form g (x, y)  f (y) + h (x) for functions g : A × B → C, f : B → C and

© Springer International Publishing AG 2018 1


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_1
2 1 Optimal Control

h : A → C is interpreted as g (x, y)  f (y) + h (x) , ∀ (x, y) ∈ A × B. The nota-


χ
tion h denotes supξ ∈χ h (ξ ), for a continuous function h : Rn → Rk and a
compact set χ . When the compact set is clear from the context, the notation h is
utilized.

1.3 The Bolza Problem

Consider a controlled dynamical system described by the initial value problem

ẋ (t) = f (x (t) , u (t) , t) , x (t0 ) = x0 , (1.1)

where t0 is the initial time, x : R≥t0 → Rn denotes the system state and u : R≥t0 →
U ⊂ Rm denotes the control input, and U denotes the action-space.
To ensure local existence and uniqueness of Carathéodory solutions to (1.1), it is
assumed that the function f : Rn × U × R≥t0 → Rn is continuous with respect to
t and u, and continuously differentiable with respect to x. Furthermore, the control
signal, u (·), is restricted to be piecewise continuous. The assumptions stated here are
sufficient but not necessary to ensure local existence and uniqueness of Carathéodory
solutions to (1.1). For further discussion on existence and uniqueness of Carathéodory
solutions, see [1, 2]. Further restrictions on the dynamical system are stated, when
necessary, in subsequent chapters.
Consider a fixed final time optimal control problem where the optimality of a
control policy is quantified in terms of a cost functional

t f
 
J (t0 , x0 , u (·)) = L (x (t; t0 , x0 , u (·)) , u (t) , t) dt + Φ x f , (1.2)
t0

where L : Rn × U × R≥0 →  R is the Lagrange cost, Φ : R → R is the Mayer


n

cost, and t f and x f  x t f denote the final time and state, respectively. In (1.2),
the notation x (t; t0 , x0 , u (·)) is used to denote a trajectory of the system in (1.1),
evaluated at time t, under the controller u (·), starting at the initial time t0 , and with
the initial state x0 . Similarly, for a given policy φ : Rn → Rn , the short notation
x (t; t0 , x0 , φ (x (·))) is used to denote a trajectory under the feedback controller
u (t) = φ (x (t; t0 , x0 , u (·))). Throughout the book, the symbol x is also used to
denote generic initial conditions in Rn . Furthermore, when the controller, the initial
time, and the initial state are understood from the context, the shorthand x (·) is used
when referring to the entire trajectory, and the shorthand x (t) is used when referring
to the state of the system at time t.
The two most popular approaches to solve Bolza problems are Pontryagin’s max-
imum principle and dynamic programming. The two approaches are independent,
both conceptually and in terms of their historic development. Both the approaches
are developed on the foundation of calculus of variations, which has its origins in
1.3 The Bolza Problem 3

Newton’s Minimal Resistance Problem dating back to 1685 and Johann Bernoulli’s
Brachistochrone problem dating back to 1696. The maximum principle was devel-
oped by the Pontryagin school at the Steklov Institute in the 1950s [3]. The devel-
opment of dynamic programming methods was simultaneously but independently
initiated by Bellman at the RAND Corporation [4]. While Pontryagin’s maximum
principle results in optimal control methods that generate optimal state and control
trajectories starting from a specific state, dynamic programming results in methods
that generate optimal policies (i.e., they determine the optimal decision to be made
at any state of the system).
Barring some comparative remarks, the rest of this monograph will focus on the
dynamic programming approach to solve Bolza problems. The interested reader is
directed to the books by Kirk [5], Bryson and Ho [6], Liberzon [7], and Vinter [8]
for an in-depth discussion of Pontryagin’s maximum principle.

1.4 Dynamic Programming

Dynamic programming methods generalize the Bolza problem. Instead of solving


the fixed final time Bolza problem for particular values of t0 , t f , and x, a family of
Bolza problems characterized by the cost functionals

t f
 
J (t, x, u (·)) = L (x (τ ; t, x, u (·)) , u (τ ) , τ ) dτ + Φ x f (1.3)
t

 
is solved, where t ∈ t0 , t f , t f ∈ R≥0 , and x ∈ Rn . A solution to the family of Bolza
problems in (1.3) can be characterized using the optimal cost-to-go function (i.e.,
the optimal value function) V ∗ : Rn × R≥0 → R, defined as

V ∗ (x, t)  inf J (t, x, u (·)) , (1.4)


u t,t
[ f]

where the notation u [t,τ ] for τ ≥ t ≥ t0 denotes the controller u (·) restricted to the
time interval [t, τ ].

1.4.1 Necessary Conditions for Optimality

In the subsequent development, a set of necessary conditions for the optimality of


the value function are developed based on Bellman’s principle of optimality.

Theorem 1.1 [7, p. 160] The value function, V ∗ , satisfies


 the principle
 of optimality.
That is, for all (x, t) ∈ Rn × t0 , t f , and for all Δt ∈ 0, t f − t ,
4 1 Optimal Control
⎧ t+Δt ⎫
⎨ ⎬
V ∗ (x, t) = inf L (x (τ ) , u (τ ) , τ ) dτ + V ∗ (x (t + Δt) , t + Δt) .
u [t,t+Δt] ⎩ ⎭
t
(1.5)
 
Proof Consider the function V : Rn × t0 , t f → R defined as

⎧ t+Δt ⎫
⎨ ⎬
V (x, t)  inf L (x (τ ) , u (τ ) , τ ) dτ + V ∗ (x (t + Δt) , t + Δt) .
u [t,t+Δt] ⎩ ⎭
t

Based on the definition in (1.4)

t+Δt
 
V (x, t) = inf L (x (τ ) , u (τ ) , τ ) dτ + inf J (t + Δt, x (t + Δt) , u (·)) .
u [t,t+Δt] u 
t+Δt,t f
t

Using (1.3) and combining the integrals,


 
V (x, t) = inf J (t, x, u (·)) ≥ inf J (t, x, u (·)) = V ∗ (x, t) .
inf
u [t,t+Δt] u u t,t
[ ] t+Δt,t f [ f]
(1.6)
Thus, V (x, t) ≥ V ∗ (x, t). On the other hand, by the definition of the infimum, for
all  > 0, there exists a controller u  (·) such that

V ∗ (x, t) +  ≥ J (t, x, u  (·)) .

Let x denote the trajectory corresponding to u  . Then,


t+Δt

J (t, x, u  ) = L (x (τ ) , u  (τ ) , τ ) dτ + J (t + Δt, x (t + Δt) , u  ) ,


t

t+Δt

≥ L (x (τ ) , u  (τ ) , τ ) dτ + V (x (t + Δt) , t + Δt) ≥ V (x, t) .


t

Thus, V (x, t) ≤ V ∗ (x, t), which, along with (1.6), implies V (x, t) = V ∗ (x, t).


   
Under the assumption that V ∗ ∈ C 1 Rn × t0 , t f , R , the optimal value function
can be shown to satisfy
 
0 = −∇t V ∗ (x, t) − inf L (x, u, t) + ∇x V ∗T (x, t) f (x, u, t) ,
u∈U
1.4 Dynamic Programming 5
   
for all t ∈ t0 , t f and all x ∈ Rn , with the boundary condition V ∗ x, t f = Φ (x),
for all x ∈ Rn . In fact, the Hamilton–Jacobi–Bellman equation along with a Hamil-
tonian maximization condition completely characterize the solution to the family of
Bolza problems.

1.4.2 Sufficient Conditions for Optimality

Theorem 1.2 presents necessary and sufficient conditions for a function to be the
optimal value function.
   
Theorem 1.2 Let V ∗ ∈ C1 Rn × t0 , t f , R denote the optimal value function.
Then, a function V : Rn × t0 , tf →R is the optimal value function (i.e., V (x, t) =
V ∗ (x, t) for all (x, t) ∈ Rn × t0 , t f ) if and only if:
   
1. V ∈ C 1 Rn × t0 , t f , R and V satisfies the Hamilton–Jacobi–Bellman equa-
tion
 
0 = −∇t V (x, t) − inf L (x, u, t) + ∇x V T (x, t) f (x, u, t) , (1.7)
u∈U

   
for all t ∈ t0 , t f and all x ∈ Rn , with the boundary condition V x, t f =
Φ (x), for all x ∈ Rn .
2. For all x ∈ Rn , there exists a controller u (·), such that the function V , the con-
troller u (·), and the trajectory x (·) of (1.1) under u (·) with the initial condition
x (t0 ) = x, satisfy the equation

L (x (t) , u (t) , t) +∇x V T (x (t) , t) f (x (t) , u (t) , t)


    
= min L x (t) , û, t + ∇x V T (x (t) , t) f x (t) , û, t ,
û∈U
(1.8)
 
for all t ∈ t0 , t f .

Proof See [7, Sect. 5.1.4].


1.5 The Unconstrained Affine-Quadratic Regulator

The focus of this monograph is on unconstrained infinite-horizon total cost Bolza


problems for nonlinear systems that are affine in the controller and cost functions that
are quadratic in the controller. That is, optimal control problems where the system
dynamics are of the form
6 1 Optimal Control

ẋ (t) = f (x (t)) + g (x (t)) u (t) , (1.9)

where f : Rn → Rm and g : Rn → Rn×m are locally Lipschitz functions, and the


cost functional is of the form
∞
J (t0 , x0 , u (·)) = r (x (τ ; t0 , x0 , u (·)) , u (τ )) dτ, (1.10)
t0

where the local cost r : Rn × Rm → R is defined as

r (x, u)  Q (x) + u T Ru, (1.11)

where Q : Rn → R is a positive definite function and R ∈ Rm×m is a symmetric


positive definite matrix.
To ensure that the optimal control problem is well-posed, the minimization prob-
lem is constrained to the set of admissible controllers (see [9, Definition 1]), and the
existence of at least one admissible controller is assumed. It is further assumed that
the optimal control problem has a continuously differentiable value function. This
assumption is valid for a large class of problems. For example, most unconstrained
infinite horizon optimal control problems with smooth data have smooth value func-
tions. However, there is a large class of relevant optimal control problems for which
the assumption fails. For example, problems with bounded controls and terminal
costs typically have nondifferentiable value functions. Dynamic programming-based
solutions to such problems are characterized by viscosity solutions to the correspond-
ing Hamilton–Jacobi–Bellman equation. For further details on viscosity solutions to
Hamilton–Jacobi–Bellman equations, the reader is directed to [10] and [11].
Provided the aforementioned assumptions hold, the optimal value function is
time-independent, That is,

V ∗ (x)  inf J (t, x, u (·)) , (1.12)


u [t,∞]

for all t ∈ R≥t0 . Furthermore, the Hamiltonian minimization condition in (1.8) is sat-
isfied by the controller u (t) = u ∗ (x (t)) , where the policy u ∗ : Rn → Rm is defined
as
1  T
u ∗ (x) = − R −1 g T (x) ∇x V ∗ (x) . (1.13)
2
Hence, assuming that an optimal controller exists, a complete characterization of the
solution to the optimal control problem can be obtained using the Hamilton–Jacobi–
Bellman equation.
Remark 1.3 While infinite horizon optimal control problems naturally arise in feed-
back control application where stability is of paramount importance, path planning
applications often involve finite-horizon optimal control problems. The method of
1.5 The Unconstrained Affine-Quadratic Regulator 7

dynamic programming has extensively been studied for finite horizon problems [12–
20], although such problems are out of the scope of this monograph.

Remark 1.4 The control-affine model in (1.9) is applicable to a wide variety of


electro-mechanical systems. In particular, any linear system and any Euler-Lagrange
nonlinear system that has a known and invertible inertia matrix can be modeled
using a control-affine model. Examples include industrial manipulators, fully actu-
ated autonomous underwater and air vehicles (where the range of operation does not
include singular configurations), kinematic wheels, etc. Computation of the policy
in (1.13) exploits the control-affine nature of the dynamics, and knowledge of the
control effectiveness function, g, is required to implement the policy. The meth-
ods detailed in this monograph can be extended to systems with uncertain control
effectiveness functions and to nonaffine systems (cf. [21–28]).
The following theorem fully characterizes solutions to optimal control problems
for affine systems.

Theorem 1.5 For a nonlinear system described by (1.9), V ∗ ∈ C 1 (Rn , R) is the


optimal value function corresponding to the cost functional (1.10) if and only if it
satisfies the Hamilton–Jacobi–Bellman equation
   
r x, u ∗ (x) + ∇x V ∗ (x) f (x) + g (x) u ∗ (x) = 0, ∀x ∈ Rn , (1.14)

with the boundary condition V (0) = 0. Furthermore, the optimal controller can be
expressed as the state-feedback law u (t) = u ∗ (x (t)) .

Proof For each x ∈ Rn we have

∂ (r (x, u) + ∇x V ∗ (x) ( f (x) + g (x) u))


= 2u T R + ∇x V ∗ (x) g (x) .
∂u

hence, u = − 21 R −1 g T (x) (∇x V ∗ (x))T = u ∗ (x) extremizes r (x, u) + ∇x V ∗ (x)


( f (x) + g (x) u). Furthermore, the Hessian

∂ 2 (r (x, u) + ∇x V ∗ (x) ( f (x) + g (x) u))


= 2R
∂ 2u

is positive definite. Hence, u = u ∗ (x) minimizes r (x, u) + ∇x V ∗ (x) ( f (x) + g


(x) u). Hence, Eq. (1.14) is equivalent to the conditions in (1.7) and (1.8).

1.6 Input Constraints

The Bolza problem detailed in the previous section is an unconstrained optimal


control problem. In practice, actuators are limited in the amount of control effort
they can produce. Let u i denote the i th component of the control vector u. The
8 1 Optimal Control

affine-quadratic formulation can be extended to systems with actuator constraints


of the form |u i (t)| ≤ u, ∀t ∈ R≥t0 , ∀i = 1, . . . , m using a non-quadratic penalty
function first introduced in [29].
Let ψ : R → R be a strictly monotonically increasing continuously differen-
tiable function such that the sgn (ψ (a)) = sgn (a) , ∀a ∈ R, and |ψ (a)| ≤ u (e.g.,
ψ (a) = tanh (a)). Consider a cost function of the form r (x, u) = Q (x) + U (u),
where ⎛ u ⎞
m i
U (u)  2 ri ⎝ ψ −1 (ξ ) dξ ⎠ , (1.15)
i=1 0

and ri denotes the i th diagonal element of the matrix R.


The following theorem characterizes the solutions to optimal control problems
for affine systems with actuation constraints.
Theorem 1.6 For a nonlinear system described by (1.9), V ∗ ∈ C 1 (Rn , R) is the
optimal value function corresponding to the cost functional in (1.10), with the control
penalty in (1.15), if and only if it satisfies the Hamilton–Jacobi–Bellman equation

r (x, φ (x)) + ∇x V ∗ (x) ( f (x) + g (x) φ (x)) = 0, ∀x ∈ Rn , (1.16)



with the boundary condition V ∗ (0) = 0, where φ (x)  −ψ 21 R −1 g T (x) (∇x V ∗

(x))T . Furthermore, the optimal controller can be expressed as the state-feedback
law u (t) = u ∗ (x (t)) , where
 
1 −1 T  T
u ∗ (x)  −ψ R g (x) ∇x V ∗ (x) .
2

Proof For each x ∈ Rn ,

∂ (r (x, u) + ∇x V ∗ (x) ( f (x) + g (x) u))  


= 2ψ −1 u T R + ∇x V ∗ (x) g (x) .
∂u
 
hence, u= − ψ 21 R −1 g T (x) (∇x V ∗ (x))T extremizes r (x, u) +∇x V ∗ (x) ( f (x) +
g (x) u). Furthermore, the Hessian is
⎡ ⎤
∇u 1 ψ −1 (u 1 ) 0 0
∂2 (r (x, u) + ∇x V ∗ (x) ( f (x) + g (x) u)) ⎢ .. ⎥
= 2R ⎣ 0 . 0 ⎦.
∂2u
0 −1
0 ∇u m ψ (u m )

Provided the function ψ is strictly monotonically increasing,


 the Hessian is positive
definite. Hence, u = −ψ 21 R −1 g T (x) (∇x V ∗ (x))T minimizes r (x, u) + ∇x V ∗
(x) ( f (x) + g (x) u).

1.7 Connections with Pontryagin’s Maximum Principle 9

1.7 Connections with Pontryagin’s Maximum Principle

To apply Pontryagin’s maximum principle to the unconstrained affine-quadratic reg-


ulator, define the Hamiltonian H : Rn × U × Rn → R as

H (x, u, p) = p T ( f (x) + g (x) u) − r (x, u) .

Pontryagin’s maximum principle provides the following necessary condition for


optimality.

Theorem 1.7. [3, 5, 7] Let x ∗ : R≥t0 → Rn and u ∗ : R≥t0 → U denote the opti-
mal state and control trajectories corresponding to the optimal control problem in
Sect. 1.5. Then there exists a trajectory p ∗ : R≥t0 → Rn such that p ∗ (t) = 0 for some
t ∈ R≥t0 and x ∗ and p ∗ satisfy the equations
  T
ẋ ∗ (t) = ∇ p H x ∗ (t) , u ∗ (t) , p ∗ (t) ,

  ∗ ∗ ∗
T
ṗ (t) = − ∇x H x (t) , u (t) , p (t) ,

with the boundary condition x ∗ (t0 ) = x0 . Furthermore, the Hamiltonian satisfies


   
H x ∗ (t) , u ∗ (t) , p ∗ (t) ≥ H x ∗ (t) , u, p ∗ (t) , (1.17)

for all t ∈ R≥t0 and u ∈ U , and


 
H x ∗ (t) , u ∗ (t) , p ∗ (t) = 0, (1.18)

for all t ∈ R≥t0 .

Proof See, e.g., [7, Sect. 4.2].


Under further assumptions on the state and the control trajectories, and on the
functions f, g, and r , the so-called natural transversality condition limt→∞ p (t) = 0
can be obtained (cf. [30–32]). The natural transversality condition does not hold in
general for infinite horizon optimal control problems. For some illustrative coun-
terexamples and further discussion, see [30–35].
A quick comparison of Eq. (1.14) and (1.18) suggests that the optimal costate
should satisfy
  T
p ∗ (t) = − ∇x V x ∗ (t) . (1.19)

Differentiation of (1.19) with respect to time yields


  T   ∗    
ṗ ∗ (t) = −∇x ∇x V x ∗ (t) f x (t) + g x ∗ (t) u (t) .

Differentiation of (1.14) with respect to the state yields


10 1 Optimal Control

  ∗   T   T         
f x +g x ∗ u ∇x ∇x V x ∗ = −∇x V x ∗ ∇x f x ∗ +∇x g x ∗ u − ∇x r x ∗ , u ∗ .

Provided the second derivatives are continuous, then ∇x (∇x V (x ∗ ))T =


 T
∇x (∇x V (x ∗ ))T . Hence, the time derivative of the costate can be computed as
      T   T   T
ṗ ∗ (t) = − ∇x f x ∗ (t) + g x ∗ (t) u (t) ∇x V x ∗ (t) − ∇x r x ∗ (t) , u ∗ (t) ,
  ∗ ∗ ∗
T
= − ∇x H x (t) , u (t) , p (t) .

Therefore, the expression of the costate in (1.19) satisfies Theorem 1.7. The relation-
ship in (1.19) implies that the costate is the sensitivity of the optimal value function to
changes in the system state trajectory. Furthermore, the Hamiltonian maximization
conditions in (1.8) and (1.17) are equivalent. Dynamic programming and Pontrya-
gin’s maximum principle methods are therefore closely related. However, there are
a few key differences between the two methods.
The solution in (1.13) obtained using dynamics programming is a feedback law.
That is, dynamic programming can be used to generate a policy that can be used to
close the control loop. Furthermore, once the Hamilton–Jacobi–Bellman equation is
solved, the resulting feedback law is guaranteed to be optimal for any initial condi-
tion of the dynamical system. On the other hand, Pontryagin’s maximum principle
generates the optimal state, costate, and control trajectories for a given initial condi-
tion. The controller must be implemented in an open-loop manner. Furthermore, if
the initial condition changes, the optimal solution is no longer valid and the optimal
control problem needs to be solved again.
Since dynamic programming generates a feedback law, it provides much more
information than the maximum principle. However, the added benefit comes at a
heavy computational cost. To generate the optimal policy, the Hamilton–Jacobi–
Bellman partial differential equation must be solved. In general, numerical methods
to solve the Hamilton–Jacobi–Bellman equation grow exponentially in numerical
complexity with increasing dimensionality. That is, dynamic programming suffers
from the so-called Bellman’s curse of dimensionality.

1.8 Further Reading

1.8.1 Numerical Methods

One way to develop optimal controllers for general nonlinear systems is to use
numerical methods [5]. A common approach is to formulate the optimal control
problem in terms of a Hamiltonian and then to numerically solve a two point boundary
value problem for the state and co-state equations [36, 37]. Another approach is to
cast the optimal control problem as a nonlinear programming problem via direct
transcription and then solve the resulting nonlinear program [30, 38–42]. Numerical
methods are offline, do not generally guarantee stability, or optimality, and are often
1.8 Further Reading 11

open-loop. These issues motivate the desire to find an analytical solution. Developing
analytical solutions to optimal control problems for linear systems is complicated
by the need to solve an algebraic Riccati equation or a differential Riccati equation.
Developing analytical solutions for nonlinear systems is even further complicated by
the sufficient condition of solving a Hamilton–Jacobi–Bellman partial differential
equation, where an analytical solution may not exist in general. If the nonlinear
dynamics are exactly known, then the problem can be simplified at the expense of
optimality by solving an algebraic Riccati equations through feedback-linearization
methods (cf. [43–47]).
Alternatively, some investigators temporarily assume that the uncertain system
could be feedback-linearized, solve the resulting optimal control problem, and then
use adaptive/learning methods to asymptotically learn the uncertainty [48–51] (i.e.,
asymptotically converge to the optimal controller). The nonlinear optimal control
problem can also be solved using inverse optimal control [52–61] by circumvent-
ing the need to solve the Hamilton–Jacobi–Bellman equation. By finding a control
Lyapunov function, which can be shown to also be a value function, an optimal
controller can be developed that optimizes a derived cost. However, since the cost is
derived rather than specified by mission/task objectives, this approach is not explored
in this monograph. Optimal control-based algorithms such as state dependent Ric-
cati equations [62–65] and model-predictive control [66–72] have been widely uti-
lized for control of nonlinear systems. However, both state dependent Riccati equa-
tions and model-predictive control are inherently model-based. Furthermore, due
to nonuniqueness of state dependent linear factorization in state dependent Riccati
equations-based techniques, and since the optimal control problem is solved over a
small prediction horizon in model-predictive control, they generally result in subopti-
mal policies. Furthermore, model-predictive control approaches are computationally
intensive, and closed-loop stability of state dependent Riccati equations-based meth-
ods is generally impossible to establish a priori and has to be established through
extensive simulation.

1.8.2 Differential Games and Equilibrium Solutions

A multitude of relevant control problems can be modeled as multi-input systems,


where each input is computed by a player, and each player attempts to influence the
system state to minimize its own cost function. In this case, the optimization problem
for each player is coupled with the optimization problem for other players. Hence,
in general, an optimal solution in the usual sense does not exist for such problems,
motivating the formulation of alternative optimality criteria.
Differential game theory provides solution concepts for many multi-player, multi-
objective optimization problems [73–75]. For example, a set of policies is called
a Nash equilibrium solution to a multi-objective optimization problem if none of
the players can improve their outcome by changing their policy while all the other
players abide by the Nash equilibrium policies [76]. Thus, Nash equilibrium solutions
12 1 Optimal Control

provide a secure set of strategies, in the sense that none of the players have an incentive
to diverge from their equilibrium policy. Hence, Nash equilibrium has been a widely
used solution concept in differential game-based control techniques. For an in-depth
discussion on Nash equilibrium solutions to differential game problems, see Chaps. 3
and 4.
Differential game theory is also employed in multi-agent optimal control, where
each agent has its own decentralized objective and may not have access to the entire
system state. In this case, graph theoretic models of the information structure are
utilized in a differential game framework to formulate coupled Hamilton–Jacobi
equations (c.f. [77]). Since the coupled Hamilton–Jacobi equations are difficult to
solve, reinforcement learning is often employed to get an approximate solution.
Results such as [77, 78] indicate that adaptive dynamic programming can be used
to generate approximate optimal policies online for multi-agent systems. For an in-
depth discussion on the use of graph theoretic models of information structure in a
differential game framework, see Chap. 5

1.8.3 Viscosity Solutions and State Constraints

A significant portion of optimal control problems of practical importance require


the solution to satisfy state constraints. For example, autonomous vehicles operating
in complex contested environments are required to observe strict static (e.g., due to
policy or mission objectives or known obstacles/structures in the environment) and
dynamic (e.g., unknown and then sensed obstacles, moving obstacles) no-entry zones.
The value functions corresponding to optimal control problems with state constraints
are generally not continuously differentiable, and may not even be differentiable
everywhere. Hence, for these problems, the Hamilton–Jacobi–Bellman equation fails
to admit classical solutions, and alternative solution concepts are required. A naive
generalization would be to require a function to satisfy the Hamilton–Jacobi–Bellman
equation almost everywhere. However, the naive generalization is not useful for
optimal control since such generalized solutions are often unrelated to the value
function of the corresponding optimal control problem.
An appropriate notion of generalized solutions to the Hamilton–Jacobi–Bellman
equation, called viscosity solutions, was developed in [10]. It has been established
that under the condition that the value function is continuous, it is a solution to
the Hamilton–Jacobi–Bellman equation. Some uniqueness results are also available
under further assumptions on the value function. For a detailed treatment of viscosity
solutions to Hamilton–Jacobi–Bellman equations, see [79].
Various methods have been developed to approximate viscosity solutions to
Hamilton–Jacobi–Bellman equations [79–81]; however, these methods are offline,
require knowledge of the system dynamics, and are computationally expensive.
Online computation of approximate classical solutions to the Hamilton–Jacobi–
Bellman equation is achieved through dynamic programming methods. Dynamic
programming methods in continuous state and time rely on a differential [82] or an
1.8 Further Reading 13

integral [83] formulation of the temporal difference error (called the Bellman error).
The corresponding reinforcement learning algorithms are generally designed to min-
imize the Bellman error. Since such minimization yields estimates of generalized
solutions, but not necessarily viscosity solutions, to the Hamilton–Jacobi–Bellman
equation, reinforcement learning in continuous time and space for optimal control
problems with state constraints has largely remained an open area of research.

References

1. Carathéodory C (1918) Vorlesungen über reelle Funktionen. Teubner


2. Coddington EA, Levinson N (1955) Theory of ordinary differential equations. McGraw-Hill
3. Pontryagin LS, Boltyanskii VG, Gamkrelidze RV, Mishchenko EF (1962) The mathematical
theory of optimal processes. Interscience, New York
4. Bellman R (1954) The theory of dynamic programming. Technical report, DTIC Document
5. Kirk D (2004) Optimal control theory: an introduction. Dover, Mineola
6. Bryson AE, Ho Y (1975) Applied optimal control: optimization, estimation, and control. Hemi-
sphere Publishing Corporation
7. Liberzon D (2012) Calculus of variations and optimal control theory: a concise introduction.
Princeton University Press
8. Vinter R (2010) Optimal control. Springer Science & Business Media
9. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton–
Jacobi–Bellman equation. Automatica 33:2159–2178
10. Crandall M, Lions P (1983) Viscosity solutions of Hamilton–Jacobi equations. Trans Am Math
Soc 277(1):1–42
11. Bardi M, Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton–Jacobi–
Bellman equations. Springer
12. Cimen T, Banks SP (2004) Global optimal feedback control for general nonlinear systems with
nonquadratic performance criteria. Syst Control Lett 53(5):327–346
13. Cheng T, Lewis FL, Abu-Khalaf M (2007) A neural network solution for fixed-final time
optimal control of nonlinear systems. Automatica 43(3):482–490
14. Cheng T, Lewis FL, Abu-Khalaf M (2007) Fixed-final-time-constrained optimal control of
nonlinear systems using neural network HJB approach. IEEE Trans Neural Netw 18(6):1725–
1737
15. Kar I, Adhyaru D, Gopal M (2009) Fixed final time optimal control approach for bounded
robust controller design using Hamilton–Jacobi–Bellman solution. IET Control Theory Appl
3(9):1183–1195
16. Wang F, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon optimal
control of discrete-time nonlinear systems with epsilon-error bound. IEEE Trans Neural Netw
22:24–36
17. Heydari A, Balakrishnan SN (2012) An optimal tracking approach to formation control of
nonlinear multi-agent systems. In: Proceedings of AIAA guidance, navigation and control
conference
18. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
19. Zhao Q, Xu H, Jagannathan S (2015) Neural network-based finite-horizon optimal control
of uncertain affine nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst
26(3):486–499
20. Li C, Liu D, Li H (2015) Finite horizon optimal tracking control of partially unknown linear
continuous-time systems using policy iteration. IET Control Theory Appl 9(12):1791–1801
14 1 Optimal Control

21. Ge SS, Zhang J (2003) Neural-network control of nonaffine nonlinear system with zero dynam-
ics by state and output feedback. IEEE Trans Neural Netw 14(4):900–918
22. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
23. Zhang X, Zhang H, Sun Q, Luo Y (2012) Adaptive dynamic programming-based optimal
control of unknown nonaffine nonlinear discrete-time systems with proof of convergence.
Neurocomputing 91:48–55
24. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
25. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
26. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
27. Kiumarsi B, Kang W, Lewis FL (2016) H-∞ control of nonaffine aerial systems using off-policy
reinforcement learning. Unmanned Syst 4(1):1–10
28. Song R, Wei Q, Xiao W (2016) Off-policy neuro-optimal control for unknown complex-valued
nonlinear systems based on policy iteration. Neural Comput Appl 46(1):85–95
29. Lyashevskiy S, Meyer AU (1995) Control system analysis and design upon the Lyapunov
method. In: Proceedings of the American control conference, vol 5, pp 3219–3223
30. Fahroo F, Ross IM (2008) Pseudospectral methods for infinite-horizon nonlinear optimal con-
trol problems. J Guid Control Dyn 31(4):927–936
31. Pickenhain S (2014) Hilbert space treatment of optimal control problems with infinite horizon.
In: Bock GH, Hoang PX, Rannacher R, Schlöder PJ (eds) Modeling, simulation and optimiza-
tion of complex processes - HPSC 2012: Proceedings of the fifth international conference on
high performance scientific computing, 5–9 March 2012, Hanoi, Vietnam. Springer Interna-
tional Publishing, Cham, pp 169–182
32. Tauchnitz N (2015) The pontryagin maximum principle for nonlinear optimal control problems
with infinite horizon. J Optim Theory Appl 167(1):27–48
33. Halkin H (1974) Necessary conditions for optimal control problems with infinite horizons.
Econometrica pp 267–272
34. Aseev SM, Kryazhimskii A (2007) The pontryagin maximum principle and optimal economic
growth problems. Proc Steklov Inst Math 257(1):1–255
35. Aseev SM, Veliov VM (2015) Maximum principle for infinite-horizon optimal control problems
under weak regularity assumptions. Proc Steklov Inst Math 291(1):22–39
36. von Stryk O, Bulirsch R (1992) Direct and indirect methods for trajectory optimization. Ann
Oper Res 37(1):357–373
37. Betts JT (1998) Survey of numerical methods for trajectory optimization. J Guid Control Dyn
21(2):193–207
38. Hargraves CR, Paris S (1987) Direct trajectory optimization using nonlinear programming and
collocation. J Guid Control Dyn 10(4):338–342
39. Huntington GT (2007) Advancement and analysis of a gauss pseudospectral transcription for
optimal control. Ph.D thesis, Department of Aeronautics and Astronautics, MIT
40. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, A MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
41. Darby CL, Hager WW, Rao AV (2011) An hp-adaptive pseudospectral method for solving
optimal control problems. Optim Control Appl Methods 32(4):476–502
42. Garg D, Hager WW, Rao AV (2011) Pseudospectral methods for solving infinite-horizon opti-
mal control problems. Automatica 47(4):829–837
43. Freeman R, Kokotovic P (1995) Optimal nonlinear controllers for feedback linearizable sys-
tems. In: Proceedings of the American control conference, pp 2722–2726
References 15

44. Lu Q, Sun Y, Xu Z, Mochizuki T (1996) Decentralized nonlinear optimal excitation control.


IEEE Trans Power Syst 11(4):1957–1962
45. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical report CIT-CDS 96-021, California Institute of Technology, Pasadena, CA 91125
46. Primbs JA, Nevistic V (1996) Optimality of nonlinear design techniques: A converse HJB
approach. Technical report CIT-CDS 96-022, California Institute of Technology, Pasadena,
CA 91125
47. Sekoguchi M, Konishi H, Goto M, Yokoyama A, Lu Q (2002) Nonlinear optimal control applied
to STATCOM for power system stabilization. In: Proceedings of the IEEE/PES transmission
and distribution conference and exhibition, pp 342–347
48. Kim Y, Lewis FL (2000) Optimal design of CMAC neural-network controller for robot manip-
ulators. IEEE Trans Syst Man Cybern Part C Appl Rev 30(1):22–31
49. Kim Y, Lewis FL, Dawson D (2000) Intelligent optimal control of robotic manipulator using
neural networks. Automatica 36(9):1355–1364
50. Dupree K, Patre P, Wilcox Z, Dixon WE (2008) Optimal control of uncertain nonlinear systems
using rise feedback. In: Proceedings of the IEEE conference on decision and control, Cancun,
Mexico, pp 2154–2159
51. Dupree K, Patre PM, Wilcox ZD, Dixon WE (2009) Optimal control of uncertain nonlinear
systems using a neural network and rise feedback. In: Proceedings of the American control
conference, St. Louis, Missouri, pp 361–366
52. Freeman RA, Kokotovic PV (1996) Robust nonlinear control design: state-space and lyapunov
techniques. Birkhäuser, Boston
53. Fausz J, Chellaboina VS, Haddad W (1997) Inverse optimal adaptive control for nonlinear
uncertain systems with exogenous disturbances. In: Proceedings of the IEEE conference on
decision and control, pp 2654–2659
54. Li ZH, Krstic M (1997) Optimal design of adaptive tracking controllers for nonlinear systems.
Automatica 33:1459–1473
55. Krstic M, Li ZH (1998) Inverse optimal design of input-to-state stabilizing nonlinear con-
trollers. IEEE Trans Autom Control 43(3):336–350
56. Krstic M, Tsiotras P (1999) Inverse optimal stabilization of a rigid spacecraft. IEEE Trans
Autom Control 44(5):1042–1049
57. Luo W, Chu YC, Ling KV (2005) Inverse optimal adaptive control for attitude tracking of
spacecraft. IEEE Trans Autom Control 50(11):1639–1654
58. Dupree K, Johnson M, Patre PM, Dixon WE (2009) Inverse optimal control of a nonlinear
Euler-Lagrange system, part ii: Output feedback. In: Proceedings of the IEEE conference on
decision and control, Shanghai, China, pp 327–332
59. Dupree K, Patre PM, Johnson M, Dixon WE (2009) Inverse optimal adaptive control of a
nonlinear Euler-Lagrange system: Part i. In: Proceedings of the IEEE conference on decision
and control, Shanghai, China, pp 321–326
60. Johnson M, Hu G, Dupree K, Dixon WE (2009) Inverse optimal homography-based visual
servo control via an uncalibrated camera. In: Proceedings of the IEEE conference on decision
and control, Shanghai, China, pp 2408–2413
61. Wang Q, Sharma N, Johnson M, Gregory CM, Dixon WE (2013) Adaptive inverse optimal
neuromuscular electrical stimulation. IEEE Trans Cybern 43:1710–1718
62. Cloutier JR (1997) State-dependent riccati equation techniques: an overview. In: Proceedings
of the American control conference, 2:932–936
63. Çimen T (2008) State-dependent riccati equation (SDRE) control: a survey. In: Proceedings
IFAC World Congress, pp 6–11
64. Cimen T (2010) Systematic and effective design of nonlinear feedback controllers via the
state-dependent riccati equation (sdre) method. Annu Rev Control 34(1):32–51
65. Yucelen T, Sadahalli AS, Pourboghrat F (2010) Online solution of state dependent riccati
equation for nonlinear system stabilization. In: Proceedings of the American control conference,
pp 6336–6341
16 1 Optimal Control

66. Garcia CE, Prett DM, Morari M (1989) Model predictive control: theory and practice - a survey.
Automatica 25(3):335–348
67. Mayne D, Michalska H (1990) Receding horizon control of nonlinear systems. IEEE Trans
Autom Control 35(7):814–824
68. Morari M, Lee J (1999) Model predictive control: past, present and future. Comput Chem Eng
23(4–5):667–682
69. Allgöwer F, Zheng A (2000) Nonlinear model predictive control, vol 26. Springer
70. Mayne D, Rawlings J, Rao C, Scokaert P (2000) Constrained model predictive control: Stability
and optimality. Automatica 36:789–814
71. Camacho EF, Bordons C (2004) Model predictive control, vol 2. Springer
72. Grüne L, Pannek J (2011) Nonlinear model predictive control. Springer
73. Isaacs R (1999) Differential games: a mathematical theory with applications to warfare and
pursuit, control and optimization. Dover books on mathematics, Dover Publications
74. Tijs S (2003) Introduction to game theory. Hindustan Book Agency
75. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory, 2nd edn. Classics in applied
mathematics, SIAM
76. Nash J (1951) Non-cooperative games. Ann Math 2:286–295
77. Vamvoudakis KG, Lewis FL (2011) Policy iteration algorithm for distributed networks and
graphical games. In: Proceedings of the IEEE conference decision control European control
conference, pp 128–135
78. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games: online
adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–1611
79. Dolcetta IC (1983) On a discrete approximation of the hamilton-jacobi equation of dynamic
programming. Appl Math Optim 10(1):367–377
80. Sethian JA (1999) Level set methods and fast marching methods: evolving interfaces in com-
putational geometry, fluid mechanics, computer vision, and materials science. Cambridge Uni-
versity Press
81. Osher S, Fedkiw R (2006) Level set methods and dynamic implicit surfaces, vol 153. Springer
Science & Business Media
82. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
83. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
Chapter 2
Approximate Dynamic Programming

2.1 Introduction

Dynamic programming techniques based on the principle of optimality have been


extensively studied in literature (cf. [1–7]). The applicability of classical dynamic
programming techniques like policy iteration and value iteration is limited by the
curse of dimensionality and the need for model knowledge. Simulation-based rein-
forcement learning techniques such as Q-learning [4] and temporal difference learn-
ing [2, 8] avoid the need for exact model knowledge. However, these techniques
require the states and the actions to be on finite sets. Even though the theory is
developed for finite state spaces of any size, the implementation of simulation-based
reinforcement learning techniques is feasible only if the size of the state space is
small. Extensions of simulation-based reinforcement learning techniques to general
state spaces or very large finite state spaces involve parametric approximation of
the policy, where the decision space is reduced to a finite dimensional vector space,
and only a few (finite) weights are tuned to obtain the value function. Such algo-
rithms have been studied in depth for systems with countable state and action-spaces
under the name of neuro-dynamic programming (cf. [6, 8–14] and the references
therein). Extensions of these techniques to general state spaces and continuous time-
domains is challenging and has recently become an active area of research. The rest
of this chapter focuses on the development of dynamic programming methods for
continuous-time systems with continuous state-spaces.

2.2 Exact Dynamic Programming in Continuous


Time and Space

A unifying characteristic of dynamic programming based methods is the use of a


(state or action) value function. A state value function, as defined in the previous
chapter is a map from the state space to the reals that assigns each state its value
© Springer International Publishing AG 2018 17
R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_2
18 2 Approximate Dynamic Programming

(i.e., the total optimal cost-to-go when the system is started in that state). An action
value function (generally referred to as the Q−function) is a map from the Carte-
sian product of the state space and the action-space to positive real numbers. The
Q−function assigns each state-action pair, (s, a), a value (i.e., the total optimal cost
when the action a is performed in the state s, and the optimal policy is followed
thereafter). Another unifying characteristic of dynamic programming based meth-
ods is the interaction of policy evaluation and policy improvement. Policy evaluation
(also referred to as the prediction problem) refers to the problem of finding the (state
or action) value function for a given arbitrary policy. Policy improvement refers to
the problem of construction of a new policy that improves the original policy. The
family of approximate optimal control methods that can be viewed as an interaction
between policy evaluation and policy improvement is referred to as generalized pol-
icy iteration. Almost all dynamic programming-based approximate optimal control
methods can be described as generalized policy iteration [8].
For the Bolza problem in Sect. 1.5, policy evaluation amounts to finding a solution
to the generalized Hamilton–Jacobi–Bellman equation (first introduced in [15])

r (x, φ (x)) + ∇x V (x) ( f (x) + g (x) φ (x)) = 0, (2.1)

for a fixed policy φ : Rn → Rm . The policy improvement step amounts to finding a


solution to the minimization problem

φ (x) = arg min (r (x, u) + ∇x V (x) ( f (x) + g (x) u)) ,


u

for a fixed value function V : Rn → R≥0 . Since the system dynamics are affine in
control, the policy improvement step reduces to the simple assignment
1
φ (x) = − R −1 g T (x) ∇x V T (x) .
2

2.2.1 Exact Policy Iteration: Differential and Integral


Methods

The policy iteration algorithm, also known as the successive approximation algorithm
alternates between policy improvement and policy evaluation. The policy iteration
algorithm was first developed by Bellman in [16], and a policy improvement theorem
was provided by Howard in [17]. In Algorithm 2.1, a version of the policy iteration
algorithm (cf. [15]) is presented for systems with continuous state space, where φ (0) :
Rn → Rm denotes an initial admissible policy (i.e., a policy that results in a finite
cost, starting from any initial condition), and V (i) : Rn → R≥0 and φ (i) : Rn → Rm
denote the value function and the policy obtained in the i th iteration. Provided the
initial policy is admissible, policy iteration generates a sequence of policies and
value functions that asymptotically approach the optimal policy and the optimal
2.2 Exact Dynamic Programming in Continuous Time and Space 19

value function. Furthermore, each policy in the sequence is at least as good as the
previous policy, which also implies that each policy in the sequence is admissible.
For a proof of convergence, see [18].

Algorithm 2.1 Policy Iteration


while V (i) = V (i−1) do  
solve r x, φ (i−1) (x) + ∇x V (i) (x) f (x) + g (x) φ (i−1) (x) = 0 with V (i) (0) = 0 to
compute V (i)
 T
φ (i) (x) ← − 21 R −1 g T (x) ∇x V (i) (x)
i ←i +1
end while

Knowledge of the system dynamics (i.e., the functions f and g) is required to


implement policy iteration. Policy iteration can be implemented without the knowl-
edge of system dynamics using an integral approach. Since the term ∇x V (x) ( f (x) +
g (x) φ (x)) is the time derivative of V along the trajectories of the system (1.9) under
the policy φ, the generalized Hamilton–Jacobi–Bellman equation can be integrated
over the interval [τ, τ + T ], for some constant T ∈ R>0 to yield


τ +T

V (x) = V (x (τ + T )) − r (x (t) , φ (x (t))) dt, (2.2)


τ

where the shorthand x (t) is utilized to denote x (t; τ, x, φ (x (·))), that is, a trajectory
of the system in (1.9) under the feedback controller u = φ (x) such that x (τ ) = x.

Definition 2.1 A function Ṽ : Rn → R>0 is called a solution to the integral gen-


eralized Hamilton–Jacobi–Bellman equation in (2.2) if ∀x ∈ Rn and ∀τ ∈ R≥t0 , Ṽ
satisfies (2.2).

Note that since the dynamics, the policy, and the cost function are time-independent,
to establish Ṽ as a solution to the integral generalized Hamilton–Jacobi–Bellman
equation, it is sufficient to check (2.2) for τ = t0 (or for any other arbitrary value of
τ ). Algorithm 2.2, first developed in [19] details a technique that utilizes the integral
generalized Hamilton–Jacobi–Bellman equation to implement policy iteration with-
(i)
out the knowledge of the drift dynamics,
(i)
 f . In Algorithm 2.2, the shorthand x (t) is
utilized to denote x t; τ, x, φ (x (·)) . The equivalence of differential and integral
policy iteration is captured in the following theorem.

Theorem 2.2 For a given admissible policy φ : Rn → Rm , if the integral general-


ized Hamilton–Jacobi–Bellman equation in (2.2) admits a continuously differentiable
solution, then the solutions to the generalized Hamilton–Jacobi–Bellman equation
in (2.1) and the integral generalized Hamilton–Jacobi–Bellman equation coincide
and are unique.
20 2 Approximate Dynamic Programming

Algorithm 2.2 Integral Policy Iteration


while V (i) = V (i−1) do  
 τ +T    
solve V (i) (x) = − τ r x (i−1) (t) , φ (i−1) x (i−1) (t) dt + V (i) x (i−1) (τ + T ) with
V (i) (0) = 0 to compute V (i)
 T
φ (i) (x) ← − 21 R −1 g T (x) ∇x V (i) (x)
i ←i +1
end while

Proof The following proof is a slight modification of the argument presented in [19].
Under suitable smoothness assumptions and provided the policy φ is admissible, the
generalized Hamilton–Jacobi–Bellman equation in (2.1) admits a unique continu-
ously differentiable solution [15]. Let Ṽ ∈ C 1 (Rn , R) be the solution to the gener-
alized Hamilton–Jacobi–Bellman equation with the boundary condition Ṽ (0) = 0.
For an initial condition x ∈ Rn , differentiation of Ṽ (x (·)) with respect to time yields

Ṽ˙ (x (t)) = ∇x Ṽ (x (t)) ( f (x (t)) + g (x (t)) φ (x (t))) .

Using the generalized Hamilton–Jacobi–Bellman equation,

Ṽ˙ (x (t)) = −r (x (t) , φ (x (t))) .

Integrating the above expression over the interval [τ, τ + T ] for some τ ∈ R≥t0
yields the integral generalized Hamilton–Jacobi–Bellman equation in (2.2). Thus,
any solution to the generalized Hamilton–Jacobi–Bellman equation is also a solu-
tion to the integral generalized Hamilton–Jacobi–Bellman equation. To establish the
other direction, let Ṽ ∈ C 1 (Rn , R) be a solution to the generalized Hamilton–Jacobi–
Bellman equation and let V ∈ C 1 (Rn , R) be a different solution to the integral gener-
alized Hamilton–Jacobi–Bellman equation with the boundary conditions Ṽ (0) = 0
and V (0) = 0. Consider the time-derivative of the difference Ṽ − V :

Ṽ˙ (x (t)) − V˙ (x (t)) = ∇x Ṽ (x (t)) ( f (x (t)) + g (x (t)) φ (x (t))) − V˙ (x (t)) .

Since Ṽ is a solution to the generalized Hamilton–Jacobi–Bellman equation,

Ṽ˙ (x (t)) − V˙ (x (t)) = −r (x (t) , φ (x (t))) − V˙ (x (t)) .

Integrating the above expression over the interval [τ, τ + T ] and using x (τ ) = x,

τ+T τ+T
  
˙ ˙
Ṽ (x (t)) − V (x (t)) dt = − r (x (t) , φ (x (t))) dt − V (x (τ + T )) − V (x) .
τ τ
2.2 Exact Dynamic Programming in Continuous Time and Space 21

Since V satisfies the integral generalized Hamilton–Jacobi–Bellman equation,


τ +T

Ṽ˙ (x (t)) − V˙ (x (t)) dt = 0, ∀τ ∈ R≥t0 .
τ

Hence, for all x ∈ Rn and for all τ ∈ R≥t0 , Ṽ − V is a constant along the trajectory
x (t) , t ∈ [τ, τ + T ], with the initial condition x (τ ) = x. Hence, using the time-
independence of the dynamics, the policy, and the cost function, it can be concluded
that Ṽ − V is a constant along every trajectory x : R≥t0 → Rn of the system in
(1.9) under the controller u (t) = φ (x (t)). Since Ṽ (0) − V (0) = 0 it can be con-
cluded that Ṽ (x (t)) − V (x (t)) = 0, ∀t ∈ R≥t0 , provided the trajectory x (·) passes
through the origin (i.e., x (t) = 0 for some t ∈ Rn ).
In general, a trajectory of a dynamical system need not pass through the origin. For
example, consider ẋ (t) = −x (t) , x (0) = 1. However, since the policy φ is admis-
sible, every trajectory of the system in (1.9) under the controller u (t) = φ (x (t))
asymptotically goes to zero, leading to the following claim.

Claim Provided φ is admissible, Ṽ − V is a constant along the trajectories of the


system in (1.9) starting from every initial condition under the controller u (t) =
φ (x (t)), and the functions Ṽ and V are continuous, then

Ṽ (x) = V (x) , ∀x ∈ Rn .

Proof (Proof of Claim) For the sake of contradiction, let


Ṽ (x ∗ ) − V (x ∗ )
>  for
some  ∈ R>0 and some x ∗ ∈ Rn . Since Ṽ − V is a constant, it can be concluded
that


     


Ṽ x t; t0 , x ∗ , φ (x (·)) − V x t; t0 , x ∗ , φ (x (·))
> , ∀t ∈ R≥t0 .



t→∞ x (t; t0 , x , φ (x (·))) = 0. Since Ṽ − V is a contin-
Since φ is admissible, lim

uous function, limt→∞


Ṽ (x (t; t0 , x ∗ , φ (x (·)))) − V (x (t; t0 , x ∗ , φ (x (·))))
= 0.
Hence, there exists a constant T such that

     


Ṽ x T ; t0 , x ∗ , φ (x (·)) − V x T ; t0 , x ∗ , φ (x (·))
< ,

which is a contradiction. Since the constants  and x ∗ were arbitrarily selected, the
proof of the claim is complete. 

The claim implies that the solutions to the integral generalized Hamilton–
Jacobi–Bellman equations are unique, and hence, the proof of the theorem is
complete. 
22 2 Approximate Dynamic Programming

2.2.2 Value Iteration and Associated Challenges

The policy iteration algorithm and the integral policy iteration algorithm both
require an initial admissible policy. The requirement of a initial admissible pol-
icy can be circumvented using value iteration. Value iteration and its variants are
popular generalized policy iteration algorithms for discrete-time systems owing to
the simplicity of their implementation. In discrete time, value iteration algorithms
work by turning Bellman’s recurrence relation (the discrete time counterpart of
the Hamilton–Jacobi–Bellman equation) into an update rule [3, 8, 20–22]. One
example of a discrete-time value iteration algorithm is detailed in Algorithm 2.3.
In Algorithm 2.3, the system dynamics are described by the difference equation
x (k + 1) = f (x (k)) + g (x (k)) u (k), where the objective is to minimize the total
cost J (x (·) , u (·)) = ∞ k=0 r (x (k) , u (k)), and V (0)
: Rn
→ R≥0 denotes an arbi-
trary initialization. A key strength of value iteration over policy iteration is that
the initialization V (0) does not need to be a Lyapunov function or a value func-
tion corresponding to any admissible policy. An arbitrary initialization such as
V (0) (x) = 0, ∀x ∈ Rn is acceptable. Hence, to implement value iteration, knowl-
edge of an initial admissible policy is not needed. As a result, unlike policy iteration,
the functions V (i) generated by value iteration are not guaranteed to be value func-
tions corresponding to admissible policies, and similarly, the policies φ (i) are not
guaranteed to be admissible. However, it can be shown that the sequences V (i) and
φ (i) converge to the optimal value function, V ∗ , and the optimal policy, u ∗ , respec-
tively [23–25]. An offline value iteration-like algorithm that relies on Pontryagin’s
maximum principle is developed in [26–28] where a single neural network is utilized
to approximate the relationship between the state and the costate variables. Value
iteration algorithms for continuous-time linear systems are presented in [29–32]. For
nonlinear systems, an implementation of Q-learning is presented in [33]; however,
closed-loop stability of the developed controller is not analyzed.

Algorithm 2.3 Discrete time value iteration


while V (i) = V (i−1) do
 T
φ (i) (x) ← − 21 R −1 g T (x) ∇x V (i−1) (x)
   
V (i) (x) ← r x, φ (i) (x) + V (i−1) f (x) + g (x) φ (i) (x)
i ←i +1
end while

2.3 Approximate Dynamic Programming in Continuous


Time and Space

For systems with finite state and action-spaces, policy iteration and value iteration
are established as effective tools for optimal control synthesis. However, in con-
tinuous state-space systems, both policy iteration and value iteration suffer from
2.3 Approximate Dynamic Programming in Continuous Time and Space 23

Bellman’s curse of dimensionality, (i.e., they become computationally intractable


as the size of the state space grows). The need for excessive computation can be
realistically sidestepped if one seeks to obtain an approximation to the optimal value
function instead of the exact optimal value function (i.e., approximate dynamic pro-
gramming). To obtain an approximation to the optimal value function using pol-
icy iteration, the generalized Hamilton–Jacobi–Bellman equation must be solved
approximately in each iteration. Several methods to approximate the solutions to
the generalized Hamilton–Jacobi–Bellman equation have been studied in the litera-
ture. The generalized Hamilton–Jacobi–Bellman equation can be solved numerically
using perturbation techniques [34–37], finite difference [38–40] and finite element
[41–44] techniques, or using approximation methods such as Galerkin projections
[45, 46]. This monograph focuses on approximate dynamic programming algorithms
that approximate the classical policy iteration and value iteration algorithms by using
a parametric approximation of the policy or the value function (cf. [47–49]). The cen-
tral idea is that if the policy or the value function can be parameterized with sufficient
accuracy using a small number of parameters, the optimal control problem reduces
to an approximation problem in the parameter space.

2.3.1 Some Remarks on Function Approximation

In this chapter and in the rest of the book, the value function is approximated using
a linear-in-the-parameters approximation scheme. The following characteristics of
the approximation scheme can be established using the Stone-Weierstrass Theorem
(see [49–51]).
Property
2.3 Let V ∈ C 1 (R n , R), χ ⊂ Rn be compact,  ∈ R>0 be a constant, and
let σi ∈ C 1 (Rn , R) | i ∈ N be a set of countably many uniformly bounded basis
functions (cf. [52, Definition 2.1]). Then,
there exists L ∈ N, a set of basis functions
σi ∈ C 1 (Rn , R) | i = 1, 2, · · · , L , and a set of weights {wi ∈ R | i = 1, 2, · · · , L}


 
such that supx∈χ
V (x) − W T σ (x)
+ ∇x V (x) − W T ∇x σ (x) ≤ , where σ 
 T  T
σ1 · · · σ L and W  w1 · · · w L .
Property 2.3, also known as the Universal Approximation Theorem, states that a sin-
gle layer neural network can simultaneously approximate a function and its derivative
given a sufficiently large number of basis functions. Using Property 2.3, a continu-
ously differentiable function can be represented as V (x) = W T σ (x) +  (x), where
 : Rn → R denotes the function approximation error. The function approximation
error, along with its derivative can be made arbitrarily small by increasing the number
of basis functions used in the approximation.
24 2 Approximate Dynamic Programming

2.3.2 Approximate Policy Iteration

An example of an approximate policy iteration method is detailed in Algorithm


2.4. In Algorithm 2.4, V̂ (i) : Rn+L → R≥0 denotes the parametric approximation of
the value function V , Wc ∈ R L denotes the vector of ideal values of the unknown
parameters, and Ŵc(i) denotes an estimate of Wc . The Bellman error corresponding
to the policy φ, denoted by δφ is defined as
 
δφ x, Ŵc = r (x, φ (x)) + ∇x V̂ x, Ŵc ( f (x) + g (x) φ (x)) .

Algorithm 2.4 Approximate policy iteration


while Ŵc(i) = Ŵc(i−1) do
   2
Ŵc(i) ← arg min Ŵc ∈R L x∈Rn δφ (i−1) σ, Ŵc dσ
  T
(i)
φ (i) (x) ← − 21 R −1 g T (x) ∇x V (i) x, Ŵc
i ←i +1
end while

Similar to Algorithms 2.2, 2.4 can be expressed in a model-free form using inte-
gration. However, the usefulness of Algorithm 2.4 (and its model-free form)  is
limited by the need to solve the minimization problem Ŵc(i) = arg min Ŵc ∈RL x∈Rn
  2
δφ (i−1) σ, Ŵc dσ , which is often intractable due to computational and infor-
mation constraints. A more useful implementation of approximate policy iteration is
detailed in Algorithm 2.5 in the model-based form, where the minimization is carried
out over a specific trajectory instead of the whole state space [19, 53]. In Algorithm
φ
2.5, the set x0 ⊂ Rn is defined as

φx0  x ∈ Rn | x (t; t0 , x0 , φ (x (·))) = x, for some t ∈ R≥t0 .

Algorithm 2.5 Approximate generalized policy iteration


while Ŵc(i) = Ŵc(i−1) do
   2
Ŵc(i) ← arg min Ŵc ∈R L x∈φ δφ (i−1) σ, Ŵc dσ
x0
  T
φ (i) (x) ← − 21 R −1 g T (x) ∇x V (i) x, Ŵc(i)
i ←i +1
end while

Under suitable persistence of excitation conditions, Algorithm 2.5 can be shown to


converge to a neighborhood of the optimal value function and the optimal policy [19,
2.3 Approximate Dynamic Programming in Continuous Time and Space 25

24, 25, 49, 53]. However, the algorithm is iterative in nature, and unlike exact policy
iteration, the policies φ (i) cannot generally be shown to be stabilizing. Hence, the
approximate policy iteration algorithms, as stated, are not suitable for online learning
and online optimal feedback control. To ensure system stability during the learning
phase, a two-network approach is utilized, where in addition to the value function,
the policy, φ, is also approximated using a parametric approximation, û x, Ŵa .
The critic learns the value of a policy by updating the weights Ŵc and the actor
improves the current policy by updating the weights Ŵa .

2.3.3 Development of Actor-Critic Methods

The actor-critic (also known as adaptive-critic) architecture is one of the most widely
used architectures to implement generalized policy iteration algorithms [1, 8, 54].
Actor-critic algorithms are pervasive in machine learning and are used to learn the
optimal policy online for finite-space discrete-time Markov decision problems [1,
3, 8, 14, 55]. The idea of learning with a critic (or a trainer) first appeared in [56,
57] where the state-space was partitioned to make the computations tractable. Critic-
based methods were further developed to learn optimal actions in sequential decision
problems in [54]. Actor-critic methods were first developed in [58] for systems
with finite state and action-spaces, and in [1] for systems with continuous state
and action-spaces using neural networks to implement the actor and the critic. An
analysis of convergence properties of actor-critic methods was presented in [47, 59]
for deterministic systems and in [14] for stochastic systems. For a detailed review of
actor-critic methods, see [60].
Several methods have been investigated to tune the actor and the critic networks
in the actor-critic methods described in the paragraph above. The actor can learn
to directly minimize the estimated cost-to-go, where the estimate of the cost-to-go
is obtained by the critic [1, 14, 55, 58, 60, 61]. The actor can also be tuned to
minimize the Bellman error (also known as the temporal-difference error) [62]. The
critic network can be tuned using the method of temporal differences [1, 2, 8, 11,
12, 14, 63] or using heuristic dynamic programming [3, 9, 20, 64–67] or its variants
[55, 68, 69].
The iterative nature of actor-critic methods makes them particularly suitable for
offline computation and for discrete-time systems, and hence, discrete-time approx-
imate optimal control has been a growing area of research over the past decade [24,
70–80]. The trajectory-based formulation in Algorithm 2.5 lends itself to an online
solution approach using asynchronous dynamic programming, where the parame-
ters are adjusted on-the-fly using input-output data. The concept of asynchronous
dynamic programming can be further exploited to apply actor-critic methods online
to continuous-time systems.
26 2 Approximate Dynamic Programming

2.3.4 Actor-Critic Methods in Continuous Time and Space

Baird [81] proposed advantage updating as an extension of the Q-learning algorithm


Advantage updating can be implemented in continuous-time and provides faster con-
vergence. A continuous-time formulation of actor-critic methods was first developed
by Doya in [82]. In [82], the actor and the critic weights are tuned continuously
using an adaptive update law designed as a differential equation. While no stability
or convergence results are provided in [82], the developed algorithms can be readily
utilized to simultaneously learn and utilize an approximate optimal feedback con-
troller in real-time for nonlinear systems. A sequential (one network is tuned at a
time) actor-critic method that does not require complete knowledge of the internal
dynamics of the system is presented in [83]. Convergence properties of actor-critic
methods for continuous-time systems where both the networks are concurrently tuned
are examined in [84], and a Lyapunov-based analysis that concurrently examines
convergence and stability properties of an online implementation of the actor-critic
method is developed in [85]. The methods developed in this monograph are inspired
by the algorithms in [82] and the analysis techniques in [85]. In the following section,
a basic structure of online continuous-time actor-critic methods is presented. Recent
literature on continuous-time actor-critic methods is cited throughout the following
sections in comparative remarks.

2.4 Optimal Control and Lyapunov Stability

Obtaining an analytical solution to the Bolza problem is often infeasible if the system
dynamics are nonlinear. Many numerical solution techniques are available to solve
Bolza problems; however, numerical solution techniques require exact model knowl-
edge and are realized via open-loop implementation of offline solutions. Open-loop
implementations are sensitive to disturbances, changes in objectives, and changes in
the system dynamics; hence, online closed-loop solutions of optimal control prob-
lems are sought-after. Inroads to solve an optimal control problem online can be made
by looking at the value function. Under a given policy, the value function provides
a map from the state space to the set of real numbers that measures the quality of a
state. In other words, under a given policy, the value function evaluated at a given
state is the cost accumulated when starting in the given state and following the given
policy. Under general conditions, the policy that drives the system state along the
steepest negative gradient of the optimal value function turns out to be the optimal
policy; hence, online optimal control design relies on computation of the optimal
value function.
In online closed-loop approximate optimal control, the value function has an even
more important role to play. Not only does the value function provide the optimal
policy, but the value function is also a Lyapunov function that establishes global
asymptotic stability of the closed-loop system.
2.4 Optimal Control and Lyapunov Stability 27

Theorem 2.4 Consider the affine dynamical system in (1.9). Let V ∗ : Rn → R≥0
be the optimal value function corresponding to the affine-quadratic optimal control
problem in (1.10). Assume further that f (0) = 0 and that the control effectiveness
matrix g (x) is full rank for all x ∈ Rn . Then, the closed-loop system under the
optimal controller u (t) = u ∗ (x (t)) is asymptotically stable.
Proof Since f (0) = 0, it follows that when x (t0 ) = 0, the controller that yields the
lowest cost is u (t) = 0, ∀t. Hence, V ∗ (0) = 0, and since the optimal controller is
given by u (t) = u ∗ (x (t)) = − 21 R −1 g T (x (t)) ∇x V ∗ (x (t)), and g is assumed to be
full rank, it can be concluded that ∇x V ∗ (0) = 0. Furthermore, if x = 0, it follows
that V ∗ (x) = 0. Hence, the function V ∗ is a candidate Lyapunov function, and x = 0
is an equilibrium point of the closed-loop dynamics

ẋ (t) = f (x (t)) + g (x (t)) u ∗ (x (t)) .

The time derivative of V ∗ along the trajectories of (1.9) is given by

V̇ ∗ (t) = ∇x V ∗ (x (t)) ( f (x (t)) + g (x (t)) u (t)) .

when the optimal controller is used, u (t) = u ∗ (x (t)) . Hence,


 
V̇ ∗ (t) = ∇x V ∗ (x (t)) f (x (t)) + g (x (t)) u ∗ (x (t)) .

Since the optimal value function satisfies the Hamilton–Jacobi–Bellman equation,


Theorem (1.2) implies that
 
V̇ ∗ (t) = −r x (t) , u ∗ (x (t)) ≤ −Q (x (t))

Since the function Q is positive definite by design, [86, Theorem 4.2] can be invoked
to conclude that the equilibrium point x = 0 is asymptotically stable. 
The utility of Theorem 2.4 as a tool to analyze optimal controllers is limited
because for nonlinear systems, analytical or exact numerical computation of the
optimal controller is often intractable. Hence, one often works with approximate
value functions and approximate optimal controllers. Theorem 2.4 provides a pow-
erful tool for the analysis of approximate optimal controllers because the optimal
policy is inherently robust to approximation (for an in-depth discussion regarding
robustness of the optimal policy, see [87, 88]). That is, the optimal value function can
also be used as a candidate Lyapunov function to establish practical stability (that
is, uniform ultimate boundedness) of the system in (1.9) under controllers that are
close to or asymptotically approach a neighborhood of the optimal controller. The
rest of the discussion in this chapter focuses on the methodology employed in the
rest of this monograph to generate an approximation of the optimal controller. Over
the years, many different approximate optimal control methods have been developed
for various classes of systems. For a brief discussion about alternative methods, see
Sect. 2.8.
28 2 Approximate Dynamic Programming

2.5 Differential Online Approximate Optimal Control


In an approximate actor-critic-based solution, the optimal value function V is
replaced by a parametric estimate V̂ x, Ŵc and the optimal policy u ∗ by a para-

metric estimate û x, Ŵa where Ŵc ∈ R L and Ŵa ∈ R L denote vectors of estimates
of the ideal parameters.
Substituting the estimates V̂ and û for V ∗ and u ∗ in (1.14), respectively, a residual
error δ : Rn × R L × R L → R, called the Bellman error , is defined as

     
δ x, Ŵc , Ŵa  ∇x V̂ x, Ŵc f (x) + g (x) û x, Ŵa + r x, û x, Ŵa .
(2.3)
The use of two separate sets of weight estimates Ŵa and Ŵc is motivated by the
fact that the Bellman error is linear with respect to the critic weight estimates and
nonlinear with respect to the actor weight estimates. Use of a separate set of weight
estimates for the value function facilitates least-squares-based adaptive updates.
To solve the optimal control problem, the critic aims to find aset of parameters

Ŵc and the actor aims to find a set of parameters Ŵa such that δ x, Ŵc , Ŵa = 0,
   T
and û x, Ŵa = − 21 R −1 g T (x) ∇ V̂ x, Ŵc ∀x ∈ Rn . Since an exact basis for
value function approximation is generally not available, an approximate set of param-
eters that minimizes the Bellman error is sought. In particular, to ensure uniform
approximation of the value function and the policy over an operating domain D ⊂ Rn ,
it is desirable to find parameters that minimize the error E s : R L × R L → R defined
as



E s Ŵc , Ŵa  sup


δ x, Ŵc , Ŵa
. (2.4)
x∈D

Hence, in an online implementation of the deterministic actor-critic method, it is


desirable treat the parameter estimates Ŵc and Ŵ
 a as time-varying
and update then
online to minimize the instantaneous error E s Ŵc (t) , Ŵa (t) or the cumulative
instantaneous error
t 
E (t)  E s Ŵc (τ ) , Ŵa (τ ) dτ, (2.5)
0

while
 the system in (1.9) is being controlled using the control law u (t) =
û x (t) , Ŵa (t) .
2.5 Differential Online Approximate Optimal Control 29

2.5.1 Reinforcement Learning-Based Online Implementation

Computation of the Bellman error in (2.3) and the integral error in (2.5) requires exact
model knowledge. Furthermore, computation of the integral error in (2.5) is generally
infeasible. Two prevalent approaches employed to render the control design robust to
uncertainties in the system drift dynamics are integral reinforcement learning (cf. [7,
19, 89–92]) and state derivative estimation (cf. [93, 94]). This section focuses on state
derivative estimation based methods. For further details on integral reinforcement
learning, see Sect. 2.2.1.
State derivative estimation-based techniques exploit the fact that if the system
model is uncertain, the critic can compute the Bellman error at each time instance
using the state-derivative ẋ (t) as
  
δt (t)  ∇x V̂ x (t) , Ŵc (t) ẋ (t) + r x (t) , û x (t) , Ŵa (t) . (2.6)

If the state-derivative is not directly measurable, an approximation of the Bellman


error can be computed using a dynamically generated estimate of the state-derivative.
Since (1.14) constitutes a necessary and sufficient condition for optimality, the Bell-
man error serves as an indirect measure of how close the critic parameter estimates
Ŵc (t) are to their ideal values; hence, in reinforcement learning literature, each eval-
uation of the Bellman error is interpreted as gained experience. In particular, the critic
receives state-derivative-action-reward tuples (x (t) , ẋ (t) , u (t) , r (x (t) , u (t)))
and computes the Bellman error using (2.6). The critic then performs a one-step
update to the parameter estimates Ŵc (t) based on either the instantaneous experi-
ence, quantified by the squared error δt2 (t), or the cumulative experience, quantified
by the integral squared error

t
E t (t)  δt2 (τ ) dτ, (2.7)
0

using a steepest descent update law. The use of the cumulative squared error is
motivated by the fact that in the presence of uncertainties, the Bellman error can only
be evaluated along the system trajectory; hence, E t (t) is the closest approximation
to E (t) in (2.5) that can be computed using available information.
Intuitively, for E t (t) to approximate E (t) over an operating domain, the state
trajectory x (t) needs to visit as many points in the operating domain as possible. This
intuition is formalized by the fact that the use of the approximation E t (t) to update
the critic parameter estimates is valid provided certain exploration conditions1 are
met. In reinforcement learning terms, the exploration conditions translate to the need
for the critic to gain enough experience to learn the value function. The exploration

1 Theexploration conditions are detailed in the next section for a linear-in-the-parameters approxi-
mation of the value function.
30 2 Approximate Dynamic Programming

conditions can be relaxed using experience replay (cf. [92]), where each evaluation
of the Bellman error δint is interpreted as gained experience, and these experiences
are stored in a history stack and are repeatedly used in the learning algorithm to
improve data efficiency; however, a finite amount of exploration is still required since
the values stored in the history stack are also constrained to the system trajectory.
Learning based on simulation of experience has also been investigated in results
such as [95–100] for stochastic model-based reinforcement learning; however, these
results solve the optimal control problem off-line in the sense that repeated learning
trials need to be performed before the algorithm learns the controller and system
stability during the learning phase is not analyzed.
While the estimates Ŵc (·) are being updated by the critic, the actor simultaneously
updates the parameter estimates Ŵa (·) using a gradient-based approach so that the
   T
quantity û x (t) , Ŵa (t) + 21 R −1 g T (x (t)) ∇x V̂ x (t) , Ŵc (t) decreases.
The weight updates are performed online and  in real-time while the system is being
controlled using the control law u (t) = û x (t) , Ŵa (t) . Naturally, it is difficult
to guarantee stability during the learning phase. In fact, the use of two different sets
parameters to approximate the value function and the policy is required solely for
the purpose of maintaining stability during the learning phase.

2.5.2 Linear-in-the-Parameters Approximation of the Value


Function

For feasibility of analysis, the optimal value function is approximated using a linear-
in-the-parameters approximation

V̂ x, Ŵc  ŴcT σ (x) , (2.8)

where σ : Rn → R L is a continuously differentiable nonlinear activation function


such that σ (0) = 0 and ∇x σ (0) = 0, and Ŵc ∈ R L , where L denotes the number of
unknown parameters in the approximation of the value function. Based on (1.13), the
optimal policy is approximated using the linear-in-the-parameters approximation
 1
û x, Ŵa  − R −1 g (x)T ∇x σ T (x) Ŵa . (2.9)
2
The update law used by the critic to update the weight estimates is given by

ω (t)
Ŵ˙ c (t) = −ηc
δt (t) ,
ρ (t)
ω (t) ω T (t)

˙ (t) = β
(t) − ηc
(t)
(t) , (2.10)
ρ 2 (t)
2.5 Differential Online Approximate Optimal Control 31

Fig. 2.1 The actor-critic Reward


architecture. The critic
computes the Bellman error Environment State DerivaƟve
based on the state, the action, State
the reward, and the
time-derivative of the state.
The actor and the critic both State
improve their estimate of the AcƟon
value function using the
Bellman error

BE
Actor CriƟc
AcƟon

where ω (t)  ∇σ (x (t)) ẋ (t) ∈ R L denotes the regressor vector, ρ (t)  1 +


νω T (t)
(t) ω (t) ∈ R, ηc , β, v ∈ R>0 are constant learning gains,
∈ R>0 is a
saturation constant, and
is the least-squares gain matrix. The update law used by
the actor to update the weight estimates is derived using a Lyapunov-based stability
analysis, and is given by

ηc ∇x σ (x (t)) g (x (t)) R −1 g T (x (t)) ∇x σ T (x (t)) Ŵa (t) ω T (t)


Ŵ˙ a (t) =
4ρ (t)

− ηa1 Ŵa (t) − Ŵc (t) − ηa2 Ŵa (t) , (2.11)

where ηa1 , ηa2 ∈ R>0 are constant learning gains. A block diagram of the resulting
control architecture is in Fig. 2.1.
The stability analysis indicates that the sufficient exploration condition takes the
form of a persistence of excitation condition that requires the existence of positive
constants ψ and T such that the regressor vector satisfies

t+T
ω (τ ) ω T (τ )
ψI L ≤ dτ, (2.12)
ρ (τ )
t

for all t ∈ R≥t0 . The regressor is defined here as a trajectory indexed by time. It should
be noted that different initial conditions result in different regressor trajectories;
hence, the constants T and ψ depend on the initial values of x (·) and Ŵa (·). Hence,
the final result is generally not uniform in the initial conditions.
Let W̃c (t)  W − Ŵc (t) and W̃a (t)  W − Ŵa (t) denote the vectors of param-
eter estimation errors, where W ∈ R L denotes the constant vector of ideal parameters
(see Property 2.3). Provided (2.12) is satisfied, and under sufficient conditions on the
learning gains and the constants ψ and T , the candidate Lyapunov function
32 2 Approximate Dynamic Programming

 1 1
VL x, W̃c , W̃a , t  V ∗ (x) + W̃cT
−1 (t) W̃c + W̃aT W̃a
2 2

can be used to establish convergence of x (t), W̃c (t), and W̃a (t) to a neighborhood
of zero as t → ∞, when the system in (1.9) is controlled using the control law

u (t) = û x (t) , Ŵa (t) , (2.13)

and the parameter estimates Ŵc (·) and Ŵa (·) are updated using the update laws in
(2.10) and (2.11), respectively.

2.6 Uncertainties in System Dynamics

The use of the state derivative to compute the Bellman error in (2.6) is advantageous
because it is easier to obtain a dynamic estimate of the state derivative than it is to
identify the system dynamics. For example, consider the high-gain dynamic state
derivative estimator

x̂˙ (t) = g (x (t)) u (t) + k x̃ (t) + μ (t) ,


μ̇ (t) = (kα + 1) x̃ (t) , (2.14)

where x̂˙ (t) ∈ Rn is an estimate of the state derivative, x̃ (t)  x − x̂ (t) is the state
estimation error, and k, α ∈ R>0 are identification gains. Using (2.14), the Bellman
error in (2.6) can be approximated by δ̂t as
  
δ̂t (t) = ∇x V̂ x (t) , Ŵc (t) x̂˙ (t) + r x (t) , û x (t) , Ŵa (t) .

The critic can then learn the critic weights by using an approximation of cumulative
experience, quantified using δ̂t instead of δt in (2.10), that is,

t
Ê t (t) = δ̂t2 (τ ) dτ. (2.15)
0

Under additional sufficient conditions on the gains k and α, the candidate Lyapunov
function
 1 1 1 1
VL x, W̃c , W̃a , x̃, x f , t  V ∗ (x) + W̃cT
−1 (t) W̃c + W̃aT W̃a + x̃ T x̃ + x Tf x f ,
2 2 2 2

where x f (t)  x̃˙ (t) + α x̃ (t), can be used to establish convergence of x (t), W̃c (t),
W̃a (t), x̃ (t), and x f (t) to a neighborhood of zero, when the system in (1.9) is
2.6 Uncertainties in System Dynamics 33

Fig. 2.2 Actor-critic- AcƟon


identifier architecture. The IdenƟfier
critic uses estimates of the
state derivative to compute
the Bellman error State
Environment
Reward State
DerivaƟve
EsƟmate

State
AcƟon

BE
Actor AcƟon
CriƟc

controlled using the control law (2.13). The aforementioned extension of the actor-
critic method to handle uncertainties in the system dynamics using derivative esti-
mation is known as the actor-critic-identifier architecture. A block diagram of the
actor-critic-identifier architecture is presented in Fig. 2.2.

2.7 Persistence of Excitation and Parameter Convergence

In online implementations of reinforcement learning, the control policy derived from


the approximate value function is used to control the system; hence, obtaining a
good approximation of the value function is critical to the stability of the closed-
loop system. Obtaining a good approximation of the value function online requires
convergence of the unknown parameters to their ideal values. Hence, similar to
adaptive control, the sufficient exploration condition manifests itself as a persistence
of excitation condition when reinforcement learning is implemented online.
Parameter convergence has been a focus of research in adaptive control for sev-
eral decades. It is common knowledge that least-squares and gradient descent-based
update laws generally require persistence of excitation in the system state for con-
vergence of the parameter estimates. Modification schemes such as projection algo-
rithms, σ −modification, and e−modification are used to guarantee boundedness of
parameter estimates and overall system stability; however, these modifications do
not guarantee parameter convergence unless the persistence of excitation condition
is satisfied [101–104].
In general, the controller in (2.13) does not ensure the persistence of excitation
condition in (2.12). Thus, in an online implementation, an ad-hoc exploration signal
is often added to the controller (cf. [8, 33, 105]). Since the exploration signal is
not considered in the the stability analysis, it is difficult the ensure stability of the
34 2 Approximate Dynamic Programming

online implementation. Moreover, the added probing signal causes large control
effort expenditure and there is no means to know when it is sufficient to remove the
probing signal. Chap. 4 addresses the challenges associated with the satisfaction of
the condition in (2.12) via simulated experience and cumulative experience collected
along the system trajectory.

2.8 Further Reading and Historical Remarks

Approximate optimal control has been an active topic of research since the seminal
works of Bellman [106] and Pontryagin [107] in the 1950s. A comprehensive survey
and classification of all the results dealing with approximate optimal control is out
of the scope of this monograph. In the following, a brief (by no means exhaustive)
classification of techniques based on Bellman’s dynamic programming principle is
presented. For a recent survey of approximate dynamic programming in deterministic
systems, see [108]. Brief discussions on a few specific techniques directly related
to the methodology used in this book are also presented. For a brief description of
methods based on Pontryagin’s maximum principle refer back to Sect. 1.8.
On-Policy Versus Off-Policy Learning: A generalized policy iteration technique
is called on-policy if the data used to improve an estimate of the optimal policy is
required to be collected using the same estimate. A generalized policy iteration
technique is called off-policy if an estimate of the optimal policy can be improved
using data collected using another policy. For example methods such as policy iter-
ation, value iteration, heuristic dynamic programming, adaptive-critic methods [1,
14, 55, 59, 61], SARSA (cf. [109, 110]) are on-policy, whereas methods such as
Q−learning [111] and R−learning [112] are off-policy. The distinction between
on-policy and off-policy methods is important because most online generalized pol-
icy iteration methods require exploration for convergence, whereas the on-policy
condition requires exploitation, hence leading to the exploration versus exploitation
conflict. Off-policy methods avoid the exploration versus exploitation conflict since
an arbitrary exploring policy can be used to facilitate learning.
Approximation of solutions of reward-maximization problems using indirect feed-
back generated by a critic network was first investigated in [57]. Critic-based methods
were further developed to solve a variety of optimal control problems [1, 4, 20, 54],
for example, heuristic dynamic programming [20], adaptive critic elements [1], and
Q-learning [111]. A common theme among the aforementioned techniques is the
use of two neuron-like elements, an actor element that is responsible for generating
control signals and a critic element that is responsible for evaluation of the control
signals generated by the actor (except Q-learning, which is implemented with just
one neuron-like element that combines the information about the policy and the value
function [4, 111]). The most useful feature of critic based methods is that they can
be implemented online in real time.
Policy Iteration, Value Iteration, and Policy Gradient: Dynamic program-
ming methods have traditionally been classified into three distinct schemes: policy
2.8 Further Reading and Historical Remarks 35

iteration, value iteration, and policy gradient. Policy iteration methods start with
a stabilizing policy, find the value function corresponding to that policy (i.e., pol-
icy evaluation), and then update the policy to exploit the value function (i.e., pol-
icy improvement). A large majority of dynamic programming algorithms can be
classified as policy iteration algorithms. For example, SARSA and the successive
approximation methods developed in results such as [15, 16, 18, 21, 45, 46, 49, 79,
113–119] are policy iteration algorithms. In value iteration, starting from an arbitrary
initial guess, the value function is directly improved by effectively combining the
evaluation and the improvement phases into one single update. For example, algo-
rithms such as Q−learning [111], R−learning [112], heuristic dynamic program-
ming, action-dependent heuristic dynamic programming, dual heuristic program-
ming, action-dependent dual heuristic programming, [9] and modern extensions of
value iteration (see [6–8, 77] for a summary). Both policy iteration and value itera-
tion are typically critic-only methods [60] and can be considered as special cases of
generalized policy iteration [8, 21].
Policy gradient methods (also known as actor-only methods) are philosophically
different from policy iteration and value iteration. In policy gradient methods, instead
of approximating the value function, the policy is directly approximated by comput-
ing the gradient of the cost functional with respect to the unknown parameters in
the approximation of the policy [120–123]. Modern policy gradient methods uti-
lize an approximation of the value function to estimate the gradients, and are called
actor-critic methods [14, 60, 124].
Continuous-Time Versus Discrete-Time Methods: For deterministic systems,
reinforcement learning algorithms have been extended to a solve finite and infinite-
horizon discounted and total cost optimal regulation problems (cf. [24, 26, 48, 49,
70, 85, 91, 93, 105, 125]) under names such as adaptive dynamic programming or
adaptive critic algorithms. The discrete/iterative nature of the approximate dynamic
programming formulation lends itself naturally to the design of discrete-time opti-
mal controllers [24, 26, 27, 70–75, 79, 126], and the convergence of algorithms for
dynamic programming-based reinforcement learning controllers is studied in results
such as [47, 59, 61, 72]. Most prior work has focused on convergence analysis for
discrete-time systems, but some continuous examples are available [15, 19, 45, 47,
49, 81, 82, 84, 85, 105, 127–129]. For example, in [81] advantage updating was pro-
posed as an extension of the Q−learning algorithm which could be implemented in
continuous time and provided faster convergence. The result in [82] used a Hamilton–
Jacobi–Bellman framework to derive algorithms for value function approximation
and policy improvement, based on a continuous version of the temporal difference
error. An Hamilton–Jacobi–Bellman framework was also used in [47] to develop
a stepwise stable iterative approximate dynamic programming algorithm for con-
tinuous input-affine systems with an input-quadratic performance measure. Based
on the successive approximation method first proposed in [15], an adaptive optimal
control solution is provided in [45], where a Galerkin’s spectral method is used to
approximate the solution to the generalized Hamilton–Jacobi–Bellman equation. A
least-squares-based successive approximation solution to the generalized Hamilton–
Jacobi–Bellman equation is provided in [49], where an neural network is trained
36 2 Approximate Dynamic Programming

offline to learn the solution to the generalized Hamilton–Jacobi–Bellman equation.


Another continuous formulation is proposed in [84].
Online Versus Offline Learning: A generalized policy iteration technique is
called online if the learning laws are iterative in nature. That is, input-output data can
be sequentially utilized while the system is running, to incrementally update the opti-
mal policy. Convergence of the approximated policy to the optimal policy is typically
obtained asymptotically over time. In contrast, methods that require expensive batch
computational operations on large recorded datasets need to be implemented offline.
Convergence of the approximate policy to the optimal policy is typically obtained
in an iterative manner as the number of iterations goes to infinity. For example,
methods such as heuristic dynamic programming and dual heuristic programming
[20], adaptive critic methods [1], asynchronous dynamic programming [130, 131],
Q−learning [111], and R−learning [112] are online generalized policy iteration
methods, whereas methods such as successive approximation [15, 16, 18, 45, 132],
policy iteration [16, 17] and value iteration [8, 21], single network adaptive critic
[26, 27] are offline generalized policy iteration algorithms. It should be noted that
the distinction between online and offline generalized policy iteration algorithms is
becoming less pronounced as computers are getting faster and it is now possible to
execute in real time many algorithms that were thought to be computationally infea-
sible. The distinction between online and offline techniques is further blurred by the
receding horizon approach used in model-predictive control that enables techniques
that were previously classified as offline to be utilized for real time feedback control.
Model-Based Versus Model-Free Methods: The ultimate objective of reinforce-
ment learning is to learn a controller for a system just by observing its behavior, and
the rewards that correspond to the behavior. Two different approaches are used to
attack the problem. Model-based (also known as indirect) approaches utilize the
observations about the behavior of the system (i.e., input and output measurements)
to build a model of the system, and then utilize the model to learn the controller. For
example, one of the early dynamic programming methods, the heuristic dynamic pro-
gramming algorithm developed by Werbos, is a model-based reinforcement learning
technique [20]. Different variations of model-based algorithms are developed in [95–
100, 133, 134], and in some cases, are shown to outperform model-free algorithms
[135]. Arguably the most successful implementation of model-based reinforcement
learning is the work of Ng et al. [97, 136] regarding acrobatic autonomous helicopter
maneuvers. However, apprenticeship and multiple iterations of offline training is
required to learn the maneuvers. Since Bellman’s recurrence relation in discreet-
time is inherently model-free, the bulk of research on reinforcement learning in
discrete-time systems has been focused on model-free reinforcement learning. For
example, while the classical formulation of dynamic programming (policy iteration
and value iteration) requires a model, reinforcement learning-based implementa-
tions of dynamic programming, such as Q−learning, R−learning , SARSA, action-
dependent heuristic dynamic programming and action-dependent dual heuristic pro-
gramming [9] are model-free.
In continuous-time systems, to solve the generalized Hamilton–Jacobi–Bellman
equation or the Hamilton–Jacobi–Bellman equation, either a model of the system
2.8 Further Reading and Historical Remarks 37

or measurements of the state-derivative are needed. Hence, early developments in


dynamic programming for continuous-time systems required exact model knowledge
[15, 45, 47, 49, 81, 82, 84, 85, 105]. Hence, the term model-based reinforcement
learning is sometimes erroneously used to describe algorithms that require exact
model knowledge. Motivated by the need to accommodate uncertainties, much of
the recent research on continuous-time reinforcement learning has focused on model-
free methods [19, 31, 53, 93, 119, 128, 129, 137–140].

References

1. Barto A, Sutton R, Anderson C (1983) Neuron-like adaptive elements that can solve difficult
learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846
2. Sutton R (1988) Learning to predict by the methods of temporal differences. Mach Learn
3(1):9–44
3. Werbos P (1990) A menu of designs for reinforcement learning over time. Neural Netw
Control 67–95
4. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292
5. Bellman RE (2003) Dynamic programming. Dover Publications, Inc, New York
6. Bertsekas D (2007) Dynamic programming and optimal control, vol 2, 3rd edn. Athena Sci-
entific, Belmont, MA
7. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control, 3rd edn. Wiley, Hoboken
8. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
9. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural mod-
eling. In: White DA, Sorge DA (eds) Handbook of intelligent control: neural, fuzzy, and
adaptive approaches, vol 15. Nostrand, New York, pp 493–525
10. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Nashua
11. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function
approximation. IEEE Trans Autom Control 42(5):674–690
12. Tsitsiklis JN, Roy BV (1999) Average cost temporal-difference learning. Automatica
35(11):1799–1808
13. Tsitsiklis J (2003) On the convergence of optimistic policy iteration. J Mach Learn Res 3:59–
72
14. Konda V, Tsitsiklis J (2004) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–
1166
15. Leake R, Liu R (1967) Construction of suboptimal control sequences. SIAM J Control 5:54
16. Bellman R (1957) Dynamic programming, 1st edn. Princeton University Press, Princeton
17. Howard R (1960) Dynamic programming and Markov processes. Technology Press of Mas-
sachusetts Institute of Technology (Cambridge)
18. Saridis G, Lee C (1979) An approximation theory of optimal control for trainable manipula-
tors. IEEE Trans Syst Man Cyber 9(3):152–159
19. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive
optimal control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
20. Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of
intelligence. Gen Syst Yearb 22:25–38
21. Puterman ML, Shin MC (1978) Modified policy iteration algorithms for discounted markov
decision problems. Manag Sci 24(11):1127–1137
22. Bertsekas DP (1987) Dynamic programming: deterministic and stochastic models. Prentice-
Hall, Englewood Cliffs
23. Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control
51(8):1249–1260
38 2 Approximate Dynamic Programming

24. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part
B Cybern 38:943–949
25. Heydari A (2014) Revisiting approximate dynamic programming and its convergence. IEEE
Trans Cybern 44(12):2733–2743
26. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
27. Heydari A, Balakrishnan S (2013) Finite-horizon control-constrained nonlinear optimal con-
trol using single network adaptive critics. IEEE Trans Neural Netw Learn Syst 24(1):145–157
28. Heydari A, Balakrishnan SN (2013) Fixed-final-time optimal control of nonlinear systems
with terminal constraints. Neural Netw 48:61–71
29. Lee JY, Park JB, Choi YH (2013) On integral value iteration for continuous-time linear
systems. In: Proceedings of the American control conference, pp 4215–4220
30. Jha SK, Bhasin S (2014) On-policy q-learning for adaptive optimal control. In: Proceedings
of the IEEE symposium on adaptive dynamic programming and reinforcement learning, pp
1–6
31. Palanisamy M, Modares H, Lewis FL, Aurangzeb M (2015) Continuous-time q-learning
for infinite-horizon discounted cost linear quadratic regulator problems. IEEE Trans Cybern
45(2):165–176
32. Bian T, Jiang ZP (2015) Value iteration and adaptive optimal control for linear continuous-time
systems. In: Proceedings of the IEEE international conference on cybernetics and intelligent
systems, IEEE conference on robotics, automation and mechatronics, pp 53–58
33. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
the IEEE conference on decision and control, pp 3598–3605
34. Al’Brekht E (1961) On the optimal stabilization of nonlinear systems. J Appl Math Mech
25(5):1254–1266
35. Lukes DL (1969) Optimal regulation of nonlinear dynamical systems. SIAM J Control
7(1):75–100
36. Nishikawa Y, Sannomiya N, Itakura H (1971) A method for suboptimal design of nonlinear
feedback systems. Automatica 7(6):703–712
37. Garrard WL, Jordan JM (1977) Design of nonlinear automatic flight control systems. Auto-
matica 13(5):497–505
38. Dolcetta IC (1983) On a discrete approximation of the hamilton-jacobi equation of dynamic
programming. Appl Math Optim 10(1):367–377
39. Falcone M, Ferretti R (1994) Discrete time high-order schemes for viscosity solutions of
Hamilton-Jacobi-Bellman equations. Numer Math 67(3):315–344
40. Bardi M, Dolcetta I (1997) Optimal control and viscosity solutions of Hamilton-Jacobi-
Bellman equations. Springer, Berlin
41. Gonzalez R (1985a) On deterministic control problems: an approximation procedure for the
optimal cost i. The stationary problem. SIAM J Control Optim 23(2):242–266
42. Gonzalez R, Rofman E (1985b) On deterministic control problems: an approximation proce-
dure for the optimal cost ii. The nonstationary case. SIAM J Control Optim 23(2):267–285
43. Falcone M (1987) A numerical approach to the infinite horizon problem of deterministic
control theory. Appl Math Optim 15(1):1–13
44. Kushner HJ (1990) Numerical methods for stochastic control problems in continuous time.
SIAM J Control Optim 28(5):999–1048
45. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton-
Jacobi-Bellman equation. Automatica 33:2159–2178
46. Beard RW, Mclain TW (1998) Successive Galerkin approximation algorithms for nonlinear
optimal and robust control. Int J Control 71(5):717–743
47. Murray J, Cox C, Lendaris G, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
References 39

48. Abu-Khalaf M, Lewis FL (2002) Nearly optimal HJB solution for constrained input systems
using a neural network least-squares approach. In: Proceedings of the IEEE conference on
decision and control, Las Vegas, NV, pp 943–948
49. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
50. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
51. Hornick K (1991) Approximation capabilities of multilayer feedforward networks. Neural
Netw 4:251–257
52. Sadegh N (1993) A perceptron network for functional identification and control of nonlinear
systems. IEEE Trans Neural Netw 4(6):982–988
53. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
54. Widrow B, Gupta N, Maitra S (1973) Punish/reward: Learning with a critic in adaptive thresh-
old systems. IEEE Trans Syst Man Cybern 3(5):455–465
55. Prokhorov DV, Wunsch IDC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8:997–
1007
56. Fu KS (1964) Learning control systems. In: Tou JT, Wilcox RH (eds) Computing and infor-
mation science, collected papers on learning, adaptation and control in information systems.
Spartan Books, Washington, pp 318–343
57. Fu KS (1969) Learning control systems. In: Tou JT (ed) Advances in information systems
science, vol 1. Springer. US, Boston, pp 251–292
58. Witten IH (1977) An adaptive optimal controller for discrete-time markov environments. Inf
Control 34(4):286–295
59. Liu X, Balakrishnan S (2000) Convergence analysis of adaptive critic based optimal control.
In: Proceedings of the American control conference, vol 3
60. Grondman I, Buşoniu L, Lopes GA, Babuška R (2012) A survey of actor-critic reinforcement
learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C Appl
Rev 42(6):1291–1307
61. Prokhorov D, Santiago R, Wunsch D (1995) Adaptive critic designs: a case study for neuro-
control. Neural Netw 8(9):1367–1372
62. Fuselli D, De Angelis F, Boaro M, Squartini S, Wei Q, Liu D, Piazza F (2013) Action dependent
heuristic dynamic programming for home energy resource scheduling. Int J Electr Power
Energy Syst 48:148–160
63. Miller WT, Sutton R, Werbos P (1990) Neural networks for control. MIT Press, Cambridge
64. Werbos P (1987) Building and understanding adaptive systems: a statistical/numerical
approach to factory automation and brain research. IEEE Trans Syst Man Cybern 17(1):7–20
65. Werbos PJ (1989) Back propagation: past and future. Proceedings of the international con-
ference on neural network 1:1343–1353
66. Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE
78(10):1550–1560
67. Werbos P (2000) New directions in ACDs: keys to intelligent control and understanding the
brain. Proceedings of the IEEE-INNS-ENNS international joint conference on neural network
3:61–66
68. Si J, Wang Y (2001) On-line learning control by association and reinforcement. IEEE Trans
Neural Netw 12(2):264–276
69. Yang L, Enns R, Wang YT, Si J (2003) Direct neural dynamic programming. In: Stability and
control of dynamical systems with applications. Springer, Berlin, pp 193–214
70. Balakrishnan S (1996) Adaptive-critic-based neural networks for aircraft optimal control. J
Guid Control Dynam 19(4):893–898
71. Lendaris G, Schultz L, Shannon T (2000) Adaptive critic design for intelligent steering and
speed control of a 2-axle vehicle. In: International joint conference on neural network, pp
73–78
40 2 Approximate Dynamic Programming

72. Ferrari S, Stengel R (2002) An adaptive critic global controller. Proc Am Control Conf 4:2665–
2670
73. Han D, Balakrishnan S (2002) State-constrained agile missile control with adaptive-critic-
based neural networks. IEEE Trans Control Syst Technol 10(4):481–489
74. He P, Jagannathan S (2007) Reinforcement learning neural-network-based controller for non-
linear discrete-time systems with input constraints. IEEE Trans Syst Man Cybern Part B
Cybern 37(2):425–436
75. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
76. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
77. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
78. Wei Q, Liu D (2013) Optimal tracking control scheme for discrete-time nonlinear systems
with approximation errors. In: Guo C, Hou ZG, Zeng Z (eds) Advances in neural networks -
ISNN 2013, vol 7952. Lecture notes in computer science. Springer, Berlin, pp 1–10
79. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
80. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
81. Baird L (1993) Advantage updating. Technical report, Wright Lab, Wright-Patterson Air
Force Base, OH
82. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
83. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2):477–484
84. Hanselmann T, Noakes L, Zaknich A (2007) Continuous-time adaptive critics. IEEE Trans
Neural Netw 18(3):631–647
85. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
86. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
87. Wang K, Liu Y, Li L (2014) Visual servoing trajectory tracking of nonholonomic mobile
robots without direct position measurement. IEEE Trans Robot 30(4):1026–1035
88. Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlin-
ear robust optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern Syst
46(11):1544–1555
89. Vamvoudakis KG, Vrabie D, Lewis FL (2009) Online policy iteration based algorithms to
solve the continuous-time infinite horizon optimal control problem. IEEE symposium on
adaptive dynamic programming and reinforcement learning, IEEE, pp 36–41
90. Vrabie D, Vamvoudakis KG, Lewis FL (2009) Adaptive optimal controllers based on gener-
alized policy iteration in a continuous-time framework. In: Proceedings of the mediterranean
conference on control and automation, IEEE, pp 1402–1409
91. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE Conference
on decision and control, pp 3066–3071
92. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
93. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlin-
ear systems. Automatica 49(1):89–92
References 41

94. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory track-
ing for continuous-time nonlinear systems. Automatica 51:40–48
95. Singh SP (1992) Reinforcement learning with a hierarchy of abstract models. In: AAAI
national conference on artificial intelligence 92:202–207
96. Atkeson CG, Schaal S (1997) Robot learning from demonstration. Int Conf Mach Learn
97:12–20
97. Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In:
International conference on machine learning. ACM, New York, pp 1–8
98. Deisenroth MP (2010) Efficient reinforcement learning using Gaussian processes. KIT Sci-
entific Publishing
99. Mitrovic D, Klanke S, Vijayakumar S (2010) Adaptive optimal feedback control with learned
internal dynamics models. In: Sigaud O, Peters J (eds) From motor learning to interaction
learning in robots, vol 264. Studies in computational intelligence. Springer, Berlin, pp 65–84
100. Deisenroth MP, Rasmussen CE (2011) Pilco: a model-based and data-efficient approach to
policy search. In: International conference on machine learning, pp 465–472
101. Narendra KS, Annaswamy AM (1987) A new adaptive law for robust adaptive control without
persistent excitation. IEEE Trans Autom Control 32:134–145
102. Narendra K, Annaswamy A (1989) Stable adaptive systems. Prentice-Hall Inc, Upper Saddle
River
103. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
104. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
105. Vrabie D, Abu-Khalaf M, Lewis F, Wang Y (2007) Continuous-time ADP for linear systems
with partially unknown dynamics. In: Proceedings of the IEEE international symposium on
approximate dynamic programming and reformulation learning, pp 247–253
106. Bellman R (1954) The theory of dynamic programming. Technical report, DTIC Document
107. Pontryagin LS, Boltyanskii VG, Gamkrelidze RV, Mishchenko EF (1962) The mathematical
theory of optimal processes. Interscience, New York
108. Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (to appear) Optimal and autonomous
control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst
109. Rummery GA, Niranjan M (1994) On-line q-learning using connectionist systems, Technical
report. University of Cambridge, Department of Engineering
110. Sutton R (1996) Generalization in reinforcement learning: successful examples using sparse
coarse coding. In: Advances in neural information processing systems, pp 1038–1044
111. Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, University of Cambridge
England
112. Schwartz A (1993) A reinforcement learning method for maximizing undiscounted rewards.
Proc Int Conf Mach Learn 298:298–305
113. Bradtke S, Ydstie B, Barto A (1994) Adaptive linear quadratic control using policy iteration.
In: Proceedings of the American control conference, IEEE, pp 3475–3479
114. McLain T, Beard R (1998) Successive galerkin approximations to the nonlinear optimal
control of an underwater robotic vehicle. In: Proceedings of the IEEE international conference
on robotics and automation
115. Lawton J, Beard R, Mclain T (1999) Successive Galerkin approximation of nonlinear optimal
attitude. Proc Am Control Conf 6:4373–4377
116. Lawton J, Beard R (1998) Numerically efficient approximations to the Hamilton–Jacobi–
Bellman equation. Proc Am Control Conf 1:195–199
117. Bertsekas D (2011) Approximate policy iteration: a survey and some new methods. J Control
Theory Appl 9:310–335
118. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural
Netw Learn Syst 24(10):1513–1525
119. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica
42 2 Approximate Dynamic Programming

120. Williams RJ (1988) Toward a theory of reinforcement-learning connectionist systems. Tech-


nical report. NU-CCS-88-3, Northeastern University, College of Computer Science
121. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Mach Learn 8(3):229–256
122. Jaakkola T, Singh S, Jordan M (1995) Reinforcement learning algorithm for partially observ-
able Markov decision problems. In: Advances in neural information processing systems, pp
345–352
123. Kimura H, Miyazaki K, Kobayashi S (1997) Reinforcement learning in pomdps with function
approximation. Proc Int Conf Mach Learn 97:152–160
124. Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for rein-
forcement learning with function approximation. In: Solla SA, Leen TK, Müller K (eds)
Advances in neural information processing systems, vol 12, MIT Press, pp 1057–1063
125. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin, pp
357–374
126. Chen Z, Jagannathan S (2008) Generalized Hamilton-Jacobi-Bellman formulation -based
neural network control of affine nonlinear discrete-time systems. IEEE Trans Neural Netw
19(1):90–106
127. Bhasin S, Sharma N, Patre P, Dixon WE (2010) Robust asymptotic tracking of a class of
nonlinear systems using an adaptive critic based controller. In: Proceedings of the American
control conference, Baltimore, MD, pp 3223–3228
128. Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10):2699–2704
129. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of
unknown continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–
566
130. Barto AG, Bradtke SJ, Singh SP (1991) Real-time learning and control using asynchronous
dynamic programming, Technical report. University of Massachusetts at Amherst, Depart-
ment of Computer and Information Science
131. Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: numerical methods.
Prentice-Hall Inc, Englewood Cliffs
132. Bertsekas DP (1976) On error bounds for successive approximation methods. IEEE Trans
Autom Control 21(3):394–396
133. Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approx-
imating dynamic programming. In: Proceedings of the international conference on machine
learning, pp 216–224
134. Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM
SIGART Bull 2(4):160–163
135. Atkeson CG, Santamaria JC (1997) A comparison of direct and model-based reinforcement
learning. In: Proceedings of the international conference on robotics and automation, Citeseer
136. Abbeel P, Ng AY (2005) Exploration and apprenticeship learning in reinforcement learning.
In: Proceedings of the international conference on machine learning. ACM, pp 1–8
137. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
138. Bian T, Jiang Y, Jiang ZP (2015) Decentralized adaptive optimal control of large-scale systems
with application to power systems. IEEE Trans Ind Electron 62(4):2439–2447
139. Modares H, Lewis FL, Jiang ZP (2015) H∞ tracking control of completely unknown
continuous-time systems via off-policy reinforcement learning. IEEE Trans Neural Netw
Learn Syst
140. Song R, Lewis FL, Wei Q, Zhang HG, Jiang ZP, Levine D (2015) Multiple actor-critic struc-
tures for continuous-time optimal control using input-output data. IEEE Trans Neural Netw
Learn Syst 26(4):851–865
Chapter 3
Excitation-Based Online Approximate
Optimal Control

3.1 Introduction

The focus of this chapter is adaptive online approximate optimal control of uncertain
nonlinear systems. The state-derivative-based method summarized in Sect. 2.6 is fur-
ther developed in this chapter. In Sect. 3.2, a novel actor-critic-identifier architecture
is developed to obviate the need to know the system drift dynamics via simultaneous
learning of the actor, the critic, and the identifier. The actor-critic-identifier method
utilizes a persistence of excitation-based online learning scheme, and hence is an
indirect adaptive control approach to reinforcement learning. The idea is similar to
the heuristic dynamic programming algorithm [1], where Werbos suggested the use
of a model network along with the actor and critic networks. Because of the general-
ity of the considered system and objective function, the developed solution approach
can be used in a wide range of applications in different fields.
The actor and critic neural networks developed in this chapter use gradient and
least-squares-based update laws, respectively, to minimize the Bellman error. The
identifier dynamic neural network is a combination of a Hopfield-type [2] component
and a novel RISE (Robust Integral of Sign of the Error) component. The Hopfield
component of the dynamic neural network learns the system dynamics based on
online gradient-based weight tuning laws, while the RISE term robustly accounts for
the function reconstruction errors, guaranteeing asymptotic estimation of the state
and the state derivative. Online estimation of the state derivative allows the actor-
critic-identifier architecture to be implemented without knowledge of system drift
dynamics; however, knowledge of the input gain matrix is required to implement
the control policy. While the design of the actor and critic are coupled through the
Hamilton–Jacobi–Bellman equation, the design of the identifier is decoupled from
the actor and the critic components, and can be considered as a modular component
in the actor-critic-identifier architecture. Convergence of the actor-critic-identifier
algorithm and stability of the closed-loop system are analyzed using Lyapunov-
based adaptive control methods, and a persistence of excitation condition is used
to guarantee exponential convergence to a bounded region in the neighborhood of
the optimal control and uniformly ultimately bounded stability of the closed-loop
system.

© Springer International Publishing AG 2018 43


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_3
44 3 Excitation-Based Online Approximate Optimal Control

In Sect. 3.3, the developed actor-critic-identifier architecture is extended to a class


of trajectory tracking problems. Approximate dynamic programming has been inves-
tigated and used as a tool to approximately solve optimal regulation problems. For
these problems, function approximation techniques can be used to approximate the
value function because it is a time invariant function. In tracking problems, the
tracking error, and hence, the value function, is a function of the state and an explicit
function of time. Approximation techniques like neural networks are commonly used
in approximate dynamic programming literature for value function approximation.
However, neural networks can only approximate functions on compact domains.
Since the time-interval in an infinite-horizon problem is not compact, temporal fea-
tures of the value function cannot be effectively identified using a neural network.
In Sect. 3.3, the tracking error and the desired trajectory both serve as inputs to the
neural network, leading to a different Hamilton–Jacobi–Bellman equation that yields
an optimal controller with a time-varying feedback component. In particular, this
chapter addresses the technical obstacles that result from the time-varying nature of
the optimal control problem by using a system transformation to convert the problem
into a time-invariant optimal control problem. The resulting value function is a time-
invariant function of the transformed states, and hence, lends itself to approximation
using a neural network. A Lyapunov-based analysis is used to establish uniformly
ultimately bounded tracking and approximate optimality of the controller. Simulation
results are presented to demonstrate the applicability of the developed technique. To
gauge the performance of the proposed method, a comparison with a numerical
optimal solution is also presented.
In Sect. 3.4, the actor-critic-identifier architecture is extended to solve a N -player
nonzero-sum infinite-horizon differential game subject to continuous-time uncertain
nonlinear dynamics. Classical optimal control problems in the Bernoulli form aim
to find a single control input that minimizes a single cost functional under boundary
constraints and dynamical constraints imposed by the system [3, 4]. Various control
problems can be modeled as multi-input systems, where each input is computed by
a player, and each player attempts to influence the system state to minimize its own
cost function. In this case, the objective is to find a Nash equilibrium solution to the
resulting differential game.
In general, Nash equilibria are not unique. For a closed-loop differential game
(i.e., the control is a function of the state and time) with perfect information (i.e.,
all the players know the complete state history), there can be infinitely many Nash
equilibria. If the policies are constrained to be feedback policies, the resulting equi-
libria are called (sub)game perfect Nash equilibria or feedback-Nash equilibria. The
value functions corresponding to feedback-Nash equilibria satisfy a coupled system
of Hamilton–Jacobi equations (see, e.g., [5–8]). In this chapter, N -actor and N -critic
neural network structures are used to approximate the optimal control laws and the
optimal value function set, respectively. The main traits of this online algorithm
involve the use of approximate dynamic programming techniques and adaptive the-
ory to determine the feedback-Nash equilibrium solution to the game in a manner
that does not require full knowledge of the system dynamics and approximately
solves the underlying set of coupled Hamilton–Jacobi–Bellman equations of the
3.1 Introduction 45

game problem. For an equivalent nonlinear system, previous research makes use of
offline procedures or requires full knowledge of the system dynamics to determine
the Nash equilibrium. A Lyapunov-based stability analysis shows that uniformly ulti-
mately bounded tracking for the closed-loop system is guaranteed for the proposed
actor-critic-identifier architecture and a convergence analysis demonstrates that the
approximate control policies converge to a neighborhood of the optimal solutions.

3.2 Online Optimal Regulation1

In this section, an online adaptive reinforcement learning-based solution is devel-


oped for the infinite-horizon optimal control problem for continuous-time uncertain
nonlinear systems. Consider the control-affine nonlinear system in (1.9). Recall from
Sect. 2.6 the approximation of the Bellman error given by
    
δ̂t (t) = ∇x V̂ x (t) , Ŵc (t) x̂˙ (t) + r x (t) , û x (t) , Ŵa (t) . (3.1)

The actor and the critic adjust the weights Ŵa (·) and Ŵc (·), respectively, to
minimize the approximate Bellman error. The identifier learns the derivatives x̂˙ (·)
to minimize the error between the true Bellman error and its approximation. The
following assumptions facilitate the development of update laws for the identifier,
the critic, and the actor.

Assumption 3.1 The functions f and g are twice continuously differentiable.

Assumption 3.2 The input gain matrix g(x) is known and uniformly bounded for
all x ∈ Rn (i.e., 0 < g(x) ≤ ḡ, ∀x ∈ Rn , where ḡ is a known positive constant).

3.2.1 Identifier Design

To facilitate the design of the identifier, the following restriction is placed on the
control input.
Assumption 3.3 The control input is bounded (i.e., u (·) ∈ L∞ ).
Using Assumption 3.2, Property 2.3, and the projection
 algorithm
 in (3.27), Assump-
tion 3.3 holds for the control design u (t) = û x (t) , Ŵa (t) in (2.9). Using Assump-
tion 3.3, the dynamic system in (1.9), with control u (·), can be represented using a
multi-layer neural network as

1 Parts of the text in this section are reproduced, with permission, from [9], 2013,
c Elsevier.
46 3 Excitation-Based Online Approximate Optimal Control
 
ẋ (t) = Fu (x (t) , u (t))  WfT σf VfT x (t) + εf (x (t)) + g (x (t)) u (t) , (3.2)

where Wf ∈ R(Lf +1)×n , Vf ∈ Rn×Lf are the unknown ideal neural network weights,
σf : RLf → RLf +1 is the neural network activation function, and εf : Rn → Rn is
the function reconstruction error. The following multi-layer dynamic neural network
identifier is used to approximate the system in (3.2)
   
x̂˙ (t) = F̂u x (t) , x̂ (t) , u (t)  ŴfT (t) σf V̂fT (t) x̂ (t) + g (x (t)) u (t) + μ (t) , (3.3)

where x̂ : R≥t0 → Rn is the dynamic neural network state, Ŵf : R≥t0 → RLf +1×n
and V̂f : R≥t0 → Rn×Lf are weight estimates, and μ : R≥t0 → Rn denotes the RISE
feedback term defined as [10, 11]

μ (t)  k x̃ (t) − k x̃ (0) + v (t) , (3.4)

where x̃ (t)  x (t) − x̂ (t) ∈ Rn is the identification error, and v (t) ∈ Rn is a Filippov
solution [12] to the initial value problem

v̇ (t) = (kα + γ )x̃ (t) + β1 sgn (x̃ (t)) , v (0) = 0,

where k, α, γ , β1 ∈ R are positive constant control gains. The identification error


dynamics can be written as
 
x̃˙ (t) = F̃u x (t) , x̂ (t) , u (t)
   
= WfT σf VfT x (t) − ŴfT (t) σf V̂fT (t) x̂ (t) + εf (x (t)) − μ (t) , (3.5)

   
where F̃u x, x̂, u  Fu (x, u) − F̂u x, x̂, u ∈ Rn . A filtered identification error is
defined as
ef (t)  x̃˙ (t) + α x̃ (t) . (3.6)

Taking the time derivative of (3.6) and using (3.5) yields


   
ėf (t) = WfT ∇V T x σf VfT x (t) VfT ẋ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
f f
   
− Ŵf (t) ∇V T x σf V̂f (t) x̂ (t) V̂f (t) x̂ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t)
T T ˙ T
f

+ ε̇f (x (t) , ẋ (t)) − kef (t) − γ x̃ (t) − β1 sgn(x̃ (t)) + α x̃˙ (t) . (3.7)

Based on (3.7) and the subsequent stability analysis, the weight update laws for
the dynamic neural network are designed as
   
Ŵ˙ f (t) = proj Γwf ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t) x̃T (t) ,
3.2 Online Optimal Regulation 47
  
V̂˙f (t) = proj Γvf x̂˙ (t) x̃T (t) ŴfT (t) ∇VfT x σf V̂fT (t) x̂ (t) , (3.8)

where the
  projection
 operator is used to bound the weight estimates such that
   
Ŵf (t) , V̂f (t) ≤ W f , ∀t ∈ R≥t0 , W f ∈ R>0 is a constant, and Γwf ∈
RLf +1×Lf +1 and Γvf ∈ Rn×n are positive constant adaptation gain matrices. The
expression in (3.7) can be rewritten as

ėf (t) = Ñ (t) + NB1 (t) + N̂B2 (t) − kef (t) − γ x̃ (t) − β1 sgn(x̃ (t)), (3.9)

where the auxiliary signals, Ñ : R≥t0 → Rn , NB1 : R≥t0 → Rn , and N̂B2 : R≥t0 →
Rn are defined as
   
Ñ (t)  α x̃˙ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂˙fT (t) x̂ (t)
f
1 T  
T T
+ Wf ∇V T x σf V̂f (t) x̂ (t) V̂f (t) x̃ (t) ˙
2 f
1 T  
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT x̃˙ (t) , (3.10)
2 f

  1  
NB1 (t)  WfT ∇V T x σf VfT x (t) VfT ẋ (t) − WfT ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) ẋ (t)
f 2 f
1 T  
− Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT (t) ẋ (t) + ε̇f (x (t) , ẋ (t)), (3.11)
2 f

1 T  
N̂B2 (t)  W̃f (t) ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
2
1 T  
+ Ŵf (t) ∇VfT x σf V̂fT (t) x̂ (t) ṼfT (t) x̂˙ (t) , (3.12)
2

where W̃f (t)  Wf − Ŵf (t) and Ṽf (t)  Vf − V̂f (t). To facilitate the subsequent
stability analysis, an auxiliary term NB2 : R≥t0 → Rn is defined by replacing x̂˙ (t)
in N̂B2 (t) by ẋ (t) , and ÑB2  N̂B2 − NB2 . The terms NB1 and NB2 are grouped as
NB  NB1 + NB2 .
Provided x (t) ∈ χ , where χ ⊂ Rn is a compact set containing the origin, using
Assumption 3.2, Property 2.3, (3.6), (3.8), (3.11), and (3.12), the following bounds
can be obtained
 
 
Ñ (t) ≤ ρ1 (z (t)) z (t) , (3.13)
NB1 (t) ≤ ζ1 , NB2 (t) ≤ ζ2 ,
 
ṄB (t) ≤ ζ3 + ζ4 ρ2 (z (t)) z (t) , (3.14)
   2
˙T 
x̃ (t) ÑB2 (t) ≤ ζ5 x̃ (t) + ζ6 ef (t) ,
2
(3.15)
48 3 Excitation-Based Online Approximate Optimal Control

∀t ∈ R≥t0 , where ρ1 , ρ2 : R → R are positive strictly increasing functions arising


 T
from the Mean Value Theorem (see [13]), z  x̃T efT ∈ R2n , and ζi ∈ R, i =
1, . . . , 6 are positive constants. To facilitate the analysis, assume temporarily that
x (t) ∈ χ , ∀t ∈ R≥t0 . Let the auxiliary signal y : R≥t0 → R2n+2 be defined as
 
y (t)  x̃T (t) efT (t) P (t) Q (t) T . (3.16)

In (3.16), the auxiliary signal Q : R≥t0 → R is defined as

1  
Q (t)  α tr(W̃fT (t) Γwf−1 W̃f (t)) + tr(ṼfT (t) Γvf−1 Ṽf (t)) ,
4
where the auxiliary function P : R≥t0 → R is the Filippov solution [12] to the dif-
ferential equation

n
Ṗ (t) = −L (z (t) , t) , P(t0 ) = β1 |x̃i (t0 )| − x̃T (t0 ) NB (t0 ). (3.17)
i=1

In (3.17), the auxiliary function L : R2n × R≥t0 → R is defined as

L (z, t)  efT (NB1 (t) − β1 sgn(x̃ (t))) + x̃˙ T (t) NB2 (t) − β2 ρ2 (z) z x̃ (t) ,
(3.18)
where β1 , β2 ∈ R are selected according to the sufficient conditions

ζ3
β1 > max(ζ1 + ζ2 , ζ1 + ), β2 > ζ4 (3.19)
α
to ensure P (t) ≥ 0, ∀t ∈ R≥t0 (see Appendix A.1.1).
Let D ⊂ R2n+2 be an open and connected set defined as D  y ∈ R2n+2 | y
 −1  √ 
< inf ρ 2 λη, ∞ , where λ and η are defined in Appendix A.1.2. Let D be
  √ 
the compact set D  y ∈ R2n+2 | y ≤ inf ρ −1 2 λη, ∞ . Let VI : D → R
be a positive-definite, locally Lipschitz, regular function defined as

1 T 1
VI (y)  e ef + γ x̃T x̃ + P + Q. (3.20)
2 f 2
The candidate Lyapunov function in (3.20) satisfies the inequalities

U1 (y) ≤ VI (y) ≤ U2 (y), (3.21)

where U1 (y), U2 (y) ∈ R are continuous positive definite functions defined as

1
U1  min(1, γ ) y2 U2  max(1, γ ) y2 .
2
3.2 Online Optimal Regulation 49
√ 
Additionally,
√ let S ⊂ D denote a set defined as S  y ∈ D | ρ 2U2 (y)
< 2 λη and let
ẏ (t) = h(y (t) , t) (3.22)

represent the closed-loop differential equations in (3.5), (3.8), (3.9), and (3.17),
where h : R2n+2 × R≥t0 → R2n+2 denotes the right-hand side of the closed-loop error
signals.

Theorem 3.4 For the system in (1.9), the identifier developed in (3.3) along with
the weight update laws in (3.8) ensures asymptotic identification of the state and its
derivative, in the sense that all Filippov solutions to (3.22) that satisfy y (t0 ) ∈ S,
are bounded, and further satisfy
 
 
lim x̃ (t) = 0, lim x̃˙ (t) = 0,
t→∞ t→∞

provided the control gains k and γ are selected sufficiently large based on the initial
conditions of the states and satisfy the sufficient conditions

ζ5
γ > , k > ζ6 , (3.23)
α
where ζ5 and ζ6 are introduced in (3.15), and β1 , and β2 , introduced in (3.18), are
selected according to the sufficient conditions in (3.19).

Proof See Appendix A.1.2. 


3.2.2 Least-Squares Update for the Critic

For online implementation, a normalized recursive formulation of the least-squares


algorithm is developed for the critic update law as

ω (t)
Ŵ˙ c (t) = −kc Γ (t) δ̂t (t) , (3.24)
1+ νωT (t) Γ (t) ω (t)

ω : R≥t0 → RL , defined as ω (t)  ∇x σ (x (t)) F̂u x (t) , x̂ (t) , û (x (t) ,
where 
Ŵa (t) is the critic neural network regressor vector, ν, kc ∈ R are constant positive
gains, and Γ : R≥t0 → RL×L is a symmetric estimation gain matrix generated using
the initial value problem

ω (t) ωT (t)
Γ˙ (t) = −kc Γ (t) Γ (t) ; Γ (tr+ ) = Γ (0) = γ IL , (3.25)
1 + νωT (t) Γ ω (t)
50 3 Excitation-Based Online Approximate Optimal Control

where tr+ is the resetting time at which λmin {Γ (t)} ≤ γ , γ > γ > 0. The covariance
resetting ensures that Γ (·) is positive-definite for all time and prevents its value
from becoming arbitrarily small in some directions, thus avoiding slow adaptation
in some directions. From (3.25), it is clear that Γ˙ (t) ≤ 0, which means that Γ (·)
can be bounded as
γ IL ≤ Γ (t) ≤ γ IL , ∀t ∈ R≥t0 . (3.26)

3.2.3 Gradient Update for the Actor

The actor update, like the critic update in Sect. 3.2.2, is based on the minimization
of the Bellman error δ̂t . However, unlike the critic weights, the actor weights appear
nonlinearly in δ̂t , making it problematic to develop a least-squares update law. Hence,
a gradient update law is developed for the actor which minimizes the squared Bellman
error. The gradient-based update law for the actor neural network is given by

Ŵ˙ a (t) = proj −
2ka1
·
1 + ωT (t) ω (t)
⎛      ⎞T
∂ F̂u x (t) , x̂ (t) , û x (t) , Ŵa (t) ∂ û x (t) , Ŵa (t)
⎝ŴcT (t) ∇x σ (x (t)) ⎠ δ̂t (t)
∂ û ∂ Ŵa
⎛   ⎞T
4ka1 ∂ û x (t) , Ŵa (t)  
− ⎝ ⎠ Rû x (t) , Ŵa (t) δ̂t (t)
1 + ωT (t) ω (t) ∂ Ŵa

− ka2 (Ŵa (t) − Ŵc (t)) (3.27)

where G (x)  g (x) R−1 g (x)T , ka1 , ka2 ∈ R are positive adaptation gains,
√ 1T is the normalization term, and the last term in (3.27) is added for stability
1+ω (t)ω(t)
(based on the subsequent stability analysis). Using the identifier developed in (3.3),
the actor weight update law can now be simplified as
  
Ŵ˙ a (t) = proj −
ka1
∇x σ (x (t)) G∇x σ T (x (t)) Ŵa (t) − Ŵc (t) δ̂t (t)
1 + ωT (t) ω (t)

− ka2 (Ŵa (t) − Ŵc (t)) . (3.28)

 
 
The projection operator ensures that Ŵa (t) ≤ W , ∀t ∈ R≥t0 , where W ∈ R>0 is

that W  ≤ W . For notational brevity, let BW denote the set


a positive constant such
w ∈ RL | w ≤ 2W .
3.2 Online Optimal Regulation 51

Remark 3.5 A recursive least-squares update law with covariance resetting is devel-
oped for the critic in (3.24), which exploits the fact that the critic weights appear
linearly in the Bellman error δ̂t (·). This is in contrast to the modified Levenberg–
Marquardt algorithm in [14] which is similar to the normalized gradient update law.
The actor update law in (3.27) also differs in the sense that the update law in [14] is
purely motivated by the stability analysis whereas the proposed actor update law is
based on the minimization of the Bellman error with an additional term for stability.
Heuristically, the difference in the update law development could lead to improved
performance in terms of faster convergence of the actor and critic weights, as seen
from the simulation results in Sect. 3.2.5.

3.2.4 Convergence and Stability Analysis

Using the Hamilton–Jacobi–Bellman equation, the unmeasurable form of the Bell-


man error can be written as

δ̂t = ŴcT ω − WcT ∇x σ Fu∗ + ûT Rû − u∗ Ru∗ − ∇x Fu∗ ,


T

1 1
= −W̃cT ω − W T ∇x σ F̃û + W̃aT ∇x σ G∇x σ T W̃a − ∇x G∇x  T − ∇x Fu∗ ,
4 4
(3.29)

The dynamics of the critic weight estimation error W̃c can now be developed by
substituting (3.29) into (3.24) as
ω 
W̃˙ c = −kc Γ ψψ T W̃c + kc Γ −W T ∇x σ F̃û
1 + νωT Γ ω

1 T 1
+ W̃a ∇x σ G∇x σ W̃a − ∇x G∇x  − ∇x Fu∗ ,
T T
(3.30)
4 4

ω
where ψ  √
1+νωΓ ω
∈ RN is the normalized critic regressor vector, bounded as
1
ψ ≤ √ , ∀t ∈ R≥t0 (3.31)
νγ
where γ is introduced in (3.26). The error system in (3.30) can be represented by the
following perturbed system
W̃˙ c = Ωnom + Δper , (3.32)

where Ωnom (W̃c , t)  −kc Γ (t)


 ψ (t) ψ (t) W̃c ∈ R , denotes the nominal system,
T N

and Δper  kc Γ 1+νωωT Γ ω −W T ∇x σ F̃û + 41 W̃aT ∇x σ G∇x σ T W̃a − 41 ∇x G∇x  T −



∇x Fu∗ ∈ RN denotes the perturbation. Using Theorem 2.5.1 in [15], the nomi-
nal system
52 3 Excitation-Based Online Approximate Optimal Control

W̃˙ c = −kc Γ ψψ T W̃c (3.33)

is globally exponentially stable, if the bounded signal ψ is uniformly persistently


exciting [16, 17] over the compact set χ × D × BW , such that

t+δ
ψ(τ )ψ(τ )T d τ ≥ μ1 IL ∀t ≥ t0 ,
t

for some positive constants μ1 , δ ∈ R. Since Ωnom is continuously differentiable and


the Jacobian ∇W̃c Ωnom = −kc Γ ψψ T is bounded for the exponentially stable system
in (3.33), the converse Lyapunov Theorem 4.14 in [18] can be used to show that there
exists a function Vc : RN × R≥t0 → R, which satisfies the following inequalities
 2  2
   
c1 W̃c  ≤ Vc (W̃c , t) ≤ c2 W̃c 
 2
 
∇t Vc + ∇W̃c Vc Ωnom (W̃c , t) ≤ −c3 W̃c  (3.34)
   
∇ Vc  ≤ c4  
W̃c  ,
W̃c

for some positive constants c1 , c2 , c3 , c4 ∈ R. Using Property 2.3, Assumptions 3.2


and 3.3, the projection bounds in (3.27), the fact that Fu∗ is bounded over compact
sets,
 and providedthe conditions of Theorem 3.4 hold (required to prove that t →
F̃û x (t) , x̂ (t) , û x (t) , Ŵa (t) ∈ L∞ ), the following bounds can be developed:
   
  ∇x σ G∇x σ T  ≤ κ2 ,
W̃a  ≤ κ1 ,
 
1 T 
 4 W̃a ∇x σ G∇x σ T W̃a − 41 ∇x G∇x  T − W T ∇x σ F̃û − ∇x Fu∗  ≤ κ3 ,
 
1 T 
 2 W ∇x σ G∇x  T + 21 ∇x G∇x  T + 21 W T ∇x σ G∇x σ T W̃a + 21 ∇x G∇x σ T  ≤ κ4 ,
(3.35)
where κ1 , κ2 , κ3 , κ4 ∈ R are computable positive constants.
Theorem 3.6 If Assumptions 3.1–3.3 hold, the regressor ψ  √1+ωω T Γ ω is uniformly
persistently exciting, and provided (3.19), (3.23), and the following sufficient gain
condition is satisfied
c3
> κ1 κ2 , (3.36)
ka1

where ka1 , c3 , κ1 , κ2 are introduced in (3.27), (3.34), and (3.35), then the controller
in (1.9), the actor-critic weight update laws in (3.24)–(3.25) and (3.28), and the
identifier in (3.3) and (3.8), guarantee that the state of the system x, and the actor-
critic weight estimation errors W̃a and W̃c are uniformly ultimately bounded.
3.2 Online Optimal Regulation 53

Proof To investigate the stability of (1.9) with control û, and the perturbed system
in (3.32), consider VL : X × RN × RN × [0, ∞) → R as the continuously differen-
tiable, positive-definite Lyapunov function candidate defined as

1
VL (x, W̃c , W̃a , t)  V ∗ (x) + Vc (W̃c , t) + W̃aT W̃a ,
2
where V ∗ (i.e., the optimal value function) is the candidate Lyapunov function for
(1.9), and Vc is the Lyapunov function for the exponentially stable system in (3.33).
Since V ∗ is continuously differentiable and positive-definite, there exist class K
functions (see [18, Lemma 4.3]), such that

α1 (x) ≤ V ∗ (x) ≤ α2 (x) ∀x ∈ Rn . (3.37)

Using (3.34) and (3.37), VL (x, W̃c , W̃a , t) can be bounded as


 2 1  2  2 1  2
       
α1 (x) + c1 W̃c  + W̃a  ≤ VL (x, W̃c , W̃a , t) ≤ α2 (x) + c2 W̃c  + W̃a  ,
2 2

which can be written as

α3 (z̃) ≤ VL (z̃, t) ≤ α4 (z̃) ∀z̃ ∈ Rn+2L ,

where z̃  [x W̃c W̃a ]T ∈ Rn+2N , and α3 and α4 are class K functions. Taking the
time derivative of VL (·) yields

∂V ∗ ∂V ∗ ∂ Vc ∂ Vc ∂ Vc
V̇L = f + g û + + Ωnom + Δper − W̃aT Ŵ˙ a , (3.38)
∂x ∂x ∂t ∂ W̃c ∂ W̃c

where the time derivative of V ∗ is taken along the trajectories of the system (1.9)
with control û, and the time derivative of Vc is taken along the along the trajectories
of the perturbed system (3.32). To facilitate the subsequent analysis, the Hamilton–
V∗ V∗
Jacobi–Bellman equation in (1.14) is rewritten as ∂∂x f = − ∂∂x gu∗ − Q − u∗ Ru∗ .
T

V∗ V∗
Substituting for ∂∂x f in (3.38), using the fact that ∂∂x g = −2u∗ R from (1.13), and
T

using (3.27) and (3.34), (3.38) can be upper bounded as


 2   
   
V̇L ≤ −Q − u∗ Ru∗ − c3 W̃c  + c4 W̃c  Δper  + 2u∗ R(u∗ − û)
T T

ka1
+ ka2 W̃aT (Ŵa − Ŵc ) + √ W̃aT ∇x σ G∇x σ T (Ŵa − Ŵc )δ̂t . (3.39)
1 + ωT ω

Substituting for u∗ , û, δ̂t , and Δper using (1.13), (2.9), (3.29), and (3.32), respec-
tively, and substituting (3.26) and (3.31) into (3.39), yields
54 3 Excitation-Based Online Approximate Optimal Control
 2  2 1
   
V̇L ≤ −Q − c3 W̃c  − ka2 W̃a  + W T ∇x σ G∇x  T
2
1 1 T 1
+ ∇x G∇x  + W ∇x σ G∇x σ W̃a + ∇x G∇x σ T W̃a
T T
2  2 2
kc γ  1
+ c4 √  −W T ∇x σ F̃û + W̃aT ∇x σ G∇x σ T W̃a
2 νγ  4
    
1     
− ∇x G∇x  T − ∇x Fu∗    W̃ c  + ka2 W̃a  W̃c 
4
ka1 
+√ W̃aT ∇x σ G∇x σ T (W̃c − W̃a ) −W̃cT ω
1 + ωT ω

1 T 1
−W ∇x σ F̃û + W̃a ∇x σ G∇x σ W̃a − ∇x G∇x  − ∇x Fu∗ .
T T T
(3.40)
4 4

Using the bounds developed in (3.35), (3.40) can be further upper bounded as
 2  2
   
V̇L ≤ −Q − (c3 − ka1 κ1 κ2 ) W̃c  − ka2 W̃a  + ka1 κ12 κ2 κ3 + κ4
 
c4 kc γ  
 
+ √ κ 3 + k κ κ κ
a1 1 2 3 + k κ 2
a1 1 2κ + ka2 1  W̃c  .
κ
2 νγ

Provided c3 > ka1 κ1 κ2 , completion of the square yields


 2  2
   
V̇L ≤ −Q − (1 − θ )(c3 − ka1 κ1 κ2 ) W̃c  − ka2 W̃a  + ka1 κ12 κ2 κ3 + κ4
 2
1 c4 kc γ
+ √ κ3 + ka1 κ1 κ2 κ3 + ka1 κ1 κ2 + ka2 κ1 , (3.41)
2
4θ (c3 − ka1 κ1 κ2 ) 2 νγ

where 0 < θ < 1. Since Q is positive definite, [18, Lemma 4.3] indicates that there
exist class K functions α5 and α6 such that
 2  2
   
α5 (z̃) ≤ Q + (1 − θ )(c3 − ka1 κ1 κ2 ) W̃c  + ka2 W̃a 
≤ α6 (z̃) ∀v ∈ Bs . (3.42)

Using (3.42), the expression in (3.41) can be further upper bounded as

V̇L ≤ −α5 (z̃) + ka1 κ12 κ2 κ3 + κ4 + κ5 ,

where
 2
1 c4 kc γ
κ5  √ κ3 + ka1 κ1 κ2 κ3 + ka1 κ1 κ2 + ka2 κ1
2
.
4θ (c3 − ka1 κ1 κ2 ) 2 νγ
3.2 Online Optimal Regulation 55

Hence, V̇L (t) is negative whenever z̃ (t) lies outside the compact set
 
Ωz̃  z̃ : z̃ ≤ α5−1 κ5 + ka1 κ12 κ2 κ3 + κ4 .

Invoking [18, Theorem 4.18], it can be concluded that z̃ (·) is uniformly ultimately
bounded. The bounds in (3.35) depend on the actor neural network approximation
error ∇x , which can be reduced by increasing the number of neurons, L, thereby
reducing the size of the residual set Ωz̃ . From Property 2.3, as the number of neurons
of the actor and critic neural networks approaches infinity, ∇x  → 0. 

Since c3 is a function of the critic adaptation gain kc , ka1 is the actor adaptation
gain, and κ1 , κ2 are known constants, the sufficient gain condition in (3.36) can be
easily satisfied.
Remark 3.7 Since the actor, the critic, and the identifier are continuously updated,
the developed reinforcement learning algorithm can be compared to fully optimistic
policy iteration in machine learning literature [19]. Unlike traditional policy iteration
where policy improvement is done after convergence of the policy evaluation step,
fully optimistic policy iteration carries out policy evaluation and policy improvement
after every state transition. Proving convergence of optimistic policy iteration is
complicated and is an active area of research in machine learning [19, 20]. By
considering an adaptive control framework, this result investigates the convergence
and stability behavior of fully optimistic policy iteration in continuous-time.
Remark 3.8 The persistence of excitation condition in Theorem 2 is equivalent to the
exploration paradigm in reinforcement learning which ensures sufficient sampling
of the state-space and convergence to the optimal policy [21].

3.2.5 Simulation

The following nonlinear system is considered


   
−x1 + x2 0
ẋ = + u, (3.43)
−0.5x1 − 0.5x2 (1 − (cos(2x1 ) + 2)2 ) cos(2x1 ) + 2

where x (t)  [x1 (t) x2 (t)]T ∈ R2 and u (t) ∈ R. The state and control penalties are
selected as  
T 1 0
Q(x) = x x; R = 1.
01

The optimal value function and optimal control for the system in (3.43) are known,
and given by [14]

1 2
V ∗ (x) = x + x22 ; u∗ (x) = −(cos(2x1 ) + 2)x2 .
2 1
56 3 Excitation-Based Online Approximate Optimal Control

The activation function for the critic neural network is selected with three neurons
as
σ (x) = [x12 x1 x2 x22 ]T ,

which yields the optimal weights W = [0.5 0 1]T . The activation function for the
identifier dynamic neural network is selected as a symmetric sigmoid with five neu-
rons in the hidden layer.
Remark 3.9 The choice of a good basis for the value function and control policy is
critical for convergence. For a general nonlinear system, choosing a suitable basis
can be a challenging problem without any prior knowledge about the system.
The identifier gains are selected as

k = 800, α = 300, γ = 5, β1 = 0.2, Γwf = 0.1I6 , and Γvf = 0.1I2 ,

and the gains for the actor-critic learning laws are selected as

ka1 = 10, ka2 = 50, kc = 20, and ν = 0.005.

The covariance matrix is initialized to Γ (0) = 5000I3 , all the neural network weights
are randomly initialized in [−1, 1], and the states are initialized to x(0) = [3, −1].
An implementation issue in using the developed algorithm is to ensure persis-
tence of excitation of the critic regressor vector. Unlike linear systems, where
persistence of excitation of the regressor translates to sufficient richness of the
external input, no verifiable method exists to ensure persistence of excitation in
nonlinear regulation problems. To ensure persistence of excitation qualitatively,
a small exploratory signal consisting of sinusoids of varying frequencies, n (t) =
sin2 (t) cos (t) + sin2 (2t) cos(0.1t) + sin2 (−1.2t) cos(0.5t) + sin5 (t), is added to
the control u (t) for the first 3 s [14]. The proposed control algorithm is implemented
using (2.9), (3.1), (3.3), (3.4), (3.24), (3.25), and (3.28). The evolution of states is
shown in Fig. 3.1. The identifier approximates the system dynamics, and the state
derivative estimation error is shown in Fig. 3.2.
As compared to discontinuous sliding mode identifiers which require infinite
bandwidth and exhibit chattering, the RISE-based identifier in (3.3) is continuous,
and thus, mitigates chattering to a large extent, as seen in Fig. 3.2.
Persistence of excitation ensures that the weights converge close to their optimal
values (i.e., Ŵc = [0.5004 0.0005 0.9999]T ≈ Ŵa ) in approximately 2 s, as seen
from the evolution of actor-critic weights in Figs. 3.3 and 3.4. The improved actor-
critic weight update laws, based on minimization of the Bellman error, led to faster
convergence of weights as compared to [14]. Errors in approximating the optimal
value function and optimal control at steady state (t = 10 s) are plotted against the
states in Figs. 3.5 and 3.6, respectively.
3.2 Online Optimal Regulation 57

Fig. 3.1 System states x (t) 3


x1
with persistently excited
2.5 x2
input for the first 3 s
(reproduced with permission 2
from [9], 2013,
c Elsevier)
1.5

x
0.5

−0.5

−1

−1.5
0 2 4 6 8 10
Time (s)

Fig. 3.2 Error in estimating


the state derivative x̃˙ (t) by 0.01
the identifier (reproduced
with permission from [9], 0.005
2013,
c Elsevier)
0
x̃˙

−0.005

−0.01

−0.015

0 2 4 6 8 10
Time (s)

Fig. 3.3 Convergence of Wc1


1.5
critic weights Ŵc (t) W
c2
(reproduced with permission Wc3
from [9], 2013,
c Elsevier) 1

0.5
Wc

−0.5

−1
0 2 4 6 8 10
Time (s)
58 3 Excitation-Based Online Approximate Optimal Control

Fig. 3.4 Convergence of 1.5 Wa1


actor weights Ŵa (t) W
a2
(reproduced with permission Wa3
from [9], 2013,
c Elsevier) 1

0.5

Wa
0

−0.5

−1
0 2 4 6 8 10
Time (s)

Fig. 3.5 Error in


approximating the optimal x 10
−4

value function by the critic at 10


steady state (reproduced
with permission from [9],
2013,
V̂ − V ∗

c Elsevier) 5

−5
2
1 2
0 1
0
x2 −1 −1 x1
−2 −2

Fig. 3.6 Error in


approximating the optimal −4
x 10
control by the actor at steady 6
state (reproduced with
4
permission from [9], 2013,
c
Elsevier) 2
û − u∗

0
−2
−4
−6
2
1 2
0 1
x2 0
−1 −1 x1
−2 −2
3.3 Extension to Trajectory Tracking 59

3.3 Extension to Trajectory Tracking2

Approximate dynamic programming has been investigated and used as a method


to approximately solve optimal regulation problems. However, the extension of
this technique to optimal tracking problems for continuous-time nonlinear systems
requires additional attention. The control development in this section uses a system
transformation to convert the time-varying tracking problem into a time-invariant
optimal control problem.

3.3.1 Formulation of a Time-Invariant Optimal Control


Problem

The control objective in this section is to track a bounded continuously differentiable


signal xd : R≥t0 → Rn . To quantify this objective, a tracking error is defined as e (t) 
x (t) − xd (t). The open-loop tracking error dynamics can then be expressed as

ė (t) = f (x (t)) + g (x (t)) u (t) − ẋd (t) . (3.44)

The following assumptions are made to facilitate the formulation of an approxi-


mate optimal tracking controller.
Assumption 3.10 The function g is bounded, the matrix g (x) has full column rank
 −1 T
for all x ∈ Rn , and the function g + : Rn → Rm×n defined as g +  g T g g is
bounded and locally Lipschitz.

Assumption 3.11 The desired trajectory is bounded such that xd (t) ≤ d , ∀t ∈
R≥t0 , and there exists a locally Lipschitz function hd : Rn → Rn such that ẋd (t) =
hd (xd (t)) and g (xd (t)) g + (xd (t)) (hd (xd (t)) − f (xd (t))) = hd (xd (t)) −
f (xd (t)), ∀t ∈ R≥t0 .

Remark 3.12 Assumptions 3.10 and 3.11 can be eliminated if a discounted cost
optimal tracking problem is considered instead of the total cost problem considered
in this chapter. The discounted cost tracking problem considers a value function

of the form V ∗ (ζ )  minu(τ )∈U |τ ∈R≥t t eκ(t−τ ) r (φ (τ, t, ζ, u (·)) , u (τ )) dτ , where
κ ∈ R>0 is a constant discount factor, and the control effort u is minimized instead
of the control error μ, introduced in (3.48). The control effort required for a system
to perfectly track a desired trajectory is generally nonzero even if the initial system
state is on the desired trajectory. Hence, in general, the optimal value function for a
discounted cost problem does not satisfy V ∗ (0) = 0. In fact, the origin may not even

2 Parts of the text in this section are reproduced, with permission, from [22], 2015,
c Elsevier.
60 3 Excitation-Based Online Approximate Optimal Control

be a local minimum of the optimal value function. Online continuous-time reinforce-


ment learning techniques are generally analyzed using the optimal value function as
a candidate Lyapunov function. Since the origin is not necessarily a local minimum
of the optimal value function for a discounted cost problem, it can not be used as
a candidate Lyapunov function. Hence, to make the stability analysis tractable, a
total-cost optimal control problem is considered in this section.

The steady-state control policy ud : Rn → Rm corresponding to the desired tra-


jectory xd is
ud (xd ) = g + (xd ) (hd (xd ) − f (xd )) . (3.45)

For notational brevity, let gd+ (t)  g + (xd (t)) and fd (t)  f (xd (t)). To transform
the time-varying optimal control problem into a time-invariant optimal control prob-
lem, a new concatenated state ζ : R≥t0 → R2n is defined as [23]
T
ζ  eT , xdT . (3.46)

Based on (3.44) and Assumption 3.11, the time derivative of (3.46) can be expressed
as
ζ̇ (t) = F (ζ (t)) + G (ζ (t)) μ (t) , (3.47)

where the functions F : R2n → R2n , G : R2n → R2n×m , and the control μ : R≥t0 →
Rm are defined as
 
f (e + xd ) − hd (xd ) + g (e + xd ) ud (xd )
F (ζ )  ,
hd (xd )
 
g (e + xd )
G (ζ )  , μ (t)  u (t) − ud (xd (t)) . (3.48)
0

Local Lipschitz continuity of f and g, the fact that f (0) = 0, and Assumption 3.11
imply that F (0) = 0 and F is locally Lipschitz.
The objective of the optimal control problem is to design a controller μ (·) that
minimizes the cost functional
 ∞
J (ζ, μ (·))  rt (ζ (τ ; t, ζ, μ (·)) , μ (τ )) dτ, (3.49)
t0

subject to the dynamic constraints in (3.47) and rt : R2n × Rm → R≥0 is the local
cost defined as
rt (ζ, μ)  Qt (ζ ) + μT Rμ. (3.50)

In (3.50), R ∈ Rm×m is a positive definite symmetric matrix of constants. For ease of


exposition, let the function Qt : R2n → R≥0 in (3.50) be defined as Qt (ζ )  ζ T Qζ ,
where Q ∈ R2n×2n is a constant matrix defined as
3.3 Extension to Trajectory Tracking 61
 
Q 0n×n
Q . (3.51)
0n×n 0n×n

In (3.51), Q ∈ Rn×n is a constant positive definite symmetric matrix with a mini-


mum eigenvalue q ∈ R>0 . Similar to Sect. 4.4, the developed technique can be easily
generalized to local cost functions where Qt (ζ )  Q (e) for any continuous positive
definite function Q : Rn → R.

3.3.2 Approximate Optimal Solution

Assuming that a minimizing policy exists and that the optimal value function V ∗ :
R2n → R≥0 defined as

∞

V (ζ )  min rt (ζ (τ ; t, ζ, μ (·)) , μ (τ )) dτ (3.52)
μ(τ )|τ ∈R≥t
t

is continuously differentiable, the Hamilton–Jacobi–Bellman equation for the opti-


mal control problem can be written as
   
H ∗ = ∇ζ V ∗ (ζ ) F (ζ ) + G (ζ ) μ∗ (ζ ) + rt ζ, μ∗ (ζ ) = 0, (3.53)

for all ζ , with the boundary condition V ∗ (0) = 0, where H ∗ denotes the Hamiltonian,
and μ∗ : R2n → Rm denotes the optimal policy. For the local cost in (3.50) and the
dynamics in (3.47), the optimal controller is given by μ (t) = μ∗ (ζ (t)), where μ∗
is the optimal policy given by

1  T
μ∗ (ζ ) = − R−1 G T (ζ ) ∇ζ V ∗ (ζ ) . (3.54)
2
Using Property 2.3, the value function, V ∗ , can be represented using a neural
network with L neurons as

V ∗ (ζ ) = W T σ (ζ ) +  (ζ ) , (3.55)

where W ∈ RL is the constant ideal weight matrix bounded above by a known positive
constant W ∈ R in the sense that W  ≤ W , σ : R2n → RL is a bounded continu-
ously differentiable nonlinear activation function, and  : R2n → R is the function
reconstruction error [24, 25].
Using (3.54) and (3.55) the optimal policy can be expressed as

1  
μ∗ (ζ ) = − R−1 G T (ζ ) ∇ζ σ T (ζ ) W + ∇ζ  T (ζ ) . (3.56)
2
62 3 Excitation-Based Online Approximate Optimal Control

Based on (3.55) and (3.56), the neural network approximations to the optimal value
function and the optimal policy are given by
 
V̂ ζ, Ŵc = ŴcT σ (ζ ) ,
  1
μ̂ ζ, Ŵa = − R−1 G T (ζ ) ∇ζ σ T (ζ ) Ŵa , (3.57)
2

where Ŵc ∈ RL and Ŵa ∈ RL are estimates of the ideal neural network weights W .
The controller  for the concatenated system is then designed as μ (t) =
μ̂ ζ (t) , Ŵa (t) . The controller for the original system is obtained from (3.45),
(3.48), and (3.57) as

1
u (t) = − R−1 G T (ζ (t)) ∇ζ σ T (ζ (t)) Ŵa (t) + gd+ (t) (hd (xd (t)) − fd (t)) .
2
(3.58)
Using the approximations μ̂ and V̂ for μ∗ and V ∗ in (3.53), respectively, the
error between the approximate and the optimal Hamiltonian (i.e., the Bellman error,
δ : Rn × RL → R), is given in a measurable form by
        
δ ζ, Ŵc , Ŵa  ∇ V̂ ζ, Ŵc F (ζ ) + G (ζ ) μ̂ ζ, Ŵa + rt ζ, μ̂ ζ, Ŵa .
(3.59)
t
The critic weights are updated to minimize 0 δt2 (ρ) d ρ using a normalized least-
squares update law with an exponential forgetting factor as [26]

ω (t)
Ŵ˙ c (t) = −kc Γ (t) δt (t) , (3.60)
1 + νωT (t) Γ ω (t)
 
ω (t) ωT (t)
Γ˙ (t) = −kc −λΓ (t) + Γ (t) Γ (t) , (3.61)
1 + νωT (t) Γ (t) ω (t)

where δt is the  evaluation of the Bellman error along the system trajectories
(i.e., δt (t) = δ ζ (t) , Ŵa (t) ), ν, kc ∈ R are constant positive adaptation gains, ω :
  
R≥0 → RL is defined as ω (t)  ∇ζ σ (ζ (t)) F (ζ (t)) + G (ζ (t)) μ̂ ζ (t) , Ŵa (t) ,
and λ ∈ (0, 1) is the constant forgetting factor for the estimation gain matrix
Γ ∈ RL×L . The least-squares approach is motivated by faster convergence. With
minor modifications to the stability analysis, the result can also be established for
a gradient descent update law. The actor weights are updated to follow the critic
weights as
 
Ŵ˙ a (t) = −ka1 Ŵa (t) − Ŵc (t) − ka2 Ŵa (t) , (3.62)
3.3 Extension to Trajectory Tracking 63

where ka1 , ka2 ∈ R are constant positive adaptation gains. The least-squares approach
can not be used to update the actor weights because the Bellman error is a nonlinear
function of the actor weights.
The following assumption facilitates the stability analysis using persistence of
excitation.
ω(t)
Assumption 3.13 The regressor ψ : R≥0 →RL , defined as ψ (t)  √ ,
1+νωT (t)Γ (t)ω(t)
 t+T
is persistently exciting (i.e., there exist T , ψ>0 such that ψIL ≤ t ψ (τ ) ψ (τ )T dτ ).

Using Assumption 3.13 and [26, Corollary 4.3.2],

ϕIL ≤ Γ (t) ≤ ϕIL , ∀t ∈ R≥0 (3.63)

where ϕ, ϕ ∈ R are constants such that 0 < ϕ < ϕ. Since the evolution of ψ is
dependent on the initial values of ζ and Ŵa , the constants ϕ and ϕ depend on the
initial values of ζ and Ŵa . Based on (3.63), the regressor can be bounded as

1
ψ (t) ≤ √ , ∀t ∈ R≥0 . (3.64)
νϕ

3.3.3 Stability Analysis

Using (3.53), (3.59), and (3.60), an unmeasurable form of the Bellman error can be
written as
1 1 1
δt = −W̃cT ω + W̃aT Gσ W̃a + ∇ζ G∇ζ  T + W T ∇ζ σ G∇ζ  T − ∇ζ F, (3.65)
4 4 2

where G  GR−1 G T and Gσ  ∇ζ σ GR−1 G T ∇ζ σ T . The weight estimation errors


for the value function and the policy are defined as W̃c  W − Ŵc and W̃a  W −
Ŵa , respectively. Using (3.65), the weight estimation error dynamics for the value
function are
 
˙ ω W̃aT Gσ W̃a ∇ζ G∇ζ  T W T ∇ζ σ G∇ζ  T
W̃c = kc Γ + + − ∇ζ F
1 + νωT Γ ω 4 4 2
− kc Γ ψψ T W̃c , (3.66)

ω
where ψ  √1+νω TΓω
∈ RL is the regressor vector.
Before stating the main result of the section, three supplementary technical lem-
mas are stated. To facilitate the discussion, let Y ∈ R2n+2L be a compact set, and let
Z denote the projection of Y onto Rn+2L . Using the universal approximation prop-
erty of neural networks, on the compact set defined by the projection of Y onto R2n ,
64 3 Excitation-Based Online Approximate Optimal Control

the neural
 network  approximation errors can be bounded such that sup | (ζ )| ≤ 
and sup ∇ζ  (ζ ) ≤ , where  ∈ R is a positive constants, and there exists a posi-
tive constant LF ∈ R such that sup F (ζ ) ≤ LF ζ . Instead of using the fact that
locally Lipschitz functions on compact sets are Lipschitz, it is possible to bound the
function F as F (ζ ) ≤ ρ (ζ ) ζ , where ρ : R≥0 → R≥0 is non-decreasing. This
approach is feasible and results in additional gain conditions. To aid the subsequent
stability analysis, Assumptions 3.10 and 3.11 are used to develop the bounds
  
 ∇ζ  WT∇ σ 
 4 + 2 ζ G∇ζ  T  + LF xd  ≤ ι1 , Gσ  ≤ ι2 ,
   
∇ζ G∇ζ  T  ≤ ι3 ,  1 W T Gσ + 1 ∇ζ G∇ζ σ T  ≤ ι4 , (3.67)
1 2 2 
 ∇ζ G∇ζ  T + 1 W T ∇ζ σ G∇ζ  T  ≤ ι5 ,
4 2

on the compact set defined by the projection of Y onto R2n , where ι1 , ι2 , ι3 , ι4 , ι5 ∈ R


are positive constants.
Supporting Lemmas
The contribution in the previous section was the development of a transformation
that enables the optimal policy and the optimal value function to be expressed as a
time-invariant function of ζ . The use of this transformation presents a challenge in
the sense that the optimal value function, which is used as the Lyapunov function
for the stability analysis, is not a positive definite function of ζ , because the matrix
Q is positive semi-definite. In this section, this technical obstacle is addressed by
exploiting the fact that the time-invariant optimal value function V ∗ : R2n → R can
be interpreted as a time-varying map Vt∗ : Rn × R≥0 → R, such that
 
e
Vt∗ (e, t) = V ∗ (3.68)
xd (t)

for all e ∈ Rn and for all t ∈ R≥0 . Specifically, the time-invariant form facilitates the
development of the approximate optimal policy, whereas the equivalent time-varying
form can be shown to be a positive definite and decrescent function of the tracking
error. In the following, Lemma 3.14 is used to prove that Vt∗ : Rn × R≥0 → R is
positive definite and decrescent, and hence, a candidate Lyapunov function.
Lemma 3.14 Let Ba denote a closed ball around the origin with the radius a ∈ R>0 .
The optimal value function Vt∗ : Rn × R≥0 → R satisfies the following properties

Vt∗ (e, t) ≥ v (e) , (3.69a)


Vt∗ (0, t) = 0, (3.69b)
Vt∗ (e, t) ≤ v (e) , (3.69c)

∀t ∈ R≥0 and ∀e ∈ Ba where v : [0, a] → R≥0 and v : [0, a] → R≥0 are class K
functions.
Proof See Appendix A.1.4. 

3.3 Extension to Trajectory Tracking 65

Lemmas 3.15 and 3.16 facilitate the stability analysis by establishing bounds on
the error signal.
T
Lemma 3.15 Let Z  eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z, ∀τ ∈ [t, t + T ].
The neural network weights and the tracking errors satisfy
 2
 
− inf e (τ )2 ≤ −0 sup e (τ )2 + 1 T 2 sup W̃a (τ ) + 2 ,
τ ∈[t,t+T ] τ ∈[t,t+T ] τ ∈[t,t+T ]
(3.70)
 2  2  2
     
− inf W̃a (τ ) ≤ −3 sup W̃a (τ ) + 4 inf W̃c (τ ) L
τ ∈[t,t+T ] τ ∈[t,t+T ] τ ∈[t,t+T ]

+ 5 sup e (τ )2 + 6 , (3.71)


τ ∈[t,t+T ]

where the constants 0 − 6 are defined in Appendix A.1.5.


Proof See Appendix A.1.5. 

T T T T
Lemma 3.16 Let Z  e W̃c W̃a , and suppose that Z (τ ) ∈ Z, ∀τ ∈ [t, t + T ].
The critic weights satisfy


t+T
   2 
t+T  
t+T
4
 T 2   2 2  
− W̃c ψ  dτ ≤ −ψ7 W̃c  + 8 e dτ + 3ι2 W̃a (σ ) dσ + 9 T ,
t t t

where the constants 7 − 9 are defined in Appendix A.1.6.


Proof See Appendix A.1.6. .

Gain Conditions and Gain Selection
The following section details sufficient gain conditions derived based on a stability
analysis performed using the candidate Lyapunov function VL : Rn+2L × R≥0 → R
defined as VL (Z, t)  Vt∗ (e, t) + 21 W̃cT Γ −1 W̃c + 21 W̃aT W̃a . Using Lemma 3.14 and
(3.63),
vl (Z) ≤ VL (Z, t) ≤ vl (Z) , (3.72)

∀Z ∈ Bb , ∀t ∈ R≥0 , where vl : [0, b] → R≥0 and vl : [0, b] → R≥0 are class K func-
tions. T
To facilitate the discussion, define ka12  ka1 + ka2 , Z  eT W̃cT W̃aT , ι 
(ka2 W +ι4 ) + 2k (ι )2 + 1 ι ,   6 ka12 +22 q+kc 9 + ι, and   1 min(k ψ ,
2

ka12 c 1 4 3 10 8 11 16 c 7
20 qT , 3 ka12 T ). Let Z0 ∈ R≥0 denote a known constant bound on the initial con-
dition such that Z (t0 ) ≤ Z0 , and let
     
10 T
Z  vl −1 vl max Z0 , + ιT . (3.73)
11
66 3 Excitation-Based Online Approximate Optimal Control

The sufficient gain conditions for the subsequent Theorem 3.17 are given by
  
kc ι2 Z
ka12 > max ka1 ξ2 + , 3kc ι22 Z ,
4 νϕ
ka1 24 ka12
ξ1 > 2LF , kc > ,ψ> T,
λγ ξ2 kc 7
 
5 ka12 1
q > max , kc 8 , kc LF ξ1 ,
0 2
  
1 νϕ 1 ka12
T < min √ ,√ , √ , , (3.74)
6Lka12 6Lkc ϕ 2 nLF 6Lka12 + 8q1

where ξ1 , ξ2 ∈ R are known adjustable positive constants. Similar conditions on ψ


and T can be found in persistence of excitation-based adaptive control in the presence
of bounded or Lipschitz uncertainties (cf. [27, 28]). Furthermore, the compact set Z
satisfies the sufficient condition
Z ≤ r, (3.75)

where r  21 supz,y∈Z z − y denotes the radius of Z. Since the Lipschitz constant


and the bounds on neural network approximation error depend on the size of the
compact set Z, the constant Z depends on r. Hence, feasibility of the sufficient
condition in (3.75) is not apparent. Algorithm A.1 in the appendix details an iterative
gain selection process to ensure satisfaction of the sufficient condition in (3.75). The
main result of this section can now be stated as follows.
Theorem 3.17 Provided that the sufficient conditions in (3.74) and (3.75) are satis-
fied and Assumptions 3.10–3.13 hold, the controller in (3.58) and the update laws in
(3.60)–(3.62)
  guarantee  that the tracking
 error is ultimately bounded, and the error
 ∗ 
t → μ̂ ζ (t) , Ŵa (t) − μ (ζ (t)) is ultimately bounded.

Proof The time derivative of VL is

V̇L = ∇ζ V ∗ F + ∇ζ V ∗ G μ̂ + W̃cT Γ −1 W̃˙ c − W̃aT Ŵ˙ a − W̃cT Γ −1 Γ˙ Γ −1 W̃c .


1
2
Provided the sufficient conditions in (3.74) are satisfied, (3.60), (3.64)–(3.67), and
the facts that ∇ζ V ∗ F = −∇ζ V ∗ Gμ∗ − r (ζ, μ∗ ) and ∇ζ V ∗ G = −2μ∗T R yield

q 1  
2 k  2
 a12  
V̇L ≤ − e2 − kc W̃cT ψ  − W̃a  + ι. (3.76)
2 8 4
The inequality in (3.76) is valid provided Z (t) ∈ Z.
3.3 Extension to Trajectory Tracking 67
 t+T
Integrating (3.76), using the facts that − t e (τ )2 dτ ≤ − T inf τ ∈[t,t+T ]
 t+T 

2



2

e (τ )2 and − t W̃a (τ ) dτ ≤ −T inf τ ∈[t,t+T ] W̃a (τ ) , Lemmas 3.15 and
3.16, and the gain conditions in (3.74) yields
kc ψ7  
2 0 qT

VL (Z (t + T ) , t + T ) − VL (Z (t) , t) ≤ − W̃c (t) − e (t)2 + 10 T
16 8
3 ka12 T 

2

− W̃a (t) ,
16

provided Z (τ ) ∈ Z, ∀τ ∈ [t, t+T ]. Thus, VL (Z (t+T ) , t+T ) − VL (Z (t) , t) < 0


provided Z (t) > 1011T and Z (τ ) ∈ Z, ∀τ ∈ [t, t + T ]. The bounds on the Lya-
punov function in (3.72) yield VL (Z (t + T ) , t + T ) − VL (Z (t) , t) < 0 provided
10 T
VL (Z (t) , t) > vl 11
and Z (τ ) ∈ Z, ∀τ ∈ [t, t + T ].
Since Z (t0 ) ∈ Z, (3.76) can be used to conclude that V̇L (Z (t0 ) , t0 ) ≤ ι. The
sufficient condition in (3.75) ensures that vl −1 (VL (Z (t0 ) , t0 ) + ιT ) ≤ r; hence,
 
10 T
Z (t) ∈ Z, ∀t ∈ [t0 , t0 + T ]. If VL (Z (t0 ) , t0 ) > vl 11
, then Z (t) ∈ Z, ∀t ∈
[t0 , t0 + T ] implies VL (Z (t0 + T ) , t0 + T ) − VL (Z (t0 ) , t0 ) < 0; hence, vl −1 (VL
(Z (t0 + T ) , t0 + T ) + ιT ) ≤ r. Thus, Z (t) ∈ Z, ∀t ∈ [t0 + T , t0 + 2T ]. Using
mathematical induction, it can be shown that the system state is bounded such that
supt∈[0,∞) Z (t) ≤ r and ultimately bounded such that
   
−1 10 T
lim sup Z (t) ≤ vl vl + ιT .
t→∞ 11

If the regressor ψ satisfies a stronger u-persistence of excitation assumption (cf. [16,


17]), the tracking error and the weight estimation errors can be shown to be uniformly
ultimately bounded. 

3.3.4 Simulation

Simulations are performed on a two-link manipulator to demonstrate the ability of


the presented technique to approximately optimally track a desired trajectory. The
two link robot manipulator is modeled using Euler–Lagrange dynamics as

M q̈ + Vm q̇ + Fd q̇ + Fs = u, (3.77)
T T
where q = q1 q2 and q̇ = q̇1 q̇2 are the angular positions in radian and the
angular velocities in radian/s respectively. In (3.77), M ∈ R2×2 denotes the iner-
tia matrix, and Vm ∈ R2×2 denotes the centripetal-Coriolis matrix given by M 
68 3 Excitation-Based Online Approximate Optimal Control
   
p1 + 2p3 c2 p2 + p3 c2 −p3 s2 q̇2 −p3 s2 (q̇1 + q̇2 )
, Vm  , where c2 = cos (q2 ) ,
p2 + p3 c2 p2 p3 s2 q̇1 0
s2 = sin (q2 ), p1 = 3.473 kg.m2 , p2 = 0.196 kg.m2 , p3 = 0.242 kg.m2 , and Fd =
 T
diag 5.3, 1.1 Nm.s and Fs (q̇) = 8.45 tanh (q̇1 ) , 2.35 tanh (q̇2 ) Nm are the
models for the static and the dynamic friction, respectively.
T
The objective is to find a policy μ that ensures that the state x q1 , q2 , q̇1 , q̇2
T
tracks the desired trajectory xd (t) = 0.5 cos (2t) , 0.33 cos
(3t) , − sin (2t)
 , − sin (3t) ,
while minimizing the cost in (3.49), where Q = diag 10, 10, 2, 2 . Using (3.45)–
(3.48) and the definitions
    T T
x
f  x3 , x4 , M −1 (−Vm − Fd ) 3 − Fs ,
x4
     T
g  0, 0 T , 0, 0 T , M −1 T ,
   
gd+  0, 0 T , 0, 0 T , M (xd ) ,
T
hd  xd 3 , xd 4 , −4xd 1 , −9xd 2 , (3.78)

the optimal tracking problem can be transformed into the time-invariant form in
(3.48).
In this effort, the basis selected for the value function approximation is a polyno-
mial basis with 23 elements given by

1 2 2
σ (ζ ) = ζ ζ ζ1 ζ3 ζ1 ζ4 ζ2 ζ3 ζ2 ζ4 ζ12 ζ22 ζ12 ζ52
2 1 2
ζ12 ζ62 ζ12 ζ72 ζ12 ζ82 ζ22 ζ52 ζ22 ζ62 ζ22 ζ72 ζ22 ζ82 ζ32 ζ52
T
ζ32 ζ62 ζ32 ζ72 ζ32 ζ82 ζ42 ζ52 ζ42 ζ62 ζ42 ζ72 ζ42 ζ82 . (3.79)

The control gains are selected as ka1 = 5, ka2 = 0.001, kc = 1.25, λ = 0.001, and
T
ν = 0.005 The initial conditions are x (0) = 1.8 1.6 0 0 , Ŵc (0) = 10 × 123×1 ,
Ŵa (0) = 6 × 123×1 , and Γ (0) = 2000 × I23 . To ensure persistence of excitation, a
probing signal
⎡  √  √ ⎤
2.55tanh(2t) 20 sin 232π t cos 20π t
⎢  2  ⎥
⎢ +6 sin 18e ⎥
p (t) = ⎢  t + √20 cos (40t)
 (21t)
√ ⎥ (3.80)
⎢ ⎥
⎣ 0.01 tanh(2t) 20 sin 132π t cos 10π t ⎦
+6 cos (8et) + 20 cos (10t) cos (11t))

is added to the control signal for the first 30 s of the simulation [14].
It is clear from Figs. 3.7 and 3.8 that the system states are bounded during the
learning phase and the algorithm converges to a stabilizing controller in the sense
that the tracking errors go to zero when the probing signal is eliminated. Furthermore,
3.3 Extension to Trajectory Tracking 69

Fig. 3.7 State trajectories


with probing signal
(reproduced with permission
from [22], 2015,
c Elsevier)

Fig. 3.8 Error trajectories


with probing signal
(reproduced with permission
from [22], 2015,
c Elsevier)

Figs. 3.9 and 3.10 shows that the weight estimates for the value function and the policy
are bounded and they converge.
The neural network weights converge to the following values

Ŵc = Ŵa = 83.36 2.37 27.0 2.78 −2.83 0.20 14.13
29.81 18.87 4.11 3.47 6.69 9.71 15.58 4.97 12.42
T
11.31 3.29 1.19 −1.99 4.55 −0.47 0.56 . (3.81)
70 3 Excitation-Based Online Approximate Optimal Control

Fig. 3.9 Evolution of critic 100


weights (reproduced with
permission from [22],
80
2015,
c Elsevier)

60

40

20

−20

−40
0 10 20 30 40 50 60

Time (s)

Fig. 3.10 Evolution of actor 100


weights (reproduced with
permission from [22],
80
2015,
c Elsevier)

60

40

20

−20

−40
0 10 20 30 40 50 60
Time (s)

Note that the last sixteen weights that correspond to the terms containing the desired
trajectories ζ5 , . . . , ζ8 are non-zero. Thus, the resulting value function V and the
resulting policy μ depend on the desired trajectory, and hence, are time-varying
functions of the tracking error. Since the true weights are unknown, a direct compar-
ison of the weights in (3.81) with the true weights is not possible. Instead, to gauge
the performance of the presented technique, the state and the control trajectories
obtained using the estimated policy are compared with those obtained using Radau-
pseudospectral numerical optimal control computed using the GPOPS software [29].
3.3 Extension to Trajectory Tracking 71

Fig. 3.11 Hamiltonian of −3


x 10
the numerical solution
computed using GPOPS
(reproduced with permission 2
from [22], 2015,
c Elsevier)

−2

−4

−6

0 5 10 15 20
Time(s)

Fig. 3.12 Costate of the


numerical solution computed
using GPOPS (reproduced
with permission from [22],
2015,
c Elsevier)

Since an accurate numerical solution is difficult to obtain for an infinite-horizon opti-


mal control problem, the numerical optimal control problem is solved over a finite-
horizon ranging over approximately five times the settling time associated with the
slowest state variable. Based on the solution obtained using the proposed technique,
the slowest settling time is estimated to be approximately twenty seconds. Thus, to
approximate the infinite-horizon solution, the numerical solution is computed over
a 100 second time horizon using 300 collocation points.
72 3 Excitation-Based Online Approximate Optimal Control

Fig. 3.13 Control μ(ζ)


trajectories μ (t) obtained
from GPOPS and the 2
developed technique
(reproduced with permission 1
from [22], 2015,
c Elsevier)
0

−1

−2

−3 GPOPS
ADP
−4 μ1

−5
μ2

0 5 10 15 20
Time(s)

Fig. 3.14 Tracking error Tracking Error


trajectories e (t) obtained
from GPOPS and the
developed technique
1
(reproduced with permission
from [22], 2015,
c Elsevier) 0.5

−0.5
GPOPS
ADP
−1
e1
−1.5 e
2
e
3
−2 e
4

0 5 10 15 20
Time(s)

As seen in Fig. 3.11, the Hamiltonian of the numerical solution is approximately


zero. This supports the assertion that the optimal control problem is time-invariant.
Furthermore, since the Hamiltonian is close to zero, the numerical solution obtained
using GPOPS is sufficiently accurate as a benchmark to compare against the approx-
imate dynamic programming-based solution obtained using the proposed technique.
Note that in Fig. 3.12, the costate variables corresponding to the desired trajectories
3.3 Extension to Trajectory Tracking 73

are nonzero. Since these costate variables represent the sensitivity of the cost with
respect to the desired trajectories, this further supports the assertion that the opti-
mal value function depends on the desired trajectory, and hence, is a time-varying
function of the tracking error.
Figures 3.13 and 3.14 show the control and the tracking error trajectories obtained
from the developed technique (dashed lines) plotted alongside the numerical solution
obtained using GPOPS (solid lines). The trajectories obtained using the developed
technique are close to the numerical solution. The inaccuracies are a result of the facts
that the set of basis functions in (3.79) is not exact, and the proposed method attempts
to find the weights that generate the least total cost for the given set of basis functions.
The accuracy of the approximation can be improved by choosing a more appropriate
set of basis functions, or at an increased computational cost, by adding more basis
functions to the existing set in (3.79). The total cost obtained using the numerical
solution is found to be 75.42 and the total cost obtained using the developed method
is found to be 84.31. Note that from Figs. 3.13 and 3.14, it is clear that both the
tracking error and the control converge to zero after approximately 20 s, and hence,
the total cost obtained from the numerical solution is a good approximation of the
infinite-horizon cost.

3.4 N-Player Nonzero-Sum Differential Games3

In this section, an approximate online equilibrium solution is developed for an


N −player nonzero-sum game subject to continuous-time nonlinear unknown dynam-
ics and an infinite-horizon quadratic cost. A novel actor-critic-identifier structure is
used, wherein a robust dynamic neural network is used to asymptotically identify
the uncertain system with additive disturbances, and a set of critic and actor neu-
ral networks are used to approximate the value functions and equilibrium policies,
respectively. The weight update laws for the actor neural networks are generated
using a gradient-descent method, and the critic neural networks are generated by
least-squares regression, which are both based on the modified Bellman error that
is independent of the system dynamics. A Lyapunov-based stability analysis shows
that uniformly ultimately bounded tracking is achieved and a convergence analysis
demonstrates that the approximate control policies converge to a neighborhood of the
optimal solutions. The actor, the critic, and the identifier structures are implemented
in real-time, continuously, and simultaneously. Simulations on two and three player
games illustrate the performance of the developed method.

3 Parts of the text in this section are reproduced, with permission, from [30], 2015,
c IEEE.
74 3 Excitation-Based Online Approximate Optimal Control

3.4.1 Problem Formulation

Consider a class of control-affine multi-input systems

N
ẋ (t) = f (x (t)) + gi (x (t)) ui (t) , (3.82)
i=1

where x : R≥t0 → Rn is the state vector, ui : R≥t0 → Rmi are the control inputs, and
f : Rn → Rn and gi : Rn → Rn×mj are the drift and input matrices, respectively.
Assume that g1 , . . . , gN , and f are second order differentiable, and that f (0) = 0 so
that x = 0 is an equilibrium point for the uncontrolled dynamics in (3.82). Let
' (
U φi : Rn → Rmi , i = 1, . . . , N | {φi , . . . , φN } is admissible for (3.82)

be the set of all admissible tuples of feedback policies φi : Rn → Rmi (cf. [6]). Let
{φ ,...,φN }
Vi i : Rn → R≥0 denote the value function of the ith player with respect to the
feedback policies {φ1 , . . . , φN } ∈ U , defined as

∞
{u }
Vi 1 ,...,uN (x) = ri (x (τ ; t, x) , φi (x (τ ; t, x)) , . . . , φN (x (τ ; t, x))) d τ, (3.83)
t

where x (τ ; t, x) for τ ∈ [t, ∞) denotes the trajectory of (3.82) evaluated at time τ


obtained using the controllers ui (τ ) = φi (x (τ ; t, x)), starting from the initial time
t and the initial condition x. In (3.83), ri : Rn × Rm1 × · · · × R )N →TR≥0 denotes
mN

the instantaneous cost defined as ri (x, ui , . . . , uN )  Qi (x) + j=1 uj Rij uj , where


Qi ∈ Rn×n and Rij ∈ Rn×n are positive definite matrices. The control objective is
to find an approximate feedback-Nash equilibrium solution to the infinite-horizon
regulation
∗ differential
game online. A feedback-Nash equilibrium solution is a tuple
u1 , . . . , uN∗ ∈ U such that for all i ∈ {1, . . . , N }, for all x ∈ Rn , the corresponding
value functions satisfy

{u∗ ,u∗ ,...,u∗ ,...,uN∗ } {u∗ ,u∗ ,...,φi ,...,uN∗ }


Vi∗ (x)  Vi 1 2 i (x) ≤ Vi 1 2 (x)

for all φi such that u1∗ , u2∗ , . . . , φi , . . . , uN∗ ∈ U .
The exact closed-loop feedback-Nash equilibrium solution ui∗ , . . . , uN∗ can be
expressed in terms of the value functions as [6, 7, 31, 32]

1  T
ui∗ (x) = − R−1 g T (x) ∇x Vi∗ (x) , (3.84)
2 ii i
3.4 N -Player Nonzero-Sum Differential Games 75

where the value functions V1∗ , . . . , VN∗ satisfy the coupled Hamilton–Jacobi
equations

N
1  T
0 = x Qi x +
T
∇x Vj∗ (x) G ij (x) ∇x Vj∗ (x) + ∇x Vi∗ (x) f (x)
j=1
4

1
N  T
− ∇x Vi∗ (x) G j (x) ∇x Vj∗ (x) . (3.85)
2 j=1

In (3.85), G j (x)  gj (x) R−1 −1 −1 T


jj gj (x) and G ij (x)  gj (x) Rjj Rij Rjj gj (x).
T

Computation of an analytical solution to the coupled nonlinear Hamilton–Jacobi


equations in (3.85) is, in general, infeasible. Hence, an approximate solution is
sought. Although nonzero-sum games contain non-cooperative components, the
solution to each player’s coupled Hamilton–Jacobi equation in (3.85) requires knowl-
edge of all the other player’s strategies in (3.84). The underlying assumption of
rational opponents [33] is characteristic of differential game theory problems and it
implies that the players share information, yet they agree to adhere to the equilibrium
policy determined from the Nash game.

3.4.2 Hamilton–Jacobi Approximation Via


Actor-Critic-Identifier

In this section, an actor-critic-identifier [9, 30] approximation architecture is used to


solve the coupled nonlinear Hamilton–Jacobi equations in (3.85). The actor-critic-
identifier architecture eliminates the need for exact model knowledge by using a
dynamic neural network to robustly identify the system, a critic neural network to
approximate the value function, and an actor neural network to find a control policy
which minimizes the value functions. The following development focuses on the
solution to a two player nonzero-sum game. The approach can easily be extended to
the N −player game presented in Sect. 3.4.1. This section introduces the actor-critic-
identifier architecture, and subsequent sections provide details of the design for the
two player nonzero-sum game solution.
The optimal policies in (3.84) and the associated value functions Vi∗ satisfy the
Hamilton–Jacobi equations
   
ri x, u1∗ (x) , . . . , uN∗ (x) + ∇x Vi∗ (x) Fu x, u1∗ (x) , . . . , uN∗ (x) = 0, (3.86)

where

N
Fu (x, u1 , . . . , uN )  f (x) + gj (x) uj ∈ Rn . (3.87)
j=1
76 3 Excitation-Based Online Approximate Optimal Control

∗ ∗
Replacing the optimal
 Jacobian
 ∇
 x Vi and
 optimal control policies ui by parametric
estimates ∇x V̂i x, Ŵci and ûi x, Ŵai , respectively, where Ŵci and Ŵai are the
estimates of the unknown parameters, yields the Bellman error
      
δi x, Ŵci , Ŵa1 , . . . , ŴaN = ri x, û1 x, Ŵa1 , . . . , ûN x, ŴaN
      
+ ∇x V̂i x, Ŵci Fu x, û1 x, Ŵa1 , . . . , ûN x, ŴaN .
(3.88)

The approximate Hamiltonian in (3.88) is dependent on Fu , and hence, complete


knowledge of the system. To overcome this limitation, an online system identifier
replaces the system dynamics Fu with a parametric estimate F̂u , defined as F̂u (t) 
x̂˙ (t) where x̂ (·) is an estimate of the state, x (·). Hence, the Bellman error in (3.88)
is approximated at each time instance as
        
˙ Ŵci , Ŵa1 , . . . , ŴaN = ri x, û1 x, Ŵa1 , . . . , ûN x, ŴaN + ∇x V̂i x, Ŵci x̂.
δ̂i x, x̂, ˙
(3.89)
The objective is to update the actors, ûi , the critics, V̂i , and the identifier, F̂u , simulta-
neously, based on the minimization of the Bellman residual errors δ̂i . All together, the
actors, the critics, and the identifier constitute the actor-critic-identifier architecture.
The update laws for the actors, the critics, and the identifiers are designed based on
a Lyapunov-based analysis to ensure stability of the closed-loop system during the
learning phase.

3.4.3 System Identifier

Consider the two-player case for the dynamics given in (3.82) as

ẋ (t) = f (x (t)) + g1 (x (t)) u1 (t) + g2 (x (t)) u2 (t) , x (t0 ) = x0 , (3.90)

where ui : R≥t0 → Rmi are the control inputs, and the state x : R≥t0 → Rn is assumed
to be available for feedback. The following assumptions about the system will be
used in the subsequent development.
Assumption 3.18 The input matrices g1 and g2 are known and bounded according
to the inequalities g1 (x) ≤ ḡ1 and g2 (x) ≤ ḡ2 , for all x ∈ Rn , where ḡ1 and ḡ2
are known positive constants.

Assumption 3.19 The control inputs u1 (·) and u2 (·) are bounded (i.e., u1 (·) ,
u2 (·) ∈ L∞ ). This assumption facilitates the design of the state-derivative estimator,
and is relaxed in Sect. 3.4.5.
3.4 N -Player Nonzero-Sum Differential Games 77

Based on Property 2.3, the nonlinear system in (3.90) can be represented using a
multi-layer neural network as
 
ẋ (t) = WfT σf VfT x (t) + f (x (t)) + g1 (x (t)) u1 (t) + g2 (x (t)) u2 (t) ,
 Fu (x (t) , u1 (t) , u2 (t)) , (3.91)

where Wf ∈ RLf +1×n and Vf ∈ Rn×Lf are unknown ideal neural network weight
matrices with Lf ∈ N representing the neurons in the output layers. In (3.91),
σf : RLf → RLf +1 is the vector of basis functions, and f : Rn → Rn is the func-
tion reconstruction error in approximating the function f . The proposed dynamic
neural network used to identify the system in (3.90) is

 
x̂˙ (t) = ŴfT (t) σf V̂fT (t) x̂ (t) + g1 (x (t)) u1 (t) + g2 (x (t)) u2 (t) + μ (t) ,
 
 F̂u x (t) , x̂ (t) , u1 (t) , u2 (t) , (3.92)

where x̂ : R≥t0 → Rn is the state of the dynamic neural network, Ŵf : R≥t0 →
RLf +1×n , V̂f : R≥t0 → Rn×Lf are the estimates of the ideal weights of the neural
networks, and μ : R≥t0 → Rn denotes the RISE feedback term (cf. [34]) defined as

μ (t)  k (x̃ (t) − x̃ (t0 )) + ν (t) , (3.93)

where the measurable identification error x̃ : R≥t0 → Rn is defined as

x̃ (t)  x (t) − x̂ (t) , (3.94)

and ν : R≥t0 → Rn is a Filippov solution to the initial value problem

ν̇ (t) = (kα + γ ) x̃ (t) + β1 sgn (x̃ (t)) , ν (t0 ) = 0,

where k, α, γ β ∈ R are positive constant gains, and sgn (·) denotes a vector signum
function.
The identification error dynamics are developed by taking the time derivative of
(3.94) and substituting for (3.91) and (3.92) as
   
x̃˙ = WfT σf VfT x (t) − ŴfT σf V̂fT (t) x̂ (t) + f (x (t)) − μ (t) . (3.95)

To facilitate the subsequent analysis an auxiliary identification error is defined as

ef (t)  x̃˙ (t) + α x̃ (t) . (3.96)

Taking the time derivative of (3.96) and using (3.95) yields


78 3 Excitation-Based Online Approximate Optimal Control
   
ėf (t) = WfT ∇V T x σf VfT x (t) VfT ẋ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t)
f
   
− ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂˙fT (t) x̂ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
f f

+ ˙f (x (t) , ẋ (t)) − kef (t) − γ x̃ (t) − β1 sgn (x̃ (t)) + α x̃˙ (t) . (3.97)

The weight update laws for the dynamic neural network in (3.92) are developed
based on the subsequent stability analysis as

   
Ŵ˙ f (t) = proj Γwf ∇VfT x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t) x̃T (t) ,
  
V̂˙f (t) = proj Γvf x̂˙ (t) x̃T (t) ŴfT (t) ∇VfT x σf V̂fT (t) x̂ (t) , (3.98)

where proj (·) is a smooth projection operator [35, 36], and Γwf ∈RLf +1×Lf +1 , Γvf ∈
Rn×n are positive
 constant
 adaptation gain matrices. Adding and  subtracting
1
W ∇VfT x σf V̂f (t) x̂ (t) V̂f (t) x̂ (t)+ 2 Ŵf (t) ∇VfT x σf V̂f (t) x̂ (t) VfT x̂˙ (t),
2 f
T T T ˙ 1 T T

and grouping similar terms, the expression in (3.97) can be rewritten as

ėf (t) = Ñ (t) + NB1 (t) + N̂B2 (t) − kef (t) − γ x̃ (t) − β1 sgn (x̃ (t)) , (3.99)

where the auxiliary signals, Ñ , NB1 , and N̂B2 : R≥t0 → Rn in (3.99) are defined as
   
Ñ (t)  α x̃˙ (t) − Ŵ˙ fT (t) σf V̂fT (t) x̂ (t) − ŴfT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂˙fT (t) x̂ (t)
f
1 T  
+ Wf ∇V T x σf V̂f (t) x̂ (t) V̂f (t) x̃˙ (t)
T T
2 f
1 T  
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) VfT x̃˙ (t) , (3.100)
2 f
  1  
NB1 (t)  WfT ∇V T x σf VfT x (t) VfT ẋ (t) − WfT ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) ẋ (t)
f 2 f
1 T  
T T
− Ŵf (t) ∇V T x σf V̂f (t) x̂ (t) Vf ẋ (t) + ˙f (x (t) , ẋ (t)) , (3.101)
2 f
1  
N̂B2 (t)  W̃fT (t) ∇V T x σf V̂fT (t) x̂ (t) V̂fT (t) x̂˙ (t)
2 f
1 T  
+ Ŵf (t) ∇V T x σf V̂fT (t) x̂ (t) ṼfT (t) x̂˙ (t) . (3.102)
2 f

To facilitate the subsequent stability analysis, an auxiliary term NB2 : R≥t0 → Rn is


defined by replacing x̂˙ (t) in N̂B2 (t) by ẋ (t) , and the mismatch between NB2 and
N̂B2 is defined as ÑB2  N̂B2 − NB2 . The terms NB1 and NB2 are grouped as NB 
NB1 + NB2 . Using Property 2.3, Assumption 3.18, (3.96), (3.98), (3.101), and (3.102)
the following bounds can be obtained over the set χ × R2n × R(Lf +1)×n × Rn×Lf
3.4 N -Player Nonzero-Sum Differential Games 79
 
 
Ñ (t) ≤ ρ1 (z (t)) z (t) , NB1 (t) ≤ ζ1 , NB2 (t) ≤ ζ2 , (3.103)
 
ṄB (t) ≤ ζ3 + ζ4 ρ2 (z (t)) z (t) , (3.104)
   
˙T 
x̃ (t) ÑB2 (t) ≤ ζ5 x̃ (t) + ζ6 ef (t) ,
2 2
(3.105)

 T
where z (t)  x̃T (t) efT (t) ∈ R2n , ∀t ∈ R≥t0 and ρ1 , ρ2 : R → R are positive,
strictly increasing functions, and ζi ∈ R, i = 1, . . . , 6 are positive constants. To
facilitate the subsequent stability analysis, let the auxiliary signal y : R≥t0 → R2n+2
be defined as
 
y (t)  x̃T (t) ef T (t) P (t) Q (t) T , ∀t ∈ R≥t0 (3.106)

where the auxiliary signal P :∈ R≥t0 → R is the Filippov solution to the initial value
problem [11]

Ṗ (t) = β2 ρ2 (z (t)) z (t) x̃ (t) − efT (t) (NB1 (t) − β1 sgn (x̃ (t))) − x̃˙ T (t) NB2 (t) ,

n
P(t0 ) = β1 |x̃i (t0 )| − x̃T (t0 ) NB (t0 ), (3.107)
i=1

where β1 , β2 ∈ R are selected according to the sufficient conditions

ζ3
β1 > max(ζ1 + ζ2 , ζ1 + ), β2 > ζ4 , (3.108)
α
such that P (t) ≥ 0 for all t ∈ [0, ∞) (see Appendix
 A.1.1). The auxiliary function 
Q : Rn(2Lf +1) →R in (3.106) is defined as Q 41 α tr(W̃fT Γwf−1 W̃f )+tr(ṼfT Γvf−1 Ṽf ) .

Let D ⊂ R 2n+2
 and connected set defined as D  y ∈ R
 −1  be√ the open
2n+2

| y < inf ρ 2 λη, ∞ , where λ and η are defined in Appendix A.1.7.
  √ 
Let D be the compact set D  y ∈ R2n+2 | y ≤ inf ρ −1 2 λη, ∞ . Let
VI : D → R be a positive-definite, locally Lipschitz, regular function defined as

1 T 1
VI (y)  ef ef + γ x̃T x̃ + P + Q. (3.109)
2 2
The candidate Lyapunov function in (3.109) satisfies the inequalities

U1 (y) ≤ VI (y) ≤ U2 (y) , (3.110)

where U1 (y), U2 (y) ∈ R are continuous positive definite functions defined as

1
U1  min(1, γ ) y2 U2  max(1, γ ) y2 .
2
80 3 Excitation-Based Online Approximate Optimal Control
√ 
Additionally,
√ let S ⊂ D denote a set defined as S  y ∈ D | ρ 2U2 (y)
< 2 λη , and let
ẏ (t) = h(y (t) , t) (3.111)

represent the closed-loop differential equations in (3.95), (3.98), (3.99), and (3.107),
where h(y, t) : R2n+2 × R≥t0 → R2n+2 denotes the right-hand side of the the closed-
loop error signals.
Theorem 3.20 For the system in (3.90), the identifier developed in (3.92) along with
the weight update laws in (3.98) ensures asymptotic identification of the state and
its derivative, in the sense that
 
 
lim x̃ (t) = 0, lim x̃˙ (t) = 0,
t→∞ t→∞

provided Assumptions 3.18 and 3.19 hold, and the control gains k and γ are selected
sufficiently large based on the initial conditions of the states, and satisfy the following
sufficient conditions
αγ > ζ5 , k > ζ6 , (3.112)

where ζ5 and ζ6 are introduced in (3.105), and β1 , β2 introduced in (3.107), are


selected according to the sufficient conditions in (3.108).

Proof See Appendix A.1.7. 


3.4.4 Actor-Critic Design

Using Property 2.3 and (3.84), the optimal value function and the optimal controls
can be represented by neural networks as

Vi∗ (x) = WiT σi (x) + i (x) ,


1  
ui∗ (x) = − R−1ii gi (x) ∇x σi (x) Wi + ∇x i (x) ,
T T T
(3.113)
2

where Wi ∈ RLi are unknown constant ideal neural network weights, Li is the number
of neurons, σi = [σi1 σi2 . . . σiL ]T : Rn → RLi are smooth neural network activation
functions, such that σi (0) = 0 and ∇x σi (0) = 0, and i : Rn → R are the function
reconstruction errors.
Using Property 2.3, both Vi∗ and ∇x Vi∗ can be uniformly approximated by neural
networks in (3.113) (i.e., as Li → ∞, the approximation errors i , ∇x i → 0 for i =
1, 2, respectively). The critic V̂ and the actor û approximate the optimal value function
and the optimal controls in (3.113), and are given as
3.4 N -Player Nonzero-Sum Differential Games 81

  1
ûi x, Ŵai = − R−1 ii gi (x) ∇x σi (x) Ŵai ,
T T

  2
V̂i x, Ŵci = ŴciT σi (x) , (3.114)

where Ŵci : R≥t0 → RLi and Ŵai : R≥t0 → RLi are estimates of the ideal weights of
the critic and actor neural networks, respectively. The weight estimation errors for
the critic and actor are defined as W̃ci (t)  Wi − Ŵci (t) and W̃ai (t)  Wi − Ŵai (t)
for i = 1, 2, respectively.
Least-Squares Update for the Critic
The recursive formulation of the normalized least-squares algorithm is used to derive
the update laws for the two critic weights as

ωi (t)
Ŵ˙ ci (t) = −kci Γci (t) δ̂ti (t) , (3.115)
1+ νi ωiT (t) Γci (t) ωi (t)

where ωi : R≥t0 → RLi , defined as ωi (t)  ∇x σi (x (t)) F̂u x (t) , x̂ (t),
   
û1 x (t) , Ŵa1 (t) , u2 x (t) , Ŵa2 (t) for i = 1, 2, is the critic neural network
regressor vector, νi , kci ∈ R are constant positive gains and δ̂ti : R≥t0 → R denotes
evaluation of the approximate
 Bellman error in (3.89), along  the system trajectories,
defined as δ̂ti  δ̂ x (t) , x̂˙ (t) , Ŵci (t) , Ŵa1 (t) , Ŵa2 (t) . In (3.115), Γci : R≥t0 →
RLi ×Li for i = 1, 2, are symmetric estimation gain matrices generated by

 
ωi (t) ωi T (t)
Γ˙ci (t) = −kci −λi Γci (t) + Γci (t) Γ ci (t) , (3.116)
1 + νi ωiT (t) Γci (t) ωi (t)

where λ1 , λ2 ∈ (0, 1) are forgetting factors. The use of forgetting factors ensures
that Γc1 and Γc2 are positive-definite for all time and prevents arbitrarily small val-
ues in some directions, making adaptation in those directions very slow. Thus, the
covariance matrices (Γc1 , Γc2 ) can be bounded as

ϕ11 IL1 ≤ Γc1 (t) ≤ ϕ01 IL1 , ϕ12 IL2 ≤ Γc2 (t) ≤ ϕ02 IL2 . (3.117)

Gradient Update for the Actor


The actor update, like the critic update, is based on the minimization of the Bellman
error. However, unlike the critic weights, the actor weights appear nonlinearly in
the Bellman error, making it problematic to develop a least-squares update law.
Hence, a gradient update law is developed for  the actor which
 minimizes the squared 
˙ Ŵc1 , Ŵc2 , Ŵa1 , Ŵa2  )2 δ̂i x, x̂,
Bellman error Ea x, x̂, ˙ Ŵci , Ŵa1 , Ŵa2 . The
i=1
actor neural networks are updated as
82 3 Excitation-Based Online Approximate Optimal Control
⎧ ⎫
· ⎨ kai1 ⎬
Ŵ ai (t) = proj − Eai (t) − kai2 (Ŵai (t) − Ŵci (t)) , (3.118)
⎩ ⎭
1+ ωiT (t) ωi (t)
 
˙ Ŵc1 (t),Ŵc2 (t),Ŵa1 (t),Ŵa2 (t)
∂Ea x(t),x̂(t),
where Eai (t)  , kai1 , kai2 ∈ R are positive adapta-
∂ Ŵai  
 
tion gains, and the smooth projection operator ensures that Ŵia (t) ≤ W , ∀t ∈
R≥t0 , i = 1, 2, where W ∈ R>0 is a positive constant such that W  ≤ W [35, 36].
The first term in (3.118) is normalized and the last term is added as feedback for
stability (based on the subsequent stability analysis). For notational brevity, let BW i

denote the set w ∈ RLi | w ≤ 2W .

3.4.5 Stability Analysis

The dynamics of the critic weight estimation errors W̃c1 and W̃c2 can be developed as
ω1 
W̃˙ c1 = kc1 Γc1 ω1 − W1T ∇x σ1 F̃û − u1∗ R11 u1∗ − 1v 
T
−W̃c1T
Fu∗ + û1T R11 û1
1 + ν1 ω1T Γc1 ω1
  
+W1T ∇x σ1 g1 (û1 − u1∗ ) + g2 (û2 − u2∗ ) − u2∗ R12 u2∗ + û2T R12 û2 ,
T

and
ω2 
W̃˙ c2 = kc2 Γc2 ω2 − W2T ∇x σ2 F̃û − u2∗ R22 u2∗ − 2v 
T
−W̃c2T
Fu∗ + û2T R22 û2
1 + ν2 ω2 Γc2 ω2
T
  
+W2T ∇x σ2 g1 (û1 − u1∗ ) + g2 (û2 − u2∗ ) − u1∗ R21 u1∗ + û1T R21 û1 .
T
(3.119)
   
Substituting for u1∗ , u2∗ and û1 , û2 from (3.113) and (3.114), respectively, in
(3.119) yields
ω1 
W̃˙ c1 = −kc1 Γc1 ψ1 ψ1T W̃c1 + kc1 Γc1 −W1T ∇x σ1 F̃û
1 + ν1 ω1 Γc1 ω1
T

1 T 1 1 T
+ W̃a2 ∇x σ2 G 12 ∇x σ2T W̃a2 − ∇x 2 G 12 ∇x 2T + W̃a1 ∇x σ1 G 1 ∇x σ1T W̃a1
4 4 4
1  
+ W̃a2 ∇x σ2 + ∇x 2T G 2 ∇x σ1T W1 − G 12 ∇x σ2T W2
2 
1
− ∇x 1 G 1 ∇x 1 − ∇x 1 Fu∗ ,
T
4
3.4 N -Player Nonzero-Sum Differential Games 83

ω2 
W̃˙ c2 = −kc2 Γc2 ψ2 ψ2T W̃c2 + kc2 Γc2 −W2T ∇x σ2 F̃û
1 + ν2 ω2T Γc2 ω2
1 T 1 1 T
+ W̃a1 ∇x σ1 G 21 ∇x σ1T W̃a1 − ∇x 1 G 21 ∇x 1T + W̃a2 ∇x σ2 G 2 ∇x σ2T W̃a2
4 4 4
1   1
+ W̃a1 ∇x σ1 + ∇x 1T G 1 ∇x σ2T W2 − G 21 ∇x σ1T W1 − ∇x 2 G 2 ∇x 2T
2 4
−∇x 2 Fu ] ,
∗ (3.120)

where ψi (t)  √1+ν ω ω(t)Γ


i (t)
∈ RLi are the normalized critic regressor vectors for
i i ci (t)ωi (t)
i = 1, 2, respectively, bounded as

1 1
ψ1  ≤ √ , ψ2  ≤ √ , (3.121)
ν1 ϕ11 ν2 ϕ12

where ϕ11 and ϕ12 are introduced in (3.117). The error systems in (3.120) can be
represented as the following perturbed systems
·
W̃˙ c1 = Ω1 + Λ01 Δ1 , W̃ c2 = Ω2 + Λ02 Δ2 , (3.122)

where Ωi (W̃ci , t)  −ηci Γci (t) ψi (t) ψiT (t) W̃ci ∈ RLi denotes the nominal system,
Λ0i  1+νηciωΓTciΓωi ω denotes the perturbation gain, and the perturbations Δi ∈ RLi are
i i ci i
denoted as

1 T
Δi  −WiT ∇x σi F̃û + W̃aiT ∇x σi G i ∇x σi W̃ai − ∇x i Fu∗
4
1 T 1
+ W̃ak ∇x σk G ik ∇x σkT W̃ak − ∇x k G ik ∇x kT
4 4 
1 1  
− ∇ x  i G i ∇x  i +
T
W̃ak ∇x σk + ∇x kT G k ∇x σiT Wi − G ik ∇x σkT Wk ,
4 2

where i = 1, 2 and k = 3 − i. Using Theorem 2.5.1 in [15], it can be shown that the
nominal systems
·
W̃˙ c1 = −kc1 Γc1 ψ1 ψ1T W̃c1 , W̃ c2 = −kc2 Γc2 ψ2 ψ2T W̃c2 , (3.123)

are exponentially stable if the bounded signals (ψ1 (t) , ψ2 (t)) are uniformly persis-
tently exciting over the compact set χ × D × BW 1 × BW 2 , as [17]

t
0 +δi

μi2 ILi ≥ ψi (τ )ψi (τ )T d τ ≥ μi1 ILi ∀t0 ≥ 0, i = 1, 2,


t0
84 3 Excitation-Based Online Approximate Optimal Control

where μi1 , μi2 , δi ∈ R are positive constants independent of the initial conditions.
Since Ωi is continuously differentiable in W̃ci and the Jacobian ∇W̃ci Ωi = −ηci Γci ψi
ψiT is bounded for the exponentially stable system (3.123) for i = 1, 2, the Converse
Lyapunov Theorem 4.14 in [18] can be used to show that there exists a function
Vc : RLi × RLi × [0, ∞) → R, which satisfies the following inequalities
 2  2
   
c11 W̃c1  + c12 W̃c2  ≤ Vc (W̃c1 , W̃c2 , t),
 2  2
   
Vc (W̃c1 , W̃c2 , t) ≤ c21 W̃c1  + c22 W̃c2  ,
 2  2 ∂ V ∂ Vc ∂ Vc
    c
−c31 W̃c1  − c32 W̃c2  ≥ + Ω1 (W̃c1 , t) + Ω2 (W̃c2 , t),
∂t ∂ W̃c1 ∂ W̃c2
   
 ∂ Vc 
  ≤ c41 
 W̃

,
 ∂ W̃  c1
 c1
  
 ∂ Vc 
  ≤ c42 
 W̃

, (3.124)
 ∂ W̃  c2
c2

for some positive constants c1i , c2i , c3i , c4i ∈ R for i = 1, 2. Using Property
 2.3,
Assumption 3.18, the projection bounds in (3.118), the fact that t → Fu x (t) ,

u1∗ (x (t)) , u2∗ (x (t)) ∈ L∞ over compact sets (using (3.91)), and provided the condi-

tions of Theorem 1 hold (required to prove that t → F̃û x (t) , x̂ (t) ,
 
û x (t) , Ŵa (t) ∈ L∞ ), the following bounds are developed to facilitate the sub-
sequent stability proof
   
   
ι1 ≥ W̃a1  , ι2 ≥ W̃a2  ,
   
ι3 ≥ ∇x σ1 G 1 ∇x σ1T  , ι4 ≥ ∇x σ2 G 2 ∇x σ2T  ,
ι5 ≥ Δ1  ; ι6 ≥ Δ2  ,
1  2 1  2
ι7 ≥ G 1 − G 21  ∇x V1∗  + G 2 − G 12  ∇x V2∗ 
4 4
1 
+ ∇x V1 (G 2 + G 1 ) ∇x V2  ,
∗ ∗T

2
 1  
ι8 ≥  ∗ ∗
− 2 ∇x V1 − ∇x V2 G 1 ∇x σ1 Wa1 − G 2 ∇x σ2 Wa2
T T

1  

+ ∇x V1 − ∇x V2 G 1 ∇x σ1 W̃a1 − G 2 ∇x σ2 W̃a2 
∗ ∗ T T
,
2
   
ι9 ≥ ∇x σ1 G 21 ∇x σ1T  , ι10 ≥ ∇x σ2 G 1 ∇x σ1T  ,
   
ι11 ≥ ∇x σ1 G 2 ∇x σ T  ,
2 ι12 ≥ ∇x σ2 G 12 ∇x σ T  ,
2 (3.125)

where ιj ∈ R for j = 1, . . . , 12 are computable positive constants.


3.4 N -Player Nonzero-Sum Differential Games 85

Theorem 3.21 If Assumptions 3.18 and 3.19 hold, the regressors ψi for i = 1, 2
are uniformly persistently exciting, and provided (3.108), (3.112), and the following
sufficient gain conditions are satisfied

c31 > ka11 ι1 ι3 + ka21 ι2 ι11 ,


c32 > ka21 ι2 ι4 + ka11 ι1 ι10 ,

where ka11 , ka21 , c31 , c32 , ι1 , ι2 , ι3 , and ι4 are introduced in (3.118), (3.124), and
(3.125), then the controller in (3.114), the actor-critic weight update laws in (3.115)–
(3.116) and (3.118), and the identifier in (3.92) and (3.98), guarantee  that the state

of the system, x (·), and the actor-critic weight estimation errors, W̃a1 (·) , W̃a2 (·)
 
and W̃c1 (·) , W̃c2 (·) , are uniformly ultimately bounded.

Proof To investigate the stability of (3.90) with control inputs û1 and û2 , and the
perturbed system (3.122), consider VL : χ × RL1 × RL1 × RL2 × RL2 × [0, ∞) →
R as the continuously differentiable, positive-definite Lyapunov function candidate,
given as
  1 T 1 T
VL x, W̃c1 , W̃c2 , W̃a1 , W̃a2 , t  V1∗ (x) + V2∗ (x) + Vc (W̃c1 , W̃c2 , t) + W̃a1 W̃a1 + W̃a2 W̃a2 ,
2 2

where Vi∗ for i = 1, 2 (the optimal value function for (3.90)), is the Lyapunov function
for (3.90), and Vc is the Lyapunov function for the exponentially stable system in
(3.123). Since V1∗ , V2∗ are continuously differentiable and positive-definite, [18,
Lemma 4.3] implies that there exist class K functions α1 and α2 defined on [0, r],
where Br ⊂ X , such that

α1 (x) ≤ V1∗ (x) + V2∗ (x) ≤ α2 (x) , ∀x ∈ Br . (3.126)

Using (3.124) and (3.126), VL can be bounded as

 2  2 1  2  2 
       
α1 (x) + c11 W̃c1  + c12 W̃c2  + W̃a1  + W̃a2  ≤ VL
2
 2  2 1  2  2 
       
≤ α2 (x) + c21 W̃c1  + c22 W̃c2  + W̃a1  + W̃a2  .
2

which can be written as α3 (w) ≤ VL (w, t) ≤ α4 (w), ∀w ∈ Bs , where w 


[xT W̃c1
T T
W̃c2 T
W̃a1 T T
W̃a2 ] , α3 and α4 are class K functions defined on [0, s], where
Bs ⊂ χ × R × R × RL2 × RL2 is a ball of radius s centered at the origin. Taking
L1 L1

the time derivative of VL yields


86 3 Excitation-Based Online Approximate Optimal Control

   ∂ Vc ∂ Vc ∂ Vc ∂ Vc
V̇L = ∇x V1∗ + ∇x V2∗ f + g1 û1 + g2 û2 + + Ω1 + Λ01 Δ1 + Ω2
∂t ∂ W̃c1 ∂ W̃c1 ∂ W̃c2
∂ Vc T ˙ T ˙
+ Λ02 Δ2 − W̃a1 Ŵa1 − W̃a2 Ŵa2 , (3.127)
∂ W̃c2

where the time derivatives of Vi∗ for i = 1, 2, are taken along the trajectories
of the system (3.90) with control inputs û1 , û2 and the time derivative of Vc
is taken along the along the trajectories of the perturbed system (3.122). Using
  )
2
(3.86), ∇x Vi∗ f = −∇x Vi∗ g1 u1∗ + g2 u2∗ − Qi (x) − uj∗T Rij uj∗ for i = 1, 2. Substi-
j=1
tuting for the ∇x Vi∗ f terms in (3.127), using the fact that ∇x Vi∗ gi = −2ui∗ Rii from
T

(3.84), and using (3.118) and (3.124), (3.127) can be upper bounded as

V̇L ≤ −Q − u1∗ (R11 + R21 ) u1∗ − u2∗ (R22 + R12 ) u2∗ + 2u1∗ R11 (u1∗ − û1 )
T T T

   
+ 2u2∗ R22 (u2∗ − û2 ) + ∇x V1∗ g2 û2 − u2∗ + ∇x V2∗ g1 û1 − u1∗
T

   2    2
       
+ c41 Λ01 W̃c1  Δ1  − c31 W̃c1  + c42 Λ02 W̃c2  Δ2  − c32 W̃c2 
⎡ ⎤
T ⎣ k ∂E
+ ka12 (Ŵa1 − Ŵc1 )⎦
a11 a
+ W̃a1
1 + ω1 ω1T ∂ Ŵ a1
⎡ ⎤
T ⎣ k ∂E
+ ka22 (Ŵa2 − Ŵc2 )⎦ ,
a21 a
+ W̃a2 (3.128)
1 + ω2 ω2T ∂ Ŵ a2

where Q  Q1 + Q2 . Substituting for ui∗ , ûi , and Δi for i = 1, 2 using (3.84),


(3.114), (3.119), and (3.122), respectively, and using (3.117) and (3.121) in (3.128),
yields

1  2 1  2
V̇L ≤ G 1 − G 21  ∇x V1∗  + G 2 − G 12  ∇x V2∗ 
4 4
1 
+ ∇x V1∗ (G 1 + G 2 ) ∇x V2∗T  − Q
2
1  
− ∇x V1∗ − ∇x V2∗ G 1 ∇x σ1T Wa1 − G 2 ∇x σ2T Wa2
2
1  
+ ∇x V1∗ − ∇x V2∗ G 1 ∇x σ1T W̃a1 − G 2 ∇x σ2T W̃a2
2
kc1 ϕ01    2
   
+ c41 √ Δ1  W̃c1  − c31 W̃c1 
2 ν1 ϕ11
kc2 ϕ02    2
   
+ c42 √ Δ2  W̃c2  − c32 W̃c2 
2 ν2 ϕ12
       2  2
         
+ ka12 W̃a1  W̃c1  + ka22 W̃a2  W̃c2  − ka12 W̃a1  − ka22 W̃a2 
3.4 N -Player Nonzero-Sum Differential Games 87

ka11  T  
+ T
W̃a1 W̃c1 − W̃a1 ∇x σ1 G 1 ∇x σ1T −W̃c1
T
ω1 + Δ1
1 + ω1T ω1
   
+ W̃a1
T
∇x σ1 G 21 − W̃c2
T
∇x σ2 G 2 ∇x σ1T −W̃c2T
ω2 + Δ2
   
+ W1T ∇x σ1 G 21 − W2T ∇x σ2 G 1 ∇x σ1T −W̃c2T
ω2 + Δ2
ka21  T  
+ T
W̃a2 W̃c2 − W̃a2 ∇x σ2 G 2 ∇x σ2T −W̃c2T
ω2 + Δ2
1 + ω2T ω2
   
+ W̃a2
T
∇x σ2 G 12 − W̃c1
T
∇x σ1 G 2 ∇x σ2T −W̃c1T
ω1 + Δ1
   
+ W2T ∇x σ2 G 12 − W1T ∇x σ1 G 2 ∇x σ2T −W̃c1T
ω1 + Δ1 . (3.129)

Using the bounds developed in (3.125), (3.129) can be further upper bounded as
 2  2
   
V̇L ≤ −Q − (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1  − ka12 W̃a1 
 2    2
     
− (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2  + σ2 W̃c2  − ka22 W̃a2 
    
 
+ σ1 W̃c1  + ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10
  
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12 + ι7 + ι8 ,

where
c41 kc1 ϕ01     
σ1  √ ι5 + ka11 (ι1 ι3 (ι1 + ι5 )) + ka21 ι2 ι11 ι5 + W̄1 + ι12 ι2 + W̄2
2 ν1 ϕ11
+ ka12 ι1 ,
c42 kc2 ϕ02     
σ2  √ ι6 + ka21 (ι2 ι4 (ι2 + ι6 )) + ka11 ι1 ι9 ι1 + W̄1 + ι10 ι6 + W̄2
2 ν2 ϕ12
+ ka22 ι2 .

Provided c31 > ka11 ι1 ι3 + ka21 ι2 ι11 and c32 > ka21 ι2 ι4 + ka11 ι1 ι10 , completion of the
squares yields
 2  2
   
V̇L ≤ −Q − ka22 W̃a2  − ka12 W̃a1 
 2
 
− (1 − θ1 )(c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1 
 2
 
− (1 − θ2 )(c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2 
88 3 Excitation-Based Online Approximate Optimal Control
  
+ ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10
  
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12
σ12
+ + ι7
4θ1 (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 )
σ22
+ + ι8 , (3.130)
4θ2 (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 )

where θ1 , θ2 ∈ (0, 1) are adjustable parameters. Since Q is positive definite, accord-


ing to [18, Lemma 4.3], there exist class K functions α5 and α6 such that

α5 (w) ≤ F (w) ≤ α6 (w) ∀w ∈ Bs , (3.131)

where
 2  2
   
F (w) = Q + ka12 W̃a1  + (1 − θ1 )(c31 − ka11 ι1 ι3 − ka21 ι2 ι11 ) W̃c1 
 2  2
   
+ (1 − θ2 )(c32 − ka21 ι2 ι4 − ka11 ι1 ι10 ) W̃c2  + ka22 W̃a2  ,

Using (3.131), the expression in (3.130) can be further upper bounded as V̇L ≤
−α5 (w) + Υ, where

   σ12
Υ = ka11 ι1 ι1 (ι3 ι5 + ι6 ι9 ) + ι6 W̄1 ι9 + W̄2 ι10 +
4θ1 (c31 − ka11 ι1 ι3 − ka21 ι2 ι11 )
   σ22
+ ka21 ι2 ι2 (ι4 ι6 + ι5 ι12 ) + ι5 W̄1 ι11 + W̄2 ι12 +
4θ2 (c32 − ka21 ι2 ι4 − ka11 ι1 ι10 )
+ ι 7 + ι8 ,

which proves−1that V̇ L is negative whenever w lies outside the compact set Ωw 


w : w ≤ α5 (Υ ) , and hence, w (·) is uniformly ultimately bounded, accord-
ing to [18, Theorem 4.18].

3.4.6 Simulations

Two-Player Game with a Known Feedback-Nash Equilibrium Solution


The following two player non-zero sum game considered in [32, 37–39] has a
known analytical solution, and hence is utilized in this section to demonstrate
the performance of the developed technique. The system dynamics are given by
ẋ = f (x) + g1 (x) u1 + g2 (x) u2 , where
3.4 N -Player Nonzero-Sum Differential Games 89

Fig. 3.15 Evolution of the system states, state derivative estimates, and control signals for the
two-player nonzero-sum game, with persistently excited input for the first six seconds (reproduced
with permission from [30], 2015,
c IEEE)

 
(x2 − 2x1 )
f (x) =     2 ,
− 21 x1 − x2 + 41 x2 (cos (2x1 ) + 2)2 + 41 x2 sin 4x12 + 2
T
g1 (x) = 0 cos (2x1 ) + 2 , (3.132)
 2 T
g2 (x) = 0 sin 4x1 + 2 . (3.133)

The objective is to design u1 and u2 to find a feedback-Nash equilibrium solu-


tion to the optimal control problem described by (3.83), where the local cost is
given by ri x, ui , uj = xT Qi x + uiT Rii ui + ujT Rij uj , i = 1, 2, and j = 3 − i, where
R11 = 2R22 = 2, R12 = 2R21 = 2, Q1 = 2, and Q2 = I2 . The known analytical solu-
tions for the optimal value functions of player 1 and player 2 are V1∗ (x) = 21 x12 + x22 ,
V2∗ (x) = 41 x12 + 21 x22 , and the corresponding optimal control inputs are u1∗ (x) =
   
− (cos (2x1 ) + 2) x2 , u2∗ (x) = − 21 sin 4x12 + 2 x2 .
To implement the developed technique, the activation function for critic neu-
T
ral networks are selected as σi (x) = x12 x1 x2 x22 , i = 1, 2, while the activation
90 3 Excitation-Based Online Approximate Optimal Control

Fig. 3.16 Convergence of actor and critic weights for player 1 and player 2 in the nonzero-sum
game (reproduced with permission from [30], 2015,
c IEEE)

function for the identifier dynamic neural network is selected as a symmetric sig-
moid with 5 neurons in the hidden layer. The identifier gains are selected as k = 300,
α = 200, γ = 5, β1 = 0.2, Γwf = 0.1I6 , and Γvf = 0.1I2 , and the gains of the actor-
critic learning laws are selected as ka11 = ka12 = 10, ka21 = ka22 = 20, kc1 = 50,
kc2 = 10, ν1 = ν2 = 0.001, and λ1 = λ2 = 0.03. The covariance matrix is initial-
ized to Γ (0) = 5000I3 , the neural network weights for state derivative estimator are
randomly initialized with values between [−1, 1], the weights for the actor and the
critic are initialized to [3, 3, 3]T , the state estimates are initialized to zero, and the
states are initialized to x (0) = [3, −1]. Similar to results such as [9, 14, 32, 39,
40], a small amplitude exploratory signal (noise) is added to the control to excite the
states for the first six seconds of the simulation, as seen from the evolution of states
and control in Fig. 3.15. The identifier approximates the system dynamics, and the
state derivative estimation error is shown in Fig. 3.15. The time histories of the critic
neural network weights and the actor neural network weights are given in Fig. 3.16,
where solid lines denote the weight estimates and dotted lines denote the true values
of the weights. Persistence of excitation ensures that the weights converge to their
3.4 N -Player Nonzero-Sum Differential Games 91

known ideal values in less than five seconds of simulation. The use of two separate
neural networks facilitates the design of least-squares-based update laws in (3.115).
The least-squares-based update laws result in a performance benefit over single neu-
ral network-based results such as [40], where the convergence of weights is obtained
after about 250 s of simulation.
Three Player Game
To demonstrate the performance of the developed technique in the multi-player case,
the two player simulation is augmented with another actor. The resulting dynamics
are ẋ = f (x) + g1 (x) u1 + g2 (x) u2 + g3 (x) u3 , where
⎡ ⎤
(x2 − 2x1 )
⎛ 1    ⎞
2 ⎥
⎢ 1 1
⎢ − x1 − x2 + x2 (cos (2x1 ) + 2)2 + x2 sin 4x12 + 2 ⎥
f (x) = ⎢ ⎜ 2 4 4 ⎟⎥,
⎣⎝ 1    2 ⎠ ⎦
+ x2 cos 4x12 + 2
4
 2 T
g3 (x) = 0 cos 4x1 + 2 , (3.134)

and g1 and g2 are the same as (3.133). Figure 3.17 demonstrates the convergence
of the actor and the critic weights. Since the feedback-Nash equilibrium solution is
unknown for the dynamics in (3.134), the obtained weights are not compared against
their true values. Figure 3.18 demonstrates the regulation of the system states and
the state derivative estimation error to the origin, and the boundedness of the control
signals.

Remark 3.22 An implementation issue in using the developed algorithm as well


as results such as [9, 14, 32, 39, 40] is to ensure persistence of excitation of the
critic regressor vector. Unlike linear systems, where persistence of excitation of
the regressor translates to the sufficient richness of the external input, no verifiable
method exists to ensure persistence of excitation in nonlinear systems. In this simu-
lation, a small amplitude exploratory signal consisting of a sum of sines and cosines
of varying frequencies is added to the control to ensure persistence of excitation
qualitatively, and convergence of critic weights to their optimal values is achieved.
The exploratory signal n (t), designed using trial and error, is present in the first six
seconds of the simulation and is given by

n (t) = sin (5π t) + sin (et) + sin5 (t) + cos5 (20t) + sin2 (−1.2t) cos (0.5t) .

3.5 Background and Further Reading

Reinforcement learning-based techniques have been developed to approximately


solve optimal control problems for continuous-time and discrete-time deterministic
systems in results such as [9, 14, 41–47] for set-point regulation, and [23, 48–52]
92 3 Excitation-Based Online Approximate Optimal Control

Fig. 3.17 Convergence of actor and critic weights for the three-player nonzero-sum game (repro-
duced with permission from [30], 2015,
c IEEE)
3.5 Background and Further Reading 93

Fig. 3.18 Evolution of the system states, state derivative estimates, and control signals for the three-
player nonzero-sum game, with persistently excited input for the first six seconds (reproduced
with permission from [30], 2015,
c IEEE)

for trajectory tracking. Extension of approximate dynamic programming to systems


with continuous time and state was pioneered by Doya [41] who used a Hamilton–
Jacobi–Bellman framework to derive algorithms for value function approximation
and policy improvement, based on a continuous-time version of the temporal differ-
ence error. Murray et al. [53] also used the Hamilton–Jacobi–Bellman framework to
develop a stepwise stable iterative approximate dynamic programming algorithm for
continuous-time input-affine systems with an input quadratic performance measure.
In Beard et al. [54], Galerkin’s spectral method is used to approximate the solution
to the generalized Hamilton–Jacobi–Bellman equation, and the solution is used to
compute a stabilizing feedback controller offline. In [55], Abu-Khalaf and Lewis
propose a least-squares successive approximation solution, where a neural network
is trained offline to learn the generalized Hamilton–Jacobi–Bellman solution.
All of the aforementioned approaches are offline and/or require complete knowl-
edge of system dynamics. One of the contributions in [56] is that only partial knowl-
94 3 Excitation-Based Online Approximate Optimal Control

edge of the system dynamics is required, and a hybrid continuous-time/discrete-time


sampled data controller is developed based on policy iteration, where the feedback
control operation of the actor occurs at faster time scale than the learning process of
the critic. Vamvoudakis and Lewis [14] extended the idea by designing a model-based
online algorithm called synchronous policy iteration, synchronous which involved
synchronous, continuous-time adaptation of both actor and critic neural networks.
Over the past few years, research has focused on the development of robust [57,
58] and off-policy [59] approximate dynamic programming methods for near-optimal
control of nonlinear systems.
For trajectory tracking, approximate dynamic programming approaches are pre-
sented in results such as [49, 60], where the value function, and the controller pre-
sented are time-varying functions of the tracking error. For discrete time systems,
several approaches have been developed to address the tracking problem. Park et al.
[61] use generalized back-propagation through time to solve a finite-horizon tracking
problem that involves offline training of neural networks. An approximate dynamic
programming-based approach is presented in [48] to solve an infinite-horizon opti-
mal tracking problem where the desired trajectory is assumed to depend on the
system states. Greedy heuristic dynamic programming based algorithms are pre-
sented in results such as [23, 62, 63] which transform the nonautonomous system
into an autonomous system, and approximate convergence of the sequence of value
functions to the optimal value function is established.
Recently results on near-optimal trajectory tracking include integral reinforcement
learning [64], Q-learning [65], guaranteed cost [66], decentralized [67], robust [68],
and event driven [69] approximate dynamic programming methods.
Generalization of reinforcement learning controllers to differential game prob-
lems is investigated in results such as [14, 32, 38, 39, 70–73]. Techniques utilizing
Q-learning algorithms have been developed for a zero-sum game in [74]. An approx-
imate dynamic programming procedure that provides a solution to the Hamilton–
Jacobi-Isaacs equation associated with the two-player zero-sum nonlinear differen-
tial game is introduced in [70]. The approximate dynamic programming algorithm
involves two iterative cost functions finding the upper and lower performance indices
as sequences that converge to the saddle point solution to the game. The actor-critic
structure required for learning the saddle point solution is composed of four action
networks and two critic networks. The iterative approximate dynamic programming
solution in [71] considers solving zero-sum differential games under the condition
that the saddle point does not exist, and a mixed optimal performance index func-
tion is obtained under a deterministic mixed optimal control scheme when the saddle
point does not exist. Another approximate dynamic programming iteration technique
is presented in [72], in which the nonlinear quadratic zero-sum game is transformed
into an equivalent sequence of linear quadratic zero-sum games to approximate an
optimal saddle point solution. In [73], an integral reinforcement learning method is
used to determine an online solution to the two player nonzero-sum game for a linear
system without complete knowledge of the dynamics.
The synchronous policy iteration method in [38] is further generalized to solve
the two-player zero-sum game problem in [39] and a multi-player nonzero-sum
3.5 Background and Further Reading 95

game in [32] and [40] for nonlinear continuous-time systems with known dynamics.
Furthermore, [75] presents a policy iteration method for an infinite-horizon two-
player zero-sum Nash game with unknown nonlinear continuous-time dynamics.
Recent results also focus on the development of data-driven approximate dynamic
programming methods for set-point regulation, trajectory tracking, and differential
games to relax the persistence of excitation conditions. These methods are surveyed
at the end of the next chapter.

References

1. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural model-
ing. In: White DA, Sorge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive
approaches, vol 15. Nostrand, New York, pp 493–525
2. Hopfield J (1984) Neurons with graded response have collective computational properties like
those of two-state neurons. Proc Nat Acad Sci USA 81(10):3088
3. Kirk D (2004) Optimal Control Theory: An Introduction. Dover, Mineola, NY
4. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal Control, 3rd edn. Wiley, Hoboken
5. Case J (1969) Toward a theory of many player differential games. SIAM J Control 7:179–197
6. Starr A, Ho CY (1969) Nonzero-sum differential games. J Optim Theory App 3(3):184–206
7. Starr A, Ho, (1969) Further properties of nonzero-sum differential games. J Optim Theory App
4:207–219
8. Friedman A (1971) Differential games. Wiley, Hoboken
9. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
10. Xian B, Dawson DM, de Queiroz MS, Chen J (2004) A continuous asymptotic tracking control
strategy for uncertain nonlinear systems. IEEE Trans Autom Control 49(7):1206–1211
11. Patre PM, MacKunis W, Kaiser K, Dixon WE (2008) Asymptotic tracking for uncertain
dynamic systems via a multilayer neural network feedforward and RISE feedback control
structure. IEEE Trans Autom Control 53(9):2180–2185
12. Filippov AF (1988) Differential equations with discontinuous right-hand sides. Kluwer Aca-
demic Publishers, Dordrecht
13. Kamalapurkar R, Rosenfeld JA, Klotz J, Downey RJ, Dixon WE (2014) Supporting lemmas
for RISE-based control methods. arXiv:1306.3432
14. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
15. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
16. Panteley E, Loria A, Teel A (2001) Relaxed persistency of excitation for uniform asymptotic
stability. IEEE Trans Autom Control 46(12):1874–1886
17. Loría A, Panteley E (2002) Uniform exponential stability of linear time-varying systems:
revisited. Syst Control Lett 47(1):13–24
18. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
19. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Belmont
20. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic
programming using function approximators. CRC Press, Boca Raton
21. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
22. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking
for continuous-time nonlinear systems. Automatica 51:40–48
96 3 Excitation-Based Online Approximate Optimal Control

23. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy hdp iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
24. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping
and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560
25. Lewis FL, Selmic R, Campos J (2002) Neuro-fuzzy control of industrial systems with actuator
nonlinearities. Society for Industrial and Applied Mathematics, Philadelphia
26. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
27. Misovec KM (1999) Friction compensation using adaptive non-linear control with persistent
excitation. Int J Control 72(5):457–479
28. Narendra K, Annaswamy A (1986) Robust adaptive control in the presence of bounded distur-
bances. IEEE Trans Autom Control 31(4):306–315
29. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, A MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
30. Johnson M, Kamalapurkar R, Bhasin S, Dixon WE (2015) Approximate n-player nonzero-sum
game solution for an uncertain continuous nonlinear system. IEEE Trans Neural Netw Learn
Syst 26(8):1645–1658
31. Basar T, Olsder GJ (1999) Dynamic noncooperative game theory. Classics in applied mathe-
matics, 2nd edn. SIAM, Philadelphia
32. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled hamilton-jacobi equations. Automatica 47:1556–1569
33. Basar T, Bernhard P (2008) H ∞ -optimal control and related minimax design problems: a
dynamic game approach, 2nd edn. Modern Birkhäuser Classics, Birkhäuser, Boston
34. Patre PM, Dixon WE, Makkar C, Mackunis W (2006) Asymptotic tracking for systems with
structured and unstructured uncertainties. In: Proceedings of the IEEE conference on decision
and control, San Diego, California, pp 441–446
35. Dixon WE, Behal A, Dawson DM, Nagarkatti S (2003) Nonlinear control of engineering
systems: a lyapunov-based approach. Birkhauser, Boston
36. Krstic M, Kanellakopoulos I, Kokotovic PV (1995) Nonlinear and adaptive control design.
Wiley, New York
37. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical report. CIT-CDS 96-021, California Institute of Technology, Pasadena, CA 91125
38. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin, pp
357–374
39. Vamvoudakis KG, Lewis FL (2010) Online neural network solution of nonlinear two-player
zero-sum games using synchronous policy iteration. In: Proceedings of the IEEE conference
on decision and control
40. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
41. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
42. Chen Z, Jagannathan S (2008) Generalized Hamilton-Jacobi-Bellman formulation -based
neural network control of affine nonlinear discrete-time systems. IEEE Trans Neural Netw
19(1):90–106
43. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
44. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
45. Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-
time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634
References 97

46. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
47. Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown
continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–566
48. Dierks T, Jagannathan S (2009) Optimal tracking control of affine nonlinear discrete-time
systems with unknown internal dynamics. In: Proceedings of the IEEE conference on decision
and control, Shanghai, CN, pp 6750–6755
49. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
50. Wei Q, Liu D (2013) Optimal tracking control scheme for discrete-time nonlinear systems with
approximation errors. In: Guo C, Hou ZG, Zeng Z (eds) Advances in neural networks - ISNN
2013, vol 7952. Lecture notes in computer science. Springer, Berlin, pp 1–10
51. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement
Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4):1167–1175
52. Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear
systems with unknown dynamics by using adaptive dynamic programming. Int J Control
87(5):1000–1009
53. Murray J, Cox C, Lendaris G, Saeks R (2002) Adaptive dynamic programming. IEEE Trans
Syst Man Cybern Part C Appl Rev 32(2):140–153
54. Beard R, Saridis G, Wen J (1997) Galerkin approximations of the generalized Hamilton-Jacobi-
Bellman equation. Automatica 33:2159–2178
55. Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
56. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
57. Wang K, Liu Y, Li L (2014) Visual servoing trajectory tracking of nonholonomic mobile robots
without direct position measurement. IEEE Trans Robot 30(4):1026–1035
58. Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlinear robust
optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern Syst 46(11):1544–1555
59. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
60. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference, pp 1568–1573
61. Park YM, Choi MS, Lee KY (1996) An optimal tracking neuro-controller for nonlinear dynamic
systems. IEEE Trans Neural Netw 7(5):1099–1110
62. Luo Y, Liang M (2011) Approximate optimal tracking control for a class of discrete-time non-
affine systems based on GDHP algorithm. In: IWACI International Workshop on Advanced
Computational Intelligence, pp 143–149
63. Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of
discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocom-
puting 78(1):14–22
64. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
65. Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only
q-learning. IEEE Trans Neural Netw Learn Syst 27(10):2134–2144
66. Yang X, Liu D, Wei Q, Wang D (2016) Guaranteed cost neural tracking control for a class of
uncertain nonlinear systems using adaptive dynamic programming. Neurocomputing 198:80–
90
67. Zhao B, Liu D, Yang X, Li Y (2017) Observer-critic structure-based adaptive dynamic pro-
gramming for decentralised tracking control of unknown large-scale nonlinear systems. Int J
Syst Sci 48(9):1978–1989
98 3 Excitation-Based Online Approximate Optimal Control

68. Wang D, Liu D, Zhang Y, Li H (2018) Neural network robust tracking control with adaptive
critic framework for uncertain nonlinear systems. Neural Netw 97:11–18
69. Vamvoudakis KG, Mojoodi A, Ferraz H (2017) Event-triggered optimal tracking control of
nonlinear systems. Int J Robust Nonlinear Control 27(4):598–619
70. Wei Q, Zhang H (2008) A new approach to solve a class of continuous-time nonlinear quadratic
zero-sum game using ADP. In: IEEE international conference on networking, sensing and
control, pp 507–512
71. Zhang H, Wei Q, Liu D (2010) An iterative adaptive dynamic programming method for solving
a class of nonlinear zero-sum differential games. Automatica 47:207–214
72. Zhang X, Zhang H, Luo Y, Dong M (2010) Iteration algorithm for solving the optimal strategies
of a class of nonaffine nonlinear quadratic zero-sum games. In: Proceedings of the IEEE
conference on decision and control, pp 1359–1364
73. Mellouk A (ed) (2011) Advances in reinforcement learning. InTech
74. Littman M (2001) Value-function reinforcement learning in markov games. Cogn Syst Res
2(1):55–66
75. Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate
solution using a policy iteration algorithm. In: Proceedings of the IEEE conference on decision
and control, pp 142–147
Chapter 4
Model-Based Reinforcement Learning
for Approximate Optimal Control

4.1 Introduction

In reinforcement learning-based approximate online optimal control, the Hamilton–


Jacobi–Bellman equation along with an estimate of the state derivative (cf. [1, 2]),
or an integral form of the Hamilton–Jacobi–Bellman equation (cf. [3]) is utilized to
approximately evaluate the Bellman error along the system trajectory. The Bellman
error, evaluated at a point, provides an indirect measure of the quality of the esti-
mated value function evaluated at that point. Therefore, the unknown value function
parameters are updated based on evaluation of the Bellman error along the system
trajectory. Such weight update strategies create two challenges for analyzing con-
vergence. The system states need to satisfy the persistence of excitation condition,
and the system trajectory needs to visit enough points in the state-space to generate
a good approximation of the value function over the entire domain of operation. As
in Chap. 3, these challenges are typically addressed in the related literature (cf. [2,
4–12]) by adding an exploration signal to the control input to ensure sufficient explo-
ration of the domain of operation. However, no analytical methods exist to compute
the appropriate exploration signal when the system dynamics are nonlinear.
The aforementioned challenges arise from the restriction that the Bellman error
can only be evaluated along the system trajectories. In particular, the integral Bellman
error is meaningful as a measure of the quality of the estimated value function only if
it is evaluated along the system trajectories, and state derivative estimators can only
generate numerical estimates of the state derivative along the system trajectories.
Recently, [11] demonstrated that experience replay can be used to improve data effi-
ciency in online approximate optimal control by reuse of recorded data. However,
since the data needs to be recorded along the system trajectory, the system trajec-
tory under the designed approximate optimal controller needs to provide enough
excitation for learning. In general, such excitation is not available. As a result, the
simulation results in [11] are generated using an added probing signal.

© Springer International Publishing AG 2018 99


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_4
100 4 Model-Based Reinforcement Learning for Approximate Optimal Control

In this chapter and in results such as [13–22], a different approach is used to


improve data efficiency by observing that if the system dynamics are known, the
state derivative, and hence, the Bellman error can be evaluated at any desired point in
the state-space. Unknown parameters in the value function can therefore be adjusted
based on least-squares minimization of the Bellman error evaluated at any number
of arbitrary points in the state-space. For example, in an infinite-horizon regulation
problem, the Bellman error can be computed at points uniformly distributed in a
neighborhood around the origin of the state-space. The results of this chapter indicate
that convergence of the unknown parameters in the value function is guaranteed
provided the selected points satisfy a rank condition. Since the Bellman error can
be evaluated at any desired point in the state-space, sufficient exploration can be
achieved by appropriately selecting the points to cover the domain of operation.
If the system dynamics are partially unknown, an approximation to the Bellman
error can be evaluated at any desired point in the state-space based on an estimate
of the system dynamics. If each new evaluation of the Bellman error along the
system trajectory is interpreted as gaining experience via exploration, the use of a
model to evaluate the Bellman error at an unexplored point in the state-space can be
interpreted as a simulation of the experience. Learning based on simulation of expe-
rience has been investigated in results such as [23–28] for stochastic model-based
reinforcement learning; however, these results solve the optimal control problem
off-line in the sense that repeated learning trials need to be performed before the
algorithm learns the controller, and system stability during the learning phase is not
analyzed. This chapter and the results in [13–22] further the state of the art for non-
linear control-affine plants with linearly parameterizable uncertainties in the drift
dynamics by providing an online solution to deterministic infinite-horizon optimal
regulation problems. In this chapter, a concurrent learning-based parameter estima-
tor is developed to exponentially identify the unknown parameters in the system
model, and the parameter estimates are used to implement simulation of experience
by extrapolating the Bellman error.
In Sect. 4.4 (see also, [14, 22]), the model-based reinforcement learning method
is extended to solve trajectory tracking problems for uncertain nonlinear systems.
The technical challenges associated with the nonautonomous nature of the trajec-
tory tracking problem are addressed in Chap. 3, where it is established that under a
matching condition on the desired trajectory, the optimal trajectory tracking problem
can be reformulated as a stationary optimal control problem. Since the value func-
tion associated with a stationary optimal control problem is time-invariant, it can be
approximated using traditional function approximation techniques. The reformula-
tion developed in Chap. 3 requires computation of the steady-state tracking controller,
which depends on the system model; therefore, the development in Chap. 3 requires
exact model knowledge. Obtaining an accurate estimate of the desired steady-state
controller, and injecting the resulting estimation error in the stability analysis are the
major technical challenges in extending the work in Chap. 3 to uncertain systems.
In Sect. 4.4, an estimate of the steady-state controller is generated using concurrent
learning-based system identifiers. The use of an estimate instead of the true steady-
state controller results in additional approximation errors that can potentially cause
4.1 Introduction 101

instability during the learning phase. This chapter analyzes the stability of the closed-
loop system in the presence of the aforementioned additional approximation error.
The error between the actual steady-state controller and its estimate is included in
the stability analysis by examining the trajectories of the concatenated system under
the implemented control signal. In addition to estimating the desired steady-state
controller, the concurrent learning-based system identifier is also used to simulate
experience by evaluating the Bellman error over unexplored areas of the state-space
[14, 29, 30].
In Sect. 4.5 (see also, [16]), the model-based reinforcement learning method is
extended to obtain an approximate feedback-Nash equilibrium solution to an infinite-
horizon N -player nonzero-sum differential game online, without requiring persis-
tence of excitation, for a nonlinear control-affine system with uncertain linearly
parameterized drift dynamics. A system identifier is used to estimate the unknown
parameters in the drift dynamics. The solutions to the coupled Hamilton–Jacobi
equations and the corresponding feedback-Nash equilibrium policies are approxi-
mated using parametric universal function approximators. Based on estimates of the
unknown drift parameters, estimates for the Bellman errors are evaluated at a set of
pre-selected points in the state-space. The critic and the actor weights are updated
using a concurrent learning-based least-squares approach to minimize the instanta-
neous Bellman errors and the Bellman errors evaluated at pre-selected points. Simul-
taneously, the unknown parameters in the drift dynamics are updated using a history
stack of recorded data via a concurrent learning-based gradient descent approach. It
is shown that under a condition milder than persistence of excitation, uniformly ulti-
mately bounded convergence of the unknown drift parameters, the critic weights and
the actor weights to their true values can be established. Simulation results demon-
strate the effectiveness of the developed approximate solutions to infinite-horizon
optimal regulation and tracking problems online for inherently unstable nonlinear
systems with uncertain drift dynamics. The simulations also demonstrate that the
developed method can be used to implement reinforcement learning without the
addition of a probing signal.

4.2 Model-Based Reinforcement Learning

Consider the control-affine nonlinear dynamical system in (1.9) and recall the expres-
sion for the Bellman error in (2.3)
        
δ x, Ŵc , Ŵa  ∇x V̂ x, Ŵc f (x) + g (x) û x, Ŵa + r x, û x, Ŵa .

To solve the optimal control problem, the critic aims to find a set of parameters Ŵc
and the actor aims to find a set of parameters Ŵa that minimize the integral error E,
introduced in (2.5). Computation of the error E requires evaluation of the Bellman
error over the entire domain D, which is generally infeasible. As a result, a derivative-
102 4 Model-Based Reinforcement Learning for Approximate Optimal Control

based evaluation of E along the system trajectories, denoted by E t and introduced


in (2.7), is utilized for learning in Chap. 3. Intuitively, the state trajectory, x, needs
to visit as many points in the operating domain as possible for E t to approximate E
over an operating domain. This intuition is formalized by the fact that techniques in
Chap. 3 require persistence of excitation to achieve convergence.
The persistence of excitation condition is relaxed in [11] to a finite excitation con-
dition by using integral reinforcement learning along with experience replay, where
each evaluation of the Bellman error along the system trajectory is interpreted as
gained experience. These experiences are stored in a history stack and are repeatedly
used in the learning algorithm to improve data efficiency. In this chapter, a different
approach is used to circumvent the persistence of excitation condition. Given a model
of the system and the current parameter estimates Ŵc (t) and Ŵa (t), the Bellman
error in (2.3) can be evaluated at any point xi ∈ Rn . That is, the critic can gain expe-
rience on how well the value function is estimated an any arbitrary point xi in the
 xi . In
state-space without actually visiting  other words, given a fixed state xi and a
corresponding planned action û xi , Ŵa , the critic can use the dynamic model to
simulate a visit to xi by computing the state derivative at xi . This results insimulated
experience quantified by the Bellman error δti (t) = δ xi , Ŵc (t) , Ŵa (t) .
In the case where the drift dynamics, f , are uncertain, a parametric approximation
fˆ x, θ̂ of the function f , where θ̂ denotes the matrix of parameter estimates, can
be utilized to approximate the Bellman error in (2.3) as
          
δ̂ x, Ŵc , Ŵa , θ̂  ∇x V̂ x, Ŵc fˆ x, θ̂ + g (x) û x, Ŵa + r x, û x, Ŵa . (4.1)

Given current parameter estimates Ŵc (t), Ŵa (t), and θ̂ (t), the approximate Bellman
error in (4.1) can be evaluated at any point xi ∈ Rn .This results in simulatedexpe-
rience quantified by the Bellman error δ̂ti (t) = δ̂ xi , Ŵc (t) , Ŵa (t) , θ̂ (t) . The
simulated experience can then be used along with gained experience by the critic
to learn the value function. Motivation behind using simulated experience is that by
selecting multiple (say N ) points, the error signal in (2.7) can be augmented to yield
a heuristically better approximation Ê ti (t), given by

t  
N

Ê ti (t)  δ̂t (τ ) +
2
δ̂ti (τ ) dτ,
2

0 i=1

for the desired error signal in (2.5). A block diagram of the simulation-based actor-
critic-identifier architecture is presented in Fig. 4.1. For notational brevity, the depen-
dence of all the functions on the system states and time is suppressed in the stability
analysis subsections unless required for clarity of exposition.
4.3 Online Approximate Regulation 103

Fig. 4.1 Simulation-based AcƟon Data-driven


actor-critic-identifier idenƟfier
architecture. In addition to
the state-action History Stack
measurements, the critic also
utilizes states, actions, and
the corresponding
State Simulated
state-derivatives to learn the state-acƟon-derivaƟve
value function Environment triplets
Reward

State
AcƟon
BE
Actor AcƟon
CriƟc

4.3 Online Approximate Regulation1

This section develops a data-driven implementation of model-based reinforcement


learning to solve approximate optimal regulation problems online under a persis-
tence of excitation-like rank condition. The development is based on the observation
that, given a model of the system, reinforcement learning can be implemented by
evaluating the Bellman error at any number of desired points in the state-space. In
this result, a parametric system model is considered, and a concurrent learning-based
parameter identifier is developed to compensate for uncertainty in the parameters.
Uniformly ultimately bounded regulation of the system states to a neighborhood of
the origin, and convergence of the developed policy to a neighborhood of the optimal
policy are established using a Lyapunov-based analysis, and simulation results are
presented to demonstrate the performance of the developed controller.
Online implementation of simulation of experience requires uniform online
 esti-
mation of the function f using the parametric approximation fˆ x, θ̂ (i.e., the
parameter estimates θ̂ need to converge to their true values θ ). In the following,
a general Lyapunov-based characterization of a system identifier that achieves uni-
form approximation of f is developed based on recent ideas on data-driven parameter
convergence in adaptive control (cf. [29–34]).

4.3.1 System Identification

To facilitate online system identification, let f (x) = Y (x) θ denote the linear
parametrization of the function f , where Y : Rn → Rn× pθ is the regression matrix,

1 Parts of the text in this section are reproduced, with permission, from [18], ©2016, Elsevier.
104 4 Model-Based Reinforcement Learning for Approximate Optimal Control

and θ ∈ R pθ is the vector of constant unknown parameters. Let θ̂ ∈ R pθ be an esti-


mate of the unknown parameter vector θ . The following development assumes that
an adaptive system identifier that satisfies conditions detailed in Assumption 4.1
is available. For completeness, a concurrent learning-based system identifier that
satisfies Assumption 4.1 is presented in Appendix A.2.3.

Assumption 4.1 A compact set Θ ⊂ R pθ such that θ ∈ Θ is known a priori. The


estimates θ̂ : R≥t0 → R pθ are updated based on a switched update law of the form
 
θ̂˙ (t) = f θs θ̂ (t) , t , (4.2)

θ̂ (t0 ) = pθ̂0 ∈ Θ, where s ∈ N denotes the switching index and


f θs : R θ × R≥0 → R pθ s∈N denotes a family of continuously differentiable func-
tions. The dynamics of the parameter estimationerror θ̃ : R≥t0 → R pθ , defined as
θ̃ (t)  θ − θ̂ (t), can be expressed as θ̃˙ (t) = f θ − θ̃ (t) , t . Furthermore, there
θs
exists a continuously differentiable function Vθ : R pθ × R≥0 → R≥0 that satisfies


   





v θ
θ̃
≤ Vθ θ̃ , t ≤ vθ
θ̃
, (4.3)

     

2

∇θ̃ Vθ θ̃ , t − f θs θ − θ̃, t + ∇t Vθ θ̃ , t ≤ −K
θ̃
+ D
θ̃
, (4.4)

for all s ∈ N, t ∈ R≥t0 , and θ̃ ∈ R pθ , where v θ , v θ : R≥0 → R≥0 are class K func-
tions, K ∈ R>0 is an adjustable parameter, and D ∈ R>0 is a positive constant.

The subsequent analysis in Sect. 4.3.4 indicates that when a system identifier that
satisfies Assumption 4.1 is employed to facilitate online optimal control, the ratio
D
K
needs to be sufficiently small to establish set-point regulation and convergence to
optimality. Using an estimate θ̂ , the Bellman error in (2.3) can be approximated by
δ̂ : Rn+2L+ p → R as
        
δ̂ x, Ŵc , Ŵa , θ̂  ∇x V̂ x, Ŵc Y (x) θ̂ + g (x) û x, Ŵa + r x, û x, Ŵa . (4.5)

In the following, the approximate Bellman error in (4.5) is used to obtain an approx-
imate solution to the Hamilton–Jacobi–Bellman equation in (1.14).

4.3.2 Value Function Approximation

Approximations to the optimal value function V ∗ and the optimal policy u ∗ are
designed based on neural network-based representations. Given any compact set
χ ⊂ Rn and a positive constant  ∈ R, the universal function approximation property
of neural networks can be exploited to represent the optimal value function V ∗
4.3 Online Approximate Regulation 105

as V ∗ (x) = W T σ (x) +  (x) , for all x ∈ χ , where W ∈ R L is the ideal weight


matrix bounded above by a known positive constant W̄ in the sense that W ≤ W̄ ,
σ : Rn → R L is a continuously differentiable nonlinear activation function such that
σ (0) = 0 and σ
(0) = 0, L ∈ N is the number of neurons, and  : Rn → R is the
function reconstruction error such that supx∈χ | (x)| ≤  and supx∈χ |∇x  (x)| ≤ .
Based on the neural network representation of the value function, a neural
network representation of the optimal controller is derived as u ∗ (x) = − 21 R −1 g T (x)

∇x σ T (x) W + ∇x  T (x) . The neural network approximations V̂ : Rn × R L → R
and û : Rn × R L → Rm are defined as
 
V̂ x, Ŵc  ŴcT σ (x) ,
  1
û x, Ŵa  − R −1 g T (x) ∇x σ T (x) Ŵa , (4.6)
2

where Ŵc ∈ R L and Ŵa ∈ R L are the estimates of W . The use of two sets of weights
to estimate the same set of ideal weights is motivated by the stability analysis and
the fact that it enables a formulation of the Bellman error that is linear in the critic
weight estimates Ŵc , enabling a least-squares-based adaptive update law.

4.3.3 Simulation of Experience Via Bellman Error


Extrapolation

In traditional reinforcement learning-based algorithms, the value function estimate


and the policy estimate are updated based on observed data. The use of observed
data to learn the value function naturally leads to a sufficient exploration condi-
tion which demands sufficient richness in the observed data. In stochastic systems,
this is achieved using a randomized stationary policy (cf. [1, 35, 36]), whereas in
deterministic systems, a probing noise is added to the derived control law (cf. [2, 6,
37–39]).
The technique developed in this result implements simulation of experience in
a model-based reinforcement learning scheme by using Y θ̂ as an estimate of the
uncertain drift dynamics f to extrapolate the approximate Bellman error to a pre-
defined set of points {xi ∈ Rn | i = 1, . . . , N } in the state-space. In the following,
δ̂t : R≥t0 → R denotes the approximate Bellman error in (4.5) evaluated along the 
trajectories of (1.9), (4.2), (4.7), and (4.9) as δ̂t (t)  δ̂ x (t) , Ŵc (t) , Ŵa (t) , θ̂ (t)
and δ̂ti : R≥t0 → R denotes the approximate Bellman error extrapolated to the
points {x i ∈ R | i = 1, . . . , N } along the trajectories of (4.2), (4.7), and (4.9) as
n

δ̂ti  δ̂ xi , Ŵc (t) , Ŵa (t) , θ̂ (t) .


106 4 Model-Based Reinforcement Learning for Approximate Optimal Control

A least-squares update law for the critic weights is designed based on the subse-
quent stability analysis as

ηc2  ωi (t)
N
˙ ω (t)
Ŵc (t) = −ηc1 Γ δ̂t (t) − Γ δ̂ti (t) , (4.7)
ρ (t) N ρ (t)
i=1 i

Γ (t) ω (t) ω (t)T Γ (t)
˙
Γ (t) = βΓ (t) − ηc1 1{ Γ ≤Γ } , (4.8)
ρ 2 (t)

where Γ : R≥t0 → R L×L  is a time-varying least-squares


  Γ (t0 ) ≤
gain matrix,
Γ , ω (t)  ∇x σ (x (t)) Y (x (t)) θ̂ (t) + g (x (t)) û x (t) , Ŵa (t) , ωi (t)  ∇x
  
σ (xi ) Y (xi ) θ̂ (t) + g (xi ) û xi , Ŵa (t) , ρ (t)  1 + νω T (t) Γ (t) ω (t), and
ρi (t)  1 + νωiT (t) Γ (t) ωi (t), ν ∈ R is a constant positive normalization gain,
Γ > 0 ∈ R is a saturation constant, β > 0 ∈ R is a constant forgetting factor, and
ηc1 , ηc2 > 0 ∈ R are constant adaptation gains.
The actor weights are updated based on the subsequent stability analysis as

  ηc1 G σT (t) Ŵa (t) ω T (t)


Ŵ˙ a (t) = −ηa1 Ŵa (t) − Ŵc (t) − ηa2 Ŵa (t) + Ŵc (t)
4ρ (t)
N
ηc2 G σT i Ŵa (t) ωiT (t)
+ Ŵc (t) , (4.9)
i=1
4Nρi (t)

where ηa1 , ηa2 ∈ R are positive constant adaptation gains, G σ (t)  ∇x σ (x (t))
g (x (t)) R −1 g T (x (t)) ∇x σ T (x (t)), G σ i  σi
gi R −1 giT σi
T ∈ R L×L , where gi 
g (xi ) and σi
 ∇x σ (xi ).
The update law in (4.7) ensures that the adaptation gain matrix is bounded such
that

Γ ≤ Γ (t) ≤ Γ , ∀t ∈ R≥t0 . (4.10)

Using the weight estimates Ŵa , the controller for the system in (1.9) is designed as
 
u (t) = û x (t) , Ŵa (t) . (4.11)

The following rank condition facilitates the subsequent stability analysis.

Assumption 4.2 There exists a finite set of fixed points {xi ∈ Rn | i = 1, . . . , N }


such that ∀t ∈ R≥t0
   N 
1  ωi (t) ω T (t)
0<c inf λmin i
. (4.12)
N t∈R≥t0 i=1
ρ i (t)
4.3 Online Approximate Regulation 107

Since the rank condition in (4.12) depends on the estimates θ̂ and Ŵa , it is generally
impossible to guarantee a priori. However, unlike the persistence of excitation condi-
tion in previous results such as [2, 6, 37–39], the condition in (4.12) can be verified
online at each time t. Furthermore, the condition in (4.12) can be heuristically met by
collecting redundant data (i.e., by selecting more points than the number of neurons
by choosing N L).
The update law in (4.7) is fundamentally different from the concurrent
learning adaptive update in results such as [30, 31] in the sense that the points
{xi ∈ Rn | i = 1, . . . , N } are selected a priori based on information about the desired
behavior of the system. Given the system dynamics, or an estimate of the system
dynamics, the approximate Bellman error can be extrapolated to any desired point
in the state-space, whereas the prediction error, which is used as a metric in adaptive
control, can only be evaluated at observed data points along the state trajectory.

4.3.4 Stability Analysis

To facilitate the subsequent stability analysis, the approximate Bellman error is


expressed in terms of the weight estimation errors W̃c  W − Ŵc and W̃a  W −
Ŵa . Subtracting (1.16) from (4.5), an unmeasurable form of the instantaneous
Bellman error can be expressed as
1 1 1
δ̂t = −ω T W̃c − W T ∇x σ Y θ̃ + W̃aT G σ W̃a + G  − ∇x  f + W T ∇x σ G∇x  T ,
4 4 2
(4.13)
where G  g R −1 g T ∈ Rn×n and G   ∇x G∇x  T ∈ R. Similarly, the approximate
Bellman error evaluated at the sampled states {xi | i = 1, . . . , N } can be expressed
as
1
δ̂ti = −ωiT W̃c + W̃aT G σ i W̃a − W T σi
Yi θ̃ + i , (4.14)
4

where Yi = Y (xi ), i
= ∇x  (xi ), f i = f (xi ), G i  gi R −1 giT ∈ Rn×n , G i 
i
G i i
T ∈ R, and i  21 W T σi
G i i
T + 41 G i − i
f i ∈ R is a constant.
On any compact set χ ⊂ Rn the function Y is Lipschitz continuous, and hence,
there exists a positive constant L Y ∈ R such that

Y (x) ≤ L Y x , ∀x ∈ χ . (4.15)

In (4.15), the Lipschitz property is exploited for clarity of exposition. The bound in
(4.15) can be easily generalized to Y (x) ≤ L Y ( x ) x , where L Y : R → R is
a positive, non-decreasing, and radially unbounded function.
108 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Using (4.10), the normalized regressor ωρ can be bounded as




ω
1
sup


ρ
≤ 2√νΓ . (4.16)
t∈R≥t0

For brevity of notation the following positive constants are defined:





 ηc2
σi
Yi
W
N
ηc1 L Y θ 
ϑ1  √ , ϑ2  √ ,
4 νΓ i=1
4N νΓ

L Y ηc1 W ∇x σ
1

ϑ3  √ , ϑ4 


4 
,
G
4 νΓ

ηc1
2W T ∇x σ G∇x  T + G 


 ηc2 ωi i

ϑ5  √ +

,
8 νΓ
Nρi

i=1


1 1

ϑ6 
W G σ + ∇x G ∇x σ


2

+ ϑ7 W + ηa2 W ,
T T T
2 2

N 
ηc1 G σ  ηc2 G σ i
ϑ7  √ + √ , q  λmin {Q},
8 νΓ i=1
8N νΓ


1 q ηc2 c ηa1 + 2ηa2 K
vl = min , , , ,
2 2 3 6 4

3ϑ52 3ϑ62 D2
ι= + + + ϑ4 , (4.17)
4ηc2 c 2 (ηa1 + 2ηa2 ) 2K

where (·)  supx∈χ (·) : R≥0 → R≥0 .


 T
Let Z : R≥t0 → Rn+2L+ p be defined as Z (t)  x T (t) , W̃cT (t) , W̃aT (t) , θ̃ T (t) ,
where x (·), W̃c (·), W̃a (·), and θ̃ (·) denote the solutions of the differential equa-
tions in (1.9), (4.2), (4.7), and (4.9), respectively, with appropriate initial conditions.
The sufficient conditions for ultimate boundedness of Z (·) are derived based on the
subsequent stability analysis as
4.3 Online Approximate Regulation 109

ηa1 + 2ηa2 2ζ2 + 1
> ϑ7 W ,
6 2ζ2
K ϑ2 + ζ1 ζ3 ϑ3 Z
> ,
4 ζ1

ηc2 ζ2 ϑ7 W + ηa1 + 2 ϑ1 + ζ1 ϑ2 + (ϑ3 /ζ3 ) Z
> ,
3 2c
q
> ϑ1 , (4.18)
2
    
where Z  v −1 v max Z (t0 ) , vιl , ζ1 , ζ2 , ζ3 ∈ R are known positive
adjustable constants, and v and v are subsequently defined class K functions. The
Lipschitz constants in (4.15) and the neural network function approximation errors
depend on the underlying compact set; hence, given a bound on the initial condition
Z (t0 ) for the concatenated state Z (·), a compact set that contains the concatenated
state trajectory needs to be established before adaptation gains satisfying the con-
ditions in (4.18) can be selected. Based on the subsequent stability analysis, an
algorithm to compute the required compact set, denoted by Z ⊂ R2n+2L+ p , is devel-
oped in Appendix A.2.1. Since the constants ι and vl depend on L Y only through the
products L Y  and Lζ3Y , Algorithm A.2 ensures that

ι 1
≤ diam (Z) , (4.19)
vl 2

where diam (Z) denotes the diameter of the set Z defined as diam (Z) 
sup { x − y | x, y ∈ Z}. The main result of this section can now be stated as fol-
lows.
Theorem 4.3 Provided Assumptions 4.1 and 4.2 hold and gains q, ηc2 , ηa2 , and K
are selected large enough using Algorithm A.2, the controller in (4.11) along with
the adaptive update laws in (4.7) and (4.9) ensure that the x (·), W̃c (·), and W̃a (·)
are uniformly ultimately bounded.
Proof Let VL : Rn+2L+ p × R≥0 → R≥0 be a continuously differentiable positive
definite candidate Lyapunov function defined as

1 1  
VL (Z , t)  V ∗ (x) + W̃cT Γ −1 (t) W̃c + W̃aT W̃a + Vθ θ̃, t , (4.20)
2 2
where V ∗ is the optimal value function, Vθ was introduced in Assumption 4.1. Using
the fact that V ∗ is positive definite, (4.3), (4.10) and [40, Lemma 4.3] yield

v ( Z ) ≤ VL (Z , t) ≤ v ( Z ) , (4.21)

for all t ∈ R≥t0 and for all Z ∈ Rn+2L+ p , where v, v : R≥0 → R≥0 are class K func-
tions.
110 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Provided the gains are selected according to Algorithm A.2, substituting for the
approximate Bellman errors from (4.13) and (4.14), using the bounds in (4.15) and
(4.16), and using Young’s inequality, the time derivative of (4.20) evaluated along
the trajectory Z (·) can be upper-bounded as

∇ Z VL (Z , t) h (Z , t) + ∇t VL (Z , t) ≤ −vl Z 2 , (4.22)

for all Z ≥ vιl > 0, Z ∈ Z and t ∈ R≥t0 , where h : Rn+2L+ p × R≥t0 → Rn+2L+ p
is a concatenation of the vector fields in (1.9), (4.2), (4.7), and (4.9). Since Vθ is
a common Lyapunov function for the switched subsystem in (4.2), and the terms
introduced by the update law (4.8) do not contribute to the bound in (4.22), VL is a
common Lyapunov function for the complete error system.
Using (4.19), (4.21), and (4.22), [40, Theorem 4.18] can be invoked to conclude
that Z (·) uniformly ultimately bounded in the sense that lim supt→∞ Z (t) ≤
is 
ι
v−1 v vl
. Furthermore, the concatenated state trajectories are bounded such
that Z (t) ≤ Z , ∀t ∈ R≥t0 . Since the estimates Ŵa (·) approximate the ideal weights
W , the policy û approximates the optimal policy u ∗ . 

4.3.5 Simulation

This section presents two simulations to demonstrate the performance and the appli-
cability of the developed technique. First, the performance of an approximate solu-
tion to an optimal control problem that has a known analytical solution. Based on the
known solution, an exact polynomial basis is used for value function approximation.
The second simulation demonstrates the applicability of the developed technique in
the case where the analytical solution, and hence, an exact basis for value function
approximation is not known. In this case, since the optimal solution is unknown,
the optimal trajectories obtained using the developed technique are compared with
optimal trajectories obtained through numerical optimal control techniques.
Problem With a Known Basis
The performance of the developed controller is demonstrated by simulating a non-
linear control-affine system with a two dimensional state x = [x1 , x2 ]T . The system
dynamics are described by (1.9), where
⎡ ⎤
  a
x x 0 0 ⎢ ⎥
f = 1 2 ⎢b⎥,
0 0 x1 x2 1 − (cos (2x1 ) + 2) 2 ⎣c⎦
d
 
T T
g = 0 (cos (2x1 ) + 2) . (4.23)
4.3 Online Approximate Regulation 111

In (4.23), a, b, c, d ∈ R are positive unknown parameters. The parameters are selected


as a = −1, b = 1, c = −0.5, and d = −0.5. The control objective is to minimize
the cost in (1.10) while regulating the system state to the origin. The origin is an
unstable equilibrium point of the unforced system ẋ = f (x). The weighting matrices
in the cost function are selected as Q = I2 and R = 1. The optimal value function
and optimal control for the system in (4.23) are given by V ∗ (x) = 21 x12 + x22 and
u ∗ (x) = −(cos(2x1 ) + 2)x2 , respectively (cf. [6]).
Thirty data points are recorded using a singular value maximizing algorithm (cf.
[31]) for the concurrent learning-based adaptive update law in (A.18). The state
derivative at each recorded data point is computed using a fifth order Savitzky–Golay
smoothing filter (cf. [41]).
 σ : R → R for value function approximation is selected as
2 3
The
 basis function
σ = x1 , x1 x2 , x2 . Based on the analytical solution, the ideal weights are W =
2 2

[0.5, 0, 1]T . The data points for the concurrent learning-based update law in (4.7)
are selected to be on a 5 × 5 grid around the origin. The learning gains are selected
as ηc1 = 1, ηc2 = 15, ηa1 = 100, ηa2 = 0.1, and ν = 0.005 and gains for the system
identifier developed in Appendix A.2.3 are selected as k x = 10I2 , Γθ = 20I4 , and
kθ = 30. The actor and the critic weight estimates are initialized using a stabilizing
set of initial weights as Ŵc (0) = Ŵa (0) = [1, 1, 1]T and the least-squares gain is
initialized as Γ (0) = 100I3 . The initial condition for the system state is selected as
x (0) = [−1, −1]T , the state estimates x̂ are initialized to be zero, the parameter
estimates θ̂ are initialized to be one, and the data stack for concurrent learning is
recorded online.
Figures 4.2, 4.3, 4.4, 4.5 and 4.6 demonstrates that the system state is regulated
to the origin, the unknown parameters in the drift dynamics are identified, and the
value function and the actor weights converge to their true values. Furthermore,
unlike previous results, a probing signal to ensure persistence of excitation is not
required. Figures 4.7 and 4.8 demonstrate the satisfaction of Assumptions 4.2 and
A.2, respectively.

Fig. 4.2 System state


trajectories generated using
the developed technique
(reproduced with permission
from [18], ©2016, Elsevier)
112 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.3 Control trajectory


generated using the
developed technique
(reproduced with permission
from [18], ©2016, Elsevier)

Fig. 4.4 Critic weight


estimates generated using the
developed technique, and
compared to the analytical
solution (reproduced with
permission from [18],
©2016, Elsevier)

Time (s)

Fig. 4.5 Actor weight


estimates generated using the
developed technique, and
compared to the analytical
solution (reproduced with
permission from [18],
©2016, Elsevier)
4.3 Online Approximate Regulation 113

Fig. 4.6 Estimates of the


unknown plant parameters
generated using the
developed technique, and
compared to the ideal values
(reproduced with permission
from [18], ©2016, Elsevier)

Fig. 4.7 Satisfaction of


Assumptions 4.2 and A.2 for
the simulation with known
basis (reproduced with
permission from [18],
©2016, Elsevier)

Fig. 4.8 Satisfaction of


Assumptions 4.2 and A.2 for
the simulation with known
basis (reproduced with
permission from [18],
©2016, Elsevier)
114 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Problem with an Unknown Basis


To demonstrate the applicability of the developed controller, a nonlinear control-
affine system with a four dimensional state x = [x1 , x2 , x3 , x4 ]T is simulated. The
system dynamics are described by
⎡ ⎤ ⎡ ⎤
x3 ⎡ ⎤ f d1
⎢ 0, 0, 0, 0
x4   ⎥ ⎢ f d2 ⎥
f =⎢

⎥ + ⎣ 0, 0, 0, 0 ⎦ ⎢
⎦ ⎣
⎥,
x   f s1 ⎦
−M −1 Vm 3 M −1 , M −1 D
x4 f s2
    T
g = 0, 0 T , 0, 0 T , M −1 T . (4.24)

In (4.24), D  diag [x3 , x4 , tanh (x3 ) , tanh (x4 )] and  the matrices M, Vm , Fd ,
p1 + 2 p3 c2 , p2 + p3 c2 f d1 , 0
Fs ∈ R are defined as M 
2×2
, Fd  , Vm 
p2 + p3 c2 , p2 0, f d2
   
− p3 s2 x4 , − p3 s2 (x3 + x4 ) f s1 tanh (x3 ) , 0
, and Fs  , where
p 3 s2 x 3 , 0 0, f s2 tanh (x3 )
c2 = cos (x2 ), s2 = sin (x2 ), p1 = 3.473, p2 = 0.196, and p3 = 0.242. The posi-
tive constants f d1 , f d2 , f s1 , f s2 ∈ R are the unknown parameters. The parameters are
selected as f d1 = 5.3, f d2 = 1.1, f s1 = 8.45, and f s2 = 2.35. The control objective is
to minimize the cost in (1.10) with Q = diag ([10, 10, 1, 1]) and R = diag ([1, 1])
while regulating the system state to the origin. The origin is a marginally stable equi-
librium point of the unforced system ẋ = f (x).
The basis
 function σ : R4 → R10 for value function approximation  is selected as
σ (x) = x1 x3 , x2 x4 , x3 x2 , x4 x1 , x1 x2 , x4 x3 , x12 , x22 , x32 , x42 . The data points for
the concurrent learning-based update law in (4.7) are selected to be on a 3 × 3 ×
3 × 3 grid around the origin, and the actor weights are updated using a projection-
based update law. The learning gains are selected as ηc1 = 1, ηc2 = 30, ηa1 = 0.1,
and ν = 0.0005. The gains for the system identifier developed in Appendix A.2.3
are selected as k x = 10I4 , Γθ = diag([90, 50, 160, 50]), and kθ = 1.1. The least-
squares gain is initialized as Γ (0) = 1000I10 and the policy and the critic weight esti-
mates are initialized as Ŵc (0) = Ŵa (0) = [5, 5, 0, 0, 0, 0, 25, 0, 2 , 2]T . The
initial condition for the system state is selected as x (0) = [1, 1, 0, 0]T , the state
estimates x̂ are initialized to be zero, the parameter estimates θ̂ are initialized to be
one, and the data stack for concurrent learning is recorded online.
Figures 4.9, 4.10, 4.11, 4.12, 4.13 and 4.14 demonstrates that the system state is
regulated to the origin, the unknown parameters in the drift dynamics are identified,
and the value function and the actor weights converge. Figures 4.15 and 4.16 demon-
strate the satisfaction of Assumptions 4.2 and A.2, respectively. The value function
and the actor weights converge to the following values.

Ŵc∗ = Ŵa∗ = [24.7, 1.19, 2.25, 2.67, 1.18, 0.93, 44.34, 11.31, 3.81 , 0.10]T . (4.25)
4.3 Online Approximate Regulation 115

Fig. 4.9 State trajectories


generated using the
developed technique
(reproduced with permission
from [18], ©2016, Elsevier)

Fig. 4.10 Control


trajectories generated using
the developed technique
(reproduced with permission
from [18], ©2016, Elsevier)

Fig. 4.11 Critic weights


generated using the
developed technique
(reproduced with permission
from [18], ©2016, Elsevier)
116 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.12 Actor weights


generated using the
developed technique
(reproduced with permission
from [18], ©2016, Elsevier)

Fig. 4.13 Drift parameter


estimates generated using the
developed technique
compared to the actual drift
parameters represented by
dashed lines (reproduced
with permission from [18],
©2016, Elsevier)

Fig. 4.14 State trajectories


generated using feedback
policy û ∗ (x) compared to a
numerical optimal solution
(reproduced with permission
from [18], ©2016, Elsevier)
4.3 Online Approximate Regulation 117

Fig. 4.15 Satisfaction of


Assumption A.2 for the
simulation with unknown
basis (reproduced with
permission from [18],
©2016, Elsevier)

Fig. 4.16 Satisfaction of


Assumption 4.2 for the
simulation with unknown
basis (reproduced with
permission from [18],
©2016, Elsevier)

Since the true values of the critic weights are unknown, the weights in (4.25)
can not be compared to their true values. However, a measure of proximity of
the weights in (4.25) to the ideal weights W can be obtained by comparing the
system trajectories resulting from applying the feedback control policy û ∗ (x) =
− 21 R −1 g T (x) ∇x σ T (x) Ŵa∗ to the system, against numerically computed optimal
system trajectories. Figures 4.14 and 4.17 indicate that the weights in (4.25) generate
state and control trajectories that closely match the numerically computed optimal
trajectories. The numerical optimal solution is obtained using an infinite-horizon
Gauss-pseudospectral method (cf. [42]) using 45 collocation points.
118 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.17 Control Control Trajectory


1
trajectories generated using
feedback policy û ∗ (x)
compared to a numerical 0
optimal solution
(reproduced with permission −1
from [18], ©2016, Elsevier)

u(t)
−2

−3
u1 − Proposed

−4 u2 − Proposed
u1 − Numerical
u − Numerical
2
−5
0 5 10 15 20 25 30
Time (s)

4.4 Extension to Trajectory Tracking2

This section provides an approximate online adaptive solution to the infinite-


horizon optimal tracking problem for control-affine continuous-time nonlinear sys-
tems with unknown drift dynamics. To relax the persistence of excitation condition,
model-based reinforcement learning is implemented using a concurrent learning-
based system identifier to simulate experience by evaluating the Bellman error over
unexplored areas of the state-space. Tracking of the desired trajectory and conver-
gence of the developed policy to a neighborhood of the optimal policy are established
via Lyapunov-based stability analysis. Simulation results demonstrate the effective-
ness of the developed technique.

4.4.1 Problem Formulation and Exact Solution

The control objective in this section is to optimally track a time-varying desired


trajectory xd : R≥t0 → Rn . Using the transformation in Sect. 3.3.1, the error system
dynamics can be expressed in the autonomous form

ζ̇ (t) = F (ζ (t)) + G (ζ (t)) μ (t) .

The control objective is to simultaneously synthesize and utilize a control signal μ (·)
to minimize the cost functional in (3.49) under the dynamic constraint in (3.47), while
tracking the desired trajectory, where the local cost rt : R2n × Rm → R≥0 is defined
as

2 Parts of the text in this section are reproduced, with permission, from [22], ©2017, IEEE.
4.4 Extension to Trajectory Tracking 119

rt (ζ, μ)  Q (e) + μT Rμ,

where R ∈ Rm×m is a positive definite symmetric matrix of constants, and Q : Rn →


R is a continuous positive definite function.
Assuming that an optimal policy exists, the optimal policy can be characterized
in terms of the value function V ∗ : R2n → R defined as
∞
V ∗ (ζ )  min rt (ζ (τ ; t, ζ, μ (·)) , μ (τ )) dτ,
μ(τ )∈U |τ ∈R≥t
t

where U ⊆ Rm is the action-space. Assuming that a minimizing policy exists and


that V ∗ is continuously differentiable, a closed-form solution for the optimal policy
T
can be obtained as [43] μ∗ (ζ ) = − 21 R −1 G T (ζ ) ∇ζ V ∗ (ζ ) . The optimal policy
and the optimal value function satisfy the Hamilton–Jacobi–Bellman equation [43]

∇ ζ V ∗ (ζ ) F (ζ ) + G (ζ ) μ∗ (ζ ) + Q t (ζ ) + μ∗T (ζ ) Rμ∗ (ζ ) = 0, (4.26)

with the
 initial condition
 V ∗ (0) = 0, where the function Q t : R2n → R is defined
T T
as Q t e , xd
T
= Q (e) , ∀e, xd ∈ Rn .

4.4.2 Bellman Error


 
The optimal value function V ∗ is replaced by a parametric estimate V̂ ζ, Ŵc and the
 
optimal policy μ∗ by a parametric estimate μ̂ ζ, Ŵa , where Ŵc ∈ R L and Ŵa ∈ R L
denote vectors of estimates of the ideal parameters. Substituting the estimates V̂ and
μ̂ for V ∗ and μ∗ in the Hamilton–Jacobi–Bellman equation, respectively, yields the
Bellman error
     
δ ζ, Ŵc , Ŵa = Q t (ζ ) + μ̂T ζ, Ŵa R μ̂ ζ, Ŵa
   
+ ∇ζ V̂ ζ, Ŵc F (ζ ) + G (ζ ) μ̂ ζ, Ŵa . (4.27)

Similar to the development in Sect. 4.3, the Bellman error is extrapolated to unex-
plored areas of the state-space using a system identifier. In this section, a neural
network-based system identifier is employed.
120 4 Model-Based Reinforcement Learning for Approximate Optimal Control

4.4.3 System Identification

On a compact set C ⊂ Rn the function f is represented using a neural network as


 T
f (x) = θ T σ f Y T x1 + θ (x), where x1  1, x T ∈ Rn+1 , θ ∈ R p+1×n and Y ∈
Rn+1× p denote the constant unknown output-layer and hidden-layer neural network
weights, σ f : R p → R p+1 denotes a bounded neural network basis function, θ :
Rn → Rn denotes the function reconstruction error, and p ∈ N denotes the number of
neural network neurons. Let θ and θ be known constants such that θ F ≤ θ < ∞,
supx∈C θ (x) ≤ θ , and supx∈C ∇x θ (x) ≤ θ . Using an estimate θ̂ ∈ R p+1×n
of the weight matrix θ , the function f  can be approximated by the function fˆ :
R2n × R p+1×n → Rn defined as fˆ ζ, θ̂  θ̂ T σθ (ζ ), where σθ : R2n → R p+1 is
  T 
defined as σθ (ζ ) = σ f Y T 1, e T + xdT .
An estimator for online identification of the drift dynamics is developed as

x̂˙ (t) = θ̂ T (t) σθ (ζ (t)) + g (x (t)) u (t) + k x̃ (t) , (4.28)

where x̃  x − x̂, and k ∈ R is a positive constant learning gain.


Assumption 4.4 ([30]) A history stack containing recorded state-action pairs
 M  M
x j , u j j=1 along with numerically computed state derivatives x̄˙ j j=1 that satisfies
 !

λmin M
x̄˙ j − ẋ j
< d, ∀ j is available a priori, where
j=1 σ f j σ f j = σθ > 0,
T
  T 
σ f j  σ f Y T 1, x Tj , d ∈ R is a known positive constant, and ẋ j = f x j +

g xj u j.
A priori availability of the history stack is used for ease of exposition, and isnot neces-
sary. Provided the system states are exciting over a finite time interval t ∈ t0 , t0 + t
(versus t ∈ [t0 , ∞) as in traditional persistence of excitation-based approaches) the
history stack can also be recorded  online.  The controller developed in [44] can be
used over the time interval t ∈ t0 , t0 + t while the history stack is being recorded,
and the controller developed in this result can be used thereafter. The use of two
different controllers results in a switched system with one switching event. Since
there is only one switching event, the stability of the switched system follows from
the stability of the individual subsystems.
The weight estimates θ̂ are updated using the concurrent learning-based update
law


M  T
θ̂˙ (t) = Γθ σ f Y T x1 (t) x̃ T (t) + kθ Γθ σ f j x̄˙ j − g j u j − θ̂ T (t) σ f j ,
j=1
(4.29)
where kθ ∈ R is a constant positive concurrent learning gain and Γθ ∈ R p+1× p+1 is a
constant, diagonal, and positive definite adaptation gain matrix. Using the identifier,
the Bellman error in (4.27) can be approximated as
4.4 Extension to Trajectory Tracking 121
     
δ̂ ζ, θ̂ , Ŵc , Ŵa = Q t (ζ ) + μ̂T ζ, Ŵa R μ̂ ζ, Ŵa
     
+ ∇ ζ V̂ ζ, Ŵa Fθ ζ, θ̂ + F1 (ζ ) + G (ζ ) μ̂ ζ, Ŵa ,
(4.30)

where
⎡  ⎤
  + 0n×1
⎢ θ̂ σθ (ζ ) − g (x) g (xd ) θ̂ σθ
T T

Fθ ζ, θ̂  ⎣ xd ⎦,
0n×1
 T T
F1 (ζ )  −h d + g (e + xd ) g + (xd ) h d , h dT .

4.4.4 Value Function Approximation

Since V ∗ and μ∗ are functions of the augmented state ζ , the minimization problem
stated in Sect. 4.4.1 is intractable. To obtain a finite-dimensional minimization prob-
lem, the optimal value function is represented over any compact operating domain
C ⊂ R2n using a neural network as V ∗ (ζ ) = W T σ (ζ ) +  (ζ ), where W ∈ R L
denotes a vector of unknown neural network weights, σ : R2n → R L denotes a
bounded neural network basis function,  : R2n → R denotes the function recon-
struction error, and L ∈ N denotes the number of neural network neurons. Using
Property 2.3, for any compact set C ⊂ R2n , there exist constant ideal weights W and
known positive
constants

W , and  such that W ≤ W < ∞, supζ ∈C  (ζ ) ≤ ,
and supζ ∈C
∇ζ  (ζ )
≤  [45].

A neural network
representation of the optimal policy is obtained as μ (ζ ) =
− 21 R −1 G T (ζ ) ∇ζ σ T (ζ ) W + ∇ζ  T (ζ ) . Using estimates Ŵc and Ŵa for the ideal
weights W , the optimal value function and the optimal policy are approximated as
    1
V̂ ζ, Ŵc  ŴcT σ (ζ ) , μ̂ ζ, Ŵa  − R −1 G T (ζ ) ∇ζ σ T (ζ ) Ŵa . (4.31)
2
The optimal control problem is thus reformulated
 as 
the need to find
"  a set of weights
"
" "
Ŵc and Ŵa online, to minimize the error Ê θ̂ Ŵc , Ŵa  supζ ∈χ "δ̂ ζ, θ̂ , Ŵc , Ŵa ",
for a given θ̂ , while simultaneously improving θ̂ using (4.29), and ensuring stability
of the system using the control law
   
u (t) = μ̂ ζ (t) , Ŵa (t) + û d ζ (t) , θ̂ (t) , (4.32)
     T 
where û d ζ, θ̂  gd+ (t) h d (t) − θ̂ T σθd (t) and σθd (t)  σθ 01×n xdT (t) .
The error between u d and û d is included in the stability analysis based on the fact that
122 4 Model-Based Reinforcement Learning for Approximate Optimal Control

the error trajectories generated by the system ė (t) = f (x (t)) + g (x (t)) u (t) −
ẋd (t) under the controller in (4.32) are identical to the error trajectories gener-
ated
 by the system ζ̇ (t) = F (ζ (t)) + G (ζ (t)) μ (t) under the control law μ (t) =
μ̂ ζ (t) , Ŵa (t) + gd+ (t) θ̃ T (t) σθd (t) + gd+ (t) θd (t), where θd (t)  θ (xd (t)).

4.4.5 Simulation of Experience

Since computation of the supremum in Ê θ̂ is intractable in general, simulation of


experience is implemented by minimizing a squared sum of Bellman errors over
finitely many points in the state-space. The following assumption facilitates the
aforementioned approximation.

 set ofpoints{ζi ∈ C | iT !


Assumption 4.5 ([14]) There exists a finite = 1, . . . , N } and
N ωi ωi
a constant c ∈ R such that 0 < c  N inf t∈R≥t0 λmin
1
i=1 ρi , where ρi 
    
1 + νωiT Γ ωi ∈ R, and ωi (t)  ∇ζ σ (ζi ) Fθ ζi , θ̂ (t) + F1 (ζi ) + G (ζi ) μ̂ ζi , Ŵa (t) .

Using Assumption 4.5, simulation of experience is implemented by the weight update


laws

ω (t)  ωi (t)
N
Ŵ˙ c (t) = −kc1 Γ
kc2
δ̂t (t) − Γ (t) δ̂ti (t) , (4.33)
ρ (t) N ρ (t)
i=1 i

ω (t) ω T (t)
Γ˙ (t) = βΓ (t) − kc1 Γ (t) Γ (t) 1{ Γ ≤Γ } , Γ (t0 ) ≤ Γ ,
ρ 2 (t)
(4.34)
 
˙
Ŵa (t) = −ka1 Ŵa (t) − Ŵc (t) − ka2 Ŵa (t)
 
kc1 G σT (t) Ŵa (t) ω T (t)  kc2 G σT i (t) Ŵa (t) ωiT (t)
N
+ + Ŵc (t) ,
4ρ (t) i=1
4Nρi (t)
(4.35)
    
where ω (t)  ∇ ζ σ (ζ (t)) Fθ ζ (t) , θ̂ (t) + F1 (ζ (t)) + G (ζ (t)) μ̂ ζ (t) , Ŵa (t) , Γ ∈
R L×L is the least-squares gain matrix, Γ ∈ R denotes a positive saturation constant,
β ∈ R denotes a constant forgetting factor, kc1 , kc2 , ka1 , ka2 ∈ R denote constant pos-
itive adaptation gains, G σ (t)  ∇ζ σ (ζ (t)) G (ζ (t)) R −1 G T (ζ (t)) ∇ζ σ T (ζ (t)),
and ρ (t)  1 + νω T (t) Γ (t) ω (t), where ν ∈ R is a positive normalization con-
stant. In (4.33)–(4.35) and in the subsequent development, the notation ξi , is
defined as ξi  ξ (ζi , ·) for any function  ξ (ζ, ·) and the instantaneous  Bellman
errors δ̂t and δ̂ti are given by δ̂t (t) = δ̂ ζ (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) and δ̂ti (t) =
 
δ̂ ζi , Ŵc (t) , Ŵa (t) , θ̂ (t) .
4.4 Extension to Trajectory Tracking 123

4.4.6 Stability Analysis

If the state penalty function Q t is positive definite, then the optimal value function V ∗
is positive definite (cf. [2, 6, 46]), and serves as a Lyapunov function for the concate-
nated system under the optimal control policy μ∗ . As a result, V ∗ is used as a candidate
Lyapunov function for the closed-loop system under the policy μ̂. In this case, the
function Q t , and hence, the function V ∗ are positive semidefinite. Therefore, the func-
tion V ∗ is not a valid candidate Lyapunov function. However, the results in [44] can
be used to show that a nonautonomous form of the optimal value function denoted by
∗ ∗ ∗
 T T T 
Vt : R × R → R, defined as Vt (e, t) = V
n
e , xd (t) , ∀e ∈ Rn , t ∈ R, is

positive definite and decrescent. Hence, Vt (0, t) = 0, ∀t ∈ R and there exist class
K functions v : R → R and v : R → R such that v ( e ) ≤ Vt∗ (e, t) ≤ v ( e ),
∀e ∈ Rn and ∀t ∈ R.
To facilitate the stability analysis, a concatenated state Z ∈ R2n+2L+n( p+1) is
defined as
   T T
Z  e T W̃cT W̃aT x̃ T vec θ̃ ,

and a candidate Lyapunov function is defined as

1 1 1 1  
VL (Z , t)  Vt∗ (e, t) + W̃cT Γ −1 W̃c + W̃aT W̃a + x̃ T x̃ + tr θ̃ T Γθ−1 θ̃ .
2 2 2 2
(4.36)
The saturated least-squares update
law in
(4.34) ensures that there exist positive
constants γ , γ ∈ R such that γ ≤
Γ −1 (t)
≤ γ , ∀t ∈ R. Using the bounds on Γ
    T   
and Vt∗ and the fact that tr θ̃ T Γθ−1 θ̃ = vec θ̃ Γθ−1 ⊗ I p+1 vec θ̃ , the
candidate Lyapunov function in (4.36) can be bounded as

vl ( Z ) ≤ VL (Z , t) ≤ vl ( Z ) , (4.37)

∀Z ∈ R2n+2L+n( p+1) and ∀t ∈ R, where vl : R → R and vl : R → R are class K


functions.
Given any compact set χ ⊂ R2n+2L+n( p+1) containing an open ball of radius ρ ∈ R
centered at the origin, a positive constant ι ∈ R is defined as
2
2
)W G σ
(kc1 +kc2√ (W T G σ +
G r σ
T )
3 16 νΓ
+ 4
+ ka2 W
2
ι
(ka1 + ka2 )




 2
3
W T σ
Ggd+
+

Ggd+
σg + kθ dθ
+
4kθ σθ
124 4 Model-Based Reinforcement Learning for Approximate Optimal Control

2
(kc1 + kc2 )2  θ 2

+ + +

Ggd+ θd

4νΓ kc2 c 2k




1

1

+
G
+
W σ Gr 




T

T

+
W T σ
Ggd+ θd
, (4.38)
2 2
T
where G r  G R −1 G T and G   
G r 
. Let vl : R → R be a class K function
such that

q ( e ) kc2 c




2 (ka1 + ka2 )

2 k k θ σθ

 
2

vl ( Z ) ≤ +
W̃c
+
W̃a
+ x̃ 2 +
vec θ̃
. (4.39)
2 8 6 4 6

The sufficient gain conditions used in the subsequent Theorem 4.6 are

vl−1 (ι) < vl −1 vl (ρ) , (4.40)
2 2
3 (kc2 + kc1 )2 W σ
σg 2
kc2 c > , (4.41)
4kθ σθ νΓ
 2
3 (kc1 + kc2 ) W G σ 3 (kc1 + kc2 ) W G σ
(ka1 + ka2 ) > √ + √ + ka1 ,
8 νΓ ckc2 8 νΓ
(4.42)

where σg  σθ +
ggd+
σθd . In (4.38)–(4.42), the notation  denotes
sup y∈χl  (y) for any function  : Rl → R, where l ∈ N and χl denotes the pro-
jection of χ onto Rl .
The sufficient condition in (4.40) requires the set χ to be large enough based on the
constant ι. Since the neural network approximation errors depend on the compact set
χ , in general, the constant ι increases with the size of the set χ for a fixed number of
neural network neurons. However, for a fixed set χ , the constant ι can be reduced by
reducing function reconstruction errors (i.e., by increasing number of neural network
neurons) and by increasing the learning gains provided σθ is large enough. Hence a
sufficient number of neural network neurons and extrapolation points are required
to satisfy the condition in (4.40).
Theorem 4.6 Provided Assumptions 4.4 and 4.5 hold and L, c, and σ θ are large
enough to satisfy the sufficient gain conditions in (4.40)–(4.42), the controller in
(4.32) with the weight update laws (4.33)–(4.35), and the identifier in (4.28) with
the weight update law (4.29), ensure that the system states remain bounded, the
tracking error is ultimately bounded, and that the control policy μ̂ converges to a
neighborhood around the optimal control policy μ∗ .

Proof Using (3.47) and the fact that V̇t∗ (e (t) , t) = V̇ ∗ (ζ (t)) , ∀t ∈ R, the time-
derivative of the candidate Lyapunov function in (4.36) is
4.4 Extension to Trajectory Tracking 125


V̇L = ∇ζ V ∗ F + Gμ∗ − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c
1
2
T ˙
− W̃ Ŵ + V̇ + ∇ V Gμ − ∇ V Gμ∗ .
a a 0

ζ

ζ (4.43)

Under sufficient gain conditions in (4.40)–(4.42), using (4.26), (4.28)–(4.31), and


the update laws in (4.33)–(4.35) the expression in (4.43) can be bounded as

V̇L ≤ −vl ( Z ) , ∀ Z ≥ vl−1 (ι) , ∀Z ∈ χ , (4.44)

where ι is a positive constant and χ ⊂ R2n+2L+n( p+1) is a compact set. Using (4.37)
can be invoked to conclude that every trajectory Z (·)
and (4.44), [40, Theorem 4.18]
satisfying Z (t0 ) ≤ vl −1 vl (ρ) , where ρ is a positive constant, is bounded for

all t ∈ R and satisfies lim supt→∞ Z (t) ≤ vl −1 vl vl−1 (ι) . The ultimate bound
can be decreased by increasing learning gains and the number of neurons in the neural
networks, provided the points in the history stack and Bellman error extrapolation
can be selected to increase σθ and c. 

4.4.7 Simulation

Linear System
In the following, the developed technique is applied to solve a linear quadratic track-
ing problem. A linear system is selected because the optimal solution to the linear
quadratic tracking problem can be computed analytically and compared against the
solution generated by the developed technique. To demonstrate  convergence
 tothe
−1 1 0
ideal weights, the following linear system is simulated: ẋ = x+ u.
−0.5 0.5 1
The control objective is to followa desired trajectory,   which is the solution to the
−1 1 0
initial value problem ẋd = x , xd (0) = , while ensuring convergence
−2 1 d 2

of the estimated policy μ̂ to a neighborhood # ∞ ofT the policy μ , such that the control
law

μ (t) = μ (ζ (t)) minimizes the cost 0 e (t) diag ([10, 10]) e (t) + μ2 (t) dt.
Since the system is linear, the optimal value function is known to be quadratic.
Hence, the value function is approximated using the quadratic basis σ (ζ ) =
[e12 , e22 , e1 e2 , e1 xd1 , e2 xd2 , e1 xd2 , e2 xd1 ]T , and the unknown drift dynamics is
approximated using the linear basis σθ (x) = [x1 , x2 ]T .
The linear system and the linear desired dynamics result in a linear time-invariant
concatenated system. Since the system is linear, the optimal tracking problem reduces
to an optimal regulation problem, which can be solved using the resulting Algebraic
Riccati Equation. The optimal value function is given by V ∗ (ζ ) = ζ T Pζ ζ , where
the matrix Pζ is given by
126 4 Model-Based Reinforcement Learning for Approximate Optimal Control
⎡ ⎤
4.43 0.67 02×2
Pζ = ⎣0.67 2.91 ⎦.
02×2 02×2

Using the matrix Pζ , the ideal weighs corresponding to the selected basis can be
computed as W = [4.43, 1.35, 0, 0, 2.91, 0, 0]T .
Figures 4.18, 4.19, 4.20 and 4.21 demonstrate that the controller remains bounded,
the tracking error goes to zero, and the weight estimates Ŵc , Ŵa and θ̂ go to their
true values, establishing convergence of the approximate policy to the optimal policy.
Figures 4.22 and 4.23 demonstrate satisfaction of the rank conditions in Assumptions
4.4 and 4.5, respectively.

Fig. 4.18 State trajectories


generated using the
developed method for the
linear system (reproduced
with permission from [22],
©2017, IEEE)

Fig. 4.19 Control trajectory


generated using the
developed method for the
linear system (reproduced
with permission from [22],
©2017, IEEE).eps
4.4 Extension to Trajectory Tracking 127

Fig. 4.20 Actor weight


estimates generated using the
developed method for the
linear system. Dashed lines
denote the ideal values
(reproduced with permission
from [22], ©2017, IEEE)

Fig. 4.21 Drift parameter


estimates generated using the
developed method for the
linear system. Dashed lines
denote the ideal values
(reproduced with permission
from [22], ©2017, IEEE)

Fig. 4.22 Evolution of


minimum singular value of
the concurrent learning
history stack for the linear
system (reproduced with
permission from [22],
©2017, IEEE)
128 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.23 Satisfaction of


Assumptions 4.4 and 4.5 for
the linear system
(reproduced with permission
from [22], ©2017, IEEE)

Nonlinear System
Effectiveness of the developed technique is demonstrated via numerical simulation
on the nonlinear system ẋ = f (x) + (cos (2x) + 2)2 u, x ∈ R, where f (x) = x 2
is assumed to be unknown. The control objective is to track the desired trajectory
xd (t) = 2 sin (2t), while ensuring convergence of the estimated μ̂ to a neigh-
policy
#∞
borhood of the policy μ∗ , such that μ∗ minimizes the cost 0 10e2 (t) + 10 μ
1 2

(t)) dt.
Since the system is linear, the optimal value function is known to be
quadratic. Hence, the value function is approximated using the quadratic basis
σ (ζ ) = [e12 , e22 , e1 e2 , e1 xd1 , e2 xd2 , e1 xd2 , e2 xd1 ]T , and the unknown drift dynam-
ics is approximated using the linear basis σθ (x) = [x1 , x2 ]T .
The value function is approximated  using the polynomial basis
σ (ζ ) = e2 , e4 , e6 , e2 xd2 , e4 xd2, e6 xd2 , and f (x) is approximated using the polyno-
mial basis σθ (x) = x, x 2 , x 3 . The higher order terms in σ (ζ ) are used to com-
pensate for the higher order terms in σθ .
The initial values for the state and the state estimate are selected to be x (0) = 1
and x̂ (0) = 0, respectively, and the initial values for the neural network weights for
the value function, the policy, and the drift dynamics are selected to be zero. Since
the selected system exhibits a finite escape time for any initial condition other than
zero, the initial policy μ̂ (ζ, 06×1 ) is not stabilizing. The stabilization demonstrated
in Fig. 4.24 is achieved via fast simultaneous learning of the system dynamics and
the value function.
Figures 4.24, 4.25, 4.26 and 4.27 demonstrate that the controller remains bounded,
the tracking error is regulated to the origin, and the neural network weights converge.
Figures 4.28 and 4.29 demonstrate satisfaction of the rank conditions in Assump-
tions 4.4 and 4.5, respectively. The rank condition on the history stack in Assumption
4.4 is ensured by selecting points using a singular value maximization algorithm [30],
and the condition in Assumption 4.5 is met via oversampling (i.e., by selecting fifty-
six points to identify six unknown parameters). Unlike previous results that rely on
4.4 Extension to Trajectory Tracking 129

Fig. 4.24 State trajectory


for the nonlinear system
generated using the
developed method

Fig. 4.25 Control trajectory


for the nonlinear system
generated using the
developed method

Fig. 4.26 Critic weight


estimates for the nonlinear
system generated using the
developed method
130 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.27 Drift parameter


estimates for the nonlinear
system generated using the
developed method. Dashed
lines represent true values of
the drift parameters

Fig. 4.28 Evolution of


minimum singular value of
the concurrent learning
history stack for the
nonlinear system

Fig. 4.29 Evolution of


minimum singular value of
the Bellman error
extrapolation matrix for the
nonlinear system
4.4 Extension to Trajectory Tracking 131

the addition of an ad-hoc probing signal to satisfy the persistence of excitation condi-
tion, this result ensures sufficient exploration via Bellman error extrapolation. Since
an analytical solution to the nonlinear optimal tracking problem is not available, the
value function and the actor weights can not be compared against the ideal values.
However, a comparison between the learned weights and the optimal weights is pos-
sible for linear systems provided the dynamics h d of the desired trajectory are also
linear.
The learning gains, the basis functions for the neural networks, and the points for
Bellman error extrapolation are selected using a trial and error approach. Alterna-
tively, global optimization methods such as a genetic algorithm, or simulation-based
methods such as a Monte-Carlo simulation can be used to tune the gains.

4.5 N-Player Nonzero-Sum Differential Games3

This section presents a concurrent learning-based actor-critic-identifier architecture


to obtain an approximate feedback-Nash equilibrium solution to an infinite-horizon
N -player nonzero-sum differential game. The solution is obtained online for a non-
linear control-affine system with uncertain linearly parameterized drift dynamics. It
is shown that under a condition milder than persistence of excitation, uniformly ulti-
mately bounded convergence of the developed control policies to the feedback-Nash
equilibrium policies can be established. Simulation results are presented to demon-
strate the performance of the developed technique without an added excitation signal.
Consider the class of control-affine multi-input systems introduced in (3.82). In
this section, the unknown function f : Rn → Rn is assumed to be linearly param-
eterizable, the functions gi : Rn → Rn×m i are assumed to be known and uniformly
bounded, the functions f and gi are assumed to be locally Lipschitz, and f (0) = 0.
Recall the Bellman errors introduced in (3.88):
⎛ ⎞
    
N  
δi x, Ŵci , Ŵa1 , . . . , Ŵa N = ∇x V̂i x, Ŵci ⎝ f (x) + g j (x) û j x, Ŵa j ⎠
j=1
    
+ ri x, û 1 x, Ŵa1 , . . . , û N x, Ŵa N . (4.45)

To obtain a feedback-Nash equilibrium solution to the N −player differential


game, the estimates Ŵci and Ŵai are recursively improved to drive the Bellman
errors to zero. The computation of the Bellman errors in (4.45) requires knowledge
of the drift dynamics f . To eliminate this requirement and to facilitate simulation of
experience, a concurrent learning-based system identifier that satisfies Assumption
4.1 is developed in the following section.

3 Parts of the text in this section are reproduced, with permission, from [16], ©2014, IEEE.
132 4 Model-Based Reinforcement Learning for Approximate Optimal Control

4.5.1 System Identification

Let f (x) = Y (x) θ be the linear parametrization of the drift dynamics, where Y :
Rn → Rn× pθ denotes the locally Lipschitz regression matrix, and θ ∈ R pθ denotes
the vector of constant unknown drift parameters. The system identifier is designed
as

N
x̂˙ (t) = Y (x (t)) θ̂ (t) + gi (x (t)) u i (t) + k x x̃ (t) , (4.46)
i=1

where the measurable state estimation error x̃ is defined as x̃ (t)  x (t) − x̂ (t),
k x ∈ Rn×n is a constant positive definite diagonal observer gain matrix, and θ̂ :
R≥t0 → R pθ denotes the vector of estimates of the unknown drift parameters. In
traditional adaptive systems, the estimates are updated to minimize the instanta-
neous state estimation error, and convergence of parameter estimates to their true
values can be established under a restrictive persistence of excitation condition. In
this result, a concurrent learning-based data-driven approach is developed to relax
the persistence of excitation condition to a weaker, verifiable rank condition.
Assumption
 4.7 ([30, 31]) A history stack Hid containing state-action tuples
x j , û i j | i = 1, . . . , N , j = 1, . . . , Mθ recorded along the trajectories of (3.82)
that satisfies
⎛ ⎞
Mθ
rank ⎝ Y jT Y j ⎠ = pθ , (4.47)
j=1


is available a priori, where Y j = Y x j , and pθ denotes the number of unknown
parameters in the drift dynamics.
To facilitate the concurrent learning-based parameter update,
numerical
methods are
used to compute the state derivative ẋ j corresponding to x j , û j . The update law
for the drift parameter estimates is designed as
 

Mθ 
N
θ̂˙ (t) = Γθ Y T (x (t)) x̃ (t) + Γθ kθ Y jT ẋ j − gi j u i j − Y j θ̂ (t) , (4.48)
j=1 i=1


where gi j  gi x j , Γθ ∈ R p× p is a constant positive definite adaptation gain matrix,
and kθ ∈ R is a constant positive concurrent learning gain. The update law in (4.48)
requires the unmeasurable state derivative ẋ j . Since the state derivative at a past
recorded point on the state trajectory is required, past and future recorded values of
the state can be used along with accurate noncausal smoothing techniques to obtain
good estimates of ẋ j . In the presence of derivative estimation errors, the parameter
estimation errors can be shown to be uniformly ultimately bounded, where the size
of the ultimate bound depends on the error in the derivative estimate [31].
4.5 N-Player Nonzero-Sum Differential Games 133

To incorporate new information, the history stack is updated with new data. Thus,
the resulting closed-loop system is a switched system. To ensure the stability of the
switched system, the history stack is updated using a singular value maximizing
algorithm (cf. [31]). Using (3.82), the state derivative can be expressed as


N
ẋ j − gi j u i j = Y j θ,
i=1

and hence, the update law in (4.48) yields the parameter estimation error dynamics
⎛ ⎞
˙θ̃ (t) = −Γ Y T (x (t)) x̃ (t) − Γ k ⎝ Y T Y ⎠ θ̃ (t) ,

θ θ θ j j (4.49)
j=1

where θ̃ (t)  θ − θ̂ (t) denotes the drift parameter estimation error. The closed-loop
dynamics of the state estimation error are given by

x̃˙ (t) = Y (x (t)) θ̃ (t) − k x x̃ (t) . (4.50)

4.5.2 Model-Based Reinforcement Learning

Based on (4.46), measurable approximations to the Bellman errors in (4.45) are


developed as
⎛ ⎞
    
N  
δ̂i x, Ŵci , Ŵa1 , . . . , Ŵa N , θ̂ = ∇x V̂i x, Ŵci ⎝Y (x) θ̂ + g j (x) û j x, Ŵa j ⎠
j=1
    
+ ri x, û 1 x, Ŵa1 , . . . , û N x, Ŵa N . (4.51)

The following assumption, which in general is weaker than the persistence of excita-
tion assumption, is required for convergence of the concurrent learning-based critic
weight estimates.

Assumption 4.8 For each i ∈ {1, . . . , N }, there exists a finite set of Mxi points
xi j ∈ Rn | j = 1, . . . , Mxi such that for all t ∈ R≥0 ,
( )
Mxi ωi (t)(ωi (t))
k k T
inf t∈R≥0 λmin k=1 ρi (t)
k

c xi  > 0, (4.52)
Mxi

where c xi ∈ R is a positive constant. In (4.52), ωik (t)  σi


ik Y ik θ̂ (t) − 21 Nj=1 σi
ik
 T
G ikj σ j
ik Ŵa j (t), where σ j
ik  ∇x σ j (xik ), the superscript (·)ik indicates that the
134 4 Model-Based Reinforcement Learning for Approximate Optimal Control
T
term is evaluated at x = xik , and ρik  1 + νi ωik Γi ωik , where νi ∈ R>0 is the
normalization gain, and Γi : R≥t0 → R L i ×L i is the adaptation gain matrix.
The concurrent learning-based least-squares update law for the critic weights is
designed as

kc2i Γi (t) 
Mxi
ωi (t) ωik (t) k
Ŵ˙ ci (t) = −kc1i Γi (t) δ̂ti (t) − δ̂ (t) ,
ρi (t) Mxi ρ k (t) ti
k=1 i

ωi (t) ωiT (t)
˙
Γi (t) = βi Γi (t) − kc1i Γi (t) Γi (t) 1{ Γi ≤Γ i } , Γi (t0 ) ≤ Γ i ,
ρi2 (t)
(4.53)

where
 
δ̂ti (t) = δ̂i x (t) , Ŵci (t) , Ŵa1 (t) , . . . , Ŵa N (t) , θ̂ (t) ,
 
δ̂tik (t) = δ̂i xik , Ŵci (t) , Ŵa1 (t) , . . . , Ŵa N (t) , θ̂ (t) ,

ωi (t)  ∇x σi (x (t)) Y (x (t)) θ̂ (t) − 21 Nj=1 ∇x σi (x (t)) G j (x (t)) ∇x σ jT (x (t))


Ŵa j (t), ρi (t)  1 + νi ωiT (t) Γi (t) ωi (t), Γ i > 0 ∈ R is the saturation constant,
βi ∈ R is the constant positive forgetting factor, and kc1i , kc2i ∈ R are constant pos-
itive adaptation gains.
The actor weight update laws are designed based on the subsequent stability
analysis as
 
Ŵ˙ ai (t) = −ka1i Ŵai (t) − Ŵci (t) − ka2i Ŵai (t)

1
N
ω T (t) T
+ kc1i ∇x σ j (x (t)) G i j (x (t)) ∇x σ jT (x (t)) ŴaTj (t) i Ŵ (t)
4 j=1 ρi (t) ci
k T
1
Mxi 
kc2i
ik ik
ik T T
N
ω (t)
+ σ j Gi j σ j Ŵa j (t) i k ŴciT (t) , (4.54)
4 k=1 j=1 Mxi ρi (t)

where ka1i , ka2i ∈ R are positive constant adaptation gains. The forgetting factor
βi along with the saturation in the update law for the least-squares gain matrix in
(4.53) ensure (cf. [47]) that the least-squares gain matrix Γi and its inverse is positive
definite and bounded for all i ∈ {1, . . . , N } as

Γ i ≤ Γi (t) ≤ Γ i , ∀t ∈ R≥0 , (4.55)

where Γ i ∈ R is a positive constant, and the normalized regressor is bounded as




ωi



≤ *1 .

ρ
2 νi Γ i
i ∞
4.5 N-Player Nonzero-Sum Differential Games 135

4.5.3 Stability Analysis

Subtracting (3.85) from (4.51), the approximate Bellman error can be expressed in
an unmeasurable form as


N
1
δ̂ti = ωiT Ŵci + x T Q i x + ŴaTj ∇x σ j G i j ∇x σ jT Ŵa j
j=1
4
⎛ ⎞

N 
N
− ⎝x T Q i x + u ∗T ∗ ∗
j Ri j u j + ∇x Vi f + ∇x Vi

g j u ∗j ⎠ .
j=1 j=1

Substituting for V ∗ and u ∗ from (3.113) and using f = Y θ , the approximate Bellman
error can be expressed as


N
1
δ̂ti = ωiT Ŵci + ŴaTj ∇x σ j G i j ∇x σ jT Ŵa j − WiT ∇x σi Y θ − ∇x i Y θ
j=1
4


N
1 T
− W j ∇x σ j G i j ∇x σ jT W j + 2∇x  j G i j ∇x σ jT W j + ∇x  j G i j ∇x  Tj
j=1
4

1  T
N
+ Wi ∇x σi G j ∇x σ jT W j + ∇x i G j ∇x σ jT W j + WiT ∇x σi G j ∇x  Tj
2 j=1

1
N
+ ∇x i G j ∇x  Tj .
2 j=1

Adding and subtracting 41 ŴaTj ∇x σ j G i j ∇x σ jT W j + ωiT Wi yields

1 T
N
δ̂ti = −ωiT W̃ci + W̃ ∇x σ j G i j ∇x σ jT W̃a j − WiT ∇x σi Y θ̃
4 j=1 a j

1  T
N
− Wi ∇x σi G j − W jT ∇x σ j G i j ∇x σ jT W̃a j − ∇x i Y θ + i , (4.56)
2 j=1

 
N N
where i  1
2 j=1 WiT ∇x σi G j − W jT ∇x σ j G i j ∇x  Tj + 1
2 j=1 W jT ∇x σ j
N N
G j ∇x iT + 21 ∇x i G j ∇x  Tj −
j=1 j=1 4 ∇x  j G i j ∇x  j .
1
Similarly, the approxi- T

mate Bellman error evaluated at the selected points can be expressed in an unmea-
surable form as
136 4 Model-Based Reinforcement Learning for Approximate Optimal Control

1  T
ik ik
ik T
N
δ̂tik = −ωikT W̃ci + W̃ σ G σ W̃a j + ik
4 j=1 a j j i j j

1  T
ik ik
ik T
N
− Wi σi G j − W jT σ j
ik G ik
ij σj W̃a j − WiT σi
ik Y ik θ̃ , (4.57)
2 j=1

where the constant ik ∈ R is defined as ik  −i


ik Y ik θ + ik
i . To facilitate the
stability analysis, a candidate Lyapunov function is defined as


N
1  T −1
N
1 T
N
1 1
VL  Vi∗ + W̃ci Γi W̃ci + W̃ai W̃ai + x̃ T x̃ + θ̃ T Γθ−1 θ̃ .
i=1
2 i=1 2 i=1 2 2
(4.58)
Since Vi∗ are positive definite, the bound in (4.55) and [40, Lemma 4.3] can be used
to bound the candidate Lyapunov function as

v ( Z ) ≤ VL (Z , t) ≤ v ( Z ) , (4.59)
 T
L i + pθ
where Z = x T , W̃c1
T
, . . . , W̃cN
T
, W̃a1
T
, . . . , W̃aTN , x̃, θ̃ ∈ R2n+2N i and v, v :
L i + pθ
R≥0 → R≥0 are class K functions. For any compact set Z ⊂ R 2n+2N i , define




1 T 1

ι1  max sup
Wi ∇x σi G j ∇x σ j + ∇x i G j ∇x σ j
,
T
i, j Z ∈Z 2 2


k ω

c1i i
ι2  max sup
3W j ∇x σ j G i j − 2WiT ∇x σi G j ∇x σ jT
i, j Z ∈Z 4ρi
 kc2i ω
M xi k
T
ik ik T
ik ik

ik T


+ i
3W σ
j j G ij − 2W σ
i i G j σ j
,
k=1
4Mxi ρik


1  N
T

ι3  max sup
Wi ∇x σi + ∇x i G j ∇x  Tj
i, j Z ∈Z 2 i, j=1

1 
N



− 2W jT ∇x σ j + ∇x  j G i j ∇x  Tj
,
4 i, j=1
4.5 N-Player Nonzero-Sum Differential Games 137



kc1i L Y  i θ

T

ι4  max sup ∇x σ j G i j ∇x σ j , ι5i  * ,


i, j Z ∈Z 4 νi Γ i

kc1i L Y W i σ i kc2i maxk


σi
ik Y ik
W i
ι6i  * , ι7i  * ,
4 νi Γ i 4 νi Γ i

N
(kc1i + kc2i ) W i ι4
ι8  * , ι9i  ι1 N + (ka2i + ι8 ) W i ,
i=1
8 νi Γ i

kc1i sup Z ∈Z i + kc2i maxk


ik

ι10i  * ,
2 νi Γ i


qi kc2i c xi 2ka1i + ka2i kθ y
vl  min , , kx , , ,
2 4 8 2

N
 
2ι29i ι2
ι + 10i + ι3 , (4.60)
i=1
2ka1i + ka2i kc2i c xi

where qi denotes the minimum eigenvalue of Q i , y denotes the minimum eigenvalue


of M θ T
j=1 Y j Y j , k x denotes the minimum eigenvalue of k x , and the suprema exist
ωi
since ρi is uniformly bounded for all Z , and the functions G i , G i j , σi , and ∇x i are
continuous. In (4.60), L Y ∈ R≥0 denotes the Lipschitz constant such that Y ( ) ≤
L Y  , ∀ ∈ Zn , where Zn denotes the projection of Z on Rn . The sufficient
conditions for uniformly ultimately bounded convergence are derived based on the
subsequent stability analysis as

qi > 2ι5i ,
kc2i c xi > 2ι5i + 2ζ1 ι7i + ι2 ζ2 N + ka1i + 2ζ3 ι6i Z ,
2ι2 N
2ka1i + ka2i > 4ι8 + ,
ζ2
2ι7i ι6i
kθ y > + 2 Z, (4.61)
ζ1 ζ3
   
where Z  v −1 v max Z (t0 ) , vιl and ζ1 , ζ2 , ζ3 ∈ R are known positive
adjustable constants.
138 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Since the neural network function approximation error and the Lipschitz constant
L Y depend on the compact set that contains the state trajectories, the compact set
needs to be established before the gains can be selected using (4.61). Based on
the subsequent stability analysis, an algorithm is developed in Appendix A.2.2 to
compute the required compact set (denoted by Z) based on the initial conditions.
Since the constants ι and vl depend on L Y only through the products L Y  i and L Y ζ3 ,
Algorithm A.3 ensures that
ι 1
≤ diam (Z) , (4.62)
vl 2

where diam (Z) denotes the diameter of the set Z.


Theorem 4.9 Provided Assumptions 4.7–4.8 hold and the control gains satisfy the
sufficient conditions in (4.61), where the constants in (4.60) are computed based on
the compact set Z selected using Algorithm A.3, the system identifier in (4.46) along
with the adaptive update law in (4.48) and the controllers in (3.114) along with
the adaptive update laws in (4.53) and (4.54) ensure that the state x (·), the state
estimation error x̃ (·), the critic weight estimation errors W̃ci (·), and the actor weight
estimation errors W̃ai (·) are uniformly ultimately bounded, resulting in uniformly
ultimately bounded convergence of the policies û i to the feedback-Nash equilibrium
policies u i∗ .

Proof The derivative of the candidate Lyapunov function in (4.58) along the trajec-
tories of (3.82), (4.49), (4.50), (4.53), and (4.54) is given by
⎛ ⎛ ⎞⎞

N 
N  
V̇L = ⎝∇x Vi∗ ⎝ f + g j u j ⎠⎠ + x̃ T Y θ̃ − k x x̃
i=1 j=1
⎛ ⎞  

N
k ω k 
Mxi
ω k
1 T
N
ωi ωiT
T ⎝ c1i i c2i i k⎠ −1
+ W̃ci δ̂ti + δ̂ti − W̃ci βi Γi − kc1i 2 W̃ci
ρi Mxi ρk 2 ρi
i=1 i=1 i i=1
⎛ ⎛ ⎞ ⎞
Mθ N  
T ⎝
+ θ̃ −Y x̃ − kθ ⎝
T
Y j Y j ⎠ θ̃ ⎠ −
T
W̃aiT − ka1i ŴaiT − ŴciT − ka2i ŴaiT
j=1 i=1

 1 kc2i T ωik T
ik ik 
ik T
N Mxi N
1 ωi T
+ kc1i ŴciT Ŵa j ∇x σ j G i j ∇x σ jT + Ŵci k Ŵa j σ j G i j σ j .
4 ρi 4 Mxi ρi
j=1 k=1 j=1
(4.63)

Substituting the unmeasurable forms of the Bellman errors from (4.56) and (4.57)
into (4.63) and using the triangle inequality, the Cauchy–Schwarz inequality, and
Young’s inequality, the Lyapunov derivative in (4.63) can be bounded as
4.5 N-Player Nonzero-Sum Differential Games 139

N 

  kc2i c xi

2 kθ y



2  2ka1i + ka2i


2
N q N
i

V̇ ≤ − x 2 −
W̃ci
− k x x̃ 2 −
θ̃

W̃ai

2 2 2 4
i=1 i=1 i=1
N
 

2
kc2i c xi 1 1

− − ι5i − ζ1 ι7i − ι2 ζ2 N − ka1i − ζ3 ι6i x


W̃ci

2 2 2
i=1
N k y
 







2 

N N
θ ι ι

+ − 7i − 6i x
θ̃i
+ ι9i
W̃ai
+ ι10i
W̃ci

2 ζ1 ζ3
i=1 i=1 i=1
N 
ι2 N

2
 N  
2ka1i + ka2i

qi
+ − ι8 −
W̃ai
+ ι3 − − ι5i x 2 . (4.64)
4 2ζ2 2
i=1 i=1

Provided the sufficient conditions in (4.61) hold and the conditions

kc2i c xi 1 1
> ι5i + ζ1 ι7i + ι2 ζ2 N + ka1i + ζ3 ι6i x ,
2 2 2
kθ y ι7i ι6i
> + x , (4.65)
2 ζ1 ζ3

hold for all Z ∈ Z, completing the squares in (4.64), the bound on the Lyapunov
derivative can be expressed as



2 N 
   2ka1i + ka2i

2
N N
qi kc2i c


V̇ ≤ − x − 2 xi

W̃ci
− k x x̃ −
2

W̃ai

i=1
2 i=1
4 i=1
8
kθ y


θ̃
+ ι
2
ι
≤ −vl Z , ∀ Z > , Z ∈ Z. (4.66)
vl

Using (4.59), (4.62), and (4.66),


 [40,
 Theorem 4.18] can be invoked to conclude
−1 ι
that lim supt→∞ Z (t) ≤ v v vl . Furthermore, the system trajectories are
bounded as Z (t) ≤ Z for all t ∈ R≥0 . Hence, the conditions in (4.61) are sufficient
for the conditions in (4.65) to hold for all t ∈ R≥0 .
The error between the feedback-Nash equilibrium policy and the approximate
policy can be bounded above as







u − û i
≤ 1 Rii gi σ i

+ 
¯

i
W̃ ai
i ,
2

for all i = 1, . . . , N , where gi  supx gi (x) . Since the weights W̃ai are uniformly
ultimately bounded, uniformly ultimately bounded convergence of the approximate
policies to the feedback-Nash equilibrium policies is obtained. 

Remark 4.10 The closed-loop system analyzed using the candidate Lyapunov func-
tion in (4.58) is a switched system. The switching happens when the history stack
140 4 Model-Based Reinforcement Learning for Approximate Optimal Control

is updated and when the least-squares regression matrices Γi reach their saturation
bound. Similar to least-squares-based adaptive control (cf. [47]), (4.58) can be shown
to be a common Lyapunov function for the regression matrix saturation, and the use
of a singular value maximizing algorithm to update the history stack ensures that
(4.58) is a common Lyapunov function for the history stack updates (cf. [31]). Since
(4.58) is a common Lyapunov function, (4.59), (4.62), and (4.66) establish uniformly
ultimately bounded convergence of the switched system.

4.5.4 Simulation

To portray the performance of the developed approach, the concurrent learning-based


adaptive technique is applied to the nonlinear control-affine system

ẋ = f (x) + g1 (x) u 1 + g2 (x) u 2 , (4.67)

where x ∈ R2 , u 1 , u 2 ∈ R, and
⎡ ⎤
x2 − 2x1

f = ⎣ − 21 x1 − x2 + 41 x2 (cos (2x1 ) + 2) ⎦ ,
2
2 2
+ 4 x2 sin 4x1 + 2
1
   
0 0
g1 = , g2 = .
cos (2x1 ) + 2 sin 4x12 + 2

The value function has the structure shown in (3.83) with the weights Q 1 = 2Q 2 =
2I2 and R11 = R12 = 2R21 = 2R22 = 2. The system identification protocol given in
Sect. 4.5.1 and the concurrent learning-based scheme given in Sect. 4.5.2 are imple-
mented simultaneously to provide an approximate online feedback-Nash equilibrium
solution to the given nonzero-sum two-player game.
The control-affine system in (4.67) is selected for this simulation because it is
constructed using the converse Hamilton–Jacobi approach [48] where the analytical
feedback-Nash equilibrium solution to the nonzero-sum game is
⎡ ⎤T ⎡ 2 ⎤ ⎡ ⎤T ⎡ 2 ⎤
0.5 x1 0.25 x1
V1∗ = ⎣ 0 ⎦ ⎣ x1 x2 ⎦ , V2∗ = ⎣ 0 ⎦ ⎣ x1 x2 ⎦ ,
1 x22 0.5 x22

and the feedback-Nash equilibrium control policies for Player 1 and Player 2 are
⎡ ⎤T ⎡ ⎤ ⎡ ⎤T ⎡ ⎤
2x 0 0.5 2x1 0 0.25
1 −1 T ⎣ 1 1
u ∗1 = − R11 g1 x2 x1 ⎦ ⎣ 0 ⎦ , u 2 = − R22 g2 ⎣ x2 x1 ⎦ ⎣ 0 ⎦ .
∗ −1 T
2 0 2x2 1 2 0 2x2 0.5
4.5 N-Player Nonzero-Sum Differential Games 141

Since the analytical solution is available, the performance of the developed method
can be evaluated by comparing the obtained approximate solution against the ana-
lytical solution.
The dynamics are linearly parameterized as f = Y (x) θ , where
⎡ ⎤T
x2 0
⎢ x1 0 ⎥
⎢ ⎥
⎢0 x1 ⎥
Y (x) = ⎢
⎢0 x


⎢ 2 ⎥
⎣ 0 x2 (cos (2x1 ) + 2)2 ⎦
2 2
0 x2 sin 4x1 + 2
 T
is known and the constant vector of parameters θ = 1, −2, − 12 , −1, 14 , − 14 is
assumed to be unknown. The initial guess for θ is selected as θ̂ (t0 ) = 0.5 ∗
[1, 1, 1, 1, 1, 1]T . The system identification gains are selected as k x = 5, Γθ =
diag (20, 20, 100, 100, 60, 60), and kθ = 1.5. A history stack of thirty points is
selected using a singular value maximizing algorithm (cf. [31]) for the concurrent
learning-based update law in (4.48), and the state derivatives are estimated using a
fifth order Savitzky–Golay filter (cf. [41]). Based on the structure of the feedback-
Nash equilibrium value functions, the basis function for value function approximation
is selected as σ = [x12 , x1 x2 , x22 ]T , and the adaptive learning parameters and initial
conditions are shown for both players in Table 4.1. Twenty-five points lying on a
5 × 5 grid around the origin are selected for the concurrent learning-based update
laws in (4.53) and (4.54).
Figures 4.30, 4.31, 4.32 and 4.33 show the rapid convergence of the actor and
critic weights to the approximate feedback-Nash equilibrium values for both players,
resulting in the value functions and control policies

Table 4.1 Approximate dynamic programming learning gains and initial conditions
Learning Gains Initial Conditions
Player 1 Player 2 Player 1 Player 2
v 0.005 0.005 +
Wc (t0 ) [3, 3, 3] T [3, 3, 3]T
kc1 1 1 +
Wa (t0 ) [3, 3, 3] T [3, 3, 3]T
kc2 1.5 1 Γ (t0 ) 100I3 100I3
ka1 10 10 x(t0 ) [1, 1]T [1, 1]T
ka2 0.1 0.1 x̂(t0 ) [0, 0]T [0, 0]T
β 3 3
Γ¯ 10,000 10,000
142 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.30 Player 1 critic


weight estimates. Dashed
lines indicate the ideal values
(reproduced with permission
from [16], ©2014, IEEE)

Fig. 4.31 Player 1 actor


weight estimates. Dashed
lines indicate the ideal values
(reproduced with permission
from [16], ©2014, IEEE)

Fig. 4.32 Player 2 critic


weight estimates. Dashed
lines indicate the ideal values
(reproduced with permission
from [16], ©2014, IEEE)
4.5 N-Player Nonzero-Sum Differential Games 143

Fig. 4.33 Player 2 actor


weight estimates. Dashed
lines indicate the ideal values
(reproduced with permission
from [16], ©2014, IEEE)

Fig. 4.34 Drift parameter


estimates. Dashed lines
indicate the ideal values
(reproduced with permission
from [16], ©2014, IEEE)

⎡ ⎤T ⎡ ⎤T
0.5021 0.2510
V̂1 = ⎣ −0.0159 ⎦ σ, V̂2 = ⎣ −0.0074 ⎦ σ,
0.9942 0.4968
⎡ ⎤T ⎡ ⎤
2x 0 0.4970
1 −1 T ⎣ 1
û 1 = − R11 g1 x2 x1 ⎦ ⎣ −0.0137 ⎦ ,
2 0 2x2 0.9810
⎡ ⎤T ⎡ ⎤
2x1 0 0.2485
1 −1 T ⎣
û 2 = − R22 g2 x2 x1 ⎦ ⎣ −0.0055 ⎦ .
2 0 2x 0.4872
2

Figure 4.34 demonstrates that (without the injection of a persistently exciting signal)
the system identification parameters also approximately converged to the correct
values. The state and control signal trajectories are displayed in Figs. 4.35 and 4.36.
144 4 Model-Based Reinforcement Learning for Approximate Optimal Control

Fig. 4.35 State trajectory


convergence to the origin
(reproduced with permission
from [16], ©2014, IEEE)

Fig. 4.36 Control


trajectories of Player 1 and
Player 2 (reproduced with
permission from [16],
©2014, IEEE)

4.6 Background and Further Reading

Online implementation of reinforcement learning is comparable to adaptive control


(cf., [2, 6, 49–52] and the references therein). In adaptive control, the estimates for
the uncertain parameters in the plant model are updated using the tracking error as
a performance metric; whereas, in online reinforcement learning-based techniques,
estimates for the uncertain parameters in the value function are updated using the
Bellman error as a performance metric. Typically, to establish regulation or tracking,
adaptive control methods do not require the adaptive estimates to convergence to the
true values. However, convergence of the reinforcement learning-based controller
to a neighborhood of the optimal controller requires convergence of the parameter
estimates to a neighborhood of their ideal values.
4.6 Background and Further Reading 145

Results such as [7, 10, 37, 44, 53–56] solve optimal tracking and differential game
problems for linear and nonlinear systems online, where persistence of excitation
of the error states is used to establish convergence. In general, it is impossible to
guarantee persistence of excitation a priori. As a result, a probing signal designed
using trial and error is added to the controller to ensure persistence of excitation.
However, the probing signal is typically not considered in the stability analysis.
Contemporary results on data-driven approximate dynamic programming meth-
ods include methods to solve set-point and output regulation [11, 57–61], trajectory
tracking [56, 62, 63], and differential game [64–69] problems.

References

1. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
IEEE conference on decision control, pp 3598–3605
2. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
3. Vrabie D (2010) Online adaptive optimal control for continuous-time systems. PhD thesis,
University of Texas at Arlington
4. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2007) Model-free q-learning designs for linear
discrete-time zero-sum games with application to H∞ control. Automatica 43:473–481
5. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
6. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
7. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: Online adaptive learn-
ing solution of coupled hamilton-jacobi equations. Automatica 47:1556–1569
8. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games:
Online adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–
1611
9. Modares H, Lewis FL, Naghibi-Sistani MB (2013) Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw
Learn Syst 24(10):1513–1525
10. Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement
Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4):1167–1175
11. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
12. Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7):1780–1792
13. Kamalapurkar R, Walters P, Dixon WE (2013) Concurrent learning-based approximate optimal
regulation. In: Proceedings of IEEE conference on decision control, Florence, IT, pp 6256–6261
14. Kamalapurkar R, Andrews L, Walters P, Dixon WE (2014) Model-based reinforcement learning
for infinite-horizon approximate optimal tracking. In: Proceedings of IEEE conference on
decision control, Los Angeles, CA, pp 5083–5088
15. Kamalapurkar R, Klotz J, Dixon WE (2014) Model-based reinforcement learning for on-line
feedback-Nash equilibrium solution of N-player nonzero-sum differential games. In: Proceed-
ings of the American control conference, pp 3000–3005
146 4 Model-Based Reinforcement Learning for Approximate Optimal Control

16. Kamalapurkar R, Klotz J, Dixon WE (2014) Concurrent learning-based online approximate


feedback Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA
J Autom Sin 1(3):239–247
17. Kamalapurkar R, Rosenfeld JA, Dixon WE (2015) State following (StaF) kernel functions
for function approximation Part II: Adaptive dynamic programming. In: Proceedings of the
American control conference, pp 521–526
18. Kamalapurkar R, Walters P, Dixon WE (2016) Model-based reinforcement learning for approx-
imate optimal regulation. Automatica 64:94–104
19. Kamalapurkar R (2014) Model-based reinforcement learning for online approximate optimal
control. PhD thesis, University of Florida
20. Kamalapurkar R, Rosenfeld J, Dixon WE (2016) Efficient model-based reinforcement learning
for approximate online optimal control. Automatica 74:247–258
21. Kamalapurkar R, Klotz JR, Walters P, Dixon WE (2018) Model-based reinforcement learning
for differential graphical games. IEEE Trans Control Netw Syst 5:423–433
22. Kamalapurkar R, Andrews L, Walters P, Dixon WE (2017) Model-based reinforcement learn-
ing for infinite-horizon approximate optimal tracking. IEEE Trans Neural Netw Learn Syst
28(3):753–758
23. Singh SP (1992) Reinforcement learning with a hierarchy of abstract models. AAAI Natl. Conf.
Artif. Intell. 92:202–207
24. Atkeson CG, Schaal S (1997) Robot learning from demonstration. In: International conference
on machine learning 97:12–20
25. Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In:
International conference on machine learning. ACM, New York, NY, USA, pp 1–8
26. Deisenroth MP (2010) Efficient reinforcement learning using Gaussian processes. KIT Scien-
tific Publishing, Karlsruhe
27. Mitrovic D, Klanke S, Vijayakumar S (2010) Adaptive optimal feedback control with learned
internal dynamics models. In: Sigaud O, Peters J (eds) From motor learning to interaction
learning in robots, vol 264. Studies in computational intelligence. Springer, Berlin, pp 65–84
28. Deisenroth MP, Rasmussen CE (2011) Pilco: A model-based and data-efficient approach to
policy search. In: International Conference on Machine Learning, pp 465–472
29. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. PhD thesis, Georgia Institute of Technology
30. Chowdhary GV, Johnson EN (2011) Theory and flight-test validation of a concurrent-learning
adaptive controller. J Guid Control Dyn 34(2):592–607
31. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
32. Parikh A, Kamalapurkar R, Chen HY, Dixon WE (2015) Homography based visual servo
control with scene reconstruction. In: Proceedings of IEEE Conference on Decision Control,
pp 6972–6977
33. Kamalapurkar R, Reish B, Chowdhary G, Dixon WE (2017) Concurrent learning for parameter
estimation using dynamic state-derivative estimators. IEEE Trans Autom Control 62:3594–
3601
34. Parikh A, Kamalapurkar R, Dixon WE (2015) Integral concurrent learning: adaptive control
with parameter convergence without PE or state derivatives. arXiv:1512.03464
35. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
36. Konda V, Tsitsiklis J (2004) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–
1166
37. Dierks T, Jagannathan S (2009) Optimal tracking control of affine nonlinear discrete-time
systems with unknown internal dynamics. In: Proceedings of IEEE Conference on Decision
Control, Shanghai, CN, pp 6750–6755
38. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for opti-
mal control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin,
pp 357–374
References 147

39. Dierks T, Jagannathan S (2010) Optimal control of affine nonlinear continuous-time systems.
In: Proceedings of the American control conference, pp 1568–1573
40. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
41. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares
procedures. Anal Chem 36(8):1627–1639
42. Garg D, Hager WW, Rao AV (2011) Pseudospectral methods for solving infinite-horizon opti-
mal control problems. Automatica 47(4):829–837
43. Kirk D (2004) Optimal control theory: an introduction. Dover, Mineola
44. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking
for continuous-time nonlinear systems. Automatica 51:40–48
45. Lewis FL, Jagannathan S, Yesildirak A (1998) Neural network control of robot manipulators
and nonlinear systems. CRC Press, Philadelphia
46. Lewis FL, Vrabie D, Syrmos VL (2012) Optimal Control, 3rd edn. Wiley, Hoboken
47. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, New Jersey
48. Nevistic V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach.
Technical Report CIT-CDS 96-021, California Institute of Technology, Pasadena, CA 91125
49. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
50. He P, Jagannathan S (2007) Reinforcement learning neural-network-based controller for non-
linear discrete-time systems with input constraints. IEEE Trans Syst Man Cybern Part B Cybern
37(2):425–436
51. Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a
class of discrete-time nonlinear systems via the greedy hdp iteration algorithm. IEEE Trans
Syst Man Cybern Part B Cybern 38(4):937–942
52. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
53. Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate
solution using a policy iteration algorithm. In: Proceedings of conference on decision and
control, pp 142–147
54. Modares H, Lewis FL (2013) Online solution to the linear quadratic tracking problem of
continuous-time systems using reinforcement learning. In: Proceedings of conference on deci-
sion and control, Florence, IT, pp 3851–3856
55. Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear
systems with unknown dynamics by using adaptive dynamic programming. Int J Control
87(5):1000–1009
56. Modares H, Lewis FL, Jiang ZP (2015) H∞ tracking control of completely unknown
continuous-time systems via off-policy reinforcement learning. IEEE Trans Neural Netw Learn
Syst 26(10):2550–2562
57. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica 50:3281–3290
58. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
59. Jiang Y, Jiang ZP (2015) Global adaptive dynamic programming for continuous-time nonlinear
systems. IEEE Trans Autom Control 60(11):2917–2929
60. Bian T, Jiang ZP (2016) Value iteration and adaptive dynamic programming for data-driven
adaptive optimal control design. Automatica 71:348–360
61. Gao W, Jiang ZP (2016) Adaptive dynamic programming and adaptive optimal output regula-
tion of linear systems. IEEE Trans Autom Control 61(12):4164–4169
62. Xiao G, Luo Y, Zhang H, Jiang H (2016) Data-driven optimal tracking control for a class of
affine non-linear continuous-time systems with completely unknown dynamics. IET Control
Theory Appl 10(6):700–710
148 4 Model-Based Reinforcement Learning for Approximate Optimal Control

63. Gao W, Jiang ZP (to appear) Learning-based adaptive optimal tracking control of strict-
feedback nonlinear systems. IEEE Trans Neural Netw Learn Syst
64. Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-
sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714
65. Wei Q, Song R, Yan P (2016) Data-driven zero-sum neuro-optimal control for a class of
continuous-time unknown nonlinear systems with disturbance using adp. IEEE Trans Neural
Netw Learn Syst 27(2):444–458
66. Song R, Lewis FL, Wei Q (2017) Off-policy integral reinforcement learning method to solve
nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn
Syst 28(3):704–713
67. Song R, Wei Q, Song B (2017) Neural-network-based synchronous iteration learning method
for multi-player zero-sum games. Neurocomputing 242:73–82
68. Vamvoudakis KG, Modares H, Kiumarsi B, Lewis FL (2017) Game theory-based control
system algorithms with real-time reinforcement learning: How to solve multiplayer games
online. IEEE Control Syst 37(1):33–52
69. Wei Q, Liu D, Lin Q, Song R (2017) Adaptive dynamic programming for discrete-time zero-
sum games. IEEE Trans Neural Netw Learn Syst 29(4):957–969
Chapter 5
Differential Graphical Games

5.1 Introduction

Reinforcement learning techniques are valuable not only for optimization but also
for control synthesis in complex systems such as a distributed network of cogni-
tive agents. Combined efforts from multiple autonomous agents can yield tacti-
cal advantages including improved munitions effects, distributed sensing, detection,
and threat response, and distributed communication pipelines. While coordinating
behaviors among autonomous agents is a challenging problem that has received
mainstream focus, unique challenges arise when seeking optimal autonomous col-
laborative behaviors. For example, most collaborative control literature focuses on
centralized approaches that require all nodes to continuously communicate with a
central agent, yielding a heavy communication demand that is subject to failure due
to delays, and missing information. Furthermore, the central agent is required to
carry enough on-board computational resources to process the data and to generate
command signals. These challenges motivate the need to minimize communication
for guidance, navigation and control tasks, and to distribute the computational burden
among the agents. Since all the agents in a network have independent collaborative
or competitive objectives, the resulting optimization problem is a multi-objective
optimization problem.
In this chapter (see also, [1]), the objective is to obtain an online forward-in-
time feedback-Nash equilibrium solution (cf. [2–7]) to an infinite-horizon formation
tracking problem, where each agent desires to follow a mobile leader while the group
maintains a desired formation. The agents try to minimize cost functions that penalize
their own formation tracking errors and their own control efforts.
For multi-agent problems with decentralized objectives, the desired action by an
individual agent depends on the actions and the resulting trajectories of its neighbors;
hence, the error system for each agent is a complex nonautonomous dynamical sys-
tem. Nonautonomous systems, in general, have non-stationary value functions. Since
non-stationary functions are difficult to approximate using parameterized function

© Springer International Publishing AG 2018 149


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_5
150 5 Differential Graphical Games

approximation schemes such as neural networks, designing approximate optimal


policies for nonautonomous systems is challenging.
Since the external influence from neighbors renders the dynamics of each agent
nonautonomous, optimization in a network of agents presents challenges similar to
optimal tracking problems. Using insights gained from the development in Chap. 4,
this chapter develops a model-based reinforcement learning technique to generate
feedback-Nash equilibrium policies online for agents in a network with coopera-
tive or competitive objectives. In particular, the network of agents is separated into
autonomous subgraphs, and the differential game is solved separately on each sub-
graph.
In addition to control, this chapter also explores applications of differential graph-
ical games in monitoring and intent detection (see also, [8]). Implementing a network
of cooperating agents (e.g., flocks of unmanned air vehicles, teams of ground vehi-
cles) helps to ensure mission completion and provides more advanced tactical capa-
bilities. Networks containing agents enacting decentralized control policies, wherein
only information from neighboring agents is used to internally make decisions, ben-
efit from autonomy: each agent is encoded with a (possibly disaggregated) tactical
mission objective and has no need to maintain contact with a mission coordinator.
However, networked systems must be cognizant of the reliability of neighbors’
influence. Network neighbors may be unreliable due to input disturbances, faulty
dynamics, or network subterfuge, such as cyber-attacks. Existing monitoring results
use an agent’s state trajectory to judge it’s performance. An issue with only using a
neighbor’s state to judge performance is that, owing to nonlinearities in the dynamics,
a small deviation in the control may cause a large deviation away from the expected
state of the system at the following time step.. Thus, if judging only by the trajectory
of a dynamical system, minimally deviant behavior may be exaggerated during mon-
itoring or significantly deviant behavior may not be noticed. Thus, motivation exists
to examine more information than just the state when judging an agent’s behavior.
The intuition behind considering both state errors and control effort is clear upon
recalling that both state errors and control effort are used in common cost functions,
such as that in linear quadratic regulators.
One of the contribution of this chapter is the development of a novel metric, based
on the Bellman error, which provides a condition for determining if a network neigh-
bor with uncertain nonlinear dynamics is behaving near optimally; furthermore, this
monitoring procedure only requires neighbor communication and may be imple-
mented online. The contribution is facilitated by the use of approximate dynamic
programming and concurrent learning to approximately determine how close opti-
mality conditions are to being satisfied.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 151

5.2 Cooperative Formation Tracking Control


of Heterogeneous Agents1

5.2.1 Graph Theory Preliminaries

Consider a set of N autonomous agents moving in the state space Rn . The control
objective is for the agents to maintain a desired formation with respect to a leader.
The state of the leader is denoted by x0 ∈ Rn . The agents are assumed to be on a
network with a fixed communication topology modeled as a static directed graph
(i.e., digraph).
Each agent forms a node in the digraph. The set of all nodes excluding the leader
is denoted by N = {1, . . . N } and the leader is denoted by node 0. If node i can
receive information from node j then there exists a directed edge from the jth to
the ith node of the digraph, denoted by the ordered pair ( j, i). Let E denote the
set of all edges. Let there be a positive weight ai j ∈ R associated with each edge
( j, i). Note that ai j = 0 if and only if ( j, i) ∈ E. The digraph is assumed to have
no repeated edges (i.e., (i, i) ∈ / E, ∀i), which implies aii = 0, ∀i. The neighborhood
sets of node i are denoted by N−i and Ni , defined as N−i  { j ∈ N | ( j, i) ∈ E}
and Ni  N−i ∪ {i}.
N ×N
 To streamline  the analysis, an adjacency matrix A ∈ R N ×N
is defined as A 
ai j | i, j ∈ N , a diagonal pinning gain matrix A0 ∈ R is defined as A0 
diag ([a10 , . . . , a N 0 ]), an in-degree matrix D ∈ R N ×N is defined as D  diag (di ) ,
where di  j∈Ni ai j , and a graph Laplacian matrix L ∈ R N ×N is defined as L 
D − A. The graph is assumed to have a spanning tree (i.e., given any node i, there
exists a directed path from the leader 0 to node i). A node j is said to be an extended
neighbor of node i if there exists a directed path from node j to node i. The extended
neighborhood set of node i, denoted by S−i , is defined as the set of all extended
neighbors of node i. Formally, S−i  { j ∈ N | j = i ∧ ∃κ ≤ N , { j1 , . . . jκ } ⊂ N |
{( j, j1 ) , ( j1 , j2 ) , . . . , ( j
κ , i)} ⊂ 2 }. Let Si  S−i ∪ {i}, and let the edge weights
E

be normalized such that j ai j = 1, ∀i ∈ N . Note that the sub-graphs are nested in


the sense that S j ⊆ Si for all j ∈ Si .

5.2.2 Problem Formulation

The state xi : R≥t0 → Rn of each agent evolves according to the control-affine


dynamics
ẋi (t) = f i (xi (t)) + gi (xi (t)) u i (t) , (5.1)

where u i : R≥t0 → Rm i denotes the control input, and f i : Rn → Rn and gi : Rn →


Rn×m i are locally Lipschitz continuous functions. The dynamics of the leader are

1 Parts of the text in this section are reproduced, with permission, from [1], 2016,
c IEEE.
152 5 Differential Graphical Games

assumed to be autonomous (i.e., ẋ0 = f 0 (x0 ), where f 0 : Rn → Rn is a locally


Lipschitz continuous function). The function f 0 and the initial condition x0 (t0 ) are
selected such that the trajectory x0 (·) is uniformly bounded.
The control objective is for the agents to maintain a predetermined formation
(with respect to an inertial reference frame) around the leader while minimizing
their own cost functions. For all i ∈ N , the ith agent is aware of its constant desired
relative position xdi j ∈ Rn with respect to all its neighbors j ∈ N−i , such that the
desired formation is realized when xi (t) − x j (t) → xdi j , ∀i, j ∈ N . The vectors
xdi j are assumed to be fixed in an inertial reference frame (i.e., the final desired
formation is rigid and its motion in an inertial reference frame can be described as
pure translation).
To facilitate the control design, the formation is expressed in terms of a set of
constant vectors {xdi0 ∈ Rn }i∈N where each xdi0 denotes the constant final desired
position of agent i with respect to the leader. The vectors {xdi0 }i∈N are unknown to the
agents not connected to the leader, and the known desired inter agent relative position
can be expressed in terms of {xdi0 }i∈N as xdi j = xdi0 − xd j0 . The control objective
is thus satisfied when xi (t) → xdi0 + x0 (t), ∀i ∈ N . To quantify the objective, a
local neighborhood tracking error signal is defined as
   
ei (t) = ai j xi (t) − x j (t) − xdi j . (5.2)
j∈{0}∪N−i

To facilitate the analysis, the error signal in (5.2) is expressed in terms of the
unknown leader-relative desired positions as
   
ei (t) = ai j (xi (t) − xdi0 ) − x j (t) − xd j0 . (5.3)
j∈{0}∪N−i

 T
Stacking the error signals in a vector E (t)  e1T (t) , e2T (t) , . . . , e TN (t) ∈ Rn N
the equation in (5.3) can be expressed in a matrix form

E (t) = ((L + A0 ) ⊗ In ) (X (t) − Xd − X0 (t)) , (5.4)


 T  T T
where X (t) = x1T (t) , x2T (t) , . . . , x NT (t) ∈ Rn N , Xd = xd10 , xd20
T
, . . . , xdTN 0
 T
∈ Rn N , and X0 (t) = x0T (t) , x0T (t) , . . . , x0T (t) ∈ Rn N . Using (5.4), it can
be concluded that provided the matrix ((L + A0 ) ⊗ In ) ∈ Rn N ×n N is nonsingular,
E (t) → 0 implies xi (t) → xdi0 + x0 (t), ∀i, and hence, the satisfaction of con-
trol objective. The matrix ((L + A0 ) ⊗ In ) is nonsingular provided the graph has
a spanning tree with the leader at the root [9]. To facilitate the formulation of an
optimization problem, the following section explores the functional dependence of
the state value functions for the network of agents.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 153

5.2.3 Elements of the Value Function

The dynamics for the open-loop neighborhood tracking error are

     
ėi (t) = ai j f i (xi (t)) + gi (xi (t)) u i (t) − f j x j (t) − g j x j (t) u j (t) .
j∈{0}∪N−i

Under the temporary assumption that each controller u i (·) is an error-feedback con-
troller (i.e., u i (t) = û i (ei (t) , t)), the error dynamics are expressed as
       
ėi (t) = ai j f i (xi (t)) + gi (xi (t)) û i (ei (t) , t) − f j x j (t) − g j x j (t) û j e j (t) , t .
j∈{0}∪N−i

Thus, the error trajectory {ei (t)}∞ t=t0 , where t0 denotes the initial time, depends on
 

û j e j (t) , t , ∀ j ∈ Ni . Similarly, the error trajectory e j (t) t=t0 depends
on û k (ek (t) , t) , ∀k ∈ N j . Recursively, the trajectory {ei (t)}∞ t=t0 depends on
 
û j e j (t) , t , and hence, on e j (t) , ∀ j ∈ Si . Thus, even if the controller for each
agent is restricted to use local error feedback, the resulting error trajectories are
interdependent. In particular, a change in the initial condition of one agent in
the extended neighborhood causes a change in the error trajectories correspond-
ing to all the extended neighbors. Consequently, the value function corresponding
to
∞an infinite-horizon optimal control problemn where each agent tries to minimize
t0 (Q (ei (τ )) + R (u i (τ ))) dτ , where Q : R → R and R : R mi
→ R are positive
definite functions, is dependent on the error states of all the extended neighbors.
Since the steady-state controllers required for formation tracking are generally
nonzero, quadratic total-cost optimal control problems result in infinite costs, and
hence, are infeasible. In the following section, relative steady-state controllers are
derived to facilitate the formulation of a feasible optimal control problem.

5.2.4 Optimal Formation Tracking Problem

When the agents are perfectly tracking the desired trajectory in the desired formation,
even though the states of all the agents are different, the time-derivatives of the states
of all the agents are identical. Hence, in steady state, the control signal applied by
each agent must be such that the time derivatives of the states corresponding to
the set of extended neighbors are identical. In particular, the relative control signal
u i j : R≥t0 → Rm i that will keep node i in its desired relative position with respect
to node j ∈ S−i (i.e., xi (t) = x j (t) + xdi j ), must be such that the time derivative
of xi (·) is the same as the time derivative of x j (·). Using the dynamics of the agent
from (5.1) and substituting the desired relative position x j (·) + xdi j for the state
xi (·), the relative control signal u i j (·) must satisfy
154 5 Differential Graphical Games
   
f i x j (t) + xdi j + gi x j (t) + xdi j u i j (t) = ẋ j (t) . (5.5)

The relative steady-state control signal can be expressed in an explicit form provided
the following assumption is satisfied.
Assumption 5.1 The matrix gi (x) is full rank for all i ∈ N and for all x ∈  R ; fur-
n

thermore, the relative steady-state control signal expressed as u i j (t) = f i j x j (t) +
   
gi j x j (t) u j (t) , satisfies (5.5) along the desired trajectory, where f i j x j 
            
gi+ x j + xdi j f j x j − f i x j + xdi j ∈ Rm i , gi j x j  gi+ x j + xdi j g j x j
∈ Rm i ×m j , where the control effectiveness and the control input relative to the leader
are understood to be g0 (x) = 0, ∀x ∈ Rn and u i0 ≡ 0, ∀i ∈ N , respectively, and
gi+ (x) denotes a pseudoinverse of the matrix gi (x), ∀x ∈ Rn .
To facilitate the formulation of an optimal formation tracking problem, define the
control error μi : R≥t0 → Rm i as
  
μi (t)  ai j u i (t) − u i j (t) . (5.6)
j∈N−i ∪{0}

In the remainder of this section, the control errors {μi (·)} will be treated as
the design variables. To implement the controllers using the designed control
errors, it is essential to invert the relationship in (5.6). To facilitate the inver-
sion, let Sio  {1, . . . , si }, where si  |Si |. Let λi : Sio → Si be a bijective map
such that λi (1) = i. For notational brevity, let (·)Si denote the concatenated vector
T T
(·)λT1 , (·)λT2 , . . . , (·)λTsi , (·)S−i denote the concatenated vector (·)λT2 , . . . , (·)λTsi ,
i
i i

i
T i i
j n(si +1)
denote j∈N−i ∪{0} , λi denote λi ( j), Ei (t)  eSi (t) , xλ1 (t) ∈ R
T T
, and
T i

E−i (t)  eST −i (t) , xλT1 (t) ∈ Rnsi . Then, the control error vector μSi (t) ∈
 i
mk
R k∈Si
can be expressed as

μSi (t) = Lgi (Ei (t)) u Si (t) − Fi (Ei (t)) , (5.7)


 
m k × k∈S m k
where the matrix Lgi : Rn(si +1) → R k∈Si i is defined by
⎧ 

⎨ − a k lg k l xλli , ∀l = k,
  λ λ
i i λ λ
i i
Lgi (Ei ) kl = λk

⎩ aλik j Im λk , ∀l = k,
i


where k, l = 1, 2, . . . , si , and Fi : Rn(si +1) → R k∈Si mk
is defined as
 λisi 
i     T
Fi (Ei )  aλi1 j f λT1 j xj ,..., T
aλi i j f λsi j x j
s .
i i
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 155

Assumption 5.2 The matrix Lgi (Ei (t)) is invertible for all t ∈ R.
Assumption 5.2 is a controllability-like condition. Intuitively, Assumption 5.2
requires the control effectiveness matrices to be compatible to ensure the existence
of relative control inputs that allow the agents to follow the desired trajectory in the
desired formation.
Using Assumption 5.2, the control vector can be expressed as

u Si (t) = Lgi−1 (Ei (t)) μSi (t) + Lgi−1 (Ei (t)) Fi (Ei (t)) . (5.8)
 
Let Lgik denote the λi−1 (k) th block row of Lgi−1 . Then, the controller u i (·) can be
implemented as

u i (t) = Lgii (Ei (t)) μSi (t) + Lgii (Ei (t)) Fi (Ei (t)) , (5.9)

and for any j ∈ N−i ,


j j
u j (t) = Lgi (Ei (t)) μSi (t) + Lgi (Ei (t)) Fi (Ei (t)) . (5.10)

Using (5.9) and (5.10), the error and the state dynamics for the agents can be repre-
sented as
ėi (t) = Fi (Ei (t)) + Gi (Ei (t)) μSi (t) , (5.11)

and
ẋi (t) = Fi (Ei (t)) + Gi (Ei (t)) μSi (t) , (5.12)

where
i i   j
Fi (Ei )  ai j gi (xi ) Lgii (Ei ) Fi (Ei ) − ai j g j x j Lgi (Ei ) Fi (Ei )
i i  
+ ai j f i (xi ) − ai j f j x j ,

i    j
Gi (Ei )  ai j gi (xi ) Lgii (Ei ) − g j x j Lgi (Ei ) ,
Fi (Ei )  f i (xi ) + gi (xi ) Lgii (Ei ) Fi (Ei ) ,
Gi (Ei )  gi (xi ) Lgii (Ei ) .
   
Let h ei t; t0 , Ei0 , μi , μS−i and h xi t; t0 , Ei0 , μi , μS−i denote the trajectories of
(5.11) and (5.12), respectively, with the initial time t0 , initial condition Ei (t0 ) = Ei0 ,
T
and policies μ j : Rn(si +1) → Rm i , j ∈ Si , and let Hi  (h e )ST i , h Txλ1 . Define a
i
cost functional
∞
Ji (ei (·) , μi (·))  ri (ei (σ ) , μi (σ )) dσ (5.13)
0
156 5 Differential Graphical Games

where ri : Rn × Rm i → R≥0 denotes the local cost defined as ri (ei , μi )  Q i (ei ) +


μiT Ri μi , where Q i : Rn → R≥0 is a positive definite function, and Ri ∈ Rm i ×m i is
a constant positive definite matrix. The objective of each agent is to minimize the
cost functional in (5.13). To facilitate the definition of a feedback-Nash equilibrium
solution, define value functions Vi : Rn(si +1) → R≥0 as

∞
       
Vi Ei ; μi , μS−i  ri h ei σ ; t, Ei , μi , μS−i , μi Hi σ ; t, Ei , μi , μS−i dσ,
t
  (5.14)
where Vi Ei ; μi , μS−i denotes the total cost-to-go under the policies μSi , starting
from the state Ei . Note that

the value functions in (5.14) are time-invariant
because
the dynamical systems ė j (t) = F j (Ei (t)) + G j (Ei (t)) μS j (t) j∈Si and ẋi (t) =
Fi (Ei (t)) + Gi (Ei (t)) μSi (t) together form an autonomous dynamical system.
A graphical feedback-Nash
  within the subgraph Si is defined
equilibrium solution
∗ n (s j +1)
as the tuple of policies μ j : R →Rmj
such that the value functions in
j∈Si
(5.14) satisfy
   
V j∗ E j  V j E j ; μ∗j , μ∗S− j ≤ V j E j ; μ j , μ∗S− j ,

∀ j ∈ Si , ∀Ei ∈ Rn(si +1) , and for all admissible policies μ j . Provided a feedback-
Nash equilibrium solution exists and the value functions (5.14) are continuously
differentiable, the feedback-Nash equilibrium value functions can be characterized
in terms of the following system of Hamilton–Jacobi equations:

 
∇e j Vi∗ (Ei ) F j (Ei ) + G j (Ei ) μ∗S j (Ei ) + Q i (Ei ) + μi∗T (Ei ) Ri μi∗ (Ei )
j∈Si
 
+ ∇xi Vi∗ (Ei ) Fi (Ei ) + Gi (Ei ) μ∗Si (Ei ) = 0, ∀Ei ∈ Rn(si +1) , (5.15)

where Q i : Rn(si +1) → R is defined as Q i (Ei )  Q i (ei ).

Theorem 5.3 Provided a feedback-Nash equilibrium solution exists and that the
value functions in (5.14) are continuously differentiable, the system of Hamilton–
Jacobi equations in (5.15) constitutes a necessary and sufficient condition for
feedback-Nash equilibrium.

Proof Consider the cost functional in (5.13), and assume that all the extended neigh-
bors of the ith agent follow their feedback-Nash equilibrium policies. The value
function corresponding to any admissible policy μi can be expressed as

 T  ∞    
Vi ei , E−i ; μi , μS−i = ri h ei σ ; t, Ei , μi , μ∗S−i , μi Hi σ ; t, Ei , μi , μ∗S−i
T T ∗
dσ.
t
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 157

Treating the dependence on E−i as explicit time dependence define


   T
V i ei , t; μi , μ∗S−i  Vi eiT , E−i
T
(t) ; μi , μ∗S−i , (5.16)

∀ei ∈ Rn and ∀t ∈ R≥0 . Assuming that the optimal controller that minimizes (5.13)
when all the extended neighbors follow their feedback-Nash
 equilibrium
policies
∗ ∗ ∗
exists, and that the optimal value function V i  V i (·) ; μi , μS−i exists and is
continuously differentiable, optimal control theory for single objective optimization
problems (cf. [10]) can be used to derive the following necessary and sufficient
condition
∗ ∗
∂ V i (ei , t)   ∂ V i (ei , t)
Fi (Ei ) + Gi (Ei ) μ∗Si (Ei ) + + Q i (ei ) + μi∗T (Ei ) Ri μi∗ (Ei ) = 0.
∂ei ∂t
(5.17)
Using (5.16), the partial derivative with respect to the state can be expressed as

∂ V i (ei , t) ∂ Vi∗ (Ei )
= , (5.18)
∂ei ∂ei

∀ei ∈ Rn and ∀t ∈ R≥0 . The partial derivative with respect to time can be expressed
as

∂ V i (ei , t)  ∂ V ∗ (Ei )  ∂ V ∗ (E )
F j (Ei ) + G j (Ei ) μ∗S j (Ei ) + i
i
= i
Fi (Ei ) ,
∂t j∈S
∂e j ∂ x i
−i

∂ V ∗ (Ei )
+ i Gi (Ei ) μ∗Si (Ei ) , (5.19)
∂ xi

∀ei ∈ Rn and ∀t ∈ R≥0 . Substituting (5.18) and (5.19) into (5.17) and repeat-
ing the process for each i, the system of Hamilton–Jacobi equations in (5.15) is
obtained. 
Minimizing the Hamilton–Jacobi equations using the stationary condition, the
feedback-Nash equilibrium solution is expressed in the explicit form

1  T  T 1  T  T
μi∗ (Ei ) = − Ri−1 G ji (Ei ) ∇e j Vi∗ (Ei ) − Ri−1 Gii (Ei ) ∇xi Vi∗ (Ei ) ,
2 2
j∈Si
(5.20)
∂μ∗S ∂μ∗
n(si +1)
∀Ei ∈ R , where G ji
 G j ∂μ∗ 
j
and Gii Gi ∂μS∗i
. As it is generally infeasible to
i i
obtain an analytical solution to the system of the Hamilton–Jacobi equations in (5.15),
the feedback-Nash value functions and the feedback-Nash
 policiesare approximated

using parametric approximation schemes as V̂i Ei , Ŵci and μ̂i Ei , Ŵai , respec-
tively, where Ŵci ∈ R L i and Ŵai ∈ R L i are parameter estimates. Substitution of the
approximations V̂i and μ̂i in (5.15) leads to a set of Bellman errors δi defined as
158 5 Differential Graphical Games
        
δi Ei , Ŵci , Ŵa  μ̂iT Ei , Ŵai R μ̂i Ei , Ŵai + ∇e j V̂i Ei , Ŵci F j E j
Si
j∈Si
    
 
+ ∇e j V̂i Ei , Ŵci G j E j μ̂S j E j , Ŵa
Sj
j∈Si
    
+ ∇xi V̂i Ei , Ŵci Fi (Ei ) + Gi (Ei ) μ̂Si Ei , Ŵa + Q i (ei ) .
Si
(5.21)

Approximate feedback-Nash equilibrium control is realized by tuning the estimates


V̂i and μ̂i so as to minimize the Bellman errors δi . However, computation of δi and
that of u i j in (5.6) requires exact model knowledge. In the following, a concurrent
learning-based system identifier is developed to relax the exact model knowledge
requirement and to facilitate the implementation of model-based reinforcement learn-
ing via Bellman error extrapolation (cf. [11]). In particular, the developed controllers
do not require knowledge of the system drift functions f i .

5.2.5 System Identification

On any compact set χ ⊂ Rn the function f i can be represented using a neural network
as
f i (x) = θiT σθi (x) +
θi (x) , (5.22)

∀x ∈ Rn , where θi ∈ R Pi +1×n denote the unknown output-layer neural network


weights, σθi : Rn → R Pi +1 denotes a bounded neural network basis function,
θi :
Rn → Rn denotes the function reconstruction error, and Pi ∈ N denotes the num-
ber of neural network neurons. Using the universal function approximation property
of single layer neural networks, provided the rows of σθi (x) form a proper basis,
there exist constant ideal weights θi and positive constants θi ∈ R and
θi ∈ R such
that θi  F ≤ θi < ∞ and supx∈χ 
θi (x) ≤
θi , where · F denotes the Frobenius
norm.
Assumption 5.4 The bounds θi and
θi are known for all i ∈ N .
Using an estimate θ̂i : R≥t0 → R Pi +1×n of the weight matrix θi , the function
 f i can
ˆ
be approximated by the function f i : R × R
n Pi +1×n ˆ
→ R defined by f i x, θ̂ 
n

θ̂ T σθi (x) . Based on (5.22), an estimator for online identification of the drift dynamics
is developed as

x̂˙i (t) = θ̂iT (t) σθi (xi (t)) + gi (xi (t)) u i (t) + ki x̃i (t) , (5.23)

where x̃i (t)  xi (t) − x̂i (t) and ki ∈ R is a positive constant learning gain. The
following assumption facilitates concurrent learning-based system identification.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 159

Assumption 5.5 [12, 13] A history stack containing recorded state-action pairs

k k Mθi
Mθi
xi , u i k=1 along with numerically computed state derivatives x̄˙ik k=1 that satisfies
M 
 θi
 k T  
λmin σθik σθi = σθi > 0, x̄˙ik − ẋik  < di , ∀k (5.24)
k=1

 
is available a priori. In (5.24), σθik  σθi xik , and di , σθi ∈ R are known positive
constants.
The weight estimates θ̂i (·) are updated using the following concurrent learning-based
update law:


Mθi  T
θ̂˙i (t) = kθi θi σθik x̄˙ik −gik u ik − θ̂iT (t) σθik + θi σθi (xi (t))x̃iT (t) , (5.25)
k=1

 
where gik  gi xik , kθi ∈ R is a constant positive concurrent learning gain, and
θi ∈ R Pi +1×Pi +1 is a constant, diagonal, and positive definite adaptation gain matrix.
To facilitate the subsequent stability analysis, a candidate Lyapunov function
V0i : Rn × R Pi +1×n → R is selected as
 1 1 
−1
V0i x̃i , θ̃i  x̃iT x̃i + tr θ̃iT θi θ̃i , (5.26)
2 2

where θ̃i (t)  θi (t) − θ̂i (t). Using (5.23)–(5.25), a bound on the time derivative of
V0i is established as
  2  
   
V̇0i x̃i , θ̃i ≤ −ki x̃i 2 − kθi σθi θ̃i  +
θi x̃i  + kθi dθi θ̃i  , (5.27)
F F

 Mθi  k   Mθi  k   k 
where dθi  d i k=1 σ  + 
 σ  . Using (5.26) and (5.27),
θi k=1 θi θi
a Lyapunov-based stability analysis can be used to show that θ̂i (·) converges expo-
nentially to a neighborhood around θi .
Remark 5.6 Using an integral formulation, the system identifier can also be imple-
mented without using state-derivative measurements (see, e.g., [14]).

5.2.6 Approximation of the Bellman Error and the Relative


Steady-State Controller

Using the approximations fˆi for the functions f i , the Bellman errors in (5.21) can
be approximated as
160 5 Differential Graphical Games
       
δ̂i Ei ,Ŵci , Ŵa , θ̂Si  μ̂iT Ei , Ŵai Ri μ̂i Ei , Ŵai + ∇e j V̂i Ei , Ŵci Fˆ j E j , θ̂S j
Si
j∈Si
     
+ ∇xi V̂i Ei , Ŵci F̂i Ei , θ̂Si + Gi (Ei ) μ̂Si Ei , Ŵa + Q i (ei )
Si
    
 
+ ∇e j V̂i Ei , Ŵci G j E j μ̂S j E j , Ŵa . (5.28)
Sj
j∈Si

In (5.28),
 i        j 
Fˆi Ei , θ̂Si  ai j fˆi xi , θ̂i − fˆj x j , θ̂ j + iai j gi (xi ) L gii −g j x j L gi F̂i Ei , θ̂Si ,
 
F̂i Ei , θ̂Si  θ̂iT σθi (xi ) + gi (xi ) L gii F̂i Ei , θ̂Si ,
⎡  1  ⎤
λi
aλ1 j fˆλ1 j xλ1 , θ̂λ1 , x j , θ̂ j
 ⎢

i i i i ⎥

.
F̂i Ei , θ̂Si  ⎢ ⎢ . ⎥,

⎣  si . ⎦
λi
aλsi j fˆλsi j xλsi , θ̂λsi , x j , θ̂ j
i i i i
      
fˆi j xi , θ̂i , x j , θ̂ j  gi+ x j + xdi j fˆj x j , θ̂ j − gi+ x j + xdi j fˆi x j + xdi j , θ̂i .

The approximations F̂i , Fˆi , and F̂i are related to the original unknown function as
     
F̂i Ei , θSi + Bi (Ei ) = Fi (Ei ), Fˆi Ei , θSi + B i (Ei ) = Fi (Ei ), and F̂i Ei , θSi
+ Bi (Ei ) = Fi (Ei ), where Bi , Bi , and Bi are O (
θ )Si terms that denote bounded
function approximation errors.
Using the approximations fˆi , an implementable form of the controllers in (5.8) is
expressed as
   
u Si (t) = Lgi−1 (Ei (t)) μ̂Si Ei (t) , Ŵa (t) + Lgi−1 (Ei (t)) F̂i Ei (t) , θ̂Si (t) .
Si
(5.29)
Using (5.7) and (5.29), an unmeasurable form of the virtual controllers for the systems
(5.11) and (5.12) is given by
   
μSi (t) = μ̂Si Ei (t) , Ŵa (t) − F̂i Ei (t) , θ̃Si (t) − Bi (Ei (t)) . (5.30)
Si

5.2.7 Value Function Approximation

On any compact set χ ∈ Rn(si +1) , the value functions can be represented as

Vi∗ (Ei ) = WiT σi (Ei ) +


i (Ei ) , ∀Ei ∈ Rn(si +1) , (5.31)

where Wi ∈ R L i are ideal neural network weights, σi : Rn(si +1) → R L i are neural
network basis functions, and
i : Rn(si +1) → R are function approximation errors.
Using the universal function approximation property of single layer neural networks,
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 161

provided σi (Ei ) forms a proper basis, there exist constant ideal weights Wi and
positive constants Wi ,
i , ∇
i ∈ R such that Wi  ≤ Wi < ∞, supEi ∈χ 
i (Ei ) ≤

i , and supEi ∈χ ∇
i (Ei ) ≤ ∇
i .

Assumption 5.7 The constants


i , ∇
i , and Wi are known for all i ∈ N .
Using (5.20) and (5.31), the feedback-Nash equilibrium policies are

1 1
μi∗ (Ei ) = − Ri−1 G σ i (Ei ) Wi − Ri−1 G
i (Ei ) ,
2 2

∀Ei ∈ Rn(si +1) , where


 T  T  T  T
G σ i (Ei )  G ji (Ei ) ∇e j σi (Ei ) + Gii (Ei ) ∇xi σi (Ei )
j∈Si
 T  T  T  T
G
i (Ei )  G ji (Ei ) ∇e j
i (Ei ) + Gii (Ei ) ∇xi
i (Ei ) .
j∈Si

The value functions and the policies are approximated using neural networks as

V̂i Ei , Ŵci  ŴciT σi (Ei ) ,
 1
μ̂i Ei , Ŵai  − Ri−1 G σ i (Ei ) Ŵai , (5.32)
2

where Ŵci and Ŵai are estimates of the ideal weights Wi introduced in (5.21).

5.2.8 Simulation of Experience via Bellman Error


Extrapolation

A consequence of Theorem 5.3 is that the Bellman error provides an indirect mea-
sure of how close the estimates Ŵci and Ŵai are to the ideal weights Wi . From a
reinforcement learning perspective, each evaluation of the Bellman error along the
system trajectory can be interpreted as experience gained by the critic, and each
evaluation of the Bellman error at points not yet visited can be interpreted as sim-
ulated experience. In previous results such as [15–19], the critic is restricted to the
experience gained (in other words Bellman errors evaluated) along the system state
trajectory. The development in [15–18] can be extended to employ simulated expe-
rience; however, the extension requires exact model knowledge. The formulation in
(5.28) employs the system identifier developed in Sect. 5.2.5 to facilitate approximate
evaluation of the Bellman error at off-trajectory points.
162 5 Differential Graphical Games

Mi
To simulate experience, a set of points Eik k=1 is selected corresponding to each
agent i, and the instantaneous Bellman error in (5.21) is approximated at the current
state and the selected points using (5.36). The approximation at the current state is
denoted by δ̂ti and the approximation at the selected points is denoted by δ̂tik , where
δ̂ti and δ̂tik are defined as
   
δ̂ti (t)  δ̂i Ei (t) , Ŵci (t) , Ŵa (t) , θ̂ (t) ,
Si Si
   
δ̂tik (t)  δ̂i Eik , Ŵci (t) , Ŵa (t) , θ̂ (t) .
Si Si



Note that once e j j∈Si and xi are selected, the ith agent can compute the states of
all the remaining agents in the sub-graph.
The critic uses simulated experience to update the critic weights using the least-
squares-based update law

ηc2i i (t) 
Mi
ωi (t) ωik (t) k
Ŵ˙ ci (t) = −ηc1i i (t) δ̂ti (t) − δ̂ (t) ,
ρi (t) Mi ρ k (t) ti
k=1 i
 
ωi (t) ωiT (t)
˙ i (t) = βi i (t) − ηc1i i (t) i 1{ i ≤ i } ,  i (t0 ) ≤ i ,
ρi2 (t)
(5.33)

where ρi (t)  1 + νi ωiT (t) i ωi (t), i ∈ R L i ×L i denotes the time-varying least-


squares learning gain, i ∈ R denotes the saturation constant, and ηc1i , ηc2i , βi , νi ∈
R are constant positive learning gains. In (5.33),
     
 
ωi (t)  ∇e j σi (Ei (t)) Fˆ j E j (t) , θ̂S j (t) + G j E j (t) μ̂S j E j (t) , Ŵa (t)
Sj
j∈Si
    
+ ∇xi σi (Ei (t)) F̂i Ei (t) , θ̂Si (t) + Gi (Ei (t)) μ̂Si Ei (t) , Ŵa (t) ,
Si
    
ωik (t)  ∇e j σik Fˆ jk θ̂S j (t) + G jk μ̂kS j Ŵa (t)
Sj
j∈Si

  
+ ∇xi σik F̂ik θ̂S j (t) + Gik μ̂kSi Ŵa (t) ,
Si

where the notation φik indicates evaluation at Ei = Eik for a function φi (Ei , (·)) (i.e.,
 
φik (·)  φi Eik , (·) ). The actor updates the actor weights using the following update
law derived based on a Lyapunov-based stability analysis:
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 163

ω T (t)
Ŵ˙ ai (t) = ηc1i G σT i (Ei (t)) Ri−1 G σ i (Ei (t)) Ŵai (t) i
1
Ŵci (t) − ηa2i Ŵai (t)
4 ρi (t)
 k T
1 ηc2i  k T −1 k 
Mi
ω (t)
+ Gσ i Ri G σ i Ŵai (t) i k Ŵci (t) − ηa1i Ŵai (t) − Ŵci (t) ,
4
k=1
Mi ρi (t)
(5.34)

where ηa1i , ηa2i ∈ R are constant positive learning gains. The following assumption
facilitates simulation of experience.

Mi
Assumption 5.8 [13] For each i ∈ N , there exists a finite set of Mi points Eik k=1
such that   # $
 Mi ωik (t)(ωik (t))T
inf t∈R≥0 λmin k=1 ρik (t)
ρi  > 0, (5.35)
Mi

where ρi ∈ R is a positive constant.

5.2.9 Stability Analysis

To facilitate the stability analysis, the left hand side of (5.15) is subtracted from
(5.28) to express the Bellman error in terms of the weight estimation errors as
 1
δ̂ti = −W̃ciT ωi − WiT ∇xi σi F̂i Ei , θ̃Si + W̃aiT G σT i Ri−1 G σ i W̃ai
4
  1  
− WiT ∇e j σi Fˆ j E j , θ̃S j + WiT ∇e j σi G j RS j W̃a
j∈Si
2 j∈Si
Sj

1 1 
− WiT G σT i Ri−1 G σ i W̃ai + WiT ∇xi σi Gi RSi W̃a + i , (5.36)
2 2 Si

  
where ˜  (·) − (·),
(·) ˆ i = O (
)Si , ∇
Si , (
θ )Si , and RS j 
 
diag Rλ−1 1 G
T
σ λ1
, . . . , R −1
sj G
T
sj is a block diagonal matrix. Consider a set of
j j λj σλj
extended neighbors S p corresponding to the pth agent. To analyze asymptotic prop-
erties of the agents in S p , consider the following candidate Lyapunov function

     1  1  
VL p Z p , t  Vti eSi , t + W̃ciT i−1 W̃ci + T W̃ +
W̃ai ai V0i x̃i , θ̃i ,
2 2
i∈S p i∈S p i∈S p i∈S p
(5.37)

where Z p ∈ R(2nsi +2L i si +n(Pi +1)si ) is defined as


164 5 Differential Graphical Games
  T  T  T T
Z p  eS p , W̃c
T
, W̃a , x̃S p , vec θ̃S p
T
,
Sp Sp

vec (·) denotes the vectorization operator, and Vti : Rnsi × R → R is defined as
   T
Vti eSi , t  Vi∗ eST i , xiT (t) , (5.38)

∀eSi ∈ Rnsi and ∀t ∈ R≥t0 .


Since Vti∗ depends on t only through uniformly bounded leader trajectories,
Lemma 1 from [20] can be used to show that Vti is a positive definite and decres-
cent function. Since the graph has a spanning tree, the mapping between the errors
and the states is invertible. Hence, the state of an agent can be expressed as
xi = hi eSi , x0 for some
 function
 hi . Thus, the value function can be expressed
as Vi∗ eSi , x0 = Vi∗ eSi 
, h eSi , x0 . Then, Vti∗ can be alternatively defined as
  eSi
Vti eSi , t  Vi∗ . Since x0 is a uniformly bounded function of t by
x0 (t)
assumption, [20, Lemma 1] can be used to conclude that Vti is a positive definite and
decrescent function.
Thus, using [21, Lemma 4.3], the following bounds on the candidate Lyapunov
function in (5.37) are established
     
vlp  Z p  ≤ VL p Z p , t ≤ vlp  Z p  , (5.39)

∀Z p ∈ R(2nsi +2L i si +n(Pi +1)si ) and ∀t ∈ R≥t0 , where vlp , vlp : R → R are class K func-
tions.
To facilitate the stability analysis, given any compact ball χ p ⊂ R2nsi +2L i si +n(Pi +1)si
of radius r p ∈ R centered at the origin, a positive constant ι p ∈ R is defined as
⎛     2 ⎞  2
 
 ⎜
θi 2 3 kθi dθi +  Aiθ  Biθ   5 (ηc1i + ηc2i )2  ωρii i 

ιp  ⎝ + ⎠+
i∈S
2ki 4σθi i∈S
4ηc2i ρi
p p

 
 
 1  
+  ∗
∇e j Vi (Ei ) G j RS j
S j 

∇xi Vi (Ei ) Gi RSi
Si + 
i∈S p  
2 j∈Si
 
 


+  ∇ V ∗
(E ) G B + ∇ V ∗
(E ) G B 
 ej i i j j xi i i i i
i∈S p  j∈Si 
  2
  Aia1  (ηc1i +ηc2i )  T ωi −1 
3 i∈S p 2
+ η a2i W i + 4  W i ρi W i
T T
G R
σi i G σ i 
+ ,
4 (ηa1i + ηa2i )
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 165

where for any function  : Rl → R, l ∈ N, the notation   denotes sup y∈χ pl


 (y), χ pl denotes the projection of χ p onto Rl , and Aiθ , Biθ , and Aia1 are uni-
formly bounded state-dependent terms. Based on the subsequent stability analysis,
the following sufficient gain conditions can be developed to facilitate Theorem 5.9:
 2  2
2  1aθ   1aθ 
ηc2i ρi (η
 p j∈Si c1i
3s 1 + η c2i )  A ij   Bij 
> , (5.40)
5 j∈S p
4kθ j σθ j
 2 

2

(ηa1i + ηa2i )  5s p 1i∈S j ηc1 j + ηc2 j A1ac
ji  2
5ηa1i
> +
3 j∈S p
16ηc2 j ρ j 4ηc2i ρi
  
 
(ηc1i + ηc2i ) Wi  ωρii  G σT i Ri−1 G σ i 
+ , (5.41)
   4
−1
 
vlp ι p < vlp −1 vlp r p , (5.42)

where Ai1aθ 1aθ 1ac


j , Bi j , and A ji are uniformly bounded state-dependent terms.

Theorem 5.9 Provided Assumptions 5.2–5.8 hold and the sufficient gain conditions
in (5.40)–(5.42) are satisfied, the controller in (5.32) along with the actor and critic
update laws in (5.33) and (5.34), and the system identifier in (5.23) along with the
weight update laws in (5.25) ensure that the local neighborhood tracking errors ei
are ultimately bounded and that the policies μ̂i converge to a neighborhood around
the feedback-Nash policies μi∗ , ∀i ∈ N .

Proof The time derivative of the candidate Lyapunov function in (5.37) is given by

   1  T −1  
V̇L p = V̇ti eSi , t − W̃ci i ˙ i i−1 W̃ci − W̃ciT i−1 Ŵ˙ ci − W̃aiT Ŵ˙ ai
2
i∈S p i∈S p i∈S p i∈S p
 
+ V̇0i x̃i , θ̃i . (5.43)
i∈S p

Using (5.15), (5.27), (5.30), and (5.36), the update laws in (5.33) and (5.34), and the
definition of Vti in (5.38), the derivative in (5.43) can be bounded as
  ηc2i ρi  
 2 (ηa1i + ηa2i ) 
 2 

V̇L p ≤ −  ci 
W̃ −  ai 

i∈S p
5 3
 ki kθi σθi 
 2
 
+ −qi (ei ) − x̃i  −
2
θ̃i  + ι p .
i∈S
2 3 F
p

Let vlp : R → R be a class K function such that


166 5 Differential Graphical Games

  1  1  ηc2i ρi 

2 1  (η + η ) 
 a2i 
2

vlp  Z p  ≤
a1i
qi (ei ) + W̃ci  + W̃ai 
2 2 5 2 3
i∈S p i∈S p i∈S p
1  ki 1  kθi σθi 
 2

+ x̃i 2 + θ̃i  , (5.44)
2 2 2 3 F
i∈S p i∈S p

where qi : R → R are class K functions such that qi (e) ≤ Q i (e) , ∀e ∈ Rn , ∀i ∈


N . Then, the Lyapunov derivative can be bounded as
 
V̇L p ≤ −vlp  Z p  (5.45)
   
∀Z p such that Z p ∈ χ p and  Z p  ≥ vlp −1
ι p . Using the bounds in (5.39), the
sufficient conditions in (5.40)–(5.42), and the inequality in (5.45), [21,  Theorem 
4.18] can be invoked to conclude that every trajectory Z (·) satisfying  Z p (t0 ) ≤
  p
 
vlp −1 vlp r p , is bounded for all t ∈ R≥t0 and satisfies lim supt→∞  Z p (t) ≤
   
−1
vlp −1 vlp vlp ιp .
Since the choice of the subgraph S p was arbitrary, the neighborhood tracking
errors ei (·) are ultimately bounded for all i ∈ N . Furthermore, the weight estimates
Ŵai (·) converge to a neighborhood of the ideal weights Wi ; hence, invoking Theorem
5.3, the policies μ̂i converge to a neighborhood of the feedback-Nash equilibrium
policies μi∗ , ∀i ∈ N .

5.2.10 Simulations

One-Dimensional Example
This section provides a simulation example to demonstrate the applicability of the
developed technique. The agents are assumed to have the communication topology
as shown in Fig. 5.1 with unit pinning gains and edge weights. Agent motion is
described by identical nonlinear one-dimensional dynamics of the form (5.1) where
f i (xi ) = θi1 xi + θi2 xi2 and gi (xi ) = (cos(2xi1 ) + 2), ∀i = 1, . . . , 5. The ideal val-
ues of the unknown parameters are selected to be θi1 = 0, 0, 0.1, 0.5, and 0.2,
and θi2 = 1, 0.5, 1, 1, and 1, for i = 1, . . . , 5, respectively. The agents start at
xi = 2, ∀i, and their final desired locations with respect to each other are given
by xd12 = 0.5, xd21 = −0.5, xd43 = −0.5, and xd53 = −0.5. The leader traverses an
exponentially decaying trajectory x0 (t) = e−0.1t . The desired positions of agents 1
and 3 with respect to the leader are xd10 = 0.75 and xd30 = 1, respectively.
For each agent i, five values of ei , three values of xi , and three values of errors cor-
responding to all the extended neighbors are selected for Bellman error extrapolation,
resulting in 5 × 3si total values of Ei . All agents estimate the unknown drift param-
eters using history stacks containing thirty points recorded online using a singular
value maximizing algorithm (cf. [22]), and compute the required state derivatives
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 167

3 1

5 4 2

Fig. 5.1 Communication topology: A network containing five agents (reproduced with permission
from [1], 2016,
c IEEE)

Fig. 5.2 State trajectories 2.5


for the five agents for the x1
x2
one-dimensional example. 2 x3
The dotted lines show the x4
desired state trajectories x5

(reproduced with permission x(t) 1.5


from [1], 2016,
c IEEE)
1

0.5

0
0 5 10 15 20 25 30
Time (s)

Fig. 5.3 Tracking error e1


trajectories for the agents for 0.4 e2
the one-dimensional example e3
(reproduced with permission 0.3 e4
e5
from [1], 2016,
c IEEE) 0.2
e(t)

0.1

-0.1

-0.2

-0.3
0 5 10 15 20 25 30
Time (s)

using a fifth order Savitzky–Golay smoothing filter (cf. [23]). Figures 5.2, 5.3, 5.4
and 5.5 show the tracking error, the state trajectories compared with the desired
trajectories, and the control inputs for all the agents demonstrating convergence to
the desired formation and the desired trajectory. Note that Agents 2, 4, and 5 do
not have a communication link to the leader, nor do they know their desired relative
position from the leader. The convergence to the desired formation is achieved via
cooperative control based on decentralized objectives. Figures 5.6 and 5.7 show the
evolution and convergence of the critic weights and the parameters estimates for the
drift dynamics for Agent 1. See Table 5.1 for the optimal control problem parameters,
168 5 Differential Graphical Games

Fig. 5.4 Trajectories of the


u1
control input for all agents 0
u2
for the one-dimensional u3
example (reproduced with -2 u4
u5
permission from [1], 2016,
c
IEEE) -4

u(t)
-6

-8

-10
0 5 10 15 20 25 30
Time (s)

Fig. 5.5 Trajectories of the 10


relative control error for all
0
agents for the
one-dimensional example -10
(reproduced with permission
-20
from [1], 2016,
c IEEE)
-30

-40

-50

-60

-70
0 5 10 15 20 25 30
Time (s)

Fig. 5.6 Critic weight 1


estimates for Agent 1 for the
one-dimensional example 0.8
(reproduced with permission 0.6
from [1], 2016,
c IEEE)
Ŵc1 (t)

0.4

0.2

-0.2

-0.4
0 5 10 15 20 25 30
Time (s)
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 169

Fig. 5.7 Drift dynamics 1


parameters estimates for
Agent 1 for the 0.8
one-dimensional example.
The dotted lines are the ideal
values of the drift parameters 0.6
(reproduced with permission
from [1], 2016,
c IEEE) 0.4

0.2

0
0 5 10 15 20 25 30
Time (s)

basis functions, and adaptation gains for all the agents. The errors between the ideal
drift parameters and their respective estimates are large; however, as demonstrated
by Fig. 5.3, the resulting dynamics are sufficiently close to the actual dynamics for
the developed technique to generate stabilizing policies. It is unclear whether the
value function and the actor weights converge to their ideal values.
Two-Dimensional Example
In this simulation, the dynamics of all the agents are assumed to be exactly known,
and are selected to be of the form (5.1) where for all i = 1, . . . , 5,
 
−xi1 + xi2
f i (xi ) = ,
−0.5xi1 − 0.5xi2 (1 − (cos(2xi1 ) + 2)2 )
 
sin(2xi1 ) + 2 0
gi (xi ) = .
0 cos(2xi1 ) + 2

The agents start at the origin, and their final desired relative positions are given by
xd12 = [−0.5, 1]T , xd21 = [0.5, −1]T , xd43 = [0.5, 1]T , and xd53 = [−1, 1]T .
The relative positions are designed such that the final desired formation is a pentagon
with the leader node at the center. The leader traverses a sinusoidal trajectory x0 (t) =
[2 sin(t), 2 sin(t) + 2 cos(t)]T . The desired positions of agents 1 and 3 with respect
to the leader are xd10 = [−1, 0]T and xd30 = [0.5, −1]T , respectively.
The optimal control problem parameters, basis functions, and adaptation gains for
the agents are available in Table 5.2. Nine values of ei , xi , and the errors corresponding
to all the extended neighbors are selected for Bellman error extrapolation for each
agent i on a uniform 3 × 3 grid in a 1 × 1 square around the origin, resulting in
9(si +1) total values of Ei . Figures 5.8, 5.9, 5.10 and 5.11 show the tracking error, the
state trajectories, and the control inputs for all the agents demonstrating convergence
to the desired formation and the desired trajectory. Note that Agents 2, 4, and 5 do
not have a communication link to the leader, nor do they know their desired relative
position from the leader. The convergence to the desired formation is achieved via
170

Table 5.1 Simulation parameters for the two-dimensional example


Agent 1 Agent 2 Agent 3 Agent 4 Agent 5
Qi 10 10 10 10 10
Ri 0.1 0.1 0.1 0.1 0.1
1 2 1 4 2 2 2 T 1 2 1 4 2 2 2 T 1 2 1 4 2 2 1 4 2 T 1 2 1 4 2 2 2 2 3 T 1 2 1 4 2 2 2 2 2 2 2 2 2 2 T
σi (Ei ) 2 [e1 , 2 e1 , e1 x 1 , e2 ] 2 [e2 , 2 e2 , e2 x 2 , e1 ] 2 [e3 , 2 e3 , e3 x 3 , 2 e3 x 3 ] 2 [e4 , 2 e4 , e3 e4 , e4 x 4 , e2 ] 2 [e5 ; 2 e5 ; e4 e5 ; e3 e5 ; e5 x 5 ; e3 e4 ; e3 ; e4 ]
xi (0) 2 2 2 2 2
x̂i (0) 0 0 0 0 0
Ŵci (0) 14×1 14×1 14×1 15×1 3 × 18×1
Ŵai (0) 14×1 14×1 14×1 15×1 3 × 18×1
θ̂i (0) 02×1 02×1 02×1 02×1 02×1
i (0) 500I4 500I4 500I4 500I5 500I8
ηc1i 0.1 0.1 0.1 0.1 0.1
ηc2i 10 10 10 10 10
ηa1i 5 5 5 5 5
ηa2i 0.1 0.1 0.1 0.1 0.1
νi 0.005 0.005 0.005 0.005 0.005
θi I2 0.8I2 I2 I2 I2
ki 500 500 500 500 500
kθi 30 30 25 20 30
5 Differential Graphical Games
Table 5.2 Simulation parameters for the two-dimensional example
Agent 1 Agent 2 Agent 3 Agent 4 Agent 5
Qi 10I2 10I2 10I2 10I2 10I2
Ri I2 I2 I2 I2 I2
1 2 2
1 2 2 1 2 2 1 2 2 2 [2e51 , 2e51 e52 , 2e52 ,
2 [2e11 , 2e11 e12 , 2e12 , 2 [2e21 , 2e21 e22 , 2e22 , 1 2 2 2 [2e41 , 2e41 e42 , 2e42 2 2
2 2 2 2 2 [2e31 , 2e31 e32 , 2e32 2 2 e41 , 2e41 e42 , e42 ,
e21 , 2e21 e22 , e22 , e11 , 2e11 e12 , e12 , 2 2 2 2 e31 , 2e31 e32 , e32
σi (Ei ) 2 x 2 , e2 x 2 , 2 x 2 , e2 x 2 , e31 x31 , e32 x31 , 2 x 2 , e2 x 2 ,
2 , 2e e , e2 ,
e31 31 32 32
e11 11 12 11 e21 21 22 21 2 x 2 , e2 x 2 ]T e41 41 42 41 2 x 2 , e2 x 2 ,
2 x 2 , e2 x 2 ]T 2 x 2 , e2 x 2 ]T e31 32 32 12 2 x 2 , e2 x 2 ]T e51 51 52 51
e11 12 12 12 e21 22 22 22 e41 42 42 42 2 x 2 , e2 x 2 ]T
e51 52 52 52
xi (0) 02×1 02×1 02×1 02×1 02×1
Ŵ ci (0) 110×1 110×1 2 × 17×1 5 × 110×1 3 × 113×1
Ŵ ai (0) 110×1 110×1 2 × 17×1 5 × 110×1 3 × 113×1
i (0) 500I10 500I10 500I4 500I5 500I8
ηc1i 0.1 0.1 0.1 0.1 0.1
ηc2i 2.5 5 2.5 2.5 2.5
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents

ηa1i 2.5 0.5 2.5 2.5 2.5


ηa2i 0.01 0.01 0.01 0.01 0.01
νi 0.005 0.005 0.005 0.005 0.005
171
172 5 Differential Graphical Games

Fig. 5.8 Phase portrait in 4


Agent 1
the state-space for the Agent 2
3 Agent 3
two-dimensional example.
Agent 4
The actual pentagonal 2 Agent 5
formation is represented by a Leader
solid black pentagon, and the 1
desired pentagonal formation

x2
0
around the leader is
represented by a dotted black −1
pentagon
−2

−3

−4
−3 −2 −1 0 1 2 3
x1

Fig. 5.9 Phase portrait of all 1 e1


e2
agents in the error space for e3
0.5 e4
the two-dimensional e5
example 0

−0.5

−1
e2

−1.5

−2

−2.5

−3
−0.5 0 0.5 1 1.5
e1

cooperative control based on decentralized objectives. Figure 5.12 show the evolution
and convergence of the actor weights for all the agents.
Three Dimensional Example
To demonstrate the applicability of the developed method to nonholonomic systems,
simulations are performed on a five agent network of wheeled mobile robots. The
dynamics of the wheeled mobile robots are given by
⎡ ⎤
cos (xi3 ) 0
ẋi = g (xi ) u i , g (xi ) = ⎣ sin (xi3 ) 0⎦ ,
0 1

where xi j (t) denotes the jth element of the vector xi (t) ∈ R3 . The desired trajectory
is selected to be a circular trajectory that slowly comes to a halt after three rotations,
generated using the following dynamical system.
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 173

20 u11 20 u21
u12 u22
15 15

10 10
u1 (t)

u2 (t)
5 5

0 0

−5 −5

−10 −10
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
20 u31 20 u41
u32 u42
15 15

10 10
u3 (t)

u4 (t)

5 5

0 0

−5 −5

−10 −10
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
20 u51
u52
15

10
u5 (t)

−5

−10
0 2 4 6 8 10
Time (s)

Fig. 5.10 Trajectories of the control input for Agents 1–5 for the two-dimensional example

⎡   ⎤
Fr
Tp
T p2 − x03
2
cos (x03 )
⎢  2  ⎥
ẋ0 = ⎢ T p − x03 sin (x03 ) ⎥
Fr
⎦,
2
⎣ Tp
 2 
Fr
Tp
T p − x03
2
174 5 Differential Graphical Games

10 µ11 3 µ21
µ12 µ22
8 2

1
6
0
u1 (t)

u2 (t)
4
−1
2
−2

0 −3

−2 −4
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)

8 µ31 20 µ41
µ32 µ42

6 15

4 10
u3 (t)

u4 (t)

2 5

0 0

−2 −5
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)

10 µ51
µ52
5

0
u5 (t)

−5

−10

−15

−20
0 2 4 6 8 10
Time (s)

Fig. 5.11 Trajectories of the relative control error for Agents 1–5 for the two-dimensional example

where x03 denotes the third element of x0 , and the parameters Fr and T p are selected to
be Fr = 0.1 and T p = 6 π . The desired formation and the communication topology
are shown in Fig. 5.13. For each agent, a random point is selected in the state space
each control iteration for Bellman error extrapolation. Figures 5.14, 5.15, 5.16 and
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 175

1.5 1.5

0.5 1

0
Ŵa1 (t)

Ŵa2 (t)
−0.5 0.5

−1

−1.5 0

−2

−2.5 −0.5
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)

2 5

1.5 4

3
1
Ŵa3 (t)

Ŵa4 (t)

2
0.5
1
0
0
−0.5 −1

−1 −2
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)

1
Ŵa5 (t)

−1

−2

−3
0 2 4 6 8 10
Time (s)

Fig. 5.12 actor weights for Agents 1–5 for the two-dimensional example

5.17 show the tracking error, the control inputs, and the actor weight estimates for
all the agents demonstrating convergence to the desired formation and the desired
trajectory. Note that Agents 2, 3 and 5 do not have a direct communication link to
the leader, nor do they know their desired relative position with respect to the leader.
176 5 Differential Graphical Games

Agent 5 Agent 4

Leader

Agent 1

Agent 3 Agent 2

Fig. 5.13 Network topology

x1
x0
2 x2
x0 + xd20
1 x3
x0 + xd30
xi2 (t)

x4
0 x0 + xd40
x5
-1 x0 + xd50

-2
2
0 40
30
-2 20
10
xi1 (t) -4 0 Time (s)

Fig. 5.14 State trajectories and desired state trajectories as a function of time
5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 177

1 1

0.5
0.5

0
e1 (t)

e2 (t)
0
-0.5

-0.5
-1

-1 -1.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)

1 1

0.5
0.5

0
e3 (t)

e4 (t)

0
-0.5

-0.5
-1

-1 -1.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)

0.5
e5 (t)

-0.5
0 10 20 30 40
Time (s)

Fig. 5.15 Tracking error trajectories for the five agents

The convergence to the desired formation is achieved via cooperative control based
on decentralized objectives.
178 5 Differential Graphical Games

2.5 2.5

2 2

1.5 1.5
u1 (t)

u2 (t)
1 1

0.5 0.5

0 0

-0.5 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)

3 4

2 3
1
2
u3 (t)

u4 (t)

0
1
-1

-2 0

-3 -1
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)

2.5

1.5
u5 (t)

0.5

-0.5
0 10 20 30 40
Time (s)

Fig. 5.16 Control trajectories for the five agents


5.2 Cooperative Formation Tracking Control of Heterogeneous Agents 179

3 2

1.5
2

1
Ŵa1 (t)

Ŵa2 (t)
1
0.5

0
0

-1 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)

2.5 2.5

2 2

1.5 1.5
Ŵa3 (t)

Ŵa4 (t)

1 1

0.5 0.5

0 0

-0.5 -0.5
0 10 20 30 40 0 10 20 30 40
Time (s) Time (s)
3

2
Ŵa5 (t)

-1
0 10 20 30 40
Time (s)

Fig. 5.17 Actor weight estimates for the five agents


180 5 Differential Graphical Games

5.3 Reinforcement Learning-Based Network Monitoring2

5.3.1 Problem Description

Consider a network of N agents with a communication topology described by the


directed graph G = {V, E}, where V = {1, 2, . . . , N } is the set of agents and E ⊆
V × V are the corresponding communication links. The set E contains an ordered
pair ( j, i) such that ( j, i) ∈ E if agent j communicates information to agent i. The
neighborhood of agent i is defined as N  { j ∈ V | ( j, i) ∈ E}, the set of all agents
which communicate to i. It is assumed that the graph is simple, i.e., there are no self
loops: (i, i) ∈ / E. Each communication link is weighted by a constant ai j ∈ R, where
ai j > 0 if ( j, i) ∈ E and ai j = 0 otherwise. The graph adjacency matrix A ∈ R N ×N
 
is constructed from these weights as A  ai j | i, j ∈ V .
The interaction

of the network leader with the other agents is described by the
graph Ḡ = V̄, Ē , which is a supergraph of G that includes the leader, denoted by 0,
such that V̄ = V ∪ {0}. The communication link set Ē is constructed such that Ē ⊃ E
and (0, i) ∈ Ē if the leader communicates

to i. The
leader-included neighborhood is
accordingly defined as N̄i  j ∈ V̄ | ( j, i) ∈ Ē . Leader connections are weighted
by the pinning matrix A0 ∈ R N ×N , where A0  diag (ai0 ) | i ∈ V and ai0 > 0 if
(0, i) ∈ Ē and ai0 = 0 otherwise.
The objective of the follower agents is to synchronize in state towards the leader’s
(possibly time-varying) state. The dynamics for agent i ∈ V are

ẋi (t) = f i (xi (t)) + gi (t) u i (t) , (5.46)

where xi : R≥0 → S is the state, the set S ⊂ Rn is the agents’ state space, f i : S →
Rn is a locally Lipschitz function, gi ∈ Rn×m is a known constant matrix, u i : R≥0 →
Rm is a pre-established stabilizing and synchronizing control input, and the derivative
of the leader state, ẋ0 , is continuous and bounded.
The monitoring objective applied at each agent is to cooperatively monitor the
network for satisfaction of its control objective, wherein the network may be affected
by input disturbances that cause suboptimal performance. Moreover, the monitoring
protocol should be decentralized and passive, i.e., only information from one-hop
neighbors should be used and the protocol should not interfere with the monitored
systems.
For typical synchronization techniques, such as model predictive control, inverse-
optimal, or approximate dynamic programming, a control law is developed based on
a cost functional of the form
 tf
 
Ji (ei (·) , u i (·))  Q i (ei (τ )) + u iT (τ ) Ri u i (τ ) dτ, (5.47)
0

2 Parts of the text in this section are reproduced, with permission, from [8], 2015,
c IEEE.
5.3 Reinforcement Learning-Based Network Monitoring 181

where t f > 0 is the final time of the optimal control problem, Q i : Rn → R is a track-
ing error weighting function, Ri is a constant positive definite symmetric weighting
matrix, and ei : R≥0 → Rn is the neighborhood tracking error defined as
  
ei (t)  ai j xi (t) − x j (t) + ai0 (xi (t) − x0 (t)) .
j∈Ni

For notational brevity, let E denote the collection e1 , . . . , e N .


Even if a controller is not developed based on a cost function, such as in robust
and adaptive control, techniques exist which can be used to develop an expression
for a meaningful cost functional in the form of (5.47) for a given control policy
(cf. [24–26]). The following monitoring approach uses the cost functional in (5.47)
to observe how well the networked dynamical systems are satisfying optimality
conditions; specifically, satisfaction of the Hamilton–Jacobi–Bellman equation will
be monitored to determine how “closely” to optimal the networked systems are
operating. Assuming that an optimal controller exists, an equivalent representation
of the optimal control problem in (5.47) can be developed in terms of the value
function Vi : Rn × R → R, defined as
 tf  
Vi (E, t)  Q i (ei (τ ; t, ei , u i (·))) + u iT (τ ) Ri u i (τ ) dτ,
t

which is minimized as
 tf  
Vi∗ (E, t) = min Q i (ei (τ ; t, ei , u i (·))) + u iT (τ ) Ri u i (τ ) dτ, (5.48)
u i ∈Ui t

where Ui is the set of admissible controllers for agent i. Because the minimization
of the value function Vi is inherently coupled with the minimization of other value
functions in the network, the value function Vi can naturally be a function of the error
signal e j , j ∈ V, if there exists a directed path from the leader to agent i that includes
agent j. Using techniques similar to Theorem 5.3, it can be shown that provided a
feedback-Nash equilibrium solution exists and that the value functions in (5.48) are
continuously differentiable, a necessary and sufficient condition for feedback-Nash
equilibrium is given by the system of Hamilton–Jacobi equations

 ∂ V ∗ (E, t)     
i
a jk f j x j + g j u ∗j (E, t) − f k (xk ) − gk u ∗k (E, t)
∂e j
j∈V k∈N̄ j

∂ Vi∗ (E, t)
+ Q i (ei ) + u i∗T (E, t) Ri u i∗ (E, t) + ≡ 0. (5.49)
∂t
Thus, one method for monitoring the network’s operating conditions is to monitor
the expression in (5.49), which equals zero for the implementation of optimal control
efforts.
182 5 Differential Graphical Games

Because the optimal value function Vi∗ is often infeasible to solve analytically, an
approximate dynamic programming-based approximation scheme is subsequently
∂V ∗
developed so that the approximate value of ∂ti + Hi may be monitored. However,
the Hamilton–Jacobi equation for agent i in (5.49) is inherently coupled with the
state and control of every agent j ∈ V such that there exists a directed path from the
leader to agent i. Consequently, checking for satisfaction of the Hamilton–Jacobi
equations is seemingly unavoidably centralized in information communication. To
overcome this restriction, the developed approximate dynamic programming-based
approximation scheme is constructed such that only information concerning one-hop
neighbors’ states, one-hop neighbors’ control policies and time is necessary for value
function approximation.
To make the current problem tenable, it is assumed that authentic information is
exchanged between the agents, i.e., communication is not maliciously compromised;
rather, the agents are cooperatively monitoring each other’s performance. If neces-
sary, communication authentication algorithms such as in [27] or [28] can be used
to verify digitally communicated information.
To evaluate the expression in (5.49), knowledge of the drift dynamics f i is
required. The following section provides a method to estimate the function f i using
a data-based approach.

5.3.2 System Identification

Assumption 5.10 The uncertain, locally Lipschitz drift dynamics, f i , are linear-in-
the-parameters, such that f i (xi ) = Yi (xi ) θi∗ , where Yi : Rn → Rn× pi is a known
regression matrix and θi∗ ∈ R pi is a vector of constant unknown parameters.

The function fˆi : Rn × R pi → Rn is an estimate of the uncertain drift dynamics


f i and is defined as fˆi xi , θ̂i  Yi (xi ) θ̂i , where θ̂i ∈ R pi is an estimate of the
unknown vector θi∗ . Estimation of θi∗ is facilitated by construction of the identifier

x̂˙i (t) = fˆi xi (t) , θ̂i (t) + gi (xi (t)) u i (t) + k xi x̃i (t) , (5.50)

where x̃i (t)  xi (t) − x̂i (t) is the state estimation error, and k xi ∈ Rn×n is a constant
positive definite diagonal gain matrix. The state identification error dynamics are
expressed using (5.46) and (5.50) as

x̃˙i (t) = Yi (xi (t)) θ̃i (t) − k xi x̃i (t) , (5.51)

where θ̃i (t)  θ ∗ − θ̂ (t). The state estimator in (5.50) is used to develop a data-
driven concurrent learning-based update law for θ̂i (·) as
5.3 Reinforcement Learning-Based Network Monitoring 183

K 
 T 
θ̂˙i (t) = θi (Yi (xi (t)))T x̃i (t) + θi kθi Yiξ ẋiξ − giξ u iξ − Yiξ θ̂i (t) ,
ξ =1
(5.52)
where θi ∈ R pi × pi is a constant positive definite symmetric gain matrix, kθi ∈ R>0
is a constant concurrent learning gain, and the superscript (·)ξ denotes evaluation
ξ
at one of the unique recorded values in the state data stack xi | ξ = 1, . . . , K or
 
ξ
corresponding control value data stack u i | ξ = 1, . . . , K . It is assumed that these
data stacks are recorded prior to use of the drift dynamics estimator. The following
assumption specifies the necessary data richness of the recorded data.
 
ξ
Assumption 5.11 [22] There exists a finite set of collected data xi | ξ =1, . . . , K
such that

⎛ ⎞
K  T
ξ ξ
rank ⎝ Yi Yi ⎠ = pi . (5.53)
ξ =1

Note that persistence of excitation is not mentioned as a necessity for this identifica-
tion algorithm; instead of guaranteeing data richness by assuming that the dynamics
are persistently exciting for all time, it is only assumed that there exists a finite set of
data points that provide the necessary data richness. This also eliminates the common
requirement for injection of a persistent dither signal to attempt to ensure persistence
of excitation, which would interfere with the monitored systems. Furthermore, con-
trary to persistence of excitation-based approaches, the condition in (5.53) can be
verified. Note that, because (5.52) depends on the state derivative ẋiξ at a past value,
numerical techniques can be used to approximate ẋiξ using preceding and proceeding
recorded state information.
To facilitate an analysis of the performance of the identifier in (5.52), the identifier
dynamics are expressed in terms of estimation errors as

K 
 T 
θ̃˙i (t) = − θi (Yi (xi (t)))T x̃i (t) − θi kθi
ξ ξ ξ ξ ξ
Yi ẋi − gi u i − Yi θ̂i (t) .
ξ =1
(5.54)
To describe the performance of the identification of θi∗ , consider the positive definite
continuously differentiable Lyapunov function Vθi : Rn+ pi → [0, ∞) defined as

1 T 1 −1
Vθi (z i )  x̃i x̃i + θ̃iT θi θ̃i , (5.55)
2 2
 T
where z i  x̃iT , θ̃iT . The expression in (5.55) satisfies the inequalities

V θi z i 2 ≤ Vθi (z i ) ≤ V̄θi z i 2 , (5.56)


184 5 Differential Graphical Games

−1  
−1 
where V θi  21 min 1, λmin θi and V̄θi  21 max 1, λmax θi are positive
known constants. Using the dynamics in (5.51) and (5.54), the time derivative of
(5.55) is ⎛ ⎞
K  T
ξ ξ
V̇θi (z i ) = −x̃iT k xi x̃i − θ̃iT kθi ⎝ Yi Yi ⎠ θ̃i . (5.57)
ξ =1

 K  ξ T ξ
Note that because the matrix ξ =1 Yi Yi is symmetric and positive semi-
definite, its eigenvalues are real and greater than or equal to zero. Furthermore, by
  ξ T ξ
Assumption 5.11, none of the eigenvalues of ξK=1 Yi Yi are equal to zero. Thus,
 K  ξ T ξ
all of the eigenvalues of the symmetric matrix ξ =1 Yi Yi are positive, and the
  T
matrix ξK=1 Yiξ Yiξ is positive definite. Using this property and the inequalities
in (5.56), (5.57) is upper bounded as
ci
V̇θi (z i ) ≤ −ci z i 2 ≤ − Vθi (z i ) , (5.58)
V̄θi
 # $
 K  ξ T ξ
where ci  min λmin {k xi } , kθi λmin ξ =1 Yi Yi . The inequalities in (5.56)
 
 
and (5.58) can then be used to conclude that x̃i (t) , θ̃i (t) → 0 exponentially
fast. Thus, the drift dynamics f i (xi ) = Yi (xi ) θi∗ are exponentially identified.
Note that even with state derivative estimation errors, the parameter estimation
error θ̃i (·) can be shown to be uniformly ultimately bounded, where the magnitude
of the ultimate bound depends on the derivative estimation error [22].
Remark 5.12 Using an integral formulation, the system identifier can also be imple-
mented without using state-derivative measurements (see, e.g., [14]).

5.3.3 Value Function Approximation

For the approximate value of (5.49) to be evaluated for monitoring purposes, the
unknown optimal value function Vi∗ needs to be approximated for each agent i.
Because the coupled Hamilton–Jacobi equations are typically difficult to solve ana-
lytically, this section provides an approach to approximate Vi∗ using neural networks.
Assuming the networked agents’ states remain bounded, the universal function
approximation property of neural networks can be used with wi neurons to equiva-
lently represent Vi∗ as
     
Vi∗ ei , t  = WiT σi ei , t  +
i ei , t  , (5.59)
5.3 Reinforcement Learning-Based Network Monitoring 185

where t   t
tf
is the normalized time, Wi ∈ Rwi is an unknown ideal neural net-
work weight vector bounded above by a known constant W̄i ∈ R>0 as Wi  ≤ W̄i ,
σi : S × [0, 1] → Rwi is a bounded nonlinear continuously differentiable activation
function with the property σ (0, 0) = 0, where
i : S × [0, 1] → R is the unknown
function reconstruction error. From the universal function approximation prop-
erty, the reconstruction error
i satisfies the properties supρ∈S,ϕ∈[0,1] |
i (ρ, ϕ)| ≤
¯i ,
supρ∈S,ϕ∈[0,1] ∂|
i ∂ρ
(ρ,ϕ)|

¯ei , and supρ∈S,ϕ∈[0,1] ∂|
i ∂ϕ
(ρ,ϕ)|

¯ti , where
¯i ,
¯ei ,
¯ti ∈
R>0 are constant upper bounds.
Note that only the state of agent i, states of neighbors of agent i ( j ∈ N̄i ), and
time are used as arguments in the neural network representation of Vi∗ in (5.59),
instead of the states of all agents in the network. This is justified by treating the error
states of other agents simply as functions of time, the effect of which is accommo-
dated by including time in the basis function σi and function reconstruction error

i . Inclusion of time in the basis function is feasible due to the finite horizon of
the optimization problem in (5.47). Using state information from additional agents
(e.g., two-hop communication) in the network may increase the practical fidelity of
function reconstruction and may be done in an approach similar to that developed in
this section.
Using this neural network representation, Vi∗ is approximated for use in computing
the Hamiltonian as   
V̂i ei , Ŵci , t   ŴciT σi ei , t  ,

where Ŵci is an estimate of the ideal neural network weight vector Wi .


To facilitate the development of a feedback-based update policy to drive Ŵci (·)
towards Wi , the Bellman error for agent i is defined as
    
δi t   ŴciT σti ei , t  + Ĥi E i , X i , Ŵci , ω̂ai , t 
 ∗   
∂ Vi E, t  ∗ ∗ 

− + Hi E, X , U , Vi , t , (5.60)
∂t
  ∂σ e ,t 

where σti ei , t   i (∂ti ) , E i  e j | j ∈ {i} ∪ N̄i is the set of error states of


agent i and the neighbors of agent i, X i  x j | j ∈ {i} ∪ N̄i isthe set of states of
agent i and the neighbors of agent i, ω̂ai  Ŵai | j ∈ {i} ∪ N̄i is the set of actor
weights of agent i and the neighbors of agent i, Hi is the Hamiltonian defined as
 
Hi E, X , U ∗ , Vi∗ , t  Q i (ei ) + u i∗T (E, t) Ri u i∗ (E, t)
 ∂ V ∗ (E, t)     
+ i
a jk f j x j + g j u ∗j (E, t) − f k (xk ) − gk u ∗k (E, t) (5.61)
∂e j
j∈V k∈N̄ j

where the sets X and U are defined as X  {xi | i ∈ V} and U  {u i | i ∈ V} and


Ĥi is the approximate Hamiltonian defined as
186 5 Differential Graphical Games

  
Ĥi E i , X i , Ŵci , ω̂ai , t   Q i (ei ) + û iT ei , Ŵai , t  Ri û i ei , Ŵai , t 
  
      
+ ŴciT σei ei , t  ai j fˆi (xi ) + gi û i ei , Ŵai , t  − fˆj x j − g j x j û j e j , Ŵa j , t 
j∈Ni
  
 
+ ai0 fˆi (xi ) + gi û i ei , Ŵai , t  − ẋ0 t  , (5.62)

     
∂σi (ei ,t  ) −1 T T
where σei ei , t   ∂ei
, û i ei , Ŵai , t   − 21 j∈N̄i ai j Ri gi σei ei , t


Ŵai is the approximated optimal control for agent i, and Ŵai is another estimate
of the ideal neural network weight vector Wi . Noting that the expression in (5.62) is
measurable (assuming that the leader state derivative is available to those communi-
cating with the leader), the Bellman error in (5.60) may be put into measurable form,
∂V ∗  
after recalling that ∂ti + Hi E, X , U ∗ , Vi∗ , t ≡ 0, as

    
δi t   ŴciT σti ei , t  + Ĥi E i , X i , Ŵci , ω̂ai , t  , (5.63)

which is the feedback to be used to train the neural network estimate Ŵci . The use
of the two neural network estimates Ŵci and Ŵai allows for least-squares based
adaptation for the feedback in (5.63), since only the use of Ŵci would result in
nonlinearity of Ŵci in (5.63).
The difficulty in making a non-interfering monitoring scheme with approximate
dynamic programming lies in obtaining sufficient data richness for learning. Contrary
to typical approximate dynamic programming-based control methods, the devel-
oped data-driven adaptive learning policy uses
concepts from concurrent + + learning

+ +
(cf.  to provide data richness. Let si  ρl ∈ S | l = 1, . . . , N̄i + 1 ∪

[22, 29])
ϕ | ϕ ∈ 0, t f be a pre-selected sample point in the state space of agent i, its


neighbors, and time. Additionally, let Si  sicl | cl = 1, . . . , L be a collection of
these unique sample points. The Bellman error will be evaluated over the set Si in
the neural network update policies to guarantee data richness. As opposed to the
common practice of injecting an exciting signal into a system’s control input to pro-
vide sufficient data richness for adaptive learning, this strategy evaluates the Bellman
error at preselected points in the state space and time to mimic exploration of the
state space. The following assumption specifies a sufficient condition on the set Si
for convergence of the subsequently defined update policies.
Assumption 5.13 For each agent i ∈ V, the set of sample points Si satisfies
  T 
1  L
χicl χicl
μi  inf λmin > 0, (5.64)
L t∈[0,t f ]
c =1
γicl
l

where (·)  indicated agent, χ i 


cl th
denotes evaluation at the cl sample 
point
for the
σti +σei ˆ ˆ ˆ
j∈Ni ai j f i (x i ) +gi û i − f j x j − g j û j + ai0 f i (x i ) + gi û i − ẋ 0
5.3 Reinforcement Learning-Based Network Monitoring 187

is a regressor vector in the developed neural network update law, γi  1 + λi (χi )T


i χi provides normalization to the developed neural network update law, λi ∈ R>0
is a constant normalization gain, and i ∈ Rwi ×wi is a subsequently defined least-
squares positive definite gain matrix.
Note that when evaluating expressions at the pre-selected concurrent learning sample
points, the current values of parameter estimates are used since they are approximat-
ing constant values. In general, similar to the selection of a dither signal in persistence
of excitation-based approaches (as in [30, 31]), the satisfaction of (5.64) cannot be
guaranteed a priori. However, this strategy benefits from the ability to accommodate
this condition by selecting more information than theoretically necessary (i.e., select-
ing sample points such that L  wi ). Additionally, satisfaction of the condition in
(5.64) up until the current time can be verified online.
Using the measurable form of the Bellman error in (5.63) as feedback, a concurrent
learning-based least-squares update policy is developed to approximate Wi as [29]

φc2i  χicl cl
L
χi
Ŵ˙ ci = −φc1i i δi − i δ , (5.65)
γi L γ cl i
c =1 i l

 
χi χiT
˙
i = βi i − φc1i i 2 i 1{ i ≤ ¯ i } , (5.66)
γi

where φc1i , φc2i ∈ R>0 are constant adaptation gains, βi ∈ R>0 is a constant forget-
ting factor, ¯ i ∈ R>0 is a saturation constant, and i (0) is positive definite, symmet-
ric, and bounded such that  i (0) ≤ ¯ i . The form of the least-squares gain matrix
update law in (5.66) is constructed such that i remains positive definite and

i ≤  i (t) ≤ ¯ i , ∀t ∈ R≥0 , (5.67)

where i ∈ R>0 is constant [32]. The neural network estimate Ŵai is updated towards
the estimate Ŵci as

   T 
φc1i G σ i Ŵai χiT 
L
φc2i G cσli Ŵai χicl
Ŵ˙ ai = −φa1i Ŵai − Ŵci −φa2i Ŵai + + Ŵci ,
4γi
cl =1
4Lγicl
(5.68)
 2
where φa1i , φa2i ∈ R>0 are constant adaptation gains, G σ i  j∈N̄i ai j σei G i σeiT
∈ Rwi , and G i  gi Ri−1 giT .
188 5 Differential Graphical Games

5.3.4 Stability Analysis

The following theorem summarizes the stability properties of the leader-follower


network.
Theorem 5.14 For every agent i ∈ V, the identifier in (5.52) along with the adaptive
update laws in (5.65)–(5.68) guarantee that the estimation errors W̃ci  Wi − Ŵci
and W̃ai  Wi − Ŵai uniformly converge in a finite time TW to Br , provided Assump-
tions (5.10)–(5.13) hold, the adaptation gains are selected sufficiently large, and the
concurrent learning sample points are selected to produce a sufficiently large μi ,
where TW and r can be made smaller by selecting more concurrent learning sample
points to increase the value of μi and increasing the gains k xi , kθi , φa1i , φa2i , and
φc2i .

Proof (Sketch) Consider the Lyapunov function

 1 1

VL  W̃ciT i−1 W̃ci + W̃aiT W̃ai + Vθi (z i ) . (5.69)
i∈V
2 2

An upper bound of V̇L along the trajectories of (5.51), (5.54), (5.65), and (5.68)
can be obtained after expressing δi in terms of estimation errors, using the property
 −1   
 
d
dt
i = − i−1 ˙i i−1 , using the inequality  χi  ≤ √1 , applying the Cauchy–
γi 2 λi i
Schwarz and triangle inequalities, and performing nonlinear damping, such that V̇L
is bounded from above by an expression that is negative definite in terms of the state
of VL plus a constant upper-bounding term. Using [21, Theorem 4.18], it can shown
that the estimation errors are uniformly ultimately bounded.

5.3.5 Monitoring Protocol

With the estimation of the unknown neural network weight Wi by Ŵci from the previ-
ous section, the performance of an agent in satisfying the Hamilton–Jacobi–Bellman
optimality constraint can be monitored through use of V̂i = ŴciT σi , the approximation
∂V ∗  
of Vi∗ . From (5.49), ∂ti + Hi E, X , U ∗ , Vi∗ , t ≡ 0 (where U ∗ denotes the optimal
control efforts). Let Mi ∈ R denote the signal to be monitored by agent i ∈ V, defined
as
 ++    
Mi ei , X i , Ŵci , Ui , t  = ++ŴciT σti + ŴciT σei ai j fˆi (xi ) + gi u i − fˆj x j − g j u j
j∈Ni
 ++
+ Q i (ei ) + u iT Ri u i + ai0 fˆi (xi ) + gi u i − ẋ0 ++,
5.3 Reinforcement Learning-Based Network Monitoring 189

which differs from


(5.63) in that the measured values of control efforts are being
used, where Ui  u j | j ∈ {i} ∪ Ni . Because the identification of f i is performed
exponentially fast and the uniform convergence of W̃ci to a ball around the origin
occurs+ in∗ the finite learning time TW , the monitored signal M +i satisfies the relation-
+ ∂V   +
ship + ∂ti + Hi E, X , U ∗ , Vi∗ , t − Mi ei , X i , Ŵci , Ui∗ , t  + < ς ∀t ≥ TW , where
ς ∈ R>0 is some bounded constant that can be made smaller by selecting larger
gains and more concurrent learning sample points. In other words, when using
the neural network approximation
of Vi∗ and appropriate gains  and sample points,
Mi ei , X i , Ŵci , Ui∗ , t  ≈ 0, where Ui∗  u ∗j | j ∈ {i} ∪ Ni . In this manner, due

to continuity of the Hamiltonian, observing large values for Mi ei , X i , Ŵci , Ui , t 
indicates significant deviation away from optimal operating conditions. Let M̄ ∈ R>0
be a constant threshold used for the monitoring process which satisfies M̄ > ς , where
the value for M̄ can be increased if a greater tolerance for sub-optimal performance
is acceptable. The monitoring protocol, which is separated into a learning phase and
a monitoring phase, can be summarized as follows.

Algorithm 5.1 Monitoring protocol.


Learning phase:
For each agent i ∈ V , use the update policies in (5.50), (5.52), (5.65), (5.66), (5.68) to train θ̂i and
Ŵci , the estimates for θi∗ and Wi .
Monitoring phase:
At t = TW , terminate updates to θ̂i and Ŵci . 
For each agent i ∈ V , monitor the value of Mi ei , X i , Ŵci , Ui , t  using θ̂i , Ŵci , neighbor com-
 
munication x j , u j , fˆj , g j and ẋ0 if (i, 0) ∈ Ē.
If Mi > M̄, then undesirable performance has been observed.

5.4 Background and Further Reading

Online real-time solutions to differential games are presented in results such as [15,
16, 33–35]; however, since these results solve problems with centralized objectives
(i.e., each agent minimizes or maximizes a cost function that penalizes the states of
all the agents in the network), they are not applicable for a network of agents with
independent decentralized objectives (i.e., each agent minimizes or maximizes a cost
function that penalizes only the error states corresponding to itself).
Various methods have been developed to solve formation tracking problems for
linear systems (cf. [36–40] and the references therein). An optimal control approach
is used in [41] to achieve consensus while avoiding obstacles. In [42], an optimal
controller is developed for agents with known dynamics to cooperatively track a
desired trajectory. In [43], an inverse optimal controller is developed for unmanned
190 5 Differential Graphical Games

aerial vehicles to cooperatively track a desired trajectory while maintaining a desired


formation. In [44], a differential game-based approach is developed for unmanned
aerial vehicles to achieve distributed Nash strategies. In [45], an optimal consensus
algorithm is developed for a cooperative team of agents with linear dynamics using
only partial information. A value function approximation based approach is presented
in [17] for cooperative synchronization in a strongly connected network of agents
with known linear dynamics.
For nonlinear systems, model-predictive control-based approaches ([46, 47]) and
approximate dynamic programming-based approaches ([17, 48]) have been pro-
posed. A model-predictive control-based approach is presented in [46]; however, no
stability or convergence analysis is presented. A stable distributed model-predictive
control-based approach is presented in [47] for nonlinear discrete-time systems with
known nominal dynamics. Asymptotic stability is proved without any interaction
between the nodes; however, a nonlinear optimal control problem needs to be solved
at every iteration to implement the controller. An optimal tracking approach for
formation control is presented in [48] using single network adaptive critics where
the value function is learned offline. Online feedback-Nash equilibrium solution of
differential graphical games in a network of agents with continuous-time uncertain
nonlinear dynamics has remained an open problem. Recently, a leader-based con-
sensus algorithm is developed in [49] where exact model of the system dynamics
is utilized, and convergence to optimality is obtained under a persistence of exci-
tation condition. The model-predictive control-based controllers require extensive
numerical computations and lack stability and optimality guarantees. The approx-
imate dynamic programming-based approaches either require offline computations
or are suboptimal because not all the inter-agent interactions are considered in the
value function.
Efforts have been made to structure the layout of a network such that the impact
of network corruption can be abated [50, 51]. Researchers have also investigated
the ability to allay types of network subterfuge by creating control algorithms which
are resilient to attacks on sensors and actuators [52]. Other efforts seek to have
agents detect undesired performance in their network neighbors. The results in [53]
and [54] provide methods to detect “sudden” faulty behavior which is modeled
as a step function multiplied by fault dynamics. Other works develop procedures
to detect generally undesired behavior in networks of linear systems [27, 55] and
unpredictable state trajectories of nonlinear systems using neural networks [56, 57].
Adaptive thresholds used for determining if the state of a neighboring agent is within
an acceptable tolerance are developed in [58] and [59].
Contemporary results on online near-optimal control of multi-agent systems
include disturbance rejection methods [60], off-policy methods [61], and Q-learning
methods [62]. For a complete description of recent developments on online methods
to solve multiplayer games, see [63].
References 191

References

1. Kamalapurkar R, Klotz JR, Walters P, Dixon WE (2018) Model-based reinforcement learning


for differential graphical games. IEEE Trans Control Netw Syst 5:423–433
2. Case J (1969) Toward a theory of many player differential games. SIAM J Control 7:179–197
3. Starr A, Ho CY (1969) Nonzero-sum differential games. J Optim Theory Appl 3(3):184–206
4. Starr A, Ho CY (1969) Further properties of nonzero-sum differential games. J Optim Theory
Appl 4:207–219
5. Friedman A (1971) Differential games. Wiley
6. Bressan A, Priuli FS (2006) Infinite horizon noncooperative differential games. J Differ Equ
227(1):230–257
7. Bressan A (2011) Noncooperative differential games. Milan J Math 79(2):357–427
8. Klotz J, Andrews L, Kamalapurkar R, Dixon WE (2015) Decentralized monitoring of leader-
follower networks of uncertain nonlinear systems. In: Proceedings of the American control
conference, pp 1393–1398
9. Khoo S, Xie L (2009) Robust finite-time consensus tracking algorithm for multirobot systems.
IEEE/ASME Trans Mechatron 14(2):219–228
10. Liberzon D (2012) Calculus of variations and optimal control theory: a concise introduction.
Princeton University Press
11. Kamalapurkar R, Andrews L, Walters P, Dixon WE (2017) Model-based reinforcement learn-
ing for infinite-horizon approximate optimal tracking. IEEE Trans Neural Netw Learn Syst
28(3):753–758
12. Chowdhary GV, Johnson EN (2011) Theory and flight-test validation of a concurrent-learning
adaptive controller. J Guid Control Dyn 34(2):592–607
13. Kamalapurkar R, Walters P, Dixon WE (2016) Model-based reinforcement learning for approx-
imate optimal regulation. Automatica 64:94–104
14. Bell Z, Parikh A, Nezvadovitz J, Dixon WE (2016) Adaptive control of a surface marine craft
with parameter identification using integral concurrent learning. In: Proceedings of the IEEE
conference on decision and control, pp 389–394
15. Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: online adaptive learn-
ing solution of coupled hamilton-jacobi equations. Automatica 47:1556–1569
16. Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate
solution using a policy iteration algorithm. In: Proceedings of the IEEE conference on decision
and control, pp 142–147
17. Vamvoudakis KG, Lewis FL, Hudas GR (2012) Multi-agent differential graphical games: online
adaptive learning solution for synchronization with optimality. Automatica 48(8):1598–1611
18. Kamalapurkar R, Dinh HT, Walters P, Dixon WE (2013) Approximate optimal cooperative
decentralized control for consensus in a topological network of agents with uncertain nonlinear
dynamics. In: Proceedings of the American control conference, Washington, DC, pp 1322–1327
19. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
20. Kamalapurkar R, Dinh H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking
for continuous-time nonlinear systems. Automatica 51:40–48
21. Khalil HK (2002) Nonlinear Systems, 3rd edn. Prentice Hall, Upper Saddle River, NJ
22. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
23. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares
procedures. Anal Chem 36(8):1627–1639
24. Krstic M, Li ZH (1998) Inverse optimal design of input-to-state stabilizing nonlinear con-
trollers. IEEE Trans Autom Control 43(3):336–350
25. Mombaur K, Truong A, Laumond JP (2010) From human to humanoid locomotion - an inverse
optimal control approach. Auton Robots 28(3):369–383
192 5 Differential Graphical Games

26. Ratliff ND, Bagnell JA, Zinkevich MA (2006) Maximum margin planning. In: Proceedings of
the international conference on learning
27. Pang Z, Liu G (2012) Design and implementation of secure networked predictive control
systems under deception attacks. IEEE Trans Control Syst Technol 20(5):1334–1342
28. Clark A, Zhu Q, Poovendran R, Başar T (2013) An impact-aware defense against stuxnet. In:
Proceedings of the American control conference, pp 4146–4153
29. Kamalapurkar R, Walters P, Dixon WE (2013) Concurrent learning-based approximate optimal
regulation. In: Proceedings of the IEEE conference on decision and control, Florence, IT, pp
6256–6261
30. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
31. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems, Springer, pp 357–374
32. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall
33. Johnson M, Hiramatsu T, Fitz-Coy N, Dixon WE (2010) Asymptotic stackelberg optimal
control design for an uncertain Euler-Lagrange system. In: Proceedings of the IEEE conference
on decision and control, Atlanta, GA, pp 6686–6691
34. Vamvoudakis KG, Lewis FL (2010) Online neural network solution of nonlinear two-player
zero-sum games using synchronous policy iteration. In: Proceedings of the IEEE conference
on decision and control
35. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
36. Lewis M, Tan K (1997) High precision formation control of mobile robots using virtual struc-
tures. Auton Robots 4(4):387–403
37. Balch T, Arkin R (1998) Behavior-based formation control for multirobot teams. IEEE Trans
Robot Autom 14(6):926–939
38. Das A, Fierro R, Kumar V, Ostrowski J, Spletzer J, Taylor C (2002) A vision-based formation
control framework. IEEE Trans Robot Autom 18(5):813–825
39. Fax J, Murray R (2004) Information flow and cooperative control of vehicle formations. IEEE
Trans Autom Control 49(9):1465–1476
40. Murray R (2007) Recent research in cooperative control of multivehicle systems. J Dyn Syst
Meas Control 129:571–583
41. Wang J, Xin M (2010) Multi-agent consensus algorithm with obstacle avoidance via optimal
control approach. Int J Control 83(12):2606–2621
42. Wang J, Xin M (2012) Distributed optimal cooperative tracking control of multiple autonomous
robots. Robot Auton Syst 60(4):572–583
43. Wang J, Xin M (2013) Integrated optimal formation control of multiple unmanned aerial
vehicles. IEEE Trans Control Syst Technol 21(5):1731–1744
44. Lin W (2014) Distributed uav formation control using differential game approach. Aerosp Sci
Technol 35:54–62
45. Semsar-Kazerooni E, Khorasani K (2008) Optimal consensus algorithms for cooperative team
of agents subject to partial information. Automatica 44(11):2766–2777
46. Shim DH, Kim HJ, Sastry S (2003) Decentralized nonlinear model predictive control of multiple
flying robots. Proceedings of the IEEE conference on decision and control 4:3621–3626
47. Magni L, Scattolini R (2006) Stabilizing decentralized model predictive control of nonlinear
systems. Automatica 42(7):1231–1236
48. Heydari A, Balakrishnan SN (2012) An optimal tracking approach to formation control of
nonlinear multi-agent systems. In: Proceedings of AIAA guidance, navigation and control
conference
49. Zhang H, Zhang J, Yang GH, Luo Y (2015) Leader-based optimal coordination control for the
consensus problem of multiagent differential games via fuzzy adaptive dynamic programming.
IEEE Trans Fuzzy Syst 23(1):152–163
References 193

50. Sundaram S, Revzen S, Pappas G (2012) A control-theoretic approach to disseminating values


and overcoming malicious links in wireless networks. Automatica 48(11):2894–2901
51. Abbas W, Egerstedt M (2012) Securing multiagent systems against a sequence of intruder
attacks. In: Proceedings of the American control conference, pp 4161–4166
52. Fawzi H, Tabuada P, Diggavi S (2012) Security for control systems under sensor and actuator
attacks. In: Proceedings of the IEEE conference on decision and control, pp 3412–3417
53. Jung D, Selmic RR (2008) Power leader fault detection in nonlinear leader-follower networks.
In: Proceedings of the IEEE conference on decision and control, pp 404–409
54. Zhang X (2010) Decentralized fault detection for a class of large-scale nonlinear uncertain
systems. In: Proceedings of the American control conference, pp 5650–5655
55. Li X, Zhou K (2009) A time domain approach to robust fault detection of linear time-varying
systems. Automatica 45(1):94–102
56. Potula K, Selmic RR, Polycarpou MM (2010) Dynamic leader-followers network model of
human emotions and their fault detection. In: Proceedings of the IEEE conference on decision
and control, pp 744–749
57. Ferdowsi H, Raja DL, Jagannathan S (2012) A decentralized fault prognosis scheme for nonlin-
ear interconnected discrete-time systems. In: Proceedings of the American control conference,
pp 5900–5905
58. Luo X, Dong M, Huang Y (2006) On distributed fault-tolerant detection in wireless sensor
networks. IEEE Trans Comput 55(1):58–70
59. Fernández-Bes J, Cid-Sueiro J (2012) Decentralized detection with energy-aware greedy selec-
tive sensors. In: International workshop on cognitive information process, pp 1–6
60. Jiao Q, Modares H, Xu S, Lewis FL, Vamvoudakis KG (2016) Multi-agent zero-sum differential
graphical games for disturbance rejection in distributed control. Automatica 69:24–34
61. Li J, Modares H, Chai T, Lewis FL, Xie L (2017) Off-policy reinforcement learning for synchro-
nization in multiagent graphical games. IEEE Trans Neural Netw Learn Syst 28(10):2434–2445
62. Vamvoudakis KG (2017) Q-learning for continuous-time graphical games on large net-
works with completely unknown linear system dynamics. Int J Robust Nonlinear Control
27(16):2900–2920
63. Vamvoudakis KG, Modares H, Kiumarsi B, Lewis FL (2017) Game theory-based control
system algorithms with real-time reinforcement learning: How to solve multiplayer games
online. IEEE Control Syst 37(1):33–52
Chapter 6
Applications

6.1 Introduction

This chapter is dedicated to applications of model-based reinforcement learning


to closed-loop control of autonomous ground and marine vehicles. Marine craft,
which include ships, floating platforms, autonomous underwater vehicles, etc, play
a vital role in commercial, military, and recreational objectives. Marine craft are
often required to remain on a station for an extended period of time, e.g., floating oil
platforms, support vessels, and autonomous underwater vehicles acting as a commu-
nication link for multiple vehicles or persistent environmental monitors. The success
of the vehicle often relies on the vehicle’s ability to hold a precise station (e.g., station
keeping near structures or underwater features). The cost of holding that station is
correlated to the energy expended for propulsion through consumption of fuel and
wear on mechanical systems, especially when station keeping in environments with
a persistent current. Therefore, by reducing the energy expended for station keeping
objectives, the cost of holding a station can be reduced.
In this chapter, an optimal station keeping policy that captures the desire to bal-
ance the need to accurately hold a station and the cost of holding that station through
a quadratic performance criterion is generated for a fully actuated marine craft (see
also, [1]). The developed controller differs from results such as [2, 3] in that it
tackles the challenges associated with the introduction of a time-varying irrotational
current. Since the hydrodynamic parameters of a marine craft are often difficult
to determine, a concurrent learning system identifier is developed. As outlined in
[4], concurrent learning uses additional information from recorded data to remove
the persistence of excitation requirement associated with traditional system identi-
fiers. Due to a unique structure, the proposed model-based approximate dynamic
programming method generates the optimal station keeping policy using a combina-
tion of on-policy and off-policy data, eliminating the need for physical exploration
of the state-space. A Lyapunov-based stability analysis is presented which guar-
antees uniformly ultimately bounded convergence of the marine craft to its station

© Springer International Publishing AG 2018 195


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_6
196 6 Applications

and uniformly ultimately bounded convergence of the approximated policy to the


optimal policy. The developed strategy is validated for planar motion of an
autonomous underwater vehicle. The experiments are conducted in a second-
magnitude spring located in central Florida.
Advances in sensing and computational capabilities have enabled autonomous
ground vehicles to become vital assets across multiple disciplines. This surge of
interest over the last few decades has drawn considerable attention to motion control
of mobile robots. As the technology matures, there is a desire to improve the per-
formance (e.g., minimum control effort, time, distance) of mobile robots to better
achieve their objectives.
Motivated by the desire for optimal path-following, an approximate dynamic
programming-based controller is developed for a unicycle-type mobile robot where
the optimal policy is parameterized by a neural network (see also, [5]). By simul-
taneously identifying and utilizing the feedback policy, the approximate dynamic
programming-based controller does not need offline training for new desired paths or
performance criteria. Path-following is achieved by tracking a virtual target placed
on the desired path. The motion of the virtual target is described by a predefined
state-dependent ordinary differential equation (cf. [6–8]). The state associated with
the virtual target’s location along the path is unbounded due to the infinite time hori-
zon of the guidance law, which presents several challenges related to the use of a
neural network. In addition, the vehicle requires a constant control effort to remain
on the path; therefore, any policy that results in path-following also results in infinite
cost, rendering the associated control problem ill-defined.
In this chapter, the motion of the virtual target is redefined to facilitate the use
of the neural network, and a modified control input is developed to render feasible
optimal policies. The cost function is formulated in terms of the modified control
and redefined virtual target motion, a unique challenge not addressed in previous
approximate dynamic programming literature. A Lyapunov-based stability analysis
is presented to establish uniformly ultimately bounded convergence of the approxi-
mate policy to the optimal policy and the vehicle state to the path while maintaining
a desired speed profile. Simulation results compare the policy obtained using the
developed technique to an offline numerical optimal solution. These results demon-
strate that the controller approximates the optimal solution with similar accuracy
as an offline numerical approach. Experimental results on a two-wheel differential
drive mobile robot demonstrate the ability of the controller to learn the approximate
optimal policy in real-time.

6.2 Station-Keeping of a Marine Craft

6.2.1 Vehicle Model

The nonlinear equations of motion for a marine craft, including the effects of irrota-
tional ocean current, are given by [9]
6.2 Station-Keeping of a Marine Craft 197

η̇ (t) = JE (η (t)) ν (t) , (6.1)

τb (t) = M R B ν̇ (t) + C R B (ν (t)) ν (t) + M A ν̇r (t) + C A (νr (t)) νr (t)


+ D A (νr (t)) νr (t) + G (η (t)) , (6.2)

where ν : R≥t0 → Rn is the body-fixed translational and angular velocity vector,


νc : R≥t0 → Rn is the body-fixed irrotational current velocity vector, νr = ν − νc is
the relative body-fixed translational and angular fluid velocity vector, η : R≥t0 → Rn
is the earth-fixed position and orientation vector, JE : Rn → Rn×n is the coordinate
transformation between the body-fixed and earth-fixed coordinates,1 M R B ∈ Rn×n is
the constant rigid body inertia matrix, C R B : Rn → Rn×n is the rigid body centripetal
and Coriolis matrix, M A ∈ Rn×n is the constant hydrodynamic added mass matrix,
C A : Rn → Rn×n is the unknown hydrodynamic centripetal and Coriolis matrix, D A :
Rn → Rn×n is the unknown hydrodynamic damping and friction matrix, G : Rn →
Rn is the gravitational and buoyancy force and moment vector, and τb : R≥t0 → Rn
is the body-fixed force and moment control input.
For a three degree-of-freedom planar model with orientation represented as Euler
angles, the state vectors in (6.1) and (6.2) are defined as
 T
η x yψ ,
 T
ν uvr ,

where x, y : R≥t0 → R, are the earth-fixed position vector components of the center
of mass, ψ : R≥t0 → R represents the yaw angle, u, v : R≥t0 → R are the body-
fixed translational velocities, and r : R≥t0 → R is the body-fixed angular velocity.
The irrotational current vector is defined as
 T
νc  u c vc 0 ,

where u c , vc : R≥t0 → R are the body-fixed current translational velocities. The


coordinate transformation JE is given as
⎡ ⎤
cos (ψ) − sin (ψ) 0
JE (η) = ⎣ sin (ψ) cos (ψ) 0⎦.
0 0 1

Assumption 6.1 The marine craft is neutrally buoyant if submerged and the center
of gravity is located vertically below the center of buoyancy on the z axis if the
vehicle model includes roll and pitch.

1 The orientation of the vehicle may be represented as Euler angles, quaternions, or angular rates.
In this development, the use of Euler angles is assumed, see Sect. 7.5 in [9] for details regarding
other representations.
198 6 Applications

Assumption 6.1 simplifies the subsequent analysis and can often be met by trimming
the vehicle. For marine craft where this assumption cannot be met, an additional term
may be added to the controller, similar to how terms dependent on the irrotational
current are handled.

6.2.2 System Identifier

Since the hydrodynamic effects pertaining to a specific marine craft may be unknown,
an online system identifier is developed for the vehicle drift dynamics. Consider the
control-affine form of the vehicle model,

ζ̇ (t) = Y (ζ (t) , νc (t)) θ + f 0 (ζ (t) , ν̇c (t)) + gτb (t) , (6.3)


 T
where ζ  η ν : R≥t0 → R2n is the state vector. The unknown hydrodynamics
are linear-in-the-parameters with p unknown parameters where Y : R2n × Rn →
R2n× p is the regression matrix and θ ∈ R p is the vector of unknown parameters. The
unknown hydrodynamic effects are modeled as

0
Y (ζ, νc ) θ = ,
−M −1 C A (νr ) νr − M −1 D A (νr ) νr

and known rigid body drift dynamics f 0 : R2n × Rn → R2n are modeled as

JE (η) ν
f 0 (ζ, ν̇c ) = ,
M −1 M A ν̇c − M −1 C R B (ν) ν − M −1 G (η)

where M  M R B + M A and the body-fixed current velocity νc and acceleration ν̇c


are assumed to be measurable. The body-fixed current velocity νc may be trivially
measured using sensors commonly found on marine craft, such as a Doppler velocity
log, while the current acceleration ν̇c may be determined using numerical differen-
tiation and smoothing. The known constant control effectiveness matrix g ∈ R2n×n
is defined as

0
g .
M −1

An identifier is designed as

ζ̂˙ (t) = Y (ζ (t) , νc (t)) θ̂ (t) + f 0 (ζ (t) , ν̇c (t)) + gτb (t) + kζ ζ̃ (t) , (6.4)

where ζ̃  ζ − ζ̂ is the measurable state estimation error, and kζ ∈ R2n×2n is a con-


stant positive definite, diagonal gain matrix. Subtracting (6.4) from (6.3) yields
6.2 Station-Keeping of a Marine Craft 199

ζ̃˙ (t) = Y (ζ (t) , νc (t)) θ̃ (t) − kζ ζ̃ (t) ,

where θ̃  θ − θ̂ is the parameter identification error.


Traditional adaptive control techniques require persistence of excitation to ensure
the parameter estimates θ̂ converge to their true values θ (cf. [10, 11]). Persistence
of excitation often requires an excitation signal to be applied to the vehicle’s input
resulting in unwanted deviations in the vehicle state. These deviations are often in
opposition to the vehicle’s control objectives. Alternatively, a concurrent learning-
based system identifier can be developed (cf. [12, 13]). The concurrent learning-based
system identifier relaxes the persistence of excitation requirement through the use of
a prerecorded history stack of state-action pairs.
Assumption

6.2 There exists a prerecorded


data set of sampled data points
ζ j , νcj , ν̇cj , τbj ∈ χ | j = 1, 2, . . . , M with numerically calculated state derivatives
ζ̄˙ at each recorded state-action pair such that ∀t ∈ [0, ∞),
j

⎛ ⎞
M
rank ⎝ Y jT Y j ⎠ = p, (6.5)
j=1
 
˙ 
ζ̄ j − ζ̇ j  < d̄, ∀ j,

   
where Y j  Y ζ j , νcj , f 0 j  f 0 ζ j , ζ̇ j = Y j θ + f 0 j + gτbj , and d̄ ∈ [0, ∞) is a
constant.
In this development, it is assumed that a data set of state-action pairs is available a
priori. Experiments to collect state-action pairs do not necessarily need to be con-
ducted in the presence of a current (e.g., the data may be collected in a pool). Since
the current affects the dynamics only through the νr terms, data that is sufficiently
rich and satisfies Assumption 6.2 may be collected by merely exploring the ζ state-
space. Note, this is the reason the body-fixed current νc and acceleration ν̇c are not
considered a part of the state. If state-action data is not available for the given system
then it is possible to build the history stack in real-time (see Appendix A.2.3).
The parameter estimate update law is


M  
θ̂˙ (t) = Γθ Y (ζ (t) , νc (t))T ζ̃ (t) + Γθ kθ Y jT ζ̄˙ j − f 0 j − gτbj − Y j θ̂ (t) , (6.6)
j=1

where Γθ is a positive definite diagonal gain matrix, and kθ is a positive scalar


gain matrix. To facilitate the stability analysis, the parameter estimate update law is
expressed in the advantageous form


M  
θ̂˙ (t) = Γθ Y (ζ (t) , νc (t))T ζ̃ (t) + Γθ kθ Y jT Y j θ̃ + d j ,
j=1
200 6 Applications

where d j = ζ̄˙ j − ζ̇ j .
To analyze the developed identifier, consider the candidate Lyapunov function
V P : R2n+ p → [0, ∞) defined as

1 T 1
V P (Z P )  ζ̃ ζ̃ + θ̃ T Γθ−1 θ̃ , (6.7)
2 2
 
where Z P  ζ̃ T θ̃ T . The candidate Lyapunov function can be bounded as

1   1
min 1, γθ Z P 2 ≤ V P (Z P ) ≤ max {1, γθ } Z P 2 (6.8)
2 2
where γθ , γθ are the minimum and maximum eigenvalues of Γθ , respectively.
The time derivative of the candidate Lyapunov function in (6.7) is


M 
M
V̇ P (t) = −ζ̃ T (t) kζ ζ̃ (t) − kθ θ̃ T (t) Y jT Y j θ̃ (t) − kθ θ̃ T (t) Y jT d j .
j=1 j=1

The time derivative can be upper bounded as


 2  2  
     
V̇ P (t) ≤ −kζ ζ̃ (t) − kθ y θ̃ (t) + kθ dθ θ̃ (t) , (6.9)


where kζ , y are the minimum eigenvalues of kζ and M T
j=1 Y j Y j , respectively, and
M  
dθ = d̄ j=1 Y j . After completing the squares, (6.9) can be upper bounded as

 2 kθ y  2 k d 2
    θ θ
V̇ P (t) ≤ −kζ ζ̃ (t) − θ̃ (t) + ,
2 2y

which may be further upper bounded as

V̇ P (t) ≤ −α P Z P (t)2 , ∀ Z P (t) ≥ K P > 0, (6.10)


  
k d2
where α P  1
2
min 2kζ , kθ y and K P  2αθ Pθy . Using (6.8) and (6.10), ζ̃ and θ̃
can be shown to exponentially decay to a ultimate bound as t → ∞.

6.2.3 Problem Formulation

The presence of a time-varying irrotational current yields unique challenges in the


formulation of the optimal regulation problem. Since the current renders the system
non-autonomous, a residual model that does not include the effects of the irrotational
6.2 Station-Keeping of a Marine Craft 201

current is introduced. The residual model is used in the development of the optimal
control problem in place of the original model. A disadvantage of this approach is
that the optimal policy is developed for the current-free model. In the case where
the earth-fixed current is constant, the effects of the current may be included in the
development of the optimal control problem as detailed in Appendix A.3.2.
The dynamic constraints can be written in a control-affine form as

ζ̇ (t) = Yr es (ζ (t)) θ + f 0r es (ζ (t)) + gu (t) , (6.11)

where the unknown hydrodynamics are linear-in-the-parameters with p unknown


parameters where Yr es : R2n → R2n× p is a regression matrix, the function f 0r es :
R2n → R2n is the known portion of the dynamics, and u : R≥t0 → Rn is the con-
trol vector. The drift dynamics, defined as fr es (ζ )  Yr es (ζ ) θ + f 0r es (ζ ), satisfies
fr es (0) = 0 when Assumption 6.1 is satisfied.
The drift dynamics in (6.11) are modeled as

0
Yr es (ζ ) θ = ,
−M −1 C A (ν) ν − M −1 D (ν) ν

JE ν
f 0r es (ζ ) = , (6.12)
−M −1 C R B (ν) ν − M −1 G (η)

and the virtual control vector u is defined as

u = τb − τc (ζ, νc , ν̇c ) , (6.13)

where τc : R2n × Rn × Rn → Rn is a feedforward term to compensate for the effect


of the variable current, which includes cross-terms generated by the introduction of
the residual dynamics and is given as

τc (ζ, νc , ν̇c ) = C A (νr ) νr + D (νr ) νr − M A ν̇c − C A (ν) ν − D (ν) ν.

The current feedforward term is represented in the advantageous form

τc (ζ, νc , ν̇c ) = −M A ν̇c + Yc (ζ, νc ) θ,

where Yc : R2n × Rn → R2n× p is the regression matrix and

Yc θ (ζ, νc ) = C A (νr ) νr + D (νr ) νr − C A (ν) ν − D (ν) ν.

Since the parameters are unknown, an approximation of the compensation term τc


given by
202 6 Applications
 
τ̂c ζ, νc , ν̇c , θ̂ = −M A ν̇c + Yc θ̂ (6.14)

is implemented, and the approximation error is defined by

τ̃c  τc − τ̂c .

The performance index for the optimal regulation problem is selected as

∞
J (t0 , ζ0 , u (·)) = r (ζ (τ ; t0 , ζ0 , u (·)) , u (τ )) dτ, (6.15)
t0

where ζ (τ ; t0 , ζ0 , u (·)) denotes a solution to (6.11), evaluated at t = τ , under the


controller u (·), with the initial condition ζ0  ζ (t0 ), and r : R2n → [0, ∞) is the
local cost defined as

r (ζ, u)  ζ T Qζ + u T Ru. (6.16)

In (6.16), Q ∈ R2n×2n and R ∈ Rn×n are symmetric positive definite weighting matri-
 2  2
ces. The matrix Q has the property q ξq  ≤ ξqT Qξq ≤ q ξq  , ∀ξq ∈ R2n where
q and q are positive constants. Assuming existence of the optimal controller, the
infinite-time scalar value function V : R2n → [0, ∞) for the optimal solution is writ-
ten as
∞

V (ζ ) = min r (ζ (τ ; t, ζ, u (·)) , u (τ )) dτ, (6.17)
u [t,∞]
t

where the minimization is performed over the set of admissible controllers.


The objective of the optimal control problem is to find the optimal policy
u ∗ : R2n → Rn that minimizes the performance index (6.15) subject to the dynamic
constraints in (6.11). Assuming that a minimizing policy exists and the value func-
tion is continuously differentiable, the Hamilton–Jacobi–Bellman equation is given
as [14]
   
0 = ∇t V ∗ (ζ ) + r ζ, u ∗ (ζ ) + ∇ζ V ∗ (ζ ) Yr es (ζ ) θ + f 0r es (ζ ) + gu ∗ (ζ ) , (6.18)

where ∂ V∂t(ζ ) = 0 since the value function is not an explicit function of time. After
substituting (6.16) into (6.18), the optimal policy is given by [14]

1  T
u ∗ (ζ ) = − R −1 g T ∇ζ V ∗ (ζ ) . (6.19)
2
6.2 Station-Keeping of a Marine Craft 203

6.2.4 Approximate Policy

The subsequent development is based on a neural network approximation of the value


function and optimal policy. Over any compact domain χ ⊂ R2n , the optimal value
function V ∗ : R2n → [0, ∞) can be represented by a single-layer neural network
with l neurons as

V ∗ (ζ ) = W T σ (ζ ) +  (ζ ) , (6.20)

where W ∈ Rl is the ideal weight vector bounded above by a known positive constant,
σ : R2n → Rl is a bounded continuously differentiable activation function, and  :
R2n → R is the bounded continuously differential function reconstruction error.
Using (6.19) and (6.20), the optimal policy can be expressed as

1  
u ∗ (ζ ) = − R −1 g T ∇ζ σ T (ζ ) W + ∇ζ  T (ζ ) . (6.21)
2
Based on (6.20) and (6.21), neural network approximations of the value function and
the optimal policy are defined as
 
V̂ ζ, Ŵc = ŴcT σ (ζ ) , (6.22)

  1
û ζ, Ŵa = − R −1 g T ∇ζ σ T (ζ ) Ŵa , (6.23)
2

where Ŵc , Ŵa : R≥t0 → Rl are estimates of the constant ideal weight vector W . The
weight estimation errors are defined as W̃c  W − Ŵc and W̃a  W − Ŵa .
Substituting (6.11), (6.22), and (6.23) into (6.18), the Bellman error, δ̂ : R2n ×
R × Rl × Rl → R, given as
p

        
δ̂ ζ, θ̂, Ŵc , Ŵa = r ζ, û ζ, Ŵa + ∇ζ V̂ ζ, Ŵc Yr es (ζ ) θ̂ + f 0r es (ζ ) + g û ζ, Ŵa
(6.24)

The Bellman error, evaluated along the system trajectories, can be expressed as
    
δ̂t (t)  δ ζ (t) , θ̂ (t) , Ŵc (t) , Ŵa (t) = r ζ (t) , û ζ (t) , Ŵa (t) + ŴcT (t) ω (t) ,

where ω : R≥t0 → Rl is given by


  
ω (t) = ∇ζ σ (ζ (t)) Yr es (ζ (t)) θ̂ (t) + f 0r es (ζ (t)) + g û ζ (t) , Ŵa (t) .
204 6 Applications

The Bellman error may be extrapolated to unexplored regions of the state-space


since it depends solely on the approximated system model and the neural network
weight estimates. In Sect. 6.2.5, Bellman error extrapolation is employed to establish
uniformly ultimately bounded convergence of the approximate policy to the optimal
policy without requiring persistence of excitation provided the following assumption
is satisfied.
Assumption 6.3 There exists a positive constant c and set of states {ζk ∈ χ |k =
1, 2, . . . , N } such that
  N 
 ωk ω T
inf λmin k
= c, (6.25)
t∈[0,∞)
k=1
ρk
  
where ωk (t)  ∇ζ σ (ζk ) Yr es (ζk ) θ̂ (t) + f 0r es (ζk ) + g û ζk , Ŵa (t) and ρk 
1 + kρ ωkT Γ ωk .
In general, the condition in (6.25) cannot be guaranteed to hold a priori; however,
heuristically, the condition can be met by sampling redundant data (i.e., N l).
The value function least-squares update law based on minimization of the Bellman
error is given by

kc2  ωk (t)
N
˙ ω (t)
Ŵc (t) = −Γ (t) kc1 δ̂t (t) + δ̂tk (t) , (6.26)
ρ (t) N k=1 ρk (t)


βΓ (t) − kc1 Γ (t) ω(t)ω(t)
T
Γ, Γ (t) ≤ Γ
Γ˙ (t) = (t) , (6.27)
0 otherwise

where kc1 , kc2 ∈ R are positive adaptation gains, δ̂tk (t)  δ̂ ζk , θ̂ (t) , Ŵc (t) ,

Ŵa (t) is the extrapolated approximate Bellman error, Γ (t0 ) = Γ0  ≤ Γ¯ is
the initial adaptation gain, Γ¯ ∈ R is a positive saturation gain, β ∈ R is a positive
forgetting factor, and ρ  1 + kρ ω T Γ ω is a normalization constant, where kρ ∈ R
is a positive gain. The update law in (6.26) and (6.27) ensures that

Γ ≤ Γ (t) ≤ Γ , ∀t ∈ [0, ∞) .

The actor neural network update law is given by


  
Ŵ˙ a (t) = proj −ka Ŵa (t) − Ŵc (t) , (6.28)

where ka ∈ R is an positive gain, and proj {·} is a smooth projection operator used
to bound the weight estimates. Using properties of the projection operator, the actor
6.2 Station-Keeping of a Marine Craft 205

neural network weight estimation error can be bounded above by positive constant.
See Sect. 4.4 in [11] or Remark 3.6 in [15] for details of smooth projection operators.
Using the definition in (6.13), the force and moment applied to the vehicle,
described in (6.3), is given in terms of the approximated optimal virtual control
(6.23) and the approximate compensation term in (6.14) as
   
τ̂b (t) = û ζ (t) , Ŵa (t) + τ̂c ζ (t) , θ̂ (t) , νc (t) , ν̇c (t) . (6.29)

6.2.5 Stability Analysis

An unmeasurable form of the Bellman error can be written using (6.18) and (6.24)
as
  1 1
δ̂t = −W̃cT ω − W T ∇ζ σ Yr es θ̃ − ∇ζ  Yr es θ + f 0r es + W̃aT G σ W̃a + ∇ζ G∇ζ σ T W
4 2
1
+ ∇ζ G∇ζ , (6.30)
4

where G  g R −1 g T ∈ R2n×2n and G σ  ∇ζ σ G∇ζ σ T ∈ Rl×l are symmetric, posi-


tive semi-definite matrices. Similarly, the Bellman error at the sampled data points
can be written as
  1
δ̂tk = −W̃cT ωk − W T σk Yr esk θ̃ + W̃aT G σ k W̃a + E k , (6.31)
4
where
1 1  
Ek  k Gσk T W + k Gk T − k Yr esk θ + f 0r esk ∈ R
2 4
is a constant at each data point, and the notation Fk denotes the function F (ζ, ·)
evaluated at the sampled state (i.e., Fk (·) = F (ζk , ·)).
The functions Yr es and f 0r es are Lipschitz continuous on the compact set χ and
can be bounded by

Yr es (ζ ) ≤ L Yr es ζ  , ∀ζ ∈ χ ,

 
 f 0 (ζ ) ≤ L f ζ  , ∀ζ ∈ χ ,
r es 0r es

respectively, where L Yr es and L f0r es are positive constants.


To facilitate the subsequent stability analysis, consider the candidate Lyapunov
function VL : R2n × Rl × Rl × R p → [0, ∞) given by
206 6 Applications

1 T 1
VL (Z ) = V (ζ ) + W̃c Γ −1 W̃c + W̃aT W̃a + V P (Z P ) ,
2 2
 T
where Z  ζ T W̃cT W̃aT Z TP ∈ χ ∪ Rl × Rl × R p . Since the value function V in
(6.17) is positive definite, VL can be bounded by

υ L (Z ) ≤ VL (Z ) ≤ υ L (Z ) (6.32)

using [16, Lemma 4.3] and (6.8), where υ L , υ L : [0, ∞) → [0, ∞) are class K
functions. Let β ⊂ χ ∪ Rl × Rl × R p be a compact set, and let ϕζ , ϕc , ϕa , ϕθ , κc ,
κa , κθ , and κ be constants as defined in Appendix A.3.1. When Assumptions 6.2 and
6.3, and the sufficient gain conditions in Appendix A.3.1 are satisfied, the constant
K ∈ R defined as
!
κc2 κ2 κ2 κ
K  + a + θ +
2αϕc 2αϕa 2αϕθ α
 
is positive, where α  1
2
min ϕζ , ϕc , ϕa , ϕθ , 2kζ .

Theorem 6.4 If Assumptions 6.1–6.3, the sufficient conditions (A.33)–(A.35), and

K < υ L −1 (υ L (r )) (6.33)

are satisfied, where r ∈ R is the radius of the compact set β, then the policy in (6.23)
with the neural network update laws in (6.26)–(6.28) guarantee uniformly ultimately
bounded regulation of the state ζ and uniformly ultimately bounded convergence of
the approximated policies û to the optimal policy u ∗ .
Proof The time derivative of the candidate Lyapunov function is

∂V ∂V  
g û + τ̂c − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c − W̃aT Ŵ˙ a + V̇ P .
1
V̇L = (Y θ + f 0 ) +
∂ζ ∂ζ 2
(6.34)
∂V
Using (6.18), ∂ζ (Y θ + f 0 ) = − ∂∂ζV g (u ∗ + τc ) − r (ζ, u ∗ ). Then,

∂V   ∂V  ∗   
g u + τc − r ζ, u ∗ − W̃cT Γ −1 Ŵ˙ c − W̃cT Γ −1 Γ˙ Γ −1 W̃c
1
V̇L = g û + τ̂c −
∂ζ ∂ζ 2
− W̃ T Ŵ˙ + V̇ .
a a P

Substituting (6.26) and (6.28) for Ŵ˙ c and Ŵ˙ a , respectively, yields
6.2 Station-Keeping of a Marine Craft 207
⎡ ⎤
kc2  ωk ⎦
N
∗T ∗ ∂V ∂V ∂V ∗ T ⎣ ω
V̇L = −ζ Qζ − u Ru +
T
g τ̃c + g û − gu + W̃c kc1 δ + δk
∂ζ ∂ζ ∂ζ ρ N ρ
j=1 k
  1 " #
ωω T
+ W̃aT ka Ŵa − Ŵc − W̃cT Γ −1 βΓ − kc1 Γ Γ 1Γ ≤Γ Γ −1 W̃c + V̇ P .
2 ρ

Using Young’s inequality, (6.20), (6.21), (6.23), (6.30), and (6.31) the Lyapunov
derivative can be upper bounded as
 2  2  2  2    
           
V̇L ≤ −ϕζ ζ 2 − ϕc W̃c  − ϕa W̃a  − ϕθ θ̃  − kζ ζ̃  + κa W̃a  + κc W̃c 
 
 
+ κθ θ̃  + κ.

Completing the squares, the upper bound on the Lyapunov derivative may be written
as

ϕζ ϕc    
 2 ϕa  2 ϕθ  2
   2
  κ2 κ2
V̇L ≤ − ζ 2 − W̃c  − W̃a  − θ̃  − kζ ζ̃  + c + a
2 2 2 2 2ϕc 2ϕa
κ 2
+ θ + κ,
2ϕθ

which can be further upper bounded as

V̇L ≤ −α Z  , ∀ Z  ≥ K > 0. (6.35)

Using (6.32), (6.33), and (6.35), [16, Theorem 4.18] is invoked to conclude that Z is
uniformly ultimately bounded, in the sense that lim supt→∞ Z (t) ≤υ L −1 (υ L (K )).
Based on the definition of Z and the inequalities in (6.32) and (6.35), ζ (·) ,
W̃c (·) , W̃a (·) ∈ L∞ . From the definition of W and the neural network weight esti-
mation errors, Ŵc (·) , Ŵa (·) ∈ L∞ . Usingthe actor update ˙
 laws, Ŵa (·) ∈ L∞ . It fol-
lows that t → V̂ (x (t)) ∈ L∞ and t → û x (t) , Ŵa (t) ∈ L∞ . From the dynamics
in (6.12), ζ̇ (·) ∈ L∞ . By the definition in (6.24), δ̂t (·) ∈ L∞ . By the definition of
the normalized critic update law, Ŵ˙ c (·) ∈ L∞ . 

6.2.6 Experimental Validation

The performance of the controller is demonstrated with experiments conducted at


Ginnie Springs in High Springs, FL. Ginnie Springs is a second-magnitude spring
discharging 142 million liters of freshwater daily with a spring pool measuring 27.4 m
in diameter and 3.7 m deep [17]. Ginnie Springs was selected because of its relatively
high flow rate and clear waters for vehicle observation. For clarity of exposition and
to remain within the vehicle’s depth limitations, the proposed method is implemented
208 6 Applications

Fig. 6.1 SubjuGator 7 autonomous underwater vehicle operating at Ginnie Springs, FL

on three degrees of freedom of an autonomous underwater vehicle (i.e., surge, sway,


yaw). The number of basis functions and weights required to support a six degree-
of-freedom model greatly increases from the set required for the three degree-of-
freedom model. The vehicle’s Doppler velocity log has a minimum height over
bottom of approximately 3 m that is required to measure water velocity. A minimum
depth of approximately 0.5 m is required to remove the vehicle from surface effects.
With the depth of the spring nominally 3.7 m, a narrow window of about 20 cm is
left operate the vehicle in heave.
Experiments were conducted on an autonomous underwater vehicle, SubjuGator
7, developed at the University of Florida. The autonomous underwater vehicle, shown
in Fig. 6.1 , is a small two man portable autonomous underwater vehicle with a mass
of 40.8 kg. The vehicle is over-actuated with eight bidirectional thrusters.
Designed to be modular, the vehicle has multiple specialized pressure vessels that
house computational capabilities, sensors, batteries, and mission specific payloads.
The central pressure vessel houses the vehicle’s motor controllers, network infras-
tructure, and core computing capability. The core computing capability services the
vehicles environmental sensors (e.g., visible light cameras, scanning sonar, etc.), the
vehicles high-level mission planning, and low-level command and control software.
A standard small form factor computer makes up the computing capability and uti-
lizes a 2.13 GHz server grade quad-core processor. Located near the front of the
vehicle, the navigation vessel houses the vehicles basic navigation sensors. The suite
of navigation sensors include an inertial measurement unit, a Doppler velocity log, a
depth sensor, and a digital compass. The navigation vessel also includes an embed-
ded 720 MHz processor for preprocessing and packaging navigation data. Along the
6.2 Station-Keeping of a Marine Craft 209

sides of the central pressure vessel, two vessels house 44 Ah of batteries used for
propulsion and electronics.
The vehicle’s software runs within the Robot Operating System framework in
the central pressure vessel. For the experiment, three main software nodes were
used: navigation, control, and thruster mapping nodes. The navigation node receives
packaged navigation data from the navigation pressure vessel where an unscented
Kalman filter estimates the vehicle’s full state at 50 Hz. The desired force and moment
produced by the controller are mapped to the eight thrusters using a least-squares
minimization algorithm. The controller node contains the proposed controller and
system identifier.
The implementation of the proposed method may has three parts: system iden-
tification, value function iteration, and control iteration. Implementing the system
identifier requires (6.4), (6.6), and the data set alluded to in Assumption 6.2. The data
set in Assumption 6.2 was collected in a swimming pool. The vehicle was commanded
to track a data-rich trajectory with a RISE controller [18] while the state-action pairs
were recorded. The recorded data was trimmed to a subset of 40 sampled points
that were selected to maximize the minimum singular value of Y1 Y2 . . . Y j as in
Appendix A.2.3. The system identifier is updated at 50 Hz.
Equations (6.24) and (6.26) form the value function iteration. Evaluating the
extrapolated Bellman error (6.24) with each control iteration is computational expen-
sive. Due to the limited computational resources available on-board the autonomous
underwater vehicle, the update of the critic weights was selected to be calculated at
a different rate (5 Hz) than the main control loop.
For the experiments, the controller in (6.4), (6.6), (6.23), (6.24), (6.26), (6.28),
and (6.29) was restricted to three degrees of freedom (i.e., surge, sway, and yaw). The
RISE controller is used to regulate the remaining degrees-of-freedom (i.e., heave,
roll, and pitch), to maintain the implied assumption that roll and pitch remain at
zero and the depth remains constant. The RISE controller in conjunction with the
proposed controller is executed at 50 Hz.
The vehicle uses water profiling data from the Doppler velocity log to measure
the relative water velocity near the vehicle in addition to bottom tracking data for
the state estimator. Between the state estimator, water profiling data, and recorded
data, the equations used to implement the developed controller only contain known
or measurable quantities.
The vehicle was commanded to hold a station near the vent of Ginnie Spring.
 T
An initial condition of ζ (t0 ) = 4 m 4 m [ π4 ] rad 0 m/s 0 ms 0 rad/s was given
to demonstrate the method’s ability to regulate the state. The optimal control weight-
ing matrices were selected to be Q = diag ([20, 50, 20, 10, 10, 10]) and R = I3 . The
system identifier adaptation gains were selected to be k x = 25 × I6 , kθ = 12.5, and
Γθ = diag ([187.5, 937.5, 37.5, 37.5, 37.5, 37.5, 37.5, 37.5]). The parameter esti-
mate was initialized with θ̂ (t0 ) = 08×1 . The neural network weights were initialized
to match the ideal values for the linearized optimal control problem, selected by solv-
ing the algebraic Riccati equation with the dynamics linearized about the station. The
actor adaptation gains were selected to be kc1 = 0.25 × I21 , kc2 = 0.5 × I21 , ka = I21 ,
210 6 Applications

Fig. 6.2 Inertial position 1


error η of the autonomous
underwater vehicle 0

Pose ([m m rad])


-1

-2

-3

-4

-5
0 50 100 150
Time (sec)

Fig. 6.3 Body-fixed 0.4


velocity error ν (bottom) of
the autonomous underwater 0.3
Velocity ([m/s m/s rad/s])

vehicle
0.2

0.1

-0.1

-0.2
0 50 100 150
Time (sec)

k p = 0.25, and β = 0.025. The adaptation matrix was initialized to Γ0 = 400 × I21 .
The Bellman error was extrapolated to 2025 points in a grid about the station.
Figures 6.2 and 6.3 illustrate the ability of the generated policy to regulate the
state in the presence of the spring’s current. Figure 6.4 illustrates the total control
effort applied to the body of the vehicle, which includes the estimate of the current
compensation term and approximate optimal control. Figure 6.5 illustrates the output
of the approximate optimal policy for the residual system. Figure 6.6 illustrates the
convergence of the parameters of the system identifier and Figs. 6.7 and 6.8 illustrate
convergence of the neural network weights representing the value function.
6.2 Station-Keeping of a Marine Craft 211

Fig. 6.4 Body-fixed total 30


control effort τ̂b commanded
about the center of mass of 25
the vehicle
20

Control ([N N Nm])


15

10

-5

-10
0 50 100 150
Time (sec)

Fig. 6.5 Body-fixed optimal 100


control effort û commanded
about the center of mass of 0
the vehicle
-100
Parameters

-200

-300

-400

-500
0 50 100 150
Time (sec)

The anomaly seen at ∼70 s in the total control effort (Fig. 6.4) is attributed to
a series of incorrect current velocity measurements. The corruption of the current
velocity measurements is possibly due in part to the extremely low turbidity in the
spring and/or relatively shallow operating depth. Despite presence of unreliable cur-
rent velocity measurements the vehicle was able to regulate the vehicle to its station.
The results demonstrate the developed method’s ability to concurrently identify the
unknown hydrodynamic parameters and generate an approximate optimal policy
using the identified model. The vehicle follows the generated policy to achieve its
station keeping objective using industry standard navigation and environmental sen-
sors (i.e., inertial measurement unit, Doppler velocity log).
212 6 Applications

Fig. 6.6 Identified system 100


parameters determined for
the vehicle online. The 0
parameter definitions may be
found in Example 6.2 and
-100

Parameters
Eq. (6.100) of [9]

-200

-300

-400

-500
0 50 100 150
Time (sec)

Fig.
 6.7 Value function 1500
Ŵc neural network weight
estimates online convergence
1000
Ŵc

500

-500
0 50 100 150
Time (sec)
 
Fig. 6.8 Policy Ŵa 1500
neural network weight
estimates online convergence
1000
Ŵa

500

-500
0 50 100 150
Time (sec)
6.3 Online Optimal Control for Path-Following 213

6.3 Online Optimal Control for Path-Following2

6.3.1 Problem Description

Path-following refers to a class of problems where the control objective is to con-


verge to and remain on a desired geometric path. The desired path is not necessarily
parameterized by time, but by some convenient parameter (e.g., path length). The
path-following method in this section utilizes a virtual target that moves along the
desired path. The error dynamics are defined kinematically between the virtual target
and vehicle as [6]

ẋ (t) = v (t) cos θ (t) + ṡ p (t) (κ (t) y (t) − 1) ,


ẏ (t) = v (t) sin θ (t) − x (t) κ (t) ṡ p (t) ,
θ̇ (t) = w (t) − κ (t) ṡ p (t) , (6.36)

where x (t) , y (t) ∈ R denote the planar position error between the vehicle and the
virtual target, θ (t) ∈ R denotes the rotational error between the vehicle heading
and the heading of the virtual target, v (t) ∈ R denotes the linear velocity of the
vehicle, w (t) ∈ R denotes the angular velocity of the vehicle, κ (t) ∈ R denotes the
path curvature evaluated at the virtual target, and s p (t) ∈ R denotes velocity of the
virtual target along the path. For a detailed derivation of the dynamics in (6.36) see
Appendix A.3.3.
Assumption 6.5 The desired path is regular and C 2 continuous; hence, the path
curvature κ is bounded and continuous.
As described in [6], the location of the virtual target is determined by

ṡ p (t)  vdes (t) cos θ (t) + k1 x (t) , (6.37)

where vdes : R → R is a desired positive, bounded and time-invariant speed profile,


and k1 ∈ R>0 is an adjustable gain.
To facilitate the subsequent control development, an auxiliary function φ : R →
(−1, 1) is defined as
   
φ s p  tanh k2 s p , (6.38)

where k2 ∈ R>0 is an adjustable gain. From (6.37) and (6.38), the time derivative of
φ is
 
φ̇ (t) = k2 1 − φ 2 (t) (vdes (t) cos θ (t) + k1 x (t)) . (6.39)

2 Parts of the text in this section are reproduced, with permission, from [5], 2014,
c IEEE.)
214 6 Applications

Note that the path curvature and desired speed profile can be written as functions of
φ.
Based on (6.36) and (6.37), auxiliary control inputs ve , we ∈ R are designed as

ve (t)  v (t) − vss (φ (t)) ,


we (t)  w (t) − wss (φ (t)) , (6.40)

where wss  κvdes and vss  vdes are computed based on the control input required
to remain on the path.
Substituting (6.37) and (6.40) into (6.36), and augmenting the system state with
(6.39), the closed-loop system is

ẋ (t) = κ (φ (t)) y (t) vdes (φ (t)) cos θ (t) + k1 κ x (t) y (t) − k1 x (t) + ve (t) cos θ (t) ,
ẏ = vdes (φ (t)) sin θ (t) − κ (φ (t)) x (t) vdes (φ (t)) cos θ (t) − k1 κ (φ (t)) x 2 (t)
+ ve (t) sin θ (t) ,
θ̇ = κ (φ (t)) vdes (φ (t)) − κ (φ (t)) (vdes (φ (t)) cos θ (t) + k1 x (t)) + we (t) ,
 
φ̇ = k2 1 − φ 2 (t) (vdes (φ (t)) cos θ (t) + k1 x (t)) . (6.41)

The closed-loop system in (6.41) can be rewritten in the following control-affine


form

ζ̇ (t) = f (ζ (t)) + g (ζ (t)) u (t) , (6.42)


 T  T
where ζ = x y θ φ : R≥t0 → R4 is the state vector, u = ve we : R≥t0 → R2
is the control vector, and the locally Lipschitz functions f : R4 → R4 and g : R4 →
R4×2 are defined as
⎡ ⎤
κ (φ) yvdes (φ) cos θ + k1 κ (φ) x y − k1 x
⎢ vdes (φ) sin θ − κ (φ) xvdes (φ) cos θ − k1 κ (φ) x 2 ⎥
f (ζ )  ⎢

⎥,

κ (φ) vdes (φ) − κ (vdes (φ) cos θ + k1 x)
k2 1 − φ (vdes (φ) cos θ + k1 x)
2

⎡ ⎤
cos (θ ) 0
⎢ sin (θ ) 0⎥
g (ζ )  ⎢
⎣ 0
⎥. (6.43)
1⎦
0 0

To facilitate the subsequent stability analysis, a subset of the state, denoted by e ∈ R3 ,


 T
is defined as e  x y θ ∈ R3 .
6.3 Online Optimal Control for Path-Following 215

6.3.2 Optimal Control and Approximate Solution

The cost functional for the optimal control problem is selected as (6.15), where
Q ∈ R4×4 is defined as

Q 03×1
Q ,
01×3 0
 2
where Q ∈ R3×3 is a user-defined positive definite matrix such that q ξq  ≤
 2
ξqT Qξq ≤ q ξq  , ∀ξq ∈ R3 , where q and q are positive constants.
The value function satisfies the Hamilton–Jacobi–Bellman equation [14]
   
0 = r ζ, u ∗ (ζ ) + ∇ζ V ∗ (ζ ) f (ζ ) + g (ζ ) u ∗ (ζ ) , (6.44)

Using (6.44) and the parametric approximation of the optimal value function and
the optimal policy from (6.22) and (6.23), respectively, the Bellman error, δ : R4 ×
R L × R L → R is defined as
       
δ ζ, Ŵc , Ŵa = r ζ, û ζ, Ŵa + ŴcT ∇ζ σ (ζ ) f (ζ ) + g (ζ ) û ζ, Ŵa . (6.45)

The adaptive update laws for the critic weights and the actor weights are given by
(6.26) and  (6.28), respectively,
 with the regressor ωk defined as ωk (t)
 ∇ζ σ (ζk ) f (ζk ) + g (ζk ) û ζk , Ŵa (t) ∈ R L and the normalization factor
&
defined as ρk (t)  1 + ωkT (t) ωk (t). The adaptation gain Γ is held constant and
it is assumed that the regressor satisfies the rank condition in (6.25).

6.3.3 Stability Analysis

To facilitate the subsequent stability analysis, an unmeasurable from of the Bellman


error can be written as
1 1 1
δ = −W̃cT ω − ∇ζ  f + ∇ζ G∇ζ σ T W + W̃aT G σ W̃a + ∇ζ G∇ζ  T , (6.46)
2 4 4

where G  gR−1 g T ∈ R4×4 and G σ  ∇ζ σ G∇ζ σ T ∈ R L×L are symmetric positive


semi-definite matrices. Similarly, at the sampled points the Bellman error can be
written as
1
δ j = −W̃cT ω j + W̃aT G σ j W̃a + E j , (6.47)
4
 
where E j  21  j G j σ j T W + 41  j G j  T −  j f j ∈ R,  j  ∇ζ  ζ j , and σ j 
  j
∇ζ σ ζ j .
216 6 Applications

The function f on any compact set χ ⊂ R4 is Lipschitz continuous, and bounded


by

 f (ζ ) ≤ L f ζ  , ∀ζ ∈ χ ,

where L f is the positive Lipschitz


  constant. Furthermore, the normalized regressor
ω
can be upper bounded by  p  ≤ 1.
The augmented equations of motion in (6.41) present a unique challenge with
respect to the value function V which is utilized as a Lyapunov function in the
stability analysis. To prevent penalizing the vehicle progression along the path, the
path parameter φ is removed from the cost function with the introduction of a positive
semi-definite state weighting matrix Q. However, since Q is positive semi-definite,
efforts are required to ensure the value function is positive definite. To address this
challenge, the fact that the value function can be interpreted as a time-invariant map
V ∗ : R4 → [0, ∞)" or a #time-varying map Vt∗ : R3 × [0, ∞) → [0, ∞), defined as
e
Vt∗ (e, t)  V ∗ is exploited. Lemma 3.14 is used to show that the time-
φ (t)
varying map is a positive definite and decrescent function for use as a Lyapunov
function. Hence, on any compact set χ the optimal value function Vt∗ satisfies the
following properties

Vt∗ (0, t) = 0,

υ (e) ≤ Vt∗ (e, t) ≤ υ (e) , (6.48)

∀t ∈ [0, ∞) and ∀e ⊂ χ where υ : [0, ∞] → [0, ∞) and υ : [0, ∞] → [0, ∞) are


class K functions.
To facilitate the subsequent stability analysis, consider the candidate Lyapunov
function VL : R4+2L × [0, ∞) → [0, ∞) given as

1 1
VL (Z , t) = Vt∗ (e, t) + W̃cT Γ −1 W̃c + W̃aT Γ −1 W̃a . (6.49)
2 2
Using (6.48), the candidate Lyapunov function can be bounded as

υ L (Z ) ≤ VL (Z , t) ≤ υ L (Z ) , (6.50)


 T
where υ L , υ L : [0, ∞) → [0, ∞) are class K functions and Z  e T W̃cT W̃aT ∈
β ⊂ χ ∪ R2L .
Theorem 6.6 If Assumptions
 6.3 and 6.5are satisfiedwith the regressor ωk redefined
as ωk (t)  ∇ζ σ (ζk ) f (ζk ) + g (ζk ) û ζk , Ŵa (t) and the normalization factor
&
defined as ρk (t)  1 + ωkT (t) ωk (t), the learning gains kc1 and kc2 in (6.26) are
6.3 Online Optimal Control for Path-Following 217

selected sufficiently small and large, respectively, and


 
K < υ L −1 υ L (r ) (6.51)

where K is an auxiliary positive constant (see Appendix A.3.4)  and r ∈ R is the
radius of a selected compact set β, then the controller u (t) = û ζ (t) , Ŵa (t) with
the update laws in (6.26), (6.27), and (6.28) guarantees uniformly ultimately bounded
convergence of the approximate policy to the optimal policy and of the vehicle to the
virtual target.

Proof The time derivative of the candidate Lyapunov function in (6.49) is

V̇L = ∇ζ V ∗ f + ∇ζ V ∗ g û − W̃cT Γ −1 Ŵ˙ c − W̃aT Ŵ˙ a .

Substituting (6.44), (6.26), and (6.28) yields


⎡ ⎤
ω k 
N
ω
V̇L = −e T Qe − u ∗ Ru ∗ + ∇ζ V ∗ g û − ∇ζ V ∗ gu ∗ + W̃cT ⎣kc1 δ + δj⎦
c2 j
p N j=1 p j
 
+ W̃aT ka Ŵa − Ŵc .

Using Young’s inequality,


 2  2    
       
V̇L ≤ −ϕe e2 − ϕc W̃c  − ϕa W̃a  + ιc W̃c  + ιa W̃a1  + ι, (6.52)

where ϕe , ϕc , ϕa , ιc , ιa , ι are auxiliary positive constants (see Appendix A.3.4).


Completing the squares, (6.52) can be upper bounded by

ϕc    
 2 ϕa  2 ι2 ι2
V̇L ≤ −ϕe e2 − W̃c  − W̃a  + c + a + ι,
2 2 2ϕc 2ϕa

which can be further upper bounded as

V̇L ≤ −α Z 2 , ∀ Z  ≥ K > 0 (6.53)

∀Z ∈ β, where α is an auxiliary positive constant (see Appendix A.3.4).


Using (6.50), (6.51), and (6.53), [16, Theorem 4.18] is invoked to conclude that
Z is uniformly ultimately bounded. The sufficient condition in (6.51) requires the
compact set β to be large enough based on the constant K . The constant K for a
given β can be reduced to satisfy the sufficient condition by reducing the function
approximation error in (6.20) and (6.21). The function approximation error can be
decreased by increasing the number of neurons in the neural network. 
218 6 Applications

To demonstrate the performance of the developed approximate dynamic


programming-based guidance law, simulation and experimental results are presented.
The simulation allows the developed method to be compared to a numerical offline
optimal solution, whereas experimental results demonstrate the real-time optimal
performance.

6.3.4 Simulation Results

To illustrate the ability of the proposed method to approximate the optimal solution,
a simulation is performed where the developed method’s policy and value function
neural network weight estimates are initialized to ideal weights identified on a previ-
ous trial. The true values of the ideal neural network weights are unknown. However,
initializing the actor and critic neural network weights to the ideal weights deter-
mined offline, the accuracy of the approximation can be compared to the optimal
solution. Since an analytical solution is not feasible for this problem, the simulation
results are directly compared to results obtained by the offline numerical optimal
solver GPOPS [19].
The simulation result utilize the kinematic model in (6.36) as the simulated mobile
robot. The vehicle is commanded to follow a figure eight path with a desired speed
of vdes = 0.25 ms . The virtual target is initially placed at the position corresponding
to an initial path parameter of s p (0) = 0 m, and the initial error state is selected
 T
as e (0) = 0.5 m 0.5 m 2 πrad . Therefore, the initial augmented state is ζ (0) =
 
0.5 m 0.5 m 2 πrad 0 m . The basis for the value function approximation is selected
T

as
 T
σ = ζ1 ζ2 , ζ1 ζ3 , ζ2 ζ3 , ζ12 , ζ22 , ζ32 , ζ42 .

The sampled data points are selected on a 5 × 5 × 3 × 3 grid about the origin. The
quadratic cost weighting matrices are selected as Q = diag ([2, 2, 0.25]) and R = I2 .
The learning gains are selected by trial and error as

Γ = diag ([1, 2.5, 1, 0.125, 2.5, 25.0.5]) ,

kc1 = 1, kc2 = 1, ka = 1.25.

Additionally, systematic gain tuning methods may be used (e.g., a genetic algorithm
approach similar to [20] may be used to minimize a desired performance criteria
such as weight settling time).
6.3 Online Optimal Control for Path-Following 219

The auxiliary gains in (6.37) and (6.39) are selected as k1 = 0.5 and k2 = 0.005.
Determined from a previous trial, the actor and critic neural network weight estimates
are initialized to
 T
Ŵc (0) = 2.8 × 10−2 , −3.3 × 10−2 , 4.0, 1.2, 2.7, 2.9, 1.0

and
 T
Ŵa (0) = 2.8 × 10−2 , −3.3 × 10−2 , 4.0, 1.2, 2.7, 2.9, 1.0

Figures 6.9 and 6.10 illustrate that the state and control trajectories approach
the solution found using the offline optimal solver, and Fig. 6.11 shows that the
neural network critic and actor weight estimates remain at their steady state values.
The system trajectories and control values obtained using the developed method
approximate the system trajectories and control value of the offline optimal solver. It
takes approximately 125 s for the mobile robot to traverse the desired path. However,
all figures with the exception of the vehicle trajectory are plotted only for 60 s to
provide clarity on the transient response. The steady-state response remains the same
after the initial transient (20 s).

Fig. 6.9 The error state 1


trajectory generated by the ADP
developed method compared GPOPS
to an offline numerical 0
optimal solver
-1
0 10 20 30 40 50 60
Time (sec)
1

-1
0 10 20 30 40 50 60
Time (sec)
2

-2
0 10 20 30 40 50 60
Time (sec)
220 6 Applications

Fig. 6.10 The control


trajectory generated by the 1 ADP
GPOPS
developed method compared
0
to an offline numerical
optimal solver
-1

0 10 20 30 40 50 60
Time (sec)

-1
0 10 20 30 40 50 60
Time (sec)

Fig. 6.11 The estimated 5


neural network weight 4
trajectories generated by the 3
developed method in
Ŵc

2
simulation
1
0
0 10 20 30 40 50 60
Time (sec)
5
4
3
Ŵa

2
1
0
0 10 20 30 40 50 60
Time (sec)

6.3.5 Experiment Results

Experimental results demonstrate the ability of the developed controller to simulta-


neously identify and implement an approximate optimal controller. The approximate
dynamic programming-based guidance law is implemented on a Turtlebot wheeled
mobile robot depicted in Fig. 6.12. Computation of the optimal guidance law takes
place on an on-board ASUS Eee PC netbook with 1.8 GHz Intel Atom processor.3 The
Turtlebot is provided velocity commands from the guidance law where the existing
low-level controller on the Turtlebot minimizes the velocity tracking error.

3 Thealgorithm complexity is linear with respect to the number of sampled points and cubic with
respect to the number of basis functions for each control iteration.
6.3 Online Optimal Control for Path-Following 221

Fig. 6.12 The Turtlebot


wheeled mobile robot

As with the simulation results, the vehicle is commanded to follow a figure eight
path with a desired speed of vdes = 0.25 ms . The basis for the value function approx-
imation is selected as
 T
σ = ζ1 ζ2 , ζ1 ζ3 , ζ1 ζ4 , ζ2 ζ3 , ζ2 ζ4 , ζ3 ζ4, ζ12 , ζ22 , ζ32 , ζ42 .

The sampled data points are selected on a 5 × 5 × 3 × 3 grid about the origin. The
quadratic cost weighting matrices are selected as Q = diag ([2, 2, 0.25]) and R = I2 .
The learning gains are selected by trial and error as

Γ = diag ([1, 2.5, 2.5, 1, 0.25, 1, 0.125, 2.5, 7.5, 0.5]) ,

kc1 = 1, kc2 = 1, ka = 1.25.

The auxiliary gains in (6.37) and (6.39) are selected as k1 = 0.5 and k2 = 0.005. The
 T
initial augmented state is ζ (0) = −0.5 m −0.5 m 2 πrad 0 m . The actor and critic
neural network weight estimates are arbitrarily initialized to

Ŵc (0) = [0, 0, 0, 0.5, 0, 0, 0.5, 0, 1, 0]T

and

Ŵa (0) = [0, 0, 0, 0.5, 0, 0, 0.5, 0, 1, 0]T .


222 6 Applications

Fig. 6.13 The error state 2


trajectory generated by the
developed method 1.5
implemented on the
Turtlebot

Pose ([m m rad])


1

0.5

-0.5

-1
0 20 40 60
Time (sec)

Fig. 6.14 The estimated 3


neural network weight
trajectories generated by the 2
developed method
Ŵc

implemented on the 1
Turtlebot
0

0 10 20 30 40 50 60
Time (sec)
3

2
Ŵa

0 10 20 30 40 50 60
Time (sec)

For the given basis, the actor and critic neural network weight estimates may also be
initialized such that the value function approximation is equivalent to the solution
to the algebraic Riccati equation corresponding to the kinematic model linearized
about the initial conditions.
Figure 6.13 shows convergences of the error state to a ball about the origin. Figure
6.14 shows the neural network critic and actor weight estimates converge to steady
state values. The ability of the mobile robot to track the desired path is demonstrated
in Fig. 6.15.
6.4 Background and Further Reading 223

Fig. 6.15 The planar


trajectory achieved by the 6 Desired Path
developed method Actual Path
implemented on the 4 Start
Turtlebot
2

-2

-4

-6 -4 -2 0 2 4 6

6.4 Background and Further Reading

Precise station keeping of a marine craft is challenging because of nonlinearities


in the dynamics of the vehicle. A survey of station keeping for autonomous surface
vehicles can be found in [21]. Common approaches employed to control a marine craft
include robust and adaptive control methods [18, 22–24]. These methods provide
robustness to disturbances and/or model uncertainty; however, they do not explicitly
attempt to reduce energy expenditure. Motivated by the desire to balance energy
expenditure and the accuracy of the vehicle’s station, optimal control methods where
the performance criterion is a function of the total control effort (energy expended)
and state error (station accuracy) are examined. Because of the difficulties associated
with finding closed-form analytical solutions to optimal control problems for marine
craft, [25] numerically approximates the solution to the Hamilton–Jacobi–Bellman
equation using an iterative application of Galerkin’s method, and [26] implements a
model-predictive control policy.
Guidance laws of a mobile robot are typically divided into three categories: point
regulation, trajectory tracking, and path-following. Path-following refers to a class
of problems where the control objective is to converge to and remain on a desired
geometric path without the requirement of temporal constraints (cf. [6, 27, 28]). Path-
following is ideal for applications intolerant of spatial error (e.g., navigating clut-
tered environments, executing search patterns). Heuristically, path-following yields
smoother convergence to a desired path and reduces the risk of control saturation.
A path-following control structure can also alleviate difficulties in the control of
nonholonomic vehicles (cf. [28, 29]).
Additional optimal control techniques have been applied to path-following to
improve performance. In [30], model-predictive control is used to develop a controller
for an omnidirectional robot with dynamics linearized about the desired path. In
[31], an adaptive optimal path-following feedback policy is determined by iteratively
224 6 Applications

solving the algebraic Riccati equation corresponding to the linearized error dynam-
ics about the desired heading. Nonlinear model-predictive control is used in [32]
to develop an optimal path-following controller over a finite time horizon. Dynamic
programming was applied to the path-following problem in [33] to numerically deter-
mine an optimal path-following feedback policy offline. The survey in [34] cites addi-
tional examples of model-predictive control and dynamic programming applied to
path-following. Unlike approximate dynamic programming, model-predictive con-
trol does not guarantee optimality of the implemented controller and dynamic pro-
gramming does not accommodate simultaneous online learning and utilization of the
feedback policy.

References

1. Walters P, Kamalapurkar R, Voight F, Schwartz E, Dixon WE (to appear) Online approximate


optimal station keeping of a marine craft in the presence of an irrotational current. IEEE Trans
Robot
2. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
3. Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal
control for partially unknown nonlinear systems. Neural Netw 22(3):237–246
4. Chowdhary G, Johnson E (2010) Concurrent learning for convergence in adaptive control
without persistency of excitation. In: Proceedings of the IEEE Conference on Decision and
Control, pp 3674–3679
5. Walters P, Kamalapurkar R, Andrews L, Dixon WE (2014) Online approximate optimal path-
following for a mobile robot. In: Proceedings of the IEEE Conference on Decision and Control,
pp 4536–4541
6. Lapierre L, Soetanto D, Pascoal A (2003) Non-singular path-following control of a unicycle in
the presence of parametric modeling uncertainties. Int J Robust Nonlinear Control 16:485–503
7. Egerstedt M, Hu X, Stotsky A (2001) Control of mobile platforms using a virtual vehicle
approach. IEEE Trans Autom Control 46(11):1777–1782
8. Dixon WE, Dawson DM, Zergeroglu E, Behal A (2000) Nonlinear control of wheeled mobile
robots, vol 262. Lecture notes in control and information sciences, Springer, London
9. Fossen TI (2011) Handbook of marine craft hydrodynamics and motion control. Wiley, New
York
10. Sastry S, Isidori A (1989) Adaptive control of linearizable systems. IEEE Trans Autom Control
34(11):1123–1131
11. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
12. Chowdhary GV, Johnson EN (2011) Theory and flight-test validation of a concurrent-learning
adaptive controller. J Guid Control Dyn 34(2):592–607
13. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
14. Kirk D (2004) Optimal control theory: an introduction. Dover, Mineola
15. Dixon WE, Behal A, Dawson DM, Nagarkatti S (2003) Nonlinear control of engineering
systems: a Lyapunov-based approach. Birkhauser, Boston
16. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
17. Schmidt W (2004) Springs of Florida. Bulletin 66, Florida Geological Survey
References 225

18. Fischer N, Hughes D, Walters P, Schwartz E, Dixon WE (2014) Nonlinear RISE-based control
of an autonomous underwater vehicle. IEEE Trans Robot 30(4):845–852
19. Rao AV, Benson DA, Darby CL, Patterson MA, Francolin C, Huntington GT (2010) Algorithm
902: GPOPS, a MATLAB software for solving multiple-phase optimal control problems using
the Gauss pseudospectral method. ACM Trans Math Softw 37(2):1–39
20. Otsuka A, Nagata F (2013) Application of genetic algorithms to fine-gain tuning of improved
the resolved acceleration controller. Procedia Comput Sci 22:50–59
21. Sørensen AJ (2011) A survey of dynamic positioning control systems. Annu Rev Control
35:123–136
22. Fossen T, Grovlen A (1998) Nonlinear output feedback control of dynamically positioned ships
using vectorial observer backstepping. IEEE Trans Control Syst Technol 6:121–128
23. Sebastian E, Sotelo MA (2007) Adaptive fuzzy sliding mode controller for the kinematic
variables of an underwater vehicle. J Intell Robot Syst 49(2):189–215
24. Tannuri E, Agostinho A, Morishita H, Moratelli L Jr (2010) Dynamic positioning systems: an
experimental analysis of sliding mode control. Control Eng Pract 18:1121–1132
25. Beard RW, Mclain TW (1998) Successive Galerkin approximation algorithms for nonlinear
optimal and robust control. Int J Control 71(5):717–743
26. Fannemel ÅV (2008) Dynamic positioning by nonlinear model predictive control. Master’s
thesis, Norwegian University of Science and Technology
27. Morro A, Sgorbissa A, Zaccaria R (2011) Path following for unicycle robots with an arbitrary
path curvature. IEEE Trans Robot 27(5):1016–1023
28. Morin P, Samson C (2008) Motion control of wheeled mobile robots. Springer handbook of
robotics. Springer, Berlin, pp 799–826
29. Dacic D, Nesic D, Kokotovic P (2007) Path-following for nonlinear systems with unstable zero
dynamics. IEEE Trans Autom Control 52(3):481–487
30. Kanjanawanishkul K, Zell A (2009) Path following for an omnidirectional mobile robot based
on model predictive control. In: Proceedings of the IEEE International Conference on Robotics
and Automation, pp 3341–3346
31. Ratnoo A, Pb S, Kothari M (2011) Optimal path following for high wind flights. In: IFAC
world congress, Milano, Italy 18:12985–12990
32. Faulwasser T, Findeisen R (2009) Nonlinear model predictive path-following control. In: Magni
L, Raimondo D, Allgöwer F (eds) Nonlinear model predictive control, vol 384. Springer, Berlin,
pp 335–343
33. da Silva JE, de Sousa JB (2011) A dynamic programming based path-following controller for
autonous vehicles. Control Intell Syst 39:245–253
34. Sujit P, Saripalli S, Borges Sousa J (2014) Unmanned aerial vehicle path following: a survey
and analysis of algorithms for fixed-wing unmanned aerial vehicles. IEEE Control Syst Mag
34(1):42–59
Chapter 7
Computational Considerations

7.1 Introduction

Efficient methods for the approximation of the optimal value function are essential,
since an increase in dimension can lead to a exponential increase in the number
of required basis functions necessary to achieve an accurate approximation. This
is known as the “curse of dimensionality”. To set the stage for the approximation
methods of this chapter, the first half of the introduction outlines a problem that
arises in the real time application of optimal control theory. Sufficiently accurate
approximation of the value function over a sufficiently large neighborhood often
requires a large number of basis functions, and hence, introduces a large number
of unknown parameters. One way to achieve accurate function approximation with
fewer unknown parameters is to use prior knowledge about the system to determine
the basis functions. However, for general nonlinear systems, prior knowledge of
the features of the optimal value function is generally not available; hence, a large
number of generic basis functions is often the only feasible option.
For some problems, such as the linear quadratic regulator problem, the optimal
value function takes a particular form which makes the choice of basis functions
trivial. In the
case of the linear quadratic regulator, the optimal value function is
of the form i,n j=1 wi, j x j xi (c.f., [1, 2]), so basis functions of the form σi, j = x j xi
will provide an accurate estimation of the optimal value function. However, in most
cases, the form of the optimal value function is unknown, and generic basis functions
are employed to parameterize the problem.
Often, kernel functions from reproducing kernel Hilbert spaces are used as gener-
ic basis functions, and the approximation problem is solved over a (preferably large)
compact domain of Rn [3–5]. An essential property of reproducing kernel Hilbert
spaces is given a collection of basis functions in the Hilbert space, there is a unique
set of weights that minimize the error in the Hilbert space norm, the so called ideal

© Springer International Publishing AG 2018 227


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0_7
228 7 Computational Considerations

weights [6]. The model choice of kernel is the Gaussian radial basis function given by
K (x, y) = exp(−x − y2 /μ) where x, y ∈ Rn and μ > 0 [5, 7]. For the approxi-
mation of a function over a large compact domain, a large number of basis functions
is required, which leads to an intractable computational problem for online control
applications.
Thus, approximation methodologies for the reduction of the number of basis
functions required to achieve accurate function approximation are well motivated.
In particular, the aim of this chapter is the development of an efficient scheme for
the approximation of continuous functions via state and time varying basis functions
that maintain the approximation of a function in a local neighborhood of the state,
deemed the state following (StaF) method. The method developed in this chapter is
presented as a general strategy for function approximation, and can be implemented
in contexts outside of optimal control.
The particular basis functions that will be employed throughout this chapter are
derived from kernel functions corresponding to reproducing kernel Hilbert spaces.
In particular, the centers are selected to be continuous functions of the state variable
bounded by a predetermined value. That is, given a compact set D ⊂ Rn ,  > 0, r >
0, and L ∈ N, ci (x)  x + di (x), where di : Rn → Rn is continuously differentiable
and supx∈D di (x) < r for i = 1, . . . , L. The parameterization of a function V in
terms of StaF kernel functions is given by


L
V̂ (y; x(t), t) = wi (t)K (y, ci (x(t))),
i=1

where wi (t) is a weight selected to satisfy

lim sup Er (x(t), t) < ,


t→∞

where Er is a measure of the accuracy of an approximation in a neighborhood of


x(t), such as that of the supremum norm:
 
 
Er (x(t), t) = sup V (y) − V̂ (y; x(t), t) .
y∈Br (x(t))

The goal of the StaF method is to establish and maintain an approximation of a


function in a neighborhood of the state. Justification for this approach stems from
the observation that the optimal controller only requires the value of the estimation
of the optimal value function to be accurate at the current system state. Thus, when
computational resources are limited, computational efforts should be focused on
improving the accuracy of approximations near the system state.
7.1 Introduction 229

Previous nonlinear approximation methods focus on adjusting of the centers of


radial basis functions (c.f. [8–10]). These efforts are focused on offline techniques,
which allow the selection of optimal centers for a global approximation. Since com-
putational resources are considered to be limited in the present context, global ap-
proximations become less feasible as the dimension of the system increases.
Section 7.3 lays the foundation for the establishment and maintenance of a real-
time moving local approximation of a continuous function. Section 7.3.1 of this chap-
ter frames the particular approximation problem of the StaF method. Section 7.3.2
demonstrates that the achievement of an accurate approximation with a fixed num-
ber of moving basis functions is possible (Theorem 7.1), and also demonstrates the
existence of an ideal weight function (Theorem 7.3). Section 7.3.3 demonstrates an
explicit bound on the number of required StaF basis functions for the case of the
exponential kernel function. Section 7.3.4 provides a proof of concept demonstrat-
ing the existence of weight update laws to maintain an accurate approximation of a
function in a local neighborhood, ultimately establishing a uniform ultimate bounded
result. The remaining sections demonstrate the developed method through numerical
experiments. Section 7.3.5 gives the results of a gradient chase algorithm.
In Sect. 7.4, a novel model-based reinforcement learning technique is developed to
achieve sufficient excitation without causing undesirable oscillations and expenditure
of control effort like traditional approximate dynamic programming techniques and
at a lower computational cost than state-of-the-art data-driven approximate dynamic
programming techniques. Motivated by the fact that the computational effort required
to implement approximate dynamic programming and the data-richness required to
achieve convergence both decrease with decreasing number of basis functions, this
chapter focuses on reduction of the number of basis functions used for value function
approximation.
A key contribution of this chapter is the observation that online implementation of
an approximate dynamic programming-based approximate optimal controller does
not require an estimate of the optimal value function over the entire domain of
operation of the system. Instead, only an estimate of the slope of the value function
evaluated at the current state is required for feedback. Hence, estimation of the
value function over a small neighborhood of the current state should be sufficient
to implement an approximate dynamic programming-based approximate optimal
controller. Since it is reasonable to postulate that approximation of the value function
over a local domain would require fewer basis functions than approximation over
the entire domain of operation, reduction of the size of the approximation domain is
motivated.
Unlike traditional value function approximation, where the unknown parameters
are constants, the unknown parameters corresponding to the StaF kernels are functions
of the system state. The Lyapunov-based stability analysis presented in Sect. 7.4.3 is
230 7 Computational Considerations

facilitated by the fact that the ideal weights are continuously differentiable functions of
the system state. To facilitate the proof of continuous differentiability, the StaF kernels
are selected from a reproducing kernel Hilbert space. Other function approximation
methods, such as radial basis functions, sigmoids, higher order neural networks, sup-
port vector machines, etc., can potentially be utilized in a state-following manner to
achieve similar results provided continuous differentiability of the ideal weights can
be established. An examination of smoothness properties of the ideal weights resulting
from a state-following implementation of the aforementioned function approximation
methods is out of the scope of this chapter.
Another key contribution of this chapter is the observation that model-based re-
inforcement learning techniques can be implemented without storing any data if the
available model is used to simulate persistent excitation. In other words, an exci-
tation signal added to the simulated system, instead of the actual physical system,
can be used to learn the value function. Excitation via simulation is implemented
using Bellman error extrapolation (cf. [11–13]); however, instead of a large num-
ber of autonomous extrapolation functions employed in the previous chapters, a
single time-varying extrapolation function is selected, where the time-variation of
the extrapolation function simulates excitation. The use of a single extrapolation
point introduces a technical challenge since the Bellman error extrapolation matrix
is rank deficient at each time instance. The aforementioned challenge is addressed in
Sect. 7.4.3 by modifying the stability analysis to utilize persistent excitation of the ex-
trapolated regressor matrix. Simulation results including comparisons with state-of-
the-art model-based reinforcement learning techniques are presented to demonstrate
the effectiveness of the developed technique.

7.2 Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space, H , is a Hilbert space with inner product ·, · H of
functions f : X → F (where F = C or R) for which given any x ∈ X , the function-
al E x f := f (x) is bounded. By the Reisz representation theorem, for each x ∈ X
there is a unique function k x ∈ H for which  f, k x  H = f (x). Each function k x is
called a reproducing kernel for the point x ∈ X . The function K (x, y) = k y , k x  H
is called the kernel function for H [7]. The norm corresponding to H will be de-
noted as  ·  H , and the subscript will be suppressed when the Hilbert space is
understood. Kernel functions are dense in H under the reproducing kernel Hilbert
space norm.
Kernel functions have the property that for each collection of points {x1 , . . . , xm } ⊂
X , the matrix (K (xi , x j ))i,m j=1 is positive semi-definite. The Aronszajn–Moore the-
orem states that there is a one to one correspondence between kernel functions with
7.2 Reproducing Kernel Hilbert Spaces 231

this property and reproducing kernel Hilbert spaces. In fact, starting with a kernel
function having the positive semi-definite property, there is an explicit construction
for its reproducing kernel Hilbert space. Generally, the norm for the reproducing
kernel Hilbert space is given by

 f  H := sup{Pc1 ,...,c M f  H : M ∈ N and c1 , . . . , c M ∈ X }, (7.1)

where Pc1 ,...,cm f is the projection of f onto the subspace of H spanned by the
kernel function K (·, ci ) for i = 1, . . . , M. Pc1 ,...,c M f is computed by interpolating the
M
points (ci , f (ci )) for i = 1, . . . , M with a function of the form i=1 wi K (·, ci ). The
 1/2
M
norm of the projection then becomes1 Pc1 ,...,c M f  = i, j=1 ci c j K (c j , ci ) . In
practice, the utility of computing the norm of f as (7.1) is limited, and alternate
forms of the norm are sought for specific reproducing kernel Hilbert spaces.
Unlike L 2 spaces, norm convergence in a reproducing kernel Hilbert space implies
pointwise convergence. This follows since if f n → f in the reproducing kernel
Hilbert space norm, then

| f (x) − f n (x)| = | f − f n , k x | ≤

 f − f n k x  =  f − f n  K (x, x).

When K is a continuous function of X , the term K (x, x) is bounded over compact
sets, and thus, norm convergence implies uniform convergence over compact sets.
Therefore, the problem of establishing an accurate approximation in the supremum
norm of a function is often relaxed to determining an accurate approximation of a
function in the reproducing kernel Hilbert space norm.
Given a reproducing kernel Hilbert space H over a set X and Y ⊂ X , the space
HY obtained by restricting each function f ∈ H to the set Y is itself a reproducing
kernel Hilbert space where the kernel function is given by restricting the original
kernel function to the set Y × Y . The resulting Hilbert space norm is given by

g HY = inf{ f  H : f ∈ H and f |Y = g}.

Therefore, the map f → f |Y is norm decreasing from H to HY [7]. For the purposes
of this paper, the norm obtained by restricting a reproducing kernel Hilbert space H
over Rn to a closed neighborhood Br (x) where r > 0 and x ∈ Rn will be denoted
as  · r,x .

1 For z ∈ C the quantity Re(z) is the real part of z, and z represents the complex conjugate of z.
232 7 Computational Considerations

7.3 StaF: A Local Approximation Method

7.3.1 The StaF Problem Statement

Given a continuous function V : Rn → R, , r > 0, and a dynamical system ẋ (t) =


f (x (t) , u (t)), where f is regular enough for the system to be well defined, the goal
of the StaF approximation method is to select a number, say L ∈ N, of state and
time varying basis functions σi : Rn × Rn × R → R for i = 1, 2, . . . , L and weight
signals wi : R+ → R for i = 1, 2, . . . , L such that
 
 
L 
 
lim sup sup V (y) − wi (t)σi (y; x(t), t) < . (7.2)
t→∞ y∈Br (x(t))  
i=1

Central problems to the StaF method are those of determining the basis functions
and the weight signals. When reproducing kernel Hilbert spaces are used for ba-
sis functions, (7.2) can be relaxed so that the supremum norm is replaced with the
Hilbert space norm. Since the Hilbert space norm of a reproducing kernel Hilbert
space dominates the supremum norm (cf. [7, Corollary 4.36]), (7.2) with the supre-
mum norm is simultaneously satisfied. Moreover, when using a reproducing
kernel Hilbert space, the basis functions can be selected to correspond to centers
placed in a moving neighborhood of the state. In particular, given a kernel function
K : Rn × Rn → R corresponding to a (universal) reproducing kernel Hilbert space,
H , and center functions ci : Rn → Rn for which ci (x) − x = di (x) is a continuous
function bounded by r then the StaF problem becomes the determination of weight
signals wi : R+ → R for i = 1, . . . , L such that
 
 
L 
 
lim sup V (·) − wi (t)K (·, ci (x(t))) < , (7.3)
t→∞  
i=1 r,x(t)

where  · r,x(t) is the norm of the reproducing kernel Hilbert space obtained by
restricting functions in H to Br (x(t)) [7, 14].
Since (7.3) implies (7.2), the focus of this section is to demonstrate the feasibility
of satisfying (7.3). Theorem 7.1 demonstrates that under a certain continuity assump-
tion a bound on the number of kernel functions necessary for the maintenance of an
approximation throughout a compact set can be determined, and Theorem 7.3 shows
that a collection of continuous ideal weight functions can be determined to satisfy
(7.3). Theorem 7.3 justifies the use of weight update laws for the maintenance of an
accurate function approximation, and this is demonstrated by Theorem 7.5.
The choice of reproducing kernel Hilbert space for Sect. 7.3.5 is that which
corresponds to the exponential kernel K (x, y) = exp(x T y), where x, y ∈ Rn . The
reproducing kernel Hilbert space will be denoted by F 2 (Rn ) since it is close-
ly connected to the Bergmann–Fock space [15]. The reproducing kernel Hilbert
space corresponding to the exponential kernel is a universal reproducing kernel
7.3 StaF: A Local Approximation Method 233

Hilbert space [7, 16], which means that given any compact set D ⊂ Rn ,  > 0
and continuous function f : D → R, there exists a function fˆ ∈ F 2 (Rn ) for which
supx∈D | f (x) − fˆ(x)| < .

7.3.2 Feasibility of the StaF Approximation and the Ideal


Weight Functions

The first theorem concerning the StaF method demonstrates that if the state variable
is constrained to a compact subset of Rn , then there is a finite number of StaF basis
functions required to establish the accuracy of an approximation.

Theorem 7.1 Suppose that K : X × X → C is a continuous kernel function corre-


sponding to a reproducing kernel Hilbert space, H , over a set X equipped with a
metric topology. If V ∈ H , D is a compact subset of X with infinite cardinality, r > 0,
and V x,r is continuous with respect to x, then for all  > 0 there is an L ∈ N such
that for each x ∈ D there are centers c1 , c2 , . . . , c L ∈ Br (x) and weights wi ∈ C
such that  
 
L 
 
V (·) − wi K (·, ci ) < .
 
i=1 r,x

Proof Let  > 0. For each neighborhood Br (x) with x ∈ D, there exists a finite
number of centers c1 , . . . , c L ∈ Br (x), and weights w1 , . . . , w L ∈ C, such that
 
 
L 
 
V (·) − wi K (·, ci ) < .
 
i=1 r,x

Let L x, be the minimum such number. The claim of the proposition is that the
set Q   {L x, : x ∈ D} is bounded. Assume by way of contradiction that Q  is
unbounded, and take a sequence {xn } ⊂ D such that L xn , is a strictly increasing
sequence (i.e., an unbounded sequence of integers) and xn → x in D. It is always
possible to find such a convergent sequence, since every compact subset of metric
space is sequentially compact. Let c1 , . . . , c L x,/2 ∈ Br (x) and w1 , . . . , w L x,/2 ∈ C be
centers and weights for which
 
 
L x,/2 
 
V (·) − wi K (·, ci ) < /2. (7.4)
 
 i=1 
r,x

For convenience, let each ci ∈ Br (x) be expressed as x + di for di ∈ Br (0). The


norm in (7.4), which will be denoted by E(x), can be written as
234 7 Computational Considerations

⎛ ⎛ ⎞ ⎞1/2
L x,/2 L x,/2
 
E(x)  ⎝V r,x − 2Re ⎝ wi V (x + di )⎠ + wi w j K (x + di , x + d j )⎠ .
i=1 i, j=1

By the hypothesis, K is continuous with respect to x, which implies that V is


continuous [3], and V r,x is continuous with respect to x by the hypothesis.
Hence, there exists η > 0 for which |E(x) − E(xn )| < /2, ∀xn ∈ Bη (x). Thus
E(xn ) < E(x) + /2 <  for sufficiently large n. By minimality L xn , < L x,/2 for
sufficiently large n. This is a contradiction. 

The assumption of the continuity of V r,x in Theorem 7.1 is well founded. There
are several examples where the assumption is known to hold. For instance, if the
reproducing kernel Hilbert space is a space of real entire functions, as it is for the
exponential kernel, then V r,x is not only continuous, but it is constant.
Using a similar argument as that in Theorem 7.1, the theorem can be shown to
hold when the restricted Hilbert space norm is replace by the supremum norm over
Br (x). The proof of the following theorem can be found in [17].
Proposition 7.2 Let D be a compact subset of Rn , V : Rn → R be a continu-
ous function, and K : Rn → Rn → R be a continuous and universal kernel func-
tion. For all , r > 0, there exists L ∈ N such that for each x ∈ D, there is a
collection of centers c1 , . . . , c L ∈ Br (x) and weights w1 , . . . , w L ∈ R such that
 L 
sup y∈Br (x) V (y) − i=1 K (y, ci ) < .

Now that it has been demonstrated that only a finite number of moving centers is
required to maintain an accurate approximation, it will now be demonstrated that the
ideal weights corresponding to the moving centers change continuously or smoothly
with the corresponding change in centers. In traditional adaptive control applications,
it is assumed that there is a collection of constant ideal weights, and much of the
theory is in the demonstration of the convergence of approximate weights to the ideal
weights. Since the ideal weights are no longer constant, it is necessary to show that
the ideal weights change smoothly as the system progresses. The smooth change in
centers will allow the proof of uniform ultimately bounded results through the use of
weight update laws. One such result will be demonstrated in Sect. 7.3.4, in particular
a uniformly ultimately bounded result is proven in Theorem 7.5.
Theorem 7.3 Let H be a reproducing kernel Hilbert space over a set X ⊂ Rn with
a strictly positive kernel K : X × X → C such that K (·, c) ∈ C m (Rn ) for all c ∈ X .
Suppose that V ∈ H . Let C be an ordered collection of L distinct centers, C =
(c1 , c2 , . . . , c L ) ∈ X L , with the associated ideal weights
 L 
 
 
W H (C) = arg min  ai K (·, ci ) − V (·) . (7.5)
a∈C L  
i=1 H

The function W H is m-times continuously differentiable with respect to each compo-


nent of C.
7.3 StaF: A Local Approximation Method 235

Proof The determination of W H (C) is equivalent to computing the projection of


V onto the space Y = span{K (·, ci ) : i = 1, . . . , L}. To compute the projection, a
Gram-Schmidt algorithm may be employed. The Gram-Schmidt algorithm is most
m
easily expressed in its determinant form. Let D0 = 1 and Dm = det K (c j , ci ) i, j=1 ,
then for m = 1, . . . , L the functions
⎛ ⎞
K (c1 , c1 )
K (c1 , c2 ) · · · K (c1 , cm )
⎜ K (c2 , c2 ) · · · K (c2 , cm ) ⎟
K (c2 , c1 )
⎜ ⎟
1 ⎜ .. .. .. .. ⎟
u m (x) := √ det ⎜ . . . . ⎟
Dm−1 Dm ⎜ ⎟
⎝ K (cm−1 , c1 ) K (cm−1 , c2 ) · · · K (cm−1 , cm )⎠
K (x, c1 ) K (x, c2 ) · · · K (x, cm )

constitute an orthonormal basis for Y . Since K is strictly positive definite, Dm is


positive for each m and every C. The coefficient for each K (x, cl ) with l = 1, . . . , m
in u m is a sum of products of the terms K (ci , c j ) for i, j = 1, . . . m. Each such
coefficient is m-times differentiable with respect to each ci , i = 1, . . . , L. Finally,
each term in W H (C) is a linear combination of the coefficients determined by u m
for m = 1, . . . , L, and thus is continuously differentiable with respect to each ci for
i = 1, . . . , L. 

7.3.3 Explicit Bound for the Exponential Kernel

Theorem 7.1 demonstrated a bound on the number of kernel functions required for
the maintenance of the accuracy of a moving local approximation. However, the
proof does not provide an algorithm to computationally determine the upper bound.
Indeed, even when the approximation with kernel functions is performed over a
fixed compact set, a general bound for the number of collocation nodes required for
accurate function approximation is unknown.
Thus, it is desirable to have a computationally determinable upper bound to the
number of StaF basis functions required for the maintenance of an accurate function
approximation. Theorem 7.4 provides a calculable bound on the number of exponen-
tial functions required for the maintenance of an approximation with respect to the
supremum norm.
While such error bounds have been computed for the exponential function before
(cf. [18]), current literature lets the “frequencies” or centers of the exponential kernel
functions to be unconstrained. The contribution of Theorem 7.4 is the development
of an error bound while constraining the size of the centers.

Theorem 7.4 Let K : Rn × Rn → R given by K (x, y) = exp x T y be the expo-
nential kernel function. Let D ⊂ Rn be a compact set, V : D → R continuous, and
, r > 0. For each x ∈ D, there exists a finite number of centers c1 , . . . , c L x, ∈ Br (x)
and weights w1 , w2 , . . . , w L x, ∈ R, such that
236 7 Computational Considerations
 
 
 
L x,


sup V (y) − wi K (y, ci ) < .
y∈Br (x)  i=1 

If p is an approximating polynomial that achieves the same accuracy over Br (x)


with degree N x, , then an asymptotically  similar bound can be found with L x, kernel
x, +Sx,
functions, where L x, < n+N N x, +Sx,
for some constant Sx, . Moreover, N x, and Sx,
can be bounded uniformly over D, and thus, so can L x, .

Proof For notational simplicity, the quantity  f  D,∞ denotes the supremum norm of
a function f : D → R over the compact set D throughout the proof of Theorem 7.4.
First, consider the ball of radius r centered at the origin. The statement of the
theorem can be proven by finding an approximation of monomials by a linear com-
bination of exponential kernel functions. 
Let α = (α1 , α2 , . . . , αn ) be a multi-index, and define |α| = αi . Note that


n  
1
m |α| αi
(exp (yi /m) − 1) = y1α1 y2α2 ··· ynαn +O
i=1
m

which leads to the sum


      n 
 α1 α2 αn    li 
|α| |α|− i li
m ··· (−1) exp yi
li ≤αi ,i=1,2,...,n
l1 l2 ln i=1
m
 
1
= y1α1 y2α2 · · · ynαn + O , (7.6)
m

where the notation gm (x) = O( f (m)) means that for sufficiently large m, there is a
constant C for which gm (x) < C f (m), ∀y ∈ Br (0). The big-oh constant indicated by
O(1/m) can be computed in terms of the derivatives of the exponential function via
Taylor’s Theorem. The centers corresponding to this approximation are of the form
li /m where li is a non-negative integer satisfying li < αi . Hence, for m sufficiently
large, the centers reside in Br (0).
Tin Br (y), let x = (x1 , x2 , . . . , xn ) ∈ R ,
T n
To shift the centers so that they reside
and multiply both sides of (7.6) by exp y x to get

      n 
 α1 α2 αn    li
|α| |α|− i li
m ··· (−1) exp yi + xi
li ≤αi ,i=1,2,...,n
l1 l2 ln i=1
m
 
yT x
α1 α2 αn
 1
=e y1 y2 · · · yn + O .
m

For each multi-index, α = (α1 , α2 , . . . , αn ), the centers for the approximation of the
corresponding monomial are of the form xi + li /m for 0 ≤ li ≤ αi . Thus, by linear
7.3 StaF: A Local Approximation Method 237

T
combinations of these kernel functions, a function of the form e y x g(y), with g a
multivariate polynomial, can be uniformly approximated by exponential functions
over Br (x). Moreover if g is a polynomial
 of degree β, then this approximation can
be a linear combination of n+β β
kernel functions.
Let  > 0 and suppose that px is polynomial with degree N x, such that
px (y) = V (y) + 1 (y) where |1 (y)| < e y x −1
T
D,∞  /2 ∀y ∈ Br (x). Let q x (y) be
a polynomial in R variables of degree Sx, such that qx (y) = e−y x + 2 (y) where
T
n
−1 −1
2 (y) < V  D,∞ e y x  D,∞  /2, ∀y ∈ Br (x).
T

The above construction indicates that there is a sequence of linear combinations


of exponential kernel functions, Fm (y), with a fixed number of centers inside Br (x),
for which
 
T 1
Fm (y) = e y x qx (y) px (y) + O
m
 T   
1
= e y x e−y x + 2 (y) (V (y) + 1 (y)) + O
T
.
m

After multiplication and an application of the triangle inequality, the following is


established:
   

V −1
D,∞ e
y T x −1
 D,∞ 1
|Fm (y) − V (y)| <  +  2 + O
4 m

for all y ∈ Br (x). The degree of the polynomial qx , Sx, , can be uniformly bounded
T
in terms of the modulus of continuity of e y x over D. Similarly, the uniform bound
on the degree of px , N x, , can be described in terms of the modulus of continuity
of V over D. The number of centers required for Fm (y) is determined by the degree
of the polynomial q · p (treating the x terms of q as constant), which is sum of the
two polynomial degrees. Finally for m large enough and  small enough, |Fm (y) −
V (y)| < , and the proof is complete. 

Theorem 7.4 demonstrates an upper bound required for the accurate approxima-
tion of a function through the estimation of approximating polynomials. Moreover,
the upper bound is a function of the polynomial degrees. The exponential kernel will
be used for simulations in Sect. 7.3.5.

7.3.4 The Gradient Chase Theorem

As mentioned before, the theory of adaptive control is centered around the concept
of weight update laws. Weight update laws are a collection of rules that the approx-
imating weights must obey which lead to convergence to the ideal weights. In the
case of the StaF approximation framework, the ideal weights are replaced with ideal
238 7 Computational Considerations

weight functions. Theorem 7.3 showed that if the moving centers of the StaF kernel
functions are selected in such a way that the centers adjust smoothly with respect to
the state x, then the ideal weight functions will also change smoothly with respect
to x. Thus, in this context, weight update laws of the StaF approximation framework
aim to achieve an estimation of the ideal weight function at the current state.
Theorem 7.5 provides an example of such weight update laws that achieve a uni-
formly ultimately bounded result. The theorem takes advantage of perfect samples of
a function in the reproducing kernel Hilbert space H corresponding to a real valued
kernel function. The proof of the theorem follows the standard proof for the conver-
gence of the gradient descent algorithm for a quadratic programming problem [19].

Theorem 7.5 Let H be a real valued reproducing kernel Hilbert space over Rn with
a continuously differentiable strictly positive definite kernel function K : Rn × Rn →
R. Let V ∈ H , D ⊂ Rn be a compact set, and x : R → Rn be a state variable
subject to the dynamical system ẋ = q(x, t), where q : Rn × R+ → Rn is a bounded
locally Lipschitz continuous function. Further suppose that x(t) ∈ D ∀t > 0. Let
c : Rn → R L , where for each i = 1, . . . , L, ci (x) = x + di (x) where di ∈ C 1 (Rn ),
and let a ∈ R L . Consider the function
 2
 
L 
 
F(a, c) = V − ai K (·, ci (x)) .
 
i=1 H

At each time instance t > 0, there is a unique W (t) for which W (t) = arg mina∈RL
F(a, c(x(t))). Given any  > 0 and initial value a 0 , there is a frequency τ > 0,
where if the gradient descent algorithm (with respect to a) is iterated at time steps
Δt < τ −1 , then F(a k , ck ) − F(wk , ck ) will approach a neighborhood of radius  as
k → ∞.

Proof Let ¯ > 0. By the Hilbert space structure of H :

F(a, c) = V 2H − 2V (c)T a + a T K (c)a,

where V (c) = (V (c1 ), . . . , V (c L ))T and K (c) = (K (ci , c j ))i,L j=1 is the symmet-
ric strictly positive kernel matrix corresponding to c. At each time iteration t k ,
k = 0, 1, 2, . . ., the corresponding centers and weights will be written as ck ∈ Rn L
and a k ∈ R L , respectively. The ideal weights corresponding to ck will be denoted
by wk . It can be shown that wk = K (ck )−1 V (ck ) and F(wk , ck ) = V 2H − V (ck )T
K (ck )V (ck ). Theorem 7.3 ensures that the ideal weights change continuously with
respect to the centers which remain in a compact set D̃ L , where D̃ = {x ∈ R L :
x − D ≤ maxi=1,...,L supx∈D |di (x)| }, so the collection of ideal weights is bound-
ed. Let R > ¯ be large enough so that B R (0) contains both the initial value a 0 and
the set of ideal weights. To facilitate the subsequent analysis, consider the constants
7.3 StaF: A Local Approximation Method 239

R0 = max |q(x, t)|, R1 = max |∇a F(a, c)|,


x∈D,t>0 a∈Br (0),c∈ D̃

R2 = max |∇c F(w(c), c)|, R3 = max |ḋi (x(t)|,


c∈ D̃ c∈ D̃
 
d 

R4 = max  w(c) ,
c∈ D̃ dc

and let Δt < τ −1  ¯ (2(R0 + R3 )(R1 R4 (R0 + R3 ) + R2 + 1))−1 . The proof aims
to show that by using the gradient descent law for choosing a k , the inequality

F(a k+1 , ck+1 ) − F(wk+1 , ck+1 ) ¯


<δ+
F(a k , ck ) − F(wk , ck ) F(a k , ck ) − F(wk , ck )

can be achieved for some 0 < δ < 1. Set

a k+1 = a k + λg, (7.7)

where g = −∇a F(a k , ck ) = 2V (ck ) − 2K (ck )a k and λ is selected so that the


 + λg, c ) is minimized. The λ that minimizes this quantity is λ =
k k
quantity F(a

gT g (g T g)2
2g T K (ck )g
which yields F(a ,
k+1 k
c ) = F(a k
, c k
) − 4g T K (ck )g
. Since F(a k+1 , ck+1 )
is continuously differentiable in the second variable, we have F(a k+1 , ck+1 ) =
F(a k+1 , ck ) + ∇c F(a k+1 , η) · (ck+1 − ck ). Since |ċ(x(t))| < R0 + R3 , an applica-
tion of the mean value theorem demonstrates that ck+1 − ck  < (R0 + R3 )Δt. Thus

F(a k+1 , ck+1 ) = F(a k+1 , ck ) + 1 (t k ),

where |1 (t k )| < ¯ /2, ∀k. The quantity F(wk+1 , ck+1 ) is continuously differentiable
in both variables. Thus, by the multi-variable chain rule and another application of
the mean value theorem

F(wk+1 , ck+1 ) = F(wk , ck ) + 2 (t k ),

for |2 (t k )| < ¯ /2 ∀k. Therefore, the following is established:

F(a k+1 , ck+1 ) − F(wk+1 , ck+1 ) F(a k+1 , ck ) − F(wk , ck ) + (1 (t k ) − 2 (t k ))


=
F(a k , ck ) − F(wk , ck ) F(a k , ck ) − F(wk , ck )
(g g)
T 2
1 (t k ) − 2 (t k )
=1− T + .
(g K (ck )g)(g T K (ck )−1 g) F(a k , ck ) − F(wk , ck )

The Kantorovich inequality [19, p. 77] yields


 2
(g T g)2 Ack /ack − 1
1− T ≤ , (7.8)
(g K (ck )g)(g T K (ck )−1 g) Ack /ack + 1
240 7 Computational Considerations

where Ack is the largest eigenvalue of K (ck ) and ack is the smallest eigenvalue of
K (ck ). The quantity on the right of (7.8) is continuous with respect to Ack and ack .
In turn, Ack and ack are continuous with respect to K (ck ) (c.f. Exercise 4.1.6 [20])
which is continuous with respect to ck . Therefore, there is a largest value, δ, that the
right hand side of (7.8) obtains on the compact set D̃ and this value is less than 1.
Moreover, δ is independent of ¯ , so it may be declared that ¯ = (1 − δ). Finally,

F(a k+1 , ck+1 ) − F(wk+1 , ck+1 ) (1 (t k ) − 2 (t k ))


≤ δ + .
F(a k , ck ) − F(wk , ck ) F(a k , ck ) − F(wk , ck )

Therefore, setting e(k) = F(a k , ck ) − F(wk , ck ), it can be shown that e(k + 1) ≤


δe(k) + (1 − δ) and the conclusion of the theorem follows. 

7.3.5 Simulation for the Gradient Chase Theorem

To demonstrate the effectiveness of the gradient chase theorem, a simulation is per-


formed on a two-dimensional linear system is presented below. The system dynamics
are given by     
ẋ1 0 1 x1
= ,
ẋ2 −1 0 x2

which is the dynamical system corresponding to a circular trajectory. The state de-
pendent function to be approximated is

V (x1 , x2 ) = x12 + 5x22 + tanh(x1 x2 ), (7.9)



where exponential kernels, K (x, y) = exp x T y , are used to approximate (7.9). The
centers are arranged in an equilateral triangle centered about the state. In particular,
each center resides on a circle of radius 0.1 centered at the state as
 
sin((i − 1)2π/3)
ci (x) = x + 0.1
cos((i − 1)2π/3)

for i = 1, 2, 3.
The initial values selected for the weights are a 0 = [0 0 0]T . The gradient descent
weight update law, given by (7.7), is applied 10 iterations per time-step and the time-
steps incremented every 0.01 s. Figures 7.1, 7.2, 7.3 and 7.4 present the results of the
simulation.
Figure 7.4 demonstrates that the function approximation error is driven to a small
neighborhood of zero as the gradient chase theorem is implemented, which numeri-
cally validates the claim of the uniformly ultimately bounded result of Theorem 7.5.
Approximations of the ideal weight function depicted in Fig. 7.3, are periodic and
7.3 StaF: A Local Approximation Method 241

Fig. 7.1 Trajectory of the Phase Portrait of the Dynamical System


state vector
1

0.5

−0.5

−1

−1 0 1

Fig. 7.2 Comparison of V Actual and Estimated Function


and the approximation V̂ 12

10

0
0 2 4 6 8
Time (s)

Fig. 7.3 The values of the Weight Estimates


weight function estimates 6

−2

−4

−6
0 2 4 6 8
Time (s)
242 7 Computational Considerations

Fig. 7.4 Error committed by Function Estimation Error


the approximation 8

−2
0 2 4 6 8
Time (s)

smooth. Smoothness of the ideal weight function itself is given in Theorem 7.3, and
the periodicity of the approximation follows from the periodicity of the selected
dynamical system, Fig. 7.1. Finally, Fig. 7.2 shows that along the system trajectory,
the approximation V̂ rapidly converges to the true function V . Approximation of the
function is maintained as the system state moves through its domain as anticipated.

7.4 Local Approximation for Efficient Model-Based


Reinforcement Learning2

In the following, Sect. 7.4.1 summarizes key results from the previous section in
the context of model-based reinforcement learning. In Sect. 7.4.2, the StaF-based
function approximation approach is used to approximately solve an optimal regula-
tion problem online using exact model knowledge via value function approximation.
Section 7.4.3 is dedicated to Lyapunov-based stability analysis of the developed tech-
nique. Section 7.4.4 extends the developed technique to systems with uncertain drift
dynamics and Sect. 7.4.5 presents comparative simulation results.

7.4.1 StaF Kernel Functions

Let H be a universal reproducing kernel Hilbert space over a compact set χ ⊂ Rn


with a continuously differentiable positive definite kernel K : χ × χ → R. Let
∗ ∗
V : χ → R be a function such that V ∈ H . Let C  [c1 , c2 , · · · c L ]T ∈ χ L be
a set of distinct centers, and let σ : χ × χ L → R L be defined as σ (x, C) 

2 Parts of the text in this section are reproduced, with permission, from [21], 2016,
c Elsevier.
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 243

[K (x, c1 ) , · · · , K (x, c L )]T . Then, there exists a unique set of weights W H such
that  
 ∗
W H (C) = arg min a T σ (·, C) − V  , (7.10)
a∈R L H

where · H denotes the Hilbert space norm. Furthermore, for any given  > 0,
there exists a constant
  of centers, C ∈ χ , and a set of weights, W H ∈
L ∈ N, a set L
 ∗ 
R L , such that W HT σ (·, C) − V  ≤ . On compact sets, the Hilbert space norm
H
corresponding to a Hilbert space with continuously differentiable kernels dominates
the supremum norm of functions and their derivatives [7, Corollary 4.36]. Hence, the
function can be approximated
 as well asits derivative, that is, there exists centers
 and
 T ∗  T ∗
weights for which, W H σ (·, C) − V  <  and W H ∇σ (·, C) − ∇V  <
χ,∞ χ,∞
. The notation ∇ f denotes the gradient of f with respect to the first argument and
the notation  f  A,∞ denotes the supremum of the absolute value (or the pointwise
norm, if f is vector-valued) of f over the set A.
Let Hx,r denote the restriction of the Hilbert space H to Br (x) ⊂ χ . Then, Hx,r
is a Hilbert space with the restricted kernel K x,r : Br (x) × Br (x) → R defined as
K x,r (y, z) = K (y, z) , ∀ (y, z) ∈ Br (x) × Br (x). The Weierstrass Theorem indi-
cates that as r decreases, the degree N x, of the polynomial needed to achieve the
same error  over Br (x) decreases [22]. Hence, by Theorem 7.4, approximation of
a function over a smaller domain requires a smaller number of exponential kernels.
Furthermore, provided the region of interest is small enough, the number of kernels
required
 to approximate continuous functions with arbitrary accuracy can be reduced
to n+22
.
In the StaF approach, the centers are selected to follow the current state x (i.e.,
the locations of the centers are defined as a function of the system state). Since the
system state evolves in time, the ideal weights are not constant. To approximate the
ideal weights using gradient-based algorithms, it is essential that the weights change
smoothly with respect to the system state. Theorem 7.3 establishes differentiability of
the ideal weights as a function of the centers to facilitate implementation of gradient-
based update laws to learn the time-varying ideal weights in real-time.

7.4.2 StaF Kernel Functions for Online Approximate


Optimal Control

Consider the Bolza problem introduced in Sect. 1.5 where the functions f and g
are assumed to be known and locally Lipschitz continuous. Furthermore, assume
that f (0) = 0 and that ∇ f : Rn → Rn×n is continuous. The selection of an optimal
regulation problem and the assumption that the system dynamics are known are
motivated by ease of exposition. Using the concurrent learning-based adaptive system
identifier and the state augmentation technique described in Sects. 3.3 and 4.4, the
approach developed in this section can be extended to a class of trajectory tracking
244 7 Computational Considerations

problems in the presence of uncertainties in the system drift dynamics. For a detailed
description of StaF-based online approximate optimal control under uncertainty, see
Sect. 7.4.4. Simulation results in Sect. 7.4.5 demonstrate the performance of such an
extension.
The expression for the optimal policy in (1.13) indicates that to compute the
optimal action when the system is at any given state x, one only needs to evaluate
the gradient ∇V ∗ at x. Hence, to compute the optimal policy at x, one only needs to
approximate the value function over a small neighborhood around x. Furthermore, as
established in Theorem 7.4, the number of basis functions required to approximate
the value function is smaller if the region for the approximation is smaller (with
respect to the ordering induced by set containment). Hence, in this result, the aim is
to obtain a uniform approximation of the value function over a small neighborhood
around the current system state.
StaF kernels are employed to achieve the aforementioned objective. To facilitate
the development, let x be in the interior of χ . Then, for all  > 0, there exists a
∗  ∗ 
function V ∈ Hx,r such that sup y∈Br (x) V ∗ (y) − V (y) < , where Hx,r is a re-
striction of a universal reproducing kernel Hilbert space, H , introduced in Sect. 7.4.1,
to Br (x). In the developed StaF-based method, a small compact set Br (x) around
the current state x is selected for value function approximation by selecting the cen-
ters C ∈ Br (x) such that C = c (x) for some continuously differentiable function
c : χ → χ L . Using StaF kernels centered at a point x, the value function can be
represented as

V ∗ (y) = W (x)T σ (y, c (x)) + ε (x, y) , y ∈ Br (x)

where ε (x, y) denotes the function approximation error.


Since the centers of the kernel functions change as the system state changes, the
ideal weights also change as the system state changes. The state-dependent nature of
the ideal weights differentiates this approach from state-of-the-art approximate dy-
namic programming methods in the sense that the stability analysis needs to account
for changing ideal weights. Based on Theorem 7.3, it can be established that the ideal
weight function W : χ → R L defined as W (x)  W Hx,r (c (x)) , where W Hx,r was
introduced in (7.10), is continuously differentiable, provided the functions σ and c
are continuously differentiable.
The approximate value function V̂ : Rn × Rn × R L → R and the approximate
policy û : Rn × Rn × R L → Rm , evaluated at a point y ∈ Br (x), using StaF kernels
centered at x, can then be expressed as
 
V̂ y, x, Ŵc  ŴcT σ (y, c (x)) ,
  1
û y, x, Ŵa  − R −1 g T (y) ∇σ (y, c (x))T Ŵa , (7.11)
2
where σ denotes the vector of basis functions, introduced in Sect. 7.4.1.
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 245

The objective of the critic is to learn the ideal parameters W (x), and the objective
of the actor is to implement a stabilizing controller based on the parameters learned by
the critic . Motivated by the stability analysis, the actor and the critic maintain separate
estimates Ŵa and Ŵc , respectively, of the ideal parameters W (x). Using the estimates
V̂ and û for V ∗ and u ∗ , respectively, the Bellman error is δ : Rn × Rn × R L × R L →
R, is computed as
        
δ y, x, Ŵc , Ŵa  r y, û y, x, Ŵa + ∇ V̂ y, x, Ŵc f (y) + g (y) û y, x, Ŵa .
(7.12)
To solve the optimal control problem, the critic aims to find a set of parameters
Ŵc and the actor aims to find a set of parameters Ŵa such that δ y, x, Ŵc , Ŵa =
0, ∀x ∈ Rn , ∀y ∈ Br (x). Since an exact basis for value function approximation is
generally not available, an approximate set of parameters that minimizes the Bellman
error is sought.
To learn the ideal parameters online, the critic evaluates a form δt : R≥t0 → R of
the Bellman error at each time instance t as
 
δt (t)  δ x (t) , x (t) , Ŵc (t) , Ŵa (t) , (7.13)

where Ŵa (t) and Ŵc (t) denote the estimates of the actor and the critic weights,
respectively, at time t, and the notation x (t) is used to denote the state the system
in (1.9), at time t, when starting from initial time t0 , initial state x0 , and under the
feedback controller  
u (t) = û x (t) , x (t) , Ŵa (t) . (7.14)

Since (1.14) constitutes a necessary and sufficient condition for optimality, the Bell-
man error serves as an indirect measure of how close the critic parameter estimates
Ŵc are to their ideal values; hence, in the context of reinforcement learning, each
evaluation of the Bellman error is interpreted as gained experience. Since the Bell-
man error in (7.13) is evaluated along the system trajectory, the experience gained is
along the system trajectory.
Learning based on simulation of experience is achieved by extrapolating the Bell-
man error to unexplored areas of the state-space. The critic selects a set of functions
 N
xi : Rn × R≥t0 → Rn i=1 such that each xi maps the current state x (t) to a point
xi (x (t) , t) ∈ Br (x (t)). The critic then evaluates a form δti : R≥t0 → R of the Bell-
man error for each xi as
 
δti (t) = δ xi (x (t) , t) , x (t) , Ŵc (t) , Ŵa (t) . (7.15)

The critic then uses the Bellman errors from (7.13) and (7.15) to improve the estimate
Ŵc (t) using the recursive least-squares-based update law
246 7 Computational Considerations

ω (t)  ωi (t) N
Ŵ˙ c (t) = −kc1 Γ (t)
kc2
δt (t) − Γ (t) δti (t) , (7.16)
ρ (t) N ρ (t)
i=1 i

 
where ρi (t)  1 + γ1 ωiT (t) ωi (t), ρ (t)  1 + γ1 ω T (t) ω (t),
 
ω (t)  ∇σ (x (t) , c (x (t))) f (x (t))+∇σ (x (t) , c (x (t))) g (x (t)) û x (t) , x (t) , Ŵa (t) ,
 
ωi (t)  ∇σ (xi (x (t)) , c (x (t))) g (xi (x (t) , t)) û xi (x (t) , t) , x (t) , Ŵa (t)
+ ∇σ (xi (x (t)) , c (x (t))) f (xi (x (t) , t)) ,

and kc1 , kc2 , γ1 ∈ R>0 are constant learning gains. In (7.16), Γ (t) denotes the least-
squares learning gain matrix updated according to

ω (t) ω T (t) kc2  ωi (t) ω T (t)


N
Γ˙ (t) = βΓ (t) − kc1 Γ (t) Γ (t) − Γ (t) i
Γ (t) ,
ρ (t)
2 N i=1
ρi
2
(t)
Γ (t0 ) = Γ0 , (7.17)

where β ∈ R>0 is a constant forgetting factor. Motivated by a Lyapunov-based sta-


bility analysis, the update law for the actor is designed as

  kc1 G σT (t) Ŵa (t) ω (t)T


Ŵ˙ a (t) = −ka1 Ŵa (t) − Ŵc (t) − ka2 Ŵa (t) + Ŵc (t)
4ρ (t)
N
kc2 G σT i (t) Ŵa (t) ωiT (t)
+ Ŵc (t) , (7.18)
i=1
4Nρi (t)

where

G σ (t)  ∇σ (x (t) , c (x (t))) g (x (t)) R −1 g T (x (t)) ∇σ T (x (t) , c (x (t))) ,


G σ i (t)  ∇σ (xi (x (t) , t) , c (x (t))) g (xi (x (t) , t)) R −1 g T (xi (x (t) , t))
· ∇σ T (xi (x (t) , t) , c (x (t))) ,

and ka1 , ka2 ∈ R>0 are learning gains.

7.4.3 Analysis

The computational cost associated


with the implementation of the developed method
can be computed as O N L 3 + mn L + Lm 2 + n 2 + m 2 . Since local approxima-
tion is targeted, the StaF kernels result in a reduction in the number of required basis
functions (i.e., L). Since the computational cost has a cubic relationship with the
number of basis functions, the StaF methodology results in a significant computa-
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 247

tional benefit. The computational cost grows linearly with the number of extrapola-
tion points (i.e., N ). If the points are selected using grid-based methods employed
in results such as [11], the number N increases geometrically with respect to the
state dimension, n. On the other hand, if the extrapolation points are selected to be
time varying, then even a single point is sufficient, provided the time-trajectory of
the point contains enough information to satisfy the subsequent Assumption 7.6.
In the following, Assumption 7.6 formalizes the conditions under which the tra-
jectories of the closed-loop system can be shown to be ultimately bounded, and
Lemma 7.7 facilitates the analysis of the closed-loop system when time-varying ex-
trapolation trajectories are utilized.
For notational brevity, time-dependence of all the signals is suppressed hereafter.
Let χ denote the projection of Bζ onto Rn . To facilitate the subsequent stability
analysis, the Bellman errors in (7.13) and (7.15) are expressed in terms of the weight
estimation errors W̃c  W − Ŵc and W̃a = W − Ŵa as
1
δt = −ω T W̃c + W̃a G σ W̃a + Δ (x) ,
4
1
δti = −ωi W̃c + W̃aT G σ i W̃a + Δi (x) ,
T
(7.19)
4
where the functions Δ, Δi : Rn → R are uniformly bounded over χ such that the
bounds Δ and Δi  decrease with decreasing ∇ε and ∇W . Let a candidate
Lyapunov function VL : Rn+2L × R≥t0 → R be defined as
1 1
VL (Z , t)  V ∗ (x) + W̃cT Γ −1 (t) W̃c + W̃aT W̃a ,
2 2
where V ∗ is the optimal value function, and
 T
Z = x T , W̃cT , W̃aT .

To facilitate learning, the system states x and the selected functions xi are assumed
to satisfy the following.
Assumption 7.6 There exist constants T ∈ R>0 and c1 , c2 , c3 ∈ R≥0 , such that

t+T  
ω (τ ) ω T (τ )
c1 I L ≤ dτ, ∀t ∈ R≥t0 ,
ρ 2 (τ )
t
 
1  ωi (t) ωiT (t)
N
c2 I L ≤ inf ,
t∈R≥t0 N i=1 ρi2 (t)
t+T  N

1 ωi (τ ) ωiT (τ )
c3 I L ≤ dτ, ∀t ∈ R≥t0 ,
N i=1
ρi2 (τ )
t

where at least one of the constants c1 , c2 , and c3 is strictly positive.


248 7 Computational Considerations

Unlike typical approximate dynamic programming literature that assumes ω is per-


sistently exciting, Assumption 7.6 only requires either the regressor ω or the regressor
ωi to be persistently exciting. The regressor ω is completely determined by the sys-
tem state x, and the weights Ŵa . Hence, excitation in ω vanishes as the system states
and the weights converge. Hence, in general, it is unlikely that c1 > 0. However, the
regressor ωi depends on xi , which can be designed independent of the system state x.
Hence, c3 can be made strictly positive if the signal xi contains enough frequencies,
and c2 can be made strictly positive by selecting a sufficient number of extrapolation
functions.
Intuitively, selection of a single time-varying Bellman error extrapolation function
results in virtual excitation. That is, instead of using input-output data from a per-
sistently excited system, the dynamic model is used to simulate persistent excitation
to facilitate parameter convergence. The performance of the developed extrapola-
tion method is demonstrated using comparative simulations in Sect. 7.4.5, where it
is demonstrated that the developed method using a single time-varying extrapolation
point results in improved computational efficiency when compared to a large number
of fixed extrapolation functions.
The following lemma facilitates the stability analysis by establishing upper and
lower bounds on the eigenvalues of the least-squares learning gain matrix, Γ .
 
Lemma 7.7 Provided Assumption 7.6 holds and λmin Γ0−1 > 0, the update law in
(7.17) ensures that the least-squares gain matrix satisfies
Γ I L ≤ Γ (t) ≤ Γ I L , (7.20)

where
1
Γ =      ,
min kc1 c1 + kc2 max c2 T, c3 , λmin Γ0−1 e−βT
1
Γ =  −1  (kc1 +kc2 ) .
λmax Γ0 + βγ1

Furthermore, Γ > 0.

Proof The proof closely follows the proof of [23, Corollary 4.3.2]. The update law
in (7.17) implies that
ω (t) ω T (t) kc2  ωi (t) ωiT (t)
N
d −1
Γ (t) = −βΓ −1 (t) + kc1 + .
dt ρ 2 (t) N i=1 ρi2 (t)

Hence,

t t N
−1 ω (τ ) ω T (τ ) kc2 ωi (τ ) ωiT (τ )
Γ (t) = kc1 e−β(t−τ ) dτ + e−β(t−τ ) dτ
ρ 2 (τ ) N i=1
ρi2 (τ )
0 0

+ e−βt Γ0−1 .
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 249

To facilitate the proof, let t < T . Then,


 
Γ −1 (t) ≥ e−βt Γ0−1 ≥ e−βT Γ0−1 ≥ λmin Γ0−1 e−βT I L .

Since the integrands are positive, it follows that if t ≥ T, then Γ −1 can be bounded
as

t t N
−1 ω (τ ) ω T (τ ) kc2 ωi (τ ) ωiT (τ )
Γ (t) ≥ kc1 e−β(t−τ ) dτ + e −β(t−τ )
dτ.
ρ 2 (τ ) N i=1
ρi2 (τ )
t−T t−T

Therefore,

t t N
−1 −βT ω (τ ) ω T (τ ) kc2 −βT ωi (τ ) ωiT (τ )
Γ (t) ≥ kc1 e dτ + e dτ.
ρ 2 (τ ) N i=1
ρi2 (τ )
t−T t−T

Using Assumption 7.6,

t  t
1
N
ωi (τ ) ωiT (τ )   ω (τ ) ω T (τ )
dτ ≥ max c2 T, c3 I L , dτ ≥ c1 I L .
N i=1
ρi (τ )
2 ρ 2 (τ )
t−T t−T

A lower bound for Γ −1 is thus obtained as,


    
Γ −1 (t) ≥ min kc1 c1 + kc2 max c2 T, c3 , λmin Γ0−1 e−βT I L . (7.21)

Provided Assumption 7.6 holds, the lower bound in (7.21) is strictly positive. Fur-
ω (t)ω T (t)
thermore, using the facts that ω(t)ω (t)
T

ρ 2 (t)
≤ γ11 and i ρ 2 (t)i ≤ γ11 , ∀t ∈ R≥t0 ,
i

t  
kc2  1
N
1
Γ −1
(t) ≤ e −β(t−τ )
kc1 + I L dτ + e−βt Γ0−1
γ1 N i=1 γ1
0
 
 −1  (kc1 + kc2 )
≤ λmax Γ0 + IL .
βγ1

Since the inverse of the lower and upper bounds on Γ −1 are the upper and lower
bounds on Γ , respectively, the proof is complete. 

Since the optimal value function is positive definite, (7.20) and [24, Lemma 4.3]
can be used to show that the candidate Lyapunov function satisfies the following
bounds
vl (Z ) ≤ VL (Z , t) ≤ vl (Z ) , (7.22)
250 7 Computational Considerations

∀t ∈ R≥t0 and ∀Z ∈ R2+2L . In (7.22), vl , vl : R≥0 → R≥0 are class K functions. To


facilitate the analysis, let c ∈ R>0 be a constant defined as

β c2
c + , (7.23)
2Γ kc2 2

and let ι ∈ R>0 be a constant defined as


 2
(kc1 +k
√c2 )Δ
∇W f  Γ −1 G W σ W 
3 v
+ Γ
+ 2 1 1
ι + G V W σ  + G V ε 
4kc2 c 2 2
 2 2
G W σ W +G V σ 
+ ka2 W  + ∇W f  + √ σ W 
(kc1 +kc2 )G
2 4 v
+ ,
(ka1 + ka2 )

where G W σ  ∇W G∇σ T , G V σ  ∇V ∗ G∇σ T , G V W  ∇V ∗ G∇W T , and G V  


∇V ∗ G∇ T . Let vl : R≥0 → R≥0 be a class K function such that

Q (x) kc2 c  2 (k + k )  2
  a1 a2  
vl (Z ) ≤ + W̃c  + W̃a  .
2 6 8
The sufficient conditions for the subsequent Lyapunov-based stability analysis are
given by
 2
G W σ  (kc1 +kc2 )W T G σ 

+ √
4 v
+ ka1
kc2 c
≥ , (7.24)
3 (ka1 + ka2 )
 
(ka1 + ka2 ) G W σ  (kc1 +kc2 )W G σ 
≥ + √ , (7.25)
4 2 4 v

vl−1 (ι) < vl −1 vl (ζ ) . (7.26)

The sufficient condition in (7.24) can be satisfied provided the points for Bellman
error extrapolation are selected such that the minimum eigenvalue c, introduced in
(7.23) is large enough. The sufficient condition in (7.25) can be satisfied without
affecting (7.24) by increasing the gain ka2 . The sufficient condition in (7.26) can be
satisfied provided c, ka2 , and the state penalty Q (x) are selected to be sufficiently
large and the StaF kernels for value function approximation are selected such that
∇W , ε, and ∇ε are sufficiently small.
Similar to neural network-based approximation methods such as [25–32], the
function approximation error, ε, is unknown, and in general, infeasible to compute
for a given function, since the ideal neural network weights are unknown. Since a
bound on ε is unavailable, the gain conditions in (7.24)–(7.26) cannot be formally
verified. However, they can be met using trial and error by increasing the gain ka2 ,
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 251

the number of StaF basis functions, and c by selecting more points to extrapolate the
Bellman error.
To improve computational efficiency, the size of the domain around the current
state where the StaF kernels provide good approximation of the value function is
desired to be small. Smaller approximation domain results in almost identical extrap-
olated points, which in turn, results in smaller c. Hence, the approximation domain
cannot be selected to be arbitrarily small and needs to be large enough to meet the
sufficient conditions in (7.24)–(7.26).
Theorem 7.8 Provided Assumption 7.6 holds and the sufficient gain conditions in
(7.24)–(7.26) are satisfied, the controller in (7.14) and the update laws in (7.16)–
(7.18) ensure that the state x and the weight estimation errors W̃c and W̃a are
ultimately bounded.

Proof The time-derivative of the Lyapunov function is given by


  1  
V̇L = V̇ ∗ + W̃cT Γ −1 Ẇ − Ŵ˙ c + W̃cT Γ˙ −1 W̃c + W̃aT Ẇ − Ŵ˙ a .
2
Using Theorem 7.3, the time derivative of the ideal weights can be expressed as

Ẇ = ∇W (x) ( f (x) + g (x) u) . (7.27)

Using (7.16)–(7.19) and (7.27), the time derivative of the Lyapunov function is
expressed as

V̇L = ∇V ∗ (x) ( f (x) + g (x) u) + W̃cT Γ −1 ∇W (x) ( f (x) + g (x) u)


  
ω 1
− W̃cT Γ −1 −kc1 Γ −ω T W̃c + W̃a G σ W̃a + Δ (x)
ρ 4
 
kc2  ωi 1 T
N
T −1
− W̃c Γ − Γ W̃ G σ i W̃a
N i=1 i
ρ 4 a
 
kc2  ωi  T 
N
T −1
− W̃c Γ − Γ −ωi W̃c + Δi (x)
N i=1 i
ρ
 
1 ωω T
− W̃cT Γ −1 βΓ − kc1 Γ Γ Γ −1 W̃c
2 ρ
 
kc2  ωi ωiT
N
1 T −1
− W̃c Γ − Γ Γ Γ −1 W̃c
2 N i=1
ρ i
 
+ W̃a ∇W (x) ( f (x) + g (x) u) − Ŵ˙ a .
T

Provided the sufficient conditions in (7.24)–(7.26) hold, the time derivative of the
candidate Lyapunov function can be bounded as
252 7 Computational Considerations

V̇L ≤ −vl (Z ) , ∀ζ > Z  > vl−1 (ι) . (7.28)

Using (7.22), (7.26), and (7.28), [24, Theorem 4.18] can be invoked to conclude that
Z is ultimately bounded, in the sense that

lim sup Z (t) ≤ vl −1 vl vl−1 (ι) .
t→∞

∈ L∞, x (·) ∈ L∞ , W̃a (·) ∈ L∞ , and W̃c (·) ∈ L∞ . Since x (·) ∈ L∞


Since Z (·)
and W ∈ C 0 χ , R L , t → W (x (t)) ∈ L∞ . Hence, Ŵa (·) ∈ L∞ and Ŵc (·) ∈ L∞ ,
which implies u (·) ∈ L∞ . 

7.4.4 Extension to Systems with Uncertain Drift Dynamics

If the drift dynamics are uncertain, a parametric approximation of the dynamics can be
employed for Bellman error extrapolation. On any compact set C ⊂ Rn the  function
f can be represented using a neural network as f (x) = θ T σ f Y T x1 (x) + εθ (x) ,
 T
where x1 (x)  1, x T ∈ Rn+1 , θ ∈ R p+1×n , and Y ∈ Rn+1× p denote the constant
unknown output-layer and hidden-layer neural network weights, σ f : R p → R p+1
denotes a bounded neural network basis function, εθ : Rn → Rn denotes the function
reconstruction error, and p ∈ N denotes the number of neural network neurons.
Using the universal function approximation property of single layer neural networks,
given a constant matrix Y such that the rows of σ f Y T x1 form a proper basis (cf.
[33]), there exist constant ideal weights θ and known constants θ , εθ , and εθ ∈ R
such that θ  ≤ θ < ∞, supx∈C εθ (x) ≤ εθ , and supx∈C ∇x εθ (x) ≤ εθ . Using
an estimate θ̂ ∈ R p+1×n of the weight matrix θ, the function
 fcan be approximated
ˆ
by the function f : R × Rn p+1×n ˆ
→ R defined as f x, θ̂  θ̂ T σθ (x) , where
n
  T 
σθ : Rn → R p+1 is defined as σθ (x)  σ f Y T 1, x T . Using fˆ, the Bellman
error in (7.12) can be approximated by δ̂ : Rn × Rn × R L × R L × R p+1×n → Rn as

          
δ̂ y, x, Ŵc , Ŵa , θ̂  r y, û y, x, Ŵa + ∇ V̂ y, x, Ŵc fˆ y, θ̂ + g (y) û y, x, Ŵa .
(7.29)
Using δ̂, the instantaneous Bellman errors in (7.13) and (7.15) are redefined as
 
δt (t)  δ̂ x (t) , x (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) , (7.30)

and  
δti (t)  δ̂ xi (x (t) , t) , x (t) , Ŵc (t) , Ŵa (t) , θ̂ (t) , (7.31)
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 253

respectively, where ω and ωi are redefined as


 
ω (t)  ∇σ (x (t) , c (x (t))) fˆ x (t) , θ̂ (t)
 
+ ∇σ (x (t) , c (x (t))) g (x (t)) û x (t) , x (t) , Ŵa (t) , (7.32)
 
ωi (t)  ∇σ (xi (x (t) , t) , c (x (t))) f xi (x (t) , t) , θ̂ (t)
 
+ ∇σ (xi (x (t) , t) , c (x (t))) g (xi (x (t) , t)) · û xi (x (t) , t) , x (t) , Ŵa (t) .
(7.33)

Assumption 4.1 describes the characteristic of a parameter estimator required to


achieve closed-loop stability. The main result for uncertain drift dynamics is sum-
marized in the following theorem.
Theorem 7.9 Provided a parameter estimator that satisfies Assumption 4.1 is avail-
able, the StaF kernels and the basis functions for system identification are selected
such that ∇W and the approximation errors ε, ∇ε, εθ , and ∇εθ are sufficiently small,
and provided the points for Bellman error extrapolation are selected such that the
minimum eigenvalue c, introduced in (7.23) is sufficiently large, then the update laws
given by (7.16)–(7.18), with the renewed definitions in (7.29)–(7.33) ensure that the
state x and the weight estimation errors θ̃, W̃c , and W̃a are ultimately bounded.

Proof The proof is a trivial combination of the proofs of Theorems 7.8 and 4.3. 

7.4.5 Simulation

Optimal Regulation Problem with Exact Model Knowledge


To demonstrate the effectiveness of the StaF kernels, simulations are performed on
a two-state nonlinear dynamical system. The system dynamics are given by (1.9),
where x = [x1 , x2 ]T ,
!
−x1 + x2
f (x) = ,
− 21 x1 − 21 x2 (cos (2x1 ) + 2)2
!
0
g (x) = . (7.34)
cos (2x1 ) + 2

The control objective is to minimize the cost

∞

x T (τ ) x (τ ) + u 2 (τ ) dτ. (7.35)
0
254 7 Computational Considerations

The system in (7.34) and the cost in (7.35) are selected because the corresponding op-
timal control problem has a known analytical solution. The optimal value function is
V ∗ (x) = 21 x1o2 + x2o2 , and the optimal control policy is u ∗ (x) = − (cos(2x1 ) + 2)x2 .
To apply the developed technique to this problem, the value function is approxi-
mated using three exponential StaF kernels (i.e, σ (x, C) = [σ1 (x, c1 ) , σ2 (x, c2 ) ,
T
σ3 (x, c3 )]T ). The kernels are selected to be σi (x, ci ) = e x ci − 1, i = 1, · · · , 3. The
centers ci are selected to be on the vertices of a shrinking equilateral triangle around
the current state (i.e., ci = x + di (x) i = 1, · · · , 3), where d1 (x) = 0.7ν o (x) ·
[0, 1]T , d2 (x)=0.7ν o (x) · [0.87, −0.5] , and d3 (x) = 0.7ν (x) · [−0.87, −0.5] ,
T o T
T
and ν o (x)  x1+γ x+0.01
T
2x x
denotes the shrinking function, where γ2 ∈ R>0 is a con-
stant normalization gain. To ensure sufficient excitation, a single point for Bell-
man error extrapolation is selected at random from a uniform distribution over a
2.1ν o (x (t)) × 2.1ν o (x (t)) square centered at the current state x (t) so that the
function xi is of the form xi (x, t) = x + ai (t) for some ai (t) ∈ R2 . For a general
problem with an n−dimensional state, exponential kernels can be utilized with the
centers placed at the vertices of an n−dimensional simplex with the current state as
the centroid. The extrapolation point can be sampled at each iteration from a uniform
distribution over an n−dimensional hypercube centered at the current state.
The system is initialized at t0 = 0 and the initial conditions

x (0) = [−1, 1]T , Ŵc (0) = 0.4 × 13×1 , Γ (0) = 500I3 , Ŵa (0) = 0.7Ŵc (0) ,

and the learning gains are selected as

kc1 = 0.001, kc2 = 0.25, ka1 = 1.2, ka2 = 0.01, β = 0.003, γ1 = 0.05, γ2 = 1.

Figure 7.5 shows that the developed StaF-based controller drives the system states
to the origin while maintaining system stability. Figure 7.6 shows the implemented
control signal compared with the optimal control signal. It is clear that the imple-

Fig. 7.5 State trajectories 1


generated using StaF
kernel-based approximate
dynamic programming 0.5
(reproduced with permission
from [21], 2016,
c Elsevier)
0

-0.5

-1
0 5 10
Time (s)
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 255

Fig. 7.6 kernel-based 0


approximate dynamic
programming compared with
the optimal control trajectory -0.5
(reproduced with permission
from [21], 2016,
c Elsevier)
-1

-1.5

-2
0 1 2 3 4 5
Time (s)

Fig. 7.7 Trajectories of the 1.5


estimates of the unknown
parameters in the value
function generated using 1
StaF kernel-based
approximate dynamic
programming. The ideal 0.5
weights are unknown and
time-varying; hence, the
obtained weights can not be 0
compared with their ideal
weights (reproduced with
permission from [21], -0.5
0 5 10
2016,
c Elsevier)
Time (s)

mented control converges to the optimal controller. Figures 7.7 and 7.8 shows that
the weight estimates for the StaF-based value function and policy approximation
remain bounded and converge as the state converges to the origin. Since the ideal
values of the weights are unknown, the weights can not directly be compared with
their ideal values. However, since the optimal solution is known, the value function
estimate corresponding to the weights in Fig. 7.7 can be compared to the optimal
value function at each time t. Figure 7.9 shows that the error between the optimal
and the estimated value functions rapidly decays to zero.
Optimal Tracking Problem with Parametric Uncertainties in the Drift Dynamics
This simulation demonstrates the effectiveness of the extension developed in
Sect. 7.4.4. The drift dynamics in the two-state nonlinear dynamical system in (7.34)
are assumed to be linearly parameterized as
256 7 Computational Considerations

Fig. 7.8 Trajectories of the 1.5


estimates of the unknown
parameters in the policy
generated using StaF 1
kernel-based approximate
dynamic programming. The
ideal weights are unknown 0.5
and time-varying; hence, the
obtained weights can not be
compared with their ideal 0
weights (reproduced with
permission from [21],
2016,
c Elsevier) -0.5
0 5 10
Time (s)

Fig. 7.9 The error between 2


the optimal and the estimated
value function (reproduced 0
with permission from [21],
2016,
c Elsevier)
-2

-4

-6

-8
0 5 10
Time (s)

⎡ ⎤
! x1
θ1 θ2 θ3 ⎣ ⎦,
f (x) = x2
θ4 θ5 θ6
" #$ % 2 x (cos (2x 1 ) + 2)
θT
" #$ %
σθ (x)

where θ ∈ R3×2 is the matrix of unknown parameters, and σθ is the known vector of
basis functions. The ideal values of the unknown parameters are θ1 = −1, θ2 = 1,
θ3 = 0, θ4 = −0.5, θ5 = 0, and θ6 = −0.5. Let θ̂ denote an estimate of the unknown
matrix θ. The control objective is to drive the estimate θ̂ to the ideal matrix θ , and to
drive the state x to follow a desired trajectory xd . The desired trajectory is selected
to be the solution to the initial value problem
! !
−1 1 0
ẋd (t) = x (t) , xd (0) = , (7.36)
−2 1 d 1
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 257

Fig. 7.10 Tracking error 1


trajectories generated using
the proposed method for the
trajectory tracking problem 0.5
(reproduced with permission
from [21], 2016,
c Elsevier)
0

-0.5

-1
0 10 20 30 40
Time (s)

*∞ 
and the cost functional is selected to be 0 e T (t) diag (10, 10) e (t) + (μ (t))2 dt,
where e (t) = x (t) − xd (t) , μ is an auxiliary controller designed using the devel-
oped method, and the tracking controller is designed as
 ! 
−1 1
u (t) = g + (xd (t)) xd (t) − θ̂ T σθ (xd (t)) + μ (t) ,
−2 1

where g + (x) denotes the pseudoinverse of g (x).


 T
The value function is a function of the concatenated state ζ  e T xdT ∈ R4 .
The value function is approximated using five exponential StaF kernels given by
σi (ζ o , C), where the five centers are selected according to ci = ζ o + di (ζ o ) to form
a regular five dimensional simplex around the current state with ν o (ζ o ) ≡ 1. Learning
gains for system identification and value function approximation are selected as

kc1 = 0.001, kc2 = 2, ka1 = 2, ka2 = 0.001, β = 0.01, γ1 = 0.1, γ2 = 1, k = 500,


Γθ = I3 , Γ (0) = 50I5 , kθ = 20.

Sufficient excitation is ensured by selecting a single state trajectory ζi (ζ o , t) 


ζ o + ai (t) for Bellman error extrapolation, where ai (t) is sampled at each t from
a uniform distribution over the a 2.1 × 2.1 × 2.1 × 2.1 hypercube centered at the
origin. The history stack required for concurrent learning contains ten points, and is
recorded online using a singular value maximizing algorithm (cf. [34]), and the re-
quired state derivatives are computed using a fifth order Savitzky–Golay smoothing
filter (cf. [35]).
The initial values for the state and the state estimate are selected to be x (0) =
[0, 0]T and x̂ (0) = [0, 0]T , respectively. The initial values for the neural network
weights for the value function, the policy, and the drift dynamics are selected to be
0.025 × 15×1 , 0.025 × 15×1 , and 03×2 , respectively. Since the system in (7.34) has
no stable equilibria, the initial policy μ̂ (ζ, 03×2 ) is not stabilizing. The stabilization
demonstrated in Fig. 7.10 is achieved via fast simultaneous learning of the system
dynamics and the value function.
258 7 Computational Considerations

-1

-2
0 10 20 30 40
Time (s)

Fig. 7.11 Control signal generated using the proposed method for the trajectory tracking problem
(reproduced with permission from [21], 2016,
c Elsevier)

0.5

-0.5
0 10 20 30 40
Time (s)

Fig. 7.12 Actor weight trajectories generated using the proposed method for the trajectory tracking
problem. The weights do not converge to a steady-state value because the ideal weights are not
constant, they are functions of the time-varying system state. Since an analytical solution to the
optimal tracking problem is not available, weights cannot be compared against their ideal values
(reproduced with permission from [21], 2016,
c Elsevier)

Figures 7.10 and 7.11 demonstrate that the controller remains bounded and the
tracking error is regulated to the origin. The neural network weights are functions of
the system state ζ . Since ζ converges to a periodic orbit, the neural network weights
also converge to a periodic orbit (within the bounds of the excitation introduced
by the Bellman error extrapolation signal), as demonstrated in Figs. 7.12 and 7.13.
Figure 7.14 demonstrates that the unknown parameters in the drift dynamics, repre-
sented by solid lines, converge to their ideal values, represented by dashed lines.
The developed technique is compared with the model-based reinforcement learn-
ing method developed in [11] for regulation and [12] for tracking, respectively.
The simulations are performed in MATLAB Simulink at 1000 Hz on the same
7.4 Local Approximation for Efficient Model-Based Reinforcement Learning 259

0.5

-0.5
0 10 20 30 40
Time (s)

Fig. 7.13 Critic function weight trajectories generated using the proposed method for the trajectory
tracking problem. The weights do not converge to a steady-state value because the ideal weights
not constant, they are functions of the time-varying system state. Since an analytical solution to the
optimal tracking problem is not available, weights cannot be compared against their ideal values
(reproduced with permission from [21], 2016,
c Elsevier)

Fig. 7.14 Trajectories of the Drift Dynamics NN Weights


unknown parameters in the 1.5
system drift dynamics for the
trajectory tracking problem. 1
The dotted lines represent the
true values of the parameters
0.5
(reproduced with permission
from [21], 2016,
c Elsevier)
0

-0.5

-1
0 10 20 30 40
Time (s)

Table 7.1 Simulation results for 2, 3, and 4 dimensional nonlinear systems


Problem description Regulation (2-states) Regulation (3-states) Tracking (4-states)
Controller StaF [11] StaF [11] StaF [12]
Running time (seconds) 6.5 17 9.5 62 12 260
Total cost 2.8 1.8 9.3 12.3 3.9 3.4
RMS steady-state error 2.5E − 6 0 4.3E − 6 4.5E − 6 3.5E − 4 2.5E − 4

machine. The simulations run for 100 s of simulated time. Since the objective is
to compare computational efficiency of the model-based reinforcement learning
method, exact knowledge of the system model is used. Table 7.1 shows that the
260 7 Computational Considerations

developed controller requires significantly fewer computational resources than the


controllers from [11, 12]. Furthermore, as the system dimension increases, the de-
veloped controller significantly outperforms the controllers from [11, 12] in terms
of computational efficiency.
Since the optimal solution for the regulation problem is known to be quadratic, the
model-based reinforcement learning method from [11] is implemented using three
quadratic basis functions. Since the basis used is exact, the method from [11] yields
a smaller steady-state error than the developed method, which uses three generic
StaF kernels. For the three-state regulation problem and the tracking problem, the
methods from [11, 12] are implemented using polynomial basis functions selected
based on a trial-and-error approach. The developed technique is implemented using
generic StaF kernels. In this case, since the optimal solution is unknown, both the
methods use generic basis functions, resulting in similar steady-state errors.
The two main advantages of StaF kernels are that they are universal, in the sense
that they can be used to approximate a large class of value functions, and that they
target local approximation, resulting in a smaller number of required basis functions.
However, the StaF kernels trade optimality for universality and computational ef-
ficiency. The kernels are generic, and the weight estimates need to be continually
adjusted based on the system trajectory. Hence, as shown in Table 7.1, the developed
technique results in a higher total cost than state-of-the-art model-based reinforce-
ment learning techniques.

7.5 Background and Further Reading

Reinforcement learning has become a popular tool for determining online solutions of
optimal control problems for systems with finite state and action-spaces [30, 36–40].
Due to various technical and practical difficulties, implementation of reinforcement
learning-based closed-loop controllers on hardware platforms remains a challenge.
Approximate dynamic programming-based controllers are void of pre-designed sta-
bilizing feedback and are completely defined by the estimated parameters. Hence,
the error between the optimal and the estimated value function is required to decay to
a sufficiently small bound sufficiently fast to establish closed-loop stability. The size
of the error bound is determined by the selected basis functions, and the convergence
rate is determined by richness of the data used for learning.
Fast approximation of the value function over a large neighborhood requires suf-
ficiently rich data to be available for learning. In traditional approximate dynamic
programming methods such as [31, 41, 42], richness of data manifests itself as the
amount of excitation in the system. In experience replay-based techniques such as
[34, 43–45], richness of data is quantified by eigenvalues of a recorded history stack.
In model-based reinforcement learning techniques such as [11–13], richness of data
corresponds to the eigenvalues of a learning matrix. As the dimension of the system
and the number of basis functions increases, the richer data is required to achieve
learning. In traditional approximate dynamic programming methods, the demand
7.5 Background and Further Reading 261

for rich data is met by adding excitation signals to the controller, thereby causing
undesirable oscillations. In experience replay-based approximate dynamic program-
ming methods and in model-based reinforcement learning, the demand for richer
data causes exponential growth in the required data storage. Hence, experimental
implementations of traditional approximate dynamic programming techniques such
as [25–32, 41, 42, 46, 47] and data-driven approximate dynamic programming tech-
niques such as [11–13, 45, 48, 49] in high dimensional systems are scarcely found
in the literature.
The control design in (7.11) exploits the fact that given a basis σ for approximation
of the value function, the basis 21 R −1 g T ∇σ T approximates the optimal controller,
provided the dynamics control-affine. As a part of future research, possible extensions
to nonaffine systems could potentially be explored by approximating the controller
using an independent basis (cf. [50–57]).

References

1. Kirk D (2004) Optimal control theory: an introduction. Dover, Mineola


2. Liberzon D (2012) Calculus of variations and optimal control theory: a concise introduction.
Princeton University Press, Princeton
3. Christmann A, Steinwart I (2010) Universal kernels on non-standard input spaces. In: Advances
in neural information processing, pp 406–414
4. Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J Mach Learn Res 7:2651–2667
5. Park J, Sanberg I (1991) Universal approximation using radial-basis-function networks. Neural
Comput 3(2):246–257
6. Folland GB (1999) Real analysis: modern techniques and their applications, 2nd edn. Pure and
applied mathematics, Wiley, New York
7. Steinwart I, Christmann A (2008) Support vector machines. Information science and statistics,
Springer, New York
8. Gaggero M, Gnecco G, Sanguineti M (2013) Dynamic programming and value-function ap-
proximation in sequential decision problems: error analysis and numerical results. J Optim
Theory Appl 156
9. Gaggero M, Gnecco G, Sanguineti M (2014) Approximate dynamic programming for stochastic
n-stage optimization with application to optimal consumption under uncertainty. Comput Optim
Appl 58(1):31–85
10. Zoppoli R, Sanguineti M, Parisini T (2002) Approximating networks and extended Ritz method
for the solution of functional optimization problems. J Optim Theory Appl 112(2):403–440
11. Kamalapurkar R, Walters P, Dixon WE (2013) Concurrent learning-based approximate optimal
regulation. In: Proceedings of the IEEE conference on decision and control, Florence, IT, pp
6256–6261
12. Kamalapurkar R, Andrews L, Walters P, Dixon WE (2014) Model-based reinforcement learning
for infinite-horizon approximate optimal tracking. In: Proceedings of the IEEE conference on
decision and control, Los Angeles, CA, pp 5083–5088
13. Kamalapurkar R, Klotz J, Dixon WE (2014) Concurrent learning-based online approximate
feedback Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA
J Autom Sin 1(3):239–247
14. Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
15. Zhu K (2012) Analysis on fock spaces, vol 263. Graduate texts in mathematics, Springer, New
York
262 7 Computational Considerations

16. Pinkus A (2004) Strictly positive definite functions on a real inner product space. Adv Comput
Math 20:263–271
17. Rosenfeld JA, Kamalapurkar R, Dixon WE (2015) State following (StaF) kernel functions for
function approximation part I: theory and motivation. In: Proceedings of the American control
conference, pp 1217–1222
18. Beylkin G, Monzon L (2005) On approximation of functions by exponential sums. Appl Com-
put Harmon Anal 19(1):17–48
19. Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
20. Pedersen GK (1989) Analysis now, vol 118. Graduate texts in mathematics, Springer, New
York
21. Kamalapurkar R, Rosenfeld J, Dixon WE (2016) Efficient model-based reinforcement learning
for approximate online optimal control. Automatica 74:247–258
22. Lorentz GG (1986) Bernstein polynomials, 2nd edn. Chelsea Publishing Co., New York
23. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
24. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
25. Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput
12(1):219–245
26. Padhi R, Unnikrishnan N, Wang X, Balakrishnan S (2006) A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw 19(10):1648–1660
27. Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using
approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B
Cybern 38:943–949
28. Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst Mag 9(3):32–50
29. Dierks T, Thumati B, Jagannathan S (2009) Optimal control of unknown affine nonlinear
discrete-time systems using offline-trained neural networks with proof of convergence. Neural
Netw 22(5–6):851–860
30. Mehta P, Meyn S (2009) Q-learning and pontryagin’s minimum principle. In: Proceedings of
the IEEE conference on decision and control, pp 3598–3605
31. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
32. Zhang H, Cui L, Zhang X, Luo Y (2011) Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming method.
IEEE Trans Neural Netw 22(12):2226–2236
33. Sadegh N (1993) A perceptron network for functional identification and control of nonlinear
systems. IEEE Trans Neural Netw 4(6):982–988
34. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
35. Savitzky A, Golay MJE (1964) Smoothing and differentiation of data by simplified least squares
procedures. Anal Chem 36(8):1627–1639
36. Bertsekas D, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Belmont
37. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
38. Konda V, Tsitsiklis J (2004) On actor-critic algorithms. SIAM J Control Optim 42(4):1143–
1166
39. Bertsekas D (2007) Dynamic programming and optimal control, vol 2, 3rd edn. Athena Scien-
tific, Belmont
40. Szepesvári C (2010) Algorithms for reinforcement learning. Synthesis lectures on artificial
intelligence and machine learning. Morgan & Claypool Publishers, San Rafael
41. Vamvoudakis KG, Lewis FL (2009) Online synchronous policy iteration method for optimal
control. In: Yu W (ed) Recent advances in intelligent control systems. Springer, Berlin, pp
357–374
References 263

42. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
43. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. Ph.D. thesis, Georgia Institute of Technology
44. Chowdhary G, Johnson E (2011) A singular value maximizing data recording algorithm for
concurrent learning. In: Proceedings of the American control conference, pp 3547–3552
45. Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and expe-
rience replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1):193–202
46. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
47. Zhang H, Liu D, Luo Y, Wang D (2013) Adaptive dynamic programming for control algorithms
and stability. Communications and control engineering, Springer, London
48. Luo B, Wu HN, Huang T, Liu D (2014) Data-based approximate policy iteration for affine
nonlinear continuous-time optimal control design. Automatica
49. Yang X, Liu D, Wei Q (2014) Online approximate optimal control for affine non-linear systems
with unknown internal dynamics using adaptive dynamic programming. IET Control Theory
Appl 8(16):1676–1688
50. Ge SS, Zhang J (2003) Neural-network control of nonaffine nonlinear system with zero dy-
namics by state and output feedback. IEEE Trans Neural Netw 14(4):900–918
51. Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832
52. Zhang X, Zhang H, Sun Q, Luo Y (2012) Adaptive dynamic programming-based optimal
control of unknown nonaffine nonlinear discrete-time systems with proof of convergence.
Neurocomputing 91:48–55
53. Liu D, Huang Y, Wang D, Wei Q (2013) Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int J Control 86(9):1554–
1566
54. Bian T, Jiang Y, Jiang ZP (2014) Adaptive dynamic programming and optimal control of
nonlinear nonaffine systems. Automatica 50(10):2624–2632
55. Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time
unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control
25(12):1844–1861
56. Kiumarsi B, Kang W, Lewis FL (2016) H-∞ control of nonaffine aerial systems using off-policy
reinforcement learning. Unmanned Syst 4(1):1–10
57. Song R, Wei Q, Xiao W (2016) Off-policy neuro-optimal control for unknown complex-valued
nonlinear systems based on policy iteration. Neural Comput Appl 46(1):85–95
Appendix A
Supplementary Lemmas and Definitions

A.1 Chapter 3 Supplementary Material

A.1.1 Derivation of the Sufficient Conditions in (3.19)

Integrating (3.18) yields

t
t  T   
L(τ )dτ = ˙ )T N B2 (τ ) − β2 ρ2 (z) z x̃ dτ
N B1 (τ ) − β1 sgn(x̃)) + x̃(τ
0 r
0
t 
n 
n
= x̃ T N B − x̃ T (0)N B (0) − 0 x̃ T Ṅ B dτ + β1 |x̃i (0)| − β1 |x̃i (t)|
i=1 i=1
t t
+ 0 αx̃ T (N B1 − β1 sgn(x̃))dτ − 0 β2 ρ2 (z) z x̃ dτ ,


n
where (3.9) is used. Using the fact that x̃2 ≤ |x̃i | , and using the bounds in
i=1
(3.14), yields

t
n
L(τ )dτ ≤ β1 |x̃i (0)| − x̃ T (0)N B (0) − (β1 − ζ1 − ζ2 ) x̃
0 i=1

t t
ζ3
− α(β1 − ζ1 − ) x̃ dτ − (β2 − ζ4 ) ρ2 (z) z x̃ dτ .
α
0 0

If the sufficient conditions in (3.19) are satisfied, then the following inequality holds

t
n
L(τ )dτ ≤ β1 |x̃i (0)| − x̃ T (0)N B (0) = P(0). (A.1)
0 i=1

© Springer International Publishing AG 2018 265


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0
266 Appendix A: Supplementary Lemmas and Definitions

Using (3.17) and (A.1), it can be shown that P(t) ≥ 0.

A.1.2 Proof of Theorem 3.4

Proof Let y (t) for t ∈ [t0 , ∞) denote a Filippov solution to the differential equa-
tion in (3.22) that satisfies y (t0 ) ∈ S. Using Filippov’s theory of differential inclu-
sions [1, 2], the
existence


for ẏ ∈ K [h] (y, t), where
of solutions can be established
K [h] (y, t)  coh (Bδ (y) \Sm , t), where denotes the intersection of
δ>0 μSm =0 μSm =0
all sets Sm of Lebesgue measure zero, co denotes convex closure [3, 4]. The time
derivative of (3.20) along the Filippov trajectory y (·) exists almost everywhere (a.e.),
and V˙I ∈ V˙˜I where
a.e.

T
V˙˜I =
1 1
ξ T K e˙f T x̃˙ T P − 2 Ṗ Q − 2 Q̇ ,
1 1
(A.2)
2 2
ξ∈∂VI (y)

where ∂VI is the generalized gradient of VI [5]. Since VI is continuously differen-


tiable, (A.2) can be simplified as [3]
T
˙
˜
VI = ∇VI K e˙f x̃
T T ˙ T 1 − 21 1 −1
P Ṗ Q Q̇ 2
2 2
  T
1 1
= e Tf γ x̃ T 2P 2 2Q 2 K e˙f T x̃˙ T P − 2 Ṗ Q − 2 Q̇ .
1 1 1 1

2 2

Using the calculus for K[·] from [4] (Theorem 1, Properties 2, 5, 7), and substituting
the dynamics from (3.9) and (3.17), yields

V˙˜I ⊂ e Tf ( Ñ + N B1 + N̂ B2 − ke f − β1 K[sgn] (x̃) − γ x̃) − e Tf (N B1 − β1 K[sgn] (x̃))


1  
+ γ x̃ T (e f − αx̃) − x̃˙ T N B2 + β2 ρ2 (z) z x̃ − α tr(W̃ Tf w−1f Ŵ˙ f ) + tr(Ṽ fT v−1f V̂˙ f )
2
1  
m
−1 ˙ −1 ˙
− α tr(W̃giT wgi Ŵgi ) + tr(ṼgiT vgi V̂gi ) . (A.3)
2
i=1
a.e.  2    2
≤ −αγ x̃2 − k e f  + ρ1 (z) z e f  + ζ5 x̃2 + ζ6 e f 
+ β2 ρ2 (z) z x̃ , (A.4)

where (3.8), (3.13), and (3.15) are used, K[sgn] (x̃) = SGN (x̃) [4], such that
SGN (x̃i ) = {1} if x̃i > 0, [−1, 1] if x̃i = 0, and {−1} if x̃i < 0 (the subscript i
denotes the ith element).
The set in (A.3) reduces to the scalar inequality in (A.4) since the right hand side
is continuous almost everywhere (i.e., the right
 hand
 side is continuous
  except for the
Lebesgue negligible set of times when e Tf K sgn (x̃) − e Tf K sgn (x̃) = 0). The set
Appendix A: Supplementary Lemmas and Definitions 267
     
of times   t ∈ [0, ∞) | e f (t)T K sgn (x̃ (t)) − e f (t)T K sgn (x̃ (t)) = 0 ⊂
[0, ∞) is equivalent to the set oftimes t | x̃ (t) = 0 ∧ e f(t) = 0 . From (3.16),
this set can also be represented by t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 . Provided x̃ is continu-
ously
 differentiable, it can be shown that the set of time instances
t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 is isolated, and thus, measure zero. This implies that
the set  is measure zero [6].
Substituting for k  k1 + k2 and γ  γ1 + γ2 , and completing the squares, the
expression in (A.4) can be upper bounded as

 2 ρ1 (z)2 β 2 ρ2 (z)2
V˙˜I ≤ −(αγ1 − ζ5 ) x̃2 − (k1 − ζ6 ) e f  +
a.e.
z2 + 2 z2 .
4k2 4αγ2
(A.5)
Provided the sufficient conditions in (3.23) are satisfied, the expression in (A.5) can
be rewritten as

ρ(z)2
V˙˜I ≤ −λ z2 +
a.e.
z2

a.e.
≤ −U (y) ∀y ∈ D, (A.6)
αγ2
where λ  min{αγ1 − ζ5 , k1 − ζ6 }, η  min{k2 , β22
}, ρ(z)2  ρ1 (z)2 +
ρ2 (z)2 is a positive strictly increasing function, and U (y) = c z2 , for some pos-
itive constant c, is a continuous positive semi-definite function defined on the domain
D. The size of the domain D can be increased by increasing the gains k and γ. Using
the inequalities in (3.21) and (A.6), [7, Corollary 1]can be
 invoked to show that y (·) ∈
   
L∞ , provided y(0) ∈ S. Furthermore, x̃ (t) , x̃˙ (t) , e f (t) → 0 as t → ∞
provided y(0) ∈ S.
Since y (·) ∈ L∞ , x̃ (·) , e f (·) ∈ L∞ . Using (3.6), standard linear analysis can
be used to show that x̃˙ (·) ∈ L∞ , and since ẋ (·)∈ L∞ , x̂˙ (·) ∈ L∞ . Since Ŵ f (·) ∈
L∞ from the use of projection in (3.8), t → σ f V̂ fT (t) x̂ (t) ∈ L∞ from Property
2.3, u (·) ∈ L∞ from Assumption 3.3, and μ (·) ∈ L∞ from (3.3). Using the above
bounds, it can be shown from (3.9) that ė f (·) ∈ L∞ . 

A.1.3 Algorithm for Selection of Neural Network


Architecture and Learning Gains

Since the gains depend on the initial conditions, the compact sets used for function
approximation, and the Lipschitz bounds, an iterative algorithm is developed to select
the gains. In Algorithm A.1, the notation {}i for any parameter  denotes the value
of  computed in the ith iteration. Algorithm A.1 ensures satisfaction of the sufficient
condition in (3.75).
268 Appendix A: Supplementary Lemmas and Definitions

Algorithm A.1 Gain Selection


First iteration:  
Given Z 0 ∈ R≥0 such that Z (t0 ) < Z 0 , let Z1 = ρ ∈ Rn+2{L}1 | ρ ≤ β1 vl −1 (vl (Z 0 )) for
some β1 > 1.  Using
 Z1 , compute the bounds in (3.67) and (3.73), and select the gains according
to (3.74). If Z 1 ≤ β1 vl −1 (vl (Z 0 )) , set Z = Z1 and terminate.
Second
  iteration:    
If Z 1 > β1 vl −1 (vl (Z 0 )) , let Z2  ρ ∈ Rn+2{L}1 | ρ ≤ β2 Z 1 . Using Z2 , compute the
   
bounds in (3.67) and (3.73) and select the gains according to (3.74). If Z 2 ≤ Z 1 , set Z = Z2
and terminate.
Third
 iteration:
 
If Z 2 > Z 1 , increase the number of neural network neurons to {L}3 to yield a lower func-
tion approximation error {}3 such that {L F }2 {}3 ≤ {L F }1 {}1 . The increase in the number
of neural network neurons ensures that {ι}3 ≤ {ι}1 . Furthermore, the assumption that the per-
sistence of excitation interval {T }3 is small enough
  such that {L F}2 {T }3 ≤ {T } {L F }1 and
10 10  1
{L}3 {T }3 ≤ {T }1 {L}1 ensures that 11 ≤ 11 , and hence, Z 3 ≤ β2 Z 1 . Set Z =
    3 1
ρ ∈ Rn+2{L}3 | ρ ≤ β2 Z 1 and terminate.

A.1.4 Proof of Lemma 3.14

The following supporting technical lemma is used to prove Lemma 3.14.


Lemma A.1 Let D ⊆ Rn contain the origin and let  : D × R≥0 → R≥0 be pos-
itive definite. If t −→  (x, t) is uniformly bounded for all x ∈ D and if x −→
 (x, t) is continuous, uniformly in t, then  is decrescent in D.

Proof Since t −→  (x, t) is uniformly bounded, for all x ∈ D, supt∈R≥0 { (x, t)}
exists and is unique for all x ∈ D. Let the function α : D → R≥0 be defined as

α (x)  sup { (x, t)} . (A.7)


t∈R≥0

Since x →  (x, t) is continuous, uniformly in t, ∀ε > 0, ∃ς (x) > 0 such that ∀y ∈


D,

d D×R≥0 ((x, t) , (y, t)) < ς (x) =⇒ dR≥0 ( (x, t) ,  (y, t)) < ε, (A.8)

where d M (·, ·) denotes the standard Euclidean metric on the metric space M. By the
definition of d M (·, ·), d D×R≥0 ((x, t) , (y, t)) = d D (x, y) . Using (A.8),

d D (x, y) < ς (x) =⇒ | (x, t) −  (y, t)| < ε. (A.9)


Appendix A: Supplementary Lemmas and Definitions 269

Given the fact that  is positive, (A.9) implies  (x, t) <  (y, t) + ε and
 (y, t) <  (x, t) + ε which from (A.7) implies α (x) < α (y) + ε and α (y) <
α (x) + ε, and hence, from (A.9), d D (x, y) < ς (x) =⇒ |α (x) − α (y)| < ε. Since
 is positive definite, (A.7) can be used to conclude α (0) = 0. Thus,  is bounded
above by a continuous positive definite function; hence,  is decrescent in D. 

Proof Based on the definitions in (3.51), (3.52) and (3.68), Vt∗(e, t) > 0, ∀t ∈ R≥0
 T
and ∀e ∈ Ba \ {0}. The optimal value function V ∗ 0, xdT is the cost incurred
when starting with e = 0 and following the optimal policy thereafter for an arbitrary
desired trajectory xd . Substituting x (t0 ) = xd (t0 ), μ (t0 ) = 0 and (3.45) in (3.47)
indicates that ė (t0 ) = 0. Thus, when starting from e = 0, a policy that is identically
zero satisfies the dynamic constraints in (3.47). Furthermore, the optimal cost is
 T
V ∗ 0, xdT (t0 ) = 0, ∀xd (t0 ) which, from (3.68), implies (3.69b). Since the opti-
mal value function Vt∗ is strictly positive everywhere but at e = 0 and is zero at e = 0,
Vt∗ is a positive definite function. Hence, [8, Lemma 4.3] can be invoked to conclude
that there exists a class K function v : [0, a] → R≥0 such that v (e) ≤ Vt∗ (e, t),
∀t ∈ R≥0 and ∀e ∈ Ba .
Admissibility of the optimal policy implies that V ∗ (ζ) is bounded over all compact
subsets K ⊂ R2n . Since the desired trajectory is bounded, t → Vt∗ (e, t) is uniformly
bounded for all e ∈ Ba . To establish that e → Vt∗ (e, t) is continuous, uniformly in
t, let χeo ⊂ Rn be a compact set containing eo . Since xd is bounded, xd ∈ χxd , where
χxd ⊂ Rn is compact. Since V ∗ : R2n → R≥0 is continuous, and χeo × χxd ⊂ R2n
is compact, V ∗ is uniformly continuous on χeo × χxd . Thus, ∀ε > 0, ∃ς > 0,
 T T T  T T T  T T T  T T T 
such that ∀ eo , xd , e1 , xd ∈ χeo × χxd , dχeo ×χxd eo , xd , e1 , xd <
    ∗  T T T 
∗ T T
ς =⇒ dR V eo , xd
T
,V e1 , xd < ε. Thus, for each eo ∈ R , there
n

exists a ς > 0 independent of xd , that


 establishes the continuity of
 T  T 
e → V ∗ e T , xdT at eo . Thus, e → V ∗ e T , xdT is continuous, uniformly
in xd , and hence, using (3.68), e → Vt∗ (e, t) is continuous, uniformly in t. Using
Lemma A.1, (3.69a), and (3.69b), there exists a positive definite function α :
Rn → R≥0 such that Vt∗ (e, t) < α (e) , ∀ (e, t) ∈ Rn × R≥0 . Using [8, Lemma 4.3]
it can be shown that there exists a class K function v : [0, a] → R≥0 such that
α (e) ≤ v (e), which implies (3.69c). 

A.1.5 Proof of Lemma 3.15

Proof Let the constants 0 − 6 be defined as


270 Appendix A: Supplementary Lemmas and Definitions
 
1 − 6nT 2 L 2F
0 = ,
2
3n  2
1 = sup g R −1 G T ∇ζ σ T  ,
4 t
  2
3n 2 T 2 d L F + supt ggd+ (h d − f d ) − 21 g R −1 G T ∇ζ σ T W − h d 
2 = ,
  n
1 − 6L (ka1 + ka2 )2 T 2
3 = ,
2
6Lka1 T 2
4 =   2  ,
1 − 6L (kc ϕT ) / νϕ
2

 2
18 ka1 Lkc ϕL F T 2
5 =   2  ,
νϕ 1 − 6L (kc ϕT ) / νϕ
2

 2
18 Lka1 kc ϕ (L F d + ι5 ) T 2  2
6 =   2  + 3L ka2 W T .
νϕ 1 − 6L (kc ϕT )2 / νϕ

Using the definition of the controller in (3.57), the tracking error dynamics can be
expressed as

1 1
ė = f + g R −1 G T σ T W̃a + ggd+ (h d − f d ) − g R −1 G T σ T W − h d .
2 2
On any compact set, the tracking error derivative can be bounded above as
 
 
ė ≤ L F e + L W W̃a  + L e ,
 
where L = L x  + ggd+ (h d − f d ) − 21 g R −1 G T σ T W − h d  and L W = 21
 −1 Te T  F d
g R G σ . Using the fact that e and W̃a are continuous functions of time, on
the interval [t, t + T ], the time derivative of e can be bounded as
 
 
ė ≤ L F sup e (τ ) + L W sup W̃a (τ ) + L e .
τ ∈[t,t+T ] τ ∈[t,t+T ]

Since the infinity norm is less than the 2-norm, the derivative of the jth component
of ė is bounded as
 
 
ė j ≤ L F sup e (τ ) + L W sup W̃a (τ ) + L e .
τ ∈[t,t+T ] τ ∈[t,t+T ]

Thus, the maximum and the minimum value of e j are related as


Appendix A: Supplementary Lemmas and Definitions 271
   
sup e j (τ ) ≤ inf e j (τ )
τ ∈[t,t+T ] τ ∈[t,t+T ]
 
 
 
+ LF sup e (τ ) + L W sup W̃a (τ ) + L e T.
τ ∈[t,t+T ] τ ∈[t,t+T ]

Squaring the above expression and using the inequality (x + y)2 ≤ 2x 2 + 2y 2


   
sup e j (τ )2 ≤ 2 inf e j (τ )2
τ ∈[t,t+T ] τ ∈[t,t+T ]
 2
 
 
+ 2 LF sup e (τ ) + L W sup W̃a (τ ) + L e T 2.
τ ∈[t,t+T ] τ ∈[t,t+T ]

n
Summing over j, and using the the facts that supτ ∈[t,t+T ] e (τ )2 ≤ supτ ∈[t,t+T ]
    j=1
e j (τ )2 and inf τ ∈[t,t+T ] n e j (τ )2 ≤ inf τ ∈[t,t+T ] e (τ )2 ,
j=1

sup e (τ )2 ≤ 2 inf e (τ )2


τ ∈[t,t+T ] τ ∈[t,t+T ]
 2
 2
 
+ 2 LF sup e (τ ) + L W
2
sup W̃a (τ ) + L e nT 2 .
τ ∈[t,t+T ] τ ∈[t,t+T ]

Using the inequality (x + y + z)2 ≤ 3x 2 + 3y 2 + 3z 2 , (3.70) is obtained.


Using a similar procedure on the dynamics for W̃a ,

 2    2
  1 − 6N (ηa1 + ηa2 )2 T 2  
− inf W̃a (τ ) ≤ − sup W̃a (τ )
τ ∈[t,t+T ] 2 τ ∈[t,t+T ]
 2
 
+ 3N ηa1
2
sup W̃c (τ ) T 2 + 3N ηa2 2
W 2 T 2 . (A.10)
τ ∈[t,t+T ]

Similarly, the dynamics for W̃c yield


 2  2
  2  
sup W̃c (τ ) ≤   inf
6N ηc2 ϕ2 T 2 τ ∈[t,t+T ]
 W̃ c (τ ) 
τ ∈[t,t+T ] 1 − ν 2 ϕ2

6N T 2 ηc2 ϕ2 ¯ L 2F
2
+   sup e (τ )2
6N η 2 ϕ2 T 2 τ ∈[t,t+T ]
νϕ 1 − ν 2c ϕ2
 2
6N T 2 ηc2 ϕ2 ¯ L F d + ι5
+   . (A.11)
6N η 2 ϕ2 T 2
νϕ 1 − ν 2c ϕ2

Substituting (A.11) into (A.10), (3.71) can be obtained. 


272 Appendix A: Supplementary Lemmas and Definitions

A.1.6 Proof of Lemma 3.16

ν 2 ϕ2
Proof Let the constants 7 − 9 be defined as 7 = 2 ν 2 ϕ2 +k ϕ2 T 2 , 8 = 32 L 2F ,
  ( c )
and 9 = 2 ι25 + 2 L 2F d 2 . The integrand on the left hand side can be written as
 
W̃cT (τ ) ψ (τ ) = W̃cT (t) ψ (τ ) + W̃cT (τ ) − W̃cT (t) ψ (τ ) .

Using the inequality (x + y)2 ≥ 21 x 2 − y 2 and integrating,


⎛ t+T ⎞
t+T  2 
1 T  
W̃cT (τ ) ψ (τ ) dτ ≥ W̃c (t) ⎝ ψ (τ ) ψ (τ )T dτ ⎠ W̃c (t)
2
t t
⎛⎛ ⎞T ⎞2
t+T τ
⎜⎝ ⎟
− ⎝ W̃˙ c (σ) dτ ⎠ ψ (τ )⎠ dτ .
t t

Substituting the dynamics for W̃c from (3.66) and using the persistence of excitation
condition in Assumption 3.13,

t+T  2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t)
2
t
t+Tτ 
ηc  (σ) ψ (σ)  (σ)
− "
1 + νω (σ)T  (σ) ω (σ)
t t

−ηc  (σ) ψ (σ) ψ T (σ) W̃c (σ)


ηc  (σ) ψ (σ) W̃aT Gσ W̃a
+ "
4 1 + νω (σ)T  (σ) ω (σ)
  T 2
ηc  (σ) ψ (σ)  (σ) F (σ)
−" dσ ψ (τ ) ,
1 + νω (σ)T  (σ) ω (σ)

where   41  GT + 21 W T σ  GT . Using the inequality (x + y + w − z)2 ≤ 2x 2 +


6y 2 + 6w2 + 6z 2 ,

t+T  2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t)
2
t
Appendix A: Supplementary Lemmas and Definitions 273
⎛ ⎞2
t+T τ
− 2⎝ ηc W̃cT (σ) ψ (σ) ψ T (σ)  T (σ) ψ (τ ) dσ ⎠ dτ
t t
⎛ ⎞2
t+T τ
⎝ η  T
(σ) ψ T
(σ)  T
(σ) ψ (τ )
dσ ⎠ dτ
c
−6 "
1 + νω (σ)T  (σ) ω (σ)
t t
⎛ τ ⎞2

t+T  

⎝ ηc F " (σ)  (σ) ψ (σ)  (σ) ψ (τ ) ⎠


T T T T
−6 dσ dτ
1 + νω (σ)T  (σ) ω (σ)
t t
⎛ ⎞2
t+T τ
η W̃ T
(σ) G σ (σ) W̃ (σ) ψ T
(σ)  T
(σ) ψ (τ )
−6 ⎝ dσ⎠ dτ .
c a
a
"
1 + νω (σ)T  (σ) ω (σ)
t t

Using the Cauchy–Schwarz inequality, the Lipschitz property, the fact that
√ 1
1+νω T ω
≤ 1, and the bounds in (3.67),

⎛ ⎞2
t+T  2 t+T τ
1 ⎝ η ι ϕ
dσ ⎠ dτ
c 5
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − 6
2 νϕ
t t t
⎛ τ ⎞
t+T   2 τ
 
2ηc2 ⎝ ψ T (σ)  T (σ) ψ (τ ) dσ ⎠ dτ
2
− W̃cT (σ) ψ (σ) dσ
t t t
⎛ τ ⎞
t+T   4 τ
   T 
6ηc2 ι22 ⎝ ψ (σ)  T (σ) ψ (τ ) dσ ⎠ dτ
2
− W̃a (σ) dσ
t t t
⎛ τ ⎞
t+T  τ
 
6ηc2 ¯ ⎝ F (σ)2 dσ ψ T (σ)  T (σ) ψ (τ ) dσ ⎠ dτ .
2 2

t t t

Rearranging,


t+T
 2 1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − 3ηc2 A4 ϕ2 ι25 T 3
2
t

t+T τ  2
dσdτ − 3ηc2 A4 ϕ2 ¯ L 2F d 2 T 3
2
−2ηc2 A4 ϕ2 (τ − t) W̃cT (σ) ψ (σ)
t t

t+T τ  4 
t+T τ
 
(τ − t) W̃a (σ) dσdτ − 6ηc2 ¯ L 2F A4 ϕ2
2
−6ηc2 ι22 A4 ϕ2 (τ − t) e2 dσdτ ,
t t t t
274 Appendix A: Supplementary Lemmas and Definitions

where A = 1
√ .
νϕ
Changing the order of integration,

t+T  2 t+T  2
1
W̃cT (τ ) ψ (τ ) dτ ≥ ψ W̃cT (t) W̃c (t) − ηc2 A4 ϕ2 T 2 W̃cT (σ) ψ (σ) dσ
2
t t
t+T t+T  4
 
−3ηc2 A4 ϕ2 ¯ L 2F T 2
2
e (σ) dσ −
2
3ηc2 ι22 A4 ϕ2 T 2 W̃a (σ) dσ
t t
 
−2ηc2 A4 ϕ2 T 3 ι25 + ¯ L 2F d 2 .
2

Reordering the terms, the inequality in Lemma 3.16 is obtained. 

A.1.7 Proof of Theorem 3.20

Proof To facilitate the subsequent development, let the gains k and γ be split as k 
k1 + k2 and γ  γ1 + γ2 . Let λ  min{αγ1 − ζ5 , k1 − ζ6 }, ρ (z)2  ρ1 (z)2 +
ρ2 (z)2 , and η  min{k2 , αγ β22
2
}. Let y (t) for t ∈ [t0 , ∞) denote a Filippov solution
to the differential equation in (3.111) that satisfies y (t0 ) ∈ S. Using Filippov’s theory
of differential inclusions [1, 2], the
existence


for ẏ ∈
of solutions can be established
K [h] (y, t), where K [h] (y, t)  coh (Bδ (y) \Sm , t), where denotes
δ>0 μSm =0 μSm =0
the intersection of all sets Sm of Lebesgue measure zero [3, 4]. The time derivative
of (3.109) along the Filippov trajectory y (·) exists almost everywhere (a.e.), and
V˙I ∈ V˙˜I where
a.e.

T
V˙˜I =
1 1
x̃˙ T P − 2 Ṗ Q − 2 Q̇
1 1
ξ K e˙f
T T
, (A.12)
2 2
ξ∈∂VI (y)

where ∂VI is the generalized gradient of VI [5]. Since VI is continuously differen-


tiable, (A.12) can be simplified as [3]
T
V˙˜I = ∇x VIT K e˙f T x̃˙ T P − 2 Ṗ Q − 2 Q̇
1 1 1 1

2 2
  T
1 1
T ˙ T 1 − 21 1 −1
= e f γ x̃ 2P 2Q K e˙f x̃
T T 2 2 P Ṗ Q Q̇ .
2
2 2

Using the calculus for K[·] from [4], and substituting the dynamics from (3.99) and
(3.107), yields
Appendix A: Supplementary Lemmas and Definitions 275

Ṽ˙ I ⊂ e Tf ( Ñ + N B1 + N̂ B2 − ke f − β1 K [sgn] (x̃) − γ x̃) + γ x̃ T (e f − α x̃)


− e Tf (N B1 − β1 K [sgn] (x̃)) − x̃˙ T N B2 + β2 ρ2 (z) z x̃
1  
− α tr(W̃ Tf w−1f Ŵ˙ f ) + tr(Ṽ fT v−1f V̂˙ f ) , (A.13)
2

where K [sgn] (x̃) = SGN (x̃). Substituting (3.98), canceling common terms, and
rearranging the expression yields

Ṽ˙ I ≤ −αγ x̃ T x̃ − ke Tf e f + e Tf Ñ + α x̃ T W̃ Tf ∇x σ̂ f V̂ fT x̂˙ + α x̃ T Ŵ Tf ∇x σ̂ f Ṽ fT x̂˙


a.e. 1 1
2 2
1
+ x̃˙ T ( N̂ B2 − N B2 ) + β2 ρ2 (z) z x̃ − αtr(W̃ Tf ∇x σ̂ f V̂ fT x̂˙ x̃ T )
2
1
− αtr(Ṽ fT x̂˙ x̃ T Ŵ Tf ∇x σ̂ f ), (A.14)
2
 
where σ̂ f  σ f V̂ fT x̂ . The set inclusion in (A.13) reduces to the scalar inequal-
ity in (A.14) because the right hand side of (A.13) is set valued  only on the
Lebesgue negligible set of times when e Tf K sgn (x̃) − e Tf K sgn (x̃) = 0. The set
     
of times   t ∈ [0, ∞) | e f (t)T K sgn (x̃ (t)) − e f (t)T K sgn (x̃ (t)) = 0 ⊂
[0, ∞) is equivalent to the set oftimes t | x̃ (t) = 0 ∧ e f(t) = 0 . From (3.96),
this set can also be represented by t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 . Provided x̃ is continu-
ously
 differentiable, it can be shown that the set of time instances
t | x̃ (t) = 0 ∧ x̃˙ (t) = 0 is isolated, and thus, measure zero. This implies that
the set  is measure zero. [6]
Substituting for k = k1 + k2 and γ = γ1 + γ2 , using (3.98), (3.103), and (3.105),
and completing the squares, the expression in (A.14) can be upper bounded as

 2 ρ1 (z)2 β 2 ρ2 (z)2
Ṽ˙ I ≤ −(αγ1 − ζ5 ) x̃2 − (k1 − ζ6 ) e f  +
a.e.
z2 + 2 z2 .
4k2 4αγ2
(A.15)
Provided the sufficient conditions in (3.112) are satisfied, the expression in (A.15)
can be rewritten as

ρ (z)2
Ṽ˙ I ≤ −λ z2 +
a.e. a.e.
z2 ≤ −U (y) , ∀y ∈ D. (A.16)

In (A.16), U (y) = c z2 is a continuous positive semi-definite function defined on


D, where c is a positive constant.
The inequalities in (3.110) and (A.16) can be used to show that t → VI (y (t)) ∈
L∞ ; hence, x̃ (·) , r (·) ∈ L∞ . Using (3.96), standard linear analysis can be used to
show that x̃˙ (·) ∈ L∞ , and since ẋ (·) ∈ L∞ , x̂˙ (·)
 ∈ L∞ . Since
 Ŵ f (·) , V̂ f (·) ∈ L∞
from the use of projection in (3.98), t → σ f V̂ fT (t) x̂ (t) ∈ L∞ from Property
276 Appendix A: Supplementary Lemmas and Definitions

2.3, and u (·) ∈ L∞ from Assumption 3.19, (3.92) can be used to conclude that
μ (·) ∈ L∞ . Using (3.97) and the above bounds it can be shown that ė f (·) ∈ L∞ .
From (A.16), [7, Corollary 1] can be invoked to show that y (·) ∈ L∞ , provided
y(0) ∈ S. Furthermore,
   
 
x̃ (t) , x̃˙ (t) , e f (t) → 0 as t → ∞,

provided y (t0 ) ∈ S. 

A.2 Chapter 4 Supplementary Material

A.2.1 Algorithm for Gain Selection

In the following, the notation {}i for any parameter  denotes the value of 
computed in the ith iteration.

Algorithm A.2 Gain Selection


First iteration:  
Given z ∈ R≥0 such that Z (t0 ) < z, let Z1  ξ ∈ R2n+2L+ p | ξ ≤ v −1 (v (z)) . Using Z1 ,
# 
ι
compute the bounds in (4.17) and select the gains according to (4.18). If vl ≤ z, set Z = Z1
1
and terminate.
Second iteration: $  #  %
# 
ι 2n+2L+ p | ξ ≤ v −1 v ι
If z < vl , let Z2  ξ ∈ R vl . Using Z2 , compute the
1   1 
bounds in (4.17) and select the gains according to (4.17). If vιl ≤ vιl , set Z = Z2 and termi-
2 1
nate.
Third
 iteration:
 
If vιl > vιl , increase the number of neural network neurons to {L}3 to ensure {L Y }2 {}3 ≤
2 1
{L Y }2 {}2 , ∀i = 1, .., N , increase the constant ζ3 to ensure {L Y }2 {L Y }2
{ζ3 }3 ≤ {ζ3 }2 , and increase
the gains K and ηa1 to satisfy the gain conditions in (4.18). Provided the constant c
is
$ large enough and D is small #enough, these adjustments ensure {ι}3 ≤ {ι}2 . Set Z =
 %
ι
ξ ∈ R2n+2L+ p | ξ ≤ v−1 v vl and terminate.
2

A.2.2 Algorithm for Gain Selection - N-Player Game

In the following, the notation {}i for any parameter  denotes the value of 
computed in the ith iteration.
Appendix A: Supplementary Lemmas and Definitions 277

Algorithm A.3 Gain Selection


First iteration:  

Given z ∈ R≥0 such that Z (t0 ) < z, let Z1  ξ ∈ R2n+2N i {L i }1 + pθ | ξ ≤ v −1 (v (z)) .
 
Using Z1 , compute the bounds in (4.60) and select the gains according to (4.61). If vιl ≤ z,
1
set Z = Z1 and terminate.
Seconditeration:
     

If z < vιl , let Z2  ξ ∈ R2n+2N i {L i }1 + pθ | ξ ≤ v −1 v vιl . Using Z2 , compute
1   1  
ι
the bounds in (4.60) and select the gains according to (4.61). If vl ≤ vιl , set Z = Z2 and
2 1
terminate.
Third
 iteration:
 
If vιl > vιl , increase the number of neural network neurons to { pW i }3 to ensure {L Y }2 {i }3 ≤
2 1
{L Y }2 {i }2 , ∀i = 1, .., N , decrease the constant ζ3 to ensure {L Y }2 {ζ3 }3 ≤ {L Y }2 {ζ3 }2 , and
increase the
 gain kθ to  satisfy the gain conditions in (4.61).
    These adjustments ensure {ι}3 ≤ {ι}2 .
Set Z = ξ ∈ R2n+2N i {L i }3 + pθ | ξ ≤ v −1 v vιl and terminate.
2

A.2.3 System Identification

Concurrent learning-based parameter update


In traditional adaptive control, convergence of the estimates θ̂ to their true values
θ is ensured by assuming that a persistent excitation condition is satisfied [9–11].
To ensure convergence under a finite excitation condition, this result employs a
concurrent learning-based approach to update the parameter estimates using recorded
input-output data [12–14].
 
Assumption A.2 ([13, 14]) A collection Hid of triplets a j , b j , c j | a j ∈ Rn , b j ∈
M
Rn , c j ∈ Rm j=1 that satisfies
⎛ ⎞
M
   
rank ⎝ Y T a j Y a j ⎠ = p,
j=1
     
b j − f a j + g a j c j  < d, ∀ j, (A.17)

is available a priori, where d ∈ R≥0 is a positive constant. Since θ ∈ , where  is


a compact set, the assumption that d is independent of θ is justified.
To satisfy Assumption A.2, data recorded in a previous run of the system can be
utilized, or the data stack can be recorded by running the system using a different
known stabilizing controller for a finite amount of time until the recorded data satisfies
the rank condition (A.17).
In some cases, a data stack may not be available a priori. For such applications,
the data stack can be recorded online  a j and c j can be recorded along
  (i.e., the points
the system trajectory as a j = x t j and c j = u t j for some t j ∈ R≥t0 ). Provided
278 Appendix A: Supplementary Lemmas and Definitions
 
the system states are exciting over a finite time interval t ∈ t0 , t0 + t (versus t ∈
[t0 , ∞) as in traditional persistence of excitation-based approaches) until the data
stack satisfies (A.17), then a modified form
 of the controller developed in Sect. A.2.4
can be used over the time interval t ∈ t0 , t0 + t , and the controller developed in
Sect. 4.3.3 can be used thereafter.
Based on Assumption A.2, the update law for the parameter estimates is designed
as
   
˙ θ k θ T    
M
θ̂ = Y a j b j − g a j c j − Y a j θ̂ , (A.18)
M j=1

where θ ∈ R p× p is a constant positive definite adaptation gain matrix and kθ ∈ R


is a constant positive concurrent learning gain. From (1.9) and  the definition
  of
θ̃, the bracketed term in (A.18), can be expressed as b j − g a j c j − Y a j θ̂ =
     
Y a j θ̃ + d j , where d j  b j − f a j + g a j c j ∈ Rn , and the parameter update
law in (A.18) can be expressed in the advantageous form
⎛ ⎞
˙ θ k θ ⎝
M
    θ k θ T  
M
θ̂ = Y T a j Y a j ⎠ θ̃ + Y aj dj. (A.19)
M j=1
M j=1

The rate of convergence of the parameter estimates to a neighborhood of their ideal


values is directly (and the ultimate
 bound  is inversely)
  proportional to the minimum
singular value of the matrix M Y T
a Y a ; hence, the performance of the
j=1 j
 j 
estimator can be improved online if a triplet a j , b j , c j in Hid is replaced with an
    
updated triplet (ak , bk , ck ) that increases the singular value of M j=1 Y
T
aj Y aj .
The stability analysis in Sect. 4.3.4 allows for this approach through the use of a
singular value maximizing algorithm (cf. [12, 14]).
Convergence analysis
Let Vθ : Rn+ p → R≥0 be a positive definite continuously differentiable candidate
Lyapunov function defined as
  1
Vθ θ̃  θ̃ T θ−1 θ̃.
2
The following bounds on the Lyapunov function can be established:

γ 
 2
  γ  2
 
θ̃ ≤ Vθ θ̃ ≤ θ̃ ,
2 2
where γ, γ ∈ R denote the minimum and the maximum eigenvalues of the matrix
θ−1 . Using (A.19), the Lyapunov derivative can be expressed as
Appendix A: Supplementary Lemmas and Definitions 279
⎛ ⎞
k θ
M
    kθ T  
M
V̇θ = −θ̃ T ⎝ Y T a j Y a j ⎠θ̃ − θ̃ T Y aj dj.
M j=1 M j=1

     
M
Let y ∈ R be the minimum eigenvalue of M1 j=1 Y
T
a j Y a j . Since
    
M T
j=1 Y a j Y a j is symmetric and positive semi-definite, (A.17) can be used
to conclude that it is also positive definite, and hence y > 0. Hence, the Lyapunov
derivative can be bounded as
 2  
   
V̇0 ≤ −ykθ θ̃ + kθ dθ θ̃ ,

    
 
where dθ = dY , Y = max j=1,··· ,M Y a j  . Hence, θ̃ exponentially decays to
an ultimate bound as t → ∞. If Hid is updated with new data, the update law
(A.19) forms a switched system. Provided (A.17) holds, and Hid is updated using
a singular value maximizing algorithm, Vθ is a common Lyapunov function for the
switched system (cf. [14]). The concurrent learning-based system identifier satisfies
Assumption 4.1 with K = ykθ and D = kθ dθ . To satisfy the last inequality in (4.18),
the quantity vιl needs to be small. Based on the definitions in (4.17), the quantity vιl
D2 dθ2
is proportional to K2
, which is proportional to y2
. From the definitions of dθ and y,

   2
M  
dθ2 2 j=1 Y a j
= d      2 .
y2 M
λmin j=1 Y T a Y a
j j

Thus, in general, a small d (i.e., accurate numerical differentiation) is required to


obtain the result in Theorem 4.3.

A.2.4 Online Data Collection for System Identification

A data stack Hid that satisfies conditions in (A.17) can be collected online provided
the controller in (4.6)
 resultsin the system states being sufficiently exciting over a
finite time interval t0 , t0 + t ⊂ R. To collect the data stack, the first M values of
the state, the control, and the corresponding numerically computed state derivative
are added to the data stack. Then, the existing values are progressively replaced with
new values using a singular value maximization algorithm. During this finite time
interval, since a data stack is not available, an adaptive update law that ensures fast
convergence of θ̃ to zero without persistence of excitation can not be developed.
Hence, the system dynamics can not be directly estimated without persistence of
excitation. Since extrapolation of the Bellman error to unexplored areas of the state-
280 Appendix A: Supplementary Lemmas and Definitions

space requires estimates of the system dynamics, withoutpersistence  of excitation,


such extrapolation is not feasible during the time interval t0 , t0 + t .
However, evaluation of the Bellman error along the system trajectories does not
explicitly depend on the parameters θ. Estimation of the state derivative is enough to
evaluate the Bellman error along system trajectories. This motivates the development
of the following state derivative estimator

x̂˙ f = gu + k f x̃ f + μ f ,
 
μ̇ f = k f α f + 1 x̃ f , (A.20)

where x̂ f ∈ Rn is an estimate of the state x, x̃ f  x − x̂ f , and k f , α f , γ f ∈ R>0 are


constant estimation gains. To facilitate the stability analysis, define a filtered error
signal r ∈ Rn as r  x̃˙ f + α f x̃ f , where x̃˙ f  ẋ − x̂˙ f . Using (1.9) and (A.20), the
dynamics of the filtered error signal can be expressed as ṙ = −k f r + x̃ f + f  f +
f  gu + α x̃˙ f . The instantaneous Bellman error in (2.3) can be approximated along
the state trajectory using the state derivative estimate as
   
δ̂ f = ω Tf Ŵc f + x T Qx + û T x, Ŵa f R û x, Ŵa f , (A.21)

 ˙
where ω f ∈ R is the regressor vector defined as ω f  σ (x) x̂ f . During the interval
L

t0 , t0 + t , the value function and the actor weights can be learned based on the
approximate Bellman error in (A.21) provided the system states are exciting (i.e., if
the following assumption is satisfied).
 
Assumption A.3 There exists a time interval t0 , t0 + t ⊂ R and positive constants
ψ, T ∈ R such that closed-loop trajectories of the system in (1.9) with the controller
 
u = û T x, Ŵa f along with the weight update laws

ωf ω f ω Tf
Ŵ˙ c f = −ηc f  f δ f , ˙ f = λ f  f − ηc f  f f,
ρf ρf
·  
Ŵ a f = −ηa1 f Ŵa − Ŵc − ηa2 f Ŵa , (A.22)

and the state derivative estimator in (A.20) satisfy

t+T
 
ψI L ≤ ψ f (τ ) ψ f (τ )T dτ , ∀t ∈ t0 , t0 + t , (A.23)
t

where ρ f  1 + ν f ω Tf ω f is the normalization term, ηa1 f , ηa2 f , ηc f , ν f ∈ R are


constant positive gains,  f ∈ R L×L is the least-squares gain matrix, and ψ f 
√ ω f T ∈ R N is the regressor vector. Furthermore, there exists a set of time
1+ν f ω f ω f
Appendix A: Supplementary Lemmas and Definitions 281
 
instances {t1 · · · t M } ⊂ t0 , t0 + t such that the data stack Hid containing the val-
ues of state-action pairs and the corresponding numerical derivatives recorded at
{t1 · · · t M } satisfies the conditions in Assumption A.2.
Conditions similar to (A.23) are ubiquitous in online approximate optimal control
literature. In fact, Assumption A.3 requires the regressor ψ f to be exciting over a finite
time interval, whereas the persistence of excitation conditions used in related results
such as [15–19] require similar regressor vectors to be exciting over all t ∈ R≥t0 .
On any compact set χ ⊂ Rn the function f is Lipschitz continuous; hence, there
exist positive constants L f , L d f ∈ R such that
 
 f (x) ≤ L f x and  f  (x) ≤ L d f , (A.24)

∀x ∈ χ. The update laws in (A.22) along with the excitation condition in (A.23)
ensure that the adaptation gain matrix is bounded such that
 
 f ≤  f  ≤  f , ∀t ∈ R≥t0 , (A.25)

where (cf. [11, Proof of Corollary 4.3.2])


  
 f = min ηc f ψT, λmin  f (t0 ) e−λ f T . (A.26)

The following positive constants are defined for brevity of notation.


 
Ld f   2W T σ  GT + G  
ϑ8  g R −1 g T σ T , ϑ10  ,
2 4
 
W T G σ + 1  G T σ T 
ϑ9  2
+ ηa2 f W ,
2
 
2W T σ  GT + G  
ϑ10  ,
4
2
3ϑ9 5ϑ2 W
ι f  2ηc f ϑ10 +   + ϑ4 + 8 ,
4 ηa1 f + ηa2 f 4k f
   
1 q β f ηa1 f + ηa2 f α f k f
vl f = min , , , , . (A.27)
2 2 4 3 3 5

To facilitate the stability analysis, let VL f : R3n+2L × R≥0 → R≥0 be a continuously


differentiable positive definite candidate Lyapunov function defined as

  1 1 T 1 T 1 T
VL f Z f , t  V ∗ (x) + W̃cTf  −1
f W̃c f + W̃a f W̃a f + x̃ f x̃ f + r r. (A.28)
2 2 2 2
282 Appendix A: Supplementary Lemmas and Definitions

Using the fact that V ∗ is positive definite, (A.25) and [8, Lemma 4.3] can be used to
establish the bound
     
vl f  Z f  ≤ VL f Z f , t ≤ vl f  Z f  , (A.29)

∀t ∈ R≥t0 and ∀Z f ∈ R3n+2L . In (A.29), vl f , vl f : R≥0 → R≥0 are class K functions


 T
and Z  x T , W̃cTf , W̃aTf , x̃ Tf , r T . The sufficient conditions for uniformly ulti-
mately bounded convergence are derived based on the subsequent stability analysis
as

  3ηa1 f 3ϑ8 ζ5 3ηc f G σ 


ηa1 f + ηa2 f > − − " Zf
2ζ4 2 4 νff
 
ϑ8 2 2 3α3f
k f > 5 max + α f + 2ηc f W σ   ,
2ζ5 4
 
5L 2d f
q > 2L 2f 2ηc f ¯2 +
4k f
1 2 2
> 6ηc f W σ   , β f > 2ηa1 f ζ4 , (A.30)
αf
    # ι f 
where Z f  v −1 v f max  Z (t
f 0 ) , and ζ4 , ζ5 ∈ R are known positive
f vl f
adjustable constants. An algorithm similar to Algorithm A.2 is employed to select
the gains and a compact set Z f ⊂ R3n+2L such that
&
ιf 1  
≤ diam Z f . (A.31)
vl f 2

Theorem A.4 Provided the gains are selected to satisfy the sufficient conditions
in (A.30) based on an algorithm similar to Algorithm A.2, the controller in (4.6),
the weight update laws in (A.22), the state derivative estimator in (A.20), and the
excitation condition in (A.23) ensure that the state trajectory x, the state estimation
error x̃ f , and the parameter estimation errors W̃c f , and W̃a f remain bounded such
that    
 Z f (t) ≤ Z f , ∀t ∈ t0 , t0 + t .

Proof Using techniques similar to the proof of Theorem 4.3, the time derivative of
the candidate Lyapunov function in (A.28) can be bounded as

 2   & ιf
V̇L f ≤ −vl f  Z f  , ∀  Z f  ≥ , (A.32)
vl f
1.2 Chapter 4 Supplementary Material 283

in the domain Z f . Using (A.29), (A.31), and (A.32), [8, Theorem  4.18] is used
to show that Z is uniformly ultimately bounded, and that  Z (t) ≤ Z f , ∀t ∈
  f f
t0 , t 0 + t . 
 
During the interval t0 , t0 + t , the controller in (4.6) is used along with the weight
update laws in Assumption A.3. When enough data is collected in the data stack to
satisfy the rank condition in (A.17), the update laws from Sect. 4.3.3 are used. The
bound Z f is used to compute gains for Theorem 4.3 using Algorithm A.2.

A.3 Chapter 6 Supplementary Material

A.3.1 Auxiliary Constants and Sufficient Gain Conditions


for the Station-Keeping Problem

The constants ϕζ , ϕc , ϕa , ϕθ , κc , κa , κθ , and κ are defined as

  
kc1 sup Z ∈β ∇ζ  L Yr es θ + L f0r es
ϕζ = q −
  2  
L Yc g W  sup Z ∈β ∇ζ σ  + sup Z ∈β ∇ζ 
− ,
2

  
kc2 ka kc1 sup Z ∈β ∇ζ  L Yr es θ + L f0r es
ϕc = c− −
N 2 2  
kc1 L Y sup Z ∈β ζ sup Z ∈β ∇ζ σ  W 

2 
kc2 n  
N j=1 Y r es j
σ j  W 
− ,
2
ka
ϕa = ,
2

 N  
kc2 Yr es σ   W 
k=1 k k
ϕθ = k θ y − N

   2  
L Yc g W  sup Z ∈β ∇ζ σ  + sup Z ∈β ∇ζ 

2  
kc1 L Yr es W  sup Z ∈β ζ sup Z ∈β ∇ζ σ 
− ,
2
284 Appendix A: Supplementary Lemmas and Definitions


 kc2 N
kc1 T
κc = sup   4N W̃aT G σ j W̃a + W̃ G σ W̃a
Z ∈β  j=1
4 a

kc2 
N
kc1 
+kc1 ∇ζ G∇ζ σ W +
T
∇ζ G∇ζ  + Ek  ,
4 N 
k=1

 
1 T 1 

κa = sup  W G σ + ∇ζ G∇ζ σ 
,
Z ∈β 2 2

κθ = kθ dθ ,

 
1 
κ = sup  ∇
4 ζ G∇ 
ζ .

Z ∈β

The sufficient gain conditions utilized in Theorem 6.4 are


  
kc1 sup Z ∈β ∇ζ  L Yr es θ + L f0r es
q> ,
 2    
L Yc g W  sup Z ∈β ∇ζ σ  + sup Z ∈β ∇ζ 
+ , (A.33)
 2
  
N  
kc1 sup Z ∈β ∇ζ  L Yr es θ + L f0r es ka
c> +
kc2 2 2
   N   
kc1 L Y sup Z ∈β ζ sup Z ∈β ∇ζ σ  W  kc2
k=1
Yr es σ   W 
k k
+ + N
,
2 2
(A.34)
 k  N  
1 c2 Yr es σ   W 
k=1 k k
y> N
kθ 2
    
L Yc g W  sup Z ∈β ∇ζ σ  + sup Z ∈β ∇ζ 
+
2
 
kc1 L Yr es W  sup Z ∈β ζ sup Z ∈β ∇ζ σ 
+ , (A.35)
2

A.3.2 Extension to Constant Earth-Fixed Current

In the case where the earth-fixed current is constant, the effects of the current may
be included in the development of the optimal control problem. The body-relative
Appendix A: Supplementary Lemmas and Definitions 285

current velocity νc (ζ) is state dependent and may be determined from



cos (ψ) − sin (ψ)
η̇c = νc ,
sin (ψ) cos (ψ)

where η̇c ∈ Rn is the known constant current velocity in the inertial frame. The
functions Yr es θ and f 0r es in (6.11) can then be redefined as
⎡ ⎤
0
Yr es θ  ⎣ −M −1 C A (−νc ) νc − M −1 D (−νc ) νc . . . ⎦ ,
−M −1 C A (νr ) νr − M −1 D (νr ) νr

JE ν
f 0r es  ,
−M −1 C R B (ν) ν − M −1 G (η)

respectively. The control vector u is

u = τb − τc

where τc (ζ) ∈ Rn is the control effort required to keep the vehicle on station given
the current and is redefined as

τc  −M A ν̇c − C A (−νc ) νc − D (−νc ) νc .

A.3.3 Derivation of Path Following Error Dynamics

The geometry of the path-following problem is depicted in Fig. A.1. Let I denote
an inertial frame. Consider the coordinate system i in I with its origin and the basis
vectors i 1 ∈ R3 and i 2 ∈ R3 in the plane of vehicle motion and i 3  i 1 × i 2 . The point
P (t) ∈ R3 on the desired path represents the location of the virtual target at time t.
The location of the virtual target is determined by the path parameter s p (t) ∈ R. In
the controller development, the path parameter is defined as the arc length along the
desired path from some arbitrary initial position on the path to the point P (t). It is
convenient to select the arc length as the path parameter for a mobile robot, since the
desired speed can be defined as unit length per unit time. Let F denote a frame fixed to
the virtual target with the origin of the coordinate system f fixed in F at point P (t).
The basis vectors f 1 (t) , f 2 (t) ∈ R3 are the unit tangent and normal vectors of the
path at P (t), respectively, in the plane of vehicle motion and f 3 (t)  f 1 (t) × f 2 (t).
Let B denote a frame fixed to the vehicle with the origin of its coordinate system
b at the center of mass Q (t) ∈ R3 . The basis vectors b1 (t) , b2 (t) ∈ R3 are the
unit tangent and normal vectors of the vehicle motion at Q (t), and b3 (t)  b1 (t) ×
b2 (t). Note, the bases {i 1 , i 2 , i 3 } , { f 1 (t) , f 2 (t) , f 3 (t)} , and {b1 (t) , b2 (t) , b3 (t)}
form standard bases.
286 Appendix A: Supplementary Lemmas and Definitions

Fig. A.1 The frame F is


attached to the virtual target
at a distance of s p on the
desired path. The frame B is
fixed to the mobile robot, and
the frame I is inertially fixed
(reproduced with permission
from [20], 2014,
c IEEE)

Consider the following vector equation from Fig. A.1,

r Q (t) = r Q (t) − r P (t) ,


P

where r Q (t) ∈ R3 and r P (t) ∈ R3 are the position vectors of points Q and P, from
the origin of the inertial coordinate system, respectively, at time t. The rate of change
of r Q as viewed by an observer in I and expressed in the coordinate system f is
P

f f f
v Q (t) = v Q (t) − v P (t) . (A.36)
P

The velocity of point P as viewed by an observer in I and expressed in f is


f  
v P (t) = s˙p (t) 0 0 ,T (A.37)

where ṡ p : R≥t0 → R is the velocity of the virtual target along the path. The velocity
of point Q as viewed by an observer in I and expressed in f is
f f
v Q (t) = Rb (θ (t)) vbQ (t) , (A.38)

f
where θ : R≥t0 → R is the angle between f 1 and b1 and Rb : R → R3×3 is a trans-
formation from b to f , defined as
Appendix A: Supplementary Lemmas and Definitions 287
⎡ ⎤
cos θ − sin θ 0
Rb (θ)  ⎣ sin θ cos θ 0⎦.
f

0 0 1

The velocity of the vehicle as viewed by an observer in I expressed in b is vbQ (t) =


 T
v (t) 0 0 where v : R≥t0 → R is the velocity of the vehicle. The velocity between
points P and Q as viewed by an observer in I and expressed in f is

F d f
r Q (t) +I →F (t) × r Q/P (t) .
f f
v Q/P (t) = (A.39)
dt /P
The angular velocity of F as viewed by an observer in I expressed in f is given
 T
as I ω F (t) = 0 0 κ (t) ṡ p (t) where κ : R≥t0 → R is the path curvature, and the
relative position of the vehicle with respect to the virtual target expressed in f is
f  T
r Q (t) = x (t) y (t) 0 . Substituting (A.37)–(A.39) into (A.36) the planar posi-
P
tional error dynamics are given as

ẋ (t) = v (t) cos θ (t) + (κ (t) y (t) − 1) ṡ p (t)


ẏ (t) = v (t) sin θ (t) − κ (t) x (t) ṡ p (t) .

The angular velocity of B as viewed by an observer in F is


F
ω B (t) = F
ω I (t) +I ω B (t). (A.40)

From (A.40), the planar rotational error dynamic expressed in f is

θ̇ (t) = w (t) − κ (t) ṡ p (t) ,

where w : R≥t0 → R is the angular velocity of the vehicle.

A.3.4 Auxiliary Signals

The constants ϕe , ϕc , ϕa , ιc , ιa , ι, K ∈ R are defined as

   
ηc1 supζ∈χ   L f ηc2 ηa ηc1 supζ∈χ   L f
ϕe  q − , ϕc  c− − ,
2 N 2 2
288 Appendix A: Supplementary Lemmas and Definitions


 ηc2 N
ηc1 T ηc1  T ηc1  T
ιc  sup 
 4N W̃aT G σ j W̃a + W̃a G σ W̃a +  Gσ W +  G
ζ∈χ  j=1
4 2 4


ηc2 
N
+ E j + ηc1  L f 

,
N j=1 
   
1 1  T  ηa  1  T 
ιa  sup 
 G σ W + σ G 
 , ϕa  , ι  sup 
  G ,

ζ∈χ 2 2 2 ζ∈χ 4

+
ι2c ι2 ι 1  ϕc ϕa 
K  + a + , α  min ϕe , , .
2αϕc 2αϕa α 2 2 2

When Assumption 6.3 and the sufficient gain conditions


2q
ηc1 < , (A.41)
supζ∈χ   L f
 
N ηa N ηc1 supζ∈χ   L f
ηc2 > + (A.42)
2c 2c

are satisfied, the constants ϕe , ϕc , ϕa , ιc , ιa , ι, K ∈ R are positive.

References

1. Filippov AF (1988) Differential equations with discontinuous right-hand sides. Kluwer Aca-
demic Publishers, Dordrecht
2. Aubin JP, Frankowska H (2008) Set-valued analysis. Birkhäuser
3. Shevitz D, Paden B (1994) Lyapunov stability theory of nonsmooth systems. IEEE Trans Autom
Control 39(9):1910–1914
4. Paden BE, Sastry SS (1987) A calculus for computing Filippov’s differential inclusion with
application to the variable structure control of robot manipulators. IEEE Trans Circuits Syst
34(1):73–82
5. Clarke FH (1990) Optimization and nonsmooth analysis. SIAM
6. Kamalapurkar R, Rosenfeld JA, Klotz J, Downey RJ, Dixon WE (2014) Supporting lemmas
for RISE-based control methods. arXiv:1306.3432
7. Fischer N, Kamalapurkar R, Dixon WE (2013) LaSalle-Yoshizawa corollaries for nonsmooth
systems. IEEE Trans Autom Control 58(9):2333–2338
8. Khalil HK (2002) Nonlinear systems, 3rd edn. Prentice Hall, Upper Saddle River
9. Sastry S, Bodson M (1989) Adaptive control: stability, convergence, and robustness. Prentice-
Hall, Upper Saddle River
10. Narendra K, Annaswamy A (1989) Stable adaptive systems. Prentice-Hall Inc, Upper Saddle
River
11. Ioannou P, Sun J (1996) Robust adaptive control. Prentice Hall, Upper Saddle River
Appendix A: Supplementary Lemmas and Definitions 289

12. Chowdhary G (2010) Concurrent learning for convergence in adaptive control without persis-
tency of excitation. PhD thesis, Georgia Institute of Technology
13. Chowdhary GV, Johnson EN (2011) Theory and flight-test validation of a concurrent-learning
adaptive controller. J Guid Control Dynam 34(2):592–607
14. Chowdhary G, Yucelen T, Mühlegg M, Johnson EN (2013) Concurrent learning adaptive control
of linear systems with exponentially convergent bounds. Int J Adapt Control Signal Process
27(4):280–301
15. Dierks T, Jagannathan S (2009) Optimal tracking control of affine nonlinear discrete-time
systems with unknown internal dynamics. In: Proceedings of the IEEE conference on decision
and control. Shanghai, CN, pp 6750–6755
16. Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5):878–888
17. Vrabie D, Lewis FL (2010) Integral reinforcement learning for online computation of feedback
nash strategies of nonzero-sum differential games. In: Proceedings of the IEEE conference on
decision and control, pp 3066–3071
18. Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A
novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear
systems. Automatica 49(1):89–92
19. Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of
continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206–
216
20. Walters P, Kamalapurkar R, Andrews L, Dixon WE (2014) Online approximate optimal path-
following for a mobile robot. In: Proceedings of the IEEE conference on decision and control,
pp 4536–4541
Index

A Concurrent learning, 100, 101, 103, 104,


Actor, xv, 25, 26, 28, 30, 31, 34, 35, 43–45, 107, 111, 114, 118, 120, 127, 130–
50, 51, 55, 58, 73, 75, 76, 80, 81, 90– 134, 140, 141, 150, 158, 159, 182,
92, 94, 162, 165, 185, 219, 258 183, 186–189, 195, 199, 243
Actor-critic, 25, 26, 28, 31, 33, 35, 52, 56, Converse Lyapunov Theorem, 84
85, 90, 94 Cooperative control, 150, 167, 172, 177, 189,
Actor-critic-identifier, 33, 43–45, 73, 75, 76, 190
102, 103, 131 Cost, Lagrange, 1, 2
Adjacency matrix, 151, 180 Cost, Mayer, 1, 2
Advantage updating, 26, 35 Cost-to-go, 3, 18, 25
Algebraic Riccati equation, 125, 209, 222, Critic, 25, 26, 28–35, 43–45, 49–51, 55–58,
224 62, 65, 73, 75, 76, 80–83, 89–92, 94,
Autonomous surface vehicles, 223 101, 102, 141, 161, 162, 165, 219,
Autonomous underwater vehicle, 195, 196, 245, 259
208–210

D
B
Differential game, 11, 12, 44, 74, 75, 94, 189,
Bellman error, 24, 25, 28–33, 43, 45, 50,
190
51, 56, 62, 63, 73, 76, 81, 99–105,
Differential game, closed-loop, 44
107, 110, 118–120, 122, 125, 130,
Differential game, graphical, 150, 190
131, 133, 135, 138, 144, 150, 157–
Differential game, nonzero-sum, 44, 73, 75,
159, 161–163, 185–187, 203–205,
89, 90, 92–94, 101, 131, 140
209, 210, 215, 245, 247, 251, 252
Differential game, zero-sum, 94
Bellman error extrapolation, 166, 169, 174,
230, 248, 250, 252–254, 257, 258 Dynamic neural network, 43, 46, 56, 73, 75,
Bellman’s principle of optimality, 3 77, 78, 90
Bergmann–Fock space, 232
Bolza problem, 1–3, 5, 7, 18, 26, 243
Brachistochrone problem, 3 E
e−modification, 33
Existence, 2
C Experience replay, 30, 99, 102, 260, 261
Carathéodory solutions, 2 Exponential kernel function, 229, 232, 234–
Common Lyapunov function, 110, 140 237, 240

© Springer International Publishing AG 2018 291


R. Kamalapurkar et al., Reinforcement Learning for Optimal
Feedback Control, Communications and Control Engineering,
https://doi.org/10.1007/978-3-319-78384-0
292 Index

F N
Filippov solution, 46, 48, 49, 77, 79 Nash equilibrium, 11, 44, 45, 74, 89, 91,
Finite excitation, 102 101, 131, 138–141, 149, 150, 156–
Finite time horizon, 224 158, 161, 166, 181, 190
Finite-horizon, 71, 94, 185 Nash policy, 165
Formation tracking, 149, 153, 154, 189 Network, leader-follower, 188
Network systems, 149–152, 167, 172, 176,
180, 181, 184, 185, 189, 190
G Nonholonomic system, 172
Galerkin’s method, 223 Nonholonomic vehicle, 223
Galerkin’s spectral method, 35, 93
GPOPS, 70, 72, 73
Gram-Schmidt algorithm, 235 P
Graph, directed, 151, 180 Path-following, 196, 213, 223, 224
Graph Laplacian, 151 Persistence of excitation, 24, 31, 33, 43, 52,
55, 56, 63, 66–68, 83, 85, 89–91,
93, 99, 101–103, 107, 111, 118, 120,
H 131–133, 143, 145, 183, 187, 190,
Hamiltonian, 5, 6, 9, 10, 61, 62, 72, 76, 185, 195, 199, 204, 230, 248
189 Policy evaluation, 18, 35, 55
Hamilton–Jacobi–Bellman equation, 5–8, Policy gradient, 35
10–13, 18–23, 27, 35, 36, 43, 44, 51, Policy improvement, 18, 35, 55, 93
53, 61, 93, 99, 104, 119, 181, 202, Policy iteration, 17–19, 22–25, 34–36, 55,
215, 223 94, 95
Hamilton–Jacobi equation, 75, 182, 184 Policy iteration, synchronous, 94
Hamilton–Jacobi–Isaacs equation, 94 Pontryagin’s maximum principle, 2, 3, 9, 10,
Heuristic dynamic programming, 25, 34–36, 22, 34
43, 94 Prediction error, 107
Hopfield, 43 Projection operator, 50, 78, 82, 84, 114, 204,
205
Pseudospectral, Gauss, 117
I Pseudospectral, Radau, 70
Identifier, 43, 45, 46, 49, 50, 52, 55–57, 73,
76, 80, 85, 90
Infinite-horizon, 5, 35, 44, 45, 71, 73, 74, 94, Q
95, 100, 101, 117, 118, 131, 149, 153 Q-learning, 17, 22, 26, 34–36, 94

K R
Kalman filter, 209 Radial basis function, 228–230
Kantorovich inequality, 239 Randomized stationary policy, 105
Receding horizon, 36
Reinforcement learning, 12, 13, 17, 29, 30,
L 33, 35–37, 43, 45, 55, 60, 91, 94, 99–
Least-squares, 28, 31, 33, 35, 43, 49–51, 62, 103, 105, 118, 144, 149, 150, 158,
63, 73, 81, 91, 93, 100, 101, 105, 106, 161, 195, 229, 230, 242, 245, 258–
111, 114, 134, 140, 162, 186, 187, 261
245, 246, 248 Reproducing kernel Hilbert space, 227, 228,
Levenberg-Marquardt, 51 230–234, 238
Linear quadratic regulator, 227 Reproducing kernel Hilbert space, universal,
233, 242, 244
Riccati equation, 11
M RISE, 43, 46, 56, 77, 209
Model-predictive control, 11, 36, 223, 224 R−learning, 34–36
Index 293

S T
Saddle point, 94 Temporal difference, 13, 17, 25, 35, 93
SARSA, 34–36
σ−modification, 33
Sigmoid, 230
Simulation of experience, 100–103, 105, U
118, 122, 131, 161–163, 245 Uniqueness, 2, 12
Single network adaptive critic, 36 Universal Approximation, 23, 63, 101, 104,
Singular value maximization algorithm, 111, 158, 160, 184, 185, 252
128, 133, 140, 141
Spanning tree, 151, 152, 164
State following (StaF), 228–230, 232, 233, V
235, 237, 238, 242–244, 246, 250, Value iteration, 17, 22, 23, 34–36
251, 253–258, 260 Viscosity solutions, 12
Station keeping, 195, 211, 223
Stone-Weierstrass Theorem, 23
Successive approximation, 18, 35, 36
Support vector machines, 230 W
Switched subsystem, 104, 110, 120, 133, Weierstrass Theorem, 243
139, 140 Wheeled mobile robot, 172, 196, 218–223

You might also like