0% found this document useful (0 votes)

14 views20 pages

Cessna 2

Uploaded by

Janki Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views20 pages

Cessna 2

Uploaded by

Janki Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

AIAA SciTech Forum 10.2514/6.

2020-1844
6-10 January 2020, Orlando, FL
AIAA Scitech 2020 Forum

Online Adaptive Incremental Reinforcement Learning Flight

Control for a CS-25 Class Aircraft

S. Heyer ∗ , D. Kroezen † , E. van Kampen ‡

Delft University of Technology, P.O. Box 5058, 2600GB Delft, The Netherlands

In recent years Adaptive Critic Designs (ACDs) have been applied to adaptive flight control
of uncertain, nonlinear systems. However, these algorithms often rely on representative models
as they require an offline training stage. Therefore, they have limited applicability to a system
for which no accurate system model is available, nor readily identifiable. Inspired by recent
work on Incremental Dual Heuristic Programming (IDHP), this paper derives and analyzes
a Reinforcement Learning (RL) based framework for adaptive flight control of a CS-25 class
fixed-wing aircraft. The proposed framework utilizes Artificial Neural Networks (ANNs) and
includes an additional network structure to improve learning stability. The designed learning
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

controller is implemented to control a high-fidelity, six-degree-of-freedom simulation of the

Cessna 550 Citation II PH-LAB research aircraft. It is demonstrated that the proposed
framework is able to learn a near-optimal control policy online without a priori knowledge of
the system dynamics nor an offline training phase. Furthermore, it is able to generalize and
operate the aircraft in not previously encountered flight regimes as well as identify and adapt
to unforeseen changes to the aircraft’s dynamics.

Nomenclature

s, s R, a = state, reference state, and action vectors

r, g = reward and return
γ, τ, κ = discount, mixing, and forgetting factors
f (s, a), r(s R, s) = state-transition and reward functions
π(s, s R ), π ∗ (s, s R ) = policy and optimal policy
π̂(s, s R, w A) = parametric approximation of the policy
v(s, s R ) = state-value function
λ(s, s R ) = state-value partial derivative w.r.t. the state vector
λ̂(s, s R, w C ) = parametric approximation of state-value partial derivative w.r.t. state vector
0
λ̂ 0(s, s R, w C ) = parametric approximation of state-value partial derivative w.r.t. state vector
P, Q = Boolean selection and symmetric weight matrices
t, ∆t, T = time, sample time, and simulation time duration
F, F̂ = state matrix and state matrix estimate
G, Ĝ = input matrix and input matrix estimate
0
w A, w C , w C = actor, critic, and target critic parameter vectors
∆s, ∆a, ∆ ŝ = state and action vector increment and state vector increment prediction
e = partial derivative of the Temporal Difference (TD) error w.r.t. the state vector
L A, L C = actor and critic losses
η A, ηC = actor and critic learning rates
Θ̂, Λ, X = Recursive Least Squares (RLS) parameter, covariance, and measurement matrices
= innovation vector
δe, δa, δr = elevator, aileron, and rudder deflections
p, q, r, α, β, φ, θ = roll rate, pitch rate, yaw rate, angle of attack, sideslip angle, roll angle, and pitch angle
p R, q R, β R, φ R, H R, γ R = roll rate, pitch rate, sideslip angle, roll angle, altitude, and flight path angle reference values
Vt as, H, γ, n = true airspeed, altitude, flight path angle, and load factor
∗ M.Sc., Faculty of Aerospace Engineering, Control and Simulation Division, Delft University of Technology
† M.Sc., Faculty of Aerospace Engineering, Control and Simulation Division, Delft University of Technology
‡ Assistant Professor, Faculty of Aerospace Engineering, Control and Simulation Division, Delft University of Technology, AIAA Member

Copyright © 2020 by Delft University of Technology. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.
I. Introduction
n recent years, the aerospace domain has experienced an unprecedented increase in interest in autonomous operations.
I Autonomous systems, especially when operated in complex urban environments, need to be able to adapt to
sudden, unexpected changes in the environment [1] and to changes in their dynamics, also referred to as fault tolerance.
Additionally to the requirement of online operation, model-independence is important as for many aerospace applications,
no accurate system model is available, nor readily identifiable [2, 3].
Current flight control systems for passenger aircraft predominantly rely on classical control techniques, which make
use of multiple linear controllers. Gain scheduling methods are then applied to switch between the numerous linear
controllers, each designed for a specific operating point [4]. These gains are determined offline, in advance, by means of
a model of the system. Their non-adaptive nature makes them unsuitable for autonomous systems.
Since the 1960s, adaptive control has been an active research field, developing control strategies that adapt to
changing system conditions online. Many, such as Nonlinear Dynamic Inversion (NDI) [5, 6] and Backstepping
(BS) [7], with the focus on dealing with system nonlinearities. Although successfully applied [8–12], these methods
strongly rely on an accurate model of the system dynamics. Recently developed incremental control methods, such as
Incremental Nonlinear Dynamic Inversion (INDI), Incremental Backstepping (IBS) and incremental adaptive sliding
mode control have decreased the model-dependence, in exchange for the need of high sample rate measurements [13–18].
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

Furthermore, major steps have been made in INDI through the first stability and robustness analysis in the presence of
external disturbances [19] and a first successful demonstration on an CS-25 certified aircraft [20].
Originally inspired by the idea of replicating biological learning mechanisms [21], Reinforcement Learning (RL) is
a field of Machine Learning (ML) that is best characterized by learning from interaction [22]. Ever since RL has been
studied and applied to the field of adaptive and optimal control. Traditional RL methods were formulated for discrete
state and actions spaces, which were sufficiently small such that approximate value functions could be represented as
tables. Continuous and high-dimensional spaces prevalent in control applications would lead to an exponential growth
in computational complexity known as the curse of dimensionality [23]. The curse of dimensionality was mitigated
with the introduction of function approximators, such as Artificial Neural Networks (ANNs), characterizing the field of
Approximate Dynamic Programming (ADP) [24].
With Adaptive Critic Designs (ACDs), a class of ADP methods, several applications were successfully explored in
the 2000s, including adaptive flight control for a missile system [25], business jet [26], helicopter [27] and military
aircraft [28]. However, these methods often need an extra structure to approximate the system dynamics. Furthermore,
when applied online, a preceding offline learning phase is required, mainly due to non-trivial identification of the
system dynamics. The offline identification phase itself requires a representative simulation model. In [29–31] novel
frameworks, named Incremental Heuristic Dynamic Programming (IHDP) and Incremental Dual Heuristic Programming
(IDHP) have been proposed to improve online adaptability and most importantly, eliminate the current need of an offline
learning phase, by identifying an incremental model of the system in real-time. However, these novel frameworks have
yet to be applied to and validated on complex, high-dimensional aerospace models and real systems.
The main contribution of this paper is to present the design and analysis of a RL based adaptive flight controller
for a CS-25 class fixed-wing research aircraft, that can learn a near-optimal control policy online without a priori
knowledge of the system dynamics nor an offline training phase. In this work, a novel learning algorithm, based on
IDHP, is proposed and applied to a six-degree-of-freedom, high-fidelity, nonlinear simulation model of a Cessna 550
Citation II PH-LAB research aircraft. Through simulation, it is shown that the designed controller, is able to learn
to control the aircraft, without a priori system knowledge, during a short online training phase. However, learning
instability inherent to IDHP can lead to failures, which are unacceptable during operation in real systems. Therefore the
proposed algorithm utilizes a separate target critic network to improve learning stability during Temporal Difference
(TD) backups. Furthermore, it is shown that the controller is able to generalize and operate the aircraft in not previously
encountered flight regimes and identify and adapt to unforeseen changes to the aircraft’s dynamics.
The remainder of this paper is structured as follows. In Section II the control problem is formulated and the
proposed learning framework is derived. Section III introduces the high-fidelity simulation model and presents the
implementation of the learning framework into the controller design. Subsequently, in Section IV, the controller is
tested and evaluated in three distinct cases: (1) online training, (2) online operation and (3) online adaption. Lastly, this
paper is concluded in Section V.

2
Fig. 1 CS-25 class Cessna 550 Citation II PH-LAB research aircraft operated by the Delft University of
Technology.

II. Foundations
This section starts by formulating the flight control task as a RL problem. Subsequently, the proposed learning
framework is introduced, followed by its update rules and training strategy.
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

A. Problem Formulation
In the framework of RL the state-transition and reward function are commonly defined as processes of the
environment, whose mechanics are hidden from the agent [22]. This paper regards them as separate entities, where the
reward function is not a hidden process of the environment, but a designed functional. The state-transition function
f (s, a) ∈ Rm×1 characterizes the discrete-time, deterministic, nonlinear plant as in Eq. (1) with the state vector s ∈ Rm×1 ,
the action vector a ∈ Rn×1 and the assumption of synchronous, high-frequency sampling.
The flight control problem is a variable set-point tracking task with the reference state vector s R ∈ R p×1 . Therefore,
the goal is to learn a deterministic policy π(s, s R ) ∈ Rn×1 that maximizes the scalar return g, defined by Eq. (2) and
Eq. (3), respectively. The return, defined by the state-value function v(s, s R ), represents a discounted sum of future
scalar rewards r, where the scalar discount factor γ ∈ [0, 1] is a property of the agent. Equation (4) is the reward function.
It is defined as the negative, weighted, squared state tracking error, with the Boolean selection matrix P ∈ R p×m and
the symmetric weight matrix Q ∈ R p×p . Furthermore, the reward function is differentiable as required for the update
operations of Dual Heuristic Programming (DHP) frameworks. Its partial derivative with respect to the state vector
∂r
∂s ∈ R
1×m is defined in Eq. (5).

st+1 = f (st , at ) (1)

at = π(st , stR ) (2)

∞
Õ
gt = v(st , stR ) = γ k rt+1+k (3)
k=0
T
rt+1 = r(stR, st+1 ) = − P st+1 − stR Q P st+1 − stR

(4)

∂rt+1 ∂r(stR, st+1 ) T

= = −2 P st+1 − stR QP

(5)
∂ st+1 ∂ st+1

B. Learning Framework
A schematic of the feed-forward signal flow of the proposed learning framework is presented in Fig. 2. The
schematic includes both processes and parametric structures. The proposed learning framework is derived from the
IDHP framework as presented in [30], with its three parametric structures: the actor π̂(s, s R, w A) ∈ Rn×1 , the critic
λ̂(s, s R, w C ) ∈ R1×m , and the incremental model of the plant. Whereas the actor approximates the control policy, the
critic approximates the partial derivative of the state-value function with respect to the state, with the parameter vectors
w A and w C , respectively.

3
R
st
rt+1
Reward
st+1

R
s
t+1 ′
λ
t+1
Target Critic
st+1

R R
st s
t+1
λt λt+1
Critic Critic
st st+1

st st −
st+1 Δst+1 1
Δst Δst
Actor Plant ̂
F t−1
z
+
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

R
st at at
Δst+1
Incremental
Plant Model
Estimator ̂
+ Gt−1
at at−1 Δat
1

z
−

Fig. 2 Schematic of the feed-forward signal flow of the learning framework, where the different parametric
structures and processes are illustrated by distinct colors.

1. Target Network
The proposed learning framework is an on-policy approach and therefore does not appertain to the deadly triad∗ .
Nevertheless, any form of learning instability is undesirable during online operation. To improve learning stability
0
a separate target critic λˆ0(s, s R, w C ) ∈ R1×m , inspired by [32, 33], is proposed for the TD backups. The proposed
approach slows down learning as the propagation of the state-value derivatives is delayed, but this is outweighed by the
0
improved learning stability. Analogous to the actor and critic, the target critic has a parameter vector w C .

2. Incremental Model
By means of a first-order Taylor series expansion of the discrete-time, nonlinear plant at the operating point [s0, a0 ] a
linear approximation of the system is established as in Eq. (6), with the partial derivatives of the state-transition function,
F(s0, a0 ) = ∂ f (s
∂s0
0,a0 )
∈ Rm×m and G(s0, a0 ) = ∂ f (s
∂a0
0,a0 )
∈ Rm×n , also referred to as the state matrix and input matrix,
respectively. By choosing the operating point [x0, a0 ] = [xt−1, at−1 ] and rearranging Eq. (6), the incremental form of the
discrete-time, linear approximation of the system is obtained, as in Eq. (7), with ∆st+1 = st+1 − st , ∆at+1 = at+1 − at ,
Ft−1 = F(st−1, at−1 ), and Gt−1 = G(st−1, at−1 ).

st+1 ≈ f (s0, a0 ) + F(s0, a0 )[st − s0 ] + G(s0, a0 )[at − a0 ] (6)

∆st+1 ≈ Ft−1 ∆st + Gt−1 ∆at (7)

Assuming a high sampling frequency and a slow-varying system, the incremental model as in Eq. (7), provides
a linear, time-varying approximation of the nonlinear system [15, 34]. An online system identification algorithm is
utilized to generate estimates of the time-varying state and input matrix, F̂t−1 ≈ Ft−1 and Ĝt−1 ≈ Gt−1
∗ The deadly triad refers to the instability and divergent behavior that methods which combine off-policy learning, bootstrapping, and function

approximation, exhibit [22].

4
3. Network Topology
In this paper single-hidden-layer, fully-connected Multilayer Perceptrons (MLPs) ANNs are chosen as parametric
structures for the actor, critic, and target critic, as they: (1) easily manage dimensional large input and output spaces (2)
support batch or incremental learning methods (3) are differentiable (4) can approximate any nonlinear function on a
compact space arbitrarily well (5) support flexible design (6) are widely applied in intelligent flight control applications
[26–28, 30, 35–41]. The hyperbolic tangent activation function is utilized and the hidden layers consist of 10 neurons.
In [26, 27] an additional, offline-trained, trim network is employed to provide a global mapping between the nominal
control positions and the system’s operating condition. The actor in the framework proposed in this paper is able to
learn a notion of a local flight-condition-dependent tracking policy, eliminating the need for a pretrained trim network.
Consequently, the input of the actor, critic, and target critic networks include both the state vector and reference state
h iT
tracking error s P s − s R ∈ R(m+p)×1 .
Whereas the input layers of the actor, critic, and target critic have the same structure, their output layers are different.
The actor’s output layer utilizes a hyperbolic tangent function, which is scaled according to the individual saturation
limits of the control surfaces of the PH-LAB research aircraft as presented in Section III. In the case of the elevator for
which the saturation limits are asymmetric, the limit with the largest absolute magnitude is utilized as scaling constant.
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

The critic and target critic have a linear output layer. The topology of the neural networks is illustrated in Fig. 3.

s0 s0
tanh scaled tanh tanh linear
.. ..
. .
a0 ∂v(s,sR )
∂s0
sm sm
.. .. .. ..
. . . .
(P s)0 − sR
0 (P s)0 − sR
0
an ∂v(s,sR )
.. .. ∂sm
. .

(P s)p − sR
p (P s)p − sR
p

Fig. 3 Neural network topology used for actor (left) and network topology used for critic and target critic
(right), with input, hidden, and output layers colored in green, blue, red, respectively. Layers can have either a
(scaled) hyperbolic tangent or a linear activation function. The hidden layers have 10 neurons.

C. Update Rules
The parametric structures are updated at each time step as new information becomes available through an interaction
with the plant. The leaning process is governed by the update rules, as presented in this section.

1. Critic
The critic is updated through a bootstrapping, TD backup operation. Equation (8) defines the mean squared error
loss of the critic L C to be minimized, with the error e ∈ R1×m . The error e is defined as the partial derivative of the TD
error with respect to the state vector, as in Eq. (9).
1 T
LtC = et e (8)
2 t
h i
∂ r(stR, st+1 ) + γv(st+1, st+1
R ) − v(s , s R )
t t
et = −
∂ st (9)
∂r(stR, st+1 ) C 0 ∂ s t+1

=− + γ λ̂ (st+1, st+1, wt )
0 R
+ λ̂(st , st , wt )
R C
∂ st+1 ∂ st

∂r(s tR ,s t +1 ) R , w C 0 ) ∂s t +1 is computed by means of an evaluation of the reward
In Eq. (9) the target ∂s t +1 + γ λ̂ 0 (s , s
t+1 t+1 t ∂s t

5
process, a forward pass through the target critic, a backward pass through the actor, and the current state and input matrix
estimates of the incremental model. The latter becomes evident through the expansion of the term ∂s∂st +1 t
as presented in
Eq. (10). As ANNs are inherently differentiable, the differentiability requirement of the actor in Eq. (10) is met.

∂ st+1 ∂ f (st , at ) ∂ f (st , at ) ∂ at

= +
∂ st ∂ st ∂ at ∂ st
(10)
∂ π̂(st , stR, wtA)
= F̂t−1 + Ĝt−1
∂ st

Equation (11) defines the gradient of the critic’s loss with respect to its parameter vector. † The critic is updated by
stepping in opposite direction to the gradient of the loss with learning rate ηC , as defined in Eq. (12).

∂LtC ∂LtC ∂ λ̂(st , stR, wtC ) ∂ λ̂(st , stR, wtC )

= = et (11)
∂wtC ∂ λ̂(st , stR, wtC ) ∂wtC ∂wtC
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

∂LtC
C
wt+1 = wtC − ηtC (12)
∂wtC

2. Target Critic
For the target critic, soft target updates are utilized as first introduced in [33]. The target critic is initialized as a
copy of the critic and is subsequently updated according to a weighted sum as defined by Eq. (13). With the scalar
mixing factor τ 1. The use of the target critic improves the learning stability as the target state-value derivatives are
constrained to change slowly.
0 0
C
wt+1 = τwt+1
C
+ [1 − τ]wtC (13)

3. Actor
The actor is updated towards an optimal policy π ∗ (s, s R ) ∈ Rn×1 , as defined in Eq. (14), which maximizes the
state-value function v(s, s R ). Consequently, the loss of the actor L A and its partial derivative with respect to its parameter
vector are defined by Eq. (15) and Eq. (16), respectively. Accordingly, the update procedure of the actor involves, the
reward process, the critic, and the input matrix estimate of the incremental model. Equation (17) defines the gradient
step with learning rate η A.

π ∗ (st , stR ) = arg max v(st stR ) (14)

at
h i
LtA = −v(st , stR ) = − r(stR, st+1 ) + γv(st+1, st+1
R
) (15)

h i
∂LtA ∂ r(stR, st+1 ) + γv(st+1, st+1
R )

=−
∂wtA ∂wtA
∂r(stR, st+1 ) ∂v(st+1, st+1
R )
∂ st+1 ∂ at

=− +γ (16)
∂ st+1 ∂ st+1 ∂ at ∂wtA
∂r(stR, st+1 ) ∂ π̂(st , stR, wtA)

=− + γ λ̂(st+1, st+1, wt+1 ) Ĝt−1
R C
∂ st+1 ∂wtA

∂LtA
A
wt+1 = wtA − ηtA (17)
∂wtA
† While the target term is also dependent on w tC through the update rules of the target critic, this is generally neglected [33, 42]

6
4. Incremental Model
The estimates of the state and input matrices are represented by the parameter matrix Θ̂ ∈ R(m+n)×m , as presented in
Eq. (18). The parameter matrix is accompanied by a covariance matrix Λ ∈ R(m+n)×(m+n) , which expresses a measure of
confidence of the estimates. The update process of the Recursive Least Squares (RLS) estimator starts with a prediction
of state increments ∆ ŝ based on the latest measurements X ∈ R(m+n)×1 and the current parameter estimates Θ̂, as
defined in Eq. (19) and Eq. (20). Subsequently, the prediction error ∈ R1×m , named innovation, is computed, as
defined in Eq. (21). Conclusively, the parameter estimates and covariance matrix are updated based on Eq. (22) and
Eq. (23), with the scalar forgetting factor κ ∈ [0, 1], which exponentially weights older measurements.
" #
T
F̂t−1
Θ̂t−1 = (18)
ĜTt−1
" #
∆st
Xt = (19)
∆at

∆ ŝTt+1 = XtT Θ̂t−1

Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

(20)

t = ∆sTt+1 − ∆ ŝTt+1 (21)

Λt−1 Xt
Θ̂t = Θ̂t−1 + t (22)
κ + XtT Λt−1 Xt
Λt−1 Xt XtT Λt−1

1
Λt = Λt−1 − (23)
κ κ + XtT Λt−1 Xt

D. Training Strategy
Algorithm 1 outlines the training strategy of the agent. The proposed strategy targets minimal model-dependency
and computational complexity. In [24, 26] the agent’s update operations are (partially) conducted on state predictions
computed with a model of the plant. Although this approach enables the use of additional optimization schemes at
each time step, it has limited applicability to time-critical, online operations, such as the PH-LAB research aircraft.
Furthermore, these methods strongly rely on both the model’s prediction performance and its capability to estimate the
state-transition derivatives.
The method proposed in this paper utilizes current state measurements for the update operations instead. This reduces
the model-dependency to only the state-transition derivatives, making the controller’s performance less vulnerable to
imperfect models. In addition, the computational complexity of the learning algorithm is improved, as it requires less
(re)evaluations of the actor, critic, and target critic.

III. Controller Design

This section starts by introducing the simulation model of the PH-LAB research aircraft utilized in this paper.
Subsequently, the flight controller design is presented, followed by the hyperparameters.

A. High-Fidelity Simulation Model

Operated by the Faculty of Aerospace Engineering of the Delft University of Technology, the Cessna 550 Citation II
PH-LAB depicted in Fig. 1 is a multipurpose research platform. In this paper, a high-fidelity, nonlinear, six-degrees-
of-freedom model of a Cessna 500 Citation I, developed with the Delft University Aircraft Simulation Model and
Analysis Tool (DASMAT), is utilized. Despite the differences of the aircraft in fuselage size, engine power, and wing
size, the model is still representative of the PH-LAB research aircraft [43]. Additionally, the simulation model includes
engine dynamics, actuator dynamics, and sensor models. The sensor models are not used, as clean measurements
are assumed in this paper. The actuator dynamics are modeled with a first order actuator model and control surface
deflection saturation. The limits are listed in Table 1. Furthermore, the aircraft’s yaw damper is disabled to provide
the agent with full control authority over the control surfaces. The engine’s thrust setting is controlled by an internal

7
Algorithm 1 Learning Framework
Require:
simulation parameters ∆t, T
agent parameters γ, η A, ηC , τ, κ
differentiable deterministic policy parameterization π̂(s, s R, w A)
differentiable state-value function derivative parameterization λ̂(s, s R, w C )
0
differentiable target state-value function derivative parameterization λ̂ 0(s, s R, w C )
reward function derivative ∂r(s∂s ,s)
R

Initialize:
w0A, w0C , Θ̂0, Λ0, s0
0
w0C ← w0C
Compute:
1: for i = 0 to int ∆t T
− 1 do
get siR
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

2:
3: if i = 1 then
4: wiA ← wi−1 A
C C
5: wi ← wi−1
0 C0
6: wiC ← wi−1
7: end if
8: if i" > 1 then
#
T
F̂i−2
9: ← Θ̂i−2
ĜTi−2
∂s i ∂ π̂(s i−1,s i−1
R ,w A )

∂s i−1 ← F̂i−2 + Ĝi−2

i−1
10: ∂s i−1
∂ri ∂r(s i−1,s i )
R
11: ∂s i ← ∂s i
12: λ̂i−1 ← λ̂(si−1, si−1 R , wC )
i−1
13: λˆi ← λ̂(si, si , wi−1
R C
)
C0
14: λ̂i0 ← λ̂ 0(si, siR, wi−1 )
∂ π̂(s i−1,s i−1
R ,w A )
h i
∂ri
15: ∆wi ← η ∂s i + γ λ̂i Ĝi−2
A A
∂w i−1
A
i−1

h h i i
∂ri ∂s i ∂λ̂ i−1
16: ∆wiC ← −ηC − ∂s i
+ γ λ̂ 0
i ∂s i−1 + λ̂ i−1 ∂w i−1
C

17: wi ← wi−1 + ∆wi

A A A

18: wiC ← wi−1 C

+ ∆wiC
0 C0
19: wiC ← τwiC + [1 − τ]wi−1
20: ∆si−1 ← si−1 − si−2
21: ∆ai−1 ← ai−1 − ai−2
22: ∆si ← si" − si−1#
∆si−1
23: Xi−1 ←
∆ai−1
24: ∆ ŝTi ← Xi−1
T Θ̂
i−2
25: i−1 ← ∆sTi − ∆ ŝTi
26: Θ̂i−1 ← Θ̂i−2 + κ+XΛTi−2ΛXi−1X i−1
i−1 i−2 i−1
h T Λ
Λi−2 X i−1 X i−1
i
1 i−2
27: Λi−1 ← κ Λi−2 − κ+X T Λ X
i−1 i−2 i−1
28: end if
29: ai ← π̂(si, siR, wiA)
30: get si+1 by taking action ai
31: end for

8
airspeed controller. The simulation model is run with a sampling frequency of 50 Hz. It contains and accepts a large
number of states and control inputs, respectively. A subset of it spans the agent-plant interface as defined by Eq. (24).

Table 1 Aerodynamic control surface saturation limits utilized in the actuator dynamics model of the Cessna
500 Citation I simulation model.

Control Surface Saturation Limits

δe [−20.05, 14.90] deg
δa [−37.24, 37.24] deg
δr [−21.77, 21.77] deg

h iT h iT
s= p q r Vt as α β φ θ H a = δe δa δr (24)
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

B. Flight Controller
The adaptive learning framework is applied to angular rate control of the pitch and roll rate, augmented with an outer
control loop, as illustrated in Fig. 4. Rate control exhibits the lowest learning complexity due to the direct dynamic
relation between the angular rates and control surfaces. The outer control loop consists of conventional PID controllers
and provides a higher-level control interface, which enables reference tracking of an altitude and roll angle profile.
Under the assumption of a symmetric aircraft, a decoupled controller design is employed utilizing separate longitudinal
and lateral learning controllers, with the state, reference state, and actions vectors as in Eq. (25) and Eq. (26). The
resulting selection and weight matrices are presented in Eq. (27) and Eq. (28). For large roll angles, there is limited
control of the flight path angle through the pitch rate. On the other hand, using separate controllers with each their own
learning framework instance reduces the complexity of the problem to be learned and simplifies the online incremental
model identification. As separate critics are used, there is no need to specify relative weights between the longitudinal
and lateral tracking errors.

Learning
ϕ Controllers
R
- p
R R
p ,β
R
PID Lateral δa , δr
ϕ +
β
R Controller
p, r, β, ϕ
δe , δa , δr p, q, r, Vtas , α, β, ϕ, θ, H
Plant
R R R
H γ + q
PID PID Longitudinal
+ - -
γ Controller δe
H α - q, α, θ

+
θ

Fig. 4 Schematic of flight controller with individual longitudinal and lateral learning controllers.

h iT lon
h iT h iT
s lon = q α θ sR = qR alon = δe (25)

h iT l at
h iT h iT
s lat = p r β φ sR = pR βR alat = δa δr (26)
h i lon
h i
Plon = 1 0 0 QR = 1 (27)
" # " #
1 0 0 0 R l at 1 0
Plat = Q = (28)
0 0 1 0 0 100

9
As can be observed from Eq. (25) the airspeed is not provided to the longitudinal controller. The exclusion of the
airspeed is motivated by the fact that the online identification of the airspeed related parameters of the incremental model
is nontrivial due to time scale separation and relatively small local variations in airspeed. The latter is exacerbated by
the internal airspeed controller. Additional to a reference roll rate, the lateral learning controller receives a zero degree
reference for the sideslip angle, such that it can learn to minimize it, with the goal of flying coordinated turns.

C. Hyperparameters
The hyperparameters utilized in the remainder of this paper are summarized in Table 2. These parameters are
utilized for both the longitudinal and lateral controllers and are determined empirically by experimentation and common
guidelines proposed in [24]. In Section IV a distinction is made between an online training, online operation, and online
adaption phase. As the name suggests, the training phase focuses on the learning process of the controllers. During the
operation phase, the agent is employed to fly representative maneuvers. In the adaption phase, changes are applied to
the plant and the agent during flight. In all phases, the agent is constantly learning through interaction. During the
training phase, a faster learning process is preferred over a lower chance of converging to a local minimum. Accordingly,
the learning rates during the training phase are higher, than in the operation phase. The utilization of the target critic
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

depends on the experiment being conducted in Section IV. For the training phase the RLS incremental model estimator,
the actor, and the critic are initialized according to the values as listed in Table 2. During the operation phase, the
learning framework is initialized at its pretrained state.
The forgetting factor κ of the RLS estimator is set to one. This is done to ensure consistency of the estimator. During
periods of poor or no excitation (which is the case for the majority of time during operation), the covariance matrix
increases exponentially when κ < 1, also referred to as estimator windup. Once, the system is excited again, abrupt
changes in the estimate occur despite no changes in the actual system. In addition to poor excitation, non-uniformly
distributed information over all parameters and time-scale separation in the variation of parameters also lead to estimator
windup. Many approaches to mitigate these issues have been proposed. In [44–46] non-uniformly distributed information
is dealt with by selective amplification of the covariance matrix, called directional forgetting. Similarly, [47, 48] propose
vector-type or selective forgetting, were individual parameter dependent forgetting factors are used, to deal with the
effects of time-scale separation. The design of an advanced online parameter identification algorithm for the PH-LAB is
out of the scope of this research.
Setting the forgetting factor to one has the disadvantage that the estimator does not forget older data. Consequently,
the estimator becomes less adaptive over time. In this paper, this problem is mitigated by resetting the covariance matrix
to its initial value when a large change in the system is detected through the estimator’s innovation term, as elaborated in
Section IV.

Table 2 Hyperparameters of learning framework instances of both lateral and longitudinal controllers.

Parameter Value
training phase 5, 10, 0.01 or 5, 10, 1
η A, ηC , τ
operation phase 1, 2, 0.01
w0A, w0C Ntrunc (µ = 0, σ = 0.05)
F̂0, Ĝ0, Λ0 I, 0, I · 108
κ 1.0
γ 0.8

IV. Results and Discussion

In this section, the results of the conducted experiments are presented and discussed. Three distinct cases are
evaluated. First, the framework’s capability to train online without a priori knowledge of the system dynamics is
evaluated. In addition, the online identification of the incremental model, as well as, the effect of the target critic on
the learning stability are discussed. Subsequently, experiments are conducted to demonstrate that the agent is able to
operate the aircraft during representative maneuvers at a variety of flight conditions. Last but not least, the framework’s
capability of dealing with unforeseen changes, such as changes in the aircraft’s dynamics, is analyzed.

10
A. Online Training Phase
In this section, the framework’s capability of online learning without a priori knowledge of the plant is demonstrated.
The longitudinal and lateral controllers are trained separately. During training of the longitudinal controller, the control
surfaces associated with lateral motion are kept at their initial trim value and vice versa.
A sine function with an amplitude of 5 degrees per second and a frequency of 0.2 Hertz is utilized as pitch and roll
rate reference. As elaborated in the previous section, the sideslip angle reference is zero. Different signals commonly
used for (aircraft) system identification [49], such as frequency sweeps, doublets, 3211 doublets, and sinusoidal functions
are good reference signal candidates. The proposed sinusoidal reference allows for a short training phase (under 60
seconds) with an acceptable load factor range, as depicted in Fig. 5. Persistent Excitation (PE) is essential to both
state-space exploration in the learning process and dynamic excitation in the system identification process of the
incremental model [22, 24, 49]. Exponentially decaying, sinusoidal excitation is applied to the elevator and ailerons
to excite the system during the initial training phase. As the agent learns to track the dynamic reference signals, the
excitation on the elevator and ailerons is reduced. Otherwise, the agent would learn to compensate for the applied
excitation over time. Through the aircraft’s dutch roll eigenmode, its yaw motion is also excited through the ailerons.
Therefore no additional excitation is applied to the rudder.
As illustrated in Fig. 5 both the longitudinal and lateral controller is able to follow the reference signals after less
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

than 30 seconds of training. The same observation is made from the actor and critic parameters as depicted in Fig. 6,
as an overall convergence of the parameters can be observed. As a result of the motion in the longitudinal controller
training phase, the airspeed oscillates. The agent perceives changes in airspeed as changes in the environment and
constantly adapts to it. This can be observed in Fig. 6, where the actor and critic parameters of the longitudinal controller
show minor oscillatory behavior.
From Fig. 5 it can be observed that it takes longer to learn to control the sideslip angle than the angular rates. This
can be attributed to two main factors: (1) the control surfaces are more directly related to the angular rates than to the
sideslip angle (2) the rudder is not directly excited. The consequence of the latter can also be observed in the online
identification process of the incremental models.

5 5
]

0 0
deg

deg
s

s
[

[
q

5 5
0.5
5.0
[deg]

0.0
deg
s

2.5
[

0.5
r

10 0.1
[deg]
[deg]

0.0
5 0.1
90
]

[deg]

0
m
s
Vtas [

88 5
2200 1
[deg]
H [m]

0
a

2000 1
0.0 0.25
[deg]

[deg]

2.5 0.00
0.25
e

1.5
1.00
n [g]

n [g]

1.0
0.5 0.95
0 10 20 30 40 50 60 0 10 20 30 40 50 60
t [s] t [s]

Fig. 5 Online training procedures of longitudinal (left) and lateral (right) controller, starting at trimmed
operation condition (Vt as, H) = (90 ms , 2 km), with (τ, η A, ηC ) = (1, 5, 10). Reference and excitation signals are
illustrated by black dashed and blue dotted lines, respectively.

11
1. Online Identification of Incremental Model
Figure 6 depicts the online identified state and input matrices for both the longitudinal and lateral incremental models.
As elaborated in Section II, the state and input matrices play an important role in the update operations of both the
actor and critic. For meaningful updates of the actor and critic, the online identified model parameters estimates should
have the correct sign and to some extent correct relative magnitude. To allow for an evaluation of the identification
performance, reference parameters values are provided by dotted lines. These reference values are derived from a
linearization of the nonlinear plant at the initial trim condition at (Vt as, H) = (90 ms , 2 km). These constant values are
not exact as: (1) they do not take in to account the actuator dynamics (2) nor the airspeed controller (3) do not take into
account time-varying changes in the plant (4) the linearized model contains different states than the incremental model’s
regression matrix. Nonetheless, these reference values provide a good approximation to evaluate the performance of the
online identified estimates.
It can be observed that both the parameters of the state and input matrices are identified in less than 10 seconds,
∂p
providing the agent with local information about the plant, with the exception of one term. The term ∂δ r
of the input
matrix of the lateral incremental model, is initially incorrectly identified and slowly moves towards its correct final
value. This behavior is partially a consequence of the fact that the rudder is not initially directly excited.
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

p p r
p r p
q q p r r
q q r p r
q p r
q r p

1 1.5
F [ ], 1s , [s]

F [ ], 1s , [s]

0 1.0
]

]
[

0.5
1
0.0
2
0
0.0
1
]

]
1

1
s

s
[

0.2
G [ ],

G [ ],

p r
2 a a a
q p r
3 e e e 0.4 r r r

0.2
0.5
0.0
wA [ ]

wA [ ]

0.0 0.2

0.5 0.4

4
1
2
wC [ ]

wC [ ]

0
0
1
2
0 10 20 30 40 50 60 0 10 20 30 40 50 60
t [s] t [s]

Fig. 6 State and input matrix estimates, and actor and critic parameters during online training of longitudinal
(left) and lateral (right) controller, starting at trimmed operation condition (Vt as, H) = (90 ms , 2 km), with
(τ, η A, ηC ) = (1, 5, 10) and (F̂0, Ĝ0, Θ̂0 ) = (I, 0, I ·108 ). Estimates derived from a plant linearization are represented
by dotted lines.

12
2. Online Training with Untrimmed Initialization
The online training phase as presented in the previous section is initiated at a trimmed condition. In practice,
however, it is challenging to find an initial trimmed state without a priori knowledge of the plant. Therefore, the proposed
framework should also be able to deal with untrimmed initial conditions. The effect of an untrimmed initialization on
the training phase is therefore examined for the longitudinal controller by superimposing a random uniformly distributed
elevator deflection offset around its trimmed state, in addition to the small random offset originating from the random
initialization of the actor’s parameters. Furthermore, the effects of the proposed utilization of a target critic on the
learning speed and stability are examined, by comparing the results attained from experiments with and without a target
critic. 500 runs are simulated for each case.
In Fig. 7 the area between the 25th and 75th percentile of the 500 runs are presented in red and blue, for a mixing
factor of τ = 1.0 and τ = 0.01, respectively. In both cases the same learning rates are utilized (η A, ηC ) = (5, 10). It can
be observed that the interquartile range of the agent without target critic narrows within the first 22 seconds of training.
In comparison, for the agent with a target critic, this is observed 20 seconds later. This implies that it takes longer for
the majority of the runs to converge to an equivalent final policy. On the other hand, in the first 5 seconds of training
the interquartile range of the agent with a target critic spans a smaller range of experienced pitch rate values than its
counterpart. This implies that for the majority of the 500 runs, the initial training phase is less dynamic and erroneous,
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

assisting the training stability. The latter is confirmed by the failure rates. The sample mean failure rate for 500 runs for
the agent with and without target critic (τ = 0.01 and τ = 1.0) is 0% and 5.2%, respectively. Failure is defined as the
occurrence of a float overflow in the learning process due to numerical instability.
Most of the time an unstable or diverging learning behavior leads to fast growth of the critic’s parameters. As
a consequence, the actor’s loss gradients become very large leading to aggressive and large updates from which the
agent cannot recover. The utilization of the proposed target critic stabilizes the critic’s learning and as a consequence,
no failures are encountered within 500 runs. As a comparison, without the proposed target critic, the training of the
longitudinal controller has a failure rate of 5.2%. Safety of learning can be further improved with dedicated safe learning
algorithms[50].
Consistent with findings in [33] the target critic slows down and stabilizes the learning process. Especially in an
application in the PH-LAB aircraft, no failure is acceptable. Therefore, the augmentation of the current IDHP framework
with the proposed target critic is vital to a successful implementation of the RL controller to the PH-LAB.

5.0 0
2.5 1
[deg]
]

0.0
deg

2
qs
[

2.5
e

3
5.0 4
0 10 20 30 40 50 60 0 10 20 30 40 50 60
t [s] t [s]

Fig. 7 Interquartile range of pitch rate and elevator deflection for 500 runs of the online training phase of the
longitudinal controller with random untrimmed initialization around the operating point (Vt as, H) = (90 ms , 2 km).
The reference signal is represented by a black, dashed line. The agent with and without a target critic are
represented in blue and red, respectively.

B. Online Operation Phase

Subsequent to a successful online training phase of both the lateral and longitudinal controllers, these can be utilized
to operate the aircraft. In this experiment, the goal is to demonstrate that the agent is able to operate the aircraft,
successfully and reliably during representative maneuvers at a variety of flight conditions. For each of the four initial
flight conditions as listed in Table 3, the agent follows an altitude and bank angle flight profile, which consists of a
prolonged climb and a subsequent descent (with 5 ms ), combined with left and right turns (of 25 deg and 20 deg), as
depicted in Fig. 8 and Fig. 9. For each flight condition 100 runs are simulated, where the plant is initialized at a trimmed
state and the agent in its final state of the online training phase of the previous section.
Figure 8 illustrates all 100 flight maneuvers for flight condition FC0, starting at the same condition in which the
previous online training phase was conducted. For all states and actions, as well as, the reference pitch and roll rate, the
figure depicts the area between the minimum and maximum bounds of all 100 runs. The bounds emphasize that none of

13
the 100 runs experience a failure, as listed in Table 3. It can be observed that the agent flies the commanded height and
bank angle profile, by following the reference pitch and roll rate, as commanded by the PID controllers. In addition, the
agent succeeds to minimize the sideslip angle and conduct a well-coordinated turn, by actuating the rudder.
Although the online pretrained agent already succeeds in flying the aircraft, its performance continues to improve.
From the min-max bound of the time history of the sideslip angle and rudder deflection, a minor oscillatory behavior is
observed in the first two turns. As the agent carries out the first two turns it gains more experience and improves its
performance. Consequently, the oscillatory behavior vanishes and min-max bound narrows for the two subsequent turns.
From the time history of the pitch rate and airspeed, it can be observed that the agent’s pitch rate reference tracking
performance temporarily degrades during quick changes in airspeed. Due to the separation between the airspeed
controller and the longitudinal and lateral controllers, as elaborated in Section III, the learning controllers do not
conceive a notion of the airspeed as a distinct state, but can only experience it as an external, temporary change in the
perceived plant dynamics. Although the agent’s performance temporary degrades, due to quick changes in airspeed, the
agent still adapts to different airspeed regimes and their influence on the plant’s dynamics.
This is demonstrated in Fig. 9, where the agent (pretrained at (Vt as, H) = (90 ms , 2 km)) is utilized to operate at the
flight condition FC3 with (Vt as, H) = (140 ms , 5 km). Despite the different flight condition, which the agent has not
previously experienced, the agent is able to follow the flight profile in all 100 runs without failure, as depicted in by
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

Fig. 9 and Table 3. Similarly, to the previous flight condition, the agent is still constantly adapting and improving upon
new experiences as can be observed for example from the decrease in width and magnitude of the min-max bound of the
time history of the sideslip angle.

Table 3 Description and sample failure rate of different flight conditions each simulated 100 times during
the online operation phase. For all cases the agent is initialized at its final state of the online training phase
conducted at (Vt as, H) = (90 ms , 2 km).

Flight Condition ID H0 m Vt as0 ms Failure Rate

FC0 2000 90 0%
FC1 2000 140 0%
FC2 5000 90 0%
FC3 5000 140 0%

C. Online Adaption
During operation, the controller needs to be able to adapt to uncertainties in the system, such as unexpected failures
or time-varying components [8, 13, 51, 52]. Conventional ACDs that require an initial offline training phase, are unable
to quickly adapt online to these changes [30]. In this section, the adaptability of the proposed framework is validated in
two experiments. In the first step, a failure is introduced to the aileron to demonstrate the framework’s capability to
identify changes in the plant. Subsequently, another experiment is conducted where a disturbance is introduced directly
in the actor’s parameters. In both cases, the agent is initiated at its online pretrained state. The controller is commanded
to fly two right rate-one turns while holding the initial altitude. The disturbances are induced at the start of the second
turn at 110 seconds.

1. Adaptive Control in the Presence of Aileron Failure

To simulate the failure of one aileron, the aileron deflection as commanded by the actor is halved before it is passed
as input to the plant. As a result, the perceived control effectiveness of the ailerons is halved and the controller has to
double its aileron command to fly the turn. In this experiment, the covariance matrix of the lateral incremental model is
reinitialized at I · 108 once a fault is detected. Many methods for fault detection have been proposed in [3, 13, 51]. Here,
a sudden increase of the innovation term, as presented in Eq. (21), is utilized to monitor for changes in the plant. As
illustrated in Fig. 10 the controller is able to detect the change in the plant and complete the turn, despite the failure in
the ailerons. From the estimates of the input matrix of the lateral incremental model, it can be observed that the online
identification perceives the change in the plant’s dynamics. For example the incremental parameter that corresponds to
∂p
the ∂δ a
term, changes from the initial value of 0.24 1s to 0.12 1s .
Consequently, the agent commands a larger aileron deflection, as illustrated in Fig. 10, to complete the right turn. In

14
2 2.5

]
deg
0.0

deg
s

s
[
0

[
q

p
2.5
2.5
[deg]

]
0.0

deg
rs [
3 2.5
0.1
5

[deg]
[deg]

0.0
0
90 20
]

[deg]
m
s

0
[

89
Vtas

20
0.5
2500

[deg]
H [m]

0.0
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

a
2000 0.5
1.5 0.5
[deg]

[deg]
2.0 0.0
2.5
e

r 0.5
0 50 100 150 200 250 0 50 100 150 200 250
t [s] t [s]

Fig. 8 Min-max bounds over 100 runs of the online operation phase, starting at trimmed operation condition
FC0 (Vt as, H) = (90 ms , 2 km) and pretrained agent, with (τ, η A, ηC ) = (0.01, 1, 2). The agent was pretrained
online at (Vt as, H) = (90 ms , 2 km). Reference signals are illustrated by black, dashed lines.

1 2.5
]

]
deg

deg

0.0
s

0
[

[
q

1 2.5
2 2
[deg]

0
deg
s [

1
r

0.5
2
[deg]
[deg]

0.0
0

140 20
]

[deg]
m
s

139 0
[
Vtas

138 20

0.25
5500
[deg]
H [m]

0.00
0.25
a

5000
0.0 0.5
[deg]

[deg]

0.0
0.5
0.5
e

1.0
0 50 100 150 200 250 0 50 100 150 200 250
t [s] t [s]

Fig. 9 Min-max bounds over 100 runs of the online operation phase, starting at trimmed operation condition
FC3 (Vt as, H) = (140 ms , 5 km) and pretrained agent, with (τ, η A, ηC ) = (0.01, 1, 2). The agent was pretrained
online at (Vt as, H) = (90 ms , 2 km). Reference signals are illustrated by black, dashed lines.

15
this case, the control adaption of the agent interacts with the outer loop PID controller. Therefore, in the next experiment,
the adaptability of the agent is demonstrated by injecting a disturbance directly on the parameters of the actor.

1 2.5
]

]
0.0
deg

deg
0
s

s
[

[
q

p
1 2.5

90
2
]

]
m
s

deg
[

s
Vtas

r[
0

2002.5 0.05

[deg]
H [m]

2000.0 0.00
1997.5 0.05

1.5 20
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

[deg]

[deg]
2.0
e

0
0 20 40 60 80 100 120 140 160
0.25 p r
1
]

[deg]
1
s [

a a a
0.00 0
G [ ],

p r
r
a
r r
0.25 1
, [s2]

108 0.0
p r
[deg]
]

r
s2
1

105 a
[
Var [ ],

0.5
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
t [s] t [s]

Fig. 10 Control adaption and online identification of the lateral incremental model during an aileron failure
induced at 110 seconds. The plant is initiated at a trimmed state at (Vt as, H) = (90 ms , 2 km) and the longitudinal
and lateral controller at their state after the initial online training phase. Reference signals are represented by
black, dashed lines.

2. Adaptive Control in the Presence of Control Disturbance

Similarly to the previous experiment, the agent is commanded to fly two right rate-one turns, while maintaining its
initial altitude. At 110 seconds the actor’s parameters are disturbed by a zero-mean Gaussian noise with a standard
deviation of 0.2. Consequently, the current policy is disturbed and the agent has to adapt to regain control over the
aircraft and complete the flight maneuver.
As can be observed from Fig. 11, subsequent to the disturbance at 110 seconds the actor’s parameters adapt towards
a new near-optimal policy within 10 seconds. This adaption is also observed from the peak in the actor’s loss gradients
as depicted in the lower left box of Fig. 11. Furthermore, the actor’s parameters converge to a different distribution, than
its initial one. This implies that there are more than one near-optimal policies and that a near-optimal policy is not
uniquely defined by one set of the actor’s parameters.
From the time history of the sideslip angle and roll rate, it can be observed that the disturbance is detrimental to the
agent’s performance. More specifically, both the overall magnitude of the sideslip angle as well as roll rate tracking
error increases during the initiation of the second right turn. Nonetheless, the agent still manages to follow the reference
roll angle profile. As the agent exits the second turn at 150 seconds it has already recovered and improved upon its
original performance.

16
1 2.5
]

]
deg

deg
0 0.0
s

s
[

[
q

p
1 2.5
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

90
2
]

]
m
s

deg
[

s
Vtas

r [
0

2002.5 0.25
[deg]
H [m]

2000.0 0.00
1997.5
0.25

1.5 20
[deg]

[deg]

2.0
e

0
0 20 40 60 80 100 120 140 160
0.5 0.5
[deg]

0.0
wA [ ]

0.0
0.5 0.5
a

0.001 0.0
[deg]
[ ]

0.000 0.5
wA
LA

0.001
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
t [s] t [s]

Fig. 11 Control adaption of the lateral controller during a Gaussian disturbance to the actor’s parameters
at 110 seconds. The plant is initiated at a trimmed state at (Vt as, H) = (90 ms , 2 km) and the longitudinal and
lateral controller at their state after the initial online training phase. Reference signals are represented by black,
dashed lines.

17
V. Conclusion
The design and analysis of a RL adaptive flight controller for a CS-25 class aircraft are presented. The adaptive,
learning controller is successfully implemented for a full-scale, high-fidelity aircraft simulation. It is demonstrated that
the proposed framework is able to learn a near-optimal control policy online without a priori knowledge of the system
dynamics nor an offline training phase. The results reveal that the proposed target critic increases the learning stability,
which is vital for the reliable operation of the aircraft. Through the simulation of representative flight profiles, the
results indicate that the learning controller is able to generalize and operate the aircraft in not previously encountered
flight regimes as well as identify and adapt to unforeseen changes to the aircraft’s dynamics. The framework proposed
in this study mitigates the current limitation of ACDs that require an offline training phase expanding the application
of such algorithms to applications for which no accurate system model is available, nor readily identifiable. This is
especially interesting for future autonomous applications. Further investigation is recommended into the design of
information-based online system identification and the execution of a flight test with the proposed framework on the
PH-LAB research aircraft.

References
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

[1] Woods, D. D., “The risks of autonomy: Doyle’s catch,” Journal of Cognitive Engineering and Decision Making, Vol. 10, No. 2,
2016, pp. 131–133.

[2] Sghairi, M., Bonneval, A., Crouzet, Y., Aubert, J. J., and Brot, P., “Challenges in Building Fault-Tolerant Flight Control System
for a Civil Aircraft,” IAENG International Journal of Computer Science, Vol. 35, No. 4, 2008.

[3] Lombaerts, T., Oort, E. V., Chu, Q. P., Mulder, J. A., and Joosten, D., “Online Aerodynamic Model Structure Selection and
Parameter Estimation for Fault Tolerant Control,” Journal of Guidance, Control, and Dynamics, Vol. 33, No. 3, 2010, pp.
707–723.

[4] Balas, G. J., “Flight Control Law Design: An Industry Perspective,” European Journal of Control, Vol. 9, No. 2-3, 2003, pp.
207–226.

[5] Lane, S. H., and Stengel, R. F., “Flight control design using non-linear inverse dynamics,” Automatica, Vol. 24, No. 4, 1988, pp.
471–483.

[6] da Costa, R. R., Chu, Q. P., and Mulder, J. A., “Reentry Flight Controller Design Using Nonlinear Dynamic Inversion,” Journal
of Spacecraft and Rockets, Vol. 40, No. 1, 2003, pp. 64–71.

[7] Sonneveldt, L., van Oort, E. R., Chu, Q. P., De Visser, C. C., Mulder, J. A., and Breeman, J. H., “Lyapunov-based Fault Tolerant
Flight Control Designs for a Modern Fighter Aircraft Model,” AIAA Guidance, Navigation, and Control Conference and Exhibit,
Chicago, Illinois, 2009.

[8] Farrell, J., Sharma, M., and Polycarpou, M., “Backstepping-Based Flight Control with Adaptive Function Approximation,”
Journal of Guidance, Control, and Dynamics, Vol. 28, No. 6, 2005, pp. 1089–1102.

[9] Sonneveldt, L., Chu, Q. P., and Mulder, J. A., “Nonlinear Flight Control Design Using Constrained Adaptive Backstepping,”
Journal of Guidance, Control, and Dynamics, Vol. 30, No. 2, 2007, pp. 322–336.

[10] Sonneveldt, L., van Oort, E. R., Chu, Q. P., and Mulder, J. A., “Comparison of Inverse Optimal and Tuning Functions Designs
for Adaptive Missile Control,” Journal of Guidance, Control, and Dynamics, Vol. 31, No. 4, 2008, pp. 1176–1182.

[11] Sonneveldt, L., van Oort, E. R., Chu, Q. P., and Mulder, J. A., “Nonlinear Adaptive Trajectory Control Applied to an F-16
Model,” Journal of Guidance, Control, and Dynamics, Vol. 32, No. 1, 2009, pp. 25–39.

[12] van Oort, E. R., Sonneveldt, L., Chu, Q. P., and Mulder, J. A., “Full-Envelope Modular Adaptive Control of a Fighter Aircraft
Using Orthogonal Least Squares,” Journal of Guidance, Control, and Dynamics, Vol. 33, No. 5, 2010, pp. 1461–1472.

[13] Lu, P., van Kampen, E., de Visser, C., and Chu, Q. P., “Aircraft fault-tolerant trajectory control using Incremental Nonlinear
Dynamic Inversion,” Control Engineering Practice, Vol. 57, 2016, pp. 126–141.

[14] Sieberling, S., Chu, Q. P., and Mulder, J. A., “Robust Flight Control Using Incremental Nonlinear Dynamic Inversion and
Angular Acceleration Prediction,” Journal of Guidance, Control, and Dynamics, Vol. 33, No. 6, 2010, pp. 1732–1742.

[15] Simplício, P., Pavel, M. D., van Kampen, E., and Chu, Q. P., “An acceleration measurements-based approach for helicopter
nonlinear flight control using incremental Nonlinear Dynamic Inversion,” Control Engineering Practice, Vol. 21, No. 8, 2013,
pp. 1065–1077.

18
[16] Acquatella, P., Falkena, W., Van Kampen, E., and Chu, Q. P., “Robust Nonlinear Spacecraft Attitude Control using Incremental
Nonlinear Dynamic Inversion,” AIAA Guidance, Navigation, and Control Conference, Minneapolis, Minnesota, 2012.

[17] Smeur, E. J. J., Chu, Q. P., and de Croon, G. C. H. E., “Adaptive Incremental Nonlinear Dynamic Inversion for Attitude Control
of Micro Air Vehicles,” Journal of Guidance, Control, and Dynamics, Vol. 39, No. 3, 2016, pp. 450–461.

[18] Acquatella, P., van Kampen, E., and Chu, Q. P., “Incremental Backstepping for Robust Nonlinear Flight Control,” Proceedings
of the EuroGNC, Delft, The Netherlands, 2013.

[19] Wang, X., van Kampen, E., Chu, Q. P., and Lu, P., “Stability Analysis for Incremental Nonlinear Dynamic Inversion Control,”
AIAA Guidance, Navigation, and Control Conference, Kissimmee, Florida, 2018.

[20] Grondman, F., Looye, G., Kuchar, R. O., Chu, Q. P., and van Kampen, E., “Design and Flight Testing of Incremental Nonlinear
Dynamic Inversion-based Control Laws for a Passenger Aircraft,” AIAA Guidance, Navigation, and Control Conference,
Kissimmee, Florida, 2018.

[21] Schultz, W., Dayan, P., and Montague, P. R., “A Neural Substrate of Prediction and Reward,” Science, Vol. 275, No. 5306,
1997, pp. 1593–1599.
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

[22] Sutton, R. S., and Barto, A. G., Reinforcement learning: An introduction, 2nd ed., A Bradford Book, 2018.

[23] Powell, W. B., Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley & Sons, 2007.

[24] Si, J., Barto, A. G., Powell, W. B., and Wunsch, D., Handbook of Learning and Approximate Dynamic Programming,
Wiley-IEEE Press, 2004.

[25] Bertsekas, D. P., Homer, M. L., Logan, D. A., Patek, S. D., and Sandell, N. R., “Missile defense and interceptor allocation by
neuro-dynamic programming,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, Vol. 30,
No. 1, 2000, pp. 42–51.

[26] Ferrari, S., and Stengel, R. F., “Online Adaptive Critic Flight Control,” Journal of Guidance, Control, and Dynamics, Vol. 27,
No. 5, 2004, pp. 777–786.

[27] Enns, R., and Si, J., “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Transactions
on Neural Networks, Vol. 14, No. 4, 2003, pp. 929–939.

[28] van Kampen, E., Chu, Q. P., and Mulder, J. A., “Continuous Adaptive Critic Flight Control Aided with Approximated
Plant Dynamics,” AIAA Guidance, Navigation, and Control Conference and Exhibit, American Institute of Aeronautics and
Astronautics, Reston, Virigina, 2006.

[29] Zhou, Y., van Kampen, E., and Chu, Q. P., “Launch Vehicle Adaptive Flight Control with Incremental Model Based Heuristic
Dynamic Programming,” 68th International Astronautical Congress (IAC), Adelaide, Australia, 2017.

[30] Zhou, Y., van Kampen, E., and Chu, Q. P., “Incremental model based online dual heuristic programming for nonlinear adaptive
control,” Control Engineering Practice, Vol. 73, 2018, pp. 13–25.

[31] Zhou, Y., van Kampen, E., and Chu, Q., “Nonlinear adaptive flight control using incremental approximate dynamic programming
and output feedback,” Journal of Guidance, Control, and Dynamics, Vol. 40, No. 2, 2018, pp. 493–496.

[32] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,
Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,
D., “Human-level control through deep reinforcement learning,” Nature, Vol. 518, No. 7540, 2015, pp. 529–533.

[33] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D., “Continuous Control with
Deep Reinforcement Learning,” International Conference on Learning Representations (ICLR), 2016.

[34] Zhou, Y., van Kampen, E., and Chu, Q. P., “Nonlinear Adaptive Flight Control Using Incremental Approximate Dynamic
Programming and Output Feedback,” Journal of Guidance, Control, and Dynamics, Vol. 40, No. 2, 2017, pp. 493–496.

[35] Sadhukhan, D., and Feteih, S., “F8 neurocontroller based on dynamic inversion,” Journal of Guidance, Control, and Dynamics,
Vol. 19, No. 1, 1996, pp. 150–156.

[36] Napolitano, M. R., and Kincheloe, M., “On-line learning neural-network controllers for autopilot systems,” Journal of Guidance,
Control, and Dynamics, Vol. 18, No. 5, 1995, pp. 1008–1015.

19
[37] Kim, B. S., and Calise, A. J., “Nonlinear Flight Control Using Neural Networks,” Journal of Guidance, Control, and Dynamics,
Vol. 20, No. 1, 1997, pp. 26–33.

[38] Calise, A. J., “Neural networks in nonlinear aircraft flight control,” IEEE Aerospace and Electronic Systems Magazine, Vol. 11,
No. 7, 1996, pp. 5–10.

[39] Ha, C. M., “Neural networks approach to AIAA aircraft control design challenge,” Journal of Guidance, Control, and Dynamics,
Vol. 18, No. 4, 1995, pp. 731–739.

[40] Balakrishnan, S. N., and Biega, V., “Adaptive-Critic-Based Neural Networks for Aircraft Optimal Control,” Journal of Guidance,
Control, and Dynamics, Vol. 19, No. 4, 1996, pp. 893–898.

[41] Prokhorov, D. V., and Wunsch, D. C., “Adaptive critic designs,” IEEE Transactions on Neural Networks, Vol. 8, No. 5, 1997,
pp. 997–1007.

[42] Srouji, M., Zhang, J., and Salakhutdinov, R., “Structured Control Nets for Deep Reinforcement Learning,” arXiv preprint
arXiv:1802.08311, 2018.

[43] van den Hoek, M. A., de Visser, C. C., and Pool, D. M., “Identification of a Cessna Citation II Model Based on Flight Test Data,”
Downloaded by TU DELFT on January 22, 2023 | http://arc.aiaa.org | DOI: 10.2514/6.2020-1844

Advances in Aerospace Guidance, Navigation and Control, Springer International Publishing, Cham, 2018, pp. 259–277.

[44] Hägglund, T., “Recursive Estimation of Slowly Time-Varying Parameters,” IFAC Proceedings Volumes, Vol. 18, No. 5, 1985,
pp. 1137–1142.

[45] Kulhavý, R., “Restricted exponential forgetting in real-time identification,” Automatica, Vol. 23, No. 5, 1987, pp. 589–600.

[46] Cao, L., and Schwartz, H. M., “A novel recursive algorithm for directional forgetting,” Proceedings of the 1999 American
Control Conference (Cat. No. 99CH36251), Vol. 2, IEEE, 1999, pp. 1334–1338.

[47] Saelid, S., and Foss, B., “Adaptive controllers with a vector variable forgetting factor,” The 22nd IEEE Conference on Decision
and Control, IEEE, 1983, pp. 1488–1494.

[48] Parkum, J. E., Poulsen, N. K., and Holst, J., “Selective Forgetting in Adaptive Procedures,” IFAC Proceedings Volumes, Vol. 23,
No. 8, 1990, pp. 137–142.

[49] Klein, V., and Morelli, E. A., Aircraft System Identification: Theory and Practice, American Institute of Aeronautics and
Astronautics, 2006.

[50] Mannucci, T., van Kampen, E., de Visser, C., and Chu, Q., “Safe exploration algorithms for reinforcement learning controllers,”
IEEE Transactions on Neural Networks and Learning Systems, Vol. 29, No. 4, 2018, pp. 1069–1081.

[51] Lu, P., Van Eykeren, L., van Kampen, E., de Visser, C., and Chu, Q., “Double-model adaptive fault detection and diagnosis
applied to real flight data,” Control Engineering Practice, Vol. 36, 2015, pp. 39–57.

[52] Looye, G., and Joos, H., “Design of robust dynamic inversion control laws using multi-objective optimization,” AIAA Guidance,
Navigation, and Control Conference and Exhibit, American Institute of Aeronautics and Astronautics, Reston, Virigina, 2001,
p. 4285.

Neurocontrol of An Aircraft: Application To Windshear: Pergamon 08957177 (95) 00101-8
No ratings yet
Neurocontrol of An Aircraft: Application To Windshear: Pergamon 08957177 (95) 00101-8
16 pages
Learning To Fly Thesis
No ratings yet
Learning To Fly Thesis
111 pages
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
No ratings yet
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
41 pages
IIST Controls
No ratings yet
IIST Controls
75 pages
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
No ratings yet
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
81 pages
Reinforced Learning-Based Robust Control Design For Unmanned Aerial Vehicle
No ratings yet
Reinforced Learning-Based Robust Control Design For Unmanned Aerial Vehicle
16 pages
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
No ratings yet
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
13 pages
Adaptive Control and Intersections With Reinforcement Learning
No ratings yet
Adaptive Control and Intersections With Reinforcement Learning
31 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
No ratings yet
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
21 pages
Text of Aero
No ratings yet
Text of Aero
9 pages
Incorporating Recurrent Reinforcement Learning Into Model Predictive
No ratings yet
Incorporating Recurrent Reinforcement Learning Into Model Predictive
7 pages
Predictive Sampling: Real-Time Behaviour Synthesis With MuJoCo
No ratings yet
Predictive Sampling: Real-Time Behaviour Synthesis With MuJoCo
14 pages
Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
No ratings yet
Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
12 pages
Maneuver Decision-Making Through Automatic Curriculum Reinforcement Learning Without Handcrafted Reward Functions
No ratings yet
Maneuver Decision-Making Through Automatic Curriculum Reinforcement Learning Without Handcrafted Reward Functions
15 pages
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
No ratings yet
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
11 pages
Robust Control Theory Workshop Rev2
100% (1)
Robust Control Theory Workshop Rev2
3 pages
Biomimetics 08 00434
No ratings yet
Biomimetics 08 00434
26 pages
Automatica 1354
No ratings yet
Automatica 1354
10 pages
Reinforcement Learning For Robust Missile Autopilot Design
No ratings yet
Reinforcement Learning For Robust Missile Autopilot Design
10 pages
! Boss
No ratings yet
! Boss
17 pages
2024 Ouput Feedback Linear System Based On ADP 1
No ratings yet
2024 Ouput Feedback Linear System Based On ADP 1
10 pages
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
No ratings yet
Reinforcement Learning Control of An Aerial Robot Based On A Tuned Proximal Policy Optimization in Takeoff and Hover Phases
7 pages
Q.15. Derive Expression For Ratio of Tension On Tight Side and Slack Side
100% (1)
Q.15. Derive Expression For Ratio of Tension On Tight Side and Slack Side
8 pages
Adaptive Learning Feedback Linearization
No ratings yet
Adaptive Learning Feedback Linearization
9 pages
Symmetric Actor-Critic Deep Reinforcement Learning For Cascade Quadrotor
No ratings yet
Symmetric Actor-Critic Deep Reinforcement Learning For Cascade Quadrotor
13 pages
1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
No ratings yet
1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
13 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
Data-Enabled Predictive Control: in The Shallows of The Deepc
No ratings yet
Data-Enabled Predictive Control: in The Shallows of The Deepc
8 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Good
No ratings yet
Good
10 pages
Learning Model Predictive Control For Quadrotors
No ratings yet
Learning Model Predictive Control For Quadrotors
7 pages
Trajectory-Control Using Deep System Identification and Model Predictive Control For Drone Control Under Uncertain Load
No ratings yet
Trajectory-Control Using Deep System Identification and Model Predictive Control For Drone Control Under Uncertain Load
6 pages
Autonomous Decision-Making Generation of UAV Based On Soft Actor-Critic Algorithm-1
No ratings yet
Autonomous Decision-Making Generation of UAV Based On Soft Actor-Critic Algorithm-1
6 pages
Guidance Mechanism For Flexible-Wing Aircraft Using Measurement-Interfaced Machine-Learning Platform
No ratings yet
Guidance Mechanism For Flexible-Wing Aircraft Using Measurement-Interfaced Machine-Learning Platform
12 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Applsci 13 13181
No ratings yet
Applsci 13 13181
21 pages
Shi 2021
No ratings yet
Shi 2021
11 pages
Advances in Adaptive Control Theory Grad
No ratings yet
Advances in Adaptive Control Theory Grad
165 pages
Data-Assimilated Model-Informed Reinforcement Learning
No ratings yet
Data-Assimilated Model-Informed Reinforcement Learning
31 pages
Adaptive Nonlinear Control of Agile Antiair Missiles Using Neural Networks
No ratings yet
Adaptive Nonlinear Control of Agile Antiair Missiles Using Neural Networks
8 pages
Schedule Risk Analysis
No ratings yet
Schedule Risk Analysis
40 pages
Electronics
No ratings yet
Electronics
18 pages
ACODS 2014 GAndrade
No ratings yet
ACODS 2014 GAndrade
7 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Strength of Materials University Question Paper
No ratings yet
Strength of Materials University Question Paper
2 pages
从LQR角度看RL和控制
No ratings yet
从LQR角度看RL和控制
28 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
Pan 2022 J. Phys. Conf. Ser. 2203 012058
No ratings yet
Pan 2022 J. Phys. Conf. Ser. 2203 012058
7 pages
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
No ratings yet
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
8 pages
A New Neural Architecture For Homing Missile Guidance
No ratings yet
A New Neural Architecture For Homing Missile Guidance
6 pages
Tac 232
No ratings yet
Tac 232
7 pages
Discussion On Backstepping Controller de
No ratings yet
Discussion On Backstepping Controller de
2 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Engineering Maths Mid Sem 1st Year
No ratings yet
Engineering Maths Mid Sem 1st Year
3 pages
A Reinforcement Learning Agent For UAV Control Mathematical Foundations, Implementation, and Human Vs AI Benchmarking
No ratings yet
A Reinforcement Learning Agent For UAV Control Mathematical Foundations, Implementation, and Human Vs AI Benchmarking
13 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Model-Based Adaptive Critic Designs: Editor's Summary
No ratings yet
Model-Based Adaptive Critic Designs: Editor's Summary
31 pages
Helicopter Flight Control Design Using A Learning Control Approach1
No ratings yet
Helicopter Flight Control Design Using A Learning Control Approach1
6 pages
BBMbook Cambridge Newstyle
No ratings yet
BBMbook Cambridge Newstyle
373 pages
Variation of Velocity and Acceleration in Suction and Delivery Pipes Due To Acceleration of Piston
100% (1)
Variation of Velocity and Acceleration in Suction and Delivery Pipes Due To Acceleration of Piston
9 pages
CV Module 1
No ratings yet
CV Module 1
166 pages
Lesson 9 5 Multiplication Division of Radical Expressions
100% (1)
Lesson 9 5 Multiplication Division of Radical Expressions
17 pages
PLAXIS - 3D2018 Tutorial Lesson 09 PDF
No ratings yet
PLAXIS - 3D2018 Tutorial Lesson 09 PDF
14 pages
Characterization PDF
No ratings yet
Characterization PDF
80 pages
Adaptive Control: Belen, Carla Cadiz, Nino Celemin, Carlvinn Evangelista. Dale Hong, Jericho SNG Daryl
No ratings yet
Adaptive Control: Belen, Carla Cadiz, Nino Celemin, Carlvinn Evangelista. Dale Hong, Jericho SNG Daryl
7 pages
Class 11 Maths Sets-2021
No ratings yet
Class 11 Maths Sets-2021
12 pages
1LANG algMERGED PDF
No ratings yet
1LANG algMERGED PDF
12 pages
Lineer Cebir Kitap
No ratings yet
Lineer Cebir Kitap
27 pages
Automata Theory and Computability 18Cs54
No ratings yet
Automata Theory and Computability 18Cs54
73 pages
Recommended System
No ratings yet
Recommended System
33 pages
UT 1 Science Class 10
No ratings yet
UT 1 Science Class 10
6 pages
Rynaski ADA353795
No ratings yet
Rynaski ADA353795
89 pages
DesignXplorer 17.0 M01 Introduction
No ratings yet
DesignXplorer 17.0 M01 Introduction
32 pages
Pythagoras Theorem
No ratings yet
Pythagoras Theorem
8 pages
Kolmanovsky Ilic EditorialSpecialIssueControlofAerospaceSystems
No ratings yet
Kolmanovsky Ilic EditorialSpecialIssueControlofAerospaceSystems
3 pages
Matlab Stock Modelling
No ratings yet
Matlab Stock Modelling
16 pages
JKoschorke Advanced Flight Control Design and Evaluation 120611
No ratings yet
JKoschorke Advanced Flight Control Design and Evaluation 120611
246 pages
2016 09 29 Final Thesis VantVeld
No ratings yet
2016 09 29 Final Thesis VantVeld
106 pages
1 s2.0 S2405896319309978 Main
No ratings yet
1 s2.0 S2405896319309978 Main
6 pages
Presentation On Research Paper "The Peak Pool Boiling Heat Flux On Horizontal Cylinders"
No ratings yet
Presentation On Research Paper "The Peak Pool Boiling Heat Flux On Horizontal Cylinders"
19 pages
Mansor 2015
No ratings yet
Mansor 2015
7 pages
Practical AFCS Design 2021-2022
No ratings yet
Practical AFCS Design 2021-2022
31 pages
Design Linear Feedback and LQR Controller For Lateral Flight Dynamics of F-16 Aircraft
No ratings yet
Design Linear Feedback and LQR Controller For Lateral Flight Dynamics of F-16 Aircraft
5 pages
Effect of Axial Gap Between Inlet Nozzle and Impeller On Efficieny
No ratings yet
Effect of Axial Gap Between Inlet Nozzle and Impeller On Efficieny
12 pages
DBInt4 QueryOpt
No ratings yet
DBInt4 QueryOpt
68 pages
1 s2.0 S2405896319309954 Main
No ratings yet
1 s2.0 S2405896319309954 Main
6 pages
1 s2.0 S2405896319312418 Main
No ratings yet
1 s2.0 S2405896319312418 Main
6 pages
1 s2.0 S240589631930998X Main
No ratings yet
1 s2.0 S240589631930998X Main
6 pages
BASTRCSX Learning-Activity-2 With Answers
No ratings yet
BASTRCSX Learning-Activity-2 With Answers
4 pages
Data Analysis Exercises
No ratings yet
Data Analysis Exercises
4 pages
Normal Modes - Rigid Element Analysis With RBE2 and CONM2
No ratings yet
Normal Modes - Rigid Element Analysis With RBE2 and CONM2
22 pages
2 Mesh Analysis
No ratings yet
2 Mesh Analysis
16 pages
Bischoff 1983
No ratings yet
Bischoff 1983
6 pages
Terbilang
No ratings yet
Terbilang
3 pages
Longitudinal Stability and Control Augmentation With Robustness and Handling Qualities Requirements Using The Two Degree of Freedom Controller
No ratings yet
Longitudinal Stability and Control Augmentation With Robustness and Handling Qualities Requirements Using The Two Degree of Freedom Controller
11 pages
Autodesk Constraints
No ratings yet
Autodesk Constraints
16 pages
Formule Pt/Uv
No ratings yet
Formule Pt/Uv
6 pages
Yuxing 2018
No ratings yet
Yuxing 2018
4 pages
Computation of Turbulent Buoyant Ows in Enclosures With Low-Reynolds-Number K-X Models
No ratings yet
Computation of Turbulent Buoyant Ows in Enclosures With Low-Reynolds-Number K-X Models
13 pages
7TH Semester Syllabus
No ratings yet
7TH Semester Syllabus
9 pages

Cessna 2

Uploaded by

Cessna 2

Uploaded by

AIAA SciTech Forum 10.2514/6.

Online Adaptive Incremental Reinforcement Learning Flight

S. Heyer ∗ , D. Kroezen † , E. van Kampen ‡

controller is implemented to control a high-fidelity, six-degree-of-freedom simulation of the

s, s R, a = state, reference state, and action vectors

st+1 = f (st , at ) (1)

at = π(st , stR ) (2)

∂rt+1 ∂r(stR, st+1 ) T

st+1 ≈ f (s0, a0 ) + F(s0, a0 )[st − s0 ] + G(s0, a0 )[at − a0 ] (6)

∆st+1 ≈ Ft−1 ∆st + Gt−1 ∆at (7)

approximation, exhibit [22].

∂ st+1 ∂ f (st , at ) ∂ f (st , at ) ∂ at

∂LtC ∂LtC ∂ λ̂(st , stR, wtC ) ∂ λ̂(st , stR, wtC )

π ∗ (st , stR ) = arg max v(st stR ) (14)

∆ ŝTt+1 = XtT Θ̂t−1

 t = ∆sTt+1 − ∆ ŝTt+1 (21)

III. Controller Design

A. High-Fidelity Simulation Model

∂s i−1 ← F̂i−2 + Ĝi−2

17: wi ← wi−1 + ∆wi

18: wiC ← wi−1 C

Control Surface Saturation Limits

IV. Results and Discussion

B. Online Operation Phase

Flight Condition ID H0 m Vt as0 ms Failure Rate

1. Adaptive Control in the Presence of Aileron Failure

2. Adaptive Control in the Presence of Control Disturbance

You might also like

t = ∆sTt+1 − ∆ ŝTt+1 (21)