Ucav Mission Execution Reinforcement Learning Paper
Ucav Mission Execution Reinforcement Learning Paper
state and actual next state. The intrinsic reward is very effi-
(RL) algorithm that enhances exploration by am-
cient in exploration because the network for predicting the
plifying the imitation effect (AIE). This algorithm
next state drives the agent to behave unexpectedly.
consists of self-imitation learning and random net-
work distillation algorithms. We argue that these This paper focuses on combining self-imitation leaning
two algorithms complement each other and that (SIL) (Oh et al., 2018) and random network distillation
combining these two algorithms can amplify the (RND) (Burda et al., 2018b). SIL is an algorithm that in-
imitation effect for exploration. In addition, by directly leads to deep exploration by exploiting only good
adding an intrinsic penalty reward to the state that decisions of the past, whereas RND solves the problem of
the RL agent frequently visits and using replay hard exploration by giving an exploration bonus through
memory for learning the feature state when us- deterministic prediction error. The RND bonus is a de-
ing an exploration bonus, the proposed approach terministic prediction error of a neural network predicting
leads to deep exploration and deviates from the features of the observations, and the authors have shown sig-
current converged policy. We verified the explo- nificant performance in some hard exploration Atari games.
ration performance of the algorithm through ex- In hard exploration environments, it does not make sense for
periments in a two-dimensional grid environment. SIL to exploit a good decision of the past. In other words,
In addition, we applied the algorithm to a sim- SIL requires an intrinsic reward. Meanwhile, in RND, catas-
ulated environment of unmanned combat aerial trophic forgetting could occur during learning because the
vehicle (UCAV) mission execution, and the em- predictor network learns about the state that the agent visited
pirical results show that AIE is very effective for recently. Consequently, the prediction error increases, and
finding the UCAV’s shortest flight path to avoid the exploration bonus increases for previously visited states.
an enemy’s missiles. We will describe this phenomenon in detail in section 4.3.
This paper introduces amplifying the imitation effect (AIE)
by combining SIL and RND to drive deep exploration. In
1. Introduction
addition, we introduce techniques that can enhance the
Reinforcement learning (RL) aims to learn an optimal policy strength of the proposed network. Adding an intrinsic
of the agent for a control problem by maximizing the ex- penalty reward to the state that the agent continuously vis-
pected return. RL shows high performance in dense reward its leads to deviation from the current converged policy.
environments such as games (Mnih et al., 2013). However, Moreover, to avoid catastrophic forgetting, we use a pool of
in many real-world problems, rewards are extremely sparse, stored samples to update the predictor network during imi-
and in this case, it is necessary to explore the environment. tation learning such that we can uniformly learn the visited
The RL literature suggests exploration methods to solve this states by the predictor network. We have experimentally
challenge, such as count-based exploration (Bellemare et al., demonstrated that these techniques lead to deep exploration.
2016; Ostrovski et al., 2017), entropy-based exploration We verify our algorithm using unmanned combat aerial vehi-
(Haarnoja et al., 2017; Ziebart, 2010) and curiosity-based cle (UCAV) mission execution. Some studies have applied
exploration (Silvia, 2012; Pathak et al., 2017; Burda et al., RL to UCAV maneuvers. (Liu & Ma, 2017; Zhang et al.,
2018a; Haber et al., 2018). In recent years, many researchers 2018; Minglang et al., 2018). However, those studies simply
1
Department of Industrial Engineering, University of Yon- defined the state and action and experimented in a dense
sei, Seoul, Korea. Correspondence to: Chang Ouk Kim reward environment. We constructed the experimental envi-
<[email protected]>. ronment by simulating the flight maneuvers of the UCAV
in a three-dimensional (3D) space. The objective of the RL
Amplifying Imitation Effect
agent is to learn the maneuvers by which the UCAV reaches For the missile, we applied proportional navigation induc-
a target point while avoiding missiles from the enemy air tion to chase the UCAV (Moran & Altilar, 2005). We as-
defense network. The main contributions of this paper are sume that if the distance between the UCAV and the missile
as follows: is less than 0.5 km, then the UCAV is unable to avoid the
missile.
• We show that SIL and RND are complementary and
that combining these two algorithms is very efficient 2.1. State
for exploration. In general, in an environment such as Atari games, the image
• We present several techniques to amplify the imitation of the game is preprocessed and used as the state, and a
effect. convolutional neural network is employed as the structure of
the network. In this study, however, the UCAV’s coordinate
• The performance of the RL applied to the UCAV con- information and the UCAV’s radar information to detect
trol problem is excellent. The learning method outputs missiles are vectorized for the state of the UCAV control
reasonable UCAV maneuvers in the sparse reward en- problem. A multilayer perceptron is more appropriate for
vironment. the problem than a convolutional neural network, which is
generally adopted for representing an image as the state of
an arcade game.
2. Problem Definition
We overlapped the air defense network as in an actual bat- 2.1.1. C OORDINATE R EPRESENTATION
tlefield environment, and we aimed to learn that the UCAV In a coordinate system, the coordinate points do not have
reaches the target by avoiding missiles from the starting a linear relationship. For example, the two-dimensional
point in a limited time period. For the UCAV dynamics, we (2D) coordinate (10, 10) is not ten times more valuable than
applied the following equations of motion of a 3-degrees-of- the coordinate (1, 1). Therefore, placing coordinates into a
freedom point mass model (Kim & Kim, 2007): state with real numbers is not reasonable and causes learn-
ing instability. One way to represent the coordinates in the
ẋ = V cos γ cos ψ
learning environment is to use a one-hot encoding vector.
ẏ = V cos γ sin ψ However, the one-hot encoding increases the dimension of
ż = V sin γ the vector as the range of coordinates increases and is only
T −D possible for integer coordinates. In this study, we introduce
V̇ = − g sin γ a method to efficiently represent the coordinate system.
m
gn sin ϕ The proposed method converts the coordinates into a one-
ψ̇ = hot encoding vector for each axis and then concatenates
V cos γ
g the vectors of the axes. The one-hot encoding method
γ̇ = (1) requires 40,000 rows (200x200) rows to represent (1, 1)
V (n cos ϕ − cos γ)
when x and y range from 1 to 200, but using this method,
where (x, y, z) is the position of the UCAV, V is the velocity, c(1,1) = [(1, 0, · · · , 0)(1, 0, · · · , 0)]′ is possible with 400
ψ is the heading angle, and γ is the flight path angle. T , rows (200+200). We additionally extended this method
n and ϕ are the control inputs of the UCAV. T , n and ϕ to the real coordinate system. The real coordinates are
denote the engine thrust, load factor and bank angle, respec- represented by introducing weight within the vector. For
tively. We use these control inputs as the action of our RL example, 1.3 is close to 70% in 1 and close to 30% in 2;
framework. Figure 1 shows the UCAV’s bank angle, flight in other words, the number 1.3 is a number with a weight
path angle, and heading angle. The engine thrust affects the of 70% in 1 and 30% in 2. Thus, 1.3 can be represented as
velocity of the UCAV. The bank angle and load factor affect c(1.3) = (0.7, 0.3, · · · , 0)′ (200 rows). Moreover, the result-
the heading angle and flight path angle. ing vector can be reduced to a small dimension. We have
reduced this coordinate to 1/10. Consequently, the num-
ber 1.3 can be represented as c(1.3) = (0.13, 0, · · · , 0)′ (20
rows). This method efficiently represents real coordinates
within a limited dimension. We call this method efficient
Ve
loc
it
y
Fl
igt
hpa
tha
ngl
e
coordinate vector (ECV).
He
adi
nga
ngl
e
Ba
nkAng
le
Figure 1. Bank angle, flight path angle and heading angle of the
UCAV.
Amplifying Imitation Effect
2.1.2. A NGLE R EPRESENTATION choices: increase, hold, and decrease. In addition, we have
added an action that initializes all inputs to have default
Representing the angle as a state is also difficult in RL be-
values (bank angle: 0◦ , load factor: 1G, and engine thrust:
cause the angle has a characteristic of circulating around
50kN ). This action allows the UCAV to cruise. The total
360◦ . For example, suppose that we change the angle from
number of actions is 28.
10◦ to 350◦ . Even if we use a real value or the ECV method,
the agent will perceive the result of a 340◦ change. How-
2.3. Reward
ever, the difference (340◦ ) is 20◦ at the same time. That
is, this angle representation confuses the RL agent. We The default reward is zero, except for the following specific
solve this problem with the polar coordinate system and situations:
ECV. r and θ can be transformed into Cartesian coordi-
nates x and y using a trigonometric function. Using the – A result of the missile skirmishes
polar coordinates, we can convert r and θ into Cartesian
– Whether the UCAV has arrived at its target point
coordinates x and y. Additionally, we can represent these
coordinates as a state through ECV. In other words, the – Cruise condition
angle is converted into the circle upper position using the
polar coordinate system, and then it is represented as a state The cruise condition is rewarded because the UCAV cannot
through the ECV. For example, as shown in figure 2, the maintain the maximum speed for cruising. We impose a
point on the circle corresponding to 17◦ can be represented penalty of -0.01 if the speed reaches the maximum speed.
as c(17◦ ) = (0, · · · , 0.71, 0.29, 0, · · · , 0, 0.302, 0.698)′ (20
rows) through ECV. 3. Related Work
Experience replay Experience replay (Lin, 1992) is a tech-
nique for exploiting past experiences, and Deep Q-Network
(DQN) has exhibited human-level performance in Atari
games using this technique(Mnih et al., 2013; 2015). Pri-
oritized experience replay (Schaul et al., 2015) is a method
for sampling prior experience based on temporal difference.
ACER (Wang et al., 2016) and Reactor (Gruslys et al., 2017)
utilize a replay memory in the actor-critic algorithm (Sut-
ton et al., 2000; Konda & Tsitsiklis, 2000). However, this
method might not be efficient if the past policy is too dif-
ferent from the current policy (Oh et al., 2018). SIL is
immune to this disadvantage because it exploits only past
Figure 2. Example of angle representation.
experiences that had higher returns than the current value.
2.1.3. F INAL S TATE Exploration Exploration has been the main challenging
issue for RL, and many studies have proposed methods to
We finally used the following information as the state of the
enhance exploration. Count-based exploration bonus (Strehl
UCAV control problem.
& Littman, 2008) is an intuitive and effective exploration
method in which an agent receives a bonus if the agent visits
– Flight path consisting of five recent steps of the UCAV a novel state, and the bonus decreases if the agent visits a
frequently visited state. There are some studies that esti-
– Path angle, heading angle and bank angle for two recent
mate the density of a state to provide a bonus in a large state
steps of the UCAV
space (Bellemare et al., 2016; Ostrovski et al., 2017; Fox
– Velocity and load factor of the UCAV et al., 2018; Machado et al., 2018). Recent studies have
introduced a prediction error (curiosity), which is the dif-
– Distance between the UCAV and the missile ference between the next state predicted and the actual next
state for the exploration (Silvia, 2012; Stadie et al., 2015;
– Horizontal and vertical angles between the UCAV and Pathak et al., 2017; Burda et al., 2018a; Haber et al., 2018).
the missile The studies designed the prediction error as an exploration
bonus (it ) to give the agent more reward when performing
2.2. Action unexpected behaviors.
The action is an input combination of engine thrust, bank However, the prediction error has a stochastic characteristic
angle and load factor using Equation 1. Each input has three because the target function is stochastic. In addition, the
Amplifying Imitation Effect
architecture of the predictor network is too limited to gener- Algorithm 1 Amplifying the Imitation Effect (AIE)
alize the state of the environment. To solve these problems, Initialize A2C network parameter θa2c
RND (Burda et al., 2018b) proposed that the target network Initialize predictor/target network parameter θp , θt
be deterministic by fixing the network with randomized Initialize replay buffer D ← ∅
weights and proposed that the predictor network has the Initialize episode buffer E ← ∅
same architecture as the target network. Other methods for Initialize feature buffer F ← ∅
efficient exploration include adding parameter noise within for episode = 1, M do
the network (Strehl & Littman, 2008; Plappert et al., 2017), for each step do
maximizing entropy policies (Haarnoja et al., 2017; Ziebart, Execute an action st , at , rt , st+1 ≈ πθ (at |st )
2010), adversarial self-play (Sukhbaatar et al., 2017) and Extract feature of st+1 to ϕst+1
learning diverse policies (Eysenbach et al., 2018; Gangwani Calculate intrinsic reward it
et al., 2018). if it < penalty condition threshold then
Self-Imitation Learning SIL can indirectly lead to deep it ← λlog(it )
exploration by imitating the good decisions of the past (Oh end if
et al., 2018). To exploit past decisions, the authors used rt = rt + it
replay buffers D = {(st , at , Rt )}, where st and at are a Store transition E ← E ∪ {(st , at , rt )}
state and an action at t-step, and Rt = Σ∞ k−t F ← F ∪ {(ϕst+1 , fθt (ϕst+1 ))}
k=t γ rk is the
discounted sum of reward at t-step with a discount factor γ. end for
The authors proposed the following off-policy actor-critic if st+1 is terminal then
loss: Compute returns Rt = Σ∞ k γ
k−t
rk for all t in E
D ← D ∪ {(st , at , rt )}
Clear episode buffer E ← ∅
end if
Lsil = Es,a,R∈D [Lsil sil sil
policy + β Lvalue ] (2)
# Optimize actor-critic network
Lsil
policy = −logπθ (a|s)(R − Vθ (S))+ (3) θa2c ← θa2c − η∇θa2c La2c
1 # Perform self-imitation learning
Lsil
value = ∥ (R − Vθ (S))+ ∥2 (4) for k= 1, M do
2
sample a minibatch {(s, a, R)} from D
where (·)+ = max(·, 0) and πθ and Vθ (s) are the pol- θa2c ← θa2c − η∇θa2c Lsil
icy (i.e., actor) and the value function parameterized by sample a minibatch {(ϕst+1 , fθt (ϕst+1 ))} from F
θ. B sil ∈ R+ is a hyperparameter for the value loss. Intu- θp ← θp − η∇θp Lp
itively, for the same state, if the past return value is greater end for
than the current value (R > Vθ ), then it can be observed end for
that the behavior in the past is a good decision. Therefore,
imitating the behavior is desirable. However, if the past
return is less than the current value (R < Vθ ), then imitating that the network forgets about the previously visited state.
the behavior is not desirable. The authors focused on com- Consequently, the prediction error increases for the past
bining SIL with advantage actor-critic (A2C) (Mnih et al., state, and the agent may go to a past policy.
2016) and showed significant performance in experiments
with hard exploration Atari games. 4. AIE
Random Network Distillation The authors proposed a
fixed target network (f ) with randomized weights and a 4.1. Combining SIL and RND
predictor network (fb), which is trained using the output of In this section, we explain why combining RND and SIL
the target network. The predictor neural network is trained can amplify the imitation effect and lead to deep exploration.
by gradient descent to minimize the expected mean squared The SIL updates only when the past R is greater than the
error ∥ fb(x; θ) − f (x) ∥2 . They used the exploration bonus current Vθ and imitates past decisions. Intuitively, if we
(it ) as ∥ fb(x; θ) − f (x) ∥2 . Intuitively, the prediction er- combine SIL and RND, we find that the (R − Vθ ) value
ror will increase for a novel state, and the prediction error is larger than the SIL because of the exploration bonus.
will decrease for a state that has been frequently visited. In the process of optimizing the actor-critic network to
However, if the agent converges to local policy, prediction maximize Rt = Σ∞ k=t γ
k−t
(it + et )k , where it is intrin-
error may (it ) no longer occurs. Furthemore, using RND sic reward and et is extrinsic reward, the increase in it by
can cause catastrophic forgetting. The predictor network the predictor network causes R to increase. That is, the
learns about the state that the agent constantly visits such learning progresses by weighting the good decisions of the
Amplifying Imitation Effect
AI
E1
AI
E2
AI
E3
1~5
,000 5,
001~10,
000 10,
001~1
5 ,
000
Epis
odei
nter
val
Figure 4. Visualization of the path of the agent and loss of all coordinate states for each algorithm in the no reward 2D grid environment.
The color changes from blue to red in the agent’s path figure to indicate where the agent visits more frequently. The color changes from
blue to yellow in the loss figure to indicate where the loss is larger.
40 0.
6
Mi
ssi
onc
ompl
et
ion
0.
5
20 0.
4
Epi
sode1~10,
000
0.
3
0
ASI
L ASI
(
L
wor
st
)
0.
2
AI
E1 AI
E1
(
wor
st
)
AI
E2 AI
(
E2
wor
st
)
0.
1
-
20 AI
E3 AI
(
E3
wor
st
)
Epi
sode1
0,000~20,
000
10k 20k 30k 40k 50k 60k 10k 20k 30k 40k 50k 60k
Epi
sodenumbe
r Epi
sodenumbe
r
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Y. Policy gradient methods for reinforcement learning
atari with deep reinforcement learning. arXiv preprint with function approximation. In Advances in neural in-
arXiv:1312.5602, 2013. formation processing systems, pp. 1057–1063, 2000.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- Kavukcuoglu, K., and de Freitas, N. Sample effi-
land, A. K., Ostrovski, G., et al. Human-level control cient actor-critic with experience replay. arXiv preprint
through deep reinforcement learning. Nature, 518(7540): arXiv:1611.01224, 2016.
529, 2015.
Zhang, Y., Zu, W., Gao, Y., Chang, H., et al. Research on
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, autonomous maneuvering decision of ucav based on deep
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- reinforcement learning. 2018.
chronous methods for deep reinforcement learning. In
International conference on machine learning, pp. 1928– Ziebart, B. D. Modeling purposeful adaptive behavior with
1937, 2016. the principle of maximum causal entropy. 2010.
Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learn-
ing. arXiv preprint arXiv:1806.05635, 2018.