0% found this document useful (0 votes)
11 views

Ucav Mission Execution Reinforcement Learning Paper

Uploaded by

Mat Brunt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Ucav Mission Execution Reinforcement Learning Paper

Uploaded by

Mat Brunt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Amplifying the Imitation Effect for Reinforcement Learning of

UCAV’s Mission Execution

Gyeong Taek Lee 1 Chang Ouk Kim 1

Abstract have added an exploration bonus, often called curiosity or in-


trinsic reward, which is the difference between the predicted
This paper proposes a new reinforcement learning
arXiv:1901.05856v1 [cs.LG] 17 Jan 2019

state and actual next state. The intrinsic reward is very effi-
(RL) algorithm that enhances exploration by am-
cient in exploration because the network for predicting the
plifying the imitation effect (AIE). This algorithm
next state drives the agent to behave unexpectedly.
consists of self-imitation learning and random net-
work distillation algorithms. We argue that these This paper focuses on combining self-imitation leaning
two algorithms complement each other and that (SIL) (Oh et al., 2018) and random network distillation
combining these two algorithms can amplify the (RND) (Burda et al., 2018b). SIL is an algorithm that in-
imitation effect for exploration. In addition, by directly leads to deep exploration by exploiting only good
adding an intrinsic penalty reward to the state that decisions of the past, whereas RND solves the problem of
the RL agent frequently visits and using replay hard exploration by giving an exploration bonus through
memory for learning the feature state when us- deterministic prediction error. The RND bonus is a de-
ing an exploration bonus, the proposed approach terministic prediction error of a neural network predicting
leads to deep exploration and deviates from the features of the observations, and the authors have shown sig-
current converged policy. We verified the explo- nificant performance in some hard exploration Atari games.
ration performance of the algorithm through ex- In hard exploration environments, it does not make sense for
periments in a two-dimensional grid environment. SIL to exploit a good decision of the past. In other words,
In addition, we applied the algorithm to a sim- SIL requires an intrinsic reward. Meanwhile, in RND, catas-
ulated environment of unmanned combat aerial trophic forgetting could occur during learning because the
vehicle (UCAV) mission execution, and the em- predictor network learns about the state that the agent visited
pirical results show that AIE is very effective for recently. Consequently, the prediction error increases, and
finding the UCAV’s shortest flight path to avoid the exploration bonus increases for previously visited states.
an enemy’s missiles. We will describe this phenomenon in detail in section 4.3.
This paper introduces amplifying the imitation effect (AIE)
by combining SIL and RND to drive deep exploration. In
1. Introduction
addition, we introduce techniques that can enhance the
Reinforcement learning (RL) aims to learn an optimal policy strength of the proposed network. Adding an intrinsic
of the agent for a control problem by maximizing the ex- penalty reward to the state that the agent continuously vis-
pected return. RL shows high performance in dense reward its leads to deviation from the current converged policy.
environments such as games (Mnih et al., 2013). However, Moreover, to avoid catastrophic forgetting, we use a pool of
in many real-world problems, rewards are extremely sparse, stored samples to update the predictor network during imi-
and in this case, it is necessary to explore the environment. tation learning such that we can uniformly learn the visited
The RL literature suggests exploration methods to solve this states by the predictor network. We have experimentally
challenge, such as count-based exploration (Bellemare et al., demonstrated that these techniques lead to deep exploration.
2016; Ostrovski et al., 2017), entropy-based exploration We verify our algorithm using unmanned combat aerial vehi-
(Haarnoja et al., 2017; Ziebart, 2010) and curiosity-based cle (UCAV) mission execution. Some studies have applied
exploration (Silvia, 2012; Pathak et al., 2017; Burda et al., RL to UCAV maneuvers. (Liu & Ma, 2017; Zhang et al.,
2018a; Haber et al., 2018). In recent years, many researchers 2018; Minglang et al., 2018). However, those studies simply
1
Department of Industrial Engineering, University of Yon- defined the state and action and experimented in a dense
sei, Seoul, Korea. Correspondence to: Chang Ouk Kim reward environment. We constructed the experimental envi-
<[email protected]>. ronment by simulating the flight maneuvers of the UCAV
in a three-dimensional (3D) space. The objective of the RL
Amplifying Imitation Effect

agent is to learn the maneuvers by which the UCAV reaches For the missile, we applied proportional navigation induc-
a target point while avoiding missiles from the enemy air tion to chase the UCAV (Moran & Altilar, 2005). We as-
defense network. The main contributions of this paper are sume that if the distance between the UCAV and the missile
as follows: is less than 0.5 km, then the UCAV is unable to avoid the
missile.
• We show that SIL and RND are complementary and
that combining these two algorithms is very efficient 2.1. State
for exploration. In general, in an environment such as Atari games, the image
• We present several techniques to amplify the imitation of the game is preprocessed and used as the state, and a
effect. convolutional neural network is employed as the structure of
the network. In this study, however, the UCAV’s coordinate
• The performance of the RL applied to the UCAV con- information and the UCAV’s radar information to detect
trol problem is excellent. The learning method outputs missiles are vectorized for the state of the UCAV control
reasonable UCAV maneuvers in the sparse reward en- problem. A multilayer perceptron is more appropriate for
vironment. the problem than a convolutional neural network, which is
generally adopted for representing an image as the state of
an arcade game.
2. Problem Definition
We overlapped the air defense network as in an actual bat- 2.1.1. C OORDINATE R EPRESENTATION
tlefield environment, and we aimed to learn that the UCAV In a coordinate system, the coordinate points do not have
reaches the target by avoiding missiles from the starting a linear relationship. For example, the two-dimensional
point in a limited time period. For the UCAV dynamics, we (2D) coordinate (10, 10) is not ten times more valuable than
applied the following equations of motion of a 3-degrees-of- the coordinate (1, 1). Therefore, placing coordinates into a
freedom point mass model (Kim & Kim, 2007): state with real numbers is not reasonable and causes learn-
ing instability. One way to represent the coordinates in the
ẋ = V cos γ cos ψ
learning environment is to use a one-hot encoding vector.
ẏ = V cos γ sin ψ However, the one-hot encoding increases the dimension of
ż = V sin γ the vector as the range of coordinates increases and is only
T −D possible for integer coordinates. In this study, we introduce
V̇ = − g sin γ a method to efficiently represent the coordinate system.
m
gn sin ϕ The proposed method converts the coordinates into a one-
ψ̇ = hot encoding vector for each axis and then concatenates
V cos γ
g the vectors of the axes. The one-hot encoding method
γ̇ = (1) requires 40,000 rows (200x200) rows to represent (1, 1)
V (n cos ϕ − cos γ)
when x and y range from 1 to 200, but using this method,
where (x, y, z) is the position of the UCAV, V is the velocity, c(1,1) = [(1, 0, · · · , 0)(1, 0, · · · , 0)]′ is possible with 400
ψ is the heading angle, and γ is the flight path angle. T , rows (200+200). We additionally extended this method
n and ϕ are the control inputs of the UCAV. T , n and ϕ to the real coordinate system. The real coordinates are
denote the engine thrust, load factor and bank angle, respec- represented by introducing weight within the vector. For
tively. We use these control inputs as the action of our RL example, 1.3 is close to 70% in 1 and close to 30% in 2;
framework. Figure 1 shows the UCAV’s bank angle, flight in other words, the number 1.3 is a number with a weight
path angle, and heading angle. The engine thrust affects the of 70% in 1 and 30% in 2. Thus, 1.3 can be represented as
velocity of the UCAV. The bank angle and load factor affect c(1.3) = (0.7, 0.3, · · · , 0)′ (200 rows). Moreover, the result-
the heading angle and flight path angle. ing vector can be reduced to a small dimension. We have
reduced this coordinate to 1/10. Consequently, the num-
ber 1.3 can be represented as c(1.3) = (0.13, 0, · · · , 0)′ (20
rows). This method efficiently represents real coordinates
within a limited dimension. We call this method efficient
Ve
loc
it
y
Fl
igt
hpa
tha
ngl
e
coordinate vector (ECV).
He
adi
nga
ngl
e
Ba
nkAng
le

Figure 1. Bank angle, flight path angle and heading angle of the
UCAV.
Amplifying Imitation Effect

2.1.2. A NGLE R EPRESENTATION choices: increase, hold, and decrease. In addition, we have
added an action that initializes all inputs to have default
Representing the angle as a state is also difficult in RL be-
values (bank angle: 0◦ , load factor: 1G, and engine thrust:
cause the angle has a characteristic of circulating around
50kN ). This action allows the UCAV to cruise. The total
360◦ . For example, suppose that we change the angle from
number of actions is 28.
10◦ to 350◦ . Even if we use a real value or the ECV method,
the agent will perceive the result of a 340◦ change. How-
2.3. Reward
ever, the difference (340◦ ) is 20◦ at the same time. That
is, this angle representation confuses the RL agent. We The default reward is zero, except for the following specific
solve this problem with the polar coordinate system and situations:
ECV. r and θ can be transformed into Cartesian coordi-
nates x and y using a trigonometric function. Using the – A result of the missile skirmishes
polar coordinates, we can convert r and θ into Cartesian
– Whether the UCAV has arrived at its target point
coordinates x and y. Additionally, we can represent these
coordinates as a state through ECV. In other words, the – Cruise condition
angle is converted into the circle upper position using the
polar coordinate system, and then it is represented as a state The cruise condition is rewarded because the UCAV cannot
through the ECV. For example, as shown in figure 2, the maintain the maximum speed for cruising. We impose a
point on the circle corresponding to 17◦ can be represented penalty of -0.01 if the speed reaches the maximum speed.
as c(17◦ ) = (0, · · · , 0.71, 0.29, 0, · · · , 0, 0.302, 0.698)′ (20
rows) through ECV. 3. Related Work
Experience replay Experience replay (Lin, 1992) is a tech-
nique for exploiting past experiences, and Deep Q-Network
(DQN) has exhibited human-level performance in Atari
games using this technique(Mnih et al., 2013; 2015). Pri-
oritized experience replay (Schaul et al., 2015) is a method
for sampling prior experience based on temporal difference.
ACER (Wang et al., 2016) and Reactor (Gruslys et al., 2017)
utilize a replay memory in the actor-critic algorithm (Sut-
ton et al., 2000; Konda & Tsitsiklis, 2000). However, this
method might not be efficient if the past policy is too dif-
ferent from the current policy (Oh et al., 2018). SIL is
immune to this disadvantage because it exploits only past
Figure 2. Example of angle representation.
experiences that had higher returns than the current value.
2.1.3. F INAL S TATE Exploration Exploration has been the main challenging
issue for RL, and many studies have proposed methods to
We finally used the following information as the state of the
enhance exploration. Count-based exploration bonus (Strehl
UCAV control problem.
& Littman, 2008) is an intuitive and effective exploration
method in which an agent receives a bonus if the agent visits
– Flight path consisting of five recent steps of the UCAV a novel state, and the bonus decreases if the agent visits a
frequently visited state. There are some studies that esti-
– Path angle, heading angle and bank angle for two recent
mate the density of a state to provide a bonus in a large state
steps of the UCAV
space (Bellemare et al., 2016; Ostrovski et al., 2017; Fox
– Velocity and load factor of the UCAV et al., 2018; Machado et al., 2018). Recent studies have
introduced a prediction error (curiosity), which is the dif-
– Distance between the UCAV and the missile ference between the next state predicted and the actual next
state for the exploration (Silvia, 2012; Stadie et al., 2015;
– Horizontal and vertical angles between the UCAV and Pathak et al., 2017; Burda et al., 2018a; Haber et al., 2018).
the missile The studies designed the prediction error as an exploration
bonus (it ) to give the agent more reward when performing
2.2. Action unexpected behaviors.
The action is an input combination of engine thrust, bank However, the prediction error has a stochastic characteristic
angle and load factor using Equation 1. Each input has three because the target function is stochastic. In addition, the
Amplifying Imitation Effect

architecture of the predictor network is too limited to gener- Algorithm 1 Amplifying the Imitation Effect (AIE)
alize the state of the environment. To solve these problems, Initialize A2C network parameter θa2c
RND (Burda et al., 2018b) proposed that the target network Initialize predictor/target network parameter θp , θt
be deterministic by fixing the network with randomized Initialize replay buffer D ← ∅
weights and proposed that the predictor network has the Initialize episode buffer E ← ∅
same architecture as the target network. Other methods for Initialize feature buffer F ← ∅
efficient exploration include adding parameter noise within for episode = 1, M do
the network (Strehl & Littman, 2008; Plappert et al., 2017), for each step do
maximizing entropy policies (Haarnoja et al., 2017; Ziebart, Execute an action st , at , rt , st+1 ≈ πθ (at |st )
2010), adversarial self-play (Sukhbaatar et al., 2017) and Extract feature of st+1 to ϕst+1
learning diverse policies (Eysenbach et al., 2018; Gangwani Calculate intrinsic reward it
et al., 2018). if it < penalty condition threshold then
Self-Imitation Learning SIL can indirectly lead to deep it ← λlog(it )
exploration by imitating the good decisions of the past (Oh end if
et al., 2018). To exploit past decisions, the authors used rt = rt + it
replay buffers D = {(st , at , Rt )}, where st and at are a Store transition E ← E ∪ {(st , at , rt )}
state and an action at t-step, and Rt = Σ∞ k−t F ← F ∪ {(ϕst+1 , fθt (ϕst+1 ))}
k=t γ rk is the
discounted sum of reward at t-step with a discount factor γ. end for
The authors proposed the following off-policy actor-critic if st+1 is terminal then
loss: Compute returns Rt = Σ∞ k γ
k−t
rk for all t in E
D ← D ∪ {(st , at , rt )}
Clear episode buffer E ← ∅
end if
Lsil = Es,a,R∈D [Lsil sil sil
policy + β Lvalue ] (2)
# Optimize actor-critic network
Lsil
policy = −logπθ (a|s)(R − Vθ (S))+ (3) θa2c ← θa2c − η∇θa2c La2c
1 # Perform self-imitation learning
Lsil
value = ∥ (R − Vθ (S))+ ∥2 (4) for k= 1, M do
2
sample a minibatch {(s, a, R)} from D
where (·)+ = max(·, 0) and πθ and Vθ (s) are the pol- θa2c ← θa2c − η∇θa2c Lsil
icy (i.e., actor) and the value function parameterized by sample a minibatch {(ϕst+1 , fθt (ϕst+1 ))} from F
θ. B sil ∈ R+ is a hyperparameter for the value loss. Intu- θp ← θp − η∇θp Lp
itively, for the same state, if the past return value is greater end for
than the current value (R > Vθ ), then it can be observed end for
that the behavior in the past is a good decision. Therefore,
imitating the behavior is desirable. However, if the past
return is less than the current value (R < Vθ ), then imitating that the network forgets about the previously visited state.
the behavior is not desirable. The authors focused on com- Consequently, the prediction error increases for the past
bining SIL with advantage actor-critic (A2C) (Mnih et al., state, and the agent may go to a past policy.
2016) and showed significant performance in experiments
with hard exploration Atari games. 4. AIE
Random Network Distillation The authors proposed a
fixed target network (f ) with randomized weights and a 4.1. Combining SIL and RND
predictor network (fb), which is trained using the output of In this section, we explain why combining RND and SIL
the target network. The predictor neural network is trained can amplify the imitation effect and lead to deep exploration.
by gradient descent to minimize the expected mean squared The SIL updates only when the past R is greater than the
error ∥ fb(x; θ) − f (x) ∥2 . They used the exploration bonus current Vθ and imitates past decisions. Intuitively, if we
(it ) as ∥ fb(x; θ) − f (x) ∥2 . Intuitively, the prediction er- combine SIL and RND, we find that the (R − Vθ ) value
ror will increase for a novel state, and the prediction error is larger than the SIL because of the exploration bonus.
will decrease for a state that has been frequently visited. In the process of optimizing the actor-critic network to
However, if the agent converges to local policy, prediction maximize Rt = Σ∞ k=t γ
k−t
(it + et )k , where it is intrin-
error may (it ) no longer occurs. Furthemore, using RND sic reward and et is extrinsic reward, the increase in it by
can cause catastrophic forgetting. The predictor network the predictor network causes R to increase. That is, the
learns about the state that the agent constantly visits such learning progresses by weighting the good decisions of the
Amplifying Imitation Effect

past. This type of learning thoroughly reviews the learn-


ing history.If the policy starts to converge as the learning
progresses, the it will be lower for the state that was fre-
quently visited. One might think that learning can be slower
as (Rt − Vθ ) > (Rt+k − Vθ ), where k > 0 for the same
state and it decreases. However, the SIL exploits past good
decisions and leads to deep exploration. By adding an ex-
ploration bonus, the agent can further explore novel states.
Consequently, the exploration bonus is likely to continue to
occur. In addition, using the prioritized experience replay
(Schaul et al., 2015), the sampling probability is determined
by the (R − Vθ ); thus, there is a high probability that the
SIL will exploit the previous transition even if it decreases.
In other words, the two algorithms are complementary to
each other, and the SIL is immune to the phenomenon in Figure 3. Path visualization for each algorithm in 2D grid environ-
ment. The color changes from blue to red for where the agent
which the prediction error (it ) no longer occurs.
visits more frequently.

4.2. Intrinsic Penalty Reward


Adding an exploration bonus to a novel state that the agent catastrophic forgetting of continual task learning that forgets
visits is clearly an effective exploration method. However, learned knowledge of previous tasks. If the prediction error
when the policy and predictor networks converge, there is increases for a state that the agent has visited before, the
no longer an exploration bonus for the novel state. In other agent may recognize the previous state as a novel state.
words, the exploration bonus method provides a reward Consequently, an agent cannot effectively explore. The
when the agent itself performs an unexpected action, not method to mitigate this phenomenon is simple but effective.
when the agent is induced to take the unexpected action. We store the output of the target network and state feature as
Therefore, an exploration method that entices the agent the memory of the predictor network, just like using a replay
to take unexpected behavior is necessary. We propose a memory to reduce the correlation between samples(Mnih
method to provide an intrinsic penalty reward for an action et al., 2013), and train the predictor network in a batch mode.
when it frequently visits the same state rather than reward- Using the predictor memory reduces the prediction error of
ing it when the agent makes an unexpected action. The states that the agent previously visited, which is why the
intrinsic penalty reward allows the agent to escape from agent is more likely to explore novel states. Even if the
the converged local policy and helps to experience diverse agent returns to a past policy, the prediction error of the
policies. Specifically, we provide a penalty by transform- state visited by the policy is low, intrinsic penalty is given
ing the current intrinsic reward into λlog(it ), where λ is a to the state, and the probability of escaping from the state is
penalty weight parameter, if the current intrinsic reward is high.
less than the quantile α of the past N intrinsic rewards. This
reward mechanism prevents the agent from staying in the 5. Experiment
same policy. In addition, adding a penalty to the intrinsic
reward indirectly amplifies the imitation effect. Since the 5.1. Conversion of State to Coordinate Feature
(Rt − Vθ ) becomes smaller due to the penalty, the probabil-
An exploration bonus is given for state feature x through
ity of sampling in replay memory is relatively smaller than
that of non-penalty transition. SIL updates are more likely ∥ fb(x; θ) − f (x) ∥2 , where (f ) is a fixed target network
to exploit non-penalty transitions. Even if (Rt − Vθ ) < 0 and (fb) is a predictor network. However, the state of our ex-
due to a penalty, it does not affect SIL because it is not perimental environment contains various information, such
updated because of the objective of SIL in equation 4. In as the path and direction information of the UCAV and the
other words, the intrinsic penalty reward allows the policy relationship information between the UCAV and missile.
network to deviate from the constantly visited state of the The high-dimensional state space makes the convergence
agent and indirectly amplifies the imitation effect for the speed of the policy network slow. Thus, we limited the state
SIL. for the exploration bonus to the current coordinates of the
UCAV (33 rows). Consequently, the convergence rate of the
4.3. Catastrophic Forgetting in RND policy network increased, and the meaning of the role of the
exploration bonus changes clearly from ‘inducing the agent
The predictor network in RND mainly learns about the to move to a novel feature state’ to ‘inducing agent to move
state that the agent recently visited, which is similar to the to novel coordinates’.
Amplifying Imitation Effect

Figure 4 is the visualization of the movement paths of the


Table 1. An exploration area score of each algorithm in a two-
agent for 5,000 episodes (left figure) and the losses of the
dimensional no-reward grid environment. We averaged the area
explored by the agent after 30 repeated experiments. predictor network at all coordinates (right figure). We ob-
served that the loss of the area explored by the agent is lower
A LGORITHM E XPLORATION AREA than in other areas. As the episode increases, the agent ex-
ASIL 11.2±1.25 plores a novel space with a high prediction error. At this
AIE1 40.5±2.06 point, we can observe that the loss of area that the agent
AIE2 43.2±2.36 explored at an episode increased compared to the loss of
AIE3 46.7±2.19 area at the preceding episode. However, AIE3 showed that
the loss of the previously explored space remained relatively
low compared to the other two algorithms.
5.2. Test Algorithms
In the sparse reward environment, ASIL explored a small
ASIL denotes the combination of A2C and SIL. We used this area, circulating throughout the area although the episode
model as a baseline method for a performance comparison. increased, but the proposed three algorithms explored many
In this study, we propose three RL algorithms. Amplifying areas. Table 1 shows the score of how each algorithm ex-
the imitation effect (AIE1) is the first proposed algorithm, plored uniformly over four quadrants of the 2D grid space
which combines ASIL and RND. The second is the addition during 30,000 episodes. The formula for the score was
of intrinsic penalty rewards to ASIL + RND (AIE2), and the
score = mean(EQq ) × σEQ × 100 (5)
third is the AIE2 with the addition of replay memory for the
predictor network (AIE3) described in Algorithm 1. where EQq is the explored portion in the total area of each
quartile. We confirmed that the proposed algorithms (partic-
5.3. Hard Exploration in 2D Environment ularly AIE3) were very effective for exploration.
5.3.1. S PARSE R EWARD S ETTING
5.4. Experiment for UCAV Mission Execution
We conducted a simple experiment to see how effective the
We performed an experiment to investigate UCAV control
proposed algorithms are for exploration. We constructed a
in a sparse reward environment and compared the perfor-
2D grid world in which the agent learns a sequence of move-
mances of the algorithms. In addition, we analyzed how the
ments that begin from a starting point and reach a goal point
UCAV manages to avoid missiles. First, since our exper-
using a simple movement step (up, down, left, and right).
imental environment has a sparse reward structure, DQN,
The reward was set to zero except when reaching the target
prioritized experience replay DQN, A2C and ACER failed
point (reward of 30) or leaving the environment (reward
to converge to the desired policy that generates the shortest
of -30). RL was performed a total of 10,000 episodes for
path from the origin to the target point while avoiding an
each algorithm. Figure 2 is the visualization of the move-
enemy’s missiles. Figure 5 (left) shows the performances of
ment paths of the agent. Since the reward is too sparse,
ASIL and the proposed three algorithms for an experiment
the ASIL failed to reach the target point. In contrast, all
consisting of 60,000 episodes. The light colors and normal
of the proposed algorithms successfully reached the target
colors represent the worst and average performance of the
point because of the exploration bonus. For AIE1, the re-
compared algorithms, respectively. The result is that AIE2
sult showed that the agent quickly reached the target point.
and AIE3 succeeded in converging to the desired policy,
However, we find that AIE 2 and AIE3 that considered the
while ASIL and AIE1 fell into a local minimum once in
intrinsic penalty reward performed a deeper exploration than
two trials and once in three trials, respectively. In particu-
AIE1 – the two algorithms arrived at the target point via
lar, AIE3 outperformed the other algorithms, as shown in
more diverse paths compared to AIE1.
Figure 4. Similar to the previous exploration experiment,
we confirmed that the performance of the three proposed
5.3.2. N O -R EWARD S ETTING
algorithms was better than that of ASIL (baseline model) in
We experimented with the same environment in which there the UCAV control environment.
is no target point. The agent performs only exploration in Figure 6 presents snapshots of learning (animation is here1 ).
each episode. We argue that the catastrophic forgetting is At early episodes of the learning, the UCAV took random
ineffective for exploration when using an exploration bonus actions and occasionally left the battlefield. However, as the
because the agent has less chance of searching a novel state episodes increased, it tended to move forward gradually but
if the prediction error remains high for previously searched was shot down by a missile. This result can be confirmed
states. Furthermore, we argue that using replay memory for by the cumulative shot probability plot (Figure 5 (right)).
predictor network (AIE3) is more efficient for exploration
1
because the memory mitigates the catastrophic forgetting. https://youtu.be/7R5lZAsCs2c
Amplifying Imitation Effect

AI
E1

AI
E2

AI
E3

1~5
,000 5,
001~10,
000 10,
001~1
5 ,
000
Epis
odei
nter
val

Figure 4. Visualization of the path of the agent and loss of all coordinate states for each algorithm in the no reward 2D grid environment.
The color changes from blue to red in the agent’s path figure to indicate where the agent visits more frequently. The color changes from
blue to yellow in the loss figure to indicate where the loss is larger.

40 0.
6
Mi
ssi
onc
ompl
et
ion
0.
5

20 0.
4
Epi
sode1~10,
000

0.
3
0
ASI
L ASI
(
L
wor
st
)
0.
2
AI
E1 AI
E1
(
wor
st
)
AI
E2 AI
(
E2
wor
st
)
0.
1
-
20 AI
E3 AI
(
E3
wor
st
)
Epi
sode1
0,000~20,
000
10k 20k 30k 40k 50k 60k 10k 20k 30k 40k 50k 60k
Epi
sodenumbe
r Epi
sodenumbe
r

Figure 5. (Left) Learning curves of the UCAV mission execution


environment. The x and y axes represent the episode number and Epi
sode20
,00
0~
the average reward, respectively. The plot is the average of the
reward of the results of the 10 experiments for each algorithm. Figure 6. 3D view of the UCAV’s learning process. The red circle
The light color represents the worst performance result of each represents the air defense network, the black solid line represents
algorithm. (Right) Cumulative probability graph of being shot the movement path of the UCAV, and the red dotted line represents
down by a missile. the missile’s movement path.

As the episodes continues, the UCAV learned how to avoid


missiles and began to move to new coordinates (attempted
to increase intrinsic reward). The UCAV attempted to reach
the target point through various paths.
work, the probability of being shot down by a missile in-
Figure 7 is a 3D representation of the path through which creased. Therefore, the UCAV learned the safe path that
the UCAV reached the target while avoiding the missile. passed through the overlapped areas of air defense networks
When the UCAV entered the center of the air defense net- with a low altitude.
Amplifying Imitation Effect

learning. arXiv preprint arXiv:1808.04355, 2018a.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex-


ploration by random network distillation. arXiv preprint
arXiv:1810.12894, 2018b.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity


is all you need: Learning skills without a reward function.
arXiv preprint arXiv:1802.06070, 2018.

Fox, L., Choshen, L., and Loewenstein, Y. Dora the


explorer: Directed outreaching reinforcement action-
selection. 2018.

Gangwani, T., Liu, Q., and Peng, J. Learning self-imitating


diverse policies. arXiv preprint arXiv:1805.10309, 2018.
Figure 7. UCAV’s path of after learning in 3D view. You can
see the UCAV through the overlapping of air defense network, Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R.
avoiding the missile and reaching the target point. The reactor: A sample-efficient actor-critic architecture.
arXiv preprint arXiv:1704.04651, 2017.
6. Conclusion Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Re-
In this paper, we proposed AIE by combining SIL and RND. inforcement learning with deep energy-based policies.
In addition, we proposed AIE2 and AIE3, which can lead to arXiv preprint arXiv:1702.08165, 2017.
efficient deep exploration. AIE2 gives an intrinsic penalty
reward to states where the agent frequently visits, which Haber, N., Mrowca, D., Fei-Fei, L., and Yamins, D. L.
prevents the agent from falling into a local optimal policy. Learning to play with intrinsically-motivated self-aware
AIE3 adopts replay memory to mitigate the catastrophic agents. arXiv preprint arXiv:1802.07442, 2018.
forgetting of the predictor network. These two algorithms
amplify the imitation effect, leading to deep exploration, Kim, S. and Kim, Y. Three dimensional optimum controller
thereby enabling the policy network to quickly converge into for multiple uav formation flight using behavior-based
the desired policy. We experimentally demonstrated that decentralized approach. In Control, Automation and Sys-
the AIEs in the 2D grid environment successfully explored tems, 2007. ICCAS’07. International Conference on, pp.
wide areas of the grid space. In addition, for the UCAV 1387–1392. IEEE, 2007.
control problem, we observed that the proposed algorithms
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In
quickly converged into the desired policy. In future work,
Advances in neural information processing systems, pp.
it is necessary to discuss the configuration of the replay
1008–1014, 2000.
memory because replay memory for the predictor network
has limited storage; thus, it is inefficient to insert a feature Lin, L.-J. Self-improving reactive agents based on reinforce-
for every learning step. ment learning, planning and teaching. Machine learning,
8(3-4):293–321, 1992.
Acknowledgments
Liu, P. and Ma, Y. A deep reinforcement learning based
This research was supported by Agency for Defense Devel- intelligent decision method for ucav air combat. In Asian
opment (UD170043JD). Simulation Conference, pp. 274–286. Springer, 2017.

References Machado, M. C., Bellemare, M. G., and Bowling, M.


Count-based exploration with the successor representa-
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., tion. arXiv preprint arXiv:1807.11622, 2018.
Saxton, D., and Munos, R. Unifying count-based explo-
ration and intrinsic motivation. In Advances in Neural Minglang, C., Haiwen, D., Zhenglei, W., and QingPeng, S.
Information Processing Systems, pp. 1471–1479, 2016. Maneuvering decision in short range air combat for un-
manned combat aerial vehicles. In 2018 Chinese Control
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., And Decision Conference (CCDC), pp. 1783–1788. IEEE,
and Efros, A. A. Large-scale study of curiosity-driven 2018.
Amplifying Imitation Effect

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Y. Policy gradient methods for reinforcement learning
atari with deep reinforcement learning. arXiv preprint with function approximation. In Advances in neural in-
arXiv:1312.5602, 2013. formation processing systems, pp. 1057–1063, 2000.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- Kavukcuoglu, K., and de Freitas, N. Sample effi-
land, A. K., Ostrovski, G., et al. Human-level control cient actor-critic with experience replay. arXiv preprint
through deep reinforcement learning. Nature, 518(7540): arXiv:1611.01224, 2016.
529, 2015.
Zhang, Y., Zu, W., Gao, Y., Chang, H., et al. Research on
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, autonomous maneuvering decision of ucav based on deep
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- reinforcement learning. 2018.
chronous methods for deep reinforcement learning. In
International conference on machine learning, pp. 1928– Ziebart, B. D. Modeling purposeful adaptive behavior with
1937, 2016. the principle of maximum causal entropy. 2010.

Moran, I. and Altilar, T. Three plane approach for 3d true


proportional navigation. In AIAA Guidance, Navigation,
and Control Conference and Exhibit, pp. 6457, 2005.

Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learn-
ing. arXiv preprint arXiv:1806.05635, 2018.

Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos,


R. Count-based exploration with neural density models.
arXiv preprint arXiv:1703.01310, 2017.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.


Curiosity-driven exploration by self-supervised predic-
tion. In International Conference on Machine Learning
(ICML), volume 2017, 2017.

Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,


R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychow-
icz, M. Parameter space noise for exploration. arXiv
preprint arXiv:1706.01905, 2017.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-


tized experience replay. arXiv preprint arXiv:1511.05952,
2015.

Silvia, P. J. Curiosity and motivation. The Oxford handbook


of human motivation, pp. 157–166, 2012.

Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex-


ploration in reinforcement learning with deep predictive
models. arXiv preprint arXiv:1507.00814, 2015.

Strehl, A. L. and Littman, M. L. An analysis of model-


based interval estimation for markov decision processes.
Journal of Computer and System Sciences, 74(8):1309–
1331, 2008.

Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam,


A., and Fergus, R. Intrinsic motivation and auto-
matic curricula via asymmetric self-play. arXiv preprint
arXiv:1703.05407, 2017.

You might also like