0% found this document useful (0 votes)

11 views

Ucav Mission Execution Reinforcement Learning Paper

Uploaded by

Mat Brunt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Ucav Mission Execution Reinforcement Learning Paper

Uploaded by

Mat Brunt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Amplifying the Imitation Effect for Reinforcement Learning of

UCAV’s Mission Execution

Gyeong Taek Lee 1 Chang Ouk Kim 1

Abstract have added an exploration bonus, often called curiosity or in-

trinsic reward, which is the difference between the predicted
This paper proposes a new reinforcement learning
arXiv:1901.05856v1 [cs.LG] 17 Jan 2019

state and actual next state. The intrinsic reward is very effi-
(RL) algorithm that enhances exploration by am-
cient in exploration because the network for predicting the
plifying the imitation effect (AIE). This algorithm
next state drives the agent to behave unexpectedly.
consists of self-imitation learning and random net-
work distillation algorithms. We argue that these This paper focuses on combining self-imitation leaning
two algorithms complement each other and that (SIL) (Oh et al., 2018) and random network distillation
combining these two algorithms can amplify the (RND) (Burda et al., 2018b). SIL is an algorithm that in-
imitation effect for exploration. In addition, by directly leads to deep exploration by exploiting only good
adding an intrinsic penalty reward to the state that decisions of the past, whereas RND solves the problem of
the RL agent frequently visits and using replay hard exploration by giving an exploration bonus through
memory for learning the feature state when us- deterministic prediction error. The RND bonus is a de-
ing an exploration bonus, the proposed approach terministic prediction error of a neural network predicting
leads to deep exploration and deviates from the features of the observations, and the authors have shown sig-
current converged policy. We verified the explo- nificant performance in some hard exploration Atari games.
ration performance of the algorithm through ex- In hard exploration environments, it does not make sense for
periments in a two-dimensional grid environment. SIL to exploit a good decision of the past. In other words,
In addition, we applied the algorithm to a sim- SIL requires an intrinsic reward. Meanwhile, in RND, catas-
ulated environment of unmanned combat aerial trophic forgetting could occur during learning because the
vehicle (UCAV) mission execution, and the em- predictor network learns about the state that the agent visited
pirical results show that AIE is very effective for recently. Consequently, the prediction error increases, and
finding the UCAV’s shortest flight path to avoid the exploration bonus increases for previously visited states.
an enemy’s missiles. We will describe this phenomenon in detail in section 4.3.
This paper introduces amplifying the imitation effect (AIE)
by combining SIL and RND to drive deep exploration. In
1. Introduction
addition, we introduce techniques that can enhance the
Reinforcement learning (RL) aims to learn an optimal policy strength of the proposed network. Adding an intrinsic
of the agent for a control problem by maximizing the ex- penalty reward to the state that the agent continuously vis-
pected return. RL shows high performance in dense reward its leads to deviation from the current converged policy.
environments such as games (Mnih et al., 2013). However, Moreover, to avoid catastrophic forgetting, we use a pool of
in many real-world problems, rewards are extremely sparse, stored samples to update the predictor network during imi-
and in this case, it is necessary to explore the environment. tation learning such that we can uniformly learn the visited
The RL literature suggests exploration methods to solve this states by the predictor network. We have experimentally
challenge, such as count-based exploration (Bellemare et al., demonstrated that these techniques lead to deep exploration.
2016; Ostrovski et al., 2017), entropy-based exploration We verify our algorithm using unmanned combat aerial vehi-
(Haarnoja et al., 2017; Ziebart, 2010) and curiosity-based cle (UCAV) mission execution. Some studies have applied
exploration (Silvia, 2012; Pathak et al., 2017; Burda et al., RL to UCAV maneuvers. (Liu & Ma, 2017; Zhang et al.,
2018a; Haber et al., 2018). In recent years, many researchers 2018; Minglang et al., 2018). However, those studies simply
1
Department of Industrial Engineering, University of Yon- defined the state and action and experimented in a dense
sei, Seoul, Korea. Correspondence to: Chang Ouk Kim reward environment. We constructed the experimental envi-
<[email protected]>. ronment by simulating the flight maneuvers of the UCAV
in a three-dimensional (3D) space. The objective of the RL
Amplifying Imitation Effect

agent is to learn the maneuvers by which the UCAV reaches For the missile, we applied proportional navigation induc-
a target point while avoiding missiles from the enemy air tion to chase the UCAV (Moran & Altilar, 2005). We as-
defense network. The main contributions of this paper are sume that if the distance between the UCAV and the missile
as follows: is less than 0.5 km, then the UCAV is unable to avoid the
missile.
• We show that SIL and RND are complementary and
that combining these two algorithms is very efficient 2.1. State
for exploration. In general, in an environment such as Atari games, the image
• We present several techniques to amplify the imitation of the game is preprocessed and used as the state, and a
effect. convolutional neural network is employed as the structure of
the network. In this study, however, the UCAV’s coordinate
• The performance of the RL applied to the UCAV con- information and the UCAV’s radar information to detect
trol problem is excellent. The learning method outputs missiles are vectorized for the state of the UCAV control
reasonable UCAV maneuvers in the sparse reward en- problem. A multilayer perceptron is more appropriate for
vironment. the problem than a convolutional neural network, which is
generally adopted for representing an image as the state of
an arcade game.
2. Problem Definition
We overlapped the air defense network as in an actual bat- 2.1.1. C OORDINATE R EPRESENTATION
tlefield environment, and we aimed to learn that the UCAV In a coordinate system, the coordinate points do not have
reaches the target by avoiding missiles from the starting a linear relationship. For example, the two-dimensional
point in a limited time period. For the UCAV dynamics, we (2D) coordinate (10, 10) is not ten times more valuable than
applied the following equations of motion of a 3-degrees-of- the coordinate (1, 1). Therefore, placing coordinates into a
freedom point mass model (Kim & Kim, 2007): state with real numbers is not reasonable and causes learn-
ing instability. One way to represent the coordinates in the
ẋ = V cos γ cos ψ
learning environment is to use a one-hot encoding vector.
ẏ = V cos γ sin ψ However, the one-hot encoding increases the dimension of
ż = V sin γ the vector as the range of coordinates increases and is only
T −D possible for integer coordinates. In this study, we introduce
V̇ = − g sin γ a method to efficiently represent the coordinate system.
m
gn sin ϕ The proposed method converts the coordinates into a one-
ψ̇ = hot encoding vector for each axis and then concatenates
V cos γ
g the vectors of the axes. The one-hot encoding method
γ̇ = (1) requires 40,000 rows (200x200) rows to represent (1, 1)
V (n cos ϕ − cos γ)
when x and y range from 1 to 200, but using this method,
where (x, y, z) is the position of the UCAV, V is the velocity, c(1,1) = [(1, 0, · · · , 0)(1, 0, · · · , 0)]′ is possible with 400
ψ is the heading angle, and γ is the flight path angle. T , rows (200+200). We additionally extended this method
n and ϕ are the control inputs of the UCAV. T , n and ϕ to the real coordinate system. The real coordinates are
denote the engine thrust, load factor and bank angle, respec- represented by introducing weight within the vector. For
tively. We use these control inputs as the action of our RL example, 1.3 is close to 70% in 1 and close to 30% in 2;
framework. Figure 1 shows the UCAV’s bank angle, flight in other words, the number 1.3 is a number with a weight
path angle, and heading angle. The engine thrust affects the of 70% in 1 and 30% in 2. Thus, 1.3 can be represented as
velocity of the UCAV. The bank angle and load factor affect c(1.3) = (0.7, 0.3, · · · , 0)′ (200 rows). Moreover, the result-
the heading angle and flight path angle. ing vector can be reduced to a small dimension. We have
reduced this coordinate to 1/10. Consequently, the num-
ber 1.3 can be represented as c(1.3) = (0.13, 0, · · · , 0)′ (20
rows). This method efficiently represents real coordinates
within a limited dimension. We call this method efficient
Ve
loc
it
y
Fl
igt
hpa
tha
ngl
e
coordinate vector (ECV).
He
adi
nga
ngl
e
Ba
nkAng
le

Figure 1. Bank angle, flight path angle and heading angle of the
UCAV.
Amplifying Imitation Effect

2.1.2. A NGLE R EPRESENTATION choices: increase, hold, and decrease. In addition, we have
added an action that initializes all inputs to have default
Representing the angle as a state is also difficult in RL be-
values (bank angle: 0◦ , load factor: 1G, and engine thrust:
cause the angle has a characteristic of circulating around
50kN ). This action allows the UCAV to cruise. The total
360◦ . For example, suppose that we change the angle from
number of actions is 28.
10◦ to 350◦ . Even if we use a real value or the ECV method,
the agent will perceive the result of a 340◦ change. How-
2.3. Reward
ever, the difference (340◦ ) is 20◦ at the same time. That
is, this angle representation confuses the RL agent. We The default reward is zero, except for the following specific
solve this problem with the polar coordinate system and situations:
ECV. r and θ can be transformed into Cartesian coordi-
nates x and y using a trigonometric function. Using the – A result of the missile skirmishes
polar coordinates, we can convert r and θ into Cartesian
– Whether the UCAV has arrived at its target point
coordinates x and y. Additionally, we can represent these
coordinates as a state through ECV. In other words, the – Cruise condition
angle is converted into the circle upper position using the
polar coordinate system, and then it is represented as a state The cruise condition is rewarded because the UCAV cannot
through the ECV. For example, as shown in figure 2, the maintain the maximum speed for cruising. We impose a
point on the circle corresponding to 17◦ can be represented penalty of -0.01 if the speed reaches the maximum speed.
as c(17◦ ) = (0, · · · , 0.71, 0.29, 0, · · · , 0, 0.302, 0.698)′ (20
rows) through ECV. 3. Related Work
Experience replay Experience replay (Lin, 1992) is a tech-
nique for exploiting past experiences, and Deep Q-Network
(DQN) has exhibited human-level performance in Atari
games using this technique(Mnih et al., 2013; 2015). Pri-
oritized experience replay (Schaul et al., 2015) is a method
for sampling prior experience based on temporal difference.
ACER (Wang et al., 2016) and Reactor (Gruslys et al., 2017)
utilize a replay memory in the actor-critic algorithm (Sut-
ton et al., 2000; Konda & Tsitsiklis, 2000). However, this
method might not be efficient if the past policy is too dif-
ferent from the current policy (Oh et al., 2018). SIL is
immune to this disadvantage because it exploits only past
Figure 2. Example of angle representation.
experiences that had higher returns than the current value.
2.1.3. F INAL S TATE Exploration Exploration has been the main challenging
issue for RL, and many studies have proposed methods to
We finally used the following information as the state of the
enhance exploration. Count-based exploration bonus (Strehl
UCAV control problem.
& Littman, 2008) is an intuitive and effective exploration
method in which an agent receives a bonus if the agent visits
– Flight path consisting of five recent steps of the UCAV a novel state, and the bonus decreases if the agent visits a
frequently visited state. There are some studies that esti-
– Path angle, heading angle and bank angle for two recent
mate the density of a state to provide a bonus in a large state
steps of the UCAV
space (Bellemare et al., 2016; Ostrovski et al., 2017; Fox
– Velocity and load factor of the UCAV et al., 2018; Machado et al., 2018). Recent studies have
introduced a prediction error (curiosity), which is the dif-
– Distance between the UCAV and the missile ference between the next state predicted and the actual next
state for the exploration (Silvia, 2012; Stadie et al., 2015;
– Horizontal and vertical angles between the UCAV and Pathak et al., 2017; Burda et al., 2018a; Haber et al., 2018).
the missile The studies designed the prediction error as an exploration
bonus (it ) to give the agent more reward when performing
2.2. Action unexpected behaviors.
The action is an input combination of engine thrust, bank However, the prediction error has a stochastic characteristic
angle and load factor using Equation 1. Each input has three because the target function is stochastic. In addition, the
Amplifying Imitation Effect

architecture of the predictor network is too limited to gener- Algorithm 1 Amplifying the Imitation Effect (AIE)
alize the state of the environment. To solve these problems, Initialize A2C network parameter θa2c
RND (Burda et al., 2018b) proposed that the target network Initialize predictor/target network parameter θp , θt
be deterministic by fixing the network with randomized Initialize replay buffer D ← ∅
weights and proposed that the predictor network has the Initialize episode buffer E ← ∅
same architecture as the target network. Other methods for Initialize feature buffer F ← ∅
efficient exploration include adding parameter noise within for episode = 1, M do
the network (Strehl & Littman, 2008; Plappert et al., 2017), for each step do
maximizing entropy policies (Haarnoja et al., 2017; Ziebart, Execute an action st , at , rt , st+1 ≈ πθ (at |st )
2010), adversarial self-play (Sukhbaatar et al., 2017) and Extract feature of st+1 to ϕst+1
learning diverse policies (Eysenbach et al., 2018; Gangwani Calculate intrinsic reward it
et al., 2018). if it < penalty condition threshold then
Self-Imitation Learning SIL can indirectly lead to deep it ← λlog(it )
exploration by imitating the good decisions of the past (Oh end if
et al., 2018). To exploit past decisions, the authors used rt = rt + it
replay buffers D = {(st , at , Rt )}, where st and at are a Store transition E ← E ∪ {(st , at , rt )}
state and an action at t-step, and Rt = Σ∞ k−t F ← F ∪ {(ϕst+1 , fθt (ϕst+1 ))}
k=t γ rk is the
discounted sum of reward at t-step with a discount factor γ. end for
The authors proposed the following off-policy actor-critic if st+1 is terminal then
loss: Compute returns Rt = Σ∞ k γ
k−t
rk for all t in E
D ← D ∪ {(st , at , rt )}
Clear episode buffer E ← ∅
end if
Lsil = Es,a,R∈D [Lsil sil sil
policy + β Lvalue ] (2)
# Optimize actor-critic network
Lsil
policy = −logπθ (a|s)(R − Vθ (S))+ (3) θa2c ← θa2c − η∇θa2c La2c
1 # Perform self-imitation learning
Lsil
value = ∥ (R − Vθ (S))+ ∥2 (4) for k= 1, M do
2
sample a minibatch {(s, a, R)} from D
where (·)+ = max(·, 0) and πθ and Vθ (s) are the pol- θa2c ← θa2c − η∇θa2c Lsil
icy (i.e., actor) and the value function parameterized by sample a minibatch {(ϕst+1 , fθt (ϕst+1 ))} from F
θ. B sil ∈ R+ is a hyperparameter for the value loss. Intu- θp ← θp − η∇θp Lp
itively, for the same state, if the past return value is greater end for
than the current value (R > Vθ ), then it can be observed end for
that the behavior in the past is a good decision. Therefore,
imitating the behavior is desirable. However, if the past
return is less than the current value (R < Vθ ), then imitating that the network forgets about the previously visited state.
the behavior is not desirable. The authors focused on com- Consequently, the prediction error increases for the past
bining SIL with advantage actor-critic (A2C) (Mnih et al., state, and the agent may go to a past policy.
2016) and showed significant performance in experiments
with hard exploration Atari games. 4. AIE
Random Network Distillation The authors proposed a
fixed target network (f ) with randomized weights and a 4.1. Combining SIL and RND
predictor network (fb), which is trained using the output of In this section, we explain why combining RND and SIL
the target network. The predictor neural network is trained can amplify the imitation effect and lead to deep exploration.
by gradient descent to minimize the expected mean squared The SIL updates only when the past R is greater than the
error ∥ fb(x; θ) − f (x) ∥2 . They used the exploration bonus current Vθ and imitates past decisions. Intuitively, if we
(it ) as ∥ fb(x; θ) − f (x) ∥2 . Intuitively, the prediction er- combine SIL and RND, we find that the (R − Vθ ) value
ror will increase for a novel state, and the prediction error is larger than the SIL because of the exploration bonus.
will decrease for a state that has been frequently visited. In the process of optimizing the actor-critic network to
However, if the agent converges to local policy, prediction maximize Rt = Σ∞ k=t γ
k−t
(it + et )k , where it is intrin-
error may (it ) no longer occurs. Furthemore, using RND sic reward and et is extrinsic reward, the increase in it by
can cause catastrophic forgetting. The predictor network the predictor network causes R to increase. That is, the
learns about the state that the agent constantly visits such learning progresses by weighting the good decisions of the
Amplifying Imitation Effect

past. This type of learning thoroughly reviews the learn-

ing history.If the policy starts to converge as the learning
progresses, the it will be lower for the state that was fre-
quently visited. One might think that learning can be slower
as (Rt − Vθ ) > (Rt+k − Vθ ), where k > 0 for the same
state and it decreases. However, the SIL exploits past good
decisions and leads to deep exploration. By adding an ex-
ploration bonus, the agent can further explore novel states.
Consequently, the exploration bonus is likely to continue to
occur. In addition, using the prioritized experience replay
(Schaul et al., 2015), the sampling probability is determined
by the (R − Vθ ); thus, there is a high probability that the
SIL will exploit the previous transition even if it decreases.
In other words, the two algorithms are complementary to
each other, and the SIL is immune to the phenomenon in Figure 3. Path visualization for each algorithm in 2D grid environ-
ment. The color changes from blue to red for where the agent
which the prediction error (it ) no longer occurs.
visits more frequently.

4.2. Intrinsic Penalty Reward

Adding an exploration bonus to a novel state that the agent catastrophic forgetting of continual task learning that forgets
visits is clearly an effective exploration method. However, learned knowledge of previous tasks. If the prediction error
when the policy and predictor networks converge, there is increases for a state that the agent has visited before, the
no longer an exploration bonus for the novel state. In other agent may recognize the previous state as a novel state.
words, the exploration bonus method provides a reward Consequently, an agent cannot effectively explore. The
when the agent itself performs an unexpected action, not method to mitigate this phenomenon is simple but effective.
when the agent is induced to take the unexpected action. We store the output of the target network and state feature as
Therefore, an exploration method that entices the agent the memory of the predictor network, just like using a replay
to take unexpected behavior is necessary. We propose a memory to reduce the correlation between samples(Mnih
method to provide an intrinsic penalty reward for an action et al., 2013), and train the predictor network in a batch mode.
when it frequently visits the same state rather than reward- Using the predictor memory reduces the prediction error of
ing it when the agent makes an unexpected action. The states that the agent previously visited, which is why the
intrinsic penalty reward allows the agent to escape from agent is more likely to explore novel states. Even if the
the converged local policy and helps to experience diverse agent returns to a past policy, the prediction error of the
policies. Specifically, we provide a penalty by transform- state visited by the policy is low, intrinsic penalty is given
ing the current intrinsic reward into λlog(it ), where λ is a to the state, and the probability of escaping from the state is
penalty weight parameter, if the current intrinsic reward is high.
less than the quantile α of the past N intrinsic rewards. This
reward mechanism prevents the agent from staying in the 5. Experiment
same policy. In addition, adding a penalty to the intrinsic
reward indirectly amplifies the imitation effect. Since the 5.1. Conversion of State to Coordinate Feature
(Rt − Vθ ) becomes smaller due to the penalty, the probabil-
An exploration bonus is given for state feature x through
ity of sampling in replay memory is relatively smaller than
that of non-penalty transition. SIL updates are more likely ∥ fb(x; θ) − f (x) ∥2 , where (f ) is a fixed target network
to exploit non-penalty transitions. Even if (Rt − Vθ ) < 0 and (fb) is a predictor network. However, the state of our ex-
due to a penalty, it does not affect SIL because it is not perimental environment contains various information, such
updated because of the objective of SIL in equation 4. In as the path and direction information of the UCAV and the
other words, the intrinsic penalty reward allows the policy relationship information between the UCAV and missile.
network to deviate from the constantly visited state of the The high-dimensional state space makes the convergence
agent and indirectly amplifies the imitation effect for the speed of the policy network slow. Thus, we limited the state
SIL. for the exploration bonus to the current coordinates of the
UCAV (33 rows). Consequently, the convergence rate of the
4.3. Catastrophic Forgetting in RND policy network increased, and the meaning of the role of the
exploration bonus changes clearly from ‘inducing the agent
The predictor network in RND mainly learns about the to move to a novel feature state’ to ‘inducing agent to move
state that the agent recently visited, which is similar to the to novel coordinates’.
Amplifying Imitation Effect

Figure 4 is the visualization of the movement paths of the

Table 1. An exploration area score of each algorithm in a two-
agent for 5,000 episodes (left figure) and the losses of the
dimensional no-reward grid environment. We averaged the area
explored by the agent after 30 repeated experiments. predictor network at all coordinates (right figure). We ob-
served that the loss of the area explored by the agent is lower
A LGORITHM E XPLORATION AREA than in other areas. As the episode increases, the agent ex-
ASIL 11.2±1.25 plores a novel space with a high prediction error. At this
AIE1 40.5±2.06 point, we can observe that the loss of area that the agent
AIE2 43.2±2.36 explored at an episode increased compared to the loss of
AIE3 46.7±2.19 area at the preceding episode. However, AIE3 showed that
the loss of the previously explored space remained relatively
low compared to the other two algorithms.
5.2. Test Algorithms
In the sparse reward environment, ASIL explored a small
ASIL denotes the combination of A2C and SIL. We used this area, circulating throughout the area although the episode
model as a baseline method for a performance comparison. increased, but the proposed three algorithms explored many
In this study, we propose three RL algorithms. Amplifying areas. Table 1 shows the score of how each algorithm ex-
the imitation effect (AIE1) is the first proposed algorithm, plored uniformly over four quadrants of the 2D grid space
which combines ASIL and RND. The second is the addition during 30,000 episodes. The formula for the score was
of intrinsic penalty rewards to ASIL + RND (AIE2), and the
score = mean(EQq ) × σEQ × 100 (5)
third is the AIE2 with the addition of replay memory for the
predictor network (AIE3) described in Algorithm 1. where EQq is the explored portion in the total area of each
quartile. We confirmed that the proposed algorithms (partic-
5.3. Hard Exploration in 2D Environment ularly AIE3) were very effective for exploration.
5.3.1. S PARSE R EWARD S ETTING
5.4. Experiment for UCAV Mission Execution
We conducted a simple experiment to see how effective the
We performed an experiment to investigate UCAV control
proposed algorithms are for exploration. We constructed a
in a sparse reward environment and compared the perfor-
2D grid world in which the agent learns a sequence of move-
mances of the algorithms. In addition, we analyzed how the
ments that begin from a starting point and reach a goal point
UCAV manages to avoid missiles. First, since our exper-
using a simple movement step (up, down, left, and right).
imental environment has a sparse reward structure, DQN,
The reward was set to zero except when reaching the target
prioritized experience replay DQN, A2C and ACER failed
point (reward of 30) or leaving the environment (reward
to converge to the desired policy that generates the shortest
of -30). RL was performed a total of 10,000 episodes for
path from the origin to the target point while avoiding an
each algorithm. Figure 2 is the visualization of the move-
enemy’s missiles. Figure 5 (left) shows the performances of
ment paths of the agent. Since the reward is too sparse,
ASIL and the proposed three algorithms for an experiment
the ASIL failed to reach the target point. In contrast, all
consisting of 60,000 episodes. The light colors and normal
of the proposed algorithms successfully reached the target
colors represent the worst and average performance of the
point because of the exploration bonus. For AIE1, the re-
compared algorithms, respectively. The result is that AIE2
sult showed that the agent quickly reached the target point.
and AIE3 succeeded in converging to the desired policy,
However, we find that AIE 2 and AIE3 that considered the
while ASIL and AIE1 fell into a local minimum once in
intrinsic penalty reward performed a deeper exploration than
two trials and once in three trials, respectively. In particu-
AIE1 – the two algorithms arrived at the target point via
lar, AIE3 outperformed the other algorithms, as shown in
more diverse paths compared to AIE1.
Figure 4. Similar to the previous exploration experiment,
we confirmed that the performance of the three proposed
5.3.2. N O -R EWARD S ETTING
algorithms was better than that of ASIL (baseline model) in
We experimented with the same environment in which there the UCAV control environment.
is no target point. The agent performs only exploration in Figure 6 presents snapshots of learning (animation is here1 ).
each episode. We argue that the catastrophic forgetting is At early episodes of the learning, the UCAV took random
ineffective for exploration when using an exploration bonus actions and occasionally left the battlefield. However, as the
because the agent has less chance of searching a novel state episodes increased, it tended to move forward gradually but
if the prediction error remains high for previously searched was shot down by a missile. This result can be confirmed
states. Furthermore, we argue that using replay memory for by the cumulative shot probability plot (Figure 5 (right)).
predictor network (AIE3) is more efficient for exploration
1
because the memory mitigates the catastrophic forgetting. https://youtu.be/7R5lZAsCs2c
Amplifying Imitation Effect

AI
E1

AI
E2

AI
E3

1~5
,000 5,
001~10,
000 10,
001~1
5 ,
000
Epis
odei
nter
val

Figure 4. Visualization of the path of the agent and loss of all coordinate states for each algorithm in the no reward 2D grid environment.
The color changes from blue to red in the agent’s path figure to indicate where the agent visits more frequently. The color changes from
blue to yellow in the loss figure to indicate where the loss is larger.

40 0.
6
Mi
ssi
onc
ompl
et
ion
0.
5

20 0.
4
Epi
sode1~10,
000

0.
3
0
ASI
L ASI
(
L
wor
st
)
0.
2
AI
E1 AI
E1
(
wor
st
)
AI
E2 AI
(
E2
wor
st
)
0.
1
-
20 AI
E3 AI
(
E3
wor
st
)
Epi
sode1
0,000~20,
000
10k 20k 30k 40k 50k 60k 10k 20k 30k 40k 50k 60k
Epi
sodenumbe
r Epi
sodenumbe
r

Figure 5. (Left) Learning curves of the UCAV mission execution

environment. The x and y axes represent the episode number and Epi
sode20
,00
0~
the average reward, respectively. The plot is the average of the
reward of the results of the 10 experiments for each algorithm. Figure 6. 3D view of the UCAV’s learning process. The red circle
The light color represents the worst performance result of each represents the air defense network, the black solid line represents
algorithm. (Right) Cumulative probability graph of being shot the movement path of the UCAV, and the red dotted line represents
down by a missile. the missile’s movement path.

As the episodes continues, the UCAV learned how to avoid

missiles and began to move to new coordinates (attempted
to increase intrinsic reward). The UCAV attempted to reach
the target point through various paths.
work, the probability of being shot down by a missile in-
Figure 7 is a 3D representation of the path through which creased. Therefore, the UCAV learned the safe path that
the UCAV reached the target while avoiding the missile. passed through the overlapped areas of air defense networks
When the UCAV entered the center of the air defense net- with a low altitude.
Amplifying Imitation Effect

learning. arXiv preprint arXiv:1808.04355, 2018a.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex-

ploration by random network distillation. arXiv preprint
arXiv:1810.12894, 2018b.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity

is all you need: Learning skills without a reward function.
arXiv preprint arXiv:1802.06070, 2018.

Fox, L., Choshen, L., and Loewenstein, Y. Dora the

explorer: Directed outreaching reinforcement action-
selection. 2018.

Gangwani, T., Liu, Q., and Peng, J. Learning self-imitating

diverse policies. arXiv preprint arXiv:1805.10309, 2018.
Figure 7. UCAV’s path of after learning in 3D view. You can
see the UCAV through the overlapping of air defense network, Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R.
avoiding the missile and reaching the target point. The reactor: A sample-efficient actor-critic architecture.
arXiv preprint arXiv:1704.04651, 2017.
6. Conclusion Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Re-
In this paper, we proposed AIE by combining SIL and RND. inforcement learning with deep energy-based policies.
In addition, we proposed AIE2 and AIE3, which can lead to arXiv preprint arXiv:1702.08165, 2017.
efficient deep exploration. AIE2 gives an intrinsic penalty
reward to states where the agent frequently visits, which Haber, N., Mrowca, D., Fei-Fei, L., and Yamins, D. L.
prevents the agent from falling into a local optimal policy. Learning to play with intrinsically-motivated self-aware
AIE3 adopts replay memory to mitigate the catastrophic agents. arXiv preprint arXiv:1802.07442, 2018.
forgetting of the predictor network. These two algorithms
amplify the imitation effect, leading to deep exploration, Kim, S. and Kim, Y. Three dimensional optimum controller
thereby enabling the policy network to quickly converge into for multiple uav formation flight using behavior-based
the desired policy. We experimentally demonstrated that decentralized approach. In Control, Automation and Sys-
the AIEs in the 2D grid environment successfully explored tems, 2007. ICCAS’07. International Conference on, pp.
wide areas of the grid space. In addition, for the UCAV 1387–1392. IEEE, 2007.
control problem, we observed that the proposed algorithms
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In
quickly converged into the desired policy. In future work,
Advances in neural information processing systems, pp.
it is necessary to discuss the configuration of the replay
1008–1014, 2000.
memory because replay memory for the predictor network
has limited storage; thus, it is inefficient to insert a feature Lin, L.-J. Self-improving reactive agents based on reinforce-
for every learning step. ment learning, planning and teaching. Machine learning,
8(3-4):293–321, 1992.
Acknowledgments
Liu, P. and Ma, Y. A deep reinforcement learning based
This research was supported by Agency for Defense Devel- intelligent decision method for ucav air combat. In Asian
opment (UD170043JD). Simulation Conference, pp. 274–286. Springer, 2017.

References Machado, M. C., Bellemare, M. G., and Bowling, M.

Count-based exploration with the successor representa-
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., tion. arXiv preprint arXiv:1807.11622, 2018.
Saxton, D., and Munos, R. Unifying count-based explo-
ration and intrinsic motivation. In Advances in Neural Minglang, C., Haiwen, D., Zhenglei, W., and QingPeng, S.
Information Processing Systems, pp. 1471–1479, 2016. Maneuvering decision in short range air combat for un-
manned combat aerial vehicles. In 2018 Chinese Control
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., And Decision Conference (CCDC), pp. 1783–1788. IEEE,
and Efros, A. A. Large-scale study of curiosity-driven 2018.
Amplifying Imitation Effect

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Y. Policy gradient methods for reinforcement learning
atari with deep reinforcement learning. arXiv preprint with function approximation. In Advances in neural in-
arXiv:1312.5602, 2013. formation processing systems, pp. 1057–1063, 2000.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- Kavukcuoglu, K., and de Freitas, N. Sample effi-
land, A. K., Ostrovski, G., et al. Human-level control cient actor-critic with experience replay. arXiv preprint
through deep reinforcement learning. Nature, 518(7540): arXiv:1611.01224, 2016.
529, 2015.
Zhang, Y., Zu, W., Gao, Y., Chang, H., et al. Research on
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, autonomous maneuvering decision of ucav based on deep
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- reinforcement learning. 2018.
chronous methods for deep reinforcement learning. In
International conference on machine learning, pp. 1928– Ziebart, B. D. Modeling purposeful adaptive behavior with
1937, 2016. the principle of maximum causal entropy. 2010.

Moran, I. and Altilar, T. Three plane approach for 3d true

proportional navigation. In AIAA Guidance, Navigation,
and Control Conference and Exhibit, pp. 6457, 2005.

Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learn-
ing. arXiv preprint arXiv:1806.05635, 2018.

Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos,

R. Count-based exploration with neural density models.
arXiv preprint arXiv:1703.01310, 2017.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.

Curiosity-driven exploration by self-supervised predic-
tion. In International Conference on Machine Learning
(ICML), volume 2017, 2017.

Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen,

R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychow-
icz, M. Parameter space noise for exploration. arXiv
preprint arXiv:1706.01905, 2017.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-

tized experience replay. arXiv preprint arXiv:1511.05952,
2015.

Silvia, P. J. Curiosity and motivation. The Oxford handbook

of human motivation, pp. 157–166, 2012.

Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex-

ploration in reinforcement learning with deep predictive
models. arXiv preprint arXiv:1507.00814, 2015.

Strehl, A. L. and Littman, M. L. An analysis of model-

based interval estimation for markov decision processes.
Journal of Computer and System Sciences, 74(8):1309–
1331, 2008.

Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam,

A., and Fergus, R. Intrinsic motivation and auto-
matic curricula via asymmetric self-play. arXiv preprint
arXiv:1703.05407, 2017.

Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Deep Reinforcement Learning Based Local
No ratings yet
Deep Reinforcement Learning Based Local
7 pages
Cessna 2
No ratings yet
Cessna 2
20 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Autonomous Unmanned Aerial Vehicle Navigation Using Reinforcement Learning: A Systematic Review
No ratings yet
Autonomous Unmanned Aerial Vehicle Navigation Using Reinforcement Learning: A Systematic Review
24 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
1 s2.0 S100093612030594X Main
No ratings yet
1 s2.0 S100093612030594X Main
18 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Generative Adversarial Inverse Reinforcement Learning With Deep Deterministic Policy Gradient
No ratings yet
Generative Adversarial Inverse Reinforcement Learning With Deep Deterministic Policy Gradient
15 pages
trajectory-transformer
No ratings yet
trajectory-transformer
15 pages
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
No ratings yet
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
16 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
Report
No ratings yet
Report
3 pages
Visual Reinforcement Learning With Imagined Goals: Equal Contribution. Order Was Determined by Coin Flip
No ratings yet
Visual Reinforcement Learning With Imagined Goals: Equal Contribution. Order Was Determined by Coin Flip
15 pages
Movement Skill Acquisition Using Imitati
No ratings yet
Movement Skill Acquisition Using Imitati
64 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Sensors 23 02036
No ratings yet
Sensors 23 02036
24 pages
Practical Hierarchical Reinforcement Lea
No ratings yet
Practical Hierarchical Reinforcement Lea
88 pages
3D Obstacle Avoidance For UAV Based On RL and RealSense
No ratings yet
3D Obstacle Avoidance For UAV Based On RL and RealSense
6 pages
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
No ratings yet
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
9 pages
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
No ratings yet
A Deep Reinforcement Learning Control Approach For High-Performance Aircraft
41 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Imitation Learning
No ratings yet
Imitation Learning
188 pages
Reinforcement Learning Based Quadcopter Controller
No ratings yet
Reinforcement Learning Based Quadcopter Controller
7 pages
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
No ratings yet
Skill-Based Curiosity For Intrinsically Motivated Reinforcement Learning
20 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Hansen_2022
No ratings yet
Hansen_2022
20 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
No ratings yet
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
10 pages
ICRA2024 IRL Reward Shaping Wu
No ratings yet
ICRA2024 IRL Reward Shaping Wu
8 pages
Autonomous_Decision-Making_Generation_of_UAV_based_on_Soft_Actor-Critic_Algorithm-1
No ratings yet
Autonomous_Decision-Making_Generation_of_UAV_based_on_Soft_Actor-Critic_Algorithm-1
6 pages
Week-12
No ratings yet
Week-12
59 pages
shi2021
No ratings yet
shi2021
11 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
SA031PL
No ratings yet
SA031PL
7 pages
A Survey of Demonstration Learning
No ratings yet
A Survey of Demonstration Learning
30 pages
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
No ratings yet
Reinforcement Learning For UAV Attitude Control: William Koch, Renato Mancuso, Richard West, and Azer Bestavros
21 pages
Reinforcement Learning For Robust Missile Autopilot Design
No ratings yet
Reinforcement Learning For Robust Missile Autopilot Design
10 pages
Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions
No ratings yet
Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions
15 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
MAS-Lab7-QFA
No ratings yet
MAS-Lab7-QFA
10 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Decision Transformer: Reinforcement Learning Via Sequence Modeling
No ratings yet
Decision Transformer: Reinforcement Learning Via Sequence Modeling
21 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
A Reinforcement Learning System With Neuro-Fuzzy Network
No ratings yet
A Reinforcement Learning System With Neuro-Fuzzy Network
5 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Helicopter Flight Control Design Using A Learning Control Approach1
No ratings yet
Helicopter Flight Control Design Using A Learning Control Approach1
6 pages
Paper Fiuri
No ratings yet
Paper Fiuri
17 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
No ratings yet
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
25 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Supervised Machine Learning for Science: How to stop worrying and love your black box
From Everand
Supervised Machine Learning for Science: How to stop worrying and love your black box
Christoph Molnar
No ratings yet
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
From Everand
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
David Macêdo
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
The Road To General Intelligence
100% (4)
The Road To General Intelligence
142 pages
Socially Aware Motion Planning With Deep Reinforcement Learning
No ratings yet
Socially Aware Motion Planning With Deep Reinforcement Learning
8 pages
Some Stuff
No ratings yet
Some Stuff
3 pages
Question-Answers in Machine Learning
No ratings yet
Question-Answers in Machine Learning
14 pages
ssrn-3763090
No ratings yet
ssrn-3763090
4 pages
From Machine Learning To Robotics - Challenges and Opportunities For Embodied Intelligence
No ratings yet
From Machine Learning To Robotics - Challenges and Opportunities For Embodied Intelligence
39 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
0% (1)
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
4 pages
labook DA
No ratings yet
labook DA
59 pages
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
No ratings yet
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
14 pages
Sert 2020
No ratings yet
Sert 2020
12 pages
Final Synopsis
No ratings yet
Final Synopsis
9 pages
ME Lab - II Sem Syllabus
No ratings yet
ME Lab - II Sem Syllabus
6 pages
Capstone PPT APNA
No ratings yet
Capstone PPT APNA
12 pages
Genetic Algorithms
No ratings yet
Genetic Algorithms
14 pages
AI Developer Resource Compilation
No ratings yet
AI Developer Resource Compilation
20 pages
AI and ML For Business Antim Prahar WITH ANSWERS
No ratings yet
AI and ML For Business Antim Prahar WITH ANSWERS
26 pages
Reinforcement Learning With MATLAB: Understanding Rewards and Policy Structures
No ratings yet
Reinforcement Learning With MATLAB: Understanding Rewards and Policy Structures
26 pages
learn to learn
No ratings yet
learn to learn
17 pages
Intern Assignment
No ratings yet
Intern Assignment
3 pages
Nerf2Real: Sim2Real Transfer of Vision-Guided Bipedal Motion Skills Using Neural Radiance Fields
No ratings yet
Nerf2Real: Sim2Real Transfer of Vision-Guided Bipedal Motion Skills Using Neural Radiance Fields
12 pages
Robotics HW1
No ratings yet
Robotics HW1
17 pages
Deep Reinforcement Learning For Stock Prediction
No ratings yet
Deep Reinforcement Learning For Stock Prediction
9 pages
06 Chapter 4 - Machine Learning
No ratings yet
06 Chapter 4 - Machine Learning
55 pages
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
14 pages
ML Unit 5 @ VS
No ratings yet
ML Unit 5 @ VS
29 pages
Electronics 11 00465 v2
No ratings yet
Electronics 11 00465 v2
20 pages
Paper 105
No ratings yet
Paper 105
4 pages
Q Learning
No ratings yet
Q Learning
38 pages
A-Study-on-the-Agent-in-Fighting-Games-Based-on-Deep-Reinforcement-LearningMobile-Information-Systems
No ratings yet
A-Study-on-the-Agent-in-Fighting-Games-Based-on-Deep-Reinforcement-LearningMobile-Information-Systems
8 pages