0% found this document useful (0 votes)

3 views14 pages

Vision-and-Language Navigation Based On Gated Recurrent BERT

The document discusses a novel approach to Vision-and-Language Navigation (VLN) using a Gated Recurrent BERT model that integrates pre-trained BERT for feature extraction and addresses data scarcity in training. It emphasizes the importance of historical information retention and combines reinforcement learning with imitation learning to enhance the agent's performance in navigating 3D environments based on natural language instructions. Experimental results demonstrate the effectiveness of the proposed method in improving navigation capabilities in indoor scenes.

Uploaded by

4jshjvbf47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views14 pages

Vision-and-Language Navigation Based On Gated Recurrent BERT

Uploaded by

4jshjvbf47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Vision-and-Language Navigation Based on

Gated Recurrent BERT⋆

First Author1[0000−1111−2222−3333] , Second Author2,3[1111−2222−3333−4444] , and

Third Author3[2222−−3333−4444−5555]
1
Princeton University, Princeton NJ 08544, USA
2
Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
[email protected]
http://www.springer.com/gp/computer-science/lncs
3
ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany
{abc,lncs}@uni-heidelberg.de

Abstract. Vision-and-language navigation(VLN) task needs an agent

navigate to a target location under the guidance of natural language in-
structions given by humans in novel 3D environments. We mainly study
the vision-and-language navigation task in indoor scenes. It is difficult
to obtain a large number of labeled training data, thus traditional multi-
modal methods perform poorly in the data-scarce environment. We intro-
duce pre-trained BERT model into vision-and-language navigation task
to extract environment features and reduce the computation consump-
tion by fixing the instruction features during navigation. Meanwhile, we
design Gated Recurrent Vision-and-Language BERT to effectively mem-
orize and transmit historical information to alleviate memory loss prob-
lem during navigation. In addition, we conduct reinforcement learning
and imitation learning simultaneously to address the exposure bias of
imitation learning. Empirically, we conduct experiments in discrete en-
vironment and analyze experimental results to verify the effectiveness of
our proposed method.

Keywords: Vision-and-Language navigation · Pre-trained model ·

Reinforcement learning · Imitation learning.

1 Introduction
It has always been a human dream to build a general robot that can complete
the tasks assigned by humans in the process of communicating with humans in
natural language. Now, with the great breakthroughs in the fields of computer
vision and natural language processing, this dream has gradually become reality.
Vision-and-language navigation is the basis to realize general robot.
Vision-and-language navigation is an action decision-making problem under
the navigation instructions given by humans in 3D scenes with incomplete envi-
ronment information. The most critical and difficult part of the task is how an
⋆
Supported by organization x.
2 F. Author et al.

intelligent agent could better perceive the environment and make correct deci-
sions. In the process of navigation, the agent needs to align textual instruction
with visual environment observation with multi-model fusion and then make
action decision in terms of reasoning and grounding about fused information.
The agent needs strong perceptual, memory and reasoning capabilities. In addi-
tion, the navigation agent trained based on neural networks often achieve good
performance in seen environments, but there would be a large performance gap
between seen and unseen environments. Hence, it crucial to improve generaliza-
tion ability of the model so that the navigation agent’s performance is insensitive
to different environments.
Previous methods use traditional encoder-decoder models in cross-modal fu-
sion, which need a large-scale and high-quality annotated data. However, it’s
hard to collect adequate high-quality data in 3D environments for training
in vision-and-language navigation field. Thus we introduce pre-trained BERT
model to address the lack of data and revise self-attention calculation in BERT
model according to the characteristics of VLN to further reduce calculation and
speed up computation. Since vision-and-language navigation could be modeled
as a partially observable Markov decision-making process, in which the future
observations depend on the current state and action of the agent. Meanwhile,
only partial sub-instructions are relevant to the current observation and decision-
making at each navigation step. Thus the model needs to localize the relevant
partial sub-instructions according to navigation progress in trajectory history,
which requires the model to memorize historical information. Therefore, we ap-
pend gated recurrent neural network with residual connection to BERT model to
handle memory loss problem. Furthermore, we incorporate reinforcement learn-
ing with imitation learning to enhance navigation agent’s generalization capacity.

2 Related Work

As an emerging and challenging research direction, vision-and-language naviga-

tion has attracted interest of many researchers.
Some researches focus on multi-modal information fusion. Landi et al. [111]
propose an architecture to use the history of previous actions for different modal-
ities. Magassouba et al. [112] present the Cross-modal Masked Path transformer,
which encodes linguistic and environment state features to generate actions. Wu
et al. [113] and Mao et al. [114] use multi-head attention on visual and textual
input, and Hong et al. [115] devise a language and visual entity relationship
graph model. Xia et al. [116] propose Learn from Everyone(LEO), which uti-
lizes multiple language instructions for the same path. Qi et al. [117] distinguish
the object and action information from language instructions, and Wang et al.
[118] present Structured Scene Memory (SSM), which allows the agent to access
to its past perception. These approaches allow the robot to ground language
instructions to the environment, reason long-term and make global decisions.
Pre-trained models are now playing a great role in multi-modal fusion. Su et
al. [136] propose VL-BERT for visual-linguistic tasks. Hao et al.[?] first pre-
Vision-and-Language Navigation Based on Gated Recurrent BERT 3

train Transformer model for vision-and-language navigation tasks. They train

the model on a large number of image-text-action triplets with self-supervised
learning and significantly improve the model’s performance.
Some researches utilize Reinforcement Learning (RL) and Imitation Learn-
ing (IL) for Vision-Language Navigation (VLN) models. RL allows the agent
to explore the state-action space outside the demonstration path and is used to
balance exploitation and exploration when learning to navigate. To improve gen-
eralization ability, Wang et al. [122] integrated model-based and model-free RL
methods while Wang et al. [119] used reinforced cross-modal matching (RCM),
extrinsic and intrinsic rewards. Zhou and Small [125] devised a adversarial in-
verse reinforcement learning method to learn a language-conditioned policy and
reward function.

3 Vision-and-Language Navigation Based on Pre-training

One of the major challenges of vision-and-language navigation tasks is the ac-

quisition of large-scale and high-quality annotated datasets. Pre-trained BERT
models can reduce the demand of annotated data. However, because BERT mod-
els have many parameters which consume a lot of training resources and hinder
the transmission of historical information, the appliance of pre-trained BERT in
vision-and-language navigation is limited.
In order to address the problem of large resource consumption, we revise the
BERT structure according to the characteristics of vision-and-language naviga-
tion tasks to reduce the calculation and enhance the model’s performance. As
for the difficulty of memorizing historical information, we propose to add gated
recurrent neural networks with residual connections to BERT model to memo-
rize and transmit historical information, which alleviates the problem of memory
loss. At the same time, with regard to exposure bias in imitation learning, we si-
multaneously conduct imitation learning and reinforcement learning to enhance
the generalization ability of the agent.

3.1 Pre-trained Multi-model BERT Model for VLN

Deep learning algorithms are now widely used in various artificial intelligence
tasks. The quality and scale of the labeled dataset substantially determine the
performance of trained model. However, labeled datasets are difficult for re-
searchers to obtain or annotate and need to be reconstructed in different tasks.
Pre-trained models refer to training on a large scale of in-domain data to
learn the general knowledge in the domain, and then fine-tuning on the labeled
data of a specific downstream task in the domain to transfer and reuse the
general knowledge, and thus enhance the performance of the model.
It is necessary to preprocess the pictures and text respectively to input them
into the Transformer. Text is usually represented by one-hot vector, and then
input into the word embedding network to obtain token embedding. We encode
the position of each word as position embedding and then input concatenated
4 F. Author et al.

position embedding and word embedding into pre-trained models to provide

position information.

P Epos,2i = sin(pos/100002i/dmodel ) (1)

P Epos,2i+1 = cos(pos/100002i/dmodel ) (2)

We initialize parameters with Prevalent pre-trained model[?], which is the
first pre-trained model focusing on vision-and-language navigation task. Preva-
lent model’s structure is two-stream multi-modal BERT structure. Prevalent is
a self-supervised pre-training model that uses language instructions and visual
observations to extract features and then combines them to improve navigation
trajectories. It consists of two pre-training tasks - Masked Language Modeling
(MLM) and Action Prediction (AP).

3.2 Recurrent BERT Model for Vision-and-Language Navigation

Previous multi-modal BERT pre-trained models are mainly used in static envi-
ronment. We propose a Gated Recurrent Vision-and-Language Navigation BERT
model(GRVLN-BERT), which uses an extra recurrent neural network to pass his-
torical information. The model fixes the language features and only uses them as
keys and values for attention calculation to reduce computation during naviga-
tion. The model structure is shown in Fig. 1, which consists of instruction feature
initialization, image feature processing, and multi-modal information fusion.

Navigation Instruction Initialization Based on BERT Model In a vision-

and-language navigation task, assuming the vocabulary set is D, the navigation
instruction U = w0 , w1 ...wn , wi ∈ D is given to the agent at the beginning, and
U remains unchanged during whole navigation episode. Therefore, at the initial
state st=0 , the start token [CLS] and the separation token [SEP] are added to the
instructions: U ′ = [CLS], w0 w1 ...wn , [SEP ]. Then input U ′ into the pre-trained
BERT model:

s0 , X = GRV LN − BERT ([U ′ ]) (3)

As shown in Eq. 3, s0 is the initial state, we regard the output embedding
vector of [CLS] token as initial state representation, and X is the instruction
feature encoded by GRVLN-BERT model. The navigation instruction U ′ will
not change in the subsequent navigation steps t > 0.Therefore, X is directly
used as the instruction feature input without update in the subsequent steps.
We localize the relevant partial sub-instruction at the current step by calculating
attention scores over the image features and historical information.

Image Feature Processing Based on BERT Model The navigation agent

receives local observations at current position continually as agent moves in the
environment. In Room-to-Room environment, the visual input of the agent is a
Vision-and-Language Navigation Based on Gated Recurrent BERT 5

Fig. 1. The Structure of Gated Recurrent Vision-and-language Navigation BERT

360-degree panorama composed of 36 images, obtained at 12 horizontal angles

and 3 vertical angles. In order to contain the visual feature and directional fea-
ture of each image, the feature of each image consists of 2048-dimensional visual
feature sv extracted by pre-trained ResNet network and 128-dimensional direc-
tional feature sp composed of 32 repetitive [sinψ; cosψ; sinω; cosω]. Therefore,
the feature of each image is 2176-dimensional vector si = [siv , sip ], and we trans-
form it into the dimension suitable for BERT input through a fully-connected
layer.

h = Layer − N orm(We si + be ) (4)

768∗2176
where We ∈ R is the weight matrix of the fully-connected layer and
be ∈ R768 is the bias.
BERT model extract features through self-attention mechanism, which cal-
culates self-attention scores over all values. As shown in Eq. 5, if the vector of
the current layer is ht−1 , the Query, Key and Value for self-attention calculation
are all ht−1 in BERT’s self-attention. In this way, each feature of ht is extracted
from all features of the previous layer ht−1 through self-attention mechanism.
QK T
ht = sof tmax( p )V, Q = K = V = ht−1 (5)
(dh )
In vision-and-language navigation tasks, ht = [st , X, Vt ]. The instruction will
not change throughout the navigation period, and the model places emphasis on
6 F. Author et al.

matching and understanding the language instruction in terms of current obser-

vation. Therefore, it is not necessary to use the instruction to extract the image
features, which saves a lot of computation resources. The purpose of vision-and-
language navigation is to make action decisions according to current observation
panorama. As a long-horizon problem, current local observation only relates to
partial sub-instruction, thus we refine features of relevant sub-instruction by
calculate self-attention scores over instruction feature with image feature. Fur-
thermore, state st is considered as the carrier of historical information, so its
feature is also extracted from all features of the previous layer to memorize
trajectory history.
X=X (6)
QK T
Vt+1 = sof tmax( p )V, Q = Vt , K = ht , V = ht (7)
(dh )
QK T
st+1 = sof tmax( p )V, Q = st , K = ht , V = ht (8)
(dh )
As shown in Fig. 2, instruction feature X only serves as Key and Value
for attention calculation to extract visual feature and state feature, while its
own value will not change in the GRVLN-BERT model. The navigational action

Fig. 2. Attention Calculation

decision of the agent is not to make specific action, such as going ahead or
turning left, but to select an image with the highest probability in the panorama
comprised of 36 images and move towards it. This form of action decision-making
is more more semantically informative and interpretable. And the attention layer
of the multi-modal BERT model calculates the attention weights over each image
Vision-and-Language Navigation Based on Gated Recurrent BERT 7

composed of the panorama, which is consistent with the form of outputting the
probabilities of selecting each image. Therefore, we directly use the attention
weights on each image at the last multi-head attention layer as the decision
probabilities.

Historical Information Transmission Based on Recurrent Neural Net-

work In vision-and-language navigation tasks, the historical state refers to the
summary of navigation instructions, historical navigation visual trajectory in-
formation, action decision sequences, etc. Although navigation is a partially ob-
servable Markov decision-making process, it is critical to record historical infor-
mation for predicting the current navigation progress and correctly extracting
key sub-instructions for decision-making. Utilizing only current image input to
extract the related sub-instruction without historical state information degrades
the inference ability of navigation agent. When encountering repetitive and sim-
ilar scenes, the agent could not distinguish them so that make the same decisions
at different stages of navigation. For instance, there are two similar rooms in the
indoor scene, and the navigation instruction indicate the agent to turn left in the
first room and turn right in the second room. However, without the assistance of
historical information, the agent could not distinguish its current position, and
has difficulty making correct navigation action.
We propose a method for recording and transferring historical state infor-
mation in the BERT model. In the BERT model, the output corresponding to
token [CLS] is generally used for specific tasks, such as classification. We con-
sider the output corresponding to token [CLS] as state information s. Due to
the self-attention mechanism of BERT, s carries the features of image input and
navigation command at each navigation time step.
Previous methods directly utilize the output of BERT at last time step as
new input of st , so that the model cannot distinguish important information
to memorize and noise information to discard. Furthermore, when the histori-
cal state information is inputted into BERT, after the calculation of multiple
self-attention layers, some historical information will be lost. Therefore, it is dif-
ficult for long-horizon sequential problem like vision-and-language to preserve
historical information for a long time.
We add Gated Recurrent Neural Networks (GRU)[?] to the output state s in
BERT model to address historical information loss problem. GRU utilize several
control gate to control whether to preserve or discard some state information
rather than directly transmit hidden state like recurrent neural network. Hence,
GRU could solve the problem of long-term dependence well.

zt = σ(Wz xt + Uz ht−1 ) (9)

zt = σ(Wz xt + Ut ht−1 ) (10)

hˆt = tanh(WXt + U (rt ⊙ ht−1 )) (11)

ht = (1 − zt ) ⊙ ht−1 + zt ⊙ hˆt (12)

8 F. Author et al.

where zt ∈ [0, 1] is update gate and determines whether to memorize or discard

some information. rt is reset gate and decides how much previous information
would be used for computing the output. hˆt is the output of activation function
and ht is the hidden state of forward propagation. The multiple gate mechanisms
in GRU ensure early historical information can be preserved well in long-horizon
sequential problem.
Inspired by the Residual Neural Network(ResNet)[?], we utilize skip connec-
tion to address the memory loss issue. As shown in Fig. 3, we directly connect
the state input to the hidden state input of GRU and preserve historical state
information through skip connection.

Fig. 3. Skip Connection of State

In conclusion, the algorithm flow of our proposed Gated Recurrent Vision-

and-Language Navigation BERT model is shown in Algorithm 1:

3.3 Training with Reinforcement Learning and Imitation Learning

Since vision-and-language navigation is long-horizon partially observable Markov
decision-making process, there exists exposure bias problem. The root cause
of exposure bias is inconsistency between sample spaces of expert demonstra-
tion and navigation agent policy. We propose We proposed a training method
combining reinforcement learning and imitation learning, which simultaneously
conducts both imitation learning and reinforcement learning. This combination
makes full use of the advantages of faster convergence of imitation learning and
better generalization of reinforcement learning.
Imitation learning utilizes simple behavior cloning, which fits the expert’s
policy through minimizing the cross-entropy loss function:
X
LIL = −λ a∗t log(pat ) (13)
t
Vision-and-Language Navigation Based on Gated Recurrent BERT 9

Algorithm 1 Gated Recurrent Vision-and-Language Navigation BERT

Input: environment ENV, navigation model GRVLN-BERT
repeat
Start a navigation episode and obtain navigation instruction
Initialize time-step t = 0 and starting position pos0
Extract initial state feature and instruction feature s0 , X = GRV LN −BERT ([I])
repeat
Ot = EN V (post ) Receive environment observation at position pos0
at = GRV LN − BERT ([st , X, Ot ]) Predict action
st+1 , post+1 = EN V (at ) Perform action
t←t+1
until Navigation agent makes STOP action or t exceeds pre-defined threshold
until Navigation model GRVLN-BERT ends training
Output: navigation model GRVLN-BERT

where a∗t is expert’s action and pat is the probability that agent takes action a.
Reinforcement learning makes agent explore in an environment and self-
update based on the reward given by the environment, which effectively increases
the generalization of the agent. We use Proximal Policy Optimization (PPO) al-
gorithm, whose objective function is to maximize the cumulative reward:
Z
J(θ) = Vπθ (s0 ) = Eτ πθ [G(t)] = πθ (τ )G(τ )dτ (14)

where πθ is policy function and θ is network parameters.

Reinforcement learning and imitation learning share the network parame-
ters of navigation agent. We sample expert demonstration and agent trajectory
respectively during training. We calculate reinforcement learning loss and imita-
tion learning loss individually, and the update agent’s parameters with gradient
descent over weighted total loss. This approach is equivalent using imitation
learning to regularize reinforcement learning to accelerate the convergence of
reinforcement learning algorithms.

L = LRL + λIL LIL (15)

where λIL is the parameter that handles the weight of imitation learning.

4 Experiments
4.1 Dataset and Simulator for Indoor Scene Vision-and-Language
Navigation
Matterport3D Simulator[?] is a software framework for various computer vision
tasks, constructed using the MatterPort3D panoramic RGB-D image dataset as
shown in Fig. 4.
Room-to-Room is the first vision-and-language dataset consisting of navi-
gation instructions and trajectories scene constructed by annotating navigation
10 F. Author et al.

Fig. 4. Matterport Indoor Scene and Viewpoints for Panorama Capture (green dots)[?]

instruction-trajectory pairs in Matterport3D simulator. Each trajectory is a se-

quence of viewpoints in Matterport3D simulator and corresponds to 3 instruc-
tions.

4.2 Evaluation Metrics

We utilize some popular evaluation metrics for vision-and-language navigation

tasks to measure the performance of navigation agent:

• Trajectory Length(TL): total length of trajectory in meters.

• Navigation Error(NE): the average distance between target location and
agent’s ultimate position in meters.
• Success Rate(SR): The ratio of the endpoint within a radius of 3 meters from
the target location.
• Oracle Success Rate(OSR): The ratio of the nearest point among the agent’s
entire path within a radius of 3 meters from the target location.
• Success rate weighted by inverse Path Length(SPL):

N
1 X li
SP L = Si (16)
N i=1 max(pi , li )

As shown in Eq. 16, Si = 1 if the navigation agent succeeds, and Si = 0 if it

fails. li is the shortest path length from the starting point to the terminus, and
pi is the path length of the actual trajectory that the navigation agent moves.
Vision-and-Language Navigation Based on Gated Recurrent BERT 11

4.3 Experiment Analysis

The validation set of Room-to-Room contains 18 indoor scenes and is split into
val-seen dataset and val-unseen dataset. Val-seen validation set refers to the in-
door scenes that have appeared in the training set, and val-unseen validation
set refers to the indoor scenes that has not appeared in the training set and the
agent has never seen before. The test set of Room-to-Room only contains naviga-
tion instructions. The official of Room-to-Room dataset provides an evaluation
server. After performing the navigation agent locally to collect the trajectories
corresponding to the instructions in the test set, we upload these trajectories to
the server for evaluation.
We initialize the parameters of GRVLN-BERT with Prevelent pre-trained
model and simultaneously conduct both imitation learning and reinforcement
learning.We set the learning rate to 10−5 and use the AdamW optimizer to train
navigation agent for 300,000 epochs in total. Table 1 shows the experimental
results in val-seen validation set. The SPL indicator of GRVLN-BERT model has
improved from 0.65% to 0.71%, and the success rate SR has increased from 0.70%
to 0.75%. Under the condition of single-run without auxiliary data augmentation,
the navigation agent has achieved the state-of-the-art performance in NE, SR,
and SPL metrics. Therefore, the memory and transfer of historical information
has greatly enhanced the model’s reasoning and grounding ability to help the
agent make correct navigation action decisions.

Table 1. Comparison among experiment results in val-seen validation environments

TL NE SR SPL
Random 9.58 9.45 0.16 -
Seq2Seq[?] 11.33 6.01 0.39 -
Speak-follower[?] - 3.36 0.66 -
SMNA[?] - 3.22 0.67 0.58
RCM+SIL[?] 10.65 3.53 0.67 -
PRESS[?] 10.57 4.39 0.58 0.55
FAST-Shot[?] - - - -
EnvDrop[?] 11 3.99 0.62 0.59
AuxRN[?] - 3.33 0.7 0.67
PREVALENT[?] 10.32 3.67 0.69 0.65
RelGraph[?] 10.13 3.47 0.67 0.65
VLNBERT[?] 11.17 3.16 0.70 0.65
GRVLN-BERT(ours) 11.08 2.58 0.75 0.71

Table 2 demonstrates the experiment results in val-unseen validation environ-

ments. The navigational agent has improved its success rate by 1% and reaches
lower navigation error rate. Meanwhile, we find that the GRVLN-BERT model
has a performance gap between val-seen and val-unseen environments, and in-
12 F. Author et al.

tend to enhance the generalization ability of GRVLN-BERT model in future

research.

Table 2. Comparison among experiment results in val-unseen validation environments

TL NE SR SPL
Random 9.77 9.23 0.16 -
Seq2Seq[?] 8.39 7.81 0.22 -
Speak-follower[?] - 6.62 0.35 -
SMNA[?] - 5.52 0.45 0.32
RCM+SIL[?] 11.46 6.09 0.43 -
PRESS[?] 10.36 5.28 0.49 0.45
FAST-Shot[?] 21.17 4.97 0.56 0.43
EnvDrop[?] 10.7 5.22 0.52 0.48
AuxRN[?] - 5.28 0.55 0.50
PREVALENT[?] 10.19 4.71 0.58 0.53
RelGraph[?] 9.99 4.73 0.57 0.53
VLNBERT[?] 11.63 4.13 0.61 0.56
GRVLN-BERT(ours) 12.49 3.81 0.62 0.56

In the end, GRVLN-BERT agent performs navigation tasks in test environ-

ments with only given instructions. Meanwhile, we collect the agent’s trajectories
corresponding to give instructions and upload instruction-trajectory pairs to the
test server we mentioned before to measure the agent’s performance. As shown
in Table 3, the GRVLN-BERT model performs better than previous models on
each indicator. Especially, it reaches a better performance with an increase of
2% than the previous best model on success rate indicator.

4.4 Ablation Study

In order to verify the effects of different components of GRVLN-BERT model, we

conduct detailed ablation studies. We study the impact of using instruction fea-
ture only as Key and Value in the cross-modal attention calculation(abbr.insKV)
on navigation agent’s performance and computation speed. On condition of both
training for 300,000 epochs, GRVLN-BERT model takes 39.7 hours with insKV
module and 43.2 hours without insKV module respectively. As shown in Table 4,
where GRVLN-BERT-noninsKV refers to GRVLN-BERT model without insKV
module, insKV module has a great impact on the performance of navigation
agent. In val-unseen scenarios, SPL decreases by 0.11%. It indicates that the in-
struction features extracted by the BERT module at the initial navigation stage
has rich semantic information.
We study the effect of historical information on the model’s grounding abil-
ity. As shown in Table 5, GRVLN-BERT refers to our proposed model, GRVLN-
BERT(normal) is the model using conventional state information transfer method
Vision-and-Language Navigation Based on Gated Recurrent BERT 13

Table 3. Comparison among experiment results in test environments

TL NE SR SPL
Random 9.89 9.79 0.13 0.12
Seq2Seq[?] 8.13 7.85 0.20 0.18
Speak-follower[?] 14.82 6.62 0.35 0.28
SMNA[?] 18.04 5.67 0.48 0.35
RCM+SIL[?] 11.97 6.12 0.43 0.38
PRESS[?] 10.77 5.49 0.49 0.45
FAST-Shot[?] 22.08 5.14 0.54 0.41
EnvDrop[?] 11.66 5.23 0.51 0.47
AuxRN[?] - 5.15 0.55 0.51
PREVALENT[?] 10.51 5.3 0.54 0.51
RelGraph[?] 10.29 4.75 0.55 0.52
VLNBERT[?] 11.68 4.35 0.61 0.57
GRVLN-BERT(ours) 12.78 3.96 0.63 0.57

Table 4. Results about insKV module (val-seen,val-unseen)

TL NE OSR SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT-noninsKV 14.57,15.27 3.98,4.80 0.66,0.60 0.60,0.52 0.55,0.45

without GRU module, and GRVLN-BERT(nonhistory) is the model without his-

torical memory. We find that historical information can effectively increase the
model’s performance. Furthermore, our proposed historical information trans-
fer based on RNN could enhance memory capacity and inference ability of the
model.

Table 5. Results about memory module (val-seen,val-unseen)

TL NE OS SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT(normal) 11.17,11.63 3.16,4.13 0.75,0.67 0.70,0.61 0.65,0.56
GRVLN-BERT(nonhistory) 10.76,10.05 3.64,4.91 0.73,0.60 0.68,0.57 0.63,0.53

We further analyze the effects of different training methods. As shown in

Table 6, GRVLN-BERT is the model trained by our proposed method com-
bining reinforcement learning and imitation learning, GRVLN-BERT(IL) is the
model trained only with imitation learning, and GRVLN-BERT(RL) is the model
trained only with reinforcement learning. In val-seen scenes, the performance gap
between GRVLN-BERT(IL) and GRVLN-BERT is slight, but becomes lager in
14 F. Author et al.

val-unseen scenes, which indicates that imitation learning has poor generaliza-
tion ability. Although the performance of GRVLN-BERT(RL) is obviously lower
than that of the other two models in val-seen scenes, its performance becomes
better in val-unseen scenes. In addition, the path leagth of GRVLN-BERT(RL) is
significantly longer than that of GRVLN-BERT(IL), which means that GRVLN-
BERT(RL) tends to explore in the environment leading to better generalization
ability. The method incorporating imitation learning and reinforcement learning
combines the advantages of the both, so that it has excellent performance in
val-seen scenarios and preferable generalization ability in val-unseen scenarios.

Table 6. Results about training method (val-seen,val-unseen)

TL NE OSR SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT(IL) 9.62,9.27 2.92,4.88 0.77,0.60 0.72,0.54 0.70,0.52
GRVLN-BERT(RL) 14.21,14.01 3.80,4.62 0.71,0.64 0.63,0.56 0.57,0.49

5 Conclusion
In this paper, we introduce the pre-trained BERT model into vision-and-language
navigation task. We revise the input and attention calculation of the BERT
model according to the characteristics of vision-and-language navigation task
to improve the model’s reasoning and grounding efficiency. In order to mem-
orize and transmit historical trajectory information in the BERT model, we
utilize gated recurrent neural network with residual connection to transmit hid-
den state and address the problem of memory loss. In order to solve the expo-
sure bias issue of training navigation agent with imitation learning, a training
method combining reinforcement learning and imitation learning is proposed to
effectively enhance the model’s generalization ability. In the end, we find that
GRVLN-BERT model achieve state-of-the-art result in some evaluation metrics
through experiments and analysis, which proves the effectiveness of our proposed
method.

References
1. Author, F.: Article title. Journal 2(5), 99–110 (2016)
2. Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S.
(eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016).
https://doi.org/10.10007/1234567890
3. Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher, Location (1999)
4. Author, A.-B.: Contribution title. In: 9th International Proceedings on Proceedings,
pp. 1–2. Publisher, Location (2010)
5. LNCS Homepage, http://www.springer.com/lncs. Last accessed 4 Oct 2017