Vision-and-Language Navigation Based On Gated Recurrent BERT
Vision-and-Language Navigation Based On Gated Recurrent BERT
1 Introduction
It has always been a human dream to build a general robot that can complete
the tasks assigned by humans in the process of communicating with humans in
natural language. Now, with the great breakthroughs in the fields of computer
vision and natural language processing, this dream has gradually become reality.
Vision-and-language navigation is the basis to realize general robot.
Vision-and-language navigation is an action decision-making problem under
the navigation instructions given by humans in 3D scenes with incomplete envi-
ronment information. The most critical and difficult part of the task is how an
⋆
Supported by organization x.
2 F. Author et al.
intelligent agent could better perceive the environment and make correct deci-
sions. In the process of navigation, the agent needs to align textual instruction
with visual environment observation with multi-model fusion and then make
action decision in terms of reasoning and grounding about fused information.
The agent needs strong perceptual, memory and reasoning capabilities. In addi-
tion, the navigation agent trained based on neural networks often achieve good
performance in seen environments, but there would be a large performance gap
between seen and unseen environments. Hence, it crucial to improve generaliza-
tion ability of the model so that the navigation agent’s performance is insensitive
to different environments.
Previous methods use traditional encoder-decoder models in cross-modal fu-
sion, which need a large-scale and high-quality annotated data. However, it’s
hard to collect adequate high-quality data in 3D environments for training
in vision-and-language navigation field. Thus we introduce pre-trained BERT
model to address the lack of data and revise self-attention calculation in BERT
model according to the characteristics of VLN to further reduce calculation and
speed up computation. Since vision-and-language navigation could be modeled
as a partially observable Markov decision-making process, in which the future
observations depend on the current state and action of the agent. Meanwhile,
only partial sub-instructions are relevant to the current observation and decision-
making at each navigation step. Thus the model needs to localize the relevant
partial sub-instructions according to navigation progress in trajectory history,
which requires the model to memorize historical information. Therefore, we ap-
pend gated recurrent neural network with residual connection to BERT model to
handle memory loss problem. Furthermore, we incorporate reinforcement learn-
ing with imitation learning to enhance navigation agent’s generalization capacity.
2 Related Work
Deep learning algorithms are now widely used in various artificial intelligence
tasks. The quality and scale of the labeled dataset substantially determine the
performance of trained model. However, labeled datasets are difficult for re-
searchers to obtain or annotate and need to be reconstructed in different tasks.
Pre-trained models refer to training on a large scale of in-domain data to
learn the general knowledge in the domain, and then fine-tuning on the labeled
data of a specific downstream task in the domain to transfer and reuse the
general knowledge, and thus enhance the performance of the model.
It is necessary to preprocess the pictures and text respectively to input them
into the Transformer. Text is usually represented by one-hot vector, and then
input into the word embedding network to obtain token embedding. We encode
the position of each word as position embedding and then input concatenated
4 F. Author et al.
decision of the agent is not to make specific action, such as going ahead or
turning left, but to select an image with the highest probability in the panorama
comprised of 36 images and move towards it. This form of action decision-making
is more more semantically informative and interpretable. And the attention layer
of the multi-modal BERT model calculates the attention weights over each image
Vision-and-Language Navigation Based on Gated Recurrent BERT 7
composed of the panorama, which is consistent with the form of outputting the
probabilities of selecting each image. Therefore, we directly use the attention
weights on each image at the last multi-head attention layer as the decision
probabilities.
where a∗t is expert’s action and pat is the probability that agent takes action a.
Reinforcement learning makes agent explore in an environment and self-
update based on the reward given by the environment, which effectively increases
the generalization of the agent. We use Proximal Policy Optimization (PPO) al-
gorithm, whose objective function is to maximize the cumulative reward:
Z
J(θ) = Vπθ (s0 ) = Eτ πθ [G(t)] = πθ (τ )G(τ )dτ (14)
where λIL is the parameter that handles the weight of imitation learning.
4 Experiments
4.1 Dataset and Simulator for Indoor Scene Vision-and-Language
Navigation
Matterport3D Simulator[?] is a software framework for various computer vision
tasks, constructed using the MatterPort3D panoramic RGB-D image dataset as
shown in Fig. 4.
Room-to-Room is the first vision-and-language dataset consisting of navi-
gation instructions and trajectories scene constructed by annotating navigation
10 F. Author et al.
Fig. 4. Matterport Indoor Scene and Viewpoints for Panorama Capture (green dots)[?]
N
1 X li
SP L = Si (16)
N i=1 max(pi , li )
The validation set of Room-to-Room contains 18 indoor scenes and is split into
val-seen dataset and val-unseen dataset. Val-seen validation set refers to the in-
door scenes that have appeared in the training set, and val-unseen validation
set refers to the indoor scenes that has not appeared in the training set and the
agent has never seen before. The test set of Room-to-Room only contains naviga-
tion instructions. The official of Room-to-Room dataset provides an evaluation
server. After performing the navigation agent locally to collect the trajectories
corresponding to the instructions in the test set, we upload these trajectories to
the server for evaluation.
We initialize the parameters of GRVLN-BERT with Prevelent pre-trained
model and simultaneously conduct both imitation learning and reinforcement
learning.We set the learning rate to 10−5 and use the AdamW optimizer to train
navigation agent for 300,000 epochs in total. Table 1 shows the experimental
results in val-seen validation set. The SPL indicator of GRVLN-BERT model has
improved from 0.65% to 0.71%, and the success rate SR has increased from 0.70%
to 0.75%. Under the condition of single-run without auxiliary data augmentation,
the navigation agent has achieved the state-of-the-art performance in NE, SR,
and SPL metrics. Therefore, the memory and transfer of historical information
has greatly enhanced the model’s reasoning and grounding ability to help the
agent make correct navigation action decisions.
TL NE SR SPL
Random 9.58 9.45 0.16 -
Seq2Seq[?] 11.33 6.01 0.39 -
Speak-follower[?] - 3.36 0.66 -
SMNA[?] - 3.22 0.67 0.58
RCM+SIL[?] 10.65 3.53 0.67 -
PRESS[?] 10.57 4.39 0.58 0.55
FAST-Shot[?] - - - -
EnvDrop[?] 11 3.99 0.62 0.59
AuxRN[?] - 3.33 0.7 0.67
PREVALENT[?] 10.32 3.67 0.69 0.65
RelGraph[?] 10.13 3.47 0.67 0.65
VLNBERT[?] 11.17 3.16 0.70 0.65
GRVLN-BERT(ours) 11.08 2.58 0.75 0.71
TL NE SR SPL
Random 9.77 9.23 0.16 -
Seq2Seq[?] 8.39 7.81 0.22 -
Speak-follower[?] - 6.62 0.35 -
SMNA[?] - 5.52 0.45 0.32
RCM+SIL[?] 11.46 6.09 0.43 -
PRESS[?] 10.36 5.28 0.49 0.45
FAST-Shot[?] 21.17 4.97 0.56 0.43
EnvDrop[?] 10.7 5.22 0.52 0.48
AuxRN[?] - 5.28 0.55 0.50
PREVALENT[?] 10.19 4.71 0.58 0.53
RelGraph[?] 9.99 4.73 0.57 0.53
VLNBERT[?] 11.63 4.13 0.61 0.56
GRVLN-BERT(ours) 12.49 3.81 0.62 0.56
TL NE SR SPL
Random 9.89 9.79 0.13 0.12
Seq2Seq[?] 8.13 7.85 0.20 0.18
Speak-follower[?] 14.82 6.62 0.35 0.28
SMNA[?] 18.04 5.67 0.48 0.35
RCM+SIL[?] 11.97 6.12 0.43 0.38
PRESS[?] 10.77 5.49 0.49 0.45
FAST-Shot[?] 22.08 5.14 0.54 0.41
EnvDrop[?] 11.66 5.23 0.51 0.47
AuxRN[?] - 5.15 0.55 0.51
PREVALENT[?] 10.51 5.3 0.54 0.51
RelGraph[?] 10.29 4.75 0.55 0.52
VLNBERT[?] 11.68 4.35 0.61 0.57
GRVLN-BERT(ours) 12.78 3.96 0.63 0.57
TL NE OSR SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT-noninsKV 14.57,15.27 3.98,4.80 0.66,0.60 0.60,0.52 0.55,0.45
TL NE OS SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT(normal) 11.17,11.63 3.16,4.13 0.75,0.67 0.70,0.61 0.65,0.56
GRVLN-BERT(nonhistory) 10.76,10.05 3.64,4.91 0.73,0.60 0.68,0.57 0.63,0.53
val-unseen scenes, which indicates that imitation learning has poor generaliza-
tion ability. Although the performance of GRVLN-BERT(RL) is obviously lower
than that of the other two models in val-seen scenes, its performance becomes
better in val-unseen scenes. In addition, the path leagth of GRVLN-BERT(RL) is
significantly longer than that of GRVLN-BERT(IL), which means that GRVLN-
BERT(RL) tends to explore in the environment leading to better generalization
ability. The method incorporating imitation learning and reinforcement learning
combines the advantages of the both, so that it has excellent performance in
val-seen scenarios and preferable generalization ability in val-unseen scenarios.
TL NE OSR SR SPL
GRVLN-BERT 11.08,12.49 2.58,3.81 0.79,0.70 0.75,0.62 0.71,0.56
GRVLN-BERT(IL) 9.62,9.27 2.92,4.88 0.77,0.60 0.72,0.54 0.70,0.52
GRVLN-BERT(RL) 14.21,14.01 3.80,4.62 0.71,0.64 0.63,0.56 0.57,0.49
5 Conclusion
In this paper, we introduce the pre-trained BERT model into vision-and-language
navigation task. We revise the input and attention calculation of the BERT
model according to the characteristics of vision-and-language navigation task
to improve the model’s reasoning and grounding efficiency. In order to mem-
orize and transmit historical trajectory information in the BERT model, we
utilize gated recurrent neural network with residual connection to transmit hid-
den state and address the problem of memory loss. In order to solve the expo-
sure bias issue of training navigation agent with imitation learning, a training
method combining reinforcement learning and imitation learning is proposed to
effectively enhance the model’s generalization ability. In the end, we find that
GRVLN-BERT model achieve state-of-the-art result in some evaluation metrics
through experiments and analysis, which proves the effectiveness of our proposed
method.
References
1. Author, F.: Article title. Journal 2(5), 99–110 (2016)
2. Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S.
(eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016).
https://doi.org/10.10007/1234567890
3. Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher, Location (1999)
4. Author, A.-B.: Contribution title. In: 9th International Proceedings on Proceedings,
pp. 1–2. Publisher, Location (2010)
5. LNCS Homepage, http://www.springer.com/lncs. Last accessed 4 Oct 2017