0% found this document useful (0 votes)
106 views17 pages

VINS-Mono A Robust and Versatile Monocular

The document presents VINS-Mono, a robust monocular visual-inertial state estimator that integrates a camera with a low-cost inertial measurement unit (IMU) for accurate six degrees-of-freedom state estimation. It addresses challenges such as estimator initialization, extrinsic calibration, and nonlinear optimization through a tightly-coupled, optimization-based approach, and includes features like loop detection and global pose graph optimization. The system has been validated through public datasets and real-world experiments, demonstrating its versatility for applications in robotics, autonomous driving, and augmented reality, with open-source implementations available for both PCs and iOS devices.

Uploaded by

Alex Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views17 pages

VINS-Mono A Robust and Versatile Monocular

The document presents VINS-Mono, a robust monocular visual-inertial state estimator that integrates a camera with a low-cost inertial measurement unit (IMU) for accurate six degrees-of-freedom state estimation. It addresses challenges such as estimator initialization, extrinsic calibration, and nonlinear optimization through a tightly-coupled, optimization-based approach, and includes features like loop detection and global pose graph optimization. The system has been validated through public datasets and real-world experiments, demonstrating its versatility for applications in robotics, autonomous driving, and augmented reality, with open-source implementations available for both PCs and iOS devices.

Uploaded by

Alex Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1

VINS-Mono: A Robust and Versatile Monocular


Visual-Inertial State Estimator
Tong Qin, Peiliang Li, and Shaojie Shen

Abstract—A monocular visual-inertial system (VINS), con-


sisting of a camera and a low-cost inertial measurement unit
(IMU), forms the minimum sensor suite for metric six degrees-
arXiv:1708.03852v1 [cs.RO] 13 Aug 2017

of-freedom (DOF) state estimation. However, the lack of direct


distance measurement poses significant challenges in terms of
IMU processing, estimator initialization, extrinsic calibration,
and nonlinear optimization. In this work, we present VINS-
Mono: a robust and versatile monocular visual-inertial state
estimator. Our approach starts with a robust procedure for
estimator initialization and failure recovery. A tightly-coupled,
nonlinear optimization-based method is used to obtain high
accuracy visual-inertial odometry by fusing pre-integrated IMU
measurements and feature observations. A loop detection module, (a) Trajectory (blue) and feature locations (red)
in combination with our tightly-coupled formulation, enables
relocalization with minimum computation overhead. We addi-
tionally perform four degrees-of-freedom pose graph optimiza-
tion to enforce global consistency. We validate the performance
of our system on public datasets and real-world experiments
and compare against other state-of-the-art algorithms. We also
perform onboard closed-loop autonomous flight on the MAV
platform and port the algorithm to an iOS-based demonstration.
We highlight that the proposed work is a reliable, complete,
and versatile system that is applicable for different applications
that require high accuracy localization. We open source our
implementations for both PCs1 and iOS mobile devices2 .
Index Terms—Monocular visual-inertial systems, state estima-
tion, sensor fusion, simultaneous localization and mapping
(b) Trajectory overlaid with Google Map for visual comparison
I. I NTRODUCTION Fig. 1. Outdoor experimental results of the proposed monocular visual-inertial
TATE estimation is undoubtedly the most fundamental state estimator. Data is collected by a hand-held monocular camera-IMU setup

S module for a wide range of applications, such as robotic


navigation, autonomous driving, virtual reality (VR), and
under normal walking condition. It includes two complete circles inside the
field and two semicircles on the nearby driveway. Total trajectory length is 2.5
km. A video of the experiment can be found in the multimedia attachment.
augmented reality (AR). Approaches that use only a monocular
camera have gained significant interests by the community due
to their small size, low-cost, and easy hardware setup [1]– illumination change, texture-less area, or motion blur. In fact,
[5]. However, monocular vision-only systems are incapable of the monocular VINS not only widely available on mobile
recovering the metric scale, therefore limiting their usage in robots, drones, and mobile devices, it is also the minimum
real-world robotic applications. Recently, we see a growing sensor setup for sufficient self and environmental perception.
trend of assisting the monocular vision system with a low- However, all these advantages come with a price. For
cost inertial measurement unit (IMU). The primary advantage monocular VINS, it is well known that acceleration excitation
of this monocular visual-inertial system (VINS) is to have is needed to make metric scale observable. This implies that
the metric scale, as well as roll and pitch angles, all observ- monocular VINS estimators cannot start from a stationary
able. This enables navigation tasks that require metric state condition, but rather launch from an unknown moving state.
estimates. In addition, the integration of IMU measurements Also recognizing the fact that visual-inertial systems are highly
can dramatically improve motion tracking performance by nonlinear, we see significant challenges in terms of estimator
bridging the gap between losses of visual tracks due to initialization. The existence of two sensors also makes camera-
IMU extrinsic calibration critical. Finally, in order to eliminate
T. Qin, P. Li, and S. Shen are with the Department of Electronic and long-term drift within an acceptable processing window, a
Computer Engineering, Hong Kong University of Science and Technology.
e-mail: {tqinab, pliap}@connect.ust.hk, [email protected]
complete system that includes visual-inertial odometry, loop
1 https://github.com/HKUST-Aerial-Robotics/VINS-Mono detection, relocalization, and global optimization has to be
2 https://github.com/HKUST-Aerial-Robotics/VINS-Mobile developed.
2

To address all these issues, we propose VINS-Mono, a discussion and possible future research directions in Sect. X.
robust and versatile monocular visual-inertial state estimator.
Our solution starts with on-the-fly estimator initialization. The
II. R ELATED W ORK
same initialization module is also used for failure recovery.
The core of our solution is a robust monocular visual-inertial Scholarly works on monocular vision-based state estima-
odometry (VIO) based on tightly-coupled sliding window non- tion/odometry/SLAM are extensive. Noticeable approaches
linear optimization. The monocular VIO module not only pro- include PTAM [1], SVO [2], LSD-SLAM [3], DSO [5], and
vides accurate local pose, velocity, and orientation estimates, ORB-SLAM [4]. It is obvious that any attempts to give a full
it also performs camera-IMU extrinsic calibration and IMU relevant review would be incomplete. In this section, however,
biases correction in an online fashion. Loops are detected using we skip the discussion on vision-only approaches, and only
DBoW2 [6]. Relocalization is done in a tightly-coupled setting focus on the most relevant results on monocular visual-inertial
by feature-level fusion with the monocular VIO. This enables state estimation.
robust and accurate relocalization with minimum computation The simplest way to deal with visual and inertial mea-
overhead. Finally, geometrically verified loops are added into a surements is loosely-coupled sensor fusion [11], [12], where
pose graph, and thanks to the observable roll and pitch angles IMU is treated as an independent module to assist vision-only
from the monocular VIO, a four degrees-of-freedom (DOF) pose estimates obtained from the visual structure from motion.
pose graph is performed to ensure global consistency. Fusion is usually done by an extended Kalman filter (EKF),
VINS-Mono combines and improves the our previous works where IMU is used for state propagation and the vision-only
on monocular visual-inertial fusion [7]–[10]. It is built on pose is used for the update. Further on, tightly-coupled visual-
top our tightly-coupled, optimization-based formulation for inertial algorithms are either based on the EKF [13]–[15] or
monocular VIO [7], [8], and incorporates the improved ini- graph optimization [7], [8], [16], [17], where camera and IMU
tialization procedure introduced in [9]. The first attempt of measurements are jointly optimized from the raw measurement
porting to mobile devices was given in [10]. Further im- level. A popular EKF based VIO approach is MSCKF [13],
provements of VINS-Mono comparing to our previous works [14]. MSCKF maintains several previous camera poses in the
include improved IMU pre-integration with bias correction, state vector, and uses visual measurements of the same feature
tightly-coupled relocalization, global pose graph optimization, across multiple camera views to form multi-constraint update.
extensive experimental evaluation, and a robust and versatile SR-ISWF [18], [19] is an extension of MSCKF. It uses square-
open source implementation. root form [20] to achieve single-precision representation and
The whole system is complete and easy-to-use. It has been avoid poor numerical properties. This approach employs the
successfully applied to small-scale AR scenarios, medium- inverse filter for iterative re-linearization, making it equal
scale drone navigation, and large-scale state estimation tasks. to optimization-based algorithms. Batch graph optimization
Superior performance has been shown against other state-of- or bundle adjustment techniques maintain and optimize all
the-art methods. To this end, we summarize our contributions measurements to obtain the optimal state estimates. To achieve
as follow: constant processing time, popular graph-based VIO meth-
ods [8], [16], [17] usually optimize over a bounded-size sliding
• A robust initialization procedure that is able to bootstrap window of recent states by marginalizing out past states and
the system from unknown initial states. measurements. Due to high computational demands of iterative
• A tightly-coupled, optimization-based monocular visual- solving of nonlinear systems, few graph-based can achieve
inertial odometry with camera-IMU extrinsic calibration real-time performance on resource-constrained platforms, such
and IMU bias estimation. as mobile phones.
• Online loop detection and tightly-coupled relocalization. For visual measurement processing, algorithms can be cat-
• Four DOF global pose graph optimization. egorized into either direct or indirect method according to
• Real-time performance demonstration for drone naviga- the definition of visual residual models. Direct approaches
tion, large-scale localization, and mobile AR applications. [2], [3], [21] minimize photometric error while indirect ap-
• Open-source release for both the PC version that is fully proaches [8], [14], [16] minimize geometric displacement.
integrated with ROS, as well as the iOS version that runs Direct methods require a good initial guess due to their small
on iPhone6s or above. region of attraction, while indirect approaches consume extra
The rest of the paper is structured as follows. In Sect. II, computational resources on extracting and matching features.
we discuss relevant literature. We give an overview of the Indirect approaches are more frequently found in real-world
complete system pipeline in Sect. III. Preprocessing steps engineering deployment due to its maturity and robustness.
for both visual and pre-integrated IMU measurements are However, direct approaches are easier to be extended for dense
presented in Sect. IV. In Sect. V, we discuss the estimator mapping as they are operated directly on the pixel level.
initialization procedure. A tightly-coupled, self-calibrating, In practice, IMUs usually acquire data at a much higher rate
nonlinear optimization-based monocular VIO, is presented than the camera. Different methods have been proposed to han-
in Sect. VI. Tightly-coupled relocalization and global pose dle the high rate IMU measurements. The most straightforward
graph optimization are presented in Sect. VII and Sect. VIII approach is to use the IMU for state propagation in EKF-based
respectively. Implementation details and experimental results approaches [11], [13]. In a graph optimization formulation, an
are shown in Sect. IX. Finally, the paper is concluded with a efficient technique called IMU pre-integration is developed in
3

Fig. 2. A block diagram illustrating the full pipeline of the proposed monocular visual-inertial state estimator.

order to avoid repeated IMU re-integration This technique was in global translation and orientation. To this end, loop clo-
first introduced in [22], which parametrize rotation error using sure plays an important role for long-term operations. ORB-
Euler angles. An on-manifold rotation formulation for IMU- SLAM [4] is able to close loops and reuse the map, which
preintegration was developed in our previous work [7]. This takes advantage of Bag-of-World [6]. A 7 DOF [28] (position,
work derived the covariance propagation using continuous- orientation, and scale) pose graph optimization is followed
time IMU error state dynamics. However, IMU biases were loop detection. In contrast, for monocular VINS, thanks to the
ignored. The pre-integration theory was further improved addition of IMU, drift only occurs in 4 DOF, which is the
in [23] by adding posterior IMU bias correction. 3D translation, and the rotation around the gravity direction
(yaw angle). Therefore, in this paper, we choose to optimize
Accurate initial values are crucial to bootstrap any monoc-
the pose graph with loop constraints in the minimum 4 DOF
ular VINS. A linear estimator initialization method that lever-
setting.
ages relative rotations from short-term IMU pre-integration
was proposed in [8], [24]. However, this method does not
model gyroscope bias, and fails to model sensor noise in raw III. OVERVIEW
projection equations. In real-world applications, this results
The structure of proposed monocular visual-inertial state
in unreliable initialization when visual features are far away
estimator is shown in Fig. 2. The system starts with measure-
from the sensor suite. A closed-form solution to the monocular
ment preprocessing (Sect. IV), in which features are extracted
visual-inertial initialization problem was introduced in [25].
and tracked, and IMU measurements between two consecutive
Later, an extension to this closed-form solution by adding
frames are pre-integrated. The initialization procedure (Sect. V
gyroscope bias calibration was proposed in [26]. These ap-
provides all necessary values, including pose, velocity, gravity
proaches fail to model the uncertainty in inertial integration
vector, gyroscope bias, and 3D feature location, for bootstrap-
since they rely on the double integration of IMU measurements
ping the subsequent nonlinear optimization-based VIO. The
over an extended period of time. In [27], a re-initialization and
VIO (Sect. VI) with relocalization (Sect. VII) modules tightly
failure recovery algorithm based on SVO [2] was proposed.
fuses pre-integrated IMU measurements, feature observations,
This is a practical method based on a loosely-coupled fusion
and re-detected features from loop closure. Finally, the pose
framework. However, an additional downward-facing distance
graph optimization module (Sect. VIII) takes in geometrically
sensor is required to recover the metric scale. An initialization
verified relocalization results, and perform global optimization
algorithm built on top of the popular ORB-SLAM [4] was
to eliminate drift. The VIO, relocalization, and pose graph
introduced in [17]. An initial estimation of the scale, gravity
optimization modules run concurrently in a multi-thread set-
direction, velocity and IMU biases are computed for the
ting. Each module has different running rates and real-time
visual-inertial full BA given a set of keyframes from ORB-
guarantees to ensure reliable operation at all times.
SLAM. However, it is reported that the time required for
We now define notations and frame definitions that we use
scale convergence can be longer than 10 seconds. This can
throughout the paper. We consider (·)w as the world frame.
pose problems for robotic navigation tasks that require scale
The direction of the gravity is aligned with the z axis of the
estimates right at the beginning.
world frame. (·)b is the body frame, which we define it to
VIO approaches, regardless the underlying mathematical be the same as the IMU frame. (·)c is the camera frame.
formulation that they rely on, suffer from long term drifting We use both rotation matrices R and Hamilton quaternions
4

q to represent rotation. We primarily use quaternions in state However, IMU biases were ignored. The pre-integration theory
vectors, but rotation matrices are also used for convenience was further improved in [23] by adding posterior IMU bias
rotation of 3D vectors. qw w
b , pb are rotation and translation correction. In this paper, we extend the IMU pre-integration
from the body frame to the world frame. bk is the body proposed in our previous work [7] by incorporating IMU bias
frame while taking the k th image. ck is the camera frame correction.
while taking the k th image. ⊗ represents the multiplication The raw gyroscope and accelerometer measurements from
operation between two quaternions. gw = [0, 0, g]T is the IMU, ω̂ and â, are given by:
gravity vector in the world frame. Finally, we denote (·)ˆ as
ât = at + bat + Rtw gw + na
the noisy measurement or estimate of a certain quantity. (1)
ω̂ t = ω t + bwt + nw .
IV. M EASUREMENT P REPROCESSING IMU measurements, which are measured in the body frame,
combines the force for countering gravity and the platform
This section presents preprocessing steps for both inertial dynamics, and are affected by acceleration bias ba , gyroscope
and monocular visual measurements. For visual measurements, bias bw , and additive noise. We assume that the additive noise
we track features between consecutive frames and detect new in acceleration and gyroscope measurements are Gaussian,
features in the latest frame. For IMU measurements, we pre- na ∼ N (0, σ 2a ), nw ∼ N (0, σ 2w ). Acceleration bias and
integrate them between two consecutive frames. Note that the gyroscope bias are modeled as random walk, whose derivatives
measurements of the low-cost IMU that we use are affected are Gaussian, nba ∼ N (0, σ 2ba ), nbw ∼ N (0, σ 2bw ):
by both bias and noise. We therefore especially take bias into
account in the IMU pre-integration process. ḃat = nba , ḃwt = nbw . (2)
Given two time instants that correspond to image frames
A. Vision Processing Front-end bk and bk+1 , position, velocity, and orientation states can
For each new image, existing features are tracked by the be propagated by inertial measurements during time interval
KLT sparse optical flow algorithm [29]. Meanwhile, new cor- [tk , tk+1 ] in the world frame:
ner features are detected [30] to maintain a minimum number pw w w
bk+1 = pbk + vbk ∆tk
(100-300) of features in each image. The detector enforces a ZZ
uniform feature distribution by setting a minimum separation + (Rw w
t (ât − bat − na ) − g ) dt
2
t∈[tk ,tk+1 ]
of pixels between two neighboring features. 2D Features are Z
firstly undistorted and then projected to a unit sphere after vbwk+1 = vbwk + (Rw w
t (ât − bat − na ) − g ) dt
passing outlier rejection. Outlier rejection is performed using Zt∈[tk ,tk+1 ]
RANSAC with fundamental matrix model [31]. 1
w w
qbk+1 = qbk ⊗ Ω(ω̂ t − bwt − nw )qbt k dt,
Keyframes are also selected in this step. We have two crite- t∈[tk ,tk+1 ] 2
ria for keyframe selection. The first one is the average parallax (3)
apart from the previous keyframe. If the average parallax of
where
tracked features is between the current frame and the latest  
keyframe is beyond a certain threshold, we treat frame as a
  0 −ωz ωy
−bωc× ω
new keyframe. Note that not only translation but also rotation Ω(ω) = , bωc× =  ωz 0 −ωx  .
−ω T 0
can cause parallax. However, features cannot be triangulated in −ωy ωx 0
the rotation-only motion. To avoid this situation, we use short- (4)
term integration of gyroscope measurements to compensate ∆tk is the duration between the time interval [tk , tk+1 ].
rotation when calculating parallax. Note that this rotation It can be seen that the IMU state propagation requires
compensation is only used to keyframe selection, and is not rotation, position and velocity of frame bk . When these starting
involved in rotation calculation in the VINS formulation. To states change, we need to re-propagate IMU measurements.
this end, even if the gyroscope contains large noise or is biased, Especially in the optimization-based algorithm, every time we
it will only result in suboptimal keyframe selection results, adjust poses, we will need to re-propagate IMU measurements
and will not directly affect the estimation quality. Another between them. This propagation strategy is computationally
criterion is tracking quality. If the number of tracked features demanding. To avoid re-propagation, we adopt pre-integration
goes below a certain threshold, we treat this frame as a new algorithm.
keyframe. This criterion is to avoid complete loss of feature After change the reference frame from the world frame to
tracks. the local frame bk , we can only pre-integrate the parts which
are related to linear acceleration â and angular velocity ω̂ as
B. IMU Pre-integration follows:
1 w 2 bk
IMU Pre-integration was first proposed in [22], which Rbwk pw bk w w
bk+1 = Rw (pbk + vbk ∆tk − g ∆tk ) + αbk+1
parametrized rotation error in Euler angle. An on-manifold 2
rotation formulation for IMU pre-integration was developed Rbwk vbwk+1 = Rbwk (vbwk − gw ∆tk ) + β bbkk+1
bk
in our previous work [7]. This work derived the covariance qbwk ⊗ qw
bk+1 = γ bk+1 ,
propagation using continuous-time IMU error state dynamics. (5)
5

x# x$
x' x"

x ('
x (" x (#

x%

f#
f' f$
f"

Fig. 3. An illustration of the sliding window monocular VIO with relocalization. It is a tightly-coupled formulation with IMU, visual, and loop measurements.

where, where δθ bt k is three-dimensional small perturbation.


We can derive continuous-time linearized dynamics of error
ZZ
αbbkk+1 = Rbt k (ât − bat − na )dt2 terms of (6):
t∈[tk ,tk+1 ]
Z  bk  
β bbkk+1 = Rbt k (ât − bat − na )dt (6) δ α̇t 0 I 0 0 0
 bk 
δαt
b
 δ β̇ k  
Zt∈[tk ,tk+1 ]
 t  0 0 −Rbt k bât − bat c× −Rbt k 0   δβ bk 
 t 
1  bk  
 δ θ̇ t  = 0 0 −bω̂ t − bwt c× −I  bk 
γ bbkk+1 = Ω(ω̂ t − bwt − nw )γ bt k dt. 0  δθ t 
t∈[tk ,tk+1 ] 2 
 δ ḃa 
 
0 0 0 0 0  δbat 
t
It can be seen that the pre-integration terms (6) can be δ ḃwt 0 0 0 0 0 δbwt
obtained solely with IMU measurements by taking bk as  
0 0 0 0 
the reference frame. αbbkk+1 , β bbkk+1 , γ bbkk+1 are only related to

−Rbk 0 0 0 na
IMU biases instead of other states in bk and bk+1 . When t   nw  bk

the estimation of bias changes, if the change is small, we +
 0 −I 0 0   nba  = Ft δzt + Gt nt ,
 
 0 0 I 0
adjust αbbkk+1 , β bbkk+1 , γ bbkk+1 by their first-order approximations nbw
with respect to the bias, otherwise we do re-propagation. This 0 0 0 I
(9)
strategy saves a significant amount of computational resources
Pbbkk+1 can be computed recursively by first-order discrete-time
for optimization-based algorithms, since we do not need to
propagate IMU measurements repeatedly. covariance update with the initial covariance Pbbkk = 0:
For discrete-time implementation, different numerical inte-
gration methods such as Euler, mid-point, RK4 integration can Pbt+δt
k
= (I + Ft δt)Pbt k (I + Ft δt)T +(Gt δt)Q(Gt δt)T ,
be applied. Here Euler integration is chosen to demonstrate the t ∈ [k, k + 1],
procedure for easy understanding (we use mid-point integra- (10)
tion in the implementation code). where Q is the diagonal covariance matrix of noise
At the beginning, αbbkk , β bbkk is 0, and γ bbkk is identity quater- (σ 2a , σ 2w , σ 2ba , σ 2bw ).
nion. The mean of α, β, γ in (6) is propagated step by step Meanwhile, the first-order Jacobian matrix Jbk+1 of δzbbkk+1
as follows. Note that the additive noise terms na , nw are with respect to δzbbkk can be also compute recursively with the
unknown, and is treated as zero in the implementation. This initial Jacobian Jbk = I,
results in estimated values of the pre-integration terms, marked
ˆ
by (·): Jt+δt = (I + Ft δt)Jt , t ∈ [k, k + 1]. (11)
bk 1
α̂bi+1
k
= α̂bi k + β̂ i δt + R(γ̂ bi k )(âi − bai )δt2 Using this recursive formulation, we get the covariance
2 matrix Pbbkk+1 and the Jacobian Jbk+1 . The first order approx-
bk bk
β̂ i+1 = β̂ i + R(γ̂ bi k )(âi − bai )δt . (7) imation of αbbkk+1 , β bbkk+1 , γ bbkk+1 with respect to biases can be
 
1 write as:
γ̂ bi+1
k
= γ̂ bi k ⊗ 1
2 (ω̂ i − bwi )δt
αbbkk+1 ≈ α̂bbkk+1 + Jα α
ba δbak + Jbw δbwk
i is discrete moment corresponding to a IMU measurement bk
within [tk , tk+1 ]. δt is the time interval between two IMU β bbkk+1 ≈ β̂ bk+1 + Jβba δbak + Jβbw δbwk (12)
measurements i and i + 1. 
1

bk bk
Then we deal with the covariance propagation. Since γ bk+1 ≈ γ̂ bk+1 ⊗ 1 γ
2 Jbw δbwk
the four-dimensional rotation quaternion γ bt k is over-
parameterized, we define its error term as a perturbation where Jα
ba and is the sub-block matrix in Jbk+1 whose location
b
around its mean: δαbk
k+1
  is corresponding to . The same meaning is also used
bk bk 1 δbak
γ t ≈ γ̂ t ⊗ 1 bk , (8) Jα β β γ
2 δθ t
for bw , Jba , Jbw , Jbw . When the estimation of bias changes
6

A. Sliding Window Vision-Only SfM


The initialization procedure starts with a vision-only SfM
to estimate a graph of up-to-scale camera poses and feature
positions.
We maintain a sliding window of frames for bounded
computational complexity. Firstly, we check feature correspon-
dences between the latest frame and all previous frames. If we
Fig. 4. An illustration of the visual-inertial alignment process for estimator can find stable feature tracking (more than 30 tracked features)
initialization. and sufficient parallax (more than 20 rotation-compensated
pixels) between the latest frame and any other frames in the
sliding window. we recover the relative rotation and up-to-
slightly, we use (12) to correct pre-integration results approx- scale translation between these two frames using the Five-point
imately instead of re-propagation. algorithm [33]. Otherwise, we keep the latest frame in the
Now we are able to write down the IMU measurement window and wait for new frames. If the five-point algorithm
model with its corresponding covariance Pbbkk+1 : success, we arbitrarily set the scale and triangulate all features
 bk   b observed in these two frames. Based on these triangulated
1 w

α̂bk+1 Rwk (pw w 2 w
bk+1 − pbk + 2 g ∆tk − vbk ∆tk ) features, a perspective-n-point (PnP) method [35] is performed
 bk   Rbwk (vbwk+1 + gw ∆tk − vbwk )  to estimate poses of all other frames in the window. Finally,
 β̂ bk+1   
−1
 γ̂ k  = 
 b   w w
qbk ⊗ qbk+1 .
 a global full bundle adjustment [36] is applied to minimize
 bk+1    the total reprojection error of all feature observations. Since
 0   babk+1 − babk 
bwbk+1 − bwbk we do not yet have any knowledge about the world frame,
0
we set the first camera frame (·)c0 as the reference frame
(13)
for SfM. All frame poses (p̄cc0k ,qcc0k ) and feature positions are
represented with respect to (·)c0 . Suppose we have a rough
V. E STIMATOR I NITIALIZATION measure extrinsic parameters (pbc , qbc ) between the camera and
Monocular tightly-coupled visual-inertial odometry is a the IMU, we can translate poses from camera frame to body
highly nonlinear system. Since the scale is not directly ob- (IMU) frame,
servable from a monocular camera, it is hard to directly fuse qcbk0 = qcc0k ⊗ (qbc )
−1

these two measurements without good initial values. One may (14)
assume a stationary initial condition to start the monocular sp̄cbk0 = sp̄cc0k − Rcbk0 pbc ,
VINS estimator. However, this assumption is inappropriate where s the scaling parameter that aligns the visual structure
as initialization under motion is frequently encountered in to the metric scale. Solving this scaling parameter is the key
real-world applications. The situation becomes even more to achieve successful initialization.
complicated when IMU measurements are corrupted by large
bias. In fact, initialization is usually the most fragile step for
monocular VINS. A robust initialization procedure is needed B. Visual-Inertial Alignment
to ensure the applicability of the system.
1) Gyroscope Bias Calibration: Consider two consecutive
We adopt a loosely-coupled sensor fusion method to get
frames bk and bk+1 in the window, we get the rotation qcbk0 and
initial values. We find that vision-only SLAM, or Structure
qcbk+1
0
from the visual SfM, as well as the relative constraint
from Motion (SfM), has a good property of initialization. bk
In most cases, a visual-only system can bootstrap itself by γ̂ bk+1 from IMU pre-integration. We linearize the IMU pre-
derived initial values from relative motion methods, such as integration term with respect to gyroscope bias and minimize
the eight-point [32] or five-point [33] algorithms, or estimating the following cost function:
the homogeneous matrices. By aligning metric IMU pre- X 2
qcbk+1 ⊗ qcbk0 ⊗ γ bbkk+1
−1
integration with the visual-only SfM results, we can roughly min 0
δbw
recover scale, gravity, velocity, and even bias. This is sufficient k∈B
  (15)
for bootstrapping a nonlinear monocular VINS estimator, as 1
γ bbkk+1 ≈ γ̂ bbkk+1 ⊗ 1 γ ,
shown in Fig. 4. 2 Jbw δbw
In contrast to [17], which simultaneously estimates gyro-
where B indexes all frames in the window. We have the first
scope and accelerometer bias during the initialization phase,
order approximation of γ̂ bbkk+1 with respect to the gyroscope
we choose to ignore accelerometer bias terms in the initial
bias using the bias Jacobian derived in Sect. IV-B. In such way,
step. Accelerometer bias is coupled with gravity, and due to the
we get an initial calibration of the gyroscope bias bw . Then
large magnitude of the gravity vector comparing to platform bk
dynamics, and the relatively short during of the initialization we re-propagate all IMU pre-integration terms α̂bbkk+1 , β̂ bk+1 ,
phase, these bias terms are hard to be observed. A detailed and γ̂ bbkk+1 using the new gyroscope bias.
analysis of accelerometer bias calibration is presented in our 2) Velocity, Gravity Vector and Metric Scale Initialization:
previous work [34]. After the gyroscope bias is initialized, we move on to initialize
7

and w2 are corresponding displacements towards b1 and b2 ,


respectively. We can find one set of b1 , b2 by cross products
operations using Algorithm 1. Then we substitute g in (17) by
ˆ + w1 b1 + w2 b2 , and solve for w1 and w2 together with
g · ḡ
other state variables. This process iterates until ĝ converges.

Algorithm 1: Finding b1 and b2


¯ 6= [1, 0, 0] then
if ĝ
b1 ← normalize(ĝ¯ × [1, 0, 0]);
Fig. 5. Illustration of 2 DOF parameterization of gravity. Since the magnitude
of gravity is known, g lies on a sphere with radius g ≈ 9.81m/s2 . The else
gravity is parameterized around current estimate as g · ḡˆ + w1 b1 + w2 b2 ,
b1 ← normalize(ĝ¯ × [0, 0, 1]);
where b1 and b2 are two orthogonal basis spanning the tangent space.
end
b2 ← ĝ ¯ × b1 ;
other essential states for navigation, namely velocity, gravity
vector, and metric scale: 4) Completing Initialization: After refining the gravity vec-
h i tor, we can get the rotation qw c0 between the world frame and
XI = vbb00 , vbb11 , · · · vbbnn , gc0 , s , (16) the camera frame c0 by rotating the gravity to the z-axis. We
then rotate all variables from reference frame (·)c0 to the world
where vbbkk is velocity in the body frame while taking the k th
frame (·)w . The body frame velocities will also be rotated to
image, gc0 is the gravity vector in the c0 frame, and s scales
world frame. Translational components from the visual SfM
the monocular SfM to metric units.
will be scaled to metric units. At this point, the initialization
Consider two consecutive frames bk and bk+1 in the win-
procedure is completed and all these metric values will be fed
dow, then (5) can be written as:
for a tightly-coupled monocular VIO.
1
αbbkk+1 = Rbck0 (s(p̄cbk+1
0
− p̄cbk0 ) + gc0 ∆t2k − Rcbk0 vbbkk ∆tk )
2 VI. T IGHTLY- COUPLED M ONOCULAR VIO
bk+1
β bbkk+1 = Rbck0 (Rcbk+1
0
vbk+1 + gc0 ∆tk − Rcbk0 vbbkk ). After estimator initialization, we proceed with a slid-
(17) ing window-based tightly-coupled monocular VIO for high-
We can combine (14) and (17) into the following linear accuracy and robust state estimation. An illustration of the
measurement model: sliding window formulation is shown in Fig. 3.
" b
α̂bkk+1 − pbc + Rbck0 Rcbk+1
#
bk
0
pbc A. Formulation
ẑbk+1 = bk = Hbbkk+1 XI + nbbkk+1
β̂ bk+1 The full state vector in the sliding window is defined as:
(18)
X = x0 , x1 , · · · xn , xbc , λ0 , λ1 , · · · λm
 
where,
xk = pw w w
 
 1 bk 2 bk c0 c0
 bk , vbk , qbk , ba , bg , k ∈ [0, n] (21)
bk −I∆tk 0 2 Rc0 ∆tk Rc0 (p̄ck+1 − p̄ck ) b
 b b
Hbk+1 = xc = pc , qc ,
−I Rbck0 Rcbk+10
Rbck0 ∆tk 0
(19) where xk is the IMU state at the time that the k th image
It can be seen that Rcbk0 , Rcbk+1 0
, p̄cc0k , p̄cc0k+1 are obtained from is captured. It contains position, velocity, and orientation
the up-to-scale monocular visual SfM. ∆tk is the time interval of the IMU in the world frame, and acceleration bias and
between two consecutive frames. By solving this linear least gyroscope bias in the IMU body frame. n is the total number
square problem: of keyframes, and m is the total number of features in the
X b 2 sliding window. λl is the inverse depth of the lth feature from
min ẑbkk+1 − Hbbkk+1 XI , (20) its first observation.
XI
k∈B We use a visual-inertial bundle adjustment formulation. We
we can get body frame velocities for every frame in the minimize the sum of prior and the Mahalanobis norm of
window, the gravity vector in the visual reference frame (·)c0 , all measurement residuals to obtain a maximum posteriori
as well as the scale parameter. estimation:
3) Gravity Refinement: The gravity vector obtained from
(
X 2
2
the previous linear initialization step can be refined by con- min krp − Hp X k + rB (ẑbbkk+1 , X ) bk +
X Pb
straining the magnitude. In most cases, the magnitude of k∈B k+1
 (22)
gravity vector is known. This results in only 2 DOF remaining 
X c 2
for the gravity vector. We therefore re-parameterize the gravity ρ( rC (ẑl j , X ) Pcj ) ,
l
with two variables on its tangent space. Our parameterization (l,j)∈C

¯
represents the gravity vector as g · ĝ + w1 b1 + w2 b2 , where where the Huber norm [37] is defined as:
g is the know magnitude of the gravity, ḡ ˆ is a unit vector (
representing the gravity direction. b1 and b2 are two orthog- 1 s ≥ 1,
ρ(s) = √ (23)
onal basis spanning the tangent plane, as shown in Fig. 5. w1 2 s − 1 s < 1.
8

c
rB (ẑbbkk+1 , X ) and rC (ẑl j , X ) are residuals for IMU and visual Unit sphere
*+
measurements respectively. Detailed definition of the residual () Tangent plane
terms will be presented in Sect. VI-B and Sect. VI-C. B is the "%
*,
#'$
set of all IMU measurements, C is the set of features which #$
"%

"%
have been observed at least twice in the current sliding win- ||#$ ||
dow. {rp , Hp } is the prior information from marginalization. !"
Ceres Solver [38] is used for solving this nonlinear problem.

Fig. 6. An illustration of the visual residual on a unit sphere. P̄ ˆ cj is the


l
B. IMU Measurement Residual c
unit vector for the observation of the lth feature in the j th frame. Pl j is
Consider the IMU measurements within two consecutive predicted feature measurement on the unit sphere by transforming its first
observation in the ith frame to the j th frame. The residual is defined on the
frames bk and bk+1 in the sliding window, according to the ˆ cj .
tangent plane of P̄ l
IMU measurement model defined in (13), the residual for pre-
integrated IMU measurement can be defined as: Keyframe
 b 
δαbk
 bkk+1  Old keyframes
 δβ bk+1  x" x# x$ x%&$ x%&# x% Latest frames
rB (ẑbbkk+1 , X ) =  δθ bk  IMU constraints
 
 bk+1 
 δba  Features
δbg Non-Keyframe
Marginalized
Throw
 bk w 1 w bk 
Rw (pbk+1 − pw 2 w
bk + 2 g ∆tk − vbk ∆tk ) − α̂bk+1
bk x" x#
Rbwk (vbwk+1 + gw ∆tk − vbwk ) − β̂ bk+1 x$
 
  x%&$ x%&# x%
 h −1 i 
= b −1 ,
 2 qw bk ⊗ qw
bk+1 ⊗ (γ̂ bk+1 )
k

 xyz 
 bab k+1
− bab k

Fig. 7. An illustration of our marginalization strategy. If the second latest
bwbk+1 − bwbk frame is a keyframe, we will keep it in the window, and marginalize the oldest
(24) frame and its corresponding visual and inertial measurements. Marginalized
 measurements are turned into a prior. If the second latest frame is not a
where · xyz extracts the vector part of a quaternion q for error keyframe, we will simply remove the frame and all its corresponding visual
measurements. However, pre-integrated inertial measurements are kept for
state representation. δθ bbkk+1 is the three dimensional error- non-keyframes, and the pre-integration process is continued towards the next
bk frame.
state representation of quaternion. [α̂bbkk+1 , β̂ bk+1 , γ̂ bbkk+1 ]T are
pre-integrated IMU measurement terms using only noisy ac-
celerometer and gyroscope measurements within the time in- same feature in the j th image. πc−1 is the back projection
terval between two consecutive image frames. Accelerometer function which turns a pixel location into a unit vector using
and gyroscope biases are also included in the residual terms camera intrinsic parameters. Since the degrees-of-freedom of
for online correction. the vision residual is two, we project the residual vector
onto the tangent plane. b1 , b2 are two arbitrarily selected
orthogonal bases that span the tangent plane of P̄ˆl j , as shown
c
C. Visual Measurement Residual
in Fig. 6. We can find one set of b1 , b2 easily, as shown in
In contrast to traditional pinhole camera models that define c
Algorithm 1. Pl j , as used in (22), is the standard covariance
reprojection errors on a generalized image plane, we define
of a fixed length in the tangent space.
the camera measurement residual on a unit sphere. The optics
for almost all types of cameras, including wide-angle, fisheye
or omnidirectional cameras, can be modeled as a unit ray D. Marginalization
connecting the surface of a unit sphere. Consider the lth In order to bound the computational complexity of our
feature that is first observed in the ith image, the residual optimization-based VIO, marginalization is incorporated. We
for the feature observation in the j th image is defined as: selectively marginalize out IMU states xk and features λl
T c
Pl j from the sliding window, meanwhile convert measurements
rC (ẑl j , X ) = b1 b2 · (P̄ˆl j −
c  c
c ) corresponding to marginalized states into a prior.
kPl j k As shown in the Fig. 7, when the second latest frame is a
 cj 

P̄ˆl j = πc ( cl j ) keyframe, it will stay in the window, and the oldest frame
c −1

v̂l (25) is marginalized out with its corresponding measurements.


 ci 
cj c bj w b 1 −1 u Otherwise, if the second latest frame is a non-keyframe, we
Pl = Rb (Rw (Rbi (Rc πc ( cl i )
λl vl throw visual measurements and keep IMU measurements that
+ pc ) + pw
b w b connect to this non-keyframe. We do not marginalize out all
bi − pbj ) − pc ),
measurements for non-keyframes in order to maintain sparsity
where [ucl i , vlci ] is the first observation of the lth feature that of the system. Our marginalization scheme aims to keep
c c
happens in the ith image. [ûl j , v̂l j ] is the observation of the spatially separated keyframes in the window. This ensures
9

G. Failure Detection and Recovery


Although our tightly-coupled monocular VIO is robust to
various challenging environments and motions. Failure is still
unavoidable due to violent illumination change or severely
aggressive motions. Active failure detection and recovery strat-
egy can improve the practicability of proposed system. Failure
detection is an independent module that detects unusual output
Fig. 8. An illustration of motion-only bundle adjustment for camera-rate from the estimator. We are currently using the following
outputs.
criteria for failure detection:
• The number of features being tracking in the latest frame
sufficient parallax for feature triangulation, and maximize the is less than a certain threshold;
probability of maintaining accelerometer measurements with • Large discontinuity in position or rotation between last
large excitation. two estimator outputs;
The marginalization is carried out using the Schur comple- • Large change in bias or extrinsic parameters estimation;
ment [39]. We construct a new prior based on all marginalized
measurements related to the removed state. The new prior is Once a failure is detected, the system switches back to the
added onto the existing prior. initialization phase. Once the monocular VIO is successfully
We do note that marginalization results in the early fix of initialized, a new and separate segment of the pose graph will
linearization points, which may result in suboptimal estimation be created.
results. However, since small drifting is acceptable for VIO,
we argue that the negative impact caused by marginalization VII. R ELOCALIZATION
is not critical.
Our sliding window and marginalization scheme bound the
E. Motion-only Visual-Inertial Bundle Adjustment for computation complexity, but it also introduces accumulated
Camera-Rate State Estimation drifts for the system. To be more specific, drifts occur in
For devices with low computational power, such as mobile global 3D position (x, y, z) and the rotation around the gravity
phones, the tightly-coupled monocular VIO cannot achieve direction (yaw). To eliminate drifts, a tightly-coupled relocal-
camera-rate outputs due to the heavy computation demands ization module that seamlessly integrates with the monocular
for the nonlinear optimization. To this end, we employ a VIO is proposed. The relocalization process starts with a
lightweight motion-only visual-inertial bundle adjustment to loop detection module that identifies places that have already
boost the state estimation to camera-rate ( 30 Hz). been visited. Feature-level connections between loop closure
The cost function for the motion-only visual-inertial bundle candidates and the current frame are then established. These
adjustment is the same as the one for monocular VIO in (22). feature correspondences are tightly integrated into the monoc-
However, instead of optimizing all states in the sliding win- ular VIO module, resulting in drift-free state estimates with
dow, we only optimize the poses and velocities of a fixed minimum computation overhead. Multiple observations of
number of latest IMU states. We treat feature depth, extrinsic multiple features are directly used for relocalization, resulting
parameters, bias, and old IMU states that we do not want to in higher accuracy and better state estimation smoothness. A
optimize as constant values. We do use all visual and inertial graphical illustration of the relocalization procedure is shown
measurements for the motion-only bundle adjustment. This in Fig. 9(a).
results in much smoother state estimates than single frame
PnP methods. An illustration of the proposed strategy is shown
in Fig. 8. In contrast to the full tightly-coupled monocular A. Loop Detection
VIO, which may cause more than 50ms on state-of-the-art We utilize DBoW2 [6], a state-of-the-art bag-of-word place
embedded computers, the motion-only visual-inertial bundle recognition approach, for loop detection. In addition to the
adjustment only takes about 5ms to compute. This enables corner features that are used for the monocular VIO, 500 more
the low-latency camera-rate pose estimation that is particularly corners are detected and described by the BRIEF descrip-
beneficial for drone and AR applications. tor [40]. The additional corner features are used to achieve
better recall rate on loop detection. Descriptors are treated
F. IMU Forward Propagation for IMU-Rate State Estimation as the visual word to query the visual database. DBoW2
IMU measurements come at a much higher rate than visual returns loop closure candidates after temporal and geometrical
measurements. Although the frequency of our VIO is limited consistency check. We keep all BRIEF descriptors for feature
by image capture frequency, we can still directly propagate retrieving, but discard the raw image to reduce memory
the latest VIO estimate with the set of most recent IMU consumption.
measurements to achieve IMU-rate performance. The high- We note that our monocular VIO is able to render roll and
frequency state estimates can be utilized as state feedback for pitch angles observable. As such, we do not need to rely on
closed loop closure. An autonomous flight experiment utilizing rotation-invariant features, such as the ORB feature used in
this IMU-rate state estimates is presented in Sect. IX-D. ORB SLAM [4].
10

1. Visual-Inertial Odometry 2. Loop Detection 3. Relocalization

(a) BRIEF descriptor matching results


4. Relocalization with Multiple Constraints

(a) Relocalization (b) First step: 2D-2D outlier rejection results

5. Add Keyframe into 6. 4-DoF Pose Graph 7. Relocalization in


Pose Graph Optimization Optimized Pose Graph

(c) Second step: 3D-2D outlier rejection results.

Keyframes in database Window links


Fig. 10. Descriptor matching and outlier removal for feature retrieval during
loop closure.
Keyframes in Window Sequential links
Marginalized Keyframe Loop links image, we perform PnP test.
(b) Global Pose Graph optimization When the number of inliers beyond a certain threshold, we
treat this candidate as a correct loop detection and perform
Fig. 9. A diagram illustrating the relocalization and pose graph optimization relocalization.
procedure. Fig. 9(a) shows the relocalization procedure. It starts with VIO-
only pose estimates (blue). Past states are recorded (green). If a loop is
detected for the newest keyframe (Sect. VII-A), as shown by the red line in the C. Tightly-Coupled Relocalization
second plot, a relocalization occurred. Note that due to the use of feature-level
correspondences for relocalization, we are able to incorporate loop closure The relocalization process effectively aligns the current
constraints from multiple past keyframes (Sect. VII-C), as indicated in the sliding window maintained by the monocular VIO (Sect. VI)
last three plots. The pose graph optimization is illustrated in Fig. 9(b). A to the graph of past poses. During relocalization, we treat
keyframe is added into the pose graph when it is marginalized out from
the sliding window. If there is a loop between this keyframe and any other poses of all loop closure frames as constants. We jointly
past keyframes, the loop closure constraints, formulated as 4-DOF relative optimize the sliding window using all IMU measurements,
rigid body transforms, will also be added to the pose graph. The pose graph local visual measurement measurements, and retrieved feature
is optimized using all relative pose constraints (Sect. VIII-B) in a separate
thread, and the relocalization module always runs with respect to the newest correspondences from loop closure. We can easily write the
pose graph configuration. visual measurement model for retrieved features observed by
a loop closure frame v to be the same as those for visual
measurements in VIO, as shown in (25). The only difference
B. Feature Retrieval is that the pose (q̂w w
v , p̂v ) of the loop closure frame, which is
taken from the pose graph (Sect. VIII), or directly from past
When a loop is detected, the connection between the local
odometry output (if this is the first relocalization), is treated as
sliding window and the loop closure candidate is established
a constant. To this end, we can slightly modify the nonlinear
by retrieving feature correspondences. Correspondences are
cost function in (22) with additional loop terms:
found by BRIEF descriptor matching. Directly descriptor (
matching may cause a large number a lot of outliers. To this X 2
2
min krp − Hp X k + rB (ẑbbkk+1 , X ) bk
end, we use two-step geometric outlier rejection, as illustrated X Pb
k∈B k+1
in Fig. 10. X c 2
+ ρ( rC (ẑl j , X ) Pcj )
• 2D-2D: fundamental matrix test with RANSAC [31]. We l (26)
(l,j)∈C
use 2D observations of retrieved features in the current 
image and loop closure candidate image to perform X 2

fundamental matrix test. + ρ(krC (ẑvl , X , q̂w w
v , p̂v )kPcv ,
l 
• 3D-2D: PnP test with RANSAC [35]. Based on the (l,v)∈L

known 3D position of features in the local sliding win- where L is the set of the observation of retrieved features in
dow, and 2D observations in the loop closure candidate the loop closure frames. (l, v) means lth feature observed in
11

the loop closure frame v. Note that although the cost function The whole graph of sequential edges and loop closure edges
is slightly different from (22), the dimension of the states to are optimized by minimizing the following cost function:
be solved remains the same, as poses of loop closure frames  
 X 
are considered as constants. When multiple loop closures are 2
X 2
min kri,j k + ρ(kri,j k ) , (29)
established with the current sliding window, we optimize using p,ψ  
(i,j)∈S (i,j)∈L
all loop closure feature correspondences from all frames at the
same time. This gives multi-view constraints for relocalization, where S is the set of all sequential edges and L is the set of all
resulting in higher accuracy and better smoothness. Note loop closure edges. Although the tightly-coupled relocalization
that the global optimization of past poses and loop closure already helps with eliminating wrong loop closures, we add
frames happens after relocalization, and will be discussed in another Huber norm ρ(·) to further reduce the impact of
Sect. VIII. any possible wrong loops. In contrast, we do not use any
robust norms for sequential edges, as these edges are extracted
VIII. G LOBAL P OSE G RAPH O PTIMIZATION from VIO, which already contain sufficient outlier rejection
mechanisms.
After relocalization, the local sliding window shifts and Pose graph optimization and relocalization (Sect. VII-C)
aligns with past poses. Utilizing the relocalization results, this runs asynchronously in two separate threads. This enables
additional pose graph optimization step is developed to ensure immediate use of the most optimized pose graph for relo-
the set of past poses are registered into a globally consistent calization whenever it becomes available. Similarly, even if
configuration. the current pose graph optimization is not completed yet,
Since our visual-inertial setup renders roll and pitch angles relocalization can still take place using the existing pose graph
fully observable, the accumulated drift only occurs in four configuration. This process is illustrated in Fig. 9(b).
degrees-of-freedom (x, y, z and yaw angle). To this end, we
ignore estimating the drift-free roll and pitch states, and only
C. Pose Graph Management
perform 4-DOF pose graph optimization.
The size of the pose graph may grow unbounded when the
travel distance increases, limiting the real-time performance
A. Adding Keyframes into the Pose Graph
of the system in the long run. To this end, we implement a
When a keyframe is marginalized out from the sliding downsample process to maintain the pose graph database to a
window, it will be added to pose graph. This keyframe serves limited size. All keyframes with loop closure constraints will
as a vertex in the pose graph, and it connects with other be kept, while other keyframes that are either too close or have
vertexes by two types of edges: very similar orientations to its neighbors may be removed. The
1) Sequential Edge: a keyframe will establish several se- probability of a keyframe being removed is proportional to its
quential edges to its previous keyframes. A sequential edge spatial density to its neighbors.
represents the relative transformation between two keyframes
in the local sliding window, which value is taken directly from IX. E XPERIMENTAL R ESULTS
VIO. Considering a newly marginalized keyframe i and one We perform three experiments and two applications to
of its previous keyframes j, the sequential edge only contains evaluate the proposed VINS-Mono system. In the first ex-
relative position p̂iij and yaw angle ψ̂ij . periment, we compare the proposed algorithm with another
p̂iij =R̂w (p̂w
−1 w state-of-the-art algorithm on public datasets. We perform a
i j − p̂i )
(27) numerical analysis to show the accuracy of our system. We
ψ̂ij =ψ̂j − ψ̂i . then test our system in the indoor environment to evaluate
2) Loop Closure Edge: If the newly marginalized keyframe the performance in repetitive scenes. A large-scale experiment
has a loop connection, it will be connected with the loop is carried out to illustrate the long-time practicability of our
closure frame by a loop closure edge in the pose graph. Simi- system. Additionally, we apply the proposed system for two
larly, the loop closure edge only contains 4-DOF relative pose applications. For aerial robot application, we use VINS-Mono
transform that is defined the same as (27). The value of the for position feedback to control a drone to follow a pre-defined
loop closure edge is obtained using results from relocalization. trajectory. We then port our approach onto an iOS mobile
device and compare against Google Tango.
B. 4-DOF Pose Graph Optimization
A. Dataset Comparison
We define the residual of the edge between frames i and j
minimally as: We evaluate our proposed VINS-Mono using the EuRoC
" # MAV Visual-Inertial Datasets [41]. The datasets are collected
−1 w w i
R(φ̂ ,
i iθ̂ , ψi ) (p − p ) − p̂ onboard a micro aerial vehicle, which contains stereo images
ri,j (pw w
i , ψi , pj , ψj ) =
j i ij
,
ψj − ψi − ψ̂ij (Aptina MT9V034 global shutter, WVGA monochrome, 20
(28) FPS), synchronized IMU measurements (ADIS16448, 200
Hz), and ground truth states (VICON and Leica MS50). We
where φ̂i , θ̂i are the estimates roll and pitch angles, which are only use images from the left camera. Large IMU bias and
obtained directly from monocular VIO. illumination changes are observed in these datasets.
12

Trajectory in MH_03_median Trajectory in MH_05_difficult


8 12
VINS VINS
VINS_loop 10 VINS_loop
6 OKVIS_mono OKVIS_mono
OKVIS_stereo 8
OKVIS_stereo
Groud Truth Groud Truth
4 6

y [m]
y [m]

2
2

0 0

-2
-2
-4

-4 -6
-2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20
x [m] x [m]

Fig. 11. Trajectory in MH 03 median, compared with OKVIS. Fig. 13. Trajectory in MH 05 difficult, compared with OKVIS..

VINS
Error in MH_03_median VINS_loop Error in MH_05_difficult VINS
1
2 VINS_loop
OKVIS_mono OKVIS_mono

Tranlation error [m]


OKVIS_stereo 1.5 OKVIS_stereo
x error [m]

0.5 1

0.5

0 0
0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100
time [s] distance [m]
4
1.5

°]
3
Rotation error [
y error [m]

1
2
0.5
1

0 0
0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100
time [s] distance [m]

0.4
Fig. 14. Translation error and rotation error plot in MH 05 difficult.
z error [m]

0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
time [s]
in Fig. 11. We only compare translation error since rotated
1.5 motion is negligible in this sequence. The x, y, z error versus
Tranlation error [m]

1 time, and the translation error versus distance are shown in


0.5
Fig. 12. In the error plot, VINS-Mono with loop closure has
0
the smallest translation error. We observe similar results in
0 20 40 60
distance [m]
80 100 120 140
MH 05 difficult. The proposed method with loop function
has the smallest translation error. The translation and rotation
Fig. 12. Translation error plot in MH 03 median. errors are shown in Fig. 14. Since the movement is smooth
without much yaw angle change in this sequence, only position
drift occurs. Obviously, the loop closure capability efficiently
In these experiments, we compare VINS-Mono with bound the accumulated drifts. OKVIS performs better in roll
OKVIS [16], a state-of-the-art VIO that works with monoc- and pitch angle estimation. A possible reason is that VINS-
ular and stereo cameras. OKVIS is another optimization- Mono uses the pre-integration technique which is the first-
based sliding-window algorithm. Our algorithm is different order approximation of IMU propagation to save computation
with OKVIS in many details, as presented in the technical resource.
sections. Our system is complete with robust initialization
and loop closure. We use two sequences, MH 03 median VINS-Mono performs well in all EuRoC datasets, even
and MH 05 difficult, to show the performance of proposed in the most challenging sequence, V1 03 difficult, the one
method. To simplify the notation, we use VINS to denote includes aggressive motion, texture-less area, and signifi-
our approach with only monocular VIO, and VINS loop to cant illumination change. The proposed method can initialize
denote the complete version with relocalization and pose graph quickly in V1 03 difficult, due to the dedicated initialization
optimization. We use OKVIS mono and OKVIS stereo to procedure.
denote the OKVIS’s results using monocular and stereo images For pure VIO, both VINS-Mono and OKVIS have similar
respectively. For the fair comparison, we throw the first 100 accuracy, it is hard to distinguish which one is better. However,
outputs, and use the following 150 outputs to align with the VINS-Mono outperforms OKVIS at the system level. It is a
ground truth, and compare the remaining estimator outputs. complete system with robust initialization and loop closure
For the sequence MH 03 median, the trajectory is shown function to assist the monocular VIO.
13

(a) Trajectory of OKVIS. (b) Trajectory of VINS-Mono with-


out loop closure.
Fig. 15. The device we used for the indoor experiment. It contains
one forward-looking global shutter camera (MatrixVision mvBlueFOX-
MLC200w) with 752×480 resolution. We use the built-in IMU (ADXL278
and ADXRS290, 100Hz) for the DJI A3 flight controller.

(c) Trajectory of VINS-Mono with


relocalization and loop closure. Red
(a) Pedestrians (b) texture-less area lines indicate loop detection.

Fig. 17. Results of the indoor experiment with comparison against OKVIS.

one shown in Fig. 15. We started from a seat in the laboratory


and go around the indoor space. Then we went down the
stairs and walked around the playground outside the building.
Next, we went back to the building and go upstairs. Finally,
we returned to the same seat in the laboratory. The whole
(c) Low light condition (d) Glass and reflection trajectory is more than 700 meters and last approximately
ten minutes. A video of the experiment can be found in the
Fig. 16. Sample images for the challenging indoor experiment.
multimedia attachment.
B. Indoor Experiment The trajectory is shown in Fig. 19. Fig. 19(a) is the trajectory
In the indoor experiment, we choose our laboratory environ- from OKVIS. When we went up stairs, OKVIS shows unstable
ment as the experiment area. The sensor suite we use is shown feature tracking, resulting in bad estimation. We cannot see
in Fig. 15. It contains a monocular camera (20Hz) and an IMU the shape of stairs in the red block. The VIO-only result from
(100Hz) inside the DJI A3 controller3 . We hold the sensor VINS-Mono is shown in Fig. 19(b). The trajectory with loop
suite by hand and walk in normal pace in the laboratory. We closure is shown in Fig. 19(c). The shape of stairs is clear in
encounter pedestrians, low light condition, texture-less area, proposed method. The closed loop trajectory is aligned with
glass and reflection, as shown in Fig. 16. Videos can be found Google Map to verify its accuracy, as shown in Fig. 18.
in the multimedia attachment. The final drift for OKVIS is [13.80, -5.26. 7.23]m in x, y
We compare our result with OKVIS, as shown in Fig. 17. and z-axis. The final dirft of VINS-Mono without loop closure
Fig. 17(a) is the VIO output from OKVIS. Fig. 17(b) is the is [-5.47, 2.76. -0.29]m, which occupies 0.88% with respect to
VIO-only result from proposed method without loop closure. the total trajectory length, smaller than OKVIS 2.36%. With
Fig. 17(c) is the result of the proposed method with relocal- loop correction. the final drift is bounded to [-0.032, 0.09, -
ization and loop closure. Noticeable VIO drifts occurred when 0.07]m, which is trivial compared to the total trajectory length.
we circle indoor. Both OKVIS and the VIO-only version of Although we do not have ground truth, we can still visually
VINS-Mono accumulate significant drifts in x, y, z, and yaw inspect that the optimized trajectory is smooth and can be
angle. Our relocalization and loop closure modules efficiently precisely aligned with the satellite map.
eliminate all these drifts. 2) Go around campus: This very large-scale dataset that
goes around the whole HKUST campus was recorded with
a handheld VI-Sensor4 . The dataset covers the ground that is
C. Large-scale Environment around 710m in length, 240m in width, and with 60m in height
1) Go out of the lab: We test VINS-Mono in a mixed changes. The total path length is 5.62km. The data contains
indoor and outdoor setting. The sensor suite is the same as the the 25Hz image and 200Hz IMU lasting for 1 hour and 34
3 http://www.dji.com/a3 4 http://www.skybotix.com/
14

D. Application I: Feedback Control of an Aerial Robot


We apply VINS-Mono for autonomous feedback control of
an aerial robot, as shown in Fig. 21(a). We use a forward-
looking global shutter camera (MatrixVision mvBlueFOX-
MLC200w) with 752×480 resolution, and equipped it with a
190-degree fisheye lens. A DJI A3 flight controller is used for
both IMU measurements and for attitude stabilization control.
The onboard computation resource is an Intel i7-5500U CPU
running at 3.00 GHz. Traditional pinhole camera model is not
suitable for large FOV camera. We use MEI [42] model for
this camera, calibrated by the toolkit introduced in [43].
In this experiment, we test the performance of autonomous
trajectory tracking under using state estimates from VINS-
Fig. 18. The estimated trajectory of the mixed indoor and outdoor experiment
aligned with Google Map. The yellow line is the final estimated trajectory Mono. Loop closure is disabled for this experiment The
from VINS-Mono after loop closure. Red lines indicate loop closure. quadrotor is commanded to track a figure eight pattern with
each circle being 1.0 meters in radius, as shown in Fig. 21(b).
Four obstacles are put around the trajectory to verify the
accuracy of VINS-Mono without loop closure. The quadrotor
follows this trajectory four times continuously during the
experiment. The 100 Hz onboard state estimates (Sect. VI-F)
enables real-time feedback control of quadrotor.
Ground truth is obtained using OptiTrack5 . Total trajectory
length is 61.97 m. The final drift is [0.08, 0.09, 0.13] m,
resulting in 0.29% position drift. Details of the translation
and rotation as well as their corresponding errors are shown
in Fig. 23.
(a) Estimated trajectory from OKVIS (b) Estimated trajectory from
VINS-Mono with loop closure
disabled E. Application II: Mobile Device
We port VINS-Mono to mobile devices and present a simple
AR application to showcase its accuracy and robustness.
We name our mobile implementation VINS-Mobile6 , and we
compare it against Google Tango device7 , which is one of
the best commercially available augmented reality solutions
on mobile platforms.
VINS-Mono runs on an iPhone7 Plus. we use 30 Hz images
with 640 × 480 resolution captured by the iPhone, and IMU
(c) Estimated trajectory from
VINS-Mono with loop closure data at 100 Hz obtained by the built-in InvenSense MP67B
6-axis gyroscope and accelerometer. As Fig. 24 shows, we
Fig. 19. Estimated trajectories for the mixed indoor and outdoor experiment. mount the iPhone together with a Tango-Enabled Lenovo Phab
In Fig. 19(a), results from OKVIS went bad during tracking lost in texture-less
region (staircase). Figs. 19(b) 19(c) shows results from VINS-Mono without 2 Pro. The Tango device uses a global shutter fisheye camera
and with loop closure. Red lines indicate loop closures. The spiral-shaped and synchronized IMU for state estimation. Firstly, we insert
blue line shows the trajectory when going up and down the stairs. We can a virtual cube on the plane which is extracted from estimated
see that VINS-Mono performs well (subject to acceptable drift) even without
loop closure. visual features as shown in Fig. 25(a). Then we hold these
5 http://optitrack.com/
6 https://github.com/HKUST-Aerial-Robotics/VINS-Mobile
7 http://shopap.lenovo.com/hk/en/tango/

minutes. It is a very significant experiment to test the stability


and durability of VINS-Mono.
TABLE I
In this large-scale test, We set the keyframe database size T IMING S TATISTICS
to 2000 in order to provide sufficient loop information and
Tread Modules Time (ms) Rate (Hz)
achieve real-time performance. We run this dataset with an
Intel i7-4790 CPU running at 3.60GHz. Timing statistics are Feature detector 15 25
1
KLT tracker 5 25
show in Table. I. The estimated trajectory is aligned with
2 Window optimization 50 10
Google map in Fig. 20. Compared with Google map, we can
see our results are almost drift-free in this very long-duration Loop detection 100
3
Pose graph optimization 130
test.
15

Fig. 20. The estimated trajectory of the very large-scale environment aligned with Google map. The yellow line is the estimated trajectory from VINS-Mono.
Red lines indicates loop closure.

Trajectory in onboard test

VINS
OptiTrack
1.5
z [m]

0.5

0
-1
(a) Aerial robot testbed

0
-2
0
1
2
y [m]
4
2 6 x [m]

Fig. 22. The trajectory of loop closure-disabled VINS-Mono on the MAV


platform and its comparison against the ground truth. The robot follows the
trajectory four times. VINS-Mono estimates are used as real-time position
feedback for the controller. Ground truth is obtained using OptiTrack. Total
length is 61.97m. Final drift is 0.18m.
(b) Testing environment and desired figure eight pattern

Fig. 21. Fig 21(a): The self-developed aerial robot with a forward-looking estimator crash caused by unstable feature tracking or active
fisheye camera (MatrixVision mvBlueFOX-MLC200w, 190 FOV) and an DJI failure detection and recovery. However, VINS-Mono still
A3 flight controller (ADXL278 and ADXRS290, 100Hz). Fig. 21(b): The works well in this challenging case. After traveling about
designed trajectory. Four known obstacles are placed. The yellow line is the
predefined figure eight-figure pattern which the aerial robot should follow. The 264m, we return to the start location. The final result can be
robot follows the trajectory four times with loop closure disabled. A video of seen in Fig. 25(c), the trajectory of Tango suffers drifting in
the experiment can be found in the multimedia attachment. the last lap while our VINS returns to the start point. The drift
in total trajectory is eliminated due to the 4-DOF pose graph
two devices and walk inside and outside the room in a normal optimization. This is also evidenced by the fact that the cube
pace. When loops are detected, we use the 4-DOF pose graph is registered to the same place on the image comparing to the
optimization (Sect. VIII-B) to eliminate x, y, z and yaw drifts. beginning.
Interestingly, when we open a door, Tango’s yaw estimation Admittedly, Tango is more accurate than our implementa-
jumps a big angle, as shown in Fig. 25(b). The reason maybe tion especially for local state estimates. However, this exper-
16

VINS
6 2 1.5 OptiTrack
4
1 1

z [m]
x [m]

y [m]
2
0 0.5
0
-2 -1 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]

0.2 0.2 0.2

z error [m]
x error [m]

y error [m]
0 0 0

-0.2 -0.2 -0.2


0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]

10 10 10

pitch [ ° ]
yaw [ ° ]

roll [ ° ]
0 0 0

-10 -10 -10


0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]
4 4 4
pitch error [ ° ]
yaw error [ ° ]

roll error [ ° ]
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]

Fig. 23. Position, orientation and their corresponding errors of loop closure-disabled VINS-Mono compared with OptiTrack. A sudden in pitch error at the
60s is caused by aggressive breaking at the end of the designed trajectory, and possible time misalignment error between VINS-Mono and OptiTrack.

(a) Beginning

Fig. 24. A simple holder that we used to mount the Google Tango device
(left) and the iPhone7 Plus (right) that runs our VINS-Mobile implementation.

iment shows that the proposed method can run on general (b) Door opening
mobile devices and have the potential ability to compare spe-
cially engineered devices. The robustness of proposed method
is also proved in this experiment. Video can be found in the
multimedia attachment.

X. C ONCLUSION AND F UTURE W ORK


(c) End
In this paper, we propose a robust and versatile monocular
visual-inertial estimator. Our approach features both state-of- Fig. 25. From left to right: AR image from VINS-Mobile, estimated trajectory
from VINS-Mono, estimated trajectory from Tango Fig: 25(a): Both VINS-
the-art and novel solutions to IMU pre-integration, estimator Mobile and Tango are initialized at the start location and a virtual box is
initialization and failure recovery, online extrinsic calibration, inserted on the plane which extracted from estimated features. Fig. 25(b):
tightly-coupled visual-inertial odometry, relocalization, and A challenging case in which the camera is facing a moving door. The drift
of Tango trajectory is highlighted. Fig. 25(c): The final trajectory of both
efficient global optimization. We show superior performance VINS-Mobile and Tango. The total length is about 264m.
by comparing against both state-of-the-art open source imple-
mentations and highly optimized industry solutions. We open
source both PC and iOS implementation for the benefit of the reach weakly observable or even degenerate conditions de-
community. pending on the motion and the environment. We are most inter-
Although feature-based VINS estimators have already ested in online methods to evaluate the observability properties
reached the maturity of real-world deployment, we still see of monocular VINS, and online generation of motion plans to
many directions for future research. Monocular VINS may restore observability. Another research direction concerns the
17

mass deployment of monocular VINS on a large variety of [20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and
consumer devices, such as mobile phones. This application F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes
tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp.
requires online calibration of almost all sensor intrinsic and 216–235, 2012.
extrinsic parameters, as well as the online identification of [21] V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visual-inertial
calibration qualities. Finally, we are interested in producing odometry with stereo cameras,” in Proc. of the IEEE Int. Conf. on Robot.
and Autom. IEEE, 2016, pp. 1885–1892.
dense maps given results from monocular VINS. Our first [22] T. Lupton and S. Sukkarieh, “Visual-inertial-aided navigation for high-
results on monocular visual-inertial dense mapping with ap- dynamic motion in built environments without initial conditions,” IEEE
plication to drone navigation was presented in [44]. However, Trans. Robot., vol. 28, no. 1, pp. 61–76, Feb. 2012.
[23] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU prein-
extensive research is still necessary to further improve the tegration on manifold for efficient visual-inertial maximum-a-posteriori
system accuracy and robustness. estimation,” in Proc. of Robot.: Sci. and Syst., Rome, Italy, Jul. 2015.
[24] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Initialization-
free monocular visual-inertial estimation with application to autonomous
R EFERENCES MAVs,” in Proc. of the Int. Sym. on Exp. Robot., Marrakech, Morocco,
Jun. 2014.
[1] G. Klein and D. Murray, “Parallel tracking and mapping for small ar [25] A. Martinelli, “Closed-form solution of visual-inertial structure from
workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th motion,” Int. J. Comput. Vis., vol. 106, no. 2, pp. 138–152, 2014.
IEEE and ACM International Symposium on. IEEE, 2007, pp. 225–234. [26] J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, “Simultaneous
[2] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct state initialization and gyroscope bias calibration in visual inertial aided
monocular visual odometry,” in Proc. of the IEEE Int. Conf. on Robot. navigation,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp.
and Autom., Hong Kong, China, May 2014. 18–25, 2017.
[3] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale di- [27] M. Faessler, F. Fontana, C. Forster, and D. Scaramuzza, “Automatic re-
rect monocular slam,” in European Conference on Computer Vision. initialization and failure recovery for aggressive flight with a monocular
Springer International Publishing, 2014, pp. 834–849. vision-based quadrotor,” in Proc. of the IEEE Int. Conf. on Robot. and
[4] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile Autom. IEEE, 2015, pp. 1722–1729.
and accurate monocular slam system,” IEEE Transactions on Robotics, [28] H. Strasdat, J. Montiel, and A. J. Davison, “Scale drift-aware large scale
vol. 31, no. 5, pp. 1147–1163, 2015. monocular slam,” Robotics: Science and Systems VI, 2010.
[5] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [29] B. D. Lucas and T. Kanade, “An iterative image registration technique
Transactions on Pattern Analysis and Machine Intelligence, 2017. with an application to stereo vision,” in Proc. of the Intl. Joint Conf. on
[6] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast Artificial Intelligence, Vancouver, Canada, Aug. 1981, pp. 24–28.
place recognition in image sequences,” IEEE Transactions on Robotics, [30] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision
vol. 28, no. 5, pp. 1188–1197, October 2012. and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE
[7] S. Shen, N. Michael, and V. Kumar, “Tightly-coupled monocular visual- Computer Society Conference on. IEEE, 1994, pp. 593–600.
inertial fusion for autonomous flight of rotorcraft MAVs,” in Proc. of [31] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.
the IEEE Int. Conf. on Robot. and Autom., Seattle, WA, May 2015. Cambridge university press, 2003.
[8] Z. Yang and S. Shen, “Monocular visual–inertial state estimation with [32] A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging
online initialization and camera–imu extrinsic calibration,” IEEE Trans- Topics in Computer Vision, 2005.
actions on Automation Science and Engineering, vol. 14, no. 1, pp. [33] D. Nistér, “An efficient solution to the five-point relative pose problem,”
39–51, 2017. IEEE transactions on pattern analysis and machine intelligence, vol. 26,
[9] T. Qin and S. Shen, “Robust initialization of monocular visual-inertial no. 6, pp. 756–770, 2004.
estimation on aerial robots.” in Proc. of the IEEE/RSJ Int. Conf. on [34] T. Liu and S. Shen, “Spline-based initialization of monocular visual-
Intell. Robots and Syst., Vancouver, Canada, 2017, accepted. inertial state estimators at high altitude,” IEEE Robotics and Automation
[10] P. Li, T. Qin, B. Hu, F. Zhu, and S. Shen, “Monocular visual-inertial Letters, 2017, accepted.
state estimation for mobile augmented reality.” in Proc. of the IEEE Int. [35] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n)
Sym. on Mixed adn Augmented Reality, Nantes, France, 2017, accepted. solution to the pnp problem,” International journal of computer vision,
[11] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real- vol. 81, no. 2, pp. 155–166, 2009.
time onboard visual-inertial state estimation and self-calibration of mavs [36] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
in unknown environments,” in Robotics and Automation (ICRA), 2012 “Bundle adjustmenta modern synthesis,” in International workshop on
IEEE International Conference on. IEEE, 2012, pp. 957–964. vision algorithms. Springer, 1999, pp. 298–372.
[12] S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A robust [37] P. Huber, “Robust estimation of a location parameter,” Annals of
and modular multi-sensor fusion approach applied to mav navigation,” Mathematical Statistics, vol. 35, no. 2, pp. 73–101, 1964.
in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, [38] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.
2013, pp. 3923–3929. org.
[13] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint Kalman [39] G. Sibley, L. Matthies, and G. Sukhatme, “Sliding window filter with
filter for vision-aided inertial navigation,” in Proc. of the IEEE Int. Conf. application to planetary landing,” J. Field Robot., vol. 27, no. 5, pp.
on Robot. and Autom., Roma, Italy, Apr. 2007, pp. 3565–3572. 587–608, Sep. 2010.
[14] M. Li and A. Mourikis, “High-precision, consistent EKF-based visual- [40] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
inertial odometry,” Int. J. Robot. Research, vol. 32, no. 6, pp. 690–711, independent elementary features,” Computer Vision–ECCV 2010, pp.
May 2013. 778–792, 2010.
[15] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual [41] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
inertial odometry using a direct ekf-based approach,” in Proc. of the Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”
IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, 2015, pp. 298– The International Journal of Robotics Research, 2016.
304. [42] C. Mei and P. Rives, “Single view point omnidirectional camera cal-
[16] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, ibration from planar grids,” in Robotics and Automation, 2007 IEEE
“Keyframe-based visual-inertial odometry using nonlinear optimization,” International Conference on. IEEE, 2007, pp. 3945–3950.
Int. J. Robot. Research, vol. 34, no. 3, pp. 314–334, Mar. 2014. [43] L. Heng, B. Li, and M. Pollefeys, “Camodocal: Automatic intrinsic
[17] R. Mur-Artal and J. D. Tardos, “Visual-inertial monocular SLAM with and extrinsic calibration of a rig with multiple generic cameras and
map reuse,” arXiv preprint arXiv:1610.05949, 2016. odometry,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ
[18] K. Wu, A. Ahmed, G. A. Georgiou, and S. I. Roumeliotis, “A square International Conference on. IEEE, 2013, pp. 1793–1800.
root inverse filter for efficient vision-aided inertial navigation on mobile [44] Y. Lin, F. Gao, T. Qin, W. Gao, T. Liu, W. Wu, Z. Yang, and S. Shen,
devices.” in Robotics: Science and Systems, 2015. “Autonomous aerial navigation using monocular visual-inertial fusion,”
[19] M. K. Paul, K. Wu, J. A. Hesch, E. D. Nerurkar, and S. I. Roumeliotis, J. Field Robot., vol. 00, pp. 1–29, 2017.
“A comparative analysis of tightly-coupled monocular, binocular, and
stereo vins,” in Proc. of the IEEE Int. Conf. on Robot. and Autom.,
Singapore, May 2017.

You might also like