VINS-Mono A Robust and Versatile Monocular
VINS-Mono A Robust and Versatile Monocular
To address all these issues, we propose VINS-Mono, a discussion and possible future research directions in Sect. X.
robust and versatile monocular visual-inertial state estimator.
Our solution starts with on-the-fly estimator initialization. The
II. R ELATED W ORK
same initialization module is also used for failure recovery.
The core of our solution is a robust monocular visual-inertial Scholarly works on monocular vision-based state estima-
odometry (VIO) based on tightly-coupled sliding window non- tion/odometry/SLAM are extensive. Noticeable approaches
linear optimization. The monocular VIO module not only pro- include PTAM [1], SVO [2], LSD-SLAM [3], DSO [5], and
vides accurate local pose, velocity, and orientation estimates, ORB-SLAM [4]. It is obvious that any attempts to give a full
it also performs camera-IMU extrinsic calibration and IMU relevant review would be incomplete. In this section, however,
biases correction in an online fashion. Loops are detected using we skip the discussion on vision-only approaches, and only
DBoW2 [6]. Relocalization is done in a tightly-coupled setting focus on the most relevant results on monocular visual-inertial
by feature-level fusion with the monocular VIO. This enables state estimation.
robust and accurate relocalization with minimum computation The simplest way to deal with visual and inertial mea-
overhead. Finally, geometrically verified loops are added into a surements is loosely-coupled sensor fusion [11], [12], where
pose graph, and thanks to the observable roll and pitch angles IMU is treated as an independent module to assist vision-only
from the monocular VIO, a four degrees-of-freedom (DOF) pose estimates obtained from the visual structure from motion.
pose graph is performed to ensure global consistency. Fusion is usually done by an extended Kalman filter (EKF),
VINS-Mono combines and improves the our previous works where IMU is used for state propagation and the vision-only
on monocular visual-inertial fusion [7]–[10]. It is built on pose is used for the update. Further on, tightly-coupled visual-
top our tightly-coupled, optimization-based formulation for inertial algorithms are either based on the EKF [13]–[15] or
monocular VIO [7], [8], and incorporates the improved ini- graph optimization [7], [8], [16], [17], where camera and IMU
tialization procedure introduced in [9]. The first attempt of measurements are jointly optimized from the raw measurement
porting to mobile devices was given in [10]. Further im- level. A popular EKF based VIO approach is MSCKF [13],
provements of VINS-Mono comparing to our previous works [14]. MSCKF maintains several previous camera poses in the
include improved IMU pre-integration with bias correction, state vector, and uses visual measurements of the same feature
tightly-coupled relocalization, global pose graph optimization, across multiple camera views to form multi-constraint update.
extensive experimental evaluation, and a robust and versatile SR-ISWF [18], [19] is an extension of MSCKF. It uses square-
open source implementation. root form [20] to achieve single-precision representation and
The whole system is complete and easy-to-use. It has been avoid poor numerical properties. This approach employs the
successfully applied to small-scale AR scenarios, medium- inverse filter for iterative re-linearization, making it equal
scale drone navigation, and large-scale state estimation tasks. to optimization-based algorithms. Batch graph optimization
Superior performance has been shown against other state-of- or bundle adjustment techniques maintain and optimize all
the-art methods. To this end, we summarize our contributions measurements to obtain the optimal state estimates. To achieve
as follow: constant processing time, popular graph-based VIO meth-
ods [8], [16], [17] usually optimize over a bounded-size sliding
• A robust initialization procedure that is able to bootstrap window of recent states by marginalizing out past states and
the system from unknown initial states. measurements. Due to high computational demands of iterative
• A tightly-coupled, optimization-based monocular visual- solving of nonlinear systems, few graph-based can achieve
inertial odometry with camera-IMU extrinsic calibration real-time performance on resource-constrained platforms, such
and IMU bias estimation. as mobile phones.
• Online loop detection and tightly-coupled relocalization. For visual measurement processing, algorithms can be cat-
• Four DOF global pose graph optimization. egorized into either direct or indirect method according to
• Real-time performance demonstration for drone naviga- the definition of visual residual models. Direct approaches
tion, large-scale localization, and mobile AR applications. [2], [3], [21] minimize photometric error while indirect ap-
• Open-source release for both the PC version that is fully proaches [8], [14], [16] minimize geometric displacement.
integrated with ROS, as well as the iOS version that runs Direct methods require a good initial guess due to their small
on iPhone6s or above. region of attraction, while indirect approaches consume extra
The rest of the paper is structured as follows. In Sect. II, computational resources on extracting and matching features.
we discuss relevant literature. We give an overview of the Indirect approaches are more frequently found in real-world
complete system pipeline in Sect. III. Preprocessing steps engineering deployment due to its maturity and robustness.
for both visual and pre-integrated IMU measurements are However, direct approaches are easier to be extended for dense
presented in Sect. IV. In Sect. V, we discuss the estimator mapping as they are operated directly on the pixel level.
initialization procedure. A tightly-coupled, self-calibrating, In practice, IMUs usually acquire data at a much higher rate
nonlinear optimization-based monocular VIO, is presented than the camera. Different methods have been proposed to han-
in Sect. VI. Tightly-coupled relocalization and global pose dle the high rate IMU measurements. The most straightforward
graph optimization are presented in Sect. VII and Sect. VIII approach is to use the IMU for state propagation in EKF-based
respectively. Implementation details and experimental results approaches [11], [13]. In a graph optimization formulation, an
are shown in Sect. IX. Finally, the paper is concluded with a efficient technique called IMU pre-integration is developed in
3
Fig. 2. A block diagram illustrating the full pipeline of the proposed monocular visual-inertial state estimator.
order to avoid repeated IMU re-integration This technique was in global translation and orientation. To this end, loop clo-
first introduced in [22], which parametrize rotation error using sure plays an important role for long-term operations. ORB-
Euler angles. An on-manifold rotation formulation for IMU- SLAM [4] is able to close loops and reuse the map, which
preintegration was developed in our previous work [7]. This takes advantage of Bag-of-World [6]. A 7 DOF [28] (position,
work derived the covariance propagation using continuous- orientation, and scale) pose graph optimization is followed
time IMU error state dynamics. However, IMU biases were loop detection. In contrast, for monocular VINS, thanks to the
ignored. The pre-integration theory was further improved addition of IMU, drift only occurs in 4 DOF, which is the
in [23] by adding posterior IMU bias correction. 3D translation, and the rotation around the gravity direction
(yaw angle). Therefore, in this paper, we choose to optimize
Accurate initial values are crucial to bootstrap any monoc-
the pose graph with loop constraints in the minimum 4 DOF
ular VINS. A linear estimator initialization method that lever-
setting.
ages relative rotations from short-term IMU pre-integration
was proposed in [8], [24]. However, this method does not
model gyroscope bias, and fails to model sensor noise in raw III. OVERVIEW
projection equations. In real-world applications, this results
The structure of proposed monocular visual-inertial state
in unreliable initialization when visual features are far away
estimator is shown in Fig. 2. The system starts with measure-
from the sensor suite. A closed-form solution to the monocular
ment preprocessing (Sect. IV), in which features are extracted
visual-inertial initialization problem was introduced in [25].
and tracked, and IMU measurements between two consecutive
Later, an extension to this closed-form solution by adding
frames are pre-integrated. The initialization procedure (Sect. V
gyroscope bias calibration was proposed in [26]. These ap-
provides all necessary values, including pose, velocity, gravity
proaches fail to model the uncertainty in inertial integration
vector, gyroscope bias, and 3D feature location, for bootstrap-
since they rely on the double integration of IMU measurements
ping the subsequent nonlinear optimization-based VIO. The
over an extended period of time. In [27], a re-initialization and
VIO (Sect. VI) with relocalization (Sect. VII) modules tightly
failure recovery algorithm based on SVO [2] was proposed.
fuses pre-integrated IMU measurements, feature observations,
This is a practical method based on a loosely-coupled fusion
and re-detected features from loop closure. Finally, the pose
framework. However, an additional downward-facing distance
graph optimization module (Sect. VIII) takes in geometrically
sensor is required to recover the metric scale. An initialization
verified relocalization results, and perform global optimization
algorithm built on top of the popular ORB-SLAM [4] was
to eliminate drift. The VIO, relocalization, and pose graph
introduced in [17]. An initial estimation of the scale, gravity
optimization modules run concurrently in a multi-thread set-
direction, velocity and IMU biases are computed for the
ting. Each module has different running rates and real-time
visual-inertial full BA given a set of keyframes from ORB-
guarantees to ensure reliable operation at all times.
SLAM. However, it is reported that the time required for
We now define notations and frame definitions that we use
scale convergence can be longer than 10 seconds. This can
throughout the paper. We consider (·)w as the world frame.
pose problems for robotic navigation tasks that require scale
The direction of the gravity is aligned with the z axis of the
estimates right at the beginning.
world frame. (·)b is the body frame, which we define it to
VIO approaches, regardless the underlying mathematical be the same as the IMU frame. (·)c is the camera frame.
formulation that they rely on, suffer from long term drifting We use both rotation matrices R and Hamilton quaternions
4
q to represent rotation. We primarily use quaternions in state However, IMU biases were ignored. The pre-integration theory
vectors, but rotation matrices are also used for convenience was further improved in [23] by adding posterior IMU bias
rotation of 3D vectors. qw w
b , pb are rotation and translation correction. In this paper, we extend the IMU pre-integration
from the body frame to the world frame. bk is the body proposed in our previous work [7] by incorporating IMU bias
frame while taking the k th image. ck is the camera frame correction.
while taking the k th image. ⊗ represents the multiplication The raw gyroscope and accelerometer measurements from
operation between two quaternions. gw = [0, 0, g]T is the IMU, ω̂ and â, are given by:
gravity vector in the world frame. Finally, we denote (·)ˆ as
ât = at + bat + Rtw gw + na
the noisy measurement or estimate of a certain quantity. (1)
ω̂ t = ω t + bwt + nw .
IV. M EASUREMENT P REPROCESSING IMU measurements, which are measured in the body frame,
combines the force for countering gravity and the platform
This section presents preprocessing steps for both inertial dynamics, and are affected by acceleration bias ba , gyroscope
and monocular visual measurements. For visual measurements, bias bw , and additive noise. We assume that the additive noise
we track features between consecutive frames and detect new in acceleration and gyroscope measurements are Gaussian,
features in the latest frame. For IMU measurements, we pre- na ∼ N (0, σ 2a ), nw ∼ N (0, σ 2w ). Acceleration bias and
integrate them between two consecutive frames. Note that the gyroscope bias are modeled as random walk, whose derivatives
measurements of the low-cost IMU that we use are affected are Gaussian, nba ∼ N (0, σ 2ba ), nbw ∼ N (0, σ 2bw ):
by both bias and noise. We therefore especially take bias into
account in the IMU pre-integration process. ḃat = nba , ḃwt = nbw . (2)
Given two time instants that correspond to image frames
A. Vision Processing Front-end bk and bk+1 , position, velocity, and orientation states can
For each new image, existing features are tracked by the be propagated by inertial measurements during time interval
KLT sparse optical flow algorithm [29]. Meanwhile, new cor- [tk , tk+1 ] in the world frame:
ner features are detected [30] to maintain a minimum number pw w w
bk+1 = pbk + vbk ∆tk
(100-300) of features in each image. The detector enforces a ZZ
uniform feature distribution by setting a minimum separation + (Rw w
t (ât − bat − na ) − g ) dt
2
t∈[tk ,tk+1 ]
of pixels between two neighboring features. 2D Features are Z
firstly undistorted and then projected to a unit sphere after vbwk+1 = vbwk + (Rw w
t (ât − bat − na ) − g ) dt
passing outlier rejection. Outlier rejection is performed using Zt∈[tk ,tk+1 ]
RANSAC with fundamental matrix model [31]. 1
w w
qbk+1 = qbk ⊗ Ω(ω̂ t − bwt − nw )qbt k dt,
Keyframes are also selected in this step. We have two crite- t∈[tk ,tk+1 ] 2
ria for keyframe selection. The first one is the average parallax (3)
apart from the previous keyframe. If the average parallax of
where
tracked features is between the current frame and the latest
keyframe is beyond a certain threshold, we treat frame as a
0 −ωz ωy
−bωc× ω
new keyframe. Note that not only translation but also rotation Ω(ω) = , bωc× = ωz 0 −ωx .
−ω T 0
can cause parallax. However, features cannot be triangulated in −ωy ωx 0
the rotation-only motion. To avoid this situation, we use short- (4)
term integration of gyroscope measurements to compensate ∆tk is the duration between the time interval [tk , tk+1 ].
rotation when calculating parallax. Note that this rotation It can be seen that the IMU state propagation requires
compensation is only used to keyframe selection, and is not rotation, position and velocity of frame bk . When these starting
involved in rotation calculation in the VINS formulation. To states change, we need to re-propagate IMU measurements.
this end, even if the gyroscope contains large noise or is biased, Especially in the optimization-based algorithm, every time we
it will only result in suboptimal keyframe selection results, adjust poses, we will need to re-propagate IMU measurements
and will not directly affect the estimation quality. Another between them. This propagation strategy is computationally
criterion is tracking quality. If the number of tracked features demanding. To avoid re-propagation, we adopt pre-integration
goes below a certain threshold, we treat this frame as a new algorithm.
keyframe. This criterion is to avoid complete loss of feature After change the reference frame from the world frame to
tracks. the local frame bk , we can only pre-integrate the parts which
are related to linear acceleration â and angular velocity ω̂ as
B. IMU Pre-integration follows:
1 w 2 bk
IMU Pre-integration was first proposed in [22], which Rbwk pw bk w w
bk+1 = Rw (pbk + vbk ∆tk − g ∆tk ) + αbk+1
parametrized rotation error in Euler angle. An on-manifold 2
rotation formulation for IMU pre-integration was developed Rbwk vbwk+1 = Rbwk (vbwk − gw ∆tk ) + β bbkk+1
bk
in our previous work [7]. This work derived the covariance qbwk ⊗ qw
bk+1 = γ bk+1 ,
propagation using continuous-time IMU error state dynamics. (5)
5
x# x$
x' x"
x ('
x (" x (#
x%
f#
f' f$
f"
Fig. 3. An illustration of the sliding window monocular VIO with relocalization. It is a tightly-coupled formulation with IMU, visual, and loop measurements.
these two measurements without good initial values. One may (14)
assume a stationary initial condition to start the monocular sp̄cbk0 = sp̄cc0k − Rcbk0 pbc ,
VINS estimator. However, this assumption is inappropriate where s the scaling parameter that aligns the visual structure
as initialization under motion is frequently encountered in to the metric scale. Solving this scaling parameter is the key
real-world applications. The situation becomes even more to achieve successful initialization.
complicated when IMU measurements are corrupted by large
bias. In fact, initialization is usually the most fragile step for
monocular VINS. A robust initialization procedure is needed B. Visual-Inertial Alignment
to ensure the applicability of the system.
1) Gyroscope Bias Calibration: Consider two consecutive
We adopt a loosely-coupled sensor fusion method to get
frames bk and bk+1 in the window, we get the rotation qcbk0 and
initial values. We find that vision-only SLAM, or Structure
qcbk+1
0
from the visual SfM, as well as the relative constraint
from Motion (SfM), has a good property of initialization. bk
In most cases, a visual-only system can bootstrap itself by γ̂ bk+1 from IMU pre-integration. We linearize the IMU pre-
derived initial values from relative motion methods, such as integration term with respect to gyroscope bias and minimize
the eight-point [32] or five-point [33] algorithms, or estimating the following cost function:
the homogeneous matrices. By aligning metric IMU pre- X 2
qcbk+1 ⊗ qcbk0 ⊗ γ bbkk+1
−1
integration with the visual-only SfM results, we can roughly min 0
δbw
recover scale, gravity, velocity, and even bias. This is sufficient k∈B
(15)
for bootstrapping a nonlinear monocular VINS estimator, as 1
γ bbkk+1 ≈ γ̂ bbkk+1 ⊗ 1 γ ,
shown in Fig. 4. 2 Jbw δbw
In contrast to [17], which simultaneously estimates gyro-
where B indexes all frames in the window. We have the first
scope and accelerometer bias during the initialization phase,
order approximation of γ̂ bbkk+1 with respect to the gyroscope
we choose to ignore accelerometer bias terms in the initial
bias using the bias Jacobian derived in Sect. IV-B. In such way,
step. Accelerometer bias is coupled with gravity, and due to the
we get an initial calibration of the gyroscope bias bw . Then
large magnitude of the gravity vector comparing to platform bk
dynamics, and the relatively short during of the initialization we re-propagate all IMU pre-integration terms α̂bbkk+1 , β̂ bk+1 ,
phase, these bias terms are hard to be observed. A detailed and γ̂ bbkk+1 using the new gyroscope bias.
analysis of accelerometer bias calibration is presented in our 2) Velocity, Gravity Vector and Metric Scale Initialization:
previous work [34]. After the gyroscope bias is initialized, we move on to initialize
7
c
rB (ẑbbkk+1 , X ) and rC (ẑl j , X ) are residuals for IMU and visual Unit sphere
*+
measurements respectively. Detailed definition of the residual () Tangent plane
terms will be presented in Sect. VI-B and Sect. VI-C. B is the "%
*,
#'$
set of all IMU measurements, C is the set of features which #$
"%
"%
have been observed at least twice in the current sliding win- ||#$ ||
dow. {rp , Hp } is the prior information from marginalization. !"
Ceres Solver [38] is used for solving this nonlinear problem.
known 3D position of features in the local sliding win- where L is the set of the observation of retrieved features in
dow, and 2D observations in the loop closure candidate the loop closure frames. (l, v) means lth feature observed in
11
the loop closure frame v. Note that although the cost function The whole graph of sequential edges and loop closure edges
is slightly different from (22), the dimension of the states to are optimized by minimizing the following cost function:
be solved remains the same, as poses of loop closure frames
X
are considered as constants. When multiple loop closures are 2
X 2
min kri,j k + ρ(kri,j k ) , (29)
established with the current sliding window, we optimize using p,ψ
(i,j)∈S (i,j)∈L
all loop closure feature correspondences from all frames at the
same time. This gives multi-view constraints for relocalization, where S is the set of all sequential edges and L is the set of all
resulting in higher accuracy and better smoothness. Note loop closure edges. Although the tightly-coupled relocalization
that the global optimization of past poses and loop closure already helps with eliminating wrong loop closures, we add
frames happens after relocalization, and will be discussed in another Huber norm ρ(·) to further reduce the impact of
Sect. VIII. any possible wrong loops. In contrast, we do not use any
robust norms for sequential edges, as these edges are extracted
VIII. G LOBAL P OSE G RAPH O PTIMIZATION from VIO, which already contain sufficient outlier rejection
mechanisms.
After relocalization, the local sliding window shifts and Pose graph optimization and relocalization (Sect. VII-C)
aligns with past poses. Utilizing the relocalization results, this runs asynchronously in two separate threads. This enables
additional pose graph optimization step is developed to ensure immediate use of the most optimized pose graph for relo-
the set of past poses are registered into a globally consistent calization whenever it becomes available. Similarly, even if
configuration. the current pose graph optimization is not completed yet,
Since our visual-inertial setup renders roll and pitch angles relocalization can still take place using the existing pose graph
fully observable, the accumulated drift only occurs in four configuration. This process is illustrated in Fig. 9(b).
degrees-of-freedom (x, y, z and yaw angle). To this end, we
ignore estimating the drift-free roll and pitch states, and only
C. Pose Graph Management
perform 4-DOF pose graph optimization.
The size of the pose graph may grow unbounded when the
travel distance increases, limiting the real-time performance
A. Adding Keyframes into the Pose Graph
of the system in the long run. To this end, we implement a
When a keyframe is marginalized out from the sliding downsample process to maintain the pose graph database to a
window, it will be added to pose graph. This keyframe serves limited size. All keyframes with loop closure constraints will
as a vertex in the pose graph, and it connects with other be kept, while other keyframes that are either too close or have
vertexes by two types of edges: very similar orientations to its neighbors may be removed. The
1) Sequential Edge: a keyframe will establish several se- probability of a keyframe being removed is proportional to its
quential edges to its previous keyframes. A sequential edge spatial density to its neighbors.
represents the relative transformation between two keyframes
in the local sliding window, which value is taken directly from IX. E XPERIMENTAL R ESULTS
VIO. Considering a newly marginalized keyframe i and one We perform three experiments and two applications to
of its previous keyframes j, the sequential edge only contains evaluate the proposed VINS-Mono system. In the first ex-
relative position p̂iij and yaw angle ψ̂ij . periment, we compare the proposed algorithm with another
p̂iij =R̂w (p̂w
−1 w state-of-the-art algorithm on public datasets. We perform a
i j − p̂i )
(27) numerical analysis to show the accuracy of our system. We
ψ̂ij =ψ̂j − ψ̂i . then test our system in the indoor environment to evaluate
2) Loop Closure Edge: If the newly marginalized keyframe the performance in repetitive scenes. A large-scale experiment
has a loop connection, it will be connected with the loop is carried out to illustrate the long-time practicability of our
closure frame by a loop closure edge in the pose graph. Simi- system. Additionally, we apply the proposed system for two
larly, the loop closure edge only contains 4-DOF relative pose applications. For aerial robot application, we use VINS-Mono
transform that is defined the same as (27). The value of the for position feedback to control a drone to follow a pre-defined
loop closure edge is obtained using results from relocalization. trajectory. We then port our approach onto an iOS mobile
device and compare against Google Tango.
B. 4-DOF Pose Graph Optimization
A. Dataset Comparison
We define the residual of the edge between frames i and j
minimally as: We evaluate our proposed VINS-Mono using the EuRoC
" # MAV Visual-Inertial Datasets [41]. The datasets are collected
−1 w w i
R(φ̂ ,
i iθ̂ , ψi ) (p − p ) − p̂ onboard a micro aerial vehicle, which contains stereo images
ri,j (pw w
i , ψi , pj , ψj ) =
j i ij
,
ψj − ψi − ψ̂ij (Aptina MT9V034 global shutter, WVGA monochrome, 20
(28) FPS), synchronized IMU measurements (ADIS16448, 200
Hz), and ground truth states (VICON and Leica MS50). We
where φ̂i , θ̂i are the estimates roll and pitch angles, which are only use images from the left camera. Large IMU bias and
obtained directly from monocular VIO. illumination changes are observed in these datasets.
12
y [m]
y [m]
2
2
0 0
-2
-2
-4
-4 -6
-2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20
x [m] x [m]
Fig. 11. Trajectory in MH 03 median, compared with OKVIS. Fig. 13. Trajectory in MH 05 difficult, compared with OKVIS..
VINS
Error in MH_03_median VINS_loop Error in MH_05_difficult VINS
1
2 VINS_loop
OKVIS_mono OKVIS_mono
0.5 1
0.5
0 0
0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100
time [s] distance [m]
4
1.5
°]
3
Rotation error [
y error [m]
1
2
0.5
1
0 0
0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100
time [s] distance [m]
0.4
Fig. 14. Translation error and rotation error plot in MH 05 difficult.
z error [m]
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
time [s]
in Fig. 11. We only compare translation error since rotated
1.5 motion is negligible in this sequence. The x, y, z error versus
Tranlation error [m]
Fig. 17. Results of the indoor experiment with comparison against OKVIS.
Fig. 20. The estimated trajectory of the very large-scale environment aligned with Google map. The yellow line is the estimated trajectory from VINS-Mono.
Red lines indicates loop closure.
VINS
OptiTrack
1.5
z [m]
0.5
0
-1
(a) Aerial robot testbed
0
-2
0
1
2
y [m]
4
2 6 x [m]
Fig. 21. Fig 21(a): The self-developed aerial robot with a forward-looking estimator crash caused by unstable feature tracking or active
fisheye camera (MatrixVision mvBlueFOX-MLC200w, 190 FOV) and an DJI failure detection and recovery. However, VINS-Mono still
A3 flight controller (ADXL278 and ADXRS290, 100Hz). Fig. 21(b): The works well in this challenging case. After traveling about
designed trajectory. Four known obstacles are placed. The yellow line is the
predefined figure eight-figure pattern which the aerial robot should follow. The 264m, we return to the start location. The final result can be
robot follows the trajectory four times with loop closure disabled. A video of seen in Fig. 25(c), the trajectory of Tango suffers drifting in
the experiment can be found in the multimedia attachment. the last lap while our VINS returns to the start point. The drift
in total trajectory is eliminated due to the 4-DOF pose graph
two devices and walk inside and outside the room in a normal optimization. This is also evidenced by the fact that the cube
pace. When loops are detected, we use the 4-DOF pose graph is registered to the same place on the image comparing to the
optimization (Sect. VIII-B) to eliminate x, y, z and yaw drifts. beginning.
Interestingly, when we open a door, Tango’s yaw estimation Admittedly, Tango is more accurate than our implementa-
jumps a big angle, as shown in Fig. 25(b). The reason maybe tion especially for local state estimates. However, this exper-
16
VINS
6 2 1.5 OptiTrack
4
1 1
z [m]
x [m]
y [m]
2
0 0.5
0
-2 -1 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]
z error [m]
x error [m]
y error [m]
0 0 0
10 10 10
pitch [ ° ]
yaw [ ° ]
roll [ ° ]
0 0 0
roll error [ ° ]
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
time [s] time [s] time [s]
Fig. 23. Position, orientation and their corresponding errors of loop closure-disabled VINS-Mono compared with OptiTrack. A sudden in pitch error at the
60s is caused by aggressive breaking at the end of the designed trajectory, and possible time misalignment error between VINS-Mono and OptiTrack.
(a) Beginning
Fig. 24. A simple holder that we used to mount the Google Tango device
(left) and the iPhone7 Plus (right) that runs our VINS-Mobile implementation.
iment shows that the proposed method can run on general (b) Door opening
mobile devices and have the potential ability to compare spe-
cially engineered devices. The robustness of proposed method
is also proved in this experiment. Video can be found in the
multimedia attachment.
mass deployment of monocular VINS on a large variety of [20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and
consumer devices, such as mobile phones. This application F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes
tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp.
requires online calibration of almost all sensor intrinsic and 216–235, 2012.
extrinsic parameters, as well as the online identification of [21] V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visual-inertial
calibration qualities. Finally, we are interested in producing odometry with stereo cameras,” in Proc. of the IEEE Int. Conf. on Robot.
and Autom. IEEE, 2016, pp. 1885–1892.
dense maps given results from monocular VINS. Our first [22] T. Lupton and S. Sukkarieh, “Visual-inertial-aided navigation for high-
results on monocular visual-inertial dense mapping with ap- dynamic motion in built environments without initial conditions,” IEEE
plication to drone navigation was presented in [44]. However, Trans. Robot., vol. 28, no. 1, pp. 61–76, Feb. 2012.
[23] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU prein-
extensive research is still necessary to further improve the tegration on manifold for efficient visual-inertial maximum-a-posteriori
system accuracy and robustness. estimation,” in Proc. of Robot.: Sci. and Syst., Rome, Italy, Jul. 2015.
[24] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Initialization-
free monocular visual-inertial estimation with application to autonomous
R EFERENCES MAVs,” in Proc. of the Int. Sym. on Exp. Robot., Marrakech, Morocco,
Jun. 2014.
[1] G. Klein and D. Murray, “Parallel tracking and mapping for small ar [25] A. Martinelli, “Closed-form solution of visual-inertial structure from
workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th motion,” Int. J. Comput. Vis., vol. 106, no. 2, pp. 138–152, 2014.
IEEE and ACM International Symposium on. IEEE, 2007, pp. 225–234. [26] J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, “Simultaneous
[2] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct state initialization and gyroscope bias calibration in visual inertial aided
monocular visual odometry,” in Proc. of the IEEE Int. Conf. on Robot. navigation,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp.
and Autom., Hong Kong, China, May 2014. 18–25, 2017.
[3] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale di- [27] M. Faessler, F. Fontana, C. Forster, and D. Scaramuzza, “Automatic re-
rect monocular slam,” in European Conference on Computer Vision. initialization and failure recovery for aggressive flight with a monocular
Springer International Publishing, 2014, pp. 834–849. vision-based quadrotor,” in Proc. of the IEEE Int. Conf. on Robot. and
[4] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile Autom. IEEE, 2015, pp. 1722–1729.
and accurate monocular slam system,” IEEE Transactions on Robotics, [28] H. Strasdat, J. Montiel, and A. J. Davison, “Scale drift-aware large scale
vol. 31, no. 5, pp. 1147–1163, 2015. monocular slam,” Robotics: Science and Systems VI, 2010.
[5] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [29] B. D. Lucas and T. Kanade, “An iterative image registration technique
Transactions on Pattern Analysis and Machine Intelligence, 2017. with an application to stereo vision,” in Proc. of the Intl. Joint Conf. on
[6] D. Gálvez-López and J. D. Tardós, “Bags of binary words for fast Artificial Intelligence, Vancouver, Canada, Aug. 1981, pp. 24–28.
place recognition in image sequences,” IEEE Transactions on Robotics, [30] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision
vol. 28, no. 5, pp. 1188–1197, October 2012. and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE
[7] S. Shen, N. Michael, and V. Kumar, “Tightly-coupled monocular visual- Computer Society Conference on. IEEE, 1994, pp. 593–600.
inertial fusion for autonomous flight of rotorcraft MAVs,” in Proc. of [31] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.
the IEEE Int. Conf. on Robot. and Autom., Seattle, WA, May 2015. Cambridge university press, 2003.
[8] Z. Yang and S. Shen, “Monocular visual–inertial state estimation with [32] A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging
online initialization and camera–imu extrinsic calibration,” IEEE Trans- Topics in Computer Vision, 2005.
actions on Automation Science and Engineering, vol. 14, no. 1, pp. [33] D. Nistér, “An efficient solution to the five-point relative pose problem,”
39–51, 2017. IEEE transactions on pattern analysis and machine intelligence, vol. 26,
[9] T. Qin and S. Shen, “Robust initialization of monocular visual-inertial no. 6, pp. 756–770, 2004.
estimation on aerial robots.” in Proc. of the IEEE/RSJ Int. Conf. on [34] T. Liu and S. Shen, “Spline-based initialization of monocular visual-
Intell. Robots and Syst., Vancouver, Canada, 2017, accepted. inertial state estimators at high altitude,” IEEE Robotics and Automation
[10] P. Li, T. Qin, B. Hu, F. Zhu, and S. Shen, “Monocular visual-inertial Letters, 2017, accepted.
state estimation for mobile augmented reality.” in Proc. of the IEEE Int. [35] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n)
Sym. on Mixed adn Augmented Reality, Nantes, France, 2017, accepted. solution to the pnp problem,” International journal of computer vision,
[11] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real- vol. 81, no. 2, pp. 155–166, 2009.
time onboard visual-inertial state estimation and self-calibration of mavs [36] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
in unknown environments,” in Robotics and Automation (ICRA), 2012 “Bundle adjustmenta modern synthesis,” in International workshop on
IEEE International Conference on. IEEE, 2012, pp. 957–964. vision algorithms. Springer, 1999, pp. 298–372.
[12] S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A robust [37] P. Huber, “Robust estimation of a location parameter,” Annals of
and modular multi-sensor fusion approach applied to mav navigation,” Mathematical Statistics, vol. 35, no. 2, pp. 73–101, 1964.
in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, [38] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.
2013, pp. 3923–3929. org.
[13] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint Kalman [39] G. Sibley, L. Matthies, and G. Sukhatme, “Sliding window filter with
filter for vision-aided inertial navigation,” in Proc. of the IEEE Int. Conf. application to planetary landing,” J. Field Robot., vol. 27, no. 5, pp.
on Robot. and Autom., Roma, Italy, Apr. 2007, pp. 3565–3572. 587–608, Sep. 2010.
[14] M. Li and A. Mourikis, “High-precision, consistent EKF-based visual- [40] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
inertial odometry,” Int. J. Robot. Research, vol. 32, no. 6, pp. 690–711, independent elementary features,” Computer Vision–ECCV 2010, pp.
May 2013. 778–792, 2010.
[15] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual [41] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
inertial odometry using a direct ekf-based approach,” in Proc. of the Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”
IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, 2015, pp. 298– The International Journal of Robotics Research, 2016.
304. [42] C. Mei and P. Rives, “Single view point omnidirectional camera cal-
[16] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, ibration from planar grids,” in Robotics and Automation, 2007 IEEE
“Keyframe-based visual-inertial odometry using nonlinear optimization,” International Conference on. IEEE, 2007, pp. 3945–3950.
Int. J. Robot. Research, vol. 34, no. 3, pp. 314–334, Mar. 2014. [43] L. Heng, B. Li, and M. Pollefeys, “Camodocal: Automatic intrinsic
[17] R. Mur-Artal and J. D. Tardos, “Visual-inertial monocular SLAM with and extrinsic calibration of a rig with multiple generic cameras and
map reuse,” arXiv preprint arXiv:1610.05949, 2016. odometry,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ
[18] K. Wu, A. Ahmed, G. A. Georgiou, and S. I. Roumeliotis, “A square International Conference on. IEEE, 2013, pp. 1793–1800.
root inverse filter for efficient vision-aided inertial navigation on mobile [44] Y. Lin, F. Gao, T. Qin, W. Gao, T. Liu, W. Wu, Z. Yang, and S. Shen,
devices.” in Robotics: Science and Systems, 2015. “Autonomous aerial navigation using monocular visual-inertial fusion,”
[19] M. K. Paul, K. Wu, J. A. Hesch, E. D. Nerurkar, and S. I. Roumeliotis, J. Field Robot., vol. 00, pp. 1–29, 2017.
“A comparative analysis of tightly-coupled monocular, binocular, and
stereo vins,” in Proc. of the IEEE Int. Conf. on Robot. and Autom.,
Singapore, May 2017.