Large_displacement_optical_flow
Large_displacement_optical_flow
Abstract
1
978-1-4244-3991-1/09/$25.00 ©2009 IEEE 41
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
is smaller than its displacement relative to the motion of the and tries to establish dense correspondences between differ-
larger scale structure in the background. Such cases are very ent scenes [8]. The work is related to ours in the sense that
common with articulated objects. rich descriptors are used in combination with geometric reg-
If one is interested only in very few correspondences, ularization. An approximative discrete optimization method
descriptor matching is a widespread methodology to esti- from [15] is used to achieve this goal. The problem of this
mate arbitrarily large displacement vectors. Only a few method in the context of motion estimation is due to the
points are selected for matching. Selected points should bad localization of the SIFT descriptor. Another strongly
have good discriminative properties and there should be a related work is the one by Wills and Belongie which allows
high probability that the same point is selected in both im- for large displacements by using edge correspondences in a
ages [17]. Quite some effort is put into the descriptors of the thin-plate-spline model [18].
keypoints such that they are invariant to likely transforma- In principle, any sparse matching technique can be used
tions of the surrounding patches. Due to their small number to find initial matches. However, it is important that the de-
and their informative descriptors, e.g. SIFT [9], keypoints scriptor matching establishes correspondences also for the
can be matched globally using a nearest neighbor criterion. smaller scale structures missed by the coarse-to-fine opti-
In return, other disadvantages are present. Firstly, there is cal flow. Here we propose to use regions from a hierar-
no geometric relationship per se enforced between matched chical segmentation of the image. This has several advan-
keypoints. A counterpart to the smoothness assumption in tages. Firstly, regions are more likely to coincide with sep-
optical flow is missing. Thus, outliers are very likely to ap- arately moving structures than commonly used corners or
pear. Secondly, correspondences are very sparse. Turning blobs. Secondly, regions allow for estimates of affine patch
the sparse set into a dense flow field by interpolation leads deformations. Additionally, the hierarchical segmentation
to very inaccurate results missing most of the details. provides a good coverage of the whole image. This avoids
In some applications, the dense 2D matching problem missing some moving parts because there is no region de-
can be circumvented by making use of specific assump- tected in the area. There is another region-based detector
tions. If the scene is static and all motion in the image is [13], which has the first two properties but does not pro-
due to the camera, the problem can be simplified by estimat- vide a hierarchy of regions. Another reasonable strategy is
ing the epipolar geometry from very few correspondences to enforce consistent segmentations between frames as sug-
(established, e.g., by descriptor matching and some outlier gested in [20].
removal procedure such as RANSAC) and then converting An important issue is the combination of the keypoint
the 2D optical flow problem into a 1D disparity estimation matches and the raw image data within the variational ap-
problem. While the complexity of combinatorial optimiza- proach. The straightforward way to initialize the variational
tion including geometric constraints in 2D is exponential, it method with the interpolated keypoint matches gives large
becomes polynomial for some 1D problems. Consequently, influence to outliers. Moreover, it raises the question of
large displacements are much less of a problem in typical which scale to initialize the variational method. The opti-
stereo or structure-from-motion tasks, where dense dispar- mum scale is likely to vary from image to image. There-
ity maps can be estimated via graph cut methods or similar fore, we integrate the keypoint correspondences directly
techniques. into the variational approach. This allows us to make use
Unfortunately, this does not work any more as soon as of all the image information (not only the keypoints) al-
objects besides the observer are moving. If the focus is on ready at coarse levels, and smoothly scales down the in-
drawing information from the object motion rather than its fluence of the keypoints as the grid gets finer. Moreover,
static shape, there is no way around optical flow estimation, we integrate multiple matching hypotheses into the varia-
and although the image motion caused by moving objects tional energy. This allows us to postpone an important hard
in the scene is usually much smaller than that caused by decision, namely which particular candidate region is the
a moving camera, displacements can still be too large for best match, to the variational optimization where geometric
contemporary methods. This holds especially true as it is constraints are available. Thanks to this formulation, out-
difficult to separate the egomotion of the camera from the liers are treated in a proper way, without the need to tune
object motion as long as both are not known. threshold parameters.
For this reason we elaborate in the present paper on opti-
cal flow estimation with large displacements. The main idea 2. Region matching
is to direct a variational technique using correspondences
from sparse descriptor matching. This aims at avoiding the
2.1. Region computation
local optimization to get stuck in a local minimum underes- For creating regions in the image, we rely on the segmen-
timating the true flow. tation method proposed in Arbelaez et al. [1]. The segmen-
A recent work called SIFT Flow goes a step even further tation is based on the boundary detector gPb from [11]. The
42
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Left: Segmentation of an image. A region hierarchy
is obtained by successively splitting regions at an edge of certain
relevance. Dark edges are inserted first. Right: Zoom into the
hand region of two successive images.
43
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
where d2 (i, j) is the Euclidean distance between the two
patches after deformation correction and d¯2 (i) is the av-
erage Euclidean distance among the 10 nearest neighbors.
This measure takes the absolute fit as well as the descrip-
tiveness into account. We restrict the distance to be com-
puted only on patch positions within the region. Hence the
changing background of a moving object part would not de-
Figure 4. Nearest neighbors and their distances using different de- stroy similarity of a correct match.
scriptors. Top: SIFT and color. Center: Patch within region. Fig. 4 depicts the nearest neighbors of a sample region.
Bottom: Patch within region after distortion correction. Simple block matching is clearly inferior compared to SIFT
and color because the high frequency information is not cor-
to integrate several potential correspondences into the vari- rectly aligned. However, computing distances on distortion
ational approach. For this purpose, a good confidence mea- corrected patches is advantageous for our task. Not only
sure is of great importance. We found that the distance the ranking improves in this particular case, the distance is
of patches globally separates good and bad matches much in general also more valuable as a confidence measure since
better than the above descriptors. The main problem with it marks bad matches more clearly.
direct patch comparison (classical block matching) is the
sensitivity to small shifts or deformations. With the defor- 3. Variational flow
mation corrected, the Euclidean distance of patches is very
informative, particularly when considering only pixels from Although most of the correspondences in Fig. 3 are cor-
within the region1 . rect, the flow field derived from these by interpolation, as
The optimum shift and deformation needed to match two shown in Fig. 5, is far from being accurate. This is be-
patches can be estimated by minimizing the following cost cause we have a hard decision when selecting the nearest
function: neighbor. Moreover, a lot of image information is neglected
and substituted by a smoothness prior. In order to obtain
E(u, v) = (P2 (x + u, y + v) − P1 (x, y))2 dxdy a more accurate, dense flow field, we integrate the match-
(3) ing hypotheses into a variational approach, which combines
+α (|∇u|2 + |∇v|2 ) dxdy, them with local information from the raw image data and a
smoothness prior.
where P1 and P2 are the two patches, u(x, y), v(x, y) de-
notes the deformation field to be estimated, and α = 3.1. Energy
10000 is a tuning parameter that steers the relative im-
The energy we optimize is similar to the one in [4] ex-
portance of the deformation smoothness. The energy is a
cept for an additional data constraint that integrates the cor-
non-linearized, large displacement version of the Horn and
respondence information:
Schunck energy and sufficient for this purpose. The regu-
larizer gets a very high weight in this case, as without reg- E(w(x)) = Ψ |I2 (x + w(x)) − I1 (x)|2 dx
ularization every patch can be made sufficiently similar to
any other. + γ Ψ |∇I2 (x + w(x)) − ∇I1 (x)|2 dx
As the patches are very small and a simple quadratic reg- 5
ularizer is applied, the estimation is quite efficient. Never-
+β ρj (x) Ψ (u(x)−uj (x))2 +(v(x)−vj (x))2 dx
theless, it would be a computational burden to estimate the
j=1
deformation for each region pair. To this end, we prese-
lect the 10 nearest neighbors for each patch using the dis- + α Ψ |∇u(x)|2 + |∇v(x)|2 + g(x)2 dx
tance from the previous section and compute the deforma- (5)
tion only for these candidates. The five nearest neighbors Here, I1 and I2 are the two input images, w := (u, v) is the
according to the patch distance are then integrated into the sought optical flow field, and x := (x, y) denotes a point
variational approach described in the next section. Each po- in the image. (uj , vj )(x) is one of the motion vectors de-
tential match j = 1, ..., 5 of a region i comes with a confi- rived at position x by region matching (j indexing the 5
dence nearest neighbors). If there is no correspondence at this po-
¯2
d (i)−d2 (i,j) sition, ρj (x) = 0. Otherwise, ρj (x) = cj , where cj is the
d2 (i,j) d¯2 (i) > 0
cj (i) := (4) distance based confidence in (4). α = 100, β = 25, and
0 else γ = 5 are tuning parameters, which steer the importance
1 In contrast to tracking and motion estimation, this probably does not of smoothness, region correspondences, and gradient con-
hold for object class detection. stancy, respectively.
44
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
2
√ Like in [4], we use the robust function Ψ(s ) = they cover larger parts of the discrete domain. ρ(x) = 0
2 −6
s + 10 in order to deal with outliers in the data as for the same number of grid points, but the total number
well as in the smoothness assumption. We also integrate the of grid points at coarser levels is much smaller. As a con-
boundary map g(x) from [1] (see Fig. 2) in order to avoid sequence, they dominate the optical flow at coarse levels,
smoothing across strong region boundaries. pushing the local optimization into the right direction. At
The robust function further reduces the influence of bad finer levels, their influence decreases (and is actually zero
correspondences and leads to the selection of the most con- in the true continuous case). While correct matches will be
sistent match among the five nearest neighbors. Note that in line with the optical flow, outliers will be outnumbered
each potential match has its own robust function. Spatial by the growing number of grid points indicating a different
consistency is enforced by the smoothness prior, which inte- flow field.
grates correspondences from the neighborhood. Many good We can use the same nested fixed point iterations as pro-
matches in the neighborhood will outnumber mismatches, posed in [4] to solve (6). We initialize w0 := (0, 0) at
which is not the case when using a squared error mea- the coarsest grid and iteratively compute updates wk+1 =
sure. With α = 0 the optimum result would simply be wk + dwk , where dwk := (duk , dv k ) is the solution of
the weighted median of the hypotheses, but with α > 0
additional matches from the surroundings are taken into ac- 0 = Ψ1 Ixk (Izk + Ixk duk + Iyk dv k )
count. +β j ρj Ψ2,j (u − uj ) − αdiv Ψ3 ∇(uk + duk )
Rather than a straightforward three step procedure with (8)
(i) interpolation of the region correspondences, (ii) removal 0 = Ψ1 Iyk (Izk + Ixk duk + Iyk dv k )
of outliers not fitting the interpolated flow field (iii) optical +β j ρj Ψ2,j (v − vj ) − αdiv Ψ3 ∇(v k + dv k )
flow estimation initialized by the interpolated inlier corre-
spondences, the above energy combines all three steps in a with
single optimization problem.
Ψ1 := Ψ (Izk + Ixk duk + Iyk dv k )2
3.2. Minimization Ψ2,j := Ψ (uk +duk −uj )2 + (v k +dv k −vj )2 (9)
k 2 k 2 2
Ψ3 := Ψ |∇(u + du )| + |∇(v + dv )| + g .
k k
The energy is non-convex and can only be optimized
locally. We can compute the Euler-Lagrange equations, We skipped the gradient constancy term in the notation to
which state a necessary condition for a local optimum: have shorter equations. The reader is referred to [4] for the
2 gradient constancy part. In order to solve (8), an inner fixed
Ψ Iz2 Iz Ix + γΨ Ixz 2
+ Iyz (Ixx Ixz + Ixy Iyz )
point iteration over l is employed, where the robust func-
+β j ρj Ψ (u − uj ) + (v − vj )2 (u − uj )
2
tions in (9) are set constant for fixed duk,l , dv k,l and are
−αdiv Ψ |∇u|2 + |∇v|2 + g(x)2 ∇u = 0 iteratively updated. The equations are then linear in duk,l ,
2 (6) dv k,l and can be solved by standard iterative methods after
Ψ Iz2 Iz Iy + γΨ Ixz 2
+ Iyz (Ixy Ixz + Iyy Iyz ) proper discretization.
+β j ρj Ψ (u − uj )2 + (v − vj )2 (v − vj )
−αdiv Ψ |∇u|2 + |∇v|2 + g(x)2 ∇v = 0, 4. Experiments
where Ψ (s2 ) is the first derivative of Ψ(s2 ) with respect to We evaluated the new method on several real images
s2 , and we define showing large displacements, particularly articulated mo-
tion of humans. Fig. 5 depicts results for the example
Ix := ∂x I2 (x + w) Ixy := ∂xy I2 (x + w) from the previous sections. The fast motion of the person’s
Iy := ∂y I2 (x + w) Iyy := ∂yy I2 (x + w) right hand missed by current state-of-the-art optical flow is
(7)
Iz := I2 (x + w) − I1 (x) Ixz := ∂x Iz correctly captured when integrating point correspondences
Ixx := ∂xx I2 (x + w) Iyz := ∂y Iz . from descriptor matching. This clearly shows the improve-
ment we aimed at. In areas without large displacements, we
Although we have the region correspondences involved in cannot expect the flow to be more accurate, since descrip-
these equations, their influence would be too local to effec- tor matching is not as precise as variational flow. How-
tively drive a large displacement solution. However, we can ever, the result is also not much spoiled by unprecise and
make use of the same coarse-to-fine strategy as used in op- bad matches. We quantitatively confirmed this by running
tical flow warping schemes. This has two effects. Firstly, [4] and the large displacement flow on five sequences of
downsampled large scale structures drive the optical flow to the Middlebury dataset with public ground truth [2]. There
a large displacement solution. Secondly, the influence of are no large displacements in any of these sequences. We
region correspondences is much larger at coarser levels as optimized the parameters of both approaches but kept the
45
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 5. Left: Flow field obtained by interpolating the region correspondences (nearest neighbor). Accuracy is low and several outliers
can be observed. Center left: Result with the optical flow method from [4]. The motion is mostly very accurate, but the hand motion is not
captured well. Center right: Result with the proposed method. Most of the accuracy of the optical flow framework is preserved and the
fast moving hands are captured as well. We see some degradations in the background due to outliers and too little structure to correct this.
Right: Result of SIFT Flow [8] running the code provided by the authors. Since the histograms in SIFT lack good localization properties,
the accuracy of the flow field is much lower.
Figure 6. Evolving flow field from coarse (left) to fine (right). The region correspondences dominate the estimate at the beginning. Outliers
are removed over time as more and more data from the image is taken into account.
Figure 7. Left: Input images. The camera was rotated and moved into the scene. Center left: Interpolated region correspondences.
Center right: Result with the optical flow method from [4]. Clearly, only the smaller displacements in the center and those of regions
with appropriate scale can be estimated. Right: Result with the proposed method. Aside from the unstructured and occluded areas near
the image boundaries, the flow field is estimated well.
parameter β (which steers the influence of the point corre- Fig. 7 depicts an experiment with a static scene and a
spondences) at the same value as in the other experiments. moving camera. This problem would actually be better
The average angular error of the large displacement version solved by estimating the fundamental matrix from few cor-
increased in average by 27%. This means it still yields a respondences and then computing the disparity map with
good accuracy while being able to capture larger motion. global combinatorial optimization. We show this experi-
Fig. 5 also demonstrates the huge improvement over de- ment to demonstrate that good results can be obtained even
scriptor matching succeeded by interpolation to derive a without exploiting the knowledge of a static scene (which
dense flow field. Clearly, the weakly descriptive informa- may not always be true in many realistic tasks). The flow
tion in the image aside of the keypoints should not be ig- in non-occluded areas is well estimated despite huge dis-
nored when more than a few correspondences are needed. placements. Neither classical optical flow nor interpolated
A comparison to [8] using their code indicates that we get descriptor matching can produce these results.
a better localization of the motion, which is quite natural as Another potential field of application of our technique is
[8] was designed to match between different scenes. human motion analysis. Fig. 8 shows two frames from the
Fig. 6 shows the evolving flow field over multiple scales. HumanEva-II benchmark at Brown University. The origi-
In can be seen that the influence of wrong region matches nal sequence was captured with a 120fps highspeed camera.
decreases as the flow field includes more and more informa- We skipped four frames to simulate the 30fps of a consumer
tion from the image and geometrically inconsistent matches camera. Again we can see that the large motion of some
are ignored as outliers. body parts is missed with previous optical flow techniques,
46
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 8. Left: Two overlaid images of a running person. The images are from the HumanEva-II benchmark on human tracking. Center
left: Interpolated region correspondences. Center: Result with optical flow from [4]. The motion of the right leg is too fast to be captured
by the coarse-to-fine scheme alone. Center right: Result with the proposed model. Region correspondences guide the optical flow towards
the fast motion of the leg. Right: Image warped according to the estimated flow. The ideal result should look like the first image apart
from occluded areas. The motion of the foot tip is underestimated, but the motion of the lower leg and the rest of the body is fine.
while it is captured much better when integrating descriptor roundings. Point correspondences obtained from global
matching. The warped image reveals that the motion of the nearest neighbor matching using strong descriptors can
foot tip is still underestimated, but the rest of the body in- guide the local optimization to the correct large displace-
cluding the lower leg and the arms is tracked correctly. The ment. Conversely, we have also shown that weakly descrip-
doubles near object boundaries are due to occlusion and in- tive information, as is thrown away when selecting key-
dicate the correct filling of the background’s zero motion. points, contains valuable information and should not be ig-
Finally, Fig. 9-10 show results from a tennis sequence. nored. The flow field obtained by exploiting all image in-
The entire sequence and the corresponding flow is available formation is much more accurate than the interpolated point
in the supplementary material. The sequence was recorded correspondences. Moreover, outliers can be avoided by in-
with a 25fps hand held consumer camera and is very diffi- tegrating multiple hypotheses into the variational approach
cult due to very fast motion of the tennis player, little struc- and making use of the smoothness prior to select the most
ture on the ground, and highly repetitive structures at the consistent one.
fence. The latter produce many outliers when matching re- This work extends the applicability of optical flow to
gions. The video shows that most of the outliers are ignored fields with larger displacements, particularly to tasks where
in the course of variational optimization and also large parts large displacements are due to object rather than camera
of the fast motion is captured correctly. Jittering of the cam- motion. We expect good results in action recognition when
era is indicated by the changing color in the background using the dense flow as a dynamic orientation feature corre-
(showing changing motion directions). Even the motion of spondingly to orientation histograms in static image recog-
the ball is estimated in some frames. The motion of the nition. However, with larger displacements there also ap-
racket and the hands is missed from time to time due to mo- pear new challenges such as occlusions, which we mainly
tion blur and weakly discriminative regions. Nevertheless, ignored here. Future works should transfer the rich knowl-
the results are very promising to serve as a cue in action edge on occlusion handling in disparity estimation to the
recognition. more general field of optical flow.
Computation of the flow for given segmentations took
37s on an Intel Xeon 2.33GHz for images of size 530 × 380 References
pixels. Most of the time is spent for the deformation of the
patches and the variational flow, which is both potentially [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con-
tours to regions: an empirical evaluation. Proc. CVPR, 2009.
available in real-time using the GPU [19]. A GPU imple-
mentation of the segmentation takes 5s per frame. [2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black,
and R. Szeliski. A database and evaluation methodology for
optical flow. Proc. ICCV, 2007.
5. Conclusions [3] M. J. Black and P. Anandan. The robust estimation of mul-
tiple motions: parametric and piecewise smooth flow fields.
We have shown that optical flow can benefit from sparse Computer Vision and Image Understanding, 63(1):75–104,
point correspondences from descriptor matching. The lo- 1996.
cal optimization involved in optical flow methods fails to [4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High ac-
capture large motions even with coarse-to-fine strategies curacy optical flow estimation based on a theory for warping.
if small subparts move considerably faster than their sur- Proc. ECCV, Springer LNCS 3024, 25–36, 2004.
47
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 9. Left: Two overlaid images of a tennis player in action. Center left: Region correspondences. Center right: Result with optical
flow from [4]. The motion of the right leg is too fast to be estimated. Right: The proposed method captures the motion of the leg.
Figure 10. One figure of a tennis sequence obtained with a 25fps hand held consumer camera. Despite extremely fast movements most part
of the interesting motion is correctly captured. Even the ball motion (red) is estimated here. The entire video is available as supplementary
material.
[5] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action gions. Proc. British Machine Vision Conference, 2002.
at a distance. Proc. ICCV, 726–733, 2003. [14] E. Mémin and P. Pérez. Dense estimation and object-
[6] B. Horn and B. Schunck. Determining optical flow. Artificial based segmentation of the optical flow with robust tech-
Intelligence, 17:185–203, 1981. niques. IEEE Transactions on Image Processing, 7(5):703–
[7] I. Laptev. Local Spatio-Temporal Image Features for Motion 719, 1998.
Interpretation. PhD thesis, Computational Vision and Active [15] A. Shekhovtsov, I. Kovtun, and V. V. Hlaváč. Efficient
Perception Laboratory, KTH Stockholm, Sweden, 2004. MRF deformation model for non-rigid image matching.
[8] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Proc. CVPR, 2007.
SIFT flow: dense correspondence across different scenes. [16] C. Strecha, R. Fransens, and L. Van Gool. A propabilistic
Proc. ECCV, Springer LNCS 5304, 28–42, 2008. approach to large displacement optical flow and occlusion
[9] D. Lowe. Distinctive image features from scale-invariant detection. Statistical Methods in Video Processing, Springer
keypoints. International Journal of Computer Vision, LNCS 3247, 71–82, 2004.
60(2):91–110, 2004. [17] T. Tuytelaars and K. Mikolajczyk. Local invariant feature
[10] B. Lucas and T. Kanade. An iterative image registration tech- detectors: a survey. Foundations and Trends in Computer
nique with an application to stereo vision. Proc. Seventh In- Graphics and Vision, 3(3):177–280, 2008.
ternational Joint Conference on Artificial Intelligence, 674– [18] J. Wills and S. Belongie. A feature based method for de-
679, 1981. termining dense long range correspondences. Proc. ECCV,
[11] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using Springer LNCS 3023, 170–182, 2004.
contours to detect and localize junctions in natural images. [19] C. Zach, T. Pock, and H. Bischof. A duality based approach
Proc. CVPR, 2008. for realtime TV-L1 optical flow. Pattern Recognition - Proc.
[12] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat- DAGM, Springer LNCS 4713, 214–223, 2007.
ural image boundaries using local brightness, color, and tex- [20] C. L. Zitnick, N. Jojic, and S. B. Kang. Consistent segmen-
ture cues. IEEE Transactions on Pattern Analysis and Ma- tation for optical flow estimation. Proc. ICCV, 2005.
chine Intelligence, 26(5):530–549, 2004.
[13] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust
wide baseline stereo from maximally stable extremal re-
48
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.