0% found this document useful (0 votes)
2 views

Large_displacement_optical_flow

This paper presents a novel optical flow estimation method that combines the strengths of energy minimization and descriptor matching to accurately estimate large displacements in images with moving objects. By establishing a region hierarchy and integrating sparse correspondences into a variational optimization framework, the approach aims to provide dense and subpixel accurate flow estimates while effectively handling outliers. The method leverages geometric constraints and image information to improve the reliability of motion estimation, particularly for articulated objects.

Uploaded by

kababey111
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Large_displacement_optical_flow

This paper presents a novel optical flow estimation method that combines the strengths of energy minimization and descriptor matching to accurately estimate large displacements in images with moving objects. By establishing a region hierarchy and integrating sparse correspondences into a variational optimization framework, the approach aims to provide dense and subpixel accurate flow estimates while effectively handling outliers. The method leverages geometric constraints and image information to improve the reliability of motion estimation, particularly for articulated objects.

Uploaded by

kababey111
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Large Displacement Optical Flow∗

Thomas Brox1 Christoph Bregler2 Jitendra Malik1


1 2
University of California, Berkeley Courant Institute, New York University
Berkeley, CA, 94720, USA New York, NY, 10003, USA
{brox,malik}@eecs.berkeley.edu [email protected]

Abstract

The literature currently provides two ways to establish


point correspondences between images with moving ob-
jects. On one side, there are energy minimization methods
that yield very accurate, dense flow fields, but fail as dis-
placements get too large. On the other side, there is descrip-
tor matching that allows for large displacements, but corre-
spondences are very sparse, have limited accuracy, and due
to missing regularity constraints there are many outliers. In
this paper we propose a method that can combine the ad-
vantages of both matching strategies. A region hierarchy is
established for both images. Descriptor matching on these
regions provides a sparse set of hypotheses for correspon-
Figure 1. Top row: Image of a sequence where the person is step-
dences. These are integrated into a variational approach ping forward and moving his hands. The optical flow estimated
and guide the local optimization to large displacement so- with the method from [4] is quite accurate for the main body and
lutions. The variational optimization selects among the hy- the legs, but the hands are not accurately captured. Bottom row,
potheses and provides dense and subpixel accurate esti- left: Overlay of two successive frames showing the motion of one
mates, making use of geometric constraints and all avail- of the hands. Center: The arm motion is still good but the hand
able image information. has a smaller scale than its displacement leading to a local mini-
mum. Right: Color map used to visualize flow fields in this paper.
Smaller vectors are darker and color indicates the direction.
1. Introduction
Optical flow estimation has been declared as a solved mination changes and proposed a numerical scheme that al-
problem several times. For restricted cases this is true, but lows for a very high accuracy, provided the displacements
in more general cases, we are still far from a satisfactory so- are not too large.
lution. For instance estimating a dense flow field of people
The reason why differential techniques can deal with dis-
with fast limb motions cannot yet be achieved reliably with
placements larger than a few pixels at all is that they initial-
state-of-the-art techniques. This is of importance for many
ize the flow by estimates from coarser image scales, where
applications, like long range tracking, motion segmentation,
displacements are small enough to be estimated by local
or flow based action recognition techniques [5, 7].
optimization. Unfortunately, the downsampling not only
Most contemporary optical flow techniques are based on
smoothes the way to the global optimum, but also removes
two important ingredients, the energy minimization frame-
information that may be vital for establishing the correct
work of Horn and Schunck [6], and the concept of coarse-
matches. Consequently, the method cannot refine the flow
to-fine image warping introduced by Lucas and Kanade [10]
of structures that are smaller than their displacement, sim-
to overcome large displacements. Both approaches have
ply because the structure is smoothed away just at the level
been extended by robust statistics, which allow the treat-
when its flow is small enough to be estimated in the varia-
ment of outliers in either the matching or the smoothness
tional setting. The resulting flow is then close to the motion
assumption, particularly due to occlusions or motion dis-
of the larger scale structure. This still works well if the mo-
continuities [3, 14]. The technique in [4] further introduced
tion varies smoothly with the scale of the structures, and
gradient constancy as a constraint which is robust to illu-
even precise 3D reconstruction of buildings becomes pos-
∗ This work was funded by the German Academic Exchange Service sible [16]. Figure 1, however, shows an example, where
(DAAD) and the ONR-MURI program. the hand motion is not estimated correctly because the hand

1
978-1-4244-3991-1/09/$25.00 ©2009 IEEE 41
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
is smaller than its displacement relative to the motion of the and tries to establish dense correspondences between differ-
larger scale structure in the background. Such cases are very ent scenes [8]. The work is related to ours in the sense that
common with articulated objects. rich descriptors are used in combination with geometric reg-
If one is interested only in very few correspondences, ularization. An approximative discrete optimization method
descriptor matching is a widespread methodology to esti- from [15] is used to achieve this goal. The problem of this
mate arbitrarily large displacement vectors. Only a few method in the context of motion estimation is due to the
points are selected for matching. Selected points should bad localization of the SIFT descriptor. Another strongly
have good discriminative properties and there should be a related work is the one by Wills and Belongie which allows
high probability that the same point is selected in both im- for large displacements by using edge correspondences in a
ages [17]. Quite some effort is put into the descriptors of the thin-plate-spline model [18].
keypoints such that they are invariant to likely transforma- In principle, any sparse matching technique can be used
tions of the surrounding patches. Due to their small number to find initial matches. However, it is important that the de-
and their informative descriptors, e.g. SIFT [9], keypoints scriptor matching establishes correspondences also for the
can be matched globally using a nearest neighbor criterion. smaller scale structures missed by the coarse-to-fine opti-
In return, other disadvantages are present. Firstly, there is cal flow. Here we propose to use regions from a hierar-
no geometric relationship per se enforced between matched chical segmentation of the image. This has several advan-
keypoints. A counterpart to the smoothness assumption in tages. Firstly, regions are more likely to coincide with sep-
optical flow is missing. Thus, outliers are very likely to ap- arately moving structures than commonly used corners or
pear. Secondly, correspondences are very sparse. Turning blobs. Secondly, regions allow for estimates of affine patch
the sparse set into a dense flow field by interpolation leads deformations. Additionally, the hierarchical segmentation
to very inaccurate results missing most of the details. provides a good coverage of the whole image. This avoids
In some applications, the dense 2D matching problem missing some moving parts because there is no region de-
can be circumvented by making use of specific assump- tected in the area. There is another region-based detector
tions. If the scene is static and all motion in the image is [13], which has the first two properties but does not pro-
due to the camera, the problem can be simplified by estimat- vide a hierarchy of regions. Another reasonable strategy is
ing the epipolar geometry from very few correspondences to enforce consistent segmentations between frames as sug-
(established, e.g., by descriptor matching and some outlier gested in [20].
removal procedure such as RANSAC) and then converting An important issue is the combination of the keypoint
the 2D optical flow problem into a 1D disparity estimation matches and the raw image data within the variational ap-
problem. While the complexity of combinatorial optimiza- proach. The straightforward way to initialize the variational
tion including geometric constraints in 2D is exponential, it method with the interpolated keypoint matches gives large
becomes polynomial for some 1D problems. Consequently, influence to outliers. Moreover, it raises the question of
large displacements are much less of a problem in typical which scale to initialize the variational method. The opti-
stereo or structure-from-motion tasks, where dense dispar- mum scale is likely to vary from image to image. There-
ity maps can be estimated via graph cut methods or similar fore, we integrate the keypoint correspondences directly
techniques. into the variational approach. This allows us to make use
Unfortunately, this does not work any more as soon as of all the image information (not only the keypoints) al-
objects besides the observer are moving. If the focus is on ready at coarse levels, and smoothly scales down the in-
drawing information from the object motion rather than its fluence of the keypoints as the grid gets finer. Moreover,
static shape, there is no way around optical flow estimation, we integrate multiple matching hypotheses into the varia-
and although the image motion caused by moving objects tional energy. This allows us to postpone an important hard
in the scene is usually much smaller than that caused by decision, namely which particular candidate region is the
a moving camera, displacements can still be too large for best match, to the variational optimization where geometric
contemporary methods. This holds especially true as it is constraints are available. Thanks to this formulation, out-
difficult to separate the egomotion of the camera from the liers are treated in a proper way, without the need to tune
object motion as long as both are not known. threshold parameters.
For this reason we elaborate in the present paper on opti-
cal flow estimation with large displacements. The main idea 2. Region matching
is to direct a variational technique using correspondences
from sparse descriptor matching. This aims at avoiding the
2.1. Region computation
local optimization to get stuck in a local minimum underes- For creating regions in the image, we rely on the segmen-
timating the true flow. tation method proposed in Arbelaez et al. [1]. The segmen-
A recent work called SIFT Flow goes a step even further tation is based on the boundary detector gPb from [11]. The

42
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Left: Segmentation of an image. A region hierarchy
is obtained by successively splitting regions at an edge of certain
relevance. Dark edges are inserted first. Right: Zoom into the
hand region of two successive images.

Figure 3. Displacement vectors of the matched regions drawn at


advantage of this boundary detector over simple edge detec-
their centroids. Many matches are good, but there are also outliers
tion is that it takes texture into account. Boundaries due to
from regions that are not descriptive enough or their counterpart in
repetitive structures are damped whereas strong changes in the other image is missing.
texture create additional boundaries. Consequently, bound-
aries are more likely to correspond to objects or parts of sum over all distances:
objects. This is beneficial for our task, as it increases the
Si − Sj 22
stability of the regions to be matched. d2 (Si , Sj ) = 1
 2
The method returns a boundary map g(x) as shown in N| k,l Sk − Sl 2
(1)
Fig. 2. Strong edges correspond to more likely object 2 Ci − Cj 22
d (Ci , Cj ) = 1
 2
,
N| k,l Ck − Cl 2
boundaries. It further returns a hierarchy of regions created
from this map. Regions with weak edges are merged first,
while separations due to strong edges persist for many lev- where N is the total number of combinations i, j. This nor-
els in the hierarchy. We generally take the regions from all malization allows to combine the distances such that both
the levels in the hierarchy into account. From the regions parts in average have equal influence:
of the first image, however, we only keep the most stable 1 2
ones, i.e., those which exist in at least 5 levels of the hi- d2 (i, j) = (d (Si , Sj ) + d2 (Ci , Cj )) (2)
2
erarchy. Unstable regions are usually arbitrary subparts of
large regions. They are likely to change their shape between We can exclude potential pairs by adding high costs to their
images. We also ignore extremely small regions (with less distance. We do this for correspondences with a displace-
than 50 pixels) from both images. These regions are usu- ment larger than 15% of the image size or with a change in
ally too small to build a descriptor discriminative enough scale that is larger than factor 3. Depending on the needs
for reliable matching. of the application, these numbers can be adapted. Smaller
values obviously produce fewer false matches, but restrict
the allowed image transformations.
2.2. Region descriptor and matching
To each region we fit an ellipse and normalize the area 2.3. Hypotheses refinement by deformed patches
around the centroid of each region to a 32 × 32 patch. The Fig. 3 demonstrates successful matching of many re-
normalized patch then serves as the basis for a descriptor. gions, but also reveals outliers. This is not surprising as
We build two descriptors S and C in each region. S some of the regions are quite small and not very descrip-
consists of 16 orientation histograms with 8 bins, like in tive. Moreover, the affine transformation estimated from
SIFT [9]. C comprises the mean RGB color of the same the region shape is not always correct as the extracted re-
16 subparts as the SIFT descriptor. While the orientation gions may not be exactly the same in both images. Finally,
histograms consider the whole patch to take also the shape the above descriptors are well suited to establish a ranking
of the region into account, the color descriptor is computed of potential matches, but the descriptor distance often per-
only from parts of the patch that belong to the region. forms badly when used as a confidence measure since good
Correspondences between regions are computed by near- matches and bad matches have very similar distances.
est neighbor matching. We compute the Euclidean distances Rather than deciding on a fixed correspondence at each
of both descriptors separately and normalize them by the keypoint, which could possibly be an outlier, we propose

43
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
where d2 (i, j) is the Euclidean distance between the two
patches after deformation correction and d¯2 (i) is the av-
erage Euclidean distance among the 10 nearest neighbors.
This measure takes the absolute fit as well as the descrip-
tiveness into account. We restrict the distance to be com-
puted only on patch positions within the region. Hence the
changing background of a moving object part would not de-
Figure 4. Nearest neighbors and their distances using different de- stroy similarity of a correct match.
scriptors. Top: SIFT and color. Center: Patch within region. Fig. 4 depicts the nearest neighbors of a sample region.
Bottom: Patch within region after distortion correction. Simple block matching is clearly inferior compared to SIFT
and color because the high frequency information is not cor-
to integrate several potential correspondences into the vari- rectly aligned. However, computing distances on distortion
ational approach. For this purpose, a good confidence mea- corrected patches is advantageous for our task. Not only
sure is of great importance. We found that the distance the ranking improves in this particular case, the distance is
of patches globally separates good and bad matches much in general also more valuable as a confidence measure since
better than the above descriptors. The main problem with it marks bad matches more clearly.
direct patch comparison (classical block matching) is the
sensitivity to small shifts or deformations. With the defor- 3. Variational flow
mation corrected, the Euclidean distance of patches is very
informative, particularly when considering only pixels from Although most of the correspondences in Fig. 3 are cor-
within the region1 . rect, the flow field derived from these by interpolation, as
The optimum shift and deformation needed to match two shown in Fig. 5, is far from being accurate. This is be-
patches can be estimated by minimizing the following cost cause we have a hard decision when selecting the nearest
function: neighbor. Moreover, a lot of image information is neglected
 and substituted by a smoothness prior. In order to obtain
E(u, v) = (P2 (x + u, y + v) − P1 (x, y))2 dxdy a more accurate, dense flow field, we integrate the match-
 (3) ing hypotheses into a variational approach, which combines
+α (|∇u|2 + |∇v|2 ) dxdy, them with local information from the raw image data and a
smoothness prior.
where P1 and P2 are the two patches, u(x, y), v(x, y) de-
notes the deformation field to be estimated, and α = 3.1. Energy
10000 is a tuning parameter that steers the relative im-
The energy we optimize is similar to the one in [4] ex-
portance of the deformation smoothness. The energy is a
cept for an additional data constraint that integrates the cor-
non-linearized, large displacement version of the Horn and
respondence information:
Schunck energy and sufficient for this purpose. The regu- 
 
larizer gets a very high weight in this case, as without reg- E(w(x)) = Ψ |I2 (x + w(x)) − I1 (x)|2 dx
ularization every patch can be made sufficiently similar to 
 
any other. + γ Ψ |∇I2 (x + w(x)) − ∇I1 (x)|2 dx
As the patches are very small and a simple quadratic reg- 5 
 
ularizer is applied, the estimation is quite efficient. Never-
+β ρj (x) Ψ (u(x)−uj (x))2 +(v(x)−vj (x))2 dx
theless, it would be a computational burden to estimate the
j=1
deformation for each region pair. To this end, we prese-  
lect the 10 nearest neighbors for each patch using the dis- + α Ψ |∇u(x)|2 + |∇v(x)|2 + g(x)2 dx
tance from the previous section and compute the deforma- (5)
tion only for these candidates. The five nearest neighbors Here, I1 and I2 are the two input images, w := (u, v) is the
according to the patch distance are then integrated into the sought optical flow field, and x := (x, y) denotes a point
variational approach described in the next section. Each po- in the image. (uj , vj )(x) is one of the motion vectors de-
tential match j = 1, ..., 5 of a region i comes with a confi- rived at position x by region matching (j indexing the 5
dence nearest neighbors). If there is no correspondence at this po-
 ¯2
d (i)−d2 (i,j) sition, ρj (x) = 0. Otherwise, ρj (x) = cj , where cj is the
d2 (i,j) d¯2 (i) > 0
cj (i) := (4) distance based confidence in (4). α = 100, β = 25, and
0 else γ = 5 are tuning parameters, which steer the importance
1 In contrast to tracking and motion estimation, this probably does not of smoothness, region correspondences, and gradient con-
hold for object class detection. stancy, respectively.

44
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
2
√ Like in [4], we use the robust function Ψ(s ) = they cover larger parts of the discrete domain. ρ(x) = 0
2 −6
s + 10 in order to deal with outliers in the data as for the same number of grid points, but the total number
well as in the smoothness assumption. We also integrate the of grid points at coarser levels is much smaller. As a con-
boundary map g(x) from [1] (see Fig. 2) in order to avoid sequence, they dominate the optical flow at coarse levels,
smoothing across strong region boundaries. pushing the local optimization into the right direction. At
The robust function further reduces the influence of bad finer levels, their influence decreases (and is actually zero
correspondences and leads to the selection of the most con- in the true continuous case). While correct matches will be
sistent match among the five nearest neighbors. Note that in line with the optical flow, outliers will be outnumbered
each potential match has its own robust function. Spatial by the growing number of grid points indicating a different
consistency is enforced by the smoothness prior, which inte- flow field.
grates correspondences from the neighborhood. Many good We can use the same nested fixed point iterations as pro-
matches in the neighborhood will outnumber mismatches, posed in [4] to solve (6). We initialize w0 := (0, 0) at
which is not the case when using a squared error mea- the coarsest grid and iteratively compute updates wk+1 =
sure. With α = 0 the optimum result would simply be wk + dwk , where dwk := (duk , dv k ) is the solution of
the weighted median of the hypotheses, but with α > 0
additional matches from the surroundings are taken into ac- 0 = Ψ1 Ixk (Izk + Ixk duk + Iyk dv k )
  
count. +β j ρj Ψ2,j (u − uj ) − αdiv Ψ3 ∇(uk + duk )
Rather than a straightforward three step procedure with (8)
(i) interpolation of the region correspondences, (ii) removal 0 = Ψ1 Iyk (Izk + Ixk duk + Iyk dv k )
  
of outliers not fitting the interpolated flow field (iii) optical +β j ρj Ψ2,j (v − vj ) − αdiv Ψ3 ∇(v k + dv k )
flow estimation initialized by the interpolated inlier corre-
spondences, the above energy combines all three steps in a with
single optimization problem.  
Ψ1 := Ψ (Izk + Ixk duk + Iyk dv k )2
 
3.2. Minimization Ψ2,j := Ψ (uk +duk −uj )2 + (v k +dv k −vj )2 (9)
 
 k 2 k 2 2

Ψ3 := Ψ |∇(u + du )| + |∇(v + dv )| + g .
k k
The energy is non-convex and can only be optimized
locally. We can compute the Euler-Lagrange equations, We skipped the gradient constancy term in the notation to
which state a necessary condition for a local optimum: have shorter equations. The reader is referred to [4] for the
   2  gradient constancy part. In order to solve (8), an inner fixed
Ψ Iz2 Iz Ix + γΨ Ixz 2
+ Iyz (Ixx Ixz + Ixy Iyz )
   point iteration over l is employed, where the robust func-
+β j ρj Ψ (u − uj ) + (v − vj )2 (u − uj )
 2
tions in (9) are set constant for fixed duk,l , dv k,l and are
   
−αdiv Ψ |∇u|2 + |∇v|2 + g(x)2 ∇u = 0 iteratively updated. The equations are then linear in duk,l ,
   2  (6) dv k,l and can be solved by standard iterative methods after
Ψ Iz2 Iz Iy + γΨ Ixz 2
+ Iyz (Ixy Ixz + Iyy Iyz ) proper discretization.
  
+β j ρj Ψ (u − uj )2 + (v − vj )2 (v − vj )
   
−αdiv Ψ |∇u|2 + |∇v|2 + g(x)2 ∇v = 0, 4. Experiments
where Ψ (s2 ) is the first derivative of Ψ(s2 ) with respect to We evaluated the new method on several real images
s2 , and we define showing large displacements, particularly articulated mo-
tion of humans. Fig. 5 depicts results for the example
Ix := ∂x I2 (x + w) Ixy := ∂xy I2 (x + w) from the previous sections. The fast motion of the person’s
Iy := ∂y I2 (x + w) Iyy := ∂yy I2 (x + w) right hand missed by current state-of-the-art optical flow is
(7)
Iz := I2 (x + w) − I1 (x) Ixz := ∂x Iz correctly captured when integrating point correspondences
Ixx := ∂xx I2 (x + w) Iyz := ∂y Iz . from descriptor matching. This clearly shows the improve-
ment we aimed at. In areas without large displacements, we
Although we have the region correspondences involved in cannot expect the flow to be more accurate, since descrip-
these equations, their influence would be too local to effec- tor matching is not as precise as variational flow. How-
tively drive a large displacement solution. However, we can ever, the result is also not much spoiled by unprecise and
make use of the same coarse-to-fine strategy as used in op- bad matches. We quantitatively confirmed this by running
tical flow warping schemes. This has two effects. Firstly, [4] and the large displacement flow on five sequences of
downsampled large scale structures drive the optical flow to the Middlebury dataset with public ground truth [2]. There
a large displacement solution. Secondly, the influence of are no large displacements in any of these sequences. We
region correspondences is much larger at coarser levels as optimized the parameters of both approaches but kept the

45
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 5. Left: Flow field obtained by interpolating the region correspondences (nearest neighbor). Accuracy is low and several outliers
can be observed. Center left: Result with the optical flow method from [4]. The motion is mostly very accurate, but the hand motion is not
captured well. Center right: Result with the proposed method. Most of the accuracy of the optical flow framework is preserved and the
fast moving hands are captured as well. We see some degradations in the background due to outliers and too little structure to correct this.
Right: Result of SIFT Flow [8] running the code provided by the authors. Since the histograms in SIFT lack good localization properties,
the accuracy of the flow field is much lower.

Figure 6. Evolving flow field from coarse (left) to fine (right). The region correspondences dominate the estimate at the beginning. Outliers
are removed over time as more and more data from the image is taken into account.

Figure 7. Left: Input images. The camera was rotated and moved into the scene. Center left: Interpolated region correspondences.
Center right: Result with the optical flow method from [4]. Clearly, only the smaller displacements in the center and those of regions
with appropriate scale can be estimated. Right: Result with the proposed method. Aside from the unstructured and occluded areas near
the image boundaries, the flow field is estimated well.

parameter β (which steers the influence of the point corre- Fig. 7 depicts an experiment with a static scene and a
spondences) at the same value as in the other experiments. moving camera. This problem would actually be better
The average angular error of the large displacement version solved by estimating the fundamental matrix from few cor-
increased in average by 27%. This means it still yields a respondences and then computing the disparity map with
good accuracy while being able to capture larger motion. global combinatorial optimization. We show this experi-
Fig. 5 also demonstrates the huge improvement over de- ment to demonstrate that good results can be obtained even
scriptor matching succeeded by interpolation to derive a without exploiting the knowledge of a static scene (which
dense flow field. Clearly, the weakly descriptive informa- may not always be true in many realistic tasks). The flow
tion in the image aside of the keypoints should not be ig- in non-occluded areas is well estimated despite huge dis-
nored when more than a few correspondences are needed. placements. Neither classical optical flow nor interpolated
A comparison to [8] using their code indicates that we get descriptor matching can produce these results.
a better localization of the motion, which is quite natural as Another potential field of application of our technique is
[8] was designed to match between different scenes. human motion analysis. Fig. 8 shows two frames from the
Fig. 6 shows the evolving flow field over multiple scales. HumanEva-II benchmark at Brown University. The origi-
In can be seen that the influence of wrong region matches nal sequence was captured with a 120fps highspeed camera.
decreases as the flow field includes more and more informa- We skipped four frames to simulate the 30fps of a consumer
tion from the image and geometrically inconsistent matches camera. Again we can see that the large motion of some
are ignored as outliers. body parts is missed with previous optical flow techniques,

46
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 8. Left: Two overlaid images of a running person. The images are from the HumanEva-II benchmark on human tracking. Center
left: Interpolated region correspondences. Center: Result with optical flow from [4]. The motion of the right leg is too fast to be captured
by the coarse-to-fine scheme alone. Center right: Result with the proposed model. Region correspondences guide the optical flow towards
the fast motion of the leg. Right: Image warped according to the estimated flow. The ideal result should look like the first image apart
from occluded areas. The motion of the foot tip is underestimated, but the motion of the lower leg and the rest of the body is fine.

while it is captured much better when integrating descriptor roundings. Point correspondences obtained from global
matching. The warped image reveals that the motion of the nearest neighbor matching using strong descriptors can
foot tip is still underestimated, but the rest of the body in- guide the local optimization to the correct large displace-
cluding the lower leg and the arms is tracked correctly. The ment. Conversely, we have also shown that weakly descrip-
doubles near object boundaries are due to occlusion and in- tive information, as is thrown away when selecting key-
dicate the correct filling of the background’s zero motion. points, contains valuable information and should not be ig-
Finally, Fig. 9-10 show results from a tennis sequence. nored. The flow field obtained by exploiting all image in-
The entire sequence and the corresponding flow is available formation is much more accurate than the interpolated point
in the supplementary material. The sequence was recorded correspondences. Moreover, outliers can be avoided by in-
with a 25fps hand held consumer camera and is very diffi- tegrating multiple hypotheses into the variational approach
cult due to very fast motion of the tennis player, little struc- and making use of the smoothness prior to select the most
ture on the ground, and highly repetitive structures at the consistent one.
fence. The latter produce many outliers when matching re- This work extends the applicability of optical flow to
gions. The video shows that most of the outliers are ignored fields with larger displacements, particularly to tasks where
in the course of variational optimization and also large parts large displacements are due to object rather than camera
of the fast motion is captured correctly. Jittering of the cam- motion. We expect good results in action recognition when
era is indicated by the changing color in the background using the dense flow as a dynamic orientation feature corre-
(showing changing motion directions). Even the motion of spondingly to orientation histograms in static image recog-
the ball is estimated in some frames. The motion of the nition. However, with larger displacements there also ap-
racket and the hands is missed from time to time due to mo- pear new challenges such as occlusions, which we mainly
tion blur and weakly discriminative regions. Nevertheless, ignored here. Future works should transfer the rich knowl-
the results are very promising to serve as a cue in action edge on occlusion handling in disparity estimation to the
recognition. more general field of optical flow.
Computation of the flow for given segmentations took
37s on an Intel Xeon 2.33GHz for images of size 530 × 380 References
pixels. Most of the time is spent for the deformation of the
patches and the variational flow, which is both potentially [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con-
tours to regions: an empirical evaluation. Proc. CVPR, 2009.
available in real-time using the GPU [19]. A GPU imple-
mentation of the segmentation takes 5s per frame. [2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black,
and R. Szeliski. A database and evaluation methodology for
optical flow. Proc. ICCV, 2007.
5. Conclusions [3] M. J. Black and P. Anandan. The robust estimation of mul-
tiple motions: parametric and piecewise smooth flow fields.
We have shown that optical flow can benefit from sparse Computer Vision and Image Understanding, 63(1):75–104,
point correspondences from descriptor matching. The lo- 1996.
cal optimization involved in optical flow methods fails to [4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High ac-
capture large motions even with coarse-to-fine strategies curacy optical flow estimation based on a theory for warping.
if small subparts move considerably faster than their sur- Proc. ECCV, Springer LNCS 3024, 25–36, 2004.

47
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.
Figure 9. Left: Two overlaid images of a tennis player in action. Center left: Region correspondences. Center right: Result with optical
flow from [4]. The motion of the right leg is too fast to be estimated. Right: The proposed method captures the motion of the leg.

Figure 10. One figure of a tennis sequence obtained with a 25fps hand held consumer camera. Despite extremely fast movements most part
of the interesting motion is correctly captured. Even the ball motion (red) is estimated here. The entire video is available as supplementary
material.

[5] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action gions. Proc. British Machine Vision Conference, 2002.
at a distance. Proc. ICCV, 726–733, 2003. [14] E. Mémin and P. Pérez. Dense estimation and object-
[6] B. Horn and B. Schunck. Determining optical flow. Artificial based segmentation of the optical flow with robust tech-
Intelligence, 17:185–203, 1981. niques. IEEE Transactions on Image Processing, 7(5):703–
[7] I. Laptev. Local Spatio-Temporal Image Features for Motion 719, 1998.
Interpretation. PhD thesis, Computational Vision and Active [15] A. Shekhovtsov, I. Kovtun, and V. V. Hlaváč. Efficient
Perception Laboratory, KTH Stockholm, Sweden, 2004. MRF deformation model for non-rigid image matching.
[8] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Proc. CVPR, 2007.
SIFT flow: dense correspondence across different scenes. [16] C. Strecha, R. Fransens, and L. Van Gool. A propabilistic
Proc. ECCV, Springer LNCS 5304, 28–42, 2008. approach to large displacement optical flow and occlusion
[9] D. Lowe. Distinctive image features from scale-invariant detection. Statistical Methods in Video Processing, Springer
keypoints. International Journal of Computer Vision, LNCS 3247, 71–82, 2004.
60(2):91–110, 2004. [17] T. Tuytelaars and K. Mikolajczyk. Local invariant feature
[10] B. Lucas and T. Kanade. An iterative image registration tech- detectors: a survey. Foundations and Trends in Computer
nique with an application to stereo vision. Proc. Seventh In- Graphics and Vision, 3(3):177–280, 2008.
ternational Joint Conference on Artificial Intelligence, 674– [18] J. Wills and S. Belongie. A feature based method for de-
679, 1981. termining dense long range correspondences. Proc. ECCV,
[11] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using Springer LNCS 3023, 170–182, 2004.
contours to detect and localize junctions in natural images. [19] C. Zach, T. Pock, and H. Bischof. A duality based approach
Proc. CVPR, 2008. for realtime TV-L1 optical flow. Pattern Recognition - Proc.
[12] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat- DAGM, Springer LNCS 4713, 214–223, 2007.
ural image boundaries using local brightness, color, and tex- [20] C. L. Zitnick, N. Jojic, and S. B. Kang. Consistent segmen-
ture cues. IEEE Transactions on Pattern Analysis and Ma- tation for optical flow estimation. Proc. ICCV, 2005.
chine Intelligence, 26(5):530–549, 2004.
[13] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust
wide baseline stereo from maximally stable extremal re-

48
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on January 23,2025 at 14:00:39 UTC from IEEE Xplore. Restrictions apply.

You might also like