Review of Serial and Parallel Min-Cut/Max-Flow Algorithms For Computer Vision
Review of Serial and Parallel Min-Cut/Max-Flow Algorithms For Computer Vision
Abstract—Minimum cut/maximum flow (min-cut/max-flow) algorithms solve a variety of problems in computer vision and thus significant
effort has been put into developing fast min-cut/max-flow algorithms. As a result, it is difficult to choose an ideal algorithm for a given
problem. Furthermore, parallel algorithms have not been thoroughly compared. In this paper, we evaluate the state-of-the-art serial and
parallel min-cut/max-flow algorithms on the largest set of computer vision problems yet. We focus on generic algorithms, i.e., for
unstructured graphs, but also compare with the specialized GridCut implementation. When applicable, GridCut performs best. Otherwise,
the two pseudoflow algorithms, Hochbaum pseudoflow and excesses incremental breadth first search, achieves the overall best
arXiv:2202.00418v2 [cs.CV] 20 Apr 2022
performance. The most memory efficient implementation tested is the Boykov-Kolmogorov algorithm. Amongst generic parallel
algorithms, we find the bottom-up merging approach by Liu and Sun to be best, but no method is dominant. Of the generic parallel
methods, only the parallel preflow push-relabel algorithm is able to efficiently scale with many processors across problem sizes, and no
generic parallel method consistently outperforms serial algorithms. Finally, we provide and evaluate strategies for algorithm selection to
obtain good expected performance. We make our dataset and implementations publicly available for further research.
Index Terms—Algorithms, computer vision, graph algorithms, graph-theoretic methods, parallel algorithms, performance evaluation of
algorithms and systems
1 I NTRODUCTION
to dynamic problems). Finally, for parallel algorithms, we do not preflow push-relabel (PPR) [35], and the GridCut [51] algorithms.
consider whether the algorithm works well in a distributed setting, Moreover, to reduce the influence of implementation details, we
but focus on the shared memory case where the complete graph evaluate different versions (including our own) of the Excesses
can be loaded into the memory of one machine. Incremental Breadth First Search (EIBFS) [36] and the Boykov-
The goal is that our experimental results can help researchers Kolmogorov (BK) [11] algorithm. We chose these for an extended
understand the strengths and weaknesses of the current state-of- evaluation, as EIBFS is the most recent min-cut/max-flow algorithm
the-art min-cut/max-flow algorithms and help practitioners when and BK is still widely used in the computer vision community.
choosing a min-cut/max-flow algorithm to use for a given problem. For the parallel algorithms, we provide the first comprehensive
comparison of all major approaches. This includes our own
1.1 Related Work implementation of the bottom-up merging algorithm by Liu and
Sun [75], our own version of the dual decomposition algorithm
Serial Algorithms Several papers [17, 27, 36, 95] provide
by Strandmark and Kahl [91], the reference implementation of the
comparisons of different serial min-cut/max-flow algorithms on a
region discharge algorithm by Shekhovtsov and Hlaváč [89], an
variety of standard benchmark problems. However, many of these
implementation of the parallel preflow push-relabel algorithm by
benchmark problems are small w.r.t. the scale of min-cut/max-flow
Baunstark et al. [5], and the parallel implementation of GridCut
problems that can be solved today — especially when it comes
(P-GridCut) [51]. In our comparison, we evaluate not just the run
to grid graphs. Also, graphs in which nodes are not based on an
time — including both the initialization time and the time for the
image grid are severely underrepresented. Furthermore, [17, 27,
min-cut/max-flow computations — but also the memory use of the
95] do not include all current state-of-the-art algorithms, while
implementations. Memory usage has not received much attention in
other papers do not include initialization times for the min-cut
the literature, despite it often being a limiting factor when working
computation. As shown by Verma and Batra [95], it is important
with large problems. Finally, we show that the current parallel
for practical use to include the initialization time, as algorithm
algorithm implementations have unpredictable performance and
implementations may spend as much time on initialization as on the
unfortunately often perform worse than serial algorithms.
min-cut computation. Additionally, existing papers only compare
All tested C++ implementations (except GridCut [51]), in-
reference implementations (i.e., the implementation released by the
cluding our new implementations of several algorithms, are
authors) of algorithms — the exception being that an optimized
available at https://github.com/patmjen/maxflow algorithms and
version of the BK algorithm is sometimes included, e.g., in [36].
are archived at DOI:10.5281/zenodo.4903945 [54]. We also provide
However, as implementation details — i.e., choices that are left
Python wrapper packages for several of the algorithms (includ-
unspecified by the algorithm description — can significantly impact
ing BK and HPF), which can be found at https://github.com/
performance [95], a systematic investigation of their effect is also
skielex/shrdr. All of our benchmark problems are available at
important. Finally, existing comparisons focus on determining the
DOI:10.11583/DTU.17091101.
overall best algorithm, even though, as we show in this work, the
best algorithm depends on the features of the given graph.
Parallel Algorithms To our knowledge, parallel min- 2 M IN -C UT /M AX -F LOW A LGORITHMS FOR C OM -
cut/max-flow algorithms have not been systematically compared. PUTER V ISION
Papers introducing parallel algorithms only compare with serial
To illustrate the use of min-cut/max-flow, we will sketch how a
algorithms [75, 91, 103] or a single parallel algorithm [5]. The
vision problem, image segmentation, can be solved using min-
most comprehensive comparison so far was made by Shekhovtsov
cut/max-flow. We start by introducing our notation and defining
and Hlaváč [89] who included a generic and grid-based parallel
the min-cut/max-flow problem.
algorithm. However, no paper compares with the approach by Liu
We define a directed graph G = (V, E) by a set of nodes, V ,
and Sun [75], as no public implementation is available, even though
and a set of directed arcs, E . We let n and m refer to the number
it is expected to be the fastest [89, 91]. Additionally, all papers use
of nodes and arcs, respectively. Each arc (i, j) ∈ E is assigned
the same set of computer vision problems used to benchmark serial
a non-negative capacity cij . For min-cut/max-flow problems, we
algorithms. This is not ideal, as the set lacks larger problems which
define two special terminal nodes, s and t, which are referred to
we expect to benefit the most from parallelization [56]. Therefore,
as the source and sink, respectively. The source has only outgoing
how big the performance benefits of parallelization are, and when
arcs, while the sink has only incoming arcs. Arcs to and from the
to expect them, is still to be determined.
terminal nodes are known as terminal arcs.
A feasible flow in the graph G is an assignment of non-
1.2 Contributions negative numbers (flows), fij , to each arc (i, j) ∈ E . A fea-
We evaluate current state-of-the-art generic serial and parallel min- sible flow must satisfy the following two types of constraints:
cut/max-flow algorithms on the largest set of computer vision capacity
P P fij ≤ cij , and conservation constraints,
constraints,
problems so far. We compare the algorithms on a wide range of i|(i,j)∈E fij = k|(j,k)∈E fjk for all nodes j ∈ V \ {s, t}.
graph problems including commonly used benchmarks problems, Capacity constraints ensure that the flow along an arc does not
as well as many new problem instances from recent papers — some exceed its capacity. Conservation constraints ensure that the flow
of which are significantly larger than previous problems and expose going into a node equals the flow coming out. See Fig. 1(a) for an
weaknesses in the algorithms not seen with previous datasets. Since example of the graph and a feasible flow. The value of the flow is
the performance of the algorithms varies between problems, we the total flow out of the source or, equivalently, into the sink, and
also provide concrete strategies on algorithm selection and evaluate the maximum flow problem refers to finding a feasible flow that
the expected performance of these. maximizes the flow value.
For the serial algorithms, we evaluate the reference imple- An s-t cut is a partition of the nodes into two disjoint sets S
mentations of the Hochbaum pseudoflow (HPF) [45, 46], the and T such that s ∈ S and t ∈ T . The sets S and T are referred
3
2 15 3 9 4 8/15 8/9
20 8 8/20 8/8
1 7 2 5
7 2
2/12 2/10 2/12 10/10
10 10 2/20 11
8/10 2/20
s 3/6 t s 5/6 t
8/10 5/15 10/10 7/15
12 6/10 8/10
6/6 5/6
3/5 5
6
3/3 4
9
3/3 3/4 (a) (b) (c) (d) (e)
7 9 8 3/9
(a) Graph and flow (b) Min-cut/max-flow Fig. 2: Some possibilities for associating graph nodes with en-
15 9 15 9 tities used for segmentation. Graph nodes (gray dots) associated
8
20
7 2
8 20
7 2
with (a) image pixels, (b) superpixels (c) positions in the image (d)
10
10 18 2 2 8 10 10/20
10/12 10/10 mesh faces (e) mesh vertices.
2
s t s 6 t
3 3
10 5 10/15
8 2 4 6 10/10 10
6 6 s A
5 5
3 4 3 4 B
9 9
A C
(c) Residual graph (d) Augmenting paths
2 1 1
1
+1 9/15 +1 8/9 3 1 11/15 +2 9/9 +1 A
10/20 8/8 1 10/20 8/8 0
1 1
−2
B
7 2 1/7 2
1/12 9/10 12 10/10
t B C
10/10 1 1/20
1 10/10 1
0
20
14
s 6/6
0
t s 6/6 t (a) (b) (c) (d)
10/10 7/15 10/10 7/15
0 +6 10 0 10/10
6 6/6
1
3/3 0
5
0
1
3/3 0
5
0
−4 Fig. 3: Some typical segmentation models. Terminal arcs are
0 4 0 4
+3 9 +3 9 shown only for the first example. Arcs drawn in purple have infinite
(e) Preflow push-relabel (f) Pseudoflow capacity. (a) A classical MRF segmentation with 4-connected grid
graph. (b) A multi-column graph used for segmenting layered
Fig. 1: Graph basics and serial algorithms. (a) An example of
structures. (c) Two-object segmentation with inclusion constraint.
the graph and a feasible (non-maximal) flow. The flow and capacity
(d) Three-object segmentation with mutual exclusion using QPBO.
for each arc is written as fij /cij , and (to reduce clutter) zero-values
of the flow are omitted. The flow is 8, which is not maximal, so
no s-t cut is evident. (b) The min-cut/max-flow with a value of 18,
which all min-cut/max-flow algorithms will eventually arrive at. The energy formulation (1) is convenient when min-cut/max-
(c) Residual graph for the flow from (a). (d) An intermediate flow flow algorithms are used to optimize MRFs. Here, each unary
while running the AP algorithm. In the first iteration, 10 units are energy term is a likelihood energy (negative log likelihood) of a
pushed along the path highlighted in orange and red, saturating two pixel being labeled 0 or 1. Likelihood terms are typically computed
terminal arcs (red). In the next iteration, flow is pushed along the directly from image data. The pairwise terms are defined for pairs
residual path highlighted in blue. (e) A preflow at an intermediate of pixels, so-called neighbors, and for 2D images, the neighborhood
stage of a PPR algorithm. Nodes with excess are shown in red, and structure is usually given by a 4 or 8-connectivity.
a label in green is attached to every node. (f) A pseudoflow at an The typical pairwise energy terms used in (1) are
intermediate stage of the HPF algorithm. Nodes with surplus/deficit
are shown in red/blue, a label is attached to every node, and arcs Eij (0, 0)=Eij (1, 1)=0 and Eij (0, 1)=Eij (1, 0)=βij . (3)
of the tree structure are highlighted in green. These terms penalize neighboring pixels having different labels
by a fixed amount, βij , thus encouraging smoothness of the
segmentation. In this case, the construction of the s-t graph which
to as the source and sink set, respectively. The capacity of the cut exactly represents the energy function is straightforward: The node
is the sum of capacities of the arcs going from S to T . And the set is V = P ∪ {s, t}. For terminal arc capacities, csi and cit ,
minimum cut problem refers to finding a cut that minimizes the we use the unary terms Ei (0) and Ei (1), respectively. Meanwhile,
cut capacity. Often, this partition of the nodes is all that is needed pairwise energy terms correspond to non-terminal arc capacities,
for computer vision applications. Therefore, some algorithms only such that cij = cji = βij . Fig. 3(a) shows this construction for
compute the minimum cut, and an additional step would be needed a 4-connected grid graph. The binary segmentation of the image
to extract the flow value for every arc. corresponds directly to the binary labeling given by the minimum
Finally, the max-flow min-cut theorem states that the value of cut. Put in another way, the sets S and T give the optimal labeling
the maximum flow is exactly the capacity of the minimum cut. of the nodes, and because we have a 1-to-1 mapping between non-
Fig. 1(b) shows min-cut/max-flow on a small graph. This can be terminal nodes and pixels, the node labeling is the segmentation.
shown by formulating both problems as linear programs, which However, there are many more advanced ways to formulate image
reveals that max-flow is the strong dual of the min-cut. segmentation using binary energy optimization and s-t graphs, and
ways to formulate other computer vision problems as well [65].
An example closely related to image segmentation is surface
2.1 Image Segmentation
fitting, where [74] uses arcs of infinite capacity (i.e., infinite
When formulating segmentation as a min-cut/max-flow problem, pairwise energy terms) to impose a structure to the optimal solution.
one modeling choice involves deciding which structures to repre- In Fig. 3(b), downward-pointing arcs ensure that if a pixel is in a
sent as graph nodes. Often, nodes of the graph represent individual source set, the column of pixels below it is also in the source set —
image pixels, but various other entities may be associated with so the optimal solution has to be a layer. The slanted arcs impose
graph nodes, some of which are illustrated in Fig. 2. the smoothness of this layer.
4
It is also possible to formulate multi-label/multi-object segmen- maintain the capacity constraints, the flow that is pushed equals the
tation problems that can be solved with a single s-t cut [22, 50, minimum residual capacity along the path. Conservation constrains
57, 74], or by iteratively changing and computing the cut [8, 49]. are maintained as the algorithm only updates complete paths from
For the single-cut Ishikawa method [50], it is common to duplicate s to t. The algorithm terminates when no augmenting paths can be
the graph for each label, i.e., having a sub-graph per label. For found. Fig. 1(d) shows an intermediate stage of an AP algorithm.
example, in Fig. 3, each pixel is represented by two nodes: one for The primary difference between various AP algorithms lies
object A and one for object B , so a pixel may be segmented as in how the augmenting paths are found. For computer vision
belonging to A, B , both, or neither. The submodular interaction applications, the most popular AP algorithm is the Boykov-
between the objects may be achieved by adding arcs between the Kolmogorov (BK) algorithm [11], which works by building search
sub-graphs. Fig. 3(c) shows submodular interaction, where arcs trees from both the source and sink nodes to find augmenting paths
with infinite capacity ensure that if a pixel belongs to object A, this and uses a heuristic that favors shorter augmenting paths. The BK
pixel and all its neighbors also belong to object B . This is known algorithm performs well on many computer vision problems, but
as inclusion or containment with a minimum margin of one. its theoretical run time bound is worse than other algorithms [95].
In the examples covered so far, the arcs between the graph In terms of performance, the BK algorithm has been surpassed
nodes correspond to submodular energy terms, which means the by the Incremetal Breadth First Search (IBFS) algorithm by
energy is lower when the nodes belong to the same set (S or T ). Goldberg et al. [37]. The main difference between the two
Mutual exclusion, in the general case, requires non-submodular algorithms is that IBFS maintains the source and sink search trees
energies which are not directly translatable to arcs in the graphs as breadth-first search trees, which results in both better theoretical
shown so far. An alternative is to use QPBO [63], as illustrated run time and better practical performance [36, 37].
in Fig. 3(d), which can handle any energy function of the form in
(1) — submodular or not. When using QPBO, we construct two 3.2 Preflow Push-Relabel
sub-graphs for each object: one representing the object and another The second family of algorithms are the preflow push-relabel
representing its complement. The exclusion of two objects, say A (PPR) algorithms, which were introduced by Goldberg and Tarjan
and B , is then achieved by adding inclusion arcs from A to B [35]. These algorithms use a so-called preflow, which satisfies
and from B to A. However, there is no guarantee that the min- capacity constraints but allows nodes to have more incoming
cut/max-flow solution yields a complete segmentation of the object than outgoing flow, thus violating conservation constraints. The
as the object and its complement may disagree on the labeling of difference between the incoming and outgoing flows for a node, i,
some nodes leaving them “unlabeled”. The number of unlabeled is denoted as its excess, ei ≥ 0.
nodes depends on the non-submodularity of the system. Extensions The PPR algorithms work by repeatedly pushing flow along the
to QPBO, such as QPBO-P and QPBO-I [88], may be used to individual arcs. To determine which arcs admit flow, the algorithms
iteratively assign labels to the nodes that QPBO failed to label. maintain an integer labeling (so-called height), di , for every node.
The labeling provides a lower bound on the distance from the node
to the sink and has a no steep drop property, meaning d(i)−d(j) ≤
3 S ERIAL M IN -C UT /M AX -F LOW A LGORITHMS 1 for any residual arc (i, j).
All min-cut/max-flow algorithms find the solution by iteratively An algorithm from the PPR family starts by saturating the
updating a flow that satisfies the capacity constraints. Such a flow source arcs and raising the source to d(s) = n. The algorithm then
induces a residual graph with the set of residual arcs, R, given by works by repeatedly selecting a node with excess (after selection
R = {(i, j) ∈ V ×V | (i, j) ∈ E, fij < cij or called a selected node) and applying one of two actions [19, 35]:
(4) push or relabel. If there is an arc in the residual graph leading from
(j, i) ∈ E, fji > 0}. the selected node to a lower-labeled node, push is performed. This
Each of the residual arcs has a residual capacity given by c0ij = pushes excess along the arc, until all excess is pushed or the arc
cij − fij if (i, j) ∈ E or c0ij = fji if (j, i) ∈ E . In other is saturated. If no residual arc leads to a lower node, the relabel
words, residual arcs tell us how much flow on the original arc operation is used to lift the selected node (increase its label) by
we can increase or decrease, see Fig. 1(c). If the graph contains one. Fig. 1(e) shows an intermediate step of a PPR algorithm.
bidirectional arcs, both conditions from (4) may be met, and the When there are no nodes with excess left, the preflow is the
residual capacity then equals the sum of two contributions. maximum flow. It is possible to terminate the algorithm earlier,
Serial min-cut/max-flow algorithms can be divided into three when no nodes with excess have a label di < n. At this point, the
families: augmenting paths, preflow push-relabel, and pseodoflow minimum s-t cut can be extracted by inspecting the node labels.
algorithms. In this section, we provide an overview of how If di ≥ n, then i ∈ S , otherwise i ∈ T . Extracting the maximum
algorithms from each family work. flow requires an extra step of pushing all excess back to the source.
However, this work generally only represents a small part of the run
time [95] and, for computer vision applications, we are typically
3.1 Augmenting Paths only interested in the minimum cut anyway.
The augmenting paths (AP) family of min-cut/max-flow algorithms The difference between various PPR algorithms lies in the
is the oldest of the three families and was introduced with the order in which the push and relabel operations are performed. Early
Ford-Fulkerson algorithm [28]. An algorithm from the AP family variants used simple heuristics, such as always pushing flow from
always maintains a feasible flow. It works by repeatedly finding so- the node with the highest label or using a first-in-first-out queue to
called augmenting paths, which are paths from s to t in the residual keep track of nodes with positive excess [17]. More recent versions
graph. When an augmenting path is found, a flow is pushed along [3, 33, 34] use sophisticated heuristics and a mix of local and
the path. Pushing flow means increasing flow for each forward global operations to obtain significant performance improvements
arc along the path, and decreasing flow for each reverse arc. To over early PPR algorithms.
5
Unlike other serial algorithms, the algorithms from the PPR the node it points to, a pointer to the next outgoing arc for the node
family operate locally on nodes and arcs. This, as we shall discuss it points from, a pointer to its reverse arc, and a residual capacity.
later, has resulted in a whole family of parallel PPR algorithms. For algorithms implemented with computer vision applications in
mind (e.g., BK, IBFS, and EIBFS), the terminal arcs are stored
3.3 Pseudoflow as a single combined terminal capacity for each Node, instead
of using the Arc structures. Other implementations simply keep
The most recent family of min-cut/max-flow algorithms is the
track of the source and sink nodes and use Arc structures for all
pseudoflow family, which was introduced with the Hochbaum
arcs. The HPF implementation uses a bidirectional Arc structure
pseudoflow (HPF) algorithm [45, 46]. These algorithms use a so-
with a capacity, a flow, and a direction. It is also common to store
called pseudoflow, which satisfies capacity constraints but not the
auxiliary values such as excesses, labels, or more.
conservation constraints, as it has no constraints on the difference
As a result of these differences, the memory footprint varies
between incoming and outgoing flow. As with preflow, we refer to
between implementations, as shown in Table 1. The footprint
the difference between incoming and outgoing flow for a node as
also depends heavily on the data types used to store the data, in
its excess, ei . A positive excess is referred to as a surplus and a
particular references to nodes and arcs, as we discuss in the next
negative excess as a deficit.
subsection. For storing arc capacities, integers are common because
During operation, HPF algorithms maintain two auxiliary
they are computationally efficient and may use as little as 1 byte.
structures: the forest of trees and a node labeling function. Only
However, some graph constructions involve large capacities to
one node in every tree, the root, is allowed to have an excess. The
model hard constraints, and here some care must be taken to avoid
algorithm works by repeatedly pushing the flow along the paths
overflow issues. With floats, this can be modeled using infinite
connecting the trees, and growing the trees.
capacity. However, floats are less efficient and some algorithms are
A generic algorithm from the HPF family is initialized by
not guaranteed to terminate with floats due to numerical errors.
saturating all terminal arcs. At this point, each graph node is a
As the size of the data structures influences how much the CPU
singleton tree in the forest. The algorithm then selects a tree with
can store in its caches, which has a large effect on performance,
surplus and containing at least one node with the label less than
it is generally beneficial to keep the data structures small. Note
n (the number of nodes in the graph). In this tree, i denotes the
that some compilers do not pack data structures densely by default,
node with the lowest label. If there are no residual arcs from i to
which may significantly increase the size of the Arc and Node
a node with a lower label, the label of i is incremented. If there
data structures.
is a residual arc (i, j) that leads to a node j with a lower label, a
merge is performed. This operation involves pushing surplus along 3.4.2 Indices vs. Pointers
the path from the root of the tree containing i, over i, over j , and
One way to reduce the size of the Arc and Node data structures
to the root of the tree containing j . If the arc capacities along this
on 64-bit system architectures is to use indices instead of pointers
path allow it, the entire surplus will be pushed and the trees will
to reference nodes and arcs. As long as the indices can be stored
be merged with j as the root. If the flow along the path saturates
using unsigned 32-bit integers, we can halve the size arc and node
an arc (i0 , j 0 ), a surplus will be collected in i0 , and a new tree
references by using unsigned 32-bit integers instead of pointers
rooted in i0 will be created. In contrast to the AP algorithms, the
(which are 64-bit). This approach can significantly reduce the
only restrictions on how much flow to push are the individual arc
size of the Arc and Node data structures, as the majority of the
capacities, not the path capacity.
structures consist of references to other arcs and nodes [52]. As the
The algorithm terminates when no selection can be made,
performance of min-cut/max-flow algorithms is mainly limited by
at which point nodes labeled with n constitute the source set.
memory speed, smaller data structures can often lead to improved
Additional processing is needed to recover the maximum feasible
performance. The downside of indices is that extra computations
flow. Fig. 1(f) shows an intermediate step of the HPF algorithm.
may be needed for every look-up, although this depends on the
There are two main algorithms in this family: HPF and
exact assembly instructions the compiler chooses to use.
Excesses Incremental Breadth First Search (EIBFS) [36]. The
Some grid-based algorithms [52] use 32-bit indices to reduce
main differences are the order in which they scan through nodes
the size of their data structure. The generic algorithms we have
when looking for an arc connecting two trees in the forest, and how
investigated in this work all use pointers to store references between
they push flow along the paths. Both have sophisticated heuristics
nodes and arcs. Some implementations avoid the extra memory
for these choices, which makes use of many of the same ideas
requirement by compiling with 32-bit pointers. However, 32-bit
developed for PPR algorithms.
pointers limit the size of the graph much more than 32-bit indices.
The reason is that the 32-bit pointers only have 4 GiB of address
3.4 Implementation Details space, and the Node and Arc structures they point to take up many
As stressed by [95], the implementation details can significantly bytes. For example, the smallest Arc structure we have tested, c.f .
affect the measured performance of a given min-cut/max-flow Table 1, uses 24 bytes, meaning that an implementation based on
algorithm. In this section, we will highlight the trends of modern 32-bit indices could handle graphs with 24 times more arcs than an
implementations and how they differ. implementation based on 32-bit pointers.
arcs from the same node adjacent in memory. However, as arcs may Since parallel PPR algorithms parallelize over every node, they
be added to the graph in any order, packing the arcs usually incurs can achieve good speed-ups and scale well to modern multi-core
an overhead from maintaining the correct ordering or reordering processors [5], or even GPUs [96]. However, these algorithms
all arcs as an extra step before computing the min-cut/max-flow. have not achieved dominance outside of large grid graphs for min-
Similar to arc packing, node packing may improve performance. cut/max-flow problems [103]. Since GPU hardware has advanced
However, this is not done in practice as opposed to arc packing. considerably in recent years, it is unclear whether GPU method
Of the serial reference implementations that we examined, should remain restricted to grid graphs, but this question is not
only HI-PR [19], IBFS, and EIBFS implement arc packing. These within the scope of this paper.
all implement it as an extra step, where arcs are reordered after
building the graph but before the min-cut/max-flow computations 4.2 Adaptive Bottom-Up Merging
start. None of the examined implementations use node packing.
The adaptive bottom-up merging approach introduced by Liu and
Sun [75] uses block-based parallelism and has two phases, which
3.4.4 Arc Merging
are summarized in Fig. 4. In phase one, the graph is partitioned
In practice, it is not uncommon that multiple arcs between the into a number of disjoint sets (blocks), and arcs between blocks
same pair of nodes are added to the graph. Merging these arcs have their capacities set to 0 — effectively removing them from
into a single arc with a capacity equal to the sum of capacities of the graph. For each pair of blocks connected by arcs, we store a
the merged arcs may reduce the graph size significantly. As this list of the connecting arcs (with capacities now set to 0) along
decreases both the memory footprint of the graph and the number of with their original capacities. Disregarding s and t, the nodes in
arcs to be processed, it can provide substantial performance benefits each block now belong to disjoint sub-graphs and we can compute
[52, 89]. However, as redundant arcs can usually be avoided by the min-cut/max-flow solution for each sub-graph in parallel. The
careful graph construction and they should have approximately min-cut/max-flow computations are done with the BK algorithm —
the same performance impact on all algorithms, we have not although one could in theory use any min-cut/max-flow algorithm.
investigated the effects of this further.
Split graph First merge Second merge Last merge
4 PARALLEL M IN -C UT /M AX -F LOW
Like serial algorithms, parallel algorithms for min-cut/max-flow
problems can be split into families based on shared characteristics.
A key characteristic is whether the algorithms parallelize over
individual graph nodes (node-based parallelism) or split the graph (a) (b) (c) (d)
into sub-graphs that are then processed in parallel (block-based
parallelism). Other important algorithmic traits include whether Fig. 4: Illustration of the adaptive bottom-up merging ap-
the algorithm is distributed, which we do not consider in this proach for parallel min-cut/max-flow. Terminal nodes and arcs
paper, and the guarantees in terms of convergence, optimality, and are not shown. Note that the underlying graph does not have to
completeness provided by the algorithm. be a grid graph. Phase one: (a) The graph is split into blocks
We should note that since many (but not all) min-cut/max-flow and the min-cut/max-flow is computed for each block in parallel.
problems in computer vision are defined on grid graphs, several Phase two: (b) The topmost blocks are locked, merged, and the
algorithms [51, 52, 86, 96] have exploited this structure to create min-cut/max-flow recomputed. (c) As the topmost block is locked,
very efficient parallel implementations. However, many important the next thread works on the bottom-most blocks (in parallel). (d)
computer vision problems are not defined on grid graphs, so in this Last two blocks are merged and min-cut/max-flow recomputed to
paper we focus on generic min-cut/max-flow algorithms. achieve the globally optimal solution.
The category of node-based parallel algorithms is generally
dominated by parallel versions of PPR algorithms. In the block- In phase two, we merge the blocks to obtain the complete
based category, we have identified three main approaches: adaptive globally optimal min-cut/max-flow. To merge two blocks, we
bottom-up merging, dual decomposition, and region discharge, restore the arc capacities for the connecting arcs and then recompute
which we investigate. In the following sections, we give an overview the min-cut/max-flow for the combined graph. This step makes
of each approach and briefly discuss its merits and limitations. use of the fact that the BK algorithm can efficiently recompute the
min-cut/max-flow when small changes are made to the residual
graph for a min-cut/max-flow solution [62].
4.1 Parallel Preflow Push-Relabel For merges in phase two to be performed in parallel, the method
PPR algorithms have been the target of most parallelization efforts marks the blocks being merged as locked. The computational
[2, 4, 5, 22, 32, 47, 96], since both push and relabel are local threads then scan the list of block pairs, which were originally
operations, which makes them well suited for parallelization. Be- connected by arcs, until they find a pair of unlocked blocks. The
cause the operations are local, the algorithms generally parallelize thread then locks both blocks, performs the merge, and unlocks the
over each node — performing pushes and relabels concurrently. To new combined block. To avoid two threads trying to lock the same
avoid data races during these operations, PPR algorithms use either block, a global lock prevents more than one thread from scanning
locking [2] or atomic operations [47]. As new excesses are created, the list of block pairs at a time.
the corresponding nodes are added to a queue from which threads As the degree of parallelism decreases towards the end of phase
can poll them. In [5], a different approach is applied, where pushes two — since there are few blocks left to merge — performance
are performed in parallel, but excesses and labels are updated later increases when computationally expensive merges are performed
in a separate step, rather than immediately after the push. early in phase two. To estimate the cost of merging two blocks, [75]
7
uses a heuristic based on the potential for new augmenting paths to introduced a new version with a simple strategy that guarantees
be formed by merging two blocks. This heuristic determines the convergence: if the duplicated nodes in two blocks do not belong to
merging order of the blocks. the same set, S or T , after a fixed number of iterations, the blocks
By using block-based, rather than node-based parallelism, are merged and the algorithm continues. This trivially guarantees
adaptive bottom-up merging avoids much of the synchronization convergence since, in the worst case, all blocks will be merged, at
overhead that the parallel PPR algorithms suffer from. However, its which point the global solution will be computed serially. However,
performance depends on the majority of the work being performed performance significantly drops when merging is needed for the
in phase one and in the beginning of phase two, where the degree algorithm to converge, as merging only happens after a fixed
of parallelism is high. number of iterations and all blocks may (in the worst case) have to
be merged for convergence.
4.3 Dual Decomposition
4.4 Region Discharge
The dual decomposition (DD) approach was introduced by
Strandmark and Kahl [91] and later refined by Yu et al. [103]. The region discharge (RD) approach was introduced by Delong
The approach was originally designed to allow for distributed and Boykov [22] and later generalized by Shekhovtsov and Hlaváč
computing, such that it is never necessary to keep the full graph in [89]. The idea builds on the vertex discharge operation introduced
memory. Their algorithm works as follows: first, the nodes of the for PPR in [35]. Similarly to DD by Strandmark and Kahl, RD
graph are divided into a set of overlapping blocks (see Fig. 5(a)). was designed to allow for distributed computing. The method first
The graph is then split into disjoint blocks, where the nodes in the partitions the graph into a set of blocks (called regions in [89]
overlapping regions are duplicated in each block (see Fig. 5(b)). It following the terminology of [22]). Each block R has an associated
is important that the blocks overlap such that if node i is connected boundary defined as the set of nodes
to node j in block bj and node k in block bk , then i is also in both B R = {v ∈ V | v ∈
/ R, (u, v) ∈ E, u ∈ R, v 6= s, t}. (5)
blocks bj and bk .
Capacities for arcs going from a boundary node to a block node
Split graph Solve blocks Overlap disagrees Overlap agrees are set to zero. This means that flow can be pushed out of the block
into the boundary, but not vice versa. Furthermore, each node is
allowed to have an excess.
Split graph Sync. borders Re-solve blocks
neighboring blocks. This may create additional excesses in some (2) The 4 super resolution [30, 88], 4 texture restoration [88],
blocks, since boundary nodes overlap with another block. The 2 deconvolution [88], 78 decision tree field (DTF) [82], and
discharge and synchronization process is repeated until no new 3 automatic labelling environment (ALE) [25, 67, 68, 69]
excesses are created, at which point the algorithm terminates. It is datasets from Verma’s and Batra’s survey [95].
proved in [89] that this process terminates in at most 2n2 iterations (3) New problems that use anisotropic MRFs [38] to segment
of discharge and synchronization when using PPR and 2n2B + 1 blood vessels in large voxel volumes from [87]. We include
when using AP, where nB is the total number of boundary nodes. 3 problems where the segmentation is applied directly to the
The guarantee of convergence, without having to merge blocks, image data and 3 to the output of a trained V-Net [79].
is beneficial, as it means that the algorithm can maintain a (4) New problems that use MRFs to clean 3D U-Net [20]
high degree of parallelism while computing the min-cut/max-flow segmentations of prostate images from [90]. We contribute 4
solution. However, because flow must be synchronized between benchmark problems.
blocks, the practical performance of the method still depends (5) New problems on mesh segmentation based on [76]. We
on well-chosen blocks and may be limited by synchronization contribute 8 benchmark problems. The original paper uses
overhead. For details on the heuristics used in the algorithm, which α-expansion and αβ -swaps [9, 11] to handle the multi-class
are also important for its practical performance, see [89]. segmentation problem. For our benchmarks, we instead use
QPBO to obtain the segmentation with a single min-cut, which
5 P ERFORMANCE C OMPARISON may lead to different results compared with the referenced
method.
We now compare the performance of the algorithms discussed
(6) New problems using the recent Deep LOGISMOS [42] to
in the previous sections. For all experiments, the source code
segment prostate images from [90]. We contribute 8 problems.
was compiled with the GCC C++ compiler version 9.2.0 with
(7) New problems performing multi-object image segmentation
-O3 optimizations on a 64-bit Linux-based operating system with
via surface fitting from two recent papers [53, 57]. We
kernel release 3.10. Experiments were run on a dual socket NUMA
contribute 9 problems using [53] and 8 using [57].
(Non-Uniform Memory Access) system with two Intel Xeon Gold
(8) New problems performing graph matching from the recent
6226R processors with 16 cores each and HTT (Hyper-Threading
paper [48]. The original matching problems can be found at
Technology) disabled, for a total of 32 parallel CPU threads. The
https://vislearn.github.io/libmpopt/iccv2021. For each match-
system has 756 GB of RAM, and for all experiments all data
ing several QPBO sub-problems are solved. We contribute the
were kept in memory. All resources were provided by the DTU
QPBO subproblems (300 per matching problem) for each of
Computing Center [23].
the 316 matching problems.
For all parallel benchmarks, we prefer local CPU core and
memory allocation. This means that for all parallel benchmarks with In total, our benchmark includes 495 problems covering a
up to 16 threads, all cores are allocated on the same CPU/NUMA variety of different computer vision applications. Note that some
node. If the data fits in the local memory of the active node, we datasets consist of many small sub-problems that must be run in se-
use this memory exclusively. If the data cannot fit in the local quence. Here, we report the accumulated times. All the benchmark
memory of one node, memory of both NUMA nodes is used. problems are available at: DOI:10.11583/DTU.17091101 [55].
For benchmarks with more than 16 threads, both CPUs and their For the parallel algorithm benchmarks, we only include a subset
memory pools are used. of all datasets. This is because parallelization is mainly of interest
Run time was measured as the minimum time over three for large problems with long solve times. For the block-based
runs and no other processes (apart from the OS) were running algorithms, we split the graph into blocks in one of the following
during the benchmarks. We split our measured run time into two ways: For graphs based on an underlying image grid, we define
distinct phases: build time and solve time. Build time refers to blocks by recursively splitting the image grid along its longest axis.
the construction of the graph and any additional data structures For the surface-based segmentation methods [53, 57], we define
used by an algorithm. If the algorithm performs arc packing or blocks such that nodes associated with a surface are in their own
similar steps, this is included in the build time. To ensure that block. For mesh segmentation, we compute the geodesic distance
the build time is a fair representation of the time used by a given between face centers and then use agglomerative clustering to
algorithm, we precompute a list of nodes and arcs and load these divide the nodes associated with each face into blocks. For bottom-
lists fully into memory before starting the timer. Solve time refers up merging, we use 64 blocks for the following dataset: the grid
to the time required to compute the min-cut/max-flow. For the graphs, the mesh segmentation, and the cells, foam, and simcells.
pseudoflow, PPR, and region discharge algorithms (c.f . Table 1), For the NT32 tomo data we use two blocks per object. For 4Dpipe
that only compute a minimum cut, we do not include the time to we use a block per 2D slice. For P-GridCut we use the same blocks
extract the full feasible maximum flow solution. The reason for this as for bottom-up merging. For dual decomposition and region
is that for most computer vision applications the minimum cut is discharge, we use one and two blocks per thread, respectively.
of principal interest. Furthermore, converting to a maximum flow
solution usually only adds a small overhead [95]. 5.2 Tested Implementations
All tested implementations (except GridCut [51]) are available at
5.1 Datasets https://github.com/patmjen/maxflow algorithms and are archived
We test the algorithms on the following benchmark datasets: at DOI:10.5281/zenodo.4903945 [54]. Beware that the implemen-
(1) The commonly used University of Waterloo [98] benchmarks tations are published under different licenses — some open and
problems. Specifically, we use 6 stereo [14, 64], 36 3D voxel some restrictive. See the links above for more information.
segmentation [9, 12, 10], 2 multi-view reconstruction [72, 13], In the following, typewriter font refers to a specific
and 1 surface fitting [71] problems. implementation of a given algorithm. We use this for BK and
9
EIBFS, where we test more that one implementation of each index.html3 . The original implementation can only handle grid
algorithm, e.g., BK refers to the algorithm, BK is the reference graphs with rectangular blocks, while our implementation can
implementation, and MBK is one of our implementations. handle arbitrary graphs and arbitrary blocks at the cost of some
BK [11] We test the reference implementation (BK) of additional overhead during graph construction. Our implementation
the Boykov-Kolmogorov algorithm from http://pub.ist.ac.at/∼vnk/ uses MBK as the base solver. Note that our implementation does not
software.html. Furthermore, we test our own implementation of implement the merging strategy proposed by [103] and, therefore,
BK (MBK), which contains several optimizations. Most notably, is not guaranteed to converge. We only include results for cases
our version uses indices instead of pointers to reduce the memory where the algorithm does converge.
footprint of the Node and Arc data structures. Finally, we test a GridCut [51, 52] We test both the serial and parallel versions
second version (MBK-R), which reorders arcs so that all outgoing of the highly optimized commercial GridCut implementation
arcs from a node are adjacent in memory. This increases cache from https://gridcut.com. The primary goal is to show how much
efficiency, but uses more memory (see Table 1) and requires an performance can be gained by using an implementation optimized
extra initialization step. The memory overhead from reordering for grid graphs. GridCut is only tested on problems with graph
could be reduced by ordering the arcs in-place; however, this may structures that are supported by the reference implementation, i.e.,
negatively impact performance. Therefore, we opt for the same 4- and 8-connected neighbor grids in 2D, and 6- and 26-connected
sorting strategy as EIBFS, where arcs are copied during reordering. (serial only) neighbor grids in 3D.
EIBFS [36] We test a slightly modified version [49] (EIBFS)
Table 1 lists the tested implementations along with their type
of the excesses incremental breadth first search algorithm originally
and memory footprint. Their memory footprint can be calculated
implemented by [36] available from https://github.com/sydbarrett/
based on the number of nodes and arcs in the graph and will be
AlphaPathMoves. This version uses slightly larger data structures
discussed further in Section 7.
to support non-integer arc capacities and larger graphs, compared
to the implementation tested in [36]. Although these changes
may slightly decrease performance, we think it is reasonable to TABLE 1: Summary of the tested implementations including
use the modified version, as several of the other algorithms have their memory footprint. The table shows the bytes required as a
made similar sacrifices in terms of performance. Additionally, function of the number of nodes, n, number of terminal arcs, mT ,
we test our own modified version of EIBFS (EIBFS-I), which and number of neighbor arcs, mN . We assume the common case
replaces pointers with indices to reduce the memory footprint. of 32-bit capacities and 32-bit indices, which is also what we use
Finally, since both EIBFS and EIBFS-I perform arc reordering for all of our experiments. Since HPF stores undirected arcs, we
during initialization, we also test a version without arc reordering give all sizes as undirected arcs, i.e., for implementations using
(EIBFS-I-NR) to better compare with other algorithms. directed arcs the size per arc reported here is doubled. Note that
HPF [45] We test the reference implementation of Hochbaum the numbers depend on, but are not the same as, the Node and
pseudoflow (HPF) from https://riot.ieor.berkeley.edu/Applications/ Arc structure sizes, as the footprint reported includes all stored
Pseudoflow/maxflow.html. This implementation has four different data (connectivity, capacity, and any auxiliary data).
configurations that we test:
Serial algorithms Algorithm type Memory footprint
1) Highest label with FIFO buckets (HPF-H-F).
2) Highest label with LIFO buckets (HPF-H-L). HI-PRa [19] Preflow push-relabel 40n + 40mT + 40mN
HPFb [45] Pseudoflow 104n + 48mT + 48mN
3) Lowest label with FIFO buckets (HPF-L-F). EIBFS [36] Pseudoflow
4) Lowest label with LIFO buckets (HPF-L-L). EIBFSc Pseudoflow 72n + 72mN
HI-PR [19] We test the implementation of the preflow EIBFS-I∗i Pseudoflow 29n + 50mN
EIBFS-I-NR∗i Pseudoflow 49n + 24mN
push-relabel algorithm from https://cmp.felk.cvut.cz/∼shekhovt/d BK [11] Augmenting path
maxflow/index.html2 . BKd Augmenting path 48n + 64mN
P-ARD [89] We test the implementation of parallel augment- MBK∗i Augmenting path 23n + 24mN
ing paths region discharge (P-ARD) from https://cmp.felk.cvut.cz/ MBK-R∗i Augmenting path 23n + 48mN
∼shekhovt/d maxflow/index.html. P-ARD is an example of the Parallel algorithms
region discharge approach. It uses BK as the base solver. Note that, P-PPRei [5] Parallel PPR 48n + 68mT + 68mN
as the implementation is designed for distributed computing, it Liu-Sun∗i [75] Ada. bot.-up merging† 25n + 24mN
makes use of disk storage during initialization, which increases the Strandmark-Kahl∗i [91] Dual decomposition† 29n + 24mN
build time. P-ARDa [89] Region discharge† 40n + 32mN
† Uses BK (augmenting path)
Liu-Sun [75] Since no public reference implementation is
∗ Implemented or updated by us:
available, we test our own implementation of the adaptive bottom-
https://github.com/patmjen/maxflow algorithms
up merging approach based on the paper by Liu and Sun [75]. Our i Assuming 32-bit indices
implementation uses MBK as the base solver. a https://cmp.felk.cvut.cz/∼shekhovt/d maxflow/index.html
b https://riot.ieor.berkeley.edu/Applications/Pseudoflow/maxflow.html
P-PPR [5] We test the implementation of a recent parallel
c https://github.com/sydbarrett/AlphaPathMoves
preflow push-relabel algorithm from https://github.com/niklasb/ d http://pub.ist.ac.at/∼vnk/software.html
pbbs-maxflow. e https://github.com/niklasb/pbbs-maxflow
Strandmark-Kahl [91] We test our own implementation of
the Strandmark-Kahl dual decomposition algorithm based on the
implementation at https://cmp.felk.cvut.cz/∼shekhovt/d maxflow/
Speed-up
Speed-up
1.00 4.00
2.00
2.00
0.50 EIBFS
1.00 1.00
Total time
Solve time 0.50
0.50 0.25
0.25
MBK MBK-R EIBFS-I EIBFS-I-NR HPF-H-L HPF-L-F HPF-L-L
(a) BK (b) EIBFS (c) HPF
Fig. 8: Performance comparison of serial algorithm variants. The solve time and total time is compared against the times for the
chosen reference algorithm for each dataset. The violin plots show a Gaussian kernel density estimate of the data and the horizontal bars
indicate — from top to bottom — the maximum, median, and minimum. The values were re-sampled as described in Fig. 7.
6 A LGORITHM S ELECTION
4 As the previous section shows, the performance of the individual
Speed-up
10 more than 36% slower than the best option. Otherwise, the best
2 option is HPF-H-L in which case the expected performance 64%
of the optimal. Another good option is EIBFS-I due to its high
0 0 mean and high minimum RP scores. All implementations, except
1 2 4 8 16 32 1 2 4 8 16 32 EIBFS, have a maximum RP of 1, meaning that they outperformed
Number of threads Number of threads all other implementations on at least one problem instance.
(c) Strandmark-Kahl (d) P-ARD For the parallel algorithms, GridCut again dominates when
applicable. Otherwise, the best parallel option is Liu-Sun which
Fig. 10: Speed-up of the the parallel algorithms compared to their is slightly better than P-PPR. Surprisingly, using the best serial
single-threaded performance. For each number of threads, the algorithm for a dataset is the overall best option, although we
distribution of the speed-ups over all datasets is shown. The values should note that comparing to the best serial algorithm gives some
were re-sampled as described in Fig. 7. advantage to the serial algorithms. If one compares to a single
serial algorithm, the parallel algorithms do give an improvement —
although the mean RP is only 1.8x higher in the best case.
provide good speed-ups for some datasets. Strandmark-Kahl comes Scenario 2: Known Problem Family If one knows from
off the worst, as it rarely beats the best serial algorithm. which problem family the graph to be solved comes, a good strategy
Finally, Fig. 10 shows the speed-up distribution of the parallel is to select the algorithm that performs well on that problem family.
algorithms compared to their single-threaded performance. Only This could, for example, be established beforehand by running a
P-PPR improves consistently as more threads are added. Liu-Sun set of benchmarks on example graphs.
and P-ARD only show consistent improvements when looking at Table 5 shows the best performing serial algorithm for each
the maximum speed-up, and for over half of the datasets they have problem family. Note that, as opposed to Table 2, we split graph
issues scaling beyond 12 threads. matching into sub-groups as papers use different energy functions
12
TABLE 2: Performance comparison of serial algorithms based on both their solve and total (build + solve) times. We show a
representative subset of the datasets, which have been grouped according to their problem family. For each problem family we only show
the fastest variant of each algorithm measured in total time. The fastest solve time for each dataset has been underlined and the fastest
total time has been marked with bold face. Datasets which contain many sub-problems are marked with (s).
Dataset Nodes Arcs Solve Total Solve Total Solve Total Solve Total Solve Total
3D segmentation: voxel-based MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
adhead.n26c100 [9, 12, 10] 12 M 327 M 65.81 s 92.57 s 22.60 s 33.93 s 24.29 s 29.03 s 225.38 s 424.67 s 25.19 s 27.79 s
adhead.n6c100 [9, 12, 10] 12 M 75 M 23.88 s 28.03 s 13.23 s 15.85 s 14.13 s 15.87 s 59.65 s 102.84 s 6.98 s 7.31 s
babyface.n26c100 [9, 12, 10] 5M 131 M 82.29 s 92.87 s 30.13 s 34.74 s 54.47 s 56.71 s 183.60 s 228.09 s 53.21 s 54.11 s
babyface.n6c100 [9, 12, 10] 5M 30 M 7.78 s 9.44 s 5.56 s 6.61 s 11.56 s 12.24 s 57.28 s 69.66 s 2.88 s 3.00 s
bone.n26c100 [9, 12, 10] 7M 202 M 9.01 s 25.62 s 9.18 s 16.28 s 4.24 s 7.16 s 68.39 s 173.75 s 4.52 s 5.88 s
bone.n6c100 [9, 12, 10] 7M 46 M 4.09 s 6.65 s 2.74 s 4.35 s 2.30 s 3.36 s 23.66 s 46.71 s 0.91 s 1.12 s
bone subx.n6c100 [9, 12, 10] 3M 23 M 4.10 s 5.36 s 2.38 s 3.11 s 1.28 s 1.81 s 10.34 s 21.49 s 1.34 s 1.44 s
bone subx.n26c100 [9, 12, 10] 3M 101 M 7.70 s 15.78 s 4.74 s 8.23 s 2.14 s 3.61 s 25.51 s 75.15 s 3.69 s 4.45 s
liver.n26c100 [9, 12, 10] 4M 108 M 11.78 s 20.41 s 10.49 s 14.20 s 5.72 s 6.50 s 71.88 s 131.00 s 5.62 s 6.21 s
liver.n6c100 [9, 12, 10] 4M 25 M 10.08 s 11.40 s 5.82 s 6.57 s 5.70 s 6.24 s 30.49 s 42.71 s 3.87 s 3.99 s
3D segmentation: oriented MRF MBK [11] EIBFS-I-NR [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
vessel.orimrf.256 [11, 38, 87] 16 M 66 M 1.84 s 2.95 s 1.13 s 2.03 s 3.19 s 6.80 s 4.11 s 30.99 s 0.40 s 1.04 s
vessel.orimrf.512 [11, 38, 87] 134 M 536 M 12.44 s 21.40 s 7.95 s 15.39 s 25.29 s 55.32 s 32.16 s 321.75 s 2.43 s 7.73 s
vessel.orimrf.900 [11, 38, 87] 688 M 2B 75.23 s 121.82 s 48.13 s 88.09 s 147.22 s 300.79 s 177.38 s 1774.65 s 15.97 s 44.70 s
3D U-Net segmentation cleaning MBK [11] EIBFS-I-NR [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
clean.orimrf.256 [11, 38, 87] 16 M 66 M 0.97 s 2.09 s 0.69 s 1.61 s 3.21 s 6.89 s 3.93 s 30.83 s 0.13 s 0.77 s
clean.orimrf.512 [11, 38, 87] 134 M 536 M 7.87 s 17.03 s 5.51 s 13.51 s 27.10 s 58.22 s 31.40 s 320.87 s 0.91 s 6.27 s
clean.orimrf.900 [11, 38, 87] 688 M 2B 35.83 s 81.92 s 25.96 s 64.22 s 130.22 s 280.73 s 163.88 s 1755.87 s 3.90 s 31.43 s
unet mrfclean 2 [11] 8 M 32 M 0.47 s 1.01 s 0.29 s 0.74 s 3.55 s 5.36 s 9.80 s 22.82 s 62 ms 0.36 s
unet mrfclean 3 [11] 15 M 63 M 0.82 s 1.88 s 0.52 s 1.37 s 5.68 s 9.14 s 20.59 s 46.69 s 0.11 s 0.68 s
unet mrfclean 8 [11] 4 M 19 M 0.48 s 0.81 s 0.24 s 0.50 s 2.39 s 3.46 s 6.55 s 13.89 s 0.11 s 0.28 s
Surface fitting MBK [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
LB07-bunny-lrg [71] 49 M 300 M 15.40 s 21.17 s 6.38 s 15.25 s 21.87 s 32.13 s 610.24 s 820.64 s 2.36 s 3.75 s
3D segmentation: sparse layered graphs (SLG) MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
4Dpipe small [57] 14 M 124 M 6.03 h 6.03 h 2.06 s 15.55 s 17.91 s 28.85 s 202.49 s 266.01 s - -
4Dpipe big [57] 143 M 1B - - 20.59 s 195.41 s 222.09 s 332.06 s 2611.65 s 3436.43 s - -
NT32 tomo3 .raw 3 [57] 7 M 49 M 15.42 s 18.69 s 24.22 s 27.19 s 15.87 s 18.33 s 176.11 s 200.29 s - -
NT32 tomo3 .raw 10 [57] 22 M 154 M 52.86 s 63.15 s 50.82 s 60.01 s 36.46 s 44.14 s 645.33 s 741.98 s - -
NT32 tomo3 .raw 30 [57] 67 M 462 M 145.23 s 176.37 s 194.79 s 221.90 s 179.82 s 202.73 s 2939.04 s 3260.63 s - -
NT32 tomo3 .raw 100 [57] 183 M 1 B 778.39 s 860.71 s 553.50 s 627.08 s 520.26 s 583.76 s 9732.34 s 2.95 h - -
3D segmentation: seperating surfaces MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
cells.sd3 [53] 13 M 126 M 48.23 s 59.03 s 35.24 s 40.66 s 15.52 s 21.84 s 98.25 s 167.56 s - -
foam.subset.r160.h210 [53] 15 M 205 M 6.05 s 22.02 s 3.21 s 12.52 s 17.14 s 26.18 s 15.16 s 145.58 s - -
foam.subset.r60.h210 [53] 1 M 24 M 0.62 s 2.58 s 0.39 s 1.49 s 1.98 s 3.01 s 1.85 s 12.82 s - -
simcells.sd3 [53] 3 M 27 M 9.93 s 12.10 s 2.94 s 4.12 s 3.23 s 4.60 s 21.57 s 33.89 s - -
Deep LOGISMOS MBK [11] EIBFS-I [36] HPF-H-F [45] HI-PR [19] GridCut [52, 51]
deeplogismos.2 [42] 511 K 4M 0.15 s 0.25 s 28 ms 0.21 s 0.12 s 0.31 s 0.16 s 1.29 s - -
deeplogismos.3 [42] 707 K 5M 0.18 s 0.31 s 41 ms 0.30 s 0.18 s 0.45 s 0.24 s 1.90 s - -
deeplogismos.7 [42] 989 K 7M 0.34 s 0.54 s 0.26 s 0.66 s 0.29 s 0.69 s 0.36 s 2.86 s - -
Super resolution BK [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
super res-E1 [30, 88] 10 K 62 K 2 ms 2 ms 1 ms 2 ms 2 ms 3 ms 1 ms 7 ms - -
super res-E2 [30, 88] 10 K 103 K 4 ms 5 ms 2 ms 3 ms 2 ms 3 ms 2 ms 12 ms - -
super res-Paper1 [30, 88] 10 K 62 K 2 ms 3 ms 1 ms 2 ms 2 ms 3 ms 1 ms 7 ms - -
superres graph [30, 88] 43 K 742 K 62 ms 78 ms 10 ms 26 ms 7 ms 12 ms 19 ms 0.16 s - -
Texture MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
texture-Cremer [88] 44 K 783 K 1.54 s 1.58 s 0.35 s 0.37 s 0.17 s 0.19 s 42 ms 0.19 s - -
texture-OLD-D103 [88] 43 K 742 K 0.60 s 0.65 s 0.19 s 0.21 s 73 ms 92 ms 41 ms 0.19 s - -
texture-Paper1 [88] 43 K 742 K 0.65 s 0.69 s 0.19 s 0.21 s 76 ms 95 ms 36 ms 0.17 s - -
texture-Temp [88] 14 K 239 K 0.22 s 0.23 s 30 ms 34 ms 9 ms 15 ms 6 ms 32 ms - -
Automatic labelling envrionment (ALE) MBK-R [11] EIBFS-I-NR [36] HPF-L-L [45] HI-PR [19] GridCut [52, 51]
graph 1 (s) [68, 69, 26, 67] 185 K 5M 16.80 s 18.52 s 0.35 s 0.79 s 1.00 s 1.60 s 1.58 s 10.60 s - -
graph 2 (s) [68, 69, 26, 67] 175 K 3M 7.38 s 10.47 s 0.83 s 1.64 s 2.25 s 3.55 s 2.91 s 20.87 s - -
graph 3 (s) [68, 69, 26, 67] 179 K 7M 27.68 s 35.55 s 2.69 s 4.51 s 4.63 s 6.96 s 6.49 s 43.73 s - -
Multi-view MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
BL06-camel-lrg [13] 18 M 93 M 107.53 s 111.42 s 28.55 s 31.54 s 24.44 s 28.82 s 291.71 s 337.91 s - -
BL06-gargoyle-lrg [13] 17 M 86 M 238.08 s 241.65 s 33.76 s 36.57 s 26.51 s 30.61 s 208.27 s 251.10 s - -
13
TABLE 2: Continued
Dataset Nodes Arcs Solve Total Solve Total Solve Total Solve Total Solve Total
Deconvolution MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
graph3x3 [88] 2K 47 K 9 ms 11 ms 3 ms 3 ms 1 ms 1 ms 1 ms 5 ms - -
graph5x5 [88] 2K 139 K 62 ms 67 ms 6 ms 9 ms 3 ms 4 ms 2 ms 15 ms - -
Stereo 1 BK [11] EIBFS-I-NR [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
BVZ-sawtooth (s) [14] 164 K 796 K 0.91 s 1.16 s 0.58 s 0.69 s 1.39 s 1.85 s 7.89 s 12.27 s - -
BVZ-tsukuba (s) [14] 110 K 513 K 0.49 s 0.58 s 0.35 s 0.41 s 0.66 s 0.84 s 4.69 s 6.64 s - -
BVZ-venus (s) [14] 166 K 795 K 1.72 s 2.03 s 1.30 s 1.44 s 1.94 s 2.46 s 15.00 s 20.11 s - -
Stereo 2 BK [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
KZ2-sawtooth (s) [64] 294 K 1M 2.59 s 3.40 s 1.14 s 2.02 s 3.30 s 4.66 s 23.79 s 36.55 s - -
KZ2-tsukuba (s) [64] 199 K 1M 1.41 s 1.84 s 0.71 s 1.12 s 1.92 s 2.55 s 20.95 s 27.14 s - -
KZ2-venus (s) [64] 301 K 2M 3.98 s 4.89 s 2.18 s 3.16 s 4.70 s 6.21 s 41.63 s 55.60 s - -
Decision tree field (DTF) MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
printed graph1 [82] 20 K 1M 0.63 s 0.73 s 0.13 s 0.17 s 40 ms 51 ms 51 ms 0.25 s - -
printed graph16 [82] 11 K 683 K 0.24 s 0.29 s 44 ms 62 ms 16 ms 22 ms 25 ms 0.12 s - -
Graph matching: small BK [11] EIBFS-I-NR [36] HPF-L-L [45] HI-PR [19] GridCut [52, 51]
atlas1.dd (s) [58, 48] 1K 5K 37 ms 59 ms 34 ms 52 ms 21 ms 39 ms 23 ms 0.18 s - -
car1.dd (s) [25, 73, 48] 38 131 1 ms 1 ms 0 ms 1 ms 1 ms 1 ms 0 ms 3 ms - -
hassan1.dd (s) [1, 92, 48] 120 2K 17 ms 30 ms 5 ms 25 ms 2 ms 6 ms 4 ms 65 ms - -
matching1.dd (s) [66, 59, 48] 38 380 10 ms 12 ms 5 ms 8 ms 2 ms 4 ms 6 ms 14 ms - -
Graph matching: big MBK-R [11] EIBFS-I [36] HPF-H-L [45] HI-PR [19] GridCut [52, 51]
pair1.dd (s) [48] 1K 58 K 1.42 s 1.97 s 0.70 s 0.96 s 92 ms 0.13 s 0.82 s 1.68 s - -
Mesh segmentation MBK-R [11] EIBFS-I [36] HPF-H-F [45] HI-PR [19] GridCut [52, 51]
bunny.segment [76] 97 K 536 K 0.12 s 0.14 s 63 ms 75 ms 68 ms 91 ms 0.20 s 0.30 s - -
bunnybig.segment [76] 2M 13 M 1.01 s 1.59 s 0.62 s 1.23 s 1.43 s 2.12 s 4.99 s 9.91 s - -
candle.segment [76] 159 K 959 K 87 ms 0.13 s 49 ms 83 ms 0.11 s 0.15 s 0.29 s 0.53 s - -
candlebig.segment [76] 1M 5M 0.51 s 0.72 s 0.26 s 0.44 s 0.60 s 0.91 s 2.03 s 3.70 s - -
chair.segment [76] 305 K 1M 0.76 s 0.88 s 0.31 s 0.39 s 0.27 s 0.37 s 0.86 s 1.37 s - -
chairbig.segment [76] 3M 26 M 1.62 s 2.89 s 1.02 s 2.45 s 3.23 s 4.59 s 9.98 s 20.92 s - -
handbig.segment [76] 248 K 1M 0.15 s 0.19 s 71 ms 0.11 s 0.13 s 0.18 s 0.35 s 0.63 s - -
handsmall.segment [76] 15 K 69 K 4 ms 5 ms 2 ms 3 ms 4 ms 6 ms 10 ms 16 ms - -
for the matching. For all but four problem families, the best neighbor capacities, degrees, out degrees, and in degrees counts
algorithm achieves a mean relative performance of 95% or higher. only non-zero arcs. Note that these statistics can be computed
Furthermore, for most problem families, one algorithm is always efficiently during graph construction. We normalize all capacity
the best. This indicates that the problem family is a strong predictor statistics by the mean over all arc capacities. In total, our feature
of algorithm performance. The problem family where this strategy vector has 31 entries per graph. Fig. 11 shows a UMAP embedding
performs the worst is 3D segmentation with sparse layered graphs [78] of the feature vectors for all benchmark datasets. Similar
(SLG). Here, the mean RP is only 81%, which is likely due to the problem families cluster together, despite UMAP receiving no
large variation in graph size in this problem family. information on this. This suggests the feature vectors provide a
Table 6 shows the best performing parallel algorithm for each good description of the graphs.
problem family. For the 6-connected graphs, the parallel GridCut We train the decision tree using Scikit-learn [83] version 0.23.1.
algorithm is clearly superior, but otherwise, the different families We use Gini impurity as the split criterion and reduce the tree
appear to favor different algorithms. using minimal cost-complexity pruning [15]. The optimal amount
Scenario 3: Known Graph Finally, we consider a strategy of pruning is determined with 5-fold cross validation. We split
where the graph is known, but the problem family is not. Here, each problem family evenly into the folds (if it contains at least 5
our strategy is to train a simple decision tree to predict the best datasets). When fitting, each dataset is weighed by one over the
algorithm given a feature vector that describes the graph to be number of datasets in its problem family. When evaluating, we
solved. Although a single decision tree is not the strongest classifier, oversample the validation data, so that each problem family has the
it has the benefit of being easily interpretable. same number of entries. This indicates how well the decision tree
The first components of our feature vector consist of the number will perform with representative training data. We also perform an
of nodes, the number of terminal arcs, the number of neighbor additional evaluation where we hold out one problem family, fit on
arcs, and whether the graph is a grid graph. Then we include mean, the rest, and then evaluate on the held out family. This indicates
standard deviation, and standard deviation of non-zero values for how well the decision tree will perform for a problem family that it
a number of arc and node properties. For arc properties, we use: has not yet encountered. We use the mean RP as validation metric.
source, sink, terminal (source and sink combined), and neighbor We first train a decision tree for the serial algorithms; the
capacities. Finally, for node properties we use: sum of in-going result is shown in Fig. 12. It achieves a mean RP of 0.82 and 0.82
neighbor capacities, sum of out-going neighbor capacities, sum of in the two evaluations, respectively. This means that the tree is
14
TABLE 3: Performance of parallel algorithms based on build and solve times. We show a representative subset of the datasets grouped
according to their problem family. See Table 2 for the number of nodes and arcs. The algorithms were run with 1, 2, 4, 6, 8, 12, 16, 24,
and 32 threads. Only the best time is shown along with the thread count for that run. For comparison, the solve time for the fastest serial
algorithm is also included. All times are in seconds. The fastest solve time for each dataset has been marked with bold face
Liu-Sun [75] P-PPR [5] Strandmark-Kahl [91] P-ARD [89] P-GridCut [52, 51] Best serial
Dataset Build Best solve Build Best solve Build Best solve Build Best solve Build Best solve Algo. Solve
3D segmentation: voxel-based
adhead.n26c10 [9, 12, 10] 6.90 17.41 8T 42.71 14.37 12T 26.14 25.40 2T 78.20 35.17 4T - - - GridCut 13.21
adhead.n26c100 [9, 12, 10] 6.86 20.87 8T 41.79 8.41 32T 27.65 19.30 6T 75.37 42.91 4T - - - EIBFS 22.60
babyface.n26c100 [9, 12, 10] 2.89 72.26 32T 15.99 7.93 32T 9.15 40.86 4T 32.01 61.20 32T - - - EIBFS 30.13
bone.n26c100 [9, 12, 10] 4.36 3.68 32T 25.63 4.01 32T 16.38 5.04 8T 48.90 11.21 4T - - - HPF 3.48
bone subx.n26c100 [9, 12, 10] 2.30 4.04 16T 12.48 2.34 32T 8.03 4.43 8T 24.07 11.71 16T - - - HPF 2.14
liver.n26c10 [9, 12, 10] 2.37 10.79 6T 14.42 7.91 32T 0.89 3.49 1T 21.93 14.95 1T - - - GridCut 2.95
liver.n26c100 [9, 12, 10] 2.36 18.24 6T 13.94 5.45 32T 0.86 6.69 1T 27.07 25.74 24T - - - GridCut 5.62
liver.n6c100 [9, 12, 10] 0.53 7.62 6T 4.78 3.68 16T 0.51 7.24 1T 6.18 7.74 32T 0.11 2.70 6T GridCut 3.87
adhead.n6c100 [9, 12, 10] 1.59 11.17 8T 14.49 4.72 32T 2.52 7.17 4T 15.46 14.41 2T 0.31 3.83 4T GridCut 6.98
babyface.n6c10 [9, 12, 10] 0.66 2.70 32T 5.30 3.09 24T 0.73 3.55 1T 8.87 5.35 1T 0.12 0.88 32T GridCut 1.52
babyface.n6c100 [9, 12, 10] 0.66 5.34 32T 6.53 3.63 24T 0.98 5.24 4T 7.29 7.28 32T 0.12 1.66 16T GridCut 2.88
bone.n6c100 [9, 12, 10] 0.99 0.79 24T 8.30 2.66 32T 1.23 2.01 2T 11.18 2.08 4T 0.20 0.17 12T GridCut 0.91
3D segmentation: oriented MRF
vessel.orimrf.256 [11] 1.22 0.69 6T 14.67 1.69 32T 1.89 1.04 2T 16.24 0.82 32T 0.52 0.14 12T GridCut 0.40
vessel.orimrf.512 [11] 9.72 4.92 8T - - - 15.08 6.49 2T 100.95 5.54 16T 4.26 0.43 32T GridCut 2.43
vessel.orimrf.900 [11] 49.62 28.66 8T - - - 79.82 38.45 2T 599.09 24.02 32T 21.98 2.49 32T GridCut 15.97
3D U-Net segmentation cleaning
clean.orimrf.256 [11] 1.23 0.39 12T 16.50 2.48 24T 2.39 0.48 4T 15.96 0.84 8T 0.52 0.06 32T GridCut 0.13
clean.orimrf.512 [11] 9.73 2.45 16T - - - 18.26 3.81 4T 131.62 4.81 8T 4.37 0.27 32T GridCut 0.91
clean.orimrf.900 [11] 50.52 12.36 16T - - - 85.77 18.66 4T 578.09 20.17 16T 23.64 0.82 32T GridCut 3.90
unet mrfclean 3 [11] 1.15 0.27 32T - - - 2.15 0.45 4T 12.66 0.53 4T 0.46 0.04 32T GridCut 0.11
unet mrfclean 8 [11] 0.37 0.21 8T - - - 0.64 0.23 4T 4.01 0.28 2T 0.14 0.04 16T GridCut 0.11
Surface fitting
LB07-bunny-lrg [71] 6.14 1.86 16T 55.88 24.27 32T 7.73 4.14 4T 72.02 4.31 16T 1.24 0.32 24T GridCut 2.36
3D segmentation: sparse layered graphs (SLG)
4Dpipe small [57] 9.93 9.39 12T - - - - - - 47.53 421.60 24T - - - EIBFS 2.06
4Dpipe big [57] 122.43 86.18 16T - - - - - - 570.44 7485.11 4T - - - EIBFS 20.59
NT32 tomo3 .raw 10 [57] 4.99 18.58 12T 34.90 14.39 32T 22.02 85.49 1T 43.95 15.40 12T - - - HPF 36.46
NT32 tomo3 .raw 30 [57] 14.93 45.70 16T 111.70 59.81 32T 66.56 363.06 1T 132.41 36.13 32T - - - BK 145.23
NT32 tomo3 .raw 100 [57] 38.93 95.24 32T - - 1T 170.38 1189.78 1T 365.60 158.92 24T - - - HPF 498.94
3D segmentation: seperating surfaces
cells.sd3 [53] 4.18 10.33 16T 25.93 9.47 32T 23.90 76.40 1T 21.84 44.98 1T - - - HPF 15.52
foam.subset.r160.h210 [53] 5.91 8.59 32T 37.08 3.61 32T 52.11 17.24 1T 33.72 7.32 1T - - - EIBFS 3.21
simcells.sd3 [53] 0.69 2.28 16T 5.60 1.99 32T 1.49 2.82 32T 4.96 0.89 32T - - - EIBFS 2.94
Multi-view
BL06-camel-lrg [13] 3.99 57.41 8T - - - 1.87 75.56 1T 13.76 95.40 1T - - - HPF 24.44
BL06-gargoyle-lrg [13] 3.70 29.28 16T - - - 1.70 190.07 1T 12.20 102.31 2T - - - HPF 26.51
Mesh segmentation
bunnybig.segment [76] 0.30 0.37 12T 2.81 1.15 32T 1.12 1.89 1T 3.66 0.41 32T - - - EIBFS 0.62
chairbig.segment [76] 0.60 0.64 24T 6.10 1.76 32T 2.62 3.29 1T 6.89 0.57 32T - - - EIBFS 1.02
handbig.segment [76] 0.02 0.11 8T 0.25 0.25 16T 0.04 0.16 1T 0.40 0.11 32T - - - EIBFS 0.07
significantly better than naively choosing the overall best algorithm 7 D ISCUSSION
but not as good as knowing the best algorithm for a problem family. In this section, we discuss the most interesting findings from our
experiments.
Next, we train a decision tree for the parallel algorithms. We
include a category ‘Serial’, which means that choosing a serial
algorithm would be faster. For simplicity, we do not specify which 7.1 Serial Algorithms
serial algorithm to choose in this scenario. The result is shown in Our results clearly show that GridCut is superior to the other tested
Fig. 13. The decision tree achieves a mean RP of 0.56 and 0.57 in algorithms for min-cut/max-flow problems with fixed neighborhood
the two evaluations, respectively. Thus, the tree is slightly better grids. This is not surprising since GridCut has been designed and
than simply choosing the overall best algorithm. However, the best optimized specifically for this type of graph. However, as shown in
option is to choose the best algorithm for a given category. Table 2, the performance benefit of GridCut decreases significantly
15
TABLE 4: Summary of relative performance (RP) scores for TABLE 6: Relative performance (RP) scores for the best
each of the min-cut/max-flow algorithm variants. The best score parallel algorithm for each problem family. Since the parallel
(higher is better) in each column has been marked with bold face. GridCut implementation can only handle 6-connected graphs ‘3D
Results were oversampled as described in Fig. 7. We only include segmentation: voxel based’ has been split into two subgroups: 6-
results where the algorithm ran to completion. connected graphs and 26-connected graphs. If an algorithm did not
run to completion on a dataset we count the RP as 0.
Serial algorithms Mean RP ± Std. RP Min RP Max RP
EIBFS-I 0.59 ± 0.28 0.1309 1.00 Problem family Algorithm Mean RP
EIBFS-I-NR 0.56 ± 0.32 0.0535 1.00 3D segmentation: SLG [57] Liu-Sun 0.63
EIBFS 0.47 ± 0.23 0.1288 0.94 Multi-view [13] Serial 1.00
HI-PR 0.16 ± 0.17 0.0046 1.00 Surface fitting [71] P-GridCut 1.00
HPF-H-F 0.59 ± 0.33 0.0279 1.00 3D seg.: voxel-based [9, 12, 10] (26-conn.) Serial 0.86
HPF-H-L 0.64 ± 0.36 0.0393 1.00 3D seg.: voxel-based [9, 12, 10] (6-conn.) P-GridCut 1.00
HPF-L-F 0.49 ± 0.29 0.0313 1.00 Mesh segmentation [76] P-ARD 0.88
HPF-L-L 0.53 ± 0.31 0.0312 1.00 3D segmentation: sep. surfaces [53] P-PPR 0.74
MBK-R 0.27 ± 0.20 0.0006 1.00 3D MRF [11] P-GridCut 1.00
BK 0.27 ± 0.24 0.0005 1.00
MBK 0.28 ± 0.22 0.0005 1.00 Mean ± std. 0.89 ± 0.14
GridCut∗ 0.99 ± 0.03 0.6419 1.00
Parallel algorithms
Liu-Sun 0.48 ± 0.30 0.0667 1.00
P-PPR 0.46 ± 0.38 0.0133 1.00
Strandmark-Kahl 0.23 ± 0.16 0.0667 0.85
UMAP-2
P-ARD 0.35 ±0.32 0.0028 1.00
P-GridCut ∗ 1.00 ± 0.00 1.0000 1.00
Best serial 0.59 ± 0.33 0.1365 1.00
∗ Only grid graphs included (6- and 26-conn. for serial, 6-conn for parallel).
TABLE 5: Relative performance (RP) scores for the best serial UMAP-1
algorithm variant for each problem family. Almost all problem 3D seg. : SLG Mesh segmentation DTF
Multi-view 3D seg. : sep. surfaces Deconvolution
families have one dominant algorithm.
Stereo 1 3D MRF Graph match. : small
Stereo 2 Deep LOGISMOS Graph match. : big
Problem family Algorithm Mean RP Surface fitting ALE Super resolution
3D segmentation: SLG [57] HPF-H-L 0.81 3D seg. : voxel-based
Multi-view [13] HPF-H-L 1.00
Surface fitting [71] GridCut 1.00 Fig. 11: UMAP embedding [78] of the extracted graph features.
3D segmentation: voxel-based [9, 12, 10] GridCut 0.98
Mesh segmentation [76] EIBFS-I 0.95 Each point correspond to a benchmark dataset and is colored
3D segmentation: sep. surfaces [53] EIBFS-I 0.92 according to its problem family. When a benchmark consists of
3D MRF [11] GridCut 1.00 multiple sub-problem we use the mean feature vector. Notice that
Deep LOGISMOS [42] EIBFS-I-NR 0.96
Deconvolution [88] HPF-H-L 0.96
points from the same problem family tend to cluster together.
DTF [82] HPF-H-L 1.00
Super resolution [30, 88] EIBFS-I 0.87
Stereo 1 [14] EIBFS-I 0.99 Grid graph
Stereo 2 [64] EIBFS-I 1.00 No Yes
ALE [68, 69, 26] EIBFS-I-NR 1.00 Sink cap. std. ≤ 2.151 GridCut
Graph matching: small [58, 48] HPF-L-L 1.00 No Yes
Graph matching: small [25, 73, 48] EIBFS-I-NR 0.91
Sink num. nonzero ≤ 43630 Num. arcs ≤ 29289
Graph matching: small [1, 92, 48] HPF-L-F 1.00
Graph matching: small [94, 16, 48] HPF-L-L 1.00 No Yes No Yes
Graph matching: small [66, 59, 48] HPF-L-F 1.00 EIBFS-I-NR EIBFS-I HPF-L-L HPF-H-L
Graph matching: big [48] HPF-H-L 1.00
Mean ± std. 0.97 ± 0.05 Fig. 12: Decision tree trained to select the best serial algorithm.
Note that capacity statistics are normalized, c.f . Section 6.
algorithm variants varies, and the choice of variant can significantly 8.1 Serial Algorithms
affect the run time. Optimizing for cache efficiency seems to be For the serial min-cut/max-flow algorithms, we have tested a
of particular importance, since optimizations such as arc packing total of 12 different variants across five of the fastest and most
and smaller data structures have large effects on the solve times for popular algorithms: PPR, BK, EIBFS, HPF, and GridCut. These
both BK and EIBFS. include representatives for the three families of min-cut/max-flow
As shown in Section 6, for non-grid problems, the best algorithms: augmenting paths, push-relabel, and pseudoflow.
algorithm most often comes down to a choice between EIBFS Our results clearly show that, for simple grid graphs, GridCut
or HPF. From Fig. 12 it seems that HPF is faster when the sink has the best performance. In most other cases, the two pseudoflow
(or, more likely, terminal) arc capacities vary a lot. As expected, algorithms, EIBFS and HPF, are significantly faster than the other
EIBFS-I-NR is preferred for small graphs, while the preferred algorithms and thus should be the first choice for anyone looking
HPF variant for small graphs appears to be HPF-L-L, which aligns for a fast serial min-cut/max-flow algorithm for static computer
with the results in Table 5. However, the best strategy is to test vision problems. For dynamic problems, we refer to [36].
several algorithms on a set of problems from the family at hand. Contrary to existing literature, we recommend the HPF algo-
rithm in the H-LIFO configuration as the default, since it has
7.2 Parallel Algorithms the best overall performance. However, the EIBFS algorithm
(EIBFS-I implementation) is a very close contender and can
P-GridCut provides the best performance of the parallel algorithms easily replace HPF with little impact on performance — and indeed
for 6-connected grid graph problems and scales well with many may perform better on some problem families. If memory usage is
threads. Of the other parallel algorithms, Liu-Sun is overall the best, of chief concern, the MBK and EIBFS-I-NR implementations are
closely followed by P-PPR, which aligns with previous results [75] both good options, as they use significantly less memory than the
and expectations [89, 91]. However, all the block-based algorithms reference EIBFS and HPF implementations.
only scale well for large graphs. For small to medium problems, Furthermore, we think significant performance improvements
they do not scale to many threads, but seem to peak at 8-12 may be gained from further improving the algorithm implementa-
threads, c.f . Fig. 10. This also means that choosing an optimal tions — especially with a focus on memory use and cache efficiency.
thread count may be difficult. Only P-PPR scaled consistently with In particular, faster and more memory efficient methods for arc (and
up to 32 threads. In addition, all parallel algorithms were often node) packing could result in significant benefits, since the extra
outperformed by a serial algorithm except on large graphs. In fact, initialization step incurs a large memory and run time overhead.
as Table 4 shows, selecting a good serial algorithm has better We would like to see a reimplementation of HPF with a half-arc
expected performance than selecting any of the parallel algorithms. data structure and arc packing.
For practical use, only the Liu-Sun, P-PPR, and P-ARD Finally, we found significant gains through automatic algorithm
algorithms seem to be relevant as is. However, the block-based selection. Based on our results, it seems likely that one could train
algorithms have the additional challenge of dividing the graph a robust classifier for selecting the appropriate algorithm based on
into blocks — the result of which significantly affects the run the min-cut/max-flow problem to be solved. By selecting the right
time of the algorithms. This was also shown in [89], where it algorithm for the job, run time could in many cases be significantly
was noticed that the multiview problems would scale better with reduced without the need for new algorithms or implementations.
more processors when partitioned on vertex numbers vs. the grid. In general, we find it unlikely that a single algorithm will ever be
While the graphs tested in this work have a natural way to be split, dominant for all types of graphs.
this may not always be the case. Meanwhile, even though this
problem is avoided with P-PPR, it does not perform as well as the
block-based algorithms overall, as shown in Fig. 9. 8.2 Parallel Algorithms
Finally, while all parallel algorithms had datasets where they We tested five different parallel algorithms for min-cut/max-flow
were best, selecting the best parallel algorithm is difficult (except problems: parallel PPR (P-PPR), adaptive bottom-up merging (Liu-
for 6-connected grid graphs). No algorithm showed dominant Sun), dual decomposition (Strandmark-Kahl), region discharge
performance — neither globally nor per problem family. Further- (P-ARD), and parallel GridCut (P-GridCut).
more, using the decision tree only gives a small improvement over If the graph is a simple grid, P-GridCut significantly outper-
selecting the best overall algorithm. Fig. 13 indicates that for grid forms all other algorithms. For other graphs, we found adaptive
17
bottom-up merging, as proposed by Liu and Sun [75], to be the Finally, as mentioned in [80], there is agreement that the
best overall parallel approach. However, each parallel algorithm performance of deep learning-based segmentation methods has
had an area in which it was the best, and it is difficult to predict started to plateau, and investigating how to integrate CNNs with
the best parallel algorithm for a graph (except for 6-connected grid ‘classical’ approaches should be pursued. Already, combinations
graphs). with active contours have shown promising results [41, 85, 99]
Of the parallel algorithms, only P-GridCut and P-PPR improved and a combination of CNNs and min-cut/max-flow methods could
consistently with more threads. All block-based algorithms failed lead to new advances. As deep learning involves repeated forward
to scale beyond 12 threads, except on large graphs. Furthermore, ex- and backward passes through a model, it is crucial that the min-
cept for P-GridCut, all parallel algorithms were often outperformed cut/max-flow algorithms are fast and efficient. While not the focus
by a serial algorithm, and consistent improvements over serial of this work, this is also an area where dynamic min-cut/max-flow
algorithms were obtained only for large graphs. These issues reveal algorithms can be of great importance, as they are effective at
a major deficiency in the state of current parallel min-cut/max-flow handling repeated solves of graphs where capacities do not change
algorithms and deserve further study. While providing good scaling drastically between successive solves.
on any type of graph may be unreachable as min-cut/max-flow is
P-complete and therefore hard to parallelize [39], computer vision
R EFERENCES
graphs often come with additional structure. Therefore, it seems
highly likely that further improvements in practical performance [1] Hassan Abu Alhaija et al. “Graphflow–6D large displace-
can be achieved. However, at this time, we only recommend using ment scene flow via graph matching”. In: German Confer-
a parallel algorithm for graphs with more than 5 M nodes or where ence on Pattern Recognition. Springer. 2015, pp. 285–296.
a serial algorithm uses at least 5 seconds. [2] Richard Anderson and Joao C. Setubal. “A parallel imple-
To improve the parallel min-cut/max-flow algorithms, one mentation of the push-relabel algorithm for the maximum
could try to replace BK, which is currently used in all the tested flow problem”. In: Journal of Parallel and Distributed
block-based parallel algorithms, with a pseudoflow algorithm. Computing 29.1 (1995), pp. 17–26.
However, this may not be trivial. In [56], results for a Liu-Sun [3] Chetan Arora et al. “An efficient graph cut algorithm for
implementation using EIBFS instead of BK showed a significant computer vision problems”. In: European Conference on
performance decrease compared to serial EIBFS. Still, given the Computer Vision. 2010, pp. 552–565.
superior performance of pseudoflow algorithms, this is an important [4] David A Bader and Vipin Sachdeva. A cache-aware parallel
area to investigate. Furthermore, parallelized graph construction implementation of the push-relabel network flow algorithm
is currently only available for P-GridCut. As the build time is and experimental evaluation of the gap relabeling heuristic.
a significant part of the total time, reducing build time will be Tech. rep. Georgia Institute of Technology, 2006.
important — especially as solve time decreases. [5] Niklas Baumstark, Guy Blelloch, and Julian Shun. “Effi-
Finally, choosing an optimal blocking strategy remains an open cient implementation of a synchronous parallel push-relabel
problem. Generally, when nodes correspond to spatial positions algorithm”. In: European Symposium on Algorithms. 2015,
(e.g., pixels or mesh vertices), we find that grouping based pp. 106–117.
on spatial distance works well. However, we recommend that [6] Endre Boros, Peter L Hammer, and Xiaorong Sun. Net-
practitioners experiment with different blocking strategies since it work flows and minimization of quadratic pseudo-Boolean
can significantly affect the performance. Furthermore, a general functions. Tech. rep. 17-1991, RUTCOR, 1991.
method that only considers the graph structure would be of high [7] Stephen Boyd and Lieven Vandenberghe. Convex optimiza-
interest, as this would also make the algorithms more accessible tion. Cambridge University Press, 2004.
to the average user. An alternative would be to focus on P-PPR [8] Y. Boykov, O. Veksler, and R. Zabih. “Fast approximate
algorithms that do not rely on blocking. Further improvements in energy minimization via graph cuts”. In: IEEE Transactions
these areas could also open the door to GPU-based implementations on Pattern Analysis and Machine Intelligence 23.11 (2001),
for solving general min-cut/max-flow problems. pp. 1222–1239.
[9] Yuri Y Boykov and M-P Jolly. “Interactive graph cuts for
optimal boundary & region segmentation of objects in ND
8.3 Min-Cut/Max-Flow in Modern Computer Vision images”. In: International Conference on Computer Vision.
It is no secret that the field of computer vision is currently Vol. 1. 2001, pp. 105–112.
dominated by deep learning. In this context, it is highly relevant to [10] Yuri Boykov and Gareth Funka-Lea. “Graph cuts and effi-
consider the future role of traditional computer vision tools, such cient ND image segmentation”. In: International Journal
as min-cut/max-flow algorithms. of Computer Vision 70 (2006), pp. 109–131.
For 3D images used in medical imaging and materials science [11] Yuri Boykov and Vladimir Kolmogorov. “An Experimental
research [77], it is common to have images where no relevant Comparison of Min-Cut/Max-Flow Algorithms for Energy
training data are available. Here, segmentation methods based on Minimization in Vision”. In: IEEE Transactions on Pattern
min-cut/max-flow continue to play an important role, as they work Analysis and Machine Intelligence 26.9 (2004), pp. 1124–
without training data and allow geometric prior knowledge to be 1137.
incorporated. Furthermore, while modern 3D images can already be [12] Yuri Boykov and Vladimir Kolmogorov. “Computing
very large (many GB per image), dynamic imaging (3D + time) with geodesics and minimal surfaces via graph cuts.” In: In-
high acquisition rates is now also possible [31, 81]. Computational ternational Conference on Computer Vision. Vol. 3. 2003,
efficiency is paramount to be able to process this ever increasing pp. 26–33.
amount of data, and for this, parallel min-cut/max-flow algorithms
could prove particularly useful.
18
[13] Yuri Boykov and Victor S Lempitsky. “From Photohulls [31] Francisco Garcı́a-Moreno et al. “Using X-ray tomoscopy
to Photoflux Optimization.” In: British Machine Vision to explore the dynamics of foaming metal”. In: Nature
Conference. Vol. 3. 2006, p. 27. communications 10.1 (2019), pp. 1–9.
[14] Yuri Boykov, Olga Veksler, and Ramin Zabih. “Markov [32] Andrew V Goldberg. “Processor-efficient implementation
random fields with efficient approximations”. In: IEEE of a maximum flow algorithm”. In: Information Processing
Conference on Computer Vision and Pattern Recognition. Letters 38.4 (1991), pp. 179–185.
1998, pp. 648–655. [33] Andrew V Goldberg. “The partial augment–relabel al-
[15] Leo Breiman et al. Classification and regression trees. gorithm for the maximum flow problem”. In: European
Routledge, 2017. Symposium on Algorithms. 2008, pp. 466–477.
[16] Tibério S Caetano et al. “Learning graph matching”. In: [34] Andrew V Goldberg. “Two-level push-relabel algorithm
IEEE Transactions on Pattern Analysis and Machine for the maximum flow problem”. In: International Confer-
Intelligence 31 (2009), pp. 1048–1058. ence on Algorithmic Applications in Management. 2009,
[17] Bala G Chandran and Dorit S Hochbaum. “A computational pp. 212–225.
study of the pseudoflow and push-relabel algorithms for [35] Andrew V Goldberg and Robert E Tarjan. “A new approach
the maximum flow problem”. In: Operations Research 57.2 to the maximum-flow problem”. In: Journal of the ACM
(2009), pp. 358–376. 35.4 (1988), pp. 921–940.
[18] Xinjian Chen and Lingjiao Pan. “A survey of graph [36] Andrew V Goldberg et al. “Faster and More Dynamic
cuts/graph search based medical image segmentation”. Maximum Flow by Incremental Breadth-First Search”. In:
In: IEEE Reviews in Biomedical Engineering (RBME) 11 European Symposium on Algorithms. 2015, pp. 619–630.
(2018), pp. 112–124. [37] Andrew V Goldberg et al. “Maximum Flows by Incre-
[19] Boris V Cherkassky and Andrew V Goldberg. “On imple- mental Breadth-First Search”. In: European Symposium on
menting the push—relabel method for the maximum flow Algorithms. 2011, pp. 457–468.
problem”. In: Algorithmica 19.4 (1997), pp. 390–410. [38] Vicente Grau, J Crawford Downs, and Claude F Bur-
[20] Özgün Çiçek et al. “3D U-Net: learning dense volumetric goyne. “Segmentation of trabeculated structures using
segmentation from sparse annotation”. In: International an anisotropic Markov random field: application to the
Conference on Medical Image Computing and Computer study of the optic nerve head in glaucoma”. In: IEEE
Assisted Intervention. 2016, pp. 424–432. Transactions on Medical Imaging 25 (2006), pp. 245–255.
[21] Thomas H Cormen et al. Introduction to algorithms. MIT [39] Raymond Greenlaw, H. James Hoover, and Walter L.
press, 2009. Ruzzo. Limits to parallel computation: P-completeness
[22] Andrew Delong and Yuri Boykov. “A scalable graph-cut theory. Oxford University Press on Demand, 1995.
algorithm for ND grids”. In: IEEE Conference on Computer [40] D. M. Greig, B. T. Porteous, and A. H. Seheult. “Exact
Vision and Pattern Recognition. 2008, pp. 1–8. Maximum A Posteriori Estimation for Binary Images”.
[23] DTU Computing Center. DTU Computing Center resources. In: Journal of the Royal Statistical Society. Series B
2021. DOI: 10.48714/DTU.HPC.0001. URL: https://doi.org/ (Methodological) 51.2 (1989), pp. 271–279.
10.48714/DTU.HPC.0001. [41] Lihong Guo et al. “Learned snakes for 3D image segmen-
[24] Jan Egger et al. “Nugget-cut: a segmentation scheme for tation”. In: Signal Processing 183 (2021), p. 108013.
spherically-and elliptically-shaped 3D objects”. In: Joint [42] Zhihui Guo et al. “Deep LOGISMOS: deep learning graph-
Pattern Recognition Symposium. 2010, pp. 373–382. based 3D segmentation of pancreatic tumors on CT scans”.
[25] M. Everingham et al. The PASCAL Visual Object Classes In: IEEE International Symposium on Biomedical Imaging.
Challenge 2007 (VOC2007) Results. http://www.pascal- 2018, pp. 1230–1233.
network.org/challenges/VOC/voc2007/workshop/index. [43] Felix Halim, Roland HC Yap, and Yongzheng Wu. “A
html. MapReduce-based maximum-flow algorithm for large
[26] M. Everingham et al. The PASCAL Visual Object Classes small-world network graphs”. In: International Conference
Challenge 2010 (VOC2010) Results. http://www.pascal- on Distributed Computing Systems. 2011, pp. 192–202.
network.org/challenges/VOC/voc2010/workshop/index. [44] Peter L Hammer, Pierre Hansen, and Bruno Simeone.
html. “Roof duality, complementation and persistency in quadratic
[27] B. Fishbain, Dorit S. Hochbaum, and Stefan Mueller. 0–1 optimization”. In: Mathematical Programming 28.2
“A competitive study of the pseudoflow algorithm for (1984), pp. 121–155.
the minimum s–t cut problem in vision applications”. [45] Dorit S. Hochbaum. “The pseudoflow algorithm: A new
In: Journal of Real-Time Image Processing 11.3 (2016), algorithm for the maximum-flow problem”. In: Operations
pp. 589–609. Research 56.4 (2008), pp. 992–1009.
[28] Lester Randolph Ford Jr and Delbert Ray Fulkerson. Flows [46] Dorit S. Hochbaum and James B. Orlin. “Simplifications
in networks. Princeton university press, 1962. and speedups of the pseudoflow algorithm”. In: Networks
[29] Daniel Freedman and Petros Drineas. “Energy minimiza- 61.1 (2013), pp. 40–57.
tion via graph cuts: Settling what is possible”. In: IEEE [47] Bo Hong and Zhengyu He. “An asynchronous multi-
Conference on Computer Vision and Pattern Recognition. threaded algorithm for the maximum network flow problem
2005, pp. 939–946. with nonblocking global relabeling heuristic”. In: IEEE
[30] William T Freeman, Egon C Pasztor, and Owen T Transactions on Parallel and Distributed Systems 22.6
Carmichael. “Learning low-level vision”. In: International (2010), pp. 1025–1033.
Journal of Computer Vision 40 (2000), pp. 25–47.
19
[48] Lisa Hutschenreiter et al. “Fusion Moves for Graph Match- In: International Conference on Computer Vision. Vol. 2.
ing”. In: International Conference on Computer Vision. 2001, pp. 508–515.
2021, pp. 6270–6279. [65] Vladimir Kolmogorov and Ramin Zabin. “What energy
[49] Hossam Isack et al. “Efficient optimization for functions can be minimized via graph cuts?” In: IEEE
hierarchically-structured interacting segments (HINTS)”. Transactions on Pattern Analysis and Machine Intelligence
In: IEEE Conference on Computer Vision and Pattern 26.2 (2004), pp. 147–159.
Recognition. 2017, pp. 1445–1453. [66] Nikos Komodakis and Nikos Paragios. “Beyond loose LP-
[50] Hiroshi Ishikawa. “Exact optimization for Markov random relaxations: Optimizing MRFs by repairing cycles”. In:
fields with convex priors”. In: IEEE Transactions on European Conference on Computer Vision. 2008, pp. 806–
Pattern Analysis and Machine Intelligence 25.10 (2003), 820.
pp. 1333–1336. [67] L’ubor Ladický and Philip HS Torr. The automatic labelling
[51] Ondřej Jamriška and Daniel Sýkora. GridCut. Version 1.3. environment. https://www.robots.ox.ac.uk/∼phst/ale.htm.
https://gridcut.com. Accessed 2020-06-12. 2015. Accessed 2021-11-24.
[52] Ondřej Jamriška, Daniel Sýkora, and Alexander Hornung. [68] L’ubor Ladický et al. “Associative hierarchical crfs for
“Cache-efficient Graph Cuts on Structured Grids”. In: IEEE object class image segmentation”. In: International Confer-
Conference on Computer Vision and Pattern Recognition. ence on Computer Vision. 2009, pp. 739–746.
2012, pp. 3673–3680. [69] L’ubor Ladický et al. “Graph cut based inference with
[53] Patrick M. Jensen, Anders B. Dahl, and Vedrana A. co-occurrence statistics”. In: European Conference on
Dahl. “Multi-Object Graph-Based Segmentation With Non- Computer Vision. 2010, pp. 239–253.
Overlapping Surfaces”. In: IEEE Conference on Computer [70] Kyungmoo Lee et al. “Multiresolution LOGISMOS graph
Vision and Pattern Recognition Workshops. 2020, pp. 976– search for automated choroidal layer segmentation of 3D
977. macular OCT scans”. In: Medical Imaging 2020: Image
[54] Patrick M. Jensen and Niels Jeppesen. Max-Flow/Min-Cut Processing. Vol. 11313. International Society for Optics
Algorithms. https : / / doi . org / 10 . 5281 / zenodo . 4903945. and Photonics. 2020, 113130B.
Accessed 2021-06-08. DOI: 10.5281/zenodo.4903945. [71] Victor Lempitsky and Yuri Boykov. “Global optimization
[55] Patrick M. Jensen et al. Min-Cut/Max-Flow Problem for shape fitting”. In: IEEE Conference on Computer Vision
Instances for Benchmarking. https://doi.org/10.11583/ and Pattern Recognition. 2007, pp. 1–8.
DTU.17091101. Accessed 2021-11-29. DOI: 10.11583/ [72] Victor Lempitsky, Yuri Boykov, and Denis Ivanov. “Ori-
DTU.17091101. ented visibility for multiview reconstruction”. In: European
[56] Niels Jeppesen et al. “Faster Multi-Object Segmentation Conference on Computer Vision. 2006, pp. 226–238.
Using Parallel Quadratic Pseudo-Boolean Optimization”. [73] Marius Leordeanu, Rahul Sukthankar, and Martial Hebert.
In: International Conference on Computer Vision. Oct. “Unsupervised learning for graph matching”. In: Interna-
2021, pp. 6260–6269. tional Journal of Computer Vision 96 (2012), pp. 28–45.
[57] Niels Jeppesen et al. “Sparse Layered Graphs for Multi- [74] Kang Li et al. “Optimal surface segmentation in volumetric
Object Segmentation”. In: IEEE Conference on Computer images-a graph-theoretic approach”. In: IEEE Transactions
Vision and Pattern Recognition. 2020, pp. 12777–12785. on Pattern Analysis and Machine Intelligence 28.1 (2005),
[58] Dagmar Kainmueller et al. “Active graph matching for pp. 119–134.
automatic joint segmentation and annotation of C. elegans”. [75] Jiangyu Liu and Jian Sun. “Parallel Graph-cuts by Adaptive
In: International Conference on Medical Image Computing Bottom-up Merging”. In: IEEE Conference on Computer
and Computer Assisted Intervention. 2014, pp. 81–88. Vision and Pattern Recognition. 2010, pp. 2181–2188.
[59] Jörg H Kappes et al. “A comparative study of modern infer- [76] Lei Liu et al. “Graph cut based mesh segmentation using
ence techniques for structured discrete energy minimization feature points and geodesic distance”. In: Proceedings of
problems”. In: International Journal of Computer Vision the International Conference on Cyberworlds (CW). 2015,
115 (2015), pp. 155–184. pp. 115–120.
[60] S Kashyap, H Zhang, and M Sonka. “Accurate Fully [77] Eric Maire and Philip John Withers. “Quantitative X-
Automated 4D Segmentation of Osteoarthritic Knee MRI”. ray tomography”. In: International materials reviews 59.1
In: Osteoarthritis and Cartilage 25 (2017), S227–S228. (2014), pp. 1–43.
[61] Anna Khoreva et al. “Simple does it: Weakly supervised [78] Leland McInnes, John Healy, and James Melville. “Umap:
instance and semantic segmentation”. In: IEEE Confer- Uniform manifold approximation and projection for dimen-
ence on Computer Vision and Pattern Recognition. 2017, sion reduction”. In: arXiv:1802.03426 (2018).
pp. 876–885. [79] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.
[62] Pushmeet Kohli and Philip H.S. Torr. “Dynamic graph cuts “V-net: Fully convolutional neural networks for volumetric
for efficient inference in Markov random fields”. In: IEEE medical image segmentation”. In: International Conference
Transactions on Pattern Analysis and Machine Intelligence on 3D Vision. 2016, pp. 565–571.
29.12 (2007), pp. 2079–2088. [80] Shervin Minaee et al. “Image segmentation using deep
[63] Vladimir Kolmogorov and Carsten Rother. “Minimizing learning: A survey”. In: IEEE Transactions on Pattern
nonsubmodular functions with graph cuts-a review”. In: Analysis and Machine Intelligence (2021).
IEEE Transactions on Pattern Analysis and Machine [81] Rajmund Mokso et al. “GigaFRoST: the gigabit fast
Intelligence 29.7 (2007), pp. 1274–1279. readout system for tomography”. In: Journal of synchrotron
[64] Vladimir Kolmogorov and Ramin Zabih. “Computing radiation 24.6 (2017), pp. 1250–1259.
visual correspondence with occlusions using graph cuts”.
20
[82] Sebastian Nowozin et al. “Decision tree fields”. In: Inter- [100] Xiaodong Wu and Danny Z Chen. “Optimal net sur-
national Conference on Computer Vision. 2011, pp. 1668– face problems with applications”. In: International Collo-
1675. quium on Automata, Languages, and Programming. 2002,
[83] F. Pedregosa et al. “Scikit-learn: Machine Learning in pp. 1029–1042.
Python”. In: Journal of Machine Learning and Research [101] Yin Yin et al. “LOGISMOS—layered optimal graph image
12 (2011), pp. 2825–2830. segmentation of multiple objects and surfaces: cartilage
[84] Bo Peng, Lei Zhang, and David Zhang. “A survey of graph segmentation in the knee joint”. In: IEEE Transactions on
theoretical approaches to image segmentation”. In: Pattern Medical Imaging 29.12 (2010), pp. 2023–2037.
Recognition 46.3 (2013), pp. 1020–1038. [102] Miao Yu, Shuhan Shen, and Zhanyi Hu. “Dynamic Graph
[85] Sida Peng et al. “Deep snake for real-time instance Cuts in Parallel”. In: IEEE Transactions on Image Process-
segmentation”. In: IEEE Conference on Computer Vision ing 26.8 (2017).
and Pattern Recognition. 2020, pp. 8533–8542. [103] Miao Yu, Shuhan Shen, and Zhanyi Hu. “Dynamic Parallel
[86] Yi Peng et al. “JF-Cut: A parallel graph cut approach for and Distributed Graph Cuts”. In: IEEE Transactions on
large-scale image and video”. In: IEEE Transactions on Image Processing 25.12 (2015), pp. 5511–5525.
Image Processing 24.2 (2015), pp. 655–666.
[87] Marius Reichardt et al. “3D virtual Histopathology of
Cardiac Tissue from Covid-19 Patients based on Phase-
Contrast X-ray Tomography”. In: eLife. (2021). Patrick M. Jensen was born in Copenhagen,
[88] Carsten Rother et al. “Optimizing binary MRFs via ex- Denmark, in 1994. He received his B.Sc.Eng
degree in 2017 and M.Sc.Eng degree in 2019,
tended roof duality”. In: IEEE Conference on Computer
both in applied mathematics, at the Technical Uni-
Vision and Pattern Recognition. 2007, pp. 1–8. versity of Denmark (DTU), Kgs. Lyngby, Denmark.
[89] Alexander Shekhovtsov and Václav Hlaváč. “A distributed He is currently pursuing a Ph.D. in 3D image
mincut/maxflow algorithm combining path augmentation analysis at the Visual Computing group at the De-
partment of Applied Mathematics and Computer
and push-relabel”. In: International Journal of Computer Science, Technical University of Denmark. His
Vision 104.3 (2013), pp. 315–342. research interests lie in 3D image segmentation.
[90] Amber L Simpson et al. “A large annotated medical image
dataset for the development and evaluation of segmentation
algorithms”. In: arXiv:1902.09063 (2019).
[91] Petter Strandmark and Fredrik Kahl. “Parallel and Dis-
tributed Graph Cuts by Dual Decomposition”. In: IEEE Niels Jeppesen is an image analysis and ma-
chine learning specialist at FORCE Technology
Conference on Computer Vision and Pattern Recognition. with a Ph.D. degree in image analysis of 3D
2010, pp. 2085–2092. structures from the Department of Applied Mathe-
[92] Paul Swoboda et al. “A study of lagrangean decompositions matics and Computer Science, Technical Univer-
and dual ascent solvers for graph matching”. In: IEEE sity of Denmark (DTU), Kgs. Lyngby, Denmark.
His research interests lie in min-cut/max-flow
Conference on Computer Vision and Pattern Recognition. algorithms and quantitative analysis of structures
2017, pp. 1607–1616. in 3D images. He applies these methods for auto-
[93] Nima Tajbakhsh et al. “Embracing imperfect datasets: mated quality control of structures and materials,
in particular, in the wind turbine industry.
A review of deep learning solutions for medical image
segmentation”. In: Medical Image Analysis 63 (2020),
p. 101693.
[94] Lorenzo Torresani, Vladimir Kolmogorov, and Carsten
Anders Bjorholm Dahl is professor in 3D image
Rother. “Feature correspondence via graph matching: analysis, and a head of the Section for Visual
Models and global optimization”. In: European Conference Computing at the Department of Applied Mathe-
on Computer Vision. 2008, pp. 596–609. matics and Computer Science, Technical Univer-
sity of Denmark (DTU), Kgs. Lyngby, Denmark.
[95] Tanmay Verma and Dhruv Batra. “MaxFlow Revisited: An He is heading The Center for Quantification of
Empirical Comparison of Maxflow Algorithms for Dense Imaging Data from MAX IV, focusing on quan-
Vision Problems”. In: British Machine Vision Conference. titative analysis of 3D images. His research is
2012, pp. 1–12. focused on image segmentation and its applica-
tions.
[96] Vibhav Vineet and P J Narayanan. “CUDA cuts: Fast graph
cuts on the GPU”. In: IEEE Conference on Computer
Vision and Pattern Recognition Workshops. 2008, pp. 1–8.
[97] Yao Wang and Reinhard Beichel. “Graph-based segmen-
tation of lymph nodes in CT data”. In: International Vedrana Andersen Dahl is an associate profes-
Symposium on Visual Computing. 2010, pp. 312–321. sor at the Department of Applied Mathematics
[98] University of Waterloo. Max-flow problem instances in and Computer Science, Technical University of
Denmark (DTU), Kgs. Lyngby, Denmark. Her pri-
vision. https : / / vision . cs . uwaterloo . ca / data / maxflow. mary research interest is in the use of geometric
Accessed 2021-02-05. models for the analysis of volumetric data. This
[99] Udaranga Wickramasinghe et al. “Voxel2mesh: 3d mesh includes volumetric segmentation and methods
based on deformable meshes. She developed
model generation from volumetric data”. In: International analysis tools with applications in material sci-
Conference on Medical Image Computing and Computer ence, industrial inspection, and biomedicine.
Assisted Intervention. Springer. 2020, pp. 299–308.