0% found this document useful (0 votes)
76 views

Abbas FastDOG Fast Discrete Optimization On GPU CVPR 2022 Paper

1) FastDOG is a massively parallel method for solving integer linear programs on GPUs using Lagrange decomposition. 2) It improves upon prior work by proposing new primal and dual algorithms that require little synchronization and can exploit the parallelism of GPUs. 3) Experimental results show FastDOG improves running times over previous algorithms by up to an order of magnitude and outperforms some specialized heuristics while being applicable to diverse problems.

Uploaded by

Aroosa Javed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Abbas FastDOG Fast Discrete Optimization On GPU CVPR 2022 Paper

1) FastDOG is a massively parallel method for solving integer linear programs on GPUs using Lagrange decomposition. 2) It improves upon prior work by proposing new primal and dual algorithms that require little synchronization and can exploit the parallelism of GPUs. 3) Experimental results show FastDOG improves running times over previous algorithms by up to an order of magnitude and outperforms some specialized heuristics while being applicable to diverse problems.

Uploaded by

Aroosa Javed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

FastDOG: Fast Discrete Optimization on GPU

Ahmed Abbas Paul Swoboda


Max Planck Institute for Informatics, Saarland Informatics Campus

Abstract General
Purpose
We present a massively parallel Lagrange decomposition
method for solving 0–1 integer linear programs occurring
FastDOG
in structured prediction. We propose a new iterative update
scheme for solving the Lagrangean dual and a perturbation
technique for decoding primal solutions. For representing
subproblems we follow [40] and use binary decision dia-
grams (BDDs). Our primal and dual algorithms require
little synchronization between subproblems and optimiza- CPU GPU
tion over BDDs needs only elementary operations without
complicated control flow. This allows us to exploit the par- several
allelism offered by GPUs for all components of our method. hundred [1, 48, 58, 65]
We present experimental results on combinatorial problems works
from MAP inference for Markov Random Fields, quadratic
assignment and cell tracking for developmental biology. Our
highly parallel GPU implementation improves upon the run-
Specialized
ning times of the algorithms from [40] by up to an order
of magnitude. In particular, we come close to or outper- Figure 1. Qualitative comparison of ILP solvers for structured
form some state-of-the-art specialized heuristics while be- prediction. Our solver (FastDOG) is faster than Gurobi [23] and
ing problem agnostic. Our implementation is available at comparable to specialized CPU solvers, but outperformed by spe-
https://github.com/LPMP/BDD. cialized GPU solvers. FastDOG is applicable to a diverse set of
applications obviating the human effort for developing solvers for
new problem classes.
1. Introduction
or difficult and time-consuming development of specialized
Solving integer linear programs (ILP) efficiently on paral-
solvers as observed for the special case of MAP-MRF [33].
lel computation devices is an open research question. Done
properly it would enable more practical usage of many ILP We argue that work on speeding up general purpose ILP
problems from structured prediction in computer vision and solvers has had only limited success so far due to compli-
machine learning. Currently, state-of-the-art generally ap- cated control flow and computation interdependencies. We
plicable ILP solvers tend not to benefit much from paral- pursue an overall different approach and do not base our
lelism [45]. In particular, linear program (LP) solvers for work on the typically used components of ILP solvers. Our
computing relaxations benefit modestly (interior point) or approach is designed from the outset to only use operations
not at all (simplex) from multi-core architectures. In par- that offer sufficient parallelism for implementation on GPUs.
ticular generally applicable solvers are not amenable for We argue that our approach sits on a sweet spot between
execution on GPUs. To our knowledge there exists no prac- general applicability and efficiency for problems in struc-
tical and general GPU-based optimization routine and only tured prediction as shown in Figure 1. Similar to general
a few solvers for narrow problem classes have been made purpose ILP solvers [15, 23], there is little or no effort to
GPU-compatible e.g. [1,48,58,65]. This, and the superlinear adapt these problems for solving them with our approach. On
runtime complexity of general ILP solvers has hindered ap- the other hand we outperform general purpose ILP solvers
plication of ILPs in large structured prediction problems, ne- in terms of execution speed for large problems from struc-
cessitating either restriction to at most medium problem sizes tured prediction and achieve runtimes comparable to hand-

439
crafted specialized CPU solvers. We are only significantly straightforward to parallelize. The work [28] reports a
outperformed by specialized GPU solvers. However, de- parallel implementation, however current state-of-the-
velopment of fast specialized solvers especially on GPU art commercial solvers outperform it with sequentially
is time-consuming and needs to be repeated for every new executed implementations.
problem class.
Our work builds upon [40] in which the authors proposed Machine Learning Methods Recently deep learning based
a Lagrange decomposition into subproblems represented by methods have been proposed for choosing variables to
binary decision diagrams (BDD). The authors proposed se- branch on [17, 43] and for directly computing some
quential algorithms as well as parallel extensions for solving easy to guess variables of a solution [43] or improving
the Lagrange decomposition. We improve upon their solver a given one [51]. While parallelism is not the goal of
by proposing massively parallelizable GPU amenable rou- these works, the underlying deep networks are executed
tines for both dual optimization and primal rounding. This on GPUs and hence the overall computation heavy ap-
results in significant runtime improvements as compared to proach is fast and brings speedups. Still, these parallel
their approach. components do not replace the sequential parts of the
solution process but work in conjunction with them,
2. Related Work limiting the overall speedup attainable.

General Purpose ILP Solvers & Parallelism The most A shortcoming of the above methods in the application to
efficient implementation of general purpose ILP solvers [15, very large structured prediction problems in machine learn-
23] provided by commercial vendors typically benefit only ing and computer vision is that they still do not scale well
moderately from parallelism. A recent survey is this direc- enough to solve problems with more than a millions variables
tion is given in [45]. The main ways parallelism is utilized in a few seconds.
in ILP solvers are:

Multiple Independent Executions State-of-the-art Parallel Combinatorial Solvers For specialized combi-
solvers [15, 23] offer the option of running mul- natorial problem classes highly parallel algorithms for GPU
tiple algorithms (dual/primal simplex, interior point, have been developed. For Maximum-A-Posteriori inference
different parameters) solving the same problem in in Markov Random Fields [48, 65] proposed a dual block
parallel until one finds a solution. While easy and coordinate ascent algorithm for sparse and [58] for dense
worthwhile for problems for which best algorithms graphs. For multicut a primal-dual algorithm has been pro-
and parameters configurations are not known, such a posed in [1]. Max-flow GPU implementations have been
simple approach can deliver parallelization speedups investigated in [60, 64]. While some parts of the above
only to a limited degree. specialized algorithms can potentially be generalized, other
key components cannot, limiting their applicability to new
Parallel Branch-and-bound tree traversal While appealing problem classes and requiring time-consuming design of
on first glance, it has been observed [46] that the order algorithms whenever attempting to solve a different problem
in which a branch-and-bound tree is traversed is crucial class.
due to exploitation of improved lower and upper bounds
and generated cuts. Consequently, it seems hard to Specialized CPU solvers There is a large literature of
obtain significant parallelization speedups and many specialized CPU solvers for specific problem classes in
recent improvements rely on a sequential execution. A structured prediction. For an overview of pursued algo-
separate line of work [50] exploited GPU parallelism rithmic techniques for the special case of MRFs we re-
for domain propagation allowing to decrease the size fer to the overview article [33]. Most related to our
of the branch-and-bound tree. approach are the so called dual block coordinate ascent
Parallel LP-Solver Interior point methods rely on comput- (a.k.a. message passing) algorithms which optimize a La-
ing a sequence of solutions of linear systems. This grange decomposition. Solvers have been developed for
linear algebra can be parallelized for speeding up the MRFs [19, 30, 31, 36, 37, 42, 47, 58, 59, 61, 62], graph match-
optimization [20, 49]. However, for sparse problems ing [54, 55, 66], multicut [1, 39, 52], multiple object track-
sequential simplex solvers still outperform parallelized ing [27] and cell tracking [24]. Most of the above algorithms
interior point methods. Also, a crossover step is needed require a sequential computation of update steps.
to obtain a suitable basis for the simplex method for
reoptimizing for primal rounding and in branch-and- Optimization with Binary Decision Diagrams Our
bound searches, limiting the speedup obtainable by work builds upon [40]. The authors proposed a Lagrange
this sequential bottleneck. The simplex method is less decomposition of ILPs that can be optimized via a sequential

440
xi Optimization variable i ∈ [n] 3.1. Lagrangean Dual
Xj Feasible set of constraint j ∈ [m] While (BP) is NP-hard to solve, optimization over a single
Ij Set of variables in constraint j ∈ [m] constraint is typically easier for example by using Binary
Ji Set of constraints containing variable i ∈ [n] Decision Diagrams. To make use of this possibility we
Min-marginal for variable i taking value β in
mβij dualize the original problem using Lagrange decomposition
subproblem j ∈ [m] similarly to [40]. This allows us to solve the Lagrangean dual
λji Lagrange multiplier for variable i in subproblem j of the full problem (BP) by solving only the subproblems.

Table 1. Notation of symbols used in our problem decomposition. Definition 2 (Lagrangean dual problem). Define the set of
subproblems that constrain variable xi as Ji = {j ∈ [m] |
dual block coordinate ascent method or a decomposition i ∈ Ij }. Let the energy for subproblem j ∈ [m] w.r.t.
based approach that can utilize multiple CPU cores. Lagrangean dual variables λj ∈ RIj be
The works [5, 6, 41] similarly consider decompositions
E^j(\lambda ^j) = \min _{x \in \X _j} x^\top \lambda ^{j} \,. (1)
into multiple BDDs and solve the resulting problem with
general purpose ILP solvers. The work [7] investigates op-
timization of Lagrange decompositions with multi-valued Then the Lagrangean dual problem is defined as
decision diagrams with subgradient methods. An extension
for job sequencing was proposed in [26] and in [14] for \max _\lambda \quad \sum _{j \in [m]} E^j(\lambda ^j) \quad \text {s.t.} \quad \sum _{j \in \J _i} \lambda ^j_i = c_i \quad \forall i \in [n]. \label {eq:dual-problem} \tag {D} (D)
routing problems. Hybrid solvers using mixed integer pro-
gramming solvers were investigated in [21, 22, 56]. The
works [3, 8, 9] consider stable set and max-cut and propose If optima of the individual subproblems E j (λj ) agree
optimizing (i) a relaxation to get lower bounds [3] or (ii) a with each other then the consensus vector obtained from
restriction to generate approximate solutions [8, 9]. stitching together individual subproblem solutions solves
In contrast to previous BDD-based optimization methods the original problem (BP). In general, (D) is a lower bound
we propose a highly parallelizable and problem agnostic on (BP). Formal derivation of (D) is given in [40].
approach that is amenable to GPU computation. 3.2. Min-Marginals
3. Method To optimize the dual problem (D) and also to obtain a
primal solution we use min-marginals [40] defined as
We first introduce the optimization problem and its La-
grange decomposition. Next we elaborate our parallel update Definition 3 (Min-marginals). For i ∈ [n], j ∈ Ji and
scheme for optimizing the Lagrangean dual followed by our β ∈ {0, 1} let
parallel primal rounding algorithm. For the problem decom-
position and dualization we follow [40]. Our notation is m^\beta _{ij} = \min _{x \in \X _j} x^\top \lambda ^{j} \quad \text {s.t.} \quad x_i = \beta \label {eq:min-marginals} \tag {MM} (MM)
summarized for reference in Table 1.
Definition 1 (Binary Program). Consider a linear objective denote the min-marginal w.r.t. primal variable i, subproblem
c ∈ Rn and m variable subsets Ij ⊂ [n] of constraints with j and β.
feasible set Xj ⊂ {0, 1}Ij for j ∈ [m]. The corresponding
binary program is defined as Definition 4 (Min-marginal differences). For notational con-
venience let us also define
\label {eq:binary-program} \min _{x \in \{0,1\}^n} c^\top x \quad \text {s.t.} \quad x_{\I _j} \in \X _j \quad \forall j \in [m]\,, \tag {BP} (BP)
M_{ij} = m_{ij}^1 - m_{ij}^0, \label {eq:min-marginals-diff} \tag {MD} (MD)
where xIj is the restriction to variables in Ij .
which denote min-marginal difference computed
Example 1 (ILP). Consider the 0–1 integer linear program through (MM).

If Mij > 0 then assigning a value of 0 to variable i has


\min \quad & c^\top x \quad \text {s.t.} \quad Ax \leq b, \; x \in \{0,1\}^n. \tag {ILP} \label {eq:ILP} (ILP)
a lower cost than assigning a 1 in the subproblem j and
The system of linear constraints Ax ≤ b may be split into viceversa. Thus, the quantity |Mij | indicates by how much
m blocks, each block representing a single (or multiple) E j (λj ) increases if xi is fixed to 1 (if Mij > 0), respectively
rows of the system. For instance, let a⊤
j x ≤ bj denote the 0 (if Mij < 0).
j-th row of Ax ≤ b, then the problem can be written in Min-marginals have been used in various ways to design
P Ij = {i ∈ [n] : aji ̸= 0} and
the form (BP) by setting dual block coordinate ascent algorithms [1, 4, 19, 24, 27, 30,
Xj = {x ∈ {0, 1}Ij : i∈Ij aji xi ≤ bj }. 31, 36, 37, 40, 42, 47, 52, 54, 55, 58, 59, 61–63, 66].

441
minx∈{0,1} −5x1 + x2 + 4x3 + 3x4
X1 : x1 + x2 + x3 ≤ 2, X2 : x2 + x3 − x4 = 0

x2 x3 x3 x4
0.5 2
> >
3
2
x1 r x2 r

0.5 3
−5
⊥ ⊥
0.5 2
2

λ1 = (−5, 0.5, 2), (x1 , x2 , x3 ) ∈ X1 λ2 = (0.5, 2, 3), (x2 , x3 , x4 ) ∈ X2

Figure 2. Example decomposition of a binary program into two subproblems, one for each constraint. Each subproblem is represented by a
weighted BDD where solid arcs model the cost λ of assigning a 1 to the variable and dashed arcs have 0 cost which model assigning a 0. All
r − ⊤ paths in BDDs encode feasible variable assignments of corresponding subproblems (and r − ⊥ infeasible). Optimal assignments w.r.t
current (non-optimal) λ are highlighted in green i.e. x1 = 1, x2 = x3 = 0 for X1 and x2 = x3 = x4 = 0 for X2 . Our dual update scheme
processes multiple variables in parallel which are indicated in same color (e.g. x1 , x2 in X1 , X2 resp.).

Algorithm 1: Parallel Deferred Min-Marg. Averag- where Mij is defined in (MD). This update scheme (2)
ing requires communication between all subproblems Ji con-
taining variable i for the min-marginal averaging step and
Input: Lagrange variables λji ∈ R ∀i ∈ [n], j ∈ Ji , thus requires synchronization. To overcome this limitation
Constraint sets Xj ⊂ {0, 1}Ij ∀j ∈ [m], we propose a novel dual optimization procedure which per-
Damping factor ω ∈ (0, 1] forms this averaging step on min-marginal differences M
1 Initialize deferred min-marginal diff. M = 0 from the previous iteration as follows
2 while (stopping criterion not met) do
3 for j ∈ J in parallel do \lambda _i^{j} \leftarrow \lambda _i^{j} - \omega M_{ij} + \frac {\omega }{\abs {\J _i}} \sum _{k \in \J _i} \overline {M}_{ik}. \label {eq:dual_update_parallel} (3)
4 for i ∈ Ij in ascending order do
5 Compute min-marginal diff. Mij (MD)
Since M was computed in the previous iteration, the above
6 Update dual variables λji via (3) dual updates can be performed in parallel for all subproblems
7 Update deferred min-marginal diff. M ← M without requiring synchronization. Following [63] we use a
8 Repeat lines 3-6 in descending order of Ij damping factor ω ∈ (0, 1) (0.5 in our experiments) to obtain
9 for j ∈ J , i ∈ Ij do better final solutions.
10 Add deferred min-marginal differences: Our proposed scheme is given in Algorithm 1. We iterate
λji += ωM ij in parallel over each subproblem j. For each subproblem,
variables are visited in order and min-marginals are com-
puted and stored for updates in the next iteration (lines 4-5).
3.3. Parallel Deferred Min-Marginal Averaging The current min-marginal difference is subtracted and the
To exploit GPU parallelism in solving the dual prob- one from previous iteration is added (line 6) by distributing
lem (D) we would like to update multiple dual variables it equally among subproblems Ji . At termination (line 10)
in parallel. However, conventional dual update schemes are we perform a min-marginal averaging step to account for
not friendly for parallelization. For example the dual update the deferred update from last iteration. For stopping criteria
scheme of [40] for variable i in subproblem j is we use relative change in dual objective between two subse-
quent iterations. We initialize the input Lagrange variables
\lambda _i^{j} \leftarrow \lambda _i^{j} - M_{ij} + \underbrace {\frac {1}{\abs {\J _i}} \sum _{k \in \J _i} M_{ik}\,,}_{\text {min-marginal averaging}} \label {eq:dual_update_serial} (2) by λji = ci /|Ji |, ∀i ∈ [n], j ∈ Ji .
Proposition 1. In each dual iteration the Lagrange multi-
pliers along with the deferred min-marginals can be used

442
to satisfy dual feasibility and the dual lower bound (D) is Algorithm 2: Perturbation Primal Rounding
non-decreasing.
Input: Lagrange variables λji ∈ R ∀i ∈ [n], j ∈ Ji ,
Constraint sets Xj ⊂ {0, 1}Ij ∀j ∈ [m],
Similar to other dual block coordinate ascent schemes Initial perturbation strength δ ∈ R+ ,
Algorithm 1 can get stuck in suboptimal points, see [62, 63]. perturbation growth rate α
As seen in our experiments these are usually not far away Output: Feasible labeling x ∈ {0, 1}n
from the optimum, however. 1 Compute min-marginal differences Mij ∀i, j (MD)
In Section 5 we will explain how we can incrementally 2 while ∃i ∈ [n] and j ̸= k ∈ Ji s.t.
compute min-marginals reusing previous computations if we sign(Mij ) ̸= sign(Mik ) do
represent subproblems as Binary Decision Diagrams. This 3 for i = 1, . . . , n in parallel do
saves us from computing min-marginals from scratch leading 4 Sample r uniformly from [−δ, δ]
to greater efficiency. 5 if Mij > 0 ∀j ∈ Ji then
6 λji += δ ∀j ∈ Ji
7 else if Mij < 0 ∀j ∈ Ji then
4. Primal Rounding 8 λji −= δ ∀j ∈ Ji
9 else if Mij = 0 ∀j ∈ Ji then
In order to obtain a primal solution to (BP) from an ap-
10 λji += r · δ ∀j ∈ Ji
proximative dual solution to (D) we propose a GPU friendly
11 else
primal rounding scheme based on cost perturbation. We
12 Compute Ptotal min-marginal difference:
iteratively change costs in a way that variable assignments
Mi = j∈Ji Mij
across subproblems agree with each other. If all variables
agree by favoring a single assignment, we can reconstruct 13 λji += sign(Mi ) · |r| · δ ∀j ∈ Ji
a primal solution (not necessarily the optimal). Instead of 14 Increase perturbation: δ ← δ · α
only using variable assignments of all subproblems we use 15 Reoptimize perturbed λ via Algorithm 1
min-marginal differences (MD) as they additionally indicate 16 Recompute Mij ∀i, j w.r.t optimized λ
how strongly a variable favours a particular assignment.
Algorithm 2 details our method. We iterate over all vari-
ables in parallel and check min-marginal differences. If for a
5. Binary Decision Diagrams
variable i all min-marginal differences indicate that the opti- We use Binary Decision Diagrams (BDDs) to represent
mal solution is 0 (resp. 1) Lagrange variables λ are increased the feasibility sets Xj , j ∈ [m] and compute their min-
(resp. decreased) leaving even more certain min-marginals marginals (MM). BDDs are in essence directed acyclic
differences for these variables. This step imitates variable graphs whose paths between two special nodes (root and
fixation as done in branch-and-bound, however we only terminal) encode all feasible solutions. Specifically, we use
perform soft fixation implicitly through cost perturbation. reduced ordered Binary Decision Diagrams [12] as in [40].
In case min-marginal differences are equal we randomly
perturb corresponding dual costs. Lastly, if min-marginals Definition 5 (BDD). Let an ordered variable set I =
differences indicate conflicting solutions we compute to- {w1 , . . . , wk } ⊂ [n] corresponding to a constraint be given.
tal min-marginal difference and decide accordingly. In the A corresponding BDD is a directed acyclic graph D =
last two cases we add more perturbation to force towards (V, A) with
non-conflicts. For faster convergence we increase the pertur- Special nodes: root node r, terminals ⊥ and ⊤.
bation magnitude after each iteration.
Note that the modified λ variables via Alg. 2 need not Outgoing Arcs: each node v ∈ V \{⊤, ⊥} has exactly two
be feasible for the dual problem (D). Although, our primal successors s0 (v), s1 (v) with outgoing arcs vs0 (v) ∈ A
rounding algorithm is not guaranteed to terminate, in our (the zero arc) and vs1 (v) ∈ A (the one arc).
experiments a solution was always found in less than 100 Partition: the node set V is partitioned by {P1 , . . . , Pk },
iterations. ∪˙ i Pi = V \{⊤, ⊥}. Each partition holds all the nodes
corresponding to a single variable e.g. Pi corresponds
Remark. The primal rounding scheme in [40] and typical to variable wi . It holds that P1 = {r} i.e. it only
primal ILP heuristics [10] are sequential and build upon contains the root node.
sequential operations such as variable propagation. Our
primal rounding lends itself to parallelism since we perturb Partition Ordering: when v ∈ Pi then s0 (v), s1 (v) ∈
costs on all variables simultaneously and reoptimize via Pi+1 ∪ {⊥} for i < k and s0 (v), s1 (v) ∈ {⊥, ⊤}
Algorithm 1. for v ∈ Pk .

443
Definition 6 (Constraint Set Correspondence). Each BDD SP(a, ·) SP(·, >)
0 0
defines a constraint set X via the relation
c1
0 0 1 1 4 0 0

x \in \X \Leftrightarrow \begin {array}{c} \exists (v_1,\ldots ,v_{k},v_{k+1}) \in \text {Paths}(V,A) \text { s.t.\ } \\ v_1 = r, v_{k+1} = \top , \\ v_{i+1} = s^{x_i}(v_i)\, \forall i \in [k] \end {array} (4) b1
3 3 4
d1 >
0 0 4
a 3 c2
2 1 0 0 0 ∞
Thus each path between root r and terminal ⊤ in the BDD 2
b2
4
d2 ⊥
corresponds to some feasible variable assignment x ∈ X . 2 1
1
c3
Figures 2 and 3 illustrate BDD encoding of feasible sets 1

of linear inequalities.
P1 P2 P3 P4
Remark. In the literature [12, 35] BDDs have additional re-
quirements, mainly that there are no isomorphic subgraphs. Figure 3. Weighted BDD of a subproblem containing variables:
This allows for some additional canonicity properties like I = {a, b, c, d} with costs (λ): 2, 3, 1, 4 resp. and constraint a −
uniqueness and minimality. While all the BDDs in our algo- b − c + d = 0. Shortest path costs from the root node a and the
rithms satisfy the additional canonicity properties, only what terminal node ⊤ are shown for each node. Here P1 = {a}, P2 =
is required in Definition 5 is needed for our purposes, so we {b1 , b2 }, P3 = {c1 , c2 , c3 }, P4 = {d1 , d2 }, s0 (c2 ) = d1 and
keep this simpler setting. s1 (c2 ) = ⊥. Dashed arcs have cost 0 as they model assigning a 0
value to the corresponding variable.
5.1. Efficient Min-Marginal Computation
In order to compute min-marginals for subproblems we Algorithm 3: Forward Pass Min-Marginal Compu-
need to consider weighted BDDs. For notational conve- tation
nience we will drop dependence on subproblem j in upcom-
1 for v ∈ Pi do
ing text e.g., we will use λi instead of λji .  
 min SP(r, u), 
u:s0 (u)=v
Definition 7 (Weighted BDD). A weighted BDD is a BDD 2 SP(r, v) = min
 min SP(r, u) + λi 
with arc costs. Let a function f (x) be defined as 1 u:s (u)=v
3 Compute mβi via (6)
f(x) = \begin {cases} x^\top \lambda & x \in \X \\ \infty & \text {otherwise} \end {cases}. (5)

Algorithm 4: Backward Pass Min-Marginal Com-


The weighted BDD represents f if it satisfies Def. 6 for the putation
given X( and the arc costs for an i ∈ [k], v ∈ Pi , vw ∈ A are 1 for v ∈ Pi+1 do
0 w ∈ s0 (v)
SP(s0 (v), ⊤),
 
set as . SP(v, ⊤) = min
λi w ∈ s1 (v) 2
SP(s1 (v), ⊤) + λi+1
Min-marginals for variable i ∈ I of a subproblem can 3 Compute mβi via (6)
be computed by its weighted BDD by calculating shortest
path distances from r to all nodes in Pi and shortest path
Efficient GPU implementation In addition to solving all
distances from all nodes in Pi+1 to ⊤. We use SP(v, w) to
subproblems in parallel, we also exploit parallelism within
denote the shortest path distance between nodes v and w of
each subproblem during shortest path updates. Specifically
a weighted BDD. An example shortest path calculation is
in Alg. 3, we parallelize over all v ∈ Pi and perform the
shown in Figure 3. The min-marginals as defined in (MM)
min operation atomically. Similarly in Alg. 4 we parallelize
can be computed as
over all v ∈ Pi+1 but without requiring atomic update.
\label {eq:min-marginal-via-shortest-path} m^{\beta }_i = \min _{\substack {vs^{\beta }(v) \in A\\ v \in \P _i}} \left [\shp (r,v) + \beta \cdot \lambda _i + \shp (s^{\beta }(v),\top )\right ] (6) To enable fast GPU memory access via memory coalesc-
ing we arrange BDD nodes in the following fashion. First,
all nodes within a BDD which belong to the same partition
For efficient min-marginal computation in Algorithm 1 P (thus corresponding to same variable) are laid out consec-
we reuse shortest path distances used in (6). Specifically, utively. Secondly, across different BDDs, nodes are ordered
for computing min-marginals in lines 4-5 of Alg. 1 we use w.r.t increasing hop distance from their corresponding root
Alg. 3 for ascending variable order and Alg. 4 for descending nodes. Such arrangement for the ILP in Figure 2 is shown in
variable order in line 8. Figure 4.

444
ascending order Specialized solvers: State-of-the-art problem spe-
cific solver for each dataset. For cell-tracking we use
X1 X1 X1 X1 the solver from [24], the AMP solver for graph matching
r r ... ⊥ ⊥
proposed in [55] and TRWS for MRF [36].
FastDOG: Our approach where for the GPU implementa-
X2 X2 X2 tion we use the CUDA [44] and Thrust [25] program-
ming frameworks. For rounding primal solutions with
descending order Algorithm 2 we set δ = 1.0 and α = 1.2. For construct-
ing BDDs out of linear (in)equalities we use the same
Figure 4. Arrangement of BDD nodes in GPU memory for the ILP approach as for BDD-CPU.
in Figure 2. For ascending order in Alg. 1 we proceed from root to
terminal nodes and vice versa for descending. For MRF, parallel algorithms such as [58] exist however
TRWS is faster on the sparse problems we consider. While we
6. We
Experiments
show effectiveness of our solver against a state-of- are aware of even faster purely primal heuristics [11, 38] for
the-art ILP solver [23], the general purpose BDD-based MRF and e.g. [29] for graph matching they do not optimize
solver [40] and specialized CPU solvers for specific problem a convex relaxation and hence do not provide lower bounds.
classes. We have chosen some of the largest structured Hence, we have chosen TRWS [36] for MRF and AMP [55]
prediction ILPs we are aware of in the literature that are for graph matching which, similar to FastDOG, optimize an
publicly available. Our results are computed on a single equivalent resp. similar Lagrange decomposition and hence
NVIDIA Volta V100 (16GB) GPU unless stated otherwise. can be directly compared.
For CPU solvers we use AMD EPYC 7702 CPU.
Results In Table 2 we show aggregated results over all
Datasets Our benchmark problems obtained from [53] can instances of each specific benchmark dataset. Runtimes are
be categorized as follows. taken w.r.t. computation of both primal and dual bounds. A
more detailed table with results for each instance is given in
Cell tracking: Instances from [24] which we partition into
the Appendix.
small and large instances as also done in [40].
In Figure 5 we show averaged convergence plots for var-
Graph matching (GM): Quadratic assignment problems (of- ious solvers. In general we offer a very good anytime per-
ten called graph matching in the literature) for corre- formance producing at most times and in general during the
spondence in computer vision [57] (hotel, house) beginning better lower bounds than our baselines.
and developmental biology [32] (worms).
Markov Random Field (MRF): Several datasets from the Discussion In general, we are always faster (up to a fac-
OpenGM [33] benchmark, containing both small and tor of 10) than BDD-CPU [40] and except on worms we
large instances with varying topologies and number of achieve similar or better lower bounds. In comparison to
labels. We have chosen the datasets color-seg, color- the respective hand-crafted Specialized CPU solvers
seg-n4, color-seg-n8 and object-seg. we also achieve comparable runtimes with comparable lower
QAPLib: The widely used benchmark dataset for quadratic and upper bounds. While Gurobi achieves, if given un-
assignment problems used in the combinatorial opti- limited time, better lower bounds and primal solutions, our
mization community [13]. We partition QAPLib in- FastDOG solver outperforms it on the larger instances when
stances into small (up to 50 vertices) and large (up we abort Gurobi early after hitting a time limit. We argue
to 128 vertices) instances. For large instances we use that we outperform Gurobi on larger instances due to its
NVIDIA RTX 8000 (48GB) GPU. superlinear iteration complexity.
When comparing the number of dual iterations to
Algorithms We compare results on the following algo- BDD-CPU we need roughly 3-times as many to reach the
rithms. same lower bound. Nonetheless, as we can perform more
iterations per second this still leads to an overall faster algo-
Gurobi: The commercial ILP solver [23] as reported rithm.
in [40]. The barrier method is used for QAPLib and Since we are solving a relaxation the lower bounds and
dual simplex for all other datasets. quality of primal solutions are dependent on the tightness of
BDD-CPU: BDD-based min-marginal averaging approach this relaxation. For all datasets except QAPLib our (and also
of [40]. The algorithm runs on CPU with 16 threads baselines’) lower and upper bounds are fairly close, reflect-
for parallelization. Primal solutions are rounded using ing the nature of commonly occurring structured prediction
their BDD-based depth-first search scheme. problems.

445
Cell tracking Graph matching MRF QAPLib
Small Large Hotel House Worms C-seg C-seg-n4 C-seg-n8 Obj-seg Small Large
# instances 10 5 105 105 30 3 9 9 5 105 29
nmax 1.2M 10M 0.3M 0.3M 1.5M 3.3M 1.2M 1.4M 681k 3M 49M
mmax 0.2M 2.3M 52k 52k 0.2M 13.6M 4.2M 8.3M 2.2M 245k 2M
Dual objective (lower bound) ↑
Gurobi [23] −4.382e6 −1.545e8 −4.293e3 −3.778e3 −4.849e4 3.085e8 1.9757e4 1.9729e4 3.1311e4 2.913e6 4.512e4
BDD-CPU [40] −4.387e6 −1.549e8 −4.293e3 −3.778e3 −4.878e4 3.085e8 1.9643e4 1.9631e4 3.1248e4 3.675e6 8.172e6
Specialized −4.385e6 −1.551e8 −4.293e3 −3.778e3 −4.847e4 3.085e8 2.0012e4 1.9991e4 3.1317e4 - -
FastDOG −4.387e6 −1.549e8 −4.293e3 −3.778e3 −4.893e4 3.085e8 2.0011e4 1.9990e4 3.1317e4 3.747e6 8.924e6
Primal objective (upper bound) ↓
Gurobi [23] −4.382e6 −1.524e8 −4.293e3 −3.778e3 −4.842e4 3.085e8 2.8464e4 2.7829e4 1.4981e5 5.186e7 1.431e8
BDD-CPU [40] −4.337e6 −1.515e8 −4.293e3 −3.778e3 −4.783e4 3.086e8 2.1781e4 2.2338e4 3.1525e4 5.239e7 1.452e8
Specialized −4.361e6 −1.531e8 −4.293e3 −3.778e3 −4.845e4 3.085e8 2.0012e4 1.9991e4 3.1317e4 - -
FastDOG −4.376e6 −1.541e8 −4.293e3 −3.778e3 −4.831e4 3.085e8 2.0016e4 1.9995e4 3.1322e4 4.330e7 1.376e8
Runtimes [s] ↓
Gurobi [23] 1 1584 4 7 1048 132 980 1337 1506 3948 6742
BDD-CPU [40] 14 216 6 12 528 70 107 218 232 357 5952
Specialized 1.5 90 3 3 214 155 9 30 3 - -
FastDOG 13 110 0.2 0.4 54 14 9 13 39 137 6928

Table 2. Results comparison on all datasets where the values are averaged within a dataset. For each dataset, the results on corresponding
specialized solvers are computed using [24, 34, 55]. Numbers in bold highlight the best performance. nmax , mmax : Maximum number of
variables, constraints in the category.

Gurobi BDD CPU Gurobi BDD CPU Gurobi BDD CPU


·108
Spec. FastDOG ·104
Spec. FastDOG Spec. FastDOG
·104
−1.5 −4.6 3
Primal and Dual obj. (×104 )
Primal & Dual obj. (×108 )

Primal & Dual obj. (×104 )


−4.8 2.5
−1.55

−5 2

−1.6
−5.2 1.5

−1.65 −5.4 1 0
100 101 102 103 101 102 103 10 101 102 103
time [s] time [s] time [s]
(a) Cell tracking: Large (b) Graph Matching: Worms (c) MRF: Color-seg-n8

Figure 5. Convergence plots averaged over all instances of a dataset. Lower curves depict increasing lower bounds while markers denote
objectives of rounded primal solutions. The x-axis is plotted logarithmically.

7. Conclusion friendly problem specific solvers and with improvements


in our or other generic GPU solvers that can benefit many
We have proposed a massively parallelizable generic al- problem classes simultaneously. Another future avenue is
gorithm that can solve a wide variety of ILPs on GPU. Our optimization of ILPs from other domains, e.g. on the MIPLib
results indicate that the performance of specialized efficient benchmark [18]. These problems include constraints that
CPU solvers can be matched or even surpassed by a com- are harder to represent as BDDs and additional encoding
pletely generic GPU solver. Our implementation is a first techniques are needed [2, 16].
prototype and we conjecture that more speedups can be
gained by elaborate implementation techniques, e.g. com- 8. Acknowledgments
pression of the BDD representation, better memory layout
for better memory coalescing, multi-GPU support etc. We We would like to thank all reviewers, especially Reviewer
argue that future improvements in optimization algorithms 1 for valuable feedback and Jan-Hendrik Lange for insightful
for structured prediction can be made by developing GPU discussions.

446
References [17] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent
Charlin, and Andrea Lodi. Exact combinatorial optimiza-
[1] Ahmed Abbas and Paul Swoboda. RAMA: A Rapid Multicut tion with graph convolutional neural networks. arXiv preprint
Algorithm on GPU. arXiv preprint arXiv:2109.01838, 2021. arXiv:1906.01629, 2019. 2
1, 2, 3
[18] Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias
[2] Ignasi Abı́o, Robert Nieuwenhuis, Albert Oliveras, Enric Achterberg, Michael Bastubbe, Timo Berthold, Philipp M.
Rodrı́guez-Carbonell, and Valentin Mayer-Eichberger. A Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, Marco
new look at bdds for pseudo-boolean constraints. Journal of Lübbecke, Hans D. Mittelmann, Derya Ozyurt, Ted K. Ralphs,
Artificial Intelligence Research, 45:443–480, 2012. 8 Domenico Salvagnin, and Yuji Shinano. MIPLIB 2017: Data-
[3] Henrik Reif Andersen, Tarik Hadzic, John N Hooker, and Driven Compilation of the 6th Mixed-Integer Programming
Peter Tiedemann. A constraint store based on multivalued Library. Mathematical Programming Computation, 2021. 8
decision diagrams. In International Conference on Principles [19] Amir Globerson and Tommi S Jaakkola. Fixing max-
and Practice of Constraint Programming, pages 118–132. product: Convergent message passing algorithms for MAP
Springer, 2007. 3 LP-relaxations. In Advances in neural information processing
[4] Chetan Arora and Amir Globerson. Higher order matching systems, pages 553–560, 2008. 2, 3
for consistent multiple target tracking. In Proceedings of the [20] Jacek Gondzio and Robert Sarkissian. Parallel interior-point
IEEE International Conference on Computer Vision, pages solver for structured linear programs. Mathematical Program-
177–184, 2013. 3 ming, 96(3):561–584, 2003. 2
[5] David Bergman and Andre A. Cire. Decomposition based on [21] Jaime E González, Andre A Cire, Andrea Lodi, and Louis-
decision diagrams. In Claude-Guy Quimper, editor, Integra- Martin Rousseau. BDD-based optimization for the quadratic
tion of AI and OR Techniques in Constraint Programming, stable set problem. Discrete Optimization, page 100610, 2020.
pages 45–54, Cham, 2016. Springer International Publishing. 3
3 [22] Jaime E González, Andre A Cire, Andrea Lodi, and Louis-
[6] David Bergman and Andre A Cire. Discrete nonlinear opti- Martin Rousseau. Integrated integer programming and deci-
mization by state-space decompositions. Management Sci- sion diagram search tree with an application to the maximum
ence, 64(10):4700–4720, 2018. 3 independent set problem. Constraints, pages 1–24, 2020. 3
[7] David Bergman, Andre A Cire, and Willem-Jan van Hoeve. [23] Gurobi Optimization, LLC. Gurobi Optimizer Reference
Lagrangian bounds from decision diagrams. Constraints, Manual, 2021. 1, 2, 7, 8
20(3):346–361, 2015. 3 [24] Stefan Haller, Mangal Prakash, Lisa Hutschenreiter, Tobias
[8] David Bergman, Andre A Cire, Willem-Jan Van Hoeve, and Pietzsch, Carsten Rother, Florian Jug, Paul Swoboda, and
John Hooker. Decision diagrams for optimization, volume 1. Bogdan Savchynskyy. A primal-dual solver for large-scale
Springer, 2016. 3 tracking-by-assignment. In AISTATS, 2020. 2, 3, 7, 8
[9] David Bergman, Andre A Cire, Willem-Jan van Hoeve, and [25] Jared Hoberock and Nathan Bell. Thrust: A parallel template
John N Hooker. Discrete optimization with decision diagrams. library, 2010. Version 1.7.0. 7
INFORMS Journal on Computing, 28(1):47–66, 2016. 3 [26] John N. Hooker. Improved job sequencing bounds from deci-
[10] Timo Berthold. Primal heuristics for mixed integer programs. sion diagrams. In Thomas Schiex and Simon de Givry, editors,
2006. 5 Principles and Practice of Constraint Programming, pages
[11] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approxi- 268–283, Cham, 2019. Springer International Publishing. 3
mate energy minimization via graph cuts. IEEE Transactions [27] Andrea Hornakova, Timo Kaiser, Paul Swoboda, Michal Ro-
on pattern analysis and machine intelligence, 23(11):1222– linek, Bodo Rosenhahn, and Roberto Henschel. Making
1239, 2001. 7 higher order MOT scalable: An efficient approximate solver
[12] Randal E Bryant. Graph-based algorithms for boolean for lifted disjoint paths. In Proceedings of the IEEE/CVF
function manipulation. Computers, IEEE Transactions on, International Conference on Computer Vision, pages 6330–
100(8):677–691, 1986. 5, 6 6340, 2021. 2, 3
[13] Rainer E Burkard, Stefan E Karisch, and Franz Rendl. [28] Qi Huangfu and J. A. J. Hall. Parallelizing the dual revised
QAPLIB–a quadratic assignment problem library. Journal of simplex method. Math. Program. Comput., 10(1):119–142,
Global optimization, 10(4):391–403, 1997. 7 2018. 2
[14] Margarita P Castro, Andre A Cire, and J Christopher Beck. [29] Lisa Hutschenreiter, Stefan Haller, Lorenz Feineis, Carsten
An mdd-based lagrangian approach to the multicommodity Rother, Dagmar Kainmüller, and Bogdan Savchynskyy. Fu-
pickup-and-delivery tsp. INFORMS Journal on Computing, sion moves for graph matching. 2021. 7
32(2):263–278, 2020. 3 [30] Jeremy Jancsary and Gerald Matz. Convergent decomposition
[15] Cplex, IBM ILOG. CPLEX optimization studio 12.10, 2019. solvers for tree-reweighted free energies. In Proceedings of
1, 2 the Fourteenth International Conference on Artificial Intelli-
[16] M. Fujita, Y. Lu, E. Clarke, and J. Jain. Efficient variable gence and Statistics, pages 388–398, 2011. 2, 3
ordering using abdd based sampling. In Design Automation [31] Jason K Johnson, Dmitry M Malioutov, and Alan S Will-
Conference, pages 687–692, Los Alamitos, CA, USA, jun sky. Lagrangian relaxation for MAP estimation in graphical
2000. IEEE Computer Society. 8 models. arXiv preprint arXiv:0710.0013, 2007. 2, 3

447
[32] Dagmar Kainmueller, Florian Jug, Carsten Rother, and Gene [46] Ted Ralphs, Yuji Shinano, Timo Berthold, and Thorsten Koch.
Myers. Active graph matching for automatic joint segmenta- Parallel solvers for mixed integer linear optimization. In
tion and annotation of C. elegans. In International Conference Handbook of parallel constraint reasoning, pages 283–336.
on Medical Image Computing and Computer-Assisted Inter- Springer, 2018. 2
vention, pages 81–88. Springer, 2014. 7 [47] B. Savchynskyy, S. Schmidt, Jörg H. Kappes, and Christoph
[33] Jörg H. Kappes, Björn Andres, Fred A. Hamprecht, Christoph Schnörr. Efficient MRF energy minimization via adaptive
Schnörr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, diminishing smoothing. UAI. Proceedings, pages 746–755,
Bernhard X. Kausler, Thorben Kröger, Jan Lellmann, Nikos 2012. 1. 2, 3
Komodakis, Bogdan Savchynskyy, and Carsten Rother. A [48] Alexander Shekhovtsov, Christian Reinbacher, Gottfried
comparative study of modern inference techniques for struc- Graber, and Thomas Pock. Solving dense image matching in
tured discrete energy minimization problems. International real-time using discrete-continuous optimization. In Proceed-
Journal of Computer Vision, 115(2):155–184, 2015. 1, 2, 7 ings of the 21st Computer Vision Winter Workshop (CVWW),
[34] Jörg Hendrik Kappes, Markus Speth, Gerhard Reinelt, and page 13, 2016. 1, 2
Christoph Schnörr. Towards efficient and exact MAP- [49] Edmund Smith, Jacek Gondzio, and Julian Hall. GPU accel-
inference for large scale discrete computer vision problems eration of the matrix-free interior point method. In Roman
via combinatorial optimization. In 2013 IEEE Conference on Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy
Computer Vision and Pattern Recognition, pages 1752–1758, Waśniewski, editors, Parallel Processing and Applied Math-
2013. 8 ematics, pages 681–689, Berlin, Heidelberg, 2012. Springer
[35] Donald E Knuth. The art of computer programming, volume Berlin Heidelberg. 2
4A: combinatorial algorithms, part 1. Pearson Education [50] Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. Ac-
India, 2011. 6 celerating domain propagation: an efficient GPU-parallel
[36] Vladimir Kolmogorov. Convergent tree-reweighted message algorithm over sparse matrices. In 2020 IEEE/ACM 10th
passing for energy minimization. IEEE transactions on pat- Workshop on Irregular Applications: Architectures and Algo-
tern analysis and machine intelligence, 28(10):1568–1583, rithms (IA3), pages 1–11. IEEE, 2020. 2
2006. 2, 3, 7 [51] Nicolas Sonnerat, Pengming Wang, Ira Ktena, Sergey Bar-
[37] Vladimir Kolmogorov. A new look at reweighted message tunov, and Vinod Nair. Learning a large neighborhood
passing. IEEE transactions on pattern analysis and machine search algorithm for mixed integer programs. arXiv preprint
intelligence, 37(5):919–930, 2014. 2, 3 arXiv:2107.10201, 2021. 2
[38] Nikos Komodakis and Georgios Tziritas. Approximate la- [52] Paul Swoboda and Bjoern Andres. A message passing algo-
beling via graph cuts based on linear programming. IEEE rithm for the minimum cost multicut problem. In Proceed-
transactions on pattern analysis and machine intelligence, ings of the IEEE Conference on Computer Vision and Pattern
29(8):1436–1453, 2007. 7 Recognition, pages 1617–1626, 2017. 2, 3
[39] Jan-Hendrik Lange, Andreas Karrenbauer, and Bjoern Andres. [53] Paul Swoboda, Andrea Hornakova, Paul Roetzer, and Ahmed
Partial optimality and fast lower bounds for weighted corre- Abbas. Structured prediction problem archive. arXiv preprint
lation clustering. In International Conference on Machine arXiv:2202.03574, 2022. 7
Learning, pages 2892–2901. PMLR, 2018. 2 [54] Paul Swoboda, Ashkan Mokarian, Christian Theobalt, Florian
[40] Jan-Hendrik Lange and Paul Swoboda. Efficient message Bernard, et al. A convex relaxation for multi-graph matching.
passing for 0–1 ILPs with binary decision diagrams. In Inter- In Proceedings of the IEEE Conference on Computer Vision
national Conference on Machine Learning, pages 6000–6010. and Pattern Recognition, pages 11156–11165, 2019. 2, 3
PMLR, 2021. 1, 2, 3, 4, 5, 7, 8 [55] Paul Swoboda, Carsten Rother, Hassan Abu Alhaija, Dagmar
[41] Leonardo Lozano, David Bergman, and J Cole Smith. On the Kainmuller, and Bogdan Savchynskyy. A study of lagrangean
consistent path problem. Optimization Online e-prints, 2018. decompositions and dual ascent solvers for graph matching.
3 In Proceedings of the IEEE Conference on Computer Vision
[42] Talya Meltzer, Amir Globerson, and Yair Weiss. Convergent and Pattern Recognition, pages 1607–1616, 2017. 2, 3, 7, 8
message passing algorithms-a unifying view. arXiv preprint [56] Christian Tjandraatmadja and Willem-Jan van Hoeve. Incor-
arXiv:1205.2625, 2012. 2, 3 porating bounds from decision diagrams into integer program-
[43] Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von ming. Mathematical Programming Computation, pages 1–32,
Glehn, Pawel Lichocki, Ivan Lobov, Brendan O’Donoghue, 2020. 3
Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, [57] Lorenzo Torresani, Vladimir Kolmogorov, and Carsten Rother.
et al. Solving mixed integer programs using neural networks. Feature correspondence via graph matching: Models and
arXiv preprint arXiv:2012.13349, 2020. 2 global optimization. In European conference on computer
[44] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. CUDA, vision, pages 596–609. Springer, 2008. 7
release: 11.2, 2021. 7 [58] Siddharth Tourani, Alexander Shekhovtsov, Carsten Rother,
[45] Kalyan Perumalla and Maksudul Alam. Design Considera- and Bogdan Savchynskyy. MPLP++: Fast, parallel dual block-
tions for GPU-Based Mixed Integer Programming on Parallel coordinate ascent for dense graphical models. In Proceedings
Computing Platforms. Association for Computing Machinery, of the European Conference on Computer Vision (ECCV),
New York, NY, USA, 2021. 1, 2 pages 251–267, 2018. 1, 2, 3, 7

448
[59] Siddharth Tourani, Alexander Shekhovtsov, Carsten Rother,
and Bogdan Savchynskyy. Taxonomy of dual block-
coordinate ascent methods for discrete energy minimization.
In AISTATS, 2020. 2, 3
[60] Vibhav Vineet and PJ Narayanan. CUDA cuts: Fast graph cuts
on the GPU. In 2008 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, pages
1–8. IEEE, 2008. 2
[61] Huayan Wang and Daphne Koller. Subproblem-tree calibra-
tion: A unified approach to max-product message passing. In
ICML (2), pages 190–198, 2013. 2, 3
[62] Tomas Werner. A linear programming approach to max-sum
problem: A review. IEEE transactions on pattern analysis
and machine intelligence, 29(7):1165–1179, 2007. 2, 3, 5
[63] Tomáš Werner, Daniel Průša, and Tomáš Dlask. Relative
interior rule in block-coordinate descent. In Proceedings of
the IEEE International Conference on Computer Vision, 2020.
To appear. 3, 4, 5
[64] Jiadong Wu, Zhengyu He, and Bo Hong. Chapter 5 - efficient
CUDA algorithms for the maximum network flow problem.
In Wen mei W. Hwu, editor, GPU Computing Gems Jade
Edition, Applications of GPU Computing Series, pages 55–
66. Morgan Kaufmann, Boston, 2012. 2
[65] Zhiwei Xu, Thalaiyasingam Ajanthan, and Richard Hartley.
Fast and differentiable message passing on pairwise markov
random fields. In Proceedings of the Asian Conference on
Computer Vision, 2020. 1, 2
[66] Zhen Zhang, Qinfeng Shi, Julian McAuley, Wei Wei, Yan-
ning Zhang, and Anton Van Den Hengel. Pairwise matching
through max-weight bipartite belief propagation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1202–1210, 2016. 2, 3

449

You might also like