Flow Priv
Flow Priv
Yasi Zhang1∗ Peiyu Yu1 Yaxuan Zhu1 Yingshan Chang2 Feng Gao3†
arXiv:2405.18816v1 [cs.CV] 29 May 2024
1
University of California, Los Angeles 2 Carnegie Mellon University
3
Amazon 4 California Institute of Technology
Abstract
Generative models based on flow matching have attracted significant attention for
their simplicity and superior performance in high-resolution image synthesis. By
leveraging the instantaneous change-of-variables formula, one can directly compute
image likelihoods from a learned flow, making them enticing candidates as priors
for downstream tasks such as inverse problems. In particular, a natural approach
would be to incorporate such image probabilities in a maximum-a-posteriori (MAP)
estimation problem. A major obstacle, however, lies in the slow computation of
the log-likelihood, as it requires backpropagating through an ODE solver, which
can be prohibitively slow for high-dimensional problems. In this work, we propose
an iterative algorithm to approximate the MAP estimator efficiently to solve a
variety of linear inverse problems. Our algorithm is mathematically justified by
the observation that the MAP objective can be approximated by a sum of N “local
MAP” objectives, where N is the number of function evaluations. By leveraging
Tweedie’s formula, we show that we can perform gradient steps to sequentially
optimize these objectives. We validate our approach for various linear inverse
problems, such as super-resolution, deblurring, inpainting, and compressed sensing,
and demonstrate that we can outperform other methods based on flow matching.
1 Introduction
Linear inverse problems are ubiquitous across many imaging domains, pervading areas such as
astronomy [1, 2], medical imaging [3, 4], and seismology [5, 6]. In these problems the goal is to
reconstruct an unknown image x∗ ∈ Rn from observed measurements y ∈ Rm of the form:
y = A(x∗ ) + noise, (1)
n m
where A : R → R with m ≤ n is a linear operator that degrades the clean image x∗ , and
the additive noise is drawn from a known distribution. In this work, we assume the noise follows
N (0, σy2 I). Due to the under-constrained nature of such problems, they are typically ill-posed, i.e.,
there are an infinite number of undesirable images that fit to the observed measurements. Hence, one
requires further structural information about the underlying images, which constitutes our prior.
With the advent of large generative models [7–12], there has been a surge of interest in exploiting
generative models as priors to solve inverse problems. Given a pretrained generator to sample from
∗
Correspondence to: [email protected]
†
This work is not related to the author’s position at Amazon.
MAP estimation provides a single, most probable point estimate of the posterior distribution, making
it simple and interpretable. This deterministic approach ensures consistency and reproducibility,
which are essential in applications requiring reliable outcomes, particularly in compressed sensing
tasks such as Computed Tomography (CT) [36] and Magnetic Resonance Imaging (MRI) [37]. While
posterior sampling methods can offer diverse reconstructions to quantify uncertainty, they can be
prohibitively slow in high-dimensions [38]. Hence, in this work, we propose to integrate flow priors
to solve linear inverse problems by MAP estimation.
A significant challenge in employing flow priors for MAP estimation lies in the slow computation
of the image probabilities, as it requires backpropagating through an ODE solver [39–41]. In this
work, we show how one can address this challenge via Iterative Corrupted Trajectory Matching
(ICTM), a novel algorithm to approximate the MAP solution in a computaionally efficient manner. In
particular, we show how one can approximately find an MAP solution by sequentially optimizing a
novel simpler, auxillary objective that approximates the true MAP objective in the limit of infinite
function evaluations. For finite evaluations, we demonstrate that this approximation is sufficient
to optimize by showcasing strong empirical performance for flow priors across a variety of linear
inverse problems. We summarize our contributions as follows:
2. Theoretically, we demonstrate that the auxillary objective converges to the true MAP
objective as the NFEs goes to infinity. We validate the correctness of our algorithm in
finding the MAP solution on a denoising problem.
3. We demonstrate the utility of ICTM on a wide variety of linear inverse problems on both
natural and scientific image datasets, with problems including denoising, inpainting, super-
resolution, deblurring, and compressed sensing. Extensive results show that ICTM is both
computationaly efficient and obtains high-quality reconstructions, outperforming other
reconstruction algorithms based on flow priors.
2
2 Background
Notation We follow the convention for flow-based models, where Gaussian noise is sampled at
timestep 0, and the clean image corresponds to timestep 1. Note that this is the opposite of diffusion
models. For t ∈ [0, 1], we denote xt (x0 ) as the point at time t whose initial condition is x0 . In this
work, we use x and x1 interchangeably, i.e., x1 (x0 ) = x(x0 ).
We consider generative models that map samples x0 from a noise distribution p(x0 ), e.g., Gaussian,
to samples x1 of a data distribution p(x1 ) using an ordinary differential equation (ODE):
dxt = vθ (xt , t) dt, (3)
where the velocity field v is a θ-parameterized neural network, e.g., using a UNet [27, 26, 42] or
Transformer [29, 43] architecture. Generative models based on flow matching [27, 26] can be seen as
a simulation-free approach to learning the velocity field. This approach involves pre-determining
paths that the ODE should follow by specifying the interpolation curve xt , rather than relying on the
MLE algorithm to implicitly discover them [31]. To construct such a path, which is not necessarily
Markovian, one can define a differentiable nonlinear interpolation between x0 and x1 :
xt = αt x1 + βt x0 , x0 ∼ N (0, I), (4)
where both αt and βt are differentiable functions with respect to t satisfying α0 = 0, β0 = 1, and
α1 = 1, β1 = 0. This ensures that xt is transported from a standard Gaussian distribution to the
natural image manifold from time 0 to time 1. In contrast, the diffusion process [9, 44, 45] induces a
non-differentiable trajectory due to the diffusion term in the SDE formulation.
The idea behind flow matching is to utilize the power of deep neural networks to efficiently predict
the velocity field at each timestep. To achieve this, we can train the neural network by minimizing an
L2 loss between the sampled velocity and the one predicted by the neural network:
L(θ) = Et,p(x1 ),p(x0 ) ∥vθ (xt , t) − (α̇t x1 + β̇t x0 )∥2 . (5)
We denote the optimal (not necessarily unique) solution to arg minθ L(θ) as θ̂. The optimal velocity
field vθ̂ can be derived in closed form and is the expected velocity at state xt :
Denote the probability of xt in Eq. (3) as p(xt ) dependent on time. Assuming that vθ is uniformly
Lipschitz continuous in xt and continuous in t, the change in log probability also follows a differential
equation [31, 32]:
∂ log p(xt ) ∂
= −tr vθ (xt , t) . (7)
∂t ∂x
One can additionally obtain the likelihood of the trajectory via integrating Eq. (7) across time
Z t
∂
log p(xt ) = log p(xτ ) − tr vθ (xs , s) ds, 0 ≤ τ < t ≤ 1. (8)
τ ∂x
3 Method
In this work, we aim to solve the MAP estimation problem in Eq. (2) where p(x) is given by a
pretrained flow prior. We first discuss in Section 3.1 how the MAP problem could, in principle, be
solved via a latent-space optimization problem. As we will see, this problem is challenging to solve
3
computationally due to the need to backpropagate through an ODE solver. To overcome this, we
show in Section 3.2 that the ideal MAP problem can be approximated by a weighted sum of “local
MAP” optimization problems, which operates by partitioning the flow’s trajectory to a reconstructed
solution. We then introduce our ICTM algorithm to sequentially optimize this auxiliary objective.
Finally, in Section 3.3, we experimentally validate that our algorithm finds a solution that is faithful
to the MAP estimate in a simplified setting where the globally optimal MAP solution is known.
Given a pretrained flow prior, one can compute the log-likelihood of x generated from an initial noise
sample x0 via Eq. (8). Hence, to find the MAP estimate, one could equivalently optimize the initial
point of the trajectory x0 and return x1 (x0 ) where x0 is found by solving
Z 1
1 2 1 2 ∂
minn ∥y − A(x (x
1 0 ))∥ + ∥x 0 ∥ + tr v (x
θ t , t) dt, (9)
x0 ∈R 2σy2 2 0 ∂x
| {z } | {z }
data likelihood prior
where xt := xt (x0 ) denotes the intermediate state xt generated from x0 . Intuitively, this loss
encourages finding an initial point x0 such that the reconstruction x1 := x1 (x0 ) fits the observed
measurements, but is also likely to be generated by the flow.
R t x1 and the prior term can be approximated by an ODE solver. The trajectory of
In practice,
xt = x0 + 0 vθ (xt , t)dt can be approximated by an ODE sampler, i.e. ODESolve(x0 , 0, t, vθ ), where
x0 is the initial point, and the second and third arguments represent the starting time and the ending
time, respectively. For example, with an Euler sampler, we iterate over xt+∆t = xt + vθ (xt , t)∆t
where ∆t = 1/N and N is the pre-determined NFEs. After acquiring the optimal x̂0 by optimizing
the Eq. (9), we obtain the MAP solution x1 by using ODESolve(x̂0 , 0, 1, vθ ) again.
The global flow-based MAP objective Eq. (9) is tractable for low-dimensional problems. The
challenge for high-dimensional problems, however, is that optimizing Eq. (9) is simulation-based,
and thus each update iteration requires full forward and backward propagation through an ODE
solver, resulting in issues regarding memory inefficiency and time, making it hard to optimize
[31, 41, 40, 39].
As a way to address this, we prove a result in Theorem 1 that shows that the MAP objective can be
approximated by a weighted sum of N local posterior objectives. These objectives are “local” in
the sense that they mainly depend on likelihoods and probabilities of intermediate trajectories xt
and xt + vθ (xt , t)∆t for t = 0, ∆t, . . . , N ∆t where ∆t := 1/N . Given an initial noise input x0 ,
each local posterior objective depends on a non-Markovian auxiliary path yt = αt y + βt A(x0 )
by connecting the points between y and Ax0 . We prove this result for straight paths αt = t and
βt = 1 − t for simplicity, but other interpolation paths can be used. The proof is in Section B.2.
Theorem 1. For N ≥ 1, set γi := ( 12 )N −i+1 and ∆t = 1/N . Suppose y = A(x∗ ) + ϵ where
x∗ = x1 (x0 ) with x0 being the solution to Eq. (9), ϵ ∼ N (0, σy2 I), and xt exactly follows the straight
path xt = tx + (1 − t)x0 for any timestep t ∈ [0, 1]. Suppose the velocity field vθ : Rn × R → Rn
∂
satisfies supz∈Rn ,s∈[0,1] |tr ∂x vθ (z, s)| ≤ C1 for some universal constant C1 . Then, there exists a
constant c(N ) that does not depend on x0 such that
N
X
lim log p(x(x0 )|y) − γi Jˆi + c(N ) = 0, (10)
N →∞
i=1
∂vθ (x(i−1)∆t ,(i−1)∆t)
where Jˆi = log p(x(i−1)∆t ) − tr ∂x ∆t + log p(yi∆t |xi∆t ).
This result shows that the true MAP objective evaluated at the optimal solution can be approximated
by a weighted sum of objectives that depend locally at a time t for the trajectory {xt : t ∈ [0, 1]}. The
intuition regarding Jˆi arises from the fact that Jˆi ≈ Ji , where Ji is the local posterior distribution
Ji = log p(yi∆t |xi∆t (x(i−1)∆t )) + log p(xi∆t ). (11)
4
Optimizing each of these local posterior distributions in a sequential fashion captures the fact that we
would like each intermediate point in our trajectory xi∆t to be likely and fit to our measurements,
ideally resulting in a final reconstruction x1 that satisfies this as well. The benefit of Jˆi , as we will
show in the sequel, is that it is efficient to optimize.
Discussion of assumptions: We assume that the trajectory {xt }t exactly follows the predefined
interpolation path {αt x + βt x0 }t . In Section C of the appendix, we analyze this assumption and
show that we can bound the deviation from the predefined interpolation path to the learned path
via a path compliance measure. Moreover, we impose a regularity assumption on the velocity field
vθ , effectively requiring a uniform bound on the spectrum of the Jacobian of vθ . This can be easily
satisfied with neural networks using Lipschitz continuous and differentiable activation functions.
As we see in Theorem 1, one can approximate the true MAP objective via a sum of local objectives
of the form
∂vθ (x(i−1)∆t , (i − 1)∆t)
Jˆi := log p(yi∆t |xi∆t ) + log p(x(i−1)∆t ) − tr ∆t . (12)
| {z } ∂x
local data likelihood | {z }
local prior
At first glance, Jˆi still appears challenging to optimize, but there are additional insights we can
exploit for computation. We discuss each term in Jˆi below.
Local data likelihood: The intuition behind ICTM is that we aim to match a corrupted trajectory
{ut }t with an auxiliary path {yt }t specified by an interpolation between our measurements y and
A(x0 ) for each timestep t, defined by yt := αt y + βt A(x0 ). The corrupted trajectory ut :=
A(xt ) follows the corrupted flow ODE dut = A(vθ (xt , t))dt. To optimize the above “local MAP”
objectives, we must understand the distribution of p(yt |xt ). Generally speaking, this distribution
is intractable. However, by assuming exact compliance of the trajectory generated by flow to the
predefined interpolation path (as done in Theorem 1), we can show that yt |xt ∼ N (ut , αt2 σy2 ). This
is proven in Lemma 3 in the appendix. While exact compliance of the trajectory may not hold for
learned flow matching models, we show empirically that making this assumption leads to strong
performance in practice. We further analyze this notion of compliance in Section C of the appendix.
Local prior: The approximation in Eq. (12) addresses one of the main concerns of MAP in that the
intensive integral computation is circumvented with a simpler Riemannian sum. This approximation
R t+∆t ∂
∂
holds for small time increments ∆t: t tr ∂x vθ (xs , s) ds ≈ tr ∂x vθ (xt , t) ∆t. Note that
one can additionally improve the efficiency of this term by employing a Hutchinson-Skilling estimate
[46, 47] for the trace of the Jacobian matrix. However, at first glance, it appears we have simply
shifted the problem to the computation of the prior at timestep (i − 1)∆t. Fortunately, it is possible
to derive a formula for the gradient of log p(xt ) for all timesteps t ∈ [0, 1] using Tweedie’s formula
[48]. This allows us to optimize each objective Jˆi using gradient-based optimizers. The following
result gives a precise characterization of ∇xt log p(xt ), proven in Section B.1.
Proposition 1. Let λt = αt /βt denote the signal-to-noise ratio. The relationship between the score
function ∇xt log p(xt ) and the velocity field vθ (xt , t) is given by:
" −1 #
1 d log λt d log βt
∇xt log p(xt ) = 2 vθ (xt , t) − xt − xt . (13)
βt dt dt
In summary, we have derived an efficient approximation to the MAP objective. For our algorithm, we
iteratively optimize each term Jˆt sequentially for each t = 0, ∆t, . . . , N ∆t, fitting our current iterate
xt to induce an increment xt+∆t such that A(xt+∆t ) fits to our auxiliary corrupted path yt+∆t while
being likely under our local prior. We call this approach Iterative Corrupted Trajectory Matching
(ICTM). Our algorithm is summarized in Algo. 1. In lines 7 and 12, instead of directly optimizing
the local data likelihood, we choose λ as a new hyper-parameter to tune. We find a constant λ works
well in practice.
We experimentally validate that the reconstruction found via ICTM is close to the optimal MAP
solution in a simplified denoising problem where the MAP solution can be obtained in closed-form.
5
Algorithm 1 Iterative Corrupted Trajectory Matching (ICTM) with Euler Sampler
Input: measurement y, matrix A, pretrained flow-based model θ, NFEs N , interpolation coefficients
{αt }t and {βt }t , step size η, guidance weight λ, and iteration number K
Output: recovered clean image x1
1: Initialize ϵ ∼ N (0, I), x0 ← ϵ, t ← 0, ∆t ← 1/N
2: Generate an auxiliary path ys = αs y + βs (Ax0 ) for s ∈ (0, 1)
3: while t < 1 do
4: xt+∆t ← xt + vθ (xt , t)∆t
5: if t = 0 then
6: for k = 1, · · · K do h i
7: xt ← xt − η∇xt λ∥A(xt+∆t (xt )) − yt+∆t ∥2 + 21 ∥xt ∥2 + tr ∂vθ∂x
(xt ,t)
∆t
8: end for
9: else
10: for k = 1, · · · K do
11: # use Eq. (13) to obtain
h the gradient of log p(xt ) i
12: xt ← xt − η∇xt λ∥A(xt+∆t (xt )) − yt+∆t ∥22 − log p(xt ) + tr ∂vθ∂x
(xt ,t)
∆t
13: end for
14: end if
15: xt+∆t ← xt + vθ (xt , t)∆t
16: t ← t + ∆t
17: end while
18: return x1
MAP
Ours
Figure 1: Results of a toy example modeling 1,000 FFHQ faces as a Gaussian distribution. Subfigure
(a) shows the qualitative results of our method; Subfigure (b) presents the histogram of the differences
between ours and the true MAP; Subfigure (c) displays the MSE values as the NFEs varies.
Specifically, we fit a Gaussian distribution N (µ, Σ) using 1,000 samples from the FFHQ dataset.
Consider a denoising problem y = x + ϵ where x ∼ N (µ, Σ) and ϵ ∼ N (0, σy2 I). In this case, the
analytical solution to the MAP estimation problem (Eq. (2)) is x∗ = (Σ−1 +σy−2 I)−1 (Σ−1 µ+σy−2 y).
We set σy = 0.1. Then, we train a flow-based model on 10,000 samples from the true Gaussian
distribution and showcase the deviation of our reconstruction found via ICTM to the closed-form
MAP solution x∗ in Fig. 1. We see that ICTM can obtain a faithful estimate of the MAP solution
across many samples.
4 Experiments
In our experimental setting, we use optimal transport interpolation coefficients, i.e., αt = t and
βt = 1 − t. We test our algorithm on both natural and medical imaging datasets. For natural images,
we utilize the pretrained checkpoint from the official Rectified Flow repository3 and evaluate our
3
https://github.com/gnobitab/RectifiedFlow
6
Figure 2: Quantitative comparison results in terms of PSNR and SSIM on the CelebA-HQ dataset.
Our algorithm surpasses all other baselines across all tasks.
Input OT-ODE DPS-ODE Ours Groundtruth Input OT-ODE DPS-ODE Ours Groundtruth
Figure 3: Qualitative comparison results on the CelebA-HQ dataset. The reconstructions generated
by our method align more faithfully with the ground truth and exhibit a higher degree of refinement.
approach on the CelebA-HQ dataset [49, 50]. We address four common linear inverse problems:
super-resolution, inpainting with a random mask, Gaussian deblurring, and inpainting with a box mask.
For the medical application, we train a flow-based model from scratch on the Human Connectome
Project (HCP) dataset [51] and test our algorithm specifically for compressed sensing at different
compression rates. Our algorithm focuses on the reconstruction faithfulness of generated images,
therefore employing PSNR and SSIM [52] as evaluation metrics.
Baselines We compare our method with three baselines. 1) OT-ODE [33]. To our knowledge,
this is the only baseline that applies flow-based models to inverse problems. They incorporate a
prior gradient correction at each sampling step based on conditional Optimal Transport (OT) paths.
For a fair comparison, we follow their implementation of Algorithm 1, providing detailed ablations
on initialization time t′ in Appendix F.3. 2) DPS-ODE. Inspired by DPS [18], we replace the
velocity field with a conditional one, i.e., v(xt |y) = v(xt ) + ζt ∇xt log p(y|x̂1 (xt )), where ζt is
a hyperparameter to tune. Following the hyperparameter instruction in DPS, we provide detailed
ablations on ζt in Appendix F.3. 3) Ours without local prior. To examine the local prior term’s
effectiveness in our optimization algorithm, we drop the local prior term as defined in Eq. (12) in our
algorithm.
Experimental setup We evaluate our algorithm using 100 images from the CelebA-HQ validation
set with a resolution of 256×256, normalizing all images to the [0, 1] range for quantitative analysis.
All experiments incorporate Gaussian measurement noise with σy = 0.01. We address the following
linear inverse problems: (1) 4× super-resolution using bicubic downsampling, (2) inpainting with a
random mask covering 70% of missing values, (3) Gaussian deblurring with a 61×61 kernel and a
standard deviation of 3.0, and (4) box inpainting with a centered 128×128 mask.
7
Table 1: Results of compressed sensing with varying compression rate ν on the HCP T2w dataset.
Note that compressed sensing is more challenging due to the complexity of the forward operator,
as evidenced by the poor performance of OT-ODE, which assumes a Gaussian distribution of
measurement y given xt .
ν = 1/2 ν = 1/4
Method PSNR SSIM PSNR SSIM
OT-ODE 18.71 ± 1.02 0.422 ± 0.17 18.16 ± 1.06 0.271 ± 0.07
DPS-ODE 31.06 ± 3.91 0.765 ± 0.08 25.01 ± 1.87 0.608 ± 0.08
Ours 32.72 ± 1.53 0.878 ± 0.05 27.03 ± 1.77 0.733 ± 0.04
Figure 4: Qualitative comparison results on compressed sensing. Our method produces more faithful
reconstructions with fewer artifacts, ensuring higher accuracy and clarity in the details.
We present the quantitative and qualitative results of all the methods in Fig. 2 and Fig. 3, respectively.
In Fig. 2, our method surpasses all other baselines across all tasks. For more challenging tasks such
as Gaussian deblurring and box inpainting, our method significantly outperforms others in terms
of SSIM. Based on the MAP framework, as shown in Fig. 3, our method prefers more faithful and
artifact-free reconstructions, whereas others trade off for perceptual quality. We note that there is
an unavoidable tradeoff between perceptual quality and restoration faithfulness [53]. Overall, our
method presents a higher degree of refinement. The comparison between ours and ours (w/o prior)
indicates the effectiveness of the local prior term in enhancing the accuracy of the reconstructions, as
evidenced by the increases in both PSNR and SSIM.
HCP T2w dataset We utilize images from the publicly available Human Connectome Project
(HCP) [51] T2-weighted (T2w) images dataset for the task of compressed sensing, which contains
brain images from 47 patients. The HCP dataset includes cross-sectional images of the brain taken at
different levels and angles.
Compressed sensing We train a flow-based model from scratch on 10,000 randomly sampled
images, utilizing the ncsnpp architecture [9] with minor adaptations for grayscale images. We employ
compression rates ν ∈ {1/2, 1/4}, meaning m = νn. The measurement operator is given by a
subsampled Fourier matrix, whose sign patterns are randomly selected. We evaluate our reconstruction
algorithm’s performance on 200 randomly sampled test images.
We present the quantitative and qualitative results of compressed sensing in Tab. 1 and Fig. 4,
respectively. As shown in Tab. 1, our method consistently achieves the best performance across
varying compression rates ν. In Fig. 4, our method produces reconstructions that are more faithful to
the original images, with fewer artifacts, leading to higher accuracy and clearer details.
8
Figure 5: Ablation results of step size η and guidance weight λ. The choice of hyperparameters
for our algorithm is fairly consistent across all tasks. We choose η = 10−2 for all experiments on
CelebA-HQ. For λ, we choose λ = 103 for Gaussian deblurring and λ = 104 for the other tasks.
Figure 6: Ablation results of iteration number K on different tasks. For super-resolution and the
other three tasks, K = 1 is sufficient to achieve the best performance with the optimal step size η and
guidance weight λ. However, for compressed sensing, it is necessary to increase K to obtain the best
performance. We hypothesize that this is due to the increased complexity of the compressed sensing
operator, which requires more iteration steps to ensure the correct optimization direction.
We use the Adam optimizer [54] for our optimization steps due to its effectiveness in neural network
computations. For all tasks, we utilize N = 100 steps.
Step size η and Guidance weight λ The use of the Adam optimizer ensures that the choice of
hyperparameters, particularly the step size η and the guidance weight λ, remains consistent across
various tasks, as illustrated in Fig. 5. Specifically, a step size of η = 10−2 is optimal for Inpainting
(random), Inpainting (box), and Super-resolution in terms of SSIM. For PSNR, Gaussian deblurring
also achieves optimal performance at η = 10−2 . Consequently, we employ η = 10−2 for all tasks.
Based on the results shown in the right two subfigures of Fig. 5, we select λ = 103 for Gaussian
deblurring and λ = 104 for the other tasks. This consistency extends to the compressed sensing
experiments, where we set λ = 103 and η = 10−2 for all experiments involving medical images.
Iteration number K We present ablation results of the iteration number K on different tasks in Fig.
6. We focus on the behavior of K in super-resolution and compressed sensing, as it performs similarly
to super-resolution in the other three tasks. With the optimal choice of η and λ in super-resolution,
i.e., η = 10−2 and λ = 103 , K = 1 provides superior performance on CelebA-HQ. A decreased step
size, e.g., η = 10−3 , can help performance as K increases, but it fails to exceed the performance
achieved with the optimal parameters at K = 1. However, for compressed sensing, it is necessary
to increase K to achieve the best performance. Consequently, we set K = 10 for all compressed
sensing experiments. We hypothesize that the complexity of the compressed sensing operator directly
determines the number of iterations required for optimal performance.
9
5 Conclusion
In this work, we have introduced a novel iterative algorithm to incorporate flow priors to solve linear
inverse problems. By addressing the computational challenges associated with the slow log-likelihood
calculations inherent in flow matching models, our approach leverages the decomposition of the
MAP objective into multiple "local MAP" objectives. This decomposition, combined with the
application of Tweedie’s formula, enables effective sequential optimization through gradient steps.
Our method has been rigorously validated on both natural and scientific images across various linear
inverse problems, including super-resolution, deblurring, inpainting, and compressed sensing. The
empirical results indicate that our algorithm consistently outperforms existing techniques based on
flow matching, highlighting its potential as a powerful tool for high-resolution image synthesis and
related downstream tasks. We discuss limitations and future work in Section A of the appendix.
10
References
[1] François Roddier. Interferometric imaging in optical astronomy. Physics Reports, 170(2):
97–166, 1988.
[2] Peter A Jansson. Deconvolution of images and spectra. Courier Corporation, 2014.
[3] Saiprasad Ravishankar, Jong Chul Ye, and Jeffrey A Fessler. Image reconstruction: From
sparsity to data-adaptive methods and machine learning. Proceedings of the IEEE, 108(1):
86–109, 2019.
[4] Paul Suetens. Fundamentals of medical imaging. Cambridge university press, 2017.
[5] Guust Nolet. A breviary of seismic tomography. A breviary of seismic tomography, 2008.
[6] Nicholas Rawlinson, Andreas Fichtner, Malcolm Sambridge, and Mallory K Young. Seismic
tomography and the assessment of uncertainty. Advances in geophysics, 55:1–76, 2014.
[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications
of the ACM, 63(11):139–144, 2020.
[9] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv
preprint arXiv:2011.13456, 2020.
[10] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In
International conference on machine learning, pages 1530–1538. PMLR, 2015.
[11] Yingshan Chang, Yasi Zhang, Zhiyuan Fang, Yingnian Wu, Yonatan Bisk, and Feng Gao. Skews
in the phenomenon space hinder generalization in text-to-image generation, 2024.
[12] Yasi Zhang, Peiyu Yu, and Ying Nian Wu. Object-conditioned energy-based attention map
alignment in text-to-image diffusion models. arXiv preprint arXiv:2404.07389, 2024.
[13] Gregory Ongie, Ajil Jalal, Christopher A. Metzler, Richard G. Baraniuk, Alexandros G. Dimakis,
and Rebecca Willett. Deep learning techniques for inverse problems in imaging. IEEE Journal
on Selected Areas in Information Theory, 1(1):39–56, 2020. doi: 10.1109/JSAIT.2020.2991563.
[14] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros Dimakis. Compressed sensing using
generative models. International Conference on Machine Learning, 2017.
[15] Sachit Menon, Alex Damian, McCourt Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-
supervised photo upsampling via latent space exploration of generative models. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[16] Muhammad Asim, Max Daniels, Oscar Leong, Ali Ahmed, and Paul Hand. Invertible generative
models for inverse problems: mitigating representation error and dataset bias. Proceedings of
the 37th International Conference on Machine Learning, 2020.
[17] Jay Whang, Erik M. Lindgren, and Alexandros G. Dimakis. Composing normalizing flows
for inverse problems. Proceedings of the 38th International Conference on Machine Learning,
2021.
[18] Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International
Conference on Learning Representations, ICLR 2023. The International Conference on Learning
Representations, 2023.
[19] Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alex Dimakis, and Sanjay
Shakkottai. Solving linear inverse problems provably via posterior sampling with latent diffusion
models. Advances in Neural Information Processing Systems, 36, 2024.
11
[20] Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun
Zhu, and Ying Nian Wu. Latent diffusion energy-based model for interpretable text modeling.
arXiv preprint arXiv:2206.05895, 2022.
[21] Peiyu Yu, Yaxuan Zhu, Sirui Xie, Xiaojian Shawn Ma, Ruiqi Gao, Song-Chun Zhu, and
Ying Nian Wu. Learning energy-based prior model with diffusion-amortized mcmc. Advances
in Neural Information Processing Systems, 36, 2024.
[22] Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, and Song-Chun Zhu. Unsupervised
foreground extraction via deep region competition. Advances in Neural Information Processing
Systems, 34:14264–14279, 2021.
[23] Yilue Qian, Peiyu Yu, Ying Nian Wu, Wei Wang, and Lifeng Fan. Learning concept-based visual
causal transition and symbolic reasoning for visual planning. arXiv preprint arXiv:2310.03325,
2023.
[24] Peiyu Yu, Yongming Rao, Jiwen Lu, and Jie Zhou. P2 gnet: Pose-guided point cloud generating
networks for 6-dof object pose estimation. arXiv preprint arXiv:1912.09316, 2019.
[25] Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, and Guang Cheng. Watermarking
generative tabular data. arXiv preprint arXiv:2405.14018, 2024.
[26] Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer
data with rectified flow. In The Eleventh International Conference on Learning Representations,
2022.
[27] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow
matching for generative modeling. In The Eleventh International Conference on Learning
Representations, 2022.
[28] Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is
enough for high-quality diffusion-based text-to-image generation. In International Conference
on Learning Representations, 2024.
[29] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini,
Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform-
ers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
[30] Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, and Jiashi Feng. Perflow:
Piecewise rectified flow as universal plug-and-play accelerator. arXiv preprint arXiv:2405.07510,
2024.
[31] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018.
[32] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud.
Ffjord: Free-form continuous dynamics for scalable reversible generative models. In Interna-
tional Conference on Learning Representations, 2019.
[33] Ashwini Pokle, Matthew J Muckley, Ricky TQ Chen, and Brian Karrer. Training-free linear
image inversion via flows. arXiv preprint arXiv:2310.04432, 2023.
[34] Martin Burger and Felix Lucka. Maximum a posteriori estimates in linear inverse problems
with log-concave priors are proper bayes estimators. Inverse Problems, 30(11):114004, 2014.
[35] Tapio Helin and Martin Burger. Maximum a posteriori probability estimates in infinite-
dimensional bayesian inverse problems. Inverse Problems, 31(8):085009, 2015.
[36] Thorsten M Buzug. Computed tomography. In Springer handbook of medical technology, pages
311–342. Springer, 2011.
[37] Marinus T Vlaardingerbroek and Jacques A Boer. Magnetic resonance imaging: theory and
practice. Springer Science & Business Media, 2013.
12
[38] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain
monte carlo. CRC press, 2011.
[39] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training
of score-based diffusion models. Advances in neural information processing systems, 34:
1415–1428, 2021.
[40] Berthy T Feng, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L Bouman, and
William T Freeman. Score-based diffusion models as principled priors for inverse imaging. In
International Conference on Computer Vision (ICCV). IEEE, 2023.
[41] Berthy T Feng and Katherine L Bouman. Efficient bayesian computational imaging with a
surrogate score-based prior. arXiv preprint arXiv:2309.01949, 2023.
[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks
for biomedical image segmentation. In Medical image computing and computer-assisted
intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9,
2015, proceedings, part III 18, pages 234–241. Springer, 2015.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[44] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-
vised learning using nonequilibrium thermodynamics. In International conference on machine
learning, pages 2256–2265. PMLR, 2015.
[45] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in neural information processing systems, 33:6840–6851, 2020.
[46] John Skilling. The eigenvalues of mega-dimensional matrices. Maximum Entropy and Bayesian
Methods: Cambridge, England, 1988, pages 455–466, 1989.
[47] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian
smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–
1076, 1989.
[48] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical
Association, 106(496):1602–1614, 2011.
[49] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in
the wild. In Proceedings of the IEEE international conference on computer vision, pages
3730–3738, 2015.
[50] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[51] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub,
Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an
overview. Neuroimage, 80:62–79, 2013.
[52] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612,
2004. doi: 10.1109/TIP.2003.819861.
[53] Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018.
[54] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[55] Zhenghan Fang, Sam Buchanan, and Jeremias Sulam. What’s in a prior? learned proximal
networks for inverse problems. In International Conference on Learning Representations, 2024.
13
Appendix
A Limitations and Future Work
While our algorithm has demonstrated promising results, there are certain limitations that suggest
avenues for future research. First, our theoretical framework, built on optimal transport interpolation
paths, is currently limited and cannot be applied to solve the general interpolation between Gaussian
and data distributions. Additionally, in order to broaden the applicability of flow priors for inverse
problems, it is important to generalize our approach to handle nonlinear forward models. Moreover,
the algorithm currently lacks the capability to quantify the uncertainty of the generated images,
an aspect crucial for many scientific applications. It would be interesting to consider approaches
to post-process our solutions to understand the uncertainty inherent in our reconstruction. These
limitations highlight important directions for future work to enhance the robustness and applicability
of our method.
B Proof
Before we dive into the proof, we provide the following three lemmas.
Lemma 1. Consider a vector-valued function f : [0, 1] → Rn . Then for any t ∈ [0, 1], we have that
Z t 2 Z t
f (s)ds ≤ ∥f (s)∥2 ds. (14)
0 0
Proof. For each s ∈ [0, 1], let fi (s) ∈ R denote the i-th component of f (s). Recall Jensen’s
inequality: for any convex function g : R → R and integrable function h : [0, 1] → R, we have
Z b ! Z
b
g h(t)dt ≤ g(h(t))dt.
a a
Using convexity of the function t 7→ t2 and applying Jensen’s inequality, we see that
Z t 2 Xn Z t 2
f (s)ds = fi (s)ds (15)
0 i=1 0
Xn Z t
≤ fi (s)2 ds (16)
i=1 0
Z tXn
= fi (s)2 ds (17)
0 i=1
Z t
= ∥f (s)∥2 ds. (18)
0
Lemma 2 (Tweedie’s Formula [48]). If µ ∼ g(·), z|µ ∼ N (αµ, σ 2 I), and therefore z ∼ f (·), we
have
1
E[µ|z] = [z + σ 2 ∇z log f (z)]. (19)
α
Lemma 3. Suppose y = A(x∗ ) + ϵ where x∗ = x1 (x0 ) with x0 being the solution to Eq. (9),
A : Rn → Rm is linear, ϵ ∼ N (0, σy2 I), and xt exactly follows the path xt = αt x + βt x0 for any
time t ∈ [0, 1]. Then we have
p(yt |xt ) = N (Axt , αt2 σy2 I), (20)
and hence
m
log p(y|x(x0 )) = log p(yt |xt ) + log(αt2 ), ∀t. (21)
2
14
Proof. Recall that the generated auxiliary path yt = αt y + βt Ax0 . By assumption, we have
A(xt ) = A(αt x + βt x0 ) = αt A(x(x0 )) + βt Ax0 . By subtracting these two equations, we have
yt − A(xt ) = αt (y − A(x(x0 )). (22)
As y|x(x0 ) ∼ N (Ax, σy2 I), we have yt |xt ∼ N (Axt , αt2 σy2 I). The proof for Eq. (20) is done. Next,
we examine the log probability as follows:
∥yt − Axt ∥2 m
log p(yt |xt ) = − − log(2παt2 σy2 ) (23)
2αt2 σy2 2
∥αt (y − A(x(x0 ))∥2 m
=− − log(2παt2 σy2 ) (24)
2αt2 σy2 2
∥y − A(x(x0 )∥2 m m
=− 2
− log(2πσy2 ) − log(αt2 ) (25)
2σy 2 2
m
:= log p(y|x(x0 )) − log(αt2 ). (26)
2
Trained by the objective defined in Eq. (5), the optimal velocity field would be
vθ (xt , t) = E(α̇t x1 + β̇t x0 |xt ) (27)
xt − αt x xt − αt x
= E(α̇t x1 + β̇t |xt ) # Given xt , x0 =
βt βt
(28)
αt β̇t
= (α̇t − β̇t )E(x1 |xt ) + xt (29)
βt βt
αt 1 β̇t
= (α̇t − β̇t )[ (xt + βt2 ∇xt log p(xt ))] + xt . # Lemma 2(Tweedie’s Formula)
βt αt βt
(30)
By defining the signal-to-noise ratio as λt = αt /βt and rearranging the equation above, we get
exactly Eq. (13) which we display again below:
" −1 #
1 d log λt d log βt
∇xt log p(xt ) = 2 vθ (xt , t) − xt − xt . (31)
βt dt dt
15
where the decomposition of the second term utilizes the property of the discretization of
Riemann integral, and that of the third term utilizes the result in Lemma 3 and thus
PN PN
ci = m 2P
2
log(αi∆t ). By the property of limits, i.e. lim∆t→0 ( i=1 γi )( i=1 ∆pi ) =
N PN PN
lim∆t→0 ( i=1 γi ) lim∆t→0 ( i=1 ∆pi ) = lim∆t→0 i=1 ∆pi , we can further decompose the
PN PN
second term in Eq. (35) into lim∆t→0 ( i=1 γi )( i=1 ∆pi ).
By extracting the limit out in Eq. (35), the equation becomes
n
lim γ1 [log p(x0 ) + ∆p1 + log p(y∆t |x∆t ) + c1 ]
∆t→0
+ γ2 [log p(x0 ) + ∆p1 + ∆p2 + log p(y2∆t |x2∆t ) + c2 ]
+ ···
+ γN [log p(x0 ) + ∆p1 + ∆p2 + · · · + ∆pN + log p(yN ∆t |xN ∆t ) + cN ]
o
+ [γ1 ∆p2 + (γ1 + γ2 )∆p3 + · · · + (γ1 + γ2 + · · · + γN −1 ) ∆pN ] − log p(y) (36)
j−1
N N
! N
X X X X
:= lim γi J˜i + γi ∆pj + γi ci − log p(y) , (37)
∆t→0
i=1 j=2 i=1 i=1
Pi PN
where J˜i := log p(x0 ) + j=1 ∆pj + log p(yi∆t |xi∆t ). We further define the c(N ) = i=1 γ i ci −
log p(y).
∂vθ (x(i−1)∆t ,(i−1)∆t)
Recall that Jˆi = log p(x(i−1)∆t ) − tr ∂x ∆t + log p(yi∆t |xi∆t ). By triangle
inequality, we have
N
X
log p(x(x0 )|y) − γi Jˆi − c(N ) (38)
i=1
N
X N
X N
X
⩽ log p(x(x0 )|y) − γi J˜i − c(N ) + γi Jˆi − γi J˜i . (39)
i=1 i=1 i=1
In the following, we analyze the two terms on the right-hand side one by one. For the first term:
as | · | : R → R is a continuous function, the first term on the right-hand side is equal to
N
X
log p(x(x0 )|y) − lim γi J˜i − c(N ) (42)
∆t→0
i=1
j−1
N
!
X X
= lim γi ∆pj (43)
∆t→0
j=2 i=1
N
X 1 1
= lim − ∆pj (44)
∆t→0
j=2
2N −j+1 2N
N N
X 1 X 1
≤ lim ∆pj + lim ∆pj , (45)
∆t→0
j=2
2N −j+1 ∆t→0
j=2
2N
where the first equation is derived by subtracting the first term in Eq. (37) from Eq. (33). As the
∂
velocity field vθ : Rn × R → Rn satisfies supz∈Rn ,s∈[0,1] |tr ∂x vθ (z, s)| ≤ C1 for some universal
16
constant C1 , we have |∆pj | ≤ C1 ∆t. The first term in (45) would be
N N
X 1 X 1
∆pj ≤ C1 ∆t ≤ C1 ∆t = O(∆t). (46)
j=2
2N −j+1 j=2
2N −j+1
Similarly, the second term in (45) would be
n N
X 1 X 1 N −1
∆p j ≤ C 1 ∆t = C1 ∆t = O(∆t). (47)
j=2
2n j=2
2N 2N
Combining the results in Eq. (46) and Eq. (47), we get
N
X
log p(x(x0 )|y) − lim γi J˜i − c(N ) = 0. (48)
∆t→0
i=1
For the second term: Intuitively, the error between the integral and the Riemannian discretization
goes to 0 as ∆t tends to 0. Rigorously,
N
X N
X N
X
lim γi Jˆi − γi J˜i = lim γi (Jˆi − J˜i ) (49)
∆t→0 ∆t→0
i=1 i=1 i=1
N Z t−∆t i−1
X ∂vθ (xs , s) X
= lim γi tr ds − ∆pj (50)
∆t→0
i=1 0 ∂x j=1
N Z t−∆t i−1
X ∂vθ (xs , s) X
≤ lim γi tr ds − ∆pj = 0. (51)
∆t→0
i=1 0 ∂x j=1
Combining the results of the first term and the second term, we get the proof of theorem 1 done.
C Compliance of Trajectory
To quantify our deviation from the assumption of having xt exactly follow the interpolation path
αt x + βt x0 , we define the following: given a differentiable process {zt } and an interpolation path
specified by α := {αt } and β := {βt }, we define the trajectory’s compliance Sα,β ({zt }) to the
interpolation path as
Z 1 h i
Sα,β ({zt }) := Ep(z0 ),p(z1 ) ∥żt − (α̇t z1 + β̇t z0 )∥2 dt. (52)
0
This generalizes the definition of straightness in [26] to general interpolation paths. We recover their
definition by setting αt = t and βt = 1 − t. In certain cases, we have exact compliance with the
predefined interpolation path. For example, when {zt } is generated by vθ and αt = t and βt = 1 − t,
note that Sα,β ({zt }) = 0 is equivalent to vθ (zt , t) = c where c is a constant, almost everywhere.
This ensures that z1 = z0 + c. In this case, when generating the trajectory through an ODE solver
with starting point x0 and endpoint xt , we have xt = αt x + βt x0 , ∀t. When Sα,β ({zt }) is not equal
to 0, we show in Proposition 2 that we can bound the deviation of our trajectory from the interpolation
path using this compliance measure. When specifying our result to Rectified Flow, we can obtain an
additional bound showing that when using L-Rectified Flow, the deviation of the learned trajectory
from the straight trajectory is bounded by O(1/L).
Proposition 2. Consider a differentiable interpolation path specified by α := {αt } and β := {βt }.
Rt
Then the expected distance between the learned trajectory zt = z0 + 0 vθ (zs , s)ds and the predefined
Rt
trajectory ẑt = z0 + 0 (α̇s z1 + β̇s z0 )ds can be bounded as
Ep(z0 ),p(z1 ) ∥ẑt − zt ∥2 ≤ Sα,β ({zt }).
(53)
If the differentiable process {zt } is specified by L-Rectified Flow and αt = t and βt = 1 − t for all
t ∈ [0, 1], then we additionally have
1
Ep(z0 ),p(z1 ) ∥ẑt − zt ∥2 ≤ O
. (54)
L
17
Rt
Proof. At time t, we are interested in the distance between a real trajectory zt = z0 + 0 vθ (zs , s)ds
Rt
and a preferred trajectory ẑt = z0 + 0 (α̇s z1 − β̇s z0 )ds. Using the result in Lemma 1, the distance
can be bounded by
Z t 2
∥ẑt − zt ∥2 = [vθ (zs , s) − (α̇s z1 − β̇s z0 )]ds (55)
0
Z t
≤ ∥vθ (zs , s) − (α̇s z1 − β̇s z0 )∥2 ds. (56)
0
Therefore,
Z t
Ep(z0 ),p(z1 ) ∥ẑt − zt ∥2 ≤ Ep(z0 ),p(z1 ) ∥vθ (zs , s) − (α̇s z1 − β̇s z0 )∥2 ds (57)
0
Z t
= Ep(z0 ),p(z1 ) ∥vθ (zs , s) − (α̇s z1 − β̇s z0 )∥2 ds (58)
0
Z 1
≤ Ep(z0 ),p(z1 ) ∥vθ (zs , s) − (α̇s z1 − β̇s z0 )∥2 ds (59)
0
:= Sα,β ({z}). (60)
If {zt , t ∈ [0, 1]} is a learned L-rectfied flow, i.e. αt = t and βt = 1 − t in this case, where L is the
times of rectifying the flow, by Theorem 3.7 in [26], we have Sα,β ({z}) = O(1/L) and thus
Ep(z0 ),p(z1 ) ∥ẑt − zt ∥2 = O(1/L). (61)
Empirically, [28, 26] found L = 2 generates nearly straight trajectories for high-quality one-step
generation. Hence, while this result gives us a simple upper bound, in practice the trajectories may
comply more faithfully with the predefined interpolation path than this result suggests.
D Additional Results
We first provide raw values of Fig. 2 in Tab. 2.
Table 2: Quantitative comparison results (raw values) on the CelebA-HQ dataset.
NFEs N We first refer to Fig. 1(c) for a preliminary ablation on N using a toy example. Next, we
show PSNR and SSIM scores for varying N in the task of super-resolution. We find that N = 100 is
the best trade-off between time and performance. The ablation results are shown in Fig. 8.
E Computational Efficiency
In Tab. 3, we present the computational efficiency comparison results. Note that OT-ODE is
the slowest as it requires taking the inverse of a matrix rt2 AAT + σy2 I each update time. Our
method requires taking the gradient over an estimated trace of the Jacobian matrix, which slows the
computation.
18
Figure 7: Ablation results of K in terms of SSIM on different tasks.
Table 3: Computational time comparison. We compare the time required to recover 100 images for
the super-resolution task on a single GPU.
F Implementation Details
Experiments were conducted on a Linux-based system with CUDA 12.2 equipped with 4 Nvidia
R9000 GPUs, each of them has 48GB of memory.
Operators For all the experiments on the CelebA-HQ dataset, we use the operators from [18]. For
all the experiments on compressed sensing, we use the operator CompressedSensingOperator defined
in the official repository of [55] 4 ,
Evaluation Metrics are implemented with different Python packages. PSNR is calculated using
basic PyTorch operations, and SSIM is computed using the pytorch_msssim package.
The workflow begins with using 1,000 FFHQ images at a resolution of 1024×1024. These images
are then downscaled to 16×16 using bicubic resizing. A Gaussian Mixture model is applied to fit the
downsampled images, resulting in mean and covariance parameters. The mean values are transformed
from the original range of [0,1] to [-1,1]. Subsequently, 10,000 samples are generated from this
distribution to facilitate training a score-based model resembling the architecture of CIFAR10
DDPM++. The training process involves 10,000 iterations, each with a batch size of 64, and
utilizes the Adam optimizer [54] with a learning rate of 2e-4 and a warmup phase lasting 100 steps.
Notably, convergence is achieved within approximately 200 steps. Lastly, the estimated log-likelihood
computation for a batch size of 128 takes around 4 minutes and 30 seconds. We show uncured samples
generated from the trained models in Fig. 9.
4
https://github.com/Sulam-Group/learned-proximal-networks/tree/main
19
Figure 9: Generated samples from the flow trained on 10,000 Gaussian samples.
In this setting, σy = 0.001. We use the ncsnpp architecture, training from scratch on 10k images for
100k iterations with a batch size of 50. We set the learning rate to 1 × 10−2 . Sudden convergence
appeared during our training process. We use 2000 warmup steps. Uncured generated images are
presented in Fig. 10.
Figure 10: Generated samples from the flow trained on 10,000 HCP T2w images.
OT-ODE As OT-ODE [33] has not released their code and pretrained checkpoints. We reproduce
their method with the same architecture as in [26]. We follow their setting and find initialization time
t′ has a great impact on the performance. We use the y init method in their paper. Specifically, the
starting point is
xt′ = t′ y + (1 − t′ )ϵ, ϵ ∼ N (0, I), (62)
where t′ is the init time. Note that in the super-resolution task we upscale y with bicubic first. We
follow the guidance in the paper and show the ablation results in Fig. 11 and Fig. 12.
20
Super-Resolution Inpainting(random) Gaussian Deblurring Inpainting(box)
Figure 11: Hyperparameter t′ selection results for OT-ODE on the CelebA-HQ dataset. We select t′ =
0.2, 0.1, 0.2, 0.2 for super-resolution, inpainting(random), Gaussian deblurring, and inpainting(box),
respectively.
Figure 12: Hyperparameter t′ selection results for OT-ODE on the HCP T2w dataset. We select
t′ = 0.1 for all the experiments.
DPS-ODE We use the following formula to update for each step in the flow:
v(xt , y) = v(xt ) + ζt −∇xt ∥y − Ax̂1 ∥2 ,
where ζt is the step size to tune. We refer to DPS for the method to choose ζt . We set ζt =
η
2∥y−Ax̂1 (xt )∥ . We demonstrate the ablation of η for this baseline in Fig. 13 and Fig. 14. Note that
there is a significant divergence in PSNR and SSIM for the task of inpainting (box). As we observe
that artifacts are likely to appear when η ≥ 100, we choose the optimal η = 75 for the best tradeoff.
21
Super-Resolution Inpainting(random) Gaussian Deblurring Inpainting(box)
Figure 13: Hyperparameter η selection results for DPS-ODE. We select η = 1000, 750, 200, 75 for
super-resolution, inpainting(random), Gaussian deblurring, and inpainting(box), respectively.
Figure 14: Hyperparameter η selection results for DPS-ODE on the HCP T2w dataset. We select
η = 200 for all the experiments.
22
OT-ODE DPS-ODE Ours GT OT-ODE DPS-ODE Ours GT
23
OT-ODE DPS-ODE Ours GT OT-ODE DPS-ODE Ours GT
24
OT-ODE DPS-ODE Ours GT OT-ODE DPS-ODE Ours GT
25
OT-ODE DPS-ODE Ours GT OT-ODE DPS-ODE Ours GT
26