0% found this document useful (0 votes)
4 views

2014_Bayesian Nonparametric Dictionary Learning for

This document presents a Bayesian nonparametric model for reconstructing MRI images from undersampled k-space data using dictionary learning. The proposed method employs the beta process for adaptive dictionary learning, allowing for improved image reconstruction accuracy by tailoring the dictionary to the specific MRI data. Empirical results demonstrate that this approach enhances reconstruction performance compared to existing methods, particularly in noisy settings.

Uploaded by

ljw1346940712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2014_Bayesian Nonparametric Dictionary Learning for

This document presents a Bayesian nonparametric model for reconstructing MRI images from undersampled k-space data using dictionary learning. The proposed method employs the beta process for adaptive dictionary learning, allowing for improved image reconstruction accuracy by tailoring the dictionary to the specific MRI data. Empirical results demonstrate that this approach enhances reconstruction performance compared to existing methods, particularly in noisy settings.

Uploaded by

ljw1346940712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Bayesian Nonparametric Dictionary Learning for


Compressed Sensing MRI
Yue Huang† , John Paisley† , Qin Lin, Xinghao Ding‡ , Xueyang Fu and Xiao-ping Zhang, Senior Member, IEEE

Abstract—We develop a Bayesian nonparametric model for Motivated by the need to find a sparse domain for repre-
reconstructing magnetic resonance images (MRI) from highly senting the MR signal, a large body of literature now exists
undersampled k-space data. We perform dictionary learning as on reconstructing MRI from significantly undersampled k-
part of the image reconstruction process. To this end, we use
the beta process as a nonparametric dictionary learning prior space data. Existing improvements in CS-MRI mostly focus
arXiv:1302.2712v3 [cs.CV] 26 Jul 2014

for representing an image patch as a sparse combination of on (i) seeking sparse domains for the image, such as con-
dictionary elements. The size of the dictionary and the patch- tourlets [4], [5]; (ii) using approximations of the `0 norm for
specific sparsity pattern are inferred from the data, in addition better reconstruction performance with fewer measurements,
to other dictionary learning variables. Dictionary learning is for example `1 , FOCUSS, `p quasi-norms with 0 < p < 1,
performed directly on the compressed image, and so is tailored
to the MRI being considered. In addition, we investigate a total or using smooth functions to approximate the `0 norm [6],
variation penalty term in combination with the dictionary learn- [7]; and (iii) accelerating image reconstruction through more
ing model, and show how the denoising property of dictionary efficient optimization techniques [8], [10], [29]. In this paper
learning removes dependence on regularization parameters in we present a modeling framework that is similarly motivated.
the noisy setting. We derive a stochastic optimization algorithm CS-MRI reconstruction algorithms tend to fall into two
based on Markov Chain Monte Carlo (MCMC) for the Bayesian
model, and use the alternating direction method of multipliers categories: Those which enforce sparsity directly within some
(ADMM) for efficiently performing total variation minimization. image transform domain [3]–[8], [10], [11], [12], and those
We present empirical results on several MRI, which show that the which enforce sparsity in some underlying latent representa-
proposed regularization framework can improve reconstruction tion of the image, such as an adaptive dictionary-based rep-
accuracy over other methods. resentation [9], [14]. Most CS-MRI reconstruction algorithms
Index Terms—compressed sensing, magnetic resonance imag- belong to the first category. For example Sparse MRI [3], the
ing, Bayesian nonparametrics, dictionary learning leading study in CS-MRI, performs MR image reconstruction
by enforcing sparsity in both the wavelet domain and the total
I. I NTRODUCTION variation (TV) of the reconstructed image. Algorithms with
image-level sparsity constraints such as Sparse MRI typically
Magnetic resonance imaging (MRI) is a widely used tech- employ an off-the-shelf basis, which can usually capture only
nique for visualizing the structure and functioning of the one feature of the image. For example, wavelets recover point-
body. A limitation of MRI is its slow scan speed during like features, while contourlets recover curve-like features.
data acquisition. Therefore, methods for accelerating the MRI Since MR images contain a variety of underlying features,
process have been heavily researched. Recent advances in such as edges and textures, using a basis not adapted to the
signal reconstruction from measurements sampled below the image can be considered a drawback of these algorithms.
Nyquist rate, called compressed sensing (CS) [1], [2], have had Finding a sparse basis that is suited to the image at hand
a major impact on MRI [3]. CS-MRI allows for significant can benefit MR image reconstruction, since CS theory shows
undersampling in the Fourier measurement domain of MR that the required number of measurements is linked to the
images (called k-space), while still outputting a high-quality sparsity of the signal in the selected transform domain. Using
image reconstruction. While image reconstruction using this a standard basis not adapted to the image under consideration
undersampled data is a case of an ill-posed inverse problem, will likely not provide a representation that can compete in
compressed sensing theory has shown that it is possible to sparsity with an adapted basis. To this end, dictionary learning,
reconstruct a signal from significantly fewer measurements which falls in the second group of algorithms, learns a sparse
than mandated by traditional Nyquist sampling if the signal basis on image subregions called patches that is adapted
is sparse in a particular transform domain. to the image class of interest. Recent studies in the image
processing literature have shown that dictionary learning is
This work supported by the National Natural Science Foundation of China
(Nos. 30900328, 61172179, 61103121, 81301278), the Fundamental Research an effective means for finding a sparse, patch-level represen-
Funds for the Central Universities (Nos. 2011121051, 2013121023) and the tation of an image [19], [20], [25]. These algorithms learn
Natural Science Foundation of Fujian Province of China (No. 2012J05160). a patch-level dictionary by exploiting structural similarities
Y. Huang, Q. Lin, X. Ding and X. Fu are with the Department of
Communications Engineering at Xiamen University in Xiamen, Fujian, China. between patches extracted from images within a class of inter-
J. Paisley is with the Department of Electrical Engineering at Columbia est. Among these approaches, adaptive dictionary learning—
University in New York, NY, USA. where the dictionary is learned directly from the image being
X.-P. Zhang is with the Department of Electrical and Computer Engineering
at Ryerson University in Toronto, Canada. considered—based on patch-level sparsity constraints usually
† Equal contributions. ‡ Corresponding author: [email protected] outperforms analytical dictionary approaches in denoising,
2

super-resolution reconstruction, interpolation, inpainting, clas- introduction, the function h can take several forms, but tends
sification and other applications, since the adaptively learned to fall into one of two categories according to whether image-
dictionary suits the signal of interest [19]–[22]. level or patch-level information is considered. We next review
Dictionary learning has previously been applied to CS- these two approaches.
MRI to learn a sparse basis for reconstruction, e.g., [14]. 1) Image-level sparse regularization: CS-MRI with an
With these methods, parameters such as the dictionary size image-level, or global regularization function hg (x) is one in
and patch sparsity are preset, and algorithms are considered which sparsity is enforced within a transform domain defined
that are non-Bayesian. In this paper, we consider a new on the entire image. For example, in Sparse MRI [3] the
dictionary learning algorithm for CS-MRI that is motivated regularization function is
by Bayesian nonparametric statistics. Specifically, we consider
hg (x) = kW xk1 + µ T V (x), (2)
a nonparametric dictionary learning model called BPFA [23]
that uses the beta process to learn the sparse representation where W is the wavelet basis and T V (x) is the total variation
necessary for CS-MRI reconstruction. The beta process is (spatial finite differences) of the image. Regularizing with
an effective prior for nonparametric learning of latent factor this function requires that the image be sparse in the wavelet
models; in this case the latent factors correspond to dictionary domain, as measured by the `1 norm of the wavelet coefficients
elements. While the dictionary size is therefore infinite in kW xk1 , which acts as a surrogate for `0 [1], [2]. The total
principle, through posterior inference the beta process learns a variation term enforces homogeneity within the image by
suitably compact dictionary in which the signal can be sparsely encouraging neighboring pixels to have similar values while
represented. allowing for sudden high frequency jumps at edges. The
We organize the paper as follows. In Section II we review parameter µ > 0 controls the trade-off between the two terms.
CS-MRI inversion methods and the beta process for dictionary A variety of other image-level regularization approaches have
learning. In Section III, we describe the proposed regulariza- been proposed along these lines, e.g., [4], [5], [7].
tion framework and algorithm. We derive a Markov Chain 2) Patch-level sparse regularization: An alternative to the
Monte Carlo (MCMC) sampling algorithm for stochastic opti- image-level sparsity constraint hg (x) is a patch-level, or local
mization of the dictionary variables in the objective function. regularization function hl (x), which enforces that patches
In addition, we consider including a sparse total variation (TV) (square sub-regions of the image) have a sparse representation
penalty, for which we perform efficient optimization using the according to a dictionary. One possible general form of such
alternating direction method of multipliers (ADMM). We then a regularization function is,
show the advantages of the proposed Bayesian nonparametric N
regularization framework on several CS-MRI problems in
X γ
hl (x) = kRi x − Dαi k22 + f (αi , D), (3)
Section IV. i=1
2

where the dictionary matrix is D ∈ CP ×K and αi is a K-


II. BACKGROUND AND R ELATED WORK dimensional vector in RK . An important difference between
√ √
We use the following notation: Let x ∈ CN be a N × N hl (x) and hg (x) is the additional function f (αi , D). While
MR image in vectorized form. Let Fu ∈ Cu×N , u  N , be image-level sparsity constraints fall within a predefined trans-
the undersampled Fourier encoding matrix and y = Fu x ∈ form domain, such as the wavelet basis, the sparse dictionary
Cu represent the sub-sampled set of k-space measurements. domain can be unknown for patch-level regularization and
The goal is to estimate x from the small fraction of k-space learned from data. The function f enforces sparsity by learning
measurements y. For dictionary learning, let Ri be the ith a D for which αi is sparse.1 For example, [9] uses K-SVD
patch extraction matrix. That is, Ri is a P × N matrix of all to learn D off-line, and then approximately optimize the
zeros
√ except
√ for a one in each row that extracts a vectorized objective function
P × P patch from the image, Ri x ∈ CP for i = 1, . . . , N . N
X
We use overlapping image patches with a shift of one pixel and arg min kRi x −Dαi k22 subject to kαi k0 ≤ T, ∀i, (4)
allow a patch to wrap around the image at the boundaries for α1:N
i=1
mathematical convenience [15], [22]. All norms P are extended using orthogonal matching pursuits (OMP) [21]. In this case,
1/p
to complex vectors when necessary, kakp = ( i |ai |p ) , the L0 penalty on the additional parameters αi make this a
where |ai | is the modulus of the complex number ai . non-convex problem. Using this definition of hl (x) in (1),
a local optimal solution can be found by an alternating
A. Two approaches to CS-MRI inversion minimization procedure [32]: First solve the least squares
We focus on single-channel CS-MRI inversion via optimiz- solution for x using the current values of αi and D, and then
ing an unconstrained function of the form update αi and D, or only αi if D is learned off-line.
λ
arg min h(x) + kFu x − yk22 , (1) B. Dictionary learning with beta process factor analysis
x 2
Typical dictionary learning approaches require a predefined
where kFu x − yk22 is a data fidelity term, λ > 0 is a parameter dictionary size and, for each patch, the setting of either a
and h(x) is a regularization function that controls properties
of the image we want to reconstruct. As discussed in the 1 The dependence of hl (x) on α and D is implied in our notation.
3

Algorithm 1 Dictionary learning with BPFA TABLE I


1) Construct a dictionary D = [d1 , . . . , dK ], with P EAK SIGNAL - TO - NOISE RATIO (PSNR) FOR IMAGE DENOISED BY BPFA.
C OMPARED WITH K-SVD USING CORRECT (M ATCH ) AND INCORRECT
dk ∼ N (0, P −1 IP ), k = 1, . . . , K. (M ISMATCH ) NOISE PARAMETER .

2) Draw a probability πk ∈ [0, 1] for each dk : K-SVD denoising (PSNR) BPFA denoising (PSNR)
σ
Match Mismatch Results Learned noise
πk ∼ Beta(cγ/K, c(1 − γ/K)), k = 1, . . . , K.
20/255 32.28 28.94 32.88 20.43/255
3) Draw precision values for noise and each weight 25/255 31.08 28.60 31.81 25.46/255
30/255 29.99 28.35 30.94 30.47/255
γε ∼ Gam(g0 , h0 ), γs ∼ Gam(e0 , f0 ).
th
4) For the i patch in x:
a) Draw the vector si ∼ N (0, γs−1 IK ).
b) Draw the binary vector zi with zik ∼ Bern(πk ).
c) Define αi = si ◦ zi by an element-wise product.
d) Sample noisy patch Ri x ∼ N (Dαi , γε−1 IP ).
5) Construct the image x as the average of all Ri x that
overlap on a given pixel.

sparsity level T , or an error threshold  to determine how many


(a) Noisy image (b) Denoising by BPFA
dictionary elements are used. In both cases, if the settings
do not agree with ground truth, the performance can signifi-
0.5 0.3
cantly degrade. Instead, we consider a Bayesian nonparametric
0.4 0.25
method called beta process factor analysis (BPFA) [23], which
0.2
has been shown to successfully infer both of these values, 0.3
as well as have competitive performance with algorithms in 0.15
0.2
several application areas [23]–[26], and see [33]–[36] for 0.1

related algorithms. The beta process is driven by an under- 0.1 0.05

lying Poisson process, and so it’s properties as a Bayesian 0


0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9
nonparametric prior are well understood [27]. Originally used
for survival analysis in the statistics literature, its use for latent (c) Dictionary probabilities (d) Dictionary elements per patch
factor modeling has been significantly increasing within the Fig. 1. (a)-(b) An example of denoising by BPFA (image scaled to [0,1]).
machine learning community [23]–[26], [28], [33]–[36]. (c) Shows the final probabilities of the dictionary elements and (d) shows a
distribution on the number of dictionary elements used per patch.
1) Generative model: We give the original hierarchical
prior structure of the BPFA model in Algorithm 1, extending
this to complex-valued dictionaries in Section III-A. With
the model include c, γ, e0 , f0 , g0 , h0 and K; we discuss setting
this approach, the model constructs a dictionary matrix D ∈
these values in Section IV.
RP ×K (CP ×K below) of i.i.d. random variables, and assigns
2) Relationship to K-SVD: Another widely used dictionary
probability πk to vector dk . The parameters for these probabil-
learning method is K-SVD [20]. Though they are models for
ities are set such that most of the πk are expected to be small,
the same problem, BPFA and K-SVD have some significant
with a few large. In Algorithm 1 we use an approximation to
differences that we briefly discuss. K-SVD learns the sparsity
the beta process.2 Under this parameterization, each patch Ri x
pattern of the coding vector αi using the OMP algorithm
extracted from the image x is modeled as a sparse weighted
[21] for each i. Holding the sparsity pattern fixed, it then
combination of the dictionary elements, as determined by the
updates each dictionary element and dimension of α jointly
element-wise product of zi ∈ {0, 1}K with the Gaussian
by a rank one approximation to the residual. Unlike BPFA, it
vector si . What makes the model nonparametric is that for
learns as many dictionary elements as are given to it, so K
many values of k, the values of zik will equal zero for all
should be set wisely. BPFA on the other hand automatically
i since πk will be very small; the model learns the number
prunes unneeded elements, and updates the sparsity pattern by
of these unused dictionary elements and their index values
using the posterior distribution of a Bernoulli process, which
from the data. Therefore, the value of K should be set to
is significantly different from OMP. It updates the weights and
a large number that is more than the expected size of the
the dictionary from their Gaussian posteriors as well. Because
dictionary. It can be shown that, under the assumptions of this
of this probabilistic structure, we derive a sampling algorithm
prior, in the limit K → ∞, the number of dictionary elements
for these variables that takes advantage of marginalization, and
used by a patch is Poisson(γ) distributed and the total number
naturally learns the auxiliary variables γε and γs .
of dictionary elements used by the data grows like cγ ln N ,
where N is the number of patches [28]. The parameters of 3) Example denoising problem: As we will see, the rela-
tionship of dictionary learning to CS-MRI is essentially as a
2 Fora finite c > 0 and γ > 0, the random measure H =
PK
πk δdk denoising step. To this end, we briefly illustrate BPFA on a
k=1
converges weakly to a beta process as K → ∞ [27], [24]. denoising problem. Denoising of an image using dictionary
4

learning proceeds by first learning the dictionary representa- and zeros elsewhere. Let Ψ = [ψ1T , . . . , ψN
T T
] be the resulting
tion of each patch, Ri x ≈ Dαi . The P denoised reconstruction 2N × N difference matrix for the entire image. The TV
of x using BPFA is then xBPFA = P1 i RiT Dαi . coefficients are β = Ψx ∈ C2N , and the isotropic TV penalty
P q 2
We show an example using 6 × 6 patches extracted from 2
P
is T V (x) = i kψ xk
i 2 = i |β| 2i−1 + |β|2i , where i
the noisy 512 × 512 image shown in Figure 1(a). In Figure ranges over the pixels in the MR image. For optimization we
1(b) we show the resulting denoised image. For this problem use the alternating direction method of multipliers (ADMM)
we truncated the dictionary size to K = 108 and set all other [31], [30]. ADMM works by performing dual ascent on the
model parameters to one. In Figures 1(c) and 1(d) we show augmented Lagrangian objective function introduced for the
some statistics from dictionary learning. For example, Figure total variation coefficients. For completeness, we give a brief
1(c) shows the values of πk sorted, where we see that fewer review of ADMM in the appendix.
than 100 elements are used by the data, many of which are very
sparsely used. Figure 1(d) shows the empirical distribution of
A. Algorithm
the number of elements used per patch. We see the ability of
the model to adapt the sparsity to the complexity of the patch. We present an algorithm for finding a local optimal solution
In Table I we show PSNR results for three noise variance to the non-convex objective function given in (5). We can write
levels. For K-SVD, we consider the case when the error this objective as
parameter matches the ground truth, and when it mismatches L(x, ϕ) = λg i kψi xk2 + i γ2ε kRi x − Dαi k22
P P
it by a magnitude of five. As expected, when K-SVD does not
+ i f (ϕi ) + λ2 kFu x − yk22 .
P
have an appropriate parameter setting the performance suffers. (6)
BPFA on the other hand adaptively infers this value, which
We seek to minimize this function with respect to x and the
helps improve the denoising.
dictionary learning variables ϕi = {D, si , zi , γε , γs , π}.
Our first step is to put the objective into a more suitable
III. CS-MRI WITH BPFA AND TV PENALTY
form. We begin by defining the TV coefficients for the ith
We next present our approach for reconstructing single- pixel as β i := [β2i−1 β2i ]T = ψi x. We introduce the vector of
channel MR images from highly undersampled k-space data. Lagrange multipliers ηi , and then split β i from ψi x by relaxing
In reference to the discussion in Section II, we consider a the equality via an augmented Lagrangian. This results in the
sparsity constraint of the form objective function
λ N
arg min λg hg (x) + hl (x) + kFu x − yk22 , (5) X ρ
x,ϕ 2 L(x, β, η, ϕ) = λg kβ i k2 + ηiT (ψi x − β i ) + kψi x − β i k22
i=1
2
N
X γε
hg (x) := T V (x), hl (x) := kRi x − Dαi k22 + f (ϕi ). N
X γε
i=1
2 + kRi x − Dαi k22 + f (ϕi )
i=1
2
For the local regularization function hl (x) we use BPFA λ
as given in Algorithm 1 in Section II-B. The parameters + kFu x − yk22 . (7)
2
to be optimized for this penalty are contained in the set
ϕi = {D, si , zi , γε , γs , π}, and are defined in Algorithm 1. From the ADMM theory [32], this objective will have (local)
We note that only si and zi vary in i, while the rest are optimal values β ∗i and x∗ with β ∗i = ψi x∗ , and so the equality
shared by all patches. The regularization term γε is a model constraints will be satisfied [31].3 Optimizing this function can
variable that corresponds to an inverse variance parameter be split into three separate sub-problems: one for TV, one for
of the multivariate Gaussian likelihood. This likelihood is BPFA and one for updating the reconstruction x. Following the
equivalently viewed as the squared error penalty term in hl (x) discussion of ADMM in the appendix, we define ui = (1/ρ)ηi
in (5). This term acts as the sparse basis for the image and and complete the square in the first line of (7). We then cycle
also aids in producing a denoised reconstruction, as discussed through the following three sub-problems,
in Sections II-B, III-B and IV-B. For the global regularization (P 1) β 0i = arg minβ λg kβk2 + ρ2 kψi x − β + ui k22 ,
function hg (x) we use the total variation of the image. This
term encourages homogeneity within contiguous regions of u0i = ui + ψi x − β 0i , i = 1, . . . , N,
the image, while still allowing for sharp jumps in pixel value ϕ0 = arg minϕ
P γε 2
(P 2) i 2 kRi x − Dαi k2 + f (ϕi ),
at edges due to the underlying `1 penalty. The regularization
P ρ 0
parameters λg , γε and λ control the trade-off between the (P 3) x0 = arg minx 0 2
i 2 kψi x − β i + ui k2
terms in this optimization. Since we sample a new value of P γε0
γε with each iteration of the algorithm discussed shortly, this + i 2 kRi x − D0 αi0 k22 + λ2 kFu x − yk22 .
trade-off is adaptively changing. Solutions for sub-problems P 1 and P 3 are globally
For the total variation penalty T V (x) we use the isotropic optimal (conditioned on the most recent values of all other
TV model. Let ψi be the 2 × N difference operator for pixel i. parameters). We cannot solve P 2 analytically since the
Each row of ψi contains a 1 centered on pixel i, and the first optimal values for the set of all BPFA variables do not
row also has a −1 on the pixel directly above pixel i, while
the second has a −1 corresponding to the pixel to the right, 3 For a fixed D, α1:N and x the solution is also globally optimal.
5

Algorithm 2 Outline of algorithm D. The addition of correlated Gaussian noise in the complex
Input: y – Undersampled k-space data plane generates the sample from the conditional posterior of
Output: x – Reconstructed MR image D. Since both the number of pixels and γε will tend to be
Initialize: x = FuH y and each ui = 0. Sample D from prior. very large, the variance of the noise is small and the mean
term dominates the update for D.
Step 1. P 1: Optimize each β i via shrinkage.
Step 2. Update Lagrange multiplier vectors ui . b) Sample sparse coding αi : Sampling αi entails sam-
Step 3. P 2: Gibbs sample BPFA variables once. pling sik and zik for each k. We sample these values us-
Step 4. P 3: Solve for x using Fourier domain. ing block sampling. We recall that to block sample two
if not converged then return to Step 1. variables from their joint conditional posterior distribution,
(s, z) ∼ p(s, z|−), one can first sample z from the marginal
distribution, z ∼ p(z|−), and then sample s|z ∼ p(s|z, −)
from the conditional distribution. The other sampling direction
have a closed form solution. Our approach for P 2 is to use
is possible as well, but for our problem sampling z → s|z is
stochastic optimization by Gibbs sampling each variable of
more efficient for finding a mode of the objective function.
BPFA conditioned on current values of all other variables.
We next present the updates for each sub-problem. We give We define ri,−k to be the residual error in approximating
an outline in Algorithm 2. the ith patch with the current values P
from BPFA minus the k th
dictionary element, ri,−k = Ri x − j6=k (sij zij )dj . We then
1) Algorithm for P1 (total variation): We can solve for β i sample zik from its conditional posterior Bernoulli distribution
exactly for each pixel i = 1, . . . , N by using a generalized zik ∼ pik δ1 + qik δ0 , where following a simplification,
shrinkage operation [31], − 21
  pik ∝ πk 1 + (γε /γs )dHk dk × (10)
0 λg ψi x + u i nγ o
β i = max kψi x + ui k2 − ,0 · . (8) exp
ε
Re(dH 2 H
ρ kψi x + ui k2 k r i,−k ) /(γs /γε + d k d k ) ,
2
We recall that β i corresponds to the 2-dimensional TV qik ∝ 1 − πk . (11)
coefficients for pixel i, with differences in one direction
The symbol H denotes the conjugate transpose. The proba-
vertically and horizontally. We then update the corresponding
bilities can be obtained by dividing both of these terms by
Lagrange multiplier, u0i = ui + ψi x − β 0i .
their sum. We observe that the probability that zik = 1 takes
into account how well dictionary element dk correlates with
2) Algorithm for P2 (BPFA): We update the parameters of
the residual ri,−k . After sampling zik we sample the corre-
BPFA using Gibbs sampling. We are therefore stochastically
sponding weight sik from its conditional posterior Gaussian
optimizing (7), but only for this sub-problem. With reference
distribution,
to Algorithm 1, the P2 sub-problem entails sampling new
values for the complex dictionary D, the binary vectors zi and
!
Re(dHk r i,−k ) 1
real-valued weights si (with which we construct αi = si ◦ zi sik |zik ∼ N zik , . (12)
through the element-wise product), the precisions γε and γs , γs /γε + dH H
k d k γs + γε zik d k d k
and the probabilities π1:K , with πk giving the probability that
When zik = 1, the mean of sik is the regularized least squares
zik = 1. In principle, there is no limit to the number of samples
solution and the variance will be small if γε is large. When
that can be made, with the final sample giving the updates
zik = 0, sik can is sampled from the prior, but does not factor
used in the other sub-problems. We found that a single sample
in the model in this case.
is sufficient in practice and leads to a faster algorithm. We
c) Sample γε and γs : We next sample from the condi-
describe the sampling procedure below.
tional gamma posterior distribution of the noise precision and
a) Sample dictionary D: We define the P × N matrix
weight precision,
X = [R1 x, . . . , RN x], which is a complex matrix of all
vectorized patches extracted from the image x. We also define γε ∼ Gam g0 + 12 P N, h0 + 12 i kRi x − Dαi k22 , (13)
P 
the K ×N matrix α = [α1 , . . . , αN ] containing the dictionary
γs ∼ Gam(e0 + 21 i,k zik , f0 + 21 i,k zik s2ik ).
P P
weight coefficients for the corresponding columns in X such (14)
that Dα is an approximation of X to which we add noise
The expected value of each variable is the first term of the
from a circularly-symmetric complex normal distribution. The
distribution divided by the second, which is close to the inverse
update for the dictionary D is
of the average empirical error for γε .
D = XαT (ααT + (P/γε )IP )−1 + E, (9) d) Sample πk : Sample each πk from its conditional beta
Ep,:
ind
∼ T
CN (0, (γε αα + P IP ) −1
), p = 1, . . . , P, posterior distribution,
 PN PN 
where Ep,: is the pth row of E. To sample this, we can first πk ∼ Beta a0 + i=1 zik , b0 + i=1 (1 − zik ) . (15)
draw Ep,: from a multivariate Gaussian distribution with this
covariance structure, followed by an i.i.d. uniform rotation of The parameters to the beta distribution include counts of how
each value in the complex plane. We note that the first term many times dictionary element dk was used by a patch.
in Equation (9) is the `2 -regularized least squares solution for
6

3) Algorithm for P3 (MRI reconstruction): The final sub-


problem is to reconstruct the image x. Our approach takes
advantage of the Fourier domain similar to other methods,
e.g. [14], [30]. The corresponding objective function is
N N
X ρ X γε
x0 = arg min kψi x − β i + ui k22 + kRi x − Dαi k22
x
i=1
2 i=1
2
λ
+ kFu x − yk22 . (a) Random 25% (b) Cartesian 30% (c) Radial 25%
2
Fig. 2. The three masks considered for a given sampling percentage.
Since this is a least squares problem, x has a closed form
solution that satisfies
ρΨT Ψ + γε i RiT Ri + λFuH Fu x = B. Discussion on λ
P 
(16)
In noise-free compressed sensing, the fidelity term λ can
ρΨT (β − u) + γε P xBPFA + λFuH y. tend to infinity giving an equality constraint for the measured
We recall that Ψ is the matrix of stacked ψi . The vector β is k-space values [1]. However, when y is noisy the setting of
also obtained by stacking each β i and u is the vector formed λ is critical for most CS-MRI algorithms since this parameter
by stacking ui . The vector xBPFA is the denoised reconstruction controls the level of denoising in the reconstructed image. We
note that a feature of dictionary learning CS-MRI approaches
P D and α1:N , which results from
from BPFA using the current
is that λ can still be set to a very large value, and so
the definition xBPFA = P1 i RiT Dαi .
We observe that inverting the left N × N matrix is compu- parameter selection isn’t necessary here. This is because a
tationally prohibitive since N is the number of pixels in the denoised version of the image is obtained through dictionary
image. Fortunately, given the form of the matrix in Equation learning (xBPFA in this paper) and can be taken as the denoised
(16) we can use the procedure described in [14] and simplify reconstruction. In Equation (19), we observe that by setting λ
the problem by working in the Fourier domain. This allows to a large value, we are effectively fixing the measured k-space
for element-wise updates in k-space, followed by an inverse values and using the k-space projection of BPFA and TV to fill
Fourier transform. We represent x as x = F H θ, where θ is the in the missing values. The reconstruction x will be noisy, but
Fourier transform of x. We then take the Fourier transform of have artifacts due to sub-sampling removed. The output image
each side of Equation (16) to give xBPFA is a denoised version of x using BPFA in essentially the
same manner as in Section II-B3. Therefore, the quality of
F ρΨT Ψ + γε i RiT Ri + λFuH Fu F H θ =
P 
(17) our algorithm depends largely on the quality of BPFA as an
image denoising algorithm [25]. We show examples of this
ρFΨT (β − u) + γε FP xBPFA + λFFuH y.
using synthetic and clinical data in Sections IV-B and IV-E.
The left-hand matrix simplifies to a diagonal matrix,
IV. E XPERIMENTAL R ESULTS
F ρΨT Ψ + γε i RiT Ri + λFuH Fu F H =
P 
(18)
u
We evaluate the proposed algorithm on real-valued and
ρΛ + γε P IN + λIN . complex-valued MRI, and on a synthetic phantom. We con-
Term-by-term this results as follows: The product of the finite sider three sampling masks: 2D random sampling, Carte-
difference operator matrix Ψ with itself yields a circulant sian sampling with random phase encodes (1D random),
matrix, which has the rows of the Fourier matrix F as its and pseudo radial sampling.4 We show an example of each
eigenvectors and eigenvalues equal to Λ = FΨT ΨF H . The mask in Figure 2. We consider a variety of sampling rates
matrix RiT Ri is a matrix of all zeros, except for ones on the for each mask. As a performance measure we use PSNR,
diagonal entries that correspond to the indices of x associated and also consider SSIM [37]. We compare with three other
with the ith patch. Since each pixel appears in P patches, algorithms: Sparse MRI [3]5 , which as discussed above is a
the sum over i gives P IN , and the Fourier product cancels. combination of wavelets and total variation, DLMRI [14]6 ,
The final diagonal matrix IN u
also contains all zeros, except which is a dictionary learning method based on K-SVD,
for ones along the diagonal corresponding to the indices in and PBDW [15]7 , which is patch-based method that uses
k-space that are measured, which results from FFuH Fu F H . directional wavelets and therefore places greater restrictions
Since the left matrix is diagonal we can perform element- on the dictionary. We use the publicly available code for these
wise updating of the Fourier coefficients θ, algorithms indicated above and used the built-in parameter
settings, or those indicated in the relevant papers. We also
ρFi ΨT (β − u) + γε P Fi xBPFA + λFi FuH y compare with the BPFA algorithm without using total variation
θi = . (19)
ρΛii + γε P + λFi FuH 1 by setting λg = 0.
We observe that the rightmost term in the numerator and 4 We used codes referenced in [3], [8], [10] to generate these masks.
denominator equals zero if i is not a measured k-space 5 http://www.eecs.berkeley.edu/∼mlustig/Software.html
location. We invert θ via the inverse Fourier transform F H 6 http://www.ifp.illinois.edu/∼yoram/DLMRI-Lab/Documentation.html

to obtain the reconstructed MR image x0 . 7 http://www.quxiaobo.org/index publications.html


7

the result being a slower inference algorithm.8 We ran 1000


iterations and use the results of the last iteration.
For regularization parameters, we set the data fidelity term
λ = 10100 . We are therefore effectively requiring equality with
the measured values of k-space and allowing BPFA to fill in
the missing values, as well as give a denoised reconstruction,
as discussed in Section III-B and highlighted below in Sections
IV-B and IV-E. After trying several values, we also found
λg = 10 and ρ = 1000 to give good results. We set the BPFA
hyperparameters as c = γ = e0 = f0 = g0 = h0 = 1. These
(a) Zero filling (b) BPFA reconstruction (x) settings result in a relatively non-informative prior given the
amount of data we have. However, we note that our algorithm
was robust to these values, since the data overwhelms these
prior values when calculating posterior distributions.

B. Experiments on a GE phantom
We consider a noisy synthetic example to highlight the
advantage of dictionary learning for CS-MRI. In Figure 3
we show results on a 256 × 256 GE phantom with additive
noise having standard deviation σ = 0.1. In this experiment
we use BPFA without TV to reconstruct the original image
(c) BPFA denoising (xBPFA ) (d) Total variation reconstruction
using 30% Cartesian sampling. We show the reconstruction
using zero-filling in Figure 3(a). Since λ = 10100 , we see
in Figure 3(b) that BPFA essentially helps reconstruct the
underlying noisy image for x. However, using the denoising
property of the BPFA model shown in Figure 1, we obtain the
denoised reconstruction of Figure 3(c) by focusing on xBPFA
from Equation (16). This is in contrast with the best result
(e) Dictionary (magnitude) we could obtain with TV in Figure 3(d), which places the TV
penalty on the reconstructed image. As discussed, for TV the
0.6
0.22
setting of λ relative to λg is important. We set λ = 1 and
0.2
0.5
0.18 swept through λg ∈ (0, 0.15), showing the result with highest
0.16 0.4 PSNR in Figure 3(d). Similar to Figure 1 we show statistics
0.14
0.12 0.3
from the BPFA model in Figures 3(e)-(g). We see that roughly
0.1 80 dictionary elements were used (the unused noisy elements
0.08 0.2
0.06 in Figure 3(e) are draws from the prior). We note that 2.28
0.1
0.04 elements were used on average by a patch given that at least
0.02
0 one was used, which discounts the black regions.
0 20 40 60 80 0 1 2 3 4 5 6 7 8 9

(f) Dictionary probabilities (g) Dictionary elements per patch

Fig. 3. GE data with noise (σ = 0.1) and 30% Cartesian sampling. BPFA
C. Experiments on real-valued (synthetic) MRI
(b) reconstructs the original noisy image, and (c) denoises the reconstruction
simultaneously. (d) TV denoises as part of the reconstruction. Also shown
For our synthetic MRI experiments, we consider two pub-
are the dictionary learning variables sorted by πk . (e) the dictionary, (f) the licly available real-valued 512 × 512 MRI9 of a shoulder and
distribution on the dictionary, πk . (g) The normalized histogram of number lumbar. We construct these problems by applying the relevant
of the dictionary elements used per patch.
sampling mask to the projection of real-valued MRI into k-
space. Though using such real-valued MRI data may not reflect
clinical reality, we include this idealized setting to provide a
A. Set-up complete set of experiments similar to other papers [3], [14],
[15]. We evaluate the performance of our algorithm using
For all images, we extract 6 × 6 patches where each pixel PSNR and compare with Sparse MRI [3], DLMRI [14] and
defines the upper left corner of a patch and wrap around the PBDW [15]. Although the original data is real-valued, we learn
image at the boundaries; we investigate different patch sizes complex dictionaries since the reconstructions are complex.
later to show that this is a reasonable size. We initialize x by We consider our algorithm with and without the total variation
zero-filling in k-space. We use a dictionary with K = 108 penalty, denoted BPFA+TV and BPFA, respectively.
initial dictionary elements, recalling that the final number of
dictionary elements will be smaller due to the sparse BPFA 8 As discussed in Section II-B, in theory K can be infinitely large.
prior. If 108 is found to be too small, K can be increased with 9 www3.americanradiology.com/pls/web1/wwimggal.vmg/wwimggal.vmg
8

TABLE II
PSNR RESULTS FOR REAL - VALUED L UMBAR MRI AS FUNCTION OF
SAMPLING PERCENTAGE AND MASK (C ARTESIAN WITH RANDOM PHASE
ENCODES , 2D RANDOM AND PSEUDO RADIAL ).

Mask Samp% BPFA+TV BPFA DLMRI SparseMRI PBDW


Cart. 10 32.48 32.03 31.02 30.24 31.74
20 36.07 35.84 33.92 33.44 35.19
25 38.78 38.53 36.56 35.50 37.43
30 41.08 40.12 38.87 35.57 39.23
35 41.05 40.96 38.85 37.66 39.24
Rand. 10 42.82 40.81 38.25 25.87 37.09 (a) BPFA+TV (b) PBDW
20 44.35 41.80 40.11 27.80 37.86
25 48.11 47.09 43.51 37.22 43.65
30 49.36 48.55 44.93 38.72 45.50
35 50.20 49.19 45.87 41.70 46.85
Rad. 10 35.16 33.33 32.91 29.35 31.46
20 41.69 41.18 38.38 35.69 38.01
25 43.75 43.40 40.29 38.59 40.25
30 45.22 44.95 41.86 37.37 42.11
35 46.85 46.45 43.09 39.74 43.72

TABLE III
PSNR RESULTS FOR REAL - VALUED S HOULDER MRI AS FUNCTION OF (c) DLMRI (d) Sparse MRI
SAMPLING PERCENTAGE AND MASK (C ARTESIAN WITH RANDOM
PHASE ENCODES , 2D RANDOM AND PSEUDO RADIAL ). Fig. 4. Absolute errors for 30% Cartesian sampling of synthetic lumbar MRI.

Mask Samp% BPFA+TV BPFA DLMRI SparseMRI PBDW


0.15 0.15

Cart. 10 32.65 30.79 31.02 27.65 28.88


20 36.96 35.77 34.52 30.64 32.10
25 38.45 37.97 35.69 32.44 34.12
30 41.43 41.22 38.11 34.26 36.73 0.1 0.1

35 41.33 41.14 38.44 34.50 36.76


Rand. 10 41.00 39.96 38.18 30.72 36.48
20 43.53 42.40 39.38 32.08 39.39 0.05 0.05

25 45.43 45.44 42.58 40.81 41.31


30 46.89 46.86 44.03 43.47 43.12
35 47.95 47.87 45.01 44.89 44.45
0 0
Rad. 10 34.30 33.88 33.27 29.18 31.60
20 39.41 39.47 38.06 35.50 36.38 (a) BPFA+TV (b) PBDW
25 41.40 41.52 39.73 38.70 38.21
0.15 0.15
30 43.14 43.45 41.20 39.98 40.30
35 44.69 44.99 42.58 39.11 41.72

0.1 0.1

We present the PSNR results for all sampling masks and


rates in Tables II and III. From these values we see the compet-
0.05 0.05
itive performance of the propose dictionary learning algorithm.
We also see a slight improvement by the addition of the TV
penalty. As expected, we observe that 2D random sampling
0 0
produced the best results, followed by pseudo-radial sampling
(c) DLMRI (d) Sparse MRI
and Cartesian sampling, which is due to their decreasing level
of incoherence, with greater incoherence producing artifacts Fig. 5. Absolute errors for 20% radial sampling of the shoulder MRI.
that are more noise-like [3]. Since BPFA is good at denoising
images, the algorithm naturally performs well in this setting.
In Figures 4 and 5 we show the absolute value of the residuals Trio Tim MRI scanner using the T2-weighted turbo spin echo
of different algorithms using one experiment from each MRI. sequence (TR/TE = 6100/99 ms, 220 × 220 mm field of view,
We see an improvement using the proposed method, which 3 mm slice thickness). We also use an MRI scan of a lemon
has more noise-like errors. obtained from the Research Center of Magnetic Resonance
and Medical Imaging at Xiamen University (TE = 32 ms,
size = 256 × 256, spin echo sequence, TR/TE=10000/32 ms,
D. Experiments on complex-valued MRI FOV= 70×70 mm2 , 2-mm slice thickness). This MRI is from
We also consider two clinically obtained complex-valued a 7T/160mm bore Varian MRI system (Agilent Technologies,
MRI: We use the T2-weighted brain MRI from [4], which is Santa Clara, CA, USA) using a quadrature-coil probe.
a 256 × 256 MRI of a healthy volunteer from a 3T Siemens For the brain MRI experiment we use both PSNR and
9

TABLE IV
PSNR/SSIM RESULTS FOR COMPLEX - VALUED B RAIN MRI AS A FUNCTION OF SAMPLING PERCENTAGE . S AMPLING MASKS INCLUDE
C ARTESIAN SAMPLING WITH RANDOM PHASE ENCODES , 2D RANDOM SAMPLING AND PSEUDO RADIAL SAMPLING .

Mask Sample % BPFA+TV BPFA DLMRI Sparse MRI PBDW Zero-filling


Cartesian 25 35.62 / 0.951 34.86 / 0.948 29.90 / 0.812 25.29 / 0.696 34.69 / 0.935 24.13 / 0.591
30 38.64 / 0.968 37.70 / 0.965 31.54 / 0.849 26.16 / 0.745 37.24 / 0.957 24.55 / 0.614
35 39.36 / 0.972 38.87 / 0.971 32.35 / 0.863 27.35 / 0.795 37.90 / 0.963 24.94 / 0.616
40 41.09 / 0.977 40.45 / 0.976 33.60 / 0.876 29.82 / 0.845 39.23 / 0.969 26.28 / 0.667
Random 10 31.57 / 0.923 31.24 / 0.920 29.38 / 0.821 24.85 / 0.756 31.15 / 0.921 23.23 / 0.536
15 36.49 / 0.963 35.44 / 0.961 30.16 / 0.774 22.68 / 0.651 34.22 / 0.942 21.18 / 0.493
20 38.83 / 0.962 38.38 / 0.964 31.62 / 0.804 26.28 / 0.672 36.29 / 0.960 23.52 / 0.504
25 40.75 / 0.979 40.00 / 0.973 32.83 / 0.862 31.16 / 0.934 37.62 / 0.968 26.58 / 0.582
30 42.70 / 0.984 42.24 / 0.984 34.09 / 0.887 31.90 / 0.965 39.38 / 0.976 27.67 / 0.630
Radial 10 30.76 / 0.914 30.68 / 0.914 27.78 / 0.680 19.79 / 0.482 30.78 / 0.886 19.06 / 0.367
15 34.00 / 0.949 33.79 / 0.950 29.49 / 0.734 22.07 / 0.640 33.99 / 0.937 20.87 / 0.498
20 36.92 / 0.967 36.60 / 0.967 30.78 / 0.768 24.22 / 0.739 36.34 / 0.958 22.57 / 0.537
25 39.72 / 0.977 39.37 / 0.977 31.91 / 0.794 26.64 / 0.797 38.38 / 0.970 24.34 / 0.574
30 41.81 / 0.982 41.54 / 0.982 32.77 / 0.807 28.20 / 0.827 39.74 / 0.975 25.43 / 0.600

(a) Original (b) BPFA+TV (c) BPFA (d) PBDW (e) DLMRI
40

38

36

BPFA+TV
34 BPFA

32

30

28
0 200 400 600 800 1000

(f) PSNR vs iteration (g) BPFA+TV error (h) BPFA error (i) PBDW error (j) DLMRI error

Fig. 6. Reconstruction results for 25% pseudo radial sampling of a complex-valued MRI of the brain.

34
SSIM as performance measures. We show these values in
33
Table IV for each algorithm, sampling mask and sampling
rate. As with the synthetic MRI, we see that our algorithm 32

performs competitively with the state-of-the-art. We also see 31


BPFA reconstruction
the significant improvement of all algorithms over zero-filling. 30
BPFA denoising
Example reconstructions are shown for each MRI dataset in 29

Figures 6 and 7. Also in Figure 7 are PSNR values for the 28 3 4 5 6


10 10 10 10
lemon MRI. We see from the absolute error residuals for
these experiments that the BPFA algorithm learns a slightly Fig. 8. PSNR vs λ in the noisy setting (σ = 0.03) for the complex-value
finer detail structure compared with other algorithms, with brain MRI with 30% 2D random sampling.
the errors being more noise-like. We also show the PSNR of
BPFA+TV and BPFA as a function of iteration. As is evident,
the algorithm does not necessarily need all 1000 iterations, but E. Experiments in the noisy setting
performs competitively even in half that number.
The MRI we have considered thus far have been essentially
noiseless. For some MRI machines this may be an unrealistic
assumption. We continue our evaluation of noisy MRI begun
with the toy GE phantom in Section IV-B by evaluating how
10

(a) Original (b) BPFA+TV: PSNR = 39.64 (c) BPFA: PSNR = 38.21 (d) PBDW: PSNR = 37.89 (e) DLMRI: PSNR = 35.05
40

39

38

37

36

35
BPFA+TV
34
BPFA
33

32

31

30
0 200 400 600 800 1000

(f) PSNR vs iteration (g) BPFA+TV error (h) BPFA error (i) PBDW error (j) DLMRI error

Fig. 7. Reconstruction results for 35% 2D random sampling of a complex-valued MRI of a lemon.

TABLE V
PSNR FOR 35% C ARTESIAN SAMPLING OF COMPLEX - VALUED B RAIN
MRI FOR VARIOUS NOISE STANDARD DEVIATIONS . (λ = 10100 )

Reconstruction method σ=0 σ = 0.01 σ = 0.02 σ = 0.03


BPFA–reconstruction 38.87 37.25 33.77 31.08
BPFA–denoising 37.99 37.19 34.43 32.39
DLMRI 32.35 32.12 31.61 30.65

As discussed in Section III-B and illustrated in Section


(a) Zero filling (b) BPFA reconstruction (x) IV-B, dictionary learning allows us to consider two possible
reconstructions: the actual reconstruction x, and the denoised
BPFA reconstruction xBPFA = P1 i RiT Dαi . As detailed in
P
these sections, as λ becomes larger the reconstruction will
be noisier, but with the artifacts from sub-sampling removed.
However, for all values of λ, xBPFA produces a denoised version
that essentially doesn’t change. We see this clearly in Figure 8,
where we show the PSNR of each reconstruction as a function
of λ. When λ is small, the performance degrades for both
algorithms since too much smoothing is done by dictionary
learning on x. As λ increases, both improve, but eventually
(c) BPFA denoising (xBPFA ) (d) DLMRI
the reconstruction of x degrades again because near equality
to the noisy y is being more strictly enforced. The denoised
Fig. 9. The denoising properties of dictionary learning on noisy complex- reconstruction however levels off and does not degrade. We
valued MRI with 35% Cartesian sampling and σ = 0.03.
show PSNR values in Table V as a function of noise level.10
Example reconstructions that parallel those given in Figure
3 are also shown in Figure 9. These results highlight the
our model performs on clinically obtained MRI with additive robustness of our approach to λ in the noisy setting, and we
noise. We show BPFA results without TV to highlight the note that we encountered no stability issues using extremely
dictionary learning features, but note that results with TV large values of λ.
provide a slight improvement in terms of PSNR and SSIM. We
again consider the brain MRI and use additive complex white
10 We are working with a different scaling of the MRI than in [14] and made
Gaussian noise having standard deviation σ = 0.01, 0.02, 0.03.
the appropriate modifications. Also, since DLMRI is a dictionary learning
For all experiments we use the original noise-free MRI as the method it can output “xKSVD ”, though it was not originally motivated this
ground truth. way. Issues discussed in Sections II-B2 and II-B3 apply in this case.
11

TABLE VI
PSNR AS A FUNCTION OF PATCH SIZE FOR A REAL - VALUED AND
COMPLEX - VALUED B RAIN MRI WITH C ARTESIAN SAMPLING .

4×4 5×5 6×6 7×7 8×8


Synthetic brain 25% 37.86 38.29 38.33 38.26 38.24
Complex brain 40% 40.53 40.84 41.09 41.11 41.15 (a) Dictionary (magnitude) for 10% sampling

TABLE VII
T OTAL RUNTIME IN MINUTES ( SECONDS / ITERATION ). W E RAN 1000
ITERATIONS OF BPFA, 100 OF DLMRI AND 10 OF S PARSE MRI.

Sampling % BPFA+TV BPFA DLMRI Sparse MRI


(b) Dictionary (magnitude) for 20% sampling
10% 52.4 (3.15) 50.5 (3.03) 27.6 (16.5) 1.63 (9.78)
20% 51.3 (3.08) 49.5 (2.97) 38.3 (23.0) 1.59 (9.54)
30% 51.2 (3.07) 48.3 (2.90) 45.7 (27.4) 1.60 (9.60)

F. Dictionary learning and further discussion


We investigate the model learned by BPFA. In Figure 10 (c) Dictionary (magnitude) for 30% sampling
we show dictionary learning results learned by BPFA+TV
0.14
for radial sampling of the complex Brain MRI. In the top 5
10% sampling
4.5 0.12 20% sampling
portion, we show the dictionaries learned for 10%, 20% and

empirical probability
Sum of probabilities 4 30% sampling
0.1
30% sampling. We see that they are similar in their shape, but 3.5
0.08
the number of elements increases as the sampling percentage 3
0.06
increases since more complex information about the image is 2.5
10% sampling
2 0.04
contained in the k-space measurements. We again note that 20% sampling
1.5 30% sampling 0.02

unused elements are represented by draws from the prior. In 1


0 20 40 60 80 100 0 5 10 15 20
Figure 10(d) we show the cumulative sum of the ordered πk Dictionary element index # dictionary elements used by patch

from BPFA. We can read off the average number of elements (d) BPFA weights (cumulative) (e) Dictionary elements per patch
used per patch by looking at the right-most value. We see that
more elements are used per patch as the fraction of observed Fig. 10. Radial sampling for the Brain MRI. (a)-(c) The learned dictionary
for various sampling rates. The noisy elements towards the end of each were
k-space increases. We also see that for 10%, 20% and 30% unused and are samples from the prior. (d) The cumulative function of the
sampling, roughly 70, 80 and 95, respectively, of the 108 total sorted πk from BPFA for each sampling rate. This gives information on
dictionary elements were significantly used, as indicated by the sparsity and average usage of the dictionary. (e) The distribution on the number
of elements used per patch for each sampling rate.
leveling off of these functions. This highlights the adaptive
property of the nonparametric beta process prior. In Figure
10(e) we show the empirical distribution on the number of coding update.11 We note that inference for the BPFA model
dictionary elements used per patch for each sampling rate. We is easily parallelizable—as are the other dictionary learning
see that there are two modes, one for the empty background algorithms—which can speed up processing time.
and one for the foreground, and the second mode tends to The proposed method has several advantages, which we
increase as the sampling rate increases. The adaptability of believe leads to the improvement in performance. A significant
this value to each patch is another characteristic of the beta advantage is the adaptive learning of the dictionary size
process model. and per-patch sparsity level using a nonparametric stochastic
We also performed an experiment with varying patch sizes process that is naturally suited for this problem. Several other
and show our results in Table VI. We see that the results dictionary learning parameters such as the noise variance and
are not very sensitive to this setting and that comparisons the variances of the score weights are adjusted as well through
using 6 × 6 patches are meaningful. We also compare the a natural MCMC sampling approach. These benefits have
runtime for different algorithms in Table VII, showing both been investigated in other applications of this model [25], and
the total runtime of each algorithm and the per-iteration naturally translate here since CS-MRI with BPFA is closely
times using an Intel Xeon CPU E5-1620 at 3.60GHz, 16.0G related to image denoising as we have shown.
ram. However, we note that we arguably ran more iterations Another advantage of our model is the Markov Chain
than necessary for these algorithms; the BPFA algorithms Monte Carlo inference algorithm itself. In highly non-convex
generally produced high quality results in half the number of Bayesian models (or similar models with a Bayesian interpre-
iterations, as did DLMRI (the authors of [14] recommend 20 tation), it is often observed by the statistics community that
iterations), while Sparse MRI uses 5 iterations as default and MCMC sampling can outperform deterministic methods, and
the performance didn’t improve beyond 10 iterations. We note
that the speed-up over DLMRI arises from the lack of the OMP 11 BPFA is significantly faster than K-SVD in Matlab because it requires
algorithm, which in Matlab is much slower than our sparse fewer loops. This difference may not be as large with other coding languages.
12

rarely performs worse [38]. Given that BPFA is a Bayesian R EFERENCES


model, such sampling techniques are readily derived, as we
showed in Section III-A. [1] E. Candés, J. Romberg and T. Tao, “Robust uncertainty principles: Exact
signal reconstruction from highly incomplete frequency information,”
IEEE Trans. on Information Theory, vol. 52, no. 2, pp. 489-509, 2006.
V. C ONCLUSION [2] D. Donoho, “Compressed sensing,” IEEE Trans. on Information Theory,
We have presented an algorithm for CS-MRI reconstruction vol. 52, no. 4, pp. 1289-1306, 2006.
[3] M. Lustig, D. Donoho and J. M. Pauly, “Sparse MRI: The application
that uses Bayesian nonparametric dictionary learning. Our of compressed sensing for rapid MR imaging,” Magnetic Resonance in
Bayesian approach uses a model called beta process factor Medicine, vol. 58, no. 6, pp. 1182-1195, 2007.
analysis (BPFA) for in situ dictionary learning. Through this [4] X. Qu, W. Zhang, D. Guo, C. Cai, S. Cai and Z. Chen, “Iterative
thresholding compressed sensing MRI based on contourlet transform,”
hierarchical generative structure, we can learn the dictionary Inverse Problems Sci. Eng., Jun. 2009.
size, sparsity pattern and additional regularization parame- [5] X. Qu, X. Cao, D. Guo, C. Hu and Z. Chen, “Combined sparsifying
ters. We also considered a total variation penalty term for transforms for compressed sensing MRI,” Electronics Letters, vol. 46,
no. 2, pp. 121-123, 2010.
additional constraints on image smoothness. We presented an
[6] J. Trzasko and A. Manduca, “Highly undersampled magnetic resonance
optimization algorithm using the alternating direction method image reconstruction via homotopic L0-minimization,” IEEE Trans. on
of multipliers (ADMM) and MCMC Gibbs sampling for all Medical Imaging, vol. 28, no. 1, pp. 106-121, 2009.
BPFA variables. Experimental results on real and complex- [7] H. Jung, K. Sung, K. S. Nayak, E. Y. Kim and J. C. Ye, “k-t FOCUSS:
A general compressed sensing framework for high resolution dynamic
valued MRI showed that our proposed regularization frame- MRI,” Magnetic Resonance in Medicine, vol. 61, pp. 103-116, 2009.
work compares favorably with other algorithms for various [8] J. Yang, Y. Zhang and W. Yin, “A fast alternating direction method for
sampling trajectories and rates. We also showed the natural TVL1-L2 signal reconstruction from partial Fourier data,” IEEE J. Sel.
Topics in Signal Processing, vol. 4, no. 2, pp. 288-297, 2010.
ability of dictionary learning to handle noisy MRI without [9] Y. Chen and X. Ye, “A novel method and fast algorithm for MR image
dependence on the measurement fidelity parameter λ. To this reconstruction with significantly under-sampled data,” Inverse Problems
end, we showed that the model can enforce a near equality and Imaging, vol. 4, no. 2, pp. 223-240, 2010.
constraint to the noisy measurements and use the dictionary [10] J. Huang, S. Zhang and D. Metaxas, “Efficient MR image reconstruction
for compressed MR imaging,” Medical Image Analysis, vol. 15, no. 5,
learning result as a denoised output of the noisy MRI. pp. 670-679, 2011.
[11] S. Ji, Y. Xue and L. Carin, ”Bayesian compressive sensing,” IEEE Trans.
VI. A PPENDIX on Signal Processing, vol. 56, no. 6, pp. 2346-2356, 2008.
[12] X. Ye, Y. Chen, W. Lin and F. Huang, “Fast MR image reconstruction
We give a brief review of the ADMM algorithm [32]. We for partially parallel imaging with arbitrary k-space trajectories,” IEEE
start with the convex optimization problem Trans. on Medical Imaging, vol. 30, no. 3, pp. 575-585, 2011.
[13] M. Akcakaya, T. A. Basha, B. Goddu, L. A. Goepfert, K. V. Kissinger, V.
min kAx − bk22 + h(x), (20) Tarokh, W. J. Manning and R. Nezafat, “Low-dimensional-structure self-
x learning and thresholding: Regularization beyond compressed sensing
where h is a non-smooth convex function, such as an `1 for MRI reconstruction,” Magnetic Resonance in Medicine, vol. 66, pp.
756-767, 2011.
penalty. ADMM decouples the smooth squared error term from [14] S. Ravishankar and Y. Bresler, “MR image reconstruction from highly
this penalty by introducing a second vector v such that undersampled k-space data by dictionary learning,” IEEE Trans. on
Medical Imaging, vol. 30, no. 5, pp. 1028-1041, 2011.
min kAx − bk22 + h(v) subject to v = x. (21) [15] X. Qu, D. Guo, B. Ning, Y. Hou, Y. Lin, S. Cai and Z. Chen,
x
“Undersampled MRI reconstruction with the patch-based directional
This is followed by a relaxation of the equality v = x via an wavelets,” Magnetic Resonance in Imaging, vol. 30, no. 7, pp. 964-977,
augmented Lagrangian term 2012.
[16] Z. Yang and M. Jacob, “Robust non-local regularization framework
ρ
L(x, v, η) = kAx − bk22 + h(v) + η T (x − v) + kx − vk22 . (22) for motion compensated dynamic imaging without explicit motion
2 estimation,” IEEE Int. Symp. Biomedical Imaging, pp. 1056-1059, 2012.
A minimax saddle point is found with the minimization taking [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro and A. Zisserman, “Non-local
sparse models for image restoration,” in Int. Conf. on Computer Vision,
place over both x and v and dual ascent for η. pp. 2272-2279, 2009.
Another way to write the objective in (22) is to define [18] K. Dabov, A. Foi, V. Katkovnik and K. Egiazarian, “Image denoising
u = (1/ρ)η and combine the last two terms. The result is by sparse 3D transform-domain collaborative filtering,” IEEE Trans. on
Image Process, vol. 16, no. 8, pp. 2080-2095, 2007.
an objective that can be optimized by cycling through the [19] K. Engan, S. O. Aase and J. H. Husoy, “Method of optimal directions
following updates for x, v and u, for frame design,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal
ρ Processing, pp. 2443-2446, 1999.
x0 = arg min kAx − bk22 + kx − v + uk22 , (23) [20] M. Aharon, M. Elad, A. Bruckstein and Y. Katz, “K-SVD: An algorithm
x 2 for designing of overcomplete dictionaries for sparse representation,”
ρ
v0 = arg min h(v) + kx0 − v + uk22 , (24) IEEE Trans. on Signal Processing, vol. 54, pp. 4311-4322, 2006.
v 2 [21] Y.C. Pati, R. Rezaiifar and P.S. Krishnaprasad, “Orthogonal matching
u0 = u + x0 − v0 . (25) pursuit: Recursive function approximation with applications to wavelet
decomposition,” 27th Conference on Signals, Systems and Computers,
This algorithm simplifies the optimization since the objective pp. 40-44, 1993.
[22] M. Protter and M. Elad, “Image sequence denoising via sparse and
for x is quadratic and thus has a simple analytic solution, while redundant representations,” IEEE Trans. on Image Processing, vol. 18,
the update for v is a proximity operator of h with penalty ρ, no. 1, pp. 27-36, 2009.
the difference being that v is not pre-multiplied by a matrix [23] J. Paisley and L. Carin, “Nonparametric factor analysis with beta process
as x is in (20). Such objective functions tend to be easier to priors,” in International Conference on Machine Learning, 2009.
[24] J. Paisley, D. Blei and M. Jordan, “Stick-breaking beta processes and the
optimize. For example when h is the TV penalty the solution Poisson process,” in International Conference on Artificial Intelligence
for v is analytical. and Statistics, 2012.
13

[25] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, G. Qin Lin is currently a graduate student in the De-
Sapiro and L. Carin, “Nonparametric Bayesian dictionary learning for partment of Communication Engineering at Xiamen
analysis of noisy and incomplete images,” IEEE Trans. Image Process., University. His research interests includes computer
vol. 21, no. 1, pp. 130-144, Jun. 2012. vision, machine learning and data mining.
[26] X. Ding, L. He and L. Carin, “Bayesian robust principal component
analysis,” IEEE Transactions on Image Processing, vol. 20, no. 12, pp.
3419-3430.
[27] N. Hjort, “Nonparametric Bayes estimators based on beta processes in
models for life history data,” Annals of Statistics, vol. 18, pp. 1259-1294,
1990.
[28] R. Thibaux and M. Jordan, “Hierarchical beta processes and the Indian
buffet process,” in International Conference on Artificial Intelligence
and Statistics, San Juan, Puerto Rico, 2007.
[29] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear Xinghao Ding was born in Hefei, China in 1977.
variational problems via finite-element approximations,” Computers and He received the B.S. and Ph.D degrees from the
Mathematics with Applications, vol. 2, pp. 17-40, 1976. Department of Precision Instruments at Hefei Uni-
[30] W. Yin, S. Osher, D. Goldfarb and J. Darbon, “Bregman iterative algo- versity of Technology in Hefei, China in 1998 and
rithms for L1 minimization with applications to compressed sensing,” 2003.
SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143-168, 2008. From September 2009 to March 2011, he was a
[31] T. Goldstein and S. Osher, “The split Bregman method for L1 regularized postdoctoral researcher in the Department of Electri-
problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323- cal and Computer Engineering at Duke University in
343, 2009. Durham, NC. Since 2011 he has been a Professor in
[32] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distributed the School of Information Science and Engineering
optimization and statistical learning via the alternating direction method at Xiamen University. His main research interests
of multipliers,” Foundations and Trends in Machine Learning, vol. 3, include image processing, sparse signal representation, and machine learning.
no. 1, pp. 1-122, 2010.
[33] J. Paisley, L. Carin and D. Blei, “Variational inference for stick-breaking
beta process priors,” in International Conference on Machine Learning,
Bellevue, WA, 2011. Xueyang Fu is currently a graduate student in
[34] D. Knowles and Z. Ghahramani, “Infinite sparse factor analysis and the Department of Communication Engineering at
infinite independent components analysis,” in Independent Component Xiamen University. His research interests include
Analysis and Signal Separation, Springer, pp 381-388, 2007. image processing, sparse representation and machine
[35] E. Fox, E. Sudderth, M.I. Jordan and A.S. Willsky, “Sharing features learning.
among dynamical systems with beta processes,” in Advances in Neural
Information Processing, Vancouver, B.C., 2011.
[36] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the
Indian buffet process,” in Advances in Neural Information Processing
Systems, MIT Press, pp 475-482, 2006.
[37] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, “Image
quality assessment: From error visibility to structural similarity,” IEEE
Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004.
[38] C. Faes, J.T. Ormerod and M.P. Wand, “Variational Bayesian inference
Xiao-Ping Zhang (M’97, SM’02) received B.S. and
for parametric and nonparametric regression with missing data,” Journal
Ph.D. degrees from Tsinghua University, in 1992 and
of the American Statistical Association, vol. 106, pp 959-971, 2011.
1996, respectively, both in Electronic Engineering.
He holds an MBA in Finance, Economics and En-
trepreneurship with Honors from the University of
Chicago Booth School of Business, Chicago, IL.
Since Fall 2000, he has been with the Department
of Electrical and Computer Engineering, Ryerson
University, where he is now Professor, Director of
Yue Huang received the B.S. degree in Electrical Communication and Signal Processing Applications
Engineering from Xiamen University in 2005, and Laboratory (CASPAL). He has served as Program
the Ph.D. degree in Biomedical Engineering from Director of Graduate Studies. He is cross appointed to the Finance Department
Tsinghua University in 2010. Since 2010 she is an at the Ted Rogers School of Management at Ryerson University. Prior to
Assistant Professor of the School of Information joining Ryerson, he was a Senior DSP Engineer at SAM Technology, Inc., San
Science and Engineering at Xiamen University. Her Francisco, and a consultant at San Francisco Brain Research Institute. He held
main research interests include image processing, research and teaching positions at the Communication Research Laboratory,
machine learning, and biomedical engineering. McMaster University, and worked as a postdoctoral fellow at the Beckman
Institute, the University of Illinois at Urbana-Champaign, and the University
of Texas, San Antonio. His research interests include statistical signal pro-
cessing, multimedia content analysis, sensor networks and electronic systems,
computational intelligence, and applications in bioinformatics, finance, and
marketing. He is a frequent consultant for biotech companies and investment
firms. He is cofounder and CEO for EidoSearch, an Ontario based company
offering a content-based search and analysis engine for financial data.
Dr. Zhang is a registered Professional Engineer in Ontario, Canada, a
Senior Member of IEEE and a member of Beta Gamma Sigma Honor
John Paisley is an assistant professor in the Depart-
Society. He is the general chair for MMSP’15, publicity chair for ICME’06
ment of Electrical Engineering at Columbia Univer-
and program chair for ICIC’05 and ICIC’10. He served as guest editor
sity. Prior to that he was a postdoctoral researcher in
for Multimedia Tools and Applications, and the International Journal of
the Computer Science departments at UC Berkeley
Semantic Computing. He is a tutorial speaker in ACMMM2011, ISCAS2013,
and Princeton University. He received the B.S., M.S.
ICIP2013 and ICASSP2014. He is currently an Associate Editor for IEEE
and Ph.D. degrees in Electrical and Computer En-
Transactions on Signal Processing, IEEE Transactions on Multimedia, IEEE
gineering from Duke University in 2004, 2007 and
Signal Processing letters and for Journal of Multimedia.
2010. His research is in the area of statistical ma-
chine learning and focuses on probabilistic modeling
and inference techniques, Bayesian nonparametric
methods, and text and image processing.

You might also like