0% found this document useful (0 votes)
63 views

Recurrent Neural Networks For Snapshot Compressive Imaging

This document describes a new deep learning method called BIRNAT for reconstructing video frames and spectral images from snapshot compressive imaging (SCI) measurements. BIRNAT uses a convolutional neural network to reconstruct the first frame, then employs a bidirectional recurrent neural network to sequentially reconstruct subsequent frames by considering spatial and temporal correlations. The method is tested on both simulated and real data from three SCI cameras, demonstrating superior performance over previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Recurrent Neural Networks For Snapshot Compressive Imaging

This document describes a new deep learning method called BIRNAT for reconstructing video frames and spectral images from snapshot compressive imaging (SCI) measurements. BIRNAT uses a convolutional neural network to reconstruct the first frame, then employs a bidirectional recurrent neural network to sequentially reconstruct subsequent frames by considering spatial and temporal correlations. The method is tested on both simulated and real data from three SCI cameras, demonstrating superior performance over previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Recurrent Neural Networks for


Snapshot Compressive Imaging
Ziheng Cheng, Ruiying Lu, Zhengjue Wang, Hao Zhang, Bo Chen, Senior Member, IEEE, Ziyi Meng,
and Xin Yuan, Senior Member, IEEE

Abstract—Conventional high-speed and spectral imaging systems are expensive and they usually consume a significant amount of
memory and bandwidth to save and transmit the high-dimensional data. By contrast, snapshot compressive imaging (SCI), where
multiple sequential frames are coded by different masks and then summed to a single measurement, is a promising idea to use a
2-dimensional camera to capture 3-dimensional scenes. In this paper, we consider the reconstruction problem in SCI, i.e., recovering a
series of scenes from a compressed measurement. Specifically, the measurement and modulation masks are fed into our proposed
network, dubbed BIdirectional Recurrent Neural networks with Adversarial Training (BIRNAT) to reconstruct the desired frames.
BIRNAT employs a deep convolutional neural network with residual blocks and self-attention to reconstruct the first frame, based on
which a bidirectional recurrent neural network is utilized to sequentially reconstruct the following frames. Moreover, we build an
extended BIRNAT-color algorithm for color videos aiming at joint reconstruction and demosaicing. Extensive results on both video and
spectral, simulation and real data from three SCI cameras demonstrate the superior performance of BIRNAT.

Index Terms—Snapshot compressive imaging, compressive sensing, deep learning, convolutional neural networks, recurrent neural
network, attention, adversarial training, coded aperture compressive temporal imaging (CACTI), coded aperture snapshot spectral
imaging (CASSI).

1 I NTRODUCTION

R ECENT advances in artificial intelligence and robotics


have resulted in an unprecedented demand for ma-
chine vision system in achieving computationally-efficient
machine vision systems.
Bearing the above concerns in mind, one novel idea to
address these challenges is to build an optical encoder, i.e.,
high-dimensional data capture and processing. To meet this compressing these sequential data during capture. Inspired
requirement, computational imaging [1], [2] has become a by the compressive sensing (CS) [8], [9], snapshot compres-
promising tool to build new imaging systems aiming to sive imaging (SCI) [3], [5], [6], [10], [11], [12], [13], [14], [15],
capture high-dimensional data. This work focuses on the [16] was proposed, providing a promising solution to this
challenge of high-speed video and spectral computational optical encoder. In video SCI, the underlying principle is to
imaging [3], [4], [5], [6]. modulate the video frames with a higher speed than the
Videos and spectral images are typically sequential im- capture rate of the camera. With knowledge of modulation,
ages (frames) at different timestamps or spectral wave- high-speed video frames can be reconstructed from each sin-
lengths with extremely inter-frame correlations. Due to the gle measurement by using advanced algorithms. It has been
high redundancy among frames, a video codec [7] can shown that 148 frames can be recovered from a snapshot
achieve a high (>100) compression rate for a high-definition measurement in the coded aperture compressive temporal
video. Two potential problems exist in conventional sam- imaging (CACTI) system [3]. In spectral SCI, the single-
pling plus compression framework: i) the long sequential disperser coded aperture snapshot spectral imaging (CASSI)
data has to be captured and saved, which requires a signifi- system [5] encodes different wavelength images by a coded
cant amount of memory and power; ii) the codec, though aperture (physical mask) and then a disperser separates dif-
efficient, introduces latency for the subsequent transmis- ferent wavelengths into different positions; these modulated
sion and processing. These problems preclude the devel- frames at different wavelengths are then integrated by a
opments of this conventional “capture first and processing single 2D detector.
afterwards” imaging systems for the subsequent generation With this optical encoder in hand, another challenge,
namely an efficient decoder is also critically important to make
the SCI system being practical. Previous algorithms are
• Z. Cheng, R. Lu, Z. Wang, H. Zhang and B. Chen are with National usually based on iterative optimization, which often need
Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071
China.
a long time (even hours [17]) to provide a good result. This
E-mails: [email protected], {ruiyinglu xidian, zhengjuewang, precludes the wide applications of SCI into our daily life.
zhanghao xidian}@163.com, [email protected]. Inspired by deep learning, some researchers attempt to em-
• Z. Meng is with Kuaishou Technology, Beijing, 100083, China. ploy deep neural networks (DNNs) to reconstruct sequential
E-mail: [email protected]
• X. Yuan is with Westlake University, Hangzhou 310024, China. frames from the corresponding SCI measurements [12], [14],
E-mail: [email protected]. [18], [19], [20], [21], [22], [23]. Due to the lack of spatial
• Corresponding authors: B. Chen and X. Yuan. and sequential correlation considerations, though achieving
Manuscript updated February 23, 2022. satisfactory testing speed (tens of milliseconds), none of

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
Camera
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Dynamic masks 2

𝑡𝑡1 𝑡𝑡𝐵𝐵−1 𝑡𝑡𝐵𝐵 𝑡𝑡1 𝑡𝑡𝐵𝐵−1 𝑡𝑡𝐵𝐵 X1 X2 XB X B−1 X1 𝑡𝑡1 𝑡𝑡𝐵𝐵−1 𝑡𝑡𝐵𝐵
tB
Coded ∫t1
Aperture Adversarial Training+MSE Loss
Video SCI
system ˆf
X1
ˆf
X 2
ˆf
X B
ˆb
X B−1
ˆb
X1 or 𝑡𝑡1 𝑡𝑡𝐵𝐵−1 𝑡𝑡𝐵𝐵
Grayscale /Color
Dynamic scene Coded scene Camera Measurement
AttRes-CNN h2f hBf hBb −1 h1b

𝑡𝑡+∆𝑡𝑡
Coded or 𝜆𝜆1 𝜆𝜆𝐵𝐵−1𝜆𝜆𝐵𝐵
Aperture Disperser � Fusion Fusion Fusion Fusion
𝑡𝑡
Block Block Block Block
Spectral SCI
system Energy t2 tB t B−1 t1
Normalization
Forward RNN Backward RNN
Shifted Camera Measurement Proposed BIRNAT
Original scene Coded scene Coded scene Recovered fames

Fig. 1. Principle of video SCI (top-left), spectral SCI (bottom-left), and the proposed BIRNAT for reconstruction (middle). For video SCI system,
a dynamic scene, shown as a sequence of images (either grayscale or color) at different timestamps ([t1 , t2 , ..., tB ], top-left), passes through a
dynamic aperture (bottom-left), which imposes individual coding patterns. The coded frames after the aperture are then integrated overtime on a
grayscale or color camera, forming a single-frame compressed measurement (middle). Different from the video SCI system, the spectral SCI system
detects measurements including tens of spectral channels coded by the mask and dispersed by a disperser. These measurements along with the
dynamic masks are fed into our BIRNAT to reconstruct the series (right) of the 3D scene, which can be grayscale, color video, or hyperspectral
images.

them show superior reconstruction quality, especially for first passes through a statically coded aperture, and then the
video reconstruction. Although the compression process of modulated data cube is shifted by a disperser (prism), and
video and spectral SCI systems are similar and both are finally, a 2D camera collects the shifted cube and integrates
sequential data, current deep learning-based reconstruction the light across the spectral dimension. Given the coding
algorithms are all independently designed for these SCI pattern of each frame, the series of the original scene can be
systems. reconstructed from the compressed measurement through
In this paper, we address the challenge of decoder by iterative optimization based algorithms, which have been
developing an efficient network to reconstruct high quality developed extensively before.
images for both video and spectral SCI. In particular, we However, one common bottleneck to preclude the wide
consider the decoder as a sequential generator of frames, applications of SCI is that current reconstruction methods
where the correlations among frames are specially inves- cannot balance the speed and quality well. Recently, the
tigated. Specifically, we investigate the spatial correlation via DeSCI algorithm, proposed in [17] has achieved state-of-the-
an attention based CNN with residual block (AttRes-CNN), art reconstruction quality in both video and spectral SCI.
and the sequential correlation via a Bidirectional Recurrent However, the speed is too slow due to the inherent iterative
Neural Network. Moreover, we take one step further by ex- strategy; it needs about 2 hours to reconstruct eight frames
tending the proposed algorithm to color video SCI systems of size 256×256 pixels from a snapshot measurement, which
for joint reconstruction and demosaicing, which makes our makes it impractical for real applications.
algorithm more practical. Motivated by recent advances of deep learning, one
conventional way is to train an end-to-end network for SCI
1.1 Snapshot Compressive Imaging inversion, with an off-the-shelf structure like U-net [25],
SCI aims to capture and compress a high-dimensional (≥ 3) which has been used as the backbone of the design for
original scene as a spatial two-dimensional (2D) measure- several inverse problems [12], [14], [22], [26], [27]. Though
ment, as shown in Fig. 1 left. Specifically, in video SCI, achieving a high testing speed, we notice that a single U-
the original dynamic scene is considered as a time-series of net [12], [14], [22] cannot lead to good results since it fails
2D images. These images pass through a dynamic aperture to consider the inherent sequential (temporal or spectral)
and coded by timestamp-specified masks, which are then correlation within adjacent frames for SCI. Aiming to fill
integrated over time as a compressed measurement. It should this research gap, in this paper, we propose a Recurrent
be noted that the value of each timestamp-specified spatial Neural Network (RNN) based model dubbed BIdirectional
coding is superposed by a random pattern and thus the Recurrent Neural networks with Adversarial Training (BIR-
spatial coding of each two timestamps is different (a shifting NAT) for both video and spectral SCI reconstruction.
binary pattern was used in [3]) from each other. The coded
frames after the aperture are then integrated over time on 1.2 Related Work
a camera, forming a compressed coded measurement. Based SCI reconstruction is an extremely ill-posed problem. The
on this idea, various video SCI systems have been built. established algorithms can be divided into model-based
The modulation approach can be categorized into spatial optimization methods, data-driven deep learning methods,
light modulator (SLM) (including digital micromirror de- and deep unfolding methods.
vice (DMD)) [10], [11], [12], [24] and physical mask [3], In model-based methods, different priors are used as
[4]. Further, to extend SCI system to capture RGB color regular terms. For example, total variation (TV), a common
video, snapshot Bayer measurement is detected by a mosaic regularizer in image restoration, used in GAP-TV [28] and
charge-coupled device (CCD) with Bayer-filter after modu- TwIST [29] can remove noise to a certain level and lead to
lation [4]. In spectral SCI, the original spectral data (x, y , λ) sharp edges; Gaussian mixture models [30], [31] dig out

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

the relation of spatiotemporal patches and recover video used deep neural networks for demosaicing [56], [57], [58],
from a posterior distribution; DeSCI [17], the state-of-the- i.e., developing an end-to-end network to directly obtain a
art algorithm for SCI, applies the nonlocal low-rank as a color image from the raw captured image. Inspired by this,
prior and then uses the weighted nuclear norm minimiza- we combine demosaicing and SCI reconstruction to extend
tion [32] in the frames into the alternating direction method our proposed BIRNAT for color SCI system.
of multipliers (ADMM) [33] regime.
Since deep learning has demonstrated its competence on 1.3 Contributions and Organization of This Paper
image restoration [34], [35], [36], researchers have started
In this paper, we build a new reconstruction framework
using a DNN to learn an end-to-end mapping in com-
(BIRNAT) for SCI and specific contributions are summa-
putational imaging [12], [18], [22], [37], [38], [39], [40]. A
rized as follows:
deep fully-connected neural network was used for video
CS in [18]. The coding patterns used in [18] is a repeated 1) We build an end-to-end deep learning based recon-
pattern of a small block, which is not practical in real optical struction regime for SCI reconstruction and use an
imaging systems and only simulation results were shown RNN to exploit the sequential correlation.
therein. A joint optimization and reconstruction network 2) A CNN with residual blocks (ResBlocks) [59] is pro-
was trained in [41] for video CS, but the quality of the posed to reconstruct the first frame as a reference
reconstruction results is relatively low on the real data. U- for the reconstruction of following frames by RNN.
net-based methods [12], [14], [22] directly learn a mapping Considering the limitation of convolution in CNN
from 2D compressed measurements to higher dimensional only extracting the local dependencies, we equip it
data cube (videos or spectral images). with a self-attention module to capture the global
Deep unfolding methods, combining deep neural net- (non-local) spatial dependencies, resulting in AttRes-
works and model-based optimization methods, achieve a CNN.
moderate performance in both speed and quality. They 3) Given the reconstruction of the first frame, a Bidi-
unfold iterative optimization and replace some steps with rectional RNN is developed to sequentially infer the
networks. In video SCI problem, Deep Tensor ADMM- following frames, where the backward RNN refines
net [19] and Tensor FISTA-net [42] employ deep-unfolding the results of the forward RNN to improve the quality
technique [43], [44], [45]. An end-to-end network [46] un- of reconstructed frames. This dual-stage framework
folds the iterative optimization to a series networks, and is jointly trained via combining mean square error
jointly optimizes masks and networks. Plug-and-play algo- (MSE) loss and adversarial training [60] to achieve
rithms [21], [47] uses a pre-trained deep denoising network good results.
as a prior and embed it in optimization. Although it can 4) This is the first attempt to consider the sequential
handle large-scale video (UHD), the results cannot compete properties of video and spectral SCI reconstruction
with DeSCI. Most recently, Deep-GSM [23] formulates the using the same neural network structure, respec-
spectral compressive imaging problem as a maximum a tively. Furthermore, we make certain changes to ex-
posteriori estimation problem inspired by the Gaussian scale tend the framework to color video SCI reconstruction
mixture prior, and then unfolds the estimation problem to by considering the physical characteristics of RGB
several deep CNNs. video imaging; we further develop a joint framework
In summary, DeSCI can provide high-quality reconstruc- of reconstruction and demosaicing.
tion but with an extremely low speed. Though enjoying fast 5) Extensive experimental results on simulated and
inference, the reconstruction quality of recent developed real datasets containing grayscale, RGB videos and
deep learning and unfolding methods is not as good as spectral images show the superior performance. It
DeSCI in video reconstruction. To fill this research gap, shows competitive, sometimes higher performance
this paper proposes an RNN based network to achieve than previous state-of-the-art methods, but with hun-
high-quality and high-speed videos or spectral images re- dreds of milliseconds inference time.
construction. Intuitively, both desired high-speed temporal
A preliminary conference version of this work was pre-
and spectral adjacent frames are strongly correlated and
sented in [61]. The present version is based on the previous
a network fully exploiting this correlation could improve
version and adds significant features to be more powerful
reconstructed sequential frames quality. RNNs, originally
in the following aspects.
developed to capture temporal correlations for text and
speech, e.g., [48], [49], are becoming increasingly popular i) We adjust the original framework to adapt to the
for video or sprectral tasks, such as video deblurring [50], spectral compressive imaging considering the similar
super-resolution [51], [52], object segmentation [53], and sequential correlation in videos and spectral images.
hyspectral image classification [54] and fusion [55]. Al- Surprisingly, our proposed BIRNAT has achieved
though these works achieve high performance in their tasks, over 4dB improvement on PSNR for the spectral SCI
how to use RNN to build a unified structure for SCI prob- task [14].
lems still remains challenging. ii) We fine-tune the network structure motivated by
Additionally, for color imaging, common devices usually the fact that different levels of visual information
first capture pixels by a color filter (one pixel only sam- should be extracted by different feature extractors,
pling one color energy) and then impose an interpolation with details described in Sec. 3.
algorithm to achieve a color (usually RGB) image. This iii) We extend BIRNAT to the RGB video SCI problem
process is called demosaicing. Recently, some researchers to jointly learn SCI reconstruction and demosaicing,

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

while the previous algorithms mainly focused on the In terms of color video SCI system, we consider the
grayscale SCI videos. Bayer pattern filter sensor, where each pixel only captures
iv ) We add more experiments to analyze the effective- the red (R), green (G) or blue (B) channel in a spatial layout
ness of BIRNAT, especially in spectral reconstruction such as ‘RGGB’. Note that two green channels are used due
in both simulated and real datasets. to the sensitivity of the human eyes. In this case, Xk is a mo-
v) Lastly, under the assumption that the diversity and saic frame and since the neighbouring pixels are sampling
the amount of scenes are helpful to the generalization different color components, the values are not necessarily
ability of the model, we train our model on another continuous. To cope with this issue, previous studies [4],
public training sets and achieve better results than [21] usually divide the original measurement Y into four-
the ones reported in the preliminary conference ver- channel sub-measurements corresponding to the Bayer-filter
nx ny
sion [61]. {Yr , Yg1 , Yg2 , Yb } ∈ R 2 × 2 for the R, G1, G2 and B
components. Similarly, the mask and desired signal are also
The rest of this paper is organized as follows. Sec. 2
divided into four components. They reconstruct each sub-
presents the mathematical model of SCI. The proposed
signal separately using the corresponding measurement and
BIRNAT for different SCI systems (including color video
mask and then perform demosaicing (using off-the-shelf
and spectral SCI) reconstruction is developed in Sec. 3.
tools) in the recovery sub-videos to generate the final color
Simulation and real data results are reported in Sec. 4,
(RGB) video.
respectively and Sec. 5 concludes the entire paper.
In the following, we first describe the proposed BIRNAT
for gray-scale SCI reconstruction in Sec. 3.1 to Sec. 3.3 and
2 M ATHEMATICAL M ODEL OF SCI then extend it to the color video SCI for joint reconstruction
and demosaicing in Sec. 3.4.2, and also provide the simi-
Recalling Fig. 1, we assume that B sequential frames
nx ×ny lar model with gray-scale version for spectral compressive
{Xk }B
k=1 ∈ R are modulated by the coding patterns
nx ×ny imaging in Sec. 3.4.3.
{Ck }B
k=1 ∈ R , correspondingly. The measurement
Y ∈ Rnx ×ny is given by
B
3 BIRNAT FOR SCI R ECONSTRUCTION
X
Y= Xk Ck + G , (1) Provided the measurement Y and the coding patterns
k=1 {Ck }B k=1 , BIRNAT is developed to predict the sequential
frames {X b k }B , which are also regarded as the reconstruc-
where denotes the Hadamard (element-wise) product k=1
tions of real sequential frames {Xk }Bk=1 . In this section, we
and G represents the noise. In video SCI system such as
will introduce the details and motivation of each module of
CACTI [3], [12], the coding patterns are implemented by a
the proposed BIRNAT, including a novel measurement pre-
shifting physical mask or a DMD. In spectral SCI system
processing method in Sec. 3.1, an attentional ResBlock based
such as CASSI [5], the modulated frames {Xk Ck }B k=1
CNN to reconstruct the first (reference) frame in Sec. 3.2, a
are the shifting data cube of the original scene through the
bidirectional RNN to sequentially reconstruct the following
disperser. For all B pixels (in the B frames) at position (i, j),
frames in Sec. 3.3, and different applications in Sec. 3.4.
i = 1, . . . , nx ; j = 1, . . . , ny , they are collapsed to form one
Combining adversarial training and MSE loss, BIRNAT is
pixel in the snapshot measurement as
trained end-to-end as described in Sec. 3.5. As mentioned
B
X before, the network described below is a little bit different
yi,j = ci,j,k xi,j,k + gi,j . (2) from previous conference version [61]. The main difference
k=1 lies in the RNN structure.
Define x = x> >
 
1 , . . . , xB , where xk = vec(Xk ), and
let Dk = diag(vec(Ck )), for k = 1, . . . , B , where vec( ) 3.1 Measurement Energy Normalization
vectorizes the matrix inside ( ) by stacking the columns Recapping the definition of measurement Y in (1), it is a
and diag( ) places the ensured vector into the diagonal of weighted ({Ck }B k=1 ) summation of the sequential frames
a diagonal matrix. We thus have the vector formulation of {Xk }B . As a result, Y is usually a non-energy-normalized
k=1
the sensing process of SCI: image. For example, some pixels in Y may gather only one-
or two-pixel energy from {Xk }B k=1 , while some ones may
y = Φx + g , (3)
gather B − 1 or B due to the random coding patterns. Thus,
where Φ ∈ Rn×nB is the sensing matrix with n = nx ny , it is not suitable to directly feed Y into a network, which
x ∈ RnB is the desired signal, and g ∈ Rn again denotes motivates us to develop a measurement energy normaliza-
the vectorized noise. Unlike traditional CS [8], the sensing tion method depicted in Fig. 2 (left).
matrix considered here is not a dense matrix. In SCI, the To be concrete, the count matrix of coding patterns
matrix Φ in (3) has a very special structure and can be {Ck }B k=1 for each pixel during one snapshot can be achieved
written as by summing all {Ck }B k=1 as
Φ = [D1 , . . . , DB ] , (4) B
0 X
C = Ck , (5)
where {Dk }B
are diagonal matrices. Therefore, the com-
k=1
k=1
pressive sampling rate in SCI is equal to 1/B . It has recently
0
been proved that high quality reconstruction is achievable where each element in C describes how many correspond-
when B > 1 [62], [63]. ing pixels of {Xk }B
k=1 are integrated into the measurement

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

AttRes-CNN ˆf
X Convolutional
1
AttRes-CNN block
B 1
Forward RNN ˆf
ˆ f ,R f ˆf Input X Residual block
X 1 1 RNN Cell X 2
1

Measurement
c ˆ f ,R f ˆf GAN+ ℱ𝐶𝑁𝑁1 ℱ𝑟𝑒𝑠𝑏𝑙𝑜𝑐𝑘1 ℱ𝐶𝑁𝑁2 Self attention
X B 1 B 1 RNN Cell X B
MSE
Loss
Input Backward RNN Fusion Block RNN Cell
ˆ f , Rb ˆb
X B B RNN Cell X B1 ˆf X
X k 1ˆb
k1  ℱ𝑥 z xf,k  z xb,k 
ℱ𝑟𝑒𝑠𝑏𝑙𝑜𝑐𝑘2
 ˆ b , Rb
X RNN Cell ˆb
X
R  z  hkf  hkb   
2 2 1

R f
k
b
k ℱ𝑟 z f
r ,k
b
r ,k c c ℱ𝑟𝑒𝑐 ˆf X
X ˆb
k k
B B
B

Recover ഥ
Y ℱ𝑦 z yf  z by  Residual block

hkf1  hkb1  ℱℎ zhf,k  zhb,k 


masks Preprocessing Training/Testing

Fig. 2. Left: the proposed preprocessing approach to normalizing the measurement. We feed the concatenation of normalization measurement
Y and {Y Ck }B k=1 into the proposed BIRNAT. Middle: the specific structure of BIRNAT including i) the attention based CNN (AttRes-CNN) to
b f ; ii) forward RNN to recurrently reconstruct the following frames {X
reconstruct the first frame X b f }B ; iii) backward RNN to perform reverse-order
1 k k=2
b 1
reconstruction {Xk }k=B−1 . Right: details of AttRes-CNN and RNN cell. C denotes concatenation along the channel dimension. F• (where • can
b
be x, r, y or h) denotes a non-linear module composed of several convolutional layers.

0
Y. Then normalize the measurement Y by C to obtain the Q
energy-normalization measurement Y as
hy

v
Y =Y C (6) hx
b

on
1c
1
where denotes the matrix dot (element-wise) division. K
From Fig. 2 and the definition of Y , it can be obviously 1  1 conv
hy hy hy
observed that Y owns more visual information than Y ,
hx hx hx
which avoids imbalanced energy distribution caused by b 1 b b
1c
random coding pattern. In addition, for video SCI, Y can on
V
v
be regarded as an approximate average of the high-speed
frames {Xk }Bk=1 , preserving the motionless information
hy
such as background and motion trail information. hx
b

3.2 AttRes-CNN Fig. 3. Self-attention module.

In order to utilize RNN to sequentially reconstruct frames,


a first (reference) frame is required. Towards this end, we and Fatten is developed to capture long-range dependencies
propose a ResBlock [59] based deep CNN for the first (e.g., non-local similarity), discussed as follows.
frame (X b 1 ) reconstruction. Aiming to fuse all the visual Note that the traditional CNN is only able to capture
information in hand including our proposed normalization local dependencies since the convolution operator in CNN
measurement Y and the coding patterns {Ck }B k=1 , we take has a local receptive field, while in images/videos, non-local
the concatenation as: similarity [65] is widely used to improve the restoration per-
formance. To explore the non-local information in networks,
E = [Y , Y C1 , Y C2 , ..., Y CB ]3 , (7)
we employ a self-attention module [64] to capture the long-
where [ ]3 denotes the concatenation along the 3rd dimen- range dependencies [22] among regions to assist our first
sion and E ∈ Rnx ×ny ×(B+1) . Note that {Y Ck }B k=1 are
frame reconstruction.
used here to approximate the real mask-modulated frames As shown in Fig. 3, we perform the self-attention over
{Xk Ck }B k=1 . After this, E is fed into a deep CNN (Fig. 2
the pixels of feature map output from Fresblock1 , denoted
top-right) consisting two four-layer sub-CNNs (FCN N 1 and by L2 ∈ Rhx ×hy ×b , where hx , hy and b represents the
FCN N 2 ), one three-layer ResBlock (Fresblock1 ), and one self- length, width and number of channels in the feature map
attention module [64] (Fatten ) as L2 , respectively. By imposing 1 × 1 convolution on L2 , we
obtain the query Q, key K and value V matrix as
b 1 = FCN N 2 (L3 ), L3 = Fatten (L2 ),
X
(8)
L2 = Fresblock1 (L1 ), L1 = FCN N 1 (E), Q = w1 ∗ L2 , K = w2 ∗ L2 , V = w3 ∗ L2 , (9)
0
where, FCN N 1 is used to fuse different visual information where {w1 , w2 } ∈ R1×1×b×b and w3 ∈ R1×1×b×b with
in E to achieve feature L1 ; Fresblock1 is employed to further the fourth dimension representing the number of filters (b0
0
capture the spatial correlation when going deeper, and also for {w1 , w2 } and b for w3 ), {Q, K} ∈ Rhx ×hy ×b , V ∈
to alleviate the gradient vanishing problem; FCN N 2 , whose Rhx ×hy ×b , and ∗ represents convolutional operator. Q, K
0 0
structure is mirror symmetry with FCN N 1 , is used to recon- and V are then reshaped to Q0 ∈ Rhxy ×b , K0 ∈ Rhxy ×b
struct the first frame X
b 1 of the desired sequential images, and V0 ∈ Rhxy ×b , where hxy = hx × hy , which means we

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

treat each pixel in the feature map L2 as a “token”, whose over the consecutive frames, such as the background. The
f
feature is 1×b0 . In our experiments, we set b0 = 8b . After that reference image at the k th frame, Rk , is acquired by
we construct the attention map A ∈ Rhxy ×hxy with element
k−1 B
aj,i defined by
Rfk = Y − bf −
X X
Ct X t Ct Y. (14)
exp(si,j ) t=1 t=k+1
aj,i = Phxy , (10)
j=1 exp(si,j ) Recalling the definition of measurement Y in (1), which is
f
> the sum of B frames modulated by the mask, Rk can be
where si,j is the element in the matrix S = Q0 K0 ∈ seen as an approximation of Ck Xk , for the reason that
Rhxy ×hxy . Here aj,i represents that the extent of the model the predicted frames X b f before k and our proposed normal-
t
depends on the ith location when generating the j th region. ization measurement Y after k are used to approximate the
Having obtained the attention map A, we can impose it on corresponding real frames Xk . Basically, considering that
the value matrix V0 to achieve the self-attention feature map b f should be more accurate than Y ,
the approximation of X
L03 as f
k
Rk is going closer to the real Ck Xk . This is one of
L03 = reshape(AV0 ) ∈ Rhx ×hy ×b , (11)
the motivations that we build the backward RNN in the
0
where reshape() reshapes the 2D matrix AV0 ∈ Rhxy ×b to following subsection. Note that in the previous version [61],
the 3D matrix. Lastly, we multiply the self-attention feature only one extractor is used to extract features from reference
f f
map L03 by a scale learnable parameter λ and add it back to frame Rk and normalization measurement Y , although Rk
the input feature map L2 [22], leading to the final result (an approximation of modulated frame) and Y (an average
frame) are extremely different in visual.
L3 = L2 + λL03 . (12) f f
For z i,k , it is concatenated with the features z h,k ex-
Recapping the reconstruction process of the first frame f
tracted from the hidden units hk−1 (we initialize h1 with
in (8), it can be regarded as a nonlinear combination of Y zeros), to get the fused features g k
f
and {Y Ck }B k=1 . After obtaining the first frame X1 , we
b
use it as a base to reconstruct the following frames by our g fk = [z fi,k , z fh,k ]3 , z fh,k = Fh (hfk−1 ), (15)
next proposed sequential model. Therefore, it is important
where Fh is another CNN-based feature extractor. After
to build the ResBlock based CNN to obtain a good reference f
that, g k is fed into a two-layer ResBlock to achieve the
frame. f
hidden units hk at frame k as

3.3 Bidirectional Recurrent Reconstruction Network hfk = Fresblock2 (g fk ), (16)


After getting the first frame X
b 1 via the AttRes-CNN, we
which is then used to generate the forward reconstruction
now propose a bidirectional RNN to perform the recon- b f by a CNN as
b k }B in a sequential X k
struction of the following frames {X k=2 b f = Frec (hf ),
X k k (17)
manner. The overall structure of BIRNAT is described in
Fig. 2, and we give a further detailed discussion below. where Frec contains six convolutional layers and it recon-
structs the current frame from the feature fusing various
3.3.1 The Forward RNN information.
The forward RNN takes X b 1 as the initial input, fusing As a result, the current reconstructed frame X b f and hid-
k
different visual information at corresponding frames to f
den units hk are transported to the same cell to sequentially
sequentially output the forward reconstruction of other generate the next frame, until we pick up the last recon-
frames {X b f }B (the superscript f denotes ‘forward’). For structed frame X b f . Finally, we can get the reconstruction of
k k=2 B
simplicity, in the following description, we take the frame k forward RNN {X b f }B (we regard the construction of first
k k=1
as an example to describe the RNN cell, which is naturally frame X b 1 from CNN in (8) as X b f ).
1
extended to each frame. Although the forward RNN is able to achieve appealing
Specifically, at frame k where k = 2, · · · , B , a fusion results (refer to Table 1), it ignores the sequential infor-
block, including three parallel six convolutional layers Fx , mation in a reverse order, which has been widely used in
Fr and Fy as feature extractors, is used to fuse the visual natural language processing [66].
information of the reconstruction at the (k − 1)th frame Besides, we observe that the performance of forward
Xb f , a calculated reference image at the k th frame Rk and
k−1 RNN improves as k goes from 1 to B . We attribute it to
normalization measurement Y as the following two reasons: i) the latter frame uses more
h i
z fi,k = z fx,k , z fr,k , z fy , information from reconstructed frames; ii) the approxima-
3 (13) tion of the second item in (14) is more accurate. Based on
z fx,k = Fx (X b f ), z f = Fr (Rf ), z f = Fy (Y),
k−1 r,k k y these observations, we add the backward RNN to improve
the performance of reconstruction further, especially for the
where X b f , Rf and Y are fed into each CNN-based
k−1 k front frames.
f f f
feature extractor to achieve z x,k , z r,k and z f respectively,
f
which are then concatenated as the fused image feature z i,k . 3.3.2 The Backward RNN
Y can be viewed as an approximate average frame so that The reconstruction procedure in the forward RNN is re-
its feature can provide more consistent visual information covering the next frame by the previous frame, reference

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

information, and shared Y . In backward RNN, we use a RNN Cell


similar procedure but in reverse order. It is worth noting that 𝑟,𝑓
𝑋෠𝑡−1 𝑟,𝑓
𝑋෠𝑡

this reverse reconstruction is also reasonable in SCI, because

Demosaicing
RNN Cell
the order is only controlled by the order of the mask. 𝑔1,𝑓
𝑋෠𝑡−1 𝑔1,𝑓
𝑋෠𝑡

The backward RNN takes X b f and hf as input to RNN Cell


B B
𝑓
sequentially output the backward reconstruction of each 𝑓 𝑔2,𝑓
𝑋෠𝑡−1 𝑔2,𝑓
𝑋෠𝑡 𝑋෠𝑡
𝑋෠𝑡−1
frame {Xb b }1
k k=B−1 (the superscript b denotes the backward). RNN Cell
At frame k , the structure of the backward RNN cell is similar 𝑏,𝑓
𝑋෠𝑡−1 𝑏,𝑓
𝑋෠𝑡

to the forward one, with a little difference in the inputs


of each cell. Referring to Fig. 2 and the description of the Fig. 4. Recurrent reconstruction module for BIRNAT-color. For forward
reconstruction process, the previous RGB frame X bf
forward RNN above, in the following, we only discuss the t−1 will be divided
r,f b g1,f b g2,f b,f
differences between backward and forward RNN. into four sub-frames Xt−1 ,Xt−1 ,Xt−1 and Xt−1 , corresponding to
b b
the Bayer filters, and then four RNNs are applied to reconstruct the next
The first difference is the second item in (13). Due to frame.
the opposite order to the forward RNN, at frame k , the
backward RNN will use the reconstruction of frame k + 1.
The corresponding networks of (13) for backward RNN are 3.4.2 Color Video SCI System
thus changed to The main difference in the hardware between the monochro-
z bi,k = [z bx,k , z br,k , z by ]3 , matic and the color SCI system is that the color SCI system
(18) has a mosaic filter sampling one color per pixel in the imag-
z b = Fx (X
x,k
b b ), z b
k+1 r,k = Fr (Rbk ), z by = Fy (Y). ing sensor. Given the mosaic measurement and masks, the
The second difference is the definition of backward ref- extended BIRNAT (BIRNAT-color) aims to directly predict
f b B ∈ Rnx ×ny ×3 .
erence image Rbk . According to the definition of Rk in (14) the high-speed RGB frames {X} k=1
at frame k , since the reconstructions of frames after k are not A simple idea is separating the Bayer measurement into
obtained, we have to use the normalization measurement Y four sub-measurements referring to the Bayer filter, and
to approximate them. In backward RNN, it is natural to use then individually reconstructs four sub-video for each sub-
each reconstruction from forward RNN {X b f }B directly as measurement, and finally combines them and perform the
k k=1
1
demosaicing algorithm on this. However, this approach will
b f, split the reconstruction into two steps which introducing
X
Rbk = Y − Ct X t (19)
t=B,t6=k
additional errors, and therefore, we combine SCI reconstruc-
tion and demosaicing for color SCI system in BIRNAT-color
where the frames used here are the reconstruction from the detailed below.
forward RNN. It has two benefits: 1) the Rbk is more accurate For the first frame reconstruction, we separate the nor-
f
than Rk because of the estimate frames from forward RNN; malization measurement Y and the original masks C into
2) when training uses the back-propagation algorithm, it four-part corresponding to the Bayer-filter, one for red, one
provides a significant amount of gradient passing paths so for blue and two for green, which is the same as [4], [21],
that it will help jointly optimize the forward and backward [57], [58]. Thus, we change (7) into a color channel separated
reconstruction. input as:
The networks used in forward and backward RNN do
r r r
not share the parameters but have the same structure. An- E =[Y , Y Cr1 , ..., Y CrB ,
other important difference is that the initial hidden units g1 g1 g1
Y ,Y Cg1
1 , ..., Y Cg1
B,
hf1 for forward RNN are set to zeros in the forward RNN, (20)
g2 g2 g2
f
while the initial hidden units hbB are set to hB in the Y ,Y Cg2
1 , ..., Y Cg2
B,
backward RNN. This change builds a closer connection b b b
Y , Y Cb1 , ..., Y CbB ]3 ,
between forward and backward RNN and provides more
information for backward RNN. nx ny
where E ∈ R 2 × 2 ×4(B+1) includes four color channel
modulation information and its superscripts r, g and b
3.4 BIRNAT for Different SCI systems denote the red, green and blue channels, respectively. Using
As mentioned above, we proposed a bidirectional recurrent the same feature extraction and reconstruction processing
reconstruction framework for SCI. Here, the gap between as AttRes-CNN in (8), the first RGB frame X b 1 is recovered
the framework and different SCI systems (e.g., monochro- directly without an additional demosaic operation.
matic and color video, and spectral SCI systems) will be After attaining the first RGB frame X1 , we then perform
filled. a forward and backward RNN in order to reconstruct the
remaining frames. Considering that each part can be re-
3.4.1 Monochromatic Video SCI System garded as a separate image, in the recurrent reconstruction
For the monochromatic snapshot compressive imaging sys- as shown in Fig. 4, we use four RNN cells corresponding to
tem, the detector directly captures the accumulated mod- four color components with the specific mask and then add
ulated illuminance during the exposure time. It is easy a set of convolution layers for demosaicing to achieve the
to obtain the masks of the system and given masks and full resolution RGB image.
measurements, BIRNAT will sequentially recover the video For the four parallel RNN cells, each of them individ-
frames under the above framework. ually reconstructs the specific color component. Different

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

from Sec. 3.3, at each timestamp, RNN cell fuses the sub- Based on this equivalent conversion, BIRNAT directly
sampled version of different color channels to reconstruct reconstructs the shifting scene shif ting(Xk ). After an in-
the next frame by vertible shifting operation, the original data cube will be
h i recovered.
z ∗,f ∗,f ∗,f ∗,f
i,k = z x,k , z r,k , z y , z ∗,f b ∗,f
x,k = Fx (Xk−1 ),
3 (21)

z ∗,f ∗,f
r,k = Fr (Rk ), z ∗,f
y = Fy (Y ),
3.5 Optimization
BIRNAT contains four modules: i) the measurement energy
g ∗,f = [z ∗,f ∗,f
z ∗,f ∗,f normalization, ii) AttRes-CNN, iii) the forward RNN and
k i,k , z h,k ]3 , h,k = Fh (hk−1 ), , (22)
iv) the backward RNN. Except for i), other modules have
h∗,f
k = Fresblock2 (g ∗,f k ),
b ∗,f = Frec (h∗,f ),
X k k (23) their corresponding parameters. Specifically, all learnable
parameters in BIRNAT are denoted by
where the superscript ∗ denotes one of the color components
b ∗,f , R∗,f , and Y ∗ are sub-sampled
r, g1, g2 or b. Note that X Θ = {Wc , Wf , Wb }, (28)
k−1 k
from X b f , Rf and Y , respectively. In order to comply the
k−1 k where
f
raw imaging format, the reference Rk is calculated by
Wc = {WCN
c c c c
N 1 , WCN N 2 , Wresblock1 , Watten }, (29)
k−1 B
Rfk b f)
X X
=Y− Ct mosaic(X t − Ct Y, (24) are the parameters of the AttRes-CNN;
t=1 t=k+1
Wf = {Wxf , Wrf , Wyf , Whf , Wrec
f f
, Wresblock2 }, (30)
where mosaic() denotes the operation from the RGB image
to the raw mosaic image. are the parameters of forward RNN; and
After obtaining four current sub-images X b r,f , X
b g1,f ,
g2,f b,f
k k Wb = {Wxb , Wrb , Wyb , Whb , Wrec
b b
, Wresblock2 }, (31)
Xb
k and X
b
k by the four RNNs, we concatenate them and
feed into demosaicing module to fuse these different color are the parameters of backward RNN. Especially for
sub-images to obtain the full-resolution RGB image X b f by BIRNAT-color, Θ contains additional parameters of demo-
t
saic module. In the following, we will introduce how to
b f = Fdemosaic ([X
X b r,f , X
b g1,f , X
b g2,f , X
b b,f ]3 ). (25) jointly learn them at the training stage and use the well-
k k k k k
learned parameters at the testing stage.
Equations (21) to (25) perform a color forward recur-
rent reconstruction module including four RNN cells and 3.5.1 Learning Parameters at the Training Stage.
a demosaic layer. In order to further capture the reverse
At the training stage, besides measurement and the coding
temporal information to improve performance, a backward
patterns {Yn , {Cn,k }B N
k=1 }n=1 for N training samples, the
reconstruction module is also employed to reconstruct in
real frames {{Xn,k }B } N
k=1 n=1 are also provided as the super-
reverse order. Since the procedure is similar to forward
vised signal. In order to minimize the reconstruction error
RNN, we omit the detailed descriptions here.
of all the frames, the mean square error is used as the loss
function
3.4.3 Spectral SCI System
N
As shown in Fig. 1 bottom left, the modulated (by the mask)
X
LM SE = αLfn + Lbn ,
spatial-spectral data-cube is shifted by a disperser and due n=1
to the linear modulation, the original imaging model B
1 X
b f − Xn,k ||2 ,
B Lfn = ||X n,k 2 (32)
X Bnx ny k=1
Y= shif ting(Xk C)k , (26)
1
k=1 1 X
Lbn = b b − Xn,k ||2 ,
||X n,k 2
is equal to the translated form: (B − 1)nx ny k=B−1
B
X where Lfn and Lbn represent the MSE loss of forward and
Y= shif ting(Xk )k shif ting(C)k , (27) backward RNN, respectively, and α is a trade-off parameter,
k=1 which is set to 1 in our experiments, and nx and ny are the
where Xk ∈ Rnx ×ny is the original scene, shif ting(Xk ) ∈ width and height of X.
Rnx ×(ny +dk ) is the modulated and shifted scene (dk denotes To further improve the quality of each reconstructed
the spatial shift of the k th spectral channel), and operation frames and make the generated video smoother, we intro-
shif ting()k denotes the effect of spatial shifting due to duce the adversarial training [67] in addition to the MSE
the disperser. Thus spectral compressive measurement has loss in (32). To be more specific, the input video frames
the same size as shifting scene shif ting(Xk )k and shifting {Xn,k }N,B
n=1,k=1 are treated
h as “real” samples, while
i the
mask shif ting(C)k , and the normalized measurement can reconstructed frames {X }b b N,B−1 f
, {X }
b N
, gener-
n,k n=1,k=1 n,B n=1
be easily calculated by Eq. (6), and for each spectral channel 3
ated from previous networks, are assumed as the “fake”
reconstruction, the reference information from the gap of
samples. The adversarial training loss can be formulated as
real measurement and previous frame is also easily com-
puted. Note that in (27), we did not consider measurement Lg = EX [log D(X)] + EY [log(1 − D(G(Y, {Ck }B
k=1 )))],
noise. (33)

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

where G is the generator which outputs reconstructed video Besides, we apply BIRNAT on the real CASSI system [14]
frames, and D is the discriminator having the same struc- which captures 5 real scenes with 28 wavelengths from
ture with [68]. As a result, the final loss function of our 450nm to 650nm. Following the same setting as previous
model is works [14], [23], we perform the real mask to generate the
N
X simulated measurement for training.
L= (αLfn + Lbn ) + βLg , (34)
n=1

where β is a trade-off parameter. In the experiments, β is set 4.1.3 Implementation Details of BIRNAT
to 0.001.
For a fair comparison of the performance on the different
3.5.2 Performing SCI Reconstruction at the Testing Stage. training sets, we train BIRNAT on the same number of
back-propagations (i.e. 8.6 × 105 ) with batch size 3. Starting
During testing, we are only provided the measurement
with the initial learning rate of 3 × 10−4 , we reduce the
Y and coding patterns {Ck }B k=1 . With the well-learned
b f }B learning rate by 10% every 4 × 104 back-propagations, and
network parameters Θ, we can achieve the frames {X k k=1 it costs about 3 days for training the entire network. The
b 1
and {Xk }k=B−1 . Considering the advantages of backward
b
Adam optimizer [73] is employed for the optimization. All
RNN that uses good visual features generated by the for- experiments are run on the NVIDIA RTX 8000 GPU based
ward RNN, we use the reconstructed frame 1 to B − 1 on PyTorch.
from backward RNN, and frame B from forward RNN
hto construct thei final reconstruction of our system, that is
{X bf .
b b }B−1 , X 4.1.4 Counterparts and Performance Metrics
k k=1 B
3
As introduced above, various methods have been proposed
4 E XPERIMENTS for SCI reconstruction. Hereby we compare our model with
In this section, we compare BIRNAT with several state-of- five competitive counterparts for video SCI, e.g., model-
the-art methods on both simulation and real datasets in based optimization methods GAP-TV [28] and DeSCI [17],
order to illustrate its performance and efficiency. and plug-and-play methods [21], [47], and deep learning-
based methods E2E-CNN [12] and a 3D-Unet model repro-
duced from [40]. For spectral SCI, we compare our model
4.1 Datasets and Experimental Settings with GAP-TV [28] and DeSCI [17], and several deep model
4.1.1 Video Datasets HSSP [74], λ-net [19], TSA-net [14] Deep-GSM [23] and
Considering following two reasons: i) the video SCI re- GAP-net [75].
construction task does not have a specific training set; ii) For the simulation datasets, both peak-signal-to-noise
the SCI imaging technology is suitable for any scene, for ratio (PSNR) and structural similarity (SSIM) [76] are used
comparison, we choose the dataset DAVIS2017 [69] and as metrics to evaluate the performance. Besides, to see
YouTube VOS [70], originally used in video object seg- whether they can be applied to a real-time video system,
mentation task, respectively training on BIRNAT with the we give the running time of reconstructing the frames at the
same iterations. DAVIS2017 is a relatively small set that testing stage.
consists of 90 scenes including 6,208 frames and YouTube
VOS is a large one that consists of 3,471 scenes with 469,873 Ground Truth GAP-TV DeSCI PnP-FFDNet E2E-CNN 3D-Unet BIRNAT
frames for training. Following the setting in [17], eight
(B = 8) sequential frames are modulated by the shifting Kobe
#8
binary masks {Ck }B k=1 and then collapsed into a single
measurement Y . We randomly crop eight-frame patches Traffic
with a size of 256 × 256 from original scenes in DAVIS2017 #16
and YouTube VOS, obtaining 26,000 and 150,000 training
data pairs with data augmentation. Runner
We firstly evaluate BIRNAT on six grayscale #4
simulation datasets including Kobe, Runner, Drop,
Traffic [17], Aerial and Vehicle [19]. Then we verify Drop
BIRNAT-color on six color simulation datasets including #8
Beauty, Bosphorus, Jockey, Runner, ShakeNDry
and Traffic [47]. After that, we also evaluate BIRNAT on Aerial
several real datasets captured by real video SCI cameras [3], #3

[4], [12].
Vehicle
#12
4.1.2 Spectral Datasets
For the sake of fair comparison, we train BIRNAT on the
modified version of CAVE [71] dataset which consists of 32 Fig. 5. Reconstructed frames of GAP-TV, DeSCI, PnP-FFDNet, E2E-
hyperspectral images of spatial size 512 × 512 × 28 follow- CNN, 3D-Unet and BIRNAT on six simulated video SCI datasets. Please
ing the previous work [14], [23], and test on 10 simulated see the full results in the supplementary material for details.
scenes of size 256 × 256 extracted from KAIST [72] dataset.

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

TABLE 1
The average results of PSNR in dB (left entry) and SSIM (right entry) and running time per measurement/shot in seconds by
different algorithms on 6 grayscale simulation datasets and the GPU memory footprint (GB) during training (left entry) and testing
(right entry) for one sample. Best results are in bold, second best results are underlined.

Dataset Kobe Traffic Runner Drop Aerial Vehicle Average Time Memory
GAP-TV [28] 26.45, 0.8448 20.89, 0.7148 28.81, 0.9092 34.74, 0.9704 25.05, 0.8281 24.82, 0.8383 26.79, 0.8576 4.2 -
DeSCI [17] 33.25, 0.9518 28.72, 0.9250 38.76, 0.9693 43.22, 0.9925 25.33, 0.8603 27.04, 0.9094 32.72, 0.9347 6180 -
PnP-FFDNet [21] 30.50, 0.9256 24.18, 0.8279 32.15, 0.9332 40.70, 0.9892 25.27, 0.8291 25.42, 0.8493 29.70, 0.8924 3.0 -, 1.4
PnP-FastDVDnet [47] 32.73, 0.9466 27.95, 0.9321 36.29, 0.9619 41.82, 0.9892 27.98, 0.8966 27.32, 0.9253 32.35, 0.9420 0.1 -, 1.3
E2E-CNN [12] 27.79, 0.8071 24.62, 0.8403 34.12, 0.9471 36.56, 0.9494 27.18, 0.8690 26.43, 0.8817 29.45, 0.8824 0.0312 1.2, 0.4
3D-Unet [40] 29.00, 0.8868 25.44 ,0.8699 34.32, 0.9522 37.26, 0.9676 27.57, 0.8800 26.60, 0.8949 30.03, 0.9086 0.0942 4.1, 1.4
BIRNAT-base [61] 32.71, 0.9504 29.33, 0.9422 38.70, 0.9760 42.28, 0.9918 28.99, 0.9166 27.84, 0.9274 33.31, 0.9507 0.1647 17.7, 3.7
BIRNAT (DAVIS2017) 33.22, 0.9549 29.53, 0.9460 38.99, 0.9767 42.76, 0.9923 29.18, 0.9185 27.99, 0.9334 33.61, 0.9536 0.1872 17.9, 3.8
BIRNAT (YouTube VOS) 33.18, 0.9554 29.80, 0.9482 39.46, 0.9799 42.94, 0.9925 29.19, 0.9214 28.05, 0.9365 33.77, 0.9556 0.1872 17.9, 3.8

TABLE 2
The average results of PSNR in dB (left entry) and SSIM (right entry) and running time per measurement/shot in seconds by
different algorithms on 6 color simulation datasets. Best results are in bold, second best results are underlined.

Dataset Beauty Bosphorus Jockey Runner ShakeNDry Traffic Average Time


GAP-TV [28] 33.08, 0.9639 29.48, 0.9144 29.48, 0.8874 29.10, 0.8780 29.59, 0.8928 19.84, 0.6448 28.46, 0.8635 10.8
DeSCI [17] 34.66, 0.9711 32.88, 0.9518 34.14, 0.9382 36.16, 0.9489 30.94 , 0.9049 24.62, 0.8387 32.23, 0.9256 92640
PnP-FFDNet [21] 33.21, 0.9629 28.43, 0.9046 32.30, 0.9182 30.83, 0.8875 27.87, 0.8606 21.03, 0.7113 28.93, 0.8742 25.8
PnP-FastDVDnet-gray [47] 33.01, 0.9628 33.01, 0.9628 33.51, 0.9279 32.82, 0.9004 29.92, 0.8920 22.81, 0.7764 30.50, 0.8989 52.2
PnP-FastDVDnet-color [47] 35.27, 0.9719 37.24, 0.9781 35.63, 0.9495 38.22, 0.9648 33.71, 0.9685 27.49, 0.9147 34.60, 0.9546 57
BIRNAT-color 36.08, 0.9750 38.30, 0.9817 36.51, 0.9561 39.65, 0.9728 34.26, 0.9505 28.03, 0.9151 35.47, 0.9585 0.98

4.2 Results on Simulation Datasets tensor ADMM-net [19] uses different training sets
4.2.1 Grayscale Simulation Video for different test sets and it only shows the results
The performance comparisons on the six benchmark on Kobe (30.15dB), Aerial (26.85dB) and Vehicle
grayscale datasets are given in Table 1, using different (23.62dB), which are inferior to those of BIRNAT. In
algorithms, i.e., GAP-TV [28], DeSCI [17], PnP-FFDNet [21], general, BIRNAT gets leading average performance
PnP-FastDVDnet [47], E2E-CNN [12], 3D-Unet [40] and our on these six datasets both for PSNR and SSIM.
previous version BIRNAT [61] (BIRNAT-base) and current iii) Due to the efficient feedforward network during
BIRNAT training on different datasets. Fig. 5 plots selected testing, E2E-CNN, 3D-Unet and BIRNAT (after train-
reconstructed frames of BIRNAT on these six datasets com- ing) are significantly faster than other iterative
pared with GAP-TV, DeSCI, PnP-FFDNet, E2E-CNN and optimization-based algorithms. Although E2E-CNN
3D-Unet. We can observe that while DeSCI smooths out the is faster than BIRNAT, BIRNAT can provide bet-
details in the reconstructed video, BIRNAT provides sharper ter reconstruction in less than 200 milliseconds and
borders and finer details, owning to a better exploration of achieve 30000 times speedups over previous state-of-
both spatial and temporal information extracted by CNN the-art method DeSCI. Although BIRNAT consumes
and bidirectional RNN. We list the observations from Table more memory during training (partly from the dis-
1 and Fig. 5 as follows. criminative network of adversarial training), it is not
large during testing.
i) BIRNAT outperforms the previous state-of-the-art iv) Compared with BIRNAT-base, the current BIRNAT
method DeSCI on Traffic (1.08dB), Runner results in a 0.3dB increase in average PSNR, which
(0.7dB), Aerial (3.86dB) and Vehicle (1.01dB) by benefits from the change of separating the approxi-
the metric PSNR. Obviously, BIRNAT can provide mated modulated frame and normalization measure-
superior performance on the datasets with com- ment. It will be described in Sec 4.3.
plex background, owning to the non-local features v) With the same number of back-propagations (itera-
obtained with self-attention and the sequential de- tions), training on the larger dataset YouTube VOS
pendencies constructed by RNN. If the scene has will increase 0.16dB on PSNR. This means richer
complicated structures such as Aerial, the non- scenes are more suitable for deep neural works.
local similarity based method DeSCI will degrade
and deep learning methods such as 3D-Unet and our To further explore the influence of attention mechanism,
BIRNAT can provide sharper tree branch edges due the attention map is illustrated in Fig. 6, where we plot
to the strong representation capability learned from the attended active areas (represented by the highlighted
the training data. red color) of a randomly selected pixel (represented by the
ii) DeSCI only improves a little bit over BIRNAT on yellow point). It can be seen that those non-local regions
Kobe (0.07dB) and Drop (0.28dB), since they have in red color are corresponding to the highly semantically
more similar patches in the video frame which are related areas. These attention-aware features can provide
helpful for DeSCI. Despite this, BIRNAT provides long-range spatial dependencies among pixels, which are
finer details in Kobe; for instance, the number ‘24’ helpful for the first frame reconstruction.
is closer to ground truth than others. The Deep In Fig. 7, we plot the frame-wise PSNR and SSIM

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Drop #1 Runner #1 Aerial #1 PnP- PnP-


Kobe #1 Ground GAP-TV DeSCI PnP-FFDNet FastDVDnet- FastDVDnet- BIRNAT-
Truth gray color color

Beauty
#6

Bosphorus
#8
Fig. 6. Selected attention maps of the first frame. Yellow points denote
the pixels randomly selected from each image, and red areas denote
Jockey
the active places. #16

Runner
#12

ShakeNDry
#4

Traffic
#5

Fig. 8. Reconstructed frames of GAP-TV, DeSCI, PnP-FFDNet, PnP-


FastDVDnet and BIRNAT-color on six color simulated video SCI
datasets.
(a) (b)

4.2.2 Color Simulation Video


In order to evaluate the performance of different methods
on color SCI reconstruction, we use six mid-scale benchmark
color datasets [47] (some frames are plotted in Fig. 8) of size
512 × 512 × 3 × 32, where 3 denotes the RGB channels and
B = 8 video frames are compressed to one measurement. To
generate the measurement, we use the same shifting binary
masks as used in [47] to modulate mosaic video of size 512×
(c) (d) 512. We only train BIRNAT-color on the DAVIS2017 set for
speed consideration.
Fig. 7. Frame-wise reconstruction quality on BIRNAT and other algo- The reconstruction results of color SCI are given in
rithms on Traffic (a-b) and Vehicle (c-d). Table 2, using different algorithms, i.e., GAP-TV [28], De-
SCI [17], PnP-FFDNet [21], PnP-FastDVDnet [47] (gray and
color) and BIRNAT-color. Fig. 8 depicts several selected
color reconstruction frames of BIRNAT-color and other
curve of BIRNAT and other algorithms on Traffic and methods. Except for BIRNAT-color, other algorithms firstly
Vehicle. Recall that the final outputs of BIRNAT are reconstruct mosaic video from the mosaic measurement and
[{X b f ]3 , where the first B − 1 frames are from
b b }B−1 , X then demosaicing1 them to obtain the final color (RGB)
k k=1 B
the backward RNN and the last frame is from the forward video. For BIRNAT-color, we directly obtain the color video
RNN. The first frame in forward RNN is from AttRes-CNN. because the demosaicing process has been integrated into
In Fig. 7, the frames reconstructed by forward RNN and the network.
BIRNAT in one measurement (every eight frames) have an It can be seen in Table 2 that our proposed BIRNAT-
uptrend and a downtrend, respectively, due to the recurrent color achieves a 0.9dB gain on PSNR over previous state-of-
reconstruction. This means that it can be benefit to recover the-art PnP-FastDVDnet-color. From Fig. 8, we can observe
the next frame from a more accurate reference and a better that for scenes with complex structures such as ShakeNDry,
reconstructed previous frame. Our final BIRNAT mitigates BIRNAT-color provides sharper hairs with fewer artifacts
these trends by using a bidirectional RNN. In addition, than others. For simple structures and motion scenes such
to further dig out the benefit of frame-by-frame recurrent as Jockey and Runner, BIRNAT-color results include
reconstruction, we then fine-tune the trained BIRNAT, i.e., more pleasant details (accurate stripes of shoes). Although
after the previously reconstructed frames, re-input the first optimization-based methods such as GAP-TV and DeSCI
frame reconstructed by the backward RNN to the Bi-RNN perform well in grayscale SCI reconstruction, they cannot
to reconstruct all frames again, and apply the MSE loss func- jointly optimize SCI and demosaicing. This gap degrades
tion to the two reconstructed videos. Under this fine-tuning the quality of color video SCI reconstruction.
(additional 2.5 × 105 iterations), the performance of the sec-
ond reconstruction is higher (0.4dB), and the reconstruction
1. We use the MATLAB function demosaic [77] for these algorithms
performance does not improve significantly (about 0.1dB) after they reconstruct each channel separately. It is possible for these
as the number of reconstruction loops increases again. algorithms to get better results if a joint procedure is used.

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

TABLE 3
The average results of PSNR in dB (left entry) and SSIM (right entry) by different algorithms on 10 spectral images simulation
datasets. Best results are in bold, second best results are underlined.

Algorithm GAP-TV [28] DeSCI [17] HSSP [74] λ-net [19] TSA-net [14] Deep-GSM [23] GAP-net [75] BIRNAT
Scene1 25.13, 0.724 27.15, 0.794 31.07, 0.852 30.82, 0.880 31.26, 0.887 32.62, 0.920 33.03, 0.921 36.05, 0.956
Scene2 20.67, 0.630 22.26, 0.694 26.30, 0.798 26.30, 0.846 26.88, 0.855 27.65, 0.892 29.52, 0.903 33.93, 0.960
Scene3 23.19, 0.757 26.56, 0.877 29.00, 0.875 29.42, 0.916 30.03, 0.921 30.46, 0.925 33.04, 0.940 38.31, 0.975
Scene4 35.13, 0.870 39.00, 0.965 38.24, 0.926 37.37, 0.962 39.90, 0.964 39.65, 0.970 41.59, 0.972 46.71, 0.990
Scene5 22.31, 0.674 24.80, 0.778 27.98, 0.827 27.84, 0.866 28.89, 0.878 28.64, 0.894 30.95, 0.924 34.63, 0.967
Scene6 22.90, 0.635 23.55, 0.753 29.16, 0.823 30.69, 0.886 31.30, 0.895 33.28, 0.938 32.88, 0.927 35.88, 0.964
Scene7 17.98, 0.670 20.03, 0.772 24.11, 0.851 24.20, 0.875 25.16, 0.887 25.73, 0.898 27.60, 0.921 31.18, 0.957
Scene8 23.00, 0.624 20.29, 0.740 27.94, 0.831 28.86, 0.880 29.69, 0.887 31.82, 0.932 30.17, 0.904 34.43, 0.962
Scene9 23.36, 0.717 23.98, 0.818 29.14, 0.822 29.32, 0.902 30.03, 0.903 30.11, 0.925 32.74, 0.927 38.09, 0.973
Scene10 23.70, 0.551 25.94, 0.666 26.44, 0.740 27.66, 0.843 28.32, 0.848 30.97, 0.933 29.73, 0.901 32.23, 0.945
Average 23.73, 0.683 25.86, 0.785 28.93, 0.834 29.25, 0.886 30.15, 0.893 31.09, 0.923 32.13, 0.924 36.14, 0.965

RGB Measurement Ground Truth GAP-TV DeSCI HSSP λ-net TSA-net Deep-GSM GAP-net BIRNAT
476.5 nm
(i)
(ii)
0.3
Ground Truth
GAP-TV, corr: 0.9907
522.7 nm
DeSCI, corr: 0.9995
0.25 HSSP, corr: 0.9967
λ-net, corr: 0.985
0.2
TSA-net, corr: 0.9973
Density

0.15 DGSM, corr: 0.9861


GAP-net, corr: 0.997
0.1 BIRNAT, corr: 0.9997

0.05
(i) 575.3 nm
0
450 500 550 600 650

Wavelength (nm)
0.4
Ground Truth
GAP-TV, corr: 0.9797
DeSCI, corr: 0.9948

648.1 nm
0.3
HSSP, corr: 0.996
λ-net, corr: 0.9857
TSA-net, corr: 0.9955
Density

0.2
DGSM, corr: 0.9816
GAP-net, corr: 0.9957
0.1 BIRNAT, corr: 0.9994

(ii)
0
450 500 550 600 650
(a) Reconstruction of Scene 7
Wavelength (nm)

476.5 nm
(i) (ii)

522.7 nm
Ground Truth

0.6
(i) GAP-TV, corr: 0.9435
DeSCI, corr: 0.9935
HSSP, corr: 0.9875
λ-net, corr: 0.96

0.4 TSA-net, corr: 0.9783


Density

DGSM, corr: 0.9707


GAP-net, corr: 0.9947
BIRNAT, corr: 0.9994
0.2

575.3 nm
0
450 500 550 600 650

Wavelength (nm)
0.6
Ground Truth

0.5 GAP-TV, corr: 0.9986


DeSCI, corr: 0.9996

0.4 HSSP, corr: 0.9985


λ-net, corr: 0.9926
648.1 nm
TSA-net, corr: 0.9984
Density

0.3
DGSM, corr: 0.9957
0.2 GAP-net, corr: 0.9982
BIRNAT, corr: 0.9998
0.1
(ii)
0
450 500 550 600 650 (b) Reconstruction of Scene 9
Wavelength (nm)

Fig. 9. Reconstructed frames of GAP-TV, DeSCI, HSSP, λ-net, TSA-net, Deep-GSM, GAP-net and BIRNAT on simulated spectral SCI datasets.
Please look over the full results in the supplementary material for details.

4.2.3 Snapshot Compressive Hyperspectral Image both model-based and deep learning-based methods (over
4dB in average PSNR and 0.041 in average SSIM) due to
As mentioned in Sec 3.4.3, we have applied the proposed
modeling the sequential correlation by RNNs and sequential
BIRNAT in snapshot compressive hyperspectral imaging
reconstruction.
reconstruction. Reconstruction results of BIRNAT and com-
parison methods e.g., GAP-TV [28], DeSCI [17], HSSP [74], In Fig. 9, the results of BIRNAT have two obvious advan-
λ-net [19], TSA-net [14] Deep-GSM [23] and GAP-net [75] tages: i) clearer edges and fewer artifacts, e.g., in Fig. 9 (a),
in both simulated data from KAIST and real data from the reconstruction frames in 575.3nm and 648.1nm include
CASSI system are presented. We give the results of sim- clean patterns and lines; ii) more accurate spectra, e.g., in
ulated data on the 10 scenes in Table 3 and Fig. 9 here. Fig. 9 (a) 522.7nm and (b) 522.7nm, the results of BIRNAT
As shown in Table 3, BIRNAT significantly outperforms have more consistent spectral curves with the ground truth

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

TABLE 4
Ablation study of BIRNAT for the 6 gray-scale and 10 spectral images simulation datasets, the average PSNR and SSIM is shown.

AttResCNN RNN Training Video Spectral images


self-attention X
b separate R and Y concat R and Y bidirectional adversarial PSNR SSIM PSNR SSIM
RNN w/o r ! 29.84 0.899 31.57 0.902
RNN w/o x ! 31.06 0.923 33.21 0.931
RNN-c ! ! 31.86 0.934 34.28 0.938
RNN-s ! ! 31.97 0.935 34.43 0.938
Bi-RNN ! ! ! 32.98 0.938 35.72 0.953
BIRNAT w/o SA ! ! ! ! 33.11 0.946 35.78 0.957
BIRNAT w/o AT ! ! ! ! 33.18 0.949 35.89 0.960
BIRNAT-base [61] ! ! ! ! ! 33.31 0.950 35.98 0.962
BIRNAT ! ! ! ! ! 33.61 0.953 36.14 0.965

GAP-TV DeSCI PnP-FFDNet E2E-CNN BIRNAT


TABLE 5 #1 #8 #1 #8 #1 #8 #1 #8 #1 #8
The comparison of grayscale videos reconstruction quality
(average PSNR of six gray-scale simulation datasets) of the
#2 #9 #2 #9 #2 #9 #2 #9 #2 #9
four methods at different measurement noise levels. Wheel
256 × 256

#3 #10 #3 #10 #3 #10 #3 #10 #3 #10


Noise Level 0.001 0.005 0.01 0.05
BIRNAT 33.60 31.59 29.24 24.17
GAP-TV 26.68 26.02 24.89 17.81 #4 #11 #4 #11 #4 #11 #4 #11 #4 #11
PnP-FFDNet 29.10 27.25 24.79 15.42
PnP-FastDVDnet 32.19 29.92 27.14 17.27 #5 #12 #5 #12 #5 #12 #5 #12 #5 #12

TABLE 6 #6 #13 #6 #13 #6 #13 #6 #13 #6 #13

The comparison of grayscale videos reconstruction quality


(average PSNR of six gray-scale simulation datasets) of the #7 #14 #7 #14 #7 #14 #7 #14 #7 #14
four methods at different mask errors levels.

Noise Level 0.001 0.005 0.01 0.05


BIRNAT 33.76 33.35 33.24 30.89
GAP-TV 25.82 25.34 25.21 24.77
PnP-FFDNet 29.20 27.57 26.14 24.66 Fig. 10. Real data Wheel: results of GAP-TV, DeSCI, E2E-CNN, PnP-
PnP-FastDVDnet 32.33 28.96 27.34 25.41 FFDNet and BIRNAT.

in the zoomed area, where other methods cannot recover • Naturally, considering both X b , R, and Y provides
such accurate frames in different spectra. On the left of better results than ‘RNN w/o x’ and ‘RNN w/o r’.
Fig. 9, we plot the average density of all spectra in the • In BIRNAT-base [61], the concatenation of R and Y
selected areas. It can be seen that BIRNAT has the highest is utilized as the reference information and a CNN
correlation value with ground truth. is employed to extract the feature, but it seems that
a shared extractor cannot lead to stratifying results.
In this paper, we separate R and Y and use two
4.3 Ablation Study networks to extract the features, respectively. As the
To verify the contribution of each module in BIRNAT, we results of ‘RNN-c’ and ‘RNN-s’ showed in Table 4,
conduct experiments with partial components in BIRNAT separating them indeed produces better results.
listed in Table 4 on the 6 grayscale (training on DAVIS2017) • ‘Bi-RNN’ includes forward and backward RNN and
and 10 spectral images simulation datasets. Different vari- we can see that adding the backward RNN is benefi-
ants of BIRNAT are shown in the first column of Table 4 and cial to the reconstruction.
!means that the module is included in the corresponding • ‘BIRNAT w/o SA’ and ‘BIRNAT w/o AT’ are the
variant. partial versions of BIRNAT-base [61], and the re-
sults of them indicate that self-attention mechanism
• ’RNN w/o x’, ‘RNN w/o r’, ‘RNN-c’, and ‘RNN-s’
is beneficial to the reconstruction performance. We
are the variants of the basic RNN cell to verify the
can observe that long-range dependencies explored
structure. They have the same AttRes-CNN without
among regions during the first frame reconstruction
self-attention and only include a forward RNN.
and adversarial training can further improve the
• ‘RNN w/o r’ and ‘RNN w/o x’ only contain the
quality of reconstructed frames.
input X b or concatenation R and Y . It can be seen
that ‘RNN w/o r’ only provides limited results. The final version of BIRNAT achieves the best result with all
This means that predicting the next frame just based the above useful parts.
on the previous frame is challenging like the video To verify the robustness of BIRNAT under noise, the
prediction task. comparisons of the measurement noise of BIRNAT, GAP-TV,
• On the other hand, ‘RNN w/o x’ appears that the PnP-FFDNet and PnP-FastDVDnet are given in Table 5. To
approximate modulated frame and normalization add simulated noise, we first normalize the measurement to
measurement have more valuable information. 0-1 and then add different standard deviation level Gaussian

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
GAP-TV

Domino #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
512 × 512
DeSCI

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
PnP-
FFDNet

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
E2E-CNN

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
BIRNAT

Fig. 11. Real data Domino: results of GAP-TV, DeSCI, PnP-FFDNet, E2E-CNN and BIRNAT.

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
GAP-TV

Water Balloon #1 #2 #3 #4 #5 #6 #7 #8 #9 #10


512 × 512
DeSCI

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
PnP-
FFDNet
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
E2E-CNN

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10
BIRNAT

Fig. 12. Real data Water Balloon: results of GAP-TV, DeSCI, PnP-FFDNet, E2E-CNN and BIRNAT.

noise to the measurement. Experiments show that these system using simulated measurements synthesized from
methods have a certain degree of performance degradation real masks and original video frames from DAVIS2017,
in noise and BIRNAT is more robust to noise than other respectively.
methods. To quantify the sensitivity of DNN to errors in
masks, we verify the performance of the pretrained BIRNAT 4.4.1 Grayscale High-Speed Video
on the corrupted masks (adding different standard devia- The Wheel [3] snapshot measurement of size 256 × 256
tion level Gaussian noise on the shifted binary masks used pixels encodes 14 (B = 14) grayscale high-speed frames.
for training). The results are added in Table 6. Although The mask is the shifting random mask with the pixel shifts
the performance degradation of GAP-TV is not obvious determined by the pre-set translation of the printed film.
with noise level increased, degradation is evident at low The reconstruction results of Wheel are shown in Fig. 10.
noise levels. BIRNAT is robust to errors of the mask due It can be seen clearly that GAP-TV [28], DeSCI [17], E2E-
to the recurrent reconstruction which will lead the errors to CNN [12], and PnP-FFDNet [21] have some unpleasant
decrease with timestamp increasing, especially at low error artifacts. By contrast, BIRNAT provides clear boundaries of
level. the letter ‘D’ and has few artifacts around ‘D’. Compared
with other methods, the brightness of reconstructed frames
generated by BIRNAT is more consistent.
4.4 Results on Real SCI Data
The grayscale snapshot measurements of Domino and
We now apply BIRNAT to real data captured by SCI cam- Water Balloon with size 512 × 512 pixels encode 10
eras [3], [4], [12] to verify its robustness. Note that in the frames, in which the mask is controlled by a DMD [12].
simulation data, we consider the measurement as noise free. The reconstruction results of Domino and Water Balloon
However, in real data, noise is unavoidable and thus the are shown in Fig. 11 and Fig. 12, respectively. DeSCI and
SCI reconstruction for real data is more challenging. Since PnP-FFDNet preserve the clear background but over smooth
the performance of the trained network depends on the the letters on dominoes in Fig. 11. However, BIRNAT can
mask, we train corresponding networks for each specific SCI provide clear and distinct letters, and we also notice that

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11

GAP-TV #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11

DeSCI #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11

PnP-
FFDNet #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11

BIRANT-
color #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22

Fig. 13. Real data Hammer: results of GAP-TV, DeSCI, PnP-FFDNet and BIRNAT-color.

E2E-CNN can provide clearer letters than DeSCI and PnP- the red Strawberry in longer wavelength channels from
FFDNet. For Water Balloon, similar to other scenes, BIR- 604.2nm to 648.1nm. Visual comparison results of different
NAT reconstructs the clear boundaries of the ball with fewer methods in two real scenes of selected two spectra are
artifacts (shown in the green box in Fig. 12). shown in Fig. 15. As we can observe, the proposed method
can recover finer details of the textures and more consistent
4.4.2 Color High-Speed Video spectral curves, especially in the right scene of Fig. 15 where
Regarding the color real data, the Hammer [4] mosaic the reconstruction of BIRNAT shows clearer plants texture
snapshot measurement of size 512 × 512 pixels encodes than others.
22 frames by a shifting random mask. The reconstruction
results of Hammer are shown in Fig. 13. Since previously
trained E2E-CNN cannot handle the color scene [12], we 5 C ONCLUSIONS
only compare BIRNAT with GAP-TV [28], DeSCI [17] and In this paper, we have proposed a recurrent reconstruc-
PnP-FFDNet [21]. It can be clearly seen that the DeSCI and tion framework consisting of an attention-residual-block-
PnP-FFDNet recover a smoother background, but BIRNAT based convolution neural network and a bidirectional re-
provides sharper edges of the hammer and the red apple. current neural network for snapshot compressive imaging.
In summary, the reconstructed video by BIRNAT shows The proposed model, BIRNAT, reconstructs the first frame
finer and more complete details compared with other meth- through the AttRes-CNN and then the following frames
ods, with a significant reduction in the reconstruction time are sequentially inferred by a bidirectional RNN. Due to
during testing compared to DeSCI. This indicates the appli- the powerful learning capability, BIRNAT achieves state-of-
cability and the efficiency of our algorithm in real applica- the-art performance in simulation and real, grayscale, color
tions. and spectral benchmark datasets. Thanks to the efficiency
of neural networks, BIRNAT can recover sequential frames
4.4.3 Hyperspectral Image from a measurement within less than one second. The high
We perform the proposed BIRNAT on 5 real scenes from SD- reconstruction quality and fast inference time of BIRNAT
CASSI [14] system which captures 28 spectral channels from will help the practical applications of end-to-end SCI sys-
450nm to 650nm and has 54 pixels dispersion (each channel tems being deployed in our daily life.
disperses 2 pixels) in the horizontal dimension. Therefore, Although BIRNAT has achieved satisfactory results,
we retrain the BIRNAT using the real mask of size 660 × there are still several gaps between practical applications
714 on CAVE dataset for this system and compare BIRNAT and current research. Most current researches mainly focus
with 5 competitive methods, e.g., GAP-TV [28], λ-net [22], on simulated data, which utilizes a binary mask and is an
TSA-net [14], Deep-GSM [23] and GAP-net [75]. ideal state without noise, but for real optical compressive
Fig. 14 shows the full 28-channel reconstruction of BI- imaging systems, these factors are inevitable. On the other
RANT. It can be obviously seen the boundary and details of hand, the flexibility of the compressive imaging systems

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

453.3nm 457.6nm 462.1nm 466.8nm 471.6nm 476.5nm 481.6nm 486.9nm 492.4nm 498.0nm

Reference 503.9nm 509.9nm 516.2nm 522.7nm 529.5nm 536.5nm 543.8nm 551.4nm 558.6nm 567.5nm

575.3nm 584.3nm 594.4nm 604.2nm 614.4nm 625.1nm 636.3nm 648.1nm

Measurement

Fig. 14. Reconstructed frames of BIRNAT on Strawberry real datasets.

Reference Measurement Reference Measurement

GAP-TV λ-net TSA-net Deep-GSM GAP-net BIRNAT GAP-TV λ-net TSA-net Deep-GSM GAP-net BIRNAT
567.5nm 604.2nm

648.1nm 614.4nm

Fig. 15. Reconstructed frames of GAP-TV, λ-net, TSA-net, Deep-GSM, GAP-net and BIRNAT on two real spectral SCI datasets Lego and Plants.
Please check the full results in the supplementary material for details.

and reconstruction algorithms is also a promising research [7] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An
direction, which will adaptively change different imaging end-to-end deep video compression framework,” CVPR, 2019.
[8] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Infor-
parameters (coding patterns and compression ratios) corre- mation Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
sponding to different scenarios and different scenarios will [9] C. Emmanuel, J. Romberg, and T. Tao, “Robust uncertainty
be recovered by the flexible algorithm without additional principles: Exact signal reconstruction from highly incomplete
training. frequency information,” IEEE Transactions on Information Theory,
vol. 52, no. 2, pp. 489–509, February 2006.
[10] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video
ACKNOWLEDGMENTS from a single coded exposure photograph using a learned over-
complete dictionary,” in 2011 International Conference on Computer
Bo Chen acknowledges the support of NSFC (U21B2006 and Vision. IEEE, 2011, pp. 287–294.
61771361), Shaanxi Youth Innovation Team Project, the 111 [11] D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2C2: Pro-
grammable pixel compressive camera for high speed imaging,”
Project (No. B18039) and the Program for Oversea Talent by in CVPR 2011. IEEE, 2011, pp. 329–336.
Chinese Central Government. [12] M. Qiao, Z. Meng, J. Ma, and X. Yuan, “Deep learning for video
compressive sensing,” APL Photonics, vol. 5, no. 3, p. 030801, 2020.
[Online]. Available: https://doi.org/10.1063/1.5140721
R EFERENCES [13] X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compres-
sive imaging: Theory, algorithms, and applications,” IEEE Signal
[1] Y. Altmann, S. McLaughlin, M. J. Padgett, V. K. Goyal, A. O.
Processing Magazine, vol. 38, no. 2, pp. 65–88, 2021.
Hero, and D. Faccio, “Quantum-inspired computational imaging,”
Science, vol. 361, no. 6403, p. eaat2298, 2018. [14] Z. Meng, J. Ma, and X. Yuan, “End-to-end low cost compressive
[2] J. N. Mait, G. W. Euliss, and R. A. Athale, “Computational imag- spectral imaging with spatial-spectral self-attention,” in European
ing,” Adv. Opt. Photon., vol. 10, no. 2, pp. 409–483, Jun 2018. Conference on Computer Vision (ECCV), August 2020.
[3] P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, [15] X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded com-
and D. J. Brady, “Coded aperture compressive temporal imaging,” pressive hyperspectral imaging,” ACM Transactions on Graphics
Optics Express, vol. 21, no. 9, pp. 10 526–10 545, 2013. (TOG), vol. 33, no. 6, pp. 1–11, 2014.
[4] X. Yuan, P. Llull, X. Liao, J. Yang, D. J. Brady, G. Sapiro, and [16] D. S. Jeon, S.-H. Baek, S. Yi, Q. Fu, X. Dun, W. Heidrich, and M. H.
L. Carin, “Low-cost compressive sensing for color video and Kim, “Compact snapshot hyperspectral imaging with diffracted
depth,” in IEEE Conference on Computer Vision and Pattern Recog- rotation,” 2019.
nition (CVPR), 2014, Journal Article, pp. 3318–3325. [17] Y. Liu, X. Yuan, J. Suo, D. Brady, and Q. Dai, “Rank minimization
[5] A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser for snapshot compressive imaging,” IEEE Transactions on Pattern
design for coded aperture snapshot spectral imaging,” Applied Analysis and Machine Intelligence, vol. 41, no. 12, pp. 2990–3006,
Optics, vol. 47, no. 10, pp. B44–B51, 2008. Dec 2019.
[6] M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, [18] M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “Deep fully-
“Single-shot compressive spectral imaging with a dual-disperser connected networks for video compressive sensing,” Digital Signal
architecture,” Optics Express, vol. 15, no. 21, pp. 14 013–14 027, Processing, vol. 72, pp. 9–18, 2018.
2007. [19] J. Ma, X. Liu, Z. Shou, and X. Yuan, “Deep tensor ADMM-Net

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

for snapshot compressive imaging,” in IEEE/CVF Conference on video compressive sensing with programmable sensors,” IEEE
Computer Vision (ICCV), 2019. transactions on pattern analysis and machine intelligence, vol. 42, no. 7,
[20] M. Qiao, X. Liu, and X. Yuan, “Snapshot spatial–temporal com- pp. 1642–1653, 2020.
pressive imaging,” Opt. Lett., vol. 45, no. 7, pp. 1659–1662, Apr [41] M. Yoshida, A. Torii, M. Okutomi, K. Endo, Y. Sugiyama, R.-i.
2020. Taniguchi, and H. Nagahara, “Joint optimization for compressive
[21] X. Yuan, Y. Liu, J. Suo, and Q. Dai, “Plug-and-play algorithms video sensing and reconstruction under hardware constraints,”
for large-scale snapshot compressive imaging,” in The IEEE/CVF in The European Conference on Computer Vision (ECCV), September
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
2020. [42] X. Han, B. Wu, Z. Shou, X.-Y. Liu, Y. Zhang, and L. Kong, “Tensor
[22] X. Miao, X. Yuan, Y. Pu, and V. Athitsos, “λ-net: Reconstruct hy- FISTA-Net for real-time snapshot compressive imaging,” in AAAI,
perspectral images from a snapshot measurement,” in IEEE/CVF 2020.
Conference on Computer Vision (ICCV), 2019. [43] J. R. L. Roux and J. Weninger, “Deep unfolding: Model-based
[23] T. Huang, W. Dong, X. Yuan, J. Wu, and G. Shi, “Deep gaussian inspiration of novel deep architectures,” 2014.
scale mixture prior for spectral compressive imaging,” in Proceed- [44] Y. Yang, J. Sun, H. Li, and Z. Xu, “Deep ADMM-Net for com-
ings of the IEEE/CVF Conference on Computer Vision and Pattern pressive sensing MRI,” in Advances in Neural Information Processing
Recognition, 2021, pp. 16 216–16 225. Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
[24] Y. Sun, X. Yuan, and S. Pang, “Compressive high-speed stereo R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 10–18.
imaging,” Opt Express, vol. 25, no. 15, pp. 18 182–18 190, 2017. [45] J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-
[25] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- inspired deep network for image compressive sensing,” in Proceed-
works for biomedical image segmentation,” in Medical Image Com- ings of the IEEE conference on computer vision and pattern recognition,
puting and Computer-Assisted Intervention (MICCAI), ser. LNCS, vol. 2018, pp. 1828–1837.
9351. Springer, 2015, pp. 234–241, (available on arXiv:1505.04597 [46] Y. Li, M. Qi, R. Gulve, M. Wei, R. Genov, K. N. Kutulakos,
[cs.CV]). [Online]. Available: http://lmb.informatik.uni- and W. Heidrich, “End-to-end video compressive sensing using
freiburg.de/Publications/2015/RFB15a anderson-accelerated unrolled networks,” in 2020 IEEE Interna-
[26] G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep tional Conference on Computational Photography (ICCP). IEEE, 2020,
learning for computational imaging,” Optica, vol. 6, no. 8, pp. 921– pp. 1–12.
943, 2019. [47] X. Yuan, Y. Liu, J. Suo, F. Durand, and Q. Dai, “Plug-and-play
[27] Z. Meng, M. Qiao, J. Ma, Z. Yu, K. Xu, and X. Yuan, “Snapshot algorithms for video snapshot compressive imaging,” IEEE Trans-
multispectral endomicroscopy,” Opt. Lett., vol. 45, no. 14, pp. 3897– actions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
3900, Jul 2020. [48] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition
[28] X. Yuan, “Generalized alternating projection based total variation with deep recurrent neural networks,” in 2013 IEEE International
minimization for compressive sensing,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.
Conference on Image Processing (ICIP), Sept 2016, pp. 2539–2543. 6645–6649.
[29] J. Bioucas-Dias and M. Figueiredo, “A new TwIST: Two-step it- [49] Y. Huang, W. Wang, and L. Wang, “Video super-resolution via
erative shrinkage/thresholding algorithms for image restoration,” bidirectional recurrent convolutional networks,” IEEE Transactions
IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 2992–3004, on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 1015–
December 2007. 1028, April 2018.
[30] J. Yang, X. Yuan, X. Liao, P. Llull, G. Sapiro, D. J. Brady, and [50] S. Nah, S. Son, and K. M. Lee, “Recurrent neural networks with
L. Carin, “Video compressive sensing using Gaussian mixture intra-frame iterations for video deblurring,” in The IEEE Conference
models,” IEEE Transaction on Image Processing, vol. 23, no. 11, pp. on Computer Vision and Pattern Recognition (CVPR), June 2019.
4863–4878, November 2014. [51] M. Haris, G. Shakhnarovich, and N. Ukita, “Recurrent back-
[31] J. Yang, X. Liao, X. Yuan, P. Llull, D. J. Brady, G. Sapiro, and projection network for video super-resolution,” in The IEEE Confer-
L. Carin, “Compressive sensing by learning a Gaussian mixture ence on Computer Vision and Pattern Recognition (CVPR), June 2019.
model from measurements,” IEEE Transaction on Image Processing, [52] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur,
vol. 24, no. 1, pp. 106–119, January 2015. “Recurrent neural network based language model.” in INTER-
[32] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm SPEECH, vol. 2, 2010, p. 3.
minimization with application to image denoising,” in IEEE Con- [53] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and
ference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video
2862–2869. object segmentation,” in The IEEE Conference on Computer Vision
[33] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed and Pattern Recognition (CVPR), June 2019.
optimization and statistical learning via the alternating direction [54] X. Mei, E. Pan, Y. Ma, X. Dai, J. Huang, F. Fan, Q. Du, H. Zheng,
method of multipliers,” Foundations and Trends in Machine Learning, and J. Ma, “Spectral-spatial attention networks for hyperspectral
vol. 3, no. 1, pp. 1–122, January 2011. image classification,” Remote Sensing, vol. 11, no. 8, p. 963, 2019.
[34] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting [55] R. Lu, B. Chen, Z. Cheng, and P. Wang, “Rafnet: Recurrent atten-
with deep neural networks,” in Advances in Neural Information tion fusion network of hyperspectral and multispectral images,”
Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, Signal Processing, vol. 177, p. 107737, 2020.
and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. [56] F. Kokkinos and S. Lefkimmiatis, “Deep image demosaicking
341–349. [Online]. Available: http://papers.nips.cc/paper/4686- using a cascade of convolutional residual denoising networks,” in
image-denoising-and-inpainting-with-deep-neural-networks.pdf Proceedings of the European Conference on Computer Vision (ECCV),
[35] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond 2018, pp. 303–319.
a gaussian denoiser: Residual learning of deep CNN for image [57] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the
denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, dark,” in Proceedings of the IEEE Conference on Computer Vision and
pp. 3142–3155, July 2017. Pattern Recognition, 2018, pp. 3291–3300.
[36] G. Ongie, A. Jalal, C. A. M. R. G. Baraniuk, A. G. Dimakis, and [58] L. Liu, X. Jia, J. Liu, and Q. Tian, “Joint demosaicing and denoising
R. Willett, “Deep learning techniques for inverse problems in with self guidance,” in IEEE/CVF Conference on Computer Vision and
imaging,” IEEE Journal on Selected Areas in Information Theory, 2020. Pattern Recognition (CVPR), June 2020.
[37] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, [59] K. He, X. Zhang, S. Ren, and S. J, “Deep residual learning for
“Reconnet: Non-iterative reconstruction of images from compres- image recognition,” in CVPR, 2016.
sively sensed random measurements,” in CVPR, 2016. [60] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net-
[38] K. Xu and F. Ren, “CSVideoNet: A real-time end-to-end learning works,” arXiv preprint arXiv:1701.00160, 2016.
framework for high-frame-rate video compressive sensing,” arXiv: [61] Z. Cheng, R. Lu, Z. Wang, H. Zhang, B. Chen, Z. Meng, and
1612.05203, Dec 2016. X. Yuan, “BIRNAT: Bidirectional recurrent neural networks with
[39] X. Yuan and Y. Pu, “Parallel lensless compressive imaging via deep adversarial training for video snapshot compressive imaging,” in
convolutional neural networks,” Optics Express, vol. 26, no. 2, pp. European Conference on Computer Vision (ECCV), August 2020.
1962–1977, Jan 2018. [62] S. Jalali and X. Yuan, “Snapshot compressed sensing: Performance
[40] J. N. Martel, L. K. Mueller, S. J. Carey, P. Dudek, and G. Wetzstein, bounds and algorithms,” IEEE Transactions on Information Theory,
“Neural sensors: Learning pixel exposures for hdr imaging and vol. 65, no. 12, pp. 8005–8024, Dec 2019.

0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2022.3161934, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

[63] S. Jalali and X. Yuan, “Compressive imaging via one-shot mea- Ruiying Lu received the B.S. degrees in
surements,” in IEEE International Symposium on Information Theory telecommunication engineering from Xidian Uni-
(ISIT), 2018. versity, Xi’an, China, in 2016. She is currently
[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. pursuing the Ph.D. degree with Xidian University.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Her research interests include deep learning for
in Advances in Neural Information Processing Systems 30, I. Guyon, image processing and nature language process-
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, ing.
and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5998–6008.
[Online]. Available: http://papers.nips.cc/paper/7181-attention-
is-all-you-need.pdf
[65] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for
image denoising,” in Computer Vision and Pattern Recognition, 2005. Zhengjue Wang received the B.S. and M.S. de-
CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, grees in electronic engineering from Xidian Uni-
2005, pp. 60–65. versity, Xi’an, China, in 2013 and 2016, respec-
[66] H. Jaeger, Tutorial on training recurrent neural networks, covering tively. She is currently pursuing the Ph.D. degree
BPPT, RTRL, EKF and the” echo state network” approach. GMD- at Xidian University. Her research interests in-
Forschungszentrum Informationstechnik Bonn, 2002, vol. 5. clude probabilistic model and deep learning, and
[67] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- their applications in image super-resolution, hy-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversar- perspectral image fusion, and natural language
ial nets,” in Proceedings of the 27th International Conference on Neural processing.
Information Processing Systems - Volume 2, ser. NIPS’14, 2014, pp.
2672–2680.
[68] L. Mescheder, S. Nowozin, and A. Geiger, “Which training meth-
ods for GANs do actually converge?” in International Conference on Hao Zhang received the B.S. and Ph.D. degree
Machine Learning (ICML), 2018. in electronic engineering from Xidian University,
[69] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine- Xi’an, China, in 2012 and 2019, respectively.
Hornung, and L. V. Gool, “The 2017 DAVIS challenge on video From 2019 to 2020, he worked as a postdoc-
object segmentation,” CoRR, vol. abs/1704.00675, 2017. [Online]. toral researcher in Electrical and Computer En-
Available: http://arxiv.org/abs/1704.00675 gineering, Duke University, Durham, NC, USA.
[70] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, He is now working as a postdoctoral researcher
“Youtube-vos: A large-scale video object segmentation bench- in Weill Cornell Medicine, Cornell University, NY,
mark,” arXiv preprint arXiv:1809.03327, 2018. USA. His research interests include statistical
[71] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized machine learning and its combination with deep
assorted pixel camera: postcapture control of resolution, dynamic learning, and the natural language processing.
range, and spectrum,” IEEE transactions on image processing, vol. 19,
no. 9, pp. 2241–2253, 2010. Bo Chen received the B.S., M.S., and Ph.D.
[72] I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, degrees from Xidian University, Xi’an, China, in
“High-quality hyperspectral reconstruction using a spectral 2003, 2006, and 2008, respectively, all in elec-
prior,” ACM Transactions on Graphics (Proc. SIGGRAPH Asia tronic engineering. He became a Post-Doctoral
2017), vol. 36, no. 6, pp. 218:1–13, 2017. [Online]. Available: Fellow, a Research Scientist, and a Senior Re-
http://dx.doi.org/10.1145/3130800.3130810 search Scientist with the Department of Electri-
[73] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- cal and Computer Engineering, Duke University,
tion,” arXiv preprint arXiv:1412.6980, 2014. Durham, NC, USA, from 2008 to 2012. From
[74] L. Wang, C. Sun, Y. Fu, M. H. Kim, and H. Huang, “Hyperspectral 2013, he has been a Professor with the National
image reconstruction using a deep spatial-spectral prior,” in Pro- Laboratory for Radar Signal Processing, Xidian
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern University. He received the Honorable Mention
Recognition, 2019, pp. 8032–8041. for 2010 National Excellent Doctoral Dissertation Award. His current
[75] Z. Meng, S. Jalali, and X. Yuan, “Gap-net for snapshot compressive research interests include statistical machine learning, statistical signal
imaging,” arXiv preprint arXiv:2012.08364, 2020. processing and radar automatic target detection and recognition.
[76] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image
quality assessment: From error visibility to structural similarity,”
IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Ziyi Meng received the B.S. degree from the
2004. Huazhong University of Science and Technology
[77] H. S. Malvar, L.-w. He, and R. Cutler, “High-quality linear in- in 2015. He received the Ph.D. degree in Beijing
terpolation for demosaicing of bayer-patterned color images,” in University of Posts and telecommunications in
2004 IEEE International Conference on Acoustics, Speech, and Signal 2021. He was a visiting student in New Jersey
Processing, vol. 3. IEEE, 2004, pp. iii–485. Institute of Technology in 2020. His research in-
terests include computational imaging and deep
learning.

Xin Yuan (SM’16) received the BEng and MEng


degrees from Xidian University, in 2007 and
2009, respectively, and the PhD from the Hong
Ziheng Cheng received the B.S. degree from Kong Polytechnic University, in 2012. He is cur-
Xidian University in 2019. He is currently pur- rently an Associate Professor at Westlake Uni-
suing the Ph.D. degree in Xidian University. His versity. He was a video analysis and coding lead
research interests include deep learning and researcher at Bell Labs, Murray Hill, NJ, USA
computational imaging. from 2015 to 2021. Prior to this, he was a Post-
Doctoral Associate in the Department of Electri-
cal and Computer Engineering, Duke University
from 2012 to 2015. His research interests are in
signal processing, computational imaging and machine learning. He has
been the Associate Editor of Pattern Recognition (2019-), International
Journal of Pattern Recognition and Artificial Intelligence (2020-) and
Chinese Optics Letters (2021-). He is leading the special issue of ”Deep
Learning for High Dimensional Sensing” in the IEEE Journal of Selected
Topics in Signal Processing in 2021.
0162-8828 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on April 25,2022 at 15:28:26 UTC from IEEE Xplore. Restrictions apply.

You might also like