Enhanced Standard Compatible Image Compression
Enhanced Standard Compatible Image Compression
X, X XXXX 1
Abstract—Recent deep neural network-based research to en- compression standards, such as JPEG2000 [2], H.264/AVC [3],
hance image compression performance can be divided into and High Efficiency Video Coding (HEVC) [4]. Recent video
three categories: learnable codecs, postprocessing networks, and coding standards [3], [4] have adopted prediction-based coding
compact representation networks. The learnable codec has been
arXiv:2009.14754v2 [eess.IV] 15 Dec 2021
designed for end-to-end learning beyond the conventional com- methods to reduce the spatial and temporal redundancy of
pression modules. The postprocessing network increases the input video. Prediction-based coding increases the complex-
quality of decoded images using example-based learning. The ity of the compression algorithm but produces much better
compact representation network is learned to reduce the capacity compression performance.
of an input image, reducing the bit rate while maintaining On the other hand, compression frameworks with end-to-
the quality of the decoded image. However, these approaches
are not compatible with existing codecs or are not optimal end trainable deep neural networks [5]–[14] (learnable codecs
for increasing coding efficiency. Specifically, it is difficult to in this paper) have been proposed based on the rapid develop-
achieve optimal learning in previous studies using a compact ment of deep learning. The approaches use trainable networks
representation network due to the inaccurate consideration of the to produce bitstreams and reconstruct the original image
codecs. In this paper, we propose a novel standard compatible (Fig. 1 (a)). Although these kinds of approaches structurally
image compression framework based on auxiliary codec net-
works (ACNs). In addition, ACNs are designed to imitate image consider the compression ratio and reconstruction quality,
degradation operations of the existing codec, which delivers their performance is still undesirable and incompatible with
more accurate gradients to the compact representation network. standard codecs, which decreases the algorithm’s utility.
Therefore, compact representation and postprocessing networks It is easy to propose a method to restore an image after the
can be learned effectively and optimally. We demonstrate that compression process to improve the compression performance
the proposed framework based on the JPEG and High Efficiency
Video Coding standard substantially outperforms existing image while being compatible with standard codecs. Following the
compression algorithms in a standard compatible manner. developments of convolutional neural networks (CNN), such
as ResNet [15], DenseNet [16], and attention networks [17],
Index Terms—Image compression, deep neural networks, com-
pact representation, JPEG, High Efficiency Video Coding. [18], the CNN-based image postprocessing algorithms [19]–
[28] have drastically improved the performance of image
restoration. These kinds of approaches are designated as
I. I NTRODUCTION
a postprocessing network (PPNet) in this paper. Although
Bitstream Bitstream
ACN
CRNet PPNet CRNet PPNet
Codec Codec
Bitstream Bitstream
(c) Standard compatible frameworks based on the compact representation (d) Proposed framework based on the auxiliary codec network (ACN)
network (CRNet)
Fig. 1. Conceptual comparison between frameworks. Green and red arrows indicate forward (or inference) and backward pass (or gradients) to train the
CRNet, respectively. Gray modules indicate that it is not differentiable or a standard codec. Blue modules indicate a differentiable network.
do not consider the degradation process through the standard II. R ELATED W ORK
codec (Fig. 1 (c)) because the standard codec, including the
quantization process, is a nondifferentiable module. A. Compression Frameworks Based on End-to-end Trainable
Networks (Learnable Codecs)
In this paper, we propose a novel standard compatible end- As deep learning has been successful in the field of image
to-end image compression framework based on auxiliary codec processing, Toderici et al. [5], [6] first proposed an end-to-end
networks (ACNs). The ACNs are designed to imitate the deep neural network-based approach in image compression.
forward image degradation process of existing codecs in dif- An input image with dimensions reduced through an auto-
ferentiable networks to provide the correct backward gradients encoder is stored as a binary vector for a given compression
for training the CRNet (Fig. 1 (d)). These gradients allow the rate and is optimized for minimum distortion.
compact represented image to consider both the degradation
As the possibility of the strong modeling capacity of a
process by the ACN and the reconstruction process by the PP-
neural network is revealed, many follow-up studies have been
Net. Based on ACNs, both the CRNet and PPNet are learned
conducted. Theis et al. [7] proposed a compressive auto-
together to achieve better image compression performance in
encoder based on a residual neural network [15] and used a
a standard compatible manner. In addition, a bit estimation
Laplace-smoothed histogram as the entropy model. Ballé et al.
network (BENet) is proposed for training as a regularization
[8] jointly optimized the entire model for rate-distortion per-
function to prevent undesired bit-rate increments. As recent
formance using a generalized divisive normalization transform.
CRNet-based [35]–[39] generate models at a single level, the
Further, Ballé et al. [9] proposed a hyperprior to effectively
proposed framework is also trained and optimized per codec
capture spatial redundancy in the latent encoding. In addition,
and rate.
Johnston et al. [10] proposed a priming technique and spatially
adaptive entropy model for image compression. Moreover,
The contributions of this paper are summarized as follows: Li et al. [11] proposed a content-weighted method based on
spatially adaptive importance map learning. Mentzer et al. [12]
proposed a model that concurrently trains a context model
• We propose a novel CNN architecture called the ACN, with an encoder and used three-dimensional convolutional
based on the prior of the image compression process to networks. Minnen et al. [13] and Lee et al. [14] combined
effectively and precisely train the CRNet. a context-adaptive entropy model and hyperprior, producing
• Based on the ACN, we propose an enhanced compression substantial performance improvements.
framework based on the collaborative learning scheme The primary difficulties of deep image compression algo-
between the ACN, PPNet, and CRNet. Furthermore, the rithms include making the nondifferential quantization process
BENet facilitates training using a proper bit prediction, end-to-end trainable, designing an entropy model that predicts
preventing undesirable artifacts in compactly represented the bitstream generated from coefficients, and enabling com-
images. pression considering both the bit rate and distortion. However,
• The framework is compatible with compression algo- although many deep network-based approach algorithms have
rithms from the standard codecs to learning-based codecs been developed, it is challenging to replace conventional
and any off-the-shelf image restoration networks. Based compression schemes due to compatibility. Furthermore, al-
on the highly accurate ACNs for two standard codecs: though the state-of-the-art approaches outperform even the
JPEG and HEVC, our framework exhibits state-of-the-art Better Portable Graphics (BPG) [40] codec, which is designed
results compared to other image compression algorithms, based on the intra-mode of HEVC, a significant performance
including standards and learnable codecs. improvement has not been demonstrated.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 3
ℒ𝑹𝑹𝑹𝑹𝑹𝑹
BENet ℒ𝑩𝑩𝑩𝑩𝑩𝑩
CRNet
ACN Train Phase
ℒ𝑹𝑹𝑹𝑹𝑹𝑹
Test Phase
Codec
PPNet
Codec-mimicking network
Prediction Image
xN Residual Image
Fig. 2. (a) Illustration of the end-to-end learning pipeline for image compression. (b)-(d) Detailed structures of the ACN and BENet.
codec. However, the codec-related functions Φ and φ have a image. The codec imitation module consisting of the ACN and
nondifferentiable quantization operation that creates problems BENet was used only for training and can be used as a gradient
for the backpropagation algorithm. The codec-related func- path for training CRNet. In the testing phase, CRNet and
tions were replaced with a differentiable neural networks: h PPNet were used with existing codecs, such as conventional
and p, to overcome this problem, as follows: preprocessing and postprocessing modules.
θf∗ , θg∗ ≈ argmin δ(x, g(h(f (x))) + λp(f (x)), (4) B. Auxiliary Codec Network
θf ,θg
The parameter values θf∗ and θg∗ were obtained by opti-
where θf and θg are the parameters of the functions f and g, mizing the objective function in (4) using the approximation
respectively. In (4), we can reach the ideally optimal solution (imitation) function of the codec modules: Φ and φ. However,
θf and θg , if the two neural networks, h and p are perfectly the obtained parameter may be different when the actual codec
modeled as real codec modules: Φ and φ. All parts of the is applied. The output of h should be as close as possible
objective function are composed of learnable neural networks, to the actual codec module Φ to reduce these differences
enabling backpropagation in the end-to-end learning scheme. and perform optimal learning. In this section, we propose
We defined h as the ACN and p as BENet in this paper. novel CNN architectures of the ACN that closely approximate
Both the ACN and BENet have fixed parameters in the two typical standard codecs, JPEG and HEVC intra coding
process of optimizing (4). The overall pipeline of the proposed (HEVC-intra). The objective function to train the ACNs is as
compression framework is illustrated in detail in Fig. 2 (a). The follows:
original image passes through the CRNet and is expressed as
2
a compact image to reduce the amount of information. Next, θh∗ = argmin kh(x) − Φ(x)k2 . (5)
the BENet calculates the number of predicted bits, and the θh
ACN generates an imitated decoded image from the compact The architectures of networks imitating JPEG and HEVC-
image. Finally, the PPNet performs restoration of the original intra are depicted in Fig. 2 (b) and (c). In the JPEG codec,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 5
PPNet through pretraining. The initial state is defined in the 𝑓𝑓(𝑥𝑥) VCNN ℎ(𝑓𝑓 𝑥𝑥 ) 𝑔𝑔(𝑦𝑦)
following equations:
Step 3 𝑥𝑥 CRNet VCNN ℎ(𝑓𝑓 𝑥𝑥 ) 𝑥𝑥
2
θf0 = argmin kf (x) − Fs (x)k2 , (11)
θf
2 (b) Alternate learning with a virtual codec neural network (VCNN) [39]
θg0 = argmin kg(Φ(Fs (x))) − xk2 . (12)
θg
ACN ℎ(𝑓𝑓 𝑥𝑥 )
degraded bicubic downsampled image to the original image.
The pretraining strategy for all networks provides a good
initialization point, making the optimal parameter closer to
the ideal and obtaining a faster convergence rate. This result
Step 2 𝑥𝑥 CRNet ACN PPNet 𝑔𝑔(ℎ(𝑓𝑓 𝑥𝑥 ) 𝑥𝑥
is verified using a comparative experiment in Section V-B6.
2) Iterative Fine-tuning Updating: As the fine-tuning of
the end-to-end model progresses, the CRNet learns in the
(c) Simultaneous learning with the proposed auxiliary codec network (ACN)
direction of optimizing the objective function. However, as
Fig. 3. Comparison of the learning process of compact representation network
mentioned, the use of the approximation function affects the (CRNet)-based methods. The blue module indicates the status updated in each
learning of the entire model. The ACN and BENet, which have step, and the yellow module indicates the fixed status. Green and red arrows
fixed weights, gradually decrease the approximation accuracy indicate a forward pass (or inference) and a backward pass (or gradients) to
train the module, respectively. The yellow double arrow indicates the argument
to the standard codec because the unseen compact image from of the loss function for the backward pass.
the CRNet is input as the whole model is trained. Therefore,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 7
31
CRNet and PPNet are updated for each cycle of the minibatch, 0.85
30
and the ACN and BENet are updated using the output value 29 0.8
PSNR (dB)
proposed algorithm, the parameters of the CRNet and PPNet
SSIM
27
and θg∗ . 25
Proposed
0.65
Proposed
24
3) Comparison with the other CRNet-based Methods: Bicubic+Post-processing 0.6 Bicubic+Post-processing
23
Recent CRNet-based papers [38], [39] proposed a learning JPEG JPEG
22 0.55
strategy that bypasses the nondifferentiable codec, adopting 0.1 0.2 0.3 0.4
Bits per pixel (bpp)
0.5 0.6 0.1 0.2 0.3 0.4
Bits per pixel (bpp)
0.5 0.6
0.87
passed through the codec and PPNet. In both methods, CRNet
PSNR (dB)
SSIM
30 0.85
and PPNet are trained alternately, which is an incomplete 0.83
29
optimization process. 0.81
28 Proposed
In this paper, we adopted a method of simultaneously Proposed
0.79
Bicubic+Post-processing Bicubic+Post-processing
learning an end-to-end network so that the CRNet and PPNet 27
JPEG
0.77
JPEG
SSIM
0.75
V. E XPERIMENTAL R ESULTS 28
0.7
A. Setting 26 Scale Factor = 0.5 Scale Factor = 0.5
0.65
Scale Factor = 0.75 Scale Factor = 0.75
Previous studies based on the CRNet and PPNet [38], [39] 24 Scale Factor = 1 0.6
Scale Factor = 1
JPEG JPEG
have displayed performance improvement in various image 22 0.55
codecs, such as JPEG, JPEG2000 [2] and BPG [40]. In this 0.1 0.3 0.5 0.7
Bits per pixel (bpp)
0.9 1.1 0.1 0.3 0.5 0.7
Bits per pixel (bpp)
0.9 1.1
(a)
(b)
Fig. 5. Visual comparison of different codec-mimicking network structures on the (a) Lighthouse image of the LIVE1 dataset at a quality factor of 10,
and (b) Kimono of the HEVC Test Sequence [51] at an HEVC quality parameter of 42. The result of JPEG-based ACN has blocking artifacts and ringing
artifacts around edges similar to the image compressed with JPEG. The result of ACN with prediction image, unlike other structures, is less blurry and has
compression artifacts similar to HEVC.
TABLE I
Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF JPEG I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
NETWORK S TRUCTURES ON S ET 14 AND LIVE1 DATASETS
QF = 10 QF = 20 QF = 40 QF = 80
vs JPEG Set14 LIVE1 Set14 LIVE1 Set14 LIVE1 Set14 LIVE1
Original image 27.49 27.03 29.85 29.30 32.20 31.62 37.00 36.32
VCNN [39] depth = 6 29.89 29.56 32.15 31.60 34.29 33.60 37.98 37.49
VCNN [39] depth = 20 29.93 29.68 32.33 31.77 34.37 33.67 37.98 37.49
JPEG-based ACN depth N = 9 43.85 44.50 39.82 39.82 40.34 40.29 39.53 38.97
JPEG-based ACN depth N = 11 45.24 45.41 41.78 41.75 40.33 40.25 39.76 39.25
JPEG-based ACN depth N = 12 46.07 46.27 42.32 42.28 41.27 41.25 43.08 42.61
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 9
TABLE II
Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF HEVC I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
NETWORK S TRUCTURES ON HEVC T EST S EQUENCES [51]
vs HEVC QP = 32 QP = 37 QP = 42 QP = 47
Original image 36.07 33.06 30.13 27.40
VCNN [39] 38.54 36.17 33.94 32.07
ACN without prediction image 39.12 36.63 34.37 32.36
ACN with prediction image 40.87 39.09 37.53 36.06
facilitate performance improvement; thus, λbit and λreg are set particular, when Depth N of the JPEG-based ACN is 12, the
to 5 × 10−5 and 1, respectively, for all QPs. The HEVC-intra- imitation performance is over 40dB in all QFs. The residual
based model is implemented in the HEVC reference software block-based CNN structure cannot follow the behavior of the
HM 16.20 [48], [49] with the all intra configuration and tested JPEG codec regardless of the depth of the network.
based on test conditions, configurations, and sequences pro- In contrast, the proposed JPEG-based ACN generates a
posed by the Joint Collaborative Team on Video Coding [51]. decoded image similar to the output of JPEG. The learned
The test sequences can be divided into Classes A, B, C, D, ACN expresses contouring and ringing artifacts, which are
and E according to the spatial resolution. When evaluating typical compression artifacts of JPEG. As the proposed ACN
the compression performance of HEVC-intra-based models, method closely follows the codec operation, backpropagation
the results are expressed in terms of the Bjøntegaard delta can be performed for the CRNet with a small error.
(BD) rate [57] reductions for the luma component. In both In addition, we conducted a comparative experiment on
codec model test situations, we adopted the peak signal-to- the HEVC-intra-based ACN. We compared the HEVC-based
noise ratio (PSNR) and the structural similarity index measure ACN with the VCNN and compared using the original image
(SSIM) [58] as image quality evaluation metrics. alone and with a prediction image for the input image of the
HEVC-intra-based ACN. The results in Fig. 5 (b) and Table II
indicate that the proposed HEVC-intra-based ACN structure
B. Ablation Study
has superior HEVC imitation performance compared with the
In this section, we present the evaluation of the contribu- VCNN with a Resblock-based CNN structure. Furthermore,
tion of each network of the proposed framework. We also we improved the ACN to take the prediction image as input
performed ablation studies to analyze the importance of each with the original image, dramatically improving the imitation
loss term. In addition, we tested the effects of pretraining and performance.
the iterative update algorithm proposed in Section IV-B. All 3) Bit Estimation Network: To prove the superiority of
experiments for the ablation studies were tested on the LIVE1 the structure of the proposed BENet, we compared it with
dataset and evaluated on the rate-distortion planes. ResNet [15], a representative network architecture for re-
1) Compact Representation Network: To confirm the effect gression. As a result of the experiment in Table III, BENet
of the proposed CRNet, we experimented on the CRNet demonstrated better bit prediction performance than ResNet.
compared to the frameworks that simply downsample and In particular, BENet is a more advantageous structure in that
restore [29]–[31]. As illustrated in Fig. 4 (a)-(d), the CRNet the number of parameters is relatively small.
outperformed the bicubic downsampling preprocessing at two 4) Simultaneous Learning Strategy: We conducted a com-
scale factors: 0.5 and 0.75. When the scale factor was 0.5, parative experiment with the learning strategies of recent
a higher performance improvement was obtained because the CRNet-based papers [38], [39]. In practice, the performance
CRNet and PPNet have a larger capacity to compress and of the gradient backpropagation should be compared to know
restore spatially as the scale factor decreases. how well the learning strategies mimic the codec role. How-
Fig. 4 (e) and (f) exhibit the performance analysis according ever, real nondifferentiable codecs have no ground truth for
to the scale factor. A smaller scale factor result in a greater gradient propagation. Therefore, we analyzed the mimicking
information loss for the original image; thus, the compression ability of the proposed method through the final compression
efficiency is improved only at a low bit rate. In contrast, performance after optimizing CRNet and PPNet.
in a high bit-rate environment, it is better to maintain the In Fig. 6, we compared the compression methods using
original scale. An efficient scale factor exists according to the algorithm preprocessing and postprocessing networks. The
bit rate, suggesting that the scale factor should be determined methodologies selected for comparison are as follows: first,
adaptively according to the target rate. the method for learning the CRNet and PPNet alternately by
2) Auxiliary Codec Network: To prove the superiority of approximating the codec as an identity function as in [38]
the architecture of the proposed ACN, we conducted a perfor- (green line) and using the VCNN [39] (magenta line), and
mance comparison by imitating real codecs with several ACN second, the method for simultaneous optimization by directly
structures. First, we compared the JPEG-based structure with connecting the CRNet and PPNet (cyan line), using the VCNN
the residual block-based CNN structure of the VCNN [39]. (purple line) and proposed JPEG-based ACN (blue line).
The results in Fig. 5 (a) and Table I reveal that JPEG-based Furthermore, we additionally performed alternate learning of
ACN mimics JPEG decoded images better than VCNN. In the CRNet and PPNet with the JPEG-based ACN (red line) to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 10
TABLE III
Q UANTITATIVE P ERCENT (%) E RROR AND N UMBER OF PARAMETERS C OMPARISON OF B ITS PER P IXEL (BPP) E STIMATION BY D IFFERENT B IT
E STIMATION N ETWORK (BEN ET ) S TRUCTURES ON S ET 14 AND LIVE1 DATASETS
ResNet-18 [15] 1.598% 0.999% 1.687% 1.535% 1.571% 1.591% 1.134% 1.327% 1.430% 15.199M
ResNet-50 [15] 1.873% 1.934% 1.520% 1.364% 0.945% 1.180% 1.705% 2.005% 1.566% 29.287M
BENet 1.001% 1.456% 1.221% 1.305% 1.072% 1.909% 1.211% 1.661% 1.355% 2.098M
compare the influence of alternate learning and simultaneous 5) Bit and Regularization Loss: The rate-distortion per-
learning. For a fair comparison, all experiments used the fixed formance was evaluated according to the two loss weights,
network structure of CRNet and PPNet, and the same training λbit and λreg , to determine the effectiveness of the bit and
database as described in Section V-A. regularization loss. According to the training progress, the
rate-distortion performance results are expressed as traces
In Fig. 6 (a), the proposed JPEG-based ACN with simulta-
according to the training epoch to analyze the change in
neous learning outperforms the other methods. In learning the
performance. As displayed in Fig. 7 (a), when learning the
CRNet and PPNet alternately by approximating the codec as
network without considering the bit loss (λbit = 0), the
an identity function, the codec characteristics are repeatedly
final reconstructed image becomes closer to the original im-
reflected in the PPNet to demonstrate good performance.
age, but the number of bits generated during compression
However, an error occurs because the codec is assumed to be
increases significantly, resulting in poor coding efficiency.
an identity function when learning the CRNet. Additionally,
In contrast, using a proper λbit prevents these problems.
limitations exist in the VCNN, as it is difficult to sufficiently
Fig. 7 (b) indicates that the network has a better training
transfer the codec characteristics to the CRNet because of the
procedure with stable convergence with the regularization loss
poor approximation of the codec.
by preserving the structure of natural images. If the influence
In the case of simultaneous learning by directly connecting of the regularization loss increases, the learning of the CRNet
the CRNet and PPNet and with the Resblock-based VCNN, is restricted. In this case, no significant change in performance
the compression performances are significantly worse than exists from the initial state. Fig. 7 (c) indicates that the coding
the others. Table I demonstrates that the VCNN is closer to efficiency is better when learning with the regularization loss
the JPEG decoded image than the original image (identity (λreg = 0.1) than learning without the regularization loss. A
function). However, as illustrated in the visual comparison standard codec is designed to compress natural images; thus,
in Fig. 5 (a), compression mimicking images generated from regularization loss prevents a decrease in coding efficiency
the VCNN have almost no observed JPEG compression noise from compressing images with unnatural patterns.
patterns and unpredictable noise. End-to-end learning with 6) Pretraining Strategy: We analyzed the effect of the pre-
the VCNN, which is updated through iterative learning and training strategy as described in Section IV-B1. Fig. 8 presents
is changeable, unlike an identity function that generates a the performance comparison according to whether pretraining
constant value, leads to worse optimization. Therefore, the occurred. The experimental results reveal faster convergence
VCNN can generate even greater error propagation than the with better coding efficiency when performing pretraining on
identity function approximation. the CRNet and PPNet. The absence of pretraining means that
In alternate learning, PPNet is continuously trained from a the parameters of the CRNet and PPNet are initialized with
decoded image obtained from the real codec to compensate random values. In this case, it is challenging to converge in
for approximation errors. However, for simultaneous learning, the desired direction because the ACN and BENet do not work
if a sufficient codec approximation is not satisfied, error properly at the beginning of the training process.
propagation continues to CRNet and PPNet during training, 7) Iterative Updating Strategy: The iterative updating pro-
resulting in performance degradation. The JPEG-based ACN cess is proposed in Section IV-B2 to reduce errors due to
exhibited better performance in simultaneous learning than the approximation functions. The fixed ACN is compared
in alternating learning. This result reveals that the proposed with the repeatedly updated ACN to demonstrate the effec-
learning method, in which CRNet and PPNet are optimized tiveness of the proposed training scheme. Fig. 9 (a) depicts
simultaneously, is more effective if good mimicking perfor- the comparison of the codec imitation performance according
mance is guaranteed. In Fig. 6 (b), the experimental results to the training epochs, and Fig. 9 (b) displays the change
demonstrated that, as the imitation performance gradually in coding efficiency as training progresses. When the ACN
decreased with a small value of N , the overall compression is fixed, the imitation performance of the ACN decreases
performance also gradually decreased. The experiments in- gradually as the epoch increases. In contrast, when the ACN
dicate that the imitation performance is proportional to the is continuously updated, the imitation performance does not
overall compression performance and that better mimicking deteriorate, and it converges. Guaranteeing the performance of
performance improves the compression performance through the ACN helps learning to increase the coding efficiency of
a more precise optimization of the preprocessing network. the entire framework with less approximation error.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 11
31.5 30.5
31 31
30
30.5
30
PSNR (dB)
PSNR (dB)
30 Training 29.5
Training
λ = 5×10⁻⁵ λ=1
29 29.5 λ = 0.3
λ = 3×10⁻⁵
λ = 1×10⁻⁵ 29 λ = 0.1
28 29
λ = 5×10⁻⁶ λ = 0.03
Epoch=0 λ=0 Epoch=0 λ=0
28.5 28.5
27
PSNR (dB)
(a) (b)
26
JPEG-based ACN + S. Learning
25 JPEG-based ACN + A. Learning 30.5
VCNN + S. Learning
24
29.5
VCNN + A. Learning [39]
IdentityApprox. + S. Learning
PSNR (dB)
28.5
23 IdentityApprox. + A. Learning [38]
JPEG 27.5
22
26.5 With RegLoss
0.1 0.2 0.3 0.4 0.5 0.6
Without RegLoss
Bits per pixel (bpp) 25.5
0.1 0.3 0.5
(a) Bits per pixel (bpp)
(c)
31 Fig. 7. Rate-distortion performance comparison according to the weight of the
loss function. Rate-distortion point of compressed image (JPEG QF = 80)
30 with increasing learning epochs according to (a) the bit loss λbit , and (b)
regularization loss λreg , and (c) rate-distortion performance comparison with
λreg = 0.1 or without regularization loss λreg after the training process is
29 complete.
28
27
PSNR (dB)
26
JPEG-based ACN Depth N = 12
25 JPEG-based ACN Depth N = 11
30.5 30.5
29
JPEG 28.5
27.5
22
28
0.1 0.2 0.3 0.4 0.5 0.6 With pretraining 26.5 With pretraining
Bits per pixel (bpp)
27.5
Without pretraining Without pretraining
27 25.5
(b) 0 5 10 15 20 25 30 35 40 45 50 0.1 0.3 0.5
Epochs Bits per pixel (bpp)
Fig. 6. Rate-distortion curve comparison of compression methods using
preprocessing and postprocessing networks. (a) Comparison results according (a) (b)
to network architectures and learning methods (simultaneous learning (S.
Fig. 8. Performance comparison with and without pretraining. (a) Change in
Learning) or alternating learning (A. Learning)), (b) Comparison result of
the peak signal-to-noise ratio (PSNR) between the original and reconstructed
the codec-mimicking networks with simultaneous learning.
images (JPEG QF = 80) in the end-to-end model according to the training
epochs and (b) comparison of the rate-distortion performance after the training
process is complete.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 12
40.5 31 38
Updated ACN 39
40
Fixed ACN 30.5 36
39.5 37
30 35 34
PSNR (dB)
PSNR (dB)
PSNR (dB)
39
PSNR (dB)
Proposed Proposed
38.5 Training 33 Li [36] Li [36]
29.5 32
Lee [14] Lee [14]
38 31 Minnen [13] Minnen [13]
Updated ACN Balle [9]
29 Balle [9] 30
37.5 29 BPG BPG
Epoch=0 Fixed ACN
HEVC-intra HEVC-intra
37 28.5 27 28
0 5 10 15 20 25 30 35 40 45 50 0.5 0.55 0.6 0.65 0 20000 40000 60000 0 20000 40000 60000
Epochs Bits per pixel (bpp) Bit rate (kbps) Bit rate (kbps)
PSNR (dB)
PSNR (dB)
37
33 Proposed Proposed
Li [36] Li [36]
Lee [14] 35 Lee [14]
34 34 31
Minnen [13] Minnen [13]
Balle [9] 33 Balle [9]
32 32 29
BPG BPG
HEVC-intra HEVC-intra
30 27 31
30
0 5000 10000 0 5000 10000
Bit rate (kbps) Bit rate (kbps)
PSNR (dB)
PSNR (dB)
28 28
Proposed Proposed (c) (d)
26 26
Zhao [39] Zhao [39] Fig. 11. Rate-distortion curves of several HEVC test sequences [51]: (a)
Jiang [38] Jiang [38]
PeopleOnStreet (2160×1440), (b) Cactus (1920×1080), (c) BasketballDrill
24 24
(832 × 480), and (d) Johnny (1280 × 720).
JPEG JPEG
22 22
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Bits per pixel (bpp) Bits per pixel (bpp)
0.75
SSIM
0.75
Proposed Proposed Moreover, the proposed method exhibits the best performance
0.7 0.7
Zhao [39] Zhao [39] on all test datasets. The difference between the proposed
0.65 0.65
Jiang [38] Jiang [38] algorithm and the existing method is the learning method for
0.6 0.6
JPEG JPEG the CRNet. From the experimental results, the well-trained
0.55 0.55
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 CRNet improves compression performance.
Bits per pixel (bpp) Bits per pixel (bpp)
2) HEVC-intra-based Model: The proposed method was
(c) (d)
compared with the conventional codecs, learnable codecs,
Fig. 10. Rate-distortion performance per peak signal-to-noise ratio (PSNR)
and structural similarity index measure (SSIM) of different compression
and the competing standard compatible algorithm to apply
algorithms on test image datasets: (a), (c) Set14 and (b), (d) LIVE1. the proposed method to the recent standard codec and ex-
hibit state-of-the-art performance. For conventional codecs, the
BPG image codec was also selected. The image compression
C. Comparison with State-of-the-art Methods frameworks proposed by Ballé [9], Minnen [13], and Lee [14]
1) JPEG-based Model: We compared the coding efficiency were selected for the learnable codec, and the work by Li [36]
of the proposed JPEG-based ACN with depth N of 12 with was selected as the competing standard compatible algorithm.
JPEG and other standard compatible compact representation The rate-distortion performance comparison of the typical
frameworks, such as those by Jiang [38] and Zhao [39]. We sequences of each class is illustrated in Fig. 11.
used the experimental results described in [39]; however, in The performance of the learnable codec algorithms does not
[38], the compression rate and distortion results for high QF exceed that of the HEVC in the test sequences. In contrast,
are not described, so we newly trained the model from [38] the standard compatible frameworks are proposed to boost the
and used these results. performance of the existing codec, exhibiting better coding
Considering that the CRNet is a useful tool under a low efficiency than the other frameworks. In particular, our method
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 13
TABLE IV
BD-R ATE C OMPARISON OF S TANDARD C OMPATIBLE F RAMEWORKS ON HEVC COMMON TEST CONDITIONS [51] WITH A LL I NTRA M AIN
CONFIGURATIONS
reached state-of-the-art performance. experimental results reveal that this approach outperforms the
For a detailed performance comparison between the stan- existing codecs and end-to-end learnable image compression
dard compatible frameworks, the BD-rate performance was algorithms. For future work, we will extend this work to video
compared in the HEVC test sequences. Considering that the compression tasks, which are more challenging and complex to
standard compatible framework is effective in a low bit-rate model because of the high complexity and temporal dynamics.
environment, the QP of the reference HEVC was set to 32, 37,
42, and 47. The BD-rate results are summarized in Table IV.
The results reveal that the proposed method achieves an R EFERENCES
average BD-rate reduction of 15.2% in all classes on the PSNR [1] G. K. Wallace, “The jpeg still picture compression standard,” IEEE
metric and 22.9% on the SSIM metric. The proposed method transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv,
also outperforms Li’s frame-level and block-level scheme 1992.
[2] M. Rabbani, “Jpeg2000: Image compression fundamentals, standards
[36] for both the PSNR and SSIM metrics. In addition, [36] and practice,” Journal of Electronic Imaging, vol. 11, no. 2, p. 286,
assumed that the standard codec is an identity function, similar 2002.
to that in [38]; therefore, the CRNet and PPNet are directly [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
connected, and the CRNet is learned through joint learning of of the h. 264/avc video coding standard,” IEEE Transactions on circuits
and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
the entire network. The proposed framework aims to reduce [4] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of
errors in learning for the CRNet by modeling the standard the high efficiency video coding (hevc) standard,” IEEE Transactions
codec with the ACN and improving the coding efficiency. on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–
1668, 2012.
[5] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
VI. C ONCLUSION S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image com-
pression with recurrent neural networks,” in International Conference
In this paper, we proposed a standard compatible deep neu- on Learning Representations, 2016.
ral network-based framework for image compression. Within [6] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor,
this framework, image compression is performed optimally and M. Covell, “Full resolution image compression with recurrent neural
through the existing off-the-shelf standard codecs, CRNet, networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 5306–5314.
and PPNet. The ACN was proposed for optimal learning of [7] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image com-
the entire network and are designed to imitate the forward pression with compressive autoencoders,” in International Conference
degradation processes of existing codecs, such as JPEG and on Learning Representations, 2017.
[8] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
HEVC. Proper training strategies were proposed to minimize compression,” in International Conference on Learning Representations,
errors due to the objective function with approximation. The 2017.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 14
[9] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational [31] X. Wu, X. Zhang, and X. Wang, “Low bit-rate image compression via
image compression with a scale hyperprior,” in International Conference adaptive down-sampling and constrained least squares upconversion,”
on Learning Representations, 2018. IEEE Transactions on Image Processing, vol. 18, no. 3, pp. 552–561,
[10] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, 2009.
S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compres- [32] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based on
sion with priming and spatially adaptive bit rates for recurrent networks,” spatio-temporal resolution adaptation,” IEEE Transactions on Circuits
in Proceedings of the IEEE Conference on Computer Vision and Pattern and Systems for Video Technology, vol. 29, no. 1, pp. 275–280, 2018.
Recognition, 2018, pp. 4385–4393. [33] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolutional
[11] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional neural network-based block up-sampling for intra frame coding,” IEEE
networks for content-weighted image compression,” in Proceedings of Transactions on Circuits and Systems for Video Technology, vol. 28,
the IEEE Conference on Computer Vision and Pattern Recognition, no. 9, pp. 2316–2330, 2018.
2018, pp. 3214–3223. [34] Y. Zhang, D. Zhao, J. Zhang, R. Xiong, and W. Gao, “Interpolation-
[12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, dependent image downsampling,” IEEE Transactions on Image Process-
“Conditional probability models for deep image compression,” in Pro- ing, vol. 20, no. 11, pp. 3291–3296, 2011.
ceedings of the IEEE Conference on Computer Vision and Pattern [35] H. Kim, M. Choi, B. Lim, and K. Mu Lee, “Task-aware image
Recognition, 2018, pp. 4394–4402. downscaling,” in Proceedings of the European Conference on Computer
[13] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and Vision (ECCV), 2018, pp. 399–414.
hierarchical priors for learned image compression,” in Advances in [36] Y. Li, D. Liu, H. Li, L. Li, Z. Li, and F. Wu, “Learning a convolutional
Neural Information Processing Systems, 2018, pp. 10 771–10 780. neural network for image compact-resolution,” IEEE Transactions on
[14] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for Image Processing, vol. 28, no. 3, pp. 1092–1107, 2018.
end-to-end optimized image compression,” in International Conference [37] W. Sun and Z. Chen, “Learned image downscaling for upscaling using
on Learning Representations, 2019. content adaptive resampler,” IEEE Transactions on Image Processing,
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image vol. 29, pp. 4027–4040, 2020.
recognition,” in Proceedings of the IEEE conference on computer vision [38] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-end
and pattern recognition, 2016, pp. 770–778. compression framework based on convolutional neural networks,” IEEE
[16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Transactions on Circuits and Systems for Video Technology, vol. 28,
connected convolutional networks,” in Proceedings of the IEEE confer- no. 10, pp. 3007–3018, 2017.
ence on computer vision and pattern recognition, 2017, pp. 4700–4708. [39] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based
[17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in on deep convolutional neural network to compress image,” Journal of
Proceedings of the IEEE conference on computer vision and pattern Visual Communication and Image Representation, vol. 63, p. 102589,
recognition, 2018, pp. 7132–7141. 2019.
[18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- [40] F. Bellard, “Bpg image format,” URL https://bellard. org/bpg, 2015.
works,” in Proceedings of the IEEE conference on computer vision and [41] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
pattern recognition, 2018, pp. 7794–7803. deep convolutional networks,” IEEE transactions on pattern analysis
[19] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
using very deep convolutional networks,” in Proceedings of the IEEE [42] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts
conference on computer vision and pattern recognition, 2016, pp. 1646– reduction by a deep convolutional network,” in Proceedings of the IEEE
1654. International Conference on Computer Vision, 2015, pp. 576–584.
[20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [43] X. Zhang, W. Yang, Y. Hu, and J. Liu, “Dmcnn: Dual-domain multi-
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single scale convolutional neural network for compression artifacts removal,” in
image super-resolution using a generative adversarial network,” in 2018 25th IEEE International Conference on Image Processing (ICIP).
Proceedings of the IEEE conference on computer vision and pattern IEEE, 2018, pp. 390–394.
recognition, 2017, pp. 4681–4690. [44] B. Zheng, Y. Chen, X. Tian, F. Zhou, and X. Liu, “Implicit dual-
[21] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense domain convolutional network for robust color image compression
skip connections,” in Proceedings of the IEEE International Conference artifact reduction,” IEEE Transactions on Circuits and Systems for Video
on Computer Vision, 2017, pp. 4799–4807. Technology, 2019.
[22] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent [45] T. Kim, H. Lee, H. Son, and S. Lee, “Sf-cnn: A fast compression artifacts
network for image restoration,” in Advances in Neural Information removal via spatial-to-frequency convolutional neural networks,” in 2019
Processing Systems, 2018, pp. 1673–1682. IEEE International Conference on Image Processing (ICIP). IEEE,
[23] L. Cavigelli, P. Hager, and L. Benini, “Cas-cnn: A deep convolutional 2019, pp. 3606–3610.
neural network for image compression artifact suppression,” in 2017 [46] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter
International Joint Conference on Neural Networks (IJCNN). IEEE, networks,” in Advances in neural information processing systems, 2016,
2017, pp. 752–759. pp. 667–675.
[24] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deep generative [47] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
adversarial compression artifact removal,” in Proceedings of the IEEE D. Rueckert, and Z. Wang, “Real-time single image and video super-
International Conference on Computer Vision, 2017, pp. 4826–4835. resolution using an efficient sub-pixel convolutional neural network,” in
[25] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory Proceedings of the IEEE conference on computer vision and pattern
network for image restoration,” in Proceedings of the IEEE international recognition, 2016, pp. 1874–1883.
conference on computer vision, 2017, pp. 4539–4547. [48] K. McCann, B. Bross, W. Han, I. Kim, K. Sugimoto, and G. Sullivan,
[26] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3: “High efficiency video coding (hevc) test model 16 (hm 16) encoder
Deep dual-domain based fast restoration of jpeg-compressed images,” in description,” JCT-VC, Doc. JCTVC N, vol. 1002, 2014.
Proceedings of the IEEE Conference on Computer Vision and Pattern [49] Hm16.20 reference software. [Online]. Available: https://hevc.hhi.
Recognition, 2016, pp. 2764–2772. fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.20/
[27] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep [50] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
residual networks for single image super-resolution,” in Proceedings for biomedical image segmentation,” in International Conference on
of the IEEE conference on computer vision and pattern recognition Medical image computing and computer-assisted intervention. Springer,
workshops, 2017, pp. 136–144. 2015, pp. 234–241.
[28] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super- [51] F. Bossen et al., “Common test conditions and software reference
resolution using very deep residual channel attention networks,” in configurations,” JCTVC-L1100, vol. 12, p. 7, 2013.
Proceedings of the European Conference on Computer Vision (ECCV), [52] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,
2018, pp. 286–301. “Ntire 2017 challenge on single image super-resolution: Methods and
[29] A. M. Bruckstein, M. Elad, and R. Kimmel, “Down-scaling for bet- results,” in Proceedings of the IEEE conference on computer vision and
ter transform compression,” IEEE Transactions on Image Processing, pattern recognition workshops, 2017, pp. 114–125.
vol. 12, no. 9, pp. 1132–1144, 2003. [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[30] W. Lin and L. Dong, “Adaptive downsampling to improve image P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
compression at low bit rates,” IEEE Transactions on Image Processing, context,” in European conference on computer vision. Springer, 2014,
vol. 15, no. 9, pp. 2513–2521, 2006. pp. 740–755.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 15
[54] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient de-
scent,” in ICLR: International Conference on Learning Representations,
2015.
[55] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
sparse-representations,” in International conference on curves and sur-
faces. Springer, 2010, pp. 711–730.
[56] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation
of recent full reference image quality assessment algorithms,” IEEE
Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
[57] G. Bjontegaard, “Calculation of average psnr differences between rd-
curves,” VCEG-M33, 2001.
[58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: from error visibility to structural similarity,” IEEE
transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.