0% found this document useful (0 votes)
12 views15 pages

Enhanced Standard Compatible Image Compression

This paper proposes a novel standard compatible image compression framework utilizing auxiliary codec networks (ACNs) to enhance image compression performance while maintaining compatibility with existing codecs like JPEG and HEVC. The framework integrates learnable codecs, postprocessing networks, and compact representation networks to optimize both compression efficiency and image quality. Experimental results demonstrate that the proposed method significantly outperforms existing image compression algorithms in a standard-compatible manner.

Uploaded by

Joseph Franklin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Enhanced Standard Compatible Image Compression

This paper proposes a novel standard compatible image compression framework utilizing auxiliary codec networks (ACNs) to enhance image compression performance while maintaining compatibility with existing codecs like JPEG and HEVC. The framework integrates learnable codecs, postprocessing networks, and compact representation networks to optimize both compression efficiency and image quality. Experimental results demonstrate that the proposed method significantly outperforms existing image compression algorithms in a standard-compatible manner.

Uploaded by

Joseph Franklin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO.

X, X XXXX 1

Enhanced Standard Compatible Image Compression


Framework based on Auxiliary Codec Networks
Hanbin Son, Taeoh Kim, Hyeongmin Lee, and Sangyoun Lee, Member, IEEE

Abstract—Recent deep neural network-based research to en- compression standards, such as JPEG2000 [2], H.264/AVC [3],
hance image compression performance can be divided into and High Efficiency Video Coding (HEVC) [4]. Recent video
three categories: learnable codecs, postprocessing networks, and coding standards [3], [4] have adopted prediction-based coding
compact representation networks. The learnable codec has been
arXiv:2009.14754v2 [eess.IV] 15 Dec 2021

designed for end-to-end learning beyond the conventional com- methods to reduce the spatial and temporal redundancy of
pression modules. The postprocessing network increases the input video. Prediction-based coding increases the complex-
quality of decoded images using example-based learning. The ity of the compression algorithm but produces much better
compact representation network is learned to reduce the capacity compression performance.
of an input image, reducing the bit rate while maintaining On the other hand, compression frameworks with end-to-
the quality of the decoded image. However, these approaches
are not compatible with existing codecs or are not optimal end trainable deep neural networks [5]–[14] (learnable codecs
for increasing coding efficiency. Specifically, it is difficult to in this paper) have been proposed based on the rapid develop-
achieve optimal learning in previous studies using a compact ment of deep learning. The approaches use trainable networks
representation network due to the inaccurate consideration of the to produce bitstreams and reconstruct the original image
codecs. In this paper, we propose a novel standard compatible (Fig. 1 (a)). Although these kinds of approaches structurally
image compression framework based on auxiliary codec net-
works (ACNs). In addition, ACNs are designed to imitate image consider the compression ratio and reconstruction quality,
degradation operations of the existing codec, which delivers their performance is still undesirable and incompatible with
more accurate gradients to the compact representation network. standard codecs, which decreases the algorithm’s utility.
Therefore, compact representation and postprocessing networks It is easy to propose a method to restore an image after the
can be learned effectively and optimally. We demonstrate that compression process to improve the compression performance
the proposed framework based on the JPEG and High Efficiency
Video Coding standard substantially outperforms existing image while being compatible with standard codecs. Following the
compression algorithms in a standard compatible manner. developments of convolutional neural networks (CNN), such
as ResNet [15], DenseNet [16], and attention networks [17],
Index Terms—Image compression, deep neural networks, com-
pact representation, JPEG, High Efficiency Video Coding. [18], the CNN-based image postprocessing algorithms [19]–
[28] have drastically improved the performance of image
restoration. These kinds of approaches are designated as
I. I NTRODUCTION
a postprocessing network (PPNet) in this paper. Although

W ITH the development of media technology, the de-


mand for live streaming or communicating using high-
resolution visual data has increased, requiring better perfor-
PPNets perform well in reconstruction and are compatible
with standard codecs, they only efficiently increase the visual
quality of the reconstructed image but do not consider the
mance of image and video compression algorithms. Standard compression ratio (Fig. 1 (b)).
algorithms have been carefully developed and released for The preprocessing and postprocessing-based coding strate-
compatibility between the encoder and decoder of compression gies have been applied to consider both image quality and
algorithms across users. compression ratio. These kinds of approaches place a spatially
The JPEG standard [1], a traditional image compression downsampled image into the codec and upsample [29]–[31],
algorithm, has been the most widely used in still-image com- or are processed by the PPNet [32], [33]. Generally, reducing
pression because of its simplicity and compatibility. Its block the spatial size of an image can increase the compression
partitioning, transform, quantization, and entropy-coding- ratio. However, the approaches only work at a low bit-rate
based scheme widely affects many other image and video setting [29]–[31], and the predefined downsampling operations
This work was partly supported by the Institute of Information & communi- degrade detailed information and increase the ratio of high-
cations Technology Planning & Evaluation (IITP) grant funded by the Korea frequency components, causing an increment in the bitstream.
government, Ministry of Science and ICT (MSIT) (No. 2021-0-00172, The The content-adaptive downsampling algorithm [34] or
development of human Re-identification and masked face recognition based
on CCTV camera) and the Institute of Information & communications Tech- learning-based downsampling algorithms [35]–[39] have been
nology Planning & Evaluation (IITP) grant funded by the Korea government, proposed to overcome these problems. These algorithms train
Ministry of Science and ICT (MSIT) (No.2016-0-00197, Development of the the content-adaptive downsampling (called compact represen-
high-precision natural 3D view generation technology using smart-car multi
sensors and deep learning). tation in [38], abbreviated to CRNet in this paper) oper-
H. Son, T. Kim, H. Lee and S. Lee are with the School of Elec- ation using the reconstruction loss after the PPNet. These
trical and Electronic Engineering, Yonsei University, Seoul, South Ko- approaches can achieve both a high compression ratio and
rea (e-mail: [email protected]; [email protected]; [email protected];
[email protected]) better reconstruction quality with two networks. However,
Corresponding author: Sangyoun Lee backward gradients from the loss function for the CRNet
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 2

Codec Codec PPNet

Bitstream Bitstream

(a) Learnable codecs (b) Postprocessing networks (PPNet)

ACN
CRNet PPNet CRNet PPNet
Codec Codec

Bitstream Bitstream

(c) Standard compatible frameworks based on the compact representation (d) Proposed framework based on the auxiliary codec network (ACN)
network (CRNet)
Fig. 1. Conceptual comparison between frameworks. Green and red arrows indicate forward (or inference) and backward pass (or gradients) to train the
CRNet, respectively. Gray modules indicate that it is not differentiable or a standard codec. Blue modules indicate a differentiable network.

do not consider the degradation process through the standard II. R ELATED W ORK
codec (Fig. 1 (c)) because the standard codec, including the
quantization process, is a nondifferentiable module. A. Compression Frameworks Based on End-to-end Trainable
Networks (Learnable Codecs)
In this paper, we propose a novel standard compatible end- As deep learning has been successful in the field of image
to-end image compression framework based on auxiliary codec processing, Toderici et al. [5], [6] first proposed an end-to-end
networks (ACNs). The ACNs are designed to imitate the deep neural network-based approach in image compression.
forward image degradation process of existing codecs in dif- An input image with dimensions reduced through an auto-
ferentiable networks to provide the correct backward gradients encoder is stored as a binary vector for a given compression
for training the CRNet (Fig. 1 (d)). These gradients allow the rate and is optimized for minimum distortion.
compact represented image to consider both the degradation
As the possibility of the strong modeling capacity of a
process by the ACN and the reconstruction process by the PP-
neural network is revealed, many follow-up studies have been
Net. Based on ACNs, both the CRNet and PPNet are learned
conducted. Theis et al. [7] proposed a compressive auto-
together to achieve better image compression performance in
encoder based on a residual neural network [15] and used a
a standard compatible manner. In addition, a bit estimation
Laplace-smoothed histogram as the entropy model. Ballé et al.
network (BENet) is proposed for training as a regularization
[8] jointly optimized the entire model for rate-distortion per-
function to prevent undesired bit-rate increments. As recent
formance using a generalized divisive normalization transform.
CRNet-based [35]–[39] generate models at a single level, the
Further, Ballé et al. [9] proposed a hyperprior to effectively
proposed framework is also trained and optimized per codec
capture spatial redundancy in the latent encoding. In addition,
and rate.
Johnston et al. [10] proposed a priming technique and spatially
adaptive entropy model for image compression. Moreover,
The contributions of this paper are summarized as follows: Li et al. [11] proposed a content-weighted method based on
spatially adaptive importance map learning. Mentzer et al. [12]
proposed a model that concurrently trains a context model
• We propose a novel CNN architecture called the ACN, with an encoder and used three-dimensional convolutional
based on the prior of the image compression process to networks. Minnen et al. [13] and Lee et al. [14] combined
effectively and precisely train the CRNet. a context-adaptive entropy model and hyperprior, producing
• Based on the ACN, we propose an enhanced compression substantial performance improvements.
framework based on the collaborative learning scheme The primary difficulties of deep image compression algo-
between the ACN, PPNet, and CRNet. Furthermore, the rithms include making the nondifferential quantization process
BENet facilitates training using a proper bit prediction, end-to-end trainable, designing an entropy model that predicts
preventing undesirable artifacts in compactly represented the bitstream generated from coefficients, and enabling com-
images. pression considering both the bit rate and distortion. However,
• The framework is compatible with compression algo- although many deep network-based approach algorithms have
rithms from the standard codecs to learning-based codecs been developed, it is challenging to replace conventional
and any off-the-shelf image restoration networks. Based compression schemes due to compatibility. Furthermore, al-
on the highly accurate ACNs for two standard codecs: though the state-of-the-art approaches outperform even the
JPEG and HEVC, our framework exhibits state-of-the-art Better Portable Graphics (BPG) [40] codec, which is designed
results compared to other image compression algorithms, based on the intra-mode of HEVC, a significant performance
including standards and learnable codecs. improvement has not been demonstrated.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 3

B. Postprocessing Networks CNNs are configured with codecs in a compression pipeline to


Following the success of the deep learning-based approach increase coding efficiency. Jiang et al. also proposed an itera-
for high-level vision problems, methods for low-level vision tive optimization algorithm because the end-to-end framework
problems, such as image super-resolution and compression includes a standard codec with nondifferential operations.
artifact removal, have improved progressively. In single-image The CRNet (ComCNN) learns in an end-to-end fashion by
super-resolution, Dong et al. [41] first proposed a CNN called connecting directly to the pretrained PPNet (RecCNN). The
the super-resolution CNN (SRCNN) for the super-resolution standard codec is approximated as an identity function in this
problem to learn end-to-end mapping from downsampled procedure, which is not optimal in the inference phase.
images to high-resolution images. In addition, Kim et al. [19] Unlike [38], Zhao et al. [39] proposed a virtual codec neural
proposed a deeper network architecture with the residual skip network (VCNN) to propagate the gradient from the post-
connection. Ledig et al. [20] and Lim et al. [27] proposed processed image to a preprocessing network called a feature
networks based on ResNet [15], and Tong et al. [21] proposed description neural network corresponding to the CRNet in this
a network based on DenseNet [16]. Based on the attention paper. However, the VCNN and postprocessing neural net-
networks [17], [18] in the recognition area, Liu et al. [22] work corresponding to PPNet are alternately trained because
and Zhang et al. [28] proposed attention-based restoration they comprise different pipelines. In addition, the VCNN is
networks. designed without careful consideration of the codec structure,
For the removal of compression artifacts, Dong et al. [42] making it difficult to guarantee that the correct gradient is
proposed a network that is slightly deeper than SRCNN [41] propagated through the VCNN. Because the VCNN is used
to reduce artifacts in the intermediate feature maps. Like in the training phase and the postprocessing neural network
super-resolution networks, deeper networks [23]–[25] have is used when encoding and decoding images in the testing
been proposed. Some researchers have proposed frequency- phase, an approximation error of the VCNN causes a train-test
based networks [26], [43]–[45] to restore images from fre- discrepancy. In Section IV-B3, the CRNet-based compression
quency transform-based compression algorithms (e.g., JPEG). algorithms will be analyzed and compared in detail.
Although these approaches can be used to recover a decoded
image after compression algorithms or recover the original III. P ROPOSED S TANDARD C OMPATIBLE I MAGE
resolution if an image is downsampled before compression, C OMPRESSION F RAMEWORK
they can only treat already compressed images and are not A. Problem Statement
accessible to the bit-rate-related module. We employed a novel image compression framework with
the advantages of both the existing standard codec and deep
C. Standard Compatible Frameworks based on the Compact learning-based image processing networks. An important goal
Representation Network in constructing a compression framework was to increase the
coding efficiency, which reduces the distortion of the output
Learning-based image downsampling methods have been
image and lowers the bit rate of the compressed bitstream. To
less actively researched compared to upsampling methods.
achieve the optimal parameter of an end-to-end network, we
These methods can be used as a compact representation of an
minimized the rate-distortion cost J = D + λR, where D is
image without losing important structures and without aliasing
the distortion, R is the bit rate, and λ is the weight of the
effects. Kim et al. [35] proposed a task-aware downscaling
relative importance between D and R. The distortion term D
(TAD) network, and Li et al. [36] proposed a compact repre-
measures how different the reconstructed image is from the
sentation network called CNN-CR to downsample adaptively
original image x, which is defined in the following equation:
using joint learning of the downsampling and super-resolution
networks. To preserve essential structures in an input image,
D = δ(x, g(Φ(f (x)))), (1)
they both adapted regularization constraints between the input
and compact images. Sun et al. [37] proposed a content- where f denotes the CRNet, g represents the PPNet, and Φ
adaptive downsampling network using adaptive sampling on is the function that generates a reconstruction image from
the input image to prevent significant changes in the image the codec. Moreover, δ represents a metric that measures the
structure based on a dynamic filter network [46], which does distortion between two images. The bit-rate term R is defined
not require regularization loss. in the following equation:
In addition, several approaches can reduce bit rates through
downsampling followed by postprocessing in an image com- R = φ(f (x)), (2)
pression framework. For example, Afonso et al. [32] and Li
where φ denotes the function that generates the number of bits
et al. [33] proposed compressing images using the handcrafted
of the compressed bitstream from the codec. Then, we defined
downsampling method and restoring images using the deep
the following objective function to train f and g to minimize
learning-based super-resolution algorithm. In addition, Jiang
the rate-distortion cost:
et al. [38] proposed an end-to-end framework that consists of
three parts: a compact CNN (ComCNN), an image codec, and
J = δ(x, g(Φ(f (x)))) + λφ(f (x)). (3)
a reconstruction CNN (RecCNN). The ComCNN produces a
compact representation of an input image, and the RecCNN Using this objective function, we jointly optimized the
restores the degraded image through compression. The two CRNet and PPNet to improve the coding efficiency of the
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 4

ℒ𝑹𝑹𝑹𝑹𝑹𝑹

BENet ℒ𝑩𝑩𝑩𝑩𝑩𝑩
CRNet
ACN Train Phase
ℒ𝑹𝑹𝑹𝑹𝑹𝑹
Test Phase
Codec

PPNet

(a) Overall Framework

Codec-mimicking network
Prediction Image

xN Residual Image

(b) JPEG-based auxiliary codec network (ACN)


Concatenation Decoded Residual
Original Image
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 (c) HEVC-intra-based ACN
xM
(d) Bit estimation network (BENet)

Input Compact 1x1 Leaky Space2Depth Codec-mimicking


Image Image Conv ReLU Reshape network

Recon. Decoded Batch Avg. Depth2Space


Image Image Norm. Pool Reshape U-Net

Fig. 2. (a) Illustration of the end-to-end learning pipeline for image compression. (b)-(d) Detailed structures of the ACN and BENet.

codec. However, the codec-related functions Φ and φ have a image. The codec imitation module consisting of the ACN and
nondifferentiable quantization operation that creates problems BENet was used only for training and can be used as a gradient
for the backpropagation algorithm. The codec-related func- path for training CRNet. In the testing phase, CRNet and
tions were replaced with a differentiable neural networks: h PPNet were used with existing codecs, such as conventional
and p, to overcome this problem, as follows: preprocessing and postprocessing modules.

θf∗ , θg∗ ≈ argmin δ(x, g(h(f (x))) + λp(f (x)), (4) B. Auxiliary Codec Network
θf ,θg
The parameter values θf∗ and θg∗ were obtained by opti-
where θf and θg are the parameters of the functions f and g, mizing the objective function in (4) using the approximation
respectively. In (4), we can reach the ideally optimal solution (imitation) function of the codec modules: Φ and φ. However,
θf and θg , if the two neural networks, h and p are perfectly the obtained parameter may be different when the actual codec
modeled as real codec modules: Φ and φ. All parts of the is applied. The output of h should be as close as possible
objective function are composed of learnable neural networks, to the actual codec module Φ to reduce these differences
enabling backpropagation in the end-to-end learning scheme. and perform optimal learning. In this section, we propose
We defined h as the ACN and p as BENet in this paper. novel CNN architectures of the ACN that closely approximate
Both the ACN and BENet have fixed parameters in the two typical standard codecs, JPEG and HEVC intra coding
process of optimizing (4). The overall pipeline of the proposed (HEVC-intra). The objective function to train the ACNs is as
compression framework is illustrated in detail in Fig. 2 (a). The follows:
original image passes through the CRNet and is expressed as
2
a compact image to reduce the amount of information. Next, θh∗ = argmin kh(x) − Φ(x)k2 . (5)
the BENet calculates the number of predicted bits, and the θh
ACN generates an imitated decoded image from the compact The architectures of networks imitating JPEG and HEVC-
image. Finally, the PPNet performs restoration of the original intra are depicted in Fig. 2 (b) and (c). In the JPEG codec,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 5

an original image is divided into 8 × 8 nonoverlapping blocks,


2
and it independently processes each block. Referring to the θp∗ = argmin kp(x) − φ(x)k2 . (6)
θp
characteristics of this JPEG compression, an input image
divided into fixed 8 × 8 blocks was calculated to change the We propose the architecture of the BENet, which predicts
axis through the Space2Depth and Depth2Space operations the number of bits for end-to-end learning, considering the
proposed in [47]. The Space2Depth operation performs data bit-rate term in (4). The structure of the BENet is presented
conversion from the blockwise spatial axis to the channel in Fig. 2 (d). The depth of the BENet M was set to 10 with
axis. For the JPEG-based ACN, pixel data in an 8 × 8 block a sufficiently small error to proceed with end-to-end learning
were rearranged and converted to 64 channels. For example, based on the results of the ablation study in Section V-B. An
if the input image with C color channels, height H, and input image of the BENet was rearranged on the channel axis
width W is represented as a data tensor of H × W × C, in the same way as the ACN and generates a predicted value
it is reshaped into a data tensor of H8 × W 8 × 64C through through the 1 × 1 convolution and global average pooling.
the Space2Depth operation. Next, the pixel in each block The BENet can predict the size of the bitstream generated
independently forms a full connection while passing through after encoding from the input image. The CNN structure can
1 × 1 convolutional layers and skip connections, preventing a backpropagate the gradient from the bit rate described in the
large loss of information as the input image passes through objective function. With the BENet comprising differentiable
the deep network. operations, gradients of the network weights can be calculated
In the HEVC intra-encoding process, more complex en- from approximate bits. Finally, a gradient-based optimization
coding is performed using intra-prediction methods, and the algorithm can be applied to optimize the objective function
size of the coding block is variable. Because the prediction (4).
image for reducing the redundancy of a block is not generated
inside the block itself but from other blocks, it is difficult to IV. N ETWORK O PTIMIZATION
approximate a prediction-based codec only with the previously
The end-to-end network was optimized using approximate
proposed block-based Space2Depth operation, which performs
functions, as in (4). An error may occur between weights
a connection only within blocks.
obtained through end-to-end learning and ideal optimized
Therefore, a more complex structure for the HEVC-intra-
weights when using this function. In this section, to reduce
based ACN is proposed, as illustrated in Fig. 2 (c). An
this error and more closely reach the ideal parameters of the
original and prediction image were concatenated along the
CRNet and PPNet, we propose an appropriate loss function
channel axis and provided as input. The prediction image was
and effective training strategy, including a pretraining method
generated from the real codec and makes the HEVC-intra-
for each network and an iterative fine-tuning update algorithm.
based ACN focus only on generating the coded prediction
residual. Continuously generating the prediction image using
the real codec in the end-to-end training process does not affect A. Loss Function
the actual encoding and decoding processes because the ACN The goal of end-to-end network training is to improve the
is not used at the inference phase. coding efficiency; that is, the reconstructed image through
For the HEVC-intra-based ACN, we also divided the block the PPNet should be closer to the original image x, and the
size of 8 × 8 with the same size as JPEG-based ACN because number of bits generated by the standard encoder φ(x) should
the default value of the minimum CU unit was set to 8 in the be reduced. The distortion is defined as the mean squared
default profile of the HEVC reference software HM 16.20 [48], error of the original image x and reconstructed image x̂.
[49]. However, the HEVC performs transform and quantization The result obtained using the distortion calculation is defined
with variable block sizes, so the structure of the HEVC- as the reconstruction loss Lrec as presented in the following
intra-based ACN comprises the combination of a JPEG-based equation:
ACN and additional CNNs based on the U-Net structure [50].
2
The proposed structure of the HEVC-intra-based ACN with Lrec = kx − x̂k2 . (7)
additional CNNs overcomes the limitations of the JPEG-based
Next, (8) defines the bit loss Lbit to reduce the number of
ACN, in which the weights were connected only in blocks
bits:
of a fixed size by enabling a connection between divided
blocks. Finally, an imitated decoded image of the HEVC-intra-
Lbit = p(f (x)). (8)
based ACN was obtained from the sum of the residual image
generated by the combined CNNs and the input prediction The end-to-end deep learning model does not analyze the
image. characteristics of the image with human intuition but only
updates the weight in a direction that reduces the objective
C. Bit Estimation Network function. Therefore, it is difficult to determine the pattern
As all parts of the objective function should comprise of intermediate products of each module in a pipeline. As
differentiable functions, we approximated the bit-rate term the parameter of f is updated as the training progresses, the
expressed as the size of a bitstream generated from the existing approximation performance of h and p gradually decreases
codec encoder φ using the deep learning network function p. because the unseen compact image is input into the pretrained
The objective function to train the BENet is as follows: h and p. The regularization loss proposed in [36], which
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 6

is called guide loss in [35], solves the degradation of the


Algorithm 1: Training algorithm of the proposed
approximation accuracy as follows:
compression framework
2
Lreg = kf (x) − Fs (x)k2 , (9) Input: Original image: x; Batch size: K;
Pretrain h and p by optimizing Eqs. (5) and (6)
where F denotes the bicubic downsampling function with Initialize θf0 and θg0 by optimizing Eqs. (11) and (12)
scale factor s. Regularization loss Lreg helps the compact for each minibatch iteration do
image generated by f preserve the statistical characteristics Update θf and θg via joint learning with fixed h
of natural images and the low-frequency information of the and p, referring to the objective function (10)
original image. The total loss combined with the above three for i = 1 ← K do
loss types is expressed in the following equation: Generate the training data group containing a
compact image f (x), a decoded image
Ltotal = Lrec + λbit Lbit + λreg Lreg , (10) Φ(f (x)), and the number of bits φ(f (x))
from the codec
where λbit and λreg are trade-off weights for balancing each end
loss term. The compression efficiency results for λbit and λreg Update the parameters of h and p using grouped
are discussed in more detail in Section V-B5. data {f (x), Φ(f (x)), φ(f (x))} using Eqs. (5) and
(6)
end
B. Training Strategy return: θf , θg
1) Pretraining: The approximation networks, ACN and
BENet, are auxiliary networks used only for end-to-end
learning and perform only gradient backpropagation. These 𝑦𝑦 = Φ𝑐𝑐 (𝑓𝑓 𝑥𝑥 )
networks are already given the role of codec imitation and bit 𝑓𝑓(𝑥𝑥)
Step 1 𝑥𝑥 CRNet Codec 𝑦𝑦 PPNet 𝑔𝑔 𝑦𝑦 𝑥𝑥
prediction, regardless of the learning direction of end-to-end
learning. Therefore, the two networks should be pretrained.
The ACN can be pretrained using (5), and the BENet can be Identity
pretrained using (6). Step 2 𝑥𝑥 CRNet
Function
PPNet 𝑔𝑔(𝑓𝑓 𝑥𝑥 ) 𝑥𝑥

The CRNet processes the input image before the standard


codec, and the PPNet processes the decoded image generated
(a) Alternate learning with identity function approximation [38]
by the standard codec. The proposed compression framework
aims to increase the coding efficiency by lowering the reso- 𝑦𝑦 = Φ𝑐𝑐 (𝑓𝑓 𝑥𝑥 )
lution of the input and output images. Because it is already 𝑓𝑓(𝑥𝑥)
Step 1 𝑥𝑥 CRNet Codec 𝑦𝑦 PPNet 𝑔𝑔 𝑦𝑦 𝑥𝑥
known that the CRNet and PPNet should be able to change
the resolution of an image and remove compression artifacts,
Codec 𝑦𝑦 PPNet
it is beneficial to define the initial state of the CRNet and Step 2 𝑥𝑥 CRNet

PPNet through pretraining. The initial state is defined in the 𝑓𝑓(𝑥𝑥) VCNN ℎ(𝑓𝑓 𝑥𝑥 ) 𝑔𝑔(𝑦𝑦)

following equations:
Step 3 𝑥𝑥 CRNet VCNN ℎ(𝑓𝑓 𝑥𝑥 ) 𝑥𝑥
2
θf0 = argmin kf (x) − Fs (x)k2 , (11)
θf

2 (b) Alternate learning with a virtual codec neural network (VCNN) [39]
θg0 = argmin kg(Φ(Fs (x))) − xk2 . (12)
θg

We trained the CRNet to output a bicubic downsampling Codec Φ𝑐𝑐 (𝑓𝑓 𝑥𝑥 )


image at the initial state and trained the PPNet to restore a Step 1 𝑥𝑥 CRNet 𝑓𝑓(𝑥𝑥) PPNet

ACN ℎ(𝑓𝑓 𝑥𝑥 )
degraded bicubic downsampled image to the original image.
The pretraining strategy for all networks provides a good
initialization point, making the optimal parameter closer to
the ideal and obtaining a faster convergence rate. This result
Step 2 𝑥𝑥 CRNet ACN PPNet 𝑔𝑔(ℎ(𝑓𝑓 𝑥𝑥 ) 𝑥𝑥
is verified using a comparative experiment in Section V-B6.
2) Iterative Fine-tuning Updating: As the fine-tuning of
the end-to-end model progresses, the CRNet learns in the
(c) Simultaneous learning with the proposed auxiliary codec network (ACN)
direction of optimizing the objective function. However, as
Fig. 3. Comparison of the learning process of compact representation network
mentioned, the use of the approximation function affects the (CRNet)-based methods. The blue module indicates the status updated in each
learning of the entire model. The ACN and BENet, which have step, and the yellow module indicates the fixed status. Green and red arrows
fixed weights, gradually decrease the approximation accuracy indicate a forward pass (or inference) and a backward pass (or gradients) to
train the module, respectively. The yellow double arrow indicates the argument
to the standard codec because the unseen compact image from of the loss function for the backward pass.
the CRNet is input as the whole model is trained. Therefore,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 7

the ACN and BENet are updated through an iterative learning


process, as presented in Algorithm 1. The parameters of the 32 0.9

31
CRNet and PPNet are updated for each cycle of the minibatch, 0.85
30
and the ACN and BENet are updated using the output value 29 0.8

of the CRNet. By correcting the approximation errors with the 28


0.75

PSNR (dB)
proposed algorithm, the parameters of the CRNet and PPNet

SSIM
27

can be learned more closely to the ideal optimum values θf∗ 26


0.7

and θg∗ . 25
Proposed
0.65
Proposed
24
3) Comparison with the other CRNet-based Methods: Bicubic+Post-processing 0.6 Bicubic+Post-processing
23
Recent CRNet-based papers [38], [39] proposed a learning JPEG JPEG
22 0.55
strategy that bypasses the nondifferentiable codec, adopting 0.1 0.2 0.3 0.4
Bits per pixel (bpp)
0.5 0.6 0.1 0.2 0.3 0.4
Bits per pixel (bpp)
0.5 0.6

an alternate learning method for CRNet and PPNet. In Fig. 3


(a) (b)
(a), which illustrates the learning method of [38], CRNet
was learned by directly connecting the PPNet, assuming the 34 0.95

codec to be an identity function. On the other hand, in Fig. 3 33


0.93

(b), which illustrates the learning method of [39], CRNet 32


0.91

was trained using the VCNN which generates the image 31


0.89

0.87
passed through the codec and PPNet. In both methods, CRNet

PSNR (dB)

SSIM
30 0.85
and PPNet are trained alternately, which is an incomplete 0.83
29
optimization process. 0.81
28 Proposed
In this paper, we adopted a method of simultaneously Proposed
0.79
Bicubic+Post-processing Bicubic+Post-processing
learning an end-to-end network so that the CRNet and PPNet 27
JPEG
0.77
JPEG

obtain the weights closer to the ideal optimal values. We 26


0.15 0.35 0.55 0.75 0.95 1.15
0.75
0.15 0.35 0.55 0.75 0.95 1.15
performed a better optimization learning method for CRNet Bits per pixel (bpp) Bits per pixel (bpp)

and PPNet to be optimized simultaneously rather than other (c) (d)


alternate learning methods. As illustrated in Fig. 3 (c), CRNet
0.95
and PPNet are updated simultaneously with a fixed ACN. 34
0.9
The comparative experiments for alternate and simultaneous
32 0.85
learning are discussed in more detail in Section V-B4
30 0.8
PSNR (dB)

SSIM
0.75
V. E XPERIMENTAL R ESULTS 28

0.7
A. Setting 26 Scale Factor = 0.5 Scale Factor = 0.5
0.65
Scale Factor = 0.75 Scale Factor = 0.75
Previous studies based on the CRNet and PPNet [38], [39] 24 Scale Factor = 1 0.6
Scale Factor = 1
JPEG JPEG
have displayed performance improvement in various image 22 0.55
codecs, such as JPEG, JPEG2000 [2] and BPG [40]. In this 0.1 0.3 0.5 0.7
Bits per pixel (bpp)
0.9 1.1 0.1 0.3 0.5 0.7
Bits per pixel (bpp)
0.9 1.1

paper, we experimented based on the traditional and widely (e) (f)


used JPEG image codec to demonstrate the effectiveness of the
Fig. 4. Rate-distortion curve of the proposed CRNet and bicubic downsam-
proposed method. Furthermore, the HEVC standard based on pling in two downsampling scale factors and comparison according to a scale
HEVC reference software HM 16.20 [48], [49] is additionally factor. (a),(b) Scale factor= 0.5; (c),(d) Scale factor= 0.75; and (e),(f) all
used for codec modeling to demonstrate the possibility of the scale factors.
extension to general codecs.
The architecture of the CRNet adopts the TAD structure [35]
and the PPNet adopts an enhanced deep super-resolution the learning rate to 1 × 10−4 . After pretraining, the learning
network (EDSR) [27] as the baseline structure. Because the rate was set to 5 × 10−5 , and for the ACN and BENet update,
CRNet performs downsampling and the PPNet performs up- the learning rate was set to 5 × 10−6 .
sampling, the TAD and EDSR are representative convolutional Two parameters of the loss function, λbit and λreg , were
networks for resizing the input image. empirically determined. We trained the model per JPEG qual-
The DIV2K dataset [52] was used to train the neural ity factor (QF) from 10 to 80. A higher QF indicates a higher
network of the proposed framework. The DIV2K training bit rate and a larger value of λbit results good performance in
set consists of 800 high-resolution images. Additionally, we a high bit-rate environment. Specifically, λbit is set to 2×10−4
used 123,403 images from the COCO 2017 dataset [53] to when the QF is 10, 1 × 10−4 when the QF is 20, and 3 × 10−5
pretrain the ACN and BENet to learn more diverse patterns when the QF is 40 and 80. The details regarding setting
and increase the approximation accuracy of the codec imitation the weights are described in the experimental results of the
module. We divided the original images into 128×128 patches ablation study. For JPEG-based model testing, two benchmark
for training. We used the Adam optimizer [54] with β1 = 0.9 datasets, Set14 [55] and LIVE1 [56] were used.
and β2 = 0.99 to train the networks. We pretrained each In the HEVC-intra-based model, applying adaptive weights
network module with the size of the minibatch set to 16 and according to the quantization parameter (QP) does not greatly
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 8

Resblock VCNN Resblock VCNN


Original Image JPEG QF=10 JPEG-based ACN
depth=6 depth=20

(a)

ACN without ACN with


Original Image HEVC QP=42 Resblock VCNN
Prediction Image Prediction Image

(b)
Fig. 5. Visual comparison of different codec-mimicking network structures on the (a) Lighthouse image of the LIVE1 dataset at a quality factor of 10,
and (b) Kimono of the HEVC Test Sequence [51] at an HEVC quality parameter of 42. The result of JPEG-based ACN has blocking artifacts and ringing
artifacts around edges similar to the image compressed with JPEG. The result of ACN with prediction image, unlike other structures, is less blurry and has
compression artifacts similar to HEVC.

TABLE I
Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF JPEG I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
NETWORK S TRUCTURES ON S ET 14 AND LIVE1 DATASETS

QF = 10 QF = 20 QF = 40 QF = 80
vs JPEG Set14 LIVE1 Set14 LIVE1 Set14 LIVE1 Set14 LIVE1
Original image 27.49 27.03 29.85 29.30 32.20 31.62 37.00 36.32
VCNN [39] depth = 6 29.89 29.56 32.15 31.60 34.29 33.60 37.98 37.49
VCNN [39] depth = 20 29.93 29.68 32.33 31.77 34.37 33.67 37.98 37.49
JPEG-based ACN depth N = 9 43.85 44.50 39.82 39.82 40.34 40.29 39.53 38.97
JPEG-based ACN depth N = 11 45.24 45.41 41.78 41.75 40.33 40.25 39.76 39.25
JPEG-based ACN depth N = 12 46.07 46.27 42.32 42.28 41.27 41.25 43.08 42.61
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 9

TABLE II
Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF HEVC I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
NETWORK S TRUCTURES ON HEVC T EST S EQUENCES [51]

vs HEVC QP = 32 QP = 37 QP = 42 QP = 47
Original image 36.07 33.06 30.13 27.40
VCNN [39] 38.54 36.17 33.94 32.07
ACN without prediction image 39.12 36.63 34.37 32.36
ACN with prediction image 40.87 39.09 37.53 36.06

facilitate performance improvement; thus, λbit and λreg are set particular, when Depth N of the JPEG-based ACN is 12, the
to 5 × 10−5 and 1, respectively, for all QPs. The HEVC-intra- imitation performance is over 40dB in all QFs. The residual
based model is implemented in the HEVC reference software block-based CNN structure cannot follow the behavior of the
HM 16.20 [48], [49] with the all intra configuration and tested JPEG codec regardless of the depth of the network.
based on test conditions, configurations, and sequences pro- In contrast, the proposed JPEG-based ACN generates a
posed by the Joint Collaborative Team on Video Coding [51]. decoded image similar to the output of JPEG. The learned
The test sequences can be divided into Classes A, B, C, D, ACN expresses contouring and ringing artifacts, which are
and E according to the spatial resolution. When evaluating typical compression artifacts of JPEG. As the proposed ACN
the compression performance of HEVC-intra-based models, method closely follows the codec operation, backpropagation
the results are expressed in terms of the Bjøntegaard delta can be performed for the CRNet with a small error.
(BD) rate [57] reductions for the luma component. In both In addition, we conducted a comparative experiment on
codec model test situations, we adopted the peak signal-to- the HEVC-intra-based ACN. We compared the HEVC-based
noise ratio (PSNR) and the structural similarity index measure ACN with the VCNN and compared using the original image
(SSIM) [58] as image quality evaluation metrics. alone and with a prediction image for the input image of the
HEVC-intra-based ACN. The results in Fig. 5 (b) and Table II
indicate that the proposed HEVC-intra-based ACN structure
B. Ablation Study
has superior HEVC imitation performance compared with the
In this section, we present the evaluation of the contribu- VCNN with a Resblock-based CNN structure. Furthermore,
tion of each network of the proposed framework. We also we improved the ACN to take the prediction image as input
performed ablation studies to analyze the importance of each with the original image, dramatically improving the imitation
loss term. In addition, we tested the effects of pretraining and performance.
the iterative update algorithm proposed in Section IV-B. All 3) Bit Estimation Network: To prove the superiority of
experiments for the ablation studies were tested on the LIVE1 the structure of the proposed BENet, we compared it with
dataset and evaluated on the rate-distortion planes. ResNet [15], a representative network architecture for re-
1) Compact Representation Network: To confirm the effect gression. As a result of the experiment in Table III, BENet
of the proposed CRNet, we experimented on the CRNet demonstrated better bit prediction performance than ResNet.
compared to the frameworks that simply downsample and In particular, BENet is a more advantageous structure in that
restore [29]–[31]. As illustrated in Fig. 4 (a)-(d), the CRNet the number of parameters is relatively small.
outperformed the bicubic downsampling preprocessing at two 4) Simultaneous Learning Strategy: We conducted a com-
scale factors: 0.5 and 0.75. When the scale factor was 0.5, parative experiment with the learning strategies of recent
a higher performance improvement was obtained because the CRNet-based papers [38], [39]. In practice, the performance
CRNet and PPNet have a larger capacity to compress and of the gradient backpropagation should be compared to know
restore spatially as the scale factor decreases. how well the learning strategies mimic the codec role. How-
Fig. 4 (e) and (f) exhibit the performance analysis according ever, real nondifferentiable codecs have no ground truth for
to the scale factor. A smaller scale factor result in a greater gradient propagation. Therefore, we analyzed the mimicking
information loss for the original image; thus, the compression ability of the proposed method through the final compression
efficiency is improved only at a low bit rate. In contrast, performance after optimizing CRNet and PPNet.
in a high bit-rate environment, it is better to maintain the In Fig. 6, we compared the compression methods using
original scale. An efficient scale factor exists according to the algorithm preprocessing and postprocessing networks. The
bit rate, suggesting that the scale factor should be determined methodologies selected for comparison are as follows: first,
adaptively according to the target rate. the method for learning the CRNet and PPNet alternately by
2) Auxiliary Codec Network: To prove the superiority of approximating the codec as an identity function as in [38]
the architecture of the proposed ACN, we conducted a perfor- (green line) and using the VCNN [39] (magenta line), and
mance comparison by imitating real codecs with several ACN second, the method for simultaneous optimization by directly
structures. First, we compared the JPEG-based structure with connecting the CRNet and PPNet (cyan line), using the VCNN
the residual block-based CNN structure of the VCNN [39]. (purple line) and proposed JPEG-based ACN (blue line).
The results in Fig. 5 (a) and Table I reveal that JPEG-based Furthermore, we additionally performed alternate learning of
ACN mimics JPEG decoded images better than VCNN. In the CRNet and PPNet with the JPEG-based ACN (red line) to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 10

TABLE III
Q UANTITATIVE P ERCENT (%) E RROR AND N UMBER OF PARAMETERS C OMPARISON OF B ITS PER P IXEL (BPP) E STIMATION BY D IFFERENT B IT
E STIMATION N ETWORK (BEN ET ) S TRUCTURES ON S ET 14 AND LIVE1 DATASETS

QF=10 QF=20 QF=40 QF=80


Structure Set14 LIVE1 Set14 LIVE1 Set14 LIVE1 Set14 LIVE1 Average Number of Parameters

ResNet-18 [15] 1.598% 0.999% 1.687% 1.535% 1.571% 1.591% 1.134% 1.327% 1.430% 15.199M
ResNet-50 [15] 1.873% 1.934% 1.520% 1.364% 0.945% 1.180% 1.705% 2.005% 1.566% 29.287M
BENet 1.001% 1.456% 1.221% 1.305% 1.072% 1.909% 1.211% 1.661% 1.355% 2.098M

compare the influence of alternate learning and simultaneous 5) Bit and Regularization Loss: The rate-distortion per-
learning. For a fair comparison, all experiments used the fixed formance was evaluated according to the two loss weights,
network structure of CRNet and PPNet, and the same training λbit and λreg , to determine the effectiveness of the bit and
database as described in Section V-A. regularization loss. According to the training progress, the
rate-distortion performance results are expressed as traces
In Fig. 6 (a), the proposed JPEG-based ACN with simulta-
according to the training epoch to analyze the change in
neous learning outperforms the other methods. In learning the
performance. As displayed in Fig. 7 (a), when learning the
CRNet and PPNet alternately by approximating the codec as
network without considering the bit loss (λbit = 0), the
an identity function, the codec characteristics are repeatedly
final reconstructed image becomes closer to the original im-
reflected in the PPNet to demonstrate good performance.
age, but the number of bits generated during compression
However, an error occurs because the codec is assumed to be
increases significantly, resulting in poor coding efficiency.
an identity function when learning the CRNet. Additionally,
In contrast, using a proper λbit prevents these problems.
limitations exist in the VCNN, as it is difficult to sufficiently
Fig. 7 (b) indicates that the network has a better training
transfer the codec characteristics to the CRNet because of the
procedure with stable convergence with the regularization loss
poor approximation of the codec.
by preserving the structure of natural images. If the influence
In the case of simultaneous learning by directly connecting of the regularization loss increases, the learning of the CRNet
the CRNet and PPNet and with the Resblock-based VCNN, is restricted. In this case, no significant change in performance
the compression performances are significantly worse than exists from the initial state. Fig. 7 (c) indicates that the coding
the others. Table I demonstrates that the VCNN is closer to efficiency is better when learning with the regularization loss
the JPEG decoded image than the original image (identity (λreg = 0.1) than learning without the regularization loss. A
function). However, as illustrated in the visual comparison standard codec is designed to compress natural images; thus,
in Fig. 5 (a), compression mimicking images generated from regularization loss prevents a decrease in coding efficiency
the VCNN have almost no observed JPEG compression noise from compressing images with unnatural patterns.
patterns and unpredictable noise. End-to-end learning with 6) Pretraining Strategy: We analyzed the effect of the pre-
the VCNN, which is updated through iterative learning and training strategy as described in Section IV-B1. Fig. 8 presents
is changeable, unlike an identity function that generates a the performance comparison according to whether pretraining
constant value, leads to worse optimization. Therefore, the occurred. The experimental results reveal faster convergence
VCNN can generate even greater error propagation than the with better coding efficiency when performing pretraining on
identity function approximation. the CRNet and PPNet. The absence of pretraining means that
In alternate learning, PPNet is continuously trained from a the parameters of the CRNet and PPNet are initialized with
decoded image obtained from the real codec to compensate random values. In this case, it is challenging to converge in
for approximation errors. However, for simultaneous learning, the desired direction because the ACN and BENet do not work
if a sufficient codec approximation is not satisfied, error properly at the beginning of the training process.
propagation continues to CRNet and PPNet during training, 7) Iterative Updating Strategy: The iterative updating pro-
resulting in performance degradation. The JPEG-based ACN cess is proposed in Section IV-B2 to reduce errors due to
exhibited better performance in simultaneous learning than the approximation functions. The fixed ACN is compared
in alternating learning. This result reveals that the proposed with the repeatedly updated ACN to demonstrate the effec-
learning method, in which CRNet and PPNet are optimized tiveness of the proposed training scheme. Fig. 9 (a) depicts
simultaneously, is more effective if good mimicking perfor- the comparison of the codec imitation performance according
mance is guaranteed. In Fig. 6 (b), the experimental results to the training epochs, and Fig. 9 (b) displays the change
demonstrated that, as the imitation performance gradually in coding efficiency as training progresses. When the ACN
decreased with a small value of N , the overall compression is fixed, the imitation performance of the ACN decreases
performance also gradually decreased. The experiments in- gradually as the epoch increases. In contrast, when the ACN
dicate that the imitation performance is proportional to the is continuously updated, the imitation performance does not
overall compression performance and that better mimicking deteriorate, and it converges. Guaranteeing the performance of
performance improves the compression performance through the ACN helps learning to increase the coding efficiency of
a more precise optimization of the preprocessing network. the entire framework with less approximation error.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 11

31.5 30.5

31 31
30

30.5
30

PSNR (dB)
PSNR (dB)
30 Training 29.5
Training
λ = 5×10⁻⁵ λ=1
29 29.5 λ = 0.3
λ = 3×10⁻⁵
λ = 1×10⁻⁵ 29 λ = 0.1
28 29
λ = 5×10⁻⁶ λ = 0.03
Epoch=0 λ=0 Epoch=0 λ=0
28.5 28.5

27
PSNR (dB)

0.5 0.52 0.54 0.56 0.58 0.6


0.45 0.55 0.65 0.75
Bits per pixel (bpp) Bits per pixel (bpp)

(a) (b)
26
JPEG-based ACN + S. Learning
25 JPEG-based ACN + A. Learning 30.5

VCNN + S. Learning
24
29.5
VCNN + A. Learning [39]
IdentityApprox. + S. Learning

PSNR (dB)
28.5
23 IdentityApprox. + A. Learning [38]
JPEG 27.5

22
26.5 With RegLoss
0.1 0.2 0.3 0.4 0.5 0.6
Without RegLoss
Bits per pixel (bpp) 25.5
0.1 0.3 0.5
(a) Bits per pixel (bpp)

(c)
31 Fig. 7. Rate-distortion performance comparison according to the weight of the
loss function. Rate-distortion point of compressed image (JPEG QF = 80)
30 with increasing learning epochs according to (a) the bit loss λbit , and (b)
regularization loss λreg , and (c) rate-distortion performance comparison with
λreg = 0.1 or without regularization loss λreg after the training process is
29 complete.

28

27
PSNR (dB)

26
JPEG-based ACN Depth N = 12
25 JPEG-based ACN Depth N = 11
30.5 30.5

JPEG-based ACN Depth N = 9 30


24 29.5
VCNN 29.5

23 Identity Func. Approximation 28.5


PSNR (dB)
PSNR (dB)

29

JPEG 28.5
27.5
22
28
0.1 0.2 0.3 0.4 0.5 0.6 With pretraining 26.5 With pretraining
Bits per pixel (bpp)
27.5
Without pretraining Without pretraining
27 25.5
(b) 0 5 10 15 20 25 30 35 40 45 50 0.1 0.3 0.5
Epochs Bits per pixel (bpp)
Fig. 6. Rate-distortion curve comparison of compression methods using
preprocessing and postprocessing networks. (a) Comparison results according (a) (b)
to network architectures and learning methods (simultaneous learning (S.
Fig. 8. Performance comparison with and without pretraining. (a) Change in
Learning) or alternating learning (A. Learning)), (b) Comparison result of
the peak signal-to-noise ratio (PSNR) between the original and reconstructed
the codec-mimicking networks with simultaneous learning.
images (JPEG QF = 80) in the end-to-end model according to the training
epochs and (b) comparison of the rate-distortion performance after the training
process is complete.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 12

40.5 31 38

Updated ACN 39
40
Fixed ACN 30.5 36
39.5 37

30 35 34

PSNR (dB)
PSNR (dB)

PSNR (dB)
39

PSNR (dB)
Proposed Proposed
38.5 Training 33 Li [36] Li [36]
29.5 32
Lee [14] Lee [14]
38 31 Minnen [13] Minnen [13]
Updated ACN Balle [9]
29 Balle [9] 30
37.5 29 BPG BPG
Epoch=0 Fixed ACN
HEVC-intra HEVC-intra
37 28.5 27 28
0 5 10 15 20 25 30 35 40 45 50 0.5 0.55 0.6 0.65 0 20000 40000 60000 0 20000 40000 60000
Epochs Bits per pixel (bpp) Bit rate (kbps) Bit rate (kbps)

(a) (b) (a) (b)


Fig. 9. Performance comparison with and without an iterative update in the
39
simultaneous learning process. (a) Change in the peak signal-to-noise ratio
41
(PSNR) between the output image generated from the JPEG-based auxiliary
37
codec network (ACN) and the decoded image from the JPEG decoder (QF =
39
80) according to training epochs and (b) change in the rate-distortion point 35
as training progresses.

PSNR (dB)

PSNR (dB)
37
33 Proposed Proposed
Li [36] Li [36]
Lee [14] 35 Lee [14]
34 34 31
Minnen [13] Minnen [13]
Balle [9] 33 Balle [9]
32 32 29
BPG BPG
HEVC-intra HEVC-intra
30 27 31
30
0 5000 10000 0 5000 10000
Bit rate (kbps) Bit rate (kbps)
PSNR (dB)
PSNR (dB)

28 28
Proposed Proposed (c) (d)
26 26
Zhao [39] Zhao [39] Fig. 11. Rate-distortion curves of several HEVC test sequences [51]: (a)
Jiang [38] Jiang [38]
PeopleOnStreet (2160×1440), (b) Cactus (1920×1080), (c) BasketballDrill
24 24
(832 × 480), and (d) Johnny (1280 × 720).
JPEG JPEG
22 22
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Bits per pixel (bpp) Bits per pixel (bpp)

(a) (b) bit-rate environment, the QF of JPEG was set from 2 to


40 for comparison. In the proposed method, we adaptively
0.95 0.95 determined the scale factor of the CRNet and PPNet according
0.9 0.9 to the bit rate. The scale factors are determined to be 0.5, 0.75,
0.85 0.85 and 1, respectively in three ranges of bit rate from low to high.
0.8 0.8 As depicted in Fig. 10, the proposed method outperforms the
methods by Jiang and Zhao in all bits per pixel (bpp) ranges.
SSIM

0.75
SSIM

0.75
Proposed Proposed Moreover, the proposed method exhibits the best performance
0.7 0.7
Zhao [39] Zhao [39] on all test datasets. The difference between the proposed
0.65 0.65
Jiang [38] Jiang [38] algorithm and the existing method is the learning method for
0.6 0.6
JPEG JPEG the CRNet. From the experimental results, the well-trained
0.55 0.55
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 CRNet improves compression performance.
Bits per pixel (bpp) Bits per pixel (bpp)
2) HEVC-intra-based Model: The proposed method was
(c) (d)
compared with the conventional codecs, learnable codecs,
Fig. 10. Rate-distortion performance per peak signal-to-noise ratio (PSNR)
and structural similarity index measure (SSIM) of different compression
and the competing standard compatible algorithm to apply
algorithms on test image datasets: (a), (c) Set14 and (b), (d) LIVE1. the proposed method to the recent standard codec and ex-
hibit state-of-the-art performance. For conventional codecs, the
BPG image codec was also selected. The image compression
C. Comparison with State-of-the-art Methods frameworks proposed by Ballé [9], Minnen [13], and Lee [14]
1) JPEG-based Model: We compared the coding efficiency were selected for the learnable codec, and the work by Li [36]
of the proposed JPEG-based ACN with depth N of 12 with was selected as the competing standard compatible algorithm.
JPEG and other standard compatible compact representation The rate-distortion performance comparison of the typical
frameworks, such as those by Jiang [38] and Zhao [39]. We sequences of each class is illustrated in Fig. 11.
used the experimental results described in [39]; however, in The performance of the learnable codec algorithms does not
[38], the compression rate and distortion results for high QF exceed that of the HEVC in the test sequences. In contrast,
are not described, so we newly trained the model from [38] the standard compatible frameworks are proposed to boost the
and used these results. performance of the existing codec, exhibiting better coding
Considering that the CRNet is a useful tool under a low efficiency than the other frameworks. In particular, our method
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 13

TABLE IV
BD-R ATE C OMPARISON OF S TANDARD C OMPATIBLE F RAMEWORKS ON HEVC COMMON TEST CONDITIONS [51] WITH A LL I NTRA M AIN
CONFIGURATIONS

Li (Frame level) [36] Li (Block level) [36] Proposed


PSNR SSIM PSNR SSIM PSNR SSIM
Traffic -12.4% -17.1% -11.8% -14.8% -15.0% -22.4%
Class A
PeopleOnStreet -9.5% -14.4% -11.7% -14.5% -15.4% -23.4%
Kimono -13.0% -14.5% -9.0% -10.8% -11.7% -20.6%
ParkScene -8.8% -15.5% -8.3% -13.0% -11.6% -19.4%
Class B Cactus -7.1% -20.0% -8.5% -12.9% -14.9% -22.6%
BasketballDrive 7.0% -19.7% -8.0% -11.3% -11.3% -23.8%
BQTerrace 6.2% -20.8% -4.8% -13.0% -14.1% -20.7%
BasketballDrill -10.7% -24.8% -7.5% -10.5% -19.2% -25.0%
BQMall 17.2% -23.4% -3.9% -8.0% -7.3% -19.9%
Class C
PartyScene 8.9% -26.2% -1.9% -6.5% -9.3% -20.2%
RaceHorcesC -6.8% -18.5% -8.2% -13.0% -14.2% -19.1%
BasketballPass 6.5% -22.0% -4.6% -9.1% -20.1% -24.4%
BQSquare 7.8% -26.6% -1.8% -3.6% -15.4% -22.8%
Class D
BlowingBubbles 3.8% -18.8% -4.2% -8.9% -19.7% -24.1%
RaceHorsesD -13.5% -17.7% -13.0% -18.0% -18.5% -21.2%
FourPeople -3.9% -18.3% -9.1% -14.5% -15.2% -26.1%
Class E Johnny -8.1% -12.9% -10.2% -11.8% -20.9% -28.9%
KristenAndSara -0.7% -20.0% -7.9% -14.0% -15.7% -22.8%
Class A -11.0% -15.8% -11.8% -8.7% -15.2% -22.9%
Class B -3.1% -18.1% -7.7% -13.1% -12.7% -21.4%
Summary Class C 2.2% -23.2% -5.4% -11.5% -12.5% -21.1%
Class D 1.2% -21.3% -5.9% -14.8% -18.4% -23.1%
Class E -4.2% -17.1% -9.1% -11.9% -17.2% -25.9%
Overall -3.0% -19.1% -8.0% -12.0% -15.2% -22.9%

reached state-of-the-art performance. experimental results reveal that this approach outperforms the
For a detailed performance comparison between the stan- existing codecs and end-to-end learnable image compression
dard compatible frameworks, the BD-rate performance was algorithms. For future work, we will extend this work to video
compared in the HEVC test sequences. Considering that the compression tasks, which are more challenging and complex to
standard compatible framework is effective in a low bit-rate model because of the high complexity and temporal dynamics.
environment, the QP of the reference HEVC was set to 32, 37,
42, and 47. The BD-rate results are summarized in Table IV.
The results reveal that the proposed method achieves an R EFERENCES
average BD-rate reduction of 15.2% in all classes on the PSNR [1] G. K. Wallace, “The jpeg still picture compression standard,” IEEE
metric and 22.9% on the SSIM metric. The proposed method transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv,
also outperforms Li’s frame-level and block-level scheme 1992.
[2] M. Rabbani, “Jpeg2000: Image compression fundamentals, standards
[36] for both the PSNR and SSIM metrics. In addition, [36] and practice,” Journal of Electronic Imaging, vol. 11, no. 2, p. 286,
assumed that the standard codec is an identity function, similar 2002.
to that in [38]; therefore, the CRNet and PPNet are directly [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
connected, and the CRNet is learned through joint learning of of the h. 264/avc video coding standard,” IEEE Transactions on circuits
and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
the entire network. The proposed framework aims to reduce [4] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of
errors in learning for the CRNet by modeling the standard the high efficiency video coding (hevc) standard,” IEEE Transactions
codec with the ACN and improving the coding efficiency. on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–
1668, 2012.
[5] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
VI. C ONCLUSION S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image com-
pression with recurrent neural networks,” in International Conference
In this paper, we proposed a standard compatible deep neu- on Learning Representations, 2016.
ral network-based framework for image compression. Within [6] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor,
this framework, image compression is performed optimally and M. Covell, “Full resolution image compression with recurrent neural
through the existing off-the-shelf standard codecs, CRNet, networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 5306–5314.
and PPNet. The ACN was proposed for optimal learning of [7] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image com-
the entire network and are designed to imitate the forward pression with compressive autoencoders,” in International Conference
degradation processes of existing codecs, such as JPEG and on Learning Representations, 2017.
[8] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
HEVC. Proper training strategies were proposed to minimize compression,” in International Conference on Learning Representations,
errors due to the objective function with approximation. The 2017.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 14

[9] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational [31] X. Wu, X. Zhang, and X. Wang, “Low bit-rate image compression via
image compression with a scale hyperprior,” in International Conference adaptive down-sampling and constrained least squares upconversion,”
on Learning Representations, 2018. IEEE Transactions on Image Processing, vol. 18, no. 3, pp. 552–561,
[10] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, 2009.
S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compres- [32] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based on
sion with priming and spatially adaptive bit rates for recurrent networks,” spatio-temporal resolution adaptation,” IEEE Transactions on Circuits
in Proceedings of the IEEE Conference on Computer Vision and Pattern and Systems for Video Technology, vol. 29, no. 1, pp. 275–280, 2018.
Recognition, 2018, pp. 4385–4393. [33] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolutional
[11] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional neural network-based block up-sampling for intra frame coding,” IEEE
networks for content-weighted image compression,” in Proceedings of Transactions on Circuits and Systems for Video Technology, vol. 28,
the IEEE Conference on Computer Vision and Pattern Recognition, no. 9, pp. 2316–2330, 2018.
2018, pp. 3214–3223. [34] Y. Zhang, D. Zhao, J. Zhang, R. Xiong, and W. Gao, “Interpolation-
[12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, dependent image downsampling,” IEEE Transactions on Image Process-
“Conditional probability models for deep image compression,” in Pro- ing, vol. 20, no. 11, pp. 3291–3296, 2011.
ceedings of the IEEE Conference on Computer Vision and Pattern [35] H. Kim, M. Choi, B. Lim, and K. Mu Lee, “Task-aware image
Recognition, 2018, pp. 4394–4402. downscaling,” in Proceedings of the European Conference on Computer
[13] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and Vision (ECCV), 2018, pp. 399–414.
hierarchical priors for learned image compression,” in Advances in [36] Y. Li, D. Liu, H. Li, L. Li, Z. Li, and F. Wu, “Learning a convolutional
Neural Information Processing Systems, 2018, pp. 10 771–10 780. neural network for image compact-resolution,” IEEE Transactions on
[14] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for Image Processing, vol. 28, no. 3, pp. 1092–1107, 2018.
end-to-end optimized image compression,” in International Conference [37] W. Sun and Z. Chen, “Learned image downscaling for upscaling using
on Learning Representations, 2019. content adaptive resampler,” IEEE Transactions on Image Processing,
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image vol. 29, pp. 4027–4040, 2020.
recognition,” in Proceedings of the IEEE conference on computer vision [38] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-end
and pattern recognition, 2016, pp. 770–778. compression framework based on convolutional neural networks,” IEEE
[16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Transactions on Circuits and Systems for Video Technology, vol. 28,
connected convolutional networks,” in Proceedings of the IEEE confer- no. 10, pp. 3007–3018, 2017.
ence on computer vision and pattern recognition, 2017, pp. 4700–4708. [39] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based
[17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in on deep convolutional neural network to compress image,” Journal of
Proceedings of the IEEE conference on computer vision and pattern Visual Communication and Image Representation, vol. 63, p. 102589,
recognition, 2018, pp. 7132–7141. 2019.
[18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- [40] F. Bellard, “Bpg image format,” URL https://bellard. org/bpg, 2015.
works,” in Proceedings of the IEEE conference on computer vision and [41] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
pattern recognition, 2018, pp. 7794–7803. deep convolutional networks,” IEEE transactions on pattern analysis
[19] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
using very deep convolutional networks,” in Proceedings of the IEEE [42] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts
conference on computer vision and pattern recognition, 2016, pp. 1646– reduction by a deep convolutional network,” in Proceedings of the IEEE
1654. International Conference on Computer Vision, 2015, pp. 576–584.
[20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, [43] X. Zhang, W. Yang, Y. Hu, and J. Liu, “Dmcnn: Dual-domain multi-
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single scale convolutional neural network for compression artifacts removal,” in
image super-resolution using a generative adversarial network,” in 2018 25th IEEE International Conference on Image Processing (ICIP).
Proceedings of the IEEE conference on computer vision and pattern IEEE, 2018, pp. 390–394.
recognition, 2017, pp. 4681–4690. [44] B. Zheng, Y. Chen, X. Tian, F. Zhou, and X. Liu, “Implicit dual-
[21] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense domain convolutional network for robust color image compression
skip connections,” in Proceedings of the IEEE International Conference artifact reduction,” IEEE Transactions on Circuits and Systems for Video
on Computer Vision, 2017, pp. 4799–4807. Technology, 2019.
[22] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent [45] T. Kim, H. Lee, H. Son, and S. Lee, “Sf-cnn: A fast compression artifacts
network for image restoration,” in Advances in Neural Information removal via spatial-to-frequency convolutional neural networks,” in 2019
Processing Systems, 2018, pp. 1673–1682. IEEE International Conference on Image Processing (ICIP). IEEE,
[23] L. Cavigelli, P. Hager, and L. Benini, “Cas-cnn: A deep convolutional 2019, pp. 3606–3610.
neural network for image compression artifact suppression,” in 2017 [46] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter
International Joint Conference on Neural Networks (IJCNN). IEEE, networks,” in Advances in neural information processing systems, 2016,
2017, pp. 752–759. pp. 667–675.
[24] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deep generative [47] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
adversarial compression artifact removal,” in Proceedings of the IEEE D. Rueckert, and Z. Wang, “Real-time single image and video super-
International Conference on Computer Vision, 2017, pp. 4826–4835. resolution using an efficient sub-pixel convolutional neural network,” in
[25] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory Proceedings of the IEEE conference on computer vision and pattern
network for image restoration,” in Proceedings of the IEEE international recognition, 2016, pp. 1874–1883.
conference on computer vision, 2017, pp. 4539–4547. [48] K. McCann, B. Bross, W. Han, I. Kim, K. Sugimoto, and G. Sullivan,
[26] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3: “High efficiency video coding (hevc) test model 16 (hm 16) encoder
Deep dual-domain based fast restoration of jpeg-compressed images,” in description,” JCT-VC, Doc. JCTVC N, vol. 1002, 2014.
Proceedings of the IEEE Conference on Computer Vision and Pattern [49] Hm16.20 reference software. [Online]. Available: https://hevc.hhi.
Recognition, 2016, pp. 2764–2772. fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.20/
[27] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep [50] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
residual networks for single image super-resolution,” in Proceedings for biomedical image segmentation,” in International Conference on
of the IEEE conference on computer vision and pattern recognition Medical image computing and computer-assisted intervention. Springer,
workshops, 2017, pp. 136–144. 2015, pp. 234–241.
[28] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super- [51] F. Bossen et al., “Common test conditions and software reference
resolution using very deep residual channel attention networks,” in configurations,” JCTVC-L1100, vol. 12, p. 7, 2013.
Proceedings of the European Conference on Computer Vision (ECCV), [52] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,
2018, pp. 286–301. “Ntire 2017 challenge on single image super-resolution: Methods and
[29] A. M. Bruckstein, M. Elad, and R. Kimmel, “Down-scaling for bet- results,” in Proceedings of the IEEE conference on computer vision and
ter transform compression,” IEEE Transactions on Image Processing, pattern recognition workshops, 2017, pp. 114–125.
vol. 12, no. 9, pp. 1132–1144, 2003. [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[30] W. Lin and L. Dong, “Adaptive downsampling to improve image P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
compression at low bit rates,” IEEE Transactions on Image Processing, context,” in European conference on computer vision. Springer, 2014,
vol. 15, no. 9, pp. 2513–2521, 2006. pp. 740–755.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX 15

[54] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient de-
scent,” in ICLR: International Conference on Learning Representations,
2015.
[55] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
sparse-representations,” in International conference on curves and sur-
faces. Springer, 2010, pp. 711–730.
[56] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation
of recent full reference image quality assessment algorithms,” IEEE
Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
[57] G. Bjontegaard, “Calculation of average psnr differences between rd-
curves,” VCEG-M33, 2001.
[58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: from error visibility to structural similarity,” IEEE
transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.

Hanbin Son received the B.S. degree in electrical


and electronic engineering from Yonsei University,
Seoul, Korea, in 2016, where he is currently pur-
suing the Ph.D. degree in electrical and electronic
engineering. His current research interests include
video compression and image processing via deep
learning.

Taeoh Kim received his B.S. degree in Electrical


and Electronic Engineering from Yonsei University,
Seoul, South Korea, in 2015, in where he is currently
pursuing the Ph.D. degree. His current research
interests include image/video restoration, face recog-
nition, and video recognition.

Hyeongmin Lee is a Ph.D. student of Electrical


and Electronic Engineering from Yonsei University,
Seoul, South Korea, where he received his B.S.
degree in 2018. His research interests include com-
puter vision, computational photography, and video
processing.

Sangyoun Lee (M’04) received his B.S. and M.S.


degrees in Electrical and Electronic Engineering
from Yonsei University, Seoul, South Korea, in 1987
and 1989, respectively, and his Ph.D. degree in Elec-
trical and Computer Engineering from the Georgia
Institute of Technology, Atlanta, GA, USA in 1999.
He is currently a Professor of Electrical and Elec-
tronic Engineering with the Graduate School, and the
Head of the Image and Video Pattern Recognition
Laboratory, Yonsei University. His research interests
include all aspects of computer vision, with a special
focus on pattern recognition for face detection and recognition, advanced
driver-assistance systems, and video codecs.

You might also like