0% found this document useful (0 votes)
15 views

Yu_SoftCollage_A_Differentiable_Probabilistic_Tree_Generator_for_Image_Collage_CVPR_2022_paper

Uploaded by

sirdmdnd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Yu_SoftCollage_A_Differentiable_Probabilistic_Tree_Generator_for_Image_Collage_CVPR_2022_paper

Uploaded by

sirdmdnd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SoftCollage: A Differentiable Probabilistic Tree Generator for Image Collage

Jiahao Yu1 , Li Chen1 *, Mingrui Zhang1 , Mading Li2


1
School of Software, BNRist, Tsinghua University, Beijing, China
2
Kuaishou Technology, Beijing, China
{yujh21,zmr20}@mails.tsinghua.edu.cn [email protected] [email protected]

Abstract Conventional Tree-Based Collage Collage

Criterion
Initialize

Image collage task aims to create an informative and Hand-crafted adjustment rules

visual-aesthetic visual summarization for an image collec- Image collection Collage tree

tion. While several recent works exploit tree-based algo- SoftCollage Tree probability Sampled
distribution 𝜏𝜃 collage
rithm to preserve image content better, all of them resort to set

hand-crafted adjustment rules to optimize the collage tree Tree Sample Criterion
Generator Map Loss
structure, leading to the failure of fully exploring the struc-
ture space of collage tree. Our key idea is to soften the Image collection Back propagation

discrete tree structure space into a continuous probability


space. We propose SoftCollage, a novel method that em- Figure 1. The optimization paradigms of the conventional meth-
ploys a neural-based differentiable probabilistic tree gener- ods and the proposed SoftCollage. We formulates the tree-based
ator to produce the probability distribution of correlation- collage generation as a differentiable process via softening the dis-
preserving collage tree conditioned on deep image feature, crete tree structure τ into a probability space for the first time. In-
aspect ratio and canvas size. The differentiable character- stead of the hand-crafted adjustment scheme, we directly exploit
istic allows us to formulate the tree-based collage gener- the gradient of the criterion loss to optimize the tree probability
ation as a differentiable process and directly exploit gra- distribution τθ , which facilitates the tree structure exploration.
dient to optimize the collage layout in the level of proba-
collage results, they brought about image artifacts [18, 25,
bility space in an end-to-end manner. To facilitate image
26, 41] and image overlapping [19, 33, 36, 40]. To tackle
collage research, we propose AIC, a large-scale public-
these defects, some tree-based algorithms [3,8,16,23,37,38]
available annotated dataset for image collage evaluation.
were developed to preserve image content better. A tree-
Extensive experiments on the introduced dataset demon-
based collage is encoded as a binary tree which leads to a
strate the superior performance of the proposed method.
recursive partition of the canvas as illustrated in Fig. 2. In
Data and codes are available at https://github.
the tree, each leaf node corresponds to an image and each
com/ChineseYjh/SoftCollage.
interior node corresponds to a bounding box, whose des-
ignation as a horizontal (“H”) or vertical (“V”) cut corre-
sponds to dividing the box into two child boxes [3]. The
1. Introduction existing tree-based methods design a two-stage procedure,
Image collage aims to create a visual summarization where images are arranged in a standard collage tree in the
with rich information and high aesthetic quality for a group first stage and the tree is mapped to the collage via a specific
of images. Because this task requires professional collage bijection mapping function in the second stage. Accord-
knowledge, amateurs have a huge demand for automatic ingly, the collage layout optimization is cast as an optimal
image collage tools [16]. Therefore, many research efforts tree structure search problem.
have tried to automate the process of image collage. While However, all the existing works only resort to heuristic
many works [4, 12, 18, 19, 25, 26, 33, 41] have achieved a hand-crafted adjustment rules when searching the optimal
certain level of success in improving visual perception of tree structure, leading to the failure of fully exploring the
structure space of collage tree (Fig. 3). Deep learning pro-
* Corresponding author. This research was partially supported by the
National Natural Science Foundation of China (Grant Nos.61972221,
vides a promising way to learn a high-quality collage tree.
62021002, 61572274) and Tsinghua-Kuaishou Institute of Future Media Unfortunately, the two-stage tree-based collage generation
Data. We thank Xingjia Pan for preparing some comparison results. process is undifferentiable because both stages include dis-

3729
Horizontal Cut
V
H H Vertical Cut

Standard collage tree Tree-based collage


(a) VSM [23] (b) Ours

Figure 2. An example of the mapping from a standard collage tree Figure 3. Due to the failure of fully exploring the structure space of
to the tree-based collage. collage tree, the collage generated by the state-of-the-art method
(a) still contains images suffering severe aspect ratio distortion
crete operations that prevent back propagation. Although (red dotted rectangle) and fails to place similar images together
recent tree-based advances [16, 23] utilized learning strate- (blue dotted ellipse). Our result (b) preserves aspect ratio and con-
gies, they only applied them to yield semantic feature in the tent correlation better.
first stage so that images with similar features clustered to-
gether. These works achieve much improvement because preserving collage tree conditioned on the deep image
placing correlated images together can facilitate collage in- feature, aspect ratio and canvas size.
formativeness [18, 38, 41]. However, these methods still • We formulate the tree-based collage generation proce-
employed hand-crafted scheme to refine tree structure and dure as a differentiable process for the first time, and
failed to fully explore the solution space (Fig. 3). Recently, introduce an end-to-end learning strategy to perform
despite Pan et al. [23] introduced back propagation for the gradient-based structure optimization.
first time to fine-tune aspect ratio and splitting ratio, they • We provide a large-scale public-available annotated
still failed to propagate the gradients back to optimize the benchmark dataset for evaluation of image collage
collage tree structure due to the undifferentiable character- method.
istic of the tree-based process. • We conduct extensive experiments and user study, and
In this paper, we attack the key problem of differen- show that our model outperforms the state-of-the-art
tiating the overall two-stage tree-based collage generation methods.
process (Fig. 1). Specifically, firstly we propose a novel
neural-based differentiable probabilistic tree generator to 2. Related Work
model the first stage of tree-based procedure. Our tree
generator exploits deep image feature and embedded infor- Previous works on image collage mainly fall into two
mation including aspect ratio and canvas size to construct categories, i.e. parametric method and partitioning-based
a correlation-preserving probabilistic collage tree (PCtree), method. Our tree-based method belongs to the latter.
which builds a probability space via modeling the node type Parametric methods parameterize a collage with vari-
distribution (the cut type of the node is horizontal (“H”) ables including position, scale, orientation and layer index
or vertical (“V”)) and the edge connection distribution (the of each image and design well-defined objective functions
child node is on the left (“L”) or right (“R”)) (Fig. 5). Sec- to solve the optimal variables directly [4,9,12,19,25–27,33,
ondly, we formulate the tree generator optimization as an 36,40]. These works either modeled the problem via a prob-
end-to-end framework resorting to the policy gradient tech- abilistic graphical framework [19,25,26,33,36,40] or solved
nique [30], which naturally overcomes the differentiation the collage parameters in a heuristic manner [4, 9, 12, 27].
difficulty in the second stage of tree-based procedure. In- To preserve correlation among images, some methods ex-
stead of the hand-crafted adjustment scheme in instance ploited a feature space to acquire the correlation and pro-
level, our optimization paradigm directly utilizes the gradi- jected the images into a visualization space [1, 13, 20, 21,
ent of collage criteria loss to optimize the collage tree struc- 29, 39]. However, these methods introduce image overlap-
ture in the level of probability space, which facilitates the ping and artifact problem.
exploration of the optimal collage structure. Partitioning-based methods partition the canvas and as-
Furthermore, this field lacks a benchmark dataset with sign each image with a corresponding region to compose
sufficient labels for quantitative evaluation. To facilitate im- a collage [3, 8, 10, 16, 18, 23, 28, 31, 37, 38, 41]. Some
age collage research, we propose AIC, a large-scale public- works utilized Voronoi tessellation [31] and packing al-
available annotated dataset for image collage evaluation. gorithm [18, 41] to allocate canvas space for the irregular
The major contributions can be summarized as follows. salient region of each image, which brought about image
artifacts when blending image boundaries. Hence, tree-
• We propose a novel neural-based probabilistic tree based collage is developed to preserve image content bet-
generator which constructs “soft” probabilistic tree ter [3, 8, 16, 23, 28, 37, 38]. Atkins [3] first introduced tree-
structure to build a probability space of correlation- based collage and solved tree structure in a beam-search

3730
L R L R L R
Feature Extractor 4 6
Edge Forward

Backbone
Classifier Shared Shared
Weights Weights
Siam Siam Siam Forward via
FC FC FC Nearest Neighbor
InfoEmbed Policy
Shared
Weights

Backbone Fusion Node


Module Classifier
AttFusion H
InfoEmbed FC
Shared V
Weights
Shared
Backbone Shared Weights
Weights
AttFusion H
InfoEmbed FC
Shared V
Weights
Shared
Backbone Shared Weights
Weights
AttFusion H
InfoEmbed FC
V

Figure 4. The pipeline of our tree generator. Here the image collection size is four, and our feature extractor initially extracts feature of
each image. Subsequently the NNP and fusion module iteratively select child nodes to yield parent feature node in a bottom-up manner
until the root feature node of the probability collage tree is acquired. Finally, the edge classifier and node classifier generate pe and pn
respectively. σ is the softmax activation.

manner. Fan [8] employed genetic algorithm to improve [3] 𝒏


H V

via designing genetic operators of collage tree. Wu and L R


V
H V H V

Aizawa [38] initialized tree structure in a greedy manner Forward 𝟐


Arrange H H
L R
and adjusted the layout iteratively according to the hand- L R

H V
V
crafted distortion threshold. These tree-based methods all L R

designed heuristic hand-crafted rules to adjust tree struc- 𝟐 𝟐

ture, thus failed to fully explore the solution space. Re- Probabilistic collage tree Image collection Standard collage tree
cently, Pan et al. [23] utilized back propagation to refine the
aspect ratio and splitting ratio of region box in [38]. How- Figure 5. Our probabilistic collage tree softens the standard col-
ever, the gradient in [23] still fails to flow back to optimize lage tree structure via modeling the node type distribution as pn
and edge connection distribution as pe .
the tree structure due to the undifferentiable characteristic
of the tree-based collage generation process. Different from Therefore, given an image collection {Ii } ,canvas width
the prior work, we attack the key problem of differentiat- w and height h, we aim to design a tree generator G.
ing the process via softening the discrete structure of col- This generator constructs a collage tree τ in the first stage
lage tree, and hence our gradient can directly update all the and the tree is mapped to the final collage C via a map-
structural details of collage tree. ping function g in the second stage. Supposing we in-
tegrate the above four criteria into one criterion function
F , our goal is to solve the optimal tree generator G∗ =
3. Approach arg maxG F g G(w, h, {Ii }) .


Problem formulation. According to the literature, a high- Overview. To solve the above two-stage problem in an
quality collage should satisfy the following criteria: 1) end-to-end manner, firstly we propose a “soft” probabilistic
Compact. The collage should fully utilize canvas space by collage tree (PCtree) and design a differentiable tree gen-
blank space minimization. 2) Ratio-preserving. Image in erator to construct the PCtree. Secondly, we approximate
the collage should suffer low aspect ratio distortion to retain the gradient of criterion loss to optimize our generator via
the aesthetics. 3) Content-preserving. Image content, espe- back propagation. These two steps tackle the differentia-
cially the salient region, should prevent occlusion. And im- tion problem of the two stages repectively. In the following
age overlapping decreases the representativeness and aes- parts, we firstly present the PCtree, our tree generator and
thetics of the collage [23]. 4) Correlation-preserving. Re- the tree generation algorithm in Sec. 3.1. Afterwards we
cent works show that placing correlated images together fa- introduce the model architecture of our neural generator in
cilitates informativeness of the collage [18, 23, 38, 41]. Sec. 3.2. Finally we present our gradient-based optimiza-

3731
Algorithm 1: Tree construction process in Fig. 4. Feature extractor extracts image semantic fea-
Input: w, h, {Ii } tures to learn correlation among images and embeds aspect
1 N ← size({fi }) ; ratio and canvas information to learn layout adjustment. Fu-
2 {fi } ← {FeatureExtractor(Ii )} ; sion module fuses the features of child nodes to yield parent
3 repeat node feature for the bottom-up tree construction. Edge clas-
4 fnx , fny ← NNP ({fi }) ; sifier determines the edge connection distribution between
5 fnz ← F usionM odule(fnx , fny ) ; child nodes and parent node. Node classifier predicts the
6 pe (nx , ny ) ← EdgeClassif ier(fnx , fny ) ; cut type distribution of interior nodes.
7 pn (nz ) ← NodeClassif ier(fnz ) ; Tree construction algorithm. To preserve correlation
8 Remove fnx , fny from {fi } and add fnz into {fi } ; among images, we adopt nearest neighbor policy (NNP) to
9 N ←N −1;
conduct the tree construction in a greedy manner. Given a
10 until N = 1;
list of features, our NNP finds the pair of features with the
closest Euclidean distance. The tree construction process is
tion paradigm in Sec. 3.3.
described in Algo. 1, where fn denotes the feature of node
3.1. Probabilistic Collage Tree Generation n. The time complexity of this algorithm is O(N 2 log N )
with the use of priority queue and hash table, where N is
Probabilistic collage tree. Standard collage tree represents the size of image collection.
collage layout using discrete structural parameters includ-
ing edge connection and node type [3], while the proposed 3.2. Model Architecture
probabilistic collage tree (PCtree) softens the parameters
In this section, we elaborate on the network architecture
via modeling the node type distribution (the cut type of the
of our four generator components.
node is designated as horizontal (“H”) or vertical (“V”)) as
Feature extractor. This component is composed of two-
pn and the edge connection distribution (the first child node
path feature extractors, as shown in Fig. 4. One path em-
in the child list is designated as the left (“L”) or right (“R”)
ploys a pre-trained backbone network to extract content fea-
child node) as pe , as shown in Fig. 5. The nodes in PCtree (i)
ture fbb (θbb ) from each image Ii and the network param-
and standard collage tree are in one-to-one correspondence.
eter θbb is fine-tuned during training. Another path intro-
Thus, given an interior node ñ in a PCtree with child nodes
duces information embedding edw , edh , edar to inject can-
ñi and n˜j , and the nodes n, ni and nj (corresponding to ñ,
vas size and image aspect ratio signals and these signals are
ñi and ñj respectively) in a standard collage tree, we define
fused via a fully connected layer and the ReLU activation
pn , pe ∈ R2 as
function [11] as
p(0)

n (ñ) = p cn = “H”|τθ (ñi ), τθ (n˜j ) (1) (i)
finf = ReLU W1 [w · edw , h · edh , ari · edar ]T + b1 (6)

(1)

pn (ñ) = p cn = “V”|τθ (ñi ), τθ (n˜j ) (2)
Here, ari is the aspect ratio of image Ii , and we de-
(i) (i)
p(0) notes the dimension of finf and fbb (θbb ) as dinf and dbb

e (ñi , n˜j ) = p ln = ni , rn = nj |τθ (ñi ), τθ (n˜j ) (3)
respectively. The elements in the embedding row vectors
p(1) edw , edh , edar are all initialized to one and they are fine-

e (ñi , n˜j ) = p ln = nj , rn = ni |τθ (ñi ), τθ (n˜j ) (4)
(i) (i)
tuned during training. W1 and b1 are also learnable param-
where pn and pe denotes the i-th (i ∈ {0, 1}) component eters. dw , dh , dar , dbb and dinf are hyperparameters.
of pn and pe respectively, cn is the cut type of n, ln is the Because the signals from these two paths are indepen-
left child node of n, rn is the right child node of n, and dent, the leaf node feature fni of image Ii is obtained via
τθ (x) denotes the subtree of PCtree τθ rooted at node x. concatenating these two feature vectors.
Through softening the parameters, we build a probability (i) (i) 
fni = concat fbb (θbb ), finf (7)
space for the collage tree and the likelihood of a standard
collage tree τ given the PCtree τθ can be calculated as Fusion module. This module should obtain the parent fea-
ture node via symmetry invariant transforms of the two
p(1{cn =“V”}) ñ × p(0) l˜n , r˜n
Y  
p(τ |τθ ) = n e (5) given child nodes, i.e. ff us (fni , fnj ) = ff us (fnj , fni )
n∈N (τ )
where ff us denotes the fusion module. Our idea is to use
where N (τ ) is the interior node set of τ , ñ, l˜n and r˜n denote the self-attentive weighted sum of the two child features
nodes in the PCtree corresponding to n, ln and rn respec- to satisfy symmetry invariance. To obtain the weight vec-
tively, and 1{·} is the indicator function (the value is 1 when tors, we utilize self-attentive embedding technique [17] to
the condition is true, otherwise it is 0). design Eq. (10), which injects additive operation into the
Generator components. To generate the PCtree, we de- aspect ratio information fusion process. Moreover, we uti-
sign four learnable components, i.e. feature extractor, fu- lize self-attention mechanism [32] to pre-process the input
sion module, edge classifier and node classifier, as shown features for injecting multiplicative signal (Eq. (9)). Bene-

3732
Algorithm 2: Optimization procedure of our model Loss function. We define Eτ ∼L(τ ;θ,π) [F (g (τ ))] as
Input: w, h, {Ii } Fθ (τ ; π) and approximate the gradient as
1 Initialize θ randomly;  X M 
1 
2 t ← 0; ∇θ Fθ (τ ; π) ≈ ∇θ F g(τi ) log p(τ |τθ ) (16)
3 repeat M i=1
4 Construct probabilistic collage tree τθ via θ and π in where M is the number of sample τi . Therefore, we define
accordance with Algo. 1 ; the loss function as
5 Sample {τi }M from p(τ |τθ ) ; M
Compute L(θ) via Eq. (17); 1 X 
6 L(θ) = − F g(τi ) log p(τ |τθ ) (17)
7 θ ← θ − α × ∇θ L(θ) ; M i=1
8 t ← t + 1; In term of mapping funcition g, We initially utilize an
9 until t ≥ Tm ; efficient mapping algorithm [8] to generate collage with
canvas blank loss rb , i.e. canvas blank space ratio, and
fiting from the two-stage transformation, the fusion module
we stretch the overall collage to fit the canvas in the post-
is able to memorize a variety of subtree structure schemes,
processing process. Our approach avoids canvas blank
which boosts the learning ability of the model.
space by introducing little aspect ratio distortion. The rea-
f(i,j) = [fni , fnj ]T (8) son is that canvas blank loss has a significantly worse im-
0
pact on the user’s visual experience than aspect ratio loss,

f(i,j) = Attention f(i,j) WQ , f(i,j) WK , f(i,j) WV (9)
  provided that magnitudes of the both losses are similarly
 
A = sof tmax Ws2 tanh Ws1 f(i,j)
0
(10) small. Moreover, our mapping function benefits from [8] in
preventing image content occlusion.
0  With respect to criterion F , we mainly focus on the ratio
fnp = Ws3 f latten Af(i,j) + b2 (11) preservation criterion because our NNP and mapping func-
Here, WQ ∈ Rd×dQ , WK ∈ Rd×dK , WV ∈ tion already consider the other three criteria. For this part,
R d×dV
, Ws1 ∈ Rd1 ×dV , Ws2 ∈ Rd2 ×d1 , Ws3 ∈ we design a reward shaping function R for canvas blank
R d×d2 dV
, b2 are all learnable parameters, where d is the loss rb as

dimension of node feature. dQ , dK , dV , d1 and d2 are all 
 −R0 , r3 < rb
R0 (rb −r2 )
r2 < rb ≤ r3

r2 −r3 ,
hyperparameters. Eq. (9) is the scaled dot-product attention 
parameterized by dK [32]. R(rb ) = R0 (log10 rb −log10 r2 )
, r1 < rb ≤ r2
 log10 r1 −log10 r2


Node classifier. A fully connected layer is utilized to model 
R0 , rb ≤ r1
this component as (18)
pn (n) = sof tmax(W2 fn + b3 ) (12) where R0 is the bound of reward value, r1 , r2 and r3 are
where W2 and b3 are learnable parameters. specific blank loss values. The shape design of Eq. (18) is
Edge classifier. Different from pn , binary function pe based on the observation that the difficulty of decreasing rb
(0) (0)
owns the property that pe (ni , nj ) + pe (nj , ni ) = may be linear in the ratio interval of r2 to r3 and it may
(0) (1) increase exponentially when rb is below r2 . R0 , r1 , r2 and
pe (ni , nj ) + pe (ni , nj ) = 1, as shown in Fig. 4. Thus,
r3 are hyperparameters. And aesthetics property Faes pro-
siamese network architecture [5] is employed to model this
posed in [23] is also included in F . Moreover, we design
component as
00
the area penalty Fp to prevent model shrinking some im-
f(i,j) = W3 concat(fni , fnj ) + b4 (13) ages too much as
Fp (C) = −R0 × 1{∃I ∈ C min(hI , wI ) ≤ sp }
00
f(j,i) = W3 concat(fnj , fni ) + b4 (14) (19)
00 00  where I is an image in collage C and sp is a hyperparameter.
pe (ni , nj ) = sof tmax f(i,j) , f(j,i) (15) Therefore, criterion F is defined as
  
where W3 and b4 are learnable parameters. F (C) = λr R rb (C) + λa Faes C + λp Fp C (20)
3.3. Gradient-Based Optimization Paradigm where λr , λa and λp are hyperparameters.
Optimization. Different from hand-crafted adjustment
Through building the probability space of collage scheme, our optimization paradigm exploits ∇θ L(θ) to
tree, the tree-based collage generation problem formula- optimize collage tree probability distribution p(τ |τθ ) in

tion can be modified as solving θ subject to Gθ∗ = an end-to-end manner. Algo. 2 shows the optimization
arg maxθ Eτ ∼L(τ ;θ,π) F g(τ ) , where  π denotes our paradigm of our model, where Tm is the maximum num-
NNP, L(τ ; θ, π) = p τ |w, h, {Ii }; θ, π = p(τ |τθ ) and θ ber of iterations and α is learning rate. At inference stage,
is the parameter of tree generator. optimal collage tree τ ∗ is determined with maximum likeli-

3733
Theme Animals Food Fruits Transportation Sports Office Baby Clothes Houseware Instrument Makeup
Percentage(%) 3.85 11.73 23.22 12.76 4.94 6.44 4.29 18.34 9.38 1.79 3.26
Table 1. The percentage of image number under each theme of ICSS.

hood method as Method Backbone Mr Mn Ms



τ = arg max p(τ |τθ ) (21) SHP [6] - 1.522 0.376 0.239
τ
CLT [2] - 1.517 0.377 0.232
The detailed derivations in this section is presented in the VSM [23] - 1.095 0.335 0
supplementary materials. Ours ResNet-50 [14] 1.086 0.284 0
Table 2. Quantitative metric results on the train set of AIC.
4. Experiments
Method Mr Mn
Baselines. We select three representative tree-based meth-
ods as baselines, where one is the state-of-the-art method Ours w/o Backbone 1.107 0.379
[23], which is also the mostly related work with ours, and Ours w/o Info 1.254 0.252
the other two are widely-used commercial softwares [2, 6]. Ours w/o Fusion 30.721 0.212
Ours w/o SA 1.503 0.278
Metrics. We introduce five quantitative metrics to analyze
Ours (full) 1.086 0.284
collage results, which are commonly used in state-of-the art
Table 3. Ablation analysis of our method on the train set of
works. Among them, three metrics, i.e. compactness Mc ,
AIC. The first method removes the backbone network and sets
ratio preservation Mr and nonoverlapping constraint Mo , dar = 1024, dw = dh = 32, dinf = 1024. The second method
are defined identically to [23]. The other two metrics are removes the information embedding in the extractor. The third
described as follows. method replaces the fusion module with feature average opera-
tion. The fourth method removes the scaled dot-product attention
• Correlation preservation Mn . Gathering correlated im-
(Eq. (9)). More results are shown in the supplementary materials.
ages can facilitate informativeness [18, 23, 38, 41].De-
spite Pan et al. [23] considered this end, their metric Method Pre-trained Identical theme Size Mr Mn
is actually both an athlete and referee due to the lack
Ours " % = 1.155 0.273
of groundtruth label. To tackle this problem, we col- Ours " " < 1.311 0.251
lected an annotated
P dataset in Sec. 4.1. Thus, we de- Ours " " > 1.164 0.254
fine Mn = N1 I kPI − P cI k2 , where N is collection Ours " " = 1.091 0.256
size, PI is the position vector of image I and P cI is the Ours % " = 1.083 0.249
centroid position vector of category label cI of image Table 4. Generalization study of our method on the test set of AIC.
I. All position coordinates are normalized by w and h. The model is directly trained on the test set when not pre-trained.
• Saliency loss Ms . This metric measures saliency ‘>’, ‘<’ and ’=’ represent cases where model is pre-trained on a
preservation ability. The collage mask is obtained collection of larger, smaller and identical size respectively.
by replacing each image in collage with the corre- labeled with category, theme and saliency mask, thus image
sponding saliency mask. We define Ms = 1 − collection sampled from ICSS is able to support the cal-
S P
| I SI |/(S I |SI |) where SI is the saliency mask of culation of Mn and Ms . With the idea of five-fold cross-
image I, I SI is the collage mask and | · | operator validation, we divide the images at a ratio of 4:1 in each
calculates the saliency area of mask. category into a train set and a test set, and both sets have a
near-identical distribution.
4.1. Annotated Image Collection Dataset
Finally, we develop an image collection sampling frame-
Collage result for unlabeled image collection cannot sup- work to generate AIC from ICSS. This framework requires
port the calculation of Mn and Ms . To encourage research that each image in one collection is sampled from one iden-
works in this field to compete fairly, we collect an annotated tical theme of ICSS. Moreover, each collection should in-
image collection dataset, namely AIC, based on saliency de- clude images from at least two categories and each category
tection dataset DUTS [34] which is partially collected from in collection should have at least two images in order to
ImageNet [7] and has high generalization ability [35]. acquire effective Mn value. Additionally, category distri-
Firstly we select 3402 images from DUTS to build the bution of each collection conforms to uniform distribution
image collection sampling source, namely ICSS, which and is not biased by prior category distribution in ICSS.
covers 72 categories and under each category there is at This framework samples train set and test set of AIC re-
least 10 images. Subsequently we divide 72 categories into spectively from train set and test set of ICSS. As a result,
11 themes manually (Tab. 1). The aspect ratio of images in AIC includes image collections with sizes of 10, 15, 20, 25,
ICSS ranges from 0.4625 to 1.9048. Each image in ICSS is 30, 50 and 100. The train set has 562 image collections

3734
5-scale Excellent (4) Good (3) Borderline (2) Poor (1) Bad (0) Score Kappa
SHP [6] 17.5% 50.8% 27.5% 4.2% 0.0% 2.816 0.82
CLT [2] 16.7% 51.2% 28.8% 3.3% 0.0% 2.813 0.80
VSM [23] 29.2% 52.9% 15.4% 2.5% 0.0% 3.088 0.76
Ours 34.2% 51.7% 12.0% 2.1% 0.0% 3.180 0.80
Side-by-side Wins Equally Good Equally Borderline Equally Poor Losses ∆ Kappa
Ours v.s. SHP [6] 60.6% 28.1% 11.3% 0.0% 0.0% 60.6% 0.75
Ours v.s. CLT [2] 63.1% 26.3% 10.6% 0.0% 0.0% 63.1% 0.71
Ours v.s. VSM [23] 26.9% 57.5% 9.4% 0.0% 6.2% 20.7% 0.67
Table 5. 5-scale human evaluation along with side-by-side human evaluation of collage results on the AIC. The score in 5-scale evaluation
is the weighted average. ∆ in side-by-side evaluation denotes the gap between the win rate and the lose rate.

Method Recall Precision Accuracy F1-Score


SHP [6] 0.723 0.625 0.555 0.658
CLT [2] 0.735 0.631 0.564 0.663
VSM [23] 0.808 0.703 0.618 0.745 Figure 6. Replacing the fusion module of our generator with fea-
Ours 0.865 0.771 0.669 0.810 ture average operation results in collages with only vertical cut.
Table 6. The results of information conveying test. We investi- ablation. It is shown that the backbone network and the in-
gate four indicators, i.e. recall, precision, accuracy and F1-score
formation embedding in the feature extractor are effective in
to evaluate the information conveying ability of collages.
preserving correlation and reducing ratio distortion respec-
including 18535 images and the test set has 62 image col- tively. The results also demonstrate that the fusion mudule
lections including 1260 images. The framework is detailed is critical to the learning ability of our model, without which
in the supplementary materials. our model can only yield vertical cut in collage (see Fig. 6)
and thus produces bad results. Moreover, the self-attention
4.2. Experiment Settings
mechanism improves our model much due to the injection
Experimental data. We use the train set* of AIC for the of multiplicative operation of aspect ratio information.
baseline comparison experiment and the ablation analysis, Generalization study. Different from the prior work, our
and the test set for generalization study. The user studies are generator can learn layout knowledge during optimization
conducted with the collage results on the train set of AIC. and generalize to other collection without training. To study
Implementation details. We implement the proposed the bottleneck data factors that impact the model generaliza-
framework using the PyTorch toolbox [24] on one GeForce tion ability, we conduct an analysis via controling variates
RTX 3090 GPU. We adopt the ResNet-50 [14] pre-trained of theme and collection size, shown in Tab. 4. The pre-
on the ImageNet [7] as the backbone network in our fea- trained collections are randomly selected as long as they
ture extractor and use the Adam optimizer [15] to train our satisfies the corresponding conditions. Tab. 4 shows that the
model for each image collection. The other implementation size of pre-trained collection has more significant impact on
details are presented in the supplementary materials. model generalization ability than the theme of that.

4.3. Quantitative Experiments 4.4. User Studies


Comparison to baseline methods. Comparing to baseline Besides the quantitative measures, we conducted two
methods, our method achieves similar or better Mr , Mn and user studies to evaluate the effectiveness of our method. We
Ms metric results on the AIC, shown in Tab. 2. As for Mc select 16 image collections for this stage, which cover all
and Mo , baseline methods and our model all achieve the sizes and themes of the collections in the AIC. Each user
optimal zero value due to the advantage of tree-based struc- study was conducted with different groups of participants
ture, and thus they are not included in Tab. 2 for concise- via different questionnaire, and collages in each question-
ness. Fig. 7 shows some comparison results. More results naire were ordered in a random way to avoid biasing judges.
are presented in the supplementary materials. Human evaluation. Firstly we carried out the 5-scale eval-
Ablation analysis. To show the detailed contributions of uation. To measure the gain in our method over the base-
the components in our model, we conduct ablation experi- lines, we also conducted the side-by-side evaluation. This
ments on the AIC (Tab. 3). Only Mr and Mn metrics are comparative task is easier than 5-scale rating task for human
demonstrated because the other metrics do not change in the and thus can produce more reliable results. Additionally,
* We learn a specific generator in each image collection respectively in
Fleiss’ Kappa score is used to gauge the reliability of the
the train set. Thus, our train set is different from the definition of train set agreement between evaluators. The details of these eval-
in the traditional deep learning context. uations are presented in the supplementary materials. The

3735
SHP [6] CLT [2] VSM [23] Ours w/o Fp Ours

Figure 7. Comparison of the collage results generated by different methods on the AIC. We can see that SHP [6] and CLT [2] both
introduce content occlusion (red dotted rectangle) into the images in collage. Despite VSM [23] circumvents this defect, the results still
contain images suffering high aspect ratio distortion (red dotted rectangle), particularly when the image collection size is large. However,
our method takes advantage of the probability space to produce results closer to the global optimal. Notably employing loss function
without Fp of Eq. (19) to train our model leads to drastic imbalance in image area assignment in collages.

results, illustrated in Tab. 5, suggest that our method is sub- ipants are inclined to choose more images as remembered,
stantially superior to all baselines in producing high-quality leading to a higher recall than precision.
collage from human’s perspective. The high Kappa scores
imply that a major agreement prevails among the evaluators. 5. Conclusion
Information conveying test. We further validate the effec- In this paper, we present SoftCollage, a novel tree-based
tiveness of our NNP via the information conveying test ac- collage method. Our key idea is to soften the discrete tree
cording to [22, 23]. Twenty subjects participated in the test structure into the probability space. By modeling the con-
and they were equally divided into four groups. Each group ditional probability distribution of collage tree via the pro-
corresponds to one collage method. For each image collec- posed tree generator, we can formulate the collage genera-
tion, we showed participants the corresponding collage for tion as a differentiable process and optimize the layout with
20 s and then asked them to perform a binary classification the gradient of criterion loss instead of the hand-crafted ad-
test, namely selecting the images that they had seen in the justment scheme. We demonstrate the effectiveness of our
collage, on an image set including five groundtruth images method via extensive experiments on the proposed large-
and five negative samples (sharing the identical theme with scale dataset AIC. Currently, the GPU memory consump-
the groundtruths). Tab. 6 shows the test results. Our col- tion of our model is high when the size of image collec-
lage benefits from the NNP and thus outperforms the other tion is large. Because of the extensibility of our method in
baselines. We find that the images selected by participants model architecture design, in the future we will explore the
account for approximately 72%, which implies that partic- lightweight design and knowledge distillation of our model.

3736
References [16] Yuan Liang, Xiting Wang, Song-Hai Zhang, Shi-Min Hu,
and Shixia Liu. Photorecomposer: Interactive photo recom-
[1] Similarity preserving snippet-based visualization of web position by cropping. IEEE transactions on visualization and
search results. IEEE transactions on visualization and com- computer graphics, 24(10):2728–2742, 2017. 1, 2
puter graphics, 20(3):457–470, 2014. 2 [17] Zhouhan Lin, Minwei Feng, Cı́cero Nogueira dos Santos,
[2] Collageit. online, 2019. https : / / www . Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A
collageitfree.com/. 6, 7, 8 structured self-attentive sentence embedding. In 5th Interna-
[3] C Brian Atkins. Blocked recursive image composition. In tional Conference on Learning Representations, ICLR 2017,
Proceedings of the 16th ACM international conference on Toulon, France, April 24-26, 2017, Conference Track Pro-
Multimedia, pages 821–824, 2008. 1, 2, 3, 4 ceedings. OpenReview.net, 2017. 4
[4] Simone Bianco and Gianluigi Ciocca. User preferences mod- [18] Lingjie Liu, Hongjie Zhang, Guangmei Jing, Yanwen Guo,
eling and learning for pleasing photo collage generation. Zhonggui Chen, and Wenping Wang. Correlation-preserving
ACM Transactions on Multimedia Computing, Communica- photo collage. IEEE transactions on visualization and com-
tions, and Applications (TOMM), 12(1):1–23, 2015. 1, 2 puter graphics, 24(6):1956–1968, 2017. 1, 2, 3, 6
[5] Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, [19] Tie Liu, Jingdong Wang, Jian Sun, Nanning Zheng, Xiaoou
Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Tang, and Heung-Yeung Shum. Picture collage. IEEE Trans-
Shah. Signature verification using a “siamese” time delay actions on Multimedia, 11(7):1225–1239, 2009. 1, 2
neural network. International Journal of Pattern Recognition [20] G. P. Nguyen and M. Worring. Interactive access to large im-
and Artificial Intelligence, 7(04):669–688, 1993. 5 age collections using similarity-based visualization. Journal
[6] V. Cheung. Shape collage. online, 2013. http://www. of Visual Languages & Computing, 19(2):203–224, 2008. 2
shapecollage.com/. 6, 7, 8 [21] E. G. Nieto, W. Casaca, L. G. Nonato, and G. Taubin. Mixed
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, integer optimization for layout arrangement. In Graphics,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Patterns & Images, 2013. 2
database. In 2009 IEEE conference on computer vision and [22] Aude Oliva and Antonio Torralba. Modeling the shape of
pattern recognition, pages 248–255. Ieee, 2009. 6, 7 the scene: A holistic representation of the spatial envelope.
International journal of computer vision, 42(3):145–175,
[8] Jian Fan. Photo layout with a fast evaluation method and
2001. 8
genetic algorithm. In 2012 IEEE International Conference
on Multimedia and Expo Workshops, pages 308–313. IEEE, [23] Xingjia Pan, Fan Tang, Weiming Dong, Chongyang Ma, Yip-
2012. 1, 2, 3, 5 ing Meng, Feiyue Huang, Tong-Yee Lee, and Changsheng
Xu. Content-based visual summarization for image col-
[9] Yuan Gan, Yan Zhang, Zhengxing Sun, and Hao Zhang.
lections. IEEE transactions on visualization and computer
Qualitative photo collage by quartet analysis and active
graphics, 2019. 1, 2, 3, 5, 6, 7, 8
learning. Computers & Graphics, 88:35–44, 2020. 2
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[10] J. Geigel, A. Loui, and E. Loui. Automatic page layout using
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
genetic algorithms for electronic albuming. Proceedings of
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-
SPIE - The International Society for Optical Engineering,
perative style, high-performance deep learning library. Ad-
pages 79–90, 2001. 2
vances in neural information processing systems, 32:8026–
[11] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep 8037, 2019. 7
sparse rectifier neural networks. Journal of Machine Learn- [25] Carsten Rother, Lucas Bordeaux, Youssef Hamadi, and An-
ing Research, 15:315–323, 2011. 4 drew Blake. Autocollage. ACM transactions on graphics
[12] Stas Goferman, Ayellet Tal, and Lihi Zelnik-Manor. Puzzle- (TOG), 25(3):847–852, 2006. 1, 2
like collage. In Computer graphics forum, volume 29, pages [26] Carsten Rother, Sanjiv Kumar, Vladimir Kolmogorov, and
459–468. Wiley Online Library, 2010. 1, 2 Andrew Blake. Digital tapestry [automatic image synthesis].
[13] E. Gomez-Nieto, W. Casaca, D. Motta, I. Hartmann, G. In 2005 IEEE Computer Society Conference on Computer
Taubin, and L. G. Nonato. Dealing with multiple require- Vision and Pattern Recognition (CVPR’05), volume 1, pages
ments in geometric arrangements. IEEE Transactions on Vi- 589–596. IEEE, 2005. 1, 2
sualization & Computer Graphics, 22(3):1223–1235, 2016. [27] M. Shuang and W. C. Chang. Automatic creation of
2 magazine-page-like social media visual summary for mobile
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. browsing. In 2016 IEEE International Conference on Image
Deep residual learning for image recognition. In Proceed- Processing (ICIP), 2016. 2
ings of the IEEE conference on computer vision and pattern [28] Yu Song, Fan Tang, Weiming Dong, Feiyue Huang, Tong-
recognition, pages 770–778, 2016. 6, 7 Yee Lee, and Changsheng Xu. Balance-aware grid collage
[15] Diederik P. Kingma and Jimmy Ba. Adam: A method for for small image collections. IEEE Transactions on Visual-
stochastic optimization. In Yoshua Bengio and Yann LeCun, ization and Computer Graphics, 2021. 2
editors, 3rd International Conference on Learning Represen- [29] Hendrik Strobelt, Marc Spicker, Andreas Stoffel, Daniel
tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Keim, and Oliver Deussen. Rolled-out wordles: A heuris-
Conference Track Proceedings, 2015. 7 tic method for overlap removal of 2d data representatives. In

3737
Computer Graphics Forum, volume 31, pages 1135–1144.
Wiley Online Library, 2012. 2
[30] Richard S Sutton, David A McAllester, Satinder P Singh, and
Yishay Mansour. Policy gradient methods for reinforcement
learning with function approximation. In Advances in neural
information processing systems, pages 1057–1063, 2000. 2
[31] Li Tan, Yangqiu Song, Shixia Liu, and Lexing Xie. Image-
hive: Interactive content-aware image summarization. IEEE
computer graphics and applications, 32(1):46–55, 2011. 2
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 4,
5
[33] Jingdong Wang, Long Quan, Jian Sun, Xiaoou Tang, and
Heung-Yeung Shum. Picture collage. In 2006 IEEE Com-
puter Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), volume 1, pages 347–354. IEEE,
2006. 1, 2
[34] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng,
Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de-
tect salient objects with image-level supervision. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 136–145, 2017. 6
[35] Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen,
Haibin Ling, and Ruigang Yang. Salient object detection
in the deep learning era: An in-depth survey. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 2021.
6
[36] Yichen Wei, Yasuyuki Matsushita, and Yingzhen Yang. Ef-
ficient optimization of photo collage. Microsoft Research,
Redmond, WA, USA, MSRTR-2009-59, 2009. 1, 2
[37] Zhipeng Wu and Kiyoharu Aizawa. Picwall: Photo collage
on-the-fly. In 2013 Asia-Pacific Signal and Information Pro-
cessing Association Annual Summit and Conference, pages
1–10. IEEE, 2013. 1, 2
[38] Zhipeng Wu and Kiyoharu Aizawa. Very fast generation
of content-preserved photo collage under canvas size con-
straint. Multimedia Tools and Applications, 75(4):1813–
1841, 2016. 1, 2, 3, 6
[39] Xintong, Han, Chongyang, Zhang, Weiyao, Lin, Mingliang,
Xu, Bin, and Sheng. Tree-based visualization and optimiza-
tion for image collection. IEEE transactions on cybernetics,
46(6):1286–300, 2016. 2
[40] Yingzhen Yang, Yichen Wei, Chunxiao Liu, Qunsheng Peng,
and Yasuyuki Matsushita. An improved belief propaga-
tion method for dynamic collage. The Visual Computer,
25(5):431–439, 2009. 1, 2
[41] Zongqiao Yu, Lin Lu, Yanwen Guo, Rongfei Fan, Mingming
Liu, and Wenping Wang. Content-aware photo collage us-
ing circle packing. IEEE transactions on visualization and
computer graphics, 20(2):182–195, 2013. 1, 2, 3, 6

3738

You might also like