Image Cropping with Composition and Saliency Aware Aesthetic Score Map
Yi Tu,
1
Li Niu,
1∗
Weijie Zhao,
2
Dawei Cheng,
1
Liqing Zhang
1∗
1
MoE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering
Shanghai Jiao Tong University, Shanghai, China
{tuyi1991, ustcnewly, dawei.cheng}@sjtu.edu.cn, zhang-lq@cs.sjtu.edu.cn
2
Versa Inc, Shanghai, China
weijie.zhao@versa-ai.com
Abstract
Aesthetic image cropping is a practical but challenging task
which aims at finding the best crops with the highest aesthetic
quality in an image. Recently, many deep learning methods
have been proposed to address this problem, but they did not
reveal the intrinsic mechanism of aesthetic evaluation. In this
paper, we propose an interpretable image cropping model to
unveil the mystery. For each image, we use a fully convo-
lutional network to produce an aesthetic score map, which
is shared among all candidate crops during crop-level aes-
thetic evaluation. Then, we require the aesthetic score map
to be both composition-aware and saliency-aware. In par-
ticular, the same region is assigned with different aesthetic
scores based on its relative positions in different crops. More-
over, a visually salient region is supposed to have more sensi-
tive aesthetic scores so that our network can learn to place
salient objects at more proper positions. Such an aesthetic
score map can be used to localize aesthetically important re-
gions in an image, which sheds light on the composition rules
learned by our model. We show the competitive performance
of our model in the image cropping task on several bench-
mark datasets, and also demonstrate its generality in real-
world applications.
1 Introduction
Given an image, the image cropping task aims at finding the
crops with the best aesthetic quality. It is an important task
that can be widely used in a lot of down-stream applications,
e.g., photo post-processing (Chen et al. 2017b), view rec-
ommendation (Li et al. 2018; Wei et al. 2018), and image
thumbnailing (Esmaeili, Singh, and Davis 2017). In order to
find the best crop, an image cropping model will first gen-
erate a large number of candidate crops and then determine
the best crop based on crop-level aesthetic evaluation. So an
image cropping model is usually composed of two stages,
candidate generation and aesthetic evaluation. A good image
crop is achieved by selecting important contents and placing
them with a good composition. The required knowledge for
such a task can be categorized into two parts, i.e., content
preference and composition preference. Therefore, a good
∗
Corresponding author.
Copyright
c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Images crop with composition rules. The orange
box in each image denotes a good crop found based on
human-defined composition rules. The white dotted lines de-
note the auxiliary lines used in these composition rules.
image cropping model should be able to learn and leverage
such preferences when searching for the best crop.
Early methods achieve this goal by explicitly utilizing
some photography knowledge like human-defined compo-
sition rules, e.g., Rule of Thirds and Rule of Central (See
Figure 1). With the development of deep learning, recent
researchers learn image cropping in a data-driven manner
and many aesthetic datasets are constructed to encode the
aesthetic preference of humans. Recent methods (Wang and
Shen 2017; Chen et al. 2017b; Wei et al. 2018; Lu et al.
2019c) treat it as an object detection task. They used aes-
thetic datasets to train an aesthetic evaluation model and ap-
plied it to compare candidate crops. Due to the power of
deep learning, these methods have brought progresses in this
field, but the intrinsic mechanism remains unrevealed.
In this paper, we propose an interpretable image cropping
model to produce both composition-aware and saliency-
aware Aesthetic Score Maps, called ASM-Net. Our ap-
proach was first inspired by the Class-Activation-Map
(CAM) method (Zhou et al. 2016), which uses a class activa-
tion map to localize the most discriminative image regions
in image classification task. Similarly, we expect to use an
aesthetic score map to localize aesthetically important image
regions. The aesthetic score of a region can be obtained via
average pooling and the regions with larger aesthetic scores
are of higher aesthetic quality. However, direct application
of CAM has been proven ineffective because the aesthetic
evaluation task is more complicated than classification and
one region cannot be simply represented by a single score.
arXiv:1911.10492v1 [cs.CV] 24 Nov 2019