http://www.paper.edu.cn
- 3 -
中国科技论文在线
performed not as well as traditional machine learning methods did, due to the lack of training
samples in the early time. The PASCAL Visual Object Classes Challenge
[12]
(VOC) used real
object area annotations to provide a standard image annotation dataset and standard evaluation 90
system for detection algorithms and learning performance. ImageNet dataset
[13]
had been widely
applied in the field of deep learning image, including image classification, positioning and
detection etc, and the visual task error rate was lower than human vision in ILSVRC2017.
However, scene recognition is still rich in challenges and Places Challenge just started. Ariadna
Quattoni proposed Indoor67 dataset in [14] so as to evaluate the works on the indoor scene 95
recognition. And a wide range of scene understanding dataset SUN to define the concept of the
scene was proposed in [15]. Bolei Zhou proposed the Places dataset, which became the largest set
of scene data in the world.
[16]
In addition, Bolei Zhou also released a densely annotated dataset
ADE20K dataset, which constructs a benchmark platform for scene analysis in [17]. Our work
will compare the difference between the scene dataset and the object dataset in the second section, 100
and we validated our method on the largest scene dataset Places and tested it on the indoor scene
dataset Indoor67.
1.2 Scene Recognition Method
Besides deep learning, the hand-crafted feature was a popular method for image processing
task, which is also applied to the field of scene recognition. The bag of words is the most 105
commonly used method for image research,
[18]
and spatial pyramid matching
[19]
was proposed to
combine spatial layout into a word bag representation for scene recognition. Gist
[20]
is a
well-known scene recognition feature that captures spatial layout and high efficiency in scene
recognition and there are other feature representations in [21-22].
Since AlexNet won the ILSVRC2012, more and more research focuses on the use of CNNs 110
to deal with image processing task, including scene recognition. Bolei Zhou proposed a new
scene-centric dataset Place
[16]
for eliminating dataset bias, and showed the object detection effect
of CNN in scene recognition task in [23]. Wang et al. proposed the use of multi-resolution CNN
for scene recognition.
[24]
Luis Herranz et al. also studied how CNN effectively combines
scene-centric and object-centric knowledge in [25]. Different from previous studies, our proposed 115
method can capture feature information with different scales in a scene and reduce the dataset bias.
In addition, we don’t only extract different scales of features in the feature extraction stage, but
also configure optimal feature combinations for different categories of scenes in the classifier.
2 Object and Scene
Deep learning has achieved excellent results in object classification task, and scene 120
recognition task is similar to the object classification task somehow, so we seek a method to
improve scene recognition. In this section, we first explore the difference in datasets used for the
two tasks, then introduce the impact of object in the image on scene recognition, and finally
propose an improvement scheme.
2.1 Data Difference 125
Training CNN requires massive data support, and understanding the differences in the
datasets involved in scene recognition task and object classification task can better explain the
reason for their different performance. Datasets commonly used for object classification tasks
include Pascal VOC, ImageNet, and datasets of scene recognition tasks are represented by MIT
Indoor67 and Places. Our research found that the main difference between these datasets lies in 130
the distribution of objects, which is represented by the number of objects and the scale of objects.