15900 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2022
Fig. 2. Comparison of objects detected by (a) YOLOv3 and (b) our proposed
model.
Faster Region-based Convolutional Network (Faster R-CNN)
[20], Mask R-CNN [21] and You Only Look Once
(YOLO) [22], [23]. Trained by datasets including all labeled
objects, these models can detect all objects appearing in traffic
scenes in real time. The objects include cars, cyclists, traffic
signs/lights, roads, pedestrians, sky, etc. Some image seg-
mentation methods have been used in commercial intelligent
driving vehicles that can detect and identify all objects and
areas that appear in the driving environment.
However, not all objects in driving scenes are critical and
necessary for driving safety. As shown in Fig. 2(a), YOLOv3
detected all static cars parked on the side of the road, pedes-
trians walking on the sidewalk, and some unrelated objects.
We hold the opinion that these static or unrelated objects may
be redundant information for driving safety. An assisted or
intelligent driving system may be distracted when too many
redundant objects are present. For example, both static cars
parked on the sidewalk in the second/third rows and the
running cars in opposite lanes in the third/fourth rows in
Fig. 2(a) are completely irrelevant objects for current driving
tasks, and they are redundant and disturbing information for
driving decision-making. By comparison, detecting the critical
objects that are closely related to the current driving situation
is more valuable, as shown in our proposed model in Fig. 2(b).
Furthermore, there are other state-of-the-art object detec-
tion works that were proposed to detect salient objects that
appeared in natural images. The detection results of these
works were binary images without bounding boxes. This can
be considered a kind of image segmentation. Guo et al. [24]
proposed a method to detect salient object regions in video via
object proposals. A deep learning model was proposed to effi-
ciently detect salient regions in video [25]. Wang et al. [26]
presented a video salient object detection model based on
geodesic distance and applied it to unsupervised video seg-
mentation. In their follow-up work, the authors introduced
an attentive saliency network (ASNet) [27] that learned to
detect salient objects from fixations. Song et al. [28] pro-
posed a fast video salient object detection model based on a
novel recurrent network. This was named the pyramid dilated
bidirectional ConvLSTM (PDB-ConvLSTM). Guo et al. [29]
proposed computationally efficient and consistently spatiotem-
porally salient object detection in videos. Hu et al. [30]
explored possible ways to use visual attention (saliency) for
object detection and tracking, which only detect the vehicles.
However, these salient natural object detection models are not
suitable for traffic driving scenarios.
C. Saliency Attention and Object Detection Dataset
Many image saliency datasets have been released in the
past few years, improving the understanding of human visual
attention and pushing computational models forward. The
statistics of saliency attention and object detection datasets
are summarized in Table I. There are some natural saliency
image/video datasets such as the MIT benchmark [31],
SLICON dataset [32], and Action in the Eye [38], but they do
not feature specific driving sequences. Wang et al. [35], [39]
built a large-scale benchmark called Dynamic Human Fix-
ation 1K (DHF1K) for predicting human fixations during
dynamic nature scenes while free-viewing. DHF1K includes
1K video sequences annotated by 17 observers with an eye-
tracker device. In addition, the authors proposed a novel
video saliency model called the attentive CNN-LSTM network
(ACLNet). In DHF1K, each video was manually annotated
with a category label, which was further classified into 7 main
categories: daily activity, sport, social activity, artistic perfor-
mance, animal artifact and scenery. However, there are no
traffic driving scenarios in this dataset. On the other hand,
there are many state-of-the-art object datasets that have been
published for object detection tasks. These datasets include
ImageNet [40], Pascal VOC [33] and MS COCO [34]. All
of the objects presented in the images have been labeled in
these datasets. These objects are very important in daily life
detection and tracking.
In the field of driving attention dataset research,
Xia et al. [36] proposed an in-lab driver attention dataset
named Berkeley DeepDrive Attention (BDD-A), which was
built upon braking event videos selected from a large-scale,
crowd-sourced driving video dataset. Recently, Fang et al. built
a dataset to predict driver attention in driving accident scenar-
ios (DADA) [37] and designed a semantic context-induced
attentive fusion network (SCAFNet). Alletto et al. recorded
one driver’s eye tracking video during actual driving and built a
publicly available video dataset (DR(eye)VE) [9]. DR(eye)VE
is a good public dataset that consists of 74 videos and eight
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on September 27,2022 at 06:45:53 UTC from IEEE Xplore. Restrictions apply.