1692 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 16, NO. 4, AUGUST 2015
emergency lane generated by the proposed system provides an
alternative option of PCS with steering for more stopping room
or less speed loss.
The rest of this paper is organized as follows. Section II
discusses the related work in detail. In Section III, the detailed
descriptions of the proposed HMM-based road detection ap-
proach is presented. Section IV introduces the adopted vehicle
detection briefly. In Section V, the proposed contextual cor-
relation for both low-level detection improvement and high-
level road structure estimation is described. Section VI gives
the experimental results on a variety of typical but challenging
road scenarios, which have demonstrated the effectiveness and
robustness of the proposed system. Finally, the conclusion is
drawn and the future work is addressed in Section VII.
II. R
ELATED WORK
ADAS is one of the fastest growth areas in automotive elec-
tronics. Since today, high-quality cameras come at a very low
price, many camera-based ADAS systems have been deployed
[5], [6]. In the proposed system, we also aim at a camera-based
solution f or the LKA, ACC, and PCS functions in unmarked
urban scenarios, which require the robust detections of road
[7]–[22] and vehicles [23]–[30] in the low level, and the rational
estimation of road structures [33]–[35] in the high level.
1) Road Detection: The problem of vision-based road de-
tection has been studied for several decades. Some methods
used a monocular camera to extract the road region by em-
ploying specific features based on the road appearance [7]–[12].
Such appearance-based methods can work very well in certain
environments, even with adverse conditions [12]. However,
they are characterized by lack of effectiveness in cases where
the r oads do not sufficiently correspond to the models of the
aprioridefined features. Some other methods worked on a
sequence of temporally consecutive monocular images of the
scene, and made use of the displacement of pixels between two
consecutive images [13], [14]. These motion-based methods
can provide generic detection of the drivable roads and give
information about the displacement of the target and structure/
depth of the scene. However, they cannot work well on chaotic
roads when the camera is unstable and the estimation of the
optical flow is not robust enough.
Stereovision-based methods are also widely used for road
detection. Generally, they are more robust than the monocular-
based ones, since they have information such as triangulate
feature points in 3-D and are more robust to loss of scale and
dynamic vehicle movements. Given a stereo image pair, stereo
matching-based methods extract the 3-D structure of the scene
by solving the correspondence problem and computing the
disparity map. For example, 3-D urban reconstruction has been
demonstrated in [15] and [16]. Comparing with these rather
holistic methods, dedicated terrain traversability estimation
methods [17]–[21] showed a better classification performance
with respect to the vehicle driving. The approach proposed here
belongs to this line of systems. Previously, we also developed
a road detection system in a Markov random field (MRF) by
finding the correspondences of the road pixels between the
image pairs based on the homography induced by the ground
plane [22]. Compared with the plane-induced homography, the
disparity map can provide more detailed information, particu-
lalry for the low-textured scenarios, thereby more accuracy and
robustness of road detection can be expected.
2) Vehicle Detection: Vehicle detection is one of the top-
ics of great interest. Both the academic community and the
automobile industry have contributed to the development of
different types of detection systems in order to improve traffic
safety with respect to the vehicle-to-vehicle collisions. For ex-
ample, Sun et al. [23] gave a comprehensive review for vehicle
detection. In the early works, the symmetry and edge informa-
tion were used for detecting vehicles in the image [24], [25].
However, such methods failed in more challenging scenarios,
where vehicles present dramatic appearance changes according
to camera viewpoints and environment conditions, and also
have intraclass variability. In order to tackle these challenges,
two common solutions have been developed in recent years.
One is to employ robust features, since overall performance
of the system depends on the discriminative power of features
used in the detection algorithm. For example, the HOG feature
[26] has been considered as one of the strongest features, which
captures the shape information of an object and is robust for
local variations. The other is to establish part-based models for
the target. Rather than trying to capture a global pattern of an
object with one template, part-based models focus on parts of
an object and, in consequence, provide more flexible and robust
representations. Recently, Felzenszwalb et al. demonstrated a
DPM that outperformed the single template model by using
a latent support vector machine (SVM) formulation in com-
bination with a variation of HOG features [4]. This approach
works very well when the nearby vehicles are fully visible.
However, vehicles are sometimes far from t he host vehicle and,
consequently, the visual evidence is very weak. Furthermore,
vehicles are frequently occluded by other objects in traffic
scenes. In this case, some part models are not visible, whereas
they will still count for the overall detection score. Thus, the
low scores of the occluded parts will result in a low summed
score, thereby generating false negatives.
It is hard to solve these problems by the generic methods
aforementioned, since the observations of the targets them-
selves are weak. In the vision community, researchers have
attempted to improve object detection by correlating the con-
textual information in the image. For example, Torralba [27] ex-
tracted the semantic categories of the image, such as a coastline,
a landscape, or a room, and learned the average positions of
objects of interest within the image. Such positions could then
be used as prior to object detection. Hoiem et al. [28] also clas-
sified the image into three main spatial classes, namely ground,
vertical, and sky, and then trained a classifier using AdaBoost
for object detection with a coarse viewpoint prior derived from
the spatial context. Galleguillos et al. [29] incorporated two
types of context, i.e., co-occurrence and relative location, for
object categorization by maximizing the object label agreement
in a conditional random field. These methods showed obvious
detection improvement by using additional context information.
However, a major problem of these methods is that the context
of the objects of interest is learned from the labeled databases
comprising of images shot in a limited set of compositions.