Remote Sens. 2021, 13, 1828 4 of 19
designed to estimate the object’s motion. The output of the camera model was interlay
utilized to calculate the measurement matrix of the EKF. The matrix was designed to map be-
tween the position measurement on the objects in the image domain and the corresponding
vector state in the real world. Hayakawa, et al. [
13
] predicted 2D flow by PWC-Net and
detected the surrounding vehicles’ 3D bounding box using a multi-scale network. The
ego-motion was extracted from the 2D flow using projection matrix and ground plane
corrected by depth information. A similar approach was used for the estimation of the
relative velocity of surrounding vehicles. The absolute velocity was derived from the
combination of the ego-motion and the relative velocity. The position and orientation of
surrounding vehicles were calculated by projecting the 3D bounding box into the ground
plane. Min and Huang [
14
] proposed a method of detecting moving objects from the
difference between the mixed flow (caused by both camera motion and object motion) and
the ego-motion flow (evoked by the moving camera). They established the mathematical
relationship between optical flow, depth, and camera ego-motion. Accordingly, a visual
odometer was implemented for the estimation of ego-motion parameters by using ground
points as feature points. The ego-motion flow was calculated from the estimated ego-
motion parameters. The mixed flow was obtained from the correspondence matching
between consecutive images. Zhang, et al. [
21
] presented a framework to simultaneously
track the camera and multiple objects. The 6-DoF motions of the objects, as well as the
camera, are optimized jointly with the optical flow in a unified formulation. The object
velocity was calculated using the rotation and translation part of the motion of points in the
global reference frame. The proposed framework detected moving objects via combining
Mask R-CNN object segmentation [
22
] and scene flow, and tracked them over frames using
optical flow.
Different from the first two categories of the methods, the learning-based
method [
8
,
15
,
23
,
24
] does not require a specific mathematical estimation model but re-
lies on ma-chine learning and the ability of neural network regression to estimate the
motion parameters. Jain, et al. [
8
] used Farneback’s algorithm to calculate optical flow and
the DeepSort algorithm to track vehicles detected from the YOLO-v3. The optical flow
and the tracking information of the vehicle were then treated as input for two different
networks. The features extracted from the two networks were stacked to create a new
input for a lightweight Multilayer Perceptron architecture which finally predicts positions
and velocities. Cao, et al. [
15
] presented a network for learning motion parameters from
stereo videos. The network masked object instances and predicted specific 3D scene flow
maps, from which the motion direction and speed for each object can be derived. The
network took the 3D geometry of the problem into account which allows it to correlate
the input images. Kim, et al. [
23
] developed a deep neural network that exploits different
levels of semantic information to perform the motion estimation. The network used a
multi-context pooling layer that integrates both object and global features, and adopt the
cyclic ordinal regression scheme using binary classifiers for effective motion classifica-
tion. In the detection stage, they ran the YOLO-v3 detector to obtain the bounding boxes.
Song, et al. [
24
] presented an end-to-end deep neural network for estimation of inter-vehicle
distance and relative velocity. The network integrated multiple visual clues provided by
two time-consecutive frames, which include deep feature clue, scene geometry clue, as
well as temporal optical flow clue. It also used a vehicle-centric sampling mechanism to
alleviate the effect of perspective distortion in the motion field.
Moving object detection is a prerequisite for motion estimation. Most of the existing
methods use bounding boxes as object proposals which affect the accuracy of the motion
estimation for the late two stages. In this study, we leverage a region-level segmentation to
accurately locate object regions for tracking and parameter estimate. Therefore, we review
here relevant segmentation works compared with our segmentation methods. PSPNet [
25
]
is a pyramid scene parsing network based on the full convolution network [
26
], which
exploits the capability of global context information by different-region-based context
aggregation. PSPNet can provide a pixel-level prediction for the scene parsing task. Mask