目标跟踪技术综述：挑战与进展

需积分: 50 38 浏览量更新于2024-07-29 收藏 2.48MB PDF 举报

目标跟踪:一项综述在本文中，作者Alper Yilmaz、Omar Javed和Mubarak Shah，分别来自俄亥俄州立大学、ObjectVideo, Inc.以及佛罗里达中部大学，共同探讨了目标跟踪领域的最新进展。目标跟踪作为一个复杂的问题，主要挑战包括物体的突然运动、外观模式的变化、非刚性物体结构、遮挡问题（对象间或对象与场景间的遮挡），以及摄像机的运动。它通常是在需要获取每个帧中物体位置和/或形状的高级应用背景下进行的。文章对现有的跟踪方法进行了全面的分类，根据所使用的对象和运动表示方式进行划分，例如基于特征的跟踪、模板匹配、关联滤波器、粒子滤波器、深度学习等不同的技术路线。每种方法都有其独特的优势和局限性，作者详细介绍了代表性方法，并深入分析了它们各自的优缺点。此外，文章还着重讨论了关键的跟踪问题，如选择合适的图像特征、处理背景干扰、适应不同光照条件和动态环境等。针对对象的表示，可能涉及形状模型（如边界框、轮廓、形状描述符）、视觉词袋、深度信息，甚至是更复杂的3D几何特征。在运动建模方面，可以有固定模型（如恒定速度假设）、可变模型（如卡尔曼滤波）、混合模型（结合静态和动态特性）等。随着深度学习的发展，如卷积神经网络（CNN）和循环神经网络（RNN）的应用，极大地提高了目标跟踪的精度和鲁棒性。文章还讨论了目标跟踪中的关键性能指标，如精度、稳定性、实时性，以及在实际应用中如何权衡这些因素。同时，对于未来的研究方向，作者提出了探索多模态融合、自适应跟踪策略、以及结合人工智能的智能决策等问题。总结来说，这篇综述为读者提供了一个全面的框架，帮助理解当前目标跟踪领域的技术现状和发展趋势，是研究人员、工程师和应用开发者深入理解并选择合适跟踪算法的重要参考文献。

8 A. Yilmaz et al.

Fig. 2. Interest points detected by applying (a) the Harris, (b) the KLT, and (c) SIFT operators.

The Harris detector computes the ﬁrst order image derivatives, (I

, I

), in x and y di-

rections to highlight the directional intensity variations, then a second moment matrix,

which encodes this variation, is evaluated for each pixel in a small neighborhood:

M =







. (1)

An interest point is identiﬁed using the determinant and the trace of M which mea-

sures the variation in a local neighborhood R =det(M)−k ·tr(M )

, where k is constant.

The interest points are marked by thresholding R after applying nonmaxima suppres-

sion (see Figure 2(a) for results). The source code for Harris detector is available at

HarrisSrc. The same moment matrix M given in Equation (1) is used in the interest

point detection step of the KLT tracking method. Interest point conﬁdence, R, is com-

puted using the minimum eigenvalue of M , λ

min

. Interest point candidates are selected

by thresholding R. Among the candidate points, KLT eliminates the candidates that

are spatially close to each other (Figure 2(b)). Implementation of the KLT detector is

available at KLTSrc.

Quantitatively both Harris and KLT emphasize the intensity variations using very

similar measures. For instance, R in Harris is related to the characteristic polynomial

used for ﬁnding the eigenvalues of M : λ

+det(M ) −λ ·tr(M ) =0, while KLT computes

the eigenvalues directly. In practice, both of these methods ﬁnd almost the same interest

points. The only difference is the additional KLT criterion that enforces a predeﬁned

spatial distance between detected interest points.

In theory, the M matrix is invariant to both rotation and translation. However, it

is not invariant to afﬁne or projective transformations. In order to introduce robust

detection of interest points under different transformations, Lowe [2004] introduced

the SIFT (Scale Invariant Feature Transform) method which is composed of four steps.

First, a scale space is constructed by convolving the image with Gaussian ﬁlters at

different scales. Convolved images are used to generate difference-of-Gaussians (DoG)

images. Candidate interest points are then selected from the minima and maxima of

the DoG images across scales. The next step updates the location of each candidate by

interpolating the color values using neighboring pixels. In the third step, low contrast

candidates as well as the candidates along the edges are eliminated. Finally, remaining

interest points are assigned orientations based on the peaks in the histograms of

gradient directions in a small neighborhood around a candidate point. SIFT detector

generates a greater number of interest points compared to other interest point detec-

tors. This is due to the fact that the interest points at different scales and different

resolutions (pyramid) are accumulated. Empirically, it has been shown in Mikolajczyk

and Schmid [2003] that SIFT outperforms most point detectors and is more resilient

to image deformations. Implementation of the SIFT detector is available at SIFTSrc.

ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.

Object Tracking: A Survey 9

Fig. 3. Mixture of Gaussian modeling for background subtraction. (a) Image from a sequence

in which a person is walking across the scene. (b) The mean of the highest-weighted Gaussians

at each pixels position. These means represent the most temporally persistent per-pixel color

and hence should represent the stationary background. (c) The means of the Gaussian with

the second-highest weight; these means represent colors that are observed less frequently. (d)

Background subtraction result. The foreground consists of the pixels in the current frame that

matched a low-weighted Gaussian.

4.2. Background Subtraction

Object detection can be achieved by building a representation of the scene called the

background model and then ﬁnding deviations from the model for each incoming frame.

Any signiﬁcant change in an image region from the background model signiﬁes a moving

object. The pixels constituting the regions undergoing change are marked for further

processing. Usually, a connected component algorithm is applied to obtain connected

regions corresponding to the objects. This process is referred to as the background

subtraction.

Frame differencing of temporally adjacent frames has been well studied since the

late 70s [Jain and Nagel 1979]. However, background subtraction became popular fol-

lowing the work of Wren et al. [1997]. In order to learn gradual changes in time, Wren

et al. propose modeling the color of each pixel, I(x, y), of a stationary background

with a single 3D (Y, U, and V color space) Gaussian, I (x, y) ∼ N (μ(x, y), (x, y)). The

model parameters, the mean μ(x, y ) and the covariance (x, y), are learned from the

color observations in several consecutive frames. Once the background model is de-

rived, for every pixel (x, y) in the input frame, the likelihood of its color coming from

N(μ(x, y ), (x, y)) is computed, and the pixels that deviate from the background model

are labeled as the foreground pixels. However, a single Gaussian is not a good model for

outdoor scenes [Gao et al. 2000] since multiple colors can be observed at a certain loca-

tion due to repetitive object motion, shadows, or reﬂectance. A substantial improvement

in background modeling is achieved by using multimodal statistical models to describe

per-pixel background color. For instance, Stauffer and Grimson [2000] use a mixture

of Gaussians to model the pixel color. In this method, a pixel in the current frame is

checked against the background model by comparing it with every Gaussian in the

model until a matching Gaussian is found. If a match is found, the mean and vari-

ance of the matched Gaussian is updated, otherwise a new Gaussian with the mean

equal to the current pixel color and some initial variance is introduced into the mix-

ture. Each pixel is classiﬁed based on whether the matched distribution represents the

background process. Moving regions, which are detected using this approach, along

with the background models are shown in Figure 3.

Another approach is to incorporate region-based (spatial) scene information instead

of only using color-based information. Elgammal and Davis [2000] use nonparamet-

ric kernel density estimation to model the per-pixel background. During the sub-

traction process, the current pixel is matched not only to the corresponding pixel in

the background model, but also to the nearby pixel locations. Thus, this method can

handle camera jitter or small movements in the background. Li and Leung [2002]

fuse the texture and color features to perform background subtraction over blocks of

ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.

剩余44页未读，继续阅读

silas316

粉丝: 0
资源: 18

目标跟踪技术综述：挑战与进展

目标跟踪综述比较全面

目标跟踪综述.pptx

目标跟踪综述

目标跟踪综述，基于预训练模型目标跟踪以及基于离线训练目标跟踪

视觉目标跟踪综述

视频目标跟踪综述

相关滤波目标跟踪综述

目标跟踪综述及相关滤波跟踪论文

电子成像，目标跟踪综述

基于人体行为的目标跟踪综述

最新资源