自适应空间正则化相关滤波器视觉跟踪

157 浏览量更新于2023-10-17 收藏 17.17MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

11∗1,21,31dkn2014@mail.dlut.edu.cn, {wdice,lhchuan}@dlut.edu.cn, waynecool@mail.dlut.edu.cn, jianhual@dlut.edu.cn46700通过自适应空间正则化相关滤波器进行视觉跟踪01 大连理工大学信息与通信工程学院，中国 2 鹏城实验室 3 腾讯优图实验室0摘要0在这项工作中，我们提出了一种新颖的自适应空间正则化相关滤波器（ASRCF）模型，同时优化滤波器系数和空间正则化权重。首先，这种自适应空间正则化方案可以学习到一个特定对象及其外观变化的有效空间权重，从而在跟踪过程中得到更可靠的滤波器系数。其次，我们的ASRCF模型可以通过交替方向乘子法进行有效优化，其中每个子问题都有闭合解。第三，我们的跟踪器应用两种CF模型分别进行位置和尺度的估计。位置CF模型利用浅层和深层特征的集合来准确确定最佳位置。尺度CF模型在多尺度浅层特征上工作，以高效地估计最佳尺度。在五个最新的基准测试上进行了大量实验证明，我们的跟踪器在性能上表现优异，与许多最先进的算法相比，实时性能达到了28fps。01. 引言视觉跟踪[36, 25,24]是一个基本的计算机视觉问题，具有许多现实应用，包括视频监控、行为分析等。尽管已经做出了许多努力，但由于前景和背景变化的困难，设计一个稳健和高效的跟踪器仍然是一项艰巨的任务。最近，基于相关滤波器（CF）的跟踪算法取得了排名靠前的性能，并引起了越来越多的关注。通常，基于CF的跟踪器[18, 12, 11, 15,8]利用大量的循环移位样本进行学习，并将空间域中的相关操作转换为频率域中的逐元素乘法，从而显著降低了计算复杂度并提高了跟踪速度。0� 通讯作者：王博士0(a) SRDCF [11] (b) ASRCF 图1. (a) SRDCF和(b)ASRCF方法的不同空间正则化的可视化。对于SRDCF，空间正则化具有负高斯形状，对于不同的对象几乎相等，并在跟踪过程中保持不变。相比之下，我们的ASRCF方法试图学习自适应的空间正则化，对不同时间的不同对象具有灵活性。如(b)所示，ASRCF模型已经学习到了一种有效的空间正则化，对噪声部分施加更高的惩罚，对可靠部分施加较低的惩罚。0然而，早期基于CF的方法存在两个主要缺陷。首先，循环移位采样过程总是在边界位置上遭受周期性重复，并使CF模型训练时使用了一部分不真实的样本。这个困境在一定程度上得到了缓解，通过对滤波器系数添加额外的预定义空间约束[11,15]。但是这些约束通常对于不同的对象是固定的，在跟踪过程中不会改变，不能充分利用不同时间的不同对象的多样性信息。其次，对象定位和尺度估计通常在相同的特征空间上进行，这要求在跟踪过程中提取多尺度的特征图。当跟踪器利用一些强大而复杂的特征（如从深度网络中提取的特征）时，这种策略显著增加了计算负载并降低了跟踪速度。这就是为什么排名靠前的CF跟踪器通常运行非常缓慢（例如DeepSRDCF [10]，C-COT [13]，DRT [32]和RPCF [33]）。0based trackers achieve state-of-the-art performance [29, 8,13, 22, 32]. Ma et al. [29] exploit three layers of CNNfeatures pre-trained on the classiﬁcation to generate featuremaps for training CF models. Danelljan et al. [13] use thecontinuous convolution ﬁlters for combinations of featuremaps with different spatial resolutions. However, these CF-based trackers no longer have the speed advantage due tothe complicated deep features. Particularly, their scale es-timation strategies require extracting multi-scale deep fea-tures, which is extremely expensive and makes the trackervery slow. In this work, we exploit two kinds of CF modelsto estimate the location and scale separately. The accurateobject localization is obtained based on one CF model onlywith single scale robust deep features; while the efﬁcient s-cale estimation is conducted with the other CF model withmulti-scale shallow features.3. Adaptive Spatially-Regularized CorrelationFilters (ASRCF)3.1. Objective Function of Our ASRCF ModelOriginal Correlation Filters (CF): The original multi-channel CF model in the spatial domain aims to minimizethe following objective function [18]:2where xk ∈ RTd hk ∈ RTdenote the k-th channelof the vectorized image and ﬁlter respectively, and K is thetotal channel number. The vector y ∈ RT ×1 is the desiredresponse (i.e., the Gaussian-shaped ground truth), ∗ denotesthe spatial correlation operator and λ is a regularization con-stant. H = [h1, h2, ..., hK] is the matrix representing theﬁlters from all K channels.The original CF model suffers from periodic repetition-s on boundary positions caused by circulant shifted sam-ples, which inevitably degrades the tracking performance.To solve this problem, several spatial constraints have beenintroduced to alleviate the unexpected boundary effect-s.The representative methods include spatially regular-ized discriminative correlation ﬁlters (SRDCF) [11] andbackground-aware correlation ﬁlters (BACF) [15]. Theirbasic ideas are presented as follows.246710基于的跟踪器实现了最先进的性能[29, 8, 13, 22,32]。Ma等人[29]利用在分类上预训练的三层CNN特征生成特征图用于训练CF模型。Danelljan等人[13]使用连续卷积滤波器对具有不同空间分辨率的特征图进行组合。然而，由于复杂的深度特征，这些基于CF的跟踪器不再具有速度优势。特别是，它们的尺度估计策略需要提取多尺度的深度特征，这非常昂贵并且使跟踪器变得非常慢。在这项工作中，我们利用两种CF模型分别估计位置和尺度。精确的目标定位仅基于一个CF模型，使用单尺度的鲁棒深度特征；而高效的尺度估计则使用另一个CF模型，使用多尺度的浅层特征。3.自适应空间正则化相关滤波器（ASRCF）3.1.我们ASRCF模型的目标函数原始相关滤波器（CF）：空间域中的原始多通道CF模型旨在最小化以下目标函数[18]：02. 相关工作基于相关滤波器（CF）的跟踪器在近年来取得了巨大的成功。我们简要介绍一些相关的工作，以突出我们的动机。MOSSE[3]方法是最早的基于CF的跟踪器，它只使用灰度样本来训练滤波器。CSK[17]跟踪器将核技巧引入CF公式中。通过利用循环移位样本，可以在频域中高效地优化滤波器系数。基于CSK[17]，KCF [18]方法利用多通道HOG[7]特征增强特征表示能力，并显著提高跟踪性能。类似地，引入颜色命名特征以实现对彩色视频的稳健跟踪[12]。DSST[9]，SAMF [26]和IBCCF[23]跟踪器使用多尺度搜索策略来解决尺度适应问题。传统的CF方法依赖于训练和检测样本的周期性假设，这会产生意外的边界效应，并使跟踪器在一部分虚假样本上进行训练和应用。为了解决这个问题，Danelljan等人[11]在CF公式中引入了一个空间正则化项，以惩罚边界区域附近的滤波器系数。Galoogahi等人[15]直接将滤波器与二进制矩阵相乘，以生成用于模型训练的真实正负样本。上述两个空间约束在后续的研究工作中被广泛使用[8, 13, 22,32]。这些空间约束通常对于不同的对象是固定的，在跟踪过程中不会改变；因此，它们无法充分利用不同帧中不同对象的多样性信息。在这项工作中，我们提出了一种新颖的自适应空间正则化项，以使跟踪器在跟踪过程中学习到更可靠的滤波器系数。最近，许多研究人员尝试将CF模型与深度视觉特征相结合，使CF-0E（H）= 10其中y -0k = 1 xk � hk0：02 +λ0k = 1∥hk∥22，（1）0SRDCF：SRDCF方法[11]引入了空间正则化，以惩罚滤波器系数与其空间位置的关系，并修改了目标函数为0其中y -0E（0k = 1 xk � hk0K02 +λ0k = 1 ∥w ⊙hk∥22，（2）0K个222.Table 1. The generalization ability of our ASRCF model.MethodPwKCFP = Iw = 1, λ2 = 0SRDCFP = Iw = �w, λ2 = 0BACF-w = 1, λ2 = 03.2. Optimization of Our ASRCF ModelInspired by previous works [11, 15], correlation ﬁlter-s are usually learned in the frequency domain for efﬁcient1 If there is no third term, the solution of w will be degraded, i.e., w = 0.training and testing. Thus, we express the objective function(4) in the frequency domain (using Parseval’s theorem), andconvert it into the equality constrained optimization form:E(H, �G , w) = 12��y −K�k=1�x k ⊙ �g k��22+ λ12K�k=1∥w ⊙ hk∥22 + λ22K�k=1∥w − wr∥22s.t., �gk =√TFP⊤hk, k = 1, ..., K,(5)where �G = [�g1, �g2, ..., �gK] (�gk=√TFP⊤hk, k =1, ..., K) is an auxiliary variable matrix. In equation (5),the symbol ˆ denotes the discrete Fourier transform formof a given signal, and F is the orthonormal T × T matrixof complex basis vectors to map any T dimensional vec-torized signal into the Fourier domain (such as �a =√TFa,a ∈ RT ×1). The model in equation (5) is bi-convex, and canbe minimized to obtain a local optimal solution using the al-ternating direction method of multipliers (ADMM) [4]. Theaugmented Lagrangian form of equation (5) can be formu-lated asL(H, �G , w, �V)= E(H, �G , w) +K�k=1�v ⊤k (�g k −√TFP⊤hk)+ µ2K�k=1��g k −√TFP⊤hk��22,(6)where V = [v1, v2, ..., vK] ∈ RT ×K is the Lagrange mul-tiplier, and �V = [�v1, �v2, ..., �vK] ∈ RT ×K is the corre-sponding Fourier transform.By introducing sk =1µvk(k = 1, 2, ..., K), the optimization of equation (6) is equiv-alent to solving equation (7) .L(H, �G, w, �S )= 12��y −K�k=1�x k ⊙ �g k��22+ λ12K�k=1∥w ⊙ hk∥22 + λ22 ∥w − wr∥22+ µ2K�k=1��g k −√TFP⊤hk + �sk��22,(7)where �S = [�s1,�s2, ...,�sK] ∈ RT ×K.Then, the ADMM algorithm is adopted by alternatelysolving the following subproblems:Subproblem H: If �G, w and �S are given, the optimal H∗can be obtained ash∗k = arg minhk⎧⎨⎩λ12 ∥w ⊙ hk∥22 +µ2��g k −√TFP⊤hk + �sk��22⎫⎬⎭=�λ1W⊤W + μTP⊤P�−1μTP (sk + gk)= μTp ⊙ (sk + gk)λ1(w ⊙ w) + μTp, (8)46720BACF：BACF方法[15]提出了一种背景感知的CF，并引入了以下目标函数：0其中y -0k = 1 xk � �P�hk�0K个02 + λ0（3）其中P ∈RT×T是一个对角二进制矩阵，使得相关运算符直接应用于真正的前景和背景样本。方程（2）和（3）上的约束在跟踪过程中是固定的，并且对于不同的对象是相同的，不能很好地反映特定对象的特征和外观变化。因此，合理地将自适应空间正则化引入到CF模型中。0k = 1 ∥hk∥22,0E（H，w）= 10k = 1 xk � (P�h0其中y0+ λ10k = 1 ∥w ⊙ hk∥22 + λ202 ∥w - wr∥220（4）在方程（4）中，第一项是岭回归项，它将训练数据X= [x1, x2, ..., xK]与滤波器H = [h1, h2, ...,hK]卷积以拟合高斯分布的真实值y。第二项是一个正则化项，引入了对滤波器H的自适应空间正则化，其中空间权重w需要进行优化。第三项试图使自适应空间权重w与参考权重wr相似。这个约束引入了关于w的先验信息，并避免了模型退化。λ1和λ2分别是第二和第三项的正则化参数。我们注意到，提出的ASRCF是一个通用的CF模型，而著名的KCF、SRDCF和BACF算法都是我们模型的特例（如表1所示）。where W = diag (w) ∈ RT ×T represents the diagonalmatrix and p = [P11, P22, ..., PT T ]⊤ is the column vectorcomposed by the diagonal elements of the cropping matrixP (For P, we also have P⊤P = P). Equation (8) showsthat the solution of hk merely requires the element-wisemultiplication and the inverse fast Fourier transform (i.e.,sk =1√T F⊤�sk and gk =1√T F⊤�gk). Thus, the computa-tion complexities of solving hk and all H are O (T log T)and O (KT log T) respectively.Subproblem ˆG: If other variables in equation (7) are ﬁxed,the optimal ˆG∗ can be estimated by solving the optimiza-tion problem (9).122+µ2⎫⎪⎪⎪⎬.⎧⎪⎨+⎫⎪⎬,Ij(X )j(X )⊤The derivatiow∗ = arg minw∥Nkw∥22 + λ22 ∥w − wr∥22=(λ1N⊤k Nk+λ2I)−1λ2wr=2wrλ1k=1hk⊙hk+λ21,�r =K�k=1�xk ⊙ �gk,(15)46730G � = arg min G0� � � � �0k =1 x k ⊙ g k0y−0K0g k − √0T FP � h k + s k 20(9)然而，由于计算复杂度高，优化问题(9)很难优化。因此，我们考虑对每个像素的所有通道进行处理，并将优化问题(9)重新表述为0V � j ( G ) = arg min V j ( G )01 2 ( y j − V j ( X ) � V j ( G ) ) 20K0|| V j ( G ) + V j ( M ) || 20(10) V j ( M ) = V j ( S ) − V j ( √0其中 V j ( g ) ∈ R K × 1 表示像素 j 上滤波器 g的所有通道的值。然后，可以得到方程(10)的解析解为0V � j ( G ) = 1 µT0λ0y j V j ( X )+ μV j (√0T FP � H ) − μV j ( S ) .01+ v � A − 1 u（这里 u 和 v 是两个列向量，uv �是一个秩为一的矩阵）。0求解 w：如果 H，G和S固定，可以确定 w 的闭式解为0λ 1 20K0λ0K0K0(13)其中 N k = diag ( h k ) ∈ R T ×T。在实践中，我们利用额外的ADMM求解器来获得更好的权重 w �。0收敛。图2展示了学习权重的一些代表性示例。从该图中可以看出，自适应空间正则化学习在一些不可靠区域引入了较大的惩罚，从而鼓励学习的滤波器更加关注下一次迭代中被跟踪对象的可靠区域。0图2.自适应空间正则化的可视化。对于每个像素，自适应空间正则化的较大值将对该像素的滤波器给予更大的学习惩罚。最好以彩色和放大的方式查看细节。0拉格朗日乘子更新：我们将拉格朗日乘子更新为0S i +1 = S i + G i +1 − H i +1，(14)0在上一个状态的拉格朗日变换中，S i 表示拉格朗日变换，ˆG ( i +1) 和 ˆ H ( i +1) 是迭代 i + 1时两个子问题的当前解。正则化常数 μ 通常设置为 μ ( i +1)= min( μ max , βμ ( i ) )[4]。因此，优化过程可以通过反复应用上述四个步骤来进行，包括(1)求解 H，(2)求解 G，(3)求解w和(4)更新拉格朗日乘子。收敛后，可以得到最优滤波器参数 H �（其傅里叶变换为 G �）和空间正则化权重 w �。04. 目标定位和尺度估计 4.1. 目标定位对于跟踪，可以在傅里叶域中确定被跟踪对象的位置为0其中r和�r分别表示响应图和其傅里叶变换。在这项工作中，我们采用深度和浅层特征的集合进行目标定位（详见第5节的实现细节）。在获得响应图之后，可以根据最大响应确定最佳位置。Xnewmodel = (1 − η)Xoldmodel + ηX∗,(16)4.3. Scale EstimationFor scale estimation, the previous CF-based tracker-s [18, 11, 15] usually apply the learned ﬁlter on multiple res-olutions of the searching area to estimate scale changes, andthen select the optimal scale with the maximum response.This manner leads to two imperfections for the CF-basedmodel with deep features: (1) it is very time-consuming toextract multi-scale deep visual features; and (2) it is difﬁcultto estimate the accurate scale based on deep CNN featuressince the pooling layers make feature descriptions loss somedetailed information.In this work, we attempt to learn two CF models (onelocation CF is for object localization and the other scale CFis for scale estimation). The location CF model for objec-t localization is trained on ensembles of deep and shallowfeatures. Although the extraction process of this CF modelis time-consuming, it merely requires to be extracted on onescale search region during the tracking process. The scaleCF model for scale estimation is trained on efﬁcient shallowfeatures (HOG features in this work). During the trackingprocess, we apply this CF model on ﬁve scale search regionsand obtain their related response maps. Then, the best scaleis determined based on the scale corresponding to the max-imum score of ﬁve response maps. The effectiveness of ourdesigned scale estimation scheme is veriﬁed in Section 5.2.In every frame, the overall framework (Figure 4.3) ﬁrstestimates the position using the location CF model withcomplicated features, and then applies the scale CF mod-el to reﬁne the scale based on ﬁve scale HOG feature maps.5. ExperimentsOur tracker is implemented based on the MAT-LAB2017a platform with the MatConvNet toolbox, andruns on a PC machine with an Intel i7 8700 CPU, 32GBRAM and a single NVIDIA GTX 1080Ti GPU, 11G mem-Five Scale HOG feature mapsOne Scale Fusion feature mapCF1 for estimatingscaleCF2 for predictingpositionBestpositionBestScaleranked methods, our tracker achieves almost the best accu-racy and the fastest speed (the only tracker with real-timeperformance).00.10.20.30.40.50.60.70.80.91Overlap threshold00.10.20.30.40.50.60.70.80.91Success rateSuccess plots of OPEOurs [0.692]ECO [0.687]MDNet [0.678]LSART [0.672]C-COT [0.667]DaSiamRPN [0.658]SiamRPN [0.637]DeepSRDCF [0.635]ACT [0.625]BACF [0.621]Struct_siam [0.621]CF2 [0.603]SRDCF [0.598]SiamFC [0.582]Staple [0.578]CFNet [0.568]KCF [0.477]01020304050Location error threshold00.10.20.30.40.50.60.70.80.91PrecisionPrecision plots of OPELSART [0.923]Ours [0.922]MDNet [0.909]ECO [0.909]C-COT [0.896]DaSiamRPN [0.881]ACT [0.859]Struct_siam [0.851]DeepSRDCF [0.851]SiamRPN [0.851]CF2 [0.845]BACF [0.824]SRDCF [0.789]Staple [0.784]SiamFC [0.771]CFNet [0.748]KCF [0.696]Figure 4. Precision and success plots on OTB2015 [35]. The leg-end contains the average distance precision score at 20 pixels andthe area-under-the-curve (AUC) score for each tracker.Table 2. Accuracy and speed comparisons of top-5 trackers on theOTB2015 dataset. The best two results are shown in red and bluefonts, respectively.C-COTLSARTMDNetECOOursSuccess0.6670.6720.6780.6870.692Precision0.8960.9230.9090.9090.922GPU/CPUCPUGPUGPUGPUGPUFPS0.71.31.717.928.0Figure 6 illustrates overlap success plots of differen-t trackers with 6 attributes (such as background clutter, de-formation, occlusion, scale variation and so on). We cansee that our tracker achieves almost the best performancein these attributes. First, our tracker performs well underbackground clutter, deformation and occlusion conditions,and obtains 1.6%, 1.6% and 0.8% gain respectively than thesecond best tracker (ECO [8]). This is mainly owed to theproposed adaptive spatial regularization, which makes thelearned ﬁlter focus on the reliable features of the trackedobject and alleviate the effects of unexpected noises withinthe object region. In addition, our tracker works well in han-dling scale variation based on the designed scale estimationscheme using multi-scale shallow features.TC128 Dataset.We perform comparisons on theTC128 [27] dataset, which consists of 128 challenging colorsequences. We compare our tracker with 8 state-of-the-arttrackers including ECO [8], C-COT [13], SRDCF [11], S-RDCFdecon, DeepSRDCF [10], MCCT [34], BACF [15],MCPF [37] and 32 more default trackers in TC128. The re-sults of top 15 trackers are reported in Figure 7, from whichwe can see that the proposed tracker performs the best interms of both precision and success criterion.VOT2016 Dataset. We also perform comparisons on theVOT2016 dataset [19] which contains 60 challenging se-quences. During the test phase, the tracker will be reset ifthere is no overlap between prediction and groundtruth. Theexpected average overlap (EAO) considering both boundingbox overlap (accuracy) and reset times (robustness) servesas the major evaluation metric on VOT2016. In Table 3(a),00.10.20.30.40.50.60.70.80.91Overlap threshold00.10.20.30.40.50.60.70.80.91Success rateSuccess plots of OPEOurs [0.603]ECO [0.597]MCCT [0.586]C-COT [0.566]ECO-HC [0.555]MCPF [0.544]DeepSRDCF [0.536]SRDCFdecon [0.534]SRDCF [0.509]BACF [0.495]MEEM [0.458]Struck [0.441]KCF [0.384]ASLA [0.380]Frag [0.374]05101520253035404550Location error threshold00.10.20.30.40.50.60.70.80.9PrecisionPrecision plots of OPEOurs [0.825]ECO [0.800]MCCT [0.799]MCPF [0.776]C-COT [0.774]ECO-HC [0.740]DeepSRDCF [0.740]SRDCFdecon [0.729]SRDCF [0.696]BACF [0.660]MEEM [0.641]Struck [0.614]KCF [0.551]ASLA [0.516]LOT [0.490]Figure 7. Performance evaluation on the TC128 dataset in termsof success and precision plots.we compare our method with top-10 trackers including C-COT, TCNN, SSAT, MLDF, Staple, DDC, EBT, SRBT,STAPLE+ and DNT. Table 3 shows that our tracker achievesthe best in terms of EAO and R scores, furthermore, ourtracker is much faster than the second best tracker (C-COT).VOT2017 Dataset. The VOT2017 [20] dataset contains 60challenging sequences (replacing some simple sequenceswith more difﬁcult ones in VOT2016) and has more ac-curate groundtruth

下载后可阅读完整内容，剩余1页未读，立即下载