开放集到封闭集：空间分割与征服的物体计数

49 浏览量更新于2023-10-12 收藏 14.25MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

83620从开放集到封闭集：通过空间分割与征服进行物体计数 �0雄海鹏†，郝璐‡，刘成鑫†，刘亮†，曹志国†，沈春华‡0† 华中科技大学，中国 ‡ 阿德莱德大学，澳大利亚0{ hpxiong,zgcao } @hust.edu.cn, hao.lu@adelaide.edu.au0摘要0视觉计数是一项从图像/视频中预测对象数量的任务，本质上是一个开放集问题，即理论上人口数量可以在[0，+∞)范围内变化。然而，在现实中，收集到的图像和标记的计数值是有限的，这意味着只观察到了一个小的封闭集。现有的方法通常以回归方式对这个任务进行建模，但它们很可能在超出封闭集范围的未知场景中遇到困难。事实上，计数是可分解的。一个密集区域总是可以划分，直到子区域的计数在先前观察到的封闭集范围内。受到这个思想的启发，我们提出了一种简单而有效的方法，即空间分割与征服网络（S-DCNet）。S-DCNet只从一个封闭集中学习，但可以通过S-DC在开放集场景中很好地推广。S-DCNet还具有高效性。为了避免重复计算子区域的卷积特征，S-DC在特征图上执行，而不是在输入图像上执行。S-DCNet在三个人群计数数据集（上海科技、UCF CC50和UCF-QNRF）、一个车辆计数数据集（TRANCOS）和一个植物计数数据集（MTC）上实现了最先进的性能。与以前的最佳方法相比，S-DCNet在上海科技B部分上相对改进了20.2％，在UCF-QNRF上相对改进了20.9％，在TRANCOS上相对改进了22.5％，在MTC上相对改进了15.1％。代码已在https://github.com/xhp-hust-2018-2011/S-DCNet上提供。01. 引言0计算机视觉中的视觉计数任务是从图像/视频中推断对象（人、汽车、玉米穗等）的数量。它具有广泛的应用，如自动人群管理[15, 16, 17, 37, 38]，交通监控[14,25]和作物产量估计[10, 13, 23]。近年来受到了广泛关注。0� 雄海鹏和郝璐为共同第一作者。曹志国为通讯作者。0图1.上海科技A部分数据集测试集上64×64局部区域的计数值直方图[38]。橙色曲线表示CSR-Net[20]在局部区域上的相对平均绝对误差（rMAE）。0计数本质上是一个开放集问题，因为计数值理论上可以从0到+∞的范围内变化。因此，通常以回归方式对其进行建模。借鉴卷积神经网络（CNN）的成功，最先进的深度计数网络通常采用多分支架构，以增强对密集区域的特征鲁棒性[2, 4,38]。然而，在实践中，数据集中的观察模式是有限的，这意味着网络只能从一个封闭集中学习。当对象数量超出封闭集范围时，这些计数网络是否仍能生成准确的预测？同时，观察到的局部计数呈现出长尾分布，如图1所示。极度密集的区域很少，而稀疏的区域占据了大多数。可以观察到，相对平均绝对误差（rMAE）随着局部密度的增加而显著增加。是否有必要将基于CNN的回归器的工作范围设置为观察到的最大计数值，即使大多数样本都是稀疏的，以至于回归器在这个范围内工作得很差？事实上，计数具有一种独特的属性——空间可分解。上述问题可以通过空间分割与征服（S-DC）的思想在很大程度上得到缓解。假设网络已经被训练成准确预测一个封闭计数集，比如0�20。当面对具有极度密集对象的图像时，可以不断地将其划分为子区域，直到子区域的计数在先前观察到的封闭集范围内。受到这个思想的启发，我们提出了一种简单而有效的方法，即空间分割与征服网络（S-DCNet）。S-DCNet只从一个封闭集中学习，但可以通过S-DC在开放集场景中很好地推广。S-DCNet还具有高效性。为了避免重复计算子区域的卷积特征，S-DC在特征图上执行，而不是在输入图像上执行。S-DCNet在三个人群计数数据集（上海科技、UCF CC50和UCF-QNRF）、一个车辆计数数据集（TRANCOS）和一个植物计数数据集（MTC）上实现了最先进的性能。与以前的最佳方法相比，S-DCNet在上海科技B部分上相对改进了20.2％，在UCF-QNRF上相对改进了20.9％，在TRANCOS上相对改进了22.5％，在MTC上相对改进了15.1％。代码已在https://github.com/xhp-hust-2018-2011/S-DCNet上提供。Figure 2. An illustration of spatial divisions. Suppose that theclosed set of counts is [0, 20]. In this example, dividing the im-age for one time is inadequate to ensure that all sub-region countsare within the closed set. For the top left sub-region, it needs afurther division.Figure 3. Spatial divisions on the input image (left) and the fea-ture map (right). Spatially dividing the input image is straight-forward. The image is upsampled and fed to the same networkto infer counts of local areas. The orange dashed line is used toconnect the local feature map, the local count and the sub-image.S-DC on the feature map avoids redundant computations and isachieved by upsampling, decoding and dividing the feature mapof high resolution.image into sub-images until all sub-region counts are lessthan 20. Then the network can accurately count these sub-images and sum over all local counts to obtain the globalimage count. Figure 2 graphically depicts the idea of S-DC.A follow-up question is how to spatially divide the count.A naive way is to upsample the input image, divide it intosub-images and process sub-images with the same network.This way, however, is likely to blur the image and leadto exponentially-increased computation cost and memoryconsumption when repeatably extracting the feature map.Inspired by RoI pooling [12], we show that it is feasibleto achieve S-DC on the feature map, as conceptually illus-trated in Figure 3. By decoding and upsampling the featuremap, the later prediction layers can focus on the feature oflocal areas and predict sub-region counts accordingly.To realize the above idea, we propose a simple but effec-tive Spatial Divide-and-Conquer Network (S-DCNet). S-DCNet learns from a closed set of count values but is ableto generalize to open-set scenarios. Speciﬁcally, S-DCNetadopts a VGG16 [30]-based encoder and an UNet [27]-like decoder to generate multi-resolution feature maps. Allfeature maps share the same counting predictor. Inspiredby [19], in contrast to the conventional density map regres-sion, we discretize continuous count values into a set ofintervals and design the counting predictor to be a classi-ﬁer. Further, a division decider is designed to decide whichsub-region should be divided and to merge different lev-els of sub-region counts into the global image count. Weshow through a controlled toy experiment that, even givena closed training set, S-DCNet effectively generalizes tothe open test set.The effectiveness of S-DCNet is fur-ther demonstrated on three crowd counting datasets (Shang-haiTech [38], UCF CC 50 [15] and UCF-QNRF [16]), a ve-hicle counting dataset (TRANCOS [14]), and a plant count-ing dataset (MTC [23]). Results show that S-DCNet indi-cates a clear advantage over other competitors and sets thenew state-of-the-art across ﬁve datasets.The main contribution of this work is that we propose totransform open-set counting into a closed-set problem. Weshow through extensive experiments that a model learned ina closed set can effectively generalize to the open set withthe idea of S-DC.2. Related WorkCurrent CNN-based counting approaches are mainlybuilt upon the framework of local regression. Accordingto their regression targets, they can be categorized into twocategories: density map regression and local count regres-sion. We ﬁrst review these two types of regression. SinceS-DCNet learns to classify counts, some works that refor-mulate the regression problem are also discussed.Density Map RegressionThe concept of density mapwas introduced in [18]. The density map contains the spa-tial distribution of objects, thus can be smoothly regressed.Zhang et al. [37] ﬁrst adopted a CNN to regress local den-sity maps. Then almost all subsequent counting networksfollowed this idea. Among them, a typical network architec-ture is multi-branch. MCNN [38] and Switching-CNN [2]used three columns of CNNs with varying receptive ﬁeldsto depict objects of different scales. SANet [4] adopted In-ception [34]-liked modules to integrate extra branches. CP-CNN [32] added two extra density-level prediction branchesto combine global and local contextual information. AC-SCP [28] inserted a child branch to match cross-scale con-sistency and an adversarial branch to attenuate the blur-ring effect of the density map. ic-CNN [26] incorporatedtwo branches to generate high-quality density maps in acoarse-to-ﬁne manner. IG-CNN [1] and D-ConvNet [29]drew inspirations from ensemble learning and trained a se-ries of networks or regressors to tackle different scenes.DecideNet [21] attempted to selectively fuse the results of8363density map estimation and object detection for differentscenes. Unlike multi-branch approaches, Idrees et al. [16]employed a composition loss and simultaneously solvedseveral counting-related tasks to assist counting.CSR-Net [20] beneﬁted from dilated convolution which effec-tively expanded the receptive ﬁeld to capture contextual in-formation.Existing deep counting networks aim to generate high-quality density maps. However, density maps are actuallyin the open set as well. Detailed discussion of the open setproblem in density maps is provided in the Supplement.Local Count RegressionLocal count regression directlypredicts count values of local image patches.This ideaﬁrst appeared in [7] where a multi-output regression modelwas used to regress region-wise local counts simultane-ously. [9] and [23] introduced such an idea into deep count-ing. Local patches were ﬁrst densely sampled in a sliding-window manner with overlaps, and a local count was thenassigned to each patch by the network. Inferred redundantlocal counts were ﬁnally normalized and fused to the globalcount. Stahl et al. [33] regressed the counts for object pro-posals generated by Selective Search [36] and combined lo-cal counts using an inclusion-exclusion principle. Inspiredby subitizing, the ability for a human to quickly counting afew objects at a glance, Chattopadhyay et al. [5] transferredtheir focus to the problem of counting objects in everydayscenes. The main challenge thus became large intra-classvariances rather than the occlusions and perspective distor-tions in crowded scenes.While some above methods [5, 33] also leverage the ideaof spatial divisions, they still regress the open-set counts.Although local-region patterns are easier to be modelledthan the whole image, the observed local patches are stilllimited. Since only ﬁnite local patterns (a closed set) canbe observed, new scenes in reality have a high probabilityincluding objects out of the range (an open set). Moreover,dense regions with large count values are rare (Figure 1) andthe networks may suffer from sample imbalance. In this pa-per, we show that a counting network is able to learn froma closed set with a certain range of counts, say 0 ∼ 20, andthen generalizes to an open set (including counts > 20) viaS-DC.Beyond Naive RegressionRegression is a natural wayto estimating continuous variables, such as age and depth.However, some literatures suggest that regression is encour-aged to be reformulated as an ordinal regression problem ora classiﬁcation problem, which enhances performance andbeneﬁts optimization [6, 11, 19, 24]. Ordinal regression isusually implemented by modifying well-studied classiﬁca-tion algorithms and has been applied to the problem of ageestimation [24] and monocular depth prediction [11]. Li etal. [19] further showed that directly reformulating regres-classiﬁerdivision decider2 × 2 AvgPool, s 22 × 2 AvgPool, s 21 × 1 Conv, 512, s 11 × 1 Conv, 512, s 11 × 1 Conv, class num, s 11 × 1 Conv, 1, s 1−SigmoidTable 1. The architecture of classiﬁer and division decider.AvgPool denotes average pooling. Convolutional layers are de-ﬁned in the format: Conv size×size, output channel, s stride.Each convolutional layer is followed by a ReLU function exceptthe last layer. In particular, a sigmoid function is employed at theend of division decider to generate soft division masks.sion to classiﬁcation was also a good choice. Since countvalues share a similar property like age and depth, it mo-tivates us to follow such a reformulation.In this work,S-DCNet follows [19] to discretize local counts and clas-sify count intervals. Indeed, we observe in experiments thatclassiﬁcation with S-DC works better than direct regression.3. Spatial Divide-and-Conquer NetworkIn this section, we describe the transformation fromquantity to interval which transfers count values into aclosed set. We also explain in detail our proposed S-DCNet.3.1. From Quantity to IntervalInstead of regressing an open set of count values, we fol-low [19] to discretize local counts and classify count inter-vals. Speciﬁcally, we deﬁne an interval partition of [0, +∞)as {0}, (0, C1], (C2, C3], ... , (CM−1, CM] and (CM, +∞).These M + 1 sub-intervals are labeled to the 0-th to the M-th classes, respectively. For example, if a count value iswithin (C2, C3], it is labeled as the 2-th class. In practice,CM should be not greater than the max local count observedin the training set.The median of each sub-interval is adopted when recov-ering the count from the interval. Notice that, for the lastsub-interval (CM, +∞], CM will be used as the count valueif a region is classiﬁed into this interval. It is clear thatadopting CM for the last class will cause a systematic er-ror, but the error can be mitigated via S-DC as what we willshow in experiments.3.2. Single-Stage Spatial Divide-and-ConquerAs shown in Figure 4, S-DCNet includes a VGG16 [30]feature encoder, an UNet [27]-like decoder, a count-intervalclassiﬁer and a division decider. The structure of the clas-siﬁer and the division decider are shown in Table 1. Noticethat, the ﬁrst average pooling layer in the classiﬁer has astride of 2, so the ﬁnal prediction has an output stride of 64.The feature encoder removes fully-connected layersfrom the pre-trained VGG16. Suppose that the input patchis of size 64 × 64. Given the feature map F0 (extracted8364Figure 4. The architecture of S-DCNet (left) and a two-stage S-DC process (right). S-DCNet adopts all convolutional layers in VGG16 [30]while the ﬁrst two convolutional blocks are simpliﬁed as Conv in the ﬁgure. An UNet [27]-like decoder is employed to upsample anddivide the feature map as per Figure 3. A shared classiﬁer and a division decider receive divided feature maps, and respectively, generatedivision counts Cis and division masks Wis, for i = 1, 2, ... After obtaining these results, Ci and Wi are merged to the i-th division countDIVi shown in the right sub-ﬁgure. Specially, we average each count of low resolution into the corresponding 2×2 area of high resolutionbefore merging (avg shown in the ﬁgure). “◦” denotes the Hadamard product. Note that, the 64×64 local patch is only used as an examplefor readers to understand the pipeline of S-DCNet. Since S-DCNet is a fully convolutional network, it can process images of arbitrary sizesM × N and return DIV2s of size M64 × N64. The structures for the classiﬁer and the division decider are presented in Table 1.from the Conv5 layer) with132 resolution of the input im-age, the classiﬁer predicts the class label of the count inter-val CLS0 conditioned on F0. The local count C0, whichdenotes the count value of the 64 × 64 input patch, canbe recovered from CLS0. Note that C0 is the local countwithout S-DC, which is also the ﬁnal output of previous ap-proaches [5, 9, 23].We execute the ﬁrst-stage S-DC on the fused feature mapF1. F1 is divided and sent to the shared classiﬁer to producethe division count C1 ∈ R2×2. Concretely, F0 is upsam-pled by ×2 in an UNet-like manner to F1. Given F1, theclassiﬁer fetches the local features that correspond to spa-tially divided sub-regions, and predicts the ﬁrst-level divi-sion counts C1. Each of the 2 × 2 elements in C1 denotes asub-count of the corresponding 32 × 32 sub-region.With local counts C0 and C1, the next question is to de-cide where to divide. We learn such decisions with anothernetwork module, division decider, as depicted in the rightpart of Figure 4. At the ﬁrst stage of S-DC, the division de-cider generates a soft division mask W1 of the same size asC1 conditioned on F1 such that for any w ∈ W1, w ∈ [0, 1].w = 0 means no division is required at this position, andthe value in C0 is used. w = 1 implies that here the ini-tial prediction should be replaced with the division count inC1. Since W1 and C1 are both 2 times larger than C0, C0 isupsampled by ×2 to ˆC0, and the count is averaged into the2 × 2 local area in ˆC0. The ﬁrst-stage division result DIV1Algorithm 1: Multi-Stage S-DCInput: Image I and division time NOutput: Image count C1 Extract F0 from I;2 Generate CLS0 given F0 with the classiﬁer, and recoverC0 from CLS0;3 Initialize DIV0 = C0;4 for i ← 1 to N do5Decode Fi−1 to Fi;6Process Fi with the classiﬁer and the division deciderto obtain CLSi and the division mask Wi;7Recover Ci from CLSi;8Update DIVi as per Eq. 2 ;9 Integrate over DIVN to obtain the image count C;10 return Ccan thus be computed asDIV1 = (✶ − W1) ◦ avg(C0) + W1 ◦ C1 ,(1)where ✶ denotes a matrix ﬁlled with 1 and is with the samesize of W1. “◦” denotes the Hadamard product. avg is anaveraging r

下载后可阅读完整内容，剩余1页未读，立即下载