没有合适的资源?快使用搜索试试~ 我知道了~
层次鲁棒表示学习:通过层次鲁棒优化方法学习更通用的深度特征及其在视觉任务中的有效性
ImageNetSOPCIFAR-10representation will be learned through multiple layers and afully connected layer is attached at the end as a linear classi-fier for recognition. Benefited from this coherent structure,deep learning promotes the performance on visual tasks dra-matically, e.g., categorization [15], detection [21], etc. De-spite the success of deep learning on large-scale data sets,deep neural networks (DNNs) are easy to overfit small datasets due to the large number of parameters. Besides, DNNsrequire GPU for efficient training, which is expensive.Researchers attempt to leverage pre-trained DNNs to im-prove the feature design mechanism.Surprisingly, it isobserved that the features extracted from the last few lay-ers perform well on the generic tasks when the model ispre-trained on a large-scale benchmark data set, e.g., Ima-geNet [22]. Deep features, which are outputs from interme-diate layers of a deep model, become popular as the sub-stitute of training deep models for light computation. Sys-tematic comparison shows that these deep features outper-form the existing hand-crafted features with a large mar-gin [8, 17, 20].The objective of learning deep models for specific tasksand deep features for generic tasks can be different, but littleefforts have been devoted to further investigating deep fea-73360层次鲁棒表示学习0骞琪 1 胡菊花 2 李浩 101 阿里巴巴集团 2华盛顿大学塔科马分校工程与技术学院,美国0{ qi.qian, lihao.lh } @alibaba-inc.com juhuah@uw.edu0摘要0随着深度学习在视觉任务中取得巨大成功,从学习模型的中间层提取的表示,即深度特征,引起了研究人员的广泛关注。以往的经验分析表明,这些特征可以包含适当的语义信息。因此,通过在大规模基准数据集(例如ImageNet)上训练的模型,提取的特征可以在其他任务上表现良好。在这项工作中,我们研究了这一现象,并证明了深度特征可能不够优化,因为它们是通过最小化经验风险来学习的。当目标任务的数据分布与基准数据集不同时,深度特征的性能可能会下降。因此,我们提出了一种层次鲁棒优化方法来学习更通用的特征。同时考虑示例级别和概念级别的鲁棒性,我们将问题形式化为一个带有Wasserstein模糊集约束的分布鲁棒优化问题,并提出了一个与传统训练流程相结合的高效算法。在基准数据集上的实验证明了鲁棒深度表示的有效性。01. 引言0提取适当的表示对于视觉识别至关重要。在过去的几十年里,各种手工设计的特征已经被开发出来捕捉图像的语义,例如SIFT [16]、HOG[7]等。传统的流程分为两个阶段。在第一个阶段,使用给定的模式从每个图像中提取表示。然后,使用这些特征训练一个特定的模型(例如SVM[6])来完成目标任务。由于手工设计的特征是与任务无关的,这种流程的性能可能不够优化。深度学习提出了通过训练端到端的卷积神经网络来整合这些阶段。不像SIFT[16]那样需要显式的特征设计,深度学习通过多层次学习来学习任务相关的表示,并在最后附加一个全连接层作为线性分类器进行识别。得益于这种连贯的结构,深度学习极大地提升了视觉任务的性能,例如分类 [15]、检测[21]等。尽管深度学习在大规模数据集上取得了成功,但由于参数数量庞大,深度神经网络(DNNs)很容易过拟合小数据集。此外,DNNs需要GPU进行高效训练,这是昂贵的。研究人员尝试利用预训练的DNNs来改进特征设计机制。令人惊讶的是,观察到当模型在大规模基准数据集(例如ImageNet[22])上进行预训练时,从最后几层提取的特征在通用任务上表现良好。这些深度特征是深度模型的中间层输出,在轻量计算中成为训练深度模型的替代品。系统性的比较表明,这些深度特征在性能上大大优于现有的手工设计特征 [8, 17,20]。学习特定任务的深度模型和通用任务的深度特征的目标可能不同,但很少有人致力于进一步研究深度特征。0图1.ImageNet、CIFAR-10和SOP的示例。在同一类别中,可以从ImageNet的第7张图像和CIFAR-10的第2张图像中观察到不同方面(如分辨率和姿态)的示例级别分布差异,例如汽车类别。概念级别的分布差异在ImageNet和SOP之间非常显著。ImageNet包含许多来自“动物”概念的类别,而SOP只包含来自“人工制品”概念的类别。Deep Features:Deep learning becomes popular sinceImageNet ILSVRC12 and various architectures of DNNshave been proposed, e.g., AlexNet [15], VGG [23],GoogLeNet [27], and ResNet [12]. Besides the success onimage categorization, features extracted from the last fewlayers are applied for generic tasks. [8] adopts the deepfeatures from the last two layers in AlexNet and shows theimpressive performance on visual recognition with differentapplications. After that, [20] applies deep features for dis-tance metric learning and achieves the overwhelming per-formance to the hand-crafted features on fine-grained visualcategorization. [17] compares deep features from differentneural networks and ResNet shows the best results. Besidesthe model pre-trained on ImageNet, [28] proposes to learndeep features with a large-scale scene data set to improvethe performance on the scene recognition task. All of thesework directly extract features from the model learned withERM as the objective. In contrast, we develop an algorithmthat is tailored to learn robust deep representations. Notethat deep features can be extracted from multiple layers ofdeep models and we focus on the layer before the final fully-connected layer in this work.Robust Optimization:Recently, distributionally robustoptimization that aims to optimize the worst-case perfor-mance has attracted much attention [5, 18, 24]. [18] pro-poses to optimize the performance with worst-case distri-bution over examples that is derived from the empirical dis-tribution.[5] extends the problem to a non-convex lossfunction, but they require a near-optimal oracle for the non-convex problem to learn the robust model. [24] introducesthe adversarial perturbation on each example for robustness.Most of these algorithms only consider the example-levelrobustness. In contrast, we propose the hierarchically robustoptimization that considers the example-level and concept-level robustness simultaneously, to learn the generic deeprepresentations for real applications.73370在学习深度模型时,它专注于优化当前训练数据集上的性能。相比之下,深度特征应该是为通用任务而学习的,而不是为单个数据集而学习的。在深度特征的应用中,还注意到当通用任务中的数据分布与基准数据集[28]不同时,深度特征可能会失败。通过研究为给定任务学习模型的目标,我们发现这是一个在示例的均匀分布上优化的标准经验风险最小化(ERM)问题。众所周知,通过ERM获得的模型可以在与训练相同分布的数据上进行良好的泛化[3]。0然而,实际应用中的数据分布可能与基准数据集显著不同,这可能导致采用从ERM学习到的表示时性能下降。这种差异可能来自至少两个方面。首先,通用任务和基准数据集之间的每个类别示例的分布可能不同,这在本文中被称为示例级别的分布差异。以图1中的ImageNet的第7张图像和CIFAR-10的第2张图像为例,它们的分辨率和姿势不同,但它们都属于汽车类别。这个问题最近引起了很多关注,并且已经开发了一些方法来优化最坏情况性能[5, 18,24]。其次,应用中的概念分布也与基准数据集中的分布不同。需要注意的是,每个概念可以包含多个类别,例如“狗”概念下的斗牛犬、比格犬等。这种概念级别的分布差异得到了较少的研究,但对于部署深度特征来说更为关键,因为实际应用中的概念可能只是基准数据集中的一个子集或部分重叠。例如,SOP中的概念与图1中ImageNet中的概念非常不同。0在这项工作中,我们提出同时考虑示例之间的差异和概念之间的差异,并从DNN中学习具有层次鲁棒性的表示。与ERM相比,我们的算法更符合学习通用深度特征的目标。对于示例级别的鲁棒性,我们采用Wasserstein模糊集[24]来编码示例中的不确定性,以实现高效的优化。我们的理论分析还说明,适当的数据增强比正则化更好,因为前者提供了对优化问题更紧密的近似。对于概念级别的鲁棒性,我们将其形式化为深度模型与不同概念之间的对抗博弈,以优化概念的最坏情况性能。通过使用对抗分布学习深度特征,可以改善概念的最坏情况性能。最后,为了保持训练过程的简单性,0在训练过程中,我们开发了一种算法,利用标准的随机采样策略在每次迭代中重新加权获得的梯度,以进行无偏估计。这一步骤可能会增加梯度的方差,我们通过精心设置学习率来减小方差。我们证明了对抗分布可以以 O(log(T)/T)的速度收敛,其中 T表示迭代的总次数。我们将ImageNet作为学习深度特征的基准数据集,并对真实数据集进行了实证研究,验证了我们方法的有效性。本文的其余部分组织如下。第2节回顾相关工作。第3节介绍了提出的方法。第4节在基准数据集上进行了实验,第5节总结了这项工作并展望未来的方向。02. 相关工作3. Hierarchical Robustness3.1. Problem FormulationLet xi denote an image and yi ∈ {1, . . . , C} be its corre-sponding label for a C-class classification problem. Givena benchmark data set {xi, yi} where i = 1, . . . , N, the pa-rameter θ in a deep neural network can be learned by solv-ing the optimization problem asminθ1N�iℓ(xi, yi; θ)(1)where ℓ(·) is a non-negative loss function (e.g., cross en-tropy loss). By decomposing the parameter θ as θ = {δ, ω},where ω denotes the parameter of the final fully-connectedlayer and δ denotes the parameter from other layers and canbe considered as for a feature extraction function f(·), wecan rewrite the original problem asminθ1N�iℓ(f(xi), yi; ω)Considering that ω is for a linear classifier, which is consis-tent to the classifiers applied in real-world applications (e.g.,SVM), the decomposition shows that the problem of learn-ing generic deep features f(x) can be addressed by learninga robust deep model on the benchmark data set.The original problem in Eqn. 1 is an empirical risk min-imization (ERM) problem that can be inappropriate forlearning generic representations. In the following, we ex-plore the hierarchical robustness to obtain robust deep rep-resentations for generic tasks.First, we consider the example-level robustness. UnlikeERM, a robust model is to minimize the loss with the worst-case distribution derived from the empirical distribution.The optimization problem can be cast as a game betweenthe prediction model and the adversarial distributionminθmaxi {ℓ(xi, yi; θ)}which is equivalent tominθmaxp∈RN;p∈∆�ipiℓ(xi, yi; θ)where p is the adversarial distribution over training exam-ples and ∆ is the simplex as ∆ = {p| �i pi = 1, ∀i, pi ≥0}. When p is a uniform distribution, the distributioanllyrobust optimization becomes ERM.Without any constraints, the adversarial distribution issensitive to the outlier and can be arbitrarily far way fromthe empirical distribution, which has large variance from theselected examples. Therefore, we introduce a regularizerto constrain the space of the adversarial distribution, whichprovides a trade-off between the bias (i.e., to the empiricaldistribution) and variance for the adversarial distribution.The problem can be written asminθmaxp∈RN;p∈∆�ipiℓ(xi, yi; θ) − λeD(p||p0)(2)where p0 is the empirical distribution. D(·) measures thedistance between the learned adversarial distribution andthe empirical distribution. We apply squared L2 distancein this work as D(p||p0) = ∥p − p0∥22. The regularizeris to guarantee that the generated adversarial distribution isnot too far way from the empirical distribution. It impliesthat the adversarial distribution is from an ambiguity set asp ∈ {p : D(p||p0) ≤ ǫ}where ǫ is determined by λe.Besides the example-level robustness, concept-level ro-bustness is more important for learning the generic features.A desired model should perform consistently well over dif-ferent concepts. Assuming that there are K concepts in thetraining set and each concept consists of Nk examples, theconcept-robust optimization problem isminθmaxk { 1NkNk�iℓ(xki , yki ; θ)}With the similar analysis as the example-level robustnessand adopting the appropriate regularizer, the problem be-comesminθmaxq∈RK;q∈∆�kqkNkNk�iℓ(xki , yki ; θ) − λcD(q||q0)(3)where q0 can be set as qk0 = Nk/N.Combined with the example-level robustness, the hierar-chically robust optimization problem becomesminθmaxp∈RN;p∈∆q∈RK;q∈∆�kqkNkNk�ipiℓ(xki , yki ; θ)−λeD(p||p0) − λcD(q||q0)In this formulation, each example is associated with a pa-rameter pi and qk. Therefore, a high dimensionality withthis coupling structure makes an efficient optimization chal-lenging. Due to the fact that K ≪ N, we decouple the hi-erarchical robustness with an alternative formulation for theexample-level robustness as follows.3.2. Wasserstein Ambiguity SetIn Eqn. 2, the ambiguity set is defined with the dis-tance to the uniform distribution over the training set. Itintroduces the adversarial distribution by re-weighting each7338example, which couples the parameter with that of theconcept-level problem. To simplify the optimization, wegenerate the ambiguity set for the adversarial distributionwith Wasserstein distance [24]. The property of Wassersteindistance can help to decouple the example-level robustnessfrom concept-level robustness.Assume that P is a data-generating distribution over thedata space and P0 is the empirical distribution from wherethe training set is generated as x ∼ P0. The ambiguity setfor the distribution P can be defined as{P : W(P, P0) ≤ ǫ}W(P, P0) = infM∈Π(P,P0) EM[d(ˆx, x)] is the Wassersteindistance between distributions [24] and we denote the ex-ample generated from P as ˆx. d(·, ·) is the transportationcost between examples.The problem of example-level robustness can be writtenasminθmaxPEP [ℓ(ˆx, y; θ)] − λw2 W(P, P0)According to the definition of Wasserstein distance [24] andlet the cost function be the squared Euclidean distance, theproblem is equivalent tominθmaxˆx∈X�iℓ(ˆxi, yi; θ) − λw2�i∥ˆxi − xi∥2Fwhere X is the data space. In [24], they obtain the optimalˆxi by solving the subproblem for each example at each it-eration. To accelerate the optimization, we propose to min-imize the upper bound of the subproblem, which also pro-vides an insight for the comparison between regularizationand augmentation.The main theoretical results are stated in the followingtheorems and their proofs can be found in the supplemen-tary. First, we give the definition of smoothness asDefinition 1. A function f is called Lz-smoothness in zw.r.t. a norm ∥ · ∥ if there is a constant Lz such that for anyvalues of z as z′ and z′′, it holds thatf(z′′) ≤ f(z′) + ⟨∇f(z′), z′′ − z′⟩ + Lz2 ∥z′′ − z′∥2Theorem 1. Assuming ℓ(·) is Lx-smoothness in x and ∇xℓis Lθ-Lipschitz continuous for θ, we havemaxˆxi∈X ℓ(ˆxi, yi; θ) − λw2 ∥ˆxi − xi∥2F ≤ ℓ(xi, yi; θ) + γ2 ∥θ∥2Fwhere λw is sufficiently large such that λw > Lx and γ =L2θλw−Lx .Theorem 2. With the same assumption in Theorem 1 andconsidering an additive augmentation with z for the origi-nal image˜xi = xi + τziwe havemaxˆxi∈X ℓ(ˆxi, yi; θ)−λw2 ∥ˆxi−xi∥2F ≤ ℓ(˜xi, yi; θ)+γ2 ∥θ∥2F −αwhereτ = ⟨∇xiℓ, zi⟩3Lx∥zi∥2Fand α is a non-negative constant asα =λwλw − Lx⟨∇xiℓ, zi⟩26Lx∥zi∥2FTheorem 1 shows that learning the model using the orig-inal examples with a regularization on the complexity ofthe model, e.g., weight decay with γ, can make the learnedmodel robust for examples from the ambiguity set. A simi-lar result has been observed in the conventional robust opti-mization [1]. However, the regularization is not sufficient totrain good enough DNNs and many optimization algorithmshave to rely on augmented examples to obtain models withbetter generalization performance.Theorem 2 interprets the phenomenon by analyzing aspecific augmentation that adds a patch z to the originalimage and shows that augmented examples can provide atighter bound for the loss of the examples in the ambiguityset. Besides, the augmented patch zi is corresponding to thegradient of the original example xi. To make the approxi-mation tight, it should be identical to the direction of thegradient. So we set zi =∇xiℓ∥∇xiℓ∥F , which is similar to thatin adversarial training [11].Combining with the concept-level robustness in Eqn. 3,we have the final objective for learning the hierarchicallyrobust representations asminθmaxq∈RK;q∈∆L(q, θ) =�kqkNk�iℓ(˜xki , yki ; θ)+γ2 ∥θ∥2F − λ2 ∥q − q0∥22(4)3.3. Efficient OptimizationThe problem in Eqn. 4 can be solved efficiently bystochastic gradient descent (SGD). In the standard trainingpipeline for ERM in Eqn. 1, a mini-batch of examples arerandomly sampled at each iteration and the model is up-dated with gradient descent asθt+1 = θt − ηθ1mm�i∇θℓ(xi, yi; θt)where m is the size of a mini-batch.For the problem in Eqn. 4, each example has a weight asqk/Nk and the gradient has to be weighted for an unbiasedestimation asθt+1 = θt − ηθ( 1mm�iNNkqk∇θℓ(˜xki , yki ; θt) + γθt)(5)7339For the adversarial distribution q, each concept has aweight qk and the straightforward way is to sample a mini-batch of examples from each concept to estimate the gra-dient of the distribution. However, the number of conceptsvaries and it can be larger than the size of a mini-batch.Besides, it results in the different sampling strategies forcomputing the gradient of deep models and the adversarialdistribution, which increases the complexity of the train-ing system. To address the issue, we take the same randomsampling pipeline and update the distribution with weightedgradient ascent asˆqt+1k= qtk + ηtq� 1mmk�jNNkℓ(˜xkj , ykj ; θt) − λ(qtk − qk0)�qt+1 = P∆(ˆqt+1)(6)where mk is the number of examples from the k-th conceptin the mini-batch and �k mk = m. P∆(·) projects thevector onto the simplex as in [9].The re-weighting strategy makes the gradient unbi-ased but introduces the additional variance. Since batch-normalization [13] is inapplicable for the parameters of theadversarial distribution that is from the simplex, we developa learning strategy to reduce the variance from gradients.First, to illustrate the issue, let δ1 and δ2 be two binaryrandom variables asPr{δ1 = 1} = 1Nk;Pr{δ2 = 1} = 1NObviously, we have E[δ1] =1Nk ;E[ Nδ2Nk ] =1Nk .Itdemonstrates that the gradient after re-weighting is unbi-as
下载后可阅读完整内容,剩余1页未读,立即下载
cpongm
- 粉丝: 5
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 李兴华Java基础教程:从入门到精通
- U盘与硬盘启动安装教程:从菜鸟到专家
- C++面试宝典:动态内存管理与继承解析
- C++ STL源码深度解析:专家级剖析与关键技术
- C/C++调用DOS命令实战指南
- 神经网络补偿的多传感器航迹融合技术
- GIS中的大地坐标系与椭球体解析
- 海思Hi3515 H.264编解码处理器用户手册
- Oracle基础练习题与解答
- 谷歌地球3D建筑筛选新流程详解
- CFO与CIO携手:数据管理与企业增值的战略
- Eclipse IDE基础教程:从入门到精通
- Shell脚本专家宝典:全面学习与资源指南
- Tomcat安装指南:附带JDK配置步骤
- NA3003A电子水准仪数据格式解析与转换研究
- 自动化专业英语词汇精华:必备术语集锦
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功