无监督域自适应方法及多级UDA的研究进展

147 浏览量更新于2023-10-19 收藏 25.97MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Gotta Adapt ’Em All: Joint Pixel and Feature-LevelDomain Adaptation for Recognition in the WildLuan Tran1∗ Kihyuk Sohn2Xiang Yu2Xiaoming Liu1Manmohan Chandraker2,31Michigan State University2NEC Labs America3UC San DiegoAbstractRecent developments in deep domain adaptation haveallowed knowledge transfer from a labeled source domainto an unlabeled target domain at the level of intermediatefeatures or input pixels. We propose that advantages may bederived by combining them, in the form of different insightsthat lead to a novel design and complementary propertiesthat result in better performance. At the feature level, in-spired by insights from semi-supervised learning, we proposea classiﬁcation-aware domain adversarial neural networkthat brings target examples into more classiﬁable regions ofsource domain. Next, we posit that computer vision insightsare more amenable to injection at the pixel level. In partic-ular, we use 3D geometry and image synthesis based on ageneralized appearance ﬂow to preserve identity across posetransformations, while using an attribute-conditioned Cycle-GAN to translate a single source into multiple target imagesthat differ in lower-level properties such as lighting. Besidesstandard UDA benchmark, we validate on a novel and aptproblem of car recognition in unlabeled surveillance imagesusing labeled images from the web, handling explicitly spec-iﬁed, nameable factors of variation through pixel-level andimplicit, unspeciﬁed factors through feature-level adaptation.1. IntroductionDeep learning has made an enormous impact on many ap-plications in computer vision such as generic object recogni-tion [22, 44, 48, 17], ﬁne-grained categorization [59, 21, 41],object detection [26, 27, 28, 36, 37], semantic segmenta-tion [6, 42] and 3D reconstruction [53, 52]. Much of itssuccess is attributed to the availability of large-scale labeledtraining data [8, 15]. However, this is hardly true in manypractical scenarios: since annotation is expensive, most dataremains unlabeled. Consider car recognition problem fromsurveillance images, where factors such as camera angle,distance, lighting or weather condition are different acrosslocations. It is not feasible to exhaustively annotate all theseimages. Meanwhile, there exists abundant labeled data from∗This work is done when L. Tran was an intern at NEC Labs America.source source ≠ target target feature vision insight semi-supervised learning insight pixel feature pixel source ≈ target FeaturePixel–CycleGANMKF+AC-CGAN(ours)–55.064.379.7DANN60.464.878.0DANN-CA (ours)75.877.784.2Table 1: Our framework for unsupervised domain adaptation atmultiple semantic levels: at feature-level, we bring insights fromsemi-supervised learning to obtain highly discriminative domain-invariant representations; at pixel-level, we leverage complementarydomain-speciﬁc vision insights e.g., geometry and attributes. Ourjoint pixel and feature-level DA demonstrates signiﬁcant improve-ment over individual adaptation counterparts as well as other com-peting methods such as CyCADA (CycleGAN+DANN) [18] on carrecognition in surveillance domain under UDA setting. Please seeSection 5 for complete experimental analysis.web domain [21, 62, 12], but with very different image char-acteristics that precludes direct transfer of discriminativeCNN-based classiﬁers. For instance, web images mightbe from catalog magazines with professional lighting andground-level camera poses, while surveillance images canoriginate from cameras atop trafﬁc lights with challenginglighting and weather conditions.Unsupervised domain adaptation (UDA) is a promisingtool to overcome the lack of labeled training data problem intarget domains. Several approaches aim to match distribu-tions between source and target domains at different levelsof representations, such as feature [57, 56, 11, 45, 31] orpixel levels [49, 43, 66, 3]. Certain adaptation challenges arebetter handled in the feature space, but feature-level DA isa black-box algorithm for which adding domain-speciﬁc in-sights during adaptation is more difﬁcult than in pixel space.On the contrary, pixel space is much higher-dimensionaland the optimization problem is under-determined. How to12672effectively combine them has become an open challenge.In this work we address this challenge by leveraging com-plementary tools that are better-suited at each level (seeﬁgure in Table 1). Speciﬁcally, we posit that feature-levelDA is more amenable to techniques from semi-supervisedlearning (SSL), while pixel-level DA allows domain-speciﬁcinsights from computer vision. In Section 3, we present ourfeature-level DA method called classiﬁcation-aware domainadversarial neural network (DANN-CA) that jointly param-eterizes the classiﬁer and domain discriminator inspired byan instance of SSL algorithm [40]. We show this to be a gen-eralization of DANN [11] to incorporate constraints (Fig. 1)that guide discriminator to easily ﬁnd major modes corre-sponding to classes in the feature space, and in turn put targetexamples into more classiﬁable regions via adversarial loss.A challenge for pixel-level DA is to simultaneously trans-form source image properties at multiple semantic levels. InSection 4, we present pixel-level DA by image transforma-tions that make use of vision concepts to deal with differentvariation factors, such as photometric or geometric transfor-mations (Fig. 2),1 for recognition in surveillance domain. Tohandle low-level transformations, we propose an attribute-conditioned CycleGAN (AC-CGAN) that extends [66] togenerate multiple target images with different attributes. Tohandle high-level identity-preserving pose transformations,we use an appearance ﬂow (AF) [65], an warping-basedimage synthesis tool. To reduce semantic gaps between syn-thetic and real images, we propose a generalization of AFwith 2D keypoints [25] as a domain bridge.In Section 5, we evaluate our framework on car recog-nition in surveillance images from the comprehensive cars(CompCars) dataset [62]. We deﬁne an experimental proto-col with web images as labeled source domain and surveil-lance images as unlabeled target domain. We explicitly han-dle nameable factors of variation such as pose and lightingthrough pixel-level DA, while other nuisance factors are han-dled by feature-level DA. As in Table 1, we achieve 84.20%accuracy, reducing error by 64.9% from a model trainedonly on the source domain. We present ablation studies todemonstrate the importance of each adaptation componentby extensively evaluating performances with various mix-tures of components. We further validate the effectivenessof our proposed feature-level DA methods on standard UDAbenchmarks, namely digits and trafﬁc signs [11] and ofﬁce-31 [38], achieving state-of-the-art recognition performance.In summary, the contributions of our work are:• A novel UDA framework that adapts at multiple seman-tic levels from feature to pixel, with complementaryinsights for each type of adaptation.• For feature-level DA, a connection of DANN to a semi-1Our framework is unsupervised DA in the sense that we don’t requirerecognition labels from the target domain for training, but it uses sideannotations to inject insights from vision concepts for pixel-level adaptation.supervised variant, motivating a novel regularization viaclassiﬁcation-aware domain adversarial neural network.• For pixel-level DA, an attribute-conditioned CycleGANto translate a source image into multiple target imageswith different attributes, along with an warping-basedimage synthesization for identity-preserving pose trans-lations via a keypoint-based appearance ﬂow.• A new experimental protocol on car recognition insurveillance domain, with detailed analysis of variousmodules and efﬁcacy of our UDA framework.• State-of-the-art performance on standard UDA bench-marks, such as ofﬁce-31 and digits, trafﬁc signs adapta-tion tasks, with our feature-level DA method.Due to a large volume of our work, we put additional detailin Section S1–S6 of the supplementary material at www.nec-labs.com/˜mas/jointDA.2267302. 相关工作0无监督域自适应。根据域自适应的理论发展[ 2 , 1]，一个主要的挑战是定义一个适当的度量来衡量域之间的差异。最大均值差异[ 29 , 57 , 9 , 56 , 47]，它基于核函数来衡量差异，以及域对抗神经网络[ 11 , 4 ,3 , 45 , 46]，它使用鉴别器来衡量差异，已经取得了成功。注意到UDA和SSL之间问题设置的相似性，已经有人尝试将SSL的思想结合起来。例如，熵最小化[ 14 ]已经被用于域对抗损失[ 30 ,31]之外。我们的特征级DA是建立在DANN上的，通过解决鉴别器在特征空间中发现模式的问题。我们的公式也与SSL紧密相连，我们解释了为什么熵最小化对于DANN是必要的。透视变换。以前的工作[ 61 , 23 , 51]提出了编码器-解码器网络来生成目标视点的输出图像。透视变换的对抗学习[ 54 , 55 , 63]已经在将视点与其他外观因素分离方面表现出良好的性能，但在非配对设置中仍然存在概念（例如类别标签）的切换。与其学习输出分布，[ 65 , 34]提出了一种基于变形的视点合成，通过估计像素级的流场。我们将其扩展到使用2D关键点[ 25]等合成到真实图像的域不变表示来改进对真实图像的泛化能力。图像到图像的翻译。随着GAN在图像生成方面的成功[13 , 35 ]，条件GAN的变体[ 32]已经成功地应用于图像到图像的翻译问题，包括配对[ 19]和非配对[ 43 , 49 , 66 ]训练设置。我们的模型扩展了[ 66]的工作，用于非配对设置中的图像翻译，使用控制变量或视觉属性[ 60]生成多个输出。多级UDA。像素级和特征级的组合326740在[ 18]中尝试了适应性，但我们在几个重要方面有所不同。具体而言，我们进一步利用了SSL的见解，为特征级DA提供了新的正则化方法，同时利用GAN中的3D几何和基于属性的条件来同时处理高级姿态和低级光照变化。我们的实验包括对互补效益以及各种适应模块的有效性的详细研究。虽然[ 18]考虑了语义分割等问题，但我们研究了一个突出了在各个层次上进行适应的需求的汽车识别问题。我们还在标准UDA基准测试中展示了最先进的结果。03. 域对抗特征学习本节描述了一个分类感知的域对抗神经网络（图1(b)），通过联合参数化分类器和鉴别器，改进了域对抗神经网络[ 11 ]。符号表示。令 X S , X T � X 为源域和目标域数据集，Y= { 1 , ..., N } 为类别标签集合。令 f : X → R K0将特征生成器，例如CNN，参数为 θ f ，将输入 x ∈ X映射为一个 K -维向量。03.1. 回顾：域对抗神经网络域对抗训练 [ 11 ]的目标是通过使两个域的特征分布不可区分，将从标记的源域学习到的分类器适应到未标记的目标域。这是通过一个域鉴别器 D : R K → (0 , 1)来实现的，该鉴别器告诉我们两个域的特征是否仍然可区分。然后， f 被训练来混淆 D，同时正确地对源数据进行分类：0最大化 θ c {L C = E X S log C ( f, y ) } (1)0最大化 θ d {L D = E X S log(1 − D ( f )) + E X T log D ( f ) } (2)0max θf {LF = LC + λEXT log(1 − D(f))} (3)0C : R K × Y → (0,1)是一个类别得分函数，输出输入x属于N个类别中的类别y的概率，即C(f(x), y) = P(y|f(x);θc)。λ平衡分类和域对抗损失。参数{θc,θd}和{θf}使用随机梯度下降交替更新。03.2.分类感知对抗学习我们注意到，无监督域自适应的问题设置与半监督学习的问题设置没有区别，只要我们去除域的概念。受到GANs半监督学习公式[40,7]的启发，我们提出了一种新的域对抗学习目标，将分类器和鉴别器联合参数化如下：0max θc {LC = EXS log C(y) + EXT log C(N+1)} (4)0max θf {LF = EXS log C(y|Y) + λEXT log(1 − C(N+1))} (5)0模型#分类器#源#0CNN#鉴别器#(D=2)#目标#0共享#共享#0CNN#模型#分类器#源#0CNN#鉴别器#(D=1)#目标#0共享#0(a) DANN(基线)0CNN#源#0CNN#目标#0共享#0CNN#源#0CNN#目标#0共享#0分类器#(C=1，...，N#/#N+1)#0分类器#(C=N+1#/#N+1)#0共享#0分类器#(C=1，...，N#/#N)#0分类器#(C

下载后可阅读完整内容，剩余1页未读，立即下载