一次性NAS中的多模型遗忘及通过多样性最大化克服的方法

6 浏览量更新于2023-10-23 收藏 12.26MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

Miao Zhang1,2Huiqi Li1∗Shirui Pan3Xiaojun Chang3Steven Su2Miao.Zhang-2@student.uts.edu.au, huiqili@bit.edu.cn, steven.su@uts.edu.au{Shirui.Pan, Xiaojun.Chang}@monash.edu78090通过多样性最大化克服一次性NAS中的多模型遗忘01 北京理工大学 2 悉尼科技大学 3 莫纳什大学0摘要0一次性神经架构搜索（NAS）通过权重共享显著提高了计算效率。然而，这种方法在超网络训练（架构搜索阶段）期间引入了多模型遗忘，即在顺序训练具有部分共享权重的新架构时，先前架构的性能会下降。为了克服这种灾难性的遗忘，最先进的方法假设在联合优化后验概率时，共享权重是最优的。然而，在实践中，这种严格的假设并不一定成立于一次性NAS。在本文中，我们将一次性NAS中的超网络训练形式化为连续学习的约束优化问题，即当前架构的学习不应降低先前架构的性能。我们提出了一种基于新颖性搜索的架构选择（NSAS）损失函数，并证明在最大化所选约束的多样性时，可以计算出后验概率而无需严格的假设。我们设计了一种贪婪的新颖性搜索方法，以找到最具代表性的子集来规范化超网络训练。我们将我们提出的方法应用于两个一次性NAS基线，随机采样NAS（RandomNAS）和基于梯度的采样NAS（GDAS）。大量实验证明我们的方法增强了一次性NAS中超网络的预测能力，并在CIFAR-10、CIFAR-100和PTB上取得了显著的性能和效率。01. 引言0一次性神经架构搜索（NAS）最近引起了人们对自动化神经网络架构设计的广泛兴趣，因为它不仅可以找到最先进的架构，还可以通过权重共享显著减少搜索时间。早期的NAS方法采用0� 通讯作者0从头开始训练大量独立架构并利用进化算法（EA）或强化学习（RL）来找到基于验证准确性的最有前景的架构[12, 38,29]，这对于大多数机器学习从业者来说是非常耗时且不可能的。有几种方法被提出来解决这个效率问题[3, 6,34]。特别是，权重共享，也称为一次性NAS[4,28]，是一个有前途的方向。一次性NAS将搜索空间定义为一个超网络，其中包含所有候选架构，并通过继承超网络的权重来评估候选架构。与从头开始训练大量独立架构不同，一次性NAS只训练超网络一次，因此它显著降低了搜索成本。一次性NAS依赖于一个关键的假设，即具有继承权重的架构的验证准确性应该接近重新训练后的测试准确性或具有很高的预测性。尽管Bender等人[4]观察到当超网络通过随机路径丢弃进行训练时，验证准确性与测试准确性之间存在很强的相关性，但Sciuto等人[30]在ENAS中在每个步骤中训练超网络中的单个路径（一个架构）的权重时得到了相反的结果。这种单路径训练方法也是最先进的一次性NAS方法[10, 19, 13,7]以及本文考虑的场景。Adam等人[1]表明，一次性NAS中的RNN控制器不依赖于过去的采样架构，使其性能与随机搜索相同。类似地，Singh等人[31]发现，在ENAS的架构搜索阶段中，由控制器生成的架构在重新训练性能方面没有明显的进展，并且具有更多共享权重的架构通常根据在ENAS中训练的超网络表现更差。Benyahia等人[5]将这种现象定义为多模型遗忘，即在单个任务中为多个具有部分共享权重的模型训练多个模型时发生。假设我们有一个包含多个模型的大型超网络020406080100Supernet Training Epoch30405060708090100Validation ACCGDAS Arch1Arch2Arch3Arch4Arch1-RArch2-RArch3-RArch4-R406080100406080minα∈ALval(W∗A(α))s.t.W∗A(α) = argmin Ltrain(WA(α))(1)(α∗θ, WAθ(α∗θ)) = argminαθ,WLtrain(WAθ(α∗θ))(2)78100图1：RandomNAS [19]和GDAS[11]超网络训练期间4种不同架构的验证准确性。实线（“Arch”）是通过从超网络继承权重得到的验证准确性，虚线（“Arch-R”）是重新训练后的验证准确性。0（架构）通过在它们之间共享权重，并且这些模型在单个任务上顺序训练，然后当训练另一个模型时，具有更多共享权重的模型丢失了更多的准确性[5,30]。多模型遗忘问题也在图1中说明，其中通过在超网络训练期间继承权重的四种不同架构的验证准确性显示出来。可以观察到权重共享导致验证准确性出现大量波动。更糟糕的是，通过继承权重的架构在超网络训练期间的性能变差，这使得基于超网络的架构排序变得不可靠。显然，尽管权重共享大大减少了计算时间，但它也引入了超网络训练中的多模型遗忘问题，这将恶化超网络的预测能力。为了克服One-ShotNAS中的多模型遗忘问题并增强超网络的预测能力，我们将超网络训练制定为连续学习的约束优化问题，避免了在训练新架构时先前架构的性能下降。与仅考虑最后一个架构的性能下降的现有工作[5]或保持共享参数固定的方法[21]不同，本文试图找到先前架构的最具代表性子集，以规范超网络训练中当前架构的学习。我们采用具有多样性最大化的高效贪婪新颖性搜索方法进行约束选择，并在RandomNAS[19]和GDAS[11]两个基线中实现我们的方法。实验结果表明，我们的算法显著减少了它们超网络训练中的多模型遗忘问题。我们的贡献总结如下。0• 首先，我们将One-ShotNAS中的超网络训练制定为连续学习的约束优化问题，其中当前架构的学习不应降低具有部分共享权重的先前架构的性能。0•其次，我们设计了一种高效的贪婪新颖性搜索方法，选择最具代表性的约束子集来近似由所有先前架构形成的可行区域。0• 其三，将所提出的方法应用于两个One-ShotNAS基线，RandomNAS [19]和GDAS[11]，以减少它们超网络训练中的多模型遗忘问题。广泛的实验结果证明了我们的方法的有效性，它可以减少多模型遗忘并增强超网络的预测能力。02. 背景02.1. 权重共享神经架构搜索0One-ShotNAS是由[28]提出的，通过权重共享大大减少了搜索时间。与训练大量独立的架构不同，One-shotNAS将搜索空间A编码为一个超网络WA，并且候选神经架构α直接从超网络中继承权重作为WA(α)。由于One-ShotNAS在架构搜索阶段只使用一次超网络，因此可以大大减少搜索时间。One-ShotNAS根据从超网络继承权重的验证性能搜索最有希望的架构α�：0公式（1）是一个具有挑战性的双层优化问题，架构空间的离散特性使得不可能直接使用基于梯度的方法来解决它，其中ENAS[28]使用LSTM控制器来采样架构。不同的是，[13]和[19]基于均匀采样策略训练超网络来采样架构，并采用随机搜索或进化方法从训练好的超网络中找到最佳性能的架构。几种最先进的One-Shot方法利用连续松弛将离散架构转化为具有参数θ的连续空间Aθ，以进一步提高效率[23, 11, 33,26]。超网络权重和架构参数可以通过以下方式联合优化：0这使得可以在架构搜索中应用连续优化方法，并且最佳架构α � 可以从连续架构表示 α � θ 中进行采样。由于 Eq.(2)假设在每个步骤中训练整个超网络，这需要更高的内存需求。p(θ | D) = p(θpa, θpb, θs, D)p(D)= p(θpa | θpb, θs, D)p(θpb, θs, D)p(D)= p(θpa, θs | D)p(D | θpb, θs)p(θpb, θs)p(θs, D)= p(θpa, θs | D)p(D | θpb, θs)p(θpb, θs)�p(D | θpa, θs)p(θpa, θs)dθpa=p(θa | D)p(D | θb)p(θb)�p(D | θpa, θs)p(θpa, θs)dθpa(3)LW P L(θb) = Lc(θb) + η2(∥θpb∥2 + ∥θs∥2) +�θsi∈θsε2Fθsi (θsi − θ∗si) (4)78110与ENAS相比，GDAS[11]在超网络训练过程中进一步引入了基于梯度的采样器，以在每个步骤中采样单个路径（即架构）。架构的分布和超网络权重可以同时优化，而内存需求与训练单个架构相等。与连续松弛不同，NAO[24]利用基于LSTM的自动编码器将离散的神经架构转换为连续表示，然后在连续空间中执行基于梯度的方法。02.2. 一次性NAS中的多模型遗忘0灾难性遗忘是人工智能和多任务学习中常见的现象，它描述了模型在训练完新任务后通常会丢失关于之前任务的信息[14, 18, 27]。给定在数据集 D A 上具有最优参数 θ � A的模型，在该模型在另一个数据集 D B 上训练后，其在 D A上的性能会急剧下降。解决这类问题的方法被称为连续学习。学习而不遗忘（LwF）[22]将旧任务的响应作为正则化项添加到模型中，以防止灾难性遗忘。弹性权重一致性（EWC）[17]提出最大化条件概率 p ( θ | D ) 的似然函数，其中 D包含两个独立的数据集 D A 和 D B，当在 D B 上训练时，DA不可用。多模型遗忘是指在单个数据集上训练多个模型时出现的现象。与依次在几个任务上训练模型不同，一次性NAS应用不同的模型，例如 θ a = ( θ p a , θ s ) 和 θ b = ( θ pb , θ s )，到一个数据集 D，其中 θ s 是共享权重，θ p a 和θ p b是私有权重。Wang等人[32]表明，在单个任务中训练时，单模网络总是优于多模网络，并且网络之间的相互作用会降低整个网络的性能。[30]和[20]中也观察到一次性NAS中的灾难性遗忘会导致在超网络中训练新架构后，先前架构的性能下降。Benyahia等人[5]将其定义为多模型遗忘问题，并提出了一种权重可塑性损失（WPL）来减少一次性NAS中的遗忘，该方法试图最大化后验概率 p ( θ p a , θ p b , θ s |D )，如下所示：0最大化似然函数的损失函数0p ( θ p a , θ p b , θ s | D ) 的计算如下：0其中 L c 是交叉熵损失函数，F θ si 是对应于参数 θ s i的Fisher信息矩阵的对角元素，通过假设参数 ( θ p a , θ pb ) 是独立的来估计，θ � s是在先前模型训练后被训练的共享参数 θs，被假设为最优点。关于 Eq.(4)的详细推导可以在[5]中找到。限制条件：权重可塑性损失（WPL）在超网络训练的每个步骤中只考虑一个先前架构，并假设共享权重是最优的。然而，在一次性NAS的超网络训练中，这两个假设很难成立，因为包含与当前架构共享权重的架构数量众多，而共享权重通常远离最优点。为了解决这些问题，我们将一次性NAS中的超网络训练形式化为一个约束优化问题，其中当前架构的学习不应降低先前访问的架构的性能。我们将一部分先前架构作为约束来规范当前架构的学习，并证明了在最大化所选架构的多样性时，可以计算出后验概率 p ( θ p a , θ p b , θ s | D )的损失函数，而不需要假设共享权重是最优的。03. 方法论03.1. 问题建模0One-ShotNAS依次训练多个架构，每个架构只训练少数轮次。这表明先前架构的模型权重θa远离最优点，而当前架构的模型权重通常由先前架构共享。与在假设θa接近最优或保持共享权重固定的情况下联合优化后验概率的WPL [5]或Learn toGrow [21]不同，我们将One-ShotNAS中的超网训练形式化为约束优化问题。具体而言，我们强制当前步骤中从超网继承权重的架构比上一步骤表现更好，且训练损失更小。不失一般性，我们考虑每个步骤中只训练超网中的一个架构的典型情况，并将约束优化问题定义为：0WtA = argminθ∈WA(αt)0s.t. Ltrain(WtA(αi)) ≤ Ltrain(Wt−1A(αi)); �i ∈ {0...t−1} (5)0其中αt是第t步中的当前架构，WtA表示第t步中超网的全部权重，WA(αt)是从超网继承的架构的权重，每个步骤t中只优化WA(αt)。4:if N(αr, M) > N(αm, M) then5:replace αm with αr;6:end if7: end foris the weights of architecture αt inherited from the supernet,and only WA(αt) is optimized in each step t.3.2. Constraints Selection based on Novelty SearchThe constraints in Eq.(5) prevent the learning of currentarchitecture degrading the performance of previous archi-tectures to overcome the multi-model forgetting in One-Shot NAS. However, the number of constraints in Eq.(5)increases linearly with the step, which makes it intractableto consider all constraints in the optimization. In practice,we try to select a subset with M constraints from previousarchitectures that the feasible region formed by the subsetis as close to the original feasible region as possible. Intu-itively, maximizing the diversity of the subset is an efﬁcientway to ﬁnd the most representative samples from the previ-ous architectures. Based on this observation and motivatedby [2], we propose a surrogate for constraint selection:maximizeM�αi,αj∈Mdis(αi, αj)s.t.M ⊂ {α1...αt−1}; |M| = M(6)where dis(αi, αj) is to calculate the distance between ar-chitectures. Solving Eq.(6) is an NP problem, while wecould use an alternative heuristic method to achieve thesame goal [2]. In this paper, we proposed a greedy nov-elty search method to maximize the diversity of the subset.Before the archive is full, we add all newcome architecturesinto the subset. Once it is full, we choose the one that ismost similar to the current architecture to be replaced withthe one that maximizes the novelty score of the archive. Weadopt a simple and standard method to measure the noveltyof architectures, deﬁned as N(α, M), which calculates themean distance of its k-nearest neighbors in M from it:N(α, M) = 1|S|�αj∈Sdis(α, αj)S = kNN(α, M) = {α1, α2, ..., αk}(7)In this paper, we only measure the difference of input edgesfor each node in an architecture, since the order of nodes isFrom Weight Plasticity Loss (WPL) to NSAS. WPL [5]regularizes the learning of current architecture by maxi-mizing the posterior probability p(θpa, θpb, θs | D), whereθa = {θpa, θs} is the weights of the last architecture, θb ={θpb, θs} is the weights of current architecture, and θs istheir shared weights. Different from WPL which only con-siders one previous architecture, we consider a subset ofpreviously visited architectures that θa = {θ1, ..., θM} ={(θp1, θs1), ..., (θpM, θsM)}, where θpi is the private weights,and θsi is the shared weights with the current architecture.When we maximize the diversity of the subset, the follow-ing two assumptions should hold true: (1) The architecturesin this subset should cover all operations in the search space;(2) There are no shared weights between these architectures.Therefore, θpb = ∅ as all weights in the current architectureare shared by previous architectures, and θi and θj shouldbe independent as we train different architecture indepen-dently. Now the posterior probability is written as:p(θ | D) = p(θp1...θpM, θs1...θsM, D)p(D)= p(θ1...θM, D)p(D)= p(θ1 | θ2...θM, D)p(θ2...θM, D)p(D)=�i=1:Mp(θi | D) ∝�i=1:Mp(D | θi)p(θi)= p(θ)�i=1:Mp(D | θi) = p(θt)�i=1:Mp(D | θi)(9)where θi is the weights of architecture αi in constraints. Asonly architecture αt is trained, p(θ) = p(θt), where θt isthe weights of the current architecture αt and θ is the allconsidered weights. Eq.(9) obtains the posterior probabilitywithout the assumption that θs in the previous step is op-timal. Now the Weight Plasticity Loss could be calculatedwithout the assumption that the shared weights are optimalwhen considering a subset of previously visited architec-tures with diversity maximization as:LW P L(WA(αt)) = ǫR(WA(αt)) +�i=1:MLc(WA(αi))(10)where ǫ is also the trade-off. And the proposed NSAS lossfunction could be also described with the Weight PlasticityLoss in Eq.(10) as:LN(WA(αt)) = Lc(WA(αt)) + λR(WA(αt))+ β�i=1:MLc(WA(αi)) + λR(WA(αi))= Lc(WA(αt)) + βLW P L(WA(αt))(11)From Eq.(11), we can ﬁnd that our proposed loss func-tion is attempted to not only optimize the WPL but also op-timize the learning of current architecture (also the sharedweights). That is because the shared weights are usually farfrom the optimal point in One-Shot NAS, and we should notonly overcome the forgetting, but also optimize the sharedweights towards the optimal point.Algorithm 2 One-Shot NAS-NSASInput: Dtrain, Dval, W, constraints archive M = ∅, M,batch size b, supernet training iteration Tfor t = 1, 2, ..., (T ∗ size(Dtrain)/b) do2:if size(M) < M thensample αt based on gradient search or randomsearch, and update the weights WA(α) by normalloss function, and add architecture α into M;4:elsesample αt based on gradient search or randomsearch, select the architecture αm that is mostsimilar to αt from M, and replace αm with αrto maximize the diversity of M based on Algo-rithm 1. Update the weights WA(α) by our pro-posed loss function in Eq.(8) or a replay buffer;6:end ifend for8: Obtain α∗ based on Eq.(1) (RandomNAS-NASA) orEq.(2) (GDAS-NASA).3.4. One-Shot NAS with Novelty Search based Ar-chitecture SelectionOur loss function is applied to two popular One-ShotNAS: RandomNAS [19] and GDAS [11].Same as themost common weight sharing NAS, we only train a sin-gle path in each step in the architecture search phase. Itis easy to incorporate our proposed loss function to ran-dom sampling based NAS (RandomNAS) as it also trainsa single path in each step. However, most gradient-basedNAS methods, like DARTS [23] and SNAS [33], train thewhole supernet in each step during the supernet training,which violates the assumption in this paper. In this paper,we adopt GDAS [11] as the gradient-based sampling NASbaseline, which utilizes the Gumbel-Max trick [15, 25, 33]to relax the discrete architecture distribution to be contin-uous and differentiable. The argmax function is appliedto the re-parameterized architecture distribution, to samplean architecture in each step of the supernet training duringthe forward pass. The softmax function is adopted duringthe backward pass for architecture learning. Algorithm 2presents the One-Shot NAS with our proposed NSAS lossfunction, termed as One-Shot NAS-NSAS.4. Experiments and ResultsTo evaluate the effectiveness of our proposed algorithm,we apply our method to both RandomNAS [19] and GDAS[11] on datasets CIFAR-10, CIFAR-100, and Penn Tree-bank (PTB). All experimental designs are following the set-tings in [19, 23] for a fair comparison.Our new meth-ods are denoted as RandomNAS-NSAS and GDAS-NSAS.We compare our methods with the state-of-the-art One-Shot7813MethodTest Error (%)ParametersSearch CostMemorySupernetCIFAR-10CIFAR-100(M)(GPU Days)ConsumptionOptimizationNAO-WS [24]3.53-2.5-Single pathGradientENAS [28]2.8918.91† [11]4.60.5Single pathRLSNAS [33]2.85±0.0220.09*2.81.5Whole SupernetGradientPARSEC [8]2.86±0.06-3.60.6Single pathGradientBayesNAS [37]2.81±0.04-3.400.2Whole SupernetGradientRENAS [9]2.88±0.02-3.56-RL&EAMdeNAS [36]2.40-4.060.16Single pathMDLMdeNAS* [36]2.87*17.61*3.78*0.16Single pathMDLDSO-NAS [35]2.84±0.07-3.01Whole SuperneGradientWPL [5]3.81---Single pathRLRandom baseline [23]3.29±0.15-3.24-RandomDARTS (1st) [23]2.94-2.91.5Whole SupernetGradientDARTS (2nd) [23]2.76±0.0917.54 †[11]3.44Whole SupernetGradientRandomNAS [19]2.85±0.0817.63*4.3*2.7Single pathRandomRandomNAS-NSAS2.64(2.50)17.56(16.85)3.080.7Single pathRandomGDAS [11]2.9318.383.40.21Single pathGradientGDAS-NSAS2.7318.023.540.4Single pathGradientTable 1: Test errors on CIFAR-10, compared with state-of-the-art NAS approaches. “*” indicates the results reproducedbased on the best reported cell structures with the same experimental setting as ours. “†” indicates the results are reportedin the [11]. We do not reproduce those methods with “-” in CIFAR-100 experiment since they are with different searchspaces or do not report their best structures. All models are trained with 600 epochs, and we also train our best found model(RandomNAS-NSAS) with more epochs (1000 training epochs) to get the state-of-the-art results. The best models obtainedby all One-Shot NAS methods are trained with cutout.NAS methods and evaluate the supernet predictive ability ofour approach compared with baselines.4.1. Architecture Search for Convolutional CellsWe conduct comparative experiments for convolutionalneural architecture search on CIFAR-10. The search spaceand hyperparameters setting are following the settings in[23, 19] for a fair comparison. This search space searchesfor micro-cell structures, which are stacked in series to formthe ﬁnal structure.In the supernet training (architecturesearch) stage, we only stack 8 cells to build the supernetwith 16 initial channels and 64 batch size. After supernettraining and obtaining the promising cells, we stack 20 cellsto form the ﬁnal architecture, and train it with 96 batch size.The comparison results are demonstrated in Table 1, whichcan be summarized as follows:• Compared with RandomNAS and GDAS which bothemploy normal cross-entropy loss function, our pro-posed NSAS loss function could greatly enhance thesearch results, where RandomNAS obtains 5.6% im-provement and 4.8% improvement for GDAS. Theseresults also demonstrate that the effectiveness of theproposed loss function, which could relieve the rankdisorder incurred by weight sharing and improve thesupernet predictive ability.• ComparedwithotherNASmethods,ourRandomNAS-NSASachievesacompetitivere-sult with a 2.64% test error on CIFAR-10, and a2.50% test error with 1000 training epochs. This is aninspiring result to validate our design to overcomingmulti-model forgetting.• Since the proposed method needs to evaluate more ar-chitectures during the supernet training, it has a littlebit higher search cost than the baselines. However, theproposed method is still very efﬁcient that the super-net training only costs 0.7 GPU day for RandomNAS-NSAS and 0.4 GPU day for GDAS-NSAS. We

下载后可阅读完整内容，剩余1页未读，立即下载