极度稀缺标记样本下的传播正则化器半监督学习方法

27 浏览量更新于2023-10-25 收藏 13.84MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

{pd99j, john}@skku.edu144010传播正则化器用于极度稀缺标记样本的半监督学习0Noo-ri Kim Jee-Hyong Lee * 韩国圣坡大学电气与计算机工程系16419庆尚道水原市张安区西部路20660摘要0半监督学习（SSL）是一种利用大量易于获取的未标记数据和少量高成本获取的标记数据来提高模型性能的方法。大多数现有的SSL研究都集中在有足够数量的标记样本的情况下，每个类别有几十到几百个标记样本，这仍然需要大量的标注成本。在本文中，我们关注极度稀缺标记样本的SSL环境，每个类别只有1或2个标记样本，在这种情况下，大多数现有方法无法学习。我们提出了一种传播正则化器，通过抑制确认偏差，可以实现极度稀缺标记样本的高效有效学习。此外，针对在没有验证数据集的情况下进行现实模型选择，我们还提出了一种基于传播正则化器的模型选择方法。所提出的方法在CIFAR-10、CIFAR-100、SVHN数据集上仅使用每个类别一个标记样本时分别达到了70.9%、30.3%和78.9%的准确率，相比现有方法提高了8.9%到120.2%。我们提出的方法在分辨率更高的数据集STL-10上也表现出良好的性能。01. 引言0半监督学习（SSL）是一种使用少量标记数据和大量未标记数据训练模型的机器学习技术。由于它可以展现与监督学习相当的性能，因此受到研究人员的更多关注。半监督学习技术在图像分割[5, 8, 31]、目标检测[1, 11, 19]、文本分类[4,10, 20]、图嵌入[27, 29]以及图像分类[17,27]等各个领域都展示了显著的性能。0* 通讯作者。0大多数SSL方法[2, 3, 12, 17, 27,32]都是基于一致性正则化[14, 26]和伪标签[15,21]的。一致性正则化是一种在样本上施加轻微扰动时预测不会发生显著变化的假设下开发的方法。伪标签是自训练[25,33]的一种特殊情况，它使用未标记样本的预测输出作为伪标签来训练模型。在SSL中，充分利用未标记样本是重要的，但同时使用少量标记样本进行学习也很重要，因为标记样本通常成本较高。然而，只有少数研究关注稀缺标记样本的学习。我们需要研究SSL的工作原理以及如何在标记稀缺的情况下提高其性能。MixMatch[3]、无监督数据增强一致性训练（UDA）[32]和ReMixMatch [2]在CIFAR-10 [13]/SVHN[22]数据集上展示了良好的性能，例如每个类别25、50、100、200和400个标记示例。最近的方法，如FixMatch[27]、SelfMatch [12]、FlexMatch [35]和CoMatch[17]考虑了标记稀缺的情况。FixMatch、SelfMatch和FlexMatch每个类别至少使用4个标记示例，CoMatch每个类别至少使用2个样本。然而，它们在少量标记样本的情况下不稳定且性能较差。稀缺标记情况之一的严重问题是确认偏差[16, 18,30]，它可能发生在标签传播[9]中。确认偏差是指模型对未标记数据学习到错误预测的现象，使得错误预测的置信度增加，模型对可以进行修正的新（正确）信息产生抵抗力。如果有足够的标记数据，错误信息的传播可以被周围的正确信息抵消。SSL可以避免模型的确认偏差。另一方面，如果标记数据数量很少，错误预测可以广泛传播，不接收适当的正确信息的概率可能增加。144020确认偏差通过SSL对模型的训练产生了显著的负面影响。在基于硬伪标签的方法（如UDA、FixMatch、SelfMatch、FlexMatch）中，这个问题在极度稀缺标记情况下更加严重。在极度稀缺标记情况下，模型选择也是另一个严重的问题。在监督学习中，停止条件对于半监督学习环境非常重要。在监督学习中，通常使用验证数据集来检查停止条件，但在半监督学习中，特别是在稀缺标记环境中，没有足够的标记样本用于验证。然而，之前的SSL方法[2, 3, 12, 17, 27,32]忽视了停止条件或模型选择。为了性能评估，他们只是简单地取最后20个模型性能的中位数[3,28]。在稀缺标记情况下，由于确认偏差的存在，SSL的学习可能非常不稳定。训练损失较低的模型并不能保证良好的测试准确性。我们提出了传播正则化器来改善极度稀缺标记环境中SSL的性能，该环境每个类别只有1或2个标记样本。传播正则化器抑制了由于极少量标记数据而可能传播的错误预测，使得SSL的学习能够稳定进行。我们还提出了一种基于传播正则化器损失的模型选择方法，以在极度稀缺标记情况下选择训练良好的模型。这些方法需要非常低的额外计算成本，并且很容易应用于现有的SSL方法。我们通过玩具示例和CIFAR-10数据集展示了确认偏差在极度稀缺标记情况下容易发生并对模型训练产生负面影响。我们提出了传播正则化器和模型选择方法。我们在每个类别只有1或2个标记样本的极度稀缺标记情况下展示了最先进的性能。02. 极度稀缺标记情况下的确认偏差0大多数SSL方法都受到确认偏差的困扰。在极度稀缺标记情况下，确认偏差问题更加严重。在本节中，我们评估确认偏差在极度稀缺标记情况下对半监督学习过程产生的负面影响。我们使用FixMatch[27]作为代表性的SSL伪标签方法，并使用三个数据集进行实验：月亮数据集、星星数据集和CIFAR-10[13]。实验证实，在稀缺标记环境中，确认偏差很容易发生，并且对性能产生了显著影响。0(a) 使用Rand2的月亮数据集0(d) 使用Rand2的星星数据集0(b) 使用Exp2的月亮数据集0(e) 使用Exp2的星星数据集0(c) 使用Rand20的月亮数据集0(f) 使用Rand20的星星数据集0图1.FixMatch生成的类别边界。颜色表示标记样本，灰色表示未标记样本。每个月牙形状是月亮数据集中的一个类别，每个翅膀形状是星星数据集中的一个类别。02.1. 玩具示例分析0为了检查在极度稀缺标记样本的SSL中是否发生确认偏差，我们使用FixMatch在二维的月亮和星星数据集上训练了一个三层神经网络模型。月亮数据集包含两个类别，有1k个未标记样本。星星数据集由高斯分布生成的5个类别组成。每个类别有200个未标记样本。在每个数据集中，给出了三组标记样本，Rand2、Exp2和Rand20，以验证初始标记样本在训练过程中对确认偏差的显著影响。Rand2包含每个类别两个标记样本，随机选择自未标记样本；Exp2包含每个类别两个由专家选择的标记样本，以使标记样本能够很好地代表未标记数据集的分布；Rand20包含每个类别20个标记样本，随机选择自未标记样本。对于FixMatch，弱增强和强增强使用不同强度的高斯噪声。图1a到1c显示了月亮数据集和FixMatch的学习结果，图1d到1f显示了星星数据集的结果。可以看到，当每个类别只给出2个随机选择的标记样本时，类别边界与数据分布不匹配，如图1a和1d所示。当给出更多的标记样本时，类别边界被正确生成，如图1c和1f所示。当仔细选择标记样本时，可以看到确认偏差对模型的影响较小，如图1b和1e所示。如图1a和1d所示，如果标记样本不MethodFoldClassEntropyAccuracy0123456789FixMatchFold 10.110.090.020.000.450.000.090.010.110.110.7262.29Fold 20.110.100.020.080.260.000.100.090.100.130.9067.18Fold 30.220.010.080.000.360.000.070.080.000.170.7253.05Fold 40.000.090.010.000.190.010.360.100.110.110.7751.31Fold 50.100.090.010.000.010.170.130.300.100.090.8466.23Table 1. Class ratio and entropy of pseudo labels for CIFAR-10 dataset with 10 labeled samples.well represent the distribution of each class, label propaga-tion occurs in a skewed way during SSL learning processand confirmation bias can be intensified. The label propa-gation process is prone to bias because there are only twolabeled samples per class and the distributions of unlabeledand labeled samples do not match each other.Even in the case where the number of labeled samplesis very small, the confirmation bias can be suppressed ifthe labeled samples can represent the class distribution, asshown in Figs. 1b and 1e. However, it is not usually ex-pected that a few randomly selected samples properly rep-resent the data distribution. As seen in Figs. 1c and 1f, ifthere are many randomly chosen labeled samples, they canrepresent the class distribution, and the class boundaries arelearned properly.2.2. Analysis with CIFAR-10 DatasetIn order to verify that real-world datasets are prone to theconfirmation bias problem, we conduct the experiment withthe CIFAR-10 dataset. In this experiment, we use 1 labeledsample per class and train FixMatch with Wide ResidualNetwork 28-2 model [34]. The experiment was performedin 5 folds and, in each fold, labeled examples are randomlyselected from the training data.The performance is the median accuracy of the last 20models, and it is averaged over 5 folds. The average ac-curacy is 60.01%. The highest accuracy among 5 folds is67.18%, and the lowest is 51.31%, showing a large variancein performance.In order to prove that each model of 5 folds does notgenerate a good model due to confirmation bias, we observethe class ratio of pseudo-labels in Tab. 1. If each modelis well trained without confirmation bias on the CIFAR-10dataset, the ratio of each class in pseudo-labels will appearas 0.1, and the entropy of the ratios will be 1.0. The entropyis defined as follows:Entropy = −c�iri logc ri(1)where ri is the ratio of class i and c is the number of classes.We notice that the class ratios in each fold are not bal-anced in Tab. 1. In detail, in the first fold, unlabeled sam-00.20.40.60.8102004006008001000EpochLossEntropyAccuracyFigure 2. Training loss, entropy of pseudo-labels and test accuracyof FixMatch on CIFAR-10 with 10 labeled samples (Fold 3).ples are never pseudo-labeled as Class 3 or 5 and only smallnumber of unlabeled samples are pseudo-labeled as Class7, which results in a low entropy of 0.72. Such tendencyis observed for all 5-fold. And even in Fold 2 with thehighest entropy of 0.90, the ratio of Class 3 is 0. The av-erage of 5-fold entropy for pseudo-labeling of FixMatch onCIFAR-10 with 10 labeled samples is 0.79. What is inter-esting is that the entropy and the accuracy of the model hasa strong correlation of 0.69. Higher entropy means smallerconfirmation bias; thus, we surmise that confirmation biashas a significant effect on a model’s performance. We alsorun the same experiment with 25 labeled samples per class.The average entropy is 0.99, which means that there is littleconfirmation bias.Through the experiment, we confirm that confirmationbias is also easy to occur in real-world datasets, that thesmaller the number of labeled samples, the stronger effectconfirmation bias makes, and that the strength of confir-mation bias, measured as entropy of pseudo-class ratios, isstrongly associated with the model performance.We also observe the test accuracy, the training loss, andthe entropy of pseudo-labels by epoch.Figure 2 showsthe accuracy, the training loss, and the entropy of Fold 3by epoch. We see that the training is very unstable. Atthe beginning of training, the entropy of pseudo-labels in-creases, and the test accuracy also increases. This showsthat consistency regularization performs beneficially alongwith pseudo-labeling and the SSL model is being trainedwell. However, the test accuracy drops sharply around 60014403Pearson’s Correlation CoefficientFold 1Fold 2Fold 3Fold 4Fold 5AverageTraining Loss-Accuracy0.1750.3640.7470.2570.4060.390Entropy-Accuracy0.8350.8070.9500.8410.8460.856Table 2. Pearson’s Correlation Coefficient of training loss-test accuracy and entropy-test accuracy during FixMatch training on the CIFAR-10 with 10 labeled samples.epochs and the same time the entropy also sharply drops.Afterward, the performance of the model recovers to someextent, but it hardly regains the previous best performance.From the observation, we notice two things. The per-formance of the model does not gradually improve as thetraining progresses. The learning of the model is unstable,showing the best performance in the middle of training, andthen rapidly dropping at some point. Therefore, choosingthe last updated model or the model with the lowest train-ing loss does not guarantee the best model.The second is more important. We perceive that the cor-relation of test accuracy and pseudo-label class entropy ismuch higher than that of test accuracy and training loss.Table 2 shows the correlation coefficients between test ac-curacy and entropy, and between test accuracy and trainingloss. This observation gives us a hint on how to suppressconfirmation bias in training and how to better select a goodmodel.3. Proposed MethodIn pseudo-labeling process, the model learns the modeloutput, i.e., it repeatedly learns its own erroneous predic-tion, resulting in confirmation bias [18, 30].This phe-nomenon can be amplified especially in an extremely scarcelabeled scenario with one or two labeled examples in eachclass.Based on our experimental observations, we propose apropagation regularizer method to suppress confirmationbias in an extremely scarce label environment, and a modelselection method that selects the optimal model without val-idation data among models generated during the learningprocess.3.1. Propagation RegularizerIn pseudo-labeling process, incorrect predictions of themodel can be used for the next model training, which causesconfirmation bias. We may infer that we need to keep thebalance between pseudo-labels based on our observationthat the correlation between test accuracy and the entropyof pseudo-classes is high as shown in Tab. 2. Learning im-balanced pseudo-labeled sample will augment confirmationbias.For example, let us consider SSL learning with twoclasses, A and B. If a model in the middle of SSL trainingproduces more pseudo-labels of class A than B, the imbal-anced pseudo-labeled samples are used for the next modeltraining. Then, the next model is easy to be biased to classA, and the confirmation bias will be inflated.To solve this problem, a regularization term is designedso that the pseudo-labeling for the unlabeled samples shouldbe balanced for each class as follows:Lpr = 1 − (−PU · logc(PU))(2)where c is the number of classes. PU is the masked aver-aged probability distribution of unlabeled examples, unla-beled samples U in a batch, defined as follows:PU = 1|U|�u∈U1 (max (p (u)) ≥ τ) p (u)(3)where τ is the confidence threshold for pseudo-labeling andp(u) is the softmax output of an unlabeled example u.In Eq. (3), the average of predictions is obtained for sam-ples having values greater than or equal to a threshold τ in abatch of unlabeled examples. To convert this to a minimiza-tion form, the entropy of PU is subtracted from 1. If thepseudo-labels of unlabeled examples are evenly distributed,the value of Lpr will converge to 0. By simply adding thisregularization term to the SSL loss, class-balanced pseudo-labeling can be achieved. Through this, we can alleviate theconfirmation bias in extremely scarce example scenario.1440403.2. 基于传播正则化和利用度量的模型选择0模型选择在半监督学习中至关重要。如第2节所观察到的，模型在训练过程中的性能不稳定，因为它受到确认偏差的影响较大。如果我们有一个验证数据集，我们可以像在监督学习中那样选择最佳模型。然而，在我们的情况下，没有足够的有标签样本可用于验证。一些SSL方法[14, 21, 24, 30,32]简单地选择最后一个模型。许多最近的SSL研究[2, 3, 12,17, 27,32]没有提出模型选择方法。他们对最后20个模型的性能取中值进行模型评估。这在简单环境中可能是一种可接受的模型性能比较方法[3,28]。然而，如图2所示，在极度缺乏标签的环境中，模型的性能非常不稳定。这使得模型144050方法折类别熵准确率0FixMatch + Sel + Reg0第1折 0.10 0.12 0.07 0.08 0.18 0.11 0.12 0.03 0.11 0.09 0.97 68.600第2折 0.09 0.12 0.06 0.06 0.09 0.09 0.18 0.10 0.10 0.11 0.98 59.470第3折 0.07 0.17 0.04 0.09 0.13 0.08 0.12 0.03 0.09 0.17 0.95 72.890第4折 0.17 0.11 0.02 0.05 0.10 0.15 0.13 0.07 0.12 0.08 0.95 78.810第5折 0.12 0.11 0.01 0.10 0.03 0.10 0.12 0.21 0.12 0.08 0.94 74.580表3. CIFAR-10数据集上使用10个有标签样本的伪标签的类别比例和熵。应用了提出的模型选择和传播正则化方法。0选择合适的模型对于半监督学习至关重要，并且阻碍了SSL方法在现实世界应用中的使用。为了选择合适的模型，我们提出了一种基于确认偏差和利用无标签样本的度量方法。一个好的SSL模型应该尽可能地利用无标签样本，并且受到确认偏差的影响较小。为了选择这样的模型，我们提出了无标签样本的利用度量和确认偏差的影响度量。对于无标签样本的利用度量，我们提出了以下方程：0TU = 10|U|0u ∈ U1 (max(p(u)) ≥ τ) (4)0方程（4）显示了满足伪标签置信度阈值τ的伪标签样本的比例。如果模型使用批次中的所有无标签样本进行训练，则TU的值为1；如果根本不使用任何无标签样本，则为0。为了衡量确认偏差的影响，我们使用方程（2），即提出的传播正则化器。通过结合方程（2）和（4），我们开发了一个用于模型选择的度量方法。它的定义如下：0Sel = (1 - Lpr) + TU (5)0一个好的SSL模型利用大多数无标签样本，并且受到确认偏差的影响较小，因此该值将被最大化。在训练过程中，我们在每个epoch评估Sel，并选择Sel的最大值作为最终模型。提出的模型选择方法不使用额外的验证数据集。在缺乏标签的情况下，我们可以选择合适的SSL模型而无需验证数据集。04. 实验0为了验证提出的传播正则化器和模型选择方法，我们将提出的方法与UDA [32]和FixMatch[27]相结合，并进行SSL图像分类基准测试。我们比较0(a) 使用Rand2的月亮数据集0(b) 使用Rand2的星星数据集0图3.使用提出的方法的FixMatch的数据集和类别边界。在数据集中，有颜色的样本是有标签的样本，灰色的样本是无标签的样本。在月亮数据集中，每个新月是一个类别，在星星数据集中，每个翅膀是一个类别。0与当前SOTA方法Co-Match [17]和FlexMatch[35]在SVHN [22]、CIFAR-10和CIFAR-100[13]上的性能进行比较。此外，我们还使用更高分辨率的数据集STL-10[6]对FixMatch进行了实验，使用了我们提出的方法。我们在包括极度缺少标签的情况下对数据集进行了SSL方法的实验，包括各种数量的标记示例。所有实验都按照SSL评估协议[2, 3,23]进行。实验结果显示了所提方法的优越性。我们的方法在极度缺少标签的情况下表现最佳。04.1. 使用玩具示例和CIFAR-10数据集的传播正则化器0为了确认我们提出的传播正则化器的有效性，我们将该方法应用于第2节的实验中。图3显示了月亮和星星数据集的实验结果。在图1a和1d中，学习到的类别分布没有很好地表示数据类别分布，因为可能会加剧确认偏差。当将提出的传播正则化器应用于FixMatch时，可以看到类别分布被正确学习，如图3a和3b所示。表3显示了未标记示例的伪标签的类别比例、熵和准确率，当使用FixMatch时。UDA51.8283.5390.0624.9637.7648.9824.0267.8496.51±8.51±8.05±4.37±2.22±0.74±1.73±19.68±11.05±2.54FixMatch60.0175.5085.5724.1135.8346.0435.8455.7986.73±7.41±10.93±5.21±1.50±1.63±1.41±10.12±30.83±21.70CoMatch65.1088.2692.1624.1932.5141.7225.3645.6276.07±7.81±8.29±4.97±0.98±1.15±2.04±4.64±7.12±9.31FlexMatch59.0694.6294.864.4730.5946.1111.0234.9377.04±19.80±0.15±0.05±0.81±1.69±2.83±1.89±36.00±23.16UDA59.7183.6090.1224.9537.7649.1178.9197.7896.48+ Sel±16.01±8.09±4.35±2.23±0.84±1.77±12.30±0.27±3.24UDA69.8784.3391.5430.3042.3450.6177.9896.0197.46+ Sel + Reg±9.96±7.23±2.53±0.99±1.54±1.48±32.06±3.71±0.45FixMatch65.7379.2489.8724.1735.7846.0551.4091.9096.41+ Sel±10.32±10.00±4.96±1.55±1.58±1.28±26.66±5.77±3.06FixMatch70.8788.2091.5227.9738.9648.0169.6196.2697.61+ Sel + Reg±7.35±4.29±2.81±1.12±1.42±1.72±24.33±2.86±0.33FixMatch30.8243.2460.92±6.73±6.32±5.60FixMatch30.0745.4563.93+ Sel±5.82±3.24±9.65FixMatch37.9161.0074.45+ Sel + Reg±6.66±15.87±13.50144060方法 CIFAR-10 CIFAR-100 SVHN010个标签 20个标签 40个标签 100个标签 200个标签 400个标签 10个标签 20个标签 40个标签0表4. CIFAR-10、CIFAR-100和SVHN在5个不同折叠中的准确率比较，每个类别有1、2和4个标记样本。0方法 STL-10010个标签 20个标签 40个标签0表5.STL-10在5个不同折叠中的准确率比较，每个类别有1、2和4个标记样本。0该方法应用于CIFAR-10数据集。在表3中，每个折叠中的类别比例比表1更平衡，准确率也有所提高。熵的平均值从0.79增加到0.96，性能的平均值从60.01%增加到70.87%。这表明该方法在极度缺少标签的情况下有效地工作。04.2. 数据集和实现细节0我们在CIFAR-10、CIFAR-100、SVHN和STL-10上进行实验。CIFAR-10/100和SVHN数据集由3个32×32大小的通道组成，而STL-10由3个96×96大小的通道组成。CIFAR-10、SVHN和STL-10都包含10个类别，而CIFAR-100包含100个类别。CIFAR-10包含50,000个训练图像和10,000个测试图像。CIFAR-100包含60,000个训练图像和10,000个测试图像。SVHN包含73,257个训练图像、26,302个测试图像和531,131个额外图像。0STL-10由5,000个训练图像和100,000个未标记图像以及8,000个测试图像组成。在CIFAR-10和CIFAR-100中，用作标记数据的图像是从训练图像中随机选择的，每个类别均匀选择，剩余的训练图像被视为未标记数据。在SVHN和STL-10中，用作标记数据的图像也是以相同的方式从训练和额外的图像中选择的，剩余的图像被视为未标记数据。与CIFAR-10/100和STL-10不同，SVHN不是一个类平衡的数据集。每个类别的样本数量占总数据的6.47%到17.28%。我们设置 λU = 1，η = 0.03，β = 0.9，τ = 0.95，µ = 7，B =64用于FixMatch，λ cls = 1，η = 0.03，τ = 0.95，µ =7，B = 64，α = 0.9，τ = 0.2，K = 2560，T = 0.8和λ ctr= 1用于CoMatch。这些超参数是基于原始论文[17,27]设定的。对于UDA，我们采用了Sohn等人[27]使用的相同值：λ U = 1，η = 0.03，温度τ = 1，置信度阈值β =0.9，µ = 7和B = 64。我们使用RandAugment[7]作为CoMatch的强数据增强方法，使用CTAugment[2]作为FixMatch和UDA的数据增强方法。传播正则化器Lpr的权重因子在CIFAR-100上设为1.0，在CIFAR-10、SVHN和STL-10上设为0.4。我们在CIFAR-10/

下载后可阅读完整内容，剩余1页未读，立即下载