半监督机器学习：新框架提升训练数据质量

3星 · 超过75%的资源需积分: 10 181 浏览量更新于2024-09-16 6 收藏 484KB PDF 举报

“ANovelContrastCo-LearningFrameworkForGeneratingHighQualityTrainingData”是一篇研究论文，探讨了如何利用一种名为对比协同学习（Contrast Co-learning, CCL）的框架来生成高质量的训练数据，尤其针对半监督机器学习场景。这篇论文由来自北京大学、微软亚洲研究院和新加坡国立大学的研究人员共同撰写。在机器学习领域，高质量的训练数据是模型表现优秀的关键。然而，在现实世界中，收集大规模无偏且精确标注的数据变得越来越困难。这篇论文提出了一种新的方法，即对比协同学习框架，旨在改进有偏差和噪声的训练数据。该框架在没有充足无偏数据的情况下，能帮助优化学习算法的性能。对比协同学习的核心思想是通过挖掘数据中的对比性信息，即不同样本之间的相似度与差异性，来增强学习过程。在半监督学习中，通常只有少量数据带有标签，而大量数据未被标记。CCL框架可以利用这些未标记数据的潜在结构，通过比较有标签和无标签数据的特征，来识别和纠正数据中的错误或偏差。具体来说，CCL可能包含以下几个步骤： 1. 数据预处理：首先，对原始数据进行清洗和预处理，去除明显的噪声。 2. 对比学习：通过构建数据的对比关系，如相似度矩阵，来区分有标签和无标签数据之间的相似和不同。 3. 协同学习：结合有标签数据和无标签数据的信息，让模型在两类数据之间进行学习和调整，使得模型能够更好地泛化到未见过的数据。 4. 数据校正：根据学习到的对比信息，更新和校正原有训练数据的标签，减少偏差。 5. 循环迭代：不断重复以上步骤，直到训练数据的质量达到一定标准，或者模型的性能不再显著提升。这种方法对于解决现实世界中的数据问题具有重要意义，例如社交媒体分析、图像识别、自然语言处理等场景，其中大量数据可能带有噪声或偏见。通过CCL框架，研究人员和开发者可以更有效地利用有限的标注资源，提升模型的泛化能力和学习效率。这篇论文提供了一种创新的策略来应对机器学习中的数据质量和偏见问题，对于提升半监督学习算法的性能具有实际价值。通过深入理解并应用这种对比协同学习方法，可以为机器学习模型的训练提供更加可靠的基石，从而推动相关领域的技术进步。

A Novel Contrast Co-Learning Framework For Generating High Quality Training

Data

Zeyu Zheng*

School of EECS

Peking University, Beijing

P. R. China, 100871

perhaps.zzy@gmail.com

Jun Yan

Microsoft Research Asia

Beijing

P. R. China, 100190

junyan@microsoft.com

Shuicheng Yan

National University of Singapore

4 Engineering Drive 3

Singapore, 117576

eleyans@nus.edu.sg

Ning Liu

Microsoft Research Asia

Beijing

P. R. China, 100190

ningl@microsoft.com

Zheng Chen

Microsoft Research Asia

Beijing

P. R. China, 100190

zhengc@microsoft.com

Ming Zhang

School of EECS

Peking University, Beijing

P. R. China, 100871

mzhang@net.pku.edu

Abstract—The good performances of most classical learning

algorithms are generally founded on high quality training data,

which are clean and unbiased. The availability of such data is

however becoming much harder than ever in many real world

problems due to the difficulties in collecting large scale unbiased

data and precisely labeling them for training. In this paper, we

propose a general Contrast Co-learning (CCL) framework to

refine the biased and noisy training data when an unbiased yet

unlabeled data pool is available. CCL starts with multiple sets of

probably biased and noisy training data and trains a set of

classifiers individually. Then under the assumption that the

confidently classified data samples may have higher probabilities

to be correctly classified, CCL iteratively and automatically

filtering out possible data noises as well as adding those

confidently classified samples from the unlabeled data pool to

correct the bias. Through this process, we can generate a cleaner

and unbiased training dataset with theoretical guarantees.

Extensive experiments on two public text datasets clearly show

that CCL consistently improves the algorithmic classification

performance on biased and noisy training data compared with

several state-of-the-art classical algorithms.

Keywords- Noisy training data, Training data bias, Contrast

Classifier, Co-learning

I. INTRODUCTION

In many classical machine learning problems, the

training datasets are assumed to be clean and exactly reflect

the true data distribution. However, this assumption is

getting increasingly difficult to be satisfied in real world

applications nowadays. For example, with the rapid growth

of World Wide Web, the problem of categorizing Web

objects such as Web pages, search queries and internet users

into meaningful classes is attracting increasingly attention

from both academia and industry. However, the

exponentially increasing data scale makes it very difficult to

construct unbiased training datasets, which can reflect the

true data distribution. On the other hand, as pointed out by

[20], in some real world application scenarios such as user

search intent classification [22], it is difficult or even

impossible to precisely label data samples for training

purpose. The difficulties in data labeling generally result in

highly noisy training datasets. In this situation, if we directly

utilize the classical classification models on these probably

biased and noisy training data, the classification results could

not be as good as their counterparts with clean and unbiased

training data.

In this paper, we propose a novel Contrast Co-learning

(CCL) framework to generate high quality training data from

the noisy and biased training data, on which many traditional

classification models could be applied with better

performance than directly using the low quality training data.

The proposed CCL framework starts with multiple biased

and noisy training datasets, which could be the training

datasets labeled by different editors or those automatically

generated by different rules [5]. In addition, the multiple

training datasets could be divided subsets of a single larger

training dataset. To reduce the bias and noises in training

data, CCL follows an iterative procedure to filter out noisy

samples and expand to add new automatically labeled

training samples respectively. Each round of iteration of

CCL consists of the following three components:

• Noise Filtering Component (NFC) for noise

detection and filtering;

• Bias Detection Component (BDC) for identifying

the underrepresented samples, namely those which

have potentials to reduce the bias of the training data

if they are labeled;

• Automatic Labeling Component (ALC) for

automatically assigning labels to those

underrepresented data.

In detail, for a given particular classification model to be

used by the CCL framework and multiple biased and noisy

training datasets, we first use each training dataset to train

individual classifier. Then we use each classifier to classify

its corresponding training dataset. Driven by the core

assumption of CCL, i.e. “the confidently classified data

*This work was done when the first author was visiting Microsoft

Research Asia.

下载后可阅读完整内容，剩余9页未读，立即下载

perhaps_zzy

粉丝: 0

半监督机器学习：新框架提升训练数据质量

机器学习学习资料与研究深度介绍

吴恩达机器学习完整资料包解析

Andrew Ng机器学习课程全套学习资料

机器学习入门资料.zip_机器_机器学习_机器学习 入门_机器学习入门

机器学习机器学习资料.txt

机器学习 资料

机器学习资料

机器学习整理资料

机器学习参考资料机器学习参考资料

机器学习复习资料

最新资源

机器学习入门资料.zip_机器_机器学习_机器学习入门_机器学习入门

机器学习资料