
A Novel Contrast Co-Learning Framework For Generating High Quality Training
Data
Zeyu Zheng*
School of EECS
Peking University, Beijing
P. R. China, 100871
perhaps.zzy@gmail.com
Jun Yan
Microsoft Research Asia
Beijing
P. R. China, 100190
junyan@microsoft.com
Shuicheng Yan
National University of Singapore
4 Engineering Drive 3
Singapore, 117576
eleyans@nus.edu.sg
Ning Liu
Microsoft Research Asia
Beijing
P. R. China, 100190
ningl@microsoft.com
Zheng Chen
Microsoft Research Asia
Beijing
P. R. China, 100190
zhengc@microsoft.com
Ming Zhang
School of EECS
Peking University, Beijing
P. R. China, 100871
mzhang@net.pku.edu
Abstract—The good performances of most classical learning
algorithms are generally founded on high quality training data,
which are clean and unbiased. The availability of such data is
however becoming much harder than ever in many real world
problems due to the difficulties in collecting large scale unbiased
data and precisely labeling them for training. In this paper, we
propose a general Contrast Co-learning (CCL) framework to
refine the biased and noisy training data when an unbiased yet
unlabeled data pool is available. CCL starts with multiple sets of
probably biased and noisy training data and trains a set of
classifiers individually. Then under the assumption that the
confidently classified data samples may have higher probabilities
to be correctly classified, CCL iteratively and automatically
filtering out possible data noises as well as adding those
confidently classified samples from the unlabeled data pool to
correct the bias. Through this process, we can generate a cleaner
and unbiased training dataset with theoretical guarantees.
Extensive experiments on two public text datasets clearly show
that CCL consistently improves the algorithmic classification
performance on biased and noisy training data compared with
several state-of-the-art classical algorithms.
Keywords- Noisy training data, Training data bias, Contrast
Classifier, Co-learning
I. INTRODUCTION
In many classical machine learning problems, the
training datasets are assumed to be clean and exactly reflect
the true data distribution. However, this assumption is
getting increasingly difficult to be satisfied in real world
applications nowadays. For example, with the rapid growth
of World Wide Web, the problem of categorizing Web
objects such as Web pages, search queries and internet users
into meaningful classes is attracting increasingly attention
from both academia and industry. However, the
exponentially increasing data scale makes it very difficult to
construct unbiased training datasets, which can reflect the
true data distribution. On the other hand, as pointed out by
[20], in some real world application scenarios such as user
search intent classification [22], it is difficult or even
impossible to precisely label data samples for training
purpose. The difficulties in data labeling generally result in
highly noisy training datasets. In this situation, if we directly
utilize the classical classification models on these probably
biased and noisy training data, the classification results could
not be as good as their counterparts with clean and unbiased
training data.
In this paper, we propose a novel Contrast Co-learning
(CCL) framework to generate high quality training data from
the noisy and biased training data, on which many traditional
classification models could be applied with better
performance than directly using the low quality training data.
The proposed CCL framework starts with multiple biased
and noisy training datasets, which could be the training
datasets labeled by different editors or those automatically
generated by different rules [5]. In addition, the multiple
training datasets could be divided subsets of a single larger
training dataset. To reduce the bias and noises in training
data, CCL follows an iterative procedure to filter out noisy
samples and expand to add new automatically labeled
training samples respectively. Each round of iteration of
CCL consists of the following three components:
• Noise Filtering Component (NFC) for noise
detection and filtering;
• Bias Detection Component (BDC) for identifying
the underrepresented samples, namely those which
have potentials to reduce the bias of the training data
if they are labeled;
• Automatic Labeling Component (ALC) for
automatically assigning labels to those
underrepresented data.
In detail, for a given particular classification model to be
used by the CCL framework and multiple biased and noisy
training datasets, we first use each training dataset to train
individual classifier. Then we use each classifier to classify
its corresponding training dataset. Driven by the core
assumption of CCL, i.e. “the confidently classified data
*This work was done when the first author was visiting Microsoft
Research Asia.