JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
size of GBDT to make it suitable for XMLC. CRAFTML [77]
tries to use fast partitioning strategies and exploit random
forest algorithm. CRAFTML first randomly projects the
feature and label into lower dimensional spaces. A k-means
algorithm is then used in the projected labels to partition the
instances into k temporary subsets. For the prediction, a test-
ing instance follows a root-to-leaf path and the average label
vector is stored in the leaf node. The forest aggregates these
label vectors in each tree for prediction. GBDT-SPARSE and
CRAFTML also open the way to parallelization.
2.3 One-vs-all Methods
One-vs-all methods are one of the most popular strategies
for multi-label classification which independently trains a
binary classifier for each label. However, this technique
suffers two major limitations for XMLC: 1) Training one-vs-
all classifiers for XMLC problems using off-the-shelf solvers
such as Liblinear can be infeasible for computation and
memory. 2) The model size for XMLC data set can be
extremely large, which leads to slow prediction. Recently,
many works have been developed to address the above
issues of the one-vs-all methods in XMLC.
By exploiting the sparsity of the data, some sub-linear
algorithms are proposed to adapt one-vs-all methods in
the extreme classification setting. For example, PD-Sparse
[74] proposes to minimize the separation ranking loss [78]
and l
1
penalty in an Empirical Risk Minimization (ERM)
framework for XMLC. The separation ranking loss penalizes
the prediction on an instance by the highest response from
the set of negative labels minus the lowest response from the
set of positive labels. PD-Sparse obtains an extremely sparse
solution both in primal and in dual with the sub-linear time
cost, while yields higher accuracy than SLEEC, FastXML
and some other XMLC methods. By introducing separable
loss functions, PPDSparse [7] parallelizes PD-Sparse with
sub-linear algorithms to scale out the training. PPDSparse
can also reduce the memory cost of PDSparse by orders of
magnitude due to the separation of training for each label.
DiSMEC [6] also presents a sparse model with a parameter
thresholding strategy, and employs a double layer of paral-
lelization to scale one-vs-all methods for problems involving
hundreds of thousand labels. ProXML [79] proposes to use
l
1
-regularized Hamming loss to address the tail label issues,
and reveals that minimizing one-vs-all method based on
Hamming loss works well for tail-label prediction in XMLC
based on the graph theory.
PD-Sparse, PPDSparse, DiSMEC and ProXML have ob-
tained high prediction accuracies and low model sizes.
However, they still train a separate linear classifier for each
label and linear scan every single label to decide whether
it is relevant or not. Thus the training and testing cost of
these methods grow linearly with the number of labels.
Some advanced methods are presented to address this issue.
For example, to reduce the linear prediction cost of one-
vs-all methods, [75] proposes to predict on a small set of
labels, which is generated by projecting a test instance on a
filtering line, and retaining only the labels that have training
instances in the vicinity of this projection. The candidate
label set should keep most of the true labels of the testing
instances, and be as small as possible. They train the label
filters by optimizing these two principles as a mixed integer
problem. The label filters can reduce the testing time of exist-
ing XMLC classifiers by orders of magnitude, while yields
comparable prediction accuracy. [75] shows an interesting
technique to find a small number of potentially relevant
labels, instead of going through a very long list of labels.
How to use label filters to speed up the training time is left
as an open problem.
Parabel [8] reduces training time of one-vs-all methods
from O(ndL) to O((nd log L)/L) by learning balanced bi-
nary label trees based on an efficient and informative label
representation. They also present a probabilistic hierarchical
multi-label model for generalizing hierarchical softmax to
the multi-label setting. The logarithmic prediction algorithm
is also proposed for dealing with XMLC. Experiments show
that Parabel could be orders of magnitude faster at training
and prediction compared to the state-of-the-art one-vs-all
extreme classifiers. However, Parabel is not accurate in low-
dimension data set, because Parabel can not guarantee that
similar labels are divided into the same group, and the
error will be propagated in the deep trees. To reduce the
error propagation, Bonsai [80] shows a shallow k-ary label
tree structure with generalized label representation. A novel
negative sampling technique is also presented in Slice [9] to
improve the prediction accuracy for low-dimensional dense
feature representations. Slice is able to cut down the training
time cost of one-vs-all methods from linear to O(nd log L)
by training classifier on only O(n/L log L) of the most con-
fusing negative examples rather than on all n training set.
Slice employs generative model to estimate O(n/L log L)
negative examples for each label based on approximate
nearest neighbour search (ANNS) in time O((n+L)d log L),
and conduct the prediction on O(log L) of the most probable
labels for each testing data. Slice is up to 15% more accurate
than Parabel, and able to scale to 100 million labels and
240 million training points. The experiments in [9] show
that negative sampling is a powerful tool in XMLC, and
the performance gain of some advanced negative sampling
technique may be explored for future research.
The training and testing time complexity of some XMLC
methods are summarized in Table 1. From Table 1, we
can see that the tree methods usually obtain much lower
training and testing time complexity compared with em-
bedding and one-vs-all methods. The testing time of one-
vs-all methods can be reduced from linear to logarithmic in
the number of labels based on tree structure and negative
sampling technique. In the future, we need more advanced
techniques to further reduce the time cost for XMLC.
3 MULTI-LABEL LEARNING WITH LIMITED SUPER-
VISION
Collecting fully-supervised data is usually hard and expen-
sive and thus a critical bottleneck in real-world classification
tasks. In MLC problems, there exist many ground-truth
labels and the output space can be very large, which further
aggravates the difficulty of precise annotation. To mitigate
this problem, plenty of works have studied different settings
of MLC with limited supervision. How to model label
dependencies and handle incomplete supervision pose two