统一框架：多标签学习的线性二进制相关性实现

73 浏览量更新于2024-07-14 收藏 2.13MB PDF 举报

"实现多标签学习的线性二进制相关性的统一框架" 这篇研究论文发表在《Neurocomputing》期刊2018年的289期，由Guoqiang Wu、Yingjie Tian和Chunhua Zhang等人共同撰写。文章探讨了多标签学习（multi-label learning）领域中的一个重要问题，即如何通过线性二进制相关性建立一个统一的框架。多标签学习是一种机器学习任务，其中每个样本可以同时关联多个类别标签，与传统的二分类或多分类任务不同。文章首先介绍了二进制相关性（Binary Relevance, BR）方法，这是一种处理多标签学习的基本策略，它将原始的多标签问题分解为一系列独立的二分类问题。BR方法对每个标签单独训练一个分类器，忽视了不同标签之间的潜在关联。尽管这种方法简单且易于实施，但在处理存在标签依赖关系的数据集时可能表现不佳。论文接着提出了一个新的框架，该框架旨在克服BR方法的局限性，通过考虑标签间的相关性来改进性能。这个统一框架结合了一对多（One-vs-all, OvA）策略，这是一种常用于多分类问题的技术，其中每个类别被看作是正类，所有其他类别视为负类。通过这种方式，作者试图在保留每个标签独立预测的同时，捕捉到它们之间的相互影响。此外，论文还讨论了损失函数（loss function）在多标签学习中的作用，损失函数是衡量模型预测与实际结果差异的指标。作者可能探索了适应于多标签环境的不同损失函数，如多类交叉熵损失或对偶马氏距离，这些损失函数能够更好地捕捉标签之间的复杂关系。文章的重点在于设计和评估新的模型，以提高多标签分类的准确性和效率。通过实验，作者验证了提出的框架在多个数据集上的效果，对比了传统BR方法和其他多标签学习算法，证明了新框架在处理具有线性二进制相关性的多标签问题时的优势。关键词包括：二进制相关性、一对多、损失函数、多标签学习和多类分类。这表明论文不仅关注了多标签学习的基本方法，还涉及了优化和损失函数设计等核心问题，对于理解和改进多标签分类算法的性能具有重要意义。

88 G. Wu et al. / Neurocomputing 289 (2018) 86–100

globally by employing Laplacian regularizations. MIFS [12] pro-

poses multi-label informed feature selection by embedding the la-

bel space to a low-dimension space to exploit the label correlation.

LIFT [15] and LLSF [16] utilize the label-speciﬁc features to improve

the performance of multi-label learning. All in all, on one hand,

many algorithms extending BR mostly use SVM as the base binary

classiﬁer and train each classiﬁer individually, because traditional

BR is implemented by decomposing into many independent binary

classiﬁers and learning each individually, which is hard to extend.

On the other hand, when extending BR by training many binary

classiﬁers together, most of these methods use least squared loss

function which is more suitable for a regression task rather than a

classiﬁcation task. Furthermore, there is a lack of a uniﬁed frame-

work of BR for various loss functions. To tackle above issues, we

propose a uniﬁed framework implementing linear Binary Relevance

for various loss functions, which is easy to extend. We also analyze

the inﬂuences of different loss functions to ﬁnd suitable loss func-

tions for multi-label learning.

Multi-class classiﬁcation can be viewed as a special case of

multi-label learning when restricting the number of labels per in-

stance to one. One-vs-all algorithm [36] can also be viewed as a

special case of BR algorithm where the decision function is special.

In one-vs-all for multi-class classiﬁcation, an instance is predicted

as the label with the maximum classiﬁcation score among all cor-

responding binary classiﬁers. Whereas, in BR for multi-label learn-

ing, an instance is predicted as the label set which contains all the

labels of corresponding binary classiﬁer’s predictions.

3. Model

3.1. Notations

For an arbitary matrix A, A



denotes its transpose, a

and a

denote the i th row and j th column of A ,  a



denotes the vector

-norm,  A 

denotes the matrix l

-norm and  A 

is the Frobe-

nius norm. For two matrices A and B, A ◦B denotes the Hadamard

(element-wise) product. Denote a function g : R → R , g



is the

corresponding derivative function and for an arbitary matrix A ∈

n ×m

, g(A ) : R

n ×m

→ R

n ×m

, and (g(A ))

= g(A

) . I denotes the

corresponding matrix where each element equals 1.

Given a training dataset D = { x

, y

}

i =1

, where x

∈ R

is a real

value instance vector, y

∈ {−1 , 1 }

is a label vector for x

, n is the

number of the samples, m is the feature dimension of the instance

and c is the number of the class labels. Therefore, y

denotes the

j th label of the i th instance and y

= 1 (or −1) means the j th

label is relevant (or irrelevant). Denote the dataset by matrices

D = (X , Y ) , where X ∈ R

n ×m

, Y ∈ {−1 , 1 }

n ×c

The task of multi-label learning is to ﬁnd a multi-label classi-

ﬁer H : R

→ {−1 , 1 }

. In BR [4] , H can be decomposed into c in-

dependent binary classiﬁers on each label, so H = { h

, h

, . . . , h

}

and h

( x

) denotes the prediction of y

. A multi-label predictor

F : R

→ R

is ﬁrstly learned and it can also be decomposed into

c independent predictors on each label { f

, f

, . . . , f

} , where f

( x

)

denotes the predicted real value of y

. Next, H can be induced from

F by thresholding function. In this paper, for multi-label learning,

) = [[ f

) > t(x

)]] , where thresholding function t(x

) = 0 for

all instances and [[ π ]] equals 1 when the proposition π holds,

0 otherwise. When the predictors f

) < 0 , j = 1 , . . . , c, the pre-

dicted label set of x

is empty. In this case, we choose the so-called

T-Criterion [4] to predict the label with the largest predictor value.

Note that it has only one class per instance for multi-class clas-

siﬁcation. Thus, the decision function of one-vs-all multi-class clas-

siﬁcation is different from BR for multi-label learning while the

training process is same. In one-vs-all multi-class classiﬁcation, for

an instance x

, it is predicted as the class label with the largest

predictor value.

In this paper, we focus on BR [4] or one-vs-all multi-class clas-

siﬁcation [36] where base binary classiﬁer is linear model, but the

linear model can be easily extended to non-linear kernel model

[37] by the kernel trick , which we do not discuss in this paper.

Thus, our goal is to ﬁnd a coeﬃcient matrix W = [ w

, w

, . . . , w

] ∈

m ×c

, which is used to map the instance space to the label space.

3.2. Uniﬁed model

In the traditional implementation of BR [4] for multi-label

learning or one-vs-all for multi-class classiﬁcation [36] , it is de-

composed into c independent binary classiﬁers and trains each

classiﬁer independently.

For the j th binary classiﬁer corresponding to the j th label, the

model is as follows:

min



i =1

loss (y

, f

)) + λ

R (w

) (1)

where loss ( y

, f

( x

)) is a loss function measuring the risk between

the true label y

and the predicted value f

( x

), R (w

) is a reg-

ularizer which controls the complexity of this model and λ

is a

tradeoff parameter. In this paper, we mainly discuss ﬁve popu-

lar loss functions for linear binary classiﬁcation and set R (w

) =

 w



. Because of the linear model, it can be written that f

) =



+ b

, where b = [ b

, b

, . . . , b

] ∈ R

is the bias. For simplic-

ity, the bias b

can be absorbed into w

when the constant value

1 is added into each data x

as an additional dimension feature.

Thus, the model can be written as

min



i =1

loss (y

, x



) + λ

 w



(2)

Thus, w

, j = 1 , 2 , . . . , c is learned independently to compose W =

[ w

, w

, . . . , w

] . This is equivalent to solve the following problem:

min

, w

,..., w



j=1



i =1

loss (y

, x



) +



j=1

 w



(3)

Set  = [ λ

e , λ

e , . . . , λ

e ] ∈ R

m ×c

, where e is the corre-

sponding column vector where elements all equal 1. Because

loss (y

, x



) > = 0 always holds due to the property of loss

function, this problem can be written as

min

 loss (Y , XW ) 

+   ◦ W 

(4)

where (loss (Y , XW ))

= loss (y

, x



) . Denote F (W ) =

 loss (Y , XW ) 

+   ◦ W 

. If λ

between each binary classi-

ﬁer is set to be all equal to λ, this problem becomes

min

 loss (Y , XW ) 

+ λ W 

(5)

which is mainly used to extend. We solve this problem Eq. (4) us-

ing gradient (or subgradient) descent algorithm in this paper. Next

we talk about the gradient (or subgradient) of the objective func-

tion w.r.t W .

Generally speaking, loss function can be written as

loss (y, x



w ) = g(y (x



w )) , where the function g : R → R

. Thus,

loss (Y , XW ) = g(Y ◦ ( XW )) .

Proposition 1. The gradient (or subgradient) of the objective function

Eq. (4) w.r.t W is as follows:

∇

F (W ) = X





(Z ) ◦ Y ) +  ◦ W (6)

where Z = Y ◦ ( XW )) .

Proof. Take the partial derivative of loss function  loss ( Y, XW ) 

w.r.t w

according to the chain rule,

∂ loss (Y , XW ) 

∂w

∂[



i =1

g(y



))]

∂w

剩余14页未读，继续阅读

weixin_38698943

粉丝: 2
资源: 900

统一框架：多标签学习的线性二进制相关性实现

浅谈基于Apache Spark的网络安全入侵检测框架.pdf

一个可以实现对大数据预处理及运行部分机器学习代码的云数据处理平.zip

关于Z_1上线性代码的Lee和Euclidean权重的基本精确序列的注记

二进制紧凑编码与哈希指纹索引技术

二进制紧凑编码在哈希指纹索引中的应用研究

线性移位寄存器一元多项式表示与序列密码解析

基于Fashion MNIST的机器学习分类分析

Wide&Deep学习在推荐系统中的应用

使用Keras进行多标签分类：场景解析与高效模型实现

【多标签分类策略】：标签编码在多标签问题中的巧妙应用

最新资源