深度神经网络与逻辑规则结合：ACL 2016最佳论文

2016优秀论文

需积分: 9 26 浏览量更新于2024-09-08 收藏 448KB PDF 举报

ACL 2016最佳论文——"Harnessing Deep Neural Networks with Logic Rules"，由Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy和Eric P. Xing等人撰写，来自卡内基梅隆大学计算机科学学院。这篇论文探讨了一种将深度神经网络（如卷积神经网络CNNs和循环神经网络RNNs）与结构化逻辑规则相结合的通用框架，旨在利用神经模型的灵活性，同时减少其不可解释性。作者提出了一种迭代蒸馏方法，该方法能够将逻辑规则的结构信息转化为神经网络的权重，从而增强网络的能力。在情感分析任务中，他们将此框架应用于CNN，而在命名实体识别任务中应用了RNN。通过引入少量高度直观的逻辑规则，他们显著提升了模型性能，并达到了最先进的或与之前最佳系统相当的结果。这表明，结合逻辑规则可以有效地提升神经网络的解释性和效果。深度神经网络在处理大量数据时表现出强大的模式学习能力，已经在图像分类、语音识别和自然语言处理等领域取得了重大突破。然而，这些模型的一个主要缺点是它们的黑盒特性，即它们的决策过程往往难以理解和解释。这在需要透明度和可解释性的领域，如法律、医疗和金融，可能成为应用的障碍。论文提出的框架提供了一个解决方案，通过将结构化的逻辑规则（一种可解释的形式）融入到神经网络中，使模型能够学习更具有语义意义的模式，而不仅仅是数据的统计特征。迭代蒸馏过程允许神经网络逐渐学习和吸收这些规则，从而使模型的决策更加有据可依。在实验部分，研究者展示了在情感分析任务中，CNN模型如何通过应用简单的逻辑规则，如考虑否定词对情感极性的影响，提高了预测准确性。同样，在命名实体识别任务中，RNN模型利用逻辑规则来更好地理解上下文和实体关系，也取得了改进。这篇ACL 2016最佳论文为深度学习模型的可解释性和性能提升提供了一个新的视角，为未来的研究开辟了道路，即如何在保留深度学习优势的同时，增加模型的可解释性，这对于人工智能的发展具有重要意义。

3.1 Learning Resources: Instances and Rules

Our approach allows neural networks to learn from

both speciﬁc examples and general rules. Here we

give the settings of these “learning resources”.

Assume we have input variable

x ∈ X

and

target variable

y ∈ Y

. For clarity, we focus

-way classiﬁcation, where

Y = ∆

the

-dimensional probability simplex and

y ∈

{0, 1}

⊂ Y

is a one-hot encoding of the class

label. However, our method speciﬁcation can

straightforwardly be applied to other contexts such

as regression and sequence learning (e.g., NER

tagging, which is a sequence of classiﬁcation deci-

sions). The training data

D = {(x

, y

)}

n=1

is a

set of instantiations of (x, y).

Further consider a set of ﬁrst-order logic

(FOL) rules with conﬁdences, denoted as

R =

{(R

, λ

)}

l=1

, where

is the

th rule over the

input-target space

(X , Y)

, and

∈ [0, ∞]

is the

conﬁdence level with

= ∞

indicating a hard

rule, i.e., all groundings are required to be true

(=1). Here a grounding is the logic expression

with all variables being instantiated. Given a set

of examples

(X, Y ) ⊂ (X , Y)

(e.g., a minibatch

from

), the set of groundings of

are denoted

(X, Y )}

g=1

. In practice a rule grounding

is typically relevant to only a single or subset of

examples, though here we give the most general

form on the entire set.

We encode the FOL rules using soft logic (Bach

et al., 2015) for ﬂexible encoding and stable opti-

mization. Speciﬁcally, soft logic allows continu-

ous truth values from the interval

[0, 1]

instead of

{0, 1}

, and the Boolean logic operators are refor-

mulated as:

A&B = max{A + B − 1, 0}

A ∨ B = min{A + B, 1}

∧ · · · ∧ A

¬A = 1 − A

(1)

Here

and

∧

are two different approximations

to logical conjunction (Foulds et al., 2015):

useful as a selection operator (e.g.,

A&B = B

when

A = 1

, and

A&B = 0

when

A = 0

), while

∧ is an averaging operator.

3.2 Rule Knowledge Distillation

A neural network deﬁnes a conditional probabil-

ity

(y|x)

by using a softmax output layer that

produces a

-dimensional soft prediction vector

denoted as

(x)

. The network is parameterized

by weights

. Standard neural network training

has been to iteratively update

to produce the

correct labels of training instances. To integrate

the information encoded in the rules, we propose

to train the network to also imitate the outputs

of a rule-regularized projection of

(y|x)

, de-

noted as

q(y|x)

, which explicitly includes rule con-

straints as regularization terms. In each iteration

is constructed by projecting

into a subspace

constrained by the rules, and thus has desirable

properties. We present the construction in the next

section. The prediction behavior of

reveals the

information of the regularized subspace and struc-

tured rules. Emulating the

outputs serves to trans-

fer this knowledge into

. The new objective is

then formulated as a balancing between imitating

the soft predictions of

and predicting the true hard

labels:

(t+1)

= arg min

θ∈Θ

n=1

(1 − π)`(y

, σ

))

+ π`(s

(t)

, σ

)),

(2)

where

denotes the loss function selected accord-

ing to speciﬁc applications (e.g., the cross entropy

loss for classiﬁcation);

(t)

is the soft prediction

vector of

at iteration

; and

is the imita-

tion parameter calibrating the relative importance

of the two objectives.

A similar imitation procedure has been used in

other settings such as model compression (Bucilu

et al., 2006; Hinton et al., 2015) where the pro-

cess is termed distillation. Following them we call

(y|x)

the “student” and

q(y|x)

the “teacher”,

which can be intuitively explained in analogous

to human education where a teacher is aware of

systematic general rules and she instructs students

by providing her solutions to particular questions

(i.e., the soft predictions). An important differ-

ence from previous distillation work, where the

teacher is obtained beforehand and the student is

trained thereafter, is that our teacher and student

are learned simultaneously during training.

Though it is possible to combine a neural net-

work with rule constraints by projecting the net-

work to the rule-regularized subspace after it is

fully trained as before with only data-label in-

stances, or by optimizing projected network di-

rectly, we found our iterative teacher-student dis-

tillation approach provides a much superior per-

formance, as shown in the experiments. More-

over, since

distills the rule information into the

剩余10页未读，继续阅读

摇摆的果冻

粉丝: 18
资源: 1

深度神经网络与逻辑规则结合：ACL 2016最佳论文

ACL 2016 论文合集

CWS, 中文分词ACL2016纸的源代码.zip

acl2016-supersense-embeddings:ACL 2016文章的源代码，数据和补充材料-Material source code

深入java虚拟机光盘源码-acl2016-convincing-arguments:ACL2016文章“哪个参数更有说服力？使用双向LSTM

semantic-parsing-dual:具有双重学习功能的ACL 2019 Long Paper语义解析的源代码和数据-Source code learning

【HICA实验09】ACL和NAT.paper

code_for_ACL_2020_paper_FLAT_Chinese_NER_Using_F

acl

ACL应用（标准acl、扩展acl、命名acl）

校园网管理aclacl

最新资源