3.1 Learning Resources: Instances and Rules
Our approach allows neural networks to learn from
both specific examples and general rules. Here we
give the settings of these “learning resources”.
Assume we have input variable
x ∈ X
and
target variable
y ∈ Y
. For clarity, we focus
on
K
-way classification, where
Y = ∆
K
is
the
K
-dimensional probability simplex and
y ∈
{0, 1}
K
⊂ Y
is a one-hot encoding of the class
label. However, our method specification can
straightforwardly be applied to other contexts such
as regression and sequence learning (e.g., NER
tagging, which is a sequence of classification deci-
sions). The training data
D = {(x
n
, y
n
)}
N
n=1
is a
set of instantiations of (x, y).
Further consider a set of first-order logic
(FOL) rules with confidences, denoted as
R =
{(R
l
, λ
l
)}
L
l=1
, where
R
l
is the
l
th rule over the
input-target space
(X , Y)
, and
λ
l
∈ [0, ∞]
is the
confidence level with
λ
l
= ∞
indicating a hard
rule, i.e., all groundings are required to be true
(=1). Here a grounding is the logic expression
with all variables being instantiated. Given a set
of examples
(X, Y ) ⊂ (X , Y)
(e.g., a minibatch
from
D
), the set of groundings of
R
l
are denoted
as
{r
lg
(X, Y )}
G
l
g=1
. In practice a rule grounding
is typically relevant to only a single or subset of
examples, though here we give the most general
form on the entire set.
We encode the FOL rules using soft logic (Bach
et al., 2015) for flexible encoding and stable opti-
mization. Specifically, soft logic allows continu-
ous truth values from the interval
[0, 1]
instead of
{0, 1}
, and the Boolean logic operators are refor-
mulated as:
A&B = max{A + B − 1, 0}
A ∨ B = min{A + B, 1}
A
1
∧ · · · ∧ A
N
=
X
i
A
i
/N
¬A = 1 − A
(1)
Here
&
and
∧
are two different approximations
to logical conjunction (Foulds et al., 2015):
&
is
useful as a selection operator (e.g.,
A&B = B
when
A = 1
, and
A&B = 0
when
A = 0
), while
∧ is an averaging operator.
3.2 Rule Knowledge Distillation
A neural network defines a conditional probabil-
ity
p
θ
(y|x)
by using a softmax output layer that
produces a
K
-dimensional soft prediction vector
denoted as
σ
θ
(x)
. The network is parameterized
by weights
θ
. Standard neural network training
has been to iteratively update
θ
to produce the
correct labels of training instances. To integrate
the information encoded in the rules, we propose
to train the network to also imitate the outputs
of a rule-regularized projection of
p
θ
(y|x)
, de-
noted as
q(y|x)
, which explicitly includes rule con-
straints as regularization terms. In each iteration
q
is constructed by projecting
p
θ
into a subspace
constrained by the rules, and thus has desirable
properties. We present the construction in the next
section. The prediction behavior of
q
reveals the
information of the regularized subspace and struc-
tured rules. Emulating the
q
outputs serves to trans-
fer this knowledge into
p
θ
. The new objective is
then formulated as a balancing between imitating
the soft predictions of
q
and predicting the true hard
labels:
θ
(t+1)
= arg min
θ∈Θ
1
N
N
X
n=1
(1 − π)`(y
n
, σ
θ
(x
n
))
+ π`(s
(t)
n
, σ
θ
(x
n
)),
(2)
where
`
denotes the loss function selected accord-
ing to specific applications (e.g., the cross entropy
loss for classification);
s
(t)
n
is the soft prediction
vector of
q
on
x
n
at iteration
t
; and
π
is the imita-
tion parameter calibrating the relative importance
of the two objectives.
A similar imitation procedure has been used in
other settings such as model compression (Bucilu
et al., 2006; Hinton et al., 2015) where the pro-
cess is termed distillation. Following them we call
p
θ
(y|x)
the “student” and
q(y|x)
the “teacher”,
which can be intuitively explained in analogous
to human education where a teacher is aware of
systematic general rules and she instructs students
by providing her solutions to particular questions
(i.e., the soft predictions). An important differ-
ence from previous distillation work, where the
teacher is obtained beforehand and the student is
trained thereafter, is that our teacher and student
are learned simultaneously during training.
Though it is possible to combine a neural net-
work with rule constraints by projecting the net-
work to the rule-regularized subspace after it is
fully trained as before with only data-label in-
stances, or by optimizing projected network di-
rectly, we found our iterative teacher-student dis-
tillation approach provides a much superior per-
formance, as shown in the experiments. More-
over, since
p
θ
distills the rule information into the