580 X. Wang, Y. Yan and P. Tang et al. / Information Sciences 504 (2019) 578–588
features should be updated, which is time-consuming. Thus, we propose a decoupled training scheme: We first train a MI-
Net to obtain all instance features; then, we fix the neural features of reference instances and update the features of target
instances.
In summary, the contributions of this study are as follows:
• We propose a learnable bag similarity representation for MIL. To the best of our knowledge, this is the first study that
integrates similarity learning with multi-instance neural networks.
• To solve bag similarity learning problems, we propose a novel bag similarity network that takes (N + 1) × Mstreams as
input. For effective training, we propose a decoupled training scheme.
• The proposed BSN method has achieved state-of-the-art performance on several different MIL tasks .
2. Related work
2.1. Multi-instance learning
MIL has long been an active research topic owing to its ability to handle weakly labeled data. Utilizing weakly la-
beled data is highly important, because labeling for big data is costly. MIL has been applied in various computer vision
[17,18,28,33,41] and medical image analysis problems [16,35] . For example, in object detection, Wang et al. [30] formulated
the problem of weakly supervised object detection as a MIL problem and proposed a relaxed MIL solution that uses deep
learning features as instance representation. Cinbis et al. [5] proposed a multi-fold MIL to avoid poor local optimal solutions.
Tang et al. [26] proposed a bag-in-bag formulation for modeling contextual information around objects. Investigating new
MIL methods is essential for understanding weakly labeled data.
In MIL, we are given a set of bags X = { X
1
, X
2
, . . . , X
N
} . Each bag X
i
can be represented by distinct instances X
i
=
{ x
i 1
, x
i 2
, . . . , x
im
i
} , where x
ij
denotes the j th instance in bag X
i
and m
i
denotes the number of instances in this bag. We
assume that Y
i
∈ {0, 1} and y
ij
∈ {0, 1} represent the label of bag X
i
and the label of instance x
ij
, respectively. During the
training phase, only bag labels are available, whereas instance labels are unknown. There are two standard MIL constraints
regarding bag and instance labels: if Y
i
= 0 , then all instances in the corresponding bag X
i
are negative; otherwise, at least
one instance x
ij
∈ X
i
is positive.
2.2. Multi-instance neural network
In the recent years, neural networks have become the most effective method for addressing MIL problems. Ilse et al.
[13] added an attention module in multi-instance neural networks for instance selection and obtained impressive results
for cancer detection in histopathology images. Even in the multi-label setting, Feng et al. [8] confirmed that deep neural
networks are effective.
MI-Net [29] is a typical multi-instance neural network that focuses on MIL problems. MI-Net contains L fully connected
(FC) layers and one MIL pooling layer (generally, L is equal to 4). The first L − 1 FC layers are followed by a non-linear
transformation such as the rectified linear unit (ReLU) [10] , which learns the representations of all instances in the corre-
sponding bag. Here, x
ij
denotes the th layer output of j th instance x
ij
in bag X
i
. The MIL pooling layer is used to map all
instance-level features to obtain bag-level representations. Three widely used pooling schemes M(x
L −1
ij | j =1 ... m
i
) are mentioned
in [29] : 1) max pooling M(x
L −1
ij | j =1 ... m
i
) = max
j
x
L −1
ij
, 2) mean pooling M(x
L −1
ij | j =1 ... m
i
) =
1
m
i
m
i
j=1
x
L −1
ij
, and 3) log–sum–exp (LSE)
pooling M(x
L −1
ij | j =1 ... m
i
) =
log [
1
m
i
m
i
j=1
exp (r·x
L −1
ij
)]
r
, where r is a parameter controlling the smoothness of approximation to the
max function. Thus, regardless of the number of input instances, the MIL pooling layer aggregates them into a bag-level
representation. Finally, the probability of a bag being positive can be calculated by an FC layer with only one neuron and
sigmoid activation, and then the bag label is predicted.
The proposed BSN is also based on neural networks. However, unlike previous multi-instance networks that learn a bag
embedding without considering the bag’s relation to other bags, BSN learns a bag embedding by comparing the bag with
the other bags. Furthermore, BSN is different from traditional bag similarity methods that use fixed bag similarity metrics,
as it learns bag similarity using neural networks.
In addition, BSN can be regarded as a special instantiation of memory-augmented neural networks [23] , which are widely
used in meta-learning. Here, memory refers to external memory and is different from the internal memory in long short-
term memory (LSTM) networks [11] . The reference training bags with their feature extraction networks can be considered
external memory in BSN.
3. Bag similarity network for MIL
Unlike traditional methods, the proposed method addresses MIL problems from the new perspective of bag similarity learning.
In the proposed design, each bag is represented by a vector of its similarities to other bags in the training set, and these
similarities are treated as a bag-level representation–hence the term bag similarity network. Fig. 2 shows the overall ar-
chitecture of BSN, where it can be seen that to avoid the complications of updating parameters and reduce computational