Compact Bilinear Pooling
Yang Gao
1
, Oscar Beijbom
1
, Ning Zhang
2∗
, Trevor Darrell
1 †
1
EECS, UC Berkeley
2
Snapchat Inc.
{yg, obeijbom, trevor}@eecs.berkeley.edu {ning.zhang}@snapchat.com
Abstract
Bilinear models has been shown to achieve impressive
performance on a wide range of visual tasks, such as se-
mantic segmentation, fine grained recognition and face
recognition. However, bilinear features are high dimen-
sional, typically on the order of hundreds of thousands to a
few million, which makes them impractical for subsequent
analysis. We propose two compact bilinear representations
with the same discriminative power as the full bilinear rep-
resentation but with only a few thousand dimensions. Our
compact representations allow back-propagation of classi-
fication errors enabling an end-to-end optimization of the
visual recognition system. The compact bilinear represen-
tations are derived through a novel kernelized analysis of
bilinear pooling which provide insights into the discrimina-
tive power of bilinear pooling, and a platform for further
research in compact pooling methods. Experimentation il-
lustrate the utility of the proposed representations for image
classification and few-shot learning across several datasets.
1. Introduction
Encoding and pooling of visual features is an integral
part of semantic image analysis methods. Before the in-
fluential 2012 paper of Krizhevsky et al. [17] rediscovering
the models pioneered by [19] and related efforts, such meth-
ods typically involved a series of independent steps: feature
extraction, encoding, pooling and classification; each thor-
oughly investigated in numerous publications as the bag of
visual words (BoVW) framework. Notable contributions in-
clude HOG [9], and SIFT [24] descriptors, fisher encod-
ing [26], bilinear pooling [3] and spatial pyramids [18],
each significantly improving the recognition accuracy.
Recent results have showed that end-to-end back-
propagation of gradients in a convolutional neural network
∗
This work was done when Ning Zhang was in Berkeley.
†
Prof. Darrell was supported in part by DARPA; AFRL; DoD MURI
award N000141110688; NSF awards IIS-1212798, IIS-1427425, and IIS-
1536003, and the Berkeley Vision and Learning Center.
sketch 1
sketch 2
*
compact feature
activation
sum-pooled
compact feature
Figure 1: We propose a compact bilinear pooling method
for image classification. Our pooling method is learned
through end-to-end back-propagation and enables a low-
dimensional but highly discriminative image representation.
Top pipeline shows the Tensor Sketch projection applied to
the activation at a single spatial location, with ∗ denoting
circular convolution. Bottom pipeline shows how to obtain
a global compact descriptor by sum pooling.
(CNN) enables joint optimization of the whole pipeline, re-
sulting in significantly higher recognition accuracy. While
the distinction of the steps is less clear in a CNN than in a
BoVW pipeline, one can view the first several convolutional
layers as a feature extractor and the later fully connected
layers as a pooling and encoding mechanism. This has been
explored recently in methods combining the feature extrac-
tion architecture of the CNN paradigm, with the pooling &
encoding steps from the BoVW paradigm [23, 8]. Notably,
Lin et al. recently replaced the fully connected layers with
bilinear pooling achieving remarkable improvements for
fine-grained visual recognition [23]. However, their final
representation is very high-dimensional; in their paper the
encoded feature dimension, d, is more than 250, 000. Such
representation is impractical for several reasons: (1) if used
with a standard one-vs-rest linear classifier for k classes,
the number of model parameters becomes kd, which for
e.g. k = 1000 means > 250 million model parameters, (2)
for retrieval or deployment scenarios which require features
to be stored in a database, the storage becomes expensive;
storing a millions samples requires 2TB of storage at dou-
1
arXiv:1511.06062v2 [cs.CV] 12 Apr 2016