1438 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 6, DECEMBER 2010
Multiview Spectral Embedding
Tian Xia, Dacheng Tao, Member, IEEE, Tao Mei, Member, IEEE, and Yongdong Zhang, Member, IEEE
Abstract—In computer vision and multimedia search, it is com-
mon to use multiple features from different views to represent an
object. For example, to well characterize a natural scene image, it
is essential to find a set of visual features to represent its color,
texture, and shape information and encode each feature into a
vector. Therefore, we have a set of vectors in different spaces to
represent the image. Conventional spectral-embedding algorithms
cannot deal with such datum directly, so we have to concatenate
these vectors together as a new vector. This concatenation is not
physically meaningful because each feature has a specific statis-
tical property. Therefore, we develop a new spectral-embedding
algorithm, namely, multiview spectral embedding (MSE), which
can encode different features in different ways, to achieve a
physically meaningful embedding. In particular, MSE finds a low-
dimensional embedding wherein the distribution of each view is
sufficiently smooth, and MSE explores the complementary prop-
erty of different views. Because there is no closed-form solution
for MSE, we derive an alternating optimization-based iterative
algorithm to obtain the low-dimensional embedding. Empirical
evaluations based on the applications of image retrieval, video an-
notation, and document clustering demonstrate the effectiveness
of the proposed approach.
Index Terms—Dimensionality reduction, multiple views,
spectral embedding.
I. INTRODUCTION
I
N COMPUTER vision and multimedia search [5], [6],
objects are usually represented in several different ways.
This kind of data is termed as the multiview data. A typical
example is a color image, which has different views from dif-
ferent modalities, e.g., color, texture, and shape. Different views
form different feature spaces, which have particular statistical
properties.
Manuscript received May 14, 2009; revised August 31, 2009 and November
18, 2009; accepted December 6, 2009. Date of publication February 17, 2010;
date of current version November 17, 2010. This work was supported in part
by the National Basic Research Program of China (973 Program) under Grant
2007CB311100; by the National High-Technology Research and Development
Program of China (863 Program) under Grant 2007AA01Z416; by the National
Natural Science Foundation of China under Grants 60873165, 60802028, and
60902090; by the Beijing New Star Project on Science and Technology under
Grant 2007B071; by the Co-building Program of Beijing Municipal Education
Commission; by the Nanyang Technological University Nanyang SUG Grant
under Project M58020010; by the Microsoft Operations PTE LTD-NTU Joint
R&D under Grant M48020065; and by the K. C. Wong Education Foundation
Award. This paper was recommended by Associate Editor S. Sarkar.
T. Xia and Y. Zhang are with the Center for Advanced Computing Tech-
nology Research, Institute of Computing Technology, Chinese Academy of
Sciences, Beijing 100190, China (e-mail: txia@ict.ac.cn; zhyd@ict.ac.cn).
D. Tao is with the School of Computer Engineering, Nanyang Technological
University, Singapore 639798 (e-mail: dctao@ntu.edu.sg).
T. Mei is with Microsoft Research Asia, Beijing 100190, China (e-mail:
tmei@microsoft.com).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSMCB.2009.2039566
Because of the popularity of multiview data in practical
applications, particularly in the multimedia domain, learning
from multiview data, which is also known as multiple-view
learning, has attracted more and more attentions. Although a
great deal of efforts have been carried out on multiview data
learning [1], including classification [21], clustering [4], [19],
and feature selection [20], little progress has been made in
dimensionality reduction, whereas it has many applications in
multimedia [28], e.g., image retrieval and video annotation.
Multimedia data generally have multiple modalities, and each
modality is usually represented in a high-dimensional feature
space which frequently leads to the “curse of dimensional-
ity” problem. In this case, multiview dimensionality reduction
provides an effective solution to solve or at least reduce this
problem.
In this paper, we consider the problem of spectral em-
bedding for multiple-view data based on our previous patch
alignment framework [29]. The major challenge is learning a
low-dimensional embedding to effectively explore the comple-
mentary nature of multiple views of a data set. The learned
low-dimensional embedding should be better than a low-
dimensional embedding learned by each single view of the
data set.
Existing spectral-embedding algorithms assume that sam-
ples are drawn from a vector space and thus cannot deal
with multiview data directly. A possible solution is to con-
catenate vectors from different views together as a new vec-
tor and then apply spectral-embedding algorithms directly on
the concatenated vector. However, this concatenation is not
physically meaningful because each view has a specific sta-
tistical property. This concatenation ignores the diversity of
multiple views and thus cannot efficiently explore the com-
plementary nature of different views. Another solution is the
distributed spectral embedding (DSE) proposed in [3]. DSE
performs a spectral-embedding algorithm on each view in-
dependently, and then based on the learned low-dimensional
representations, it learns a common low-dimensional embed-
ding which is “close” to each representation as much as
possible. Although DSE allows selecting different spectral-
embedding algorithms for different views, the original multiple-
view data are invisible to the final learning process, and
thus, it cannot well explore the complementary nature of dif-
ferent views. Moreover, its computational cost is dense be-
cause it conducts spectral-embedding algorithms for each view
independently.
To effectively and efficiently learn the complementary nature
of different views, we propose a new algorithm, i.e., multiview
spectral embedding (MSE), which learns a low-dimensional
and sufficiently smooth embedding over all views simultane-
ously. Empirical evaluations based on image retrieval, video
1083-4419/$26.00 © 2010 IEEE