Modeling continuous visual features for semantic image annotation and retrieval
Zhixin Li
a,b,
⇑
, Zhiping Shi
a
, Xi Liu
a
, Zhongzhi Shi
a
a
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
b
College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China
article info
Article history:
Received 13 October 2009
Available online 17 November 2010
Communicated by H.H.S. Ip
Keywords:
Automatic image annotation
Continuous PLSA
Latent aspect model
Semantic gap
Image retrieval
abstract
Automatic image annotation has become an important and challenging problem due to the existence of
semantic gap. In this paper, we firstly extend probabilistic latent semantic analysis (PLSA) to model con-
tinuous quantity. In addition, corresponding Expectation–Maximization (EM) algorithm is derived to
determine the model parameters. Furthermore, in order to deal with the data of different modalities in
terms of their characteristics, we present a semantic annotation model which employs continuous PLSA
and standard PLSA to model visual features and textual words respectively. The model learns the
correlation between these two modalities by an asymmetric learning approach and then it can predict
semantic annotation precisely for unseen images. Finally, we compare our approach with several
state-of-the-art approaches on the Corel5k and Corel30k datasets. The experiment results show that
our approach performs more effectively and accurately.
Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction
Content-based image retrieval (CBIR) has been studied and
explored for decades. Its performance, however, is not ideal en-
ough due to the notorious semantic gap (Smeulders et al., 2000).
CBIR retrieves images in terms of their visual features, while users
often prefer intuitive text-based image searching. Since manual
image annotation is expensive and difficult to be extended to large
image databases, automatic image annotation has emerged as a
striking and crucial problem (Datta et al., 2008).
The state-of-the-art techniques of image auto-annotation can
be roughly categorized into two different schools of thought. The
first one defines auto-annotation as a traditional supervised classi-
fication problem (Chang et al., 2003; Li and Wang, 2003; Cusano
et al., 2004; Carneiro et al., 2007), which treats each word (or
semantic concept) as an independent class and creates different
classifiers for every word. This approach computes similarity at
the visual level and annotates a new image by propagating the cor-
responding words. The second perspective takes a different stand
and treats images and texts as equivalent data. It attempts to dis-
cover the correlation between visual features and textual words on
an unsupervised basis, by estimating the joint distribution of fea-
tures and words. Thus, it poses annotation as statistical inference
in a graphical model. Under this perspective, images are treated
as bags of words and features, each of which are assumed gener-
ated by a hidden variable. Various approaches differ in the defini-
tion of the states of the hidden variable: some associate them with
images in the database (Jeon et al., 2003; Lavrenko et al., 2003;
Feng et al., 2004), while others associate them with image clusters
(Duygulu et al., 2002; Barnard et al., 2003) or latent aspects (topics)
(Blei and Jordan, 2003; Monay and Gatica-Perez, 2007; Zhang et al.,
2005).
As latent aspect models, PLSA (Hofmann, 2001) and latent
Dirichlet allocation (LDA) (Blei et al., 2003) have been successfully
applied to annotate and retrieve images. PLSA-WORDS (Monay
and Gatica-Perez, 2007) is a representative approach, which
achieves the annotation task by constraining the latent space to
ensure its consistency in words. However, since standard PLSA
can only handle discrete quantity (such as textual words), this ap-
proach quantizes feature vectors into discrete visual words for
PLSA modeling. Therefore, its annotation performance is sensitive
to the clustering granularity. In the area of automatic image anno-
tation, it is generally believed that using continuous feature vectors
will give rise to better performance (Lavrenko et al., 2003; Blei and
Jordan, 2003; Zhang et al., 2005; Li et al., 2010). In order to model
image data precisely, it is required to deal with continuous quan-
tity using PLSA.
This paper proposes continuous PLSA, which assumes that fea-
ture vectors in an image are governed by a Gaussian distribution
under a given latent aspect other than a multinomial one. In addi-
tion, corresponding EM algorithm is derived to estimate the
parameters. Then, as general treatment, each image can be treated
as a mixture of Gaussians under this model. Furthermore, based on
the continuous PLSA and the standard PLSA, we present a semantic
0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2010.11.015
⇑
Corresponding author at: Key Laboratory of Intelligent Information Processing,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190,
China. Tel.: +86 10 62600506; fax: +86 10 82610254.
E-mail addresses: lizx@ics.ict.ac.cn, lizx@gxnu.edu.cn (Z. Li), shizp@ics.ict.ac.cn
(Z. Shi), liux@ics.ict.ac.cn (X. Liu), shizz@ics.ict.ac.cn (Z. Shi).
Pattern Recognition Letters 32 (2011) 516–523
Contents lists available at ScienceDirect
Pattern Recognition Letters
journal homepage: www.elsevier.com/locate/patrec