J Intell Inf Syst (2014) 43:247–269 251
retrieval techniques. We design an experiment to evaluate the effectiveness of our approach
at improving search engine retrieval precision by maintaining consistency between semantic
and image content.
1.2 Background and related work
There are two popular approaches to image retrieval: CBIR and keyword-based search meth-
ods. In CBIR (Wang et al. 2001; Veltkamp and Tanase 2000;Natsevetal.2004), the images
are retrieved based on their visual content without using external metadata such as annota-
tions. Even though CBIR for general-purpose image databases is still a highly challenging
problem due to the uncontrolled imaging conditions and the difficulties of understanding
images, it has shown great promise in auto-mating the process of interpreting images which
is one of the reasons we incorporate CBIR in our system. In order to capture the visual fea-
tures of each image concept in terms of the objects it contains, we apply a region-based
image content representation in which regions of an image are obtained through an auto-
matic image segmentation process (Vu et al. 2003;Carsonetal.2002). The regions sharing
similar low-level features such as color and textures may represent a certain object or a scene
in the image. Region-based retrieval methods are a widely used type of CBIR (Wang et al.
2001; Chen and Wang 2002;Natsevetal.2004). They perform well in handling complex
images. One drawback of these methods is that they take query images instead of a tex-
tual query (Wang et al. 2001;Tsai2009;Carsonetal.2002), which makes it inconvenient
for users. We extract the color and textural features from each region and apply cluster-
ing algorithm for image segmentation. Instead of using k-means, we use mixture Gaussian
clustering, because EM for a mixture of Gaussians does not require hand-tuning whereas in
k-means, the selection of initial centroids can influence the clustering result. To represent
the image, unlike Santos et al. (2008) who represent an image by aligning regions accord-
ing to their area size, we leverage the approach from DDSVM (Chen and Wang 2004)and
apply it to building the image classifier.
In comparison, keyword-based image retrieval methods are based on textual descriptions
about the pictures, and have been employed in commercial search engines. However these
methods suffer from lower precision especially for complex queries. With more information
contained in the complex query, it is harder to determine the users’ main interest and sub-
sequently retrieve the images whose contents are relevant to the query. Take Google as an
example, the precision of Google’s image search engine is reported to be only 39 % (Schroff
et al. 2007). The keywords used by Google image search are mainly based on the image’s
filename, the link text pointing to the image, and surrounding text (Schroff et al. 2007).
When we search for “US destroyer shells Polish shore” in Google, we expect the retrieved
images should include a “destroyer”, however, seven of the top ten images returned (on Nov
7th, 2009), only partial matched the text information in the query, and the contents were not
even close (e.g. returning “cars” or “houses”) – the content of these images is not consis-
tent with the query. Moreover, keyword-based methods are primarily useful to a user who
knows what keywords should be used to index the images. However, when the user does not
have a clear goal of what keywords to pick, it can become problematic. This may happen
when the user only wants to search images related to a piece of news, but does not know
how to organize the query. Our approach in contrast, combines keyword indexing with con-
tent analysis to filter out the images that only match with the unimportant keywords in the
query. Feng et al. (2008) also incorporates auxiliary text information to help organize the
semantics. However, they segment the images into squares and make restrictive indepen-
dence assumptions on the relationship between the text and regions. The required format of