A Latent Semantic Model with Convolutional-Pooling
Structure for Information Retrieval
Yelong Shen
Microsoft Research
Redmond, WA, USA
yeshen@microsoft.com
Xiaodong He
Microsoft Research
Redmond, WA, USA
xiaohe@microsoft.com
Jianfeng Gao
Microsoft Research
Redmond, WA, USA
jfgao@microsoft.com
Li Deng
Microsoft Research
Redmond, WA, USA
deng@microsoft.com
Grégoire Mesnil
University of Montréal
Montréal, Canada
gregoire.mesnil@umont
real.ca
ABSTRACT
In this paper, we propose a new latent semantic model that
incorporates a convolutional-pooling structure over word
sequences to learn low-dimensional, semantic vector
representations for search queries and Web documents. In order to
capture the rich contextual structures in a query or a document, we
start with each word within a temporal context window in a word
sequence to directly capture contextual features at the word n-
gram level. Next, the salient word n-gram features in the word
sequence are discovered by the model and are then aggregated to
form a sentence-level feature vector. Finally, a non-linear
transformation is applied to extract high-level semantic
information to generate a continuous vector representation for the
full text string. The proposed convolutional latent semantic model
(CLSM) is trained on clickthrough data and is evaluated on a Web
document ranking task using a large-scale, real-world data set.
Results show that the proposed model effectively captures salient
semantic information in queries and documents for the task while
significantly outperforming previous state-of-the-art semantic
models.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval; I.2.6 [Artificial Intelligence]: Learning
General Terms
Algorithms, Experimentation
Keywords
Convolutional Neural Network, Semantic Representation, Web
Search
1. INTRODUCTION
Most modern search engines resort to semantic based methods
beyond lexical matching for Web document retrieval. This is
partially due to the fact that the same single concept is often
expressed using different vocabularies and language styles in
documents and queries. For example, latent semantic models such
as latent semantic analysis (LSA) are able to map a query to its
relevant documents at the semantic level where lexical matching
often fails (e.g., [9][10][31]). These models address the problem
of language discrepancy between Web documents and search
queries by grouping different terms that occur in a similar context
into the same semantic cluster. Thus, a query and a document,
represented as two vectors in the low-dimensional semantic space,
can still have a high similarity even if they do not share any term.
Extending from LSA, probabilistic topic models such as
probabilistic LSA (PLSA), Latent Dirichlet Allocation (LDA),
and Bi-Lingual Topic Model (BLTM), have been proposed and
successfully applied to semantic matching [19][4][16][15][39].
More recently, semantic modeling methods based on neural
networks have also been proposed for information retrieval (IR)
[16][32][20]. Salakhutdinov and Hinton proposed the Semantic
Hashing method based on a deep auto-encoder in [32][16]. A
Deep Structured Semantic Model (DSSM) for Web search was
proposed in [20], which is reported to give very strong IR
performance on a large-scale web search task when clickthrough
data are exploited as weakly-supervised information in training
the model. In both methods, plain feed-forward neural networks
are used to extract the semantic structures embedded in a query or
a document.
Despite the progress made recently, all the aforementioned
latent semantic models view a query (or a document) as a bag of
words. As a result, they are not effective in modeling contextual
structures of a query (or a document). Table 1 gives several
examples of document titles to illustrate the problem. For
example, the word “office” in the first document refers to the
popular Microsoft product, but in the second document it refers to
a working space. We see that the precise search intent of the word
“office” cannot be identified without context.
microsoft office excel could allow remote code execution
welcome to the apartment office
online body fat percentage calculator
online auto body repair estimates
Table 1: Sample document titles. The text is lower-cased and
punctuation removed. The same word, e.g., “office”, has
different meanings depending on its contexts.
Modeling contextual information in search queries and
documents is a long-standing research topic in IR
[11][25][12][26][2][22][24]. Classical retrieval models, such as
TF-IDF and BM25, use a bag-of-words representation and cannot
effectively capture contextual information of a word. Topic
models learn the topic distribution of a word by considering word
occurrence information within a document or a sentence.
However, the contextual information captured by such models is
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from permissions@acm.org.
CIKM’14, November 03 – 07, 2014, Shanghai, China.
Copyright © 2014 ACM 978-1-4503-2598-1/14/11…$15.00
http://dx.doi.org/10.1145/2661829.2661935