卷积池化结构的潜在语义模型在信息检索中的应用

需积分: 9 84 浏览量更新于2024-09-08 收藏 759KB PDF 举报

"本文提出了一种新的潜在语义模型，该模型结合了卷积池化结构，用于信息检索。通过在单词序列上应用卷积池化结构，学习搜索查询和网页的低维语义向量表示。为了捕获查询或文档中的丰富上下文结构，模型首先在时间上下文窗口内对每个单词进行处理，直接提取词n-gram级别的上下文特征。然后，模型发现单词序列中的显著词n-gram特征，并将其聚合形成句子级别的特征向量。最后，通过非线性转换提取高层次的语义信息，从而提高信息检索的准确性和效率。" 这篇论文主要探讨了一种创新的潜在语义模型，它将深度学习的卷积神经网络（CNN）和池化技术融入到传统的语义模型中，以增强信息检索的能力。在信息检索领域，传统的TF-IDF或基于向量空间的模型往往无法有效地捕捉文本的语义关联，尤其是在处理同义词、近义词或拼写错误时。而这篇工作则尝试解决这一问题。卷积神经网络（CNN）在处理序列数据时，能有效地捕获局部特征，通过滤波器（filter）在单词序列上滑动，提取出不同尺度的n-gram特征。这些特征可以反映单词间的上下文关系，有助于理解文本的语义。池化操作则可以降低计算复杂度，同时保留关键信息，减少数据维度。在论文中，模型首先为每个单词设定一个时间上下文窗口，这样可以考虑邻近单词的影响，形成词n-gram级别的上下文特征。接着，模型通过选择性地聚集这些特征，形成句子级别的向量表示。这个过程可能包括最大池化或平均池化等操作，以突出重要的n-gram组合。最后，通过非线性变换（如激活函数ReLU或tanh），可以进一步提取出更抽象、更高层次的语义信息，这些信息对于区分相似但含义不同的查询和文档至关重要。这种结合卷积和池化的潜在语义模型，旨在提高信息检索系统的准确性和鲁棒性，能够更好地理解和匹配用户的查询意图，提升用户体验。同时，由于模型能够学习到低维的语义表示，也有助于减少存储和计算的需求，使得大规模信息检索系统更加高效。论文作者来自微软研究实验室和蒙特利尔大学，他们的贡献在于将深度学习技术应用于信息检索，为理解和处理自然语言提供了一种新的有效方法。这种模型的提出，对于搜索引擎优化、问答系统、文档分类和推荐系统等领域都有潜在的应用价值。

A Latent Semantic Model with Convolutional-Pooling

Structure for Information Retrieval

Yelong Shen

Microsoft Research

Redmond, WA, USA

yeshen@microsoft.com

Xiaodong He

Microsoft Research

Redmond, WA, USA

xiaohe@microsoft.com

Jianfeng Gao

Microsoft Research

Redmond, WA, USA

jfgao@microsoft.com

Li Deng

Microsoft Research

Redmond, WA, USA

deng@microsoft.com

Grégoire Mesnil

University of Montréal

Montréal, Canada

gregoire.mesnil@umont

real.ca

ABSTRACT

In this paper, we propose a new latent semantic model that

incorporates a convolutional-pooling structure over word

sequences to learn low-dimensional, semantic vector

representations for search queries and Web documents. In order to

capture the rich contextual structures in a query or a document, we

start with each word within a temporal context window in a word

sequence to directly capture contextual features at the word n-

gram level. Next, the salient word n-gram features in the word

sequence are discovered by the model and are then aggregated to

form a sentence-level feature vector. Finally, a non-linear

transformation is applied to extract high-level semantic

information to generate a continuous vector representation for the

full text string. The proposed convolutional latent semantic model

(CLSM) is trained on clickthrough data and is evaluated on a Web

document ranking task using a large-scale, real-world data set.

Results show that the proposed model effectively captures salient

semantic information in queries and documents for the task while

significantly outperforming previous state-of-the-art semantic

models.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search

and Retrieval; I.2.6 [Artificial Intelligence]: Learning

General Terms

Algorithms, Experimentation

Keywords

Convolutional Neural Network, Semantic Representation, Web

1. INTRODUCTION

Most modern search engines resort to semantic based methods

beyond lexical matching for Web document retrieval. This is

partially due to the fact that the same single concept is often

expressed using different vocabularies and language styles in

documents and queries. For example, latent semantic models such

as latent semantic analysis (LSA) are able to map a query to its

relevant documents at the semantic level where lexical matching

often fails (e.g., [9][10][31]). These models address the problem

of language discrepancy between Web documents and search

queries by grouping different terms that occur in a similar context

into the same semantic cluster. Thus, a query and a document,

represented as two vectors in the low-dimensional semantic space,

can still have a high similarity even if they do not share any term.

Extending from LSA, probabilistic topic models such as

probabilistic LSA (PLSA), Latent Dirichlet Allocation (LDA),

and Bi-Lingual Topic Model (BLTM), have been proposed and

successfully applied to semantic matching [19][4][16][15][39].

More recently, semantic modeling methods based on neural

networks have also been proposed for information retrieval (IR)

[16][32][20]. Salakhutdinov and Hinton proposed the Semantic

Hashing method based on a deep auto-encoder in [32][16]. A

Deep Structured Semantic Model (DSSM) for Web search was

proposed in [20], which is reported to give very strong IR

performance on a large-scale web search task when clickthrough

data are exploited as weakly-supervised information in training

the model. In both methods, plain feed-forward neural networks

are used to extract the semantic structures embedded in a query or

a document.

Despite the progress made recently, all the aforementioned

latent semantic models view a query (or a document) as a bag of

words. As a result, they are not effective in modeling contextual

structures of a query (or a document). Table 1 gives several

examples of document titles to illustrate the problem. For

example, the word “office” in the first document refers to the

popular Microsoft product, but in the second document it refers to

a working space. We see that the precise search intent of the word

“office” cannot be identified without context.

microsoft office excel could allow remote code execution

welcome to the apartment office

online body fat percentage calculator

online auto body repair estimates

Table 1: Sample document titles. The text is lower-cased and

punctuation removed. The same word, e.g., “office”, has

different meanings depending on its contexts.

Modeling contextual information in search queries and

documents is a long-standing research topic in IR

[11][25][12][26][2][22][24]. Classical retrieval models, such as

TF-IDF and BM25, use a bag-of-words representation and cannot

effectively capture contextual information of a word. Topic

models learn the topic distribution of a word by considering word

occurrence information within a document or a sentence.

However, the contextual information captured by such models is

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Request permissions from permissions@acm.org.

CIKM’14, November 03 – 07, 2014, Shanghai, China.

http://dx.doi.org/10.1145/2661829.2661935

下载后可阅读完整内容，剩余9页未读，立即下载

basketfox

粉丝: 0
资源: 10

卷积池化结构的潜在语义模型在信息检索中的应用

Unsupervised Learning by Probabilistic Latent Semantic Analysis

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning

A popularity scaled latent space model for large-scale directed social network

Multi-class Latent Concept Pooling for Computer-Aided Endoscopy Diagnosis

Pytorch-Latent-Constraints-Learning-to-Generate-Conditionally-from-Unconditional-Generative-Models:潜在约束

dlcv_latent-walk-slerp-vs-lerp_demo.py例子

Gibbs-sampling-in-the-generative-model-of-Latent-Dirichlet-Allocation

吉布斯采样matlab代码-Latent-Dirichlet-Allocation-LDA-:使用折叠的吉布斯采样执行贝叶斯推断

吉布斯采样matlab代码-Latent-Dirichlet-Allocation-exercise:潜在狄利克雷分配运动

dlcv_latent-walk-great-circle_demo.py例子

最新资源