倒排索引构建与压缩技术解析

需积分: 27 165 浏览量更新于2024-07-18 1 收藏 959KB PDF 举报

"倒排索引是文本搜索引擎中的一种重要数据结构，用于高效地支持文本查询。这篇描述可能来源于一篇关于倒排索引构建和压缩技术的教程或论文，作者包括Justin Zobel 和 Alistair Moffat，分别来自澳大利亚的RMIT大学和墨尔本大学。文中提到，过去十年间，倒排索引的技术有了显著进步，不仅在存储、构建和查询评估方面有所创新，而且某些具体技术并未广泛普及或现有教材中的描述已过时。该教程旨在介绍这一领域的核心技术和扩展方法，并提供了全面的文本索引文献参考。倒排索引的基本概念：倒排索引是一种反向映射的索引结构，它将文档中的词汇（词项）与包含这些词汇的文档列表关联起来。在传统的正向索引中，每个文档由一个关键词列表描述，而在倒排索引中，每个关键词对应一个文档列表（称为倒排列表），列出了包含该词的所有文档编号。这种结构对于快速查找包含特定词汇的文档非常有效。倒排索引的构建过程： 1. 分词：对输入文档进行分词，提取出所有的词项。 2. 词汇表创建：将所有出现过的词项收集到一个词汇表中，每个词项有一个唯一的标识符。 3. 倒排列表初始化：为每个词项初始化一个空的倒排列表。 4. 倒排列表填充：遍历文档，当遇到词汇表中的词项时，在对应的倒排列表中添加文档编号。 5. 最后优化：可能包括合并重复的倒排列表项，压缩数据等。倒排索引的压缩：为了节省存储空间并提高查询效率，倒排索引通常会进行压缩。常见的压缩方法有： 1. 词项编码：对词汇表中的词项进行编码，如使用变长编码（Variable-Length Encoding）或字典编码（Dictionary Encoding）。 2. 倒排列表压缩：使用行程编码（Run-Length Encoding）压缩连续的文档编号，或者采用游程编码（Delta Encoding）和二进制编码（Binary Encoding）减少表示数字的位数。 3. 预测编码：利用相邻项之间的相关性，如差分编码（Difference Coding）或赫夫曼编码（Huffman Coding）。 4. 压缩存储：使用专门的压缩算法，如LZ77、LZ78或BWT等。此外，文中还提到了分类和主题描述符，涉及信息存储和检索的不同方面，如内容分析、文件组织和信息搜索模型，以及操作系统中的数据管理。总结，倒排索引是文本搜索的核心技术，通过有效的构建和压缩策略，能够在大规模文本数据中实现高效的查询性能。这篇教程或论文详细介绍了这一领域的关键技术和最新进展，为深入理解和应用倒排索引提供了宝贵资料。"

10 J. Zobel and A. Moffat

Fig. 4. Indexed computation of cosine similarity between a query q and a text

collection.

Fig. 5. Using an inverted ﬁle and a set of accumulators to calculate document similarity scores.

While this transformation does not reduce the maximum magnitude of the stored num-

bers, it does reduce the average, providing leverage for the compression techniques

discussed later. Section 8 gives details of mechanisms that can exploit the advantage

that is created by gaps.

Baseline Query Evaluation.

Ranking using an inverted ﬁle is described in Figure 4

and illustrated in Figure 5. In this algorithm, the query terms are processed one at a

time. Initially each document has a similarity of zero to the query; this is represented

by creating an array A of N partial similarity scores referred to as accumulators, one

for each document d. Then, for each term t, the accumulator A

for each document d

mentioned in t’s inverted list is increased by the contribution of t to the similarity of d

ACM Computing Surveys, Vol. 38, No. 2, Article 6, Publication date: July 2006.

Inverted Files for Text Search Engines 11

to the query. Once all query terms have been processed, similarity scores S

are calcu-

lated by dividing each accumulator value by the corresponding value of W

. Finally, the

r largest similarities are identiﬁed, and the corresponding documents returned to the

user.

The cost of ranking via an index is far less than with the exhaustive algorithm

outlined in Figure 2. Given a query of three terms, processing a query against the Web

data involves ﬁnding the three terms in the vocabulary; fetching and then processing

three inverted lists of perhaps 100kB to 1MB each; and making two linear passes over

an array of 12,000,000 accumulators. The complete sequence requires well under a

second on current desktop machines.

Nonetheless, the costs are still signiﬁcant. Disk space is required for the index at

20%–60% of the size of the data for an index of the type shown in Figure 3; memory is

required for an accumulator for each document and for some or all of the vocabulary;

CPU time is required for processing inverted lists and accumulators; and disk trafﬁc

is used to fetch inverted lists. Fortunately, compared to the implementation shown in

Figure 4, all of these costs can be dramatically reduced.

Indexing Word Positions.

We have described inverted lists as sequences of index entries,

each a d, f

d,t

 pair. An index of this form is document-level since it indicates whether a

term occurs in a document but does not contain information about precisely where the

term appears. Given that the frequency f

d,t

represents the number of occurrences of t

in d , it is straightforward to modify each entry to include the f

d,t

ordinal word positions

p at which t occurs in d and create a word-level inverted list containing pointers of the

form d, f

d,t

, p

, ..., p

d,t

. Note that in this representation positions are word counts,

not byte counts, so that they can be used to determine adjacency.

Word positions can be used in a variety of ways during query evaluation. Section 4

discusses one of these, phrase queries in which the user can request documents with

a sequence rather than bag-of-words. Word positions can also be used in bag-of-word

queries, for example, to prefer documents where the terms are close together or are

close to the beginning of the document. Similarity measures that make use of such

proximity mechanisms have not been particularly successful in experimental settings

but, for simple queries, adjacency and proximity do appear to be of value in Web

retrieval.

If the source document has a hierarchical structure, that structure can be reﬂected

by a similar hierarchy in the inverted index. For example, a document with a structure

of chapters, sections, and paragraphs might have word locations stored as (c, s, p, w)

tuples coded as a sequence of nested runs of c-gaps, s-gaps, p-gaps, and w-gaps. Such

an index allows within-same-paragraph queries as well as phrase queries, for example,

and with an appropriate representation, is only slightly more expensive to store than

a nonhierarchical index.

Core Ideas. To end this section, we state several key implementation decisions.

—Documents have ordinal identiﬁers, numbered from one.

—Inverted lists are stored contiguously.

—The vocabulary consists of every term occurring in the documents and is stored in a

simple extensible structure such as a B-tree.

—An inverted list consists of a sequence of pairs of document numbers and in-document

frequencies, potentially augmented by word positions.

—The vocabulary may be preprocessed, by stemming and stopping.

—Ranking involves a set of accumulators and term-by-term processing of inverted lists.

ACM Computing Surveys, Vol. 38, No. 2, Article 6, Publication date: July 2006.

12 J. Zobel and A. Moffat

This set of choices constitutes a core implementation in that it provides an approach that

is simple to implement and has been used in several public-domain search systems and,

we believe, many proprietary systems. In the following sections, we explore extensions

to the core implementation.

4. PHRASE QUERYING

A small but signiﬁcant fraction of the queries presented to Web search engines include

an explicit phrase, such as "philip glass" opera or "the great flydini". Users also

often enter phrases without explicit quotes, issuing queries such as Albert Einstein or

San Francisco hotel. Intuitively it is appealing to give high scores to pages in which

terms appear in the same order and pattern as they appear in the query, and low scores

to pages in which the terms are separated.

When phrases are used in Boolean queries, it is clear what is intended—the phrase

itself must exist in matching documents. For example, the Boolean query old "night

keeper" would be evaluated as if it contains two query terms, one of which is a phrase,

and both terms would be required for a document to match.

In a ranked query, a phrase can be treated as an ordinary term, that is, a lexical entity

that occurs in given documents with given frequencies, and contributes to the similarity

score for that document when it does appear. Similarity can therefore be computed in

the usual way, but it is ﬁrst necessary to use the inverted lists for the terms in the

phrase to construct an inverted list for the phrase itself, using a Boolean intersection

algorithm. A question for information retrieval research (and outside the scope of this

tutorial) is whether this is the best way to use phrases in similarity estimation. A good

question for this tutorial is how to ﬁnd—in a strictly Boolean sense—the documents in

which a given sequence of words occur together as a phrase since, regardless of how

they are eventually incorporated into a matching or ranking scheme, identiﬁcation of

phrases is the ﬁrst step.

An obvious possibility is to use a parser at index construction time that recognizes

phrases that might be queried and to index them as if they were ordinary document

terms. The set of identiﬁed phrases would be added to the vocabulary and have their

own inverted lists; users would then be able to query them without any alteration to

query evaluation procedures. However, such indexing is potentially expensive. There

is no obvious mechanism for accurately identifying which phrases might be used in

queries, and the number of candidate phrases is enormous since even the number of

distinct two-word phrases grows far more rapidly than the number of distinct terms.

The hypothetical Web collection shown in Table I could easily contain a billion distinct

two-word pairs.

Three main strategies for Boolean phrase query evaluation have been developed.

—Process phrase queries as Boolean bags-of-words so that the terms can occur any-

where in matching document, then postprocess the retrieved documents to eliminate

false matches.

—Add word positions to some or all of the index entries so that the locations of terms

in documents can be checked during query evaluation.

—Use some form of partial phrase index or word-pair index so that phrase appearances

can be directly identiﬁed.

These three strategies can complement each other. However, a pure bag-of-words ap-

proach is unlikely to be satisfactory since the cost of fetching just a few nonmatching

documents could exceed all other costs combined, and, for many phrases, only a tiny

fraction of the documents that contain the query words also contain them as a phrase.

ACM Computing Surveys, Vol. 38, No. 2, Article 6, Publication date: July 2006.

剩余55页未读，继续阅读

legend0011

粉丝: 0
资源: 4

倒排索引构建与压缩技术解析

信息索引技术：倒排索引与文本压缩

搜索引擎核心技术：倒排索引解析

C++实现倒排索引结构详解与应用

倒排索引设计

倒排索引java实现

Elasticsearch之倒排索引

压缩完美嵌入式跳过列表加速倒排索引查找

倒排索引的压缩与优化策略

倒排索引实战：如何构建简单的倒排索引

倒排索引与压缩算法在存储优化中的应用

最新资源