集群基础的倒排文件索引压缩技术

125 浏览量更新于2024-08-25 收藏 265KB PDF 举报

"这篇论文探讨了集群基础的混合编码方案在倒排文件索引压缩中的应用，由陈金林、钟平和库克三位作者共同完成，分别来自纽约城市大学皇后学院和研究生中心的计算机科学部门。论文指出，通过利用文档集合的聚类特性可以优化倒排文件的压缩效率，因为文档中的词频并非均匀分布，某些词在集合的特定部分出现更频繁。" 倒排文件是搜索引擎和信息检索系统中常用的数据结构，用于快速定位文档中关键词的位置。它将每个词对应一个列表，列表包含所有包含该词的文档的编号。然而，这种结构在存储上往往占用大量空间，因此压缩倒排文件是提高存储效率的关键。论文中提到的"混合编码方案"是指结合多种编码方法，针对不同大小的词频差距（d-gaps）使用不同的编码字词。这些编码方法的性能取决于它们能否适应文档集合中词频分布的实际模式。如果编码策略能精确匹配词频分布，压缩效果会更佳。 "集群属性"是指在文档集合中，术语的出现具有聚类特性，即某些区域（例如主题相关的文档群）内的词频更高。利用这一特性，可以将文档集合划分成不同的簇，对每个簇内的倒排列表采用更适合其词频分布的编码策略，从而实现更有效的压缩。具体实现中，论文可能提出了采用连续差异方法来减少指针的平均编码位数。连续差异是通过计算连续元素之间的差值来压缩数据的一种技术。在倒排文件中，这可能意味着跟踪相邻文档编号的差异，而不是存储每个文档编号本身，从而节省存储空间。为了优化压缩，论文可能还讨论了如何选择和调整编码方法以适应聚类后的词频分布。这可能涉及到统计分析、概率模型以及编码算法的选择与优化。例如，可以使用变长编码（如霍夫曼编码）来处理频繁出现的词汇，而对于不那么常见的词汇，可能使用固定长度编码。这篇研究旨在通过理解和利用文档集合的内在结构特性，开发出更加高效的倒排文件压缩技术，从而提升搜索引擎的性能和存储效率。这样的工作对于理解大规模文本数据的存储和检索有重要的理论与实践意义。

30 Journal of Digital Information Management  Volume 6 Number 1  February 2008

Cluster based Mixed Coding Schemes for Inverted File Index Compression

Jinlin Chen

, Ping Zhong

, Terry Cook

Computer Science Department

Queen College, City University of New York

USA

jchen@cs.qc.edu

Computer Science Department

Graduate Center, City University of New York

USA

pzhong@gc.cuny.edu

Computer Science Department

Graduate Center, City University of New York

USA

terrycookd1@aol.com

taking consecutive differences, d

i+1

- d

. In this way it is possible

to code inverted lists using fewer bits per pointer on average.

Many codes have been proposed for compressing inverted

lists. These codes use different codewords for different d-

gaps. The performance of a code is decided by whether the

implicit d-gap distribution model conforms to that of the

document collection.

One way to improve inverted file compression is to use the cluster

property [1] of document collection, which states that term

occurrences are not uniformly distributed. Some terms are more

frequently used in some parts of the collection than in others.

The corresponding part of the inverted list will consequently be

small d-gap values clustered. Interpolative code [9] exploits the

cluster property of term occurrences and achieves very good

performance. Other codes that favor small d-gaps also perform

well on document collections with cluster property.

A major feature of most previous approaches is that they use

the same code within a given inverted list, without considering

the difference in between clustered and non-clustered d-gaps.

Actually the knowledge of cluster and non-cluster property of

d-gaps provides valuable information for improving index

compression. By clustering d-gaps of an inverted list strictly

based on a threshold, and then encoding clustered and non-

clustered d-gaps using different methods, we can tailor to the

specific properties of different d-gaps and achieve better

compression ratio. Based on this idea, in this paper we

propose a cluster based approach and present two new mixed

codes for inverted file index compression: mixed k-base

gamma/k-flat binary code and mixed k-base delta/k-flat binary

code. Experiment results show that the two new codes achieve

better or equal performance in terms of compression ratio

comparing to interpolative code which is considered as the

most efficient bitwise code at present. Besides, the two new

codes have much lower complexity comparing to interpolative

code and therefore enable faster encoding and decoding. By

adjusting the parameters for the mixed codes, even better

result may be achieved.

The rest of this paper is organized as follows. Section 2

describes related work on inverted file indexing. Section 3

presents the motivation of this paper. Section 4 discusses

the concept of cluster based mixed code and presents two

new mixed codes. Section 5 presents experiment results for

performance evaluation. Section 6 concludes the paper.

ABSTRACT: The cluster property of document collections in

today’s search engines provides valuable information for index

compression. By clustering d-gaps of an inverted list based

on a threshold, and then encoding clustered and non-clustered

d-gaps using different methods, we can tailor to the specific

properties of different d-gaps and achieve better compression

ratio. Based on this idea, in this paper we propose a cluster

based approach and presents two new codes for inverted file

index compression: mixed gamma/flat binary code and mixed

delta/flat binary code. Experiment results show that the two

new codes achieve better or equal performance in terms of

compression ratio comparing to interpolative code which is

considered as the most efficient bitwise code at present.

Besides, the two new codes have much lower complexity

comparing to interpolative code and therefore enable faster

encoding and decoding. By adjusting the parameters for the

mixed codes, even better results may be achieved.

Experiments show promising results with our approaches.

Categories and Subject Descriptors

H.3.2 [Information Storage] File Organization; H.3.1 [Content

analysis and indexing]; I.7.3 [Index generation]

General Terms

Document processing, Index generation

Keywords: Inverted file, d-gap, Index compression, inverted list

Received 28 Aug. 2006; Revised and accepted 29 August 2007

1. Introduction

Today Web search engines play an important role for people to

access Web information. The large amount of information

available on the Web requires an efficient indexing mechanism

for search engines. Among the many indexing techniques,

inverted file has been the most popular one due to its relative

small size and high efficiency for keyword-based queries

[10][16][17]. An inverted file index on a document collection maps

each unique term to an inverted list of all the documents

containing the term. For a term t, the inverted list has the structure

; d

, d

, … , d

>, where f

is the number of documents

containing t, d

is a DocID that identifies the document

associated with the i

occurrence of t, and d

< d

i+1

. Since the

inverted list is in ascending order of DocIDs, and all processing

is sequential from the beginning of the list, the list can be

stored as an initial position followed by a list of d-gaps by

Journal of Digital

Information Management

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38623080

粉丝: 5
资源: 1002

集群基础的倒排文件索引压缩技术

Context-Based Adaptive Binary Arithmetic Coding in the H.264 AVC Video Compression Standard

A context-based adaptive lossless nearly-lossless coding scheme

Grokking-the-Coding-Interview-Patterns-for-Coding-Questions

image-compression-with-coding.rar_image compression

Golang-Grokking-the-Coding-Interview-Patterns-for-Coding-Questions:Golang版本的“浏览编码面试”

H.264-based-Motion-Compensation-Residual-Coding-master.zip

离散余弦变换matlab代码-H.264-based-Motion-Compensation-Residual-Coding:在本项目中，您将

The-Ultimate-Strategy-to-Preparing-for-a-Coding-Interview-Medium.pdf

On the Construction of Some Capacity-Approaching Coding Schemes

Huffman-Coding-File-Compression:UIC数据结构课程项目

最新资源