30 Journal of Digital Information Management Volume 6 Number 1 February 2008
Cluster based Mixed Coding Schemes for Inverted File Index Compression
Jinlin Chen
1
, Ping Zhong
2
, Terry Cook
3
1
Computer Science Department
Queen College, City University of New York
USA
jchen@cs.qc.edu
2
Computer Science Department
Graduate Center, City University of New York
USA
pzhong@gc.cuny.edu
3
Computer Science Department
Graduate Center, City University of New York
USA
terrycookd1@aol.com
taking consecutive differences, d
i+1
- d
i
. In this way it is possible
to code inverted lists using fewer bits per pointer on average.
Many codes have been proposed for compressing inverted
lists. These codes use different codewords for different d-
gaps. The performance of a code is decided by whether the
implicit d-gap distribution model conforms to that of the
document collection.
One way to improve inverted file compression is to use the cluster
property [1] of document collection, which states that term
occurrences are not uniformly distributed. Some terms are more
frequently used in some parts of the collection than in others.
The corresponding part of the inverted list will consequently be
small d-gap values clustered. Interpolative code [9] exploits the
cluster property of term occurrences and achieves very good
performance. Other codes that favor small d-gaps also perform
well on document collections with cluster property.
A major feature of most previous approaches is that they use
the same code within a given inverted list, without considering
the difference in between clustered and non-clustered d-gaps.
Actually the knowledge of cluster and non-cluster property of
d-gaps provides valuable information for improving index
compression. By clustering d-gaps of an inverted list strictly
based on a threshold, and then encoding clustered and non-
clustered d-gaps using different methods, we can tailor to the
specific properties of different d-gaps and achieve better
compression ratio. Based on this idea, in this paper we
propose a cluster based approach and present two new mixed
codes for inverted file index compression: mixed k-base
gamma/k-flat binary code and mixed k-base delta/k-flat binary
code. Experiment results show that the two new codes achieve
better or equal performance in terms of compression ratio
comparing to interpolative code which is considered as the
most efficient bitwise code at present. Besides, the two new
codes have much lower complexity comparing to interpolative
code and therefore enable faster encoding and decoding. By
adjusting the parameters for the mixed codes, even better
result may be achieved.
The rest of this paper is organized as follows. Section 2
describes related work on inverted file indexing. Section 3
presents the motivation of this paper. Section 4 discusses
the concept of cluster based mixed code and presents two
new mixed codes. Section 5 presents experiment results for
performance evaluation. Section 6 concludes the paper.
ABSTRACT: The cluster property of document collections in
today’s search engines provides valuable information for index
compression. By clustering d-gaps of an inverted list based
on a threshold, and then encoding clustered and non-clustered
d-gaps using different methods, we can tailor to the specific
properties of different d-gaps and achieve better compression
ratio. Based on this idea, in this paper we propose a cluster
based approach and presents two new codes for inverted file
index compression: mixed gamma/flat binary code and mixed
delta/flat binary code. Experiment results show that the two
new codes achieve better or equal performance in terms of
compression ratio comparing to interpolative code which is
considered as the most efficient bitwise code at present.
Besides, the two new codes have much lower complexity
comparing to interpolative code and therefore enable faster
encoding and decoding. By adjusting the parameters for the
mixed codes, even better results may be achieved.
Experiments show promising results with our approaches.
Categories and Subject Descriptors
H.3.2 [Information Storage] File Organization; H.3.1 [Content
analysis and indexing]; I.7.3 [Index generation]
General Terms
Document processing, Index generation
Keywords: Inverted file, d-gap, Index compression, inverted list
Received 28 Aug. 2006; Revised and accepted 29 August 2007
1. Introduction
Today Web search engines play an important role for people to
access Web information. The large amount of information
available on the Web requires an efficient indexing mechanism
for search engines. Among the many indexing techniques,
inverted file has been the most popular one due to its relative
small size and high efficiency for keyword-based queries
[10][16][17]. An inverted file index on a document collection maps
each unique term to an inverted list of all the documents
containing the term. For a term t, the inverted list has the structure
<f
t
; d
1
, d
2
, d
3
, … , d
ft
>, where f
t
is the number of documents
containing t, d
i
is a DocID that identifies the document
associated with the i
th
occurrence of t, and d
i
< d
i+1
. Since the
inverted list is in ascending order of DocIDs, and all processing
is sequential from the beginning of the list, the list can be
stored as an initial position followed by a list of d-gaps by
Journal of Digital
Information Management