An Inverted Index Method for Mass Spectra K-Nearest Neighbor Queries
Houjun Tang, Xi Liu, Honglong Xu, Kezhong Lu, Gang Liu,
Yuhong Feng, Hong Zhou, Rui Mao
National High Performance Computing Center at Shenzhen
College of Computer Science and Software Engineering
Shenzhen University
3688 Nanhai Road, Shenzhen, 518060, China
houj.tang@gmail.com, xii.liu@hotmail.com, longer597@163.com,
{kzlu, gliu,yfeng, hzhou, mao}@szu.edu.cn
Keywords: K-nearest neighbor search, metric-space indexing, mass spectra, sparse matrix,
inverted index.
Abstract. Finding k-nearest neighbors (k-nn) in metric-space is frequently used in modern biological
applications due to its general applicability. Processing such queries with general purpose methods
usually requires more time and space than domain-specific methods. This paper presents an inverted
index method which exploits the sparsity of mass spectra binary format data and compares it with an
existing metric-space method. This metric-space method acts as a coarse filter and can be followed by
any fine ranking scheme. In experiments, we find that the new method outperforms the metric-space
method in both query speed and index size.
Introduction
Tandem mass spectrometry, also known as MS/MS or MS
2
, is used to produce structural information
about a compound. It has been used as a common technique in proteins and peptide sequences
identification in complicated samples. A mass spectrum obtained by an experiment contains a list of
peaks corresponding to the peptide fragment ions which are pairs of real numbers m/z ratios and their
intensity of occurrence, where m denotes mass and z charge [10]. By labeling each spectrum with its
correct amino-acid sequence, we can identify a peptide’s presence in the protein sample.
The spectrum identification step can be easily modeled as a similarity search problem such as the
k-nearest neighbor (k-nn) search. In k-nn search, k objects which are most similar to the given object
are retrieved from a large database which is measured by a distance function. In our case,
experimentally generated spectra are compared to a database of theoretical spectra.
In the last twenty years, protein databases have been growing exponentially [7], which leads to a
great increase of computation when identifying a biological sequence from theoretical databases.
Performing k-nn search on experimental spectra against theoretical ones costs more time than ever
and linear scan is no longer acceptable. As a result, various protein identification methods supporting
rapid similarity search have been developed, such as TurboSEQUEST [19], MASCOT [11],
ProFound [21], and clustering method [4]. The similarity measures include cosine distances based on
shared peak count [12,15] and the Hausdorff distance [10].
Metric-space indexing, also known as distance-based indexing [2,6,16,20] usually focus on its
general applicability and take little use of data domain information. All information the indexing need
is the metric function to compute the distance between objects. The data set is clustered and a data
structure complied during an off-line process while an on-line search makes use of the triangle
inequality to eliminate clusters of data and return possible results.
In MoBIoS project [8], Ramakrishnan et al. proposed a metric-space technique called MSFound
and use it for similarity search of mass spectra [15]. In their method, a cosine distance-based
semi-metric distance is introduced and the MVP Tree [1] is used as data structure to perform range
The correspondence author