Sequence analysis
Application of learning to rank to protein
remote homology detection
Bin Liu
1,2,3,
*, Junjie Chen
1
and Xiaolong Wang
1,2
1
School of Computer Science and Technology,
2
Key Laboratory of Network Oriented Intelligent Computation,
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China and
3
Gordon Life
Science Institute, Belmont, MA 02478, USA
*To whom correspondence should be addressed.
Associate Editor: John Hancock
Received on June 4, 2015; revised on July 3, 2015; accepted on July 7, 2015
Abstract
Motivation: Protein remote homology detection is one of the fundamental problems in computa-
tional biology, aiming to find protein sequences in a database of known structures that are evolu-
tionarily related to a given query protein. Some computational methods treat this problem as a
ranking problem and achieve the state-of-the-art performance, such as PSI-BLAST, HHblits and
ProtEmbed. This raises the possibility to combine these methods to improve the predictive per-
formance. In this regard, we are to propose a new computational method called ProtDec-LTR for
protein remote homology detection, which is able to combine various ranking methods in a super-
vised manner via using the Learning to Rank (LTR) algorithm derived from natural language
processing.
Results: Experimental results on a widely used benchmark dataset showed that ProtDec-LTR can
achieve an ROC1 score of 0.8442 and an ROC50 score of 0.9023 outperforming all the individual
predictors and some state-of-the-art methods. These results indicate that it is correct to treat pro-
tein remote homology detection as a ranking problem, and predictive performance improvement
can be achieved by combining different ranking approaches in a supervised manner via using LTR.
Availability and implementation: For users’ convenience, the software tools of three basic ranking
predictors and Learning to Rank algorithm were provided at http://bioinformatics.hitsz.edu.cn/
ProtDec-LTR/home/
Contact: bliu@insun.hit.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Using sequence similarity between protein pairs to detect evolu-
tionary relationships is one of the central tasks in bioinformatics,
which can be applied to the protein 3D structure and function pre-
diction (Bork and Koonin, 1998). Unfortunately, remote homology
protein pairs have similar structures and functions, but they lack
easily detectable sequence similarity, because the protein tertiary
structure is more conserved than protein sequence. Therefore, it is
often difficult to detect protein remote homology by computa-
tional approaches.
Some effective computational methods have been developed to
address this challenging problem, which can be mainly divided into
two groups, including discriminative methods and ranking methods.
The first group discriminative methods treat protein remote hom-
ology detection as a classification problem using both the positive
and negative samples to train the classification models, and then
they are used to predict unseen samples. Among this kind of
approaches, the methods based on Support Vector Machines
(SVMs) achieve the state-of-the-art performance with appropri-
ate kernel functions, which measure the similarity between any
V
C
The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.pe rmissions@oup.com 3492
Bioinformatics, 31(21), 2015, 3492–3498
doi: 10.1093/bioinformatics/btv413
Advance Access Publication Date: 10 July 2015
Original Paper