GPCR-GIA: a web-server for identifying G-protein
coupled receptors and their families with grey
incidence analysis
Wei-Zhong Lin
1
, Xuan Xiao
1,3
and Kuo-Chen Chou
2
1
Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen
333001, China and
2
Gordon Life Science Institute, 13784 Torrey Del Mar
Drive, San Diego, CA 92130, USA
3
To whom correspondence should be addressed.
E-mail: xiaoxuan0326@yahoo.com.cn
G-protein-coupled receptors (GPCRs) play fundamental
roles in regulating various physiological processes as well
as the activity of virtually all cells. Different GPCR
families are responsible for different functions. With the
avalanche of protein sequences generated in the postge-
nomic age, it is highly desired to develop an automated
method to address the two problems: given the sequence
of a query protein, can we identify whether it is a
GPCR? If it is, what family class does it belong to? Here,
a two-layer ensemble classifier called GPCR-GIA was
proposed by introducing a novel scale called ‘grey inci-
dent degree’. The overall success rate by GPCR-GIA in
identifying GPCR and non-GPCR was about 95%, and
that in identifying the GPCRs among their nine family
classes was about 80%. These rates were obtained by the
jackknife cross-validation tests on the stringent bench-
mark data sets where none of the proteins has
50%
pairwise sequence identity to any other in a same class.
Moreover, a user-friendly web-server was established at
http://218.65.61.89:8080/bioinfo/GPCR-GIA. For user’s
convenience, a step-by-step guide on how to use the
GPCR-GIA web server is provided. Generally speaking,
one can get the desired two-level results in around 10 s
for a query protein sequence of 300–400 amino acids; the
longer the sequence is, the more time that is needed.
Keywords: ensemble classifier/fusion/K nearest neighbor
algorithm/pseudo amino acid composition/web server
Introduction
G-protein-coupled receptors (GPCRs) are seven-helix trans-
membrane proteins that provide a molecular link between
extracellular signals and intracellular reactions ranging from
cell–cell communication processes to physiological
responses (Heuss and Gerber, 2000; Milligan and White,
2001; Hall and Lefkowitz, 2002; Chou, 2005a). They are
among the largest and most diverse protein families in mam-
malian genomes. Owing to their close relevance to a variety
of diseases, such as cancer, diabetes, neurodegenerative,
inflammatory and respiratory disorders, GPCRs are of utmost
interest in drug development: over half of all prescription
drugs currently on the market act by targeting these receptors
directly or indirectly.
Many efforts have been invested in studying GPCR by
both academic institutions and pharmaceutical industries.
However, as membrane proteins, GPCRs are very difficult to
crystallize and most of them will not dissolve in normal sol-
vents. Accordingly, so far, very few crystal GPCR structures
have been determined. Although the recently developed
state-of-the-art NMR technique is a very powerful tool in
determining the three-dimensional structures of membrane
proteins (Oxenoid and Chou, 2005; Call et al., 2006; Douglas
et al., 2007; Schnell and Chou, 2008), it is time-consuming
and costly. Although some membrane protein structures can
be derived with homology approaches (Chou, 2004), the
number of templates for transmembrane proteins is very
limited. In contrast, more than thousand GPCR sequences
are known, and much more are expected to come in the near
future. In view of this, it would be very useful to develop a
computational method which can predict the classification of
the families and subfamilies of GPCRs based on their
primary sequences.
In a pioneer study (Chou and Elrod, 2002), Chou and
Elrod attempted to identify the subfamily classes of the
rhodopsin-like GPCR family by using the covariant-
discriminant algorithm (Chou and Elrod, 1999). With more
data available later, the study was extended to identify the
main family classes of GPCRs (Chou, 2005b) with a similar
approach. Stimulated by the encouraged results, some
follow-up studies were conducted by using various different
approaches as reported in Bhasin and Raghava (2005), Gao
and Wang (2006) and Wen et al. (2007).
Although considerable progresses have been achieved
during the past 6 years in this area, further studies are
needed due to the following reasons. First, the data sets con-
structed to train the existing predictors cover very limited
GPCR family classes. With the development of protein data-
bases, more classes should be included to enhance the cover-
age scope for practical usage. Secondly, the reported success
rates were derived based on a benchmark data set without
being rigorously screened by a clear data-culling operation to
avoid redundancy and homologous bias, and hence those
reported success rates therein might be overestimated. As is
well known, the more the family classes covered, the lower
the odds are in getting a correct prediction. Also, the more
stringent the benchmark data set in excluding homologous
sequences, the harder it becomes to get a high success rate
for cross-validation test (Xiao et al., 2005; Chou and Shen,
2007c; Chou and Shen, 2008). The present study was
devoted to address these problems by developing a new
GPCR predictor. Moreover, a user-friendly web server,
called GPCR-GIA, was designed for the new predictor. For
the convenience of most experimental scientists who wish to
utilize the predictor to generate the desired data but feel diffi-
cult to follow the detailed mathematics and processes, a
step-by-step guide on how to use the web server predictor
was provided.
# The Author 2009. Published by Oxford University Press. All rights reserved.
For Permissions, please e-mail: journals.permissions@oxfordjournals.org
699
Protein Engineering, Design & Selection vol. 22 no. 11 pp. 699–705, 2009
Published online September 22, 2009 doi:10.1093/protein/gzp057