Predicting protein structural classes with pseudo amino acid composition:
An approach using geometric moments of cellular automaton image
Xuan Xiao
a,
, Pu Wang
a
, Kuo-Chen Chou
b
a
Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China
b
Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA
article info
Article history:
Received 2 April 20 08
Received in revised form
18 June 2008
Accepted 18 June 2008
Available online 24 June 2008
Keywords:
Cellular automaton
Space–time evolution
Image texture
Geometric invariant moment
Pseudo amino acid composition
Covariant-discriminant algorithm
Chou’s invariant theorem
abstract
A novel approach was developed for predicting the structural classes of proteins based on their
sequences. It was assumed that proteins belonging to the same structural class must bear some sort of
similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984.
Cellular automation as models of complexity. Nature 311, 419–424]. Based on this, two geometric
invariant mom ent factors derived from the image functions were used as the pseudo amino acid
components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid
composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246–255] to
formulate the protein samples for statistical prediction. The success rates thus obtained on a previously
constructed benchmark dataset are quite promising, implying that the cellular automaton image can
help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino
acid sequences.
& 2008 Elsevier Ltd. All rights reserved.
1. Introduction
Although the details of the 3-D (dimensional) structures of
proteins are extremely complicated and irregular, their overall
topological folding patterns are surprisingly simple and regular. In
view of this, proteins are generally classified into a limited
number of different structural classes, and typically into four
structural classes: all-
a
, all-
b
,
a
/
b
, and
a
+
b
(Levitt and Chothia,
1976) although with more data available recently proteins can be
also further classified into 11 classes (Andreeva et al., 2004)of
which at least 7 classes are highly populated with low sequence
homology within the same class (Chou and Cai, 20 04).
The structural class is an important attribute used to
characterize the overall folding type of a protein. Therefore,
prediction of the structural class has attracted many investigators
(see, e.g., Cao et al., 2006; Chandonia and Karplus, 1995; Chen
et al., 200 6a, b, 2008a, b; Chou, 1989, 1995; Chou and Zhang, 1994;
Chou and Maggiora, 1998; Deleage and Roux, 1987; Jahandideh
et al., 2007; Kedarisetti et al., 2006; Klein, 1986; Klein and Delisi,
1986; Kneller et al., 1990; Kurgan and Homaeian, 2006; Kurgan
et al., 2007, 2008; Lin and Li, 2007b; Liu and Chou, 1998; Luo et al.,
2002; Mao et al., 1994; Metfessel et al., 1993; Nakashima et al.,
1986; Shen et al., 2005; Sun and Huang, 2006; Zhang et al., 1995;
Zhang and Ding, 2007; Zhou, 1998; Zhou and Assa-Munt, 2001).
Although various different algorithms were used by these
investigators, they can be basically categorized into the following
two groups. One is based on the amino acid (AA) composition, and
the other based on the pseudo amino acid (PseAA) composition
(Chou, 2005b). Although the amino acid composition model is
simpler and easier to handle, it fails to incorporate any of the
sequence-order information in a protein. To avoid the complete
loss of the sequence-order information as suffered in the amino
acid composition model (Chou, 1995; Nakashima et al., 1986), the
PseAA composition was introduced.
The concept of PseAA composition was originally proposed for
improving the prediction quality of protein subcellular localiza-
tion and membrane protein type (Chou, 2001). The essence of
PseAA composition is to keep using a discrete model to represent
a protein sample, yet without completely losing its sequence-
order information. According to its definition, the PseAA composi-
tion for a given protein sample is expressed by a set of 20+
l
discrete numbers, where the first 20 represent the 20 components
of the classical amino acid composition while the additional
l
numbers incorporate some of its sequence-order information via
different kinds of coupling modes.
Ever since the concept of PseAA composition was introduced,
various PseAA composition approaches have been proposed to
deal with different problems in proteins and protein-related
systems (see, e.g., Chen et al., 2006a, b; Chen and Li, 2007a, b; Ding
et al., 2007; Du and Li, 2006; Fang et al., 2008; Gonzalez-Diaz
ARTICLE IN PRESS
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/yjtbi
Journal of Theoretical Biology
0022-5193/$ - see front matter & 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.jtbi.2008.06.016
Corresponding author. Tel.: +86 13879809729; fax: +86 798 8499671.
E-mail address: xiaoxuan0326@yahoo.com.cn (X. Xiao).
Journal of Theoretical Biology 254 (2008) 691–696