324
. Son
et al.
redundant proteins using 40% sequence identity as the cut-off in order to obtain a non-
redundant negative data set. After filtering, we obtained 154 qualified full-length yeast
proteins as negative data sets.
In order to discover novel Rsp5 substrates in yeast proteome, we also collected 3450
proteins comprehensively as the pool to be predicted by our method (Belle et al., 2006).
2.2 Informative features
As we already know, the interaction between enzyme and substrate is somewhat
determined by its structure and sequence, so the amino acid sequence is the basis for
investigating the Rsp5 substrate (Jaakkola et al., 2000; Liao and Noble, 2002; Saigo
et al., 2004). We analysed the amino acid composition and distribution of the protein
sequence.
We examined a number of features computed based on protein sequences and
secondary structures that are possibly relevant to the recognition of Rsp5 substrate. Some
features are included because they are known to be relevant to substrates of Rsp5, while
others are included because of their statistical relevance to our classification problem.
Firstly, we calculated the statistics of monopeptide and dipeptide, which were then
normalised by the sequence length. In the end, we obtained 420 features from the amino
acid composition. Due to the variation of polypeptide sequences in evolution, analysis of
the composition of grouped amino acids will be more reasonable. We divide them into
six groups according to physical and chemical properties of amino acid: class a (I, V, L,
M), class b (F, Y, W), class c (H, K, R), class d (D, E), class e (Q, N, T, P) and class f
(A, C, G, S). We then analysed the composition (C) of grouped mono-peptide, grouped
dipeptide and tripeptide.
Beside the descriptors of amino acid composition, transition (T) and distribution (D)
of amino acid groups are also used to describe the global composition of amino acids
groups, in which T denotes the relative frequency in changing amino acid groups along
the protein sequence and D denotes the chain length within which the first 25%, 50%,
75% and 100% of the amino acids of a particular group are located (Dubchak et al.,
1995; Cai et al., 2003; Cui et al., 2007). We also included some general features such as
the protein length, hydrophobic value, sulphur content, isoelectric point, signal peptide
and N-end amino acid.
We also took into account the Low-Complexity Region (LCR) as an important
feature. LCRs in protein sequences are regions containing little diversity in their amino
acid composition. We examined the numbers of LCR, the length of maximum LCR and
the total length of LCR in every sequence by the program of ‘SEG’ in order to
investigate the implication with the stability of proteins (Wootton, 1994). In the end,
three features in total were included in the initial list.
Functional proteins with part of disordered structures are highly abundant in nature.
Disordered proteins are more widespread in eukaryotic proteomes; therefore, four
features representing the number of disordered regions, the total length of disordered
regions, the length of maximum disordered regions and the average score of disordered
regions are used to describe the characteristics of the disordered region. We analysed
disorder regions of protein by means of IUPred (Dosztanyi et al., 2005).
We take the existence of the transmembrane region and the total length of
transmembrane region in protein as two features in initial list. We analysed the
transmembrane regions of protein by means of SMART (Letunic et al., 2009).