automated procedure is completed, a careful manual check is
performed to eliminate possible false positives, which can
occur for entries with the commonly used crystallization
additives. By doing so, it is believed with high confidence
that the ligand-protein interactions collected from PDB are
real biologically relevant. Details for constructing BioLip can
be found in [45].
To evaluate the effectiveness of the proposed TargetS, we
thus constructed training data sets and independent
validation data sets based on the BioLip [45] rather than
on PDB. Twelve different types of ligands, i.e., five types of
metal ions, five types of nucleotides, DNA, and HEME,
were considered in this study. For each of the 12 types of
the considered ligands, we constructed its training data set
and independent validation data set as follows:
Training data sets. We extracted all the protein sequences,
which interact with the given ligand and were released into
PDB before 10 March 2010, from BioLip, and then the
maximal pairwise sequence identity of the extracted protein
sequences was culled to 40 percent with PISCES software
[49] and the resulting sequences constitute the training data
set for that ligand.
Independent validation data sets. We extracted all the protein
sequences that interact with the ligand and were deposited
into PDB after 10 March 2010 from BioLip. Again, the
maximal pairwise sequence identity of the extracted protein
sequences was reduced to 40 percent and the resulting
sequences constitute the validation data set. Moreover, if a
given sequence in the validation data set shares >40%
identity to a sequence in the training data set, then we remove
the sequence from the validation data set. This assures that
the sequences in validation data set are independent of those
in training data set. Table 1 summarizes the detailed
compositions of the training data sets and the independent
validation data sets for the 12 types of ligands.
To further demonstrate the effectiveness of the proposed
TargetS, CASP9 data set was used for blind test. The ninth
community-wide critical assessment o f techniques for
protein structure prediction (CASP9) released 129 target
protein sequences for blind test of protein structure and
function prediction methods. Among the 129 sequences,
31 were used f or evaluating the ligand binding-site
predictions, where the predictors were asked to identify
ligand binding residues in the sequences. As one sequence
(Target ID: T0533) was canceled on 26 May 2010, the
remaining 30 sequences were, thus, taken as targets for
our consideration.
It has not escaped from our notice that the percentages of
binding residues in training and validation data sets for a
given ligand are different. However, this difference will not
affect the objective evaluation procedure of the proposed
method as we performed both the cross-validation evalua-
tion on training data set and the independent test on the
testing data set. The purpose of the cross-validation is to
evaluate the overall performance of the proposed method
on a given data set. While independent test is often used to
evaluate the generalization capability of the proposed
method, which has been widely accepted in this field.
2.2 Feature Extraction
2.2.1 Position Specific Scoring Matrix Feature
Position specific scoring matrix (PSSM) well encodes the
evolutionary information of a protein sequence. Tremendous
previous studies have shown its prominent discriminative
capability for many prediction problems in bioinformatics,
suchasproteinfunctionprediction[50],protein-ATP
binding sites prediction [51], transmembrane helices predic-
tion [52], protein secondary structure prediction [53],
subcellular localization prediction [54], [55], [56], and so on.
The position specific scoring matrix for protein sequence
is built by using the PSI-BLAST [57] to search the Swiss-Prot
database through three iterations with 0.001 as the e-value
cutoff for multiple sequence alignment against the query
996 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 4, JULY/AUGUST 2013
TABLE 1
Composition of the Training Data Sets and the Independent Validation Data Sets for the 12 Types of Ligands