BioMed Research International 3
correlation of RNA-binding residues in the query protein,
respectively. Furthermore, BP(2)formula represents the rel-
evance of the two RNA-binding residues combined with
dierent distances from 1 to −1and takes into account the
fact that the correlation value between two residues is smaller
when the distance is larger which proves the rationality of
the denition.
We also dened two nonbinding propensities for non-
binding proteins. e denitions of NBP(1)and NBP(2)are
similar to the denitions of BP(1)and BP(2). Consider
NBP
(
1
)
=
∑
𝑛
𝑖=1
RI
(
)
10
,
(3)
where and are the number of amino acids and the number
of nonbinding residues in this protein, respectively. RI() is
the reliability index of the prediction result of nonbinding
residue obtained from PRBR. Consider
NBP
(
2
)
=
∑
𝑁−1
𝑖=1
2
−𝑖+1
∑
𝑛(𝑖)
𝑘=1
RI
(
)
10
(
−1
)
,
(4)
where is the number of amino acids in this protein, ()
is the number of two nonbinding residues at a distance ,
and
RI() is the average value of the reliability index for
nonbinding residue and nonbinding residue +.
NBP(1) and NBP(2) describe the information of the
appearance and correlation of nonbinding residues in the
query protein, respectively, which are similar to BP(1) and
BP(2). We also used the reliability index because the pre-
diction result of nonbinding residues is applied in those
formulas.
2.2.2. Evolutionary Information Combined with Physicochem-
ical Properties (EIPP). Evolutionary information in the form
of a position-specic scoring matrix (PSSM) has been used
successfully to represent proteins in many applications, such
as prediction of DNA-binding residues [16–21] and RNA-
binding residues [15, 22, 23]. Here, PSSM proles were
generated using the PSI-BLAST program [24] to search the
nonredundant (NR) database through three iterations, with
0.001 as the -value cuto for multiple sequence alignment.
e PSSM scoring matrix has 20 ∗ elements, where is
the length of protein. However, dierent proteins may have
dierent numbers of amino acids. erefore, the PSSM could
not be used directly as feature in the prediction work because
all the machine learning methods require the input feature
to have a xed length. erefore, we generated a PSSM-400,
which has a vector of dimension of 400 from the PSSM.
PSSM-400 is composition of occurrences of each type of
amino acid corresponding to each type of amino acids in
sequences. We pooled all rows that belonged to the same
aminoacidinthisPSSMtoformanewmatrix.Wethen
converted each new matrix to a vector and added all the
normalized values in each column for the new matrixes.
erefore, we produced a 20-dimensional vector for each new
matrix to generate PSSM-400.
ephysicochemicalpropertyfeaturehasbeenused
eectively in many elds, such as the identication of
DNA\RNA-binding proteins [7, 9, 14, 25, 26] and the iden-
tication and prediction of protein-protein interactions [27].
us, an EIPP was generated by merging 20 amino acid
columns of the PSSM-400 into a single column containing
the information for a certain physicochemical property. Six
physicochemical properties that we used successfully in pre-
vious works [15] were considered for combining with PSSM-
400 to generate the EIPP: the pKa values of the amino group,
the pKa values of the carboxyl group [28], the molecular mass
[6], the lowest free energy [29], the Balaban index [30], and
theWienerindex[31].eentry
𝑎𝑘
of th type of amino acid
in a protein sequence for a certain physicochemical property
in EIPP was calculated with
𝑎𝑘
=
20
𝑖=1
𝑎
(
)
𝑘
(
)
,
(5)
where is the index of a certain physicochemical property,
is the index of the type of amino acids in the query protein
sequence, is the index of the type of na
¨
ıve amino acids,
𝑘
()
is the normalized value of the th type of na
¨
ıve amino acid
for the thtypeofaminoacidintheproteinsequenceof
the PSSM-400, and
𝑎
()is the normalized physicochemical
property values of for the th type amino acids. erefore,
thevectorsizeofEIPPfeatureis6×20.
2.2.3. Conjoint Triad (CT). Electrostatic and hydrophobic
interactions inuence protein-nucleic acid interactions and
may be reected by the dipoles and volumes of the side
chains of amino acids, respectively. Based on the dipoles
and volumes of the side chains, the 20 kinds of amino acids
could be clustered into seven classes [32]. Considering that
disulde bonds have no special eect on protein-nucleic acid
interactions, the unique amino acid cysteine in the seventh
class was put back to the third class in this study. erefore,
the 20 kinds of amino acids were clustered into six classes as
follows.Classa:Ala,Gly,andVal;Classb:Ile,Leu,Phe,and
Pro;Classc:Tyr,Met,r,Ser,andCys;Classd:His,Asn,
Gln, and Tpr; Class e: Arg and Lys; and Class f: Asp and Glu.
According to the similar feature construction method used
in [32], a protein is described by the conjoint triads feature
with 6×6×6=216dimensions, where each component
of the feature vector has the value of the frequency of the
corresponding triad.
As mentioned above, for each query protein, the vector
size of a feature is 4+120+216=340.
2.3. Algorithms to Classify and Measure a Classier’s Perfor-
mance. e random forest (RF) algorithm [12] is a classi-
cation algorithm that uses an ensemble of tree-structured
classiers, which has been used successfully in many applica-
tions for data classication and achieves high performance.
e random forest R package [33] was used to implement the
RF algorithm.
To evaluate the performance of the classier, a 5-fold
cross-validation procedure for the training dataset was used
in this research. During the procedure, we randomly divided
the data instances into ve parts. Four of these parts were
input into the RF to establish a model for classication,