1
Scientific REPORTS | (2018) 8:15264 | DOI:10.1038/s41598-018-33654-x
www.nature.com/scientificreports
Deep-RBPPred: Predicting RNA
binding proteins in the proteome
scale based on deep learning
Jinfang Zheng, Xiaoli Zhang, Xunyi Zhao, Xiaoxue Tong, Xu Hong, Juan Xie & Shiyong Liu
RNA binding protein (RBP) plays an important role in cellular processes. Identifying RBPs by
computation and experiment are both essential. Recently, an RBP predictor, RBPPred, is proposed in
our group to predict RBPs. However, RBPPred is too slow for that it needs to generate PSSM matrix as
its feature. Herein, based on the protein feature of RBPPred and Convolutional Neural Network (CNN),
we develop a deep learning model called Deep-RBPPred. With the balance and imbalance training
set, we obtain Deep-RBPPred-balance and Deep-RBPPred-imbalance models. Deep-RBPPred has
three advantages comparing to previous methods. (1) Deep-RBPPred only needs few physicochemical
properties based on protein sequences. (2) Deep-RBPPred runs much faster. (3) Deep-RBPPred has
a good generalization ability. In the meantime, Deep-RBPPred is still as good as the state-of-the-art
method. Testing in A. thaliana, S. cerevisiae and H. sapiens proteomes, MCC values are 0.82 (0.82),
0.65 (0.69) and 0.85 (0.80) for balance model (imbalance model) when the score cuto is set to 0.5,
respectively. In the same testing dataset, dierent machine learning algorithms (CNN and SVM) are
also compared. The results show that CNN-based model can identify more RBPs than SVM-based. In
comparing the balance and imbalance model, both CNN-base and SVM-based tend to favor the majority
class in the imbalance set. Deep-RBPPred forecasts 280 (balance model) and 265 (imbalance model) of
299 new RBP. The sensitivity of balance model is about 7% higher than the state-of-the-art method. We
also apply deep-RBPPred to 30 eukaryotes and 109 bacteria proteomes downloaded from Uniprot to
estimate all possible RBPs. The estimating result shows that rates of RBPs in eukaryote proteomes are
much higher than bacteria proteomes.
RNA binding proteins (RBPs) play important functions in many cellular processes, such as post-transcriptional
gene regulation, RNA subcellular localization and alternative splicing. With signicant function in biology, many
high-throughput experimental techniques have been developed to identify new RBPs in human, mouse, S. cere-
visiae and C. elegans
1–10
. Aer RBPs have been identied, CLIP-related experimental technologies
11–14
are applied
to reveal the binding sites in RNAs. Also, many computational methods have been proposed to predict interaction
of protein with RNA
15–18
and RBPs
19–25
. RBP predictors can predict the RBPs, and then CLIP-related techniques
can further reveal RNAs interacting with these RBPs. However, previous computational methods only considered
only part features or known RNA binding domain (RBD) which plays a signicant role in RBPs prediction. So, we
proposed RBPPred integrating as much as features to address this problem
22
. Benchmarking on datasets shows
that RBPPred is better than other approaches. But RBPPred runs slowly because it requires to run blast against a
huge protein NR database to generate PSSM matrix. However, the prediction speed is important because a large
number of RBPs are still unknown in many species. To overcome this shortcoming, we present Deep-RBPPred
which is based on deep learning.
In recently years, deep learning technology has been used in many aspects in bioinformatics and proved as a
power tool
26–32
. For predicting protein binding sites in RNA sequence, DeepBind
32
is the rst CNN-based model
to predict the binding anity. Deep-rbp
29
and iDeep
30,31
are two deep learning methods which both take RNA
structure into consideration. ese methods outperform the conventional approaches in term of prediction accu-
racy. However, deep learning algorithm is still not applied to RBPs prediction. In Deep-RBPPred, we apply a deep
convolutional neural network instead of SVM. Since CNN-based method requires to input a xed length feature
vector, two solutions are handled to meet this requirement. e rst solution is to pad all the sequences to xed
length sequences, and then one-hot encoding is used to encode the sequences. e second solution is to design
School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China. Correspondence
and requests for materials should be addressed to S.L. (email: liushiyong@gmail.com)
Received: 17 May 2018
Accepted: 28 September 2018
Published: xx xx xxxx
OPEN