![](https://csdnimg.cn/release/download_crawler_static/15114865/bg1.jpg)
Knowledge-Based Systems 118 (2017) 115–123
Contents lists available at ScienceDirect
Knowle dge-Base d Systems
journal homepage: www.elsevier.com/locate/knosys
Protein secondary structure prediction by using deep learning method
Yangxu Wang, Hua Mao
∗
, Zhang Yi
Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, People’s Republic of China
a r t i c l e i n f o
Article history:
Received 9 March 2016
Revised 16 November 2016
Accepted 16 November 2016
Available online 17 November 2016
Keywords:
Deep learning
Secondary structure prediction
Encoder–decoder networks
Recurrent neural networks
a b s t r a c t
The prediction of protein structures directly from amino acid sequences is one of the biggest challenges
in computational biology. It can be divided into several independent sub-problems in which protein sec-
ondary structure (SS) prediction is fundamental. Many computational methods have been proposed for SS
prediction problem. Few of them can model well both the sequence-structure mapping relationship be-
tween input protein features and SS, and the interaction relationship among residues which are both im-
portant for SS prediction. In this paper, we proposed a deep recurrent encoder–decoder networks called
Secondary Structure Recurrent Encoder–Decoder Networks (SSREDNs) to solve this SS prediction prob-
lem. Deep architecture and recurrent structures are employed in the SSREDNs to model both the complex
nonlinear mapping relationship between input protein features and SS, and the mutual interaction among
continuous residues of the protein chain. A series of techniques are also used in this paper to refine the
model’s performance. The proposed model is applied to the open dataset CullPDB and CB513. Experi-
mental results demonstrate that our method can improve both Q3 and Q8 accuracy compared with some
public available methods. For Q8 prediction problem, it achieves 68.20% and 73.1% accuracy on CB513 and
CullPDB dataset in fewer epochs better than the previous state-of-art method.
©2016 Elsevier B.V. All rights reserved.
1.
Introduction
Discovering protein’s structure and biological functions are very
important for understanding their biological processes, such as the
protein-protein interactions [1] , protein complexes identification
[2] and protein structure prediction. Protein structure prediction,
elucidating the complex relationship between a protein sequence
and its structure, is one of the most important challenges in com-
putational biology [3] . The most elemental task of protein structure
prediction is protein secondary structure (SS) prediction, which
aims to discover the structural states of amino acids. SS represents
the local conformation of the polypeptide backbone of proteins and
provides a bridge that links the primary sequence and the tertiary
structure, which is very helpful for many structural and functional
analysis tools [4,5] .
Typically, protein secondary structures can either be divided
into three states ( α-helix (H), β-strand (E) and coil region (C)) or
be further classified into eight fine-grained states (3
10
-helix (G),
α-helix (H), π -helix (I), β-strand (E), β-bridge (B), β-turn (T), high
curvature regions (S) and irregular loop (L)). SS prediction is usu-
ally evaluated by Q3 and Q8 accuracy, which measures the per-
∗
Corresponding author.
E-mail addresses: mellowxu@gmail.com (Y. Wang), huamao@scu.edu.cn (H.
Mao), zhangyi@scu.edu.cn (Z. Yi).
centage of residues for which 3-state or 8-state SS is correctly pre-
dicted. Currently, extensive research efforts have been spent on ap-
plying computational methods to address the Q3 prediction prob-
lem, but very few to the more challenging Q8 prediction problem.
Hidden markov model (HMM) has been applied to 3-state SS
prediction problem [6] . Although HMM can describe the inter-
actions among adjust residues, it’s very challenging for HMM to
model the complex nonlinear relationship between input protein
features and SS. Support vector machine (SVM) [7] can deal with
this complex nonlinear mapping, but it’s challenging for SVM to
take into consideration the interactions among adjacent residues.
To our best knowledge, by using a 2-stage neural networks (NNs)
method [8] , so far the best Q3 accuracy is about 80%. For the Q8
prediction problem, existing methods [9,10] fail to provide promis-
ing results. The problem may be that most of these mentioned
methods are shallow architectures. The limitation of them is that
it’s very difficult for a relatively shallow architectures to model
well both the complex sequence-structure relationship between in-
put protein features and SS, and the mutual interaction relation-
ship among residues. However, they are both important for SS
prediciton [10,11] .
Nowdays, NNs with deep architectures, also called deep neu-
ral networks (DNNs) become the most powerful machine learning
techniques for pattern recognition [12,13] . With the ability of map-
ping unorganized low-level features into high-level laten data rep-
resentations which are more suitable for a final classification prob-
http://dx.doi.org/10.1016/j.knosys.2016.11.015
0950-7051/© 2016 Elsevier B.V. All rights reserved.