RESEARCH ARTICLE
A hybrid strategy for comprehensive annotation of the protein
coding genes in prokaryotic genome
Jia-Feng Yu
•
Jing Guo
•
Qing-Bin Liu
•
Yue Hou
•
Ke Xiao
•
Qing-Li Chen
•
Ji-Hua Wang
•
Xiao Sun
Received: 20 July 2014 / Accepted: 23 December 2014 / Published online: 8 January 2015
Ó The Genetics Society of Korea and Springer-Science and Media 2015
Abstract Protein coding gene annotation errors in pro-
karyotic genomes are accumulating continually in bioin-
formatics databases, while the update rate of genome
annotation can not keep up with the explosive increasing
genome sequences in most cases. Hence it is critical to
manually rectify the genome annotation errors. In this
paper, a hybrid strategy by combing the gene ab initio
predicting programs and the over annotated gene re-anno-
tation programs is proposed for re-annotation of the protein
coding genes in prokaryotic genomes. Based on this strat-
egy, the protein coding genes in Geobacter sulfurreducens
PCA is comprehensively re-annotated. As a consequence,
16 hypothetical genes are annotated as non-coding
sequences and 104 missing genes are retrieved as protein
coding genes. Subsequent function analysis and sequences
analysis show that the predicting results are much reliable
and robust. Further application to other genomes show that
this work can provide alternative tools for later post-pro-
cess of prokaryotic genome annotations.
Keywords Protein coding genes Re-annotation
Prokaryotic genome
Introduction
Nowadays, the genomic sequences increase explosively
urged by the quick development of sequencing technolo-
gies, which provide unprecedented opportunities for dis-
closing the secret of life. By November, 2013, 7407
genome sequencing projects have been completed, among
which almost 96 % (7096/7407) are from archaeal and
bacterial genomes (Liolios et al. 2010). Then, how to
accurately predict the diverse genomic components has
been the one of the most important project in the post
genome era (Kyrpides 2009; Petty 2010; Li et al. 2011;
Liao et al. 2012). Even though gene prediction in pro-
karyotic genomes has lasted for more than 20 years, more
and more recent studies indicate that protein coding genes
annotation errors have been a universal phenomenon in
public databases (Poptsova and Gogarten 2010; Bakke
et al. 2009; Palleja
`
et al. 2008; Kisand and Lettieri 2013;
Yu et al. 2014), including the problems of translational
starting site (TSS) prediction (Gao et al. 2010), protein
coding genes over annotation (Nagy et al. 2008; Luo et al.
2009; Chen et al. 2008; Yu and Sun 2010; Yu et al. 2012;
Electronic supplementary material The online version of this
article (doi:10.1007/s13258-014-0263-0) contains supplementary
material, which is available to authorized users.
J.-F. Yu Q.-B. Liu Q.-L. Chen J.-H. Wang
Shandong Provincial Key Laboratory of Functional
Macromolecular Biophysics, Institute of Biophysics, Dezhou
University, Dezhou 253023, People’s Republic of China
J.-F. Yu (&) Y. Hou K. Xiao X. Sun (&)
State Key Laboratory of Bioelectronics, Southeast University,
Nanjing 210096, People’s Republic of China
e-mail: jfyu1979@126.com
X. Sun
e-mail: xsun@seu.edu.cn
J.-F. Yu J.-H. Wang
College of Physics and Electronic Information, Dezhou
University, Dezhou 253023, People’s Republic of China
J. Guo
School of Computer Engineering, Nanyang Technological
University, Singapore 639798, Singapore
Q.-B. Liu Q.-L. Chen
College of Life Science, Shandong Normal University,
Jinan 250014, People’s Republic of China
123
Genes Genom (2015) 37:347–355
DOI 10.1007/s13258-014-0263-0