PROCEEDINGS Open Access
Protein localization prediction using random
walks on graphs
Xiaohua Xu
*†
, Lin Lu
†
, Ping He, Ling Chen
From The 2012 International Conference on Intelligent Computing (ICIC 2012)
Huangshan, China. 25-29 July 2012
Abstract
Background: Understanding the localization of proteins in cells is vital to characterizing their function s and
possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located
becomes an important problem in protein classification. This classi fication issue thus involves predicting labels in a
dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data,
random walk techniques have performed well in sequence classification and functional prediction; however, this
method has not yet been applied to protein localization. Accordingly, we prop ose a novel classifier in the site
prediction of proteins based on random walks on a graph.
Results: We propose a graph theory model for predicting protein localization using data generated in yeast and
gram-negative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the
model training parameters by varying the laziness values and the number of steps taken during the random walk.
Using 10-fold cross-validation, we achie ved an accuracy of above 61% for yeast data and about 93% for gram-
negative bacteria.
Conclusions: This study pre sents a new classifier derived from the random walk technique and applies this
classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation
demonstrate an improvement over previous methods, such as support vector machine (SVM)-based classifiers.
Background
Protein localization is a general a term that refers to the
study of where proteins are located within the cell. In
many cases, proteins cannot perform their designated
function until they are tran sported to the proper location
at the appropriate time. Improper localization of proteins
can exert a significant impact on cellular processes or on
the entire organism. Therefore, a central issue for biolo-
gists is to predict the (sub)cellular localization of proteins
[1-3], which has implications for the functions and interac-
tions [4,5] of proteins.
With the deve lopment of new approaches in c omputer
science, coupled with an improved dataset of proteins
with known localizatio n, computational tools can now
provide fast and accurate localization predictions for
many organisms as an alternative to laboratory-based
methods. Therefore, many studies have begun to address
this issue. To predict the cellular localization of proteins,
soon after their proposal of a probabilistic classification
system to identify 336 E.coli proteins and the 1484 yeast
proteins [6], Paul Horton and Kenta Nakai [7] also
compared their s pecifically designed probabilistic model
with three other classifiers on the same datasets: the
k-nearest-neighbor (kNN) classifier, the binary decision
tree classifier, and the naive Bayes classifier. The resulting
accuracy using stratified cross-validation showed that the
kNN classifier performed better than the other methods,
with an accuracy of approximately 60% for 10 yeast
classes and 86% for 8 E. coli classes.
Feng [8] presented an o verview about the prediction of
protein subcellular localization, and in 2004, Donnes and
Hoglund [9] introduced past and cur rent work on this
* Correspondence: arterx@gmail.com
† Contributed equally
Department of Computer Science, Yangzhou University, Yangzhou 225009,
China
Xu et al. BMC Bioinformatics 2013, 14(Suppl 8):S4
http://www.biomedcentral.com/1471-2105/14/S8/S4
© 2013 Xu et al.; licensee BioMed Cen tral Ltd. T his is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons .org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properl y cited.