employed to extra ct the most discriminative new feature
subset. Finally, ELM is chosen as the weak learning
machine and the ensemble ELM classifier is constructed
using the vectors of resulting feature subs et as input. To
evaluate the performance, the proposed method was
applied to Saccharomyces cerevisiae PPI data. The experi-
ment results show that our method achieved 87% predic-
tion accuracy with 86.15% sensitivity at the precision of
87.59%. The prediction model was also assessed using the
independent dataset of the Escherichia coli PPIs and
yielded 87.5% prediction accuracy, which further demon-
strates the effectiveness of our method.
Results
In this section, we first d iscuss the biological datasets
and evaluation strategies used in performance compari-
sons. Ne xt we present results f or comparing the PCA-
EELM method to state-of-the-art classifier for predicting
protein interaction pairs in yeast.
Generation of the data set
We evaluated the proposed method with the dataset of
physical protein interactions from yeast used in the study
of Guo et al. [9]. The PPI dataset was collected from
Saccharomyces cerevisiae core subset of database of inter-
acting proteins (DIP), version DIP 20070219. After the
redundant protein pairs which contain a protein with
fewer than 50 residues or have ≥40% sequence identity
were remove, the remaining 5594 protein pairs comprise
the final positive dataset. The 5594 non-interacting protein
pairs were generated from pairs of proteins whose sub-cel-
lular localizations are different. The whole dataset consists
of 11188 protein pairs, where half are from the positive
dataset and half are from the negative dataset.
Evaluation measures
To measure the performance of the proposed method,
we ado pted 5-fold cross validatio n and four parameters,
the overall prediction accuracy (Accu.), sensitivity
(Sens.), precision (Prec.) and Ma tthews correlation coef-
ficient (MCC). They are defined as follows:
ACC =
TP + TN
TP + FP + T
+ F
(1)
SN =
TP + F
(2)
PE =
TP
(3)
MCC =
TP × TN − FP × FN
(TP + FN) × (TN + FP) × (TP + FP) × (TN + FN)
(4)
wheretruepositive(TP)isthenumberoftruePPIs
that are predicted correctly; f alse negative (FN) is the
number of true PPIs that are p redicted to be non-int er-
acting pairs; false positive (FP) is the number of true
non-interacting pairs that are predicted to be PPIs, and
true negative (TN) is the number of true non-interacting
pairs that are predicted correctly. MCC denotes Math-
ews correlation coefficient.
Experimental setting
The proposed PCA-EELM protein interaction prediction
method was implemented using MATLAB platform. For
ELM, the implementation by Zhu and Huang available
from http://www.ntu.edu.sg/home/egbhuang was used.
Regarding SVM, LIBSVM implementation available
from http://www.csie.ntu.edu.tw/~cjlin/libsvm was uti-
lized, which wa s originally developed by Chang and Lin .
All the simulations were carried out on a computer with
3.1 GHz 2-core CPU, 6 GB memory and Windows oper-
ating system.
All ELM in the ensemble classifier had the same num-
ber of hidden layer neurons but different random hid-
den layer weights and output layer weights. Ensemble
ELM models were built via the stratified 5-fold cross-
validation procedure through increasing gradually the
number of hidden neurons fr om 20 to 300 in interval of
10. The best number of neurons was a dapted to create
the training model. The sigmoid activation function was
used to compute the h idden layer output matrix. The
final model was an ensemble of 15 extreme learning
machines, and the o utputs of ensemble ELM model
were determined by combining the outputs of the each
individual ELM by majority voting. For SVM, the Rad ial
Basis Function was chosen as the kerne l function and
the optimized parameters
C, γ
were obtained with a
grid search approach.
Prediction performance of PCA-EELM model
We evaluated the performance of the proposed PCA-
EELM model using the DIP PPIs data as investigated in
Guo et al. [9]. In order to evaluate the prediction ability
of our ELM classifiers, we also implemented a Support
Vector Machine (SVM) learning algorithm which is
thought of as the state-of-the-art classifier. We have
compa red our ensemble ELM based recognition sc heme
against methods utilizing SVM with C = 8, g = 0.5, l =
30. For the ensemble ELM and SVM classifiers, all of
the input values were normalized in the range of [-1,1].
To reduce the bias of training and testing data, a 5-fold
cross-validation technique is adopted. More specifically,
the dataset is divided into 5 subsets, and the holdout
method is reiterated 5 times. Each time four of the five
subsets are put together as the training dataset, and the
You et al. BMC Bioinformatics 2013, 14(Suppl 8):S10
http://www.biomedcentral.com/1471-2105/14/S8/S10
Page 3 of 11