氨基酸序列预测蛋白质相互作用：PCA-EELM方法

97 浏览量更新于2024-08-25 收藏 1.12MB PDF 举报

"这篇研究论文探讨了如何利用集成极端学习机(Ensemble Extreme Learning Machine, EELM)和主成分分析(Principal Component Analysis, PCA)从氨基酸序列预测蛋白质-蛋白质相互作用(Protein-Protein Interactions, PPIs)。在2012年的国际智能计算大会上发表，该研究旨在解决实验方法识别PPIs效率低且成本高的问题，提出了一种自动化计算方法以更高效准确地预测PPIs。" 正文: 蛋白质-蛋白质相互作用是生物体内许多重要过程的关键因素，它们构成了生物机制的基础。尽管高通量实验技术已经产生了大量不同物种的PPI数据，但实验方法获取的PPI对仅覆盖了完整PPI网络的一小部分。此外，实验识别PPI的方法既耗时又昂贵，因此，发展能够快速、准确预测PPI的计算方法显得尤为迫切和具有挑战性。研究者提出了一种新颖的层次PCA-EELM模型，该模型结合了主成分分析和集成极端学习机。主成分分析是一种统计方法，能将高维数据集降维，同时保留大部分原始数据的信息，降低复杂性并减少冗余。而在机器学习领域，极端学习机(Extreme Learning Machine, ELM)是一种快速的单隐藏层前馈神经网络训练算法，其随机初始化隐层节点权重和偏置，然后仅优化输出层权重，从而简化了学习过程。 PCA-EELM模型首先通过PCA对氨基酸序列进行降维处理，提取关键特征，减少了数据的噪声和冗余，提高了后续学习过程的效率。然后，这些主成分作为输入传递到EELM中，EELM的快速学习能力使得模型能够在大量氨基酸序列上高效地学习和预测蛋白质间的相互作用。论文中的实验结果可能包括模型在多个数据集上的表现，如预测精度、召回率和F1分数等评价指标。对比其他现有的预测方法，PCA-EELM模型可能展示了优越的性能，证明了其在预测PPIs方面的有效性和实用性。这项研究为预测蛋白质相互作用提供了一种创新的计算方法，不仅有助于理解蛋白质功能和细胞机制，还可能促进药物发现和疾病治疗等领域的发展。通过将统计方法与机器学习算法相结合，研究人员展示了如何利用氨基酸序列这一基础生物信息来预测复杂的生物交互。

展开

employed to extra ct the most discriminative new feature

subset. Finally, ELM is chosen as the weak learning

machine and the ensemble ELM classifier is constructed

using the vectors of resulting feature subs et as input. To

evaluate the performance, the proposed method was

applied to Saccharomyces cerevisiae PPI data. The experi-

ment results show that our method achieved 87% predic-

tion accuracy with 86.15% sensitivity at the precision of

87.59%. The prediction model was also assessed using the

independent dataset of the Escherichia coli PPIs and

yielded 87.5% prediction accuracy, which further demon-

strates the effectiveness of our method.

Results

In this section, we first d iscuss the biological datasets

and evaluation strategies used in performance compari-

sons. Ne xt we present results f or comparing the PCA-

EELM method to state-of-the-art classifier for predicting

protein interaction pairs in yeast.

Generation of the data set

We evaluated the proposed method with the dataset of

physical protein interactions from yeast used in the study

of Guo et al. [9]. The PPI dataset was collected from

Saccharomyces cerevisiae core subset of database of inter-

acting proteins (DIP), version DIP 20070219. After the

redundant protein pairs which contain a protein with

fewer than 50 residues or have ≥40% sequence identity

were remove, the remaining 5594 protein pairs comprise

the final positive dataset. The 5594 non-interacting protein

pairs were generated from pairs of proteins whose sub-cel-

lular localizations are different. The whole dataset consists

of 11188 protein pairs, where half are from the positive

dataset and half are from the negative dataset.

Evaluation measures

To measure the performance of the proposed method,

we ado pted 5-fold cross validatio n and four parameters,

the overall prediction accuracy (Accu.), sensitivity

(Sens.), precision (Prec.) and Ma tthews correlation coef-

ficient (MCC). They are defined as follows:

ACC =

TP + TN

TP + FP + T

+ F

(1)

SN =

TP + F

(2)

PE =

TP + FP

(3)

MCC =

TP × TN − FP × FN



(TP + FN) × (TN + FP) × (TP + FP) × (TN + FN)

(4)

wheretruepositive(TP)isthenumberoftruePPIs

that are predicted correctly; f alse negative (FN) is the

number of true PPIs that are p redicted to be non-int er-

acting pairs; false positive (FP) is the number of true

non-interacting pairs that are predicted to be PPIs, and

true negative (TN) is the number of true non-interacting

pairs that are predicted correctly. MCC denotes Math-

ews correlation coefficient.

Experimental setting

The proposed PCA-EELM protein interaction prediction

method was implemented using MATLAB platform. For

ELM, the implementation by Zhu and Huang available

from http://www.ntu.edu.sg/home/egbhuang was used.

Regarding SVM, LIBSVM implementation available

from http://www.csie.ntu.edu.tw/~cjlin/libsvm was uti-

lized, which wa s originally developed by Chang and Lin .

All the simulations were carried out on a computer with

3.1 GHz 2-core CPU, 6 GB memory and Windows oper-

ating system.

All ELM in the ensemble classifier had the same num-

ber of hidden layer neurons but different random hid-

den layer weights and output layer weights. Ensemble

ELM models were built via the stratified 5-fold cross-

validation procedure through increasing gradually the

number of hidden neurons fr om 20 to 300 in interval of

10. The best number of neurons was a dapted to create

the training model. The sigmoid activation function was

used to compute the h idden layer output matrix. The

final model was an ensemble of 15 extreme learning

machines, and the o utputs of ensemble ELM model

were determined by combining the outputs of the each

individual ELM by majority voting. For SVM, the Rad ial

Basis Function was chosen as the kerne l function and

the optimized parameters

(

C, γ

)

were obtained with a

grid search approach.

Prediction performance of PCA-EELM model

We evaluated the performance of the proposed PCA-

EELM model using the DIP PPIs data as investigated in

Guo et al. [9]. In order to evaluate the prediction ability

of our ELM classifiers, we also implemented a Support

Vector Machine (SVM) learning algorithm which is

thought of as the state-of-the-art classifier. We have

compa red our ensemble ELM based recognition sc heme

against methods utilizing SVM with C = 8, g = 0.5, l =

30. For the ensemble ELM and SVM classifiers, all of

the input values were normalized in the range of [-1,1].

To reduce the bias of training and testing data, a 5-fold

cross-validation technique is adopted. More specifically,

the dataset is divided into 5 subsets, and the holdout

method is reiterated 5 times. Each time four of the five

subsets are put together as the training dataset, and the

You et al. BMC Bioinformatics 2013, 14(Suppl 8):S10

http://www.biomedcentral.com/1471-2105/14/S8/S10

Page 3 of 11

下载后可阅读完整内容，剩余10页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38718690

粉丝: 6

氨基酸序列预测蛋白质相互作用：PCA-EELM方法

使用新型的氨基酸序列的局部联合三联体描述符预测蛋白质-蛋白质相互作用。

用酵母双杂交系统研究蛋白质-蛋白质相互作用ppt课件.pptx

如何应用PCA-EELM模型从氨基酸序列预测蛋白质-蛋白质相互作用？

如何利用氨基酸序列数据通过PCA-EELM模型预测蛋白质-蛋白质相互作用？

如何使用PCA-EELM模型基于氨基酸序列数据来预测蛋白质-蛋白质相互作用？

通过蛋白质序列的多元互信息预测蛋白质-蛋白质相互作用

预测蛋白质-蛋白质相互作用位点的级联随机森林算法

蛋白质-蛋白质相互作用中热点区域的预测和分析

metal-binding-prediction:通过氨基酸序列预测蛋白质中金属结合位点的方法

基于递归特征消除的蛋白质-蛋白质相互作用热点识别

最新资源