基于多变量互信息的蛋白质序列预测蛋白-蛋白相互作用

180 浏览量更新于2024-08-27 收藏 1.51MB PDF 举报

本文研究主要关注的是通过蛋白质序列的多元互信息（Multivariate Mutual Information, MMI）来预测蛋白质-蛋白质相互作用（Protein-Protein Interaction, PPI）。这项工作发表在《BMC生物信息学》(BMC Bioinformatics)期刊上，由Ding等人在2016年的一项研究论文中提出，DOI为10.1186/s12859-016-1253-9。PPIs对于生物学过程至关重要，因为它们在细胞功能和信号传导中起着关键作用。传统上，预测PPI的方法依赖于大量同源蛋白质和已知相互作用伙伴的标记，这在实际应用中存在局限性。为了克服这些问题，作者提出了一种新颖的基于序列的预测方法，利用RF（随机森林）算法。首先，他们将20种标准氨基酸划分为七个功能组，对蛋白质序列进行编码处理，从而构建出每个蛋白质对的638维特征向量。这种向量化过程允许研究人员从序列层面提取更为丰富的信息。接着，他们采用一种创新的多变量互信息特征表示策略，这种方法不仅考虑了单个氨基酸之间的关系，还考虑了氨基酸之间更复杂的相互作用模式。通过这种方法，他们能够捕捉到蛋白质序列中的潜在模式，这些模式可能与PPI的存在或强度有关。值得注意的是，该方法在设计上侧重于计算效率，减少了对大量相似蛋白质数据的依赖，这对于那些缺乏完整交互信息的物种来说尤其有用。正常化的步骤确保了特征之间的可比性，使得算法能够在各种规模和复杂性的数据集上稳定运行。通过训练和验证，该模型展示了良好的预测性能，表明多元互信息特征可以有效地揭示蛋白质序列中的功能关联，从而提高PPI预测的准确性。这项研究为生物信息学领域提供了一个新的、有效的工具，可以帮助科学家们在没有足够结构或功能数据的情况下，预测蛋白质间的相互作用，对于理解生物网络和疾病机理具有重要意义。同时，这种方法也为后续的结构生物学和系统生物学研究开辟了新的途径，促进了对蛋白质相互作用机制的深入探索。

Ding et al. BMC Bioinformatics

(2016) 17:398

Page 3 of 13

dataset, our method achieves 82, 82, 62 and 61 % AUROC

on four different test classes (typical Cross-Validated (CV)

and distinct test classes C1, C2 and C3). On the human

dataset, our method achieves 82, 82, 60 and 57 % AUROC

on four different test classes. Finally, we test our method

on three important PPIs networks: the one-core network

(CD9) [21], the multiple-core network (Ras-Raf-Mek-Erk-

Elk-Srf pathway) [22], and the crossover network (Wnt-

related Network) [23]. Compared to the Conjoint Triad

(CT) method [13], accuracies of our method are increased

by 6.25, 2.06 and 18.75 %, respectively.

Methods

In our method for predicting protein-protein inter-

action based on protein sequence information, first

we extract features from protein sequence informa-

tion. The feature vector represents the characteristic

on one pair of proteins. We use k-gram feature repre-

sentation calculated as Multivariate Mutual Information

(MMI) and extract additional feature by normalized

Moreau-Broto Autocorrelation (NMBAC) from protein

sequences. These two approaches are employed to trans-

form the protein sequence into feature vectors. Then,

we feed the feature vectors into a specific classifier

for identifying interaction pairs and non-interaction

pairs.

Multivariate mutual information

Inspired by previous work [13, 24, 25] for extracting

features from protein sequences, we propose a novel

method to fully describe key information of protein-

protein interaction. There exist many technologies using

the k-gram feature representation, which is commonly

used for protein sequence classification [26, 27]. Here

k represents the number of conjoint amino acids. For

example, CT [13] used the 3-gram feature representation.

Shen et al. [13] indicated that methods without con-

sidering local environment are usually not reliable and

robust, so they produced a conjoint triad method to con-

sider properties of amino acids and their proximate amino

acids.

To continue the u sage of k-gram feature representa-

tion and to enhance classification accuracy, we utilize

MMI [28] for deeply extracting conjoint information of

amino acids in protein sequences.

Classifying amino acids

The protein-protein interaction can be dominated by

dipoles and volumes of diverse amino acids, which reflect

electrostatic and hydrophobic properties. All 20 stan-

dard amino acid types are assigned to seven functional

groups [13], as shown in Table 1. For each pair of proteins,

we extract conjoint information based on these amino

acid categories.

Table 1 Division of 20 amino acid types, based on dipoles and

volumes of side chains

No. Group Dipolescale Volumescale

A, G, V Dipole < 1.0 Volume< 50

C 1.0 < Dipole < 2.0 (form disulphide bonds) Volume> 50

D, E Dipole > 3.0 (opposite orientation) Volume> 50

F, I, L, P Dipole < 1.0 Volume> 50

H, N, Q, W 2.0 < dipole < 3.0 Volume> 50

K, R Dipole> 3.0 Volume> 50

M, S, T, Y 1.0 < dipole < 2.0 Volume> 50

Calculating multivariate mutual information

Considering the neighbours of each amino acid, we regard

any three contiguous amino acids as a unit. We use a

sliding window of a length of 3 amino acids to parse the

protein sequence. For each window, categories of three

aminoacidsareusedtolabelthetypeofthisunit.Instead

of considering the order of the three amino acids, we

only consider the basic ingredient of the unit. We define

different types of 3-gram feature representation, such as



, C



, C



, ...,



, C



. Similarly, we also

define different types of 2-gram feature representation,

such as



, C



, C



, ...,



, C



.Wecounteachtype

of 3-gram feature and 2-gram feature on one protein

sequence by a sliding window, as shown in Fig. 1.

At some point in the ensuing discussion of mutual infor-

mation, we state the logarithmic base as e.Incontrastto

the standard mutual information approach, our mutual

information and entropy method refer to single event on

one protein sequence, whereas standard mutual informa-

tion refers to overall possible events. We calculate the

multivariate mutual information for each type of 3-gram

feature, defined as follows:

I(a, b, c) = I(a, b) − I(a, b|c) (1)

where a, b and c are categories of three conjoint amino

acids in one unit.

Fig. 1 3-gram or 2-gram feature representation

剩余12页未读，继续阅读

weixin_38529251

粉丝: 6

基于多变量互信息的蛋白质序列预测蛋白-蛋白相互作用

(完整word版)蛋白质功能-结构-相互作用预测网站工具合集.doc

GaussDCA.jl：用于蛋白质家族中残基接触预测的多元高斯直接耦合分析-Julia模块

【AI知识表示实战手册】：从基础到高级，全面优化多元关系构建

MATLAB大数据处理：生物信息学中的解决方案与实践

【MATLAB生物信息学革命】：掌握数据世界的10大技巧

生物信息学与Matlab：数据分析与可视化的强大组合！

生物信息学的数学语言：数值分析在基因数据解读中的应用

Matlab在生物信息学中的应用：数据分析与模型构建的专业知识

【计算模型与生物信息学的未来】：《A New Kind of Science》在基因组学中的突破应用

cole_02_0507.pdf

最新资源