无模板预测子TargetS：基于分类器集成与空间聚类的蛋白质-配体结合位点识别

122 浏览量更新于2024-07-15 收藏 2.04MB PDF 举报

"这篇研究论文提出了一种名为TargetS的新方法，用于在没有3D结构或模板的情况下预测蛋白质-配体结合位点。通过结合蛋白质进化信息、预测的蛋白质二级结构以及残基的配体特异性结合倾向，该方法利用分类器集成和空间聚类策略来准确识别结合位点。" 在蛋白质科学和药物设计领域，精确地识别蛋白质-配体结合位点（或口袋）至关重要。这有助于理解蛋白质功能并推动新药的开发。尽管已取得许多进展，但在目标蛋白质的3D结构不可用或无法找到同源模板时，仍存在挑战。在这种情况下，基于模板的方法难以应用。论文作者Dong-Jun Yu等人提出了一种新的、针对配体的无模板预测子——TargetS，旨在解决这一问题。 TargetS方法分为两个主要步骤：首先，它使用配体特异性策略预测序列上的结合残基；然后，通过递归的空间聚类算法从预测的结合残基中进一步识别出结合位点。这种方法的核心创新在于结合了多种信息来源以提高预测准确性： 1. 蛋白质进化信息：利用蛋白质的进化树信息，可以捕捉到不同物种间相似蛋白质的保守性，这对于识别可能参与配体结合的关键氨基酸至关重要。 2. 预测的蛋白质二级结构：蛋白质的二级结构（如α螺旋、β折叠和无规卷曲）对其功能和结合位点的形成有直接影响。通过预测蛋白质的二级结构，TargetS可以更准确地定位可能的结合区域。 3. 残基的配体特异性结合倾向：每个氨基酸残基都有不同的倾向与特定类型的配体结合。通过对这些倾向的分析，TargetS能够区分哪些残基更有可能参与特定配体的结合。分类器集成是TargetS中的另一个关键概念，它涉及使用多个分类模型的组合来提高预测性能。通常，这些模型可能基于不同的算法（如支持向量机、随机森林或神经网络），并将它们的预测结果整合以得出最终决策。这种集成方法有助于减少单个模型的错误并提高整体预测的稳定性和准确性。空间聚类算法在识别结合位点时起到关键作用。通过递归地对预测的结合残基进行聚类，TargetS可以识别出连续的空间区域，这些区域最有可能形成稳定的蛋白质-配体复合物。这种方法考虑了蛋白质表面的三维结构，从而提高了识别的物理合理性。 TargetS是一种创新的无模板预测方法，通过集成各种信息源和算法，它能够在缺乏模板信息的情况下有效地预测蛋白质-配体结合位点，对于蛋白质功能研究和药物发现具有重要的应用价值。

automated procedure is completed, a careful manual check is

performed to eliminate possible false positives, which can

occur for entries with the commonly used crystallization

additives. By doing so, it is believed with high confidence

that the ligand-protein interactions collected from PDB are

real biologically relevant. Details for constructing BioLip can

be found in [45].

To evaluate the effectiveness of the proposed TargetS, we

thus constructed training data sets and independent

validation data sets based on the BioLip [45] rather than

on PDB. Twelve different types of ligands, i.e., five types of

metal ions, five types of nucleotides, DNA, and HEME,

were considered in this study. For each of the 12 types of

the considered ligands, we constructed its training data set

and independent validation data set as follows:

Training data sets. We extracted all the protein sequences,

which interact with the given ligand and were released into

PDB before 10 March 2010, from BioLip, and then the

maximal pairwise sequence identity of the extracted protein

sequences was culled to 40 percent with PISCES software

[49] and the resulting sequences constitute the training data

set for that ligand.

Independent validation data sets. We extracted all the protein

sequences that interact with the ligand and were deposited

into PDB after 10 March 2010 from BioLip. Again, the

maximal pairwise sequence identity of the extracted protein

sequences was reduced to 40 percent and the resulting

sequences constitute the validation data set. Moreover, if a

given sequence in the validation data set shares >40%

identity to a sequence in the training data set, then we remove

the sequence from the validation data set. This assures that

the sequences in validation data set are independent of those

in training data set. Table 1 summarizes the detailed

compositions of the training data sets and the independent

validation data sets for the 12 types of ligands.

To further demonstrate the effectiveness of the proposed

TargetS, CASP9 data set was used for blind test. The ninth

community-wide critical assessment o f techniques for

protein structure prediction (CASP9) released 129 target

protein sequences for blind test of protein structure and

function prediction methods. Among the 129 sequences,

31 were used f or evaluating the ligand binding-site

predictions, where the predictors were asked to identify

ligand binding residues in the sequences. As one sequence

(Target ID: T0533) was canceled on 26 May 2010, the

remaining 30 sequences were, thus, taken as targets for

our consideration.

It has not escaped from our notice that the percentages of

binding residues in training and validation data sets for a

given ligand are different. However, this difference will not

affect the objective evaluation procedure of the proposed

method as we performed both the cross-validation evalua-

tion on training data set and the independent test on the

testing data set. The purpose of the cross-validation is to

evaluate the overall performance of the proposed method

on a given data set. While independent test is often used to

evaluate the generalization capability of the proposed

method, which has been widely accepted in this field.

2.2 Feature Extraction

2.2.1 Position Specific Scoring Matrix Feature

Position specific scoring matrix (PSSM) well encodes the

evolutionary information of a protein sequence. Tremendous

previous studies have shown its prominent discriminative

capability for many prediction problems in bioinformatics,

suchasproteinfunctionprediction[50],protein-ATP

binding sites prediction [51], transmembrane helices predic-

tion [52], protein secondary structure prediction [53],

subcellular localization prediction [54], [55], [56], and so on.

The position specific scoring matrix for protein sequence

is built by using the PSI-BLAST [57] to search the Swiss-Prot

database through three iterations with 0.001 as the e-value

cutoff for multiple sequence alignment against the query

996 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 4, JULY/AUGUST 2013

TABLE 1

Composition of the Training Data Sets and the Independent Validation Data Sets for the 12 Types of Ligands

剩余14页未读，继续阅读

weixin_38532629

粉丝: 5
资源: 921

无模板预测子TargetS：基于分类器集成与空间聚类的蛋白质-配体结合位点识别

聚类算法Spatio-temporal-Clustering.zip

基于兴趣支持的子空间聚类预测回购-研究论文

分类与聚类学习算法课件-完整详细.pptx,目录如下：分类学习算法、聚类学习方法

基于加权特征集合的聚类算法预测酵母蛋白质的定位位点

高维稀疏数据子空间聚类的熵权k-均值算法

Salesforce聚类分析：K-Means和K-Medoids算法应用

利用结构邻接性识别蛋白质-蛋白质结合位点的新方法

城市用水量曲线聚类优化：SPAA-k-shape算法

Matlab项目源码：数据分类聚类分析与k-means算法应用

探索聚类算法：K-means、层次聚类与应用实例

最新资源