序列分割方法应用于蛋白质相似性分析

58 浏览量更新于2024-08-28 收藏 233KB PDF 举报

本文主要探讨了一种应用于蛋白质相似性分析的序列分割方法（Sequence-Segmented Method, SSM），该研究发表在一项生物信息学领域的研究论文中。作者们来自浙江科技大学的生命科学学院和科学学院，他们提出了一种新的蛋白质序列图形表示法，这种方法基于氨基酸的两种分类，旨在提供更为深入的蛋白质序列比较和结构信息编码。首先，作者构建了一个二维图形模型，通过将蛋白质序列按照特定规则划分为多个子段（k-segments）。这种划分方式（SSM）使得蛋白质图谱被分解成多个可处理的部分，每个子段对应一个几何中心，作为蛋白质序列特征的描述符。这些描述符不仅有助于进行蛋白质间的相对比较，而且能够捕捉到蛋白质结构的内在特性，如可能的功能区域或保守性特征。论文的引入部分指出，这种新的序列分割方法突破了传统的单序列分析方法，引入了空间维度和结构相关的视角来增强分析的精确性和深度。通过这种方式，研究者能够更好地理解蛋白质的功能关联、进化关系以及可能的结构异同。接着，作者通过一个简单的例子，展示了在12个杆状病毒蛋白序列上应用SSM后，新描述符如何揭示和解析这些蛋白质的特异性行为和潜在的共同模式。这个实例提供了直观的证据，证明了SSM在实际蛋白质研究中的实用价值。关键词包括：相似性分析、序列分割方法、图形表示、描述符。这篇论文为蛋白质序列分析提供了一种新颖且具有洞察力的工具，有望促进蛋白质结构和功能研究的进一步发展。通过这种方法，科学家们可以更有效地挖掘和理解大量蛋白质数据，从而推动生物学和医学领域的新发现。

A Sequence-Segmented Method Applied to the

Similarity Analysis of Proteins

Fen Kong

, Xu-ying Nan

, Ping-an He

, Qi Dai

, Yu-hua Yao

College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China

College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China

zaozuang1989@126.com, nanxuying@163.com, pinganhe@zstu.edu.cn, daiailiu2004@yahoo.com.cn, yaoyuhua2288@163.com

Abstract—A 2-D graphical representation of protein sequences

based on two classifications of amino acids is outlined. The method

of dividing a long sequence into k segments (SSM) is introduced,

so protein graph is divided into k segments, geometrical center of

the points for all protein curve segment is given as descriptors of

protein sequences. It is not only useful for comparative study of

proteins, but also for encoding innate information about the

structure of proteins. Finally, a simple example is taken to

highlight the behavior of the new descriptor on protein sequences

taken from the 12 baculoviruse proteins.

Keywords-Similarity; Sequence-Segmented Method; Graphical

representation; Descriptors.

I. INTRODUCTION

Bio-molecular sequence comparison is the origin of

bioinformatics. Today, powerful sequence comparison

methods, together with comprehensive biological databases,

have changed the practice of molecular biology and genomics.

Previously, almost all such comparisons are based on sequence

alignment: these methods use dynamic programming, a score

function is used to represent insertion, deletion, and

substitution of nucleotides or

amino acids in the compared

DNAs or proteins, finally a regression technique that finds an

optimal alignment by assigning scores to different possible

alignments and picking the alignment with the highest score.

Recently, biological sequence analysis quickly incorporated

additional concepts and algorithms, such as stochastic

modeling of sequences using hidden Markov models and other

Bayesian theory methods for hypothesis testing and parameter

estimation [1].

Among all existing alignment-free methods for comparing

biological macromolecules, graphical representation

techniques provide a simple way to view, sort, and compare

sequences or structures. H-curve, graphical representation of

DNA sequences was introduced by Hamori in 1983 [2].

Graphical representations of bio-sequences were expanded

from DNA [3-5], RNA secondary stucture [6, 7] to proteins [8,

9] and as it grew from qualitative and pictorial representations

to quantitative estimation of sequence similarities/

dissimilarities. These graphical representations both 2-D and

3-D can be associated with a matrix, such as E, M/M, L/L,

L, thus the matrix invariants arrive at various numerical

descriptors rather than the visual description of sequence. The

comparison of sequences changed into the comparison of

descriptors. Above matrix methods by forming ratios of graph

theoretic and Euclidean distances between nodes of the

graphical plots, first formulated for DNA sequences in Randic

et al. Those methods are used in the study of global homology

and conserved patterns, the analysis of similarity and

dissimilarity, the study of fractal and long range correlations.

This technique has been widely used method of choice for the

researchers in this field who have defined different types of

matrices to construct various invariants for describe the bio-

sequences. However, the difficulties associated with

computing various parameters for very large matrices that are

natural for large sequences have restricted the numerical

characterizations to leading eigenvalues and the like [10].

Another approach using geometrical descriptor was

proposed by Raychaudhury and Nandy [11], and it has been

found to be useful for several calculations based on the 2D

graphical representation [12], and extended recently to an

abstract 20D modelling for protein sequences [13], where

individual sequences are indexed by numerical descriptors.

The approach is convenient, fast and efficient, but it couldn’t

used to similarity/dissimilarity measure for bio-sequences with

length less than 1000.

In this paper, we outlined a dynamic 2-D graphical

representation based two physico-chemical properties of amino

acids, and introduced a novel strategy for sequence

comparison based on the method of dividing a long sequence

into k segments (SSM). We will make a comparison for

helicase protein sequences of 12 baculoviruses, including 3

group I NPVs: AcMNPV (Autographa californica MNPV),

BmNPV (Bombyx mori NPV), RoMNPV (Rachiplusia ou

MNPV); 5 group II NPVs: HearNPV (Helicoverpa armigera

NPV), HzSNPV (Helicoverpa zea SNPV), MacoNPVA

(Mamestra configurata NPVA), MacoNPVB (Mamestra

configurata NPVB), SeMNPV (Spodoptera exigua MNPV); 3

GVs: AdorGV (Adoxophyles orona GV), CpGV (Cydia

pomonella GV), CrleGV (Cryptophlebia leucotreta GV); 1

hymenopteran baculovirus: NeseNPV (Neodiprion sertifer

NPV). The family baculoviridae is divided into two genera,

Nucleopolyhedrovirus (NPV) and Granulovirus (GV).

Lepidopteran NPVs show a further division into group I and

group II NPVs. Group I NPVs appear to be much more

conserved than those of group II [14]. Length and group

information of these protein sequences are showed in Table 1.

The similarities are computed by calculating the Euclidean

distance among the end point of the normalized descriptor

vectors. Using our approach, one can find that the

2012 IEEE 6th International Conference on Systems Biology (ISB)

321 Xi’an, China, August 18–20, 2012

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38662089

粉丝: 5

序列分割方法应用于蛋白质相似性分析

Agreement on Target-Bidirectional LSTMs for Sequence-to-Sequence Learning

sequence-diagram.zip

sequence-to-sequence learning

rag-sequence-nq

warning: seqid "chr1" on line 2 in file "/opt/gff3validator/tmp/test_peaks.narrowPeak3.gff" has not been previously introduced with a "##sequence-region" line, create such a line automatically

sequence-to-sequence模型详细介绍

losses = tf.contrib.legacy_seq2seq.sequence_loss_by_example

利用python实现sequence-DTW算法

最新资源