A Sequence-Segmented Method Applied to the
Similarity Analysis of Proteins
Fen Kong
1
, Xu-ying Nan
2
, Ping-an He
1
, Qi Dai
2
, Yu-hua Yao
*2
1
College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
2
College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
zaozuang1989@126.com, nanxuying@163.com, pinganhe@zstu.edu.cn, daiailiu2004@yahoo.com.cn, yaoyuhua2288@163.com
Abstract—A 2-D graphical representation of protein sequences
based on two classifications of amino acids is outlined. The method
of dividing a long sequence into k segments (SSM) is introduced,
so protein graph is divided into k segments, geometrical center of
the points for all protein curve segment is given as descriptors of
protein sequences. It is not only useful for comparative study of
proteins, but also for encoding innate information about the
structure of proteins. Finally, a simple example is taken to
highlight the behavior of the new descriptor on protein sequences
taken from the 12 baculoviruse proteins.
Keywords-Similarity; Sequence-Segmented Method; Graphical
representation; Descriptors.
I. INTRODUCTION
Bio-molecular sequence comparison is the origin of
bioinformatics. Today, powerful sequence comparison
methods, together with comprehensive biological databases,
have changed the practice of molecular biology and genomics.
Previously, almost all such comparisons are based on sequence
alignment: these methods use dynamic programming, a score
function is used to represent insertion, deletion, and
substitution of nucleotides or
amino acids in the compared
DNAs or proteins, finally a regression technique that finds an
optimal alignment by assigning scores to different possible
alignments and picking the alignment with the highest score.
Recently, biological sequence analysis quickly incorporated
additional concepts and algorithms, such as stochastic
modeling of sequences using hidden Markov models and other
Bayesian theory methods for hypothesis testing and parameter
estimation [1].
Among all existing alignment-free methods for comparing
biological macromolecules, graphical representation
techniques provide a simple way to view, sort, and compare
sequences or structures. H-curve, graphical representation of
DNA sequences was introduced by Hamori in 1983 [2].
Graphical representations of bio-sequences were expanded
from DNA [3-5], RNA secondary stucture [6, 7] to proteins [8,
9] and as it grew from qualitative and pictorial representations
to quantitative estimation of sequence similarities/
dissimilarities. These graphical representations both 2-D and
3-D can be associated with a matrix, such as E, M/M, L/L,
k
L/
k
L, thus the matrix invariants arrive at various numerical
descriptors rather than the visual description of sequence. The
comparison of sequences changed into the comparison of
descriptors. Above matrix methods by forming ratios of graph
theoretic and Euclidean distances between nodes of the
graphical plots, first formulated for DNA sequences in Randic
et al. Those methods are used in the study of global homology
and conserved patterns, the analysis of similarity and
dissimilarity, the study of fractal and long range correlations.
This technique has been widely used method of choice for the
researchers in this field who have defined different types of
matrices to construct various invariants for describe the bio-
sequences. However, the difficulties associated with
computing various parameters for very large matrices that are
natural for large sequences have restricted the numerical
characterizations to leading eigenvalues and the like [10].
Another approach using geometrical descriptor was
proposed by Raychaudhury and Nandy [11], and it has been
found to be useful for several calculations based on the 2D
graphical representation [12], and extended recently to an
abstract 20D modelling for protein sequences [13], where
individual sequences are indexed by numerical descriptors.
The approach is convenient, fast and efficient, but it couldn’t
used to similarity/dissimilarity measure for bio-sequences with
length less than 1000.
In this paper, we outlined a dynamic 2-D graphical
representation based two physico-chemical properties of amino
acids, and introduced a novel strategy for sequence
comparison based on the method of dividing a long sequence
into k segments (SSM). We will make a comparison for
helicase protein sequences of 12 baculoviruses, including 3
group I NPVs: AcMNPV (Autographa californica MNPV),
BmNPV (Bombyx mori NPV), RoMNPV (Rachiplusia ou
MNPV); 5 group II NPVs: HearNPV (Helicoverpa armigera
NPV), HzSNPV (Helicoverpa zea SNPV), MacoNPVA
(Mamestra configurata NPVA), MacoNPVB (Mamestra
configurata NPVB), SeMNPV (Spodoptera exigua MNPV); 3
GVs: AdorGV (Adoxophyles orona GV), CpGV (Cydia
pomonella GV), CrleGV (Cryptophlebia leucotreta GV); 1
hymenopteran baculovirus: NeseNPV (Neodiprion sertifer
NPV). The family baculoviridae is divided into two genera,
Nucleopolyhedrovirus (NPV) and Granulovirus (GV).
Lepidopteran NPVs show a further division into group I and
group II NPVs. Group I NPVs appear to be much more
conserved than those of group II [14]. Length and group
information of these protein sequences are showed in Table 1.
The similarities are computed by calculating the Euclidean
distance among the end point of the normalized descriptor
vectors. Using our approach, one can find that the