An algorithm of multiple sequence alignment based
on consensus sequence searched by simulated
annealing and star alignment
Dengfeng Yao/ Beijing Union University
Beijing Key Lab of Information Service Eng.,
Beijing Union University, Beijing, China
e-mail: yaodengfeng@gmail.com
Xu You/ Beijing Institute of Petro-Chemical
Technology
Department of Mathematics and Physics, Beijing
Institute of Petro-Chemical Technology, Beijing,
P.R. China
e-mail: youxu@bipt.edu.cn
Abudoukelimu.abulizi /Tsinghua University
Lab of Computational Linguistics, School of
Humanities, Tsinghua University, Beijing, China
e-mail:keram1106@163.com
Renkui Hou/Tsinghua University
Lab of Computational Linguistics, School of
Humanities, Tsinghua University, Beijing, China
e-mail: hourk0917@163.com
Abstract—Sequence alignment is an important
method of sequence analysis in biological
informatics, by which the characteristics of a gene
family can be conveniently determined through
consensus sequence. Finding a consensus sequence
prior to multiple sequence alignment makes the
sequence alignment easier. Thus, the consensus
sequence should be produced by simulated
annealing and star alignment algorithms.
Furthermore, the star alignment can be used to
compare each sequence in the consensus sequence.
Keywords—multiple sequence alignment,
simulated annealing, consensus sequence, star
alignment
1. INTRODUCTION
Consensus sequence is a method of
presenting multiple sequence alignment results
and merging highly similar sequence fragments.
The overall characteristics of a gene family and
the common elements in a sequence family can
be conveniently determined from the consensus
sequence.
However, a problem closely related to
consensus sequence is the construction of
multiple sequence alignment, which is an
important component of modern biological
information. Multiple sequence alignment
promotes many important applications in DNA
detection, structure function, and RNA and
protein family evolution and relation. Multiple
sequence alignment methods include the
Needleman–Wunsch algorithm [1],
Carrillo–Lipman algorithm [2], precise alignment
algorithm, iterative alignment algorithms such as
CLUSTAW [3], algorithm based on progressive
alignment, and SAGA [4] based on genetic
algorithm. Although these methods have
achieved good results, many of these still require
improvement and are yet to be perfect algorithms
for multiple sequence alignment. Hence, the
biological significance of alignment needs
continuous improvement.
In addition to using simulated annealing
method and star alignment algorithm to construct
the consensus sequence, this research also
conducted multiple sequence alignment
experiments based on the consensus sequence.
Once the consensus sequence was found, the star
alignment became easy to use to achieve multiple
sequence alignment based on the consensus
sequence [6]. Therefore, this research mainly
investigates the consensus sequence.
Day and McMorris (1993) [7] proved that
finding the consensus sequence is a complete NP
problem. Two main methods are used to
determine the consensus sequence. First, the
traditional method extracts elements in the
aligned column to obtain the consensus sequence
after multiple sequence alignment. The second
method searches for the consensus sequence in
the solution space based on simulation annealing
[5]. This method gradually generates a sequence
with a minimum distance of the input sequence
set from an empty sequence, and the consensus
sequence is obtained in the final annealing. These
two methods are performed from different