2018 International Conference on Information, Electronic and Communication Engineering (IECE 2018)
ISBN: 978-1-60595-585-8
Design and Implementation of Parallelization of BLAST
Algorithm Based on Spark
Zhen-yu LIU, Jing GAO
*
, Zhi-jun SHEN and Fang ZHAO
College of Computer and Information Engineering, Inner Mongolia Agricultural University,
Hohhot Inner Mongolia 010018, China
*Corresponding author
Keywords: Spark, Parallel computing, Bioinformatics, Sequence alignment, Big data, Basic Local
Alignment Search Tool (BLAST).
Abstract. BLAST (Basic Local Alignment Search Tool) is a local alignment algorithm, which has
high accuracy and is used widely. It can reduce the running time of program while maintaining high
precision, but it has performance bottleneck and low efficiency when comparing large gene data
sets. Therefore, a distributed parallel method named Spark_BLAST based on Spark was proposed.
The method uses Spark memory computation to identify and divide tasks, and realizes the
distributed parallel computing of the BLAST algorithm. Finally, the method was implemented on
the Spark cluster with 5 nodes. Comparing with single machine shows that the speedup of Spark
cluster can reach about 4 without changing the accuracy of the comparison result. The method
provides an efficient alignment method for bioinformatics.
Introduction
With the development of bioinformatics, gene sequence alignment has become an indispensable
part in this field. BLAST [1, 2] (Basic Local Alignment Search Tool) is one of the most popular
methods for gene sequence alignment. However, with the development of high-throughput
sequencing technology, a large number of gene data produced. BLAST has performance bottlenecks
and low efficiency in comparing large gene data sets.
Recently, many people have improved the blast algorithm. For example, the first way is the
improvement of BLAST algorithm based on GPU: Vouzis at Carnegie Mellon University designed
and implemented the GPU-BLAST [3] algorithm in 2011, which achieves a three-fold acceleration
ratio compared with single-machine BLAST; In 2014, the G-BLASTN [4] was proposed by K
Zhao and X Chu. It supports a pipeline mode that further improves the overall performance by up to
44% when handling a batch of queries as a whole. The second way is the BLAST algorithm based
on distributed design and Implementation: Matsunaga and Tsugawa implemented the CloudBLAST
[5] algorithm in 2009. And the algorithm integrates MapReduce with virtual machine and virtual
network, realizes the parallel computation of BLAST2. In 2014, Ming MENG implemented an
efficient and reliable parallel BLAST algorithm using MapReduce computing framework based on
Hadoop cluster [6]. With the wide application of BLAST, RS Neumann, S Kumar et al. developed
user-friendly programs to control the visualization and analysis of BLAST output results in 2014. It
provides many conveniences for users without bioinformatics background [7]. In 2017, Marcelo
Rodrigo de Castro and Catherine DOS Santos Tostems designed the SparkBLAST [8], and carried
out protein comparison experiments on Google and Amazon clouds, showing that SparkBLAST is
faster than Hadoop. In 2018, Grzegorz M Boratyn, Jean Thierry-Mieg designed and implemented
Magic-BLAST [9], which can better identify introns and is suitable for aligning long-reading gene
sequences. At the same time, the speed of comparison will be improved.
To sum up, it has become a trend to implement BLAST parallelization using big data technology.
However, the MapReduce is a computing framework based on disk, which requires frequently
access disk during multi-step computing, it She caused a lot of time delay. The distributed parallel