Spark_BLAST：并行化BLAST算法设计与性能提升

195 浏览量更新于2024-08-27 收藏 732KB PDF 举报

"基于Spark的BLAST算法并行化设计与实现" 在当前的生物信息学领域，BLAST（Basic Local Alignment Search Tool）是用于序列比对的一种关键算法，其特点是高精度和快速搜索。然而，随着基因数据集的规模不断增大，传统的BLAST算法在处理大数据时表现出性能瓶颈和效率低下。为了克服这个问题，研究人员提出了Spark_BLAST，这是一种基于Apache Spark的分布式并行化解决方案。 Apache Spark 是一个流行的开源大数据处理框架，它强调内存计算以提高处理速度。Spark_BLAST 方法利用Spark的分布式计算能力，将BLAST算法的任务分解并分配到多个节点上，从而有效地并行化执行。这种方法的核心是通过Spark的弹性分布式数据集（Resilient Distributed Datasets, RDD）来存储和操作基因序列数据，以实现大规模数据的高效处理。在实现过程中，Spark_BLAST首先将大型基因数据集划分为更小的块，然后在Spark集群的不同节点上并行运行BLAST的比对过程。每个节点独立处理分配给它的数据，减少了数据的传输和等待时间，显著提高了整体的计算效率。此外，Spark的容错机制确保了即使在节点故障的情况下，计算也能继续进行，保证了系统的稳定性。在5个节点的Spark集群上进行的实验结果显示，Spark_BLAST相比于单机执行的BLAST，实现了大约4倍的加速比，同时并未牺牲比对结果的准确性。这意味着，对于大型基因数据集的分析，Spark_BLAST能显著缩短处理时间，为生物信息学家提供了更快、更高效的工具。关键词：Spark，并行计算，生物信息学，序列比对，大数据，基本局部比对搜索工具（BLAST）这一创新的方法不仅解决了BLAST算法在处理大数据时的效率问题，还为未来的生物信息学研究提供了新的方向。通过将现代的大数据处理技术应用到传统生物信息学算法中，Spark_BLAST为其他领域的数据密集型计算提供了一个成功的范例，展示了如何利用分布式计算技术解决复杂问题的能力。

2018 International Conference on Information, Electronic and Communication Engineering (IECE 2018)

ISBN: 978-1-60595-585-8

Design and Implementation of Parallelization of BLAST

Algorithm Based on Spark

Zhen-yu LIU, Jing GAO

, Zhi-jun SHEN and Fang ZHAO

College of Computer and Information Engineering, Inner Mongolia Agricultural University,

Hohhot Inner Mongolia 010018, China

*Corresponding author

Keywords: Spark, Parallel computing, Bioinformatics, Sequence alignment, Big data, Basic Local

Alignment Search Tool (BLAST).

Abstract. BLAST (Basic Local Alignment Search Tool) is a local alignment algorithm, which has

high accuracy and is used widely. It can reduce the running time of program while maintaining high

precision, but it has performance bottleneck and low efficiency when comparing large gene data

sets. Therefore, a distributed parallel method named Spark_BLAST based on Spark was proposed.

The method uses Spark memory computation to identify and divide tasks, and realizes the

distributed parallel computing of the BLAST algorithm. Finally, the method was implemented on

the Spark cluster with 5 nodes. Comparing with single machine shows that the speedup of Spark

cluster can reach about 4 without changing the accuracy of the comparison result. The method

provides an efficient alignment method for bioinformatics.

Introduction

With the development of bioinformatics, gene sequence alignment has become an indispensable

part in this field. BLAST [1, 2] (Basic Local Alignment Search Tool) is one of the most popular

methods for gene sequence alignment. However, with the development of high-throughput

sequencing technology, a large number of gene data produced. BLAST has performance bottlenecks

and low efficiency in comparing large gene data sets.

Recently, many people have improved the blast algorithm. For example, the first way is the

improvement of BLAST algorithm based on GPU: Vouzis at Carnegie Mellon University designed

and implemented the GPU-BLAST [3] algorithm in 2011, which achieves a three-fold acceleration

ratio compared with single-machine BLAST; In 2014, the G-BLASTN [4] was proposed by K

Zhao and X Chu. It supports a pipeline mode that further improves the overall performance by up to

44% when handling a batch of queries as a whole. The second way is the BLAST algorithm based

on distributed design and Implementation: Matsunaga and Tsugawa implemented the CloudBLAST

[5] algorithm in 2009. And the algorithm integrates MapReduce with virtual machine and virtual

network, realizes the parallel computation of BLAST2. In 2014, Ming MENG implemented an

efficient and reliable parallel BLAST algorithm using MapReduce computing framework based on

Hadoop cluster [6]. With the wide application of BLAST, RS Neumann, S Kumar et al. developed

user-friendly programs to control the visualization and analysis of BLAST output results in 2014. It

provides many conveniences for users without bioinformatics background [7]. In 2017, Marcelo

Rodrigo de Castro and Catherine DOS Santos Tostems designed the SparkBLAST [8], and carried

out protein comparison experiments on Google and Amazon clouds, showing that SparkBLAST is

faster than Hadoop. In 2018, Grzegorz M Boratyn, Jean Thierry-Mieg designed and implemented

Magic-BLAST [9], which can better identify introns and is suitable for aligning long-reading gene

sequences. At the same time, the speed of comparison will be improved.

To sum up, it has become a trend to implement BLAST parallelization using big data technology.

However, the MapReduce is a computing framework based on disk, which requires frequently

access disk during multi-step computing, it She caused a lot of time delay. The distributed parallel

292

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38712578

粉丝: 4

Spark_BLAST：并行化BLAST算法设计与性能提升

基于Spark的分布式健康大数据分析系统设计与实现.pdf

使用python实现BLAST

BLAST:BLAST算法

【R语言并行计算秘诀】：RStudio中的数据处理并行化技巧，加速数据处理

离散数学算法分析：深入浅出时间复杂度与空间复杂度

字符串搜索的秘籍：子串位置与主串索引的算法对比分析

大数据时代的数据挖掘挑战：海量数据处理，算法优化

非结构化数据价值挖掘：六步法实现数据到信息的华丽转身

【数据可视化】：Python在生物信息学中的重要性及实现方法

R语言数据包多线程应用：并行计算提升效率的方案

最新资源