G-FQZip: Lossless Reference-Based Compression of FASTQ files Using
GPUs
Cong Peng
1
, Qingjin Deng
1
,
Zhi-An Huang
1
, Yiwen
Sun
2*
and Zexuan Zhu
1
1
College of Computer Science and Software Engineering,
Shenzhen University, Shenzhen 518060, China
2
School of Medicine, Shenzhen University, Shenzhen
518060, China
ywsun@szu.edu.cn
Abstract—The exponentially increasing high throughput of
sequencing data calls for efficient specific compression
methods to address the challenges posed by the storage and
transmission of such data. In this work, we develop a GPU
version of lossless reference-based compression method namely
G-FQZip by introducing a GPU-based arithmetic coding, a
template matching approach, and a parallel light-weight
mapping model. The comparison experiments demonstrate
that G-FQZip can improve the (de)compression speed while
maintaining comparable compression ratios. Besides, the
follow-up evaluation demonstrated the efficiency of the GPU-
based arithmetic coding and the template matching approach.
Keywords-GPU acceleration; Reference-based DNA
sequence compression; High-throughput sequencing; Lossless
compression
I. INTRODUCTION
High-throughput sequencing has been extensively used
in genome studies thanks to a significantly drop in
sequencing costs [1, 2]. The most widely used file format to
store sequencing data and the associated information
including metadata and mapping quality values is FASTQ
[3]. General-purpose compression tools are not efficient in
the (de)compression of FASTQ files as they do not consider
the biological characteristics of such files.
In recent years, many specialized compression
algorithms have been developed to handle sequencing data
in raw FASTQ format via introducing diverse effective
heuristic strategies and techniques [4][5]. Generally, they
could be categorized into reference-free methods [6-8] and
reference-based methods [9-12] depending on whether some
external reference sequence(s) are needed.
In this article, we leverage the acceleration in a graphic
processing unit (GPU) to propose a GPU version of lossless
reference-based compression tool for FASTQ files named
G-FQZip. The overflow of G-FQZip is described in Section
II. The novelty of this work stems from three aspects: a
GPU-based arithmetic coding is designed for compressing
sequencing reads, a template matching approach is
introduced to record metadata and a parallel light-weight
mapping model is used to capture the differences of the
reads against the reference. The comparison experiments
demonstrate that G-FQZip can improve (de)compression
speed with satisfactory compression ratios.
Figure 1. The general framework of G-FQZip
553
2017 13th International Conference on Computational Intelligence and Security
0-7695-6341-4/17/31.00 ©2017 IEEE
DOI 10.1109/CIS.2017.00128