Burrows-Wheeler算法：数据压缩的基石

89 浏览量更新于2024-08-25 收藏 191KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇文档是Daniel Schiller在2012年8月5日关于Burrows-Wheeler算法的介绍，该算法由Michael Burrows和David Wheeler于1994年提出，基于David Wheeler在1983年的未发表工作。Burrows-Wheeler算法主要用于数据压缩，其过程包含多个连续的阶段，输入和输出都是任意大小的块，但块的大小必须在压缩过程开始时指定。解压缩时，可以恢复出原始数据。算法有多种变体，本文档主要讨论1994年研究报告中的原始算法结构。" Burrows-Wheeler算法是一种高效的数据压缩方法，它的工作原理和结构如下： 1. **旋转排序**： - 首先，算法对输入的文本进行旋转操作，形成一个新的排列。这个过程是通过将文本的每一个字符与它右边的字符进行比较，然后根据比较结果进行排序。例如，对于字符串"ABCBDAB"，经过旋转后可能会得到"BABCDA"或"DABCBA"等不同的排列。 2. **创建转换矩阵**： - 排序后的字符序列被转换成一个矩阵，通常是按行填充。这个矩阵的每一行都是原字符串的一个旋转版本。例如，"BABCDA"会变成矩阵`[BABCDA, AABCD, ABCD, CDBA, DABC]`。 3. **最大频率编码**： - 矩阵的每一列被看作是一个单独的字符，这些字符按照出现的频率进行排序，频率最高的字符放在最前面。然后，算法用较少的位来表示出现频率高的字符，从而实现压缩。例如，如果"A"是最频繁出现的字符，那么它可以用较少的位来编码。 4. **块排序**： - 最后，排序后的频率编码块被编码为输出的压缩数据流。这些块通常用一种称为行程长度编码（Run-Length Encoding, RLE）的技术处理，进一步减少冗余信息。在解压缩过程中，算法逆向执行这些步骤，首先解析出编码的块，然后按照频率编码还原矩阵，再进行逆向旋转排序，最终得到原始输入字符串。这个算法的创新之处在于它的可逆性和对文本模式的敏感性。由于压缩过程中保留了字符顺序信息，因此在解压缩时能精确地恢复原始数据。尽管不是所有数据都能得到很好的压缩效果，但对于包含重复模式的文本，如自然语言，Burrows-Wheeler算法往往表现出色。 Burrows-Wheeler算法是一种在数据压缩领域广泛应用的技术，尤其在生物信息学、文件存储和传输等领域，由于其效率和可恢复性，成为了一种重要的工具。然而，它也有一些缺点，比如对无规律或随机数据的压缩效率较低。此外，由于算法的复杂性，实际应用中往往需要结合其他压缩技术，如Huffman编码或LZ77，以达到更优的压缩比。

资源详情

资源推荐

Step 2:

Index F-column L-column

0 A M A P A N

1 A N A M A P

2 A P A N A M

3 M A P A N A

4 N A M A P A

5 P A N A M A

Output: N P M AAA 5

We see in our example that in the output the identical characters are

close together as in the input. The output of these stage is the input for the

next stage, the Move-To-Front Transform.

4 Move-To-Front Transform

As mentioned previously, in this stage the characters of the input get as-

signed a global index value. Therefore, we have in addition to the input

a global list Y . Normally, the global list Y contains all characters of the

ASCII-Code in ascending order. Now we look detailed at the technique of

the Move-To-Front Transform.

Process steps:

1. Save the index value of the global list Y which contains the ﬁrst

character of the input

2. Move the saved character of the previous step in the global list on

index position 0 and move all characters one position to the right

which are located in the global list before the old position of the

saved character

3. Repeat step 1 and 2 sequentially for the other characters of the

input and use for all repetitions the modiﬁed global list from the

previous repetition

The output of this stage consists of all saved index positions and the

index value of the sorted matrix from the Burrows-Wheeler Transform

which contains the original input. This index value won’t processed in

the Move-To-Front Transform.

Now we explain the procedure step by step with NP MAAA 5 (output

of the example from the Burrows-Wheeler Transform) as example input. We

剩余11页未读，继续阅读

weixin_38692928

粉丝: 6
资源: 913

Burrows-Wheeler算法：数据压缩的基石

Burrows-Wheeler压缩算法JAVA实现

The Burrows-Wheeler Transform, Data Compression, Suffix Arrays, and Pattern Matching

Burrows-Wheeler变换：数据重新排列的魔法

生物信息分析 linux软件

文本压缩效率最高的是什么算法

libbz2-dev

linux zip,gzip,bzip2的区别

ubuntu安装BWA

chromap 比对

zip\gzip\bzip2\的区别

linux中bzipz命令的用法

压缩中.gz和.bz2分别是什么意思

简述主要压缩和次要压缩的区别

bzip2-1.0.5

怎样确定conda中安装bwa成功

压缩算法有哪些常见的分类？

数据压缩算法Python

采样数据适合用哪种数据压缩算法。

文本压缩最好的是什么算法

_bz2.cpython-37m-x86_64-linux-gnu.so

最新资源