数据压缩基础：信息理论与编码技术

需积分: 9 39 浏览量更新于2024-07-26 收藏 377KB PDF 举报

“introduction to data compression - Guy E. Blelloch” 本文是对数据压缩的入门介绍，由卡内基梅隆大学计算机科学系的Guy E. Blelloch撰写。主要内容涵盖了信息理论、概率编码以及概率编码在各种压缩算法中的应用，包括LZ77、LZ78、Burrows-Wheeler变换、分形压缩等。 1. 信息理论信息理论是数据压缩的基础，它研究如何度量和传输信息。熵是信息理论的核心概念，表示一个信息源的平均信息量。在第2.1节中，熵被定义为描述随机变量不确定性的度量。第2.2节讨论了英语语言的熵，揭示了英文文本的统计特性。第2.3节介绍了条件熵和马尔可夫链，它们在理解文本序列和预测未来符号出现的概率方面至关重要。 2. 概率编码概率编码是利用符号出现的概率差异来更高效地编码数据的方法。第3.1节介绍了前缀码，这是一种编码方式，确保没有编码是另一编码的前缀，避免了解码时的歧义。第3.1.1节探讨了前缀码与熵的关系。第3.2节详细讲述了哈夫曼编码，这是一种最优的前缀码，用于对出现频率不同的符号进行差异化编码。第3.2.1节讨论了如何合并消息，第3.2.2节则涉及最小方差哈夫曼编码。第3.3节介绍了算术编码，这是一种连续概率分布的高效编码方法，其中第3.3.1节解释了其整数实现。 3. 概率编码的应用在第4章，作者展示了概率编码在不同数据压缩技术中的应用。第4.1节介绍了运行长度编码，这是一种利用连续重复符号的特性来减少编码长度的方法。第4.2节讨论了移动到前编码，通过维护一个动态表来改进压缩效率。第4.3节提到了残差编码，如JPEG-LS，它在图像压缩中利用像素间的相关性。第4.4节和第4.5节分别介绍了上下文编码在JBIG（联合黑白图像编码标准）和PPM（预测像素模型）中的应用，这些方法在理解图像局部结构上尤为有效。此文档是“现实世界中的算法”一书的早期草稿，作者欢迎指正错误。无损压缩部分相对成熟，而有损压缩部分还在完善中，部分内容源于Ben Liblit在加州大学伯克利分校的课堂笔记。

3 Probability Coding

As mentioned in the introduction, coding is the job of taking probabilities for messages and gen-

erating bit strings based on these probabilities. How the probabilities are generated is part of the

model component of the algorithm, which is discussed in Section 4.

In practice we typically use probabilities for parts of a larger message rather than for the com-

plete message, e.g., each character or word in a text. To be consistent with the terminology in the

previous section, we will consider each of these components a message on its own, and we will

use the term message sequence for the larger message made up of these components. In general

each little message can be of a different type and come from its own probability distribution. For

example, when sending an image we might send a message specifying a color followed by mes-

sages specifying a frequency component of that color. Even the messages specifying the color

might come from different probability distributions since the probability of particular colors might

depend on the context.

We distinguish between algorithms that assign a unique code (bit-string) for each message, and

ones that “blend” the codes together from more than one message in a row. In the ﬁrst class we

will consider Huffman codes, which are a type of preﬁx code. In the later category we consider

arithmetic codes. The arithmetic codes can achieve better compression, but can require the encoder

to delay sending messages since the messages need to be combined before they can be sent.

3.1 Preﬁx Codes

A code C for a message set S is a mapping from each message to a bit string. Each bit string is

called a codeword, and we will denote codes using the syntax C = {(s

, w

), (s

, w

), · · · , (s

, w

)}.

Typically in computer science we deal with ﬁxed-length codes, such as the ASCII code which maps

every printable character and some control characters into 7 bits. For compression, however, we

would like codewords that can vary in length based on the probability of the message. Such vari-

able length codes have the potential problem that if we are sending one codeword after the other

it can be hard or impossible to tell where one codeword ﬁnishes and the next starts. For exam-

ple, given the code {(a, 1), (b, 01), (c, 101), (d, 011)}, the bit-sequence 1011 could either be

decoded as aba, ca, or ad. To avoid this ambiguity we could add a special stop symbol to the

end of each codeword (e.g., a 2 in a 3-valued alphabet), or send a length before each symbol.

These solutions, however, require sending extra data. A more efﬁcient solution is to design codes

in which we can always uniquely decipher a bit sequence into its code words. We will call such

codes uniquely decodable codes.

A preﬁx code is a special kind of uniquely decodable code in which no bit-string is a preﬁx

of another one, for example {(a, 1), (b, 01), (c, 000), (d, 001)}. All preﬁx codes are uniquely

decodable since once we get a match, there is no longer code that can also match.

Exercise 3.1.1. Come up with an example of a uniquely decodable code that is not a preﬁx code.

Preﬁx codes actually have an advantage over other uniquely decodable codes in that we can

decipher each message without having to see the start of the next message. This is important when

sending messages of different types (e.g., from different probability distributions). In fact in certain

applications one message can specify the type of the next message, so it might be necessary to fully

decode the current message before the next one can be interpreted.

A preﬁx code can be viewed as a binary tree as follows

• Each message is a leaf in the tree

• The code for each message is given by following a path from the root to the leaf, and ap-

pending a 0 each time a left branch is taken, and a 1 each time a right branch is taken.

We will call this tree a preﬁx-code tree. Such a tree can also be useful in decoding preﬁx codes. As

the bits come in, the decoder can follow a path down to the tree until it reaches a leaf, at which point

it outputs the message and returns to the root for the next bit (or possibly the root of a different tree

for a different message type).

In general preﬁx codes do not have to be restricted to binary alphabets. We could have a preﬁx

code in which the bits have 3 possible values, in which case the corresponding tree would be

ternary. In this chapter we only consider binary codes.

Given a probability distribution on a set of messages and associated variable length code C, we

deﬁne the average length of the code as

(s,w)∈C

p(s)l (w)

where l(w) is the length of the codeword w. We say that a preﬁx code C is an optimal preﬁx code

if l

has a lower average length).

3.1.1 Relationship to Entropy

It turns out that we can relate the average length of preﬁx codes to the entropy of a set of messages,

as we will now show. We will make use of the Kraft-McMillan inequality

Lemma 3.1.1. Kraft-McMillan Inequality. For any uniquely decodable code C,

(s,w)∈C

−l(w)

≤ 1 ,

where l (w) it the length of the codeword w. Also, for any set of lengths L such that

l∈L

−l

≤ 1 ,

there is a preﬁx code C of the same size such that l(w

) = l

(i = 1, . . . |L|).

The proof of this is left as a homework assignment. Using this we show the following

Lemma 3.1.2. For any message set S with a probability distribution and associated uniquely

decodable code C,

H(S) ≤ l

(C)

剩余54页未读，继续阅读

Orzjjj

粉丝: 0
资源: 4

数据压缩基础：信息理论与编码技术

Introduction To Data Compression part 1 KHALID SAYOOD

数据压缩基础：A Concise Introduction to Data Compression

【Practical Exercise】Introduction to LTE Communication and MATLAB Simulation

---> 17 compress_gray_image('00.png', compression_level=2)---> 11 compressed_data = lzss.compress(data, compression_level)这两行代码报错TypeError: function takes at most 1 argument (2 given)

若数据压缩比(data compression ratio)是指 压缩后数据长度 原始数据长度 之比,那

Failed to obtain compression information for entry

.ROW STORE COMPRESS BASIC

hive io.compression.codecs

最新资源

若数据压缩比(data compression ratio)是指压缩后数据长度原始数据长度之比,那