数据压缩基础：A Concise Introduction to Data Compression

需积分: 9 46 浏览量更新于2024-07-19 收藏 3.1MB PDF 举报

"A Concise Introduction to Data Compression" 是一本由David Salomon编著的书籍，主要探讨了数据压缩这一主题，适合计算机科学本科学生作为入门教材。书中详细介绍了数据压缩的基本概念、方法以及一些具体算法的细节。通过多个解释和实例，作者清晰地阐述了熵、变长编码等关键概念，并提供了丰富的程序代码片段，帮助学生理解和实践。这本书还包含了习题和解决方案，非常适合课堂教学或自我学习。在数据压缩领域，这本书提供了一个概述，涵盖了通用的压缩方法。它深入讲解了一些广泛使用的特定算法，如霍夫曼编码（Huffman Coding）、算术编码（Arithmetic Coding）等，这些都是数据压缩中的重要技术。这些技术主要用于减少数据存储需求和提高数据传输效率。例如，霍夫曼编码是一种基于字符频率的前缀编码方法，通过赋予频繁出现的字符更短的编码，从而达到压缩效果。而算术编码则是通过对数据的概率分布进行编码，实现更高效的压缩。书中的每部分都配备了练习题，这些题目旨在巩固学生对所学概念的理解，并在书末提供了答案。这种结构化的学习路径有助于学生逐步掌握数据压缩的核心原理。此外，作者通过简洁明了的编程示例，使学生能够直观地看到理论在实际应用中的工作方式，这对于提升学生的实践能力至关重要。此外，该书作为“Undergraduate Topics in Computer Science (UTiCS)”系列的一部分，遵循了系列的特点，即以新颖、简洁和现代的方式呈现核心内容，适合一或两个学期的课程。整个系列由各领域的专家撰写，并经过国际顾问委员会的审阅，确保了内容的准确性和专业性。 "A Concise Introduction to Data Compression" 是一本针对初学者的优秀教材，不仅提供了全面的数据压缩理论基础，还强调了实践应用，使得学生能够对这个领域有深入的理解，并具备一定的编程实现能力。无论是作为课堂教学资源还是个人自学材料，这本书都是一个理想的选择。

1.1 Variable-Length Codes 25

1.1 Variable-Length Codes

Often, a ﬁle of data to be compressed consists of data symbols drawn from an alphabet.

At the time of writing (mid-2007) most text ﬁles consist of individual ASCII characters.

The alphabet in this case is the set of 128 ASCII characters. A grayscale image consists

of pixels, each coded as one number indicating a shade of gray. If the image is restricted

to 256 shades of gray, then each pixel is represented by eight bits and the alphabet is the

set of 256 byte values. Given a data ﬁle where the symbols are drawn from an alphabet,

it can be compressed by replacing each symbol with a variable-length codeword. The

obvious guiding principle is to assign short codewords to the common symbols and long

codewords to the rare symbols.

In data compression, the term code is often used for the entire set, while the indi-

vidual codes are referred to as codewords.

Variable-length codes (VLCs for short) are used in several real-life applications, not

just in data compression. The following is a short list of applications where such codes

play important roles.

The Morse code for telegraphy, originated in the 1830s by Samuel Morse and Alfred

Vail, employs the same idea. It assigns short codes to commonly-occurring letters (the

code of E is a dot and the code of T is a dash) and long codes to rare letters and

punctuation marks (--.- to Q, --.. to Z,and--..-- to the comma).

Processor design. Part of the architecture of any computer is an instruction set

and a processor that fetches instructions from memory and executes them. It is easy

to handle ﬁxed-length instructions, but modern computers normally have instructions

of diﬀerent sizes. It is possible to reduce the overall size of programs by designing the

instruction set such that commonly-used instructions are short. This also reduces the

processor’s power consumption and physical size and is especially important in embedded

processors, such as processors designed for digital signal processing (DSP).

Country calling codes. ITU-T recommendation E.164 is an international standard

that assigns variable-length calling codes to many countries such that countries with

many telephones are assigned short codes and countries with fewer telephones are as-

signed long codes. These codes also obey the preﬁx property (page 28) which means

that once a calling code C has been assigned, no other calling code will start with C.

The International Standard Book Number (ISBN) is a unique number assigned to a

book, to simplify inventory tracking by publishers and bookstores. The ISBN numbers

are assigned according to an international standard known as ISO 2108 (1970). One

component of an ISBN is a country code, that can be between one and ﬁve digits long.

This code also obeys the preﬁx property. Once C has been assigned as a country code,

no other country code will start with C.

VCR Plus+ (also known as G-Code, VideoPlus+, and ShowView) is a preﬁx,

variable-length code for programming video recorders. A unique number, a VCR Plus+,

is computed for each television program by a proprietary algorithm from the date, time,

and channel of the program. The number is published in television listings in newspa-

pers and on the Internet. To record a program on a VCR, the number is located in a

newspaper and is typed into the video recorder. This programs the recorder to record

1.1 Variable-Length Codes 27

probabilities of the symbols are computed and are used to determine the set of variable-

length codes that will be assigned to the symbols. This set is written on the compressed

ﬁle and the encoder starts the second pass. In this pass it again reads the entire input

ﬁle and compresses it by replacing each symbol with its codeword. This method provides

very good results because it uses the correct probabilities for each data ﬁle. The table

of codewords must be included in the output ﬁle, but this table is small (typically a few

hundred codewords written on the output consecutively, with no separators between

codes). The downside of this approach is its low speed. Currently, even the fastest

magnetic disks are considerably slower than memory and CPU operations, which is why

reading the input ﬁle twice normally results in unacceptably-slow execution. Notice that

the decoder is simple and fast because it does not need two passes. It starts by reading

the code table from the compressed ﬁle, following which it reads variable-length codes

and replaces each with its original symbol.

Use a set of training documents. The ﬁrst step in implementing fast software for

text compression may be to select texts that are judged “typical“ and employ them to

“train” the algorithm. Training consists of counting symbol frequencies in the training

documents, computing the distribution of symbols, and assigning them variable-length

codes. The code table is then built into both encoder and decoder and is later used to

compress and decompress various texts. An important example of the use of training

documents is facsimile compression (page 86). The success of such software depends on

how “typical” the training documents are.

It is unlikely that a set of documents will be typical for all kinds of text, but such a

set can perhaps be found for certain types of texts. A case in point is facsimile compres-

sion. Documents sent on telephone lines between fax machines have to be compressed in

order to cut the transmission times from 10–11 minutes per page to about one minute.

The compression method must be an international standard because fax machines are

made by many manufacturers, and such a standard has been developed (Section 2.4). It

is based on a set of eight training documents that have been selected by the developers

and include a typed business letter, a circuit diagram, a French technical article with

ﬁgures and equations, a dense document in Kanji, and a handwritten memo.

Another application of training documents is found in image compression. Re-

searchers trying to develop methods for image compression have long noticed that pixel

diﬀerences in images tend to be distributed according to the well-known Laplace distri-

bution (by a pixel diﬀerence is meant the diﬀerence between a pixel and an average of

its nearest neighbors).

An adaptive algorithm. Such an algorithm does not assume anything about the

distribution of the symbols in the data ﬁle to be compressed. It starts “with a blank

slate” and adapts itself to the statistics of the input ﬁle as it reads and compresses

more and more symbols. The data symbols are replaced by variable-length codewords,

but these codewords are modiﬁed all the time as more is known about the input data.

The algorithm has to be designed such that the decoder would be able to modify the

codewords in precisely the same way as the encoder. We say that the decoder has to

work in lockstep with the encoder. The best known example of such a method is the

adaptive (or dynamic) Huﬀman algorithm (Section 2.3).

28 1. Approaches to Compression

 Exercise 1.3: Compare the three diﬀerent approaches (two-passes, training, and adap-

tive compression algorithms) and list some of the pros and cons for each.

Several variable-length codes are listed and described later in this section, and the

discussion shows how the average code length can be used to determine the statistical

distribution to which the code is best suited.

The second consideration in the design of a variable-length code is unique decod-

ability (UD). We start with a simple example: the code a

=0,a

= 10, a

= 101,

and a

= 111. Encoding the string a

... with these codewords results in the bit-

string 0101111.... However, decoding is ambiguous. The same bitstring 0101111...can

be decoded either as a

... or a

.... This code is not uniquely decodable. In

contrast, the similar code a

=0,a

= 10, a

= 110, and a

= 111 (where only the

codeword of a

is diﬀerent) is UD. The string a

...is easily encoded to 0110111...,

and this bitstring can be decoded unambiguously. The ﬁrst 0 implies a

, because only

the codeword of a

starts with 0. The next (second) bit, 1, can be the start of a

, a

or a

. The next (third) bit is also 1, which reduces the choice to a

or a

. The fourth

bit is 0, so the decoder emits a

A little thinking clariﬁes the diﬀerence between the two codes. The ﬁrst code is

ambiguous because 10, the code of a

, is also the preﬁx of the code of a

. When the

decoder reads 10..., it often cannot tell whether this is the codeword of a

or the start

of the codeword of a

. The second code is UD because the codeword of a

is not the

preﬁx of any other codeword. In fact, none of the codewords of this code is the preﬁx

of any other codeword.

This observation suggests the following rule. To construct a UD code, the codewords

should satisfy the following preﬁx property. Once a codeword c is assigned to a symbol,

no other codeword should start with the bit pattern c. Preﬁx codes are also referred to

as preﬁx-free codes, preﬁx condition codes, or instantaneous codes. Observe, however,

that a UD code does not have to be a preﬁx code. It is possible, for example, to designate

the string 111 as a separator (a comma) to separate individual codewords of diﬀerent

lengths, provided that no codeword contains the string 111. There are other ways to

construct a set of non-preﬁx, variable-length codes.

A UD code is said to be instantaneous if it is possible to decode each codeword in

a compressed ﬁle without knowing the succeeding codewords. Preﬁx codes are instan-

taneous.

Constructing a UD code for given ﬁnite set of data symbols should start with the

probabilities of the symbols. If the probabilities are known (at least approximately),

then the best variable-length code for the symbols is obtained by the Huﬀman algo-

rithm (Chapter 2). There are, however, applications where the set of data symbols is

unbounded; its size is either extremely large or is not known in advance. Here are a few

practical examples of both cases:

Text. There are 128 ASCII codes, so the size of this set of symbols is reasonably

small. In contrast, the number of Unicodes is in the tens of thousands, which makes it

impractical to use variable-length codes to compress text in Unicode; a diﬀerent approach

is required.

A grayscale image. For 8-bit pixels, the number of shades of gray is 256, so a set of

256 codewords is required, large, but not too large.

1.1 Variable-Length Codes 29

Pixel prediction. If a pixel is represented by 16 or 24 bits, it is impractical to

compute probabilities and prepare a huge set of codewords. A better approach is to

predict a pixel from several of its near neighbors, subtract the prediction from the

pixel value, and encode the resulting diﬀerence. If the prediction is done properly,

most diﬀerences will be small (signed) integers, but some diﬀerences may be (positive or

negative) large, and a few may be as large as the pixel value itself (typically 16 or 24 bits).

In such a case, a code for the integers is the best choice. Each integer has a codeword

assigned that can be computed on the ﬂy. The codewords for the small integers should

be small, but the lengths should depend on the distribution of the diﬀerence values.

Audio compression. Audio samples are almost always correlated, which is why many

audio compression methods predict an audio sample from its predecessors and encode

the diﬀerence with a variable-length code for the integers.

Any variable-length code for integers should satisfy the following requirements:

1. Givenanintegern, its code should be as short as possible and should be con-

structed from the magnitude, length, and bit pattern of n, without the need for any

table lookups or other mappings.

2. Given a bitstream of variable-length codes, it should be easy to decode the next

codeandobtainanintegern even if n hasn’t been seen before.

Quite a few VLCs for integers are known. Many of them include part of the binary

representation of the integer, while the rest of the codeword consists of side information

indicating the length or precision of the encoded integer.

The following sections describe popular variable-length codes (the Intermezzo on

page 253 describes one more), but ﬁrst, a few words about notation. It is customary to

denote the standard binary representation of the integer n by β(n). This representation

can be considered a code (the beta code), but this code does not satisfy the preﬁx

property and also has a ﬁxed length. (It is easy to see that the beta code does not

satisfy the preﬁx property because, for example, 2 = 10

is the preﬁx of 4 = 100

Given a set of integers between 0 and n, we can represent each in

1+log

n = log

(n +1) (1.1)

bits, a ﬁxed-length representation. When n is represented in any other number base b,

its length is given by the same expression, but with the logarithm in base b instead of 2.

A VLC that can code only positive integers can be extended to encode nonnegative

integers by incrementing the integer before it is encoded and decrementing the result

produced by decoding. A VLC for arbitrary integers can be obtained by a bijection, a

mapping of the form

0 −11−22−33−44−55···

1234567891011···

A function is bijective if it is one-to-one and onto.

剩余297页未读，继续阅读

Artemis_Melete

粉丝: 1

数据压缩基础：A Concise Introduction to Data Compression

A Concise Introduction to Data Compression-David Salomon

A Concise Introduction to Data Compression

A Concise Introduction to Data Structures using Java

Data Compression - A concise introduction

A Concise Introduction to MATLAB

A Concise Introduction to Software Engineering

A Concise Introduction to MATLAB_matlab_

A Concise Introduction to Languages and Machines.pdf

A Concise Introduction to MATLAB.zip.zip

A Concise Introduction to Multiagent Systems and Distributed Artificial Intelligence

最新资源