数字媒体处理：使用C语言的DSP算法

5星 · 超过95%的资源 | 下载需积分: 9 | PDF格式 | 7.95MB | 更新于2024-08-01 | 81 浏览量 | 举报

1 收藏

"《Digital Media Processing: DSP Algorithms Using C》是由Hazarathaiah Malepati编著的一本书，主要探讨了使用C语言实现数字媒体处理和数字信号处理（DSP）算法的主题。这本书由Newnes出版社出版，是Elsevier的一个印记。出版年份为2010年，并强调所有版权受法律保护，未经许可，不得复制或以任何形式传播内容。书中可能包含的信息存储和检索系统使用需得到出版商的明确书面许可。关于版权许可的详细信息可以在Elsevier的网站上找到。该书的内容可能涉及以下核心知识点： 1. **数字媒体处理**：这是关于如何对数字形式的音频、视频和其他多媒体数据进行操作和分析的学科。可能涵盖图像和音频压缩技术、编码解码、数字滤波、信号增强以及噪声消除等。 2. **数字信号处理（DSP）算法**：这是信号处理的重要领域，涉及使用数学运算在数字域中处理信号。常见的DSP算法包括快速傅里叶变换(FFT)、滤波器设计（如IIR和FIR滤波器）、谱分析、信号同步、采样理论以及时间序列分析等。 3. **C语言编程**：作为实现这些算法的基础，C语言因其高效、灵活和广泛支持而被选择。读者可能将学习如何使用C语言来编写和优化DSP代码，包括理解指针、内存管理、函数调用以及如何利用C语言的特性来实现高性能计算。 4. **实例和应用**：书籍可能会提供实际案例，让读者了解如何在音频处理、视频编码、通信系统、医疗成像、生物医学信号处理等领域应用这些算法。 5. **硬件接口**：在某些情况下，可能会讨论如何将C程序与硬件设备（如DSP芯片或GPU）集成，以实现高效的并行处理。 6. **性能优化**：由于DSP通常需要处理大量数据，因此书中可能涵盖代码优化技巧，如循环展开、向量化和并行化，以提高处理速度和效率。 7. **开发工具和环境**：可能会介绍用于开发和调试DSP应用程序的软件工具，如IDEs、编译器和调试器，以及模拟器和硬件仿真器。 8. **标准和库**：可能涉及到如OpenCV、FFmpeg等开源库和标准，这些在数字媒体处理和DSP中经常被使用。 9. **教育和研究价值**：这本书不仅适合于专业人士，也可能适用于大学课程或研究生级别的学习，提供理论知识和实践技能的结合。《Digital Media Processing: DSP Algorithms Using C》是一本深入浅出的教程，旨在帮助读者理解和实现数字媒体处理和信号处理中的复杂算法，同时通过C语言的实践应用提升编程技能。"

Introduction 5

Table 1.1: Dig ital media processing applications

Digital Home Telecommunications Consumer Electronics

AV receivers ADSL/VDSL Digital camera

DVD/Blu-Ray players Cable modems Portable media players

TV/desktop audio/video Wire/wireless smart phones Portable DVD players

Sound bar IP phone Digital video recorder

Digital picture frame Femto base stations Personal GPS navigation

Video telephony Software defined radio Mobile TV

IP TV, IP phone, IP camera WLAN, WiFi, WiMAX Bluetooth

Door phone Mobile TV HD/ANC headphones

Smoke detector Radar/sonar Video game players

Network video recorder Power line communication Digital music instruments

CD clock radio Video conferencing

FM/satellite radio

Automotives Industrial Medical

Advanced driver assistance Power meter Ultrasound

Automotive infotainment Motor control CT,MRI,PET

Digital audio/s atellite radio Active noise cancellation Digital x-ray

Vision control Barcode scanner Pulse oximetry

Bluetooth hands-free phone Flow meter Digital stethoscope

Electronic stability control Oscilloscope Blood-pressure monitor

Safety/airbag control Security Lab diagnostic equipment

Crash detection Surveillance IP networks Heart rate monitor

Fingerprint biometrics

Video doorbell

Video analytic ser ver

discussed.The necessity of software–hardware partitioning of embedded systems to handle complex applications

is discussed, as well as possible ways to efficiently partition such a system. Finally, we discuss future embedded

processor requirements to handle very complex embedded applications.

Chapter 17 (see companion website) brieﬂy discusses various applications. Different embedded applications

use different algorithms. The processing power and memory requirements vary from one application to another.

We brieﬂy talk about various modules present in a few embedded application sectors. The applications covered

in this chapter include automotive, video surveillance, portable entertainment systems, digital communications,

digital camera, and immigration and healthcare sectors.

1.4 Algorithm Implementation on DSP Architectures

In Section 1.2, various algorithms that areplaying a critical role in diverse applications were mentioned.Although

dozens of semiconductor companies are designing embedded processors with a range of architectural features

to support different kinds of applications, no single architecture is efficient for processing all types of digital

media processing algorithms. This is because processors designed with many pipeline stages (to execute in

parallel m ultiple operations of numeric-intensive algorithms) do not efficiently handle algorithms that contain

full-control operations. The architectures developed for executing the control code are not efficient at computing

numeric-intensive algorithms. The architectural feature set of the reference embedded processor (see Appendix A

on the companion website) is in between, and is good at handling both control and numeric-intensive algorithms.

In the following subsections, DSP architecture and its performance in e xecuting various algorithms are brieﬂy

discussed. We also brieﬂy describe a few algorithm implementation techniques.

6 Chapter 1

1.4.1 DSP Architecture

A simpliﬁed block diagram of embedded DSP architecture is shown in Figure 1.1. The m ain architectural

blocks of an embedded processor are the processor core (with register sets, ALU, data address generator [DAG],

sequencer, etc.), memory (for holding instructions and data, for stack space, etc.), peripherals (e.g., serial periph-

eral interface [SPI], parallel peripheral interface [PPI], serial ports [SPORT], general-purpose timers, universal

asynchronous receiver transmitter [UART], watchdog timer, and general-purpose I/O) and a few others (e.g.,

JTAG emulator, event controller, direct memory access [DMA] controller). Embedded processor peripherals and

memory architectures are discussed in some detail in Chapter 16.

The peripheral features are important when we talk about the overall application. In this book, we assume

that the architecture comes with all necessary peripherals to enable a particular application. Also, we assume

that the program code and data required for algorithm processing are residing in the faster memory (or le vel 1,

L1) memory, which can be accessed at the speed of the processor core. If we cannot fit data and program in L1

memory, then we store the extra data or program in L2/L3 memory and use DMA to get the data or program

from L2/L3 memory without interrupting the processor core. From an algorithm-implementation point of view,

the important things are processor core architecture, availability of L1 memory, and internal bus bandwidth.

Even more important than getting data into (or sending it out from) the processor, is the structure of the

memory subsystem that handles the data during processing. It is essential that the processor core access data in

memory at rates fast enough to meet application demands. L1 memory is often split between instruction and data

segments for efficient utilization of memory bus bandwidth. Most DSP architectures support this Harvard-like

architecture (in which data and instruction memories are accessed simultaneously, as shown in Figure 1.1) in

combination with a hierarchical memory structure that views memory as a single, unified gigabyte address space

using 32-bit addresses. All resources, including internal memory, external memory, and I/O control registers,

occupy separate sections of this common address space.

The register file contains dif ferent register types (e.g., data registers, accumulators, address registers) to hold

the information temporarily for ALU processing or for memory load/store purposes. The processor’s compu-

tational units perform numeric processing for DSP algorithms and general control algorithms. Data moving in

and out of the computational units go through the data register file. The processor’s assembly language provides

access to the data register file. The syntax lets programs move data to and from these registers and specify a

computation’s data format at the same time.

The DAGs generate addresses for data moving to and from memory. By generating addresses, the DAGs let

programs refer to addresses indirectly using a DAG re gister instead of an absolute address.

The program sequencer controls the instruction execution ﬂow, including instruction alignment and decoding.

The program sequencer determines the next instruction address by examining both the current instruction being

executed and the current state of the processor. Generally, the processor executes instructions from m emory in

sequential order by incrementing the look-ahead address. However, when encountering one of the following

structures, the processor will execute an instruction that is not at the next sequential address: jumps, conditional

branches, function calls, interrupts, loops, and so on.

ALU Unit

Registers DAG Unit

Sequencer

Peripherals

Data

Memory

Instruction

Memory

DSP Core

Figure 1.1: Simplified diagram of DSP architecture.

Introduction 7

In the next subsection, we consider three algorithms with different processing ﬂow requirements and discuss

to what extent the benchmarks provided by processor manufacturers are useful in deciding which processor

(from dozens of processors available today in the market) is suitable for a particular application.

1.4.2 Algorithm Complexity and DSP Performance

In this section, we consider three simple algorithms—dot product, RC4 stream cipher, and the H.264 CABAC

encode-symbol-normalization process—and discuss embedded processor performance (with a particular archi-

tectural feature set) in executing those three algorithms.

Dot Product

Dot product involves accumulation of sample-by-sample multiplication of elements from two sample arrays.

The dot product, z,oftwoN-length sample arrays x[] and y[], can be computed as

z =

N−1



n=0

x[n]y[n] (1.1)

A sim ple “for” loop C code that implements the dot product described by Equation (1.1) is shown in Pcode 1.1.

What is the cost (in terms of cycles and m emory) of this dot-product algorithm for implementation on

the embedded processor, given its processor core architecture? Clearly, we require two buffers of length

2*N bytes (assuming the elements are the 16-bit word type), each t o hold the two input array buffers in

memory.

In terms of computations, it involves N multiplicationsand N additions. If the embedded processor consumes

one cycle for multiplication and one cycle for addition, then we require a total of 2N cycles (assuming a single

ALU) to execute the corresponding dot-product code given in Pcode 1.1. What about the cycle cost of loading

the data from m emory to the data registers? Typically, many processors come with separate data load/store units;

hence, we assume that the data loads happen parallel to compute operations and therefore they are free.

z=0;

for(i = 0;i < N;i++)

z += x[i] * y[i];

Pcode 1.1: Pseudo code for dot product.

Many embedded processors come with multiply–accumulate (MAC) units, and in this case we require only

N cycles, as the dot product contains a total of N MAC operations. For this case, the two memory loads must

happen in a single cycle.

Now, you may wonder whether this cycle count can be achieved with the C code ported to the processor

assembly using the compiler or with the optimized assembly-level code written manually. Here, when we say

that the cycle count is N for executing the dot product, it means that one MAC operation is mapped to a single

processor instruction, which consumes exactly one cycle; only then can we describe the cycle count as N c ycles

for N MAC operations.

Is this the final cycle count for computing the dot product? Not exactly—in the dot-product case, it also

depends on the num ber of MAC units that the processor comes with. For example, if the processor consists of

four MAC units, then we require only N/4 cycles to complete the dot product. How is this possible? It is possible

because we can execute four MAC operations in parallel on a four-MAC processor, as the dot product has no ﬂow

dependencies. However, we will have a problem with the data load unless we load 128 bits (four 16-bit words

from array x [] and another four 16-bit words from array y[]) of data to eight 16-bit registers in a single cycle.

For efficient compilation to run on a four-MAC processor, we unroll the dot-product loop in Pcode 1.1 by

four times and reduce the loop count by a factor of 4 as shown in Pcode 1.2. Given that the dot product is a

simple algorithm, most compilers can efficiently map the C code to the assembly language so that the difference

between cycle estimation and actual cycles measured is negligible.

8 Chapter 1

z1=0;z2=0;z3=0;z4=0;

for(i = 0;i < N/4;i += 4) {

z1 += x[i]*y[i]; // MAC unit 1

z2 += x[i + 1]*y[i + 1]; // MAC unit 2

z3 += x[i + 2]*y[i + 2]; // MAC unit 3

z4 += x[i + 3]*y[i + 3]; // MAC unit 4

}

z=z1+z2+z3+z4;

Pcode 1.2: Pseudo code for dot product with loop unrolling four times.

Digital media processing algorithms are not just “dot products.” Next, we consider another simple algorithm,

the RC4 stream cipher.

RC4 Stream Cipher

The RC4 algorithm (see Section 2.1.6, RC4 Algorithm, for more details) is used as a stream cipher in low-security

applications and as a pseudorandom number generator in many standard ciphers applications. RC4 is used in

many commercial software packages, such as Lotus Notes and Oracle Secure SQL, and in network protocols,

such as SSL, IPsec, WEP, and WPA. An RC4 simulation code is given in Pcode 1.3.

j=0;

for (i = 0;i < N;i++) { // N: data length in bytes

k = i & 0xff; // i mod 255

r0 = SBox[k];

r1 = j + r0;

j = r1 & 0xff; // i mod 255

r1 = SBox[j]; // look-up table access with arbitrary offset

Sbox[j] = r0; // swap look-up table elements

Sbox[k] = r1;

r1 = r1 + r0;

r1 = r1 & 0xff; // i mod 255

r1 = Sbox[r1]; // look-up table access with arbitrary offset

in[i] = in[i] ˆ r1; // encrypt input message bytes

}

Pcode 1.3: Simulation code for RC4 stream cipher.

In the iterative procedure of computing RC4 encrypted data using Pcode 1.3, the computation of a new j value

requires updated (swapped) S-box values. Thus, computing many j values and swapping them at the same time

is not possible due to the dependency of j on updated S-box values. The RC4 algorithm is sequential in nature,

although no jumps are present. Even if multiple compute units are available with the processor, we cannot use

them in this case for parallel implementation of the algorithm. See Section 2.1.6, RC4 Algorithm, for cycle costs

and memory requirements to implement RC4 on the reference embedded processor.

Unlike the dot product, the execution of algorithms, such as RC4 on deep-pipeline processors, may not be

efficient in terms of cycles. RC4 can be computed efficiently on microcontrollers with a two-stage pipeline in

fewer cycles, compared to DSPs with 10 or more pipeline stages.

In the case of algorithms with frequently occurring conditional branches (e.g., the H.264 CABAC encode

symbol normalization process described in Section 5.5), the performance of deep-pipeline DSPs worsens. As

shown in Pcode 1.4, the normalization process has many conditional jumps in a “while loop.” This process is

costly in terms of cycles, as it performs normalization 1 bit at a time with many jumps. Avoiding jumps is the

only solution to reduce cycle cost (see Section 5.5 for details).

In summary, DSPs are good at handling FFTs, filters, and matrix operations, and are less effective at handling

both control code and sequential algorithms.Simple pipeline processors (e.g., ARM) are good at handling control

and sequential algorithms, and less effective at handling signal processing tasks such as transforms, filtering

operations, and so on.

In brief, the dot-product benchmark provided by the DSP m anufacturer may not provide much useful infor-

mation because the application at hand rarely contains dot-product kinds of operations. To efficiently run

Introduction 9

while(pBAC->Range < 256) {// Low, Range, Outstanding bits (or Obits) are CABAC params

if(pBAC->Low >= 512) {

pBAC->Low -= 512;

write_bits(1,1);

if(pBAC->Obits > 0) {

write_bits(0,pBAC->Obits); // bit-fifo write

pBAC->Obits = 0;

}

else if(pBAC->Low < 256) {

write_bits(0,1);

if(pBAC->Obits > 0){

write_bits(1,pBAC->Obits); // bit-fifo write

pBAC->Obits = 0;

}

else{

pBAC->Obits++;

pBAC->Low -= 256;

}

pBAC->Range = pBAC->Range << 1;

pBAC->Low = pBAC->Low << 1;

}

Pcode 1.4: Simulation code for H.264 CABAC encode symbol normaliza tion.

any algorithm on a particular digital signal processor, we need to dedicate som e tim e to understanding the

underlying mathematical structure of the algorithm and then tune it to write efﬁcient code for that processor.

A few techniques to map algorithms to DSPs are discussed in the next section.

1.4.3 Algorithm Implementation Techniques

Digital data is efﬁciently processed with an embedded processor by optimizing the corresponding program at

both the algorithm ﬂow level and the instruction level. The algorithms are optimized for throughput, mem-

ory usage, I/O bandwidth, and power dissipation. In this subsection, we discuss algorithm-level optimization

using various techniques for increasing throughput. In most cases, there is a trade-off between throughput and

memory.

Algorithm code is optimized at the instruction level to eliminate pipeline stalls due to data dependencies, to

minimize the overhead of control code such as jumps and software loop overheads, and to efficiently handle

data movement within the system. Instruction-level optimization techniques vary by processor. Compilers also

perform some degree of instruction-level optim ization. Typically we see a 10 to 20% gain with instruction-level

optimization (measured by a decrease in core clock cycles). When optimizing the code at the instruction level,

complete knowledge of the algorithm structure may not be necessary.

On the other hand, program-ﬂow optimization at the algorithm level requires knowledge of the algorithm’s

mathematical structure and properties. Compilers cannot achieve algorithm-level program optimization. Min-

imizing the number of computations and balancing the CPU and load/store bandwidth are possible with

algorithm-level optimization. We can achieve algorithm-level optimization using multiple approaches. A few

of these methods considered in this section include changing the algorithm ﬂow, using look-up tables, using

algorithm-ﬂow statistics, using symmetry and periodicity, reusing already-computed data, and approximating

mathematic functionality. The amount of cycle savings depends on a particular algorithm and its ﬂow. For the

algorithmsdiscussed in this book, the amount of cycle savings achieved with algorithm-level optimizationranges

from 20 to 80%.

Is Optimizing All the Program Code Worthwhile?

Before we proceed, we ask whether optimizing all the program code is worthwhile. The answer is that it depends

on processor capabilities and application demands. Usually, we start optimizing the most critical modules in C,

and if the MIPS budget is not met, we continue to optimize other critical modules. If we are still not within the

MIPS budget, then we start writing assembly language and optimizing it. For example, consider a video decoder

(see Chapter 14 for details); it has many layers and modules (see Figure 14.15). In the slice layer, we decode

剩余768页未读，继续阅读

DoomLord

粉丝: 114

数字媒体处理：使用C语言的DSP算法

Digital Media Processing DSP Algorithms Using C 无水印pdf 0分

Introduction.to.Algorithms.-.算法导论

Charles.River.Media.Algorithms.For.Compiler.Design.eBook-LiB.chm

Algorithms.and.Architectures.for.Parallel.Processing.

Data.Structures.and.Algorithms.USING.C

Charles.River.Media.Algorithms.and.Data.Structures.The.Science.of.Computing.2004.chm

Data.Structures.and.Algorithms.Using.CSharp

Problem.Solving.in.Data.Structures.and.Algorithms.Using.Cplusplus.epub

DSP.Algorithms.For.Programmers.gz_algorithms

Fuzzy Algorithms. With Applications to Image Processing and Pattern Recognition.(WS,1996)(232s).djvu

最新资源