嵌入式数字信号处理系统的挑战与优化

5星 · 超过95%的资源需积分: 10 89 浏览量更新于2024-07-27 收藏 6.11MB PDF 举报

"嵌入式数字信号处理系统是利用微电子技术的进步，如VLSI（超大规模集成电路）和ASIC（专用集成电路），在有限资源的嵌入式设备中实现数字信号处理算法的应用。这些系统在电池供电设备中强调能源效率，并且随着小型设备对散热限制的增加，低功耗成为关键需求。为了成功地在嵌入式系统中实施DSP应用，需要进行定制化设计，这为设计方法学带来了新的挑战。" 嵌入式数字信号处理系统在现代技术中的重要性体现在其能够在各种应用领域中发挥巨大作用，包括通信、音频处理、图像和视频处理、医疗设备以及自动化系统等。随着半导体技术的不断发展，处理器速度的提升和集成度的增加，使得复杂的DSP算法能够在微型设备中得以实现。然而，这种进步也伴随着一系列设计难题。嵌入式系统的特点在于资源受限，包括计算能力、内存容量和电力消耗。因此，在设计嵌入式DSP系统时，需要在性能与资源之间找到平衡。例如，为了在保持高效处理的同时降低功耗，可能需要采用能效比高的处理器架构、优化的算法和智能电源管理策略。此外，由于嵌入式系统的硬件和软件通常紧密耦合，因此系统级设计和软硬件协同设计成为优化性能的关键。在设计过程中，编译器优化、硬件加速器的设计以及算法的简化或重构都是提高嵌入式DSP性能的有效手段。例如，通过使用特定的DSP指令集和硬件模块，可以加速执行常见的信号处理操作，如乘累加运算。同时，采用并行处理技术和流水线设计可以进一步提高处理速度。为了应对设计挑战，研究人员和工程师开发了各种设计工具和方法，如模型驱动的设计、快速原型验证和形式化验证技术。这些工具帮助设计师在早期阶段就能评估和优化设计方案，减少迭代次数，从而节省时间和成本。此外，开放源代码的软件库和开发平台，如MATLAB和Simulink，为开发者提供了便捷的环境来模拟、验证和部署DSP算法。结合硬件描述语言（如Verilog或VHDL），可以实现算法的硬件实现，确保在满足性能需求的同时，满足嵌入式系统资源约束。总结起来，"EURASIP Journal on Embedded Systems" 特刊中的文章聚焦于嵌入式数字信号处理系统的最新研究和发展，涵盖了从设计方法学到具体实现策略的广泛话题。这些研究对于推动嵌入式系统的性能提升、资源优化和低功耗设计具有重要意义，同时也对相关领域的工程实践提供了宝贵的指导。

O. Silven and K. Jyrkk

a 9

Priority level User inter rupt handlers

12 34567

Start HW

Read

results

Start HW

Read

results

User prioritized tasks

Hardware abstraction

Time

Hardware thread 1

Hardware thread 2

TX modulator HW

Viterbi

equalizer HW

Viterbi

decoder HW

= bit equalizer algorithm

= speech encoding part 1

= channel decoding part 1

= speech encoding part 2

= channel encoder

= channel decoder part 2

= speech decoder

Figure 7: The execution threads of an early GSM mobile phone.

ACKNOWLEDGMENTS

Numerous people have directly and indirectly contributed to

this paper. In particular, we wish to thank Dr. Lauri Pirtti-

aho for his observations, comments, questions, and exper-

tise, and Professor Yrj

o Neuvo for advice, encouragement,

and long-time support, both from the Nokia Corporation.

REFERENCES

[1] GSM Association, “TW.09 Battery Life Measurement Tech-

nique,” 1998, http://www.gsmworld.com/documents/index.

shtml.

[2] Nokia, “Phone models,” http://www.nokia.com/.

[3] M. Anis, M. Allam, and M. Elmasry, “Impact of technology

scaling on CMOS logic st yles,” IEEE Transactions on Circuits

and Systems II: Analog and Digital Signal Processing, vol. 49,

no. 8, pp. 577–588, 2002.

[4] G. Frantz, “Digital signal processor trends,” IEEE Micro,

vol. 20, no. 6, pp. 52–59, 2000.

[5] The ARM foundry program, 2004 and 2006, http://www.arm.

com/.

[6] 3GPP: TS 05.01, “Physical Layer on the Radio Path (Gen-

eral Description),” http://www.3gpp.org/ftp/Specs/html-info/

0501.htm.

[7] J. Doyle and B. Broach, “Small gains in power eﬃciency now,

bigger gains tomorrow,” EE Times, 2002.

[8] K. Jyrkk

a, O. Silven, O. Ali-Yrkk

o, R. Heidari, and H. Berg,

“Component-based development of DSP software for mobile

communication terminals,” Microprocessors and Microsystems,

vol. 26, no. 9-10, pp. 463–474, 2002.

[9] Y. Neuvo, “Cellular phones as embedded systems,” in Pro-

ceedings of IEEE International Solid-State Circuits Conference

(ISSCC ’04), vol. 1, pp. 32–37, San Francisco, Calif, USA,

February 2004.

[10] X. Q. Gao, C. J. Duanmu, and C. R. Zou, “A multilevel succes-

sive elimination algorithm for block matching motion estima-

tion,” IEEE Transactions on Image Processing,vol.9,no.3,pp.

501–504, 2000.

[11] H.-S. Wang and R. M. Mersereau, “Fast algorithms for the es-

timation of motion vectors,” IEEE Transactions on Image Pro-

cessing, vol. 8, no. 3, pp. 435–438, 1999.

[12] 5250 VGA encoder, 2004, http://www.hantro.com/en/prod-

ucts/codecs/hardware/5250.html.

[13] S. Moch, M. Berekovi

c, H. J. Stolberg, et al., “HIBRID-SOC:

a multi-core architecture for image and video applications,”

ACM SIGARCH Computer Architecture News, vol. 32, no. 3,

pp. 55–61, 2004.

[14] K. K. Loo, T. Alukaidey, and S. A. Jimaa, “High perfor-

mance parallelised 3GPP turbo decoder,” in Proceedings of

the 5th European Personal Mobile Communications Conference

(EPMCC ’03), Conf. Publ. no. 492, pp. 337–342, Glasgow, UK,

April 2003.

[15] R. Salami, C. Laﬂamme, B. Bessette, et al., “Description of

GSM enhanced full rate speech codec,” in Proceedings of the

IEEE International Conference on Communications (ICC ’97),

vol. 2, pp. 725–729, Montreal, Canada, June 1997.

[16] M. H. Klein, A Practitioner’s Handbook for Real-Time Analysis,

Kluwer, Boston, Mass, USA, 1993.

[17] M. Spuri and G. C. Buttazzo, “Eﬃcient aperiodic service under

earliest deadline scheduling,” in Proceedings of Real-Time Sys-

tems Symposium, pp. 2–11, San Juan, Puerto Rico, USA, De-

cember 1994.

[18] J. St

arner and L. Asplund, “Measuring the cache interference

cost in preemptive real-time systems,” in Proceedings of the

ACM SIGPLAN Conference on Languages, Compilers, and Tools

for Embedded Systems (LCTES ’04), pp. 146–154, Washington,

DC, USA, June 2004.

[19] M. R. Gathaus, J. S. Ringenberg, D. Ernst, T. M. Austen, T.

Mudge, and R. B. Brown, “MiBench: a free, commercially rep-

resentative embedded benchmark suite,” in Proceedings of the

4th Annual IEEE International Workshop on Workload Charac-

terization (WWC-4 ’01), pp. 3–14, Austin, Tex, USA, Decem-

ber 2001.

[20] J. C. Mogul and A. Borg, “The eﬀect of context switches on

cache performance,” in Proceedings of the 4th International

Conference on Architectural Support for Programming Lan-

guages and Operating Systems (ASPLOS ’91), pp. 75–84, Santa

Clara, Calif, USA, April 1991.

Hindawi Publishing Corporation

EURASIP Journal on Embedded Systems

Volume 2007, Article ID 56467, 16 pages

doi:10.1155/2007/56467

Research Article

The Sandbridge SB3011 Platform

John Glossner, Daniel Iancu, Mayan Moudgill, Gary Nacer, Sanjay Jinturkar,

Stuart Stanley, and Michael Schulte

Sandbridge Technologies, Inc., 1 North Lexington Avenue, White Plains, NY 10601, USA

Received 1 August 2006; Revised 18 January 2007; Accepted 20 February 2007

Recommended by Jarmo Henrik Takala

This paper describes the Sandbridge Sandblaster real-time software-deﬁned radio platform. Speciﬁcally, we describe the SB3011

system-on-a-chip multiprocessor. We describe the software development system that enables real-time execution of communi-

cations and multimedia applications. We provide results for a number of interesting communications and multimedia systems

including UMTS, DVB-H, WiMAX, WiFi, and NTSC video decoding. Each processor core achieves 600 MHz at 0.9 V operation

while typically dissipating 75 mW in 90 nm technology. The entire chip typically dissipates less than 500 mW at 0.9 V.

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Performance requirements for mobile wireless communica-

tion devices have expanded dramatically since their incep-

tion as mobile telephones. Recent carrier oﬀerings with mul-

tiple communication systems and handover from cellular to

broadband suggest that some consumers are requesting con-

vergence devices with ful l data and voice integration. The

proliferation of cameras and Internet access in cell phones

also suggests a variety of computationally intense features

and applications such as web browsing, MP3 audio, and

MPEG4 video are needed. Moreover, consumers want these

wireless subscriber services to be accessible at all times any-

where in the world. Such complex functionality requires high

computing capability at low power consumption; adding

new features requires adding computing capacity.

The technologies necessary to realize true broadband

wireless handsets and systems presenting unique design chal-

lenges if extremely power eﬃcient, yet high-performance,

broadband wireless terminals are to be realized. The design

tradeoﬀs and implementation options inherent in meeting

such demands highlight the extremely onerous requirements

for next generation baseband processors. Tremendous hard-

ware and software challenges exist to realize convergence de-

vices.

The increasing complexities of mobile terminals and a

desire to generate multiple versions with increasing features

for handsets have led to the consideration of a software-

deﬁned radio- (SDR-) based approach in the wireless indus-

try. The previous generation of mobile terminals was primar-

ily designed for use in geographically restricted areas where

growth of the wireless industry was dependant upon signing

up new users. The penetration levels in European and Asian

countries are high and new revenue streams (from technolo-

gies such as 3G) have been slow to materialize for a variety

of complex reasons. True convergence of multimedia, cellu-

lar, location and connectivity technologies is expensive, time

consuming, and complex at all levels of development—not

only mobile terminals, but infrastructure as well. Moreover,

the standards themselves have failed to converge, which has

led to multiple market segments. In order to maintain mar-

ket share, a handset development company must use multi-

ple platforms each of which may be geographically speciﬁc

supporting multiple combinations of communications sys-

tems. This requires some handset companies to support mul-

tiple platforms and multiple hardware solutions from multi-

ple technology suppliers.

1.1. SDR-based approach

Building large parallel processing systems is a diﬃcult task.

Progra mming them eﬃciently is even more challenging.

When nonassociative digital signal processing (DSP) arith-

metic is included, the challenge of automated software devel-

opment for a complex chip multiprocessor (CMP) system is

ampliﬁed.

Early software-deﬁned radio (SDR) platforms were of-

ten built out of discrete processors and FPGAs that were

2 EURASIP Journal on Embedded Systems

integrated on a card. More recently a trend has been to inte-

grate multiple processors on a single chip creating SDR CMP

systems. The S DR Forum [1] deﬁnes ﬁve tiers of solutions.

Tier-0 is a traditional radio implementation in hardware.

Tier-1, software-controlled radio (SCR), implements the

control features for multiple hardware elements in software.

Tier-2, software-deﬁned radio (SDR), implements modu-

lation and baseband processing in software but allows for

multiple frequency ﬁxed function RF hardware. Tier-3, ideal

software radio (ISR), extends programmability through the

RF with analog conversion at the antenna. Tier-4, ultimate

software r adio (USR), provides for fast (millisecond) transi-

tions between communications protocols in addition to dig-

ital processing capability.

The advantages of reconﬁgurable SDR solutions versus

hardware solutions are signiﬁcant. First, reconﬁgurable so-

lutions are more ﬂexible allowing multiple communication

protocols to dynamically execute on the same transistors

thereby reducing hardware costs. Speciﬁc functions such as

ﬁlters,modulationschemes,encoders/decoderscanbere-

conﬁgured adaptively at run time. Second, se veral commu-

nication protocols can be eﬃciently stored in memory and

coexist or execute concurrently. This signiﬁcantly reduces

the cost of the system for both the end user and the ser-

vice provider. Third, remote reconﬁguration provides sim-

ple and inexpensive maintenance and feature upgrades. This

also allows service providers to diﬀerentiate products after

the product is deployed. Fourth, the development time of

new and existing communications protocols is signiﬁcantly

reduced providing an accelerated t ime to market. Develop-

ment cycles are not limited by long and laborious hardware

design cycles. With SDR, new protocols are quickly added as

soon as the software is available for deployment. Fifth, SDR

prov ides an attractive method of dealing with new standards

releases while assuring backward compatibility with existing

standards.

SDR enabling technologies also have signiﬁcant advan-

tages from the consumer perspective. First, mobile terminal

independence with the ability to “choose” desired feature sets

is provided. As an example, the same terminal may be ca-

pable of supporting a superset of features but the consumer

only pays for features that they are interested in using. Sec-

ond, global connectivity with the ability to roam across oper-

ators using diﬀerent communications protocols can be pro-

vided. Third, future scalability and upgr adeability provide

forlongerhandsetlifetimes.

1.2. Processor background

In this section we deﬁne a number of terms and provide

background information on general purpose processors, dig-

ital signal processors, and some of the workload diﬀerences

between general purpose computers and real-time embed-

ded systems.

The architecture of a computer system is the minimal set

of properties that determine what programs will run and

what results they will produce [2]. It is the contract between

the programmer and the hardware. Every computer is an

interpreter of its machine language—that representation of

programs that resides in memory and is interpreted (exe-

cuted) directly by the (host) hardware.

The logical organization of a computer’s dataﬂow and

controls is called the implementation or microarchitecture.

The physical structure embodying the implementation is

called the realization. The architecture describes what hap-

pens while the implementation describes how it is made

to happen. Programs of the same architecture should run

unchanged on diﬀerent i mplementations. An architectural

function is transparent if its implementation does not pro-

duce any architecturally visible side eﬀects. An example of a

nontransparent function is the lo ad delay slot made visible

due to pipeline eﬀects. Generally, it is desirable to have trans-

parent implementations. Most DSP and VLIW implementa-

tions are not transparent and therefore the implementation

aﬀects the architecture [3].

Execution predictability in DSP systems often precludes

the use of many general-purpose design techniques ( e.g.,

speculation, branch prediction, data caches, etc.). Instead,

classical DSP architectures have developed a unique set of

performance-enhancing techniques that are optimized for

their intended market. These techniques are characterized by

hardware that supports eﬃcient ﬁltering, such as the ability

to sustain three memory accesses per cycle (one instruction,

one coeﬃcient, and one data access). Sophisticated address-

ing modes such as bit-reversed and modulo addressing may

also be provided. Multiple address units operate in parallel

with the datapath to sustain the execution of the inner kernel.

In classical DSP architectures, the execution pipelines

were visible to the programmer and necessarily shallow to

allow assembly language optimization. This progr amming

restriction encumbered implementations with tight timing

constraints for both arithmetic execution and memory ac-

cess. The key characteristic that separates modern DSP ar-

chitectures from classical DSP architectures is the focus on

compilability. Once the decision was made to focus the DSP

design on programmer productivity, other constraining de-

cisions could be relaxed. As a result, signiﬁcantly longer

pipelines with multiple cycles to access memory and multi-

ple cycles to compute arithmetic operations could be utilized.

This has yielded higher clock frequencies and higher perfor-

mance DSPs.

In an attempt to exploit instruction level parallelism in-

herent in DSP applications, modern DSPs tend to use VLIW-

like execution packets. This is partly driven by real-time re-

quirements which require the worst-case execution time to

be minimized. This is in contrast with general purpose CPUs

which tend to minimize average execution times. With long

pipelines and multiple instruction issue, the diﬃculties of

attempting assembly language programming become appar-

ent. Controlling dependencies between upwards of 100 in-

ﬂight instructions is not an easy task for a programmer. This

is exactly the area where a compiler excels.

One challenge of using some VLIW processors is large

program executables (code bloat) that result from inde-

pendently specifying e very operation with a single instruc-

tion. As an example, a VLIW processor with a 32-bit basic

John Glossner et al. 3

instruction width may require 4 instructions, 128 bits, to

specify 4 operations. A vector encoding may compute many

more operations in as few as 21 bits (e.g., multiply two 4-

element vectors, saturate, accumulate, and saturate).

Another challenge of some VLIW implementations is

that they may have excessive register ﬁle write ports. Because

each instruction may specify a unique destination address

and all the instructions are independent, a separate port may

be provided for the target of each instruction. This can result

in high power dissipation, which is unacceptable for handset

applications.

To help overcome problems with code bloat and excessive

write ports, recent VLIW DSP architectures, such as OnDSP

[4], the embedded vector processor (EVP) [5], and the syn-

chronous transfer architecture (STA) [6], provide vector op-

erations, specialized instructions for multimedia and wireless

communications, and multiple register ﬁles.

A challenge of visible pipeline machines (e.g., most DSPs

and VLIW processors) is interrupt response latency. It is de-

sirable for computational datapaths to remain fully utilized.

Loading new data while simultaneously operating on current

data is required to maintain execution throughput. However,

visible memory pipeline eﬀects in these highly parallel inner

loops (e.g., a load instruction followed by another load in-

struction) are not typically interruptible because the proces-

sor state cannot be restored. This requires programmers to

break apart loops so that worst-case timings and maximum

system latencies may be acceptable.

Signal processing applications often require both compu-

tations and control processing. Control processing is often

amenable to RISC-style architectures and is typically com-

piled directly from C code. Signal processing computations

are characterized by multiply-accumulate intensive functions

executed on ﬁxed point vectors of moderate length. There-

fore, a DSP requires support for such ﬁxed point saturating

computations. This has traditionally been implemented us-

ing one or more multiply accumulate (MAC) units. In ad-

dition, as the saturating arithmetic is nonassociative, paral-

lel operations on multiple data elements may result in dif-

ferent results from serial execution. This creates a challenge

for high-level language implementations that specify integer

modulo arithmetic. Therefore, most DSPs have been pro-

grammed using assembly language.

Multimedia adds additional requirements. A processor

which executes general purpose programs, signal processing

programs, and multimedia programs (which may also be

considered to be signal processing programs) is termed a

convergence processor. Video, in particular, requires high

performance to allow the display of movies in real-time. An

additional trend for multimedia applications is Java execu-

tion. Java provides a user-friendly interface, support for pro-

ductivity tools and games on the convergence device.

The problems associated with previous approaches req-

uire a new architecture to facilitate eﬃcient convergence ap-

plications processing. Sandbridge Technologies has devel-

oped a new approach that reduces both hardware and soft-

ware design challenges inherent in real-time applications like

SDR and processing of streaming data in convergence ser-

vices.

In the subsequent sections, we describe the Sandbridge

SB3011 low-power platform, the architecture and implemen-

tation, the programming tools including an automatically

multithreading compiler, and SDR results.

2. THE SB3011 SDR PLATFORM

Motivated by the convergence of communications and mul-

timedia processing, the Sandbridge SB3011 was designed for

eﬃcient software execution of physical layer, protocol stacks,

and multimedia applications. The Sandbridge SDR platform

is a Tier-2 implementation as deﬁned by the SDR Forum.

Figure 1 shows the SB3011 implementation. It is intended

for handset markets. The main processing complex includes

four DSPs [7] each running at a minimum of 600 MHz at

0.9 V. The chip is fabricated in 90 nm technology. Each DSP

is capable of issuing multiple operations per cycle includ-

ing data parallel vector operations. The microarchitecture

of each DSP is 8-way multithreaded allowing the SB3011

to simultaneously execute up to 32 independent instruction

streams each of which may issue vector operations.

2.1. DSP complex

Each DSP has a le vel-1 (L1) 32 KB set-associative instruction

cache and an independent L1 64 KB data memory which is

not cached. In addition a noncached global level-2 (L2) 1 MB

memory is shared among all processors. The implementa-

tion guarantees no pipeline stalls for L1 memory accesses

(see Section 4). For external memory accesses or L2 accesses

only the thread that issued the load request stalls. All other

threads continue executing independent of which processor

issued the memory request.

The S andblaster DSP is a true architecture in the sense

that from the programmer’s perspective each instruction

completes prior to the next instruction issuing—on a per

thread basis.

The processors are interconnected through a determinis-

tic and opportunistic unidirectional ring network. The in-

terconnection network typically runs at half the processor

speed. The ring is time-division multiplexed and each pro-

cessor may request a slot based on a proprietary algorithm.

Communications between processors is primarily through

shared memory.

The processor’s instruction set provides synchronization

primitives such as load locked and store conditional. Since all

data memory is noncached, there are no coherence issues.

2.2. ARM and ARM peripherals

In addition to the parallel multithreaded DSP complex, there

is an entire ARM complex with all the peripherals neces-

sary to implement input/output (I/O) devices in a smart

phone. The processor is an ARM926EJ-S running at up to

300 MHz. The ARM has 32 kB instruction and 32 kB data

cache memories. There is an additional 64 kB of scratch

剩余102页未读，继续阅读

xiongxhzju

粉丝: 0
资源: 12

嵌入式数字信号处理系统的挑战与优化

TP9950Aspec12282018.pdf

TP9950Aspec12172018_xinshijue.pdf

动手写一个AspectJ的gradle插件

Keil 5 Digital Signal Processing: Practical Guide to CMSIS-DSP Library

DSP for Embedded and Real-Time Systems

Embedded.Systems.2nd.Edition.1466468866.epub

DSP.for.Embedded.and.Real-Time.Systems

Designing Embedded Hardware, Second Edition

Fuzzy and Neuro-Fuzzy Systems in Medicine

2019 Analog-and-Algorithm-Assisted Ultra-low Power Biosignal Acquisition Systems

最新资源