多核处理器平台上的JPEG解码实现

需积分: 10 109 浏览量更新于2024-11-14 收藏 353KB PDF 举报

"这篇文档详细介绍了在多核处理器平台上实现JPEG解码器的过程。团队使用了一个已经功能完备的JPEG解码器作为基础，并将其移植到一个由三个Silicon Hive VLIW处理器组成的Celoxica RC300 FPGA板上。该平台还包括外部内存和帧缓冲区，所有组件通过NXP的Network on Chip (NoC)——Thereal进行连接。任务目标是在2008年第二季度末将嵌入式JPEG解码器推向市场，并将应用程序代码移植到VLIW内核上。" 在JPEG解码过程中，多核处理器的利用能够显著提高处理速度和效率，尤其是在高分辨率图像处理时。JPEG（Joint Photographic Experts Group）是一种广泛使用的有损压缩图像格式，它通过离散余弦变换(DCT)、量化和熵编码等步骤来减小图像数据的大小。在这个项目中，三个VLIW（Very Long Instruction Word）处理器并行工作，每个核心可以同时执行多个操作，大大提高了解码速度。VLIW架构允许在一个指令周期内执行多个微操作，优化了硬件资源的使用，从而实现高效能的并行计算。首先，JPEG解码过程中的预处理步骤，如分块和颜色空间转换，可以分配给不同的VLIW核心来处理。例如，一个核心负责将图像分割成8x8像素的块，另一个核心执行YCbCr到RGB的颜色空间转换。接下来，DCT（离散余弦变换）是解码过程的关键步骤，通常需要大量的计算。DCT将每个8x8像素块转换为频率域表示，这一步可以进一步分解为多个并行可执行的任务，分配给各个VLIW核心。每个核心可以独立地计算DCT系数，从而加速整个过程。然后，量化表的应用和反量化也是并行化的好候选。每个核心可以独立处理图像块的量化系数，将它们转换回连续的幅度值。熵解码阶段，包括霍夫曼编码或算术编码的解码，可以进一步并行化。这些编码方法将压缩的位流转换回DCT系数，可以通过分配不同长度的编码序列到各个核心来实现并行处理。最后，重构图像涉及到逆向过程，即IDCT（逆离散余弦变换）、反量化、反分块和像素重新排列。每个VLIW核心可以分别处理这些步骤，尤其是在帧缓冲区写入重建的图像像素时，可以有效地利用多核并行性。为了在多核平台上实现高效的JPEG解码，还需要考虑任务调度、数据通信和同步问题。NXP的NoC（Network on Chip）提供了一种灵活的通信架构，使得各核心之间可以高效地交换数据，确保解码过程的正确性和实时性。总结来说，这个项目展示了如何在多核处理器系统上优化JPEG解码，通过合理分配任务和利用VLIW处理器的并行处理能力，实现了高效、快速的图像解码。这对于实时图像处理和多媒体应用，如视频流和高清图像显示，具有重要的实际意义。

5kk03 Embedded Systems Laboratory

[Design of an Embedded JPEG Decoder on a multiprocessor platform]

Geert Kwintenberg

Hardware Expert

0614867

geert.k@zonnet.nl

Lennart de Graaf

Software Engineer

0612829

l.deGraaf@fontys.nl

Wei Tong

Embedded Engineer

0641310

w.tong@student.tue.nl

Manickavasagam

Shanmugam Annamalai

Group Leader

0641287

s.a.manick@gmail.com

Feiteng Yang

Software Engineer

0638263

yftﬂy@hotmail.com

1. INTRODUCTION

This document describes the implementation of a JPEG De-

coder on a multiprocessor platform. To complete the assign-

ment in the required time an already functional implementa-

tion of a JPEG Decoder was used. The multiprocessor plat-

form used in the assignment consists of a Celoxica RC300

FPGA board. The FPGA on this b oard contains three Sil-

icon Hive VLIW

processors. The board also consists of an

external memory and a framebuﬀer. All of these components

are connected through a NXP Network on Chip called Æthe-

real. The assignment consists of the following contents: Put

an embedded JPEG Decoder on the market at the end of

Q2 2008, port the application code to the embedded VLIW

cores, eﬃciently map the application to the platform and

optimize the system by using performance metrics. From

a organizational view, to smoothen the design process each

group members was assigned a certain role.

In section 2 of the paper is explained how the design process

was started. Then an overview is given in section 3 of the

code optimizations made to make the application more eﬃ-

cient. Then a description is given of the three diﬀerent im-

plementations used; Data parallel, Functional parallel and

the Hybrid version in sections 4, 5 and 6 respectively. In

section 7 the benchmark results are depicted. Finally, con-

clusions are presented in section 8.

2. GETTING STARTED

To get started with programming the embedded Silicon Hive

cores, some small programs were written to do some basic

calculations. From this point it become clear that program-

ming the on the cores have some limitations. For example

Very Large Instruction Word: Refers to a CPU architecture

designed to take advantage of instruction level parallelism

the cores do not support doubles or ﬂoating point opera-

tions. The cores also do not include a hardware divider. So

the divisions are done in a software manner. The follow-

ing sections describe which tasks had to be completed to

implement the JPEG Decoder on a single core.

2.1 Single core porting

The starting point of this project is a working JPEG decoder

solution that runs on the host system. To be able to tryout

the beneﬁts of using multiple cores to do the decoding in

parallel, ﬁrst two things need to be done:

• Split up the code in a part that runs on the host and

a part that runs on the core(s);

• Port the source code that needs to run on de core(s).

Splitting up the code is fairly easy, since in the original

code there is a good functional seperation between the ini-

tial setup and the actual decoding. Basically the splitup

is already in the original code because JpgToBmp.c does all

initialisations and then calls the decoder() function that is in

the ﬁle decoder.c. The most time consuming part here was

to get familiar with the silicon hyves environment, functions

to control the cores (loading, starting and waiting) and ex-

change data between host and core. Also the infamiliarity

with working with makeﬁles took some time. Porting the

code to the core was tedious. Main issues here is that li-

brary functions that work on the host, are not available on

the core. This most of the times is due to the fact that the

actual hardware is diﬀerent and functions like printf() and

fget() do not make sense any more. Main changes here are:

• Removing all calls to standard IO (printf functions).

To keep crun working, we made use of #ifdef construc-

tions to remove all functions that were not supported

when run on hyvesCC;

• Changing all calls to ﬁle functions. From a core ﬁle ac-

cess is not supported. Therefore the host is responsible

for reading the .JPEG ﬁle and creating the resulting

.bmp ﬁle. Information from these ﬁles are passed to

下载后可阅读完整内容，剩余7页未读，立即下载

tongwei1983

粉丝: 0
资源: 1

多核处理器平台上的JPEG解码实现

傻瓜式jpg图片批量处理器

利用多核处理器实现JPEG图像快速解码技术

多核处理器构架的高速JPEG解码算法

基于NiosII多核处理器的JPEG解码的设计与实现.pdf

多核处理器构架的高速JPEG解码算法_图像处理

通信与网络中的多核处理器构架的高速JPEG解码算法

STM32参考资料文档图片解码多核处理器构架的高速JPEG解码算法

STM32软件学习资料图片解码多核处理器构架的高速JPEG解码算法

一种基于TMS320C6678多核处理器的JPEG实时解码优化算法.pdf

技术资料分享多核处理器构架的高速JPEG解码算法很好的技术资料.zip

最新资源