FPGA Implementation of Low-Power and
High-PSNR DCT/IDCT Architecture based on
Adaptive Recoding CORDIC
Jianfeng Zhang
*
State Key Laboratory
of High Performance Computing
College of Computer
National University of Defense Techonology
Changsha, China
Email: jianfengzhang@nudt.edu.cn
Paul Chow
Department of Electrical and
Computer Engineering
University of Toronto
Toronto, Canada
Email: pc@eecg.toronto.edu
Hengzhu Liu
State Key Laboratory
of High Performance Computing
College of Computer
National University of Defense Techonology
Changsha, China
Email: hengzhuliu@nudt.edu.cn
Abstract—The discrete cosine transform (DCT) and its inverse
(IDCT) are widely used in image and video compression stan-
dards. In this paper, we propose a novel unified architecture
for DCT and IDCT based on adaptive recoding coordinate
rotation digital computer (ARC). The proposed architecture
requires two types of ARC rotators. In addition, an efficient
adder and shifter-based scale factor approximation is used in
the proposed architecture. To verify the function and evaluate
the performance, the proposed architecture is validated on a
Virtex 5 FPGA development platform. Under DCT-only mode,
compared with the proposed architecture, a state-of-the-art DCT
architecture uses 12% more hardware resources, increases the
critical path delay by 7.12%, consumes 10.1% more power
and decreases 4.8 dB in PSNR. Under DCT/IDCT mode, the
latest unified DCT/IDCT architecture has a factor of 2.17-fold
in latency, needs 74.9% more hardware resources and dissipates
52.5% more power when compared to the proposed architecture.
In addition, PSNR of the proposed architecture is better by 2 dB.
I. INTRODUCTION
Today low-power is extremely important in embedded
systems, especially for portable devices. Due to the perfect
energy packing [1] and very close approximation to the opti-
mal Karhunen-Loeve transform (KLT) [2], the discrete cosine
transform (DCT) and inverse discrete cosine transform (IDCT)
have been widely applied in image and video compression
standards, such as JPEG [3], MPEG [4], H.264 [5] and
HEVC [6] since they were first introduced [7]. As DCT
and IDCT are computationally intensive transforms, many
fast algorithms are proposed to accelerate the computation
process, such as multiplier-based algorithms [8, 9], distributed
arithmetic (DA) based algorithms [10, 11] and coordinate
rotation digital computer (CORDIC) based algorithms [12–16].
The multiplier-based algorithms and the DA-based algorithms
have high peak signal-to-noise ratio (PSNR), but they con-
sume too much power. The reason is that they either require
complicated multipliers or use too many hardware resources.
CORDIC [17] can realize the transcendental functions in a
*
Jianfeng Zhang is currently a visiting PhD student at the University of
Toronto.
parallel way by only using adders and shifters, and it is also
highly suited to implementation on FPGAs [18], which means
adopting CORDIC to implement DCT and IDCT can reduce
architecture complexity and save power. Compared to the other
two methods, more and more people focus on implementing
DCT and IDCT based on CORDIC.
As a 2-D DCT is commonly calculated by first applying a
1-D DCT over the rows followed by another 1-D DCT applied
to the columns of the input matrix [16], 1-D DCT is the kernel
processing element. Meanwhile, both DCT and IDCT are used
in image and video systems, and then designing a unified
efficient architecture for 1-D DCT and IDCT is very important.
In this paper, we propose a novel unified architecture for DCT
and IDCT based on CORDIC. The proposed architecture uses
two different types of CORDIC rotators. There are drawbacks
in the conventional CORDIC [17], for example excessive
iterations, poor accuracy, and especially the data dependence of
the neighbouring iterations that restricts the speed significantly.
Hence, adaptive recoding CORDIC (ARC) [19] is preferred
to improve the accuracy and accelerate the rotation process.
The proposed architecture has been synthesized on a Xilinx
Virtex-5 LX110T to verify the correctness and performance.
Compared to the state-of-the-art DCT and the latest unified
DCT/IDCT architectures, the proposed architecture demon-
strates significantly improved performance.
The rest of this paper is organized as follows: Related work
about DCT and IDCT is discussed in the following section.
The background of DCT, IDCT, conventional CORDIC and
ARC are described in Section III. In Section IV, we discuss
the proposed novel unified DCT/IDCT architecture and the
implementations of ARC rotators. Section V analyzes the
simulation and comparison results on an FPGA. Conclusions
are drawn in Section VI.
II. RELATED WORK
Much research work to improve DCT and IDCT has been
done based on three general groups of methods.
The first group is to minimize the number of multipliers
978-1-4673-9091-0/15/$31.00
c
2015 IEEE