Int J Parallel Prog (2010) 38:322–338 325
2.2 FPGAs
For accelerator designs to be more than just an academic exercise the following are
important considerations. The design time should be low and allow for extensive test-
ing. The design should be modular in nature and scale with available resources. For
integration within an existing system form-factor limitations should be considered as
should power and memory. Most HPC systems are based on Infiniband like intercon-
nects between nodes, with nodes having PCI-e for communication with peripherals.
The PCI-e connects via the southbridge to the host memory, sharing bandwidth with
the host processor. DMA is used in order to transfer data efficiently, and thus acceler-
ators s hould be compatible with DMA and burst transfer. The overall system should
also be oblivious to the presence of the accelerator, requiring minimal modifications
to be done to accomodate it. These aspects make FPGA based designs very attractive.
Their form factors allow multiple FPGA to fit on existing boards, that can communi-
cate via PCI-e. The cost of being reconfigurable doesn’t allow FPGAs to run at clocks
as high as modern general purpose processors. However, it does let designs exploit
the very low power consumption and their high parallelisability.
FPGAs have a reconfigurable fabric consisting of flip-flops and look-up tables
(LUTs) grouped into Configurable Logic Blocks (CLB). The difference between
FPGAs results from different arrangements within the CLB and the interconnects
between them. The fixed function logic blocks, such as multipliers, and embedded
block RAM are ‘systematically’ interspersed between these. Care should be taken that
designs should not be complex from the view of routing between elements. FP-
GAs have limited resources that facilitate long routing, and can adversely affect the
maximum achievable clock frequency if not utilized well. In this work, care has
been paid to minimise communication between PEs, keeping the routing complex-
ity low.
The targeted device is from the Xilinx Virtex-5 family [11], based on a 65 nm
process, it provides four 6-input LUTs, four flip-flops, multiplexers and carry chains,
within a slice, with two slices making a CLB. For our context, we briefly introduce
the key FPGA primitives used in this design: the Block RAM (BRAM), the FIFO and
the DSP48 slice based Multipliers and Adders. These hard primitives embedded in the
Virtex-5 fabric are individually able to clock at speeds greater than 500 MHz while
operating at relatively low power.
Block RAM
The BRAMs are 36 bit wide 1 K deep true dual-port SRAMs, true dual-port mean-
ing being able to independently read/write from both ports. They can be used in
a variety of width-depth configurations and cascaded if requried. Two adjacent
BRAMs can be treated as 64 bit wide memories with no additional user logic.
They can also be configured as FIFOs with relevant flags available for use.
DSP48 Slices
DSP48E blocks consists of cascadeable, 25 × 18 bit multipliers and 48-bit
adder/subtractr/accumulator. They also allow for functions like shifting, com-
parisons and others to be implemented. Their ability to be cascaded allows for
floating point implementations.
123