Toolflows for Mapping CNNs on FPGAs: A Survey and Future Directions 0:7
rate of neighbouring engines. In this manner, the overall architecture is tailored to the particular
network. With emphasis placed on BNNs, the computation engines dier from conventional CNN
hardware designs and are optimised for the ecient mapping of binarised layers, including dedicated
hardware for binarised convolutions, max pooling and batch normalisation [
40
]. Finn expresses
binarised convolutions as matrix-vector operations followed by thresholding. To this end, the
integral block of the architecture is the Matrix-Vector-Threshold Unit (MVTU) which is optimised
to perform the majority of the core binarised operations. In terms of scheduling, Finn’s approach lies
closer to fpgaConvNet’s synchronous dataow scheme and farther from DeepBurning’s dynamic
dataow, with static schedules generated at compile time. Finally, in contrast to fpgaConvNet and
DeepBurning and similarly to Haddoc2, all the binarised weights are required to be stored on-chip,
with the external memory transfers focusing only on the input and output of the network, imposing
a hard limit to the size of networks that can be addressed.
Single computation engines:
This design approach favours exibility over customisation. Such
an architecture comprises a single computation engine, typically in the form of a systolic array of
processing elements or a matrix multiplication unit, that executes the CNN layers sequentially. The
control of the hardware and the scheduling of operations is performed by software (Fig. 2). This
design paradigm consists of a xed architectural template which can be scaled based on the input
CNN and the available FPGA resources. With this scheme, each CNN corresponds to a dierent
sequence of microinstructions that are executable by the hardware. By taking this approach to
the extreme, the architecture can be congured and scaled based only on the resources of the
target FPGA without targeting a specic CNN and, as a result, after a single compilation, the same
bitstream can target many CNNs without the overhead of bitstream-level reconguration. Despite
the exibility gains, ineciencies are introduced due to control mechanisms that resemble those
of a processor [
27
]. Moreover, the one-size-ts-all approach can lead to high variability in the
achieved performance across CNNs with dierent workload characteristics.
1) Angel-Eye: The design principle behind the Angel-Eye framework is based on having a single
exible computation engine which can be programmed and controlled by software. The main
computational component is an array of Processing Elements (PEs) with each PE containing a bank
of convolvers, an adder tree and an optional pooling path. The input feature maps of a CONV layer
are shared across all PEs and each PE processes its inputs with a dierent set of kernels in order
to produce independent output feature maps. Within a PE, the inputs are parallelised across the
convolvers, followed by the adder tree that combines partial results to produce the output. Overall,
the organisation of Angel-Eye’s and AutoCodeGen’s hardware for CONV layers are following the
same strategy by organising convolvers into groups and tunably unrolling with respect to input
and output feature maps.
The framework’s compiler translates the input CNN to a sequence of instructions from Angel-
Eye’s custom instruction set and the computation engine executes the instructions. This process
corresponds to the sequential execution of the layers in a time-sharing manner. With dierent
CNNs mapped to dierent instruction sequences, the architecture can be reused to execute various
models without recompilation or reconguration. In this respect, the hardware design is congured
and scaled based only on the available resources of the target device and hence is CNN-independent.
2) ALAMO: In contrast to Angel-Eye, ALAMO customises the generated computation engine to
the input CNN. The architecture comprises hardware blocks for POOL, ReLU and NORM layers,
together with a 2D array of compute units which is shared between CONV and FC layers. In CONV
layers, the array exploits the parallelism within one input feature map and across multiple output
feature maps. At each time instant, each row of the array is responsible for one output feature map,
with its columns processing dierent windows of the same input feature map and combining their
ACM Computing Surveys, Vol. 0, No. 0, Article 0. Publication date: March 2018.