Abstract
Current microprocessor instruction set
architectures are word oriented, with some subword
support. Many important applications, however, can
realize substantial performance benefits from bit-
oriented instructions. We propose the parallel extract
(pex) and parallel deposit (pdep) instructions to
accelerate compressing and expanding selections of
bits. We show that these instructions can be
implemented by the fast inverse butterfly and butterfly
network circuits. We evaluate latency and area costs
of alternative functional units for implementing
subsets of advanced bit manipulation instructions. We
show applications exhibiting significant speedup,
3.41× on average over a basic RISC architecture, and
2.48× on average over an instruction set architecture
(ISA) that supports extract and deposit instructions.
1. Introduction
Operations on microprocessors are typically word,
and more recently subword [1], oriented. However,
many important applications benefit from bit-oriented
operations. For example, arbitrary n-bit permutations
take O(n) operations using basic instructions such as
and, shift and or to move individual bits [2]. A
few fixed permutations, such as in ciphers like DES,
have been optimized by table lookup [2], still taking
tens to hundreds of cycles, due to cache misses.
Recent research showed that specialized bit-oriented
instructions can permute bits in O(lg n) [2-4] or even
O(1) operations [5,6]. For example for n=64, any one
of 64! bit permutations can be achieved in 1 or 2
cycles by butterfly (bfly) and inverse butterfly
(ibfly) permutation instructions [5,6]. Such speedup
can enable previously difficult bit manipulation
computations to be done much more efficiently.
This paper discusses another important class of
bit-oriented operations involving selecting and
compressing bits, and distributing bits according to
different bit patterns. We call these parallel extract
(pex) and parallel deposit (pdep) operations,
respectively. pdep and pex can also be viewed as
bit-level scatter and gather instructions. These
operations are important in application domains such
as bioinformatics, image processing, steganography,
cryptanalysis and coding.
Fast Bit Compression and Expansion with Parallel Extract and
Parallel Deposit Instructions
Yedidya Hilewitz and Ruby B. Lee
Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA
{hilewitz, rblee}@princeton.edu
We present the architectural definition of these
two novel bit instructions. We show how pdep can be
implemented using the single-cycle butterfly network
datapath. We evaluate alternative new functional units
that implement useful subsets of these advanced bit
manipulation instructions, and recommend one that is
smaller than an ALU with shorter latency. Our
performance results indicate that a processor enhanced
with pex and pdep achieves a 5.2× maximum
speedup, 3.41× on average, over a basic RISC
architecture.
Section 2 describes the new pex and pdep
instructions. Section 3 presents the ISA definitions.
Section 4 discusses the implementation and different
options for a new functional unit implementing
advanced bit-oriented instructions. Section 5 describes
applications of these instructions and section 6 their
performance. Section 7 concludes the paper.
2. Parallel extract and parallel deposit
It is often necessary to select non-contiguous bits
from data. For example, in pattern matching, many
pairs of features may be compared. Then, a subset of
these comparison result bits are selected, compressed
and used as an index to look up a table. This selection
and compression of bits is what a pex instruction does
(Figure 1(b)). A pex instruction can also be viewed as
a parallel version of the extract (extr) instruction [7,
8]. The extr instruction extracts a single field of bits
from any position in the source register and right
This work was supported in part by the DoD and Intel.
Yedidya Hilewitz is a Hertz Foundation Fellow.
Yedidya Hilewitz and Ruby B. Lee, Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions, Proceedings of
the IEEE 17
th
International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13,
2006