10 G. Gaubatz, J.-P. Kaps, and B. Sunar
the next partial product and the 2’s complement of the modulus n (reduction).
The control logic determines whether a reduction operation is necessary after an
addition. Since we are using the same adder for both functions, the number of
clock cycles needed for one squaring is data dependent and at most 1024.
The most complex part of the squarer is the Adder. There are two basic adder
designs that are suitable for a low power implementation, namely carry-save adder
and ripple-carry adder. A ripple-carry adder uses fewer gates and hence consumes
the least amount of leakage power, but as the worst case carry chain is the
longest, this adder also has the longest delay. The propagation of carries causes
glitches which in turn cause a very high dynamic power consumption. A carry-
save adder on the other hand propagates carries only by one position, hence there
are no glitches, resulting in insignificant amounts of delay and dynamic power
consumption. Its disadvantage is that the result is kept in redundant carry-save
representation which requires 512 additional flip-flops. This in turn causes a
higher consumption of leakage power. Since partial products and complements
of the modulus can be accumulated in redundant form, the final non-redundant
result needs to be computed only at the very end of the multiplication which
takes 512 additional clock cycles.
Neither of both approaches seems optimal for this implementation, so we
tried to strike a balance between power and speed. For our adder we are using a
ripple-carry adder and insert a carry-save bit on every 8th bit position. Hence the
carries ripple for a maximum of 8 bits causing some glitching but significantly
less than a full ripple-carry adder would. The dynamic power consumption is
therefore much lower than for a full ripple-carry adder. This adder also needs
only 64 additional flip-flops to store the carry bits, which is 448 flip-flops less
than necessary for a full carry-save adder. This approach, however, introduces
a new difficulty. After adding a partial product to the sum, the result has to
be shifted. This would misalign the saved carry bits
2
. Hence, carry bits need
to be re-aligned before shifting the sum. This is done by adding the carry bits
to the sum in the appropriate position and saving the carry bits at the new
position. The cost for this is a 512 bit multiplexer, 512 additional clock cycles
and a slightly more complex control logic.
The Control logic is comprised of two state machines and one counter. The
counter is implemented as a Linear Feedback Shift Register (LFSR) and “counts”
up to 512. LFSRs have reduced switching activity and are faster than regular
counters, hence reducing the effects on the critical path delay. Furthermore it is
clock gated and can be reset. The counter is used to count all the multiplication
steps and also to count the worst case number of steps necessary to ripple all
64 carry-save flip-flops. The main state machine of this control logic keeps track
of the overall operation of the circuit. The second state machine takes care of
arithmetic operations of the circuit. Furthermore it is responsible for the clock
gating of the counter and the left shift register (see Figure 2).
2
This problem does not occur when a full carry-save adder is being used as there is
one carry bit associated with every bit position