1702 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 11, NOVEMBER 1997
Skew-Tolerant Domino Circuits
David Harris, Student Member, IEEE, and Mark A. Horowitz, Senior Member, IEEE
Abstract—Domino circuits are widely used in high-performance
CMOS microprocessors. However, textbook domino pipelines
suffer significant timing overhead from clock skew, latch delay,
and the inability to borrow time. To eliminate this overhead, some
designers provide multiple overlapping clock phases such that
domino gates are always ready for evaluation by the time critical
inputs arrive and do not precharge until the next gate consumes
the result. This paper describes a systematic framework, called
skew-tolerant domino circuits, for understanding and analyzing
domino circuits with overlapping clocks. Simulations confirm that
a speedup of 25% or more can be achieved over textbook domino
circuits in high-speed systems.
Index Terms—Adders, clock skew, clocks, CMOS digital inte-
grated circuits, dynamic logic, VLSI circuit design.
I. INTRODUCTION
S
INCE microarchitectural improvements have been yield-
ing diminishing returns, microprocessor designers seeking
high performance have been forced to aggressively reduce
cycle times beyond that which simple process scaling would
permit. We can normalize cycle time improvement due to
faster processes by expressing cycle time in terms of the
delay of a fanout-of-four (FO4) inverter, i.e., an inverter
driving a load that is four times its input capacitance. Today’s
fastest microprocessors are operating at cycle times below 18
fanout-of-four inverter delays [1].
1
Domino circuits [2] are an
important enabler for this cycle time improvement [3]–[5]. At
such short cycle times, however, clocking overhead which was
once negligible becomes a significant fraction of the clock
period.
As we will see in Section II, when domino circuits are
pipelined in the same way that two-phase static circuits have
traditionally been pipelined, they are highly sensitive to clock
skew, include latch delays on the critical path, and are in-
capable of borrowing time across clock phases to balance
the pipeline. Some designers have discovered that by over-
lapping the clocks controlling domino gates, these sources of
overhead can be hidden, as we illustrate in Section III. We
proceed to analyze domino gates using overlapping clocks in
a systematic framework which we call skew-tolerant domino.
Section IV presents the analysis under a single clock skew
budget. Even more global clock skew can be hidden if we
take advantage of tighter bounds on local clock skew, as
Manuscript received April 10, 1997; revised August 5, 1997. This work was
supported in part by a National Science Foundation fellowship, by Stanford’s
Center for Integrated Systems, and by DARPA Contract DABT63-94-C-0054.
The authors are with Stanford University, Stanford, CA 94305 USA.
Publisher Item Identifier S 0018-9200(97)08035-9.
1
DEC reports an Alpha 21164 cycle time of 14 “gate delays” where a “gate
delay” is roughly an average fanout-of-three two-input gate. Simulation found
that the average of a two-input fanout-of-three
NAND and NOR delay is about
1.24 fanout-of-four inverter delays.
described in Section V. For many reasonable designs, this
global skew tolerance greatly exceeds the actual system skews,
so Section VI explains how to take advantage of the extra
overlap to allow time borrowing across phases. Section VII
then addresses the critical issue of clock generation and
shows how a single global clock and relatively simple local
clock generators can produce the needed clock phases, while
Section VIII looks at the interfaces of skew-tolerant domino
with static and self-timed logic. Section IX presents simulation
results of skew-tolerant domino applied to an adder self-bypass
path. Finally, Section X summarizes the skew-tolerant domino
techniques and the performance benefits which they offer.
II. T
EXTBOOK DOMINO CIRCUITS
We begin with a review of a simple form of domino circuits,
including a motivation of why domino is beneficial, how
pipelines can be constructed, and why such textbook pipelines
have serious overhead.
Static CMOS gates are slow because an input must drive
both NMOS and PMOS transistors. In any transition, either
the pull-up or pull-down network is activated, meaning the
input capacitance of the inactive network loads down the path.
Moreover, PMOS transistors have poor mobility and must be
sized larger to achieve comparable rising and falling delays,
further increasing input capacitance. Dynamic gates overcome
this weakness by eliminating the PMOS transistors and re-
placing them with a single precharge transistor. The dynamic
gate is precharged high, then may evaluate low through an
NMOS stack. Unfortunately, if one dynamic inverter directly
drives another, a race can corrupt the result. When clk rises,
both outputs have been precharged high. The high input to
the first gate causes its output to fall, but the second gate’s
output also falls in response to its initial high input. The circuit
therefore produces an incorrect result because the second
output will never rise during evaluation. Domino circuits solve
this problem by using inverting static gates between dynamic
gates so that the input to each dynamic gate is initially low. The
falling dynamic output and rising static output ripple through a
chain of gates like a stream of toppling dominos. In summary,
domino logic runs 1.5–2
faster than static CMOS logic [6]
because dynamic gates present a much lower input capacitance
for the same output current and have a lower switching
threshold, and because the inverting static gate can be skewed
to favor the critical monotonically rising evaluation edges.
After domino gates evaluate, they must be precharged before
they can be used in the next cycle. If all domino gates were to
precharge simultaneously, the circuit would waste time during
which no useful computation occurs. Therefore, domino logic
is conventionally divided into two phases, ping-ponged such
0018–9200/97$10.00 1997 IEEE