Global control logic
NoC3
input
buf
NoC2
output
buf
MSHR
Tag
array
State
array
Stall logic
Stall logic
Way
selection
Way
selection
Decode
Decode
Dir
array
Data
array
Msg
to send
NoC1
input
buf
Figure 4: The architecture of the L2 cache.
coherence packet formats, and a write-back layer, caching
stores from the write-through L1 data cache. It is an 8KB
4-way set associative write-back cache (the same size as the
L1 data cache by default) with configurable associativity
and size. The line size is the same as the L1 data cache at
16-bytes.
The L1.5 communicates requests and responses to and
from the core through CCX. The CCX bus is preserved as
the primary interface to the OpenSPARC T1. The L1.5 CCX
interface could relatively easily be replaced with other inter-
faces like AMBA or AXI to accommodate different cores.
When a memory request results in a miss, the L1.5 trans-
lates and forwards request to the L2 through the network-
on-chip (NoC) channels. Generally, the L1.5 issues requests
on NoC1, receives data on NoC2, and writes back modified
cache lines on NoC3, as shown in Figure 3.
While the L1.5 was named as such during the devel-
opment of the Piton ASIC prototype, in traditional com-
puter architecture contexts it would be appropriate to call
it the “private L2” and to call the next level cache the
“shared/distributed L3”. The L1.5 is inclusive of the L1 data
cache; each can be independently sized with independent
eviction policies. As a space- and performance-conscious
optimization, the L1.5 does not cache instructions–these
cache lines are bypassed directly between the L1 instruc-
tion cache and the L2. It is possible to modify the L1.5 to
also cache instructions.
2.3.3 L2 Cache
The L2 cache is a distributed write-back cache shared by all
tiles. The default cache configuration is 64KB per tile and
4-way set associativity, but both the cache size and associa-
tivity are configurable. The cache line size is 64 bytes, larger
than caches lower in the hierarchy. The integrated directory
cache has 64 bits per entry, so it can precisely keep track of
up to 64 sharers by default.
The L2 cache is inclusive of the private caches (L1 and
L1.5). Cache line way mapping between the L1.5 and the
L2 is independent and is entirely subject to the replacement
policy of each cache. In fact, since the L2 is distributed,
cache lines consecutively mapped in the L1.5 are likely to
be strewn across multiple L2 tiles (L2 tile referring to a
portion of the distributed L2 cache in a single tile). By
default, OpenPiton maps cache lines using constant strides
with the lower address bits across all L2 tiles, but Coherence
Domain Restriction (CDR) [30], an experimental research
feature integrated into OpenPiton, can be used to interleave
cache lines belonging to a single application or page across
a software-specified set of L2 tiles.
As shown in Figure 4, the L2 cache is designed with
dual parallel pipelines. The first pipeline (top) receives cache
miss request packets from lower in the cache hierarchy on
NoC1 and sends memory request packets to off-chip DRAM
and cache fill response packets to lower in the cache hierar-
chy on NoC2. The second pipeline (bottom) receives mem-
ory response packets from off-chip DRAM and modified
cache line writeback packets from lower in the cache hier-
archy on NoC3. The first L2 pipeline contains 4 stages and
the second pipeline contains only 3 stages since it does not
transmit output packets. The interaction between the L2 and
the three NoCs is also depicted in Figure 3.
2.4 Cache Coherence and Memory Consistency Model
The memory subsystem maintains cache coherence with a
directory-based MESI coherence protocol. It adheres to the
TSO memory consistency model used by the OpenSPARC
T1. Coherent messages between L1.5 caches and L2 caches
communicate through three NoCs, carefully designed to en-
sure deadlock-free operation.
The L2 is the point of coherence for all memory requests,
except for non-cacheable loads and stores which directly by-
pass the L2 cache. All other memory operations (including
atomic operations such as compare-and-swap) are ordered
and the L2 strictly follows this order when servicing re-
quests.
The L2 also keeps the instruction and data caches coher-
ent. Per the OpenSPARC T1’s original design, coherence be-
tween the two L1 caches is maintained at the L2. When a line
is present in a core’s L1 instruction cache and is loaded as
data, the L2 will send invalidations to the relevant instruction
caches before servicing the load.
High-level features of the coherence protocol include:
•
4-step message communication
•
Silent eviction in Exclusive and Shared states
•
No acknowledgments for dirty write-backs
•
Three 64-bit physical NoCs with point-to-point ordering
•
Co-location of L2 cache and coherence directory
2.5 Interconnect
There are two major interconnection types used in Open-
Piton, the NoCs and the chip bridge.