it. During this time, the upstream port will continue to
transmit packets. Thus, the ingress port must reserve
buffer space for each priority to absorb packets that
arrive during this “gray period”. This reserved buffer is
called headroom. The size of the headroom is decided by
the MTU size, the PFC reaction time of the egress port,
and most importantly, the propagation delay between
the sender and the receiver.
The propagation delay is determined by the distance
between the sender and the receiver. In our network,
this can be as large as 300 meters. Given that our ToR
and Leaf switches have shallow buffers (9MB or 12MB),
we can only reserve enough headroom for two lossless
traffic classes even though the switches support eight
traffic classes. We use one lossless class for real-time
traffic and the other for bulk data transfer.
Need for congestion control: PFC works hop by
hop. There may be several hops from the source server
to the destination server. PFC pause frames propagate
from the congestion point back to the source if there is
persistent network congestion. This can cause problems
like unfairness and victim flow [42].
In order to reduce this collateral damage, flow based
congestion control mechanisms including QCN [13], DC-
QCN [42] and TIMELY [27] have been introduced. We
use DCQCN, which uses ECN for congestion notifica-
tion, in our network. We chose DCQCN because it di-
rectly reacts to the queue lengths at the intermediate
switches and ECN is well supported by all the switches
we use. Small queue lengths reduce the PFC generation
and propagation probability.
Though DCQCN helps reduce the number of PFC
pause frames, it is PFC that protects packets from being
dropped as the last defense. PFC poses several safety is-
sues which are the primary focus of this paper and which
we will discuss in Section 4. We believe the lessons we
have learned in this paper apply to the networks using
TIMELY as well.
Coexistence of RDMA and TCP: In this paper,
RDMA is designed for intra-DC communications. TCP
is still needed for inter-DC communications and legacy
applications. We use a different traffic class (which is
not lossless), with reserved bandwidth, for TCP. Differ-
ent traffic classes isolate TCP and RDMA traffic from
each other.
3. DSCP-BASED PFC
In this section we examine the issues faced by the
original VLAN-based PFC and present our DSCP-based
PFC solution. VLAN-based PFC carries packet prior-
ity in the VLAN tag, which also contains VLAN ID.
The coupling of packet priority and VLAN ID created
two serious problems in our deployment, leading us to
develop a DSCP-based PFC solution.
Figure 3(a) shows the packet formats of the PFC
pause frame and data packets in the original VLAN-
based PFC. The pause frame is a layer-2 frame, and
(a) VLAN-based PFC.
(b) DSCP-based PFC.
Figure 3: The packet formats of VLAN-based PFC and
DSCP-based PFC. Note that the PFC pause frame for-
mat is the same in both Figure 3(a) and Figure 3(b).
does not have a VLAN tag. The VLAN tag for the
data packet has four parts: TPID which is fixed to
0x8100, DEI (Drop Eligible Indicator), PCP (Priority
Code Point) which is used to carry packet priority, and
VID (VLAN identifier) which carries the VLAN ID of
the packet.
For our purpose, although we need only PCP, VID
and PCP cannot be separated. Thus, to support PFC,
we have to configure VLAN at both the server and the
switch side. In order for the switch ports to support
VLAN, we need to put the server facing switch ports
into trunk mode (which supports VLAN tagged pack-
ets) instead of access mode (which sends and receives
untagged packets). The basic PFC functionality works
with this configuration, but it leads to two problems.
First, the switch trunk mode has an undesirable inter-
action with our OS provisioning service. OS provision-
ing is a fundamental service which needs to run when
the server OS needs to be installed or upgraded, and
when the servers need to be provisioned or repaired.
For data centers at our scale, OS provisioning has to
be done automatically. We use PXE (Preboot eXecu-
tion Environment) boot to install OS from the network.
When a server goes through PXE boot, its NIC does
not have VLAN configuration and as a result cannot
send or receive packets with VLAN tags. But since
the server facing switch ports are configured with trunk
mode, these ports can only send packets with VLAN
tag. Hence the PXE boot communication between the