intended changes to the network’s topology in a model [26],
which in turn triggers our deployment systems and opera-
tional staff to make the necessary physical and configuration
changes to the network. As we will describe later, Orion
propagates this top-level intent into network control applica-
tions, such as routing, through configuration and dynamic
state changes. Applications react to top-level intent changes
by mutating their internal state and by generating intermedi-
ate intent, which is in turn consumed by other applications.
The overall system state evolves through a hierarchical prop-
agation of intent ultimately resulting in changes to the pro-
grammed flow state in network switches.
Align control plane and physical failure domains.
One
potential challenge with decoupling control software from
physical elements is failure domains that are misaligned or too
large. For misalignment, consider the case in which a single
SDN controller manages network hardware across portions of
two buildings. A failure in that controller can cause correlated
failures across two buildings, making it harder to meet higher-
level service SLOs. Similarly, the failure of a single SDN
controller responsible for all network elements in a campus
would constitute too large a vulnerability even if it improved
efficiency due to a centralized view.
We address these challenges by carefully aligning network
control domains with physical, storage, and compute domains.
As one simple example, a single failure in network control
should not impact more than one physical, storage, or com-
pute domain. To limit the “blast radius” of individual con-
troller failures, we leverage hierarchical, partitioned control
with soft state progressing up the hierarchy (§5.1). We explic-
itly design and test the network to continue correct, though
likely degraded, operation in the face of controller failures.
3.2 Principles related to an SDN controller
SDN enables novel approaches to handling failures, but
it also introduces new challenges requiring careful design.
The SDN controller is remote from the network switches,
resulting in the lack of fate sharing but also the possibility of
not being able to communicate with the switches.
Lack of fate sharing can often be used to our advantage.
For example, the network continues forwarding based on
its existing state when the controller fails. Conversely, the
controller can repair paths accurately and in a timely manner
when individual switches fail, by rerouting around them.
React optimistically to correlated unreachability.
The
loss of communication between controller and switches poses
a difficult design challenge as the controller must deal with
incomplete information. We handle incomplete information
by first deciding whether we are dealing with a minor failure
or a major one, and then reacting pessimistically to the former
and optimistically to the latter.
We start by associating a ternary health state with network
elements: (i) healthy with a recent control communication (a
switch reports healthy link and programming state with no
Figure 1: Network behavior in three cases:
Normal
(left): A net-
work with healthy switches. Flows from top to bottom switches use
all middle switches.
Fail Closed
(mid): With few switches in un-
known state (grey), the controller conservatively routes around them.
Fail Static
(right): With enough switches in unknown state, the
controller no longer routes around newly perceived failed switches.
packet loss), (ii) unhealthy, when a switch declares itself to
be unhealthy, when neighbouring switches report unhealthy
conditions or indirect signals implicate the switch, and (iii)
unknown, with no recent control communication with a switch
and no indirect signals to implicate the switch.
A switch in the unknown state could be malfunctioning, or
it could simply be unable to communicate with a controller
(a fairly common occurrence at scale). In comparison, the
unhealthy state is fairly rare, as there are few opportunities to
diagnose unequivocal failure conditions in real time.
2
The controller aggregates individual switch states into a
network-wide health state, which it uses to decide between
a pessimistic or an optimistic reaction. We call these
Fail
Closed
and
Fail Static
, respectively. In Fail Closed, the con-
troller re-programs flows to route around a (perceived) failed
switch. In Fail Static, the controller decides not to react to a
switch in an unknown, potentially failed, state, keeping traffic
flowing toward it until the switch state changes or the network
operator intervenes. Figure 1 illustrates an example of normal
operation, Fail Closed reaction, and Fail Static condition.
In Fail Static, the controller holds back from reacting to
avoid worsening the overall state of the network, both in
terms of connectivity and congestion. The trade-off between
Fail Closed and Fail Static is governed by the cost/benefit
implication of reacting to the unknown state: if the element
in the unknown state can be avoided without a significant
performance cost, the controller conservatively reacts to this
state and triggers coordinated actions to steer traffic away
from the possible failures. If the reaction would result in a
significant loss in capacity or loss in end-to-end connectivity,
the controller instead enters Fail Static mode for that switch.
In practice we use a simple “capacity degradation threshold”
to move from Fail Closed to Fail Static. The actual threshold
value is directly related to: (1) the operating parameters of
the network, especially the capacity headroom we typically
reserve, for example, to support planned maintenance; (2) the
level of redundancy we design in the topology and control
2
It is not common for a software component to be able to self-diagnose a
failure, without being able to avoid it in the first place, or at least repair it.
Slightly more common is the ability to observe a failure from an external
vantage point, e.g. a neighboring switch detecting a link “going down.”
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 85