Google's Orion：软件定义网络控制平面的探索

需积分: 9 37 浏览量更新于2024-07-06 收藏 864KB PDF 举报

"Orion:Google’s Software-Defined Networking Control Plane" Orion是Google在其网络基础设施中采用的一种软件定义网络（SDN）控制平面系统。在2021年的第18届USENIX Symposium on Networked Systems Design and Implementation上，这篇论文详细介绍了Orion的设计理念、架构以及它如何帮助Google解决大规模网络管理的挑战。软件定义网络（SDN）的核心概念是将网络的控制平面与数据平面分离，使得网络的配置和管理可以更加灵活、集中和自动化。Orion作为Google SDN的一部分，主要负责控制和协调整个网络的流量，通过抽象化底层硬件的复杂性，提供了一种高效且可扩展的方式来管理Google全球范围内的数据中心和网络设备。在Orion的设计中，几个关键点值得一提： 1. **集中式控制**：Orion采用了一个高度可扩展的集中式控制器架构，能够处理Google网络的海量规模。这种设计允许快速响应网络状态变化，同时减少了配置错误的可能性。 2. **状态同步与一致性**：在如此大规模的网络中，保持控制器之间的状态同步至关重要。Orion使用先进的分布式系统技术来确保各个控制器之间的一致性，从而保证网络策略的正确执行。 3. **编程接口与网络编排**：Orion提供了丰富的API，使得网络管理员可以通过编程方式定义和调整网络策略。这极大地提高了网络配置的灵活性，并简化了运维工作。 4. **性能优化**：为了处理高流量和低延迟的需求，Orion进行了深入的性能优化，包括高效的路径计算、快速的流表更新以及智能的负载均衡策略。 5. **故障恢复与弹性**：Google的网络需要具备极高的可用性和韧性。Orion内置了故障检测和恢复机制，能够在网络故障发生时快速重新路由流量，保证服务连续性。 6. **安全性**：Orion在设计时也考虑到了网络安全，它支持细粒度的访问控制和安全策略实施，有助于防止潜在的攻击和未经授权的访问。 7. **可扩展性与模块化**：Orion的架构设计允许轻松添加新功能或适应新的网络需求，体现了模块化和插件化的思想。通过Orion，Google能够更有效地管理和优化其庞大的网络资源，实现更高的效率、更低的成本以及更快的创新速度。Orion的实践经验和设计思路对于其他大型云服务提供商和企业网络的SDN实施具有重要的参考价值。

intended changes to the network’s topology in a model [26],

which in turn triggers our deployment systems and opera-

tional staff to make the necessary physical and conﬁguration

changes to the network. As we will describe later, Orion

propagates this top-level intent into network control applica-

tions, such as routing, through conﬁguration and dynamic

state changes. Applications react to top-level intent changes

by mutating their internal state and by generating intermedi-

ate intent, which is in turn consumed by other applications.

The overall system state evolves through a hierarchical prop-

agation of intent ultimately resulting in changes to the pro-

grammed ﬂow state in network switches.

Align control plane and physical failure domains.

One

potential challenge with decoupling control software from

physical elements is failure domains that are misaligned or too

large. For misalignment, consider the case in which a single

SDN controller manages network hardware across portions of

two buildings. A failure in that controller can cause correlated

failures across two buildings, making it harder to meet higher-

level service SLOs. Similarly, the failure of a single SDN

controller responsible for all network elements in a campus

would constitute too large a vulnerability even if it improved

efﬁciency due to a centralized view.

We address these challenges by carefully aligning network

control domains with physical, storage, and compute domains.

As one simple example, a single failure in network control

should not impact more than one physical, storage, or com-

pute domain. To limit the “blast radius” of individual con-

troller failures, we leverage hierarchical, partitioned control

with soft state progressing up the hierarchy (§5.1). We explic-

itly design and test the network to continue correct, though

likely degraded, operation in the face of controller failures.

3.2 Principles related to an SDN controller

SDN enables novel approaches to handling failures, but

it also introduces new challenges requiring careful design.

The SDN controller is remote from the network switches,

resulting in the lack of fate sharing but also the possibility of

not being able to communicate with the switches.

Lack of fate sharing can often be used to our advantage.

For example, the network continues forwarding based on

its existing state when the controller fails. Conversely, the

controller can repair paths accurately and in a timely manner

when individual switches fail, by rerouting around them.

React optimistically to correlated unreachability.

The

loss of communication between controller and switches poses

a difﬁcult design challenge as the controller must deal with

incomplete information. We handle incomplete information

by ﬁrst deciding whether we are dealing with a minor failure

or a major one, and then reacting pessimistically to the former

and optimistically to the latter.

We start by associating a ternary health state with network

elements: (i) healthy with a recent control communication (a

switch reports healthy link and programming state with no

Figure 1: Network behavior in three cases:

Normal

(left): A net-

work with healthy switches. Flows from top to bottom switches use

all middle switches.

Fail Closed

(mid): With few switches in un-

known state (grey), the controller conservatively routes around them.

Fail Static

(right): With enough switches in unknown state, the

controller no longer routes around newly perceived failed switches.

packet loss), (ii) unhealthy, when a switch declares itself to

be unhealthy, when neighbouring switches report unhealthy

conditions or indirect signals implicate the switch, and (iii)

unknown, with no recent control communication with a switch

and no indirect signals to implicate the switch.

A switch in the unknown state could be malfunctioning, or

it could simply be unable to communicate with a controller

(a fairly common occurrence at scale). In comparison, the

unhealthy state is fairly rare, as there are few opportunities to

diagnose unequivocal failure conditions in real time.

The controller aggregates individual switch states into a

network-wide health state, which it uses to decide between

a pessimistic or an optimistic reaction. We call these

Fail

Closed

and

Fail Static

, respectively. In Fail Closed, the con-

troller re-programs ﬂows to route around a (perceived) failed

switch. In Fail Static, the controller decides not to react to a

switch in an unknown, potentially failed, state, keeping trafﬁc

ﬂowing toward it until the switch state changes or the network

operator intervenes. Figure 1 illustrates an example of normal

operation, Fail Closed reaction, and Fail Static condition.

In Fail Static, the controller holds back from reacting to

avoid worsening the overall state of the network, both in

terms of connectivity and congestion. The trade-off between

Fail Closed and Fail Static is governed by the cost/beneﬁt

implication of reacting to the unknown state: if the element

in the unknown state can be avoided without a signiﬁcant

performance cost, the controller conservatively reacts to this

state and triggers coordinated actions to steer trafﬁc away

from the possible failures. If the reaction would result in a

signiﬁcant loss in capacity or loss in end-to-end connectivity,

the controller instead enters Fail Static mode for that switch.

In practice we use a simple “capacity degradation threshold”

to move from Fail Closed to Fail Static. The actual threshold

value is directly related to: (1) the operating parameters of

the network, especially the capacity headroom we typically

reserve, for example, to support planned maintenance; (2) the

level of redundancy we design in the topology and control

It is not common for a software component to be able to self-diagnose a

failure, without being able to avoid it in the ﬁrst place, or at least repair it.

Slightly more common is the ability to observe a failure from an external

vantage point, e.g. a neighboring switch detecting a link “going down.”

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 85

剩余16页未读，继续阅读

ljyfree

粉丝: 865
资源: 14

Google's Orion：软件定义网络控制平面的探索

orion:Orion是用于进行私人交易的PegaSys组件

orion:Orion搜索中使用的数据收集，浓缩和分析系统

java像素鸟源码下载-orion360-sdk-basic-examples-android:orion360-sdk-basic-exam

Orion-Web-Proxy-for-Cross-Domain:用于跨域的 FI-WARE Orion Web 代理

OriZEmu：Orion-128和Orion-PRO自酿8位计算机模拟器

Electric-DeLorean-Project:电动Delorean DMC-12的汽车仪表板显示和路线跟踪器

orion-lang-zh-cn:Orion的简体中文翻译

orion-semantic-demo:一个测试orion-semantic-ui软件包的演示

plataforma-dev-orion:开发Orion开发团队平台

orion-ptt-system-cloudformation

最新资源