Naiad：融合实时与迭代的分布式数据流系统

需积分: 10 157 浏览量更新于2024-07-22 收藏 253KB PDF 举报

"Naiad是微软研究院开发的一个分布式系统，旨在执行数据并行、循环数据流程序，同时兼顾高吞吐量、低延迟以及迭代和增量计算能力。它解决了以往需要多个平台才能实现这些功能而带来的效率、可维护性和简单性问题。Naiad的核心是一个名为“及时数据流”（Timely Dataflow）的新计算模型，该模型通过时间戳增强了数据流计算，以实现更高效、轻量级的协调机制。这个模型可以支持许多强大的高级编程模型的构建。" Naiad系统设计的关键特性包括： 1. **数据并行处理**：Naiad能够处理大规模数据集，通过分布式架构实现了数据的并行处理，提高了处理速度和效率，与批量处理器相似。 2. **低延迟处理**：Naiad不仅适用于批量处理，还能够快速响应实时数据流，其设计考虑了低延迟处理，类似于流处理器的功能。 3. **迭代和增量计算**：Naiad的一个独特之处在于支持迭代计算，这对于需要反复调整结果直至满足特定条件的算法（如图计算和机器学习）非常有用。此外，增量计算允许系统仅更新自上次计算以来发生变化的数据，从而节省了大量计算资源。 4. **及时数据流模型**：这是Naiad的基础，引入了时间戳的概念，使得数据流中的每个操作都有一个逻辑时间点。这种时间戳机制允许系统跟踪计算进度，并在各个节点之间高效地协调计算。 5. **协调机制**：基于时间戳的协调机制使得Naiad能够在不影响性能的情况下，轻松处理数据依赖和状态管理，这在传统的数据流系统中可能是一个挑战。 6. **编程模型的灵活性**：Naiad的抽象层次较高，开发者可以使用多种高级编程模型来构建应用，无需关心底层的并行和分布式细节，简化了开发流程。 7. **效率和可维护性**：由于Naiad在一个框架内提供了多种计算模式，因此应用程序的效率得到提升，同时降低了代码的复杂性和维护成本。 8. **应用场景**：Naiad适合于需要快速响应变化、迭代优化或需要处理大量数据的领域，如社交网络分析、推荐系统、图算法和实时数据分析等。 Naiad是一个综合性的分布式计算系统，它通过引入新的计算模型和时间戳概念，成功地融合了批量处理、流处理和迭代计算的能力，为大数据处理提供了一个高效且灵活的平台。

delivery of notiﬁcations, and develop tools for a single-

threaded implementation. Section 3 discusses the issues

that arise in a distributed implementation.

At any point in an execution, the set of timestamps at

which future messages can occur is constrained by the

current set of unprocessed events (messages and notiﬁ-

cation requests), and by the graph structure. Messages

in a timely dataﬂow system ﬂow only along edges, and

their timestamps are modiﬁed by ingress, egress, and

feedback vertices. Since events cannot send messages

backwards in time, we can use this structure to compute

lower bounds on the timestamps of messages an event

can cause. By applying this computation to the set of

unprocessed events, we can identify the vertex notiﬁca-

tions that may be correctly delivered.

Each event has a timestamp and a location (either a

vertex or edge), and we refer to these as a pointstamp:

Pointstamp : (t ∈ Timestamp,

location

}| {

l ∈ Edge ∪ Vertex).

The SENDBY and NOTIFYAT methods generate new

events: for v.SENDBY(e,m,t) the pointstamp of m is

(t, e) and for v.NOTIFYAT(t) the pointstamp of the no-

tiﬁcation is (t,v).

The structural constraints on timely dataﬂow graphs

induce an order on pointstamps. We say a pointstamp

) could-result-in (t

) if and only if there exists

a path

ψ = hl

,. .. ,l

i in the dataﬂow graph such that

the timestamp

ψ(t

) that results from adjusting t

ac-

cording to each ingress, egress, or feedback vertex oc-

curring on that path satisﬁes

ψ(t

) ≤ t

. Each path can

be summarized by the loop coordinates that its vertices

remove, add, and increment; the resulting path summary

between l

and l

is a function that transforms a time-

stamp at l

to a timestamp at l

. The structure of timely

dataﬂow graphs ensures that, for any locations l

and l

connected by two paths with different summaries, one of

the path summaries always yields adjusted timestamps

earlier than the other. For each pair l

and l

, we ﬁnd the

minimal path summary over all paths from l

to l

us-

ing a straightforward graph propagation algorithm, and

record it as Ψ[l

]. To efﬁciently evaluate the could-

result-in relation for two pointstamps (t

) and (t

we test whether Ψ[l

](t

) ≤ t

We now consider how a single-threaded scheduler de-

livers events in a timely dataﬂow implementation. The

scheduler maintains a set of active pointstamps, which

are those that correspond to at least one unprocessed

event. For each active pointstamp the scheduler main-

tains two counts: an occurrence count of how many

outstanding events bear the pointstamp, and a precur-

sor count of how many active pointstamps precede it in

the could-result-in order. As vertices generate and retire

events, the occurrence counts are updated as follows:

Operation Update

v.SENDBY(e,m,t) OC[(t, e)] ← OC[(t, e)] + 1

v.ONRECV(e,m,t) OC[(t,e)] ← OC[(t,e)] − 1

v.NOTIFYAT(t) OC[(t,v)] ← OC[(t,v)] + 1

v.ONNOTIFY(t) OC[(t,v)] ← OC[(t,v)] − 1

The scheduler applies updates at the start of calls to

SENDBY and NOTIFYAT, and as calls to ONRECV and

ONNOTIFY complete. When a pointstamp p becomes

active, the scheduler initializes its precursor count to the

number of existing active pointstamps that could-result-

in p. At the same time, the scheduler increments the

precursor count of any pointstamp that p could-result-

in. A pointstamp p leaves the active set when its occur-

rence count drops to zero, at which point the scheduler

decrements the precursor count for any pointstamp that

p could-result-in. When an active pointstamp p’s pre-

cursor count is zero, there is no other pointstamp in the

active set that could-result-in p, and we say that p is in

the frontier of active pointstamps. The scheduler may

deliver any notiﬁcation in the frontier.

When a computation begins the system initializes an

active pointstamp at the location of each input vertex,

timestamped with the ﬁrst epoch, with an occurrence

count of one and a precursor count of zero. When an

epoch e is marked complete the input vertex adds a new

active pointstamp for epoch e + 1, then removes the

pointstamp for e, permitting downstream notiﬁcations

to be delivered for epoch e. When the input vertex is

closed it removes any active pointstamps at its location,

allowing all events downstream of the input to eventu-

ally drain from the computation.

2.4 Discussion

Although the timestamps in timely dataﬂow are

more complicated than traditional integer-valued time-

stamps [22, 38], the vertex programming model supports

many advanced use cases that motivate other systems.

The requirement that a vertex explicitly request notiﬁ-

cations (rather than passively receive notiﬁcations for all

times) allows a programmer to make performance trade-

offs by choosing when to use coordination. For exam-

ple, the monotonic aggregation operators in Bloom

[13]

may continually revise their output without coordina-

tion; in Naiad a vertex can achieve this by sending out-

puts from ONRECV. Such an implementation can im-

prove performance inside a loop by allowing fast un-

coordinated iteration, at the possible expense of send-

ing multiple messages before the output reaches its ﬁnal

value. On the other hand an implementation that sends

only once, in ONNOTIFY, may be more useful at the

boundary of a sub-computation that will be composed

with other processing, since the guarantee that only a

single value will be produced simpliﬁes the downstream

剩余16页未读，继续阅读

Quantum_bit

粉丝: 2
资源: 39

Naiad：融合实时与迭代的分布式数据流系统

Naiad- A Timely Dataflow System

Naiad: A Timely Dataflow System(presenation)

Naiad：实时数据流处理系统介绍

Naiad：融合批处理与流处理的高效数据流系统

Naiad-Buddies:Naiad Buddy是促进Naiad与第三方应用程序之间通信的插件，Naiad用户操作和命令行工具的集合。 这个git repo包含了所有当前可用的Naiad好友的完整源代码-Source code collection

Naiad-Clients:此git repo包含所有Exotic Matter的官方Naiad客户端应用程序的完整源代码，例如Naiad Studio和Naiad Command-Line客户端-git source code

NaiadSamples:使用 Naiad 的示例实现

Naiad Player1.2

CFE10A14-D883-4ACE-990A-0DDA86AA362B:预装了 Naiad 支持的 Pharo 3.0 内存

naiad直升机理论

最新资源

Naiad-Buddies:Naiad Buddy是促进Naiad与第三方应用程序之间通信的插件，Naiad用户操作和命令行工具的集合。这个git repo包含了所有当前可用的Naiad好友的完整源代码-Source code collection