Raft共识算法：理论与实践

需积分: 20 89 浏览量更新于2024-07-20 1 收藏 4.88MB PDF 举报

"Raft一致性算法的博士论文" 这篇论文由Diego Ongaro撰写，旨在弥合一致性理论与实践之间的差距，特别是在分布式系统的环境中。一致性是确保数据在多个节点间同步和准确的关键因素，这对于系统的高可用性和可靠性至关重要。这篇论文以Raft算法为主题，它是Paxos算法的一种更易理解和实现的变体，用于解决分布式系统中的共识问题。 Raft算法的核心在于其将领导者选举、日志复制和安全性这三个关键组件分离的设计。领导者选举确保了只有一个节点（领导者）可以接受并处理新请求，从而简化了状态机的一致性维护。日志复制则通过领导者将更新广播到其他跟随者节点，保证所有节点的数据同步。安全性机制防止了不一致状态的出现，例如避免多节点同时被视为领导者的情况。论文详细探讨了Raft的以下方面： 1. 领导者选举：Raft使用任期（Term）的概念，每个任期只有一个领导者。当节点启动或网络分区恢复时，会触发新的选举。节点通过投票选举出任期内的领导者，大多数节点的支持使得选举成功。 2. 日志复制：领导者维护一个有序的日志，接收到的新命令被添加到日志末尾。领导者然后向跟随者发送这些条目，跟随者接收并复制这些条目，保持其日志与领导者同步。 3. 安全性：通过任期机制和日志匹配属性，Raft确保了没有两个有效的日志条目在相同的索引位置且任期不同。这避免了冲突，并保证了最终一致性。 4. 状态机：节点根据其日志中的条目顺序执行命令，确保了所有节点以相同顺序执行相同操作，达到状态一致性。 5. 拓扑变化与容错：论文还讨论了如何在节点故障或加入新节点时维护一致性。Raft算法能够优雅地处理这些情况，确保即使在部分节点失效的情况下，系统仍然能够正常运作。这篇博士论文对于理解分布式系统中的共识挑战，以及如何通过Raft算法解决这些问题提供了深入的洞察。它不仅对学术研究有重要价值，也为实际系统设计和实现提供了宝贵的指导。通过阅读这篇论文，读者可以全面掌握Raft算法的细节，从而更好地应对分布式环境中的数据一致性问题。

5.4 Log compaction: approaches to log cleaning in Raft . . . . . . . . . . . . . . . . . 61

5.5 Log compaction: alternative: snapshot stored in log . . . . . . . . . . . . . . . . . 63

6.1 Client interaction: summary of RPCs . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Client interaction: example of incorrect results for duplicated command . . . . . . 70

6.3 Client interaction: lease mechanism for read-only queries . . . . . . . . . . . . . . 74

7.1 Raft user study: example lecture slide with stylus overlay . . . . . . . . . . . . . . 89

7.2 Raft user study: quiz score CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.3 Raft user study: quiz score scatter plot (by school) . . . . . . . . . . . . . . . . . . 94

7.4 Raft user study: quiz score scatter plot (by prior Paxos exposure) . . . . . . . . . . 95

7.5 Raft user study: CDF of participants’ quiz score difference . . . . . . . . . . . . . 96

7.6 Raft user study: ordering effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.7 Raft user study: quiz score CDFs by question difﬁculty and ordering . . . . . . . . 100

7.8 Raft user study: quiz score CDFs by question . . . . . . . . . . . . . . . . . . . . 102

7.9 Raft user study: prior Paxos experience survey . . . . . . . . . . . . . . . . . . . . 103

7.10 Raft user study: fairness survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.11 Raft user study: preferences survey . . . . . . . . . . . . . . . . . . . . . . . . . . 105

9.1 Leader election evaluation: leader election timeline with no split votes . . . . . . . 119

9.2 Leader election evaluation: earliest timeout example . . . . . . . . . . . . . . . . . 119

9.3 Leader election evaluation: earliest timeout CDF . . . . . . . . . . . . . . . . . . 121

9.4 Leader election evaluation: split vote example with ﬁxed latency . . . . . . . . . . 123

9.5 Leader election evaluation: split vote probability with ﬁxed network latency . . . . 126

9.6 Leader election evaluation: split vote probability with variable network latency . . 128

9.7 Leader election evaluation: expected overall election time . . . . . . . . . . . . . . 131

9.8 Leader election evaluation: benchmark results on LAN cluster . . . . . . . . . . . 133

9.9 Leader election evaluation: election performance on a simulated WAN cluster . . . 135

9.10 Leader election evaluation: election performance with differing logs . . . . . . . . 135

10.1 Implementation and performance: threaded architecture . . . . . . . . . . . . . . . 140

10.2 Implementation and performance: optimized request processing pipeline . . . . . . 142

10.3 Implementation and performance: preliminary measurements of LogCabin . . . . . 145

11.1 Related work: differences in how new leaders replicate existing entries . . . . . . . 154

xvi

CHAPTER 1. INTRODUCTION 2

dominated the discussion of consensus algorithms over the last two decades: most implementations

of consensus were based on Paxos or inﬂuenced by it, and Paxos had become the primary vehicle

used to teach students about consensus.

Unfortunately, Paxos is quite difﬁcult to understand, in spite of numerous attempts to make

it more approachable. Furthermore, its architecture requires complex changes to support practical

systems, and building a complete system based on Paxos requires developing several extensions for

which the details have not been published or agreed upon. As a result, both system builders and

students struggle with Paxos.

The two other well-known consensus algorithms are Viewstamped Replication [83, 82, 66] and

Zab [42], the algorithm used in ZooKeeper. Although we believe both of these algorithms are in-

cidentally better in structure that Paxos for building systems, neither has explicitly made this argu-

ment; they were not designed with simplicity or understandability as a primary goal. The burden of

understanding and implementing these algorithms is still too high.

Each of these consensus options was difﬁcult to understand and difﬁcult to implement. Unfor-

tunately, when the cost of implementing consensus with proven algorithms was too high, systems

builders were left with a tough decision. They could avoid consensus altogether, sacriﬁcing the fault

tolerance or consistency of their systems, or they could develop their own ad hoc algorithm, often

leading to unsafe behavior. Moreover, when the cost of explaining and understanding consensus

was too high, not all instructors attempted to teach it, and not all students succeeded in learning it.

Consensus is as fundamental as two-phase commit; ideally, as many students should learn it (even

though consensus is fundamentally more difﬁcult).

After struggling with Paxos ourselves, we set out to ﬁnd a new consensus algorithm that could

provide a better foundation for system building and education. Our approach was unusual in that our

primary goal was understandability: could we deﬁne a consensus algorithm for practical systems

and describe it in a way that is signiﬁcantly easier to learn than Paxos? Furthermore, we wanted

the algorithm to facilitate the development of intuitions that are essential for system builders. It was

important not just for the algorithm to work, but for it to be obvious why it works.

This algorithm also had to be complete enough to address all aspects of building a practical

system, and it had to perform well enough for practical deployments. The core algorithm not only

had to specify the effects of receiving a message but also describe what should happen and when;

these are equally important for systems builders. Similarly, it had to guarantee consistency, and it

also had to provide availability whenever possible. It also had to address the many aspects of a

system that go beyond reaching consensus, such as changing the members of the consensus group.

CHAPTER 1. INTRODUCTION 3

These are necessary in practice, and leaving this burden to systems builders would risk ad hoc,

suboptimal, or even incorrect solutions.

The result of this work is a consensus algorithm called Raft. In designing Raft we applied

speciﬁc techniques to improve understandability, including decomposition (Raft separates leader

election, log replication, and safety) and state space reduction (Raft reduces the degree of nonde-

terminism and the ways servers can be inconsistent with each other). We also addressed all of the

issues needed to build a complete consensus-based system. We considered each design choice care-

fully, not just for the beneﬁt of our own implementation but also for the many others we hope to

enable.

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational

purposes and as a foundation for implementation. It is simpler and more understandable than other

algorithms; it is described completely enough to meet the needs of a practical system; it has several

open-source implementations and is used by several companies; its safety properties have been

formally speciﬁed and proven; and its efﬁciency is comparable to other algorithms.

The primary contributions of this dissertation are as follows:

• The design, implementation, and evaluation of the Raft consensus algorithm. Raft is similar

in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped

Replication [83, 66]), but it is designed for understandability. This led to several novel fea-

tures. For example, Raft uses a stronger form of leadership than other consensus algorithms.

This simpliﬁes the management of the replicated log and makes Raft easier to understand.

• The evaluation of Raft’s understandability. A user study with 43 students at two universities

shows that Raft is signiﬁcantly easier to understand than Paxos: after learning both algorithms,

33 of these students were able to answer questions about Raft better than questions about

Paxos. We believe this is the ﬁrst scientiﬁc study to evaluate consensus algorithms based on

teaching and learning.

• The design, implementation, and evaluation of Raft’s leader election mechanism. While many

consensus algorithms do not prescribe a particular leader election algorithm, Raft includes a

speciﬁc algorithm involving randomized timers. This adds only a small amount of mechanism

to the heartbeats already required for any consensus algorithm, while resolving conﬂicts sim-

ply and rapidly. The evaluation of leader election investigates its behavior and performance,

concluding that this simple approach is sufﬁcient in a wide variety of practical environments.

It typically elects a leader in under 20 times the cluster’s one-way network latency.

剩余256页未读，继续阅读

dayforward

粉丝: 0
资源: 4

Raft共识算法：理论与实践

Raft算法：易用一致性协议与工程实践

Raft算法详解：易理解的分布式一致性解决方案

raft协议实现与ftKV服务构建

Raft论文1

Raft 论文总结

Raft论文解析中文版

Raft论文中文翻译版

raft论文自我理解，通俗易懂

Raft一致性算法论文

Raft 一致性算法论文译文

最新资源