Google Borg：Kubernetes之源与集群管理深度解析

需积分: 33 36 浏览量更新于2024-07-16 收藏 836KB PDF 举报

“Google Brog是Google内部使用的集群管理系统，它为运行大规模应用程序提供了基础。这个系统管理着成千上万的作业，涵盖了数千种不同的应用，每个集群包含多达十万台机器。Borg通过准入控制、高效的任务打包、过度承诺和进程级性能隔离实现了高利用率。它支持高可用性应用，具备减少故障恢复时间的运行时特性，以及防止关联失败的调度策略。用户可以使用声明式的作业规范语言、集成的名字服务、实时作业监控工具以及系统行为分析和模拟工具，使得运维工作变得简单。” 在深入探讨Borg系统之前，我们先理解一下Kubernetes（K8S）与Borg的关系。Kubernetes是由Google贡献并开源的，受到了Borg的启发，旨在为云原生应用提供容器编排。Kubernetes（K8S）已经成为业界标准，而Borg则是其灵感来源和实践基础。 Borg的架构设计包括以下几个关键组件和功能： 1. **集群管理**：Borg负责在大规模集群中调度和管理作业，确保资源的有效分配和利用。它能够处理各种规模的作业，从小型到大规模分布式系统。 2. **任务打包与过度承诺**：Borg允许高效地将多个任务打包到单个机器上，同时通过过度承诺技术来提高硬件利用率，这意味着机器可以分配给超过其实际物理资源的作业。 3. **性能隔离**：为了确保不同任务之间的性能不受影响，Borg实施了进程级的性能隔离，确保一个任务的资源消耗不会影响其他任务。 4. **高可用性**：Borg提供了一系列特性来保证服务的高可用性，如故障检测、自动恢复和负载均衡。这有助于快速从故障中恢复，降低服务中断的风险。 5. **调度策略**：Borg采用智能调度策略，避免将相关任务部署在同一节点上，以减少潜在的关联故障。此外，它还考虑了作业的优先级和依赖关系。 6. **声明式作业规范**：用户可以通过声明式的作业规格定义他们的应用，系统会自动处理更新、扩展和缩容等操作。 7. **监控与分析**：Borg提供了实时作业监控，允许用户查看和分析作业状态，同时也支持对系统行为进行模拟和回溯分析，以便优化资源分配和策略设定。 8. **名字服务集成**：Borg集成了名字服务，使得服务发现和通信变得更加简单，增强了系统的可扩展性和可靠性。从Borg的设计和经验中，我们可以学到很多关于如何构建和运营大规模分布式系统的教训。例如，有效的资源管理和调度、故障恢复策略、以及用户友好的接口都是成功的关键因素。Kubernetes作为Borg的开源版本，继承了许多这些设计理念，并在此基础上进行了改进和扩展，使其更适合开放的云环境。对于CKA（Certified Kubernetes Administrator）学习者来说，了解Borg可以帮助他们更好地理解Kubernetes的设计初衷和核心机制，从而更深入地掌握Kubernetes的使用和管理。

Almost every task run under Borg contains a built-in

HTTP server that publishes information about the health of

the task and thousands of performance metrics (e.g., RPC

latencies). Borg monitors the health-check URL and restarts

tasks that do not respond promptly or return an HTTP er-

ror code. Other data is tracked by monitoring tools for dash-

boards and alerts on service level objective (SLO) violations.

A service called Sigma provides a web-based user inter-

face (UI) through which a user can examine the state of all

their jobs, a particular cell, or drill down to individual jobs

and tasks to examine their resource behavior, detailed logs,

execution history, and eventual fate. Our applications gener-

ate voluminous logs; these are automatically rotated to avoid

running out of disk space, and preserved for a while after the

task’s exit to assist with debugging. If a job is not running

Borg provides a “why pending?” annotation, together with

guidance on how to modify the job’s resource requests to

better ﬁt the cell. We publish guidelines for “conforming”

resource shapes that are likely to schedule easily.

Borg records all job submissions and task events, as well

as detailed per-task resource usage information in Infrastore,

a scalable read-only data store with an interactive SQL-like

interface via Dremel [61]. This data is used for usage-based

charging, debugging job and system failures, and long-term

capacity planning. It also provided the data for the Google

cluster workload trace [80].

All of these features help users to understand and debug

the behavior of Borg and their jobs, and help our SREs

manage a few tens of thousands of machines per person.

3. Borg architecture

A Borg cell consists of a set of machines, a logically central-

ized controller called the Borgmaster, and an agent process

called the Borglet that runs on each machine in a cell (see

Figure 1). All components of Borg are written in C++.

3.1 Borgmaster

Each cell’s Borgmaster consists of two processes: the main

Borgmaster process and a separate scheduler (§3.2). The

main Borgmaster process handles client RPCs that either

mutate state (e.g., create job) or provide read-only access

to data (e.g., lookup job). It also manages state machines

for all of the objects in the system (machines, tasks, allocs,

etc.), communicates with the Borglets, and offers a web UI

as a backup to Sigma.

The Borgmaster is logically a single process but is ac-

tually replicated ﬁve times. Each replica maintains an in-

memory copy of most of the state of the cell, and this state is

also recorded in a highly-available, distributed, Paxos-based

store [55] on the replicas’ local disks. A single elected mas-

ter per cell serves both as the Paxos leader and the state

mutator, handling all operations that change the cell’s state,

such as submitting a job or terminating a task on a ma-

chine. A master is elected (using Paxos) when the cell is

brought up and whenever the elected master fails; it acquires

a Chubby lock so other systems can ﬁnd it. Electing a master

and failing-over to the new one typically takes about 10 s, but

can take up to a minute in a big cell because some in-memory

state has to be reconstructed. When a replica recovers from

an outage, it dynamically re-synchronizes its state from other

Paxos replicas that are up-to-date.

The Borgmaster’s state at a point in time is called a

checkpoint, and takes the form of a periodic snapshot plus a

change log kept in the Paxos store. Checkpoints have many

uses, including restoring a Borgmaster’s state to an arbitrary

point in the past (e.g., just before accepting a request that

triggered a software defect in Borg so it can be debugged);

ﬁxing it by hand in extremis; building a persistent log of

events for future queries; and ofﬂine simulations.

A high-ﬁdelity Borgmaster simulator called Fauxmaster

can be used to read checkpoint ﬁles, and contains a complete

copy of the production Borgmaster code, with stubbed-out

interfaces to the Borglets. It accepts RPCs to make state ma-

chine changes and perform operations, such as “schedule all

pending tasks”, and we use it to debug failures, by interact-

ing with it as if it were a live Borgmaster, with simulated

Borglets replaying real interactions from the checkpoint ﬁle.

A user can step through and observe the changes to the sys-

tem state that actually occurred in the past. Fauxmaster is

also useful for capacity planning (“how many new jobs of

this type would ﬁt?”), as well as sanity checks before mak-

ing a change to a cell’s conﬁguration (“will this change evict

any important jobs?”).

3.2 Scheduling

When a job is submitted, the Borgmaster records it persis-

tently in the Paxos store and adds the job’s tasks to the pend-

ing queue. This is scanned asynchronously by the scheduler,

which assigns tasks to machines if there are sufﬁcient avail-

able resources that meet the job’s constraints. (The sched-

uler primarily operates on tasks, not jobs.) The scan pro-

ceeds from high to low priority, modulated by a round-robin

scheme within a priority to ensure fairness across users and

avoid head-of-line blocking behind a large job. The schedul-

ing algorithm has two parts: feasibility checking, to ﬁnd ma-

chines on which the task could run, and scoring, which picks

one of the feasible machines.

In feasibility checking, the scheduler ﬁnds a set of ma-

chines that meet the task’s constraints and also have enough

“available” resources – which includes resources assigned

to lower-priority tasks that can be evicted. In scoring, the

scheduler determines the “goodness” of each feasible ma-

chine. The score takes into account user-speciﬁed prefer-

ences, but is mostly driven by built-in criteria such as mini-

mizing the number and priority of preempted tasks, picking

machines that already have a copy of the task’s packages,

spreading tasks across power and failure domains, and pack-

ing quality including putting a mix of high and low priority

剩余17页未读，继续阅读

sinoxxxx

粉丝: 0
资源: 2

Google Borg：Kubernetes之源与集群管理深度解析

Google Brog.zip

brog：用于管理多个相关JavaScript项目的工具，是lerna的更好替代

目标检测中替代传统NMS的后处理方式.zip

软件工程课程设计报告-学生信息管理系统.doc

51报名管家小程序（源码）.zip

Axelrod-3.1.0-py2.py3-none-any.whl.zip

用opencv的dnn模块做yolov5目标检测，包含C++和Python两个版本的程序，优化后的.zip

Axelrod-3.8.1-py2.py3-none-any.whl.zip

奥多停车小程序（源码）.zip

基于slf4j标准的日志管理系统.zip

最新资源