Almost every task run under Borg contains a built-in
HTTP server that publishes information about the health of
the task and thousands of performance metrics (e.g., RPC
latencies). Borg monitors the health-check URL and restarts
tasks that do not respond promptly or return an HTTP er-
ror code. Other data is tracked by monitoring tools for dash-
boards and alerts on service level objective (SLO) violations.
A service called Sigma provides a web-based user inter-
face (UI) through which a user can examine the state of all
their jobs, a particular cell, or drill down to individual jobs
and tasks to examine their resource behavior, detailed logs,
execution history, and eventual fate. Our applications gener-
ate voluminous logs; these are automatically rotated to avoid
running out of disk space, and preserved for a while after the
task’s exit to assist with debugging. If a job is not running
Borg provides a “why pending?” annotation, together with
guidance on how to modify the job’s resource requests to
better fit the cell. We publish guidelines for “conforming”
resource shapes that are likely to schedule easily.
Borg records all job submissions and task events, as well
as detailed per-task resource usage information in Infrastore,
a scalable read-only data store with an interactive SQL-like
interface via Dremel [61]. This data is used for usage-based
charging, debugging job and system failures, and long-term
capacity planning. It also provided the data for the Google
cluster workload trace [80].
All of these features help users to understand and debug
the behavior of Borg and their jobs, and help our SREs
manage a few tens of thousands of machines per person.
3. Borg architecture
A Borg cell consists of a set of machines, a logically central-
ized controller called the Borgmaster, and an agent process
called the Borglet that runs on each machine in a cell (see
Figure 1). All components of Borg are written in C++.
3.1 Borgmaster
Each cell’s Borgmaster consists of two processes: the main
Borgmaster process and a separate scheduler (§3.2). The
main Borgmaster process handles client RPCs that either
mutate state (e.g., create job) or provide read-only access
to data (e.g., lookup job). It also manages state machines
for all of the objects in the system (machines, tasks, allocs,
etc.), communicates with the Borglets, and offers a web UI
as a backup to Sigma.
The Borgmaster is logically a single process but is ac-
tually replicated five times. Each replica maintains an in-
memory copy of most of the state of the cell, and this state is
also recorded in a highly-available, distributed, Paxos-based
store [55] on the replicas’ local disks. A single elected mas-
ter per cell serves both as the Paxos leader and the state
mutator, handling all operations that change the cell’s state,
such as submitting a job or terminating a task on a ma-
chine. A master is elected (using Paxos) when the cell is
brought up and whenever the elected master fails; it acquires
a Chubby lock so other systems can find it. Electing a master
and failing-over to the new one typically takes about 10 s, but
can take up to a minute in a big cell because some in-memory
state has to be reconstructed. When a replica recovers from
an outage, it dynamically re-synchronizes its state from other
Paxos replicas that are up-to-date.
The Borgmaster’s state at a point in time is called a
checkpoint, and takes the form of a periodic snapshot plus a
change log kept in the Paxos store. Checkpoints have many
uses, including restoring a Borgmaster’s state to an arbitrary
point in the past (e.g., just before accepting a request that
triggered a software defect in Borg so it can be debugged);
fixing it by hand in extremis; building a persistent log of
events for future queries; and offline simulations.
A high-fidelity Borgmaster simulator called Fauxmaster
can be used to read checkpoint files, and contains a complete
copy of the production Borgmaster code, with stubbed-out
interfaces to the Borglets. It accepts RPCs to make state ma-
chine changes and perform operations, such as “schedule all
pending tasks”, and we use it to debug failures, by interact-
ing with it as if it were a live Borgmaster, with simulated
Borglets replaying real interactions from the checkpoint file.
A user can step through and observe the changes to the sys-
tem state that actually occurred in the past. Fauxmaster is
also useful for capacity planning (“how many new jobs of
this type would fit?”), as well as sanity checks before mak-
ing a change to a cell’s configuration (“will this change evict
any important jobs?”).
3.2 Scheduling
When a job is submitted, the Borgmaster records it persis-
tently in the Paxos store and adds the job’s tasks to the pend-
ing queue. This is scanned asynchronously by the scheduler,
which assigns tasks to machines if there are sufficient avail-
able resources that meet the job’s constraints. (The sched-
uler primarily operates on tasks, not jobs.) The scan pro-
ceeds from high to low priority, modulated by a round-robin
scheme within a priority to ensure fairness across users and
avoid head-of-line blocking behind a large job. The schedul-
ing algorithm has two parts: feasibility checking, to find ma-
chines on which the task could run, and scoring, which picks
one of the feasible machines.
In feasibility checking, the scheduler finds a set of ma-
chines that meet the task’s constraints and also have enough
“available” resources – which includes resources assigned
to lower-priority tasks that can be evicted. In scoring, the
scheduler determines the “goodness” of each feasible ma-
chine. The score takes into account user-specified prefer-
ences, but is mostly driven by built-in criteria such as mini-
mizing the number and priority of preempted tasks, picking
machines that already have a copy of the task’s packages,
spreading tasks across power and failure domains, and pack-
ing quality including putting a mix of high and low priority