Variables
In most computations a graph is executed multiple times.
Most tensors do not survive past a single execution of the
graph. However, a Variable is a special kind of opera-
tion that returns a handle to a persistent mutable tensor
that survives across executions of a graph. Handles to
these persistent mutable tensors can be passed to a hand-
ful of special operations, such as Assign and AssignAdd
(equivalent to +=) that mutate the referenced tensor. For
machine learning applications of TensorFlow, the param-
eters of the model are typically stored in tensors held in
variables, and are updated as part of the Run of the train-
ing graph for the model.
3 Implementation
The main components in a TensorFlow system are the
client, which uses the Session interface to communicate
with the master, and one or more worker processes, with
each worker process responsible for arbitrating access to
one or more computational devices (such as CPU cores
or GPU cards) and for executing graph nodes on those
devices as instructed by the master. We have both lo-
cal and distributed implementations of the TensorFlow
interface. The local implementation is used when the
client, the master, and the worker all run on a single ma-
chine in the context of a single operating system process
(possibly with multiple devices, if for example, the ma-
chine has many GPU cards installed). The distributed
implementation shares most of the code with the local
implementation, but extends it with support for an en-
vironment where the client, the master, and the workers
can all be in different processes on different machines.
In our distributed environment, these different tasks are
containers in jobs managed by a cluster scheduling sys-
tem [51]. These two different modes are illustrated in
Figure 3. Most of the rest of this section discusses is-
sues that are common to both implementations, while
Section 3.3 discusses some issues that are particular to
the distributed implementation.
Devices
Devices are the computational heart of TensorFlow. Each
worker is responsible for one or more devices, and
each device has a device type, and a name. Device
names are composed of pieces that identify the de-
vice’s type, the device’s index within the worker, and,
in our distributed setting, an identification of the job
and task of the worker (or localhost for the case where
the devices are local to the process). Example device
names are "/job:localhost/device:cpu:0" or
"/job:worker/task:17/device:gpu:3". We
have implementations of our Device interface for CPUs
and GPUs, and new device implementations for other de-
vice types can be provided via a registration mechanism.
Each device object is responsible for managing alloca-
tion and deallocation of device memory, and for arrang-
ing for the execution of any kernels that are requested by
higher levels in the TensorFlow implementation.
Tensors
A tensor in our implementation is a typed, multi-
dimensional array. We support a variety of tensor ele-
ment types, including signed and unsigned integers rang-
ing in size from 8 bits to 64 bits, IEEE float and double
types, a complex number type, and a string type (an ar-
bitrary byte array). Backing store of the appropriate size
is managed by an allocator that is specific to the device
on which the tensor resides. Tensor backing store buffers
are reference counted and are deallocated when no refer-
ences remain.
3.1 Single-Device Execution
Let’s first consider the simplest execution scenario: a sin-
gle worker process with a single device. The nodes of the
graph are executed in an order that respects the depen-
dencies between nodes. In particular, we keep track of
a count per node of the number of dependencies of that
node that have not yet been executed. Once this count
drops to zero, the node is eligible for execution and is
added to a ready queue. The ready queue is processed in
some unspecified order, delegating execution of the ker-
nel for a node to the device object. When a node has
finished executing, the counts of all nodes that depend
on the completed node are decremented.
3.2 Multi-Device Execution
Once a system has multiple devices, there are two main
complications: deciding which device to place the com-
putation for each node in the graph, and then managing
the required communication of data across device bound-
aries implied by these placement decisions. This subsec-
tion discusses these two issues.
3.2.1 Node Placement
Given a computation graph, one of the main responsi-
bilities of the TensorFlow implementation is to map the
computation onto the set of available devices. A sim-
plified version of this algorithm is presented here. See
Section 4.3 for extensions supported by this algorithm.
One input to the placement algorithm is a cost model,
which contains estimates of the sizes (in bytes) of the
4