ble with GPU acceleration [14], Cui et al. have recently
shown that GeePS [19], a parameter server specialized
for use with GPUs, can achieve speedups on modest-sized
clusters.
MXNet [12] is a recent system that uses a parameter
server to scale training, supports GPU acceleration, and
includes a flexible programming model with interfaces
for many languages. While MXNet partially fulfills our
extensibility requirements, the parameter server is “priv-
ileged” code, which makes it difficult for researchers to
customize the handling of large models (§4.2).
The parameter server architecture meets most of our
requirements, and our DistBelief [21] uses parameter
servers with a Caffe-like model definition format [36] to
great effect. We found this architecture to be insufficiently
extensible, because adding a new optimization algorithm,
or experimenting with an unconventional model archi-
tecture would require our users to modify the parameter
server implementation, which uses C++ for performance.
While some of the practitioners who use that system are
comfortable with making these changes, the majority are
accustomed to writing models in high-level languages,
such as Python and Lua, and the complexity of the high-
performance parameter server implementation is a barrier
to entry. With TensorFlow we therefore sought a high-
level programming model that allows users to customize
the code that runs in all parts of the system (§3).
3 TensorFlow execution model
TensorFlow uses a single dataflow graph to represent
all computation and state in a machine learning algo-
rithm, including the individual mathematical operations,
the parameters and their update rules, and the input pre-
processing (Figure 1). Dataflow makes the communi-
cation between subcomputations explicit, and therefore
makes it easy to execute independent computations in par-
allel, and partition the computation across multiple dis-
tributed devices. Dataflow TensorFlow differs from batch
dataflow systems (§2.2) in two respects:
• The model supports multiple concurrent executions
on overlapping subgraphs of the overall graph.
• Individual vertices may have mutable state that can
be shared between different executions of the graph.
The key observation in the parameter server architec-
ture [21, 14, 46] is that mutable state is crucial when
training very large models, because it becomes possible to
make in-place updates to very large parameters, and prop-
agate those updates to parallel training steps as quickly
as possible. Dataflow with mutable state enables Tensor-
Flow to mimic the functionality of a parameter server,
but with additional flexibility, because it becomes pos-
sible to execute arbitrary dataflow subgraphs on the ma-
chines that host the shared model parameters. As a re-
sult, our users have been able to experiment with different
optimization algorithms, consistency schemes, and paral-
lelization strategies.
3.1 Dataflow graph elements
In a TensorFlow graph, each vertex represents an atomic
unit of computation, and each edge represents the out-
put from or input to a vertex. We refer to the compu-
tation at vertices as operations, and the values that flow
along edges as tensors, because TensorFlow is designed
for mathematical computation, and uses tensors (or multi-
dimensional arrays) to represent all data in those compu-
tations.
Tensors In TensorFlow, we model all data as tensors
(dense n-dimensional arrays) with each element having
one of a small number of primitive types, such as int32,
float32, or string. Tensors naturally represent the
inputs to and results of the common mathematical oper-
ations in many machine learning algorithms: for exam-
ple, a matrix multiplication takes two 2-D tensors and
produces a 2-D tensor; and a mini-batch 2-D convolution
takes two 4-D tensors and produces another 4-D tensor.
All tensors in TensorFlow are dense. This decision en-
sures that the lowest levels of the system can have sim-
ple implementations for memory allocation and serializa-
tion, which reduces the overhead imposed by the frame-
work. To represent sparse tensors, TensorFlow offers two
alternatives: either encode the data into variable-length
string elements of a dense tensor, or use a tuple of
dense tensors (e.g., an n-D sparse tensor with m non-zero
elements could be represented an m × n index matrix and
a length-m value vector). The size of a tensor can vary in
one or more dimensions, making it possible to represent
sparse tensors with differing numbers of elements, at the
cost of more sophisticated shape inference.
Operations An operation takes m ≥ 0 tensors as input,
and produces n ≥ 0 tensors as output. An operation has
a named “type” (such as Const, MatMul, or Assign)
and may have zero or more compile-time attributes that
determine its behavior. An operation can be generic and
variadic at compile-time: its attributes determine both the
expected types and arity of its inputs and outputs.
4