
Distributed TensorFlow with MPI
Abhinav Vishnu
#1
Charles Siegel
#2
, and Jeff Daily
#3
#1,2,3
Pacific Northwest National Laboratory, Richland, WA 99352
ABSTRACT
Machine Learning and Data Mining (MLDM) algorithms are
becoming increasingly important in analyzing large volume
of data generated by simulations, experiments and mobile
devices. With increasing data volume, distributed memory
systems (such as tightly connected supercomputers or cloud
computing systems) are becoming important in designing
in-memory and massively parallel MLDM algorithms. Yet,
the majority of open source MLDM software is limited to se-
quential execution with a few supporting multi-core/many-
core execution.
In this paper, we extend recently proposed Google Ten-
sorFlow for execution on large scale clusters using Message
Passing Interface (MPI). Our approach requires minimal
changes to the TensorFlow runtime – making the proposed
implementation generic and readily usable to increasingly
large users of TensorFlow. We evaluate our implementation
using an InfiniBand cluster and several well known datasets.
Our evaluation indicates the efficiency of our proposed im-
plementation.
1. INTRODUCTION
Today, simulations, experiments and mobile devices are
generating increasingly large volume of data [1, 2]. Machine
Learning and Data Mining (MLDM) algorithms, which can
build models, classifiers, and anomaly detectors are being
designed and applied in several domains including high en-
ergy physics, computational biology, and cyber security [3,
4, 5].
MLDM algorithms are generally classified as supervised
(the input dataset is labeled with ground truth) and un-
supervised (learning from un-labeled dataset). Base unsu-
pervised/supervised algorithms can be combined together
using ensemble methods to remove noise, and possibly learn
better models/classifiers. Several software packages which
support supervised, unsupervised and ensemble algorithms
have been released publicly. A few well known packages are
Weka [6], Scikit [7], libsvm [8], and Matlab. However, these
packages only support sequential execution. As a result,
they are generally used with modest size datasets.
At the same time, Deep Learning algorithms – a class
of MLDM algorithms – are becoming increasingly popular.
Deep Learning algorithms emulate brain activity by using
several layers of neurons (interconnected with synapses) and
learn the weights for the synapses by using gradient descent
methods. There are several classes of Deep Learning algo-
rithms – Deep Neural Networks (DNN - typically used on
tabular datasets), Convolutional Neural Networks (CNNs -
typically used on images) and Recurrent Neural Networks
(RNNs - typically used on time-dependent datasets). Sev-
eral researchers/practitioners have applied Deep Learning
algorithms to their problems, and reported better results in
comparison to their well published models. Naturally, open
source efforts such as Theano, CuDNN, and Caffe [9] have
gained traction and wide acceptance among researchers and
practitioners alike.
Recently, Google released TensorFlow, which is a toolkit
for developing MLDM algorithms. It uses a dataflow model
by specifying operations on tensors (user-defined multi-dimensional
arrays). It also supports automatic differentiation, which
simplifies the design and implementation of gradient de-
scent methods. TensorFlow readily supports DNNs, CNNs
and RNNs on multi-core/many-core systems (GPUs) and
supports algorithmic advancements such as AdaGrad, and
Neuron Dropout for regularization. However, TensorFlow’s
restriction to single compute node is highly restrictive, es-
pecially with increasing size of the datasets.
In this paper, we propose a design to alleviate these lim-
itations of TensorFlow. Specifically, we extend TensorFlow
for scalable execution on very large scale systems. We con-
sider several programming models, especially MapReduce
based programming models (Hadoop, and Spark) and con-
clude that neither of them are geared towards realizing the
peak potential of the system, while TensorFlow is geared
towards exploiting the architecture effectively using a C++
backend and state of the art linear algebra packages. We
use Message Passing Interface (MPI) [10] as the communi-
cation interface for parallelizing TensorFlow on distributed
memory subsystems. We specify the changes which were
required to realize the implementation on distributed mem-
ory systems. Specifically, we conclude that these changes
are minimal and require no changes to the TensorFlow run-
time! Our evaluation of the proposed extensions with sev-
eral well known datasets such as MNIST, CIFAR-10, Adult
and Higgs reveals the performance efficiency of the proposed
implementation.
2. BACKGROUND
In this section, we provide a brief background of Google
TensorFlow (simply referred as TensorFlow for rest of the
paper) and Message Passing Interface (MPI) [10, 11].
2.1 TensorFlow
Google’s TensorFlow, released in November 2015, is a
platform for building and developing models in machine
learning, particularly neural networks. It is capable of han-
arXiv:1603.02339v1 [cs.DC] 7 Mar 2016