IONN: Incremental Oloading of Neural Network Computations SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA
[
8
], or pytorch [
28
], requires a substantial space (more than 3 GB)
1
,
so it is not realistic to upload such an image on demand at runtime.
Rather, it is more reasonable for a VM (or a container) image for the
DNN framework to be pre-installed at the edge server in advance,
so the client uploads only the client’s DNN model to the edge server
on demand.
To check the overhead of uploading a DNN model, we measured
the time to transmit the DNN model through wireless network. It
takes about 24 seconds to upload the AlexNet model, meaning that
the smart glasses should execute the queries locally for 24 seconds
before using the edge server, thus no improvement in the meantime.
Of course, worse network conditions would further increase the
uploading time.
If we used a central cloud server with the same hardware where
the user’s DNN model is installed in advance, we would have ob-
tained the same DNN execution time, yet with a longer network
latency. For example, if we access a cloud server in our local region
(East Asia) [
10
], the network latency would be about 60 ms, com-
pared to 1 ms of our edge server due to multi-hop transmission.
Also, it is known that the multi-hop transmission to distant cloud
datacenters causes high jitters, which may hurt the real-time user
experience [32].
Although edge servers are attractive alternatives for running
DNN queries, our experimental result indicates that users should
wait quite a while to use an edge server due to the time to upload
a DNN model. Especially, a highly-mobile user, who can leave the
service area of an edge server shortly, will suer heavily from the
problem; if the client moves to another location before it completes
uploading its DNN, the client will waste its battery for network
transmission but never use the edge server. To solve this issue,
we propose IONN, which allows the client to ooad partial DNN
execution to the server while the DNN model is being uploaded.
3 BACKGROUND
Before explaining IONN, we briey review a DNN and its variant,
Convolutional Neural Network (CNN), typically used for image pro-
cessing. We also describe some previous approaches to ooading
DNN computations to remote servers.
3.1 Deep Neural Network
Deep neural network (DNN) can be viewed as a directed graph
whose nodes are layers. Each layer in DNN performs its opera-
tion on the input matrices and passes the output matrices to the
next layer (in other words, each layer is executed). Some layers
just perform the same operations with xed parameters, but the
others contain trainable parameters. The trainable parameters are
iteratively updated according to learning algorithms using training
data (training). After trained, the DNN model can be deployed as a
le and used to infer outputs for new input data (inference). DNN
frameworks, such as cae [
19
], can load a pre-trained DNN from
the model le and perform inference for new data by executing
the DNN. In this paper, we focus on ooading computations for
1
We measured the size of a docker image for each DNN framework (GPU-version) from
dockerhub, which contains all libraries to run the framework as well as the framework
itself.
inference, because training requires much more resources than in-
ference, hence typically performed on powerful cloud datacenters.
A CNN is a DNN that includes convolution layers, widely used
to classify an image into one of pre-determined classes. The image
classication in the CNN commonly proceeds as follows. When
an image is given to the CNN, the CNN extracts features from the
image using convolution (conv ) layers and pooling (po ol) layers.
The conv/pool layers can be placed in series [
22
] or in parallel [
36
]
[
15
]. Using the features, a fully-connected (fc) layer calculates the
scores of each output class, and a softmax layer normalizes the
scores. The normalized scores are interpreted as the possibilities of
each output class where the input image belongs. There are many
other types of layers (e.g., about 50 types of layers are currently
implemented in cae [
19
]), but explaining all of them is beyond the
scope of this paper.
3.2 Oloading of DNN Computations
Many cloud providers are oering machine learning (ML) services
[
26
] [
2
] [
10
], which perform computation-intensive ML algorithms
(including DNN) on behalf of clients. They often provide an appli-
cation programming interface (API) to app developers so that the
developers can implement ML applications using the API. Typically,
the API allows a user to make a request (query) for DNN compu-
tation by simply sending an input matrix to the service provider’s
clouds where DNN models are pre-installed. The server in the
clouds executes the corresponding DNN model in response to the
query and sends the result back to the client. Unfortunately, this
centralized, cloud-only approach is not appropriate for our scenario
of the generic use of edge servers since pre-installing DNN models
at the edge servers is not straightforward.
Recent studies have proposed to execute DNN using both the
client and the server [
20
] [
14
]. NeuroSurgeon is the latest work on
the collaborative DNN execution using a DNN partitioning scheme
[
20
]. NeuroSurgeon creates a prediction model for DNN, which
estimates the execution time and the energy consumption for each
layer, by performing regression analysis using the DNN execution
proles. Using the prediction model and the runtime information,
NeuroSurgeon dynamically partitions a DNN into the front part
and the rear part. The client executes the front part and sends its
output matrices to the server. The server runs the rear part with the
delivered matrices and sends the new output matrices back to the
client. To decide the partitioning point, NeuroSurgeon estimates the
expected query execution time for every possible partitioning point
and nds the best one. Their experiments show that collaborative
DNN execution between the client and the server improves the
performance, compared to the server-only approach.
Although collaborative DNN execution in NeuroSurgeon was ef-
fective, it is still based on the cloud servers where the DNN model is
pre-installed, thus not well suited for our edge computing scenario;
it does not upload the DNN model nor its partitioning algorithm
considers the uploading overhead. However, collaborative execu-
tion gives a useful insight for the DNN edge computing. That is, we
can partition the DNN model and upload each partition incremen-
tally, so that the client and the server can execute the partitions
collaboratively, even before the whole model is uploaded. Start-
ing from this insight, we designed the incremental ooading of
403