and non-linear layers. The purpose of the convolution and
p ooling layers can be viewed as that of feature extractor be-
fore the fully connected layers are engaged. Inference then
pro ceeds exactly as previously described for DNNs until ul-
timately a classification is reached.
Contrary to the shallow learning-based models, deep learn-
ing models are usually big an often contains more than mil-
lion parameters. H igh parameter space improves the capac-
ity of these models and they often outperform prior shallow
mo dels in terms of model generalization perform ances. How-
ever, the accuracy gains come at the expense of high energy
and memory costs. Although, high end wearables contain-
ing GPU, e.g., NVIDIA Tegra K1, can efficiently run deep
mo dels [12], the high resource demands make deep learning
mo dels unattractive for low end wearables. In this paper we
explore sparse factorizations and convolutional kernel sep-
arations to optimize the resource demands of deep models,
while maintaining the functional properties of the models.
3. DESIGN AND OPERATION
Beginning with this section, and spanning the following two,
we detail the design and algorithms of SparseSep.
3.1 Design Goals
SparseSep is shaped on the following objectives.
• No Re-training. The training of a large deep model is
the most time consuming and computationally demand-
ing task. For example, a large model such as GoogleNet
is trained using thousands of CPU cores [13], which is
b eyond the current capabilities of a single wearable de-
vice. In this work, we mainly focus on the inference
cycle of a deep model and p erform no training on the
resource-constrained devices. The training process also
requires a very large training dataset, often inaccessible
to the developers [14]. Thus new techniques are needed
to compress popular cloud-scale deep learning models to
run on wearable and IoT grade hardware gracefully.
• No Cloud O✏oading. As noted in §1, o✏oading
the execution of portions of deep models can result in
leaking sensitive sensor data. By keeping inference com-
pletely local, user and applications have greater privacy
protection as the data or any intermediate results never
leave the device.
• Target Low-resource Platforms. Even high-end
mobile processors (such as the Tegra K1 [15]) still require
careful resource use, when executing deep learning mod-
els. But in this class of processors, the gap in resources
is closing. However, for low-energy highly portable wear-
able processors that lack GPUs or have only a few MBs
of RAM (e.g., ARM Cortex M3 [16]), local execution of
deep models remains impractical. For this reason, Spars-
eSep turns to new ideas like the use of sparsification of
weights and kernel separation, in search of the leaps in
resource efficiency required to make these low-end pro-
cessors viable.
• Minimize Model Changes. Deep models must un-
dergo some degree of change to enable their operation
on wearable hardware. However, a core tenet of Spars-
eSep is to minimize the extent of such modifications
and remain functionally faithful to the initial model ar-
chitecture. For this reason, we frame the problem as
one of deep model compression (originally formulated by
the m achine learning community), where model layer ar-
rangements remain unchanged and only per-layer con-
nections are changed through the insertion of additional
summarizing layers. Thus, the degree of changes made
by SparseSep is a key metric that is minimized during
mo del processing.
• Adopt Principled Approaches. Ad-ho c methods
to al ter a deep model – such as ‘specializing’ a model to
recognize a smaller set of activities/contexts, or chang-
ing layer/unit parameters to generate a desired resource
consumption profile – are dangerous as they violate the
domain experience of the modeling experts. Methods like
sparse coding [17] and model compression [18] are sup-
p orted by theoretical analysis [19]. Assessing if a model
can be altered solely by changes in the accuracy metric
can be dangerous and can potentially hurt, for example,
its ability to generalize.
3.2 Overview
We now briefly outline the core approach of SparseSep to
optimize the architecture of large deep learning models so
that they meet the constraints of target wearable devices.
In §4 we provide the necessary theory and algorithms of this
pro cess, but we begin here with the key ideas.
The inference pipeline of a deep learning model is domi-
nated by a series of matrix computations, especially multi-
plications, and convolutions. Attempts have been made to
optimize the total number of computations by low-rank fac-
torizing of the weight matrix or decomposing convolutional
kernels into separable filters in an ad-hoc manner. Both
weight factorization and kernel separation, however, require
mo dification in the architecture of the model by inserting
a new layer and updating weight components (see §4.1 and
§4.4). Although, counter-intuitive, the insertion of a new
layer only achieves computat ional efficiency under certain
conditions, which depends on, e.g., the size of the newly
inserted layer, the size of the original weight matrix, and
the size of convolutional kernels. In §4.1, §4.2 and §4.4 we
derive and show the c onditions unde r which computational
and memory efficiencies can be achieved.
In this paper, we postulate that the computational and space
efficiency of the deep learning models can b e further im-
proved by adding sparsity constraints to the factorization
pro cess. Accordingly, we propose a sparse dictionary learn-
ing approach to enforcing a sparse factorization of the weight
matrix (see §4.3). In §5.2 we show that under specific spar-
sity conditions the resource scalability of the proposed ap-
proach is significantly better than existing approaches.
The weight factorization approach significantly reduces the
memory footprint of both DNN and CNN models by opti-
mizing the parameter space of the fully connected layers.
The factorization also helps to reduce the overall number of
op erations needed and improves the inference time. How-
ever, the inference time improvement due to factorization
is much more pronounced for DNNs than CNNs. This is
primarily due to the fact that a major portion of the CNN-
based inference time (often over 95%) is spent on performing
convolution operations [12, 20], where the layer factorization
technique has no influence. To overcome this limitation, we
also propose a runtime convolution kernel separation tech-
nique that optimizes the convolution operations to reduce