FractalNet: Ultra-Deep Neural Networks without Residuals
Gustav Larsson
University of Chicago
larsson@cs.uchicago.edu
Michael Maire
TTI Chicago
mmaire@ttic.edu
Gregory Shakhnarovich
TTI Chicago
greg@ttic.edu
Abstract
We introduce a design strategy for neural network macro-architecture based on self-
similarity. Repeated application of a single expansion rule generates an extremely
deep network whose structural layout is precisely a truncated fractal. Such a
network contains interacting subpaths of different lengths, but does not include
any pass-through connections: every internal signal is transformed by a filter and
nonlinearity before being seen by subsequent layers. This property stands in stark
contrast to the current approach of explicitly structuring very deep networks so that
training is a residual learning problem. Our experiments demonstrate that residual
representation is not fundamental to the success of extremely deep convolutional
neural networks. A fractal design achieves an error rate of 22.85% on CIFAR-100,
matching the state-of-the-art held by residual networks.
Fractal networks exhibit intriguing properties beyond their high performance. They
can be regarded as a computationally efficient implicit union of subnetworks of
every depth. We explore consequences for training, touching upon connection
with student-teacher behavior, and, most importantly, demonstrating the ability to
extract high-performance fixed-depth subnetworks. To facilitate this latter task, we
develop drop-path, a natural extension of dropout, to regularize co-adaptation of
subpaths in fractal architectures. With such regularization, fractal networks exhibit
an anytime property: shallow subnetworks provide a quick answer, while deeper
subnetworks, with higher latency, provide a more accurate answer.
1 Introduction
ResNet [
8
] is a recent and dramatic increase in both depth and accuracy of convolutional neural
networks, facilitated by constraining the network to learn residuals. ResNet variants [
8
,
9
,
11
] and
related architectures [
31
] employ the common technique of initializing and anchoring, via a pass-
through channel, a network to the identity function. Training now differs in two respects. First, the
objective changes to learning residual outputs, rather than unreferenced absolute mappings. Second,
these networks exhibit a type of deep supervision [
18
], as near-identity layers effectively reduce
distance to the loss. He et al. [8] speculate that the former, residual formulation itself, is crucial.
We show otherwise, by constructing a competitive extremely deep architecture that does not rely on
residuals. Our design principle is pure enough to communicate in a single word, fractal, and a simple
diagram (Figure 1). Yet, fractal networks implicitly recapitulate many properties hard-wired into
previous successful architectures. Deep supervision not only arises automatically, but also drives a
type of student-teacher learning [
1
,
34
] internal to the network. Modular building blocks of other
designs [32, 20] are almost special cases of a fractal network’s nested substructure.
For fractal networks, simplicity of training mirrors simplicity of design. A single loss, attached to the
final layer, suffices to drive internal behavior mimicking deep supervision. Parameters are randomly
initialized. As they contain subnetworks of many depths, fractal networks are robust to choice of
overall depth; make them deep enough and training will carve out a useful assembly of subnetworks.
arXiv:1605.07648v1 [cs.CV] 24 May 2016