2 Unit test Construction
Our testing framework is an open-source library containing a collection of unit tests and visualization
tools. Each unit test is defined by a prototype function to be optimized, a prototypical scale, a noise
prototype, and optionally a non-stationarity prototype. A prototype function is the concatenation
of one or more local shape prototypes. A multi-dimensional unit test is a composition of one-
dimensional unit tests, optionally with a rotation prototype or curl prototype.
2.1 Shape Prototypes
Shape prototypes are functions defined on an interval, and our collection includes linear slopes
(zero curvature), quadratic curves (fixed curvature), convex or concave curves (varying curvature),
and curves with exponentially increasing or decreasing slope. Further, there are a number of non-
differentiable local shape prototypes (absolute value, rectified-linear, cliff). All of these occur in
realistic learning scenarios, for example in logistic regression the loss surface is part concave and
part convex, an MSE loss is the prototypical quadratic bowl, but then regularization such as L1
introduces non-differentiable bends (as do rectified-linear or maxout units in deep learning [15, 16]).
Steep cliffs in the loss surface are a common occurrence when training recurrent neural networks,
as discussed in [11]. See the top rows of Figure 1 for some examples of shape prototypes.
2.2 One-dimensional Concatenation
In our framework, we can chain together a number of shape prototypes, in such a way that the result-
ing function is continuous and differentiable at all junction points. We can thus produce many pro-
totype functions that closely mimic existing functions, e.g., the Laplace function, sinusoids, saddle-
points, step-functions, etc. See the bottom rows of Figure 1 for some examples.
A single scale parameter determines the scaling of a concatenated function across all its shapes using
the junction constraints. Varying the scales is an important aspect of testing robustness because it is
not possible to guarantee well-scaled gradients without substantial overhead. In many learning prob-
lems, effort is put into proper normalization [17], but that is insufficient to guarantee homogeneous
scaling, for example throughout all the layers of a deep neural network.
2.3 Noise Prototypes
The distinguishing feature of stochastic gradient optimization (compared to batch methods) is that it
relies on sample gradients (coming from a subset of even a single element of the dataset) which are
inherently noisy. In out unit tests, we model this by four types of stochasticity:
• Scale-independent additive Gaussian noise on the gradients, which is equivalent to random
translations of inputs in a linear model with MSE loss. Note that this type of noise flips the
sign of the gradient near the optimum and makes it difficult to approach precisely.
• Multiplicative (scale-dependent) Gaussian noise on the gradients, which multiplies the gra-
dients by a positive random number (signs are preserved). This corresponds to a learning
scenario where the loss curvature is different for different samples near the current point.
• Additive zero-median Cauchy noise, mimicking the presence of outliers in the dataset.
• Mask-out noise, which zeros the gradient (independently for each dimension) with a certain
probability. This mimics both training with drop-out [18], and scenarios with rectified
linear units where a unit will be inactive for some input samples, but not for others.
For the first three, we can vary the noise scale, while for mask-out we pick a drop-out frequency.
This noise is not necessarily unbiased (as in the Cauchy case), breaking common assumptions made
in algorithm design (but the modifications in section 2.5 are even worse). See Figure 2 for an illus-
tration of the first two noise prototypes. Noise prototypes and prototype functions can be combined
independently into one-dimensional unit tests.
3