RANDOM SEARCH FOR HYPER-PARAMETER OPTIMIZATION
2. Random vs. Grid for Optimizing Neural Networks
In this section we take a second look at several of the experiments of Larochelle et al. (2007) us-
ing random search, to compare with the grid searches done in that work. We begin with a look
at hyper-parameter optimization in neural networks, and then move on to hyper-parameter opti-
mization in Deep Belief Networks (DBNs). To characterize the efficiency of random search, we
present two techniques in preliminary sections: Section 2.1 explains how we estimate the general-
ization performance of the best model from a set of candidates, taking into account our uncertainty
in which model is actually best; Section 2.2 explains the random experiment efficiency curve that
we use to characterize the performance of random search experiments. With these preliminaries
out of the way, Section 2.3 describes the data sets from Larochelle et al. (2007) that we use in our
work. Section 2.4 presents our results optimizing neural networks, and Section 5 presents our results
optimizing DBNs.
2.1 Estimating Generalization
Because of finite data sets, test error is not monotone in validation error, and depending on the set
of particular hyper-parameter values λ evaluated, the test error of the best-validation error configu-
ration may vary. When reporting performance of learning algorithms, it can be useful to take into
account the uncertainty due to the choice of hyper-parameters values. This section describes our
procedure for estimating test set accuracy, which takes into account any uncertainty in the choice
of which trial is actually the best-performing one. To explain this procedure, we must distinguish
between estimates of performance Ψ
(valid)
= Ψ and Ψ
(test)
based on the validation and test sets
respectively:
Ψ
(valid)
(λ) = mean
x∈X
(valid)
L
x;A
λ
(X
(train)
)
,
Ψ
(test)
(λ) = mean
x∈X
(test)
L
x;A
λ
(X
(train)
)
.
Likewise, we must define the estimated variance V about these means on the validation and test sets,
for example, for the zero-one loss (Bernoulli variance):
V
(valid)
(λ) =
Ψ
(valid)
(λ)
1− Ψ
(valid)
(λ)
|X
(valid)
| − 1
, and
V
(test)
(λ) =
Ψ
(test)
(λ)
1− Ψ
(test)
(λ)
|X
(test)
| − 1
.
With other loss functions the estimator of variance will generally be different.
The standard practice for evaluating a model found by cross-validation is to report Ψ
(test)
(λ
(s)
)
for the λ
(s)
that minimizes Ψ
(valid)
(λ
(s)
). However, when different trials have nearly optimal val-
idation means, then it is not clear which test score to report, and a slightly different choice of λ
could have yielded a different test error. To resolve the difficulty of choosing a winner, we report a
weighted average of all the test set scores, in which each one is weighted by the probability that its
particular λ
(s)
is in fact the best. In this view, the uncertainty arising from X
(valid)
being a finite sam-
ple of G
x
makes the test-set score of the best model among λ
(1)
,...,λ
(S)
a random variable, z. This
score z is modeled by a Gaussian mixture model whose S components have means µ
s
= Ψ
(test)
(λ
(s)
),
285