Joaquin Vanschoren
2.4 Learning Curves
We can also extract meta-data about the training process itself, such as how fast model
performance improves as more training data is added. If we divide the training in steps
s
t
, usually adding a fixed number of training examples every s tep, we can measure the
performance P (θ
i
, t
j
, s
t
) = P
i,j,t
of configuration θ
i
on task t
j
after step s
t
, yielding a
learning curve across the time steps s
t
. Learning curves are used extensively to speed
up hyperparameter optimization on a given task (Kohavi and J oh n, 1995; Provost et al.,
1999; Swersky et al., 2014; Chandrashekaran and Lane, 2017). In meta-learning, however,
learning curve information is transferred across tasks.
While evaluating a configuration on new task t
new
, we can halt the training after a
certain numb er of iterations r < t, and use the partially observed learning curve to predict
how well the configuration will perform on the full dataset based on prior experience with
other tasks, and decide whether to continue the training or not. This can significantly speed
up the search for good configurations.
One approach is to assume that similar tasks yield sim ilar learning curves. First, define
a distance between tasks based on how similar the partial learning cu rves are: dist(t
a
, t
b
) =
f (P
i,a,t
, P
i,b,t
) with t = 1, ..., r. Next, find the k most s im ilar tasks t
1..k
and use their
complete learning curves to predict how well the configuration will perform on the new
complete dataset. Task similarity can be measured by comparing the shapes of the partial
curves across all configurations tried, and the prediction is made by adapting the ‘nearest’
complete curve(s) to the new partial curve (Leite and Brazdil, 2005, 2007). This approach
was also successful in combination with active testing (Leite and Br azdil, 2010), and can
be sped up further by using multi-objective evaluation measures that include training time
(van Rij n et al., 2015).
Interestingly, while several methods aim to predict learning curves during neural archi-
tecture search (Elsken et al., 2018), as of yet none of this work leverages learning curves
previously observed on other tasks.
3. Learning from Task Properties
Another rich sour ce of meta-data are characterizations (meta-features) of the task at hand.
Each task t
j
∈ T is described with a vector m (t
j
) = (m
j,1
, ..., m
j,K
) of K meta-featur es
m
j,k
∈ M, the s et of all known meta-features. Th is can be used to defin e a task similarity
measure based on, for instance, the Euclidean distance between m(t
i
) and m(t
j
), so that
we can transfer information from the most similar tasks to the new task t
new
. Moreover,
together with prior evaluations P, we can train a meta-learner L to predict the performance
P
i,new
of configurations θ
i
on a new task t
new
.
3.1 Meta-Features
Table 1 provides a concise overview of the most commonly used meta-features, together with
a short rationale for why they are indicative of model performance. Where possible, we also
show the formulas to compute them. More complete surveys can be found in the literature
(Rivolli et al., 2018; Vans choren, 2010; Mantovani, 2018; Reif et al., 2014; Castiello et al.,
2005).
6