6 Auto-sklearn: Efficient and Robust Automated Machine Learning 117
and a set of meta-features, i.e., characteristics of the dataset that can be computed
efficiently and that help to determine which algorithm to use on a new dataset.
This meta-learning approach is complementary to Bayesian optimization for
optimizing an ML framework. Meta-learning can quickly suggest some instan-
tiations of the ML framework that are likely to perform quite well, but it is
unable to provide fine-grained information on performance. In contrast, Bayesian
optimization is slow to start for hyperparameter spaces as large as those of
entire ML frameworks, but can fine-tune performance over time. We exploit this
complementarity by selecting k configurations based on meta-learning and use their
result to seed Bayesian optimization. This approach of warmstarting optimization
by meta-learning has already been successfully applied before [21, 22, 38], but
never to an optimization problem as complex as that of searching the space of
instantiations of a full-fledged ML framework. Likewise, learning across datasets
has also been applied in collaborative Bayesian optimization methods [4, 45]; while
these approaches are promising, they are so far limited to very few meta-features and
cannot yet cope with the high-dimensional partially discrete configuration spaces
faced in AutoML.
More precisely, our meta-learning approach works as follows. In an offline phase,
for each machine learning dataset in a dataset repository (in our case 140 datasets
from the OpenML [43] repository), we evaluated a set of meta-features (described
below) and used Bayesian optimization to determine and store an instantiation of
the given ML framework with strong empirical performance for that dataset. (In
detail, we ran SMAC [27] for 24 h with 10-fold cross-validation on two thirds of
the data and stored the resulting ML framework instantiation which exhibited best
performance on the remaining third). Then, given a new dataset D, we compute its
meta-features, rank all datasets by their L
1
distance to D in meta-feature space and
select the stored ML framework instantiations for the k = 25 nearest datasets for
evaluation before starting Bayesian optimization with their results.
To characterize datasets, we implemented a total of 38 meta-features from the
literature, including simple, information-theoretic and statistical meta-features [29,
33], such as statistics about the number of data points, features, and classes, as
well as data skewness, and the entropy of the targets. All meta-features are listed in
Table 1 of the original publication’s supplementary material [20]. Notably, we had
to exclude the prominent and effective category of landmarking meta-features [37]
(which measure the performance of simple base learners), because they were
computationally too expensive to be helpful in the online evaluation phase. We note
that this meta-learning approach draws its power from the availability of a repository
of datasets; due to recent initiatives, such as OpenML [43], we expect the number
of available datasets to grow ever larger over time, increasing the importance of
meta-learning.