AutoML新书解析：超参数优化与未来挑战

需积分: 20 39 浏览量更新于2024-07-18 收藏 10.39MB PDF 举报

"《AUTOML：方法，系统，挑战》是一本由Frank Hutter, Lars Kotthoff, Joaquin Vanschoren合著的新书，旨在详细介绍AutoML系统的基础知识和最新进展，包括Auto-WEKA、Hyperopt-Sklearn、Auto-sklearn等代表性框架。书中深入探讨了AutoML在机器学习中的应用，特别是如何自动化选择模型和优化超参数，以提供基准性能。" 在机器学习领域，AutoML（自动机器学习）已经成为一个关键的研究方向，它旨在减少人工介入，通过自动化流程实现模型选择、特征工程、超参数优化等任务。其中，超参数优化是AutoML的核心组成部分，因为它直接影响模型的性能。 1.1 引言每个机器学习模型都包含一组超参数，它们不是在训练过程中学习的，而是需要在模型构建前预先设定。这些超参数的设置对模型的最终性能至关重要。例如，深度神经网络中的学习率、批次大小、层数等都是超参数，它们的选择对网络的训练速度和准确度有显著影响。 1.2 超参数优化概述超参数优化通常被视为一个黑盒函数优化问题，因为它涉及到对未知函数的寻优，而这个函数就是模型在不同超参数设置下的性能。优化方法包括模型自由方法和基于贝叶斯优化的策略。模型自由方法如随机搜索和网格搜索，尽管简单易用，但可能效率低下。相比之下，贝叶斯优化利用先验知识和模型来指导搜索，能更高效地探索超参数空间。 1.3 多精度方法由于现代机器学习模型的计算复杂性，纯黑盒优化变得非常昂贵。因此，多精度或称为多 fidelity 方法被引入，它们利用低成本的近似评估（例如，小规模数据集或简化模型）来预测全规模模型的性能。这种方法可以大大减少优化过程中的计算资源需求。 1.4 开放问题与未来研究方向尽管超参数优化已有许多进步，但仍存在许多挑战。例如，如何处理高维度的超参数空间、如何有效利用计算资源、如何在有限的预算内找到全局最优解，以及如何结合领域知识进行优化等。此外，随着深度学习和其他复杂模型的发展，动态调整优化策略和适应性地选择模型架构也是未来研究的重点。《AUTOML：方法，系统，挑战》这本书不仅提供了超参数优化的全面概述，还揭示了当前AutoML系统面临的挑战，并为未来的研发指明了道路。通过深入理解这些概念和技术，读者可以更好地掌握自动机器学习的精髓，提升模型开发的效率和性能。

18 CHAPTER 1. HYPERPARAMETER OPTIMIZATION

Multi-task Bayesian optimization (and the methods presented in the previ-

ous subsection) requires an upfront speciﬁcation of a set of ﬁdelities. This can

be suboptimal since these can be misspeciﬁed [71, 75] and because the number

of ﬁdelities that can be handled is low (usually ﬁve or less). Therefore, and

in order to exploit the typically smooth dependence on the ﬁdelity (such as,

e.g., size of the data subset used), it often yields better results to treat the

ﬁdelity as continuous (and, e.g., choose a continuous percentage of the full data

set to evaluate a conﬁguration on), trading oﬀ the information gain and the

time required for evaluation [75]. To exploit the domain knowledge that perfor-

mance typically improves with more data, with diminishing returns, a special

kernel can be constructed for the data subsets [75]. This generalization of multi-

task Bayesian optimization improves performance and can achieve a 10-100 fold

speedup compared to blackbox Bayesian optimization.

Instead of using an information-theoretic acquisition function, Bayesian op-

timization with the Upper Conﬁdence Bound (UCB) acquisition function can

also be extended to multiple ﬁdelities [70, 71]. While the ﬁrst such approach,

MF-GP-UCB [70], required upfront ﬁdelity deﬁnitions, the later BOCA algo-

rithm [71] dropped that requirement. BOCA has also been applied to optimiza-

tion with more than one continuous ﬁdelity, and we expect HPO for more than

one continuous ﬁdelity to be of further interest in the future.

Generally speaking, methods that can adaptively choose their ﬁdelity are

very appealing and more powerful than the conceptually simpler bandit-based

methods discussed in Section 1.4.2, but in practice we caution that strong models

are required to make successful choices about the ﬁdelities. When the models

are not strong (since they do not have enough training data yet, or due to

model mismatch), these methods may spend too much time evaluating higher

ﬁdelities, and the more robust ﬁxed budget schedules discussed in Section 1.4.2

might yield better performance given a ﬁxed time limit.

1.5 Applications to AutoML

In this section, we provide a historical overview of the most important hyperpa-

rameter optimization systems and applications to automated machine learning.

Grid search has been used for hyperparameter optimization since the 1990s [104,

68] and was already supported by early machine learning tools in 2002 [33]. The

ﬁrst adaptive optimization methods applied to HPO were greedy depth-ﬁrst

search [79] and pattern search [106], both improving over default hyperparame-

ter conﬁgurations, and pattern search improving over grid search, too. Genetic

algorithms were ﬁrst applied to tuning the two hyperparameters C and γ of an

RBF-SVM in 2004 [116] and resulted in improved classiﬁcation performance in

less time than grid search. In the same year, an evolutionary algorithm was

used to learn a composition of three diﬀerent kernels for an SVM, the kernel hy-

perparameters and to jointly select a feature subset; the learned combination of

kernels was able to outperform every single optimized kernel. Similar in spirit,

also in 2004, a genetic algorithm was used to select both the features used by

1.5. APPLICATIONS TO AUTOML 19

and the hyperparameters of either an SVM or a neural network [126].

CMA-ES was ﬁrst used for hyperparameter optimization in 2005 [35], in that

case to optimize an SVM’s hyperparameters C and γ, a kernel lengthscale l

for

each dimension of the input data, and a complete rotation and scaling matrix.

Much more recently, CMA-ES has been demonstrated to be an excellent choice

for parallel HPO, outperforming state-of-the-art Bayesian optimization tools

when optimizing 19 hyperparameters of a deep neural network on 30 GPUs in

parallel [88].

In 2009, Escalante et al. [28] extended the HPO problem to the Full Model

Selection problem, which includes selecting a preprocessing algorithm, a feature

selection algorithm, a classiﬁer and all their hyperparameters. By being able

to construct a machine learning pipeline from multiple oﬀ-the-shelf machine

learning algorithms using HPO, the authors empirically found that they can

apply their method to any data set as no domain knowledge is required, and

demonstrated the applicability of their approach to a variety of domains [46,

30]. Their proposed method, particle swarm model selection (PSMS), uses a

modiﬁed particle swarm optimizer to handle the conditional conﬁguration space.

To avoid overﬁtting, PSMS was extended with a custom ensembling strategy

which combined the best solutions from multiple generations [29]. Since particle

swarm optimization was originally designed to work on continuous conﬁguration

spaces, PSMS was later also extended to use a genetic algorithm to optimize

the pipeline structure and only use particle swarm optimization to optimize the

hyperparameters of each pipeline [142].

To the best of our knowledge, the ﬁrst application of Bayesian optimization

to HPO dates back to 2005, when Frohlich and Zell [36] used an online Gaussian

process together with EI to optimize the hyperparameters of an SVM, achieving

speedups of factor 10 (classiﬁcation, 2 hyperparameters) and 100 (regression, 3

hyperparameters) over grid search. Tuned Data Mining [81] proposed to tune

the hyperparameters of a full machine learning pipeline using Bayesian optimiza-

tion; speciﬁcally, this used a single ﬁxed pipeline and tuned the hyperparameters

of the classiﬁer as well as the per-class classiﬁcation threshold and class weights.

In 2011, Bergstra et al. [10] were the ﬁrst to apply Bayesian optimization to

tune the hyperparameters of a deep neural network, outperforming both manual

and random search. Furthermore, they demonstrated that TPE resulted in

better performance than a Gaussian process-based approach. TPE, as well as

Bayesian optimization with random forests, were also successful for joint neural

architecture search and hyperparameter optimization [12, 103].

Another important step in applying Bayesian optimization to HPO was made

by Snoek et al. in the 2012 paper Practical Bayesian Optimization of Machine

Learning Algorithms [137], which describes several tricks of the trade for Gaus-

sian process-based HPO implemented in the Spearmint system and obtained

a new state-of-the-art result for hyperparameter optimization of deep neural

networks.

Independently of the Full Model Selection paradigm, Auto-WEKA [146] (see

also Chapter 4) introduced the Combined Algorithm Selection and Hyperparam-

eter Optimization (CASH) problem, in which the choice of a classiﬁcation algo-

20 CHAPTER 1. HYPERPARAMETER OPTIMIZATION

rithm is modeled as a categorical variable, the algorithm hyperparameters are

modeled as conditional hyperparameters, and the random-forest based Bayesian

optimization system SMAC [56] is used for joint optimization in the resulting

786-dimensional conﬁguration space.

In recent years, multi-ﬁdelity methods have become very popular, especially

in deep learning. Firstly, using low-ﬁdelity approximations based on data sub-

sets, feature subsets and short runs of iterative algorithms, Hyperband [87]

was shown to outperform blackbox Bayesian optimization methods that did not

take these lower ﬁdelities into account. Finally, most recently, in the 2018 paper

BOHB: Robust and Eﬃcient Hyperparameter Optimization at Scale, Falkner et

al. [31] introduced a robust, ﬂexible, and parallelizable combination of Bayesian

optimization and Hyperband that substantially outperformed both Hyperband

and blackbox Bayesian optimization for a wide range of problems, including

tuning support vector machines, various types of neural networks, and rein-

forcement learning algorithms.

At the time of writing, we make the following recommendations for which

tools we would use in practical applications of HPO:

• If multiple ﬁdelities are applicable (i.e., if it is possible to deﬁne substan-

tially cheaper versions of the objective function of interest, such that the

performance for these roughly correlates with the performance for the full

objective function of interest), we recommend BOHB [31] as a robust, ef-

ﬁcient, versatile, and parallelizable default hyperparameter optimization

method.

• If multiple ﬁdelities are not applicable:

– If all hyperparameters are real-valued and one can only aﬀord a few

dozen function evaluations, we recommend the use of a Gaussian

process-based Bayesian optimization tool, such as Spearmint [137].

– For large and conditional conﬁguration spaces we suggest either the

random forest-based SMAC [56] or TPE [12], due to their proven

strong performance on such tasks [27].

– For purely real-valued spaces and relatively cheap objective func-

tions, for which we can aﬀord more than hundreds of evaluations, we

recommend CMA-ES [48].

1.6 Open Problems and Future Research Direc-

tions

We conclude this chapter with a discussion of open problems, current research

questions and potential further developments we expect to have an impact on

HPO in the future. Notably, despite their relevance, we leave out discussions

on hyperparameter importance and conﬁguration space deﬁnition as these fall

under the umbrella of meta-learning and can be found in Chapter 2.

1.6. OPEN PROBLEMS AND FUTURE RESEARCH DIRECTIONS 21

Benchmarks and Comparability

Given the breadth of existing HPO methods, a natural question is what are the

strengths and weaknesses of each of them. In order to allow for a fair com-

parison between diﬀerent HPO approaches, the community needs to design and

agree upon a common set of benchmarks that expands over time, as new HPO

variants, such as multi-ﬁdelity optimization, emerge. As a particular example

for what this could look like we would like to mention the COCO platform

(short for comparing continuous optimizers), which provides benchmark and

analysis tools for continuous optimization and is used as a workbench for the

yearly Black-Box Optimization Benchmarking (BBOB) challenge [9]. Eﬀorts

along similar lines in HPO have already yielded the hyperparameter optimiza-

tion library (HPOlib [27]) and a benchmark collection speciﬁcally for Bayesian

optimization methods [23]. However, neither of these has gained similar traction

as the COCO platform.

Additionaly, the community needs clearly deﬁned metrics, but currntly dif-

ferent works use diﬀerent metrics. One important dimension in which evalua-

tions diﬀer is whether they report performance on the validation set used for

optimization or on a separate test set. The former helps to study the strength

of the optimizer in isolation, without the noise that is added in the evaluation

when going from validation to test set; on the other hand, some optimizers may

lead to more overﬁtting than others, which can only be diagnosed by using the

test set. Another important dimension in which evaluations diﬀer is whether

they report performance after a given number of function evaluations or after

a given amount of time. The latter accounts for the diﬀerence in time between

evaluating diﬀerent hyperparameter conﬁgurations and includes optimization

overheads, and therefore reﬂects what is required in practice; however, the for-

mer is more convenient and aids reproducibility by yielding the same results

irrespective of the hardware used. To aid reproducibility, especially studies that

use time should therefore release an implementation.

We note that it is important to compare against strong baselines when us-

ing new benchmarks, which is another reason why HPO methods should be

published with an accompanying implementation. Unfortunately, there is no

common software library as is, for example, available in deep learning research

that implements all the basic building blocks [2, 114]. As a simple, yet eﬀec-

tive baseline that can be trivially included in empirical studies, Jamieson and

Recht [65] suggest to compare against diﬀerent parallelization levels of random

search to demonstrate the speedups over regular random search. When com-

paring to other optimization techniques it is important to compare against a

solid implementation, since, e.g., simpler versions of Bayesian optimization have

been shown to yield inferior performance [137, 139, 76].

Gradient-Based Optimization

In some cases (e.g., least-squares support vector machines and neural networks)

it is possible to obtain the gradient of the model selection criterion with respect

22 CHAPTER 1. HYPERPARAMETER OPTIMIZATION

to some of the model hyperparameters. Diﬀerent to blackbox HPO, in this case

each evaluation of the target function results in an entire hypergradient vector

instead of a single ﬂoat value, allowing for faster HPO.

Maclaurin et al. [96] described a procedure to compute the exact gradients

of validation performance with respect to all continuous hyperparameters of a

neural network by backpropagating through the entire training procedure (using

a novel, memory-eﬃcient algorithm). Being able to handle many hyperparam-

eters eﬃciently through gradient-based methods allows for a new paradigm of

hyperparametrizing the model to obtain ﬂexibility over model classes, regular-

ization, and training methods. Maclaurin et al. demonstrated the applicabil-

ity of gradient-based HPO to many high-dimensional HPO problems, such as

optimizing the learning rate of a neural network for each iteration and layer

separately, optimizing the weight initialization scale hyperparameter for each

layer in a neural network, optimizing the l2 penalty for each individual param-

eter in logistic regression, and learning completely new training datasets. As a

small downside, backpropagating through the entire training procedure comes

at the price of doubling the time complexity of the training procedure. To over-

come the necessity of backpropagating through the complete training procedure,

later work allows to perform hyperparameter updates with respect to a separate

validation set interleaved with the training process [90, 34].

Recent examples of gradient-based optimization of simple model’s hyperpa-

rameters [115] and of neural network structures (see Chapter 3) show promising

results, outperforming state-of-the-art Bayesian optimization models. Despite

being highly model-speciﬁc, the fact that gradient-based hyperparemeter opti-

mization allows tuning several hundreds of hyperparameters could allow sub-

stantial improvements in HPO.

Scalability

Despite recent successes in multi-ﬁdelity optimization, there are still machine

learning problems which have not been directly tackled by HPO due to their

scale, and which might require novel approaches. Here, scale can mean both the

size of the conﬁguration space and the expense of individual model evaluations.

For example, there has not been any work on HPO for deep neural networks

on the ImageNet challenge dataset [124] yet, mostly because of the high cost of

training even a simple neural network on the dataset. It will be interesting to

see whether methods going beyond the blackbox view from Section 1.3, such as

the multi-ﬁdelity methods described in Section 1.4, gradient-based methods, or

meta-learning methods (described in Chapter 2) allow to tackle such problems.

Chapter 3 describes ﬁrst successes in learning neural network building blocks

on smaller datasets and applying them to ImageNet, but the hyperparameters

of the training procedure are still set manually.

Given the necessity of parallel computing, we are looking forward to new

methods that fully exploit large-scale compute clusters. While there exists much

work on parallel Bayesian optimization [41, 10, 57, 137, 22, 132, 51, 31], ex-

cept for the neural networks described in Section 1.3.2 [138], so far no method

剩余220页未读，继续阅读

sangwq

粉丝: 1
资源: 19

AutoML新书解析：超参数优化与未来挑战

modeltime.h2o:使用H2O AutoML进行预测。 使用H2O自动机器学习算法作为Modeltime时间序列预测的后端

2021-PPT-自动化机器学习.pdf

AutoML之自动化特征工程

告别空调暖气――打造低能耗绿色住宅 (2003年)

19-Promise：使用Promise，告别回调函数_For_vip_user_0011

把微信小程序异步API转为Promise，简化异步编程，告别层层回调 ...

6D2正式发布!可以告别5D3了.pdf

告别鼠标手

瓷砖铺贴告别黄沙水泥 东方雨虹发布瓷砖铺贴系统质量白皮书规范行业标准.zip

告别Linux1

最新资源

modeltime.h2o:使用H2O AutoML进行预测。使用H2O自动机器学习算法作为Modeltime时间序列预测的后端

瓷砖铺贴告别黄沙水泥东方雨虹发布瓷砖铺贴系统质量白皮书规范行业标准.zip