课程学习策略：提升模型性能的途径

需积分: 21 179 浏览量更新于2024-07-09 1 收藏 3.99MB PDF 举报

"清华大学朱文武团队在IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE发表的首篇课程学习综述，全面探讨了课程学习（Curriculum Learning, CL）的动机、定义、理论和应用。" 课程学习是一种创新的机器学习训练策略，其灵感来源于人类教育体系中的课程设置，即按照从易到难的顺序进行学习，以提高学习效率和理解深度。这一策略在现代人工智能领域，特别是在计算机视觉和自然语言处理中显示出了显著的优势，能够提升模型的泛化能力，加快训练过程的收敛速度。该综述文章首先阐述了课程学习的基本概念和理论基础，强调了其作为机器学习模型训练的一种有效方式，如何通过逐步增加数据的复杂性来引导模型的学习过程。CL的核心思想是让模型在训练初期接触较为简单的样本，随着训练的深入逐渐引入更复杂的挑战，这样有助于模型逐步建立和巩固基础，避免过早陷入局部最优。接着，文章详细讨论了课程学习的两个主要设计方向：手动预定义课程和自动课程。手动预定义课程依赖于专家知识，通过人为设定数据集的难度等级来构建学习路径。而自动课程则更具动态性，它利用各种算法和策略自动调整训练样本的难度，包括自我节奏学习（Self-paced Learning）、迁移教师（Transfer Teacher）和强化学习教师（RL Teacher）等方法。自我节奏学习是一种基于模型自身性能的自动课程设计，模型会根据其当前的学习状态决定学习哪些样本。迁移教师则借鉴迁移学习的思想，利用已学习的知识指导新任务的课程设计。强化学习教师则是通过强化学习的反馈机制来优化课程序列。文章还对现有的课程学习方法进行了分类，将自动课程学习方法分为四大类。这些方法在实际应用中各有优势，可以根据具体任务的需求选择合适的技术路径。通过这样的系统性回顾，读者可以全面了解课程学习领域的最新进展和未来研究方向，为相关领域的研究提供宝贵的参考。清华大学朱文武团队的这篇综述为课程学习的研究提供了深入的见解和丰富的实践指导，对于推动机器学习领域的进步具有重要意义。

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. , NO. 4

gradually consider less smoothing versions, until the target

objective of interest. This strategy also shares the same spirit

with simulated annealing. As illustrated in Fig 3, continuation

methods provide a sequence of optimization objectives,

starting with a heavily smoothed objective for which it

is easy to ﬁnd a global minimum, and tracking the local

minima throughout the training. In this way, continuation

methods guide the training towards better regions in param-

eter space, i.e., as shown in Fig 3, the local minima learned

from easier objectives have better generalization ability and

are more likely to approximate global minima. Moreover,

from the view of transfer learning, this continuation strategy

can also be regarded as a sequence of unsupervised pre-

training [6]: training on the preceding objectives could act

as a pre-training process which both helps optimization and

provides regularization on succeeding objectives.

Fig. 3. Illustration of the continuation method from [5], which is the

essence of the CL [6]. It starts from optimizing a heavily smoothed

version of the objective, and gradually moves to the target objective.

Tracking the local minima throughout the training guides the model

towards better parameter space and makes it more generalizable.

Additionally, recent studies provide more theoretical

evidence for the convergence speedup in CL from the opti-

mization perspective. Weinshall et al. [123] prove a theorem

On the other hand, researchers also analyze the CL

mechanism from the perspective of data distribution. In

the era of deep learning, large-scale data sources are re-

quired for training, which are collected and annotated by

company users, the web, and crowd-sourcing systems. This

big data collection brings noisy data that is less cogniz-

able or wrongly annotated. In the CL setting, the noisy

data corresponds to harder examples in the datasets while

the cleaner data form the easier part. Since CL strategy

encourages training more on the easier data, an intuitive

hypothesis is that CL learner wastes less time with the

harder and noisy examples to achieve faster training [6].

This hypothesis reveals the denoising efﬁcacy of CL on noisy

data.

To have a closer look at this denoising mechanism,

Gong et al. [27] provide a theory based on the assumption

that there exists deviation between training and testing

distributions caused by noisy/wrongly-annotated training

data. Intuitively, training and target/testing distributions

share a common high-conﬁdence annotated region with

large density, which corresponds to the easier examples

in CL. Therefore, to start training from easier examples

by CL strategy actually simulates learning from this high-

conﬁdence common region (as an approximation to the

target distribution), which guides the learning towards the

expected target while reduces the negative impacts from

low-conﬁdence noisy examples. This data distribution per-

spective of CL is illustrated in Fig 4. The common density

peak (at the center of the x-axis) of training and target

distributions P

train

(x) and P

target

(x) in the left part refers

to the common high-conﬁdence area, while the heavy tail

of P

train

(x) demonstrates the relatively more noisy data in

training distribution. The right part illustrates the sequence

of weight functions in CL, which initially assigns small

values to the noisy tails and much larger values in the

common easy area, and gradually moves to equal weights

for all examples. Based on the above analysis, the authors

formulate P

target

(x) as the weighted expression of P

train

(x).

A follow-up theory clariﬁes that CL essentially minimizes

an upper bound of the expected risk under target distribu-

tion, and this bound shows that we could approach the task

of minimizing the expected risk on P

target

(x) by taking the

core idea of CL: gradually taking relatively easy examples

according to the curriculum and minimizing the empirical

risk on these examples.

Fig. 4. Illustration of the CL from the data distribution perspective [27].

The left part demonstrates the data distribution shifts from the easy

subset (the solid curve, which is assumed to approximate the testing

distribution P

target

(x) well) to the full training set P

train

(x) (the red dashed

curve). The right part shows the corresponding weighting scheme to

enable this distribution shift. The center peak of curves refers to the

high-conﬁdence clean data, while the tails refer to the noisy data in the

distributions. As shown in the left part, P

target

(x) is cleaner than P

train

(x).

3.2 Suitable Application Scenes of CL

Based on the above analysis on why CL is effective, we

can categorize the motivations for applying CL into two

groups: to guide, regularizing the training towards better

regions in parameter space (with steeper gradients) as from

the perspective of the optimization problem, and to de-

noise, focusing on high-conﬁdence easier area to alleviate

the interference of noisy data as from the perspective of

data distribution. Not surprisingly, most of the existing

application scenes of CL can be classiﬁed into these two

groups, as demonstrated in Table 1.

The application scenes based on the “to guide” motiva-

tion often involve difﬁcult target tasks where direct training

on these tasks results in poor performance or slow conver-

gence. CL strategies are adopted to guide the training from

easier tasks or smoother versions of objectives to the target

tasks. For instance, in sparse-reward RL, direct training

on the ﬁnal tasks rarely gets any positive rewards, which

hinders agent learning. Therefore, researchers propose to

take the CL strategy and manually [72] or automatically [20]

design a sequence of auxiliary (sub)tasks/goals from easy to

hard to guide the training. In multi-task learning, learning

all the tasks simultaneously or in random order often leads

to unsatisfactory performance. To yield performance gains,

CL strategies are adopted to automatically choose the easier

tasks which are more related to the previous one [83] or

can bring more learning progress to the model training [29],

剩余19页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

课程学习策略：提升模型性能的途径

首篇「课程学习（Curriculum Learning)」综述论文

课程综述1

黑苹果英伟达驱动Geforce-Kepler-Patcher-V2-by-Chris1111-HeiPG.cn

自激式开关电源变压器的设计技术资料开发设计用的重要资料.zip

SENSORO物联网智慧城市解决方案.pdf

南京邮电大学通达学院在广东2021-2024各专业最低录取分数及位次表.pdf

河南城建学院在广东2021-2024各专业最低录取分数及位次表.pdf

爬取淘宝热销(热门)沐浴露商品信息透明公开的数据集

成都信息工程大学在广东2021-2024各专业最低录取分数及位次表.pdf

小型运输机_机械3D图Solidworks设计图.zip

最新资源