小数据集上的机器学习：提升软件工程效率

版权申诉

88 浏览量更新于2024-07-06 收藏 772KB PDF 举报

"这篇论文探讨了如何在小规模的软件工程数据集上利用现代机器学习技术，特别是深度学习中的预训练方法，以提高模型的性能。作者Julian Aron Prenner和Romain Robbes强调了在软件工程领域，由于数据标注的成本高昂，存在大量小型（少于1000个样本）和中型（少于100000个样本）的数据集。他们研究了预训练的Transformer模型在13个来自软件工程文献的小型数据集上的表现，这些数据集涵盖了源代码和自然语言两个方面。" 在当前的软件工程研究和实践中，机器学习和人工智能正在发挥着越来越重要的作用。然而，由于数据收集和标注的高成本，研究人员和从业者经常面临小规模数据集的挑战。传统的深度学习模型，如卷积神经网络（CNN）和循环神经网络（RNN），通常需要大量的标注数据来达到最佳性能。近年来，预训练技术的出现，特别是在Transformer架构上的应用，为解决这个问题提供了新的途径。预训练是一种半监督学习技术，它利用大量的未标注数据与少量的标注数据相结合，以改善模型的泛化能力。在自然语言处理（NLP）领域，预训练的Transformer模型，如BERT、GPT系列和RoBERTa等，已经在各种任务上取得了显著成果。这些模型首先在大规模的无标注文本上进行预训练，然后在特定任务的少量标注数据上进行微调，从而适应新任务的需求。论文中，作者评估了预训练Transformer模型在软件工程领域的适用性，包括源代码理解和自然语言处理任务。结果表明，预训练的Transformer在涉及自然语言的任务上表现出色，甚至优于以往的模型。而对于源代码相关的任务，尤其是非常小的数据集，预训练的Transformer模型可能并不总是最佳选择，这可能是由于源代码的结构化和高度特定的特性，对模型的要求更为严格。为了充分利用小数据集，作者建议结合以下策略： 1. 预训练模型的选择：根据任务类型选择合适的预训练模型，例如，对于自然语言任务，可以优先考虑BERT或RoBERTa；对于源代码任务，可能需要开发专门针对代码结构的预训练模型。 2. 数据增强：通过合成、变换或扩充现有数据来增加数据集的大小和多样性。 3. 迁移学习：利用其他相关领域的预训练模型作为起点，然后针对软件工程任务进行微调。 4. 模型融合：集成多个模型的预测，以提高整体性能和鲁棒性。这篇论文为软件工程领域的研究者和实践者提供了一个起点，指导他们在面对小规模数据集时如何有效地运用现代机器学习技术，尤其是预训练的Transformer模型，以提升模型的性能和实用性。尽管对于源代码任务，预训练模型的效果可能受到限制，但这一研究仍然展示了预训练方法在克服小数据集问题上的潜力，并为未来的研究指明了方向。

Pre-training via Language Modelling

BERT [9] (Bi-Directional Encoder Representations from

Transformers) is an extension of the Transformer archi-

tecture and comes with a speciﬁc semi-supervised learning

training regimen: BERT heavily relies on pre-training, a

form of unsupervised learning, before being ﬁne-tuned on a

downstream task in a classical supervised fashion.

During pre-training, BERT is trained on large amounts

of unlabeled data via Mask Language Modelling (MLM).

MLM is a prediction task where some of the input tokens

are randomly replaced by blanks (“masked”) and the model

is trained to predict the tokens behind these blanks, taking

into account the textual context on both sides of the blank

(see the BERT paper for more details on the pre-training

itself [9]). Intuitively, this general task is supposed to initial-

ize the weights to a state in which certain general concepts

and relationships useful for a large number of downstream

tasks are already present: BERT learns a Representation of the

tokens. Unlike word embeddings [59], these are contextual

representations: they depend both on the token, and its

surrounding tokens.

Of note, earlier work also used Language Modelling as

a pre-training task (ELMo and ULMFit [7], [8]) with LSTMs,

and were used with some varying amount of success in

Software Engineering [13], [14]. BERT’s pre-training is more

efﬁcient for two reasons: BERT’s bidirectional architecture

uses the context before and after the token, whereas LSTMs

use only the context before the token; and BERT uses Byte-

Pair Encoding (BPE) [60] to tokenise text in subwords rather

than entire words, leading to better modelling of the vo-

cabulary (see previous work by Karampatsis et al. for an

extended discussion of this aspect for source code [42]).

RoBERTa [10] is a reﬁnement of BERT, in particular relat-

ing to its pre-training regimen (e.g., RoBERTa uses a larger

pre-training corpus, dynamic masking, and a variation of

the pre-training task) and with only minor architectural

changes (RoBERTa uses Byte-level BPE tokenization, rather

than character-level BPE).

Fine-tuning

Both BERT and RoBERTa are hardly ever trained from

scratch. Instead, starting from a pre-trained model with

pre-initialized weights, the model weights are further ﬁne-

tuned by training on task-speciﬁc labeled data (called a

downstream task). This involves replacing the last layer of

the model (useful for the pre-training task), with a task-

speciﬁc layer, and resuming training. The model can lever-

age the pre-trained representations to be able to learn the

downstream task effectively, even with a limited amount of

data, allowing BERT and RoBERTa to set the state of the art

on NLP benchmarks, even on tasks with limited data (the

GLUE benchmark [11] includes several task with less than

10,000 examples).

Impact of the Pre-training corpora

The standard BERT and RoBERTa models have both been

pre-trained on a large English natural language corpus, with

several models available in various sizes. There exist pre-

trained BERT models for many other natural languages and

even programming languages [61]. Intuitively, one would

EN Leppie, that’s great news! I look forward to trying

IronScheme!

EN →DE Leppie, das sind großartige Neuigkeiten! Ich freue

mich darauf, IronScheme auszuprobieren!

DE →EN leppie, those are great news! I am looking forward to

try out IronScheme!

EN →FR Leppie, c’est une excellente nouvelle! J’ai hâte

d’essayer IronScheme!

FR →EN leppie, this is great news! I can’t wait to try Iron-

Scheme!

Fig. 1. Example of back-translation. The original English sentence is

ﬁrst translated to German and French, then translated back into En-

glish; resulting variation underlined. Google Translate was used for the

translation.

expect a generic pre-training corpus to be a “jack of all

trades, master of none”, with a more speciﬁc pre-training

corpus to be more suited for more speciﬁc domains (such

as software engineering). There is evidence of this for word

embeddings in Software Engineering [62], but how much of

an impact a domain-speciﬁc pre-training corpus has for a

BERT or RoBERTa model is still an open question, which we

investigate. Of note, the ULMFit approach [8] continues the

pre-training task on the task-speciﬁc data (without using

labels), before the actual ﬁne-tuning, ﬁnding that it does

improve performance.

3.2 Additional Techniques

Intermediate-Task Fine-Tuning

Intermediate-task ﬁne-tuning (ITT), also known as two stage

ﬁne-tuning, STILTs [63], or TANDA [64] is a technique

whereby the model is ﬁne-tuned twice (with labeled data):

ﬁrst on an intermediate task, a task different from but closely

related to the target task, and ﬁnally on the actual target

task (e.g., training for sentiment analysis on movies, before

switching to sentiment analysis on books). This is particu-

larly attractive whenever only little data is available for the

target task whilst large amounts of data are available for a

similar, possibly slightly simpler, but different intermediate

task. The idea is that the target task might beneﬁt from

“knowledge” that the model acquired during intermediate-

task training. Pruksachatkun et al. [65] presents a survey on

when this method offers good prospects in NLP.

Self-Training

Self-training (also known as self-labelling or self-

learning) [66], [67], is a very simple semi-supervised

learning method. It can be explained as follows: A model

is ﬁrst trained on a (possibly too small) labeled dataset.

Next, this model is used to evaluate a number of additional

unlabeled samples. The model’s predictions for these

unlabeled samples are then simply used as their gold labels.

We now have additional labeled data, albeit noisier ones;

after adding it to the original dataset we retrain the model.

Predictions can be ﬁltered by conﬁdence to reduce the

probability of introducing noise into the training set.

Data Augmentation and Back-Translation

Data augmentation is a well-known technique to increase

the amount of labeled data without any human labeling

剩余22页未读，继续阅读

易小侠

粉丝: 6598
资源: 9万+

小数据集上的机器学习：提升软件工程效率

criteo_small 数据集 已划分训练集、测试集和验证集

Sort_1000pics机器学习小型数据集

Code The Hidden Language of Computer Hardware and Software

python机器学习资料小合集

吴恩达机器学习课程全数据集汇总

干豆数据集：机器学习数据处理教程与实践

机器学习实战详解：代码、数据集与PDF全面解析

机器学习手写数字识别数据集完整下载指南

吴恩达机器学习课程作业数据集深入分析

心脏病预测机器学习数据集使用与分析

最新资源

criteo_small 数据集已划分训练集、测试集和验证集