BERT_SE：预训练的软件工程语言模型

版权申诉

132 浏览量更新于2024-07-07 收藏 454KB PDF 举报

"BERT_SE是专为软件工程设计的预训练语言表示模型，旨在提升软件工程领域自然语言处理（NLP）的效果。论文介绍了BERT_SE如何利用预训练的嵌入模型来解决软件工程中的文本分类问题，如软件需求的分类，这对于软件开发过程中的需求理解、人力资源选择以及软件工作量估计等任务具有重要意义。由于软件工程领域的文本数据量小且标注质量不高，传统的NLP方法可能难以奏效，而BERT_SE通过预训练可以在这种情况下提供有效的解决方案。" 在软件工程中，自然语言处理（NLP）已经成为了一个至关重要的工具，它能够帮助处理和理解与软件开发相关的非正式和复杂的文本信息。BERT（Bidirectional Encoder Representations from Transformers）是一个由Google开发的预训练语言模型，它在多个自然语言理解和生成任务上表现出了卓越的能力。BERT_SE是BERT模型的一个特定变体，专门针对软件工程的语境进行了优化。软件需求的分类是一个挑战性的任务，因为这些需求通常是非正式的，并且包含了丰富的技术细节和上下文信息。传统的基于规则或统计的NLP方法可能难以捕捉到这些微妙的语义差异。预训练的嵌入模型，如BERT，能够在大规模无标注文本上学习通用的语言表示，然后在特定领域的微调阶段适应软件工程的语料，从而提高对软件需求的理解和分类能力。 BERT_SE的工作原理是首先在大规模的公开语料库上进行预训练，学习词汇和句子的深层次语义关系。接着，在软件工程相关的数据集上进行微调，使得模型能更好地理解软件开发中的专业术语和上下文。这种方法可以有效地弥补软件工程领域数据量小、标注质量低的问题，因为预训练阶段可以从大量未标注文本中获取泛化的语言知识。通过应用BERT_SE，软件工程师可以更准确地理解和分类软件需求，进而提高项目管理的效率和质量。例如，它可以用于自动识别需求的优先级、预测潜在的缺陷或者辅助团队成员理解复杂的需求描述。此外，BERT_SE还可以应用于其他软件工程任务，如代码理解、代码搜索和代码生成，进一步提升软件开发的自动化水平。 BERT_SE是软件工程领域自然语言处理的一个突破，它利用预训练的深度学习模型解决了数据不足和质量不高的问题，提高了软件开发过程中文本信息处理的准确性。随着NLP技术的不断发展，BERT_SE这类模型将在未来的软件工程实践中发挥更大的作用。

118 Computer Science & Information Technology (CS & IT)

Word2Vec, as well as other similar models (e.g. Glove [5]) are considered algorithms for

textual representations context-less. This means that these models present restrictions regarding

the representation of the context of words in a text, impairing tasks at the sentence level or even

fine-tuning at the word level. This is because these models are unidirectional, that is, they

consider the context of a word only from left to right, with no mechanism that detects if a

particular word has already occurred in the corpus before. Therefore, these models provide a

single representation, using a dense vector, for each word in a text or set of texts.

In addition, according to its authors [4] these models are considered very shallow, as they

represent each word in only one layer, and there is a limit on the amount of information they can

capture. Finally, these models do not consider the polysemy of words, that is, the same word

being used in different contexts can have different meanings (e.g. bank – monetary sense; bank -

to sit), which is not treated by these models. Another characteristic that is not treated is the

ambiguity of the words, that is, when two or more different words have the same meaning (e.g.

create, implement, generate).

On the other hand, many advances have occurred in the area of NLP in recent years. Such

advances are due, mainly, to deep learning techniques [23]. Among these advances is the

possibility of obtaining contextualized embedding. This approach produces different vector

representations for the same word in a text, which varies according to its context. Therefore,

these techniques are capable of capturing contextual semantics of ambiguous words [14], as

well as addressing polysemy issues. From this new paradigm, recent studies have turned to

research that applies contextualized embedding models [24] [14], leaving aside the original

paradigm, in which there was only one vector of embedding for each single word in one text/set

of texts. Thus, each occurrence of a word is mapped to a dense vector, specifically considering

the surrounding context.

This representation approach is easily applicable to many NLP tasks, where the inputs are

usually sentences and therefore, the context information is available, such as textual software

requirements. This new language representation paradigm originated from several ideas and

initiatives that emerged in NLP in recent years, such as: coVe [25], ELMo [24], ULMFiT [21],

CVT [26], Context2Vec [10], BERT [19] and Transformer OpenAI (GPT e GPT-2) [27]. The

BERT contextualized pre-trained model [19], has presented results greatly improved in NLP

tasks, and has therefore been widely used in several applications. Its application has occurred

through pre-trained models and available by its authors (e.g. BERT_base e BERT large [19]).

2.2. BERT

The BERT is an innovative method, considered the state of the art in pre-trained language

representation [19]. BERT models are considered contextualized or dynamic models, and have

shown much-improved results in several NLP tasks [22], [24], [27], [21] as sentiment

classification, calculation of semantic tasks of textual similarity and recognition of tasks of

textual linking.

This model originated from various ideas and initiatives aimed at textual representation that

have emerged in the area of NLP in recent years, such as: coVe [25], ELMo [24], ULMFiT [21],

CVT [26], context2Vec [28], the OpenAI transformer (GPT and GPT-2) [27] and the

Transformer [29]. BERT is characterized as a dynamic method, mainly because it has an

attention mechanism, also called Transformer [19], which allows analyzing the context of each

word in a text individually, including checking if each word has been previously used in a text

with the same context. This allows the method to learn contextual relationships between words

(or subwords) in a text. BERT consists of several Transformer models [29] whose parameters

are pre-trained on an unlabeled corpus like Wikipedia and BooksCorpus [30]. It can say that for

a given input sentence, BERT “looks left and right several times” and outputs a dense vector

剩余16页未读，继续阅读

易小侠

粉丝: 6587
资源: 9万+

BERT_SE：预训练的软件工程语言模型

TensorFlow_code_and_pre-trained_models_for_BERT_bert.zip

chinese-bert-wwm-L-12-H-768-A-12

Pre-trained-BERT-model-using-own-corpus

bert_extractive_summarizer-0.10.1-whl压缩包下载

Python库 | bert_extractive_summarizer-0.10.1-py3-none-any.whl

Python库 | bert_serving_server-1.7.9-py3-none-any.whl

PyPI 官网下载 | bert_serving_server-1.4.8-py3-none-any.whl

Python库 | bert_serving_client-1.8.2-py2.py3-none-any.whl

Python库 | bert_serving_client-1.8.0-py2.py3-none-any.whl

最新资源