探究单一向量嵌入：语言特性分析

需积分: 0 33 浏览量更新于2024-08-05 收藏 440KB PDF 举报

"探究单一向量蕴含的语义：对句子嵌入表示的评估与分析" 在当前的自然语言处理领域，尽管已经投入大量精力来训练高质量的句子嵌入（sentence embeddings），但我们对于这些嵌入究竟捕获了什么样的信息仍然知之甚少。传统的评估方法通常依赖于基于句子分类的“下游”任务，以此来判断句子表示的质量。然而，这些任务的复杂性使得我们难以直接推断出嵌入中所包含的信息类型。为此，研究者们提出了10个探查（probing）任务，旨在检测句子嵌入是否能够捕捉到一些基本的语言特征。这些任务设计得较为简单，目的是深入理解由不同编码器生成的嵌入所蕴含的语义信息。研究中，他们选取了三种不同的编码器，分别以八种独特的方式进行训练，通过对这些嵌入进行探查任务的测试，揭示了编码器和训练方法的有趣特性。 1. 探查任务的多样性：这些任务涵盖了词汇顺序、语法结构、情感极性等基本语言特性，旨在检测嵌入是否能捕捉到句子的基本语义。例如，它们可能包括词序反转检测、词性标注预测、主谓一致判断等，通过这些任务可以了解嵌入是否能够保留足够的上下文信息。 2. 编码器分析：不同的编码器，如递归神经网络（RNN）、卷积神经网络（CNN）以及Transformer模型，在处理句子嵌入时可能表现出不同的优势和弱点。研究可能发现，某些编码器在捕捉局部特征上表现优秀，而其他编码器则在捕获全局依赖关系上更胜一筹。 3. 训练方法的影响：研究还关注了不同的训练策略如何影响嵌入的表示能力。例如，对比预训练与微调的方法，无监督学习与监督学习的差异，以及使用不同大小的训练数据集的效果。 4. 结果解析：通过这些探查任务的结果，研究者能够明确哪些语言特征被有效编码，哪些则可能丢失或模糊。这有助于优化模型设计，提升模型在理解和生成自然语言时的性能。 5. 应用前景：理解句子嵌入的内在属性对于改进自然语言处理模型至关重要，这不仅有助于提高特定任务的性能，还能为构建更加通用和强大的语言模型提供指导。总结来说，这篇2018年的ACL论文通过设计一系列探查任务，对不同训练方式下生成的句子嵌入进行了深入研究，揭示了编码器和训练方法对捕获语言特征的影响，为未来的研究提供了有价值的方向。

In the top constituent task (TopConst), sen-

tences must be classiﬁed in terms of the sequence

of top constituents immediately below the sen-

tence (S) node. An encoder that successfully ad-

dresses this challenge is not only capturing latent

syntactic structures, but clustering them by con-

stituent types. TopConst was introduced by Shi

et al. (2016). Following them, we frame it as a

20-way classiﬁcation problem: 19 classes for the

most frequent top constructions, and one for all

other constructions. As an example, “[Then] [very

dark gray letters on a black screen] [appeared] [.]”

has top constituent sequence: “ADVP NP VP .”.

Note that, while we would not expect an un-

trained human subject to be explicitly aware of

tree depth or top constituency, similar information

must be implicitly computed to correctly parse

sentences, and there is suggestive evidence that the

brain tracks something akin to tree depth during

sentence processing (Nelson et al., 2017).

Semantic information These tasks also rely on

syntactic structure, but they further require some

understanding of what a sentence denotes. The

Tense task asks for the tense of the main-clause

verb (VBP/VBZ forms are labeled as present,

VBD as past). No target form occurs across the

train/dev/test split, so that classiﬁers cannot rely

on speciﬁc words (it is not clear that Shi and col-

leagues, who introduced this task, controlled for

this factor). The subject number (SubjNum) task

focuses on the number of the subject of the main

clause (number in English is more often explic-

itly marked on nouns than verbs). Again, there

is no target overlap across partitions. Similarly,

object number (ObjNum) tests for the number of

the direct object of the main clause (again, avoid-

ing lexical overlap). To solve the previous tasks

correctly, an encoder must not only capture tense

and number, but also extract structural informa-

tion (about the main clause and its arguments).

We grouped Tense, SubjNum and ObjNum with

the semantic tasks, since, at least for models that

treat words as unanalyzed input units (without

access to morphology), they must rely on what

a sentence denotes (e.g., whether the described

event took place in the past), rather than on struc-

tural/syntactic information. We recognize, how-

ever, that the boundary between syntactic and se-

mantic tasks is somewhat arbitrary.

In the semantic odd man out (SOMO) task, we

modiﬁed sentences by replacing a random noun

or verb o with another noun or verb r. To make

the task more challenging, the bigrams formed by

the replacement with the previous and following

words in the sentence have frequencies that are

comparable (on a log-scale) with those of the orig-

inal bigrams. That is, if the original sentence con-

tains bigrams w

n−1

o and ow

n+1

, the correspond-

ing bigrams w

n−1

r and rw

n+1

in the modiﬁed

sentence will have comparable corpus frequencies.

No sentence is included in both original and modi-

ﬁed format, and no replacement is repeated across

train/dev/test sets. The task of the classiﬁer is to

tell whether a sentence has been modiﬁed or not.

An example modiﬁed sentence is: “ No one could

see this Hayes and I wanted to know if it was

real or a spoonful (orig.: ploy).” Note that judg-

ing plausibility of a syntactically well-formed sen-

tence of this sort will often require grasping rather

subtle semantic factors, ranging from selectional

preference to topical coherence.

The coordination inversion (CoordInv) bench-

mark contains sentences made of two coordinate

clauses. In half of the sentences, we inverted the

order of the clauses. The task is to tell whether

a sentence is intact or modiﬁed. Sentences

are balanced in terms of clause length, and no

sentence appears in both original and inverted

versions. As an example, original “They might

be only memories, but I can still feel each one”

becomes: “I can still feel each one, but they might

be only memories.” Often, addressing CoordInv

requires an understanding of broad discourse and

pragmatic factors.

Row Hum. Eval. of Table 2 reports human-

validated “reasonable” upper bounds for all the

tasks, estimated in different ways, depending on

the tasks. For the surface ones, there is always a

straightforward correct answer that a human an-

notator with enough time and patience could ﬁnd.

The upper bound is thus estimated at 100%. The

TreeDepth, TopConst, Tense, SubjNum and Ob-

jNum tasks depend on automated PoS and pars-

ing annotation. In these cases, the upper bound

is given by the proportion of sentences correctly

annotated by the automated procedure. To esti-

mate this quantity, one linguistically-trained au-

thor checked the annotation of 200 randomly sam-

pled test sentences from each task. Finally, the

BShift, SOMO and CoordInv manipulations can

accidentally generate acceptable sentences. For

剩余13页未读，继续阅读

马李灵珊

粉丝: 40
资源: 297

探究单一向量嵌入：语言特性分析

cyrus-sasl-2.1.21.tar.gz

MCTS.70-640.Exam.Cram.Windows.Server.2008.Active.Directory

plot-CRLB.zip_CRLB_You Can!

cyrus-sasl-lib-2.1.28

centos7 dovecot 如何禁用非加密的imap。

samtools 安装

Cramér's V的python实现

sendmail开启smtp用户认证

samtools 建立 index

samtools 如何对排序的文件建立IGV可以用的bai索引文件

最新资源