基于图神经网络的派生形态学生成模型

需积分: 12 10 浏览量更新于2024-09-02 收藏 1.2MB PDF 举报

本文主要探讨了"基于图网络的派生形态学模型"（A Graph Auto-encoder Model of Derivational Morphology），这是一个在自然语言处理（NLP）领域内的研究，特别是与计算语言学（ACL）相关的主题。形态学是语言学的一个核心分支，关注词汇形式如何由基本单元（如词根、词缀）组合而成。阿龙夫（Aronoff, 1976）提出了一个著名的目标：理解说话者能够创造什么样的新词，这与形态学的范畴密切相关，即形态学的正确性或有效性（Morphological Well-Formedness, MWF）。一直以来，模型化衍生词的MWF被认为是一个复杂且具有挑战性的任务（Bauer, 2019）。作者们提出了一种创新的图网络自动编码器模型，该模型旨在学习能捕捉到词缀与词干在生成过程中兼容性信息的嵌入。值得注意的是，这个模型能够在英语等语言中展现出令人惊讶的效果，它不仅结合了语法和语义信息，还利用了来自心理词典的关联信息来增强对MWF的建模能力。具体来说，图自动编码器的工作原理可能是这样的：首先，它将单词的构成部分（词缀和词干）构建成一个图结构，其中节点代表这些成分，边则表示它们之间的关系，比如规则的组合方式或预期的兼容性。然后，通过训练，模型能够学习到每个节点的嵌入向量，这些向量蕴含了关于它们在生成过程中的行为和限制的潜在特征。在训练过程中，模型会尝试通过重构输入的词缀和词干组合，以此来判断它们是否符合MWF规则。如果重构结果接近原始输入，那么模型就认为这个组合是有效的；反之，则可能揭示出潜在的错误或不规则现象。这种自动编码器架构允许模型从大量的实际和潜在的衍生词中学习到模式，并用于预测或解释新词的生成可能性。这项研究对于提升我们理解和生成复杂语言结构的能力具有重要意义，尤其是在处理派生词法多样性方面。通过结合多源信息，这个模型为解决MWF问题提供了一个新颖且有效的方法，为未来的语言模型设计和自然语言生成任务开辟了新的路径。

... This is a supernice

superminecraft game. I love

the nicesque afﬁrmation of

minecraftesque. ...

minecraft

affirm

nice

$ation

$esque

super$

TRAINING

super$nice 3

affirm$esque 7

nice$esque 3

minecraft$ation 7

super$affirm 7

minecraft$esque 3

TEST

super$minecraft 0.9

affirm$ation 0.6

nice$ation 0.1

Figure 2: Experimental setup. We extract DGs from Reddit and train link prediction models on them. In the shown

toy example, the derivatives super$minecraft and affirm$ation are held out for the test set.

ation of a derivative corresponds to a new link be-

tween a stem and a derivational pattern in the DG,

which in turn reﬂects the inclusion of a new word

into a lexical cluster with a shared derivational pat-

tern in the mental lexicon.

3 Experimental Data

3.1 Corpus

We base our study on data from the social media

platform Reddit.

Reddit is divided into so-called

subreddits (SRs), smaller communities centered

around shared interests. SRs have been shown

to exhibit community-speciﬁc linguistic properties

(del Tredici and Fern

andez, 2018).

We draw upon the Baumgartner Reddit Cor-

pus, a collection of publicly available comments

posted on Reddit since 2005.

The preprocess-

ing of the data is described in Appendix A.1.

We examine data in the SRs r/cfb (cfb – college

football), r/gaming (gam), r/leagueoﬂegends (lol),

r/movies (mov), r/nba (nba), r/nﬂ (nﬂ), r/politics

(pol), r/science (sci), and r/technology (tec) be-

tween 2007 and 2018. These SRs were chosen

because they are of comparable size and are among

the largest SRs (see Table 1). They reﬂect three

distinct areas of interest, i.e., sports (cfb, nba, nﬂ),

entertainment (gam, lol, mov), and knowledge (pol,

sci, tec), thus allowing for a multifaceted view on

how topical factors impact MWF: seeing MWF as

an emergent property of the mental lexicon entails

that communities with different lexica should differ

in what derivatives are most likely to be created.

3.2 Morphological Segmentation

Many morphologically complex words are not de-

composed into their morphemes during cognitive

processing (Sonnenstuhl and Huth, 2002). Based

on experimental ﬁndings in Hay (2001), we seg-

ment a morphologically complex word only if the

stem has a higher token frequency than the deriva-

reddit.com

files.pushshift.io/reddit/comments

SR n

|S| |A| |E|

cfb 475,870,562 522,675 10,934 2,261 46,110

nba 898,483,442 801,260 13,576 3,023 64,274

nﬂ 911,001,246 791,352 13,982 3,016 64,821

gam 1,119,096,999 1,428,149 19,306 4,519 107,126

lol 1,538,655,464 1,444,976 18,375 4,515 104,731

mov 738,365,964 860,263 15,740 3,614 77,925

pol 2,970,509,554 1,576,998 24,175 6,188 143,880

sci 277,568,720 528,223 11,267 3,323 58,290

tec 505,966,695 632,940 11,986 3,280 63,839

Table 1: SR statistics. n

: number of tokens; n

: num-

ber of types; |S|: number of stem nodes; |A|: number

of afﬁx group nodes; |E|: number of edges.

tive (in a given SR). Segmentation is performed

by means of an iterative afﬁx-stripping algorithm

introduced in Hofmann et al. (2020) that is based

on a representative list of productive preﬁxes and

sufﬁxes in English (Crystal, 1997). The algorithm

is sensitive to most morpho-orthographic rules of

English (Plag, 2003): when

$ness

is removed

from

happi$ness

, e.g., the result is

happy

, not

happi. See Appendix A.2. for details.

The segmented texts are then used to create DGs

as described in Section 2.3. All processing is done

separately for each SR, i.e., we create a total of

nine different DGs. Figure 2 illustrates the general

experimental setup of our study.

4 Models

Let

be a Bernoulli random variable denoting

the property of being morphologically well-formed.

We want to model

P (W |d, C

) = P (W |s, a, C

)

i.e., the probability that a derivative

consisting of

stem

and afﬁx group

is morphologically well-

formed according to SR corpus C

Given the established properties of derivational

morphology (see Section 2), a good model of

P (W |d, C

)

should include both semantics and

formal structure,

P (W |d, C

) = P (W |m

, f

, m

, f

, C

), (1)

where

, are meaning and form (here

剩余11页未读，继续阅读

weixin_39257188

粉丝: 0
资源: 1

基于图神经网络的派生形态学生成模型

Python-GraphAutoEncoders在TensorFlow中实现图形自动编码器

Variational-Graph-Auto-Encoders:这是2016年贝叶斯深度学习NIPS研讨会上论文``变图自动编码器''的实施

graph auto-encoder

variational graph auto-encoders

A Co-Saliency Model of Image Pairs

MATLAB Plot Gallery - Graph Plot 2:Create a graph plot-matlab开发

Graph-Models-of-Infrastructures-and-the-Robustnes_The Power_grap

A Versatile Graph Structure for Edge-Oriented Graph Algorithms - 1987 (Ebert1987AVD)-计算机科学

graphdb-docker-master.zip

orientdb-graphdb-2.2.0-beta2.jar

最新资源