深度学习特征嵌入提升依存句法解析性能

164 浏览量更新于2024-08-27 收藏 1.01MB PDF 举报

"Feature Embedding for Dependency Parsing" 是一篇在第25届国际计算语言学会议 (COLING 2014) 上发表的研究论文，该会议于2014年8月23日至29日在爱尔兰都柏林举行。作者是来自苏州大学计算机科学与技术学院的 Wenliang Chen、Yue Zhang 和 Min Zhang。这篇论文旨在解决依赖性解析中的特征稀疏问题，通过借鉴词嵌入的思想，提出了一种自动学习特征嵌入的方法。传统的依赖性解析往往依赖大量的手工设计特征，然而这些特征可能无法穷尽所有情况，导致在大规模数据处理时面临稀疏性挑战。为了解决这个问题，论文提出了一种新颖的特征嵌入方法。这种方法将特征表示为分布式的、可学习的形式，这样可以从大量的自动解析数据中捕捉到特征之间的潜在关系和模式。通过学习到的特征嵌入，作者构建了一套适用于图模型的依赖性解析的新特征，这些特征不仅充分利用了精心设计的传统特征，还能够利用特征隐类表示的优势。实验部分，论文在标准的中文和英文数据集上进行了评估。结果显示，引入特征嵌入后的新解析器在性能上实现了显著提升。这表明特征嵌入能够有效地提升模型的泛化能力，减少对人工特征工程的依赖，并且有助于解析任务的精度和效率。这种方法对于依赖性解析领域的研究具有重要意义，因为它开辟了一条利用大数据驱动特征表示学习的新路径，为进一步改进自然语言处理任务，特别是依赖性分析提供了新的思考方向和技术支持。

First-order

[wp]

, [wp]

, d(h, d)

[wp]

, d(h, d)

, p

, d(h, d)

[wp]

, d(h, d)

, p

, w

, p

, d(h, d)

, w

, p

, d(h, d)

, w

, p

, d(h, d)

, p

, [wp]

, d(h, d)

, p

, d(h, d)

, p

h+1

, p

d−1

, p

, d(h, d)

h−1

, p

d−1

, p

, d(h, d)

, p

h+1

, p

d+1

, d(h, d)

h−1

, p

d+1

, d(h, d)

Second-order

, p

, d(h, d, c)

, w

, c

, d(h, d, c)

, [wp]

, d(h, d, c)

, [wp]

, d(h, d, c)

Second-order (continue)

, [wp]

, d(h, d, c)

, [wp]

, d(h, d, c)

[wp]

, [wp]

h+1

, [wp]

, d(h, d, c)

[wp]

h−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

c+1

, d(h, d, c)

[wp]

h−1

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

h+1

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

h−1

, [wp]

c+1

, d(h, d, c)

[wp]

, [wp]

h+1

, [wp]

c+1

, d(h, d, c)

[wp]

, [wp]

d+1

, [wp]

, d(h, d, c)

[wp]

d−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

c+1

, d(h, d, c)

[wp]

, [wp]

d+1

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

, [wp]

d+1

, [wp]

c+1

, d(h, d, c)

[wp]

d−1

, [wp]

c−1

, [wp]

, d(h, d, c)

[wp]

d−1

, [wp]

c+1

, d(h, d, c)

Table 1: Base feature templates.

base feature templates are listed in Table 1, where h and d refer to the head, the dependent, respectively,

c refers to d ’s sibling or child, b refers to the word between h and d, +1 (−1) refers to the next (previous)

word, w and p refer to the surface word and part-of-speech tag, respectively, [wp] refers to the surface

word or part-of-speech tag, d(h, d) is the direction of the dependency relation between h and d, and

d(h, d, c) is the directions of the relation among h, d, and c.

We train a parser with the base features and use it as the Baseline parser. Deﬁning f

(x, g) as the base

features and w

as the corresponding weights, the scoring function becomes,

score(x, g) = f

(x, g) · w

(2)

3 Feature Embeddings

Our goal is to reduce the sparseness of rich features by learning a distributed representation of features,

which is dense and low dimensional. We call the distributed feature representation feature embeddings.

In the representation, each dimension represents a hidden-class of the features and is expected to capture

a type of similarities or share properties among the features.

The key to learn embeddings is making use of information from a local context, and to this end

various methods have been proposed for learning word embeddings. Lin (1997) and Curran (2005) use

the count of words in a surrounding word window to represent distributed meaning of words. Brown

et al. (1992) uses bigrams to cluster words hierarchically. These methods have been shown effective

on words. However, the number of features is much larger than the vocabulary size, which makes it

infeasible to apply them on features. Another line of research induce word embeddings using neural

language models (Bengio, 2008). However, the training speed of neural language models is too slow for

the high dimensionality of features. Mikolov et al. (2013b) and Mikolov et al. (2013a) introduce efﬁcient

methods to directly learn high-quality word embeddings from large amounts of unstructured raw text.

Since the methods do not involve dense matrix multiplications, the training speed is extremely fast.

We adapt the models of Mikolov et al. (2013b) and Mikolov et al. (2013a) for learning feature embed-

dings from large amounts of automatically parsed dependency trees. Since feature embeddings have a

high computational cost, we also use Negative sampling technique in the learning stage (Mikolov et al.,

2013b). Different from word embeddings, the input of our approach is features rather than words, and

the feature representations are generated from tree structures instead of word sequences. Consequently,

818

剩余10页未读，继续阅读

weixin_38517095

粉丝: 4
资源: 936

深度学习特征嵌入提升依存句法解析性能

convolutional architecture for fast feature embedding

Caffe： Convolutional Architecture for Fast Feature Embedding

feature embedding

Nearest feature line embedding for face hallucination

Constrained discriminant neighborhood embedding for high dimensional data feature extraction

Multi-layer Locality-constrained Iterative Neighbor Embedding for Face Hallucination:Multi-layer Locality-constrained Iterative Neighbor Embedding for Face Hallucination的Matlab代码-matlab开发

Constrained Preference Embedding for Item Recommendation

Piecewise Flat Embedding for Image Segmentation

ENCYCLOPEDIA ENHANCED SEMANTIC EMBEDDING FOR ZERO-SHOT LEARNING

Discriminatively Learned CNN Embedding for Person Re-identification

最新资源