cw2vec：利用笔画信息提升中文词嵌入

需积分: 30 186 浏览量更新于2024-09-07 收藏 1.15MB PDF 举报

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information 是阿里巴巴研究团队提出的一种创新方法，旨在提升中文词向量（Word Embeddings）的学习效果。这个模型的灵感来源于对汉字笔画信息的深入挖掘，认识到在处理中文词汇时，笔画级别的特征至关重要。不同于传统的基于词或字符的词嵌入模型，cw2vec采用了一种简约而有效的方式——通过捕捉笔画-gram（stroken-grams），来提取词语的语义和形态信息。笔画-gram是cw2vec的核心概念，它将汉字分解为一系列连续的笔画序列，这些序列能够反映单词的构造方式和潜在含义。与现有技术如基于词的word2vec、GloVe（全局词向量）以及字符、组件和像素级别的词嵌入模型不同，cw2vec特别关注汉字的结构特征，从而能更好地捕捉到词语间的细微差异和深层次关联。在模型的分析中，作者通过定性评估证明了cw2vec能够提取出现有方法无法捕捉的语义信息，这显示了其在理解和表示中文词汇方面的优势。实证结果在词汇相似度、词语关系推理（word analogy任务）、文本分类以及命名实体识别等多个任务上，cw2vec都表现出优于当前最先进的技术，包括词基word2vec、字符基CWE（Character-based Word Embeddings）、组件基JWE（Component-based Word Embeddings）以及像素基GWE（Pixel-based Word Embeddings）。 cw2vec的出现填补了中文词嵌入领域的一个空白，它不仅提升了中文文本处理的性能，而且通过利用笔画信息，为理解汉字的复杂性提供了新的视角。对于理解和处理中文自然语言处理任务的开发者而言，cw2vec提供了一个值得借鉴和优化的框架，进一步推动了中文NLP技术的发展。

cw2vec: Learning Chinese Word Embeddings

with Stroke n-gram Information

Shaosheng Cao

1,2

and Wei Lu

and Jun Zhou

and Xiaolong Li

AI Department, Ant Financial Services Group

Singapore University of Technology and Design

{shaosheng.css, jun.zhoujun, xl.li}@antfin.com

luwei@sutd.edu.sg

Abstract

We propose cw2vec, a novel method for learning Chinese

word embeddings. It is based on our observation that ex-

ploiting stroke-level information is crucial for improving the

learning of Chinese word embeddings. Speciﬁcally, we de-

sign a minimalist approach to exploit such features, by us-

ing stroke n-grams, which capture semantic and morpholog-

ical level information of Chinese words. Through qualita-

tive analysis, we demonstrate that our model is able to ex-

tract semantic information that cannot be captured by exist-

ing methods. Empirical results on the word similarity, word

analogy, text classiﬁcation and named entity recognition tasks

show that the proposed approach consistently outperforms

state-of-the-art approaches such as word-based word2vec and

GloVe, character-based CWE, component-based JWE and

pixel-based GWE.

1. INTRODUCTION

Word representation learning has recently received a sig-

niﬁcant amount of attention in the ﬁeld of natural lan-

guage processing (NLP). Unlike traditional one-hot repre-

sentations for words, low-dimensional distributed word rep-

resentations, also known as word embeddings, are able to

better capture semantics of natural language words. Such

representations were shown useful in certain down-stream

NLP tasks such as text classiﬁcation (Conneau et al. 2016;

Xu et al. 2016), named entity recognition (Turian, Ratinov,

and Bengio 2010; Collobert et al. 2011; Sun et al. 2015),

machine translation (Devlin et al. 2014; Meng et al. 2015;

Jean et al. 2015) and measuring semantic textual similarity

(Shen et al. 2014; Wieting et al. 2015). It is therefore vital

to design methods for learning word representations that can

capture word semantics well.

Existing approaches only concentrated on learning such

representations based on contextual information (Mikolov

et al. 2010; 2013b; Ling et al. 2015; Levy, Goldberg, and

Ramat-Gan 2014; Pennington, Socher, and Manning 2014)

where words are regarded as atomic tokens. Recently, re-

searchers also have started looking into incorporating sub-

word level information to better capture word semantics

(Bian, Gao, and Liu 2014; Cotterell and Sch

utze 2015;

 2018, Association for the Advancement of Artiﬁcial

Figure 1: Radical v.s. components v.s. stroke n-gram

Bojanowski et al. 2016; Cao and Lu 2017). While these ap-

proaches were shown effective, they largely focused on Eu-

ropean languages such as English, Spanish and German that

employ the Latin script in their writing system. Therefore

the methods developed are not directly applicable to lan-

guages such as Chinese that employ a completely different

writing system.

In Chinese, each word typically consists of less charac-

ters than English

, where each character conveys fruitful se-

mantic information (Wieger 1915; Liu et al. 2010). Given

the rich internal structures of Chinese words and characters,

approaches that exploit character level information (Chen et

al. 2015) have been proposed for learning Chinese word em-

beddings. However, is such information sufﬁcient for prop-

erly capturing the semantic information of words? Does

there exist other useful information that can be extracted

from words and characters to better model the semantics of

words?

For internal structural information of words, we argue that

characters alone are not sufﬁcient for capturing the semantic

information. For instance, two words “木材 (timber)” and

“森林 (forest)” are semantically closely related. However,

“木材 (timber)” is composed of two characters “木 (wood)”

and “材 (material)”, while “森林 (forest)” is made up of

“森 (forest)” and “林 (jungle)”. If only character level infor-

mation is considered, there is no information that is shared

across these two words as they consist of distinct characters.

While certain manually deﬁned rules for extracting sub-

word information such as radicals

(Sun et al. 2014; Li et

al. 2015; Yin et al. 2016) and components can be exploited

(Xin and Song 2017), such information might be incomplete

Most modern Chinese words consist of only one or two char-

acters (Chen, Liang, and Liu 2015).

A component primarily used for indexing characters.

笔画信息

n元笔画

了解n-

gram模

型

语义和形态

测量语义文

本相似度

现存的方法

仅仅是利用

上下文信息

来学习词向

量，并且将

单词作为原

子级别。

亚词

信息

一个中文字

是由若干汉

字组成,每

个汉字都含

有语义（这

点与英文的

字母不同）

对于词语的

内部结构信

息，我们认

为只有汉字

不足以捕获

语义信息。

人工定义的

规则来提取

亚词信息例

如汉字的偏

旁和字件

下载后可阅读完整内容，剩余7页未读，立即下载

EchoLLLiu

粉丝: 1
资源: 1

cw2vec：利用笔画信息提升中文词嵌入

ngram2vec：高效提取与训练n-gram嵌入

解决词相似度问题：深入理解Word2Vec与Skip-Gram模型

深入理解Word2Vec：skip-gram模型详解

ngram2vec:n-gram的嵌入

Hcpcs2Vec:使用大型Medicare数据和Skip-gram模型学习HCCCS程序代码的密集语义嵌入

dna2vec：dna2vec：可变长度k-mers的一致矢量表示

cw2vec:cw2vec模型的实现

sklearn-doc2vec：gensim doc2vec实现的scikit-learn包装器

feat2vec:NAACL 论文代码“Unsupervised Multi-Domain Adaptation with Feature Embeddings”

cw2vec:基于字符训练词向量

最新资源