cw2vec: Learning Chinese Word Embeddings
with Stroke n-gram Information
Shaosheng Cao
1,2
and Wei Lu
2
and Jun Zhou
1
and Xiaolong Li
1
1
AI Department, Ant Financial Services Group
2
Singapore University of Technology and Design
{shaosheng.css, jun.zhoujun, xl.li}@antfin.com
luwei@sutd.edu.sg
Abstract
We propose cw2vec, a novel method for learning Chinese
word embeddings. It is based on our observation that ex-
ploiting stroke-level information is crucial for improving the
learning of Chinese word embeddings. Specifically, we de-
sign a minimalist approach to exploit such features, by us-
ing stroke n-grams, which capture semantic and morpholog-
ical level information of Chinese words. Through qualita-
tive analysis, we demonstrate that our model is able to ex-
tract semantic information that cannot be captured by exist-
ing methods. Empirical results on the word similarity, word
analogy, text classification and named entity recognition tasks
show that the proposed approach consistently outperforms
state-of-the-art approaches such as word-based word2vec and
GloVe, character-based CWE, component-based JWE and
pixel-based GWE.
1. INTRODUCTION
Word representation learning has recently received a sig-
nificant amount of attention in the field of natural lan-
guage processing (NLP). Unlike traditional one-hot repre-
sentations for words, low-dimensional distributed word rep-
resentations, also known as word embeddings, are able to
better capture semantics of natural language words. Such
representations were shown useful in certain down-stream
NLP tasks such as text classification (Conneau et al. 2016;
Xu et al. 2016), named entity recognition (Turian, Ratinov,
and Bengio 2010; Collobert et al. 2011; Sun et al. 2015),
machine translation (Devlin et al. 2014; Meng et al. 2015;
Jean et al. 2015) and measuring semantic textual similarity
(Shen et al. 2014; Wieting et al. 2015). It is therefore vital
to design methods for learning word representations that can
capture word semantics well.
Existing approaches only concentrated on learning such
representations based on contextual information (Mikolov
et al. 2010; 2013b; Ling et al. 2015; Levy, Goldberg, and
Ramat-Gan 2014; Pennington, Socher, and Manning 2014)
where words are regarded as atomic tokens. Recently, re-
searchers also have started looking into incorporating sub-
word level information to better capture word semantics
(Bian, Gao, and Liu 2014; Cotterell and Sch
¨
utze 2015;
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Radical v.s. components v.s. stroke n-gram
Bojanowski et al. 2016; Cao and Lu 2017). While these ap-
proaches were shown effective, they largely focused on Eu-
ropean languages such as English, Spanish and German that
employ the Latin script in their writing system. Therefore
the methods developed are not directly applicable to lan-
guages such as Chinese that employ a completely different
writing system.
In Chinese, each word typically consists of less charac-
ters than English
1
, where each character conveys fruitful se-
mantic information (Wieger 1915; Liu et al. 2010). Given
the rich internal structures of Chinese words and characters,
approaches that exploit character level information (Chen et
al. 2015) have been proposed for learning Chinese word em-
beddings. However, is such information sufficient for prop-
erly capturing the semantic information of words? Does
there exist other useful information that can be extracted
from words and characters to better model the semantics of
words?
For internal structural information of words, we argue that
characters alone are not sufficient for capturing the semantic
information. For instance, two words “木材 (timber)” and
“森林 (forest)” are semantically closely related. However,
“木材 (timber)” is composed of two characters “木 (wood)”
and “材 (material)”, while “森林 (forest)” is made up of
“森 (forest)” and “林 (jungle)”. If only character level infor-
mation is considered, there is no information that is shared
across these two words as they consist of distinct characters.
While certain manually defined rules for extracting sub-
word information such as radicals
2
(Sun et al. 2014; Li et
al. 2015; Yin et al. 2016) and components can be exploited
(Xin and Song 2017), such information might be incomplete
1
Most modern Chinese words consist of only one or two char-
acters (Chen, Liang, and Liu 2015).
2
A component primarily used for indexing characters.
现存的方法
仅仅是利用
上下文信息
来学习词向
量,并且将
单词作为原
子级别。
一个中文字
是由若干汉
字组成,每
个汉字都含
有语义(这
点与英文的
字母不同)
对于词语的
内部结构信
息,我们认
为只有汉字
不足以捕获
语义信息。