Tweet Normalization with Syllables
Ke Xu
School of Software Eng.
Beijing U. of Posts & Telecom.
Beijing 100876, China
xxukez2@gmail.com
Yunqing Xia
STCA
Microsoft
Beijing 100084, China
yxia@microsoft.com
Chin-Hui Lee
School of Electr. & Comp. Eng.
Georgia Institute of Technology
Atlanta, GA 30332-0250, USA
chl@ece.gatech.edu
Abstract
In this paper, we propose a syllable-based
method for tweet normalization to study
the cognitive process of non-standard
word creation in social media. Assuming
that syllable plays a fundamental role in
forming the non-standard tweet words,
we choose syllable as the basic unit and
extend the conventional noisy channel
model by incorporating the syllables to
represent the word-to-word transitions
at both word and syllable levels. The
syllables are used in our method not
only to suggest more candidates, but also
to measure similarity between words.
Novelty of this work is three-fold: First,
to the best of our knowledge, this is an
early attempt to explore syllables in tweet
normalization. Second, our proposed
normalization method relies on unlabeled
samples, making it much easier to adapt
our method to handle non-standard words
in any period of history. And third, we
conduct a series of experiments and prove
that the proposed method is advantageous
over the state-of-art solutions for tweet
normalization.
1 Introduction
Due to the casual nature of social media, there
exists a large number of non-standard words in
text expressions which make it substantially dif-
ferent from formal written text. It is reported in
(Liu et al., 2011) that more than 4 million dis-
tinct out-of-vocabulary (OOV) tokens are found
in the Edinburgh Twitter corpus (Petrovic et al.,
2010). This variation poses challenges when
performing natural language processing (NLP)
tasks (Sproat et al., 2001) based on such texts.
Tweet normalization, aiming at converting these
OOV non-standard words into their in-vocabulary
(IV) formal forms, is therefore viewed as a very
important pre-processing task.
Researchers focus their studies in tweet normal-
ization at different levels. A character-level tag-
ging system is used in (Pennell and Liu, 2010) to
solve deletion-based abbreviation. It was further
extended in (Liu et al., 2012) using more charac-
ters instead of Y or N as labels. The character-level
machine translation (MT) approach (Pennell and
Liu, 2011) was modified in (Li and Liu, 2012a)
into character-block. While a string edit distance
method was introduced in (Contractor et al., 2010)
to represent word-level similarity, and this ortho-
graphical feature has been adopted in (Han and
Baldwin, 2011), and (Yang and Eisenstein, 2013).
Challenges are encountered in these different
levels of tweet normalization. In the character-
level sequential labeling systems, features are re-
quired for every character and their combinations,
leading to much more noise into the later reverse
table look-up process (Liu et al., 2012). In the
character-block level MT systems equal number of
blocks and their corresponding phonetic symbols
are required for alignment (Li and Liu, 2012b).
This strict restriction can result in a great difficulty
in training set construction and a loss of useful
information. Finally, word-level normalization
methods cannot properly model how non-standard
words are formed, and some patterns or consisten-
cies within words can be omitted and altered.
We observe the cognitive process that, given
non-standard words like tmr, people tend to first
segment them into syllables like t-m-r. Then
they will find the corresponding standard word
with syllables like to-mor-row. Inspired by
this cognitive observation, we propose a syllable
based tweet normalization method, in which non-
standard words are first segmented into syllables.
Since we cannot predict the writers deterministic
intention in using tmr as a segmentation of tm-r