基于音节的推文归一化方法

133 浏览量更新于2024-08-27 收藏 238KB PDF 举报

"音节的Tweet归一化方法在社会媒体中的应用" 在当前的数字化时代，社交媒体如Twitter已经成为人们日常交流的重要平台。然而，这些平台上非标准词汇的使用（例如缩写、拼写错误和创新表达）给自然语言处理带来了挑战。这篇研究论文《音节的Tweet归一化》提出了一个新的方法，旨在理解和处理这种非标准词汇的创建过程，从而促进社交媒体文本的理解和分析。作者包括来自北京邮电大学软件工程学院的Ke Xu、微软STCA的Yunqing Xia和乔治亚理工学院电子与计算机工程学院的Chin-Hui Lee。该论文发表在2015年计算语言学协会的第53届年会及第七届国际自然语言处理联合会议中，这表明了这个领域对解决社交媒体文本处理问题的关注。论文的核心是基于音节的归一化方法。作者假设音节在形成非标准的Twitter词汇中起着基础性作用。因此，他们选择音节作为基本单位，并扩展了传统的嘈杂信道模型，将音节纳入其中，以表示单词到单词的转换，不仅在单词层面，也在音节层面上。这种方法的优势在于，音节不仅可以提供更多的候选词，还能用于衡量不同词汇之间的相似度，从而更准确地识别和纠正非标准的Twitter用词。在实施这个方法时，研究者首先对Twitter数据进行预处理，识别出非标准词汇。然后，通过分析音节模式，构建一个音节到音节的转换模型，模拟用户在创建新词汇时的心理过程。接下来，使用这个模型来生成可能的标准形式，为非标准词汇提供多个候选归一化结果。最后，通过对这些候选结果进行评估，选择最合适的归一化形式，以提高语义理解和后续分析的准确性。此外，论文可能还涉及了实验设计和性能评估，包括使用基准数据集来测试方法的有效性，以及与其他归一化技术的比较。通过实验，作者可能展示了他们的方法在处理非标准Twitter词汇方面的优越性，例如更高的正确率、召回率或F1分数。这篇研究论文为理解并处理社交媒体上的非标准语言提供了一个新颖且有前景的视角。音节的使用增加了模型的灵活性和适应性，有助于提高自然语言处理系统在处理Twitter等社交媒体数据时的效率和准确性。这对于未来开发更智能的社交媒体分析工具、情感分析算法以及自动信息提取系统具有重要意义。

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics

and the 7th International Joint Conference on Natural Language Processing, pages 920–928,

Beijing, China, July 26-31, 2015.

2015 Association for Computational Linguistics

Tweet Normalization with Syllables

Ke Xu

School of Software Eng.

Beijing U. of Posts & Telecom.

Beijing 100876, China

xxukez2@gmail.com

Yunqing Xia

STCA

Microsoft

Beijing 100084, China

yxia@microsoft.com

Chin-Hui Lee

School of Electr. & Comp. Eng.

Georgia Institute of Technology

Atlanta, GA 30332-0250, USA

chl@ece.gatech.edu

Abstract

In this paper, we propose a syllable-based

method for tweet normalization to study

the cognitive process of non-standard

word creation in social media. Assuming

that syllable plays a fundamental role in

forming the non-standard tweet words,

we choose syllable as the basic unit and

extend the conventional noisy channel

model by incorporating the syllables to

represent the word-to-word transitions

at both word and syllable levels. The

syllables are used in our method not

only to suggest more candidates, but also

to measure similarity between words.

Novelty of this work is three-fold: First,

to the best of our knowledge, this is an

early attempt to explore syllables in tweet

normalization. Second, our proposed

normalization method relies on unlabeled

samples, making it much easier to adapt

our method to handle non-standard words

in any period of history. And third, we

conduct a series of experiments and prove

that the proposed method is advantageous

over the state-of-art solutions for tweet

normalization.

1 Introduction

Due to the casual nature of social media, there

exists a large number of non-standard words in

text expressions which make it substantially dif-

ferent from formal written text. It is reported in

(Liu et al., 2011) that more than 4 million dis-

tinct out-of-vocabulary (OOV) tokens are found

in the Edinburgh Twitter corpus (Petrovic et al.,

2010). This variation poses challenges when

performing natural language processing (NLP)

tasks (Sproat et al., 2001) based on such texts.

Tweet normalization, aiming at converting these

OOV non-standard words into their in-vocabulary

(IV) formal forms, is therefore viewed as a very

important pre-processing task.

Researchers focus their studies in tweet normal-

ization at different levels. A character-level tag-

ging system is used in (Pennell and Liu, 2010) to

solve deletion-based abbreviation. It was further

extended in (Liu et al., 2012) using more charac-

ters instead of Y or N as labels. The character-level

machine translation (MT) approach (Pennell and

Liu, 2011) was modiﬁed in (Li and Liu, 2012a)

into character-block. While a string edit distance

method was introduced in (Contractor et al., 2010)

to represent word-level similarity, and this ortho-

graphical feature has been adopted in (Han and

Baldwin, 2011), and (Yang and Eisenstein, 2013).

Challenges are encountered in these different

levels of tweet normalization. In the character-

level sequential labeling systems, features are re-

quired for every character and their combinations,

leading to much more noise into the later reverse

table look-up process (Liu et al., 2012). In the

character-block level MT systems equal number of

blocks and their corresponding phonetic symbols

are required for alignment (Li and Liu, 2012b).

This strict restriction can result in a great difﬁculty

in training set construction and a loss of useful

information. Finally, word-level normalization

methods cannot properly model how non-standard

words are formed, and some patterns or consisten-

cies within words can be omitted and altered.

We observe the cognitive process that, given

non-standard words like tmr, people tend to ﬁrst

segment them into syllables like t-m-r. Then

they will ﬁnd the corresponding standard word

with syllables like to-mor-row. Inspired by

this cognitive observation, we propose a syllable

based tweet normalization method, in which non-

standard words are ﬁrst segmented into syllables.

Since we cannot predict the writers deterministic

intention in using tmr as a segmentation of tm-r

920

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38664469

粉丝: 5
资源: 896

基于音节的推文归一化方法

Collaborative personalized tweet recommendation

前端项目-tweet.zip

com.twitter.sdk.android:tweet-composer

tweet = re.sub(r"\x89Û_", "", tweet)

spark中对每条tweet进行拆分，提取出其中被@的用户名具体

tweet sentiment extraction

Could not build url for endpoint 'tweets'. Did you mean 'tweet' instead?

how will your tweet be received?

最新资源