基于n-gram的文本分类方法：Cavnar与Trenkle论文综述

下载需积分: 10 | PDF格式 | 142KB | 更新于2024-07-15 | 12 浏览量 | 举报

"《基于n-gram的文本分类》是一篇由威廉·B·卡瓦纳(William B. Cavnar)和约翰·M·特伦克尔(John M. Trenkle)于2001年发表的论文，收录在CiteSeer平台上。这篇研究论文关注的是文本识别领域的一个关键技术——n-gram方法，它在文档处理中起着基础性的作用，特别在自动化处理大量电子文档时显得尤为重要。n-gram是一种统计语言模型，通过分析文本中的连续词序列（如单个词、双词或三词组），来捕捉语言的局部特征，以此进行文本分类和主题识别。论文的核心内容讨论了n-gram技术在文本分类中的应用挑战，尤其是在面对某些类别的文档，如存在大量领域特定术语或复杂语法结构的文档时，如何有效地利用n-gram模型来提取关键信息和模式。作者们可能采用了机器学习的方法，如朴素贝叶斯分类器或者支持向量机等，通过构建n-gram特征向量来训练模型，以提高文本分类的准确性和效率。两位作者在该领域的其他工作也值得关注。威廉·Cavnar在Mechanical Simulation Corporation公司，他的研究领域还包括适应逻辑网络（Adaptive Logic Networks）和高性能搜索与匹配技术。约翰·M·特伦克尔则在TubeMogul公司，他的研究贡献涉及多个出版物，展示了他在文本处理和自然语言理解方面的广泛经验。此外，该论文被引用次数超过1000次，阅读量达到了2999次，表明其在学术界具有较高的影响力。用户请求对下载的文件进行增强，这可能意味着他们希望获取更详细的代码示例、实验结果分析，或者改进的算法实现策略。全文上传日期为2012年12月11日，显示了作者对后续研究的持续贡献。《基于n-gram的文本分类》这篇论文不仅介绍了该领域的核心概念和技术，还为后续的研究者提供了实践指导和深入探讨的平台，对于理解和应用文本识别、机器学习在文档管理、搜索引擎优化以及自然语言处理等领域具有重要意义。"

•

The categorization must work reliably in

spite of textual errors.

•

The categorization must be efﬁcient, con-

suming as little storage and processing

time as possible, because of the sheer vol-

ume of documents to be handled.

•

The categorization must be able to recog-

nize when a given document does

not

match any category, or when it falls

between

two categories. This is because

category boundaries are almost never clear-

cut.

In this paper we will cover the following top-

ics:

•

Section 2.0 introduces N-grams and N-

gram-based similarity measures.

•

Section 3.0 discusses text categorization

using N-gram frequency statistics.

•

Section 4.0 discusses testing N-gram-

based text categorization on a language

classiﬁcation task.

•

Section 5.0 discusses testing N-gram-

based text categorization on a computer

newsgroup classiﬁcation task.

•

Section 6.0 discusses some advantages of

N-gram-based text categorization over

other possible approaches.

•

Section 7.0 gives some conclusions, and

indicates directions for further work.

2.0 N-Grams

An N-gram is an N-character slice of a longer

string. Although in the literature the term can

include the notion of any co-occurring set of

characters in a string (e.g., an N-gram made up

of the ﬁrst and third character of a word), in this

paper we use the term for contiguous slices only.

Typically, one slices the string into a set of over-

lapping N-grams. In our system, we use N-grams

of several different lengths simultaneously. We

also append blanks to the beginning and ending

of the string in order to help with matching

beginning-of-word and ending-of-word situa-

tions. (We will use the underscore character (“_”)

to represent blanks.) Thus, the word “TEXT”

would be composed of the following N-grams:

bi-grams: _T, TE, EX, XT, T_

tri-grams: _TE, TEX, EXT, XT_, T_ _

quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

In general, a string of length

, padded with

blanks, will have

+1 bi-grams,

+1tri-grams,

+1 quad-grams, and so on.

N-gram-based matching has had some suc-

cess in dealing with noisy ASCII input in other

problem domains, such as in interpreting postal

addresses ([1] and [2]), in text retrieval ([3] and

[4]), and in a wide variety of other natural lan-

guage processing applications[5]. The key bene-

ﬁt that N-gram-based matching provides derives

from its very nature: since every string is decom-

posed into small parts, any errors that are present

tend to affect only a limited number of those

parts, leaving the remainder intact. If we count

N-grams that are common to two strings, we get

a measure of their similarity that is resistant to a

wide variety of textual errors.

3.0 Text Categorization Using N-

Gram Frequency Statistics

Human languages invariably have some words

which occur more frequently than others. One of

the most common ways of expressing this idea

has become known as Zipf’s Law [6], which we

can re-state as follows:

The

th most common word in a human language

text occurs with a frequency inversely propor-

tional to

The implication of this law is that there is

always a set of words which dominates most of

the other words of the language in terms of fre-

quency of use. This is true both of words in gen-

eral, and of words that are speciﬁc to a particular

subject. Furthermore, there is a smooth contin-

uum of dominance from most frequent to least.

The smooth nature of the frequency curves helps

us in some ways, because it implies that we do

not have to worry too much about speciﬁc fre-

quency thresholds. This same law holds, at least

剩余14页未读，继续阅读

xn12334

粉丝: 113

基于n-gram的文本分类方法：Cavnar与Trenkle论文综述

基于n-gram的文本分类

哈工大 智能技术与自然语言处理技术课程 NLP系列课程 第05章 n-gram语言模型 共78页.ppt

基于N-Gram的SQL注入检测研究.pdf

基于机器学习的android恶意代码检测（n-gram opcode + RandomForest）.zip

一种基于关联分析与_i_N__i_-Gram的错误参数检测方法.pdf

Class-based n-gram models of natural language.pdf

基于机器学习的 Webshell 检测 (OPCode - N-Gram - TF-IDF - XGBoost).zip

基于N-gram的Android恶意检测.pdf

N-Gram-LM.rar_bi gram_bi gram算法_gram_n gram_n-gram

【Matlab源码】Gram-schmidt正交化GUI.zip.zip

最新资源

哈工大智能技术与自然语言处理技术课程 NLP系列课程第05章 n-gram语言模型共78页.ppt