N-gram错误容忍文本分类：高准确率处理多语言与计算机新组

4星 · 超过85%的资源需积分: 44 125 浏览量更新于2024-09-17 2 收藏 73KB PDF 举报

"基于n-gram的文本分类是一种关键的文档处理技术，它在电子文档的大规模自动化处理中起着核心作用。本文档由William B. Cavnar和John M. Trenkle撰写，发表于环境研究学院，主要探讨了如何利用n-gram方法来应对文本分类中的挑战，尤其是在处理包含各种文本错误（如拼写、语法错误在电子邮件中的出现，以及来自OCR系统的字符识别错误）的文档时。 n-gram方法是一种统计语言模型，它将连续的词语或字符序列分解成固定长度的片段，如一元(n=1)、二元(n=2)或三元(n=3)等。这种技术在文本分析中特别有用，因为它能够捕捉到局部上下文的模式，这对于理解和区分不同主题或类别非常有效。在这篇文章中，作者提出了一种基于n-gram的文本分类系统，其设计旨在对文本错误具有高度容忍度。该系统的特点是小型化、高效且稳健，特别是在处理多语言的Usenet新闻组文章时表现出色，达到了99.8%的正确分类率。这证明了n-gram方法在识别和归类各种语言的文本时的精准性。此外，该系统还应用于计算机导向的新闻组文章分类，尽管面临语言和主题的多样性，但也能达到80%的高准确率，显示了其在实际应用中的广泛适用性。这种n-gram方法的优势在于它不仅适用于标准的、无误的文本，还能在一定程度上处理文本噪声，提高了文本分类任务的整体性能。这篇论文提供了一种实用的文本分类策略，展示了n-gram技术在处理复杂文本环境中的潜力，对于文本挖掘、自然语言处理以及信息检索等领域具有重要的参考价值。"

approximately, for other aspects of human lan-

guages. In particular, it is true for the frequency

of occurrence of N-grams, both as inﬂection

forms and as morpheme-like word components

which carry meaning. (See Figure 1 for an exam-

ple of a Zipﬁan distribution of N-gram frequen-

cies from a technical document.) Zipf’s Law

implies that classifying documents with N-gram

frequency statistics will not be very sensitive to

cutting off the distributions at a particular rank. It

also implies that if we are comparing documents

from the same category they should have similar

N-gram frequency distributions.

We have built an experimental text categori-

zation system that uses this idea. Figure 2 illus-

trates the overall data ﬂow for the system. In this

scheme, we start with a set of pre-existing text

categories (such as subject domains) for which

we have reasonably sized samples, say, of 10K to

20K bytes each. From these, we would generate

a set of N-gram frequency proﬁles to represent

each of the categories. When a new document

arrives for classiﬁcation, the system ﬁrst com-

putes its N-gram frequency proﬁle. It then com-

pares this proﬁle against the proﬁles for each of

the categories using an easily calculated distance

measure. The system classiﬁes the document as

belonging to the category having the smallest

distance.

3.1 Generating N-Gram Frequency

Proﬁles

The bubble in Figure 2 labelled “Generate

Proﬁle” is very simple. It merely reads incoming

text, and counts the occurrences of all N-grams.

To do this, the system performs the following

steps:

•

Split the text into separate tokens consist-

ing only of letters and apostrophes. Digits

and punctuation are discarded. Pad the

token with sufﬁcient blanks before and

after.

•

Scan down each token, generating all pos-

sible N-grams, for N=1 to 5. Use positions

that span the padding blanks, as well.

•

Hash into a table to ﬁnd the counter for the

N-gram, and increment it. The hash table

uses a conventional collision handling

mechanism to ensure that each N-gram

gets its own counter.

•

When done, output all N-grams and their

counts.

•

Sort those counts into reverse order by the

number of occurrences. Keep just the N-

grams themselves, which are now in

reverse order of frequency.

FIGURE 1. N-Gram Frequencies By Rank In A Technical Document

500

1000

1500

2000

N-Gram Frequency

0 100 200 300 400 500

N-Gram Rank

剩余13页未读，继续阅读

pokemones3

粉丝: 0
资源: 1

N-gram错误容忍文本分类：高准确率处理多语言与计算机新组

N-Gram-Based Text Categorization.pdf

N-gram特征提取

毕业论文范文基于N-Gram的G蛋白偶联序列分类方法的研究

textcat：Go包，用于基于n-gram的文本分类，并支持utf-8和原始文本

ngram_profile:基于字符n-gram的文本分类

ZEN:基于N-gram表示的基于BERT的中文文本编码器

基于n-gram的文本分类方法：Cavnar与Trenkle论文综述

基于n-gram模型的语言建模与文本生成

基于N-Gram模型的蒙古语文本语种识别算法的研究

基于N-Gram的语言识别技术

最新资源