基于语义的自动术语提取：i-SWB新型主题模型

PDF格式 | 599KB | 更新于2024-08-30 | 168 浏览量 | 举报

"一种自动术语提取的新型主题模型" 这篇研究论文主要探讨了一种名为i-SWB(i-Semantic Word Bag)的新型主题模型，该模型致力于从特定领域的语料库中自动提取专业术语。自动术语提取（Automatic Term Extraction, ATE）是自然语言处理中的一个重要任务，其目标是从文本中识别出具有专业或领域特性的词汇或短语。以往的研究多依赖于词频信息来判断一个短语是否为术语，但这种方法可能无法充分捕捉到词汇的语义特性。论文中提出的方法创新性地将术语提取建立在词语的语义表示之上。i-SWB模型通过将领域语料库映射到一个潜在的语义空间，这个空间由一些通用主题、一个背景主题以及文档特定的主题组成。这样的设计使得模型能够更深入地理解和区分词汇在不同上下文中的意义，从而更好地识别术语。实验部分，作者在四个不同领域进行了验证，结果显示，i-SWB模型的表现优于现有的最佳ATE方法。这表明，利用语义表示和主题建模相结合的方式对于提高术语提取的准确性具有显著效果。在实际应用中，自动术语提取对于信息检索、知识发现、文档摘要等领域都至关重要。例如，它可以帮助科研人员快速定位关键概念，促进文献分析；在机器翻译中，准确的术语提取可以提高翻译质量；在信息抽取系统中，它能帮助提取结构化的领域知识。因此，i-SWB模型的提出对于推动这些领域的技术进步具有积极的意义。此外，i-SWB模型的创新之处在于引入了文档特定的主题，这有助于捕捉特定文档的特征，使得模型不仅限于通用主题，还能适应特定语境。这种灵活性对于处理多领域、多样性的文本数据尤其有价值。这篇研究论文为自动术语提取提供了一个新的视角和工具，通过语义表示和主题建模的融合，提高了术语识别的准确性和效率。未来的研究可能将进一步探索如何优化i-SWB模型，例如，结合深度学习技术改进语义表示，或者引入更多元化的主题模型，以适应更加复杂和多变的文本环境。

A Novel Topic Model for Automatic Term Extraction

Sujian Li

Jiwei Li

Tao Song

Wenjie Li

Baobao Chang

Key Laboratory of Computational Linguistics (Peking University), Ministry of Education,

School of Electronics Engineering and Computer Science, CHINA

The Innovative Intelligent Computing Center,

The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, CHINA

{lisujian, bdlijiwei, stao, chbb} @pku.edu.cn; cswjli@comp.polyu.edu.hk

ABSTRACT

Automatic term extraction (ATE) aims at extracting domain-

specific terms from a corpus of a certain domain. Termhood is one

essential measure for judging whether a phrase is a term. Previous

researches on termhood mainly depend on the word frequency

information. In this paper, we propose to compute termhood

based on semantic representation of words. A novel topic model,

namely i-SWB, is developed to map the domain corpus into a

latent semantic space, which is composed of some general topics,

a background topic and a documents-specific topic. Experiments

on four domains demonstrate that our approach outperforms the

state-of-the-art ATE approaches.

Categories and Subject Descriptors

H.3.1 [Information Storage and Retrieval]: Content Analysis

and Indexing – linguistic processing, Thesauruses.

General Terms

Algorithms, Experimentation.

Keywords

Term Extraction, Topic Model, Termhood.

1. INTRODUCTION

So far, most researches on automatic term extraction have been

guided by two essential measures defined by [6], namely unithood

and termhood. Unithood examines syntactic formation of terms or

the degree (or significance) of the association among the term

constituents. Termhood, on the other hand, aims to capture the

semantic relatedness of a term to a domain concept. However,

there is no uniform definition of what is semantic relatedness, and

how to compute termhood is still an open problem.

Previous researches have attempted to measure termhood by

applying several statistical measures within a domain or across

domains, such as TF-IDF, C-value/NC-value [5], co-occurrence

[4] and inter-domain entropy [2]. These statistical measures often

ignore the informative words with very high frequency or very

low frequency and do not take into account the semantics carried

by terms. Taking the term “NRZ electrical input” in the electric

engineering domain for example, “NRZ” only occurs in a few

documents while “electrical” occurs in many documents

frequently. Using TF-IDF to measure termhood, low scores will

be assigned to both “NRZ” and “electrical”, which in turn causes

the term “NRZ electrical input” to have a low termhood. It is

obvious that frequency-based measures will keep many real terms

out of the door. In fact, a domain is described semantically from

various aspects. Again, let’s take the electric engineering domain

for example. The words like “input” emphasize some specific

topic in a domain while the words like “electrical” provide the

background of that domain. There also exist a cluster of words

like “NRZ” which occur in the corpus infrequently, but tend to

occur in a few documents frequently. Such words can reflect

some special characteristics of the domain. Based on these

observations, we argue that three semantic aspects can be used in

the representation of words: Domain background words (e.g.

electrical) describe the domain in general. Domain topic words

(e.g. input) represent a certain topic in a given domain. Domain

documents-specific words (e.g. NRZ) are specific to a small

number of documents and exhibit the characteristics of the

domain. We assume that a term can be recognized by identifying

whether its constituent words belong to some of the three

semantic aspects.

As for semantic representation of words, unsupervised topic

models have shown their advantages [1] [3]. Latent Dirichlet

Allocation (LDA) is a well-known example of such models. It

posits that each document can be seen as a mixture of latent topics

and each topic as the distribution over a given vocabulary. To

trade-off generality and specificity of words, Chemudugunta et al.

[3] further defined the special words with background (SWB)

model that allowed words to be modeled as originating from

general topics, or document-specific topics, or a corpus-wide

background topic. The existing work proves that topic models are

competent for the semantic representation of words. However, to

our knowledge, no prior work has introduced such kind of

semantic representation to term extraction.

Inspired by Chemudugunta’s idea of generality and specificity [3],

in this paper we propose a novel topic model, namely i-SWB to

model the three suggested semantic aspects. In i-SWB, three

kinds of topics, namely bac

kground topic, general topics, and

documents-specific topic are correspondingly constructed to

generate the words in a domain corpus. Compared with

Chemudugunta’s SWB model, there are two main improvements

in i-SWB to tailor to term extraction. First, specificity in i-SWB is

modeled at the corpus level and one documents-specific topic is

set to identify a cluster of idiosyncratic words from the whole

corpus. Thus, i-SWB avoids the computationally intensive

problem in SWB where the number of document-specific topics

grows linearly with the number of documents. Second, i-SWB

makes use of both document frequency (DF) and topic

information to control the generation of words, while SWB only

uses a simple multinomial variable to control which topic a word

is generated from. This improvement comes from the following

findings that have been verified in the experiments: the words

occurring in many documents and distributing over many general

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. Copyrights for

components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to

post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from permissions@acm.org.

SIGIR’13, July 28–August 1, 2013, Dublin, Ireland.

885

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38644780

粉丝: 3

基于语义的自动术语提取：i-SWB新型主题模型

新闻文本预测资产回报：一种新型方法

基于Snake模型的病理肺部CT图像智能分割

新型四环形铅笔芯设计介绍与应用

双语双项主题模型的跨语言分类法一致性

HybridRenderingEngine：实现集群渲染的新型Forward+Deferred引擎

RPRG与ICCD：数字信号处理的新型降噪技术

话题建模进化论：从LDA到深度主题模型的神经网络应用

构建术语库的艺术：维护OptiSystem汉化术语库的策略

文本挖掘技术宝典：非结构化数据信息提取全攻略

深度学习101：构建你的第一个模型（初学者终极指南）

最新资源