深度文档分析：多维潜在语义分析

6 浏览量更新于2024-07-15 收藏 1.48MB PDF 举报

"Multidimensional Latent Semantic Analysis Using Term Spatial Information - 研究论文" 这篇研究论文探讨了深度文档分析的问题，提出了一种新的方法，称为多维潜在语义分析（MDLSA）。MDLSA旨在有效地从文档中挖掘与术语关联和空间分布相关的局部信息。该方法首先将每个文档划分为段落，然后构建一个术语亲和图，这个图表示在段落中术语共现的频率。通过这种方式，MDLSA能够捕获文档内部的结构和术语之间的关系。接着，论文执行二维主成分分析（2-D Principal Component Analysis，PCA）来实现最优语义映射。2-D PCA的目标是找到训练集样本协方差矩阵的主要特征向量，这些特征向量用于刻画低维度语义空间。这种方法有助于减少数据的复杂性，同时保留重要的语义信息。为了进一步提升框架的性能，论文还设计了一种混合文档相似度度量。这种度量可能结合了多种相似性计算方法，例如基于词频的TF-IDF、基于余弦相似性的方法等，以更全面地评估文档之间的关系。论文的应用部分，MDLSA算法被应用于两个文档处理任务：检索和分类。在文档检索中，MDLSA可能提高了查询与文档匹配的准确性，使得用户能更快找到相关文档；在文档分类中，它可能提升了分类器的性能，更准确地将文档归入相应的类别。这篇论文为文本分析领域提供了一个创新的工具，通过考虑术语的空间分布和局部信息，改进了传统的潜在语义分析方法。这不仅有助于理解文档的深层结构，也有助于提高信息检索和自动文档分类的效率和准确性。

1628 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013

BD − ACI −BCA:



1 + log(f

u,i

)

(1 − s)+sW



log



1+f



(3)

AB − AFD −BAA (Okapi):



u,i

+ τ

/¯τ



log



1+n/f



(4)

BI − ACI −BCA:



1 + log(f

u,i

)

(1 − s)+sW



1 −

log

(n)



(5)

Lnu.ltu (SMART):



(1 + log(f

u,i

))/(1 + log(

u,i

)

(1 − s)+sτ

/¯τ



log



n/f



(6)

where f

u,i

is the term frequency of the uth word associated with

the ith document, f

is the document frequency of term u, f

the largest f

for all u, W

is the document vector l

norm, i.e.,

= x



is the average W

in the entire dataset, τ

and ¯τ

are the number of unique terms in document i and the average

unique terms, respectively, s is a slope parameter (set to 0.7

[19], [28]), and n

is a noise measure of term u [27], [28]. The

NORM weighting was recently used in [20], [21], and [23]; the

other four schemes, which are well-known weighting methods,

were used in [19] and [28].

C. Dimensionality Reduction

A document set can be represented by X =

,...,x

] ∈ R

m ×n

, which is a rectangular matrix

of terms and documents. The desire of latent semantic analysis

is to produce a set Y , which is an accurate representation of X,

but resides in a lower dimensional space. Y is of dimension d,

with d  m, and it is produced by the form

Y = V

X (7)

where V

is an m × d linear transformation matrix. Thus, it is

straightforward to replace each document x

by its projection

= V

such that we can make between or within com-

parisons facile in the lower dimensional latent semantic space.

There are a number of ways to accomplish this projection. The

transformation matrix V

can be obtained by traditional tech-

niques such as the PCA, the LSI, or other dimensionality reduc-

tion approaches [3]. In this study, we use the classical PCA to

determine the matrix V

. The PCA is a well-known technique

in the category of dimensionality reduction. In the PCA, the

determination of V

is given by maximizing the variance of the

projected vectors, which is in the format of

max



i=1

y

−



i=1



. (8)

It has been shown that the matrix V

is the set of eigenvectors

of the sample covariance matrix associated with the d largest

eigenvalues. Keep this in mind, as we will use this set of global

representations {y

,...,y

}to formulate a hybrid similarity

of two documents (see Section VI).

IV. W

ORD AFFINITY GRAPH

This section introduces a scheme to produce an in-depth doc-

ument representation. First, we segment each document into

paragraphs. Second, we build a word afﬁnity graph, which de-

scribes the local information of each document.

A. Document Segmentation

As we mentioned before, the major drawback of the tradi-

tional modeling methods such as the PCA and the LSI is that

they lack the description of term associations and spatial dis-

tribution information over the reduced space. In this study, we

propose a new document representation that contains this de-

scription. First, each document is segmented into paragraphs.

Since we only considered the HTML documents in this paper,

a Java platform was developed to implement the segmentation.

For the HTML format document, we can use the HTML tags to

identify paragraphs easily. Before document segmentation, we

ﬁrst ﬁlter out the formatted text that appears within the HTML

tags. The text is not accounted for in word counts or docu-

ment features. The overall document partitioning process can

be summarized as follows [20], [23].

1) Partition a document into blocks using the HTML tags:

“<p>,” “<br\>,” “<li>,” “</td>,” etc.

2) Merge the subsequent blocks to form a newparagraph until

the total number of words of the merged blocks exceeds a

paragraph threshold (set at 50).

3) The new block is merged with the previous paragraph

if the total number of words in a paragraph exceeds the

minimum threshold (set at 30).

For the HTML documents, it is noted that there is no rule

for minimum/maximum number of words for paragraphs [20].

Setting a threshold for word counts, however, still enables us

to control the number of paragraphs ﬂexibly in each document

and remove the blocks, which contain only a few words (e.g.,

titles), by being attached to the real paragraph blocks. It is worth

pointing out that we are able to further partition each paragraph

into sentences by marking periods (the tag “\.”) to form a ﬁner

structure such that more semantics can be included.

B. Word Afﬁnity Graph

Building a word afﬁnity graph for each document is to rep-

resent the frequency of term cooccurrence in a paragraph. Con-

sider a graph denoted by a matrix G

∈ R

m ×m

, in which each

element g

i,u,v

(u, v =1, 2,...,m) is deﬁned by

i,u,v



u,v

· log

(n/DF

u,v

)/G



,u= v

· log

(n/f

)/G



,u= v

(9)

where .

is the Frobenius norm, F

u,v

is the frequency of the

cooccurrence in a paragraph associated with the terms u and v

in the ith document, DF

u,v

is the document frequency that the

terms u and v coappear in a document, and notations of f

and

are as described in (1). Note that if we do not consider term

剩余15页未读，继续阅读

weixin_38701312

粉丝: 8
资源: 947

深度文档分析：多维潜在语义分析

SQL Server 2012 Tutorials - Analysis Services Multidimensional Modeling

Multidimensional Sinusoidal Frequency Estimation Using Subspace and Projection Separation Approaches

给我推荐20个比较流行的推荐算法模型

Axial Attention in Multidimensional Transformers

Cannot index with multidimensional key

multidimensional scaling

multidimensional scaling matlab

multidimensional scaling python

df.loc[outliers]报错Cannot index with multidimensional key

这些研究的具体年份和研究对象可以列一下吗

最新资源