XML文档中根据主题泛化权重标签与路径

研究论文

96 浏览量更新于2024-07-15 收藏 1.08MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文是一篇发表在Elsevier期刊上的研究论文，主要探讨了如何根据XML文档中标签和路径的主题概括性进行权重分配。XML（Extensible Markup Language）是一种用于标记电子文档的标准，广泛应用于数据交换和存储。在这个研究中，作者们关注的是如何通过评估和量化标签与主题的相关性和抽象程度，来赋予它们不同的权重，以便于更好地理解和组织数据。文章的作者团队包括来自江西财经大学、江西数据与知识工程重点实验室以及香港科技大学和蒙特利尔大学的研究者，他们分别代表了中国的金融经济学教育背景和香港及加拿大的科研实力。他们可能采用了文本挖掘、信息检索或自然语言处理技术，来分析XML文档中节点和路径的语义关联，并且利用统计方法或者机器学习算法来确定权重。在实践中，这种主题概括性权重的分配对于信息检索、内容过滤、文档聚类、知识图谱构建等领域具有重要意义。例如，在搜索引擎中，高权重的标签和路径可以帮助提高搜索结果的相关性和精度；在数据分析中，它有助于提取关键信息并构建文档的层次结构，便于用户理解和导航。论文可能涉及的具体步骤可能包括以下几点： 1. 数据预处理：清洗和标准化XML文档，去除噪声，提取标签和路径信息。 2. 特征提取：计算每个标签和路径的语义相似度，可以利用词向量、TF-IDF等方法。 3. 主题模型或聚类：应用LDA（Latent Dirichlet Allocation）或其他主题建模技术，识别文档中的潜在主题。 4. 权重计算：基于主题的关联性和重要性，为标签和路径赋予相应的权重。 5. 应用评估：通过实验验证权重分配对文档理解和分析性能的影响。然而，由于这部分内容并未直接提供详细算法和具体实施细节，读者需要查阅原文才能深入了解他们的方法论和实验设计。这篇论文为XML文档管理和分析提供了一种创新的视角和实用工具，对于提升信息技术领域的文本处理能力具有理论价值和实际应用前景。

资源详情

资源推荐

Author's personal copy

example, in the IEEECS collection, abbreviations are generally used in tags (e.g atl) and their meanings (e.g. abstract)

can be hard to guess. In addition, the XML document samples analyzed manually may not cover all distinct tags/paths

in the collection. As a result, there is always the risk of missing some tags. Fig. 2 shows the number of distinct tags that

can be observed with respect to the number of documents in the IEEECS collection. If 200 documents are analyzed

manually, only about 60% of the distinct tags out of 189 are seen. We can observe that, if we want to cover all the tags,

the process involves a great amount of manual operations, which are tedious and inefﬁcient. As we mentioned earlier,

paths could provide more accurate information than tags on the meaning/role of the elements. If different paths have

to be analyzed by experts, there will be even a much larger number of them, as Fig. 3 shows.

(2) Subjective interpretation. The manual setting of tag/path weights is strongly inﬂuenced by experts’ background

knowledge and their understanding of the documents. Depending on the expert, the resulting weights can be very dif-

ferent. This can be reﬂected by an analysis on the correlation between the weights assigned by different experts. We

have conducted a simple experiment where three people working on XML retrieval were asked to assign weights

(from 0 to 5.0) to the tags in the IEEECS collection. The result on the Pearson Correlation Coefﬁcient between the three

group of manual weights are only 0.637, 0.738 and 0.682, with an average of 0.686. As for the path weights set on the

Wikipedia collection, the correlation coefﬁcients are 0.584, 0.766 and 0.681 respectively, and these coefﬁcients shar-

ply declined to 0.556, 0.428 and 0.582 on the IEEECS collection. The results clearly show that it is very difﬁcult to reach

consensus about the tag/path weights even among experts, and this issue becomes more serious when the collection

has more complex structures.

(3) Many false alarms. As a collection usually has a large number of XML documents, and in each document the number of

tags ranges from tens to hundreds, it is likely that the experts have misjudgments on the actual meanings of these tags.

For example, the tag atl in the IEEECS collection is found to mean article title. In our experiment, the experts are

inclined to assign the greatest weight (5.0) to atl. However, atl is also used as the tag of the title of article listed in

the reference. In this case, the experts are misguided because they ignore some other meanings of the tag. Even if

the experts understand all meanings of a tag, it is still a challenge to set tag weight because of the different meanings

of the same tag.

The above observations clearly indicate that a manual setting of tag/path weights is not tractable in large scale applica-

tions. In this paper, we propose an automatic method to determine the weights of tags and paths by taking advantage of the

contents of the corresponding ﬁelds. The general idea is to assign a large weight to a tag or path if the corresponding ﬁeld can

summarize well the contents of the whole document. This is indeed the criterion implicitly used in human evaluation of tag/

path importance. Rather than relying on human subjective judgments, we propose a method based on the average potential

power that the content of the ﬁeld can be generalized to the whole document. This measure is called Average Topic Gener-

alization (ATG), which is based on the similarity between the content of the ﬁeld and that of the whole document.

Intuitively, the above approach can reﬂect well the human assignments. For example, the title, keywords, abstract, and

section title are generally believed to be more important than the paragraph, ﬁgure, table, and reference. The former ﬁelds

are also those that are more related to the document topic than the latter, thus have a higher ATG power. If a query-term

occurs in the nodes corresponding to title, keywords, abstract, or section title in some documents, these documents or nodes

should be considered more relevant to the query, compared with those documents or nodes where the query-term occurs in

other nodes.

In the ATG model, an important step is the estimation of topical similarity, which involves term weighting. Instead of

using the traditional term weighting methods, we propose a new term weighting strategy taking into account several fea-

tures proposed in the previous studies.

The approach has been tested on two collections, IEEECS and Wikipedia. On both collections, we observe increased effec-

tiveness for XML retrieval than the traditional keyword-based approach. In addition, the weights determined by the ATG

model are found to correlate more strongly with the human assignments, than those determined using other criteria.

The main contributions of this paper are summarized as follows:

Fig. 2. The number of observed tags increases with the number of documents in the IEEECS collection.

50 D. Liu et al. / Information Sciences 249 (2013) 48–66

剩余19页未读，继续阅读

weixin_38649091

粉丝: 6
资源: 933

XML文档中根据主题泛化权重标签与路径

Term-weighting_approaches_in_automatic_te

SNR C-wt A-wt

tfidf关键词提取英文

Vgg16 attention

Input ports (2) of 'f/Fuzzy Logic Controller/FIS Wizard/Rule42/Weighting' are involved in the loop

Combining 3D Morphable Models: A Large scale Face-and-Head Model

ieee-cis fraud detection knn

计算泰勒窗。泰勒加权函数

balanced crossentropy

matlab hanning

Domain adaptation

KNN() takes no arguments

nx.pagerank(graph, max_iter = 10**20, tol=10**-10, weight = 'times_citing'*'weighting')

“public PearsonCorrelationSimilarity(DataModel dataModel, Weighting weighting) throws Exception { this.dataModel = dataModel; this.cachedNumItems = dataModel.getNumItems(); this.cachedNumUsers = dataModel.getNumUsers(); this.weighted = weighting == Weighting.WEIGHTED; }” 解释代码

how to estimate the ordered probit model by using MLE, please give the code

risk parity python

国外文本挖掘研究现状和参考文献

Dynamic Weighting A∗ Search

生成在MATLAB上使用贝叶斯加权平均法对数据集A和B处理并输出数据集D，使得数据集D的结果接近数据集C,数据集A,B,C,D都为1×200的矩阵，代码

最新资源

nx.pagerank(graph, max_iter = 1020, tol=10-10, weight = 'times_citing'*'weighting')