关系数据库中处理文本与数值属性的Top-N查询方法

75 浏览量更新于2024-08-26 收藏 258KB PDF 举报

"这篇文章主要探讨了如何在关系数据库中处理包含文本和数字属性的关系型前N个查询。作者提出了一个利用WordNet构建的索引来增强文本属性和数字属性的语义与数值信息，并且该索引大小随着数据库大小线性增长的处理方法。实验结果证明了这种方法的有效性。关键词包括关系数据库、top-N查询、语义距离、数值距离和WordNet。" 在当前的信息化社会中，关系数据库是数据存储和管理的核心工具，而top-N查询是一种常见的检索策略，它用于返回最相关的前N个结果。当查询涉及到既有文本属性（如产品描述）又有数字属性（如价格或销售量）时，传统的排序方法可能无法有效地捕获数据的复杂性和上下文意义。因此，处理这类查询成为了一个挑战。本文提出的解决方案是结合语义距离和数值距离来构建一个排名函数。语义距离是指通过比较文本属性中单词的语义相似度来评估其相关性，这通常依赖于词义网络如WordNet。WordNet是一个大型英语词汇数据库，它提供了词汇之间的语义关系，如同义词集和上下位关系，使得可以计算两个单词的语义相似度。数字距离则关注数值属性的差距，例如，两个价格或销售量之间的绝对或相对差异。为了实现这个方法，文章建议创建一个基于WordNet的索引。这个索引不仅包含了原始的文本属性，还扩展了这些属性的语义信息，同时考虑到数字属性的相关信息。这样，当处理查询时，不仅可以根据文本的语义相似度进行匹配，还能考虑数字属性的差异，从而提供更精确的排名。实验结果显示，这种方法在处理关系型前N个查询时表现出了高效性和准确性。随着数据库规模的增大，索引的大小线性增加，表明了该方法的可扩展性。此外，线性增长的索引不会对系统性能造成过大的负担，这对于大规模数据库应用至关重要。这篇论文贡献了一种新的处理方法，它融合了文本属性的语义理解和数字属性的数值比较，提高了关系数据库中复杂查询的处理能力。这一方法对于改善搜索引擎、推荐系统、数据分析等领域的性能具有实际应用价值。

Processing Relational Top-N Queries with Text and Numeric Attributes

Liang Zhu

1,a

, Bin Liu

1,b

, Guang Liu

2,c

, Quanlong Lei

1,d

Key Lab of Machine Learning and Computational Intelligence, School of Mathematics and

Computer Science, Hebei University, Baoding, Hebei 071002, China

College of Art, Hebei University, Baoding, Hebei 071002, China

zhu@hbu.edu.cn;

lliubbin163@163.com;

lg672@tom.com;

lqlfeng@126.com

Keywords: Relational Database, top-N Query, Semantic Distance, Numeric Distance, WordNet.

Abstract. Relational top-N queries with both text attributes and numeric attributes are useful in many

applications, by using the ranking functions based on both semantic distances for text attributes and

numeric distances for numeric attributes. In this paper, we propose an approach for processing such

type of top-N queries in relational databases. The basic idea of the approach is to create an index based

on WordNet to expand the tuple words semantically for text attributes and on the related information

of numeric attributes, meanwhile the size of the index increases linearly with the size of the database.

The results of extensive experiments show that our method is efficient and effective.

Introduction

A relational top-N (or ranking) query is to find a sorted set of N tuples that are the best but not

necessarily all answers to the query. Most of the researches involve numeric attributes with numeric

ranking functions [1, 2]. However, there are many applications where top-N queries are evaluated by

using both text attributes and numeric attributes, as demonstrated in the following example.

Example 1. Assume that a database BOOK of used books with schema: Books(isbn, title, author,

year, publisher, price). A user wants to find a book with title on “criminal law”, price about “$100”,

and year around “2005”, where title is a text attribute with semantics, and price and year are two

numeric attributes. Obviously, a book on “symphony” with price = “$100” and year = “2005” is not

the desired result for the user. However, another book on “penal code” with price = “$103” and year

= “2006” may be the need of the user.

For the type of relational top-N queries as shown in Example 1, we design a ranking function that

combines the semantic distances and numeric distances by employing statistics and training methods,

and then we create an index to process the queries in terms of semantic and numeric matching in

database search. Moreover, this work is a continuation of the work in [3] and [4]. [3] studied the

processing of relational ranking queries only with text attributes, without numeric attributes. The size

of the index in [4] does not increase linearly with the size of the database, and it then may not be

suitable for the database with the big size, or three or more numeric attributes. The method in this

paper will alleviate the limitations of the approach in [4].

Query Model

Assume that R(tid, A) is a relation/table with identifier tid, where A is a text attribute with semantics,

and S(idx, B

, B

, …, B

, …, FKid) is another relation with identifier idx, where B

, B

, …, B

are m

numeric attributes, and FKid is the foreign key referencing R.tid. Let R

= R S with S.FKid = R.tid.

Let t be a tuple in R

. Then t[A] = (tw

, tw

, …, tw

) is the word-string with n words on the text

attribute A, and t[B

] = b

is the numeric value on the attribute B

(1 ≤ j ≤ m). For simplicity, we denote

t = (t

, b

, …, b

) where t

= (tw

, tw

, …, tw

), and call tw

a tuple word and b

a tuple value (1 ≤ i

≤ n, 1 ≤ j ≤ m). As described in [3], for each tuple word tw, we can obtain the set K(tw) of all kinship

words of tw by WordNet [3, 5], i.e., K(tw) includes the five kinds of words in WordNet: (1) word tw

Applied Mechanics and Materials Vols. 490-491 (2014) pp 1326-1329

doi:10.4028/www.scientific.net/AMM.490-491.1326

www.ttp.net. (ID: 60.4.163.23-16/01/14,05:54:28)

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38629303

粉丝: 4

关系数据库中处理文本与数值属性的Top-N查询方法

文本内容转换处理技术详解

正则表达式re：高效文本处理的关键工具

HTML基础：掌握文本对齐与元素属性

使用HTML开发商业网站-属性选择器和关系选择器课件.pptx

正则表达式替换与XML：解析和处理XML文本，掌握数据处理新技能

图数据模型设计：节点和关系的属性及标签的使用

MATLAB中的自然语言处理：理解和处理文本数据，解锁语言奥秘

Python字符串处理实战攻略：复杂文本数据的处理之道

Python正则表达式与文本处理技巧

数字信号处理与人工智能：探索《数字信号处理教程》在AI中的应用

最新资源