P-index：冷库中基于数据血缘的高效元数据检索方案

182 浏览量更新于2024-08-26 收藏 167KB PDF 举报

"P-index：基于DataProvenance的冷库中高效的可搜索元数据索引方案" 在当前的数据中心中，存在大量不常访问的数据，这些数据被称为冷数据。云存储服务提供商通常会将这些冷数据及其元数据存储在低成本的商品硬件上，以实现成本效益的存储。然而，尽管这些数据访问频率低，但某些特定的存储服务仍需要确保对冷数据进行高性能的访问和检索。传统的元数据在这种情况下并不适用于高效搜索，因为它们可能已经长时间未被访问。为了解决这个问题，研究人员提出了一个名为"P-index"的新方案，这是一个基于数据来源（DataProvenance）的高效可搜索元数据索引。P-index的主要创新之处在于通过文件之间的数据来源关系来将相关文件分组成逻辑组。这种方法可以迅速地缩小搜索范围，从而极大地提高了冷数据的检索效率。数据来源（DataProvenance）是指数据生成、处理和流转的历史记录，它提供了数据从何而来、如何变化以及如何与其他数据交互的信息。在P-index中，利用数据来源关系作为索引的关键依据，可以更好地理解数据间的关联性，有助于在大量冷数据中快速定位到所需的信息。 P-index的具体工作流程包括以下几个关键步骤： 1. **数据源分析**：首先，系统会对冷数据的元数据进行分析，收集和理解每个文件的数据来源信息，包括创建、修改和访问历史等。 2. **文件分组**：根据数据来源关系，将具有相似或相关来源的文件聚类成组。这种分组策略使得在查询时可以减少不必要的扫描，仅关注与查询相关的文件集合。 3. **索引构建**：构建基于数据来源关系的索引结构，这可能涉及到复杂的数据结构设计，如B树、图索引或者自适应的索引结构，以优化查询性能。 4. **查询优化**：当用户发起查询时，P-index能够利用数据来源信息快速定位到可能包含目标数据的文件组，然后在这些组内进行精细化搜索，减少了全库扫描的时间开销。 5. **动态更新**：随着新的数据和操作的加入，P-index需要能够动态更新索引，保持其有效性并适应数据的变化。 P-index方案的提出，不仅提高了冷数据的检索效率，还为大数据环境下的数据管理和搜索提供了一个新的视角。它特别适用于那些需要频繁查询和分析历史数据的场景，如数据分析、合规审计和数据恢复等。此外，由于P-index考虑了数据的演化历史，它还能帮助识别数据的依赖性和潜在的异常，增强了数据的可理解和可追溯性。 P-index是针对冷数据存储挑战的一种创新解决方案，通过数据来源的智能利用，实现了对冷数据高效且精准的检索，这对于优化云存储服务和提升用户体验具有重要意义。

– Evaluation on real-word trace.We achieve the prototype of P-index. We

evaluate the performance via using two complex queries, range and KNN

queries. The test results show that P-index improves metadata searching

performance by 1 - 2 orders of magnitude.

The rest of the paper is organized as follows. Section 2 shows research back-

grounds and our motivations. We give the overview of the P-index in Section 3.

Section 4 presents the system design and implementation. We give extensive ex-

perimental results in Section 5. Section 6 describes the related work, and Section

7 concludes the paper.

2 Background and Motivations

In this section, we ﬁrst show the research backgrounds about the cold storage

systems. We then present our motivations.

2.1 Cold storage systems

A cold storage system stores cold data with low storage costs and correspondingly

accepted performance levels [1], such as Amazon Glacier[5], Microsoft Pelican

[4] and Facebook Cold Data Storage[3]. Cold data are the data which can not

be lost in a long term, and rarely accessed.

According to the concerned requirements except low cost, cold storage sys-

tems are classiﬁed into two categories. The ﬁrst kind of cold storage systems

pursue expected storage life, such as, archive systems and disaster recovery sys-

tems. The design of these systems mainly focuses on the reliability of systems

and data. The second kind of cold storage systems are more concerned with

access speed. For example, online social media systems and several backup sys-

tems. These systems need to provide real-time services (the response time is less

than three seconds) [1].

With the development of cloud applications, the amount of cold data b ecome

larger and larger. There are over 100 hours of videos being uploaded every minute

on YouTube, and 2 billions of photos each day shared across Facebook sites

[3]. Since most of the data are accessed infrequently, several cheap and lower-

performance equipments are used for cost-eﬀective storage. Hence, ensuring the

system performance becomes a great challenge in the second kind of cold storage

systems mentioned above.

2.2 Motivations

When the data that users need are “cold”, metadata search is used to ﬁnd the

data. There are two methods which are used to speed up metadata search. The

ﬁrst method is improving the eﬃciency of index structure, such as spyglass [6],

smartstore [7], vsfs [8] and so on. By using the methods, the systems quickly

cut oﬀ the branches which do not contain the query results. The second method

is collecting and storing more information of ﬁles. Once users remember a little

information about the ﬁles they need, they can ﬁnd the ﬁles easily.

In order to improve the performance of metadata search, we use not only the

present information of ﬁles, but also their historical information. Provenance

data of a ﬁle record its historical information. We extract data correlations from

剩余13页未读，继续阅读

weixin_38632247

粉丝: 8
资源: 1000

P-index：冷库中基于数据血缘的高效元数据检索方案

P-index：基于数据来源的冷库高效可搜索元数据索引方案

厨房管理者及员工工作流程图.doc

冷库二氧化碳传感器选型

用python把[('78439690910', '15-25', 'KK（国际冷库已预定）'),('78439692376', '0-15', 'SS（国内冷库已预定）')]写入到"D:/记录表.xlsx"表格里

xlwings把[('78439690910', '15-25', 'KK（国际冷库已预定）'),('78439692376', '0-15', 'SS（国内冷库已预定）')]添加到"D:/记录表.xlsx"表格里的有效数据的下一行

电商 数据库冷热分离

请详细介绍大型冷库系统的过程控制框架及工作原理

最新资源

电商数据库冷热分离