知识图谱数据融合：实践与解决方案

下载需积分: 45 | PDF格式 | 2.77MB | 更新于2024-07-20 | 184 浏览量 | 举报

7 收藏

"知识图谱的数据融合是将来自不同源的信息整合到一个统一的知识图谱中的过程，旨在解决数据不一致性、重复和不完整性等问题。数据融合涉及实体对齐和实体链接等关键技术，以确保数据的准确性和一致性。实体对齐是识别不同数据源中代表相同真实世界对象的记录，而实体链接则是将这些识别出的记录关联起来，形成一个连贯的知识网络。" 知识图谱数据融合的关键点包括： 1. **数据清洗**：在融合数据前，需要对各个来源的数据进行预处理，包括去除噪声、修正错误、标准化格式等，以提高数据质量。 2. **实体识别**：确定数据中哪些记录代表了现实世界的实体，例如人、地点或事件。这通常涉及到命名实体识别（NER）技术，通过模式匹配、机器学习算法等方法来识别实体。 3. **实体对齐**：比较和匹配来自不同数据源的实体，找出它们之间的对应关系。实体对齐可以通过基于特征的相似度计算、规则匹配、机器学习模型等方法实现。 4. **属性对齐**：除了对实体进行对齐，还需要对它们的属性进行匹配，确保属性的含义和单位一致。例如，不同数据库中可能对同一产品的价格使用不同的货币单位，需要进行转换和标准化。 5. **实体链接**：在确定了实体对齐后，将来自不同源的实体连接起来，形成一个全局的实体表示。这一步骤有助于消除冗余数据，并建立实体间的语义关系。 6. **冲突解决**：在数据融合过程中，可能会出现矛盾或不一致的信息。需要设计有效的冲突检测和解决策略，如多数投票、基于证据的决策或人工介入。 7. **知识表示与存储**：融合后的数据需要以适当的形式（如 RDF 或 OWL）存储在知识图谱中，以便于查询和推理。同时，图数据库如 Neo4j 或 Virtuoso 可用于支持大规模知识图谱的存储和检索。 8. **持续更新与维护**：知识图谱不是一次性构建完成的，需要随着新数据的加入和旧数据的更新进行持续维护，确保其时效性和准确性。 9. **性能优化**：由于数据量大和复杂性高，知识图谱的数据融合需要考虑性能优化，如并行处理、索引技术以及高效的数据访问策略。 10. **隐私与安全**：在数据融合过程中，必须遵守数据保护法规，确保敏感信息的安全，防止非法访问和滥用。通过上述步骤和技术，知识图谱数据融合能够创建一个丰富的、一致性的知识库，为各种应用提供强有力的支持，如智能搜索、推荐系统、问答系统以及数据分析等。

6.13 Practical Considerations and Research Issues . . . . . . . . . . . . . 161

6.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7 Evaluation of Matching Quality and Complexity ............. 163

7.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.2 Measuring Matching Quality . . . . . . . . . . . . . . . . . . . . . . . . 165

7.3 Measuring Matching Complexity . . . . . . . . . . . . . . . . . . . . . 172

7.4 Clerical Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.5 Public Test Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.6 Synthetic Test Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.7 Practical Considerations and Research Issues . . . . . . . . . . . . . 183

7.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Part III Further Topics

8 Privacy Aspects of Data Matching........................ 187

8.1 Privacy and Confidentiality Challenges for Data Matching . . . 187

8.1.1 Requiring Access to Identifying Information . . . . . . 188

8.1.2 Sensitive and Confidential Outcomes

from Matched Data . . . . . . . . . . . . . . . . . . . . . . . . 189

8.2 Data Matching Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 190

8.3 Privacy-Preserving Data Matching Techniques. . . . . . . . . . . . 193

8.3.1 Exact Privacy-Preserving Matching Techniques . . . . 196

8.3.2 Approximate Privacy-Preserving

Matching Techniques . . . . . . . . . . . . . . . . . . . . . . 199

8.3.3 Scalable Privacy-Preserving Matching

Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8.4 Practical Considerations and Research Issues . . . . . . . . . . . . . 205

8.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9 Further Topics and Research Directions ................... 209

9.1 Geocode Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

9.2 Matching Unstructured and Complex Data . . . . . . . . . . . . . . 211

9.3 Real-time Data Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . 213

9.4 Matching Dynamic Databases . . . . . . . . . . . . . . . . . . . . . . . 215

9.5 Parallel and Distributed Data Matching . . . . . . . . . . . . . . . . . 217

9.6 Research Challenges and Directions . . . . . . . . . . . . . . . . . . . 222

10 Data Matching Systems ............................... 229

10.1 Commercial Systems and Checklist . . . . . . . . . . . . . . . . . . . 229

10.2 Research and Open Source Systems . . . . . . . . . . . . . . . . . . . 231

10.2.1 BigMatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

10.2.2 D-Dupe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

xviii Contents

Chapter 1

Introduction

1.1 Aims and Challenges of Data Matching

Given the ever-increasing amount of data that are being collected, not just by busi-

nesses and government organisations but increasingly also by individuals, the past

decade has seen strong interest in novel techniques that allow the efﬁcient processing,

management and analysis of large data collections. The ﬁelds of data warehousing

and data mining have gained immense interest in both academia and industry. While

data warehousing is concerned with the efﬁcient processing, integration and storage

of large amounts of data into clean, consistent and persistent forms that enable basic

statistical analysis, data mining is aimed at discovering new and potentially valuable

information from such large data collections [135].

As businesses, public bodies and government agencies are drowning in an ever-

increasing deluge of data, the ability to analyse their data in a timely fashion can

provide a competitive edge to a commercial enterprise, lead to improved productivity

for government agencies and be of vital importance to national security. In many

large-scale information systems and data mining projects, data from multiple sources

need to be integrated and matched in order to improve data quality, enrich existing

data sources or facilitate data mining that is not feasible on a single database. The

analysis of data integrated from disparate sources, either within an organisation or

between different organisations, can lead to much improved beneﬁts compared to

analysing databases in isolation. Integrated data can also allow types of data analyses

that are not feasible on individual databases, such as the identiﬁcation of adverse drug

reactions in particular patient groups, or the detection of terrorism suspects through

the analysis of certain suspicious patterns of activities [44, 58, 103, 143].

Integrating data from different sources consists of three tasks. The ﬁrst task

is schema matching [224], which is concerned with identifying database tables,

attributes and conceptual structures (such as ontologies, XML schemas and UML

diagrams) from disparate databases that contain data that correspond to the same type

of information. The second task, the topic of this book, is data matching,thetask

of identifying and matching individual records from disparate databases that refer to

P. Christen, Data Matching, Data-Centric Systems and Applications, 3

DOI: 10.1007/978-3-642-31164-2_1, © Springer-Verlag Berlin Heidelberg 2012

4 1 Introduction

the same real-world entities or objects. A special case of data matching is duplicate

detection, the task of identifying and matching records that refer to the same entities

within a single database. The following Sect. 1.2 discusses how data matching ﬁts

into the overall data integration process. The third task, known as data fusion [38],

is the process of merging pairs or groups of records that have been classiﬁed as

matches (i.e. that are assumed to r efer to the same entity) into a clean and consistent

record that represents an entity. When applied on one database, this process is called

deduplication.

The records considered in data matching and deduplication generally refer to real-

world entities. The attribute values in these records are descriptions of the identifying

details of these entities, such as their names, addresses and so on. It is assumed

that these records are available already in a certain structured format, for example

consisting of a name attribute, an address attribute, a date-of-birth attribute, etc. Data

matching does not consider the extraction of entity information from unstructured

documents (such as e-mails, news articles, police reports or scientiﬁc publications),

or the scanning and optical character recognition (OCR) of names and addresses from

letters and parcels. It is assumed that these information extraction [230] steps have

already been conducted and that the records to be matched are stored in well-deﬁned

ﬁles or database tables.

Most commonly, the records to be matched across two or more databases, or to

be deduplicated in a single database, correspond to people. They can, for example,

refer to customers in a business database, employees in a company data warehouse,

tax payers or welfare recipients in government databases, patients in hospital or

private health insurance databases, known criminals and terrorism suspects in law

enforcement and national security databases, or travellers in the databases held by

airlines, and government departments of immigration and homeland security.

Besides people, other entities that sometimes have to be matched include records

about businesses, publications and bibliographic citations, Web pages and Web s earch

results or consumer products. In applications such as Web search and digital libraries,

for example, it is important that duplicate documents (such as Web pages and bibli-

ographic citations) in the results returned by a search engine are removed before the

results are being presented to a user [131]. For automatic text indexing systems, it is

important that duplicates are eliminated before the indexing takes place in order to

reduce storage requirements and computational efforts [245].

With the increase in e-Commerce in recent years, another application where

data matching has become of importance is comparative online shopping. Because

consumer products in different online stores often have slightly different product

descriptions (such as ‘Canon PowerShot D10 Digital Camera’ or ‘Canon D10

12.1MP 3 × OPT ZOOMOIS Underwater Camera’), identifying which product

description corresponds to which actual product can become difﬁcult [33].

The task of identifying and matching records that refer to the same entities within

one or across several databases is challenging for several reasons. The following

sections highlight some of the major challenges. They will be further discussed in

the relevant chapters later in this book.

剩余278页未读，继续阅读

呼啦11111

粉丝: 3

知识图谱数据融合：实践与解决方案

CNCC2019-张勇-基于知识图谱的健康医疗大数据融合技术及应用.pdf

6种融合算法,数据融合算法,matlab

知识图谱-金融知识图谱.zip

人工智能-知识图谱-智能问答-基于知识图谱的智能问答机器人

知识图谱-行业分析.pdf

知识图谱-sparql查询服务.zip

产业链知识图谱-服创项目.zip

知识图谱 多源数据融合

知识图谱-基于LSTM的语义关系分类研究

知识图谱-前沿进展.zip_Python__Python_

最新资源

知识图谱多源数据融合