![](https://csdnimg.cn/release/download_crawler_static/9797955/bg14.jpg)
4 1 Introduction
the same real-world entities or objects. A special case of data matching is duplicate
detection, the task of identifying and matching records that refer to the same entities
within a single database. The following Sect. 1.2 discusses how data matching fits
into the overall data integration process. The third task, known as data fusion [38],
is the process of merging pairs or groups of records that have been classified as
matches (i.e. that are assumed to r efer to the same entity) into a clean and consistent
record that represents an entity. When applied on one database, this process is called
deduplication.
The records considered in data matching and deduplication generally refer to real-
world entities. The attribute values in these records are descriptions of the identifying
details of these entities, such as their names, addresses and so on. It is assumed
that these records are available already in a certain structured format, for example
consisting of a name attribute, an address attribute, a date-of-birth attribute, etc. Data
matching does not consider the extraction of entity information from unstructured
documents (such as e-mails, news articles, police reports or scientific publications),
or the scanning and optical character recognition (OCR) of names and addresses from
letters and parcels. It is assumed that these information extraction [230] steps have
already been conducted and that the records to be matched are stored in well-defined
files or database tables.
Most commonly, the records to be matched across two or more databases, or to
be deduplicated in a single database, correspond to people. They can, for example,
refer to customers in a business database, employees in a company data warehouse,
tax payers or welfare recipients in government databases, patients in hospital or
private health insurance databases, known criminals and terrorism suspects in law
enforcement and national security databases, or travellers in the databases held by
airlines, and government departments of immigration and homeland security.
Besides people, other entities that sometimes have to be matched include records
about businesses, publications and bibliographic citations, Web pages and Web s earch
results or consumer products. In applications such as Web search and digital libraries,
for example, it is important that duplicate documents (such as Web pages and bibli-
ographic citations) in the results returned by a search engine are removed before the
results are being presented to a user [131]. For automatic text indexing systems, it is
important that duplicates are eliminated before the indexing takes place in order to
reduce storage requirements and computational efforts [245].
With the increase in e-Commerce in recent years, another application where
data matching has become of importance is comparative online shopping. Because
consumer products in different online stores often have slightly different product
descriptions (such as ‘Canon PowerShot D10 Digital Camera’ or ‘Canon D10
12.1MP 3 × OPT ZOOMOIS Underwater Camera’), identifying which product
description corresponds to which actual product can become difficult [33].
The task of identifying and matching records that refer to the same entities within
one or across several databases is challenging for several reasons. The following
sections highlight some of the major challenges. They will be further discussed in
the relevant chapters later in this book.