![](https://csdnimg.cn/release/download_crawler_static/4088721/bg3.jpg)
ing a lexicon of synonyms. It is assumed that some
classes or at least some of their attributes and/or rela-
tionships are assigned with meaningful names in a pre-
integration phrase. Therefore, the knowledge about
the terminological relationship between the names can
be used as an indicator of the real world correspon-
dence between the objects. In pre-integration, object
equivalence (or degree of similarity) is calculated by
comparing the aspects of each object and computing
a weighted probability of similarity and dissimilarity.
Sheth and Larson [SL90]
noted that comparison of
the schema objects is difficult unless the related in-
formation is represented in a similar form in different
schemas.
2.1 Existing Approaches
In [DKM+93] it is noted that semantics are embod-
ied in four places: The database model, conceptual
schema, application programs and minds of users. An
automatic semantic integration procedure can only
make use of information contained in the first two. We
further break this into three parts: The names of at-
tributes (obtained from the schema); attribute values
and domains (obtained from the data contents); and
field specifications (from the schema,‘or in some cases
from automated inspection of the data). We detail
these approaches below.
2.1.1 Comparing attribute names
Systems have been developed to automate database
integration. One that has addressed the problem of at-
tribute equivalence is MUVIS (Multi-User View Inte-
gration System) [HR90]. MUVIS is a knowledge based
system for view integration. It assists database design-
ers in representing user views and integrating these
views into a global conceptual view. MUVIS deter-
mines the degree of similarity and dissimilarity of two
objects during a
p-e-integmtion
phrase l.
The similarity and dissimilarity in MUVIS is primarily
based on comparing the
field names
of the attributes.
Object equivalence is determined by comparing the as-
pects of each (such as class names, member names,
and attribute names) and computing a weighted value
for similarity and dissimilarity. A recommendation is
then produced as to how the integration should be per-
formed.
Most automated tools developed to assist designers in
establishing object correspondences by comparing at-
tribute names work well for homonyms (same name
for different data), as users are shown the false match.
However, different objects can have different synonyms
‘Since, in the real world, se-tics of terms may vary, the
relationship between two attributes is usually fuzzy. Therefore,
a degree of similarity and diasimikity has a strength of [O,l].
that are not easily detected by inspection. This shifts
the problem to building the synonym lexicon. Even
a synonym lexicon has limitations because it is diffi-
cult for database designers to define a field name by
using only the words that can be found in a dictio-
nary or abbreviations carrying unambiguous meanings
and in some cases, it is difficult to use
a
single word
rather than a phrase to name a field. These reasons
make it expensive to build a system of this approach.
Sheth and Larson [SL90] also pointed out that com-
pletely automatic determination of attribute relation-
ships through searching a synonym lexicon is not pos-
sible because it would require that all of the semantics
of schema be completely specified. Also, current se-
mantic (or other) data models are not able to capture
a real-world state completely and interpretations of
real-world state change over time.
2.1.2 Comparing attribute values and do-
mains using data contents
Another approach of determining attribute equiva-
lence is comparing attribute domains. Larson et. al.
[LNE89, NB86] and Sheth et. al. [SLCN88] discussed
how relationships and entity sets can be integrated pri-
marily based on their domain relationships: EQUAL,
CONTAINS, OVERLAP, CONTAINED-IN, and DIS-
JOINT. Determining such relationships can be time
consuming and tedious [SL90]. If each schema has
100 entity types, and an average of five attributes per
entity type, then 250,099 pairs of attributes must be
considered (for each attribute in one schema, a poten-
tial relationship with each attribute in other schemas
should be considered). Another problem with their
approach is poor tolerance of faults. Small amounts of
incorrect data may lead the system to draw a wrong
conclusion on domain relationships.
In the tool developed to perform schema integration
described in [SLCN88],
a heuristic algorithm is given
to identify pairs of entity types and relationship types
that are related by EQUAL, CONTAINS, OVERLAP,
and CONTAINED-IN domain relationships. Sheth
and Gala [SG89] also argued that this task cannot
be automated, and hence we may need to depend on
heuristics to identify a small number of attribute pairs
that may be potentially related by a relationship other
than DISJOINT.
2.1.3 Comparing field specifications
In [NB86] the characteristics of attributes discussed
are uniqueness, cardinality, domain, semantic integrity
constraints, security constraints, allowable operations,
and scale.
In our prior work [LC93], we presented
a technique which utilizes these field specifications to
determine the similarity and dissimilarity of a pair of
3