2.1 Extensible Markup Language
Extensible Markup Language (XML) is a hierarchical
data format for information exchange in the World Wide
Web. An XML document consists of nested element
structures, starting with a root element. Element data can
be in the form of attributes or sub-elements. Figure 1
shows an XML document that contains information about
a book. In this example, there is a book element that has
two sub-elements, booktitle and author. The author
element has an id attribute with value “dawkins” and is
further nested to provide name and address information.
Further information on XML can be found in [3,8].
Figure 1
Figure 2
2.2 DTDs and other XML Schemas
Document Type Descriptors (DTDs) [2] describe the
structure of XML documents and are like a schema for
XML documents. A DTD specifies the structure of an
XML element by specifying the names of its sub-elements
and attributes. Sub-element structure is specified using the
operators * (set with zero or more elements), + (set with
one or more elements), ? (optional), and | (or). All values
are assumed to be string values, unless the type is ANY in
which case the value can be an arbitrary XML fragment.
There is a special attribute, id, which can occur once for
each element. The id attribute uniquely identifies an
element within a document and can be referenced through
an IDREF field in another element. IDREFs are untyped.
Finally, there is no concept of a root of a DTD – an XML
document conforming to a DTD can be rooted at any
element specified in the DTD. Figure 2 shows a DTD
specification, while Figure 1 gives an XML document that
conforms to this DTD.
Document Content Descriptors (DCDs) [4] and XML
Schemas [16] are extensions to DTDs. For our purposes,
the main difference between these and DTDs is that they
allow typing of values and set size specification. If DCDs
and XML Schemas become standard, the additional
information would aid in our translation process; for
example, we could create tables with integer attributes
where appropriate instead of using just strings. The types
in the current DCD proposal are compatible with types
supported by current relational systems. More complex
types will require object-relational extensions.
2.3 XML Query Languages
Figure 3
Figure 4
There are many semi-structured query languages that can
be used to query XML documents, including XML-QL
[9], Lorel [1], UnQL [5] and XQL (from Microsoft). All
these query languages have a notion of path expressions
for navigating the nested structure of XML. XML-QL
uses a nested XML-like structure to specify the part of a
document to be selected and the structure of the result
XML document.
Figure 4 shows an XML-QL query to determine the
last name of an author of a book having title “The Selfish
Gene”, specified over a set of XML documents
conforming to the DTD in Figure 2. The last names thus
selected will be nested within a lastname tag, as specified
in the construct clause of the query. Lorel is more like
SQL and its representation of the same query is shown in
Figure 3. In this paper, we use a combination of XML-QL
and Lorel (modified appropriately for our purposes).
<booktitle> The Selfish Gene </booktitle>
<author>
<lastname> $l </lastname>
</>
</> IN a.xml, b.xml
CONSTRUCT <lastname> $l </lastname>
FROM book X
WHERE X.booktitle = “The Selfish Gene”
<!ELEMENT book (booktitle, author)
<!ELEMENT article (title, author*, contactauthor)>
<!ELEMENT contactauthor EMPTY>
<!ATTLIST contactauthor authorID IDREF IMPLIED>
<!ELEMENT monograph (title, author, editor)>
<!ELEMENT editor (monograph*)>
<!ATTLIST editor name CDATA #REQUIRED>
<!ELEMENT author (name, address)>
<!ATTLIST author id ID #REQUIRED>
<!ELEMENT name (firstname?, lastname)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT address ANY>
<booktitle> The Selfish Gene </booktitle>
<author id = “dawkins”>
<name>
<firstname> Richard </firstname>
<lastname> Dawkins </lastname>
</name>
<address>
<city> Timbuktu </city>
<zip> 99999 </zip>
</address>
</author>
</book>