6 · M. Yoshikawa et al.
technique, elements having an in-degree greater than 2 are also inlined if they are reachable
without passing “*”. Incidentally, order information among elements that is discarded in
the first step can be represented by adding positional information in the relational schema.
2.1.2 Storing Structured Documents without Information about DTD. There have been
several studies that used fixed relational schemas to store structured documents. For ex-
ample, [Horowits and Williamson 1986] proposed to store structured documents (ordered
trees) by decomposing them into relational tables. Also, in a study by [Zhang 1995],
a method to manage SGML documents using object-oriented database systems was pro-
posed. In that work, all text nodes were maintained by a class NODE. In addition, [Flo-
rescu and Kossmann 1999b; Florescu and Kossmann 1999a] proposed several relational
schemas, and performed performance analysis on them. The method proposed in this pa-
per differs from these previous methods in that in this method, information about paths
from the root to each node and its position in the document is maintained in relational ta-
bles. In addition, our proposal does not impose any prerequisites on XML documents to be
stored, whereas [Florescu and Kossmann 1999b; Florescu and Kossmann 1999a] assumes
that each element has an ID attribute.
2.2 Other Approaches
Regarding index files for structured documents, several studies such as PAT [Salminen and
Tompa 1994], Burkowski [Burkowski 1992], Clarke et al. [Clarke et al. 1995a; Clarke
et al. 1995b] and Navarro et al. [Navarro and Baeza-Yates 1997] have appeared. [Sacks-
Davis et al. 1998] categorized such indexes into position-based and path-based indexings.
In position-based indexes, queries are processed using word element and position. On the
other hand, paths in tree structure are used in path-based indexes. In this paper, we do
not use special indexes for structured documents. However, our storage method is closely
related to the concept of those indexes.
Finally, the topic of abstract data type is related to both storage and query retrieval.
In [Blake et al. 1995], the authors described an approach in which an XML document is
regarde as just a sequence of characters, then operations on tree structures are replaced
with those on character strings, and an abstract data type is defined in a database having
such operations. Our approach differs from those of previous research in that we simply
use an off-the-shelf database system; that is, we do not need any special full-text search
system or indexing structure, and translate XPath queries into SQL.
3. AN OVERVIEW OF XML DOCUMENTS
An XML document consists of three parts: an XML declaration, a DTD (Document Type
Definition), and an XML instance
1
. An XML declaration and a DTD are not mandatory
for an XML document. An XML declaration specifies the version and the encoding of
XML being used. A DTD is a schema that constrains the structure of XML instances, and
corresponds to an extended context-free grammar. An XML instance is a tagged document.
We omit concrete descriptions of an XML declaration and a DTD.
An XML instance is a hierarchy of elements, the boundaries of which are either delim-
ited by start-tags and end-tags, or, for empty elements, by empty-element tags. Character
1
Although the term ‘XML instance’ does not appear in the XML Recommendation [World Wide Web Consor-
tium 1998], we use this term to represent XML document data excluding an XML declaration and a DTD.