Query Languages for XML

Tue Nov 18 21:40:43 GMT 1997

At 04:38 PM 11/17/97 -0600, W. Eliot Kimber wrote:

>SDQL is simply that part of the larger DSSSL expression language that
>enables the accessing of properties of nodes in groves and the navigation
>of groves.  It uses the same syntax as the rest of DSSSL, that is a Scheme
>variant.  It is based on the basic grove data model (nodes and their
>properties) but has some built-in functions related to SGML (e.g., "gi",
>"att-string", etc.).  All the built-in functions are or can be defined in
>terms of primitives (e.g., node-property).  It includes some basic
>string-matching functions but does not attempt to provide any sort of
>complete full-text facility (which would be outside the stated scope of
>DSSSL in any case).

In the database world, what you describe would not be called a query
language; at least, not if I understand you correctly. Certainly, something
like SDQL is useful, but it doesn't seem to be a query language, nor does it
seem to eliminate the need for a query language. I think we can learn
something from the history of databases - and if we do, we will not be
condemned to repeat this history!

1. Navigational databases (hierarchical and network) allowed complex data
structures, including hierarchical structures, and used navigation to
retrieve data. Indexes on certain fields could allow a kind of random access
to records. Advantages: complex data structures possible, records always
express their relationships to other records, good run-time efficiency.
Disadvantages: dependent on physical format of records, dependent on the
exact way that records are threaded together, minor changes in the database
produced significant changes to the algorithms used to process them,
difficult to write code for general-purpose queries, queries are dependent
on the programming language used to implement them, query optimization is
virtually impossible. 

Your description of SDQL makes me think that it is analogous to navigational
databases, and would probably have these disadvantages: (A) query
optimization is very difficult, because the query is procedural, and tells
precisely how the data is to be retrieved - even if a particular repository
or database has a faster way of retrieving the data, it can not do so,
because the query tells how to find it, not what to find; (B) language
dependence - there is no way to formulate a query string that will work for
any implementation of SDQL, regardless of language (and for now, you have to
formulate SDQL in scheme); (C) physical dependence - if the manner in which
the data is structured changes, the algorithms no longer work. I'm not
saying that SDQL isn't useful, I'm saying merely that it doesn't do what
query languages do.

2. Relational databases introduced the concept of real query languages, and
of logical independence - the operation of a database should not be
dependent on its physical layout. Advantages: significantly easier to change
and maintain databases, queries can be formulated as simple strings, query
language is independent of implementation language, logical independence.
Another, non-technical advantage is that an awful lot of the data we want to
retrieve from databases is currently stored in relational databases.
Disadvantages: logical independence only works as long as you *think* that
everything is a two dimensional table, complex data structures can not be
expressed (and SGML documents can not be managed efficiently using two
dimensional tables), relationships are not supported directly and must be
reestablished at run-time via primary/foreign key pairs, the results of a
query do not always maintain the original relationships among data.

Relational databases won't be a useful way to store structured documents,
but they do contain lots of data that we might want to import into our
structured documents. If we ignore relational databases, we are leaving out
a lot of important functionality.

3. Object-relational and object oriented databases are fairly diverse, so I
have to make some qualifications before I can say anything. The fundamental
difference between object-relational and object-oriented databases has to do
with persistence, a way of automatically storing programming-language
objects; this is something that object-oriented databases do, and
object-relational databases don't. More relevant for us is the underlying
data model, which is very similar for SQL 3, object-relational databases
like Illustra and UniSQL, or object-oriented databases like POET, O2,
Versant, and the ODMG standard for object databases (I am intentionally
omitting ObjectStore, which is largely a navigational database with object
persistence). These databases combine the rich data structures of
navigational databases with the logical independence and query languages of
relational databases. Objects can have complex relationships or complex
structure, and both the structure and relationships can be used as the basis
for queries.

Because hierarchical structures and their relationships are easily used in
queries, this makes a lot of sense for SGML and XML documents. For instance,
here is an OQL query that finds all SECT1 elements that have an ID attribute
and at least one PARA sub-element:

select  e 
from    e in SGMLElement,
        a in e.attributes,
        s in e.subElements
where   e.tagName = "SECT1"
  and   a.tagName = "ID"
  and   s.tagName = "PARA";

This kind of query is very useful - it can be understood fairly easily, the
system that performs the query can make its own decisions about the most
efficient way to perform such a query, and the query can explicitly
reference subelements, reflecting the hierachical structure of SGML and XML.
And fortunately, the major relational database vendors are also moving
towards object-relational databases; soon, we will be able to do this kind
of query in SQL-3. One SGML repository vendor has also added a fulltext
operator to allow fulltext queries to be formulated as part of a structured
OQL query - this is really cool because structured queries and fulltext
queries can be combined in the conditions of a query.

Another advantage of object databases is that the results are presented as a
grove - when it is returned as part of a query, each element maintains its
relationships to the other elements of the grove. Cool, eh?

But there are also some problems here:

a. There is no support for hierarchical queries or for transitive closure, a
fancy term for "if you keep going this way, you get there eventually". It is
nice to be able to say that you want SECT1 elements that have at least one
PARA element somewhere below them, or ask for those elements which have ID
attributes and which are somewhere below some particular element. Some
research database systems like semantic network databases have supported
these kinds of operations, but they are not widespread.

b. The form of the query depends on the data structures used to implement
the database. I modified the names for my query to make them friendly - no
real repository would allow you to use exactly those names. On the other
hand, it might not be unreasonable to create standard names to describe the
grove structure, specify how queries can be created using those names, and
have individual vendors map this abstraction onto their own implementations. 

4. Some SGML databases have an SGML aware query syntax that is
non-procedural. I am thinking particularly of Texcel and LT-XML, which have
similar query languages. For instance, here is a Texcel query that finds
title elements with a parent of section with an ancestor of appendix whose
type attribute is "informational" and that has a descendant of introduction:

title { -- section { -* appendix { type = 
 "informational" && +* introduction }}}

This query language, like LT-XML's, directly supports hierarchical queries
and transitive closure, and is designed to support queries on SGML and XML
documents. It is non-procedural, setting no constraints on the system that
will implement the query or the language to be used to carry it out. It
would be interesting to add fulltext operators to a language like this. As I
understand it, DSSSL/SDQL could be used fairly easily to implement queries
designed in a query language like this.

I would think that solutions like this might be useful for queries on
SGML/XML documents, fulltext searches, and queries that combine the two.
This does *not* address the need to use data from non-document databases to
create markup, e.g. to bring data from relational or object-oriented
databases into a dynamic document.

I apologize for the length of this document - I hope it contains enough
useful information to be worth reading.

Jonathan
________________________________

Jonathan Robie
Email: jonathan at texcel.no
Texcel Research, Inc. ("http://www.texcel.no")

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)