XML-Data, "&" and inheritance

Tue Apr 28 10:43:03 BST 1998

From: Charles Frankston <cfranks at microsoft.com>

>I think there are good reasons to not use regular expression syntax:
>
>1. It is not easily read by those who have not been working with it for
many
>years.  I think millions of HTML authors can figure out XML-Data's verbose
>syntax much more quickly than they can learn regular expressions.  I think
>this makes the verbosity worthwhile.

Is XML-data targeted at HTML authors? I thought it was targeted at database
people. I would be surprised if database people do not understand regular
expressions: nowdays most professionals have been through some formal
training.

>2. I can use the same XML tools to deal with the XML-Data schema documents
>as I use to deal with my other XML documents.  If regular expressions were
>used for even part of the syntax, this wouldn't be the case, and I'd need a
>regular expressio parser.

An XML parser has a regular expression parser already: or at
least a content-model parser. So the difference is not in the parsing but
what
becomes of the parsed information:  for example, the API could emit XML-data
events, but the document could be transmitted using XML content models.

>3. A large schema built from scratch in XML-Data should be able to save a
>lot of space by using inheritence to avoid copying large sections of
similar
>schema information.

The DTD figures I gave already used a lot of parameter entities, which
the SGML2DTD program retains: the bloat is not an artifact of macro
expansion. So my figures (178 K to
600K expansion for a version of DOCBOOK) already include a lot
of compression. The simple fact is that going from "<!ELEMENT a (p, q*)>" to
"<elementType id='a'><element>p</element><element>q<element
occurs='ONEORMORE'></elementType>" will bloat  out the schema.
Which is why I think  "<elementType id='a'>(p, q*)</elementType>" would
be much better.

But rather than just assertions, I would be interested to see XML-data
used for a realistic, real-life DTD like DOCBOOK. I have evidence that
a similar system bloats out in a way that I think would be unacceptable
for the web: the best way to prove me wrong would be for the XML-data
people to actually mark up the DOCBOOK DTD in XML-data, trying
to use inheritance. I challenge them to do this, in fact.

I predict what would happen is that the result would be large and bloated.
I don't think there is much inheritance to be found in what DOCBOOK
structures.  I would predict that the XML-data proponents would then
say (fairly) that the problem is that DOCBOOK DTD was not based on an
analysis which exposes inheritance.

So to cut directly to that,  I think the problem is to think that using
inheritance
can be a way of simplifying existing DTDs. Parameter entities allow a
certain
level of compression, and are widely used. But they are certainly far from
an
inheritance mechanism.

Consequently DTD designers do not tend to make specialized versions of
general structures, even when it would be desirable:  people will have
a single table model, rather than one kind of  table which must include a
figure, one kind which must have 3 columns, one kind which can
include footnote and one kind which cannot contain footnotes.

Adding an inheritance mechnism will not tend to simplify any existing
DTDs, rather it will make specifying richer, more exact DTDs more
tractable and doable. Where DOCBOOK has one (or two) types of tables,
there would be more types if inheritence could be used.

So even if XML-data became as concise as XML in specifying the base
structures, a schema with specialized structures using inheritence must
be bigger. At the moment. DTDs tend only to have these base structures.

Experience shows that for text it is easily possible for a DTD to require
hundreds of element types, and that is when general structures are used.
I think the XML "terseness is not of major importance" goal should be
at the bottom of the list (and possibly off the list) as far as an
XML-schema
proposal goes.

If I have a 10K document, I do not want to have to ship out a 600K schema.
The reason that XML allows no markup declarations in the first place
is that even a 200K schema is too much for lots of uses. The better approach
to this problem is to have as terse a schema syntax as we can: regular
expressions provide a great model here: every computer science student
has studied them, everyone in document processing knows them, everyone
who has used wildcards in Web searches knows the idea. Then use
some hypertext convention by which the schema can be held remotely
and only the particular relevant definitions can be requested
as they are needed: a linking system from element type names (etc)
which uses some simple defaulting convention. The document is kept
as small as possible (preferably the same size as the DTD-less document)
and the schema can be made as elaborate and grand as desired.

The Web is based on going from the idea of just plonking in great blobs
of text whereever they are needed to having smart links to navigate to
the exact resource needed, as it is needed. XML-data as currently
formulated is a step back into this pre-hypertext mentality. This
compounds the problem of its verbosity. It would be best to deal with
this as a hypertext problem, but otherwise at least use regular expression
syntax for content models to reduce the verbosity.

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)