SGML the next big thing?

Liam R. E. Quin liamquin at
Sat Dec 4 04:23:29 GMT 1999

On Fri, 3 Dec 1999, Lauren Wood wrote:
> On 3 Dec 99, at 12:14, Arnold, Curt wrote:
>> It looks like the XML Schema group is trying to add back the & construct.
>> If you have a compelling justification for continued suppression, please
>> rant long and loud.
> How about every SGML parser author I've talked to says the & 
> construct was the biggest, hardest part (which means probably the 
> buggiest) of the entire parser? I think the XML WG was right in 
> throwing it out of XML in the first place.

If this is as per content models, I think
(1) Lauren is right, because as SGML specified them, they were very
    hard to get right.

    This & thing is so far outside the way most other computer languages
    work that standard off-the-shelf parser generators roll on their
    backs and wave their paws in the air and admit defeat.

(2) The idea of saying, "this element must contain at least one of each of
    the following elements" is a useful one, and is very different from
    the & construct.

    A simplified, regularised form of & might be possible.

(3) The & connector interacts with #PCDATA to form pernicious content
    models (see below).  The XML WG went to great lengths to make sure
    that no valid XML document suffers from this SGML bogosity.  Similar
    lengths are needed for "&".

    For those who're not familiar with &, the content model connector in
    SGML that says that in order to match a & b & c ..., every content
    fragment a, b, etc., must be satsfied, and nothing must be left over.
    Furthermore, there must be exactly one way to satisfy the expression,
    as otherwise it is "ambigious" and illegal, just as
	(a, b?) | a
    is illegal in SGML, even though it is a perfectly sensible and valid
    regular expression for the rest of the world of computing :-)

    Consider the following SGML declaration (with OMITTAG NO):
	<!ELEMENT boy
	    (noise & (dirt,mud)+ & (mud,shoes,trouble)* & #PCDATA) +smell
    This is a "pernicious" mixed content model, and can only have
    white space in it between elements once, since that uses up the
    #PCDATA content model fragment.

    The following is (let's say for the sake of argument) a valid boy:

    If you try and match this against the content model I gave, you'll
    see that you can't do it with LL(1) or LALR(1) directly unless
    you build a DFA with a rather large number of states.  I added the
    inclusion +smell, but you could change the content model to be
	(boy-model | smell)*
    to have an even more interesting time of it.

Liam Quin, Barefoot Computing, Toronto;  The barefoot agitator
l i a m    at    h o l o w e b    dot    n e t <-- NEW ADDRESS
Ankh on,
Please remove your shoes and socks before replying in anger.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list