SGML the next big thing?
Liam R. E. Quin
liamquin at interlog.com
Sat Dec 4 04:23:29 GMT 1999
On Fri, 3 Dec 1999, Lauren Wood wrote:
> On 3 Dec 99, at 12:14, Arnold, Curt wrote:
>> It looks like the XML Schema group is trying to add back the & construct.
>> If you have a compelling justification for continued suppression, please
>> rant long and loud.
>
> How about every SGML parser author I've talked to says the &
> construct was the biggest, hardest part (which means probably the
> buggiest) of the entire parser? I think the XML WG was right in
> throwing it out of XML in the first place.
If this is as per content models, I think
(1) Lauren is right, because as SGML specified them, they were very
hard to get right.
This & thing is so far outside the way most other computer languages
work that standard off-the-shelf parser generators roll on their
backs and wave their paws in the air and admit defeat.
(2) The idea of saying, "this element must contain at least one of each of
the following elements" is a useful one, and is very different from
the & construct.
A simplified, regularised form of & might be possible.
(3) The & connector interacts with #PCDATA to form pernicious content
models (see below). The XML WG went to great lengths to make sure
that no valid XML document suffers from this SGML bogosity. Similar
lengths are needed for "&".
Note:
For those who're not familiar with &, the content model connector in
SGML that says that in order to match a & b & c ..., every content
fragment a, b, etc., must be satsfied, and nothing must be left over.
Furthermore, there must be exactly one way to satisfy the expression,
as otherwise it is "ambigious" and illegal, just as
(a, b?) | a
is illegal in SGML, even though it is a perfectly sensible and valid
regular expression for the rest of the world of computing :-)
Consider the following SGML declaration (with OMITTAG NO):
<!ELEMENT boy
(noise & (dirt,mud)+ & (mud,shoes,trouble)* & #PCDATA) +smell
>
This is a "pernicious" mixed content model, and can only have
white space in it between elements once, since that uses up the
#PCDATA content model fragment.
The following is (let's say for the sake of argument) a valid boy:
mud,smell,shoes,trouble,dirt,mud,dirt,mud,noise,smell
If you try and match this against the content model I gave, you'll
see that you can't do it with LL(1) or LALR(1) directly unless
you build a DFA with a rather large number of states. I added the
inclusion +smell, but you could change the content model to be
(boy-model | smell)*
to have an even more interesting time of it.
--
Liam Quin, Barefoot Computing, Toronto; The barefoot agitator
l i a m at h o l o w e b dot n e t <-- NEW ADDRESS
Ankh on irc.sorcery.net, http://www.valinor.sorcery.net/~liam/
Please remove your shoes and socks before replying in anger.
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list