XML-LINK implementation
Peter Murray-Rust
Peter at ursus.demon.co.uk
Sat Apr 12 19:41:38 BST 1997
In message <2.2.32.19970412033137.00f0f4e0 at jclark.com> James Clark writes:
> At 21:31 11/04/97 GMT, Peter Murray-Rust wrote:
>
[...]
>
> One might think so, but since C has mixed content and no white-space in
> mixed content is automatically ignored, the white-space following the D and
> E elements will be data and hence constitute pseudo-elements. Thus
> ID(F)PREVIOUS(-2) will actually designate E.
>
Oh dear!! I wasn't even thinking of that problem. I was concerned about the
words 'elder siblings ' in 5.3.4.5 which seem to make no sense to me although
they are verbatim from Chapter 14 of TEI.
As a webhacker the whitespace problem concerns me greatly. This isn't
even 'pernicious mixed content' being talked about on c.t.s. at present. There
are at least three areas where it will bite people.
- authors. on c.t.s. someone (Joe English?) said that everyone gets
bitten by this when writing a DTD and then they learn. Admittedly
this isn't the same problem, but assuming DTD creators allow mixed
content (I don't, except as HTML 2.0) most *document* authors will
certainly fail to understand. (In fact I pretty-printed the
example simply to make it readable!). The problem here is that one
bite will put them off XML completely - they won't have a clue
what's going on.
- parser writers. I am not clear at present whether NXP and Lark, for
example, give the same output for all possible combinations of
whitespace. It's my impression that there are still some unsolved
problems here (or at least the conventions are not completely
finalised). Of course this is still a 'work in progress'...
BUT I think it's very important that while this problem remains
'it should be easy to write programs which process XML documents
[correctly]' is not true. The mythical CS student has a good chance
of getting this wrong at present. At present I'd say it was
impossible for anyone who wasn't highly competent in SGML to write a
correct parser.
- search implementers and authors. If the parser-writer has done a
correct job, then the implementer of the algorithm shouldn't have a
problem, so long as they all interpret 5.3 in the same way. (My
problem was that I implicitly parsed the example incorrectly). So
the *author of queries* is again the one who will come to grief on
this unless they understand it. And it will be a mysterious and
probably undebuggable problem for them. Indeed if the result of
a search is a newline character which they didn't realise was part
of the data, then it won't 'show up' on the display. They'll think
there's a bug in the software.
I can't see a simple solution to this for DTDs which allow mixed content. (It's
because of this that CML has no mixed content - all #PCDATA is held in special
containers). In general terms there seems to be the following.
- keep document authors and readers as far away from the source as
possible by using authoring tools which manage this problem. This
would be a great pity as it would stifle the creativity that we've
seen in HTML. And you might as well use SGML.
- forbid/discourage mixed content DTDs. This would make XML very
clunky for most text-based applications
- firm up the conventions for DEFAULT|PRESERVE and build them more
formally into such operations as the search procedure. This seems
to be the only way I can see - that when mixed content is being
processed at any stage the tools must be critically aware of this
problem and maybe there is mandatory syntax required.
P.
--
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list