XML-LINK implementation

Peter Murray-Rust Peter at ursus.demon.co.uk
Sat Apr 12 19:41:38 BST 1997

In message < at jclark.com> James Clark writes:
> At 21:31 11/04/97 GMT, Peter Murray-Rust wrote:
> One might think so, but since C has mixed content and no white-space in
> mixed content is automatically ignored, the white-space following the D and
> E elements will be data and hence constitute pseudo-elements.  Thus
> ID(F)PREVIOUS(-2) will actually designate E.
Oh dear!! I wasn't even thinking of that problem.  I was concerned about the
words 'elder siblings ' in which seem to make no sense to me although
they are verbatim from Chapter 14 of TEI.

As a webhacker the whitespace problem concerns me greatly.  This isn't
even 'pernicious mixed content' being talked about on c.t.s. at present.  There
are at least three areas where it will bite people.
	- authors.  on c.t.s. someone (Joe English?) said that everyone gets
	  bitten by this when writing a DTD and then they learn.  Admittedly
	  this isn't the same problem, but assuming DTD creators allow mixed
	  content (I don't, except as HTML 2.0) most *document* authors will
 	  certainly fail to understand.  (In fact I pretty-printed the 
	  example simply to make it readable!).  The problem here is that one
	  bite will put them off XML completely - they won't have a clue
	  what's going on.
	- parser writers.  I am not clear at present whether NXP and Lark, for
	  example, give the same output for all possible combinations of
	  whitespace.  It's my impression that there are still some unsolved
	  problems here (or at least the conventions are not completely 
	  finalised).  Of course this is still a 'work in progress'...
	  BUT I think it's very important that while this problem remains
	  'it should be easy to write programs which process XML documents
	  [correctly]' is not true.  The mythical CS student has a good chance
	  of getting  this wrong at present.  At present I'd say it was 
	  impossible for anyone who wasn't highly competent in SGML to write a
	  correct parser.
	- search implementers and authors.  If the parser-writer has done a 
	  correct job, then the implementer of the algorithm shouldn't have a 
	  problem, so long as they all interpret 5.3 in the same way.  (My 
	  problem was that I implicitly parsed the example incorrectly).  So
	  the *author of queries* is again the one who will come to grief on
	  this unless they understand it.  And it will be a mysterious and 
	  probably undebuggable problem for them.  Indeed if the result of 
	  a search is a newline character which they didn't realise was part
	  of the data, then it won't 'show up' on the display.  They'll think
	  there's a bug in the software.

I can't see a simple solution to this for DTDs which allow mixed content.  (It's
because of this that CML has no mixed content - all #PCDATA is held in special
containers).  In general terms there seems to be the following.
	- keep document authors and readers as far away from the source as 
	  possible by using authoring tools which manage this problem.  This
	  would be a great pity as it would stifle the creativity that we've 
	  seen in HTML.  And you might as well use SGML.
	- forbid/discourage mixed content DTDs.  This would make XML very
	  clunky for most text-based applications
	- firm up the conventions for DEFAULT|PRESERVE and build them more 
	  formally into such operations as the search procedure.  This seems
	  to be the only way I can see - that when mixed content is being
	  processed at any stage the tools must be critically aware of this
	  problem and maybe there is mandatory syntax required.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list