Mixed content considered harmful...
paul at prescod.net
Tue May 11 21:53:31 BST 1999
John Cowan wrote:
> Can you sketch an algorithm that will convert SGML-style (or &-less
> SGML-style) content models involving #PCDATA into content models
> involving #PCDATA and #WS, where #WS is a data type that matches
> only white space, such that random white space around tags will be properly
> accounted for?
Thanks for asking.
I don't think that you would convert the content models. You leave the
content models alone and just change your matching algorithm slightly.
#PCDATA is a token that matches any character data. Given A,#PCDATA,B,
#PCDATA matches the longest stretch of character data between A and B.
#WS matches a stretch of whitespace.
When you are parsing, you always try to match (all) characters against
#PCDATA. If that fails AND the characters are whitespace then you ignore
or suppress them. If it files but the characters are NOT whitespace then
of course you have a validity error.
Token Text Result
#PCDATA "abc" "abc"
#PCDATA " " "abc"
#PCDATA not allowed "abc" ERROR
#PCDATA not allowed " " "ignorable:[ ]"
The only danger is if you put datatype nodes beside each other or datatype
nodes beside PCDATA. Then you could have problems with ambiguity in the
formal grammars sense of the word (which IS a real problem). We could
handle this by disallowing content models that allow datatypes to be
adjacent or by requiring schema processors to detect and report a possible
ambiguity based on the actual definitions of the datatype.
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
Diplomatic term: "Emerging Markets"
Translation: Poor countries. The great euphemism of the Asian financial
meltdown. Investors got much more excited when they thought
they could invest in up-and-comers than when they heard they could invest
in the Third World.(Brills Content, Apr. 1999)
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev