XML end-of-line + Entity expansion (Was: Re: Unix/Java design issues)
Nik O
niko at cmsplatform.com
Mon Jul 26 23:16:45 BST 1999
Hunter, David wrote:
>Tim Bray comments on this in the Annotated XML spec (at
> http://www.xml.com/axml/axml.html):
I'm soooo embarrassed! I've cited Tim's annotated spec several times in
other contexts herein. Yet, for some reason this time i went directly to
the W3C spec, and didn't cross-check the former. Mea culpa.
><quote>
>Line End Trade-Offs
>[snip]
>
>But as a programmer using an XML processor, you can count on never seeing
>anything but a single line-feed character separating lines. This means your
>code will run anywhere.
>
>[snip] ..but it's too late for that now.
></quote>
At the risk of hubris, i have to disagree with Tim that reporting all line
delimiters as a single LF means "code will run anywhere". Yes, the
(simple)code will run in isolation -- but the XML is converted into
something that can't be used (by native tools) without further conversion.
I guess i'm saying that i wish that this transform were under application
control and/or could be suppressed. This way, XML processors would not have
to keep converting data back to its
original form at every step of the way, just to preserve editable data.
Using Expat, i've had to resort to some ugly little kludges to preserve the
system-specific end-of-line strings. Or perhaps it would be appropriate to
report the end-of-line as a special event, rather than just another bit of
character data. This way the XML 1.0 end-of-line normalization would be
preserved, but those processors that need to preserve the original data
could do so...
<flame_me_bait>
Is it truly too late? I'm assuming that there may well be an XML 1.1 (or
2.0) necessary to resolve some other issues (e.g. namespaces, XLink,
proprietary XSL implementations, etc.) before everything settles down.
Would that be an opportunity to address this and other such issues (see
below)?
</flame_me_bait>
=======
A related topic is the expansion of general (symbol) entities by
XML-compliant
parsers. In my earlier ignorance, i'd though that "well-formed" pertained
strictly
to matters of XML syntax. Yet, parsers (correctly) choke on "undefined"
entities (e.g. HTML's "•" or "©"). Why is it that a
non-validating XML parser must "validate" such entities, but not element
tags? I realise that i'm presenting the most simplistic use of entities --
parser-based entity expansion is surely useful for more sophisticated (e.g.
non-symbol, nested, parameter) entities. However, IMHO, much use of the XML
entity feature will be similar to HTML's use of same -- symbolic constants
and special characters. Perhaps there should be a sub-class of entities
specifically for this purpose, with a non-validating XML parser checking for
valid syntax, but not expanding the entity string.
=======
When processing XML data that must remain intact and/or system-specific, it
is necessary to convert normalized end-of-lines and expanded entities back
into their original form. In both these cases, the parser is actually
creating more work for the processor than if the data were simply passed,
unmodified by the parser.
Regards,
-Nik O, Content Mgmt Solutions, Jackson, Wyo.
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list