CDATA Section
Gordon, Simon
Simon.Gordon at swi.galileo.com
Thu Oct 28 23:07:34 BST 1999
Any ideas? Well, sort of...
We have two or perhaps three problems when trying to transfer binary data
within XML elements;
1) Transfer should not break XML's character set restrictions
2) Transfer should be economic of bandwidth
3) The encoding/decoding should be cognisant of the character encoding in
use
1 is a given; it's why this discussion got started. 2 and 3 are less so and
more of a desire on my part to see an elegant and economic solution that
lies within XML itself. For instance, BASE64 is _OK_ when encoding binary
data into 8-bit characters but what happens when we have 16-bit Unicode
characters? The (in)efficiency goes from 133% to 266%. If the encoder knew
the character set in use, it could send data more efficiently without
relying on a compression algorithm. In fact, XML declares only 29 of the 256
8-bit characters to be illegal so we could pass the 227 legal ones straight
through and only escape the remaining 29. Under Unicode we have 63457 legal
and 2079 illegal characters. Burying this inside of the XML parser (which
does have full knowledge of the character encoding) would make it invisible
to end-users. Attributes in the XML namespace could control the
encode/decode behavior. In fact, this corresponds with the concept outlined
in the latest XML Schemas specification (Part 2, Section 3.2.9) where they
talk about an encoding facet**.
Assuming a totally random input stream, the inefficiencies drop to 111% for
8-bit characters and only 109% for 16-bit characters, ignoring any overhead
in Controlling XML attributes and assuming a 1 : 2 expansion ratio for
illegal characters. With this sort of overhead, it now makes economic sense
to compress large files, then apply this sort of encoding without worrying
too much about the gain of the compression being lost in the encoding. Seen
XMLZIP?
I'd put in more but I don't want to post too much in one go and besides, I'm
sure this sort of thing must've been proposed before (but then, why are we
still stuck with using base64 or &#xx; to send binary data?)
** Couldn't this be the URL of a translation service? Just send it the
stream to en/decode or request the applet - A Web-centric application as
proposed by Tim O'Reilly recently?
Regards,
Simon Gordon
Systems Engineer,
Systems Integration,
Galileo International, Denver, USA.
-----Original Message-----
From: John Evdemon [mailto:JohnE01 at xmls.com]
Sent: Wednesday, October 27, 1999 11:15
To: Gordon, Simon; xml-dev at ic.ac.uk
Subject: RE: CDATA Section
I've run into a similar issue with CDATA, although we were transferring
mainframe reports within XML, not binary data. The suggested workaround was
base64 -- I would love to see something more elegant. Any ideas?
John Evdemon
Architect
XML Solutions
http://www.xmls.com
-----Original Message-----
From: owner-xml-dev at ic.ac.uk [mailto:owner-xml-dev at ic.ac.uk]On Behalf Of
Gordon, Simon
Sent: Wednesday, October 27, 1999 11:48 AM
To: xml-dev at ic.ac.uk
Subject: RE: CDATA Section
Thanks for the info. I guess I was trying to mix the ATTLIST and ELEMENT
syntax. The !ATTLIST declaration allows CDATA but then doesn't seem to use
it in the same way as when you specify <!ELEMENT x (#PCDATA)> then use
<![CDATA[...]]> in the XML. Most confusing.
Now for the next question; binary data and CDATA sections. According to Tim
Bray's annotated XML spec., I can use a CDATA section to send binary data
yet I can't get it to work. [The annotation link is the last in the first
paragraph of section 2.7 CDATA Sections].
This (IMHO) seems to contradict the XML Spec. where it defines the CDATA
data to consist of chars which is further defined as consisting of TAB, CR,
LF and 0x20-..etc. That is, not all possible binary values. I've checked
this using the RXP parser (good, very good) and it does reject any values
outside the defined ranges. Big disappointment.
Now I'll have to look at using base64 to exchange binary data with our
vendors; we'll all have to implement an encoding/decoding scheme, probably
involving attributes and NOTATIONs and a load more work. Unless anyone has a
better idea?
Regards,
Simon Gordon
-----Original Message-----
From: John Cowan [mailto:cowan at locke.ccil.org]
Sent: Wednesday, October 20, 1999 13:23
To: Simon.Gordon at swi.galileo.com
Cc: xml-dev at ic.ac.uk
Subject: Re: CDATA Section
Gordon, Simon scripsit:
> > <?xml version='1.0'?>
> > <!DOCTYPE TEST [
> > <!ELEMENT TEST (CDATA)>
The problem is with this line, which says that TEST elements must contain
a single *element* whose name is "CDATA". This has nothing to do with
CDATA sections.
> > Warning: CDATA section not allowed here
> > in unnamed entity at line 5 char 16 of file:test.xml
Quite right; your document is invalid because your TEST element
contains character data instead of an element named CDATA.
> > PS. Just tried <!ELEMENT TEST {#PCDATA)> and removing the DOCTYPE
section
> > altogether. The former passes the validation test and the latter passes
> > the well-formedness test (rxp -xs test.xml)!
And rightly so. You cannot *compel* the content of an element to be
a CDATA section or not by using DTD-based validation. Elements declared
with #PCDATA content may express that content with or without
CDATA sections.
--
John Cowan cowan at ccil.org
I am a member of a civilization. --David Brin
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN
981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following
message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list