Encoded XML Content

Wed Feb 11 08:00:40 GMT 1998

The discussion has covered some good points up to now. I'll try to
build on it, and move forward.

Let's be clear about what we're trying to solve here. Unicode has
essentially solved the text problem. This note focuses on non-textual
data, or places where a different character encoding is required
inside your document.

For some applications, base64 will be be easy to use. Binary data will
be present in particular locations in the XML tree, and the
applications will simply know to decode it. These don't really need
anything new, but will benefit if there is a common technique for
handling it. 

I think the real target is 'container' elements, where the designer
needs to allow for flexibility in content at runtime. It is possible
to do part of this with elements, but you run into two difficulties.
First, you eventually hit your non-text data, and you have to provide
some indication of the content and format. Second, you may have a real
need to allow for formats that have never been forseen.

What we don't need to do is provide another mechanism for managing XML
markup and structure. XML parsers will not be asked to do anything
different. This is entirely about how developers will use XML's
features to resolve an often-encountered problem. (That's why this
still belongs on xml-dev.)

That said, the moment you move away from Unicode data content, you
face a number of issues. You will probably have to specify a wrapper
layer used to make the data XML-friendly. If that is removed, then you
will have to note what format or conventions apply to the next layer.
Ultimately you will reach either a text layer or a binary data layer,
which cannot be further unwrapped. That layer may need a descriptor,
to specify what type of data was carried with all this effort.

The question I still haven't completely resolved is - is there a need
for allowing an arbitrary number of layers, or is three sufficient?
That is the 'content encoding', 'content format', and 'content type'?
I'm not certain it's sufficient, but I can't see a use for much more
at the moment. (I'm not tightly attached to the labels, but I think
they work, and at least they're a start.) The most likely
implementations seem to be with these as attributes. Attributes that
are not present would have a default of a zero-length string.

Below, I've listed a number of items, in the interests of ensuring
that any proposed solution can handle them all. (Ultimately, such a
table would be useful to developers.)

What Is It?      Content   Content          Content
                 Encoding  Format           Type
--------------   --------  ---------------  -----------------
JPEG image       base64                     mime:image/jpeg
ASCII text       base64    ISO-8859-1       mime:text/plain
HTML text        base64    ISO-8859-1       mime:text/html
XML content    
XML carried                                 xml:                                        
XML carried      base64    ISO-10646-UCS-2  mime:text/xml
XML data only                               xml:pcdata
private data     hex                        x-private:somedata
private text     base64    Commodore64      x-private:sometext
embedded item    base64    ISO-8859-1       rfc:822
embedded item    base64                     mime:application/x-zip

I thought about separating content-type from the content-domain, but
I can't see that you would specify them separately all that often.

The above seems to support several required ideas:

1) Standard XML content requires no settings at all. This is the
   degenerate case, and it is good that it works this way.

2) Standard XML content could be structured using a DTD specified 
   using namespace techniques. This appears to be an available option
   without changing any of the infrastructure around encoding.

3) It supports MIME types, but does not require them. Other domains
   can be used bsides MIME, including completely private or
   proprietary formats.

4) There is some consistency. Notice that whenever you specify a text
   type, you must provide a content-format. Otherwise, the text is the
   same as the surrounding XML. Whenever you specify any
   content-format that is different than the surrounding XML, you must
   use a content-encoding to restore XML friendliness.

4) So far, just about anything you can throw in there that has any
   current structure looks to be workable.

An example element using these, called 'container' could be defined as
shown below.

<!ELEMENT container ANY>
<!ATTLIST container
   content-encoding (base64|hex|none) "none"
   content-format   CDATA
   content-type     CDATA   
>

I've limited the strings in content-encoding. Is this a good idea?

There would be some structure applied to the content-format and
content-type, but I don't think it would be effectively captured in
the DTD.

Comments aren't just welcome - they're essential!

---------------------------------------------------------------------------
 Chris Smith                                          <smith at interlog.com>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)