Schemas and Other Crucial XML Questions

Mon Aug 10 20:56:30 BST 1998

David Megginson wrote:

> Sam Gentile writes:
>
>  > > Also, we have been hearing rumors of a "short" XML notation. Is
>  > > there one?  We have a need to reduce the size of our buffers.
>
> No, there is no such thing.  XML's parent, SGML, included extensive
> facilities for markup minimisation and has suffered badly for it,
> since SGML tools are far too difficult to write (there is still not a
> single Java-based SGML parser, beside probably more than a dozen
> Java-based XML parsers).
>
> There are, however, alternatives: for example, you could compile the
> XML to a compact binary format for internal storage then decompile it
> back to a verbose format for export -- there's no requirement to store
> it internally as text.

Simple some very simple compression algorithms like Huffman encoding for
instance, do very well with XML documents as the Name production that is used for
identifying tags among other things will be converted to some binary symbol that
is used as an index to lookup the actual name production.  In fact, you could do
this all with entities by simply taking all of the Names specified in the DTD,
spit them into a List, and then declare all entities.

You could index all of this by using base 10 digits or else use something as high
as base 64 to encode the array references.

<!ENTITY % 0 "Foo">
<!ENTITY % 1 "Bar">

Then for a document which had element types with names "Foo" and "Bar" occurences
of:

<foo></foo>
<bar></bar>

would be converted to:

<0></0>
<1></1>

For small documents like CDF for instance these sort of techniques may turn out
to be counter-productive.

Tyler

BTW, on a side-note I am having a problem understanding whether the external
subset or the internal subset should be parsed first.  I would assume that the
external subset should go first, but in this case it would make using INCLUDE and
IGNORE sections to be pretty useless.  This is something that is not clarified as
far as I can tell in the 1.0 spec so if someone could clarify how this should be
handled by a parser, then I would greatly appreciate it.

Thanx in advance...

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)