Is there anyone working on a binary version of XML?

Tue Mar 30 05:51:16 BST 1999

"Stephen D. Williams" wrote:
> Imagine that you have all the features of XML: structure, flexibility, common format for
> interchange, but that you perform zero processing steps to import or export the 'document'
> from a program.  (Actually, I'm thinking this would be done in chunks, but essentially very
> few reads and writes.)

I had an idea to accomplish something similar to this using notations.  
First use a fixed width encoding, and then provide an index to the 
information contained within the XML document in a notation.  This way
you get many of the advantages above, but your information is still XML, 
so that it can be read by a parser who may not understand the indexing notation.

Anyway, I havn't had time to work on it more, but here was a 
crude, first-pass at explaining the idea I posted to the list
a while back.  I hope it helps.

Clark Evans

-------- Original Message --------
Subject: Fractal XML Index Notation
Date: Wed, 03 Feb 1999 01:32:34 +0000
From: Clark Evans <clark.evans at manhattanproject.com>
To: xml-dev at ic.ac.uk
References: <958E41703996D21197A200A0C9D4C65672B7 at AUS-SERVER4>

Abstract:

        By fixing the content of an XML file, a 
        position based  index mechanism can be added 
        to XML files, allowing fractal parsing.

Introduction:

In a thin-client/server environment, especially those 
implemented in an interpreted language, like Java, 
is important to minimise client-side processing by 
doing server-side pre-processing.

For example, suppose that an on-line shopping web 
site has a thin-client ordering java applet.  It could
quickly download, and start accepting customer
information, and other input.  Simoutanenously,
it could be downloading a 250K+ file(s) containing
the package and product list, authorized shipping 
agents, tax calculation tables, etc.  Advanced
versions of the applet would "cashe" a copy of the
catalog locally, and only download deltas.  

Several pre-processing items could occur, the most
obvious being a translation of the normalized schema:

 PRODUCT_CATEGORY   (CATEGORY_ID, CATEGORY_NAME)
 BUNDLE_OF_PRODUCTS (BUNDLE_ID, BUNDLE_NAME, BUNDLE_PRICE)
 VENDOR             (VENDOR_ID, VENDOR_NAME)
 BUNDLE-PRODUCT     (BUNDLE_ID,PRODUCT_ID)
 PRODUCT            (PRODUCT_ID, PRODUCT_NAME, 
                     CATEGORY_ID, INVIDUAL_SALE_FLAG,
                     PRICE_IF_SOLD_INDIVIDUALLY )
 PRODUCT-VENDOR     (PRODUCT_ID,VENDOR_ID)
 BUNDLE-VENDOR      (BUNDLE_ID,VENDOR_ID)

into a hierarchical drill-down that better meets
the particular needs of the order-entry client:

<catalog>
   <product-category>
      <product-bundle>
         <product>
         <vendor>
      <individual-product>
         <vendor>

In this example, several joins are interwoven into a 
a single hierarchical "snapshot" to support the
the drill down requirements in the order-entry client.

Notice, that product-bundles, products, and vendors
*will* be duplicated with this scheme, this de-normalization
is exactly what is required since it makes the processing
on the client simpler.  Here XML complements the
relational database by providing a de-normalized 
stream of data instead of a normalized repository.

For another example, suppose a roaming-sales person 
receives an update every morning in his e-mail with
new products, discontinued products, changes in pricing, 
packaging, etc.  Then, during the day, the sales peson 
goes "door-to-door" selling the products and taking orders.  
The orders are collected on his/her hard drive untill 
the evening, when they are uploaded to the server for 
approval.

I see XML as a great move forward in a standard transport
layer for this form of communication.  Each order could
be a simple e-mail message, leveraging existing POP3/SMTP
standards.  The messages would be queued during the day,
and send after the sales person is connected to the 
network.  In a similar way, the updates to the product
could be sent as via e-mail (xml-mail anyone?) as well.  

THUS, we have moved the join from the client to the
server, but now, we have *increased* the parsing 
requirements of the client... also, with a _large_
catelog file (3+MB?), it is unreasonable to think 
that a collection of objects in memory would 
be the result of the parsing.

THEREFORE, some form of storage/retrieval is necessary
on the client.  This can be in a local database,
but that just increases the footprint and processing.

Instead of making a client-side database, and 
re-normalizing the information, I suggest that 
indexing the XML file may be a better alternative.
A way to do this, is to "fix" the XML file's binary
representaion, and build a physical index detailing
the "exact" location of an element within the file.

Requirement for such an index:

a) It should be embeddable inside XML, and should follow
XML if possible (perhaps it is a notation?)

b) It should allow indexing on arbitrary element attributes.

c) It should be created so that a change in one part of the
file has minimal impact on the rest of the XML file.  Thus,
although a change to a child may require a re-adjustment
of information about it's parent, it shouldn't require 
re-adjustment of information about each sibling.

d) It should take advantage of the "hierarchy" built
into the XML file, since the thin-client usage will
directly correspond to the "hierachy"

e) It should support typed entities and attributes
"Archetecutres", so that different attribute names
of sub-types can be indexed together.

f) Indexing an element based upon it's child elements
may not be required. If an index like this is needed,
perhaps a re-write adds an attribute with the 
computed value and then this is indexed instead.

g) Working with linking is purely optional, and may
not be important to support. <opinion> If you are 
using linking with transaction-oriented documents, 
you should be using a relational database instead. 
I see XML as bringing back the Hierarchical database 
to *complement* relational technology, not to 
*replace* it.</opinion>

================================================

What I propose is a "fractal" index inter-woven 
into the XML data.  First, here is the file to 
be indexed:

<catalog date="03-FEB-1999" company="Acme Tools" >
   <product-category name="Household" type="Domestic">
      <individual-product name="Hammer" price="13.95"/>
      <individual-product name="Screw-Driver, 1/4 inch" price="6.95"/>
      <individual-product name="Screw-Driver, 1/8 inch" price="7.95"/>
      <individual-product name="Allen-Wrench Set"       price="11.55"/>
      <product-bundle name="Household-Starter" price = "23.99" />
         <bundled-product name="Hammer"/>
         <bundled-product name="Screw-Driver, 1/4 inch"/>
         <bundled-product name="Screw-Driver, 1/8 inch"/>
         ...
      </product-bundle>
      ...
   </product-category>
   <product-category type="Commercial" name="Light-Industry" >
      <individual-product name="Hammer" price="13.95"/>
      <individual-product name="Versa Screw(tm)" price="66.95"/>
      ...
   </product-category>
   ...
</catalog>

Here is the "indexed" example, I use line numbers for 
the demonstration since it is easier to show in e-mail
form, however, I would see it being done by position instead.
I also use

0001 <!-- other-information-before-the-catelog -->
...
0009 <catalog date="03-FEB-1999" company="Acme Tools" >
0010    <product-category name="Household" type="Domestic">
0011       <individual-product name="Hammer" price="13.95"/>
0012       <individual-product name="Screw-Driver, 1/4 inch"
price="6.95"/>
0013       <individual-product name="Screw-Driver, 1/8 inch"
price="7.95"/>
0014       <individual-product name="Allen-Wrench Set" price="1.55"/>
0015       <product-bundle name="Household-Starter" price = "23.99" />
0016          <bundled-product name="Hammer"/>
0017          <bundled-product name="Screw-Driver, 1/4 inch"/>
0018          <bundled-product name="Screw-Driver, 1/8 inch"/>
...
0033       </product-bundle>
...
0533       <index               <!-- an index for "Household"
category     -->
0534          name="Price"      <!-- the listing is asending by
price      -->
0535          index-start=525   <!-- (535-10), relative begining of
index  -->
0536          delimiter="|"     <!-- Hmm, possibly for
readability         -->
0536          position-width=4  <!-- Length for each position,
lpad="0"    -->
0537          length=100        <!-- Length of
index                       -->
0538       >
0539       <index-column name="name" width=30 align="left" rpad=" ">
0540          <index-element element="individual-product"
attribute="price" />
0541          <index-element element="product-bundle" attribute="price"
/>
0542       </index-column>
0543       0004|Allen-Wrench Set    | <!-- First item...                
-->
...
05??       0005|Household-Starter   | <!-- First item...                
-->
...
05??       0008|Allen-Wrench Set    | <!-- First item...                
-->
...
0632       </index>
0633       <index               
0634          name="Price"      <!-- the index is asending by
price        -->
0635          index-start=625   <!-- (635-10), relative begining of
index  -->
0636          delimiter="|"     
0636          position-width=4 
0637          length=100
0638       >
0639       <index-column name="price" width=5 align="right" lpad="0">
0640          <index-element element="individual-product"
attribute="price" />
0641          <index-element element="product-bundle" attribute="price"
/>
0642       </index-column>
0643       0433|01.23         <!-- Cheapest item...                     
-->
...
06??       0002|06.95         <!-- Refers to line 10+2=12               
-->
...
06??       0005|23.99         <!-- Referrs to line 10+5=15              
-->
...
0732       </index>
....
????    </product-category>
????    <product-category type="Commercial" name="Light-Industry" >
????       <individual-product name="Hammer" price="13.95"/>
????       <individual-product name="Versa Screw(tm)" price="66.95"/>
...
????       <index 
              name="Price"
...
????       <index 
              name=""
...
????    </product-category>
...
????    
0000 </catalog>
0000 

==============================

....

<INDEX

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)