Fractal XML Index Notation

Clark Evans clark.evans at manhattanproject.com
Wed Feb 10 08:33:05 GMT 1999


Abstract:

   By fixing the content of an XML file, a 
   position based  index mechanism can be added 
   to XML files, allowing fractal parsing.

Introduction:

In a thin-client/server environment, especially those 
implemented in an interpreted language, like Java, 
is important to minimise client-side processing by 
doing server-side pre-processing.

For example, suppose that an on-line shopping web 
site has a thin-client ordering java applet.  It could
quickly download, and start accepting customer
information, and other input.  Simoutanenously,
it could be downloading a 250K+ file(s) containing
the package and product list, authorized shipping 
agents, tax calculation tables, etc.  Advanced
versions of the applet would "cashe" a copy of the
catalog locally, and only download deltas.  

Several pre-processing items could occur, the most
obvious being a translation of the normalized schema:

 PRODUCT_CATEGORY   (CATEGORY_ID, CATEGORY_NAME)
 BUNDLE_OF_PRODUCTS (BUNDLE_ID, BUNDLE_NAME, BUNDLE_PRICE)
 VENDOR             (VENDOR_ID, VENDOR_NAME)
 BUNDLE-PRODUCT     (BUNDLE_ID,PRODUCT_ID)
 PRODUCT            (PRODUCT_ID, PRODUCT_NAME, 
                     CATEGORY_ID, INVIDUAL_SALE_FLAG,
                     PRICE_IF_SOLD_INDIVIDUALLY )
 PRODUCT-VENDOR     (PRODUCT_ID,VENDOR_ID)
 BUNDLE-VENDOR      (BUNDLE_ID,VENDOR_ID)
 
into a hierarchical drill-down that better meets
the particular needs of the order-entry client:

<catalog>
   <product-category>
      <product-bundle>
         <product>
         <vendor>
      <individual-product>
         <vendor>

In this example, several joins are interwoven into a 
a single hierarchical "snapshot" to support the
the drill down requirements in the order-entry client.

Notice, that product-bundles, products, and vendors
*will* be duplicated with this scheme, this de-normalization
is exactly what is required since it makes the processing
on the client simpler.  Here XML complements the
relational database by providing a de-normalized 
stream of data instead of a normalized repository.

For another example, suppose a roaming-sales person 
receives an update every morning in his e-mail with
new products, discontinued products, changes in pricing, 
packaging, etc.  Then, during the day, the sales peson 
goes "door-to-door" selling the products and taking orders.  
The orders are collected on his/her hard drive untill 
the evening, when they are uploaded to the server for 
approval.

I see XML as a great move forward in a standard transport
layer for this form of communication.  Each order could
be a simple e-mail message, leveraging existing POP3/SMTP
standards.  The messages would be queued during the day,
and send after the sales person is connected to the 
network.  In a similar way, the updates to the product
could be sent as via e-mail (xml-mail anyone?) as well.  

THUS, we have moved the join from the client to the
server, but now, we have *increased* the parsing 
requirements of the client... also, with a _large_
catelog file (3+MB?), it is unreasonable to think 
that a collection of objects in memory would 
be the result of the parsing.

THEREFORE, some form of storage/retrieval is necessary
on the client.  This can be in a local database,
but that just increases the footprint and processing.

Instead of making a client-side database, and 
re-normalizing the information, I suggest that 
indexing the XML file may be a better alternative.
A way to do this, is to "fix" the XML file's binary
representaion, and build a physical index detailing
the "exact" location of an element within the file.

Requirement for such an index:

a) It should be embeddable inside XML, and should follow
XML if possible (perhaps it is a notation?)

b) It should allow indexing on arbitrary element attributes.

c) It should be created so that a change in one part of the
file has minimal impact on the rest of the XML file.  Thus,
although a change to a child may require a re-adjustment
of information about it's parent, it shouldn't require 
re-adjustment of information about each sibling.

d) It should take advantage of the "hierarchy" built
into the XML file, since the thin-client usage will
directly correspond to the "hierachy"

e) It should support typed entities and attributes
"Archetecutres", so that different attribute names
of sub-types can be indexed together.

f) Indexing an element based upon it's child elements
may not be required. If an index like this is needed,
perhaps a re-write adds an attribute with the 
computed value and then this is indexed instead.

g) Working with linking is purely optional, and may
not be important to support. <opinion> If you are 
using linking with transaction-oriented documents, 
you should be using a relational database instead. 
I see XML as bringing back the Hierarchical database 
to *complement* relational technology, not to 
*replace* it.</opinion>


================================================

What I propose is a "fractal" index inter-woven 
into the XML data.  First, here is the file to 
be indexed:

<catalog date="03-FEB-1999" company="Acme Tools" > 
 <product-category name="Household" type="Domestic">
  <individual-product name="Hammer" price="13.95"/>
  <individual-product name="Driver, 1/4 inch" price="6.95"/>
  <individual-product name="Driver, 1/8 inch" price="7.95"/>
  <individual-product name="Allen-Wrench Set" price="11.55"/>
  <product-bundle name="Household-Starter" price = "23.99" />
   <bundled-product name="Hammer"/>
   <bundled-product name="Driver, 1/4 inch"/>
   <bundled-product name="Driver, 1/8 inch"/>
       ...
  </product-bundle>
    ...
 </product-category>
 <product-category type="Commercial" name="Light-Industry" >
  <individual-product name="Hammer" price="13.95"/>
  <individual-product name="Versa Screw(tm)" price="66.95"/>
    ...
 </product-category>
 ...
</catalog> 

Here is the "indexed" example, I use line numbers for 
the demonstration since it is easier to show in e-mail
form, however, I would see it being done by position instead.
I also use <!-- to comment stuff. -->

0001 <!-- other-information-before-the-catelog -->
...
0009 <catalog date="03-FEB-1999" company="Acme Tools" >
0010  <product-category name="Household" type="Domestic">
0011   <individual-product name="Hammer" price="13.95"/>
0012   <individual-product name="Driver, 1/4 inch" price="6.95"/>
0013   <individual-product name="Driver, 1/8 inch" price="7.95"/>
0014   <individual-product name="Allen-Wrench Set" price="1.55"/>
0015   <product-bundle name="Household-Starter" price = "23.99" />
0016    <bundled-product name="Hammer"/>
0017    <bundled-product name="Driver, 1/4 inch"/>
0018    <bundled-product name="Driver, 1/8 inch"/>
...
0033   </product-bundle>
...
0533   <index              <!-- an index for "Household" category    
-->
0534     name="Price"      <!-- the listing is asending by price     
-->
0535     index-start=525   <!-- (535-10), relative begining of index 
-->
0536     delimiter="|"     <!-- Hmm, possibly for readability        
-->
0536     position-width=4  <!-- Length for each position, lpad="0"   
-->
0537     length=100        <!-- Length of index                      
-->
0538   >
0539    <index-column name="name" width=30 align="left" rpad=" ">
0540     <index-element element="individual-product" attribute="price"
/>
0541     <index-element element="product-bundle" attribute="price" />
0542    </index-column>
0543    0004|Allen-Wrench Set    | <!-- First item...                -->
...
05??    0005|Household-Starter   | 
...
05??    0008|Allen-Wrench Set    | 
...
0632   </index>
0633   <index               
0634    name="Price"      <!-- the index is asending by  price       -->
0635    index-start=625   <!-- (635-10), relative begining of index  -->
0636    delimiter="|"     
0636    position-width=4 
0637    length=100
0638   >
0639    <index-column name="price" width=5 align="right" lpad="0">
0640     <index-element element="individual-product" attribute="price"
/>
0641     <index-element element="product-bundle" attribute="price" />
0642    </index-column>
0643    0433|01.23         <!-- Cheapest item...                     -->
...     
06??    0002|06.95         <!-- Refers to line 10+2=12               -->
...
06??    0005|23.99         <!-- Referrs to line 10+5=15              -->
...
0732   </index>
....
????  </product-category>
????  <product-category type="Commercial" name="Light-Industry" >
????   <individual-product name="Hammer" price="13.95"/>
????   <individual-product name="Versa Screw(tm)" price="66.95"/>
...
????   <index name="Price"
....
????   <index name=""
....
????  </product-category>
....
????

I'm getting tired here, but now, you would have
the next level of indexes, here indexing the
product categories, instead of the products
in the product category.
                           
???? </catalog>
 
-------------------------------------------

Anyway, I wrote this up about 2 months ago
and have implemented something *very* similar
to it successfully in a client/server order
entry system about 6 years ago using flat files
and a proprietary file format.  I did this
beacuse the client parsing times were getting
to be large and the frequent updates were
starting to cause problems.  Anyway, the
speedup obtained by a multi-attribute index
was hudge and well worth the added complexity.

Since the index is fractal at the element
level, a change in the file means re-computing
the indexes for all parents transitively.
Re-computation of the sibling indexes is avoided!

As an added advantage, carrage returns
were used so that "upgrades" of the
files could be distribued merged with "diff".

Just thinking outloud here....

Clark Evans

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list