Is there anyone working on a binary version of XML?
Clark Evans
clark.evans at manhattanproject.com
Tue Mar 30 05:51:16 BST 1999
"Stephen D. Williams" wrote:
> Imagine that you have all the features of XML: structure, flexibility, common format for
> interchange, but that you perform zero processing steps to import or export the 'document'
> from a program. (Actually, I'm thinking this would be done in chunks, but essentially very
> few reads and writes.)
I had an idea to accomplish something similar to this using notations.
First use a fixed width encoding, and then provide an index to the
information contained within the XML document in a notation. This way
you get many of the advantages above, but your information is still XML,
so that it can be read by a parser who may not understand the indexing notation.
Anyway, I havn't had time to work on it more, but here was a
crude, first-pass at explaining the idea I posted to the list
a while back. I hope it helps.
Clark Evans
-------- Original Message --------
Subject: Fractal XML Index Notation
Date: Wed, 03 Feb 1999 01:32:34 +0000
From: Clark Evans <clark.evans at manhattanproject.com>
To: xml-dev at ic.ac.uk
References: <958E41703996D21197A200A0C9D4C65672B7 at AUS-SERVER4>
Abstract:
By fixing the content of an XML file, a
position based index mechanism can be added
to XML files, allowing fractal parsing.
Introduction:
In a thin-client/server environment, especially those
implemented in an interpreted language, like Java,
is important to minimise client-side processing by
doing server-side pre-processing.
For example, suppose that an on-line shopping web
site has a thin-client ordering java applet. It could
quickly download, and start accepting customer
information, and other input. Simoutanenously,
it could be downloading a 250K+ file(s) containing
the package and product list, authorized shipping
agents, tax calculation tables, etc. Advanced
versions of the applet would "cashe" a copy of the
catalog locally, and only download deltas.
Several pre-processing items could occur, the most
obvious being a translation of the normalized schema:
PRODUCT_CATEGORY (CATEGORY_ID, CATEGORY_NAME)
BUNDLE_OF_PRODUCTS (BUNDLE_ID, BUNDLE_NAME, BUNDLE_PRICE)
VENDOR (VENDOR_ID, VENDOR_NAME)
BUNDLE-PRODUCT (BUNDLE_ID,PRODUCT_ID)
PRODUCT (PRODUCT_ID, PRODUCT_NAME,
CATEGORY_ID, INVIDUAL_SALE_FLAG,
PRICE_IF_SOLD_INDIVIDUALLY )
PRODUCT-VENDOR (PRODUCT_ID,VENDOR_ID)
BUNDLE-VENDOR (BUNDLE_ID,VENDOR_ID)
into a hierarchical drill-down that better meets
the particular needs of the order-entry client:
<catalog>
<product-category>
<product-bundle>
<product>
<vendor>
<individual-product>
<vendor>
In this example, several joins are interwoven into a
a single hierarchical "snapshot" to support the
the drill down requirements in the order-entry client.
Notice, that product-bundles, products, and vendors
*will* be duplicated with this scheme, this de-normalization
is exactly what is required since it makes the processing
on the client simpler. Here XML complements the
relational database by providing a de-normalized
stream of data instead of a normalized repository.
For another example, suppose a roaming-sales person
receives an update every morning in his e-mail with
new products, discontinued products, changes in pricing,
packaging, etc. Then, during the day, the sales peson
goes "door-to-door" selling the products and taking orders.
The orders are collected on his/her hard drive untill
the evening, when they are uploaded to the server for
approval.
I see XML as a great move forward in a standard transport
layer for this form of communication. Each order could
be a simple e-mail message, leveraging existing POP3/SMTP
standards. The messages would be queued during the day,
and send after the sales person is connected to the
network. In a similar way, the updates to the product
could be sent as via e-mail (xml-mail anyone?) as well.
THUS, we have moved the join from the client to the
server, but now, we have *increased* the parsing
requirements of the client... also, with a _large_
catelog file (3+MB?), it is unreasonable to think
that a collection of objects in memory would
be the result of the parsing.
THEREFORE, some form of storage/retrieval is necessary
on the client. This can be in a local database,
but that just increases the footprint and processing.
Instead of making a client-side database, and
re-normalizing the information, I suggest that
indexing the XML file may be a better alternative.
A way to do this, is to "fix" the XML file's binary
representaion, and build a physical index detailing
the "exact" location of an element within the file.
Requirement for such an index:
a) It should be embeddable inside XML, and should follow
XML if possible (perhaps it is a notation?)
b) It should allow indexing on arbitrary element attributes.
c) It should be created so that a change in one part of the
file has minimal impact on the rest of the XML file. Thus,
although a change to a child may require a re-adjustment
of information about it's parent, it shouldn't require
re-adjustment of information about each sibling.
d) It should take advantage of the "hierarchy" built
into the XML file, since the thin-client usage will
directly correspond to the "hierachy"
e) It should support typed entities and attributes
"Archetecutres", so that different attribute names
of sub-types can be indexed together.
f) Indexing an element based upon it's child elements
may not be required. If an index like this is needed,
perhaps a re-write adds an attribute with the
computed value and then this is indexed instead.
g) Working with linking is purely optional, and may
not be important to support. <opinion> If you are
using linking with transaction-oriented documents,
you should be using a relational database instead.
I see XML as bringing back the Hierarchical database
to *complement* relational technology, not to
*replace* it.</opinion>
================================================
What I propose is a "fractal" index inter-woven
into the XML data. First, here is the file to
be indexed:
<catalog date="03-FEB-1999" company="Acme Tools" >
<product-category name="Household" type="Domestic">
<individual-product name="Hammer" price="13.95"/>
<individual-product name="Screw-Driver, 1/4 inch" price="6.95"/>
<individual-product name="Screw-Driver, 1/8 inch" price="7.95"/>
<individual-product name="Allen-Wrench Set" price="11.55"/>
<product-bundle name="Household-Starter" price = "23.99" />
<bundled-product name="Hammer"/>
<bundled-product name="Screw-Driver, 1/4 inch"/>
<bundled-product name="Screw-Driver, 1/8 inch"/>
...
</product-bundle>
...
</product-category>
<product-category type="Commercial" name="Light-Industry" >
<individual-product name="Hammer" price="13.95"/>
<individual-product name="Versa Screw(tm)" price="66.95"/>
...
</product-category>
...
</catalog>
Here is the "indexed" example, I use line numbers for
the demonstration since it is easier to show in e-mail
form, however, I would see it being done by position instead.
I also use <!-- to comment stuff. -->
0001 <!-- other-information-before-the-catelog -->
...
0009 <catalog date="03-FEB-1999" company="Acme Tools" >
0010 <product-category name="Household" type="Domestic">
0011 <individual-product name="Hammer" price="13.95"/>
0012 <individual-product name="Screw-Driver, 1/4 inch"
price="6.95"/>
0013 <individual-product name="Screw-Driver, 1/8 inch"
price="7.95"/>
0014 <individual-product name="Allen-Wrench Set" price="1.55"/>
0015 <product-bundle name="Household-Starter" price = "23.99" />
0016 <bundled-product name="Hammer"/>
0017 <bundled-product name="Screw-Driver, 1/4 inch"/>
0018 <bundled-product name="Screw-Driver, 1/8 inch"/>
...
0033 </product-bundle>
...
0533 <index <!-- an index for "Household"
category -->
0534 name="Price" <!-- the listing is asending by
price -->
0535 index-start=525 <!-- (535-10), relative begining of
index -->
0536 delimiter="|" <!-- Hmm, possibly for
readability -->
0536 position-width=4 <!-- Length for each position,
lpad="0" -->
0537 length=100 <!-- Length of
index -->
0538 >
0539 <index-column name="name" width=30 align="left" rpad=" ">
0540 <index-element element="individual-product"
attribute="price" />
0541 <index-element element="product-bundle" attribute="price"
/>
0542 </index-column>
0543 0004|Allen-Wrench Set | <!-- First item...
-->
...
05?? 0005|Household-Starter | <!-- First item...
-->
...
05?? 0008|Allen-Wrench Set | <!-- First item...
-->
...
0632 </index>
0633 <index
0634 name="Price" <!-- the index is asending by
price -->
0635 index-start=625 <!-- (635-10), relative begining of
index -->
0636 delimiter="|"
0636 position-width=4
0637 length=100
0638 >
0639 <index-column name="price" width=5 align="right" lpad="0">
0640 <index-element element="individual-product"
attribute="price" />
0641 <index-element element="product-bundle" attribute="price"
/>
0642 </index-column>
0643 0433|01.23 <!-- Cheapest item...
-->
...
06?? 0002|06.95 <!-- Refers to line 10+2=12
-->
...
06?? 0005|23.99 <!-- Referrs to line 10+5=15
-->
...
0732 </index>
....
???? </product-category>
???? <product-category type="Commercial" name="Light-Industry" >
???? <individual-product name="Hammer" price="13.95"/>
???? <individual-product name="Versa Screw(tm)" price="66.95"/>
...
???? <index
name="Price"
...
???? <index
name=""
...
???? </product-category>
...
????
0000 </catalog>
0000
==============================
....
<INDEX
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list