Is there anyone working on a binary version of XML?

Stephen D. Williams sdw at lig.net
Tue Mar 30 06:08:18 BST 1999


Excellent.  I've had similar ideas.  My current plan is to produce something without the
requirement that the result be pure text, however I once toyed with the idea of a database where
all indexing information was stored as part of the text in fixed width fields.  The file could be
edited with any text editor and then 'reindexed' and be ready for fast use.

Your design is pretty handy, but I really want something that can be loaded, have a minor
modification made with minimal data shuffling, and then 'saved' out very quickly.  Having to
rebuild a complete index probably isn't the most optimal way to do this.

sdw

Clark Evans wrote:

> "Stephen D. Williams" wrote:
> > Imagine that you have all the features of XML: structure, flexibility, common format for
> > interchange, but that you perform zero processing steps to import or export the 'document'
> > from a program.  (Actually, I'm thinking this would be done in chunks, but essentially very
> > few reads and writes.)
>
> I had an idea to accomplish something similar to this using notations.
> First use a fixed width encoding, and then provide an index to the
> information contained within the XML document in a notation.  This way
> you get many of the advantages above, but your information is still XML,
> so that it can be read by a parser who may not understand the indexing notation.
>
> Anyway, I havn't had time to work on it more, but here was a
> crude, first-pass at explaining the idea I posted to the list
> a while back.  I hope it helps.
>
> Clark Evans
>
> -------- Original Message --------
> Subject: Fractal XML Index Notation
> Date: Wed, 03 Feb 1999 01:32:34 +0000
> From: Clark Evans <clark.evans at manhattanproject.com>
> To: xml-dev at ic.ac.uk
> References: <958E41703996D21197A200A0C9D4C65672B7 at AUS-SERVER4>
>
> Abstract:
>
>         By fixing the content of an XML file, a
>         position based  index mechanism can be added
>         to XML files, allowing fractal parsing.
>
> Introduction:
>
> In a thin-client/server environment, especially those
> implemented in an interpreted language, like Java,
> is important to minimise client-side processing by
> doing server-side pre-processing.
>
> For example, suppose that an on-line shopping web
> site has a thin-client ordering java applet.  It could
> quickly download, and start accepting customer
> information, and other input.  Simoutanenously,
> it could be downloading a 250K+ file(s) containing
> the package and product list, authorized shipping
> agents, tax calculation tables, etc.  Advanced
> versions of the applet would "cashe" a copy of the
> catalog locally, and only download deltas.
>
> Several pre-processing items could occur, the most
> obvious being a translation of the normalized schema:
>
>  PRODUCT_CATEGORY   (CATEGORY_ID, CATEGORY_NAME)
>  BUNDLE_OF_PRODUCTS (BUNDLE_ID, BUNDLE_NAME, BUNDLE_PRICE)
>  VENDOR             (VENDOR_ID, VENDOR_NAME)
>  BUNDLE-PRODUCT     (BUNDLE_ID,PRODUCT_ID)
>  PRODUCT            (PRODUCT_ID, PRODUCT_NAME,
>                      CATEGORY_ID, INVIDUAL_SALE_FLAG,
>                      PRICE_IF_SOLD_INDIVIDUALLY )
>  PRODUCT-VENDOR     (PRODUCT_ID,VENDOR_ID)
>  BUNDLE-VENDOR      (BUNDLE_ID,VENDOR_ID)
>
> into a hierarchical drill-down that better meets
> the particular needs of the order-entry client:
>
> <catalog>
>    <product-category>
>       <product-bundle>
>          <product>
>          <vendor>
>       <individual-product>
>          <vendor>
>
> In this example, several joins are interwoven into a
> a single hierarchical "snapshot" to support the
> the drill down requirements in the order-entry client.
>
> Notice, that product-bundles, products, and vendors
> *will* be duplicated with this scheme, this de-normalization
> is exactly what is required since it makes the processing
> on the client simpler.  Here XML complements the
> relational database by providing a de-normalized
> stream of data instead of a normalized repository.
>
> For another example, suppose a roaming-sales person
> receives an update every morning in his e-mail with
> new products, discontinued products, changes in pricing,
> packaging, etc.  Then, during the day, the sales peson
> goes "door-to-door" selling the products and taking orders.
> The orders are collected on his/her hard drive untill
> the evening, when they are uploaded to the server for
> approval.
>
> I see XML as a great move forward in a standard transport
> layer for this form of communication.  Each order could
> be a simple e-mail message, leveraging existing POP3/SMTP
> standards.  The messages would be queued during the day,
> and send after the sales person is connected to the
> network.  In a similar way, the updates to the product
> could be sent as via e-mail (xml-mail anyone?) as well.
>
> THUS, we have moved the join from the client to the
> server, but now, we have *increased* the parsing
> requirements of the client... also, with a _large_
> catelog file (3+MB?), it is unreasonable to think
> that a collection of objects in memory would
> be the result of the parsing.
>
> THEREFORE, some form of storage/retrieval is necessary
> on the client.  This can be in a local database,
> but that just increases the footprint and processing.
>
> Instead of making a client-side database, and
> re-normalizing the information, I suggest that
> indexing the XML file may be a better alternative.
> A way to do this, is to "fix" the XML file's binary
> representaion, and build a physical index detailing
> the "exact" location of an element within the file.
>
> Requirement for such an index:
>
> a) It should be embeddable inside XML, and should follow
> XML if possible (perhaps it is a notation?)
>
> b) It should allow indexing on arbitrary element attributes.
>
> c) It should be created so that a change in one part of the
> file has minimal impact on the rest of the XML file.  Thus,
> although a change to a child may require a re-adjustment
> of information about it's parent, it shouldn't require
> re-adjustment of information about each sibling.
>
> d) It should take advantage of the "hierarchy" built
> into the XML file, since the thin-client usage will
> directly correspond to the "hierachy"
>
> e) It should support typed entities and attributes
> "Archetecutres", so that different attribute names
> of sub-types can be indexed together.
>
> f) Indexing an element based upon it's child elements
> may not be required. If an index like this is needed,
> perhaps a re-write adds an attribute with the
> computed value and then this is indexed instead.
>
> g) Working with linking is purely optional, and may
> not be important to support. <opinion> If you are
> using linking with transaction-oriented documents,
> you should be using a relational database instead.
> I see XML as bringing back the Hierarchical database
> to *complement* relational technology, not to
> *replace* it.</opinion>
>
> ================================================
>
> What I propose is a "fractal" index inter-woven
> into the XML data.  First, here is the file to
> be indexed:
>
> <catalog date="03-FEB-1999" company="Acme Tools" >
>    <product-category name="Household" type="Domestic">
>       <individual-product name="Hammer" price="13.95"/>
>       <individual-product name="Screw-Driver, 1/4 inch" price="6.95"/>
>       <individual-product name="Screw-Driver, 1/8 inch" price="7.95"/>
>       <individual-product name="Allen-Wrench Set"       price="11.55"/>
>       <product-bundle name="Household-Starter" price = "23.99" />
>          <bundled-product name="Hammer"/>
>          <bundled-product name="Screw-Driver, 1/4 inch"/>
>          <bundled-product name="Screw-Driver, 1/8 inch"/>
>          ...
>       </product-bundle>
>       ...
>    </product-category>
>    <product-category type="Commercial" name="Light-Industry" >
>       <individual-product name="Hammer" price="13.95"/>
>       <individual-product name="Versa Screw(tm)" price="66.95"/>
>       ...
>    </product-category>
>    ...
> </catalog>
>
> Here is the "indexed" example, I use line numbers for
> the demonstration since it is easier to show in e-mail
> form, however, I would see it being done by position instead.
> I also use <!-- to comment stuff. -->
>
> 0001 <!-- other-information-before-the-catelog -->
> ...
> 0009 <catalog date="03-FEB-1999" company="Acme Tools" >
> 0010    <product-category name="Household" type="Domestic">
> 0011       <individual-product name="Hammer" price="13.95"/>
> 0012       <individual-product name="Screw-Driver, 1/4 inch"
> price="6.95"/>
> 0013       <individual-product name="Screw-Driver, 1/8 inch"
> price="7.95"/>
> 0014       <individual-product name="Allen-Wrench Set" price="1.55"/>
> 0015       <product-bundle name="Household-Starter" price = "23.99" />
> 0016          <bundled-product name="Hammer"/>
> 0017          <bundled-product name="Screw-Driver, 1/4 inch"/>
> 0018          <bundled-product name="Screw-Driver, 1/8 inch"/>
> ...
> 0033       </product-bundle>
> ...
> 0533       <index               <!-- an index for "Household"
> category     -->
> 0534          name="Price"      <!-- the listing is asending by
> price      -->
> 0535          index-start=525   <!-- (535-10), relative begining of
> index  -->
> 0536          delimiter="|"     <!-- Hmm, possibly for
> readability         -->
> 0536          position-width=4  <!-- Length for each position,
> lpad="0"    -->
> 0537          length=100        <!-- Length of
> index                       -->
> 0538       >
> 0539       <index-column name="name" width=30 align="left" rpad=" ">
> 0540          <index-element element="individual-product"
> attribute="price" />
> 0541          <index-element element="product-bundle" attribute="price"
> />
> 0542       </index-column>
> 0543       0004|Allen-Wrench Set    | <!-- First item...
> -->
> ...
> 05??       0005|Household-Starter   | <!-- First item...
> -->
> ...
> 05??       0008|Allen-Wrench Set    | <!-- First item...
> -->
> ...
> 0632       </index>
> 0633       <index
> 0634          name="Price"      <!-- the index is asending by
> price        -->
> 0635          index-start=625   <!-- (635-10), relative begining of
> index  -->
> 0636          delimiter="|"
> 0636          position-width=4
> 0637          length=100
> 0638       >
> 0639       <index-column name="price" width=5 align="right" lpad="0">
> 0640          <index-element element="individual-product"
> attribute="price" />
> 0641          <index-element element="product-bundle" attribute="price"
> />
> 0642       </index-column>
> 0643       0433|01.23         <!-- Cheapest item...
> -->
> ...
> 06??       0002|06.95         <!-- Refers to line 10+2=12
> -->
> ...
> 06??       0005|23.99         <!-- Referrs to line 10+5=15
> -->
> ...
> 0732       </index>
> ....
> ????    </product-category>
> ????    <product-category type="Commercial" name="Light-Industry" >
> ????       <individual-product name="Hammer" price="13.95"/>
> ????       <individual-product name="Versa Screw(tm)" price="66.95"/>
> ...
> ????       <index
>               name="Price"
> ...
> ????       <index
>               name=""
> ...
> ????    </product-category>
> ...
> ????
> 0000 </catalog>
> 0000
>
> ==============================
>
> ....
>
> <INDEX

--
OptimaLogic - Finding Optimal Solutions     Web/Crypto/OO/Unix/Comm/Video/DBMS
sdw at lig.net   Stephen D. Williams  Senior Consultant/Architect   http://sdw.st
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 5Jan1999



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list