XML Performance question

Tue Apr 6 02:27:06 BST 1999

On Mon, Apr 05, 1999 at 09:25:48AM -0400, Lippmann, Jens wrote:
> Following the XML for the last couple month, I am surprised how little
> attention is paid to performance. My  optimistic personality leads me to the
> conclusion that performance is not an issue. :) 
> 
> However, I would be very interested on an expert's guess on the following
> problem:
> 
> Assume the following XML document:
> 
> <PORTFOLIO>
>    <ACCOUNT MANAGER="Joe Smith" ID="000001">
>       <AUDIT DATE="03/31/1999">
>          <SECURITYDESC>
>             <SECURITY>
>                <CUSIP>0815</CUSIP>
>                <PRICE CURRENCY="US">4289.23</PRICE>
>                <TRADEDSHARES>4289.23</TRADEDSHARES>
>             </SECURITY>
>          </SECURITYDESC>
>       </AUDIT>
>    </ACCOUNT>
> </PORTFOLIO>
>  
> 
> Each document will contain about 10^4 <SECURITY> elements each will contain
> between 10 - 10^2 child tags, and I have to handle about 10^2 documents a
> day, i.e. we're dealing with 10^7 to 10^8 tags. So far, the benchmarks I've
> got are pretty devastating.  I have to visit every sub-element
> of  <SECURITY> at least once during the number crunching and I cannot keep
> everything in memory. I am considering one of the XML repositories to help
> me with the job.

I just ran one million elements through SP with a scripting language
on top of it.  The run took 7m 15s.  This extrapolates to 12 hours for
10^8 tags.  This could easily be sped up by:

  1. Using expat instead of SP (this is makes a _big_ difference).
  2. Accessing the data from C++ rather than a script language.
  3. Shortening your element names (currently they overload
     the data; they seem to incur roughly a 12% performance hit, and
     this would get much worse if you were looking for specific
     elements during parsing).

I ran some brief tests, handling 10^6 elements with no processing
(beyond parsing, that is), using expat in C.  It completed in just
under 2 minutes.  This would suggest that 100 of the largest possible
documents would take approximately 3h 20m.

This extremely rough analysis suffices to establish some idea of the
lower the bound for your problem.  It doesn't address the full
complexity of your situation, since we don't know the specifics of
what you are trying to achieve.

Also note that these figures were acquired using an event model,
rather than a parse tree.  This can have a significant impact on the
performance.  It may well be that your processing requirements don't
permit an event-based approach, in which case the above figures are
meaningless (this situation is less likely than is commonly perceived,
however).

Finally, note that this was all done in one thread (a 333 UltraSPARC).
Multiple threads could potentially improve this figure substantially.
Spreading the second test across 2 cpus brought the time down to 70
seconds (2 hours for 100 documents).  Of course, this depends on your
hardware.

Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)