Parsing XML->DOM and XSL querying optimising

Sat Mar 20 09:15:17 GMT 1999

> Hi all,
> 
> I have an application that consists of 140+ XML documents, roughly 100k
> bytes each that I want to be able to query (using XSL pattern matching at
> present) and output to XML/HTML and RTF format. This will happen in real
> time (if at all possible).
> 
> Additionally, I'd like to be able to search/query the entire repository of
> documents and return a composite XML/HTM or RTF document from these.
> 
> At the moment, I'm experimenting with the DOM parser in Python and finding
> that a DOM parse takes about 4 seconds, whilst an XSL query takes about 1.8
> seconds.
> 
> I reckon that a user could wait the 1.8 seconds for a query, but might start
> to get fidgety after almost 6 seconds (how transient we are!).

I'm using sgrep.  I can run a query returning 2000 fragments out of a 25,000 
document, 100Mb collection in around 2 seconds.  Queries returning less text 
are significantly faster.

sgrep pre-indexes the document collection, which takes me around 15 minutes.  
There's no way to update indexes other than rebuilding from scratch, so it has 
limits for rapidly changing databases.  I'm getting by with concatenations of 
searches of indexes of sections of my total database, meaning that updates 
aren't as large as they might be.

sgrep's parser doesn't care about well-formedness, but if your documents are 
well formed it behaves correctly.

It's queries are based on containment, and have no facility for things like 
'get me the nth occurence of X'.  Some workarounds can be managed with 
recursion, but if you expect to do a lot of this stuff it may not be for you. 
I noticed recently that Tim Bray was proposing a query language with these 
same limitations because it provides for more efficient processing.

> What strategies have people got for limiting the DOM parsing time?

In perl there are various tools (eg Storable.pm) for dumping perl data 
hierarchies in a binary form, which could be done with pre-parsed DOM data.  
For my application though just the time required to load the DOM module's code 
is a problem.

> My own thoughts are that I load up all 140 documents at server-startup time,
> parse them into DOM[0]...DOM[139], store them into memory and then query
> each one in turn in the case of a simple query, and query all the DOM
> objects in the case of a full query across all XML documents.
> 
> Is this sensible? practical? stupid?

DOM operations in perl typically involve inefficient linear searches.  I'm not sure whether this is implicit in the DOM or is implmentation dependent.  At least in perl, The DOM is good for manipulation, but not particularly efficient for simple extraction of data.

Andrew McNaughton

-- 
-----------
Andrew McNaughton
andrew at squiz.co.nz
http://www.newsroom.co.nz/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)