SAX, DOM, and Search Engines (was Re: xml parser)

Wed Nov 4 22:33:58 GMT 1998

Tim Bray writes:

 > At 10:55 AM 11/4/98 -0000, Michael Kay wrote:
 > >My immediate answer to this is yes, all the information you need for a
 > >search engine is available via the SAX or DOM interface offered by many
 > >parsers.
 > 
 > I disagree.  Few parsers track byte offsets or other locational info in
 > the file, and I think you need that to do basic things like proximity
 > and phrase search.

I disagree.  While byte offsets might be useful for other purposes,
they would be inappropriate for proximity and phrase searches -- for
those, you need to track the relative positions of words, not their
absolute positions.  Consider the following example:

  <p>WORD1 &x; WORD2</p>

Is WORD1 close to WORD2?  It's only five bytes away (assuming an 8-bit
encoding), but might be separated by 20,000 words, depending on what
&x; expands to.  SAX and the DOM do give you enough information to
determine the relative positions of words.

Byte offsets would be helpful for displaying context around a match,
but there would be no 100% reliable way to format that context without
starting from the top of the document, in which case an XPOINTER (also
derivable from SAX or DOM) might be more helpful unless you want the
search engine to display raw XML markup for the context.

All the best,

David

-- 
David Megginson                 david at megginson.com
           http://www.megginson.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)