SAX, DOM, and Search Engines (was Re: xml parser)

Tim Bray tbray at textuality.com
Wed Nov 4 23:08:59 GMT 1998


At 05:32 PM 11/4/98 -0500, david at megginson.com wrote:
>Tim Bray writes:
> > I disagree.  Few parsers track byte offsets or other locational info in
> > the file, and I think you need that to do basic things like proximity
> > and phrase search.
>
>I disagree.  While byte offsets might be useful for other purposes,
>they would be inappropriate for proximity and phrase searches -- for
>those, you need to track the relative positions of words, not their
>absolute positions.  Consider the following example:
>
>  <p>WORD1 &x; WORD2</p>
>Is WORD1 close to WORD2?

Clearly, the proximity tests have to work in terms of proximity in the
cooked, not raw, text.  Lark carefully tracks offsets in terms of the
entity stack so you can do this.  But that's so obvious I don't think 
it's your point. 

Secondly, for proximity, you're worried about counting characters, not 
bytes, but for addressing back into the entity, you're worried about byte, 
not character, offsets.  So it's even harder than it looks.  Unless
of course you're using UTF16 and staying in the BMP - which might be
a REAL good idea in an IR-oriented system anyhow. 

>  It's only five bytes away (assuming an 8-bit
>encoding), but might be separated by 20,000 words, depending on what
>&x; expands to.  SAX and the DOM do give you enough information to
>determine the relative positions of words.

[warning: simple argument with long embedded digression]
I don't think so.  How about languages, such as those spoken by the
majority of the world's inhabitants, that do not separate words with
spaces?  (Identifying word breaks in running Japanese or Chinese
text is essentially a strong-AI problem.  You can get decent results
by running a dictionary and searching at each character break for
a match, with morphological heuristics, but it turns out that in those
languages there is sufficient encoding redundancy that you get pretty
good results (at a cost of some space wasteage) just treating most 
characters as words - and lurking in that fact there's a PhD in 
linguistics for someone - but I digress, I spent a long time
in those particular mines).  

But spotting "words" may not matter.  In fact, I am not aware of 
any research that shows word proximity to be a better information 
retrieval heuristic than character proximity.  And it's much easier
to nail down what you mean by "character" than "word", and thus get
deterministic cross-language behavior.

>Byte offsets would be helpful for displaying context around a match,
>but there would be no 100% reliable way to format that context without
>starting from the top of the document

unless you used the whizzy new soon-to-arrive W3C fragment packager,
right?  Actually, if you have an index that can understand the the
structure well enough to support xpointer-flavor querying, the engine
is going to know all the context info, so this should actually work
pretty well (but only if you know the byte/character offsets).

And the right way to display results in context depends on whether
you're sampling, or visiting match.

OK, you've been warned... if you get me going on the problems of 
searching in tagged internationalized text, bring a windbreaker -
you'll need it.  -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list