XML Search Engine

Thu Nov 5 18:55:07 GMT 1998

Tim Bray writes:

 > What I said was:
 > 1. I have not seen any research which demonstrates that word proximity
 >    achieves better results than character proximity based on any
 >    well-known IR metric.
 > 2. Doing word proximity at all is a *very* hard problem in the languages
 >    used by a large majority of the world's population.

I think that there might be a disconnect here.  What we're talking
about is minimal-semantic-unit proximity -- for some
languages/contexts, the minimal semantic unit will always be a single
grapheme, and for others, it will be a cluster of one or more
graphemes.

This type of clustering is critical for search engines, which often
(usually?) provide inverse indexes only for minimal semantic units,
not for all graphemes.  The argument, then, is that proximity testing
should be done by counting the units that were indexed, which may or
may not be single graphemes.

All the best,

David

-- 
David Megginson                 david at megginson.com
           http://www.megginson.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)