XML Search Engine

Thu Nov 5 17:52:41 GMT 1998

At 12:27 PM 11/5/98 -0000, Michael Kay wrote:
>Switching thrreads, I am a little surprised by Tim's remarks on word
>proximity versus character proximity. Confining our attention to European
>languages (as most search engines do), word proximity searching is a common
>feature of the high-end search engines, whereas character proximity is
>hardly found outside basic desktop tools like grep. 

What I said was:
1. I have not seen any research which demonstrates that word proximity
   achieves better results than character proximity based on any
   well-known IR metric.
2. Doing word proximity at all is a *very* hard problem in the languages
   used by a large majority of the world's population.

>Apart from anything
>else, once you've done the word normalisation (normalising different
>linguistic forms or spellings of the same word), character proximity is
>meaningless. In the older boolean engines word proximity is used rather
>mechanistically, in the newer engines it is used more subtly as part of a
>statistical or linguistic approach to relevance ranking

If you go poking around either in the SIGIR world (that would be the 
Association for Computing Machinery's Special Interest Group on 
Information Retrieval) or in the actual commercial retrieval engine
world, you find a distressing lack of technology progress.  Yes, with
modern engines, precision & recall are measurably better than they
were in 1978.  But 10 times as good?  Hah!  Twice as good?  Maybe,
for certain restricted application domains.  Given all this, I'm
less than impressed about the subtle techniques of modern engines.
On top of which, most of the techniques used in the "advanced" engines
are basically Anglocentric and fall apart once you get outside the
English-speaking world.

> but either way it
>is an established feature of the scene, and it is not there on whim: the
>search algorithms used are based on extensive research and benchmarking of
>relevance and recall scores.

Yeah, well, it's *not* an established feature of the scene in Asia.  Maybe
it's just an irrational prejudice, but I'm not all that interested in
computing techniques that are not usable by a large majority of the
world's population.  And once again, I challenge the assertion that,
for all these clever heuristics, real-world retrieval software is
really much better than it was 20 years ago.

>An interesting comparison of web search engines is at
>http://www.netstrider.com/search/features.html ; this asserts that all the
>well-known web search engines other than Lycos use word proximity matching.

And we know what wonderful results they produce (that's in English; for
real joy, go try a tricky in German - even European languages sometimes
leave out the spaces between the words - and see what happens).  -Tim

PS: Given my grouchy tone, I should say that I'm dazzled at the
inventiveness, deep thought, and creativity that have been invested
in the IR field in recent decades.  The fact the results are so
underwhelming is evidence of how hard the problems are... the real
lesson is that we should marvel at the language-processing apparatus
we carry around between our ears. -T

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)