XML Search Engine

Thu Nov 5 18:21:03 GMT 1998

As you say Word/Character proximity searching is not that interesting, and
if this is desired, XML doesn't have much to add to the current equation.

On the other hand grove based proximity search techniques have also been
used since the 1970's when this was called a "semantic network". the
advantage is that it is language independent. To date, this hasn't been
terribly useful with HTML as not many people care about indexing <p> tags
for example. This where XML has lots to offer and where efforts ought to and
are being directed (IHMO).

Jonathan Borden
JABR Technology
http://jabr.ne.mediaone.net

Tim Bray wrote:
>
>
> At 12:27 PM 11/5/98 -0000, Michael Kay wrote:
> >Switching thrreads, I am a little surprised by Tim's remarks on word
> >proximity versus character proximity. Confining our attention to European
> >languages (as most search engines do), word proximity searching
> is a common
> >feature of the high-end search engines, whereas character proximity is
> >hardly found outside basic desktop tools like grep.
>
> What I said was:
> 1. I have not seen any research which demonstrates that word proximity
>    achieves better results than character proximity based on any
>    well-known IR metric.
> 2. Doing word proximity at all is a *very* hard problem in the languages
>    used by a large majority of the world's population.
>
> >Apart from anything
> >else, once you've done the word normalisation (normalising different
> >linguistic forms or spellings of the same word), character proximity is
> >meaningless. In the older boolean engines word proximity is used rather
> >mechanistically, in the newer engines it is used more subtly as part of a
> >statistical or linguistic approach to relevance ranking
>
> If you go poking around either in the SIGIR world (that would be the
> Association for Computing Machinery's Special Interest Group on
> Information Retrieval) or in the actual commercial retrieval engine
> world, you find a distressing lack of technology progress.  Yes, with
> modern engines, precision & recall are measurably better than they
> were in 1978.  But 10 times as good?  Hah!  Twice as good?  Maybe,
> for certain restricted application domains.  Given all this, I'm
> less than impressed about the subtle techniques of modern engines.
> On top of which, most of the techniques used in the "advanced" engines
> are basically Anglocentric and fall apart once you get outside the
> English-speaking world.
>
> > but either way it
> >is an established feature of the scene, and it is not there on whim: the
> >search algorithms used are based on extensive research and
> benchmarking of
> >relevance and recall scores.
>
> Yeah, well, it's *not* an established feature of the scene in Asia.  Maybe
> it's just an irrational prejudice, but I'm not all that interested in
> computing techniques that are not usable by a large majority of the
> world's population.  And once again, I challenge the assertion that,
> for all these clever heuristics, real-world retrieval software is
> really much better than it was 20 years ago.
>
> >An interesting comparison of web search engines is at
> >http://www.netstrider.com/search/features.html ; this asserts
> that all the
> >well-known web search engines other than Lycos use word
> proximity matching.
>
> And we know what wonderful results they produce (that's in English; for
> real joy, go try a tricky in German - even European languages sometimes
> leave out the spaces between the words - and see what happens).  -Tim
>
> PS: Given my grouchy tone, I should say that I'm dazzled at the
> inventiveness, deep thought, and creativity that have been invested
> in the IR field in recent decades.  The fact the results are so
> underwhelming is evidence of how hard the problems are... the real
> lesson is that we should marvel at the language-processing apparatus
> we carry around between our ears. -T
>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)