Assisted Search of XML document collections

Sat May 22 18:41:18 BST 1999

On Sat, 22 May 1999, Didier PH Martin wrote:

> <Comment>
> Although I have announced this project on the perl-xml list, and it will
> concentrate on Perl, with and without XS, there is no reason that Java
> and/or C/C++ viewpoints are not welcome. We are primarily interested in
> exploring issues pertaining to the construction of a file that describes a
> collection of XML documents in a succinct fashion, most likely with a
> moderate to high degree of application specificity - i.e. there may not be
> a lot of defaults that make sense.
> 
> We also wish to supply a useful API that search engine writers can use.
> </comment>
> 
> <reply>
> There is obviously a need for such tool in the context of IETMs, also in the
> context of XML sites. I nothing against Perl but why not considering C/C++
> and PERL? (this is not a critic but a question) C/C++ for speed and Perl for
> convenience.
> </reply>
> 
Hi, Didier

The use of C is certainly on the table. It may well be that we end up with
a C library, and an API that we describe using a header file, and then run
h2xs or SWIG to produce Perl and Tcl and, yes, (shudder... :-)) Python
interfaces. So the option would be there for someone to use C directly,
also.

Java is definitely the close second implementation option to Perl+C.
I have a personal bias against C++, but I'm trying to solve a problem and
if someone wishes to become involved and presents a solution using C++, it
will be considered.

There are currently 5 people who are interested in doing this thing, and I
think we are at the point where we are not locked into using this language
or that language.

I'm glad you mentioned "speed". That is really the whole point of this
exercise. We have identified a number of cases where one has N documents,
where the business case also makes it logical that the individual
documents be XML marked-up. Brute-force search is insufficiently speedy,
and we want the ability to construct a derivative document that, based on
search parameters, allows the application to use a search engine
method or function with an *optional* parameter, namely the name
of the "index". I use the term "index" with reservations, as sometimes the
derivative document may act exactly like an "index", and perhaps sometimes
it will not. But the desired end result is that you get your list of files
much faster than if you had no special knowledge of the document
collection.

It stands to reason that we want the API to be attractive to search engine
writers. *We* - I mean the people involved - have real applications that
require search, and some of us _are_ search engine writers, so to speak,
but the intent of this project is to not write a universal search engine,
but rather to devise a set of functions that can be used to examine a
group of XML files based on guidance, and distill the knowledge imparted
by the markup into a summary. This summary can then be used by a search
engine.

This is all pretty wordy stuff, and I don't mind admitting that I'm
putting this whole idea out there so people can say what they think about
it. I guess what it all boils down to is something like this:

use XML::Index;

# assemble a list of filenames or filehandles
...

# construct the index. $parameters is a hash of tag descriptions or
something similar (this is notional)
$index = XML::Index->build_index($parameters, @xml_files);

# conduct a typical search using a typical search engine
@xml_files = XML::Search_Engine->pick($search_params, [ $index ]);

__END__

The [] around the $index is not there to make that an anonymous array ref
but rather to indicate that the use of an index is optional. *If* that
argument is present then the search engine writer uses the index API to
use the index.

Other thought: an index may be a prebuilt XML document which is nothing
but a list of XML Links. OTOH, it may be nothing of the sort - it may be a
binary structure. In other words, this is what we are trying to figure
out.

Thanks for your comments, Didier. I await others. :-)

Arved

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)