Assisted Search of XML document collections

Wed May 26 14:33:59 BST 1999

On Sat, 22 May 1999, Edward C. Zimmermann wrote:

> > On Sat, 22 May 1999, Edward C. Zimmermann wrote:
> > 
And I (Arved) had written:

> > That is the stage we are at. I have this gut feeling that we need to
> > define what it means to have a search engine operate on let's say 100,00
> > documents marked up using XML, and what are the situations where it might
> > make more sense to search a file which describes that collection.

> 100K documents is not a problem. Even on consumer PC hardware a modestly performant 
> fulltext engine can handle typical queries on such a small collection in fractions
> of a second. The problem is more (beyond quantity) that information resources
> (XML, HTML or whatever) are not always static but dynamic. That's, above all, one
> of the fundamental flaws in the brute-force spider/crawl approaches followed by
> the major "Internet Engines" (beyond the impact on bandwidth, the half-life of
> data, and all the other significant shortcommings).
>
I don't think I'd quite agree that 100K documents is not a problem.
Full-text searches using maybe Boolean expressions, yes, that's fast, but
querying based on knowledge of the XML structure, i.e. something like the
Perl XML::XQL syntax, I'm sorry, I just don't see that kind of query as
shrugging off 100K documents.

Part of what I'm trying to do is define when an indexing scheme might be
appropriate. I'm leaning towards static or slowly varying. One's
definition of either would depend on factors such as how long it takes to
index, do already-existing documents change also, is the indexing
structure such that new documents can be incrementally added, etc etc.

I'm not so sure that indexing, as least as I envisage it, is going to
handle millions of changing documents on the Web, for example.

 > > 
> > Your best contribution would be to describe a business problem and tell us
> > how you like to solve it.

> Different problems, different methods, different tools. 
> 
> Lets turn the tables, since I'm the confused soul, can you explain a bussiness
> problem and tell us how you might plan to "solve it"....
> > 
Sure. We put in a tender to supply document management to the local
provincial natural resources people, specifically the survey and mapping
types. We looked at perhaps 500K to 1M documents, of which (if I recall
aright) perhaps 75% were very amenable to being scanned in, zone-OCR'ed,
and had enough structure to make them very suitable XML candidates. Maybe
5 DTD's could have described that 75%.

You understand that I'm describing a tender a few years old, and that XML
wasn't on *anybody's* mind at the time. I'm think of it now, though, as a
situation where XML would be really appropriate for allowing the kinds of
searches these guys wanted to do. Plus they wanted to eventually make much
of this info available via the Web, or run off paper copies; again, XML
markup seems just right, and convert into other formats as required.

OK, as to searching and indexing. Of the "searchable" documents I
describe, all were static - they were *records*. Probably the number of
similar documents added in a given year would be 3-5% of the existing
archive. So an index would be a very manageable thing, and would rarely
change.

So, you understand, my viewpoint is record-centric. That's why I'm asking
for input.

Arved

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)