Identifying XML Document Types (was XML media types revisited)

Walter Underwood wunder at infoseek.com
Fri Jan 15 22:06:00 GMT 1999


At 03:12 PM 1/15/99 -0500, Simon St.Laurent wrote:
>
>With XML the expectations (for being able to process documents with both
>specific and generic tools) are much higher, yet the tools for identifying
>document types are actually weaker in many ways.

I'm not sure that things are all that bad. An Excel spreadsheet
can be a lot of different things, but it is always parsed the
same way. Word documents or FrameMaker documents may use different
templates, but the file format is the same. MIME types do a fine
job at that level.

More ambitious schemes for description become more and more 
application specific.

For example, my application is reading XML so that our search
engine can index it. The document features that are important
to a search engine are not specified in DTDs, style-sheets,
schemas, or anything else. We need to know which element is the
title, which is the description, and whether some parts of
the document are more important for search purposes (a bibliography
is less important, a problem description might be more important).

The search engine does not care whether the document is valid
or has a DTD at all, but it does care whether XLink is used
in the document (namespaces do help in this case).

Documents are often put to unexpected uses--indexing for
search, legal discovery, corpus linguistics, whatever.
Committing to a document description too early can actually
make a document harder to use. 

In case you're curious, the search engine is a commercial 
product (Ultraseek Server), and has supported simple XML
searching since last September.

wunder

Walter R. Underwood
wunder at infoseek.com
wunder at best.com (home)
http://www.best.com/~wunder/
1-408-543-6946

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list