Discovering document types - best practice?

Thu Jun 24 13:43:46 BST 1999

At 12:13 PM 6/24/99 +0100, james at xmlTree.com wrote:
>This may seem like a simple problem, but I can't find any references to
how best to solve
>it.

It seems like it should be a simple problem, but there are lots of
complexities along the way.  XML has no reliable 'document type
identification' mechanism because of the approach it takes to validation.
(Internal subsets can change the rules on a per-document basis; namespaces
and validation don't get along very well at present, MIME types aren't yet
commonly used for XML, and application/xml doesn't tell you anything about
what vocabulary is used, just that it's XML.)

>I need to process a series of xml documents, which can be in a number of
different
>formats.  I don't know in advance the type of the documents, only their
URLs.  What is the
>best way of analysing what type the document is (and how to process it)?
Is there a "best
>practice" for this?

I wish there were... as time goes on, I do expect more XML documents to get
MIME types identifying them specifically, but this hasn't happened yet.
For some of the complexities involved in this process, see the discussion
archives at http://www.imc.org/ietf-xml-mime/.  When MIME types get
straightened out, you could use HEAD requests to the server to get a MIME
type back and base your processing on that rather than downloading entire
documents in order to determine if they fit your requirements.

>For example, should I
>
>1) Try and read the document type declaration?  If so, what
function/property should I be
>using?  I'm using MS XMLDOM (from IE 5).

I haven't gotten this close to Microsoft's XML processors in a while,
having been burned a number of times, so I don't know the actual API.  

Even if you have access to the document type declaration, it may not be
easy to process that information.  Simple declarations that just identify a
root element and external subset of the DTD generally provide you with a
reliable identifier of document type.  More complex declarations that
include an internal subset may trip you up by assembling a modular DTD on
the fly or overriding and extending declarations from the external DTD.
This can get complex.  In simple cases, it's not bad, but in cases with an
internal subset, it can be difficult to work with.

>2) Try and look for a link to a XML schema?

If schemas were ready for prime-time...  at least I haven't seen internal
subset proposals for schemas.

>3) Just start walking the tree looking for particular nodes in a
particular order?

This is the most accurate, but also the most costly.  Especially if you're
expecting to find a lot of documents you don't plan to actually use in your
list of URLs, you can churn through processor cycles and discard the results.

I took a stab at describing document classes a few weeks ago, creating a
fairly simple spec called XPDL, for XML Processing Description Language.
Details are at http://purl.oclc.org/NET/xpdl.  That might solve a lot of
your problems, but only if people actually used it in their files.

I'd love to hear about other approaches people are taking to this problem...

Simon St.Laurent
XML: A Primer / Building XML Applications
Inside XML DTDs: Scientific and Technical (July)
Sharing Bandwidth / Cookies
http://www.simonstl.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)