Identifying XML Document Types (was XML media types revisited)
Simon St.Laurent
simonstl at simonstl.com
Fri Jan 15 20:12:43 GMT 1999
We've had a good deal of give and take over various ways for naming and
otherwise identifying elements and documents over the last few days, and
I'd like to summarize a lot of issues that have arisen (for me at least)
from the discussion.
I'm concerned that XML is a significant break from 'the old way of doing
things', which, crummy as it was, had certain advantages of familiarity.
Proprietary documents came with their own identifiers and their own rules
for doing things, and I don't think anyone expected to open Word documents
in a statistical program and get meaningful results.
With XML the expectations (for being able to process documents with both
specific and generic tools) are much higher, yet the tools for identifying
document types are actually weaker in many ways. I'll list most of the
tools for identifying document types here and their potential strengths and
weaknesses. I'm hoping I'm wrong about some of these, but I'm also hoping
I'm wrong in ways that can make users lives simpler, not ways that just
have workarounds requiring users to trek 50 miles through mountains while
wearing a straitjacket and ball-and-chain.
1) Filename extensions - The classic for the PC world, used to some extent
in Unix, and typically sneered at by the Macintosh community.
Advantages: Can be created on a whim. Easily connected to other systems,
like MIME identifiers, when used in a supportive (HTTP) environment.
Disadvantages: No central registry, so conflicts abound. Typically limited
to three characters by old DOS rules, though longer extensions are becoming
a bit more common. Makes it difficult to use periods in file names.
Doesn't fit well with 'smarter' file systems that store document type and
application information separately from the name of the document.
Recurring Question: Why using .xml isn't enough to identify XML documents
precisely to applications. (Recurring answer: because not all applications
should work with every XML document fed them, using finer-grained
identification is a good idea.)
----------------------------------------------------------------
2) MIME types - The classic Internet standard, used by a variety of
Internet applications and becoming more widespread in other systems.
Advantages: IANA provides central registry, with mechanism (x-) for
unregistered types. Can be made into public identifiers and notations
fairly easily.
Disadvantages: Like the .xml file extension, application/xml and text/xml
provide no information about the _type_ of XML document inside the file
they roughly describe, leaving applications to determine whether or not the
information is actually meaningful.
Recurring Question: Why using application/xml or text/xml isn't enough to
identify XML documents precisely to applications. (Recurring answer:
because not all applications should work with every XML document fed them,
using finer-grained identification is a good idea.)
----------------------------------------------------------------
3) DOCTYPE declarations - The de facto SGML standard, about the only thing
that provides a description of the contents of a document.
Advantages: Public Identifier vocabulary suitably rich to avoid most naming
conflicts without required use of central repository.
Disadvantages: Only reliable in validating environments when public
identifiers are actually used, which isn't very often. SYSTEM pointers
seem much more typical. Even when public identifiers are present, many
declarations can be added or overridden in the internal subset, muddying
the waters for applications that need a particular structure. Validation
process doesn't make clear if this has happened. ANY opens black holes.
Recurring Questions: Where do I buy a public identifier? Can I use a
public identifier for documents that are only well-formed? (Recurring
answer: pretty much no on both counts.)
----------------------------------------------------------------
4) Root elements using Namespaces - A new possibility that gained some
prominence with the accession to W3C Recommendation of 'Namespaces in XML'.
Advantages: Namespaces ensure unique element names, making it less likely
that you have someone else's DOCUMENT element.
Disadvantages: Just because the root element is X doesn't mean its contents
are Y. Especially given the problems of validating documents in
namespace-aware environments, namespaces may not always be available. Half
the XML community regards Namespaces as the worst thing since the plague.
Because namespaces aren't supposed to point to anything, you can't sneak a
DTD in at the URL identified by the namespace.
Recurring Question: So how do I make this work reliably in a validating
environment? (Recurring answer: Ask again next year, please.)
Perhaps I'm being a little too hard, but none of these solutions seem
viable. If all we were talking about was generic documents with style
sheets, it might not matter so much, but unfortunately, we're not. Lots of
XML standards are under development where putting the square document in
the round processor is not a good idea. It seems wise to provide a generic
mechanism to keep the square documents created with our generic tools from
the round processor.
Or maybe that's too much. I guess we'll see.
Simon St.Laurent
XML: A Primer / Cookies
Sharing Bandwidth
Building XML Applications (March)
http://www.simonstl.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list