SAX and delayed entity loading

Chris Maden crism at oreilly.com
Thu Dec 3 23:59:52 GMT 1998


I suspect this is going to be long-winded, mostly because I'm replying
to Eliot.  So a summary:

MIME types or notations eventually come down to a magic word of some
sort.  MIME types work now; notations don't yet.

I want a notation to make it *go*, not tell me what to read to make
something to make it go.  Since we're not all using the same
programming language, magic cookies are the only workable solution;
MIME combines this with robust, well-defined fallback behavior.

[W. Eliot Kimber]
> Unless I've misunderstood something, a MIME type is still an
> indirection to the definition of that MIME type.

In the epistemological sense, yes.  In the real world, though, it's a
key into a hash of handlers in your software.  And since the list of
keys is well-known, the success rate of using a given key is pretty
good.

> I.e., "text/xml" is a pointer to the RFC that establishes that MIME
> type. But then a problem is: where do I got to figure out what RFC a
> given MIME type maps to?

You go to the IANA, the designated authority for MIME.  When you get a
public notation identifier, where do you go?  Oops... there was a
ten-year delay in establishing a registry, and now the one registrar
has about eight owners registered, of which one (the ISO) already had
a formal reference mechanism.

So you can get a catalog that resolves the public identifier to...
what?  A DLL?  A Java class?  That's portable.  The indirection keeps
getting pushed farther down the line.  MIME says, "This is the list of
things.  Here's what they are.  If you implement this thing, this is
the name to look for."  It's a magic cookie system, but it's a robust
one that works very, very well.

> What if the MIME type is an "x-*" MIME type, what do I do then?
> Note that the external ID for a notation could, in theory be a MIME
> type:
> 
> <!NOTATION xml SYSTEM "urn:mime:text/xml" >

Early in XML's development, there was talk up at the formalism level
that system identifiers were HyTime FSIs, but only <url> FSIs were
allowed, and were the default, so the <url> tag was omitted.  This
left room for expansion into other system identifier types.  I argued
that the default for notation system identifiers should be treated
similarly, but the default should be <mimetype>, allowing

<!NOTATION xml SYSTEM "text/xml">

> The short answer is that they are a highly general way to associate
> data objects with the definition of the rules that governs the
> interpretation of that data object.

I like the phrase "highly general way".  HyTime is a highly general
way to associate any piece of information in any format anywhere in
the world with any other piece of information in any format anywhere
in the world.  And about six people actually understand it, and three
of them have been institutionalized from the shock.  (HyTime II is
much better, thank you, Eliot.)

In other words, it's so general that it's useless.  It can be made
useful with certain user conventions, like that the public identifier
is treated as a magic string that is just known, or that the system
identifier is a piece of software usable within a closed system.  But
otherwise, it's like SGML without a stylesheet standard, or public
identifiers without a resolution mechanism.

> I think that the Web and Windows have established an unreasonable
> expectation that software will "just know" how to deal with things.
> Unfortunately, you can't always rely on registered MIME types and
> magic numbers.

I don't think the expectation is at all unreasonable.  The software
*does* "just know" nearly all of the time.  The MIME specification was
developed in the Internet, where what works, wins.  The most important
aspect of MIME is its hierarchicality(?): types, sub-types,
sub-sub-types, ad infinitum.  Like with RFC 1738 language
specifications, you can make a reasonable guess about an entity even
if you don't recognize the whole MIME type.  An old browser,
confronted with text/xml, will say, "Oh... I don't know about xml, but
I can do text.  Here ya go."  And the user will see markup, but it'll
be sensible.

> Perhaps part of the problem is that in the Web world we have tended
> to remove the need for such a generalized mechanism by hard-coding
> knowledge of the semantics of everything?  But you can't do that
> forever, and MIME only seems to make the problem worse by requiring
> that all interchangable types be registered before they can be
> used.

Unlike notations, which will work by magic without telling anyone what
they are.

> Notations don't require that because the external ID of a
> notation can be anything (including MIME types or their RFC
> documents).

Yay... so all my software has to handle is... anything.  Fun.

I love abstract theory.  But in the end, it comes down to software
*doing something* with what it gets.  A function whose range is
unbounded across the set of the universe is not a useful function.  A
notation for planets whose system identifier is a bibref to
Magrathea's operating manual is not of very much use to me.  At least
with application/planet I know that it's an application, but not one I
handle, and can ask the user for suggestions.

Before anyone misconstrues my position (too late!) I don't have
anything at all against the ISO process.  (I only say this because I
know there are some people who *seem* to hold a grudge against the ISO
or the W3C or both.)  I love the abstraction of SGML and HyTime.  But
notation identifiers have always seemed to me to be a bizarre bit of
Pollyannaism, and the constant use of system identifiers in examples
blew my mind the first time I read the standard.  For all that SGML is
of great utility for open systems, it shows definite signs of having
grown up in a pre-Internet world where openness and portability were
much smaller words.

> But I do mind: if I see "x-whatever/whatever", how do I know where
> to look, as a programmer or document recipient, to understand what
> the rules for that MIME type are?

And after you've looked, then what?  Designing a system that tells
programmers where to go to implement a processor for a new notation is
bizarre.  Most users are not programmers, and the idea that a notation
would point to a formal spec would shatter their heads.  They want a
notation to point to something that will do it for them.  In the
absence of a One World Programming Language (pipe down, Python heads),
a hierarchical magic cookie system works best.

> If someone gives you a document with a useless external ID for a
> notation, that's a problem between you and the author of that
> document and no mechanism can fix that problem.

And the difference between this and MIME is... that with a well-
defined registry of notation types, and hierarchical fall-back system,
a useless external ID is far less likely.

> But it's not just about *viewing*, it's about processing of all
> sorts.  Pulling down a plug-in for viewing a particular kind of data
> is only one small application of notations.  If you are only
> thinking about the problem in terms of viewing things on the Web,
> then you are missing the point.

Take viewing as one form of possible processing, used here as an
example.  The problem is one of finding the processor, and MIME types
are equally good at finding the processor for analyzing as they are
for viewing.

[Simon St.Laurent]
> I still argue that notations are a waste of time based on the
> misguided notion that information about dependencies (of whatever
> type) actually belongs in the document.

Only "document" in the sense that the document as an informational
unit necessarily includes the description of its type.  Notations
(whether MIME or otherwise) are associated with the type of a
document; like common entities, common notations should be defined in
common files.

> Let the dependent pieces be self-describing (MIME or something
> better),

Now, I think you may be falling into the trap Eliot describes.  MIME
entities aren't self-describing; they're wrapped in headers that
describe them.  And someone still has to understand that description;
it's just that the MIME implementations are more robust and flexible,
with better fall-back behavior, then SGML notations.

[W. Eliot Kimber]
> You're right, the URL for the XML spec is not sufficient (but
> neither is "<?xml?>", since you need to know what thing defines what
> that magic number means).

The URL for the XML spec is *equally* useful to a processor as is
"<?xml?>", "text/xml", and "foobie bletch".  Assuming of course, that
each string is used in a system designed to understand it.

The namespec specification uses uses URIs, but they are essentially
magic cookies.  When used with a stylesheets, the URIs are compared.
When fed to a processor (like XSL's xsl: and fo: namespaces), the
processor is expected to either recognize the URI on sight, or else
isn't a processor for that data type.  This is exactly like MIME
without fallback.

> In thinking about it, I think the only thing that will be reliable is to
> depend on a non-electronic, human-primary, long-term repository like the
> Library of Congress.
> 
> Thus, the declaration for XML as a notation should be something like:
> 
> <!NOTATION somelocalname 
>  PUBLIC "+//IDN loc.us.gov//NOTATION TZ 1234:W3C eXtensible Markup Language
> (XML) Recommendation 1.0//EN"
> >

For cyberarchæology, that works well.  You can find the spec and
re-implement a processor.  But as a user, I want to use the data I
got, not write a bloody parser for it myself.  And (once again) since
we don't have universally portable software, you can't give me a
pointer to a chunk of code.  It's got to be a well-recognized name,
and MIME provides this better than any other system.

Whew.

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//Anonymous//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//O'Reilly//NONSGML Christopher R. Maden//EN"
"<URL>http://www.oreilly.com/people/staff/crism/ <TEL>+1.617.499.7487
<USMAIL>90 Sherman Street, Cambridge, MA 02140 USA" NDATA SGML.Geek>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list