external parsed entites (was: A unique ID question ?)

Wed Nov 10 15:31:13 GMT 1999

Len Bullard wrote:
> 
> Tim Bray wrote:
> >
> > At 06:30 PM 11/9/99 -0600, Len Bullard wrote:
> > >We can't bag the cat once it is in the alley
> > >unless we are faster than the cat, or know where it sleeps.  In either
> > >case, we will have a mad cat when we take it out of the bag.
> >
> > I enjoyed reading that but I have *no* idea what you mean, Len... -T
> 
> External parsed entities are a done deal for XML 1.0.
> But hey, just a version number, right?  We get to
> have this fight again. :-)
> 
> If you and eliot truly think they are bogus and a
> screwup to use, sounds like an issue to me.
> 
> How do subdocs fix this problem, Eliot?  Or
> better, what problems do subdocs fix for markup systems?

By "subdocs" I presume you mean the use of separate documents to define
a single logical "compound" document (the SGML concept of "subdoc" is in
fact a red herring--what's important is the use of separate documents,
not the fact that they are declared as SUBDOC entities--thus the lack of
SUBDOC in XML is absolutely no loss). It solves the problem by forcing
you to use and manage truly reusable objects, rather than justing doing
syntactic copy and paste through the inclusion of external parsed
entities. 

It also forces you to recognize that all use-by-reference needs to occur
at the semantic (DOM, grove) level, not the syntactic level. Once you
realize that, all sorts of apparently hard problems or non-sensical
cases become easy and quite sensical, because we are out of the
syntactic domain and into the semantic domain. For example, it is
non-sensical for one element to be replaced by another element in
another document at the syntactic level (parse time), but it is
perfectly sensical for an element node to redirect to another element
node in another DOM and the processing needed to achieve this is
trivial:

# See if node is a redirection and resolve it:
try node.atts["redir"]:
   pointer = node.atts["redir"].value
   node = xpath.resolve_pointer_to_node(pointer)
except IndexError:
   # No redir attribute, just go on as before

It couldn't be easier at this level. All that's required is that your
processing software understand the semantics of the documents that might
be pulled together this way, which might of course use different
document types, but that's no different from needing to understand a
document that uses elements with different namespace prefixes, so it's a
constant part of the XML processing problem and this approach doesn't
change it (except to possibly make it both more obvious that the problem
exists and clearer as to how you define a framework for handling the
case).

The problem with external parsed entities is that they are not true
objects in the sense that they have no independent existence outside the
contexts in which they are used--they cannot be parsed or validated in
isolation and must conform *syntactically* to all the contexts they are
used. Thus the problems with ID conflict, entity names, etc. When using
multiple documents, each element maintains its original document context
and therefore its fundamental identity, so there is no possibility of ID
or name conflict and you can always examine the element in its original
context as well as in any contexts in which it might be used.

Of course, if you want to write a transform that generates a new single
instance as output, you have to disambiguat the names and IDs, but
there's no programming difficulty there, it's just an exercise in
rewriting of pointers and, possibly, applying name-space prefixes to
element type names (if you're so inclined). But this is only one way to
take advantage of semantic use-by-reference--you should never assume
that the processing result of compound document processing is another
XML document. Using GroveMinder we have a grove-aware browser that does
all this resolution dynamically at run-time, generating HTML as output.
There's nothing particularly difficult or inventive about this except
that we did it (and that it happens to implement the part of the HyTime
standard that deals with use-by-reference, the value reference
facility).

Note also that if, for example, XLink made a clear distinction between
use-by-reference relationships and hyperlink relationships, that it
would be clearer how one can have a highly-generic, standards-based
infrastructure for doing this stuff.  As it is, you can define your own
conventions for using XLinks to mean use by reference (the "show=embed"
part of XLink is almost there, but it is not sufficiently flexible. For
example, it doesn't let you define the value of an attribute by
reference, which is quite useful, if not a hard requirement for certain
problems).

By using independent documents you get objects (documents) that have
their own independent existence. They can be reliably re-used because
they are combined with other documents *semantically*, not
syntactically, at the processing level. That is, I construct a bunch of
DOM trees (or groves) and then another layer of processing decides how
to use them together. No document directly interfers with any other. Of
course, there will be dependencies between the documents, such as one
document linking to something in another document.

Because the processing of compound documents is a separate layer, there
can be many different ways of processing the same compound document and
therefore different sets of constraints that you might want to enforce.
You can have a policy that says all the members of the document must
have the same DTD or maybe you don't care because your processing isn't
DTD sensitive (e.g., a generic XML structure browser). The system can be
more or less sophisticated depending on your requirements. You don't
have to do twisted things like use name spaces to disambiguate element
type names in the source documents or rationalize all the documents to a
single over-arching document type. Individual documents can be optimized
for their own local purposes and still combined together meaningfully
with other documents, given some rules for playing nice together (e.g.,
Architectures, architypes, etc.).

Once you've built the infrastructure to handle this way of doing things,
the range of problems you can solve and the range of requirements you
support increases dramatically and the incremental cost of the system
drops rapidly.

I note that tools like Framemaker and Wordperfect (and I think even
Word) have *always* worked this way. In Framemaker, for example, a
"book" is composed of multiple documents. Each document is completely
syntactically independent of the other documents in the book: it can
have its own templates, customizations, etc. The addressing between
documents for cross references and hyperlinks is document-to-document
addressing because each document establishes its own ID name space in
Framemaker. There is no syntactic interference between different
documents in the same book.  [NOTE: this is not true for the SGML
version of Framemaker--for whatever reason, the designers chose not to
carry this model into the SGML version, which they could have done very
easily.]

Cheers,

E.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)