Documents and Document Fragments (Was RE: XML Information Set Requirements, W3C Note 18-February-1999)

Mon Feb 22 02:22:11 GMT 1999

<ExecutiveSummary>
Mark and I agree on and are both excited about the value of the concepts we
are discussing: namely promoting an element in a document to the status of
document element in its very own document.

I'm also interested in the reverse: demoting a document element to the
status of a normal element in a larger document.

Our disagreement seems to stem from the fact that I don't believe you have
an "XML document" until you serialise as well-formed XML text. That's my
understanding of the XML 1.0 REC.

Mark uses the terms "physical" and "logical" XML documents where, by the
former I think he means what I think of as an XML document in the sense of
the XML 1.0 REC, i.e. serialised text. By the latter I think he means a more
abstract representation of the type being developed by the XML Infoset WG.

In fairness to Mark, the terms "physical" and "logical" are used in this way
in the Infoset Requirements. However, I would argue that the term "XML
document", at least as used in the XML 1.0 REC, is only ever "physical".
There is an equivalent logical representation but that is yet to be
standardised by the Infoset WG.

Most people probably think I am just being pedantic.

I am.

I'm also trying to follow the spec :-)
</ExecutiveSummary>

Mark Birbeck:
>I put apostrophes around all
>sorts of key words (like the word contains) to emphasise that I am not
>trying to be literal. My meaning was that an element within a document
>can itself be treated as a document - and still fit with the spec,
>despite what you say next.

Oh I agree with you. But only because you say "can itself be *treated* as a
document".

[...]
>I was saying that an element is equivalent to a well-formed document
>(with an empty prolog) and this gives us certain advantages.

Again I agree, because you say "is *equivalent* to".

My only point was that we (and I don't mean you in particular—sorry if I
appeared to single you out) need to be careful with the term "XML document"
because it means something quite specific in the XML 1.0 REC. You are
absolutely right that any element in a well-formed document can be treated
as a well-formed document (assuming no entity references, etc) but while it
is an element within a well-formed document it is not itself an XML
document.

[example of objects in your database]
>[I spell this out because if I show it with tags, everyone will think I
>am referring to one *physical* XML document, which I am not.]

Well, you aren't referring to an XML document at all in the XML 1.0 REC
sense until you serialise it (or part of it) as an XML document.

>then I can export a 'proper' XML document from this, such as:
[part of database object serialised as XML]
>as well as:
[all of database object serialised as XML]
>or even the 'proper' document:
[a single element serialised from the database]

>All of these are well-formed 'documents' in the logical sense but have
>no relationship to a physical document of any form.

I think my problem may not be so much the use of "document" as the use of
"logical" and "physical".
You are using (quite correctly in the general sense of the terms) "physical"
to mean XML "text" and "logical" to mean the abstract data (perhaps in some
database). But the XML spec doesn't talk like this. An object representation
of an XML document is not an XML document according to the spec. For
something to be XML (unparsed entities excepted) it must be represented as
text with markup and character data.

To stress again, I agree with everything you are saying, just pointing out
that use of certain terms in the spec is more specific.

> Of course, if all of
>your documents (logical) are stored as text files (physical), or to put
>it another way, if there is a one-to-one mapping between your physical
>and logical XML documents, then none of this is of any use to you;

XML documents = text files. What you are calling "logical XML documents"
aren't "XML documents" in the sense of the XML 1.0 REC. I'm not arguing
about the value of what you are talking about doing. I think it's the way to
go. I am just trying to be careful with the terminology.

>On the other hand, if you have no documents,
>but thousands of nodes of data in a database that you can export and
>query, then the difference between a logical document and a physical one
>is key. (Further, you could also generate an inline DTD from your schema
>as the prolog to each document, if you wanted. Or just point to an
>external one.)

Yep. All exciting stuff. Keep a logical document or documents in a database
and export parts of documents or aggregates of documents as XML.

>I pointed out that all this fits with the XML 1.0 notion of a logical
>document, in order to stress that we don't need some other terms
>inventing to cope with these concepts.

Does the XML 1.0 REC really have a notion of a logical document? It has a
notion of text (what you are calling a physical document) having a logical
structure. It is the main point of the XML Infoset to introduce the notion
of a logical document.

>The fact that the three examples
>I gave above are all subsets of a greater whole, does not in any way
>affect that they are all still perfectly acceptable XML documents.

They have the potential to be, if serialised as such.

> We
>don't then need to go back to the original data and say that because we
>can get many documents from a bigger document, that document must
>therefore be referred to as an 'uberdocument'

Hang on. I'm not suggesting anyone *has* to use my word! :-)

I coined the term überdocument originally to mean a hierarchy of XML
documents that are treated as if they were a single XML document. To
actually be serialised as a single XML document, one would have to handle
localised declarations and name clashes (namespaces to the rescue!).

I made up a new word because I essentially wanted to say "these things
aren't documents in the sense in which one would normally think of them. An
überdocument is an over-arching document representing an entire collection
of documents".

>> Yep. This is the idea I'm exploring. I'm just using the term
>> "überdocument"
>> for the "one massive document".
>
>But it's still a document (logical), just like the other three.

Yes. It's a logical document in the sense you've been using the term. It has
the potential to be serialised as a single XML document.

> And
>equally, we don't really need to say that because those three documents
>came from a greater document they must be 'document fragments'. (I say
>'don't really', because there are situations such as getting a parser to
>select part of a *physical* document, when the term 'fragment' might be
>useful.)

And likewise I think there are situations where you want to get a parser to
treat an XML document (physical document in the sense you've been using the
term) as part of a larger document: the überdocument.

>To conclude, there's nothing wrong with introducing new terms, but I
>feel that they must clarify something, or point towards something that
>has not been addressed before. But as far as I can see, all of the
>concepts we need to cope with the idea of an 'XML document server',
>etc., *are* present in XML 1.0.

Not in the XML 1.0 REC. That's why the Infoset Set work is being done.

We actually agree on the concepts and their value. I am just being pedantic
about using words as they are meant in the XML 1.0 REC.

James
--
James Tauber / jtauber at jtauber.com / www.jtauber.com
Associate Researcher, Electronic Commerce Network
Curtin University of Technology, Perth, Western Australia

Full-day XML Tutorial @ WWW8 : http://www8.org/

Maintainer of : www.xmlinfo.com,  www.xmlsoftware.com and www.schema.net

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)