Documents and Document Fragments (Was RE: XML Information Set Requirements, W3C Note 18-February-1999)

Mark Birbeck Mark.Birbeck at iedigital.net
Sun Feb 21 14:56:18 GMT 1999


James Tauber wrote:
> We must be careful when using the word "document" because it 
> does have a
> specific meaning in the XML spec. It is *NOT* true that a 
> document may be
> inside another document in the logical sense in which the 
> word is used in
> the spec.

That's right ... and I'm not saying that. I put apostrophes around all
sorts of key words (like the word contains) to emphasise that I am not
trying to be literal. My meaning was that an element within a document
can itself be treated as a document - and still fit with the spec,
despite what you say next.

> An XML *document* (document in the spec sense) is 
> PROLOG+ELEMENT+optional
> EPILOG.
> You can't have PROLOGs in the content of elements, therefore 
> you cannot have
> document in documents if you mean document in the spec sense. 
> (note that all
> XML document have a prolog even if it is empty).

I was saying that an element is equivalent to a well-formed document
(with an empty prolog) and this gives us certain advantages. For
example, if I have in my database:

    object type issue, with issue number set to 67 and ...
        a container of articles of which ...
            the first is article 1 with a container full of paragraphs
...
                of which the first says "This is para 1"
                and the second says "This is para 2"
                and the third says "this is para 3"
            and the second is article 2 with a container full of
paragraphs ...
                of which the first says "This is para 1"
                and the second says "This is para 2"

[I spell this out because if I show it with tags, everyone will think I
am referring to one *physical* XML document, which I am not.]

then I can export a 'proper' XML document from this, such as:

    <article number="2">
        <para>This is para 1</para>
        <para>This is para 2</para>
    </article>

as well as:

    <issue number="67">
        <article number="1">
            <para>This is para 1</para>
            <para>This is para 2</para>
            <para>This is para 3</para>
        </article>
        <article number="2">
            <para>This is para 1</para>
            <para>This is para 2</para>
        </article>
    </issue>

or even the 'proper' document:

    <para>This is para 2</para>

All of these are well-formed 'documents' in the logical sense but have
no relationship to a physical document of any form. Of course, if all of
your documents (logical) are stored as text files (physical), or to put
it another way, if there is a one-to-one mapping between your physical
and logical XML documents, then none of this is of any use to you; you
will have a lot of trouble querying across documents, and no means of
creating dynamic documents. On the other hand, if you have no documents,
but thousands of nodes of data in a database that you can export and
query, then the difference between a logical document and a physical one
is key. (Further, you could also generate an inline DTD from your schema
as the prolog to each document, if you wanted. Or just point to an
external one.)

I pointed out that all this fits with the XML 1.0 notion of a logical
document, in order to stress that we don't need some other terms
inventing to cope with these concepts. The fact that the three examples
I gave above are all subsets of a greater whole, does not in any way
affect that they are all still perfectly acceptable XML documents. We
don't then need to go back to the original data and say that because we
can get many documents from a bigger document, that document must
therefore be referred to as an 'uberdocument'; to quote you:

> Yep. This is the idea I'm exploring. I'm just using the term 
> "überdocument"
> for the "one massive document".

But it's still a document (logical), just like the other three. And
equally, we don't really need to say that because those three documents
came from a greater document they must be 'document fragments'. (I say
'don't really', because there are situations such as getting a parser to
select part of a *physical* document, when the term 'fragment' might be
useful.)

To conclude, there's nothing wrong with introducing new terms, but I
feel that they must clarify something, or point towards something that
has not been addressed before. But as far as I can see, all of the
concepts we need to cope with the idea of an 'XML document server',
etc., *are* present in XML 1.0.

Regards,

Mark


Mark Birbeck
Managing Director
Intra Extra Digital Ltd.
39 Whitfield Street
London
W1P 5RE
w: http://www.iedigital.net/
t: 0171 681 4135
e: Mark.Birbeck at iedigital.net


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list