XML Inclusion Proposal

Sat May 29 04:36:56 BST 1999

Please do not cross-post responses to this proposal because unfortunately
discussions do not span mailing lists very easily. I look forward to
opinions on both of the included lists.

  XML Node Inclusion Mechanism
  ============================

Abstract
========
This note describes the syntax and semantics of a simple node inclusion
mechanism for XML. Inclusion allows documents and parts of documents to
be reused automatically in multiple documents. It should be considered
a structured alternative to XML's text entity mechanism.

The note builds upon Web Characterization [Webchar], the XML 
Information Set [InfoSet] and the XLink and XPointer specifications.

Scope
=====
There are many reasons for wanting to include text. We divide them up
into two categories: reuse for text management versus quotation with a
rhetorical or sociological purpose. For the purposes of this document,
we will refer to the former as inclusion and the latter as transclusion.

For instance you might want to include boilerplate text for a copyright.
In that case, the original context is not useful. The fact that the text
was dynamically assembled is only an implementation detail. There is no
rhetorical or sociological reason for the reuse. It is just an
efficient way of structuring the text. This is inclusion.

In another case, you might want to quote a verse from Hamlet.  In this
case there are various rhetorical and sociological implications of
the reuse. The fact that the verse comes from Shakespeare is relevant
to readers of the document. In this case, you would need the 
rendition to indicate clearly that the text is included. It may
also provide a means for recovering the verse's original context. 
This is transclusion.

This specification is too simple to completely cover rhetorically
motivated quotation (transclusion). We assert that in 
general transclusion requires intelligent rendition decisions
which can only be handled with sophisticated stylesheet support. 
Another complication with transclusion is that it must be possible to 
transclude diverse media types. Inclusion does not have that 
requirement.

Processing Model
================
Overview:

An inclusion process takes an XML document information set as input and 
produces a result information set. This process may involve many
individual inclusions (i.e. copyright boilerplate could be included 
from one source and an introductory paragraph from another source).

Inclusion Links:

All XLinks in the input document with a link type of xml:include are
inclusion links. Inclusion links explicitly or implicitly have two
anchors: source and content. 

Source Anchor:

The source anchor may be identified as an anchor described in a locator
with the role "source". It must address a single node in the same
document as the link. If an inline link has no locator named "source",
then the local resource serves as the source anchor.

Target Anchor:

The target anchor may be identified by a locator with the role
"content". If the link has no such locator but it has only one  single
remote resource then that resource may be used as the content anchor.

Note: According to these rules, the simplest inclusion reference (without
using defaulted attributes) uses this syntax:

<xml:include xml:link="simple" href="blah blah blah"/>

Software called an inclusion processor works from the information set for
the
input document, the documents containing the included nodes and generates 
an information set called the result.

Process:

The processing is recursive. It starts with the document node and
progresses
down to inclusion nodes and their inclusions.

The result of processing the input document node is a result document
node.

The children of the result document node are the nodeset result of 
processing the input document's content (prolog, document element and 
epilog).

The result of processing a node that serves as the source of an inclusion
is a copy of the nodeset.

The result of processing a document type declaration, processing
instruction or character is an identical DTD, PI or character, as long
as the node was not the source of an inclusion.

The result of processing any non-source element node is a result element 
node with the same generic identifier and attributes. The content 
of the result element node is the result of processing the content 
of the source element.

Iterations:

The process of evaluating each node from the document down to the leaves
(other than the children of source nodes) is called an iteration of the
process. In many contexts it will make sense to process the result tree
and the result of that process and so forth until there are no more 
source elements in a result tree. This is called a deep inclusion process.

Note: a deep inclusion can be implemented in a single pass but the 
specification describes it as multiple passes because in some cases this
may be convenient.

Addressing:

Although the Web is designed to allow anchors into manifestations 
of documents, it does not define a syntax that differentiates between
links
into the resource and links into the various client and server generated 
manifestations of the document, including transformation result trees.
Typically, any reference is interpreted as being valid in all
manifestations
but this is not always the case and is specifically not the case with
links
into including documents.

Until a generalized syntax is defined, we define an extension to XPointer
and
the XSL query language that allows us to do inclusions.

include() is a function that takes a single document node as an argument
and
returns a nodeset representing the result of the inclusion process. By 
default it works upon the current node. 

Note: Therefore a reference such as somedoc.xml#include()//TITLE refers to
all titles in the result of an inclusion process applied to somedoc.xml.

deep-include() is a function that takes a single document node as an
argument
and returns a nodeset representing the result of the deep inclusion
process.
By default it works upon the current node.

Limitations
===========
The result of inclusion may not be valid according to the input DTD.
This mechanism does not provide specific support to ensure that it will
be.  
This responsibility is placed upon the creator of the including document.
This is no more onerous than the same responsibility in the XML text
entity
mechanism.  There is probably a market for authoring and validation
software that will follow inclusion references and ensure that the
logical result document will be valid. 

In the long term, it would be useful to have an XPointer/XSL QL
extension that changed the document type declaration on a document
node. Then result document types could be different from source document
types.

The mechanism does not preserve authorship information. The underlying
XML data model does not support this concept. In other words, the
technology does not defend against plagerism. In our opinion, this
strictly mechanical layer is not the correct place to enforce a high
level concept like ownership. People who want to plagerise can use 
many other techniques just as easily as they can use this one.

IDs in the included documents must be chosen so that they do not clash.
Future versions of SGML and XML schemas will probably support ID scopes to
avoid this problem.

Future Work:
============
It is only possible to include parts of other resources that have an 
information set that is compatible with XML's. The term "compatible with"
is loosely defined at this point but could be made more explicit if there
were information sets for multiple media types and if those information
sets
could build upon each other through subtyping. Right now, neither HTML nor
generic SGML have information sets. Parts of those documents cannot be
included.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

Alabama's constitution is 100 years old, 300 pages long and has more than
600 amendments. Highlights include "Amendment 393: Amendment of Amendment
No.  351", "Validation of Laws Regulating Court Costs in Randolph County",
"Miscegenation laws", "Bingo Games in Russell County", "Suppression
of dueling".  - http://www.legislature.state.al.us/ALISHome.html

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)