Inheritance in XML (was Re: Problems parsing XML)

Fri Apr 17 17:21:41 BST 1998

At 12:12 PM 4/17/98 +0200, Matthew Gertner wrote:

>1) HyTime provides an extremely valuable and rich basis for this work, just
>as it has for XML-Link. However, the relevant aspects need to be extracted
>and presented in a more easily digestible form. Also, HyTime attempts to
>implement inheritance (of element content) without extending the DTD syntax.
>This decision should at least be reevaluated in the context of XML.

I appreciate the vote of confidence for architectures and hesitate to make
the next comment. However, there appears to be a general misconception
about architectures that I feel I must attempt to correct, to wit, that
architectures have ANYTHING to do with inheritance.

Mathew says "HyTime attempts to implement inheritance (of element content)
without extending the DTD syntax".

This is a false statement because HyTime DOES NOT ATTEMPT to define any
form of inheritance as I understand that word.  Therefore, it is not a
failing of the AFDR that it did not extend DTD syntax (which was never a
realistic option at the time it was designed). The decision that was made
was the only possible decision at the time.

This is not to say that I object to the idea of true inheritance in SGML. I
do not. It would almost certainly be a useful facility, making the use of
architectures at least easier, if not more powerful as well. So I
appreciate the depth of thought that is being and will be put into this
issue. I simply object to the suggestion that there is anything wrong with
architectures as they stand because they fail to provide a proper or useful
inheritance mechanism. Architectures cannot fail at something they
explicitly don't try to do. I don't want people to think that they
shouldn't use architectures because they don't do inheritance.
Architectures aren't about inheritance--they are orthoganal but synergistic
concepts.

The *processing effect* of using architectures may *appear* to be
inheritance, but that is a side effect of the type of processing that
architectures enable, not a direct intent of the architectural mechanism.
Or, said another way, architectures were designed to enable object-oriented
*processing* but not object-oriented construction of instance DTDs for
enabling parsing and validation. The latter simply isn't a requirement for
the former and is orders of magnitude harder to invent, specify, and
implement.

Remember: DTDs exist for exactly two reasons:

1. To enable *syntactic* validation of instances
2. To enable the use of markup minization features

For all other types of processing DTDs are *irrelevant*. Thus, you do not
need to think about DTDs at all in order to enable object-oriented
*processing*, which is one of the things architectures do.  Architectures
also enable the syntactic validation of documents against the architectural
syntax rules (the architectural DTD), but they do not need to provide an
"DTD inheritance" mechanism in order to do that--they simply need to enable
the automatic generation of new instances that conform to the architectural
DTD.  This is a pretty trivial thing to define and implement (modulo the
optional automapping facility, which, like any markup minimization feature,
complicates things a bit).

It might help to understand why architectures are designed the way they are.

Architectures are designed to give you a way to define a set of general
rules for processing documents for some specific purpose (e.g.,
hyperlinking, defining metadata, etc.). Document instances use these rules
by reference by asserting derivation from the architecture and conformance
to its rules.  

Because SGML can only talk about syntactic rules and because the
architecture mechanism uses SGML syntax as the base definition of its
rules, these sets of rules provide an ability to define syntactic
constraints in way that is similar or identical to those provided by a
document's private DOCTYPE declaration.  At the same time, these rules do
not impose any requirements on the names used in instances, because
avoidance of name-space incursion is a basic principle of SGML and its
related standards. Thus, a general set of rules define a set of types that
instances assert conformance to, rather than defining the instance types
directly.  Note that architectures presume additional definitions beyond
the architectural DTD but cannot, of course, define how these rules might
be specified (because there are an infinite number of useful ways to do so).

Note that the direction of pointing is from instances to types to establish
an is-a or kind-of relationship.  This is merely an *assertion* made by
element *instances* (not types). This means that there is no, I repeat, no
connection between element type declarations and architectural types
("forms"), except that the markup minization feature of fixed attributes
lets you fix the mapping for instances at part of an element type
declaration.  But it is not meaningful to say that an element type conforms
to an architectural form--only instances can conform. This further suggests
that what architectures do is not inheritance because instances do not
inherit properties from other types, they are simply instances of types.
Architectures do not define any notion of types being derived from types.
[The derivation of one architecture from another is really the derivation
of architectural *instances* from another architecture, not derivation of
the architecture. This truth is obscured by the fact that architectural
instances are normally only transient objects used by processors and not
literally instantiated as SGML documents.]

In addition, the rules defined by an architecture need not cover the
entirety of an instance. The HyTime architecture, for example, only covers
those parts of documents involved with linking and addressing. Therefore,
the mechanism must be flexibile enough to allow both different elements of
diffent types in the same document to be derived from different
architectures and a single element to be derived from different
architectures at once. 

Because each architecture defines a distinct "processing context", there is
no problem in having a single element derived from multiple architectures
because the processing for each architecture is independent of the
processing for any other architecture.  There is no "multiple inheritance"
problem because it's not inheritance.  It's no different from me saying
that I conform to the rules for both male humans and licensed drivers.
These are distinct rule domains and as long as the rules for conformance to
both do not result in a conflict such that I can't satisfy both at once,
there are no problems. [For example, I could also say that I can conform to
the rules for licensed drivers and medical cadavers but I obviously can't
do both at the same time, because being a cadaver includes a requirement
that makes it impossible for me to conform to the rules for licensed drivers.]

Note that the assertion made by elements that they conform to a given form
is NOT saying "instance element X inherits the *syntactic* properties of
architectural form Y". It is saying "instance element X *conforms to* the
syntax and semantics of architectural form Y".  It is an assertion of
conformance or derivation that does not have any implications about the
content model of the instance except that it must *allow* (but not
necessarily require) instances that conform to the architectural content
rules.  The only constraints architectural content models impose on
instances is the requirement for *potential* conformance. But instances are
free to allow content that would not conform, because not all instances
will be processed or validated with respect to a given architecture.

[There may, however, be a definite processing result that looks or in fact
is inheritance, but that's inheritance of processing, which is different
from inheritance of local syntactic rules. Object-oriented techniques are a
natural way to implement processing because you can reflect the *taxonomic*
hierarchy represented by an architecture with programmatic objects.]

For example, say I define an architecture for sections in technical manuals
with the following architectural content model:

<!-- Section architecture: -->
<!ELEMENT Section
  (Title,
   (Para+ |
    Section+))
>
<!-- Another form that is not allowed within Section -->
<!ELEMENT Intro
  (Para+)
>

<!-- End of architecture -->

In a document, I might have this element type, instances of which can be
derived from the Section form:

<?XML version="1.0" ?>
<!DOCTYPE Division [
<?IS10744:arch name="Sectarch" ... ?>
<!ELEMENT Division
  ((Title | Metadata),
   (Para+ |
    (Intro,
     Division+) |
    Division+))
>
<!ATTLIST Division
   sectarch 
      (Section)
      #IMPLIED
>
]>
<Division sectarch="Section">
 <!-- This Division claims conformance but fails to conform 
      because the Section architectural element does not
      all the Intro architectural element in its content.  
   -->
 <Metadata>...</Metadata>
 <Intro>
  ..
 </Intro>
 <Division>
  ...
 </Division>
</Division>

This document is valid with respect to its own rules. It should be clear
from inspection that it allows instances that conform to the Section
architecture. 
It also allows instances that do not conform. It should also be clear that
the instance does not conform to the Section architecture (even though it
asserts conformance by asserting derivation from the Section form).

Thus, given an architectural element type, there is no way to predict the
content models of conforming instances except to say "it will probably
*allow* conforming instances*.  Note that given an architectural element
type, it is probably easy to *generate* instance content models that will
ensure conformance (e.g., just copy the architectural declarations into the
instance and change the names, if desired), but combining two or more forms
from different architectures into a single element type probably cannot be
done programmatically in any satisfactory way because too many arbitrary
decisions will have to be made, possibly based on variables that can only
be understood or provided by humans (such as when are instances expected to
be validated against a particular architectural derivation).

It should be clear that any notion of true inheritance of content models
from architectures to instances is problematic at best, provably impossible
at worst.  

In addition, it would require that the instance parser have access to all
architectural DTDs and be able to synthesize them according to some set of
combinatorial heuristics. To my mind, this is a level of processing
overhead that is unacceptably high if all conforming parsers must support
it. In particular, it seems to be directly at odds with at least one of
XML's basic principles (actually, I can think of at least three: enabling
small parsers, no options, simplicity of specification).

By constrast, you only need to access and use an architectural DTD when you
are *validating* with respect to that architecture, which is always an
option. Validation is not a requirement for doing architecture-aware
processing. A processor for any given architecture presumably has built-in
knowledge of the forms in that architecture. In any case, DTD's only enable
validation and parsing, not processing, so they are largely irrelevant to
the issue of enabling *processing*, which is the primary purpose of
architectures. Thus, the use of architectures imposes *no requirements* on
instance parsers to do anything more than they have to do today. Validating
with respect to an architecture is a choice that users of documents get to
make.

But, doing such combination in some non-SGML schema syntax is perfectly
reasonable to contemplate because at that point you've gone outside the
minimum requirements of SGML parsing and by definition there is no
requirement that any conforming instance parser do any processing with
respect to non-SGML-syntax schemas.

Cheers,

Eliot
--
<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
Highland Consulting, a division of ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 95202.  214.953.0004
www.isogen.com
</Address>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)