Names and schemas

Mon Jun 1 14:01:04 BST 1998

Through discussions on XML=DEV, I believe that I have clarified my own
thinking on names, namespaces and schemas quite a bit. The fog is lifting.
I believe that through a careful agreement upon and application of
definitions, we can get rid of most complaints about the namespaces
proposal and remove all overlap between that proposal and things like
architectural forms and XSchemas (under development).

Of course my definitions are context=based. Semioticians might disagree
with them, but I think that they are sufficient for those of us in the
markup language design business.

Definitions
===========

Name: (for our purposes) A unicode string that refers to something.
Example: FOO

Real object: (f.o.p.) A resource the computer can process. Example: a Java
class or XML entity.

Conceptual object: (...) A resource the computer cannot process (yet).
Example: the meaning of the English word "ship"

Namespace: A function that maps names to real or conceptual objects.
Examples: The domain name system. A file system. A particular directory in
a file system.

Declaration: An assertion that a real or conceptual object exists (in
reality or as a concept!). Example: Element type declaration.

Definition: An assertion of what an object *is*. Definitions make things
real to the computer. Examples: Java class definition. External entity
definition.

Directory: (or dictionary, vocabulary) A document that declares objects
and/or defines real objects. Examples: A DTD. /usr/dict

Schema: A document that defines a set of data objects (and thus implicitly
defines a truth value: "is object X in the set.") Note that schemata do
not necessarily (and, in generic markup, will not usually) DEFINE
objects...it defines a set. All it does for individual objects is report
whether they are in the set or not. Alternately, you could say that it
constrains them.

Implications
============

When I teach or write about SGML DTDs, I always says: "The DTD declares
what elements are allowed and what are the contraints on how they can be
used." It is only today that I recognize that these are two different
responsibilities. The first is the role of a *directory* and the second of
a *schema*. There is nothing wrong with the same language performing both
tasks. It can be very convenient. But we must be clear that they are two
tasks.

Note that these definitions have the potential to sweep away the confusion
about "multiple definitions" and "multiple inheritance" in the namespaces
proposal. It only makes sense to *declare* each name once. Once it has
been declared it has been declared. The software knows it exists. It only
makes sense to *define* each name once for the same reason. A name can be
bound to only one object.

But it makes perfect sense to have multiple schemas for a particular
object. The same object could be constrained in a hundred different ways.
It's content model could be check by a DTDs. RDF schemata could check that
it fits into a reasonable logical meta-data framework. A linking schema
could check that if it is a link, it is a "correct" one. etc. etc.

This is why you can attach multiple SGML architectures to a document (they
are achemas) but you can only attach one DTD (it is both a directory AND a
schema). This is also why the namespaces proposal and architectures need
have no overlap. The one is about combining directories. The other is
about constraining the named objects.

Here is an example of a directory (dictionary) that would not be a schema:

<!ELEMENT abc>
<!ELEMENT def>
<!ELEMENT ghi>
<!ELEMENT jkl>

Here is another:

abc def ghi jkl

Of course, a directory *could be* a schema. As I said before, combining
them can be quite convenient. But a schema could also exist which did not
constrain names at all! For instance, a Java class that mapped document
instances to truth values would be a schema (albeit a hard to work with
schema!).

Note that declarations for objects will be the norm in XML applications.
Definitions will be quite rare. Very few of the things that must be
expressed in markup will be expressed in terms that the computer can
understand. This is why XML does not have "element type definitions", but
rather declarations. The definition, if it xists, is in the brain(s) of
the author(s). An exception would be where an element is "defined by" a
Java class or RDF schema.

Implications for the namespaces draft
=====================================

The namespaces proposal was always supposed to be about "naming things
accurately" and not about competing with schema languages. Nevertheless,
this separation of church and state is not complete. The namespace
proposal *does* promote the idea that an object should have a single
schema. It should not. 

Luckily, this is easily fixed. All we need to do is take all normative
references to the word "schema" out of the spec. In some cases, they can
be easily eliminated. For instance the SRCDEF could be eliminated
entirely. The role of the namespaces proposal is simply not to point to
schemas. If the SRCDEF is to be retained, then it should point to a
*directory* (or dictionary) which is not necessarily a schema. (but I
think that the FIRST URI should point to the directory)

Here's another example of what must be changed:

"We envision applications of Extensible Markup Language [XML] where a
document contains markup defined in multiple schemas, which may have 
been authored independently. One motivation for this is that writing 
good schemas is hard, so it is beneficial to re-use parts from existing,
well-designed schemas. Another is the advantage of allowing search 
engines or other tools to operate over a range of documents that vary 
in many respects but use common names for common element types."

The two sentences should be reversed and modified slightly. Verifying that
a document conforms to one or more schemas is simply a special case of
"allowing tools to operate over a range of documents that vary in many 
respects but use common names." It need not (and probably *should not*) 
be priviledged in the namespaces proposal. It is this type of language 
that makes people think that namespaces are a competitor to, or 
replacement for, architectural forms and other schema languages.

Similarly, Section 2.5 presumes that every element is constrained by
a single schema. But we know that many will live in multiple schemas.
Rather, it should refer to "directories" (or one of the synonyms). It
makes sense for an element to be declared in, or defined in, a single
directory.

 Paul Prescod  - http://itrc.uwaterloo.ca/~papresco

Three things it is far better that only you should know:
How much you're paid, the schedule pad, and what is just for show

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)