Syntactic and conceptual schemas

Wed Aug 18 12:08:35 BST 1999

I received the following questions privately. However, I think the answers 
(vague as they are) might be interesting to the list.

Uwe Speck wrote:
> In the W3C Note of the XML-Data Schema is written that there are
> two types of Schema: syntactic and conceptual.
>
> URL: http://www.w3.org/TR/1998/NOTE-XML-data-0105/
>
> "Schemas define the characteristics of classes of objects. This paper
> describes an XML vocabulary for schemas, that is, for defining and
> documenting object classes. It can be used for classes which as
> strictly syntactic (for example, XML) or those which indicate concepts
> and relations among concepts (as used in relational databases, KR graphs
> and RDF).
> *** The former are called "syntactic schemas;" the latter
> "conceptual schemas." ***"
>
> The terms syntantactic and conceptual are NOT used in the W3C Schema
> Note Part 1 Structures, but there seems to be the same intention:
>
> URL: http://www.w3.org/1999/05/06-xmlschema-1/
>
> Par. 2.2 On schemas, constraints and contributions
>
> "XML Schema: Structures not only reconstructs the DTD constraints of XML 
1.0
> using XML instance syntax,
>
> *** it also adds the ability to define new kinds of constraints. ***
>
> For example, although the author of an XML 1.0 DTD may declare
> an element type as containing character data, elements, or mixed content,
> there is no mechanism with which to constrain the contents of elements
> to only character data of a particular form, such as only integers in a
> specified range." ...
>
> So my questions:

Good (and hard) questions.  I'll do my best to answer them, but there's no 
guarantee my answers are completely correct.

> 1) From my point of view, the paragraphs written above mean
> exactly the same, but use different words. Is that true?

I don't think so, but I'm not really sure. (To give you an idea of my 
confusion, I started out saying "maybe", then said "probably", then went 
back to "maybe", and ended up with "I don't think so".)

Background
----------
In database theory, there are two different schemas -- the physical schema 
and the conceptual schema. The physical schema states how data is actually 
stored on the disk. The conceptual schema states how data appears to be 
organized from the user's point of view. For example, in a relational 
database, the conceptual schema declares how data is organized into tables 
and columns, what the data types of each column are, what the primary key / 
foreign key relationships between the tables are, and so on.

The query engine accepts commands that use the conceptual schema. For 
example, in a relational database, the query engine accepts SQL statements 
to select, insert, update, and delete data and so on. The query engine then 
submits requests to the storage engine, which knows the physical schema. 
For example, the query engine might ask the storage engine for all the rows 
in the table named ABC; the storage engine looks at the physical schema and 
retrieves the data. Notice that the storage engine is translating here 
between the conceptual schema (which uses the concept of table) and the 
physical schema (which describes how data is stored on disk).

Note also that the physical schema can be (and usually is) completely 
different from the conceptual schema. For example, it would be perfectly 
legal to store all data on disk as a sequence of strings:

   [table name][column name][row number][data value]

where row number and data value are stored in string form. To find all data 
for a table, the storage engine could search through all of the data and 
return only the data for the requested table, which it would change to the 
data type of the column according to the conceptual schema. Of course, this 
would be horribly inefficient for a large database, but it shows you how 
much the conceptual schema and physical schema can differ.

XML-Data
--------
I believe the authors of XML-Data think of an XML document as physical 
storage and the XML specification as defining rules for this storage. That 
is, it has rules for where markup, white space, character data, and so on 
can appear. In this context, the DTD language is a language for defining 
physical schemas. In other words, it states what attributes belong to a 
given element type, the content models of element types, and so on.

(To use the terms used by the XML-Data authors, XML defines the syntax for 
a class of languages -- that is, the legal structure of strings in the 
language. The DTD language is used to define the syntax for a particular 
XML language. In other words, a DTD is a "syntactic schema".)

Thus, any schema information that does not directly affect physical storage 
is "conceptual" and needs to be interpreted by a layer above the storage 
engine (XML processor/parser).

A good example of this is data types. All data in an XML document is stored 
as a string. Therefore, stating that the data type of a given element or 
attribute is integer is a "conceptual" operation, since conversion to/from 
strings and type checking is not performed at the storage (XML 
processor/parser) level, but at a higher level. Similarly, such things as 
the <foreignKey> element type in XML-Data are conceptual constraints -- 
that is, constraints that must be enforced at a level higher than the 
storage engine.

W3C XML Schemas
---------------
On the other hand, I don't know if the authors of the W3C's XML Schemas see 
a difference between the constraints imposed by a DTD and other 
constraints. It appears that they view the constraints that can be written 
in the DTD language as a subset of the possible constraints. That is, I 
think that they think there are a large number of possible constraints 
(content models, lists of legal attributes, element type inheritance, data 
types, and so on), and that the DTD language supports some constraints; XML 
Schemas supports more of these constraints.

Discussion
-----------
One of the problems with XML is that the boundaries between physical and 
conceptual are not always clear. Technically, the specification only 
defines physical layout -- that is, the syntax of a legal XML document. 
Unfortunately, when people start to think about XML, they immediately start 
to think in conceptual terms:

* Programmers usually think of an object model in which element types 
roughly correspond to classes and attributes correspond to properties of 
these classes.

* Document authors usually think of a document model in which the physical 
layout corresponds to the conceptual model they have in their head (a book 
has a title, one or more authors, and one or more chapters; a chapter has a 
title and one or more sections; and so on).

In both cases, an XML schema language (including the DTD language) can be 
viewed as a conceptual schema language as well as a physical schema 
language. For example, element types define physical structures (tags and 
legal children) but also can be used to define classes (in the programmer's 
case) or document parts such as chapters (in the document author's case). 
Similarly, archetypes in the W3C's XML Schemas can be thought of as a 
convenient shorthand (similar to parameter entities) for defining element 
types, but can also be used to define object inheritance.

(The situation is further complicated by things like entities. In DDML, we 
thought of entities as physical constructs and element types, attributes, 
and notations as logical constructs. The reason for this was that 
processors are not required to inform applications of entity usage. Because 
we were only interested in logical (conceptual) constructs, DDML did not 
support entity definition. Thus, DDML was primarily a conceptual language. 
(Note that the other schema languages do support entity definition, 
although there has been strong support for removing these from the W3C's 
XML Schemas.))

I think that one reason for the physical/conceptual duality of schema 
languages (including the DTD language) is that XML only defines physical 
layout and people, who generally think in conceptual terms, want to express 
those concepts. Thus, they impose concepts on DTDs and schema languages, 
even when those languages were designed to express physical schema.

An Answer (Finally)
-------------------
So, to answer your question, I don't think that these two paragraphs are 
saying the same thing. On the other hand, I don't think they contradict 
each other. Instead, I think they are viewing the same question from two 
different angles.

XML-Data's separation of syntactic schemas and conceptual schemas is useful 
because it makes very clear what XML can do and what it can't do. It also 
makes clear the responsibilities of the processor (processing syntactic 
schemas) and the application (processing conceptual schemas).

On the other hand, the separation is not entirely relevant to application 
writers and document authors. The reason for this is that these people use 
the "syntactic" parts of schema languages to express concepts as well as 
physical layout. This seems to be the view taken by W3C's XML Schemas.

I hope this helps to clarify, if not completely answer, your question

> 2) Can we say, that the goal of *** every *** XML-Schema
> language is, to support additional constraints compared to DTDs
> - means every XML-Schema supports something like a
> *** conceptual *** schema-principle! Or are the ** conceptual ***
> Schema of XML-Data something extraordinary?

Yes, I think you can say this. The only difference between XML-Data and the 
other schema languages is in this regard is that XML-Data explicit states 
what parts of their language apply at the XML document level and what parts 
apply at a higher level.

-- Ron Bourret

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)