Syntactic and conceptual schemas
Ronald Bourret
rbourret at ito.tu-darmstadt.de
Wed Aug 18 12:08:35 BST 1999
I received the following questions privately. However, I think the answers
(vague as they are) might be interesting to the list.
Uwe Speck wrote:
> In the W3C Note of the XML-Data Schema is written that there are
> two types of Schema: syntactic and conceptual.
>
> URL: http://www.w3.org/TR/1998/NOTE-XML-data-0105/
>
> "Schemas define the characteristics of classes of objects. This paper
> describes an XML vocabulary for schemas, that is, for defining and
> documenting object classes. It can be used for classes which as
> strictly syntactic (for example, XML) or those which indicate concepts
> and relations among concepts (as used in relational databases, KR graphs
> and RDF).
> *** The former are called "syntactic schemas;" the latter
> "conceptual schemas." ***"
>
> The terms syntantactic and conceptual are NOT used in the W3C Schema
> Note Part 1 Structures, but there seems to be the same intention:
>
> URL: http://www.w3.org/1999/05/06-xmlschema-1/
>
> Par. 2.2 On schemas, constraints and contributions
>
> "XML Schema: Structures not only reconstructs the DTD constraints of XML
1.0
> using XML instance syntax,
>
> *** it also adds the ability to define new kinds of constraints. ***
>
> For example, although the author of an XML 1.0 DTD may declare
> an element type as containing character data, elements, or mixed content,
> there is no mechanism with which to constrain the contents of elements
> to only character data of a particular form, such as only integers in a
> specified range." ...
>
> So my questions:
Good (and hard) questions. I'll do my best to answer them, but there's no
guarantee my answers are completely correct.
> 1) From my point of view, the paragraphs written above mean
> exactly the same, but use different words. Is that true?
I don't think so, but I'm not really sure. (To give you an idea of my
confusion, I started out saying "maybe", then said "probably", then went
back to "maybe", and ended up with "I don't think so".)
Background
----------
In database theory, there are two different schemas -- the physical schema
and the conceptual schema. The physical schema states how data is actually
stored on the disk. The conceptual schema states how data appears to be
organized from the user's point of view. For example, in a relational
database, the conceptual schema declares how data is organized into tables
and columns, what the data types of each column are, what the primary key /
foreign key relationships between the tables are, and so on.
The query engine accepts commands that use the conceptual schema. For
example, in a relational database, the query engine accepts SQL statements
to select, insert, update, and delete data and so on. The query engine then
submits requests to the storage engine, which knows the physical schema.
For example, the query engine might ask the storage engine for all the rows
in the table named ABC; the storage engine looks at the physical schema and
retrieves the data. Notice that the storage engine is translating here
between the conceptual schema (which uses the concept of table) and the
physical schema (which describes how data is stored on disk).
Note also that the physical schema can be (and usually is) completely
different from the conceptual schema. For example, it would be perfectly
legal to store all data on disk as a sequence of strings:
[table name][column name][row number][data value]
where row number and data value are stored in string form. To find all data
for a table, the storage engine could search through all of the data and
return only the data for the requested table, which it would change to the
data type of the column according to the conceptual schema. Of course, this
would be horribly inefficient for a large database, but it shows you how
much the conceptual schema and physical schema can differ.
XML-Data
--------
I believe the authors of XML-Data think of an XML document as physical
storage and the XML specification as defining rules for this storage. That
is, it has rules for where markup, white space, character data, and so on
can appear. In this context, the DTD language is a language for defining
physical schemas. In other words, it states what attributes belong to a
given element type, the content models of element types, and so on.
(To use the terms used by the XML-Data authors, XML defines the syntax for
a class of languages -- that is, the legal structure of strings in the
language. The DTD language is used to define the syntax for a particular
XML language. In other words, a DTD is a "syntactic schema".)
Thus, any schema information that does not directly affect physical storage
is "conceptual" and needs to be interpreted by a layer above the storage
engine (XML processor/parser).
A good example of this is data types. All data in an XML document is stored
as a string. Therefore, stating that the data type of a given element or
attribute is integer is a "conceptual" operation, since conversion to/from
strings and type checking is not performed at the storage (XML
processor/parser) level, but at a higher level. Similarly, such things as
the <foreignKey> element type in XML-Data are conceptual constraints --
that is, constraints that must be enforced at a level higher than the
storage engine.
W3C XML Schemas
---------------
On the other hand, I don't know if the authors of the W3C's XML Schemas see
a difference between the constraints imposed by a DTD and other
constraints. It appears that they view the constraints that can be written
in the DTD language as a subset of the possible constraints. That is, I
think that they think there are a large number of possible constraints
(content models, lists of legal attributes, element type inheritance, data
types, and so on), and that the DTD language supports some constraints; XML
Schemas supports more of these constraints.
Discussion
-----------
One of the problems with XML is that the boundaries between physical and
conceptual are not always clear. Technically, the specification only
defines physical layout -- that is, the syntax of a legal XML document.
Unfortunately, when people start to think about XML, they immediately start
to think in conceptual terms:
* Programmers usually think of an object model in which element types
roughly correspond to classes and attributes correspond to properties of
these classes.
* Document authors usually think of a document model in which the physical
layout corresponds to the conceptual model they have in their head (a book
has a title, one or more authors, and one or more chapters; a chapter has a
title and one or more sections; and so on).
In both cases, an XML schema language (including the DTD language) can be
viewed as a conceptual schema language as well as a physical schema
language. For example, element types define physical structures (tags and
legal children) but also can be used to define classes (in the programmer's
case) or document parts such as chapters (in the document author's case).
Similarly, archetypes in the W3C's XML Schemas can be thought of as a
convenient shorthand (similar to parameter entities) for defining element
types, but can also be used to define object inheritance.
(The situation is further complicated by things like entities. In DDML, we
thought of entities as physical constructs and element types, attributes,
and notations as logical constructs. The reason for this was that
processors are not required to inform applications of entity usage. Because
we were only interested in logical (conceptual) constructs, DDML did not
support entity definition. Thus, DDML was primarily a conceptual language.
(Note that the other schema languages do support entity definition,
although there has been strong support for removing these from the W3C's
XML Schemas.))
I think that one reason for the physical/conceptual duality of schema
languages (including the DTD language) is that XML only defines physical
layout and people, who generally think in conceptual terms, want to express
those concepts. Thus, they impose concepts on DTDs and schema languages,
even when those languages were designed to express physical schema.
An Answer (Finally)
-------------------
So, to answer your question, I don't think that these two paragraphs are
saying the same thing. On the other hand, I don't think they contradict
each other. Instead, I think they are viewing the same question from two
different angles.
XML-Data's separation of syntactic schemas and conceptual schemas is useful
because it makes very clear what XML can do and what it can't do. It also
makes clear the responsibilities of the processor (processing syntactic
schemas) and the application (processing conceptual schemas).
On the other hand, the separation is not entirely relevant to application
writers and document authors. The reason for this is that these people use
the "syntactic" parts of schema languages to express concepts as well as
physical layout. This seems to be the view taken by W3C's XML Schemas.
I hope this helps to clarify, if not completely answer, your question
> 2) Can we say, that the goal of *** every *** XML-Schema
> language is, to support additional constraints compared to DTDs
> - means every XML-Schema supports something like a
> *** conceptual *** schema-principle! Or are the ** conceptual ***
> Schema of XML-Data something extraordinary?
Yes, I think you can say this. The only difference between XML-Data and the
other schema languages is in this regard is that XML-Data explicit states
what parts of their language apply at the XML document level and what parts
apply at a higher level.
-- Ron Bourret
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list