words (RE: extensibility in XSchema?)

James K. Tauber jtauber at jtauber.com
Tue Jun 23 07:58:44 BST 1998


> I think the distinction between syntax and semantics can be made very
> clearly: syntax defines the rules for construction of a character string
> (or binary encoding if you want to include that as being syntactic).  The
> syntactic constructs thus become things that can be *mechanically*
> identified and reported.

Like the formal language view or the "Standard Theory" view in linguistics
where syntax is a set theoretic notion. A given document is grammatical or
not according to some syntatic grammar defining the set of grammatical
documents. A DTD with only element declarations would do this.

> Semantics is the interpretation (whatever that might mean) of the
syntactic
> constructs, whatever they might be.

Yes, and we have to be careful as to what language different syntactic
constructs belong.

XML is a metalanguage but it is also a language (hence the notion of
well-formedness).
The character < has both positional constraints (syntax) and a meaning
(alterable by context eg CDATA sections).

DTDs describe a language which, in as much as they contain content models,
is concerned with syntax but which goes beyond that with things like
notations.

Furthermore, there is (and there are often more than one) language the
content is in.

> There are of course, different levels of semantic in something as complex
> as a XML document, with the semantics of the base syntactic constructs
> implicitly or explicitly defined by the definition of the syntax itself
> (e.g., "abstract thing of type 'x' is represented syntactically by the
> character string 'y'").  Because they come directly from the parsing of
the
> syntax, these things can be defined as completely as necessary to ensure
> consistent interpretation (e.g., we all pretty much agree on what an XML
> element is).  Thus we can agree on things like the SGML property set, the
> DOM, and other abstract or functional interpretations of the syntax of XML
> documents.

That's why it's important to establishing which syntax (or semantics) we are
talking about. The syntax and semantics of XML as a language is defined by
the spec. We can check well-formedness and build groves, etc. That is not
the same as saying the syntax or semantics of XML *documents* is fully
defined by that spec. There have been misunderstanding on this list before
where people have been talking about the syntax and semantics of two
entirely different things.

> Syntactic validation is always easy and uncontroversial (to the degree
that
> the syntactic rules are understood, which is another issue entirely):

Indeed. Computer languages are much easier than modelling natural languages
:-)

> either the string matches the rules or it doesn't.  This is the *only*
type
> of validation that XML and SGML can provide (by which I mean, validation
> against the base rules and any further rules provided by a set of DTD
> declarations, which are, of course, also syntactic rules).
>
> Architectural validation gets you slightly closer to semantic validation
> only because the syntactic rules defined by an architectural DTD are
> controlled independ of the document that claims to conform to them so that
> the author of a document cannot simply change the declarations to make
> their instance conform.

I'm not clear on what "semantic validation" would really mean. In the sense
you are using it, isn't it just a conformance to a second *syntax*? To me,
"semantic validation" is closer to truth value. <giraffe>Eliot</giraffe>
might be syntactically valid, but not semantically valid, for example.

<aside>
Art joke that just popped into my head: <pipe>This is not a pipe</pipe>
</aside>

[...]
> That is why at the end of the day the only possibly complete and reliable
form of
> semantic definition is prose and the only reliable semantic validators are
other
> humans.

Agreed. And this is why I don't like stylesheets being spoken of as
expressing semantics.

> It's probably more useful to define clearer categories of
> semantics, such as:
[...]

Agreed, but I wouldn't use the term semantics for presentation.

If you take the XML and CSS fragments:

<HyTimeGuy>Eliot</HyTimeGuy>

HyTimeGuy {color: blue}

Then there is a huge difference between our conceptual notion of the
"HyTimeGuy" element type and the fact that is is displayed in blue (it's one
of the keys to generic markup right?)

It seems to me that the distinction is very similar to that between the
conceptual (semantic) and articulatory (phonological) in Chomskyan
linguistics. A speaker has a concept, they articulate it with words that the
listen hears and reconceptualises.

The same thing happens in documents. A particular phrase is conceived by the
author as a technical term. They write it (or its done by a stylesheet) in
bold. The reader then interprets the bold as meaning a technical term.
Generic markup involves externalising that document between the conceptual
and presentational stages.

An generic markup, documents seems to be to be a lot like the deep structure
of an utterance in Chomsky's earlier theories.

Now, the complication seems to come (and I think this is what caused Jon and
my disagreement even though we actually agreed on what much of what the
other was saying) in the fact that XML documents aren't just used for
presentation but can be machine read. This is where the line becomes fuzzy.

If an invoice is expressed in XML and printed out, I think that's
presentation, not semantics. Even though the presentation tells you
something about the semantics, it is not what the document means. But if the
invoice is read by machine and causes a payment to be made automatically, I
can see why one would invoke Wittgenstein and call it semantics.

I think of it this way:

In normal publishing:

AUTHOR CONCEPT           [semantics]
 --presentation-->
  DOCUMENT               [presentation]
   --interpretation-->
    READER CONCEPT       [semantics]

With generic markup:

AUTHOR CONCEPT             [semantics]
 --markup-->
  XML DOCUMENT             [syntax]
   --stylesheet-->
    DOCUMENT               [presentation]
     --interpretation-->
      READER CONCEPT       [semantics]

Now, where a machine (or a human, for that matter) directly reads the XML
documents we have:

AUTHOR CONCEPT             [semantics]
 --markup-->
  XML DOCUMENT             [syntax]
   --processing-->
    MACHINE ACTION         [?semantics]

[of course, machines can generate the documents too, a case I haven't
considered in the above diagrams]

> The ultimate problem of course is that Humpty Dumpty rules in
> XML land: any element can mean exactly what you want it to mean,
regardless
> of what the original author intended it to mean.  That is the beauty of
> generalized markup and the curse of generalized markup.

Put another way, labels (eg element type names) are arbitrary. I can use
<giraffe>..</giraffe> to mark up people if I want to.

> Embrace the chaos.

Invoke Saussure

James

--
James Tauber / jtauber at jtauber.com      http://www.jtauber.com/
Perth, Western Australia                http://www.xmlinfo.com/


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list