Content v. attribute

Mon Oct 26 14:47:04 GMT 1998

Re:
----------
So far, I've come up with this structure:
<METADATA
     TITLE="Title of the data"
     AUTHOR="J.J. Vrijland"
     DATE="26 October 1998">
</METADATA>
----------

I can think of several reasons why it's probably not
a good idea to try to model the "title" of an authored
work as an XML (SGML) attribute, given that the datatype
of XML's "attribute" is just (flat) 'string'.  In particular,
machine processing of "title" information should be
sensitive to the languages present in a work title, which is
most easily given in a "language" attribute.  So,
for example, think about the markup for these titles:

Comentarios al "Mein kampf"
Eclaircissements sur Mein Kampf: la doctrine d'Adolf Hitler
Hitler's Mein Kampf in Britain and America. A publishing history, 1930-39

These are "real titles" and, depending upon the language of
discourse in the broader work which references these works,
one would need two or three levels of nesting to capture the
fact that 'Mein kampf' is a title, in German, and that the
embedding string is a title, in some other language, and that
the larger discourse unit is XXX, in some third language.  This
is not a rare event - multilingualism "happens" in the
real world millions of time a day.

In my personal judgment, it also makes no sense to model
a "title" of a work (a document, a "chapter," a "section"
or whatever) as "metadata."  By whose definition of
"metadata"?  It's pretty difficult to find a definition
of metadata that will hold up, especially in terms of
articulating diagnostic/distinguishing features of
"metadata" vis-a-vis "data" which determine the best
modeling construct in SGML/XML.  Whether you want the
information to be "presented in the view" should not
be relevant, since XML encoding should be free of
assumptions about processing-level semantics.  Stylesheet
contols should dictate wheher/how some information is
presented or suppressed in a particular view.  Whether
the (rest of) "content" *could be understood* without
reference to the candidate information in question is
equally unhelpful: a novel is "understandable" without
the volume title and chapter titles, and it would be
understandable with every 12th word removed (if a bit
rough in places) as well.

To the extent that consciously introducing a distinction
between "data" and "metadata" into document encoding
means hard-coding a perspective, it may be a bad idea.
We already have encough problems dealing with the fact
that a particular (privileged) hierarchical modeling
of a problem domain introduces a certain distortion
(selected analytical perspectve) of the problem domain.
Neutral encoding in search of data independence would
try to eliminate these particularized interpretations
from the encoding model.

And finally, to wit (credits Steve DeRose): "your 'metadata'
is always someone else's 'data.'  I would add: what you
think is 'metadata' today will be your 'data'
tomorrow - you'll probably be sorry that you modeled the
distinction in markup.

Just my 2 cents... many will disagree, of course.

rcc

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)