When to use attributes vs. elements

Mon Feb 8 12:15:45 GMT 1999

On Fri, 5 Feb 1999, Andrew Layman wrote:

> Thank you.  Dan asks a reasonable question, which is whether a document that
> uses the conventions described in
> http://www.w3.org/TandS/QL/QL98/pp/microsoft-serializing.html needs to
> signal somehow that these conventions are in play.
> 
> In case of the "canonical format" I proposed, however, I don't think special
> signalling is necessary: The proposal does not add any new interpretations
> to the use of elements or attributes beyond what can be described in a DTD
> or a schema such as XML-Data or DCD.  Elements, attributes, ids and idrefs
> are carefully used so that their normal XML interpretation matches the
> scoping and linking rules of object graphs or relational databases.

So, to be clear on what you're claiming... For any chunk of 'normal' 
XML, you have a set of interpretation rules that tell us how all the attributes
and elements map into "graphs of data such as database tables and relations,
nodes and edges from directed labeled graphs, and similar
constructions"[1]. This would be enormously useful, if people could be
persuaded it were true. 

> In a general case, if conventions add rules for interpretation above what is
> in the structure of a document or above what can be expressed in a DTD, then
> this would need to be somehow signalled in order for a reader to process the
> document.  

I'm a little confused in that [1] proposes a canonical framework for
interpreting all XML as graph serialisations, but then goes on to
discuss "Mapping Abbreviated Syntax to Canonical Syntax":

	However, the canonical syntax is not the only syntax that could be used
	to serialize a graph. In many cases, alternative syntaxes may be
	used, either due to historical or political factors, or to take
	advantage of compressions that are available if one has
	domain knowledge. We call all of these "abbreviated syntaxes."[1]

This implies that some unknown subset of XML instance data will have
been serialised according to one or more alternate serialisation
algorithms. Consequently de-serialising such data according to the
'canonical' algorithm will garble your data. In which case we're back in
a situation where we need a mechanisms such as
<XYZ:SerializationAccordingToAndrew> to tell us which data can be
interpreted according to the 'canonical' rules versus some alternate
(possibly unknown) serialisation rules.

The example alternate serialisation given is:

	<Class>
	  <name>Western Civilization</name>
	  <taughtBy>Thorsten</taughtBy>
	  <attendedBy>Raphael</attendedBy>
	  <attendedBy>Smith</attendedBy>
	</Class>

Interpreting this according to the "Procedure for XML Instance to Graph
Conversion" rule will give garbage data. We simply don't know from
looking at the XML above what nodes and edges it creates. The fact that
we need to treat such data in a special manner is worrying: how are we
supposed to _know_ when there is something else to know? 

(repeated from above)
> In a general case, if conventions add rules for interpretation above what is
> in the structure of a document or above what can be expressed in a DTD, then
> this would need to be somehow signalled in order for a reader to process the
> document.  

This suggests that the burden is placed upon content creators to flag up
when the generic 'canonical' rule wouldn't usefully apply to the
interpretation of the XML content. So the default behaviour would be to
assume everyone used the rules outlined in [1] unless associated schema,
stylesheet or enclosing tags told us otherwise?

So...  if I'm a 'canonical-format' aware processor building a graph from
XML data acquired from a variety of sources, what procedure do I follow
to sort XML instance data into the follow categories:

a) old XML files which *happen* to have been serialised according to 
   the canonical-format rules

b) old XML files which happen *not* to have been serialised according to
   the canonical-format rules. (for example, the extract above)

c) recent XML files created by following the c-f rules for serialising
   graphs

d) recent XML files created using an alternative or abbreviated graph
   serialisation algorithm as discussed in [1]

In particular, I'm concerned that (a) and (b) are mechanically
indistinguishable.

Dan

[1] http://www.w3.org/TandS/QL/QL98/pp/microsoft-serializing.html 

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)