Why XML data typing is hard

Joel Bender joel at spooky.emcs.cornell.edu
Fri Dec 4 17:47:35 GMT 1998

James Robertson wrote:

>   | >     <prop name="state" xml:regexp="[A-Z]+">NY</prop>
>   |
>   | It's a neat way of doing it, since checking is optional and
>   | transparent to non-checking applications.
> Wouldn't this be better placed in a DTD?
> By adding a fixed, pre-set attribute with the regexp to
> element definitions in the DTD, you can enforce consistency.

Absolutely.  Now, how would you specify that some pattern should be used
for a particular element (or attribute) contents?

> Otherwise, can't the user just choose to use this or not,
> on an individual, ad-hoc basis?

Yea, not such a good thing.

> All that being said, I am of the belief that all of
> this should be placed in application code.

How about a new layer between XML and the application?  The layer (a)
filters SAX-like 'messages' (function/procedure calls) from the parser to
the application and applies patterns to data as necessary, generating new
messages  or (b) can be applied to a DOM implementation to check the
validity of a document more completely.

I'm guessing here, but perhaps there's something we could specify (in XML
of course) that provides this validity information.  As a separate document
with its own structure, it could be developed on its own track so as to not
add more cruft to XML and checking for applications that don't need it.

> XML isn't a solution to any problem, it is a storage and
> interchange format for applications ...

Well, in a sense, yes.  All of the interchange formats that I've delt with
(we call them protocols :-), don't stop at the structure of the message but
also specify what is acceptable content.

> Why try to cram the entire world of computing
> science into XML?

I wasn't going to try and do that in this thread... :-)

Ketil Z Malde wrote:

> > No, not specific to a language mapping, that belongs in some API or SAX
> > reference not in XML.
> That's what I meant (I think).  It would make SAX a whole lot
> more complex, though, if it has to understand e.g. standardised
> dates, and return some kind of date object (or struct) when it
> encounters one.

OK, so keep it out of SAX.  The work of translating "1.5" into 1.5 has to
get done someplace, is done in a very similar way by lots of applications,
and seems ripe for standardization.  IMHO, it doesn't seem like a huge leap
to go from XML documents that contain just text to ones that contain atomic
types (boolean, integer, float for starters).

I assume that date formats have even more variations than numbers, at least
until there is agreement on a stardate!  So stick with simple things like
binding a regexp pattern to content.  There will be debates about a date
being an atomic type or a structure (I tend to think of them as integers
with a really bad number base).  There shouldn't be any need for structure
parsing because structures will already be described by the XML document
being parsed.

> I would have thought it would be simple, but then again,
> I'm culturally biased, and hadn't read the Unicode regexp
> document. Oh horror!

:-)  It looks 'hard', but doesn't seem like there's any more real
complexity than what hasn't already been solved by some very talented
folks.  In particular it is written...

> (Regular expression syntax varies widely: the issues discussed
> here would need to be adapted to the syntax of the particular
> implementation.)

This pattern definition/association document (this beast needs a name!) can
make all that hand wringing and the "levels of support" go away.  No need
for funky esacpe characters, escaped escape characters, misinterpretation
of parens, brackets, braces, stars...gag!

Here is a start...

	<set id="letter">
	<set id="digit">
	<set id="special">
	    _$                          <!-- $ is a VMS thing -->
	<set id="namechar">
	    <set idref="letter"/>
	    <set idref="special"/>

	<token id="name">
	    <set idref="namechar"/>
	    <group optional="1" repeatable="1" disjunction="1">
	        <set idref="namechar"/>
	        <set idref="digit"/>

	<pattern id="namevalue">
	    <token idref="name"/>
	    <s ignore="1"/>             <!-- 's' is whitespace -->
	    <token idref="AttValue"/>   <!-- from XML spec -->

BTW, if I'm not using 'id' and 'idref' correctly, please forgive me, I'm
still very new at this!  I'd be happy to take more discussion off-line if
it doesn't belong in xml-dev.  In the mean time I'll draft a DTD of this
for feedback.


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list