Why XML data typing is hard (was Re: Internal subset equivalent in new schema proposals?)

Ketil Z Malde ketil at ii.uib.no
Mon Nov 30 08:03:12 GMT 1998


"G. Ken Holman" <gkholman at canadamail.com> writes:

>> but not
>>	<value>4,50</value>

> Then your example proposed range of values is inappropriate because "4,50"
> is a valid float from an I18N point of view.

I want to specify, in my DTD, what kind of data my processing system
know how to deal with.  Apparently, in my example, the processing
system does not know how to deal with commas in "value"s.  This may or 
may not be an inadequacy with respect to I18N.

Who said anything about "float"?

> In Canada

Yes, yes.  Here too we use commas as decimal ..uh.. points.

> And I suppose your regular expression example could be changed to 

> 	<!element value #REGEXP:"-?[0-9]*(\.|,)[0-9][0-9]">

(whoops, forgot to escape the dot, didn't I)

> I gather from Michael S-McQ in a presentation in Chicago that the regular
> expression for a valid date (taking into account days of the month and leap
> years) is 4801 characters long.

Yes.  Some may want to build all of this into a type system that XML
parsers need to handle, I suppose with mappings to the various
programming languages and machine architectures that may or may not
support that type natively.  As an alternative, I suggested
restricting the *form* instead of the *type* of the content, since 

	a) it's a *lot* simpler to implement, almost trivial
	b) gives the application a clear indication of what data it
	needs to understand
	c) catches errors in data early, avoiding potential run-time
	errors (Y2K?)
	d) avoids a lot of complexity that you probably don't need in
	90% of the cases

Look at the date example.  First you need to embed your 4801 character 
regular expression into parsers that understand xml:type="date".
Then you need the parser to provide something useful, a "struct date", 
a time_t or perhaps a reasonable s-expression, or perhaps some machine 
specific stuff on your embedded system.  And *then* you worry about
what to do when people type "01/02/03". 

Alternatively, you could force people to use "YYYY-MM-DD" by forcing
conformance to a regular expression, and have your applications only
have to deal with that.

And, I think it's pretty obvious that there are a lot of very complex
data types out there.  What's the format for version numbers, for
instance?  Or license plates?  Are you ready to come up with
an xml:type that covers all cases?

(And I bet the 4801-definition doesn't even cover Chinese or Mayan
calendars, or deal correctly with Muslim dates, or seamlessly
integrate Julian and Gregorian.)

>> What would the point of using xml:type be?  

> Perhaps to abstract what is being expressed in markup to allow different
> lexical expressions of the same value to be considered valid.

My point is that you cannot do that without also providing a correct
translation from the lexical expressions of that data type into a
native representation of that data type.  And that translation may not 
make a lot of sense, there are architectures and languages without
concepts such as "float", for instance.

Having all documents be universally understandable and unambigous is a
laudable goal, of course.  But I don't see it happening.

Sorry to be so negative, but at least I didn't mention how I think XML 
is going to destroy the WWW.  Whoops. :-)

~kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list