Strong Typing in SGML and XML

Peter Murray-Rust Peter at
Wed May 7 09:37:18 BST 1997

In message <199705070346.UAA02957 at> "Eric Albright" writes:
> First, I'd like to concur with the need for a formal specification for data
> typing.
> I had hoped that HyTime's lextype feature would be sufficient. I for one
> would like to hear from the HyTime experts about how they would implement
> the parallel data typing. -- No use reinventing any standard. It may only
> need simplifying and explaining.
> Having said that, I ask when is strong data typing necessary? As far as I
> can tell there is only one place where it is useful -- when the document is
> being created or altered. There will always be data validation that cannot

You may all regard this as poor design, but CML requires the documents to
carry the data types.  To save increasingly complex content models, CML
has only two elements to carry typed data, XVAR (a scalar) and ARRAY
(to carry large amounts of XVARs - an ARRAY looks like

<ARRAY TYPE="FLOAT" SIZE="3">1.2 2.3 3.4</ARRAY>

- remember that some arrays can run to several powers of 10).  At present 
CML uses 4 types (others are obsolete):  STRING, FLOAT,
INTEGER, DATE.  I agree that in principle I can convert to 
<INTEGER>, <FLOATARRAY> and so on, but it makes things more complex (and
the current processing software has to be rewritten.  However, if we are
working towards re-usable components and the whole of the XML community
says they like (say) 4 unique types, then in the interests of interoperability 
I would be shouting for that.  If they prefer to type their variables by
attribute, I'll shout for that.  Neither is trivial to process.
> be handled by data typing and as such must be delegated to a validating
> application or a human. e.g.
> As for comments about the proposal:
> I would like to see a simplified version of the data types. It is very
> important for databases to know the exact size in bytes that a data element
> will occupy. SGML/XML deals with a character string and therefore does not
> care. More important to me are the constraints on the data implicit by a
> given type. I think we need to determine the types of constraints that each
> data type requires and allow for the maximum flexibility without
> sacrificing precision.

I understand the force of your argument.  For both your requirment and mine,
the question is 'should XML support this, or is it up to the "application"?'.
Personally I am in favour of XML steering people towards a common way of
doing things, whether it be in the spec, or Generally Accepted Conventions.
> As far as I can tell, there are three basic types--character, numeric, and
> temporal. Each type requires its own unique constraints:
> CHARACTER - an alphabet, length constraint, content constraint (regular
> expressions)
> NUMERIC - a maximum value, a minimum value, some type of rounding/precision

Some people will feel that the INTEGER/FLOAT distinction is important.  I think
I can live without it.

> TEMPORAL - a maximum value, minimum value, (the maximum and minimum values
> may be constrained in relation to the current value), some type of
> rounding/precision
> I think that the CHARACTER data type should be able to specify the alphabet
> and length constraint within the content constraint. However some

Again I keep asking the XML community the question as to where these 
constraints are applied. Editor (obviously), parser(??), application 

> modification to the standard regular expression writing would be necessary.
> I for one do not want to have to type
> \([0-9][0-9][0-9]\)[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] for a phone number.
> Perhaps \([0-9](+3)\)[0-9](+3)-[0-9](+4) would be better.
> To allow maximum flexibility and precision for numeric values, we should be
> able to specify the form (roman/arabic) and a base. The rounding allows us
> to constrain the significant digits to some factor of the base. A rounding
> type would be needed for the greatest flexibility (round/ceiling/floor).
> Temporal values can specify either an instant of time or an extent of time.
> They should also be able to be rounded. When an instant is rounded, the
> significant digits are to the left; when an extent is rounded, the
> significant digits are to the right. To signify that an instant is precise
> to the nearest five years, it would be rounded to 0005/00/00 00:00:00. To
> signify that an extent is precise to the nearest tenth of a second, it
> would be rounded by 0000/00/00 00:00:00.1 .

I assume this must be a frequently solved problem and we shouldn't try to 
reinvent it.  I someone more knowledgeable than me says - 'use the FOO
approach' I'll probably buy it if it's stable and implementable.

>                                            -- up to 20 repetitions of

There has been a regular and repeated cry for regular expressions.  If 
someone comes up with one that is available, I'll buy it.  Surely one of the
very many readers of this list is authoritative about this?

This is a very critical discussion for me, and I expect for others and shows
some of the new things that XML will be used for.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences

xml-dev: A list for W3C XML Developers
Archived as:
To unsubscribe, send to majordomo at the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at

More information about the Xml-dev mailing list