Strong Typing in SGML and XML
Eric Albright
eric_albright at sprynet.com
Wed May 7 05:46:25 BST 1997
First, I'd like to concur with the need for a formal specification for data
typing.
I had hoped that HyTime's lextype feature would be sufficient. I for one
would like to hear from the HyTime experts about how they would implement
the parallel data typing. -- No use reinventing any standard. It may only
need simplifying and explaining.
Having said that, I ask when is strong data typing necessary? As far as I
can tell there is only one place where it is useful -- when the document is
being created or altered. There will always be data validation that cannot
be handled by data typing and as such must be delegated to a validating
application or a human. e.g.
<NAME><FIRST>Albright</FIRST><LAST>Eric</LAST></NAME>
As for comments about the proposal:
I would like to see a simplified version of the data types. It is very
important for databases to know the exact size in bytes that a data element
will occupy. SGML/XML deals with a character string and therefore does not
care. More important to me are the constraints on the data implicit by a
given type. I think we need to determine the types of constraints that each
data type requires and allow for the maximum flexibility without
sacrificing precision.
As far as I can tell, there are three basic types--character, numeric, and
temporal. Each type requires its own unique constraints:
CHARACTER - an alphabet, length constraint, content constraint (regular
expressions)
NUMERIC - a maximum value, a minimum value, some type of rounding/precision
TEMPORAL - a maximum value, minimum value, (the maximum and minimum values
may be constrained in relation to the current value), some type of
rounding/precision
I think that the CHARACTER data type should be able to specify the alphabet
and length constraint within the content constraint. However some
modification to the standard regular expression writing would be necessary.
I for one do not want to have to type
\([0-9][0-9][0-9]\)[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] for a phone number.
Perhaps \([0-9](+3)\)[0-9](+3)-[0-9](+4) would be better.
To allow maximum flexibility and precision for numeric values, we should be
able to specify the form (roman/arabic) and a base. The rounding allows us
to constrain the significant digits to some factor of the base. A rounding
type would be needed for the greatest flexibility (round/ceiling/floor).
Temporal values can specify either an instant of time or an extent of time.
They should also be able to be rounded. When an instant is rounded, the
significant digits are to the left; when an extent is rounded, the
significant digits are to the right. To signify that an instant is precise
to the nearest five years, it would be rounded to 0005/00/00 00:00:00. To
signify that an extent is precise to the nearest tenth of a second, it
would be rounded by 0000/00/00 00:00:00.1 .
Given this the "architectural form" for data typing would become:
<!ATTLIST AnyElement
XML-TYPE (character|numeric|temporal) #IMPLIED -- if omitted,
default is
character
with no other
constraints
applied --
XML-TYPE-CONTENT CDATA #IMPLIED -- For CHARACTER types
only;
default is no
constraint --
XML-TYPE-MIN CDATA #IMPLIED -- For
NUMERIC/TEMPORAL;
default is no
constraint --
XML-TYPE-MAX CDATA #IMPLIED -- For
NUMERIC/TEMPORAL;
default is no
constraint --
XML-TYPE-ROUNDTO CDATA #IMPLIED -- For
NUMERIC/TEMPORAL;
default is no
constraint --
XML-TYPE-RNDMETH (round|ceiling|floor) #IMPLIED -- Round method;
For NUMERIC/TEMPORAL
default is "round"
--
XML-TYPE-FORM (roman|arabic) #IMPLIED -- For NUMERIC;
default is "roman"
--
XML-TYPE-BASE CDATA #IMPLIED -- For NUMERIC;
default is "10" --
XML-TYPE-TYPE (instant|extent) #IMPLIED -- required for
TEMPORAL --
>
This changes the number of attributes from 4 to 9 but provides for higher
precision for data constraint.
The examples would become:
For a bank loan; balance, interest rate, and maturity date:
<!ELEMENT BALANCE (#PCDATA) >
<!ATTLIST BALANCE XML-TYPE CDATA #FIXED "NUMERIC"
XML-TYPE-ROUNDTO CDATA #FIXED "0.01"
XML-TYPE-MIN CDATA #FIXED "0.00" >
<!ELEMENT INTEREST (#PCDATA)>
<!ATTLIST INTEREST XML-TYPE CDATA #FIXED "NUMERIC"
XML-TYPE-MAX CDATA #FIXED "100" -- in practice we may
want
this to be much
lower --
XML-TYPE-MIN CDATA #FIXED "0" >
<!ELEMENT MATURITY (#PCDATA)>
<!ATTLIST MATURITY XML-TYPE CDATA #FIXED "TEMPORAL"
XML-TYPE-TYPE CDATA #FIXED "INSTANT"
XML-TYPE-ROUNDTO CDATA #FIXED "0000/00/01 00:00:00">
For an airline departure: passenger name, seat number, and departure time:
<!ELEMENT LAST-NAME (#PCDATA)>
<!ATTLIST LAST-NAME XML-TYPE CDATA #FIXED "CHARACTER"
XML-TYPE-CONTENT CDATA #FIXED "[A-Z](*20)"
-- up to 20 repetitions of
[A-Z]-->
<!ELEMENT FIRST-INITIAL (#PCDATA)>
<!ATTLIST FIRST-INITIAL XML-TYPE CDATA #FIXED "CHARACTER"
XML-TYPE-CONTENT CDATA #FIXED "[A-Z]" >
<!ELEMENT SEAT-ROW (#PCDATA)>
<!ATTLIST SEAT-ROW XML-TYPE CDATA #FIXED "NUMERIC"
XML-TYPE-MIN CDATA #FIXED "1"
XML-TYPE-MAX CDATA #FIXED "36"
XML-TYPE-ROUNDTO CDATA #FIXED "1" >
<!ELEMENT SEAT-LETTER (#PCDATA)>
<!ATTLIST SEAT-LETTER XML-TYPE CDATA #FIXED "CHARACTER"
XML-TYPE-CONTENT CDATA #FIXED "[A-F]" >
<!ELEMENT DEPARTURE (#PCDATA)>
<!ATTLIST DEPARTURE XML-TYPE CDATA #FIXED "TEMPORAL"
XML-TYPE-TYPE CDATA #FIXED "INSTANT"
XML-TYPE-ROUNDTO CDATA #FIXED "0000/00/00 00:01:00"
-- to the nearest minute -->
<!ELEMENT FLIGHT-TIME (#PCDATA)>
<!ATTLIST FLIGHT-TIME XML-TYPE CDATA #FIXED "TEMPORAL"
XML-TYPE-TYPE CDATA #FIXED "EXTENT"
XML-TYPE-ROUNDTO CDATA #FIXED "0000/00/00 00:15:00"
-- to the nearest 15 minutes -->
Well, what do you think?
Eric
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list