Strong Typing in SGML and XML

Eric Albright eric_albright at sprynet.com
Wed May 7 05:46:25 BST 1997


First, I'd like to concur with the need for a formal specification for data
typing.

I had hoped that HyTime's lextype feature would be sufficient. I for one
would like to hear from the HyTime experts about how they would implement
the parallel data typing. -- No use reinventing any standard. It may only
need simplifying and explaining.

Having said that, I ask when is strong data typing necessary? As far as I
can tell there is only one place where it is useful -- when the document is
being created or altered. There will always be data validation that cannot
be handled by data typing and as such must be delegated to a validating
application or a human. e.g.
<NAME><FIRST>Albright</FIRST><LAST>Eric</LAST></NAME>

As for comments about the proposal:

I would like to see a simplified version of the data types. It is very
important for databases to know the exact size in bytes that a data element
will occupy. SGML/XML deals with a character string and therefore does not
care. More important to me are the constraints on the data implicit by a
given type. I think we need to determine the types of constraints that each
data type requires and allow for the maximum flexibility without
sacrificing precision.

As far as I can tell, there are three basic types--character, numeric, and
temporal. Each type requires its own unique constraints:

CHARACTER - an alphabet, length constraint, content constraint (regular
expressions)

NUMERIC - a maximum value, a minimum value, some type of rounding/precision

TEMPORAL - a maximum value, minimum value, (the maximum and minimum values
may be constrained in relation to the current value), some type of
rounding/precision

I think that the CHARACTER data type should be able to specify the alphabet
and length constraint within the content constraint. However some
modification to the standard regular expression writing would be necessary.
I for one do not want to have to type
\([0-9][0-9][0-9]\)[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] for a phone number.
Perhaps \([0-9](+3)\)[0-9](+3)-[0-9](+4) would be better.

To allow maximum flexibility and precision for numeric values, we should be
able to specify the form (roman/arabic) and a base. The rounding allows us
to constrain the significant digits to some factor of the base. A rounding
type would be needed for the greatest flexibility (round/ceiling/floor).

Temporal values can specify either an instant of time or an extent of time.
They should also be able to be rounded. When an instant is rounded, the
significant digits are to the left; when an extent is rounded, the
significant digits are to the right. To signify that an instant is precise
to the nearest five years, it would be rounded to 0005/00/00 00:00:00. To
signify that an extent is precise to the nearest tenth of a second, it
would be rounded by 0000/00/00 00:00:00.1 .

Given this the "architectural form" for data typing would become:

<!ATTLIST AnyElement
    XML-TYPE 	   (character|numeric|temporal)  #IMPLIED -- if omitted, 
                                                        default is
character
                                                        with no other
constraints 
                                                        applied --
    XML-TYPE-CONTENT CDATA                 #IMPLIED -- For CHARACTER types
only; 
                                                       default is no
constraint --
    XML-TYPE-MIN     CDATA                 #IMPLIED -- For
NUMERIC/TEMPORAL; 
                                                       default is no
constraint --
    XML-TYPE-MAX     CDATA                 #IMPLIED -- For
NUMERIC/TEMPORAL; 
                                                       default is no
constraint --
    XML-TYPE-ROUNDTO CDATA                 #IMPLIED -- For
NUMERIC/TEMPORAL; 
                                                       default is no
constraint --
    XML-TYPE-RNDMETH (round|ceiling|floor) #IMPLIED -- Round method;
                                                       For NUMERIC/TEMPORAL
                                                       default is "round"
--
    XML-TYPE-FORM    (roman|arabic)        #IMPLIED -- For NUMERIC;
                                                       default is "roman"
--
    XML-TYPE-BASE    CDATA                 #IMPLIED -- For NUMERIC;
                                                       default is "10" --
    XML-TYPE-TYPE    (instant|extent)      #IMPLIED -- required for
TEMPORAL --
>

This changes the number of attributes from 4 to 9 but provides for higher
precision for data constraint.

The examples would become:

For a bank loan; balance, interest rate, and maturity date: 

<!ELEMENT BALANCE  (#PCDATA) >
<!ATTLIST BALANCE  XML-TYPE	       CDATA #FIXED "NUMERIC"
                   XML-TYPE-ROUNDTO  CDATA #FIXED "0.01" 
                   XML-TYPE-MIN      CDATA #FIXED "0.00" >
<!ELEMENT INTEREST (#PCDATA)>
<!ATTLIST INTEREST XML-TYPE      CDATA #FIXED "NUMERIC" 
                   XML-TYPE-MAX  CDATA #FIXED "100" -- in practice we may
want 
                                                       this to be much
lower --
                   XML-TYPE-MIN  CDATA #FIXED "0" >
<!ELEMENT MATURITY (#PCDATA)>
<!ATTLIST MATURITY XML-TYPE          CDATA #FIXED "TEMPORAL"
                   XML-TYPE-TYPE     CDATA #FIXED "INSTANT"
                   XML-TYPE-ROUNDTO  CDATA #FIXED "0000/00/01 00:00:00">

For an airline departure: passenger name, seat number, and departure time: 

<!ELEMENT LAST-NAME (#PCDATA)>
<!ATTLIST LAST-NAME XML-TYPE         CDATA #FIXED "CHARACTER"
                    XML-TYPE-CONTENT CDATA #FIXED "[A-Z](*20)" 
                                           -- up to 20 repetitions of
[A-Z]-->
<!ELEMENT FIRST-INITIAL (#PCDATA)>
<!ATTLIST FIRST-INITIAL XML-TYPE          CDATA #FIXED "CHARACTER"
                        XML-TYPE-CONTENT  CDATA #FIXED "[A-Z]" >
<!ELEMENT SEAT-ROW (#PCDATA)>
<!ATTLIST SEAT-ROW XML-TYPE          CDATA #FIXED "NUMERIC"
                   XML-TYPE-MIN      CDATA #FIXED "1"
                   XML-TYPE-MAX      CDATA #FIXED "36"
                   XML-TYPE-ROUNDTO  CDATA #FIXED "1" >
<!ELEMENT SEAT-LETTER (#PCDATA)>
<!ATTLIST SEAT-LETTER XML-TYPE          CDATA #FIXED "CHARACTER"
                      XML-TYPE-CONTENT  CDATA #FIXED "[A-F]" >
<!ELEMENT DEPARTURE (#PCDATA)>
<!ATTLIST DEPARTURE XML-TYPE          CDATA #FIXED "TEMPORAL" 
                    XML-TYPE-TYPE     CDATA #FIXED "INSTANT"
                    XML-TYPE-ROUNDTO  CDATA #FIXED "0000/00/00 00:01:00"
                                       -- to the nearest minute -->
<!ELEMENT FLIGHT-TIME (#PCDATA)>
<!ATTLIST FLIGHT-TIME XML-TYPE          CDATA #FIXED "TEMPORAL" 
                      XML-TYPE-TYPE     CDATA #FIXED "EXTENT"
                      XML-TYPE-ROUNDTO  CDATA #FIXED "0000/00/00 00:15:00"
                                        -- to the nearest 15 minutes -->


Well, what do you think?

Eric

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list