Strong Typing in SGML and XML

Wed May 7 17:14:40 BST 1997

In message <199705070346.UAA02957 at m9.sprynet.com>, Eric Albright
<eric_albright at sprynet.com> writes
>
>Having said that, I ask when is strong data typing necessary? As far as I
>can tell there is only one place where it is useful -- when the document is
>being created or altered. There will always be data validation that cannot
>be handled by data typing and as such must be delegated to a validating
>application or a human. e.g.
><NAME><FIRST>Albright</FIRST><LAST>Eric</LAST></NAME>

>From a museum perspective, we have found the need for two types of data
validation/strong typing, which we call 'syntax control' and 'vocabulary
control'.  

Syntax control deals with things like the form of personal names.  These
are _not_ analysed in our application, but expressed in a consistent way
suitable for alphabetical sorting, e.g.:

        Light, Richard B.
rather than
        Richard B. Light

The syntax check would pick up non-capitalised words (apart from a 'stop
list' of known weak prefixes), inconsistent use of full stop and/or
spaces after initials, etc.  This starts to be hard work for a regular
expression, and might more easily be supported as a 'notation', for
which an external helper applet is called up in the context of editing.

Vocabulary control involves checking the data content against an
external authority, which could be a simple termlist or a complex
thesaurus.

Another use we make of data syntax is as a short-cut for markup.  (This
was before we knew about SGML, by the way!  The conventions were
originally devised to make optimal use of A5 catalogue cards ...)  We
use colons as a 'field separator', e.g.:

        <person>maker : Light, R.B.</person>
implies:
        <person>
                <role>maker</role>
                <persname>Light, R.B.</persname>
        </person>

and ampersands (definitely pre-SGML!) as keyword separators:

        <place>Burgess Hill & W. Sussex & U.K.</place>
implies:
        <place>
                <placename>Burgess Hill</placename>
                <placename>W. Sussex</placename>
                <placename>U.K.</placename>
        </place>

These practices tie in with the SGML concept of short references, which
are not available in XML.  So a general conclusion I have come to is
that ':' and '&' need to be mapped to suitable subelements, and our
users need to come to terms with more heavily tagged records than they
are used to.  

This is relevant (really!) in the context of Tim's suggestion that
strong typing should apply only to PCDATA-only elements.  In the more
general case of 'data validation' we might well want to validate
elements with substructure.

Richard Light
SGML and Museum Information Consultancy
richard at light.demon.co.uk
3 Midfields Walk 
Burgess Hill
West Sussex RH15 8JA
U.K.
tel. (44) 1444 232067

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)