A Plea for Schemas

Len Bullard cbullard at hiwaay.net
Tue Nov 2 02:30:25 GMT 1999


Matthew Gertner wrote:
> 
> I have written a short "XML Rant" 

Enjoyable.  It is good to see some reasonable passion from a 
reasonable mind.  Here is some rant for the rant.

o  "the 1980s, Charles Goldfarb invented SGML".  Ok for a 
rant, but ISO created SGML. If any man can be said to 
have lead that work, it is Dr. Charles Goldfarb at IBM Almaden.
He was a member of the IBM team (Goldfarb, Mosher, Lorie) that designed,
GML.  
To the idea of GenCodes, GML added among other things, 
type-defined namespaces for markup. GML and research were combined to 
propose and ratify ISO 8879.  Invention like that is a community 
process.  Dr. Goldfarb leads that community.

In the late 1960s, publishers needed a means to 
exchange working files.  A solution proposed at that time, 
GenCodes, was supported.  The limited power of sharing 
the same single namespace (the Gencodes) did not evolve.  
The reasons are not complex and are the same as HTML: 
the namespace represents a local application context.  
When shared for all types, it limits the expressiveness 
needed to document multi-context real time events.

o  "..thousands loved it."  Conceded.  SGML was an expensive 
system deployed on then mostly mainframe and mini environments. 
Who had it?  Aerospace tech writers, some artists, and lawyers.
Why?  They had a use for it and the costs were justified 
relative to the cost of the lifecycle of the information 
in its topical context.  Manuals.  Expensive ones.
 
SGML lends itself to interpreted means and interpreted 
means are inefficient.  That is relative to resources.  
As soon as SGML was moved to PC-based systems, 
it became cost-effective.  There are and were examples of 
SGML-based systems working well for hypertext client 
applications in those environments.  Except for 
lowlyIADS, mostly expensive ones.  Systems like 
IADS proved SGML, if deFanged a bit, could be 
deployed cheaply.  Free even. 

IADS did not use a DTD.  It used a stylesheet (circa 1990).
It had a DTD, and the tags within it were modifiable and 
extensible via the stylesheet processor.  Its tags (file, frame,
hyperlink) 
were the equivalent of the ThenMalignedAndDespised PROCESSING
INSTRUCTIONS 
but they looked like tags, so DTDs written for the system 
incorporated them and went on about their business.  Framing worked.

In 1989:

1. Software was expensive
2. Hardware was expensive
3. The dominant application of SGML (1000dpi print) was hard.

SGML emerged into more general use when more power 
was on more desks.  Complexity coupled to complexity 
produces emergence.  TCO.  The critical innovation 
to enable the emergence of SGML came from Intel, et al.  
The unification of a significantly sized software base by a dominant 
operating system company did the rest.  Kick MS as much 
as people want to, without them, the Web today would 
still be something university students surfed and 
researchers occasionally mastered, IMNSHO. 

HTML emerged when:

o  The Internet was opened to commercial use
o  The power of the processor could support the 
   lowest-common denominator application of SGML
o  Governments paid to implement and give away 
   a means and process to share the namespace in that 
   application
o  A person to lead the effort emerged with a plan 
   that would work:  Tim Berners-Lee, HTTP and HTML.

These convergent events, all in the same five years, gave you the 
WorldWideWeb.

o  HTML is a subset of SGML:  NYET.  Get out the ruler 
and rap the knuckles.   XML is a subset of SGML.  HTML 
is an *application* of SGML.  It is obnoxious, and I 
apologize in advance, but getting others to understand 
**that** critical difference in thinking about markup is 
very hard sometimes.  Where I put "application", some 
say, "vocabulary".  Que bueno, but as Charles said, 
"conserve names" and that is all.  

Systems are invented or specified.  Vocabularies are spoken.

HTML was not hobbled.  It was distilled like other vocabularies 
from agreements made among organizations to share information. 
CERN, Univ of Ill, DARPA agree to make such agreements and 
vocabularies are the result of that agreement.  What the organizations 
share are namespaces and the implementations of processors for 
creating, adding, deleting, or modifying statements in those 
namespaces.  HTML was GenCode: partDeux.   TimBL gets the credit, 
but there were those who helped him and if you ask, I'm sure he 
will tell you names.  Names are what is shared.  

It's all about names.  Read the XML 1.0 and, IMHO, that 
is the conceptual breakthrough to understand markup.  In essence, 
SGML has always been principally a lexical standard.  That 
structural integrity is important, and specifying that 
provides the necessary freedom from implementation 
to enable an inexhaustible range of expression.  

It makes the agreement needed to implement a 
system to use it very expensive.  XML locks 
down the SGML Declaration.  Most of the biggest 
changes from SGML start there.  To keep the original 
expressive power, the means for making beyondLex agreements 
are still needed.

A DTD is not about lexical validation only.  It 
is about validating a hierarchical namespace to 
determine conformance.  Whether you use DTDs, 
MS Schemas, XML Schemas(someday), or just use 
the table design window for Access or Oracle, 
validating a vocabulary requires you to declare 
one or derive it.  IMHO, of the two means, declaration 
is usually cheaper, but it is always political. 
  
Politics are human means to declare namespaces. 
BizTalk and OASIS both exist because of the names 
and interest of those named in the shared politics 
of creating their shared namespaces.  That is all.

XML does not care.

Syntax unification is not enough.  Using markup systems 
requires you to accept the idea that the namespace is 
primary.  What does that mean?  Just as sql systems 
must disambiguate aggregate naming, so must markup systems. 
A name means what you need it to.  It must be unique and persistent 
to be a name and you require a means to discover if it is 
meeting that need.  Trust but verify.

Schemas are just one of the tools for discovering if 
that is the case.  You can do more with schema information 
in the same way the relational system does it.  Names 
are associated to create processable unique names.  

You can do a lot with the DTDs and schemas, really.  
They are just metainformation by which 
you agree to organize the screen and the objects on it, 
or the messages among objects, or whatever you want 
to talk about.  The reason to use them 
is to validate or as a source for initialization.  In 
effect, they really are, just another database of 
names and values.  That is what makes using XML 
Schemas (in deference to DTDs), attractive.  Application 
outside very specialize ISO 8879-conforming processors 
for DTDs are also useful for managing the namespace 
of that metainformation.

DTDs do not aggregate; so, if instances do, they 
are not validatible.  That does not keep them from 
being useful.  The names in the space are unique. 
Their persistence is questionable, yet if you treat 
them as a relational designer treats a view, they 
are very useful.  Well-formed is what you need for 
any lifecycle of the information.  Valid is what 
you need to ensure correct processes among systems 
that use the information at particular times.  When 
a formal means to persist these better is provided, 
then we have a very good system for maintaining 
namespace communities.

Schemas organize a namespace; not doing that is 
relaxing a design constraint on the namespace.  Relaxing 
that constraint is efficient particularly at this 
time when database systems are so cheap and ubiquitous, 
using them for serving strings is optimal.  Correct-
by-construction from a trusted source is faster, 
more compact, and less-restricting on system evolution.

Badly-formed HTML?  It was a trade-off.  It cleans 
up over time.  Better tools, better hunts, better times.

All XML says is, you don't have to use the DTD.  
It doesn't say it isn't useful. Enlightened XMLers 
write them and use them and even throw them away.  
A DTD is snapshot of the organization of a namespace 
in time.  Time moves on.  Information does too.  
The DTD might not.  Some part of it probably 
will and will influence the next version. The 
reason to use or not use a DTD or any other 
schema is determined by the namespace evolution: 
and evolution of agreements, so cooperation.

Cooperation among large human communities is 
always furthered when agreements about what 
to name the names are simple and easy to verify. 
When the means to communicate among companies 
became the Web, the need to verify these agreements 
by simple means became an ecological imperative. 
So, patience.  But don't quit pleading.  Namespaces 
are gardens.  To grow usefully, they have to be tended. 
It takes tools, lots of them, for particular 
purposes, to do that. Most of us have sheds full of 
tools we only use occasionally next to ones we use 
every day.

That golden 10% of XML is the distilled essence of 
SGML and the years of practice and competing, sometimes 
awkward specifications and standards written there 
by all of the people I met in those years.  Even 
those HyTime guys worked on creating XML.  HyTime, 
DSSSL, TEI, but before them, Dexter, FRESS, Englebart, 
all feed the single stream that is now XML and as 
with SGML, all the competing, sometimes awkward 
specifications being written by many of the same people.

If you want to plead for schemas, I plead with you.  Schemas are a 
tool for validating agreements among overlapping namespace 
communities.  Ecom-ecologies (keiretsu) emerge because 
the tools they use to make agreements, their namespaces, 
become efficient.  S=KlogW - Boltzman.  To control 
the temperature, control the value of W.  DTDs help 
you control the rate at which entropy consumes referents.

The trick to fix the web is to fix the web's indexes. 
To do that, ensure the agreements by which the indexes 
are made enable validation of the namespaces indexed.
Well-formed, and valid by agreement are the keys to creating 
semantic space, overlapping vocabularies, if that is what 
you want.  

DTDs are a tool to make agreements.  Beyond the agreement are the 
names that agree.  XML Doesn't Care.  You do.  You write:

     Dilution of the basic principles of generic markup, and
     misunderstanding of their purpose, will then give rise to
inevitable
     disappointment, and hence rejection: "We switched our whole
     company over to XML and we still can't interchange data
effortlessly.
     So this means that XML doesn't work, right?" 

How many 'MLers here want a dollar for every time you've heard that?

Tell 'em, "ahh, XML Works.  We just don't agree on how."

len bullard


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list