Simple approaches to XML implementation

Sat Mar 1 01:08:02 GMT 1997

The discussion on the API is extremely valuable and exciting and I'm learning
a lot.  There is no doubt that there are enough experts to do a first class 
job of building an API that will last.  However, for some people who may
have joined this list and who really need or want XML it may not be clear how
some of this relates to more practical problems.  (It really does!).

A few weeks ago I got assurance from the WG that XML was not only for 
rocket_scientists, so if you aren't one here is a place to talk about
the simple aspects.  Remember that XML is 'an extremely simple' dialect of
SGML _and can be used as such_.  I started working with an XML-like
dialect about 12 months ago, wrote my own parser and postprocessor with
steam technology so it's not _essential_ to have groves, IDL, etc. though 
it will certainly make it much easier to develop complex applications.  You
may also want to build a prototype to learn what's it's about and then 
bolt in the more powerful parsing and processing tools later.

The first thing to realise is that XML allows you to create documents that 
are well-formed, but need not be validated.  That may be fine for
many people - especially during a development stage.  If you don't use
EMPTY elements (e.g. <BR> in HTML) so that all your start- and end-tags are
balanced and nested correctly, and if your attributes are quoted, then
that is all you need for a WF document.  Example:

<FOO>
<BAR LANGUAGE="EN">
This is a string
</BAR>
</FOO>

So, are there simple tools for creating well-formed documents?  Can HTML
editors be extended? (Since I create a lot of my XML documents by hand,
I'd be interested to have shortcuts).

------------------------------------------------------------------------

Most documents will then need some sort of processing.  There are two
main strategies:
	- event stream mode.
	- parse tree
The event stream mode is best illustrated by HTML and the font or phrase
tags.  <I> switches on italics and </I> switches it off.  <B> is bold_on
and </B> is bold_off.  If your XML document was arranged as above it would
be quite easy to write code which read each line, and took appropriate 
action (Foo_on, Foo_off).

I've been writing something this morning to do exactly that for HTML.  I use
Java, but there's nothing fundamental about what language you use (a year
ago I used tcl/tk with CoST).  So, for example, I take a _stream_ of HTML,
write it to the screen, and every time I encounter a flag (tag) I take
appropriate action.  If the document is well-formed,  the tags should nest
so that the interpreting/parsing process must throw an error if an end-tag
is encountered unexpectedly.

The tree model is best illustrated by the containers in HTML:
<HTML>
<HEAD>
<TITLE>
This is a title
</TITLE>
</HEAD>
<BODY>
<H1>
That's all folks
</H1>
</BODY>
</HTML>

If you look at what elements contain what others, you'll see that HTML
can be thought of as a root, with two bracnches to its children
(HEAD and BODY).  HEAD has one child (TITLE) and BODY has one child (H1).
Both TITLE and H1 contain strings (#PCDATA) which can be regarded as children            

Looking at structured documents as trees is extrememly powerful for searching
and other manipulations.  IMO HTML requires both approaches and in processing
it you have to switch between them.

--------------------------------------------------------------------------

In building a generic parser (such as Lark and NXP) the authors have to cover
the whole range of possibilities both in the input document and the ways
that it might be processed.  There is, however, no need for any particular
application to use the full power of XML and this might allow you to develop
a simpler parser and/ or editor if you want, especially if you have 
need to write it for a specific platform, etc.  Also, if you just 'want to
get started' there are enough tools to get a feel for what XML is about.

	P.

XML is committed to making things simple!

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)