Handling unknown elements?

Fri Apr 17 14:42:07 BST 1998

At 18:45 08/04/98 -0400, Tyler Baker wrote:
>One dilemma I have been trying to figure out with XML is the problem of
>handling unknown element types and what to do with their children.
[...]
>
>Anyone here got any better ideas on this?

Well I have some ideas ... :-)

The problem I address (in JUMBO2) is "

"what do I do when someone sends me an XML document without any/enough
accompanying material telling me what to do with it?"

If this is similar to your problem, read on :-)

(1) If the DTD is present it can tell you if the document is valid. There
is no agreed mechanism whereby a DTD can carry additional semantics. So
your DTD could tell you if a B element can contain mixed content including
an I element - it can't tell you what they mean.

(2) There is no universal generic mechanism for adding semantics to an XML
document.

(3) If the main purpose of the document is to be rendered for humans, then
stylesheets should be used. If the author creates their own tagset and
doesn't provide a stylesheet, many XML-aficionados will give up at this
stage. i.e. a document:
	This is a <FOO>bold <BAR>italic</BAR> phrase</FOO>
is as valid as B and I, but the reader has to do some detective work.
They'd probably give up on most.

(4) If the main purpose of the document is for a machine to act upon it
(and not everyone realises the enormous potential of XML here), then
another way of communicating semantics has to be provided. The method I use
is to map Java classes onto elements. This can use a wide degree of
context-dependence and can be very powerful. Example:

<MOL><ATOMS> <ARRAY BUILTIN="X2">... </ARRAY></ATOMS></MOL>
will draw a chemical line drawing.

<MOL><ATOMS> <ARRAY BUILTIN="X3">... </ARRAY></ATOMS></MOL>
will draw a rotatable 3-D molecule.

The JUMBO-MOL software is (obviously) application-specific and uses
XPointers extensively to decide on context.

(5) To help with the first three problems JUMBO2 now has to following
*generic* facilities which help with 'unstyled' random XML documents
	- search the document for all elements, attributes, attribute values, and
PCDATA content and uniquify them
	- display this as a tree showing unique markup components. This is linked
to the original document (tree). Thus, I may find that <bibref> occurs in
rec.xml. What does it mean?  I can use JUMBO2 to find all the occurrences
of <bibref> in the doc and highlight them all (almost instantaneous , now :-)
	- find all 'whitespace' elements and delete them. This aids tree
navigation in some cases
	- display the content of any node (whether mixed or element) in several
different styles. These include:
		raw XML
		untagged event stream (e.g. similar to removal of unknown tags)
		prettyprinted XML (indented)
		whitespace specifically highlighted
		'default' styling.

The default styling applies simple heuristics to display elements. Thus 
<SPEAKER>MACBETH</SPEAKER>
is displayed as:
SPEAKER: MACBETH
where the markup term is in a different font.  This is useful for may
generic XML documents.

	In addition JUMBO will allow you to add your own style to individual
elements. Thus <olist> in rec.xml would appear to be a list, so the user
can interactively add list-formatting to it. In your case you could arrange
that <B> was made bold and <I> was made italic. [I am not prepared to
'guess' the meaning of common tags - e.g. <A> - and the reader has to take
the responsibility for this. I would hope that the world might converge
towards common semantics for common terms, and XML-DEV is here if anyone
wishes. But if you want to use <PARA> for a chemical term rather than a
paragraph, you're perfectly welcome to - XML doesn't care :-)].

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)