Local Markup

John Cowan cowan at locke.ccil.org
Thu Aug 13 23:19:25 BST 1998


The following is put forth for comments.

1.  Abstract

This document describes a local markup facility for XML.  This serves
some of the purposes of SGML short references, but can be layered over
XML parsing rather than integrated into it.  By declaring appropriate
processing instructions, character data within selected elements
with mixed content can be effectively turned into element content.

Terms not defined here are as in the XML Recommendation.

2. Motivating Example

A DATE element may appear in an XML document with #PCDATA content
containing an ISO 8601 date, as follows:

	<DATE>1998-08-13T14:15:16</DATE>

A conformant local markup processor, given appropriate PIs, can
process this as if it were:

	<DATE>1998<D/>08<D/>13<T/>14<C/>15<C/>16</DATE>

where the dashes have been turned into empty D elements, the T
separating time and date (a feature of ISO 8601) has been turned
into an empty T element, and the colons into empty C elements.
In this way, the application reading this data can allow the XML
parser to be the only parser, rather than having a separate date
parser behind the XML parser.

One possible set of relevant PIs are:

	<? LM-DEF char="-" group="ISO8601" element="D" ?>
	<? LM-DEF char="T" group="ISO8601" element="T" ?>
	<? LM-DEF char=":" group="ISO8601" element="C" ?>
	<? LM-USE element="DATE" group="ISO8601" ?>

The above interpretation, though much more useful than the raw
character data, does not take into account the existence of
hierarchical information.  This document, therefore, also provides
a way to make the interpretation come out as follows:

	<DATE>
	  <SEGMENT>
	    <PART>1998</PART>
	    <PART>08</PART>
	    <PART>13</PART>
	  </SEGMENT>
	  <SEGMENT>
	    <PART>14</PART>
	    <PART>15</PART>
	    <PART>16</PART>
	  </SEGMENT>
	</DATE>

(The indentation is shown here for clarity only.)

This is still not as nice as actual eleements named YEAR, MONTH, DAY,
etc., but preserves the structure of the original.  Here are some
PIs that do the job:

	<? LM-DEF char="-" group="ISO8601" element="PART" ?>
	<? LM-DEF char="T" group="ISO8601" element="SEGMENT" ?>
	<? LM-DEF char=":" group="ISO8601" element="PART" ?>
	<? LM-USE element="DATE" group="ISO8601" model="SEGMENT PART"?>

3.  Processing Instructions

There are two PIs used to support local markup.  The PI target
"LM-DEF" defines which characters are or may be local markup;
the PI target "LM-USE" specifies when local markup is in effect.
Both should appear before the start tag of the document element.
Neither affects the DTD in any way.  Both PIs use pseudo-attributes
(PAs) in the style of the XML declaration; the pseudo-attributes may
appear in any order.

3.1  LM-DEF Pseudo-Attributes

These PAs are required in an LM-DEF PI.  It is an error to have
two PIs with the same "char" and "group" values.

3.1.1  "char"

The "char" PA value has a value that either a Char or a
numeric character reference, which the LM processor will expand.

3.1.2  "group"

The "group" PA value matches Name.  This Name is matched
with the "group" pseudo-attribute in an LM-USE PI.

3.1.3  "element"

The "element" PA value matches Name.  It specifies the
element name corresponding to the character defined by "char".

3.2.  LM-USE Pseudo-attributes

The "group" and "element" PAs are required in an
LM-USE PI.  The "model" PA is optional.  It is an error to have
two PIs with the same "element" value.  [Query: maybe should be
an error only if both "group" and "element" values are the same?]

3.2.1 "group"

The "group" PA value matches Name.  This Name is matched
with the "group" pseudo-attribute in an LM-DEF PI.

3.2.2 "element"

The "element" PA value matches Name.  It specifies the
name of an element type.  The characters in the group specified by
"group" are processed when found in the character data of elements with
this name.

3.2.3 "model"

The "element" PA matches Names.  It names sub-elements
which are to be placed as children of the "element" element.


4.  Processing Model

A conformant Local Markup processor reads and records all LM-DEF
and LM-USE PIs.  It then processes the rest of the document,
searching for elements with names matching those declared in
the "element" PA of some LM-USE PI.  All other elements and
markup remain unaltered.

When an element with an associated LM-USE PI is found, its character
data is searched for characters defined in LM-DEF PIs which share
the same "group" PA as the element.  These characters are the
element's *local markup*, and the element is called the *base element*.
Each local markup character has a corresponding element type.

Local markup can be either *free* or *hierarchical*.

4.1  Free Local Markup

Local markup is free in an element if it does not meet the conditions
for hierarchical local markup.  If the element associated with a
markup character is not mentioned in the "model" PA corresponding to
the base element, the markup is always free.

Free local markup characters are processed as if the corresponding
element was present in empty form.  Thus if "!" has the element
"BANG", then a "!" will be treated like "<BANG/>", as follows:

	<? LM-DEF char="!" group="a" element="BANG" ?>
	<? LM-USE group="a" element="foo" ?>
	<foo>foo!bar!baz</foo>

is treated as if it were:

	<foo>foo<BANG/>bar<BANG/>baz</foo>

4.2  Hierarchical Local Markup

Hierarchical local markup can be present only if the base element
has an associated "model" PA.  If so, then the elements mentioned
in the "model" PA are made the child, grandchild, great-grandchild ...
of the base element.  For example:

	<? LM-USE group="x" element="foo" model="baz zam" ?>
	<foo>This is foo</foo>

is treated as if it were:

	<foo><baz><zam>This is foo</zam></baz></foo>

Local markup characters are processed hierarchically only if two
conditions are met:  the element corresponding to the character
must be part of the model, and the character must appear as a child
of the base element, not as part of some descendant of the base
element.  Thus, in the following:

	<? LM-DEF char="!" group="x" element="bar" ?>
	<? LM-DEF char="#" group="x" element="baz" ?>
	<? LM-USE group="x" element="foo" model="bar">
	<foo>This is ! <zam> Empty !</zam>.  Here is #</foo>

the first "!" is hierarchical, but the second "!" and the "#" are free.

A local markup character that is hierarchical is treated as if it
were a set of end-tags followed by a set of start-tags.  The end-tags
are successively those from the end of the model, backward to and
including the one associated with the character.  The start-tags
are the same but in reverse order.  The previous example is therefore
treated as if it were:

	<foo>
	  <bar>This is </bar>
	  <bar>
	    <zam> Empty <bar/></zam>.  Here is <baz/>
	  </bar>
	</foo>

ignoring the extra whitespace shown here for clarity only.

Using the second set of PIs in section 3, the element

	<DATE>1997T</DATE>

is treated as if it were:

<DATE><SEGMENT><PART>1997</PART></SEGMENT><PART/></SEGMENT></DATE>

because the T generates </PART></SEGMENT><SEGMENT><PART>, of which
the last tag merges with the closing </PART></SEGMENT></DATE> to
form an empty PART element.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan at ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list