XML processing experiments

Fri Nov 7 16:29:00 GMT 1997

James Clark wrote:
>> Given XML's requirements that entity references in the instance are
>> synchronous, I would have thought that the overhead of an entity stack
>> could be avoided for parsing the instance.  The parser passes the
>> application an entity reference event, and the application can then, if
>> it chooses, recursively invoke the parser to parse the referenced
>> entity.

Richard Tobin wrote:
>Entity references are expanded, and a bit may end in a different
>entity from the one it started in (suppose foo is defined as "a<b/>c";
>then the first bit returned from "x&foo;y" is "xa" - as far as I can
>tell this is quite legal XML).

I don't think this is legal. The working draft (sec. 4.1) says:
"The logical and physical structures (elements and entities) in an XML
document must be synchronous. Tags and elements must each begin and end in
the same entity, but may refer to other entities internally; comments,
processing instructions, character references, and entity references must
each be contained entirely within a single entity"

It seems to me that with the current whitespace handling, one could nearly
(?) parse the entities locally, and build a subtree of it if the tree is
wanted. (This could maybe result in easier error-reporting, and would
probably have a positive impact on parsing speed (but could mean a bit more
complexity in the implementation?))

As Mr. Clark indicates, a parser doesn't need to take much of a performance
hit when entities are not present, the entity stack have no influence (is
kept constant) when parsing f.i. a start-tag.
(if entity references are present in the attribute values, this can be
expanded afterwards if wanted. Authoring tools etc often don't want this
expansion to happen.)

I (currently!) think it is possible to design a 'real' parser looking
locally much the same as Mr. Clark's "quick and dirty" parser.
(I'm in the startup implementing one)

BTW: Anyone having an example of where the immediate expansion of character
references within 
internal entities actually comes handy?
To me this seems to make the parser use more memory and perhaps being
slower, but more importantly: ruins copy-paste semantics of entity expansion

What will "normal" people think about such things as the example from the
draft:

<!ENTITY example "<p>An ampersand (&#38;#38;) may be escaped
numerically (&#38;#38;#38;) or with a general entity
(&amp;amp;).</p>" >

I think most people will regard this as a bug/design flaw.

I would feel better if I knew an example where this behaviour actually
comes handy... :-)

Cheers,
Jarle Stabell

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)