XML processing experiments

Fri Nov 7 15:16:16 GMT 1997

> Given XML's requirements that entity references in the instance are
> synchronous, I would have thought that the overhead of an entity stack
> could be avoided for parsing the instance.  The parser passes the
> application an entity reference event, and the application can then, if
> it chooses, recursively invoke the parser to parse the referenced
> entity.

A pedant might note that the XML standard requires that for internal
entities "the processor must ... retrieve its replacement text ...
passing the result to the application in place of the reference".  No
doubt the same pedant could draw the line between processor and
application such that this was satisfied.

This scheme seems reaonable for a parser that works in terms of events
implemented by callbacks.  Our parser on the other hand returns "bits"
(these are essentially start tags, end tags, and pcdata)
*sequentially*, following the model of reading a plain text file.
Entity references are expanded, and a bit may end in a different
entity from the one it started in (suppose foo is defined as "a<b/>c";
then the first bit returned from "x&foo;y" is "xa" - as far as I can
tell this is quite legal XML). In a language with threads, it's easy
to implement this on top of a callback interface (in a sense the
procedure stack in the parsing stack would replace the entity stack),
but it's much messier in plain C.

Partly the reason for using the sequential model is historical: this
parser is used in the LT-NSL system, which already worked like that.
But it's also for simplicity: I want this parser to be easily usable
with existing C applications (for example, someone here wants to be
able to read XML-marked-up text into his speech synthesizer).

> [...]
> This is particularily the case if you want to get
> correct byte offsets when using a variable width encoding (such as
> UTF-8); it's hard to do this without a method call per character.

Misha Wolf tells me that my earlier comment about the non-invertibility
of UTF-8 is wrong: the Unicode standard requires that the shortest
encoding be used.  So, for example, if you know the byte offset of the
start of the line then you can find the byte offset of a character
in the line by calculating the encoded length of the preceeding
characters.

On the other hand I note that low-end current machines can do about 10
million trivial non-leaf procedure calls per second, so maybe the
overhead of a call per character is not unacceptable (in C I would be
doing something like parser->source->get_translated_char(); there
would probably be more overhead in an object-oriented language).

> [...]
> there would be a one stage process that
> converted a stream of bytes into a stream of characters already split up
> into tokens.

Yes - I have been thinking about that too.  Outside the dtd the
tokenisation is relatively trivial, and the speed of dtd processing is
unimportant in many applications so it can just use character-at-a-time
translation.

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)