Confused about & in entity literal

Mon May 10 21:34:16 BST 1999

> >Have a closer look at production 9:
> >
> >    [9] EntityValue ::=
> >         '"' ([^%&"] | PEReference | Reference)* '"' |
> >         "'" ([^%&'] | PEReference | Reference)* "'"
> >
> >Which _does_ say that you can't have a raw '&' in an entity value etc.
> >That's what the excslusion syntax means.
> 
> So its like you have to parse the entity value and, if you find an ampersand,
> you have to parse it like an entity reference. If it happens to be either a
> numeric reference or the name happens to match one of the intrinsic entity
> names, then you should expand that and escape the character it generates?

See section 4.5 on how the replacement text for an entity value is
defined; there's no special case for general entities that happen to
be built-in.  Also see appendix D for more elaboration of how the
rules in that section interact with the entity expansion rules in
section 4.4 ...

Briefly, entity references will get expanded eventually, just not
at the time the entity's replacement text is constructed.  (It is
not possible to do the expansions then, since the general entities
aren't required to be defined at that time!)

> Otherwise, if it happens to look like a reasonable entity reference, I guess you
> are supposed to ignore it and just pass it through as is? If it does not look
> like a reasonable entity reference, then you give an error?

Again, see section 4.5 (etc) which is pretty clear about this.

Try running your parser through a conformance test suite -- e.g. James
Clark's XMLTEST and Sun's (which incorporates some examples from the
XML spec, avoiding any issues about interpretation).

> What about this scenario?
> 
> <!ENTITY Foo "&[insert 128K of what is really the rest of a base64 encoded or
> encrypted piece of text];">
> 
> In this scenario, the entity is an encoded value of some sort, which just
> happens to start with an ampersand and end with a semi-colon. It has no illegal
> name chars in it and no spaces. You are supposed to buffer up all of that text,
> and if it happens to end with a semicolon, assume that it is a legal reference
> and pass it through?

Absolutely.  If that assumption is untrue, it's the fault of whoever
provided that illegal entity declaration.

> Yes I know that the ampersands should have been escaped technically, but how
> many parsers would blow up in this situation trying to buffer up that much text?

Curiously, not Sun's parser.  The XML specification places no limitations on
the size of XML names; that wasn't the (single) place I found it convenient
to impose such a limitation.  (I forget whether I removed that limitation
in the next version of the parser.)

> How would the end user of that text figure out again where the escaped
> ampersands are in the text since its basically a totally meaningless sequence of
> characters to begin with?

I don't understand the question.  Are you perhaps assuming that
the replacement text is not going to have such entity references
expanded at the proper time?  (Namely, when the entity, "Foo" in
the example above, is referenced.)

- Dave

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)