Entities and Expat

Nik O niko at cmsplatform.com
Wed May 12 18:52:01 BST 1999

Thank you both for your rapid response.

I should have mentioned that i had previously tried to include entity
declarations within my document (there is presently no DTD associated with
the document).  When i included the entity declaration at the beginning of
the document (outside the root element), Expat returned
XML_ERROR_INVALID_TOKEN ("not well-formed").  If the entity declaration was
included within the root element, Expat returned XML_ERROR_SYNTAX ("syntax
error").  Indeed, i double-checked this by copying the string from Joshua's
email.  Same results.  Perhaps the real issue is Expat's handling of
"<!ENTITY..>" declarations in standalone documents?  Or am i still missing

My application is parsing XML documents that contain HTML entity references
("&copy;", etc.), indexing the text, and building a full-text database
comprised of HTML "documents".  The app doesn't need to expand, translate,
or index the entity strings -- it just needs a string length to keep the
document's word offsets straight, and to copy the string to the output
stream.  I had hoped to do this in the DefaultHandler callback, but of
course i'm never getting there.

 Joshua E. Smith wrote:
> If you want your application to "just know" about some entities which you
> have failed to define anywhere, I don't think that documents relying on
> that behavior would even be considered well-formed.

 David Brownell wrote:
> Expat doesn't read external parameter entities, including
> "the" external subset, but it does understand that if it
> doesn't come across one of those, all entities must be
> defined through the internal subset...

 John Cowan wrote:
> All you actually have to do is to ensure that the next character
> (if not #, see above) is a NAMESTRT character, and that all characters
> until ; are either NAME or NAMESTRT characters.  There is no need (and
> in fact it is forbidden) to look up the supposed entity name anywhere.

The first two statements seem to contradict the third (what statement
sparked my first message).  I must admit i remain a little confused about
the boundary between well-formed and valid XML documents when it comes to
general entities.  My opinion would be to agree with John Cowan's
statement -- if an EntityRef is physically valid (see "[68]" below), why
should the parser, or any other intermediate processor care whether the
referenced entity 'Name' exists?  However, when i went back to the XML spec,
it seems that Joshua E. Smith is indeed correct that my document is *not*
well-formed, and therefore Expat is processing it correctly.

The following references and excerpts are from Tim Bray's truly excellent
annotated XML 1.0 Specification (http://www.xml.com/axml/testaxml.htm or
http://www.xml.com/axml/target.html to omit the explanation frame).  I've
added the text in curly braces (e.g. "{43}") to describe Tim's hyperlinks.

======= Begin spec excerpt =======
4.3.2 Well-Formed Parsed Entities
An internal general parsed entity is well-formed if its replacement text
matches the production labeled content {43}. All internal parameter entities
are well-formed by definition.
======= End spec excerpt =======

======= Begin spec excerpt =======
[43]  content ::=  (element | CharData | Reference {67} | CDSect | PI |
======= End spec excerpt =======

======= Begin spec excerpt =======
[67]  Reference ::=  EntityRef | CharRef
[68]  EntityRef ::=  '&' Name ';'
    [  WFC: Entity Declared ]
    [  VC: Entity Declared ]
    [  WFC: Parsed Entity ]
    [  WFC: No Recursion ]
======= End spec excerpt =======

======= Begin spec excerpt =======
4.1 Character and Entity References
Well-Formedness Constraint: Entity Declared
In a document without any DTD, a document with only an internal DTD subset
which contains no parameter entity references, or a document with
"standalone='yes'", the Name given in the entity reference must match that
in an entity declaration, except that well-formed documents need not declare
any of the following entities: amp, lt, gt, apos, quot. The declaration of a
parameter entity must precede any reference to it. Similarly, the
declaration of a general entity must precede any reference to it which
appears in a default value in an attribute-list declaration. Note that if
entities are declared in the external subset or in external parameter
entities, a non-validating processor is not obligated to read and process
their declarations; for such documents, the rule that an entity must be
declared is a well-formedness constraint only if standalone='yes'.
======= End spec excerpt =======

According to the table in section "4.4 XML Processor Treatment of Entities
and References", an "Internal General Entity" that is "Reference[d] in
Content" is to be "Included".

======= Begin spec excerpt =======
4.4.3 Included If Validating
When an XML processor recognizes a reference to a parsed entity, in order to
validate the document, the processor must include its replacement text. If
the entity is external, and the processor is not attempting to validate the
XML document, the processor may, but need not, include the entity's
replacement text. If a non-validating parser does not include the
replacement text, it must inform the application that it recognized, but did
not read, the entity.
======= End spec excerpt =======

-Nik O, Content Mgmt Solutions, Jackson, Wyo.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list