Confusion about conditional sections
roddey at us.ibm.com
roddey at us.ibm.com
Fri Feb 19 20:09:57 GMT 1999
Ok, I'm confused about a conditional section issue that some of my
colleagues were discussing...
Once you see a conditional ignore section, can you effectively just scan
for <![ and ]]> parts of the text inside there without actually doing
regular parsing? Is there a reason that this cannot be done?
The logic is basically this (and assumes that we've already entered the
body of an ignore section):
while (true)
{
depth = 0;
if (skipped char '<')
{
if (skipped char '!') and (skipped char '[')
depth++;
}
else if (next char is '>')
{
if (skipped char ']') and (skipped char '>')
depth--;
if (!depth)
return;
}
else if (skipped char is not valid XML char)
{
emit error
}
}
Here 'skipped char' means that it was skipped over in the content if it was
the target character. I can't help but think that this logic would fail to
deal with a number of issues, but I can't think of any right off hand. What
is missing from this picture?
Also, does the specification of a conditional section basically imply that
you cannot have a ']]>' character anywhere in an ignored section, even if
its in a literal?
So something like this:
<![IGNORE[
<!ENTITY MyEntity "The ]]> text of my entity">
]]>
would fail according to the spec because the ]]> character is not allowed
inside an ignored conditional section, even if in a place where it is
otherwise legal such as in a literal value. Is this correct? The above
logic is kind of dependent upon this being true I would think, since
otherwise it could be fooled. If this is true it would seem to be awfully
wierd that changing INCLUDE to IGNORE would cause a correct document to
break in this way.
The spec says that you must parse even the ignored section, but it doesn't
say to what extent. The logic above does 'parse' the text in that it looks
at every character in there. But its attempting to do a very low calory and
fast parse based on knowledge of what can be in a conditional section.
Since there is no identifying name in the end of a conditional, to assure
that its correctly aligned, doesn't the above logic correctly maintain all
the required state? It would though seem not to catch something like this:
<![IGNORE[
<![SOMETHINGSTUPID[ ]]>
]]>
Since it actually does not look at what follows the <![ part, which is
theoretically supposed to either be INCLUDE or IGNORE or CDATA, right? That
wouldn't cause the logic not to work, but it would miss catching an error.
I'm not sure that's really a serious issue though. Once its changed to
INCLUDE the inner error would then be caught. How responsible does a parser
have to be about catching syntax errors inside ignored sections, since its
not really part of the document?
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list