Escape mechanism using release character

Rick Jelliffe ricko at allette.com.au
Sat Jul 17 09:12:31 BST 1999


From: Richard Tobin <richard at cogsci.ed.ac.uk>

>> Why is it that the well known escape mechanism of using a
>> release character (like '\') for escaping special characters
>> (eg. '<','&') not used in XML?
>
>Because XML is a subset of SGML which does not use such a mechanism.
>
>If XML had been a new system designed from scratch, it might well have
>been much simpler in many respects.  On the other hand, it would
>probably not have succeeded.

Actually, SGML does have such a mechanism: the Markup Suppress
Character. This could have been defined as "\" for XML.  I think I
remember Charles Goldfarb even raised this issue for XML during its
development.

The reasons against it include these:
    1) it creates three kinds of delimiting: by CDATA sections, by
entity
references, and by markup suppression. XML tried to remove duplication
unless there was a good reason;

    2) programmers have a lot of difficulty coping with delimiters
(witness
the appalling support for correct delimiters in first generation XML
applications);

    3) HTML and almost all SGML document s do not use this mechanism, so
you would be building in incompatability;

    4) it creates another character with a special meaning that must be
delimited:
as well as & and <, parsers must look for / and people must delimit it
in text.

    5) the character "\" is problematic for Japanese in that the ASCII
code point
for that character is used for the Yen character in ShiftJIS:  if we
used that
character, then it would rule out the class of dumb applications that
just understand
the ASCII codepoints delimiter recognition and pass every other byte
through;

    6) the character "\" is problematic in Taiwanese encodings, in that
it is used
as a codepoint as part of Big5 characters: if we used that character, it
would rule
out the class of dumb applications that just understand the ASCII code
values of
delimiters and pass everything else through (there is already a
potential for this
problem with [ and ] as used used in CDATA sections, but "\" would be
far worse).

    7) \ is often used in programming languages as an escape. As you
might know
from shell languages, double delimiting is really tricky, and if you
need to triple
delimit (e.g. use "\\\\" to represent "\\" to represent \ in output) it
gets complicated).
So it is common practise for markup languages to use different delimiter
delimiters
than the delimiter delimiters of the embedded language; similarly it is
common
for XML processing languages to use different delimiter delimiters: e.g.
OmniMark
uses "%" no "\" or entities.

    8) Also, I think there is a good reason in that \ might encourage
the view that
XML documents are delimited merely to fit into a pipeline of processes:
Microsoft
adopted this approach for handling XML documents with CSS stylesheets in
IE5,
which is why &amp;? gets treated like a processing instruction. But this
is wrong
behaviour; in XML data is not tailored to a process, you declare what
you want.
So if I say &amp;?  I do not want a processing instruction start at my
output:
XSL gets this very right in its approach.

Rick Jelliffe



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list