Recent XML WG decisions
Jon Bosak
Jon.Bosak at eng.Sun.COM
Sat Sep 13 07:40:16 BST 1997
While it is not our usual policy to post decisions of the XML Working
Group to xml-dev, the last three WG meetings have seen a number of
issues decided that bear directly on current experimental XML
implementations. Following are reports prepared by C. M.
Sperberg-McQueen and Tim Bray detailing recent decisions that will be
incorporated into the next working draft.
Jon
----------------------------------------------------------------------
Jon Bosak, Online Information Technology Architect, Sun Microsystems
901 San Antonio Road, MPK17-101, Palo Alto, California 94303
----------------------------------------------------------------------
ISO/IEC JTC1/SC18/WG8::NCITS V1::Davenport::SGML Open::W3C XML WG
It is earlier than we think. -- Vannevar Bush
----------------------------------------------------------------------
From: "C. M. Sperberg-McQueen" <cmsmcq at hd.uib.no>
Subject: XML WG decisions of 27 August 1997
The XML Work Group discussed the following questions, and made the
decisions indicated, in the meeting of 27 August 1997.
Present: Jon Bosak, James Clark, Steve DeRose, Eliot Kimber,
Eve Maler, Makoto Murata, Peter Sharpe, C. M. Sperberg-McQueen.
1. A decision on case folding was postponed.
Background: The current draft XML spec requires that most names
(i.e. generic identifiers, attribute names, IDs, IDREFs, name tokens
in attribute values PI targets, notation names, and document type
names) be case-folded, while entity names are case sensitive. It has
been repeatedly urged that this be changed and that all names be
case-sensitive. The arguments are familiar:
For case folding: since the reference concrete syntax requires case
folding, many current users of SGML and HTML are familiar with and
have come to expect this behavior.
For case sensitivity: since SGML parsers are required to fold up,
rather than down, the XML spec is inconsistent with recommended
Unicode practice. (Unicode recommends folding down rather than up
since there are slightly fewer unpleasant surprises and
inconsistencies that way.) There is *no* rule for case folding which
works in the culturally expected manner for all speakers of all
alphabetic languages: a lower-case e with acute accent is (correctly)
uppercased one way in Quebec and a different way in metropolitan
France. Lowercase I (with a dot) is uppercased one way in Turkish and
another way in other languages using the Latin alphabet.
A strong majority of those participating felt that we should make XML
case sensitive and drop case folding, but in view of the sensitive
nature of the decision, it was decided to postpone the decision until
a larger fraction of the work group was present.
2. XML characters range from #x0 to #x10FFFF.
Decision: Legal XML characters are those representable in UTF-16 /
Unicode 2.0, i.e. those in the first seventeen planes of ISO/IEC 10646.
Unanimous.
Rationale: The current spec says that XML characters may include any
character defined by ISO/IEC 10646. Currently, that standard defines
characters only within the Basic Multilingual Plane, each of which can
be represented by a string of 16 bits; in principle, however, ISO/IEC
10646 defines a 31-bit character space, and production 2 accordingly
defines Character as covering the range #x0 to #x7FFFFFFF, with some
gaps for forbidden characters.
XML processors, however, are not required to support the flat 32-bit
character encoding UCS-4, only the 16- and 8-bit encodings of UCS-2
and UTF-8. (The latter can represent all the characters of the 31-bit
character space, but UCS-2 cannot.) In many places, the XML spec
suggests, or at least allows incautious readers to believe, that XML
characters are only 16 bits wide.
Either way, it's important to eliminate the ambiguity in the spec.
In favor of restricting XML characters to 16 bits: it simplifies life
for users of Java and other tools. It seems clear that the full 31-bit
space of 10646 will not be needed, even for extremely specialized
applications, in the foreseeable future.
In favor of defining XML characters to be 31 bits wide: 16 bits is
manifestly too few for anyone working with historical texts in Han
characters. Politically, it would be unwise to give the impression
that only the Basic Multilingual Plane is of importance. The
surrogate method, while clever, is clearly a hack which demonstrates
that the original Unicode claim (16 bits is enough to build an
absolutely flat character space which will last for all time) has
fallen apart under the pressure of fact; the surrogate method
abandons the flat character space which is one of the most important
advantages of Unicode.
The compromise (BMP plus the next 16 planes) appears
- well understood
- compatible with Java and other tools which assume 16-bit characters
- sufficient for realistic expectations (even the most extensive of
known collections of historical Chinese characters is unlikely to take
much more than one of the additional planes; even the user area is
sufficiently large, with 131,072 character positions)
3. Processors must support UTF-16, not just UCS-2.
Background: the current draft spec says (4.3.3): "All XML processors
must be able to read entities in either UTF-8 or UCS-2." It has been
proposed to change this to require support for UTF-8 and UTF-16 (which
is UCS-2 plus support for the surrogate-character mechanism by which
characters outside the Basic Multilingual Plane may be encoded).
Decision: (i) XML processors must support 16-bit data streams (i.e.
UTF-16) for input. (ii) They must not corrupt surrogate characters.
(iii) If the processor uses a 16-bit buffer or a 16-bit interface to
the downstream application, it must correctly represent numeric
character references to non-BMP characters as pairs of surrogate
characters. Unanimous.
Rationale: since all name characters in XML are in the Basic
Multilingual Plane, characters outside the BMP can only appear in
XML documents as data. Since an XML processor is required to do
nothing more to data than store it and pass it to the downstream
application without corrupting it, no special handling is required for
surrogate characters. The only new requirement is that processors
understand the surrogate-character mechanism for characters outside
the BMP, and use it, when necessary, to handle numeric character
references correctly.
4. XML will refer to Unicode 2.0 and ISO/IEC 10646 with Am. 1-7.
The current draft spec refers to Unicode 2.0 and ISO/IEC 10646 with
Amendments 1 through 5. It has been suggested (a) that XML should refer
*only* to Unicode, and (b) that the reference should be to "the current
version" of Unicode, so that as Unicode is revised, XML automatically
accepts the revisions.
Decision: refer to 10646 with Amendments 1 through 7, but otherwise
retain the current reference. I.e. do not drop the reference to
ISO/IEC 10646, and do not phrase the reference so as to incorporate
changes to Unicode automatically. Unanimous.
Rationale: the agreement between ISO/IEC JTC1/SC2 and the Unicode
Consortium to keep Unicode and 10646 synchronized is extremely
important to all users. A joint reference to both standards makes
clear to both parties that we, as users, wish them to honor that
agreement. A reference solely to Unicode would imply clearly that XML
would follow Unicode even if Unicode were to diverge from ISO/IEC
10646. The joint reference makes clear our intent: if the Unicode
Consortium and SC2 fail to keep the two standards in synch, then XML
is not guaranteed to follow either of them.
Reference to as yet unpublished standards (which is what reference to
"the most recent version" amounts to) is unwise because there is and
can be no guarantee that revisions in Unicode and 10646 will not
require corresponding revisions to the XML spec.
5. Encoding of external text entities is kept as is.
It has been suggested that by allowing external entities to be in
different character encodings, XML is incompatible with ISO 8879,
which does not allow this.
The WG unanimously reaffirmed its belief that the current draft spec
is in fact compatible with ISO 8879 under what is sometimes called the
'new' character model. SGML documents must have a single document
character set declaration and thus a single document character set,
but this reflects the output from, not the input to, the entity
manager, and is thus independent of the character encoding encountered
in the actual data stream of the external text entity.
6. Ideographic space is not white space.
Decision (unanimous): ideographic space (#x3000) will be removed from
the non-terminals S and PubidCharacter.
Rationale: Ideographic space corresponds more closely to the
no-break space (#xA0, ) than to the standard space character
(#x20). #xA0 is not allowed in S, and neither should ideographic
space be. It is unlikely, with current standard input methods for
kanji, that any operator would unintentionally or accidentally insert an
ideographic (#x3000) rather than a Latin (#x20) space within a tag.
7. Binding sources of information for character encodings will be
specified.
The current draft spec says nothing about the priority of various
sources of information regarding character encodings. Some
participants (notably Gavin Nicol and Makoto Murata) have argued
that this should be specified.
Decision: The spec should include wording to the following effect:
If an XML document or entity is in a file, the Byte-Order Mark
and encoding-declaration PI are used (if present) to determine
the character encoding. All other heuristics and sources of
information are solely for error recovery.
If an XML document is delivered via the HTTP protocol with a
MIME type of text/xml, then the HTTP header determines the
character encoding method; all other heuristics and sources of
information are solely for error recovery.
If an XML document is delivered via the HTTP protocol with a
MIME type of application/xml, then the Byte-Order Mark and
encoding-declaration PI are used (if present) to determine the
character encoding. All other heuristics and sources of
information are solely for error recovery.
-C. M. Sperberg-McQueen
From: "C. M. Sperberg-McQueen" <cmsmcq at hd.uib.no>
Subject: XML WG decisions of 3 September 1997
The XML Work Group met today (3 Sept 1997) and made the decisions
described below. Present were Jon Bosak (JB), Tim Bray (TB), James
Clark (JC), Dan Connolly (DC), Steve DeRose (SJD), Paul Grosso (PG),
Dave Hollander (DH), Eliot Kimber (EK), Murray Maloney (MMa), Makoto
Murata (MMu), Joel Nava (JN), Jean Paoli (JP), Peter Sharpe (PS), and
Michael Sperberg-McQueen (MSM).
1. Procedures for determination of character encoding to be
described in an appendix.
Background: last week's report of decisions (31 August, posting
from U35395 at UICVM.UIC.EDU), included as item 7 a decision regarding
"Binding sources of information for character encodings". The WG
revisited the issue, noted that in fact no formal vote on it had
been taken (error in the report), and discussed whether such rules
belong in the XML language spec or not.
Against inclusion: the rules really apply to the delivery of XML in
very specific protocol environments, and should be included in the
specification of the protocol. XML will be delivered by many protocols,
some of them not yet invented; the language spec should not have to be
revised every time a new protocol is deployed or invented.
For inclusion: such conventions are important for encouraging
interoperability of XML software. Conforming processors reading
the same material in the same environment should make the same
decisions about the character encoding.
Decision: The rules for locating binding information about the character
encoding of XML entities (reported last week) will be described
in an appendix. They will be accompanied by a note making clear
that the rules about http service properly belong in the RFC defining
the Mime types text/xml and application/xml, and that when those
RFCs are available their text will supersede the recommendations
of the appendix.
The wording given in the posting of 31 August will be changed by
replacing the phrases 'XML document or entity' and 'XML document'
with the phrase 'XML entity'. (It has been argued that the term
'entity' is not currently well defined in the XML spec; if the usage
of the term is later revised, this occurrence may be changed.)
In favor: all present.
2. A decision on case-folding was postponed again.
A summary of the issues and a request for discussion by the SIG
will be posted shortly.
3. XML processors to normalize CR, LF, and CRLF to LF.
Background: the current draft XML spec says nothing about whether
or how XML processors or applications should normalize the common
line-break sequences CR, LF, and CRLF.
For normalization: since the three sequences are intended, in practice,
to have the same meaning, they can be normalized without loss of
useful information. If the XML processor does not normalize these
sequences, every single downstream XML application will be forced to
do so; experience shows that relying on them to do so will result in
broken applications and inconsistent behavior.
Against normalization: right now the spec has no concept of line or
line break; there is no need to introduce one, so for the sake of
economy (and clarity) none should be introduced.
For normalizing to LF: thanks to C's standard IO model, it's what
most program libraries provide, and thus what most programs and most
programmers expect.
For normalizing to CRLF: it's more consistent with the specifications
governing the Web. Last time anybody looked at the ASCII spec, CRLF
was the preferred form of this information.
Against CRLF: specifications? On the Web?
Decision: When an XML processor encounters any of the character
sequences CR (UTF-16 x000D), LF (UTF-16 x000A), or CR LF (UTF-16
x000D x000A), the processor must pass a single LF character to the
downstream application.
(Note: this formulation of the decision presupposes that the set of
information which XML processors may or must make visible to downstream
applications will be described more fully than it is in the current
draft spec. If the WG decides against such a description, this
substantive decision will need to be expressed in some other form.
If the processor disappears from the XML language specification, as
has been proposed, this decision may be expressed as a constraint on
whether the differences among line-break sequences in the input
stream are 'visible' or 'significant'.)
-C. M. Sperberg-McQueen
University of Illinois at Chicago
tei at uic.edu
From: Tim Bray <tbray at textuality.com>
Subject: XML WG decisions of Wed. Sep. 10
The XML WG met on Wed. Sep. 10th. Present: Bosak, Kimber, Murata,
Clark, Sperberg-McQueen, Wood, Nava, Bos, Maler, Bray, Tigue, Maloney,
Paoli, DeRose.
Errors in discussion summaries are, as usual, mine.
1. Discussion of case sensitivity
Few new arguments arose in the discussion of case sensitivity, aside
from Steve DeRose's observation that disallowing case folding will,
by removing the possibility that attribute values are case-folded,
reduce the number of instances where the results of parsing can
be affected by the presence/absence of a DTD. (Note that the
handling of white space can still be affected in the case where
attribute values are known to be tokenized, so the problem hasn't
entirely gone away).
This is a summary of points made in a brief last-chance-to-speak-
your-mind go-around:
For Case Sensitivity:
- XML will rarely be created by hand and when it happens, it'll be by
experts.
- This is a chance to do the right thing early in XML's history and
avoid living with a compromise forever.
- Case folding is very easy to specify and to understand.
- It would be nice to be able to map case-sensitive objects, for example
DSSSL flow objects, to element types.
- Internationalization experts are unanimously against folding.
- Pleasant experiences with case-sensitive programming languages.
- Casefolding problems are truly vile.
- It will be easy to make XML processors recognize typical user errors
and provide helpful error messages.
For Case Folding:
- It would be the right thing to do if we were starting from scratch, but
it's too late now.
- There will be serious difficulties dealing with the XML-in-HTML
scenario.
- It will make it impossible for HTML ever to be specified as an
application of XML as opposed to SGML.
- The XML spec has been out for nine months now; it's late in the game
to be making this change.
The Question: Modify the XML specification to achieve the effect of
NAMECASE GENERAL NO in SGML.
Yes: Bosak Kimber Murata Clark Sperberg-McQueen Nava Bos
Bray Tigue Maloney Paoli DeRose
No: Wood
Abstain: Maler
So XML is now case-sensitive.
1a: Since XML is case sensitive, we must specify the case of
our keywords, i.e. <!ELEMENT or <!element. Names not recorded,
vote was
Upper: 7 Lower: 3 Abstain: 4
(In this vote, some of the abstains should be taken as don't-cares).
2. Chris Maden's suggestion that NOTATION System Identifiers
should be mime types. The WG liked the idea, but declined to
modify the spec to achieve tihs effect; among other things,
URLs and mime types are not syntactically distinguishable. It
was the feeling of the group that it would be desirable that a
new URL scheme be created to allow a URL to locate a mime type.
3. Discussion of the proposition that the XML spec should say
more about what the processor passes the App. John Tigue has
volunteered to write an XML Grove Plan; while there is little
sentiment that this should be made normative, it might serve
usefully as either a separate application note or an appendix.
The WG agreed that the editors should enrich the language of the
spec sufficiently to make it clear (as it does with PIs and
comments) what a processor may and must make available to an
application.
Cheers, Tim Bray tbray at textuality.com http://www.textuality.com/
PS: For your amusement, I attach the output produced by a
moments-ago-updated Lark when asked to process the XML spec:
Loading
Testing: Lark V0.92 Copyright (c) 1997 Tim Bray.
All rights reserved; the right to use these class files for any purpose
is hereby granted to everyone.
Parsing...
Syntax error at line 127:57: Start/End tags differ only in case: p/P
Syntax error at line 367:23: Start/End tags differ only in case: ITEM/item
Syntax error at line 369:51: Start/End tags differ only in case: ITEM/item
Syntax error at line 370:69: Start/End tags differ only in case: item/ITEM
Syntax error at line 454:4: Start/End tags differ only in case: P/p
Syntax error at line 457:50: Start/End tags differ only in case: p/P
Syntax error at line 750:50: Start/End tags differ only in case: termdef/TERMDEF
Syntax error at line 752:34: Start/End tags differ only in case: lhs/LHS
Syntax error at line 755:71: Start/End tags differ only in case: prod/PROD
Syntax error at line 955:43: Start/End tags differ only in case: P/p
Syntax error at line 956:7: Start/End tags differ only in case: ITEM/item
Syntax error at line 959:19: Start/End tags differ only in case: p/P
Syntax error at line 959:26: Start/End tags differ only in case: item/ITEM
Syntax error at line 991:7: Start/End tags differ only in case: list/LIST
Syntax error at line 1031:22: Start/End tags differ only in case: P/p
Syntax error at line 1039:4: Start/End tags differ only in case: p/P
Syntax error at line 1062:4: Start/End tags differ only in case: P/p
Syntax error at line 1137:31: Start/End tags differ only in case: p/P
Syntax error at line 1140:4: Start/End tags differ only in case: p/P
Syntax error at line 1207:4: Start/End tags differ only in case: P/p
Syntax error at line 1278:4: Start/End tags differ only in case: P/p
Syntax error at line 1289:60: Start/End tags differ only in case: p/P
Syntax error at line 1453:7: Start/End tags differ only in case: DIV2/div2
Syntax error at line 1544:4: Start/End tags differ only in case: P/p
Syntax error at line 1586:4: Start/End tags differ only in case: P/p
Syntax error at line 1652:14: Start/End tags differ only in case: P/p
Syntax error at line 1655:19: Start/End tags differ only in case: p/P
Syntax error at line 1675:4: Start/End tags differ only in case: P/p
Syntax error at line 1706:22: Start/End tags differ only in case: P/p
Syntax error at line 1721:36: Start/End tags differ only in case: p/P
Syntax error at line 1726:45: Start/End tags differ only in case: P/p
Syntax error at line 1935:40: Start/End tags differ only in case: P/p
Syntax error at line 2072:4: Start/End tags differ only in case: P/p
Syntax error at line 2376:8: Start/End tags differ only in case: SCRAP/scrap
Syntax error at line 2377:4: Start/End tags differ only in case: P/p
Syntax error at line 2438:8: Start/End tags differ only in case: SCRAP/scrap
Syntax error at line 2530:7: Start/End tags differ only in case: div3/DIV3
Syntax error at line 2595:8: Start/End tags differ only in case: SCRAP/scrap
Syntax error at line 2665:10: Start/End tags differ only in case: p/P
Syntax error at line 2858:7: Start/End tags differ only in case: DIV2/div2
Syntax error at line 3650:19: Start/End tags differ only in case: p/P
Done.
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list