Recent XML WG decisions

Sat Sep 13 07:40:16 BST 1997

While it is not our usual policy to post decisions of the XML Working
Group to xml-dev, the last three WG meetings have seen a number of
issues decided that bear directly on current experimental XML
implementations.  Following are reports prepared by C. M.
Sperberg-McQueen and Tim Bray detailing recent decisions that will be
incorporated into the next working draft.

Jon

----------------------------------------------------------------------
 Jon Bosak, Online Information Technology Architect, Sun Microsystems
     901 San Antonio Road, MPK17-101, Palo Alto, California 94303
----------------------------------------------------------------------
  ISO/IEC JTC1/SC18/WG8::NCITS V1::Davenport::SGML Open::W3C XML WG    
            It is earlier than we think. -- Vannevar Bush
----------------------------------------------------------------------

 From: "C. M. Sperberg-McQueen" <cmsmcq at hd.uib.no>
 Subject: XML WG decisions of 27 August 1997

 The XML Work Group discussed the following questions, and made the
 decisions indicated, in the meeting of 27 August 1997.

 Present:  Jon Bosak, James Clark, Steve DeRose, Eliot Kimber,
 Eve Maler, Makoto Murata, Peter Sharpe, C. M. Sperberg-McQueen.

 1.  A decision on case folding was postponed.

 Background: The current draft XML spec requires that most names
 (i.e. generic identifiers, attribute names, IDs, IDREFs, name tokens
 in attribute values PI targets, notation names, and document type
 names) be case-folded, while entity names are case sensitive.  It has
 been repeatedly urged that this be changed and that all names be
 case-sensitive.  The arguments are familiar:

 For case folding: since the reference concrete syntax requires case
 folding, many current users of SGML and HTML are familiar with and
 have come to expect this behavior.

 For case sensitivity: since SGML parsers are required to fold up,
 rather than down, the XML spec is inconsistent with recommended
 Unicode practice.  (Unicode recommends folding down rather than up
 since there are slightly fewer unpleasant surprises and
 inconsistencies that way.)  There is *no* rule for case folding which
 works in the culturally expected manner for all speakers of all
 alphabetic languages: a lower-case e with acute accent is (correctly)
 uppercased one way in Quebec and a different way in metropolitan
 France.  Lowercase I (with a dot) is uppercased one way in Turkish and
 another way in other languages using the Latin alphabet.

 A strong majority of those participating felt that we should make XML
 case sensitive and drop case folding, but in view of the sensitive
 nature of the decision, it was decided to postpone the decision until
 a larger fraction of the work group was present.

 2.  XML characters range from #x0 to #x10FFFF.

 Decision: Legal XML characters are those representable in UTF-16 /
 Unicode 2.0, i.e. those in the first seventeen planes of ISO/IEC 10646.
 Unanimous.

 Rationale: The current spec says that XML characters may include any
 character defined by ISO/IEC 10646.  Currently, that standard defines
 characters only within the Basic Multilingual Plane, each of which can
 be represented by a string of 16 bits; in principle, however, ISO/IEC
 10646 defines a 31-bit character space, and production 2 accordingly
 defines Character as covering the range #x0 to #x7FFFFFFF, with some
 gaps for forbidden characters.

 XML processors, however, are not required to support the flat 32-bit
 character encoding UCS-4, only the 16- and 8-bit encodings of UCS-2
 and UTF-8.  (The latter can represent all the characters of the 31-bit
 character space, but UCS-2 cannot.)  In many places, the XML spec
 suggests, or at least allows incautious readers to believe, that XML
 characters are only 16 bits wide.

 Either way, it's important to eliminate the ambiguity in the spec.

 In favor of restricting XML characters to 16 bits: it simplifies life
 for users of Java and other tools.  It seems clear that the full 31-bit
 space of 10646 will not be needed, even for extremely specialized
 applications, in the foreseeable future.

 In favor of defining XML characters to be 31 bits wide: 16 bits is
 manifestly too few for anyone working with historical texts in Han
 characters.  Politically, it would be unwise to give the impression
 that only the Basic Multilingual Plane is of importance.  The
 surrogate method, while clever, is clearly a hack which demonstrates
 that the original Unicode claim (16 bits is enough to build an
 absolutely flat character space which will last for all time) has
 fallen apart under the pressure of fact; the surrogate method
 abandons the flat character space which is one of the most important
 advantages of Unicode.

 The compromise (BMP plus the next 16 planes) appears
   - well understood
   - compatible with Java and other tools which assume 16-bit characters
   - sufficient for realistic expectations (even the most extensive of
 known collections of historical Chinese characters is unlikely to take
 much more than one of the additional planes; even the user area is
 sufficiently large, with 131,072 character positions)

 3.  Processors must support UTF-16, not just UCS-2.

 Background: the current draft spec says (4.3.3): "All XML processors
 must be able to read entities in either UTF-8 or UCS-2."  It has been
 proposed to change this to require support for UTF-8 and UTF-16 (which
 is UCS-2 plus support for the surrogate-character mechanism by which
 characters outside the Basic Multilingual Plane may be encoded).

 Decision: (i) XML processors must support 16-bit data streams (i.e.
 UTF-16) for input.  (ii) They must not corrupt surrogate characters.
 (iii) If the processor uses a 16-bit buffer or a 16-bit interface to
 the downstream application, it must correctly represent numeric
 character references to non-BMP characters as pairs of surrogate
 characters.  Unanimous.

 Rationale: since all name characters in XML are in the Basic
 Multilingual Plane, characters outside the BMP can only appear in
 XML documents as data.  Since an XML processor is required to do
 nothing more to data than store it and pass it to the downstream
 application without corrupting it, no special handling is required for
 surrogate characters.  The only new requirement is that processors
 understand the surrogate-character mechanism for characters outside
 the BMP, and use it, when necessary, to handle numeric character
 references correctly.

 4.  XML will refer to Unicode 2.0 and ISO/IEC 10646 with Am. 1-7.

 The current draft spec refers to Unicode 2.0 and ISO/IEC 10646 with
 Amendments 1 through 5.  It has been suggested (a) that XML should refer
 *only* to Unicode, and (b) that the reference should be to "the current
 version" of Unicode, so that as Unicode is revised, XML automatically
 accepts the revisions.

 Decision:  refer to 10646 with Amendments 1 through 7, but otherwise
 retain the current reference.  I.e. do not drop the reference to
 ISO/IEC 10646, and do not phrase the reference so as to incorporate
 changes to Unicode automatically.  Unanimous.

 Rationale: the agreement between ISO/IEC JTC1/SC2 and the Unicode
 Consortium to keep Unicode and 10646 synchronized is extremely
 important to all users.  A joint reference to both standards makes
 clear to both parties that we, as users, wish them to honor that
 agreement.  A reference solely to Unicode would imply clearly that XML
 would follow Unicode even if Unicode were to diverge from ISO/IEC
 10646.  The joint reference makes clear our intent: if the Unicode
 Consortium and SC2 fail to keep the two standards in synch, then XML
 is not guaranteed to follow either of them.

 Reference to as yet unpublished standards (which is what reference to
 "the most recent version" amounts to) is unwise because there is and
 can be no guarantee that revisions in Unicode and 10646 will not
 require corresponding revisions to the XML spec.

 5.  Encoding of external text entities is kept as is.

 It has been suggested that by allowing external entities to be in
 different character encodings, XML is incompatible with ISO 8879,
 which does not allow this.

 The WG unanimously reaffirmed its belief that the current draft spec
 is in fact compatible with ISO 8879 under what is sometimes called the
 'new' character model.  SGML documents must have a single document
 character set declaration and thus a single document character set,
 but this reflects the output from, not the input to, the entity
 manager, and is thus independent of the character encoding encountered
 in the actual data stream of the external text entity.

 6.  Ideographic space is not white space.

 Decision (unanimous): ideographic space (#x3000) will be removed from
 the non-terminals S and PubidCharacter.

 Rationale:  Ideographic space corresponds more closely to the
 no-break space (#xA0, &nbsp;) than to the standard space character
 (#x20).  #xA0 is not allowed in S, and neither should ideographic
 space be.  It is unlikely, with current standard input methods for
 kanji, that any operator would unintentionally or accidentally insert an
 ideographic (#x3000) rather than a Latin (#x20) space within a tag.

 7.  Binding sources of information for character encodings will be
 specified.

 The current draft spec says nothing about the priority of various
 sources of information regarding character encodings.  Some
 participants (notably Gavin Nicol and Makoto Murata) have argued
 that this should be specified.

 Decision:  The spec should include wording to the following effect:

      If an XML document or entity is in a file, the Byte-Order Mark
    and encoding-declaration PI are used (if present) to determine
    the character encoding.  All other heuristics and sources of
    information are solely for error recovery.

      If an XML document is delivered via the HTTP protocol with a
    MIME type of text/xml, then the HTTP header determines the
    character encoding method; all other heuristics and sources of
    information are solely for error recovery.

      If an XML document is delivered via the HTTP protocol with a
    MIME type of application/xml, then the Byte-Order Mark and
    encoding-declaration PI are used (if present) to determine the
    character encoding.  All other heuristics and sources of
    information are solely for error recovery.

 -C. M. Sperberg-McQueen

 From: "C. M. Sperberg-McQueen" <cmsmcq at hd.uib.no>
 Subject: XML WG decisions of 3 September 1997

 The XML Work Group met today (3 Sept 1997) and made the decisions 
 described below.  Present were Jon Bosak (JB), Tim Bray (TB), James
 Clark (JC), Dan Connolly (DC), Steve DeRose (SJD), Paul Grosso (PG), 
 Dave Hollander (DH), Eliot Kimber (EK), Murray Maloney (MMa), Makoto 
 Murata (MMu), Joel Nava (JN), Jean Paoli (JP), Peter Sharpe (PS), and 
 Michael Sperberg-McQueen (MSM).

 1.  Procedures for determination of character encoding to be 
 described in an appendix.

 Background:  last week's report of decisions (31 August, posting 
 from U35395 at UICVM.UIC.EDU), included as item 7 a decision regarding 
 "Binding sources of information for character encodings".  The WG
 revisited the issue, noted that in fact no formal vote on it had
 been taken (error in the report), and discussed whether such rules
 belong in the XML language spec or not.  

 Against inclusion:  the rules really apply to the delivery of XML in 
 very specific protocol environments, and should be included in the 
 specification of the protocol.  XML will be delivered by many protocols, 
 some of them not yet invented; the language spec should not have to be 
 revised every time a new protocol is deployed or invented.  

 For inclusion:  such conventions are important for encouraging 
 interoperability of XML software.  Conforming processors reading 
 the same material in the same environment should make the same 
 decisions about the character encoding.

 Decision:  The rules for locating binding information about the character
 encoding of XML entities (reported last week) will be described
 in an appendix.  They will be accompanied by a note making clear
 that the rules about http service properly belong in the RFC defining 
 the Mime types text/xml and application/xml, and that when those
 RFCs are available their text will supersede the recommendations
 of the appendix.

 The wording given in the posting of 31 August will be changed by
 replacing the phrases 'XML document or entity' and 'XML document' 
 with the phrase 'XML entity'.  (It has been argued that the term
 'entity' is not currently well defined in the XML spec; if the usage 
 of the term is later revised, this occurrence may be changed.)

 In favor:  all present.

 2.  A decision on case-folding was postponed again.

 A summary of the issues and a request for discussion by the SIG
 will be posted shortly.

 3.  XML processors to normalize CR, LF, and CRLF to LF.

 Background:  the current draft XML spec says nothing about whether 
 or how XML processors or applications should normalize the common
 line-break sequences CR, LF, and CRLF.  

 For normalization:  since the three sequences are intended, in practice,
 to have the same meaning, they can be normalized without loss of
 useful information.  If the XML processor does not normalize these
 sequences, every single downstream XML application will be forced to
 do so; experience shows that relying on them to do so will result in
 broken applications and inconsistent behavior.

 Against normalization:  right now the spec has no concept of line or
 line break; there is no need to introduce one, so for the sake of
 economy (and clarity) none should be introduced.

 For normalizing to LF:  thanks to C's standard IO model, it's what 
 most program libraries provide, and thus what most programs and most 
 programmers expect.

 For normalizing to CRLF:  it's more consistent with the specifications
 governing the Web.  Last time anybody looked at the ASCII spec, CRLF
 was the preferred form of this information.

 Against CRLF:  specifications?  On the Web?

 Decision:  When an XML processor encounters any of the character
 sequences CR (UTF-16 x000D), LF (UTF-16 x000A), or CR LF (UTF-16
 x000D x000A), the processor must pass a single LF character to the
 downstream application.  

 (Note:  this formulation of the decision presupposes that the set of 
 information which XML processors may or must make visible to downstream 
 applications will be described more fully than it is in the current 
 draft spec.  If the WG decides against such a description, this 
 substantive decision will need to be expressed in some other form.
 If the processor disappears from the XML language specification, as
 has been proposed, this decision may be expressed as a constraint on
 whether the differences among line-break sequences in the input
 stream are 'visible' or 'significant'.)

 -C. M. Sperberg-McQueen
  University of Illinois at Chicago
  tei at uic.edu

 From: Tim Bray <tbray at textuality.com>
 Subject: XML WG decisions of Wed. Sep. 10

 The XML WG met on Wed. Sep. 10th.  Present: Bosak, Kimber, Murata,
 Clark, Sperberg-McQueen, Wood, Nava, Bos, Maler, Bray, Tigue, Maloney,
 Paoli, DeRose.

 Errors in discussion summaries are, as usual, mine.

 1. Discussion of case sensitivity

 Few new arguments arose in the discussion of case sensitivity, aside
 from Steve DeRose's observation that disallowing case folding will,
 by removing the possibility that attribute values are case-folded,
 reduce the number of instances where the results of parsing can
 be affected by the presence/absence of a DTD.  (Note that the 
 handling of white space can still be affected in the case where 
 attribute values are known to be tokenized, so the problem hasn't
 entirely gone away).

 This is a summary of points made in a brief last-chance-to-speak-
 your-mind go-around:

 For Case Sensitivity: 
 - XML will rarely be created by hand and when it happens, it'll be by 
   experts.  
 - This is a chance to do the right thing early in XML's history and
   avoid living with a compromise forever.  
 - Case folding is very easy to specify and to understand.  
 - It would be   nice to be able to map case-sensitive objects, for example 
   DSSSL flow objects, to element types.  
 - Internationalization experts are unanimously against folding.  
 - Pleasant experiences with case-sensitive programming languages.  
 - Casefolding problems are truly vile.  
 - It will be easy to make XML processors recognize typical user errors 
   and provide helpful error messages.

 For Case Folding: 
 - It would be the right thing to do if we were starting from scratch, but 
   it's too late now.  
 - There will be serious difficulties dealing with the XML-in-HTML 
   scenario.  
 - It will make it impossible for HTML ever to be specified as an 
   application of XML as opposed to SGML.  
 - The XML spec has been out for nine months now; it's late in the game 
   to be making this change.

 The Question: Modify the XML specification to achieve the effect of
 NAMECASE GENERAL NO in SGML.

 Yes: Bosak Kimber Murata Clark Sperberg-McQueen Nava Bos
      Bray Tigue Maloney Paoli DeRose
 No: Wood
 Abstain: Maler

 So XML is now case-sensitive.

 1a: Since XML is case sensitive, we must specify the case of
 our keywords, i.e. <!ELEMENT or <!element.  Names not recorded,
 vote was
 Upper:  7  Lower: 3  Abstain: 4
 (In this vote, some of the abstains should be taken as don't-cares).

 2. Chris Maden's suggestion that NOTATION System Identifiers 
 should be mime types.  The WG liked the idea, but declined to 
 modify the spec to achieve tihs effect; among other things,
 URLs and mime types are not syntactically distinguishable.  It
 was the feeling of the group that it would be desirable that a 
 new URL scheme be created to allow a URL to locate a mime type.

 3. Discussion of the proposition that the XML spec should say
 more about what the processor passes the App.  John Tigue has
 volunteered to write an XML Grove Plan; while there is little 
 sentiment that this should be made normative, it might serve 
 usefully as either a separate application note or an appendix.

 The WG agreed that the editors should enrich the language of the
 spec sufficiently to make it clear (as it does with PIs and
 comments) what a processor may and must make available to an
 application.

 Cheers, Tim Bray tbray at textuality.com http://www.textuality.com/

 PS: For your amusement, I attach the output produced by a 
 moments-ago-updated Lark when asked to process the XML spec:
 Loading
 Testing: Lark V0.92 Copyright (c) 1997 Tim Bray.
  All rights reserved; the right to use these class files for any purpose
  is hereby granted to everyone.
 Parsing...
 Syntax error at line 127:57: Start/End tags differ only in case: p/P
 Syntax error at line 367:23: Start/End tags differ only in case: ITEM/item
 Syntax error at line 369:51: Start/End tags differ only in case: ITEM/item
 Syntax error at line 370:69: Start/End tags differ only in case: item/ITEM
 Syntax error at line 454:4: Start/End tags differ only in case: P/p
 Syntax error at line 457:50: Start/End tags differ only in case: p/P
 Syntax error at line 750:50: Start/End tags differ only in case: termdef/TERMDEF
 Syntax error at line 752:34: Start/End tags differ only in case: lhs/LHS
 Syntax error at line 755:71: Start/End tags differ only in case: prod/PROD
 Syntax error at line 955:43: Start/End tags differ only in case: P/p
 Syntax error at line 956:7: Start/End tags differ only in case: ITEM/item
 Syntax error at line 959:19: Start/End tags differ only in case: p/P
 Syntax error at line 959:26: Start/End tags differ only in case: item/ITEM
 Syntax error at line 991:7: Start/End tags differ only in case: list/LIST
 Syntax error at line 1031:22: Start/End tags differ only in case: P/p
 Syntax error at line 1039:4: Start/End tags differ only in case: p/P
 Syntax error at line 1062:4: Start/End tags differ only in case: P/p
 Syntax error at line 1137:31: Start/End tags differ only in case: p/P
 Syntax error at line 1140:4: Start/End tags differ only in case: p/P
 Syntax error at line 1207:4: Start/End tags differ only in case: P/p
 Syntax error at line 1278:4: Start/End tags differ only in case: P/p
 Syntax error at line 1289:60: Start/End tags differ only in case: p/P
 Syntax error at line 1453:7: Start/End tags differ only in case: DIV2/div2
 Syntax error at line 1544:4: Start/End tags differ only in case: P/p
 Syntax error at line 1586:4: Start/End tags differ only in case: P/p
 Syntax error at line 1652:14: Start/End tags differ only in case: P/p
 Syntax error at line 1655:19: Start/End tags differ only in case: p/P
 Syntax error at line 1675:4: Start/End tags differ only in case: P/p
 Syntax error at line 1706:22: Start/End tags differ only in case: P/p
 Syntax error at line 1721:36: Start/End tags differ only in case: p/P
 Syntax error at line 1726:45: Start/End tags differ only in case: P/p
 Syntax error at line 1935:40: Start/End tags differ only in case: P/p
 Syntax error at line 2072:4: Start/End tags differ only in case: P/p
 Syntax error at line 2376:8: Start/End tags differ only in case: SCRAP/scrap
 Syntax error at line 2377:4: Start/End tags differ only in case: P/p
 Syntax error at line 2438:8: Start/End tags differ only in case: SCRAP/scrap
 Syntax error at line 2530:7: Start/End tags differ only in case: div3/DIV3
 Syntax error at line 2595:8: Start/End tags differ only in case: SCRAP/scrap
 Syntax error at line 2665:10: Start/End tags differ only in case: p/P
 Syntax error at line 2858:7: Start/End tags differ only in case: DIV2/div2
 Syntax error at line 3650:19: Start/End tags differ only in case: p/P
 Done.

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)