NameChar (was: Editing text)

Fri Nov 28 12:30:32 GMT 1997

Peter Murray-Rust writes:

 > I am writing an editor for JUMBO where I expect most of the characters like
 > '"<>& to have been converted into entities (e.g. &apos, etc.). [I do not
 > expect any raw <![CDATA[ sections in the text - they will have been
 > transformed by the parser. On the other hand there may be other entities
 > which have not been expanded (e.g. &foo;
 > 
 > My understanding of the spec [71] is that an entity is a Name and that Names
 > [4], [5] and [6] are constructed from letters, digits and numbers. In
 > determining whether something is an entity, I have to look for a string of
 > the form: '&'(Letter | '_' | ':') (NameChar)* ';'
 > NameChars are Digits, MiscNames and Letters.
 >
 > Appendix B lists six and a half pages of potential NameChars for which
 > JUMBO has to test - is this correct? If so I have code of the form:
 > 
 > public boolean isNameChar(char ch) {
 >     return <six pages of conditionals>;
 > }
 > 
 > I assume there is no short cut...

I have not checked them for alignment, but there is a good chance that
you could use Java's built-in java.lang.Character.isLetterOrDigit()
predicate to eliminate most of it, something like this:

  public boolean isNameChar (char ch) {
    return java.lang.Character.isLetterOrDigit(ch) | isMiscChar(ch);
  }

  public boolean isMiscChar (char ch) {
    switch(ch) {
    case '.':
    case '-':
    case '_':
    case ':':
      return true;
    default:
      return isCombining(ch) || isIgnorable(ch) || isExtender(ch);
    }
  }

  public boolean isIgnorable (char ch) {
    int c = (int)ch;
    return ((c >= 0x200c && c <= 0x200f) ||
            (c >= 0x202a && c <= 0x202e) ||
            (c >= 0x206a && c <= 0x206f));
  }

  public boolean isExtender (char ch) {
    int c = (int)ch;
    switch (c) {
    case 0x00b7:
    case 0x02d0:
    case 0x02d1:
    case 0x0387:
    case 0x0640:
    case 0x0e46:
    case 0x0ec6:
    case 0x3005:
      return true;
    default:
      return ((c >= 0x3031 && c <= 0x3035) ||
              (c >= 0x309b && c <= 0x309e) ||
              (c >= 0x30fc && c <= 0x30fe));
    }
  }

  public boolean isCombining (char ch) {
    // lots of stuff
  }

The only long one left is isCombining(), which I haven't bothered to
fill in.  Before anyone uses these, please check them against both the
XML spec and the Java Language Spec, to see if isLetterOrDigit()
really aligns properly.

 > I applaud the work of the WG on the Internationalisation and I don't want
 > to detract from it. What I would suggest is that because of the extremely
 > likelihood of error if individuals do try to hack their own isNameChar(),
 > and because if ever this list is revised software will be invalidated, that
 > the WG, or W3C or whoever, maintain an isNameChar() routine in the common
 > languages 
 > (C, C++, Java) so that we know we shall all be working with the same one.

Not a bad idea, but it is unlikely that everyone would want to use the
same one.  The fastest solution would be to maintain a static 65,536
(or at least 32,768) entry array, with bit flags for different
character properties.  That would be fine for big programs, but it
would kill Java applets and other size-sensitive applications unless
it were already built-into the Java environment.

All the best,

David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)