Character classification

Istvan Cseri istvanc at microsoft.com
Fri Sep 5 18:02:59 BST 1997


You are right, it is a well known technique, Java JDK1.1 in fact uses
very similar code for character classification. I replaced that with the
simple 256 member array lookup (for characters in that range) and it
sped up the parser ~10%.

Istvan

> ----------
> From: 	James Clark[SMTP:jjc at jclark.com]
> Reply To: 	James Clark
> Sent: 	Thursday, September 04, 1997 10:04 PM
> To: 	xml-dev at ic.ac.uk
> Subject: 	Re: Character classification
> 
> Istvan Cseri wrote:
> > 
> > For better speed I would suggest an alternative solution: use a
> quick
> > array lookup for characters below 256 and go to the more expensive
> > method above... It will do wonders with your parser.
> 
> Except of course when you're parsing non-Latin scripts.
> 
> There's another technique which exploits the fact that characters on
> the
> same page often have similar properties, and this is true even more so
> for characters in the same column.
> 
> The idea is to have a three-level table, the first level with 256
> entries, the second and third levels with 16 entries.  The entries for
> the first and second levels are a (possibly null) pointer to a
> sub-table
> plus a value; the entries for the third level are just values. To look
> up the value for a character, you use the high 8 bits to index into
> the
> first-level table; if the pointer part of the entry is null, then
> return
> the value part of entry; otherwise use the sub-table table addressed
> by
> the pointer; use the next 4 bits to index into that in a similar way,
> and, if necessary, the bottom 4 bits to index into the bottom table.
> 
> This is I believe quite a well-known technique; I got it from Glenn
> Adams.
> 
> You can use this to implement case-folding by storing the difference
> between a character and its upper-case equivalent modulo 2^16.
> 
> There's a C++ implementation of this in SP in include/CharMap.h.
> 
> James
> 
> 
> xml-dev: A list for W3C XML Developers
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
> To unsubscribe, send to majordomo at ic.ac.uk the following message;
> unsubscribe xml-dev
> List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
> 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list