Unicode support in Hugs - alpha-patch available

Ross Paterson ross@soi.city.ac.uk
Sat, 23 Aug 2003 16:43:41 +0100


On Sun, Aug 17, 2003 at 11:35:31PM -0400, Dimitry Golubovsky wrote:
> Anyone interested in Unicode support in Hugs (what it lacks so far) 
> please check out this URL:
> 
> http://www.golubovsky.org/software/hugs-patch/article.html
> 
> I have written a patch for the November 2002 release of Hugs that 
> enables internal handling of Unicode characters by Hugs. The URL above 
> points to the article I wrote to explain the details. The article also 
> contains links to download the patch itself and the 
> demonstration/testing program.

As a general comment: your patch converts the Unicode Database into an
internal table in Hugs for use by primitives.  An alternative approach is
used by a recent addition of Unicode support to GHC: use the native wide
character functions iswupper(), towupper(), etc where these are available.

The current CVS version of Hugs also includes an optimization of the
whatis() code, which may clash with your changes.  However the speed
gains from that change are modest -- increased functionality may be
more important.

> [The] number of distinct characters defined by the Unicode Database
> (UnicodeData.txt available from www.unicode.org is 15100 for the most
> recent version (4.0) with Unicode character values ranging from 0x0000
> to 0x10FFFD.  So, position of a character in the Unicode character table
> may be used by Hugs as internal character code.

UnicodeData.txt may contain that many character lines, but it includes
pairs of lines like

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FA5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

which describes 20902 characters, and there are several more like this.
Actually the character space is fairly dense, at least up to FFFF.
The compression approach could be used for character property tables,
but not for internal representation of character codes.

Also, the consCharArray array, used to implement (c:), is of size
NUM_CHARS -- this could be rather large.  (Perhaps this could be
filled lazily?)