Unicode support in Hugs - alpha-patch available

Mon, 25 Aug 2003 12:18:24 +0100

[returning to the list after some discussion with Dimitry]

Just to be clear, the Unicode support under discussion comprises only:
- making ord(maxBound::Char) a lot bigger, say 0x10FFFD.
- making the character classification and case conversion functions
  in the Char module work on the expanded range.

Dimitry's approach is based on mapping Unicode code points to a smaller
set of internal codes, both to conserve cell tags (which currently
include all character codes) and to reduce the size of the mapping tables.
One problem there is that there are an order of magnitude more characters
than he thought: 235963 out of 1114110 possible codes are allocated in
Unicode 4.0.1.  But also the Haskell Char type belongs to the Bounded,
Enum and Ix classes, and so should be a contiguous range of values with
no gaps.  So I think the Unicode code points should be used internally
as well.

How should these be stored on the heap?  We could use a scheme similar
to the representation of Int: values smaller than some threshold map to
cell tags, while larger values are stored in a pair.

How do we implement the conversion functions?  The approach recently
added to the CVS version of GHC is to use the native libraries, which
requires the user to set the locale appropriately.  More generally,
should these functions be locale-dependent at all?  I appreciate that
case conversion varies between languages (e.g. i's in Turkish) but I
think that requires the IO monad.  The Haskell Report is vague on the
issue, but I think the functions should just do the default mappings,
just as Ord compares character codes rather than using locales.

The alternative is to include tables in the Hugs binary.  These could be
quite compact: fewer than 800 Unicode characters have upper case mappings,
and similarly for lower case mappings.  Also, adjacent Unicode characters
often belong to the same category: the full range can be divided into
fewer than 2000 contiguous subranges of characters belonging to the same
category, including gaps of unallocated characters.  Applying binary
search to these tables would take a little time, but probably not much
more than the current in-Haskell implementations.  Less than 30k of data
would be required.

Another problem is that Hugs contains an array indexed by character values
(consCharArray) containing (: c) for each character c.  With a much larger
range of character codes, one approach would be a hash table holding
a limited number of such values, built lazily and bypassed when full.
This change could be confined to the consChar() function in builtin.c.

Finally, most of this could be #ifdef UNICODE_CHARS, so the overhead
could be turned off if required.