Unicode support

Kent Karlsson kentk@md.chalmers.se
Tue, 9 Oct 2001 12:37:27 +0200

----- Original Message -----
From: "Ashley Yakeley" <ashley@semantic.org>
To: "Kent Karlsson" <kentk@md.chalmers.se>; "Haskell List" <haskell@haskell.org>; "Libraries for Haskell List"
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support

> At 2001-10-09 02:58, Kent Karlsson wrote:
> >In summary:
> >
> >    code position (=code point): a value between 0000 and 10FFFF.
> Would this be a reasonable basis for Haskell's 'Char' type?

Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.

> At some point
> perhaps there should be a 'Unicode' standard library for Haskell. For
> instance:
> encodeUTF8 :: String -> [Word8];
> decodeUTF8 :: [Word8] -> Maybe String;
> encodeUTF16 :: String -> [Word16];
> decodeUTF16 :: [Word16] -> Maybe String;
> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> getGeneralCategory :: Char -> Maybe GeneralCategory;

There is not really any "Maybe" just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
      Cs Other, Surrogate
      Co Other, Private Use
      Cn Other, Not Assigned (yet)

> ...sorting & searching...
> ...canonicalisation...
> etc. Lots of work for someone.

Yes.  And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).

        Kind regards
        /kent k