Unicode support

Kent Karlsson kentk@md.chalmers.se
Tue, 9 Oct 2001 12:37:27 +0200


----- Original Message -----
From: "Ashley Yakeley" <ashley@semantic.org>
To: "Kent Karlsson" <kentk@md.chalmers.se>; "Haskell List" <haskell@haskell.org>; "Libraries for Haskell List"
<Libraries@haskell.org>
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support


> At 2001-10-09 02:58, Kent Karlsson wrote:
>
> >In summary:
> >
> >    code position (=code point): a value between 0000 and 10FFFF.
>
> Would this be a reasonable basis for Haskell's 'Char' type?

Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.

> At some point
> perhaps there should be a 'Unicode' standard library for Haskell. For
> instance:
>
> encodeUTF8 :: String -> [Word8];
> decodeUTF8 :: [Word8] -> Maybe String;
> encodeUTF16 :: String -> [Word16];
> decodeUTF16 :: [Word16] -> Maybe String;
>
> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> getGeneralCategory :: Char -> Maybe GeneralCategory;

There is not really any "Maybe" just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
      Cs Other, Surrogate
      Co Other, Private Use
      Cn Other, Not Assigned (yet)


> ...sorting & searching...
>
> ...canonicalisation...
>
> etc. Lots of work for someone.

Yes.  And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).

        Kind regards
        /kent k