Tue, 9 Oct 2001 12:37:27 +0200
----- Original Message -----
From: "Ashley Yakeley" <firstname.lastname@example.org>
To: "Kent Karlsson" <email@example.com>; "Haskell List" <firstname.lastname@example.org>; "Libraries for Haskell List"
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support
> At 2001-10-09 02:58, Kent Karlsson wrote:
> >In summary:
> > code position (=code point): a value between 0000 and 10FFFF.
> Would this be a reasonable basis for Haskell's 'Char' type?
Yes. It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t. As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.
> At some point
> perhaps there should be a 'Unicode' standard library for Haskell. For
> encodeUTF8 :: String -> [Word8];
> decodeUTF8 :: [Word8] -> Maybe String;
> encodeUTF16 :: String -> [Word16];
> decodeUTF16 :: [Word16] -> Maybe String;
> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> getGeneralCategory :: Char -> Maybe GeneralCategory;
There is not really any "Maybe" just there. Yet unallocated code
positions have general category Cn (so do non-characters):
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (yet)
> ...sorting & searching...
> etc. Lots of work for someone.
Yes. And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).