Unicode support
Kent Karlsson
kentk@md.chalmers.se
Tue, 9 Oct 2001 12:37:27 +0200
----- Original Message -----
From: "Ashley Yakeley" <ashley@semantic.org>
To: "Kent Karlsson" <kentk@md.chalmers.se>; "Haskell List" <haskell@haskell.org>; "Libraries for Haskell List"
<Libraries@haskell.org>
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support
> At 2001-10-09 02:58, Kent Karlsson wrote:
>
> >In summary:
> >
> > code position (=code point): a value between 0000 and 10FFFF.
>
> Would this be a reasonable basis for Haskell's 'Char' type?
Yes. It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t. As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.
> At some point
> perhaps there should be a 'Unicode' standard library for Haskell. For
> instance:
>
> encodeUTF8 :: String -> [Word8];
> decodeUTF8 :: [Word8] -> Maybe String;
> encodeUTF16 :: String -> [Word16];
> decodeUTF16 :: [Word16] -> Maybe String;
>
> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> getGeneralCategory :: Char -> Maybe GeneralCategory;
There is not really any "Maybe" just there. Yet unallocated code
positions have general category Cn (so do non-characters):
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (yet)
> ...sorting & searching...
>
> ...canonicalisation...
>
> etc. Lots of work for someone.
Yes. And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).
Kind regards
/kent k