Unicode support

Ashley Yakeley ashley@semantic.org
Tue, 9 Oct 2001 04:07:43 -0700


At 2001-10-09 03:37, Kent Karlsson wrote:

>> >    code position (=code point): a value between 0000 and 10FFFF.
>>
>> Would this be a reasonable basis for Haskell's 'Char' type?
>
>Yes.  It's essentially UTF-32, but without the fixation to 32-bit
>(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
>to 10FFFF instead of 31(!) bits) is the datatype used in some
>implementations of C for wchar_t.  As I said in another e-mail,
>if one does not have high efficiency concerns, UTF-32 is a rather
>straighforward way of representing characters.

Would it be worthwhile restricting Char to the 0-10FFFF range, just as a 
Word8 is restricted to 0-FF even though in GHC at least it's stored 
32-bit?

...
>> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
>> getGeneralCategory :: Char -> Maybe GeneralCategory;
>
>There is not really any "Maybe" just there.  Yet unallocated code
>positions have general category Cn (so do non-characters):
>      Cs Other, Surrogate
>      Co Other, Private Use
>      Cn Other, Not Assigned (yet)

OK. It occured to me to put 'unassigned' as Nothing, since it might 
change -- so in a sense getGeneralCategory doesn't know what the GC is. I 
assume once a codepoint has a non-Cn GC, it cannot be changed. But 
confusingly, some of the GCs are 'normative', whereas others are merely 
'informative' -- perhaps these last are subject to revision.

-- 
Ashley Yakeley, Seattle WA