Unicode support
Ashley Yakeley
ashley@semantic.org
Tue, 9 Oct 2001 04:07:43 -0700
At 2001-10-09 03:37, Kent Karlsson wrote:
>> > code position (=code point): a value between 0000 and 10FFFF.
>>
>> Would this be a reasonable basis for Haskell's 'Char' type?
>
>Yes. It's essentially UTF-32, but without the fixation to 32-bit
>(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
>to 10FFFF instead of 31(!) bits) is the datatype used in some
>implementations of C for wchar_t. As I said in another e-mail,
>if one does not have high efficiency concerns, UTF-32 is a rather
>straighforward way of representing characters.
Would it be worthwhile restricting Char to the 0-10FFFF range, just as a
Word8 is restricted to 0-FF even though in GHC at least it's stored
32-bit?
...
>> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
>> getGeneralCategory :: Char -> Maybe GeneralCategory;
>
>There is not really any "Maybe" just there. Yet unallocated code
>positions have general category Cn (so do non-characters):
> Cs Other, Surrogate
> Co Other, Private Use
> Cn Other, Not Assigned (yet)
OK. It occured to me to put 'unassigned' as Nothing, since it might
change -- so in a sense getGeneralCategory doesn't know what the GC is. I
assume once a codepoint has a non-Cn GC, it cannot be changed. But
confusingly, some of the GCs are 'normative', whereas others are merely
'informative' -- perhaps these last are subject to revision.
--
Ashley Yakeley, Seattle WA