Unicode support

John Meacham john@repetae.net
Tue, 9 Oct 2001 14:59:09 -0700


On Tue, Oct 09, 2001 at 12:37:27PM +0200, Kent Karlsson wrote:
> > At 2001-10-09 02:58, Kent Karlsson wrote:
> > >In summary:
> > >    code position (=code point): a value between 0000 and 10FFFF.
> > Would this be a reasonable basis for Haskell's 'Char' type?
> 
> Yes.  It's essentially UTF-32, but without the fixation to 32-bit
> (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
> to 10FFFF instead of 31(!) bits) is the datatype used in some
> implementations of C for wchar_t.  As I said in another e-mail,
> if one does not have high efficiency concerns, UTF-32 is a rather
> straighforward way of representing characters.

I think that perhaps space efficiency concerns are moot anyway since
Char's would probably be represented by possibly evaluated thunks anyway
which I can't imagine being smaller than a pointer in general so for
haskell the simplification of UTF-32 is most likely worth it. 

If space efficiency is a concern than I imagine people would want to use
mutable arrays of bytes or words anyway (perhaps mmap'ed from a file)
and not haskell lists of Chars.

> > At some point
> > perhaps there should be a 'Unicode' standard library for Haskell. For
> > instance:
> >
> > encodeUTF8 :: String -> [Word8];
> > decodeUTF8 :: [Word8] -> Maybe String;
> > encodeUTF16 :: String -> [Word16];
> > decodeUTF16 :: [Word16] -> Maybe String;
> >
> > data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
> > getGeneralCategory :: Char -> Maybe GeneralCategory;
> 
> There is not really any "Maybe" just there.  Yet unallocated code
> positions have general category Cn (so do non-characters):
>       Cs Other, Surrogate
>       Co Other, Private Use
>       Cn Other, Not Assigned (yet)
> 
> > ...sorting & searching...
> >
> > ...canonicalisation...
> >
> > etc. Lots of work for someone.
> 
> Yes.  And it is lots of work (which is why I'm not volonteering
> to make a qick fix: there is no quick fix).

I think a cannonical way to get at iconvs ('man 3 iconv' for info.)
functionality in one of the standard librarys would be great. perhaps I
will have a go at it. even if the underlying platform does not have
iconv then some basic conversions (utf8, utf16, latin1, [Char]) could
easily be provided with the same API and minimal implementation effort.
	John

-- 
---------------------------------------------------------------------------
John Meacham - California Institute of Technology, Alum. - john@repetae.net
---------------------------------------------------------------------------