UTF-8 library

George Russell ger@tzi.de
Tue, 06 Aug 2002 19:57:50 +0200


Axel Simon wrote:
> 
> On Tue, Aug 06, 2002 at 06:11:04PM +0200, George Russell wrote:
> [snip]
> > Converting CStrings to [Word8] is probably a bad idea anyway, since there is
> > absolutely no reason to assume a C character will be only 8 bits long, and
> > under some implementations it isn't.
> But the interface should be practical. I do not really want to write
> Haskell programs for architectures where the smallest addressable memory
> entity (i.e. C's char) is something else than 8 bits.
Yes but if all we are talking about is practicality, I *do* want to convert 
between CString's and ordinary String's in the conventional way, and I bet lots
of other people do.

> 
> > A better suggestion would be to provide ALTERNATIVE functions which
> > got from CString/CStringLen and friends to [CChar], and make your UTF8
> > converters go between [CChar] and String.  However we should not be forced
> > to do this every time we want to construct a CString from a String (a very
> > common need when calling C functions) so the existing functions should remain
> > with their existing semantics.
> But converting CChar to Char means you are assuming that the C String is
> ISO-8859-1, the lower 255 characters of Unicode. I guess this should be
> made explicit during conversion.
I suggest that for conversions CChar to Char all that should be required should
be that it be a total function and that the printable ASCII characters map in
the obvious way and that the results should be consistent during the course of
a Haskell program.  However I see no reason why the conversion should not take 
account of locale when the program starts and so translate (say) KOI-8 encoded
Cyrillic characters to their Unicode equivalents.

For Char to CChar it would surely be easiest to produce an IO error if the character
doesn't fit.