[Haskell-cafe] The Nature of Char and String

Wed Feb 2 08:26:11 EST 2005

Ketil Malde <ketil+haskell at ii.uib.no> writes:

>> The Haskell functions accept or return Strings but interface to OS
>> functions which (at least on Unix) deal with arrays of bytes (char*),
>> and the encoding issues are essentially ignored. If you pass strings
>> containing anything other than ISO-8859-1, you lose.
>
> I'm not sure it's as bad as all that. You lose the correct Unicode
> code points (i.e. chars will have the wrong values, and strings may be
> the wrong lenght), but I think you will be able to get the same bytes
> out as you read in.  So in that sense, Char-based IO is somewhat
> encoding neutral.
>
> So one can have Unicode both in IO and internally, it's just that you
> don't get both at the same time :-)

That's the problem. Perl is similar: it uses the same strings for byte
arrays and for Unicode strings whose characters happen to be Latin1.
The interpretation sometimes depends on the function / library used,
and sometimes on other libraries loaded.

When I made an interface between Perl and my language Kogut (which
uses Unicode internally and converts texts exchanged with the OS,
even though conversion may fail e.g. for files not encoded using the
locale encoding - I don't have a better design yet), I had trouble
with converting Perl strings which have no characters above 0xFF.
If I treat them as Unicode, then a filename passed between the two
languages is interpreted differently. If I treat them as the locale
encoding, then it's inconsistent and passing strings in both
directions doesn't round-trip.

So I'm currently treating them as Unicode. Perl's handling of Unicode
is inconsistent with itself (e.g. for filenames containing characters
above 0xFF), I don't think I made it more broken than it already is...

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/