[Haskell-cafe] UTF-8 in Haskell.

Thu Dec 23 07:15:34 CET 2010

On Thu, Dec 23, 2010 at 2:01 PM, Mark Lentczner <markl at glyphic.com> wrote:
>
> On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:
>> Thus under all situation (ascii, UTF-8, or even
>> UTF-32), my program always send 4 bytes through the network. Is that
>> OK?
>
> Generally, no.
>
> Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding.
>
> In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user).
>
> The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested.
>
> Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package.
>
>                - Mark

I see. I just realize that, in this case (ssh), I could use CString to
avoid all problems about encoding.

-- 
竹密岂妨流水过
山高哪阻野云飞