[Haskell-cafe] Ready for testing: Unicode support for Handle I/O

Wed Feb 4 04:26:06 EST 2009

John Goerzen wrote:
> Duncan Coutts wrote:
>> Sorry, I think we've been talking at cross purposes.
> 
> I think so.
> 
>>> There always has to be *some* conversion from a 32-bit Char to the
>>> system's selection, right?
>> Yes. In text mode there is always some conversion going on. Internally
>> there is a byte buffer and a char buffer (ie UTF32).
>>
>>> What exactly do we have to do to avoid the penalty?
>> The penalty we're talking about here is not the cost of converting bytes
>> to characters, it's in switching which encoding the Handle is using. For
>> example you might read some HTTP headers in ASCII and then switch the
>> Handle encoding to UTF8 to read some XML.
> 
> Simon referenced a 30% penalty.  Are you saying that if we read from a
> Handle using the same encoding that we used when we first opened it,
> that we won't see any slowdown vs. the system in 6.10?

No, there's a fixed 30% penalty for hGetContents/readFile/hPutStr in the 
new library, regardless of what encoding you're using.  Presumably if 
you're using a complex encoding there will be an extra penalty imposed by 
iconv.

The cost is mostly in decoding (or copying, in the case of latin1) bytes 
from the byte buffer into characters in the character buffer.  Previously 
there was only a byte buffer.

I was surprised at the slowdown too, so I looked into it.  As it turns out, 
hGetContents and hPutStr are actually quite well optimised already: 
virtually all the allocation is accounted for by the [Char], and we have 
good tight inner loops, so the cost of shuffling between the byte buffer 
and the char buffer is quite noticeable.

We could add a special-case for the latin1 encoding and eliminate the 
intermediate char buffer, but that would add significant complexity to the 
code, and it's not the right way to go about it.  If you want binary data 
and speed, then hGetBuf/hPutBuf (perhaps via bytestring) should be as fast 
as before.

Cheers,
	Simon