[Haskell-cafe] Ready for testing: Unicode support for Handle I/O

Tue Feb 3 20:32:59 EST 2009

On Tue, 2009-02-03 at 17:39 -0600, John Goerzen wrote:
> On Tue, Feb 03, 2009 at 10:56:13PM +0000, Duncan Coutts wrote:
> > > > Thanks to suggestions from Duncan Coutts, it's possible to call
> > > > hSetEncoding even on buffered read Handles, and the right thing
> > > > happens.  So we can read from text streams that include multiple
> > > > encodings, such as an HTTP response or email message, without having
> > > > to turn buffering off (though there is a penalty for switching
> > > > encodings on a buffered Handle, as the IO system has to do some
> > > > re-decoding to figure out where it should start reading from again).
> > > 
> > > Sounds useful, but is this the bit that causes the 30% performance hit?
> > 
> > No. You only pay that penalty if you switch encoding. The standard case
> > has no extra cost.
> 
> I'm confused.  I thought the standard case was conversion to the
> system's local encoding?  How is that different than selecting the
> same encoding manually?

Sorry, I think we've been talking at cross purposes.

> There always has to be *some* conversion from a 32-bit Char to the
> system's selection, right?

Yes. In text mode there is always some conversion going on. Internally
there is a byte buffer and a char buffer (ie UTF32).

> What exactly do we have to do to avoid the penalty?

The penalty we're talking about here is not the cost of converting bytes
to characters, it's in switching which encoding the Handle is using. For
example you might read some HTTP headers in ASCII and then switch the
Handle encoding to UTF8 to read some XML.

Switching the Handle encoding has a penalty. We have to discard the
characters that we pre-decoded and re-decode the byte buffer in the new
encoding. It's actually slightly more complicated because we do not
track exactly how the byte and character buffers relate to each other
(it'd be too expensive in the normal cases) so to work out the
relationship when switching encoding we have to re-decode all the way
from the beginning of the current byte buffer.

The point is, in terms of performance we get the ability to switch
handle encoding more or less for free. It has a cost in terms of code
complexity. The simpler alternative design was that you would not be
able to switch encoding on a read handle that used any buffering at the
character level without loosing bytes. The performance penalty when
switching encoding is the downside to the ordinary code path being fast.

> > No, I think that's 30% for latin1. The cost is not really the character
> > conversion but the copying from a byte buffer via iconv to a char
> > buffer.
> 
> Don't we already have to copy between a byte buffer and a char buffer,
> since read() and write() use a byte buffer?

In the existing Handle mechanism we read() into a byte buffer and then
when doing say getLine or getContents we allocate [Char]'s in a loop
reading bytes directly from the byte buffer. There is no separate
character buffer.

Duncan