[Haskell-cafe] Ready for testing: Unicode support for Handle I/O

John Goerzen jgoerzen at complete.org
Tue Feb 3 12:03:58 EST 2009


Simon Marlow wrote:
> I've been working on adding proper Unicode support to Handle I/O in GHC, 
> and I finally have something that's ready for testing.  I've put a patchset 
> here:

Yay!

Comments below.

> Comments/discussion please!

Do you expect Hugs will be able to pick up all of this?

> The only change to the existing behaviour is that by default, text IO
> is done in the prevailing encoding of the system.  Handles created by
> openBinaryFile use the Latin-1 encoding, as do Handles placed in
> binary mode using hSetBinaryMode.

Sounds very good and reasonable.

> We provide a way to change the encoding for an existing Handle:
> 
>    hSetEncoding :: Handle -> TextEncoding -> IO ()
> 
> and various encodings:
> 
>    latin1,
>    utf8,
>    utf16, utf16le, utf16be,
>    utf32, utf32le, utf32be,
>    localeEncoding,

Will there also be something to handle the UTF-16 BOM marker?  I'm not
sure what the best API for that is, since it may or may not be present,
but it should be considered -- and could perhaps help autodetect encoding.

> Thanks to suggestions from Duncan Coutts, it's possible to call
> hSetEncoding even on buffered read Handles, and the right thing
> happens.  So we can read from text streams that include multiple
> encodings, such as an HTTP response or email message, without having
> to turn buffering off (though there is a penalty for switching
> encodings on a buffered Handle, as the IO system has to do some
> re-decoding to figure out where it should start reading from again).

Sounds useful, but is this the bit that causes the 30% performance hit?

> Performance is about 30% slower on "hGetContents >>= putStr" than
> before.  I've profiled it, and about 25% of this is in doing the
> actual encoding/decoding, the rest is accounted for by the fact that
> we're shuffling around 32-bit chars rather than bytes in the Handle
> buffer, so there's not much we can do to improve this.

Does this mean that if we set the encoding to latin1, tat we should see
performance 5% worse than present?

30% slower is a big deal, especially since we're not all that speedy now.

> IO library restructuring
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The major change here is that the implementation of the Handle
> operations is separated from the underlying IO device, using type
> classes.  File descriptors are just one IO provider; I have also
> implemented memory-mapped files (good for random-access read/write)
> and a Handle that pipes output to a Chan (useful for testing code that
> writes to a Handle).  New kinds of Handle can be implemented outside
> the base package, for instance someone could write bytestringToHandle.
> A Handle is made using mkFileHandle:

Very nice.  That means I can eliminate all the HVIO stuff I have in
MissingH, which does roughly the same thing.

> with making new kinds of Handle.  We could split up the layers further
> later.

Would it now be possible to make the Socket an instance of this
typeclass, so we can work with it directly rather than having to convert
it to a Handle first?


Thanks,

-- John


More information about the Glasgow-haskell-users mailing list