[Haskell-cafe] Ready for testing: Unicode support for Handle I/O
jgoerzen at complete.org
Tue Feb 3 12:03:58 EST 2009
Simon Marlow wrote:
> I've been working on adding proper Unicode support to Handle I/O in GHC,
> and I finally have something that's ready for testing. I've put a patchset
> Comments/discussion please!
Do you expect Hugs will be able to pick up all of this?
> The only change to the existing behaviour is that by default, text IO
> is done in the prevailing encoding of the system. Handles created by
> openBinaryFile use the Latin-1 encoding, as do Handles placed in
> binary mode using hSetBinaryMode.
Sounds very good and reasonable.
> We provide a way to change the encoding for an existing Handle:
> hSetEncoding :: Handle -> TextEncoding -> IO ()
> and various encodings:
> utf16, utf16le, utf16be,
> utf32, utf32le, utf32be,
Will there also be something to handle the UTF-16 BOM marker? I'm not
sure what the best API for that is, since it may or may not be present,
but it should be considered -- and could perhaps help autodetect encoding.
> Thanks to suggestions from Duncan Coutts, it's possible to call
> hSetEncoding even on buffered read Handles, and the right thing
> happens. So we can read from text streams that include multiple
> encodings, such as an HTTP response or email message, without having
> to turn buffering off (though there is a penalty for switching
> encodings on a buffered Handle, as the IO system has to do some
> re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
> Performance is about 30% slower on "hGetContents >>= putStr" than
> before. I've profiled it, and about 25% of this is in doing the
> actual encoding/decoding, the rest is accounted for by the fact that
> we're shuffling around 32-bit chars rather than bytes in the Handle
> buffer, so there's not much we can do to improve this.
Does this mean that if we set the encoding to latin1, tat we should see
performance 5% worse than present?
30% slower is a big deal, especially since we're not all that speedy now.
> IO library restructuring
> The major change here is that the implementation of the Handle
> operations is separated from the underlying IO device, using type
> classes. File descriptors are just one IO provider; I have also
> implemented memory-mapped files (good for random-access read/write)
> and a Handle that pipes output to a Chan (useful for testing code that
> writes to a Handle). New kinds of Handle can be implemented outside
> the base package, for instance someone could write bytestringToHandle.
> A Handle is made using mkFileHandle:
Very nice. That means I can eliminate all the HVIO stuff I have in
MissingH, which does roughly the same thing.
> with making new kinds of Handle. We could split up the layers further
Would it now be possible to make the Socket an instance of this
typeclass, so we can work with it directly rather than having to convert
it to a Handle first?
More information about the Libraries