Ready for testing: Unicode support for Handle I/O
marlowsd at gmail.com
Tue Feb 3 11:42:44 EST 2009
I've been working on adding proper Unicode support to Handle I/O in GHC,
and I finally have something that's ready for testing. I've put a patchset
That is a set of patches against a GHC repo tree: unpack the tarball, and
say 'sh apply /path/to/ghc/repo' to apply all the patches. Then clean your
tree and build it from scratch (or if you're using the new GHC build
system, just say 'make' ;-). It should validate, bar one or two minor
Oh, it doesn't work on Windows yet. That's the major thing left to do. If
anyone else felt like tackling this I'd be delighted: all you have to do is
implement a Win32 equivalent of the module GHC.IO.Encoding.Iconv (see
below), everything else should work unchanged.
Depending on whether any further changes are required, I may amend-record
some of these patches, so treat them as temporary patches for testing only.
Below is what will be the patch description in the patch for libraries/base.
This is a significant restructuring of the Handle implementation with
the primary goal of supporting Unicode character encodings.
The only change to the existing behaviour is that by default, text IO
is done in the prevailing encoding of the system. Handles created by
openBinaryFile use the Latin-1 encoding, as do Handles placed in
binary mode using hSetBinaryMode.
We provide a way to change the encoding for an existing Handle:
hSetEncoding :: Handle -> TextEncoding -> IO ()
and various encodings:
utf16, utf16le, utf16be,
utf32, utf32le, utf32be,
and a way to lookup other encodings:
mkTextEncoding :: String -> IO TextEncoding
(it's system-dependent whether the requested encoding will be available).
Currently hSetEncoding is availble from GHC.IO.Handle, and the
encodings are available from GHC.IO.Encoding. We may want to export
these from somewhere more permanent; that's something for a library
Thanks to suggestions from Duncan Coutts, it's possible to call
hSetEncoding even on buffered read Handles, and the right thing
happens. So we can read from text streams that include multiple
encodings, such as an HTTP response or email message, without having
to turn buffering off (though there is a penalty for switching
encodings on a buffered Handle, as the IO system has to do some
re-decoding to figure out where it should start reading from again).
If there is a decoding error, it is reported when an attempt is made
to read the offending character from the Handle, as you would expect.
Performance is about 30% slower on "hGetContents >>= putStr" than
before. I've profiled it, and about 25% of this is in doing the
actual encoding/decoding, the rest is accounted for by the fact that
we're shuffling around 32-bit chars rather than bytes in the Handle
buffer, so there's not much we can do to improve this.
IO library restructuring
The major change here is that the implementation of the Handle
operations is separated from the underlying IO device, using type
classes. File descriptors are just one IO provider; I have also
implemented memory-mapped files (good for random-access read/write)
and a Handle that pipes output to a Chan (useful for testing code that
writes to a Handle). New kinds of Handle can be implemented outside
the base package, for instance someone could write bytestringToHandle.
A Handle is made using mkFileHandle:
-- | makes a new 'Handle'
mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev)
=> dev -- ^ the underlying IO device, which must support
-- 'IODevice', 'BufferedIO' and 'Typeable'
-- ^ a string describing the 'Handle', e.g. the file
-- path for a file. Used in error messages.
-- ^ The mode in which the 'Handle' is to be used
-> Maybe TextEncoding
-- ^ text encoding to use, if any
-> IO Handle
This also means that someone can write a completely new IO
implementation on Windows based on native Win32 HANDLEs, and
distribute it as a separate package (I really hope somebody does
This restructuring isn't as radical as previous designs. I haven't
made any attempt to make a separate binary I/O layer, for example
(although hGetBuf/hPutBuf do bypass the text encoding). The main goal
here was to get Unicode support in, and to allow others to experiment
with making new kinds of Handle. We could split up the layers further
API changes and Module structure
NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still
present, but are just re-exporting things from other modules now).
For 6.12 we'll want to bump base to version 5 and add a base4-compat.
For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated
I split modules into smaller parts in many places. For example, we
now have GHC.IORef, GHC.MVar and GHC.IOArray containing the
implementations of IORef, MVar and IOArray respectively. This was
necessary for untangling dependencies, but it also makes things easier
The new module structurue for the IO-relatied parts of the base
Implementation of the IO monad; unsafe*; throw/catch
The IOMode type
Buffers and operations on them
The IODevice and RawIO classes.
The BufferedIO class.
The FD type, with instances of IODevice, RawIO and BufferedIO.
The TextEncoding type; built-in TextEncodings; mkTextEncoding
Implementation internals for GHC.IO.Encoding
The main API for GHC's Handle implementation, provides all the Handle
operations + mkFileHandle + hSetEncoding.
Implementation of Handles and operations.
Parts of the Handle API implemented by file-descriptors: openFile,
stdin, stdout, stderr, fdToHandle etc.
More information about the Libraries