Ready for testing: Unicode support for Handle I/O

Tue Feb 3 11:42:44 EST 2009

I've been working on adding proper Unicode support to Handle I/O in GHC, 
and I finally have something that's ready for testing.  I've put a patchset 
here:

   http://www.haskell.org/~simonmar/base-unicode.tar.gz

That is a set of patches against a GHC repo tree: unpack the tarball, and 
say 'sh apply /path/to/ghc/repo' to apply all the patches.  Then clean your 
tree and build it from scratch (or if you're using the new GHC build 
system, just say 'make' ;-).  It should validate, bar one or two minor 
failures.

Oh, it doesn't work on Windows yet.  That's the major thing left to do.  If 
anyone else felt like tackling this I'd be delighted: all you have to do is 
implement a Win32 equivalent of the module GHC.IO.Encoding.Iconv (see 
below), everything else should work unchanged.

Depending on whether any further changes are required, I may amend-record 
some of these patches, so treat them as temporary patches for testing only.

Below is what will be the patch description in the patch for libraries/base.

Comments/discussion please!

Cheers,
	Simon

Unicode-aware Handles
~~~~~~~~~~~~~~~~~~~~~

This is a significant restructuring of the Handle implementation with
the primary goal of supporting Unicode character encodings.

The only change to the existing behaviour is that by default, text IO
is done in the prevailing encoding of the system.  Handles created by
openBinaryFile use the Latin-1 encoding, as do Handles placed in
binary mode using hSetBinaryMode.

We provide a way to change the encoding for an existing Handle:

   hSetEncoding :: Handle -> TextEncoding -> IO ()

and various encodings:

   latin1,
   utf8,
   utf16, utf16le, utf16be,
   utf32, utf32le, utf32be,
   localeEncoding,

and a way to lookup other encodings:

   mkTextEncoding :: String -> IO TextEncoding

(it's system-dependent whether the requested encoding will be available).

Currently hSetEncoding is availble from GHC.IO.Handle, and the
encodings are available from GHC.IO.Encoding.  We may want to export
these from somewhere more permanent; that's something for a library
proposal.

Thanks to suggestions from Duncan Coutts, it's possible to call
hSetEncoding even on buffered read Handles, and the right thing
happens.  So we can read from text streams that include multiple
encodings, such as an HTTP response or email message, without having
to turn buffering off (though there is a penalty for switching
encodings on a buffered Handle, as the IO system has to do some
re-decoding to figure out where it should start reading from again).

If there is a decoding error, it is reported when an attempt is made
to read the offending character from the Handle, as you would expect.

Performance is about 30% slower on "hGetContents >>= putStr" than
before.  I've profiled it, and about 25% of this is in doing the
actual encoding/decoding, the rest is accounted for by the fact that
we're shuffling around 32-bit chars rather than bytes in the Handle
buffer, so there's not much we can do to improve this.

IO library restructuring
~~~~~~~~~~~~~~~~~~~~~~~~

The major change here is that the implementation of the Handle
operations is separated from the underlying IO device, using type
classes.  File descriptors are just one IO provider; I have also
implemented memory-mapped files (good for random-access read/write)
and a Handle that pipes output to a Chan (useful for testing code that
writes to a Handle).  New kinds of Handle can be implemented outside
the base package, for instance someone could write bytestringToHandle.
A Handle is made using mkFileHandle:

-- | makes a new 'Handle'
mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev)
              => dev -- ^ the underlying IO device, which must support
                     -- 'IODevice', 'BufferedIO' and 'Typeable'
              -> FilePath
                     -- ^ a string describing the 'Handle', e.g. the file
                     -- path for a file.  Used in error messages.
              -> IOMode
                     -- ^ The mode in which the 'Handle' is to be used
              -> Maybe TextEncoding
                     -- ^ text encoding to use, if any
              -> IO Handle

This also means that someone can write a completely new IO
implementation on Windows based on native Win32 HANDLEs, and
distribute it as a separate package (I really hope somebody does
this!).

This restructuring isn't as radical as previous designs.  I haven't
made any attempt to make a separate binary I/O layer, for example
(although hGetBuf/hPutBuf do bypass the text encoding).  The main goal
here was to get Unicode support in, and to allow others to experiment
with making new kinds of Handle.  We could split up the layers further
later.

API changes and Module structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still
present, but are just re-exporting things from other modules now).
For 6.12 we'll want to bump base to version 5 and add a base4-compat.
For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated
warnings.

I split modules into smaller parts in many places.  For example, we
now have GHC.IORef, GHC.MVar and GHC.IOArray containing the
implementations of IORef, MVar and IOArray respectively.  This was
necessary for untangling dependencies, but it also makes things easier
to follow.

The new module structurue for the IO-relatied parts of the base
package is:

GHC.IO
   Implementation of the IO monad; unsafe*; throw/catch

GHC.IO.IOMode
   The IOMode type

GHC.IO.Buffer
   Buffers and operations on them

GHC.IO.Device
   The IODevice and RawIO classes.

GHC.IO.BufferedIO
   The BufferedIO class.

GHC.IO.FD
   The FD type, with instances of IODevice, RawIO and BufferedIO.

GHC.IO.Exception
   IO-related Exceptions

GHC.IO.Encoding
   The TextEncoding type; built-in TextEncodings; mkTextEncoding

GHC.IO.Encoding.Types
GHC.IO.Encoding.Iconv
   Implementation internals for GHC.IO.Encoding

GHC.IO.Handle
   The main API for GHC's Handle implementation, provides all the Handle
   operations + mkFileHandle + hSetEncoding.

GHC.IO.Handle.Types
GHC.IO.Handle.Internals
GHC.IO.Handle.Text
   Implementation of Handles and operations.

GHC.IO.Handle.FD
   Parts of the Handle API implemented by file-descriptors: openFile,
   stdin, stdout, stderr, fdToHandle etc.