Text in Haskell: a second proposal

Simon Marlow simonmar@microsoft.com
Fri, 9 Aug 2002 11:26:47 +0100


Here's my take on the Unicode issue.  Summary: unless there's a very
good reason, I don't think we should decouple encoding/decoding from
I/O, at least for the standard I/O library.

Firstly, types.  We already have all the necessary types:

  - Char, a Unicode code point
  - Word8, an octet
  - CChar, a type representing the C 'char' type

The latter two are defined by the FFI addendum.

Taking hGetChar as an example:

	hGetChar :: Handle -> IO Char

This combines, IMO, two operations: reading some data from the file, and
decoding enough of it to yield a Char.  Underneath the hood, the Handle
has a particular encoding associated with it.  In GHC, currently we have
two encodings, ISO8859 (aka binary, but we shouldn't use that term
because the I/O library works in terms of Char) and MS-DOS text.  We
could easily extend the set of encodings to include UTF-8 and others. =20

Seeking only works on Handles with a 1-1 correspondence between handle
positions and characters (i.e. in the ISO encoding).

Why combine I/O and {en,de}coding?  Firstly, efficiency.  Secondly,
because it's convenient: if we were to express encodings as stream
transformers, eg:

	decodeUTF8 :: [Word8] -> [Char]

Then we would have to do all our I/O using lazy streams.  You can't
write hGetChar in terms of hGetWord8 using this: you need the non-stream
version which in general looks something like

	decode :: Word8 -> DecodingState=20
		-> (Maybe [Char], DecodingState)

for UTF-8 you can get away with something simpler, but AFAIK that's not
true in general.  You might want to use compression as an encoding, for
example.  So in general you need to store not only the DecodingState but
also some cached characters between invocations of hGetChar.  It's
highly unlikely that automatic optimisations will be able to do anything
useful with code written using the above interface, but we can write
efficient code if the encoder/decoder can work on the I/O buffer
directly.

There's no reason why we shouldn't provide encoders/decoders as a
separate library *as well*, and we should definitely also provide
low-level I/O that works with Word8.

Cheers,
	Simon