Text in Haskell: a second proposal
Simon Marlow
simonmar@microsoft.com
Tue, 13 Aug 2002 12:13:17 +0100
> At 2002-08-09 03:26, Simon Marlow wrote:
>=20
> >Why combine I/O and {en,de}coding? Firstly, efficiency.=20
>=20
> Hmm... surely the encoding functions can be defined efficiently?
>=20
> decodeISO88591 :: [Word8] -> [Char];
> encodeISO88591 :: [Char] -> [Word8]; -- uses low octet of=20
> codepoint
>=20
> You could surely define them as native functions very efficiently, if=20
> necessary.
That depends what you mean by efficient: these functions represent an
extra layer of intermediate list between the handle buffer and the final
[Char], and furthermore they don't work with partial reads - the input
has to be a lazy stream gotten from hGetContents. I don't want to be
forced to use lazy I/O.
> A monadic stream-transformer:
>=20
> decodeStreamUTF8 :: (Monad m) =3D> m Word8 -> m Char;
>=20
> hGetChar h =3D decodeStreamUTF8 (hGetWord8 h);
>=20
> This works provided each Char corresponds to a contiguous block of=20
> Word8s, with no state between them. I think that includes all the=20
> standard character encoding schemes.
This is better: it doesn't force you to use lazy I/O, and when
specialised to the IO monad it might get decent performance. The
problem is that in general I don't think you can assume the lack of
state. For example: UTF-7 has a state which needs to be retained
between characters, and UTF-16 and UTF-32 have an endianness state which
can be changed by a special sequence at the beginning of the file. Some
other encodings have states too.
Cheers,
Simon