[Haskell-beginners] hGetContents, unicode and linux

Sun Nov 28 03:35:05 EST 2010

I wrote:
>> In any case, you still need to have the correct encoding
>> set on the handles as before.

Michael Snoyman wrote:
> ...it does *not* address invalid byte sequences (AFAIK),
> which can be dealt with using the bytestring/text decoding
> combination.

Well, using the standard interface, you have three choices
on how to handle invalid byte sequences - drop them,
use a replacement character, or throw an exception, with
the third choice being the default. You specify that choice
when you set the encoding. See the documentation for
System.IO for more details.

However, those choices are implemented via GNU iconv,
so on Windows you only have the default behavior.

Also, in certain special situations - like if you need to be able
to specify the replacement character yourself, or if you need
in-band exceptions (e.g. a stream of Either error character),
then the options do seem limited currently.

You might still need to fall back on the old bytestring hack
in those cases. If you find yourself in that situation, it might
be a good idea to push the maintainers of System.IO and
Data.Text to continue to improve support for encodings in the
standard libraries.

Regards,
Yitz