[Haskell-beginners] hGetContents, unicode and linux

Sun Nov 28 08:53:47 EST 2010

On Sun, Nov 28, 2010 at 10:35 AM, Yitzchak Gale <gale at sefer.org> wrote:
> I wrote:
>>> In any case, you still need to have the correct encoding
>>> set on the handles as before.
>
> Michael Snoyman wrote:
>> ...it does *not* address invalid byte sequences (AFAIK),
>> which can be dealt with using the bytestring/text decoding
>> combination.
>
> Well, using the standard interface, you have three choices
> on how to handle invalid byte sequences - drop them,
> use a replacement character, or throw an exception, with
> the third choice being the default. You specify that choice
> when you set the encoding. See the documentation for
> System.IO for more details.
>
> However, those choices are implemented via GNU iconv,
> so on Windows you only have the default behavior.
>
> Also, in certain special situations - like if you need to be able
> to specify the replacement character yourself, or if you need
> in-band exceptions (e.g. a stream of Either error character),
> then the options do seem limited currently.
>
> You might still need to fall back on the old bytestring hack
> in those cases. If you find yourself in that situation, it might
> be a good idea to push the maintainers of System.IO and
> Data.Text to continue to improve support for encodings in the
> standard libraries.

I hadn't realized that the standard libraries offered so much
sophistication in their approach to file encodings, I'll have to look
at it more thoroughly.

Michael