[Haskell-beginners] hGetContents, unicode and linux

Sun Nov 28 02:27:19 EST 2010

On Sun, Nov 28, 2010 at 9:19 AM, Michael Snoyman <michael at snoyman.com> wrote:
> On Sun, Nov 28, 2010 at 8:53 AM, Yitzchak Gale <gale at sefer.org> wrote:
>> Michael Snoyman wrote:
>>> Perhaps a silly question, but are you certain that the input file is
>>> valid UTF-8?
>>
>> That is a very good point.
>>
>>> You could also try using the readFile from utf8-string...
>>> [or] read the contents as a lazy
>>> bytestring and then use the decode functions...
>>
>> Those approaches are now both deprecated. Either do
>> what you are doing, which gives you conceptually simple
>> strings as lists of Char. Or, for better efficiency, use
>> the text package:
>>
>>>    import qualified Data.Text.Lazy as T
>>>    main :: IO ()
>>>    main
>>>     = do   text <- T.readFile "unicode.txt"
>>>            T.putStr text
>>
>> In any case, you still need to have the correct encoding
>> set on the handles as before. (And the input needs to
>> be valid for your selected encoding.)
>
> Which is why I would actually recommend sticking with the
> bytestring/text combination when you know what the file encoding will
> be and it is not system-dependent. It's the approach that I use with
> Hamlet et al for precisely that reason.

Sorry for replying to myself, but I didn't clarify that very well.
You're right that setting encoding on the handle can work well enough
for this, but it does *not* address invalid byte sequences (AFAIK),
which can be dealt with using the bytestring/text decoding
combination.

Michael