[Haskell-cafe] Low level problem with Data.Text.IO

Wed Aug 25 17:25:33 EDT 2010

For debugging the error, we'll need to know what your locale's
encoding is. You can see this by echoing the $LANG environment
variable. For example:

$ echo $LANG
en_US.UTF-8

means my encoding is UTF-8.

Haskell doesn't currently have any decoding libraries with good error
handling (that I know of), so you might need to use an external
library or program.

My preference is Python, since it has very descriptive errors. I'll
load a file, attempt to decode it with my locale encoding, and then
see what errors pop up:

$ python
>>> content = open("testfile", "rb").read()
>>> text = content.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9d in position 1:
unexpected code byte

The exact error will help us generate a test file to reproduce the problem.

If you don't see any error, then the bug will be more difficult to
track down. Compile your program into a binary (ghc --make
ReadFiles.hs) and then run it with gdb, setting a breakpoint in the
malloc_error_break procedure:

$ ghc --make ReadFiles.hs
$ gdb ./ReadFiles
(gdb) break malloc_error_break
(gdb) run testfile

... program runs ...
BREAKPOINT

(gdb) bt

<stack trace here, copy and paste it into an email for us>

The stack trace might help narrow down where the memory corruption is occuring.

-----------------

If you don't care much about debugging, and just want to read the file:

First step is to figure out what encoding the file's in. Data.Text.IO
is intended for decoding files in the system's local encoding
(typically UTF-8), not general-purpose "this file has letters in it"
IO. Web browsers are pretty good at auto-detecting encodings. For
example, if you load the file into Firefox and then look at the (View
-> Character Encoding) menu, which option is selected?

Next, you'll need to read the file in as bytes and then decode it. Use
Data.ByteString.hGetContents to read it in.

If it's encoded in one of the common UTF encodings (UTF-8, UTF-16,
UTF-32), then you can use the functions in Data.Text.Encoding to
convert from the file's bytes to text.

If it's an unusual encoding (windows-1250, shift_jis, gbk, etc) then
you'll need a decoding library like "text-icu". Create the proper
decoder, feed in the bytes, receive text.

If all else fails, you can use this function to decode the file as
iso8859-1, but it'll be too slow to use on any file larger than a few
dozen megabytes. Furthermore, it will likely cause any special
characters in the file to become corrupted.

import Data.ByteString.Char8 as B8
import Data.Text as T

iso8859_1 :: ByteString -> Text
iso8859_1 = T.pack . B8.unpack

If any corruption occurs, please reply with *what* characters were
corrupted; this might help us reproduce the error.