[Haskell-cafe] getting crazy with character encoding
Seth Gordon
sethg at ropine.com
Wed Sep 12 11:16:25 EDT 2007
Andrea Rossato wrote:
> Hi,
>
> supposed that, in a Linux system, in an utf-8 locale, you create a file
> with non ascii characters. For instance:
> touch abèèè
>
> Now, I would expect that the output of a shell command such as
> "ls ab*"
> would be a string/list of 5 chars. Instead I find it to be a list of 8
> chars...;-)
The file name may have five *characters*, but if it's encoded as UTF-8,
then it has eight *bytes*.
It appears that in spite of the locale definition, hGetContents is
treating each byte as a separate character without translating the
multi-byte sequences *from* UTF-8, and then putStrLn sends each of those
bytes to standard output without translating the non-ASCII characters
*to* UTF-8. So the second line of your program's output is
correct...but only by accident.
Futzing around a little bit in ghci, I see that I can define a string
"\1488", but if I send that string to putStrLn, I get nothing, when I
should get א (the Hebrew letter aleph).
I � Unicode.
More information about the Haskell-Cafe
mailing list