[Haskell-cafe] getting crazy with character encoding

Seth Gordon sethg at ropine.com
Wed Sep 12 11:16:25 EDT 2007


Andrea Rossato wrote:
> Hi,
> 
> supposed that, in a Linux system, in an utf-8 locale, you create a file
> with non ascii characters. For instance:
> touch abèèè
> 
> Now, I would expect that the output of a shell command such as 
> "ls ab*"
> would be a string/list of 5 chars. Instead I find it to be a list of 8
> chars...;-)

The file name may have five *characters*, but if it's encoded as UTF-8, 
then it has eight *bytes*.

It appears that in spite of the locale definition, hGetContents is 
treating each byte as a separate character without translating the 
multi-byte sequences *from* UTF-8, and then putStrLn sends each of those 
bytes to standard output without translating the non-ASCII characters 
*to* UTF-8.  So the second line of your program's output is 
correct...but only by accident.

Futzing around a little bit in ghci, I see that I can define a string 
"\1488", but if I send that string to putStrLn, I get nothing, when I 
should get א (the Hebrew letter aleph).

I � Unicode.



More information about the Haskell-Cafe mailing list