[Haskell-cafe] Re: getting crazy with character encoding

Wed Sep 12 20:23:33 EDT 2007

Hi.  I believe that everything I've said has been said by another
responder, but not all together in one place.

On 2007-09-12, Andrea Rossato <mailing_list at istitutocolli.org> wrote:
> supposed that, in a Linux system, in an utf-8 locale, you create a file
> with non ascii characters. For instance:
> touch abÃ¨Ã¨Ã¨
> 
>
> Now, I would expect that the output of a shell command such as 
> "ls ab*"
> would be a string/list of 5 chars. Instead I find it to be a list of 8
> chars...;-)
>
> That is to say, each non ascii character is read as 2 characters, as
> if the string were an ISO-8859-1 string - the string is actually
> treated as an ISO-8859-1 string. But when I print it, now it is
> displayed correctly.

The Linux kernel doesn't really have a notion of characters, only bytes
in its interfaces.  (This isn't strictly true: it needs to in some cases
when it's interacting with other systems, but it's 99% true.)  In the
UTF-8 representation of these 5 characters are 8 bytes, as indeed each
non-ASCII character takes two bytes.

The various C runtimes do have some notion of various character sets,
and locales, and so forth, and build on top of the byte interface to
represent characters.  But not all programs use these.  Your example of
ls just takes the bytes from the kernel, and perhaps does some minimal
sanitizing (munging control codes) before sending them to the tty.  If
the terminal understands UTF-8, everything works great.

On the other hand, GHC's runtime always interprets these bytes as
meaning the characters in ISO-8859-1 (this just takes the bytes to the
unicode code points), and does not pay attention to locale settings
such as LC_CHARSET, etc.  While this has some nice properties (totally
invertible, no code to maintain (as the first 256 code points of Unicode
are ISO-8859-1), etc.), personally, I think this is a bug.  The Haskell
standard talks about characters, not bytes, and the characters read
and written should correspond to the native environment notions and
encodings.  These are, under Unix, determined by the locale system.

Unfortunately, at this point it is a well entrenched bug, and changing
the behaviour will undoubtedly break programs.

There should be another system for getting the exact bytes in and out
(as Word8s, say, rather than Chars), and there are in fact external
libraries using lower level interfaces, rather than the things like
putStr, getLine, etc. that do this.  An external library works, of
course, but it should be part of the standard so implementors know that
character based routines actually are character based, not byte based.

> After reading about character encoding, the way the linux kernel
> manages file names, I would expect that a file name set in an utf-8
> locale should be read by locale aware application as an utf-8 string,
> and each character a unicode code point which can be represented by a
> Haskell char. What's wrong with that?

That's a reasonable assumption.  The problem is that GHC doesn't support
locales.  But byte-sequences do round-trip, as long as you don't try to
process them, so not as much breaks as one might think.

I don't know what NHC and hugs do, though I assume they also provide
no translations.  I'm also not sure what JHC does, though I do see
mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage
of C libraries), and I do know that John is fairly careful about locale
issues.

-- 
Aaron Denney
-><-