[Haskell-cafe] invalid character encoding

Fri Mar 18 14:00:42 EST 2005

Wolfgang Thaller wrote:

> > If you try to pretend that I18N comes down to shoe-horning everything
> > into Unicode, you will turn the language into a joke.
> 
> How common will those problems you are describing be by the time this 
> has been implemented?
> How common are they even now?

Right now, GHC assumes ISO-8859-1 whenever it has to automatically
convert between String and CString. Conversions to and from ISO-8859-1
cannot fail, and encoding and decoding are exact inverses.

OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
the correct encoding, but that doesn't actually matter a lot of the
time; frequently, you're just grabbing a "blob" of data from one
function and passing it to another.

The problems will only appear once you start dealing with fallible or
non-reversible encodings such as UTF-8 or ISO-2022. If and when that
happens, I guess we'll find out how common the problems are. Of
course, it's quite possible that the only test cases will be people
using UTF-8-only (or even ASCII-only) systems, in which case you won't
see any problems.

> I haven't yet encountered a unix box where the file names were not in 
> the system locale encoding. On all reasonably up-to-date Linux boxes 
> that I've seen recently, they were in UTF-8 (and the system locale 
> agreed).

I've encountered boxes where multiple encodings were used; primarily
web and FTP servers which were shared amongst multiple clients. Each
client used whichever encoding(s) they felt like. IIRC, the most
common non-ASCII encoding was MS-DOS codepage 850 (the clients were
mostly using Windows 3.1 at that time).

I haven't done sysadmin for a while, so I don't know the current
situation, but I don't think that the world has switched to UTF-8 in
the mean time. [Most of the non-ASCII filenames which I've seen
recently have been either ISO-8859-1 or Win-12XX; I haven't seen much
UTF-8.]

> On both Windows and Mac OS X, filenames are stored in Unicode, so it is 
> always possible to convert them to unicode.
> So we can't do Unicode-based I18N because there exist a few unix 
> systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go
away. If a design doesn't work in reality, it's the fault of the
design, not of reality.

> > Haskell's Unicode support is a joke because the API designers tried to
> > avoid the issues related to encoding with wishful thinking (i.e. you
> > open a file and you magically get Unicode characters out of it).
> 
> OK, that part is purely wishful thinking, but assuming that filenames 
> are text that can be represented in Unicode is wishful thinking that 
> corresponds to 99% of reality.
> So why can't the remaining 1 percent of reality be fixed instead?

The issue isn't whether the data can be represented as Unicode text,
but whether you can convert it to and from Unicode without problems.
To do this, you need to know the encoding, you need to store the
encoding so that you can convert the wide string back to a byte
string, and the encoding needs to be reversible.

-- 
Glynn Clements <glynn at gclements.plus.com>