[Haskell-cafe] invalid character encoding

Sat Mar 19 01:10:12 EST 2005

Glynn Clements wrote:

> OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
> the correct encoding, but that doesn't actually matter a lot of the
> time; frequently, you're just grabbing a "blob" of data from one
> function and passing it to another.

Yes. Of course, this also means that Strings representing non-ASCII 
filenames will *always* be nonsense on Mac OS X and other UTF8-based 
platforms.

> The problems will only appear once you start dealing with fallible or
> non-reversible encodings such as UTF-8 or ISO-2022.

In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 
file name that is converted to Unicode cannot be converted back any 
more (assuming you know for sure that it was ISO-2022 in the first 
place)?

> Of course, it's quite possible that the only test cases will be people
> using UTF-8-only (or even ASCII-only) systems, in which case you won't
> see any problems.

I'm kind of hoping that we can just ignore a problem that is so rare 
that a large and well-known project like GTK2 can get away with 
ignoring it. Also, IIRC, Java strings are supposed to be unicode, too - 
how do they deal with the problem?

>> So we can't do Unicode-based I18N because there exist a few unix
>> systems with messed-up file systems?
>
> Declaring such systems to be "messed up" won't make the problems go
> away. If a design doesn't work in reality, it's the fault of the
> design, not of reality.

In general, yes. But we're not talking about all of reality here, we're 
talking about one small part of reality - the question is, can the part 
of reality where the design doesn't work be ignored?

For example, as soon as we use any kind of path names in our APIs, we 
are ignoring reality on good old "Classic" Mac OS (may it rest in 
piece). Path names don't always uniquely denote a file there (although 
they do most of the time). People writing cross-platform software have 
been ignoring this fact for a long time now.

I think that if we wait long enough, the filename encoding problems 
will become irrelevant and we will live in an ideal world where unicode 
actually works. Maybe next year, maybe only in ten years. And while we 
are arguing about how far we are from that ideal world, we should think 
about alternatives. The current hack is really just a hack, and I don't 
want to see this hack become the new accepted standard.

Do we have other alternatives? Preferably something that provides other 
advantages over a unicode String than just making things work on 
systems that many users never encounter, otherwise almost no one will 
bother to use it. So maybe we should start looking for _other_ reasons 
to represent file names and paths by an abstract datatype or something?

Cheers,

Wolfgang