[Haskell-cafe] invalid character encoding
wolfgang.thaller at gmx.net
Sat Mar 19 01:10:12 EST 2005
Glynn Clements wrote:
> OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
> the correct encoding, but that doesn't actually matter a lot of the
> time; frequently, you're just grabbing a "blob" of data from one
> function and passing it to another.
Yes. Of course, this also means that Strings representing non-ASCII
filenames will *always* be nonsense on Mac OS X and other UTF8-based
> The problems will only appear once you start dealing with fallible or
> non-reversible encodings such as UTF-8 or ISO-2022.
In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
file name that is converted to Unicode cannot be converted back any
more (assuming you know for sure that it was ISO-2022 in the first
> Of course, it's quite possible that the only test cases will be people
> using UTF-8-only (or even ASCII-only) systems, in which case you won't
> see any problems.
I'm kind of hoping that we can just ignore a problem that is so rare
that a large and well-known project like GTK2 can get away with
ignoring it. Also, IIRC, Java strings are supposed to be unicode, too -
how do they deal with the problem?
>> So we can't do Unicode-based I18N because there exist a few unix
>> systems with messed-up file systems?
> Declaring such systems to be "messed up" won't make the problems go
> away. If a design doesn't work in reality, it's the fault of the
> design, not of reality.
In general, yes. But we're not talking about all of reality here, we're
talking about one small part of reality - the question is, can the part
of reality where the design doesn't work be ignored?
For example, as soon as we use any kind of path names in our APIs, we
are ignoring reality on good old "Classic" Mac OS (may it rest in
piece). Path names don't always uniquely denote a file there (although
they do most of the time). People writing cross-platform software have
been ignoring this fact for a long time now.
I think that if we wait long enough, the filename encoding problems
will become irrelevant and we will live in an ideal world where unicode
actually works. Maybe next year, maybe only in ten years. And while we
are arguing about how far we are from that ideal world, we should think
about alternatives. The current hack is really just a hack, and I don't
want to see this hack become the new accepted standard.
Do we have other alternatives? Preferably something that provides other
advantages over a unicode String than just making things work on
systems that many users never encounter, otherwise almost no one will
bother to use it. So maybe we should start looking for _other_ reasons
to represent file names and paths by an abstract datatype or something?
More information about the Haskell-Cafe