[Haskell-cafe] invalid character encoding

Sat Mar 19 12:34:41 EST 2005

Wolfgang Thaller wrote:

> > Of course, it's quite possible that the only test cases will be people
> > using UTF-8-only (or even ASCII-only) systems, in which case you won't
> > see any problems.
> 
> I'm kind of hoping that we can just ignore a problem that is so rare 
> that a large and well-known project like GTK2 can get away with 
> ignoring it.

1. The filename issues in GTK-2 are likely to be a major problem in
CJK locales, where filenames which don't match the locale (which is
seldom UTF-8) are common.

2. GTK's filename handling only really applies to file selector
dialogs. Most other uses of filenames in a GTK-based application don't
involve GTK; they use the OS API functions which just deal with byte
strings.

3. GTK is a GUI library. Most of the text which it deals with is going
to be rendered, so it *has* to be interpreted as characters. Treating
it as blobs of data won't work. IOW, on the question of whether or not
to interpret byte strings as character strings, GTK is at the far end
of the scale.

> Also, IIRC, Java strings are supposed to be unicode, too - 
> how do they deal with the problem?

Files are represented by instances of the File class:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html

	An abstract representation of file and directory pathnames.

You can construct Files from Strings, and convert Files to Strings. 

The File class includes two sets of directory enumeration methods:
list() returns an array of Strings, while listFiles() returns an array
of Files.

The documentation for the File class doesn't mention encoding issues
at all. However, with that interface, it would be possible to
enumerate and open filenames which cannot be decoded.

> >> So we can't do Unicode-based I18N because there exist a few unix
> >> systems with messed-up file systems?
> >
> > Declaring such systems to be "messed up" won't make the problems go
> > away. If a design doesn't work in reality, it's the fault of the
> > design, not of reality.
> 
> In general, yes. But we're not talking about all of reality here, we're 
> talking about one small part of reality - the question is, can the part 
> of reality where the design doesn't work be ignored?

Sure, you *can* ignore it; K&R C ignored everything other than ASCII.
If you limit yourself to locales which use the Roman alphabet (i.e.
ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot.

Most such users avoid encoding issues altogether by dropping the
accents and sticking to ASCII, at least when dealing with files which
might leave their system.

To get a better idea, you would need to consult users whose language
doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
you don't usually find too many of them on lists such as this.

I'm only familiar with one OSS project which has a sizeable CJK user
base, and that's XEmacs (whose I18N revolves around ISO-2022, and most
of the documentation is in Japanese). Even there, there are separate
mailing lists for English and Japanese, and the two seldom
communicate.

> I think that if we wait long enough, the filename encoding problems 
> will become irrelevant and we will live in an ideal world where unicode 
> actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.

-- 
Glynn Clements <glynn at gclements.plus.com>