[Haskell-cafe] invalid character encoding

Thu Mar 17 20:34:30 EST 2005

Marcin 'Qrczak' Kowalczyk wrote:

> > If you provide "wrapper" functions which take String arguments,
> > either they should have an encoding argument or the encoding should
> > be a mutable per-terminal setting.
> 
> There is already a mutable setting. It's called "locale".

It isn't a per-terminal setting.

> > It is possible for curses to be used with a terminal which doesn't
> > use the locale's encoding.
> 
> No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

> and it will confuse programs which use the old narrow character API.

It has no effect on the *byte* API. Characters don't come into it.

> The user (or the administrator) is responsible for matching the locale
> encoding with the terminal encoding.

Which is rather hard to do if you have multiple encodings.

> > Also, it's quite common to use non-standard encodings with terminals
> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> > which terminfo understands).
> 
> curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they
get sent to the terminal as-is. Curses doesn't have ACS_* macros for
those characters, but it doesn't mean that you can't use them.

> >> The locale encoding is the right encoding to use for conversion of the
> >> result of strerror, gai_strerror, msg member of gzip compressor state
> >> etc. When an I/O error occurs and the error code is translated to a
> >> Haskell exception and then shown to the user, why would the application
> >> need to specify the encoding and how?
> >
> > Because the application may be using multiple locales/encodings.
> 
> But strerror always returns messages in the locale encoding.

Sorry, I misread that paragraph. I replied to "why would ..." without
thinking about the context.

When you know that a string is in the locale's encoding, you need to
use it for the conversion. In that case you need to do the conversion
(or at least record the actual encoding) immediately, in case the
locale gets switched.

> Just like Gtk+2 always accepts texts in UTF-8.

Unfortunately. The text probably originated in an encoding other than
UTF-8, and will probably end up getting displayed using a font which
is indexed using the original encoding (rather than e.g. UCS-2/4). 
Converting to Unicode then back again just introduces the potential
for errors. [Particularly for CJK where, due to Han unification,
Chinese characters may mutate into Japanese characters, or vice-versa. 
Fortunately, that doesn't seem to have started any wars. Yet.]

> For compatibility the default locale is "C", but new programs
> which are prepared for I18N should do setlocale(LC_CTYPE, "")
> and setlocale(LC_MESSAGES, "").

In practice, you end up continuously calling setlocale(LC_CTYPE, "")
and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
to be human-readable (locale-dependent) or a machine-readable format
(locale-independent, i.e. "C" locale).

> > [The most common example is printf("%f"). You need to use the C
> > locale (decimal point) for machine-readable text but the user's
> > locale (locale-specific decimal separator) for human-readable text.
> 
> This is a different thing, and it is what IMHO C did wrong.

It's a different example of the same problem. I agree that C did it
wrong; I'm objecting to the implication that Haskell should make the
same mistakes.

> > This isn't directly related to encodings per se, but a good example
> > of why parameters are preferable to state.]
> 
> The LC_* environment variables are the parameters for the encoding.

But they are only really "parameters" at the exec() level.

Once the program starts, the locale settings become global mutable
state. I would have thought that, more than anyone else, the
readership of this list would understand what's bad about that
concept.

> There is no other convention to pass the encoding to be used for
> textual output to stdout for example.

That's up to the application. Environment variables are a convenience;
there's no reason why you can't have a command-line switch to select
the encoding. For more complex applications, you often have
user-selectable options and/or encodings specified in the data which
you handle.

Another problem with having a single locale: if a program isn't
working, and you need to communicate with its developers, you will
often have to run the program in an English locale just so that you
will get error messages which the developers understand.

> > C libraries which use the locale do so as a last resort.
> 
> No, they do it by default.

By default, libc uses the C locale. setlocale() includes a convenience
option to use the LC_* variables. Other libraries may or may not use
the locale settings, and plenty of code will misbehave if the locale
is wrong (e.g. using fprintf("%f") without explicitly setting the C
locale first will do the wrong thing if you're trying to generate
VRML/DXF/whatever files).

Beyond that, libc uses the locale mechanism because it was the
simplest way to retrofit minimal I18N onto K&R C. It also means that
most code can easily duck the issues (i.e. so you don't have to pass a
locale parameter to isupper() etc).

OTOH, if you don't want to duck the issue, global locale settings are
a nuisance.

> > The only reason that the C locale mechanism isn't a major nuisance
> > is that you can largely ignore it altogether.
> 
> Then how would a Haskell program know what encoding to use for stdout
> messages?

It doesn't necessarily need to. If you are using message catalogues,
you just read bytes from the catalogue and write them to stdout. The
issue then boils down to using the correct encoding for the
catalogues; the code doesn't need to know.

> How would it know how to interpret filenames for graphical
> display?

An option menu on the file selector is one option; heuristics are
another.

Both tend to produce better results in non-trivial cases than either
of Gtk-2's choices: i.e. filenames must be either UTF-8 or must match
the locale (depending up the G_BROKEN_FILENAMES setting), otherwise
the filename simply doesn't exist. At least Gtk-1 would attempt to
display the filename; you would get the odd question mark but at least
you could select the file; ultimately, the returned char* just gets
passed to open(), so the encoding only really matters for display.

> > Code which requires real I18N can use other mechanisms, and code
> > which doesn't require any I18N can just pass byte strings around and
> > leave encoding issues to code which actually has enough context to
> > handle them correctly.
> 
> Haskell can't just pass byte strings around without turning the
> Unicode support into a joke (which it is now).

If you try to pretend that I18N comes down to shoe-horning everything
into Unicode, you will turn the language into a joke.

Haskell's Unicode support is a joke because the API designers tried to
avoid the issues related to encoding with wishful thinking (i.e. you
open a file and you magically get Unicode characters out of it).

The "current locale" mechanism is just a way of avoiding the issues as
much as possible when you can't get away with avoiding them
altogether.

Unicode has been described (accurately, IMHO) as "Esperanto for
computers". Both use the same approach to try to solve essentially the
same problem. And both will be about as successful in the long run.

-- 
Glynn Clements <glynn at gclements.plus.com>