[Haskell-cafe] invalid character encoding

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Fri Mar 18 06:16:51 EST 2005


Glynn Clements <glynn at gclements.plus.com> writes:

>> > If you provide "wrapper" functions which take String arguments,
>> > either they should have an encoding argument or the encoding should
>> > be a mutable per-terminal setting.
>> 
>> There is already a mutable setting. It's called "locale".
>
> It isn't a per-terminal setting.

A separate setting would force users to configure an encoding just
for the purposes of Haskell programs, as if the configuration wasn't
already too fragmented. It's unwise to propose a new standard when an
existing standard works well enough.

>> > It is possible for curses to be used with a terminal which doesn't
>> > use the locale's encoding.
>> 
>> No, it will break under the new wide character curses API,
>
> Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in
UTF-8 mode.

>> > Also, it's quite common to use non-standard encodings with terminals
>> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
>> > which terminfo understands).
>> 
>> curses don't support that.
>
> Sure it does. You pass the appropriate bytes to waddstr() etc and they
> get sent to the terminal as-is.

It doesn't support that and it will switch the terminal mode to "user"
encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
an ACS_* macro was used, or maybe even at initialization.

curses support two families of encodings: the current locale encoding
and ACS. The locale encoding may be UTF-8 (works only with wide
character API).

>> For compatibility the default locale is "C", but new programs
>> which are prepared for I18N should do setlocale(LC_CTYPE, "")
>> and setlocale(LC_MESSAGES, "").
>
> In practice, you end up continuously calling setlocale(LC_CTYPE, "")
> and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
> to be human-readable (locale-dependent) or a machine-readable format
> (locale-independent, i.e. "C" locale).

I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
it only affects the encoding of texts emitted by gettext (including
strerror) and the meaning of isalpha, toupper etc.

>> The LC_* environment variables are the parameters for the encoding.
>
> But they are only really "parameters" at the exec() level.

This is usually the right place to specify it. It's rare that they
are even set separately for the given program - usually they are
per-system or per-user.

> Once the program starts, the locale settings become global mutable
> state. I would have thought that, more than anyone else, the
> readership of this list would understand what's bad about that
> concept.

You can treat it as immutable. Just don't call setlocale with
different arguments again.

> Another problem with having a single locale: if a program isn't
> working, and you need to communicate with its developers, you will
> often have to run the program in an English locale just so that you
> will get error messages which the developers understand.

You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

>> Then how would a Haskell program know what encoding to use for
>> stdout messages?
>
> It doesn't necessarily need to. If you are using message catalogues,
> you just read bytes from the catalogue and write them to stdout.

gettext uses the locale to choose the encoding. Messages are
internally stored as UTF-8 but emitted in the locale encoding.

You are using the semantics I'm advocating without knowing that...

>> How would it know how to interpret filenames for graphical
>> display?
>
> An option menu on the file selector is one option; heuristics are
> another.

Heuristics won't distinguish various ISO-8859-x from each other.

An option menu on the file selector is user-unfriendly because users
don't want to configure it for each program separately. They want to
set it in one place and expect it to work everywhere.

Currently there are two such places: the locale, and
G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's
unwise to introduce yet another convention, and it would be a horrible
idea to make it per-program.

> At least Gtk-1 would attempt to display the filename; you would get
> the odd question mark but at least you could select the file;

Gtk+2 also attempts to display the filename. It can be opened
even though the filename has inconvertible characters escaped.

> The "current locale" mechanism is just a way of avoiding the issues
> as much as possible when you can't get away with avoiding them
> altogether.

It's a way to communicate the encoding of the terminal, filenames,
strerror, gettext etc.

> Unicode has been described (accurately, IMHO) as "Esperanto for
> computers". Both use the same approach to try to solve essentially the
> same problem. And both will be about as successful in the long run.

Unicode has no viable competition.
Esperanto had English.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Haskell-Cafe mailing list