[Haskell-cafe] invalid character encoding

Fri Mar 18 14:52:45 EST 2005

Marcin 'Qrczak' Kowalczyk wrote:

> >> > If you provide "wrapper" functions which take String arguments,
> >> > either they should have an encoding argument or the encoding should
> >> > be a mutable per-terminal setting.
> >> 
> >> There is already a mutable setting. It's called "locale".
> >
> > It isn't a per-terminal setting.
> 
> A separate setting would force users to configure an encoding just
> for the purposes of Haskell programs, as if the configuration wasn't
> already too fragmented.

	encoding <- localeEncoding
	Curses.setupTerm encoding handle

Not a big deal.

> It's unwise to propose a new standard when an existing standard
> works well enough.

Existing standard? The standard curses API deals with bytes; encodings
don't come into it. AFAIK, the wide-character curses API isn't yet a
standard.

> >> > It is possible for curses to be used with a terminal which doesn't
> >> > use the locale's encoding.
> >> 
> >> No, it will break under the new wide character curses API,
> >
> > Or expose the fact that the WC API is broken, depending upon your POV.
> 
> It's the only curses API which allows to write full-screen programs in
> UTF-8 mode.

All the more reason to fix it.

And where does UTF-8 come into it? I would have expected it to use
wide characters throughout.

> >> > Also, it's quite common to use non-standard encodings with terminals
> >> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> >> > which terminfo understands).
> >> 
> >> curses don't support that.
> >
> > Sure it does. You pass the appropriate bytes to waddstr() etc and they
> > get sent to the terminal as-is.
> 
> It doesn't support that and it will switch the terminal mode to "user"
> encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
> an ACS_* macro was used, or maybe even at initialization.
> 
> curses support two families of encodings: the current locale encoding
> and ACS. The locale encoding may be UTF-8 (works only with wide
> character API).

I'm talking about standard (XSI) curses, which will just pass
printable (non-control) bytes straight to the terminal. If your
terminal uses CP437 (or some other non-standard encoding), you can
just pass the appropriate bytes to waddstr() etc and the corresponding
characters will appear on the terminal.

ACS_* codes are a completely separate issue; they allow you to use
line graphics in addition to a full 8-bit character set (e.g. 
ISO-8859-1). If you only need ASCII text, you can use the other 128
codes for graphics characters and never use the ACS_* macros or the
"acsc" capability.

> >> For compatibility the default locale is "C", but new programs
> >> which are prepared for I18N should do setlocale(LC_CTYPE, "")
> >> and setlocale(LC_MESSAGES, "").
> >
> > In practice, you end up continuously calling setlocale(LC_CTYPE, "")
> > and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
> > to be human-readable (locale-dependent) or a machine-readable format
> > (locale-independent, i.e. "C" locale).
> 
> I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
> it only affects the encoding of texts emitted by gettext (including
> strerror) and the meaning of isalpha, toupper etc.

Sorry, I'm confusing two cases here. With LC_CTYPE, the main reason
for continuous switching is when using wcstombs(). printf() uses
LC_NUMERIC, which is switched between the "C" locale and the user's
locale.

> > Once the program starts, the locale settings become global mutable
> > state. I would have thought that, more than anyone else, the
> > readership of this list would understand what's bad about that
> > concept.
> 
> You can treat it as immutable. Just don't call setlocale with
> different arguments again.

Which limits you to a single locale. If you are using the locale's
encoding, that limits you to a single encoding.

> > Another problem with having a single locale: if a program isn't
> > working, and you need to communicate with its developers, you will
> > often have to run the program in an English locale just so that you
> > will get error messages which the developers understand.
> 
> You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

I'm starting to think that you're misunderstanding on purpose. Again.

The point is that a single program often generates multiple streams of
text, possibly for different "audiences" (e.g. humans and machines).
Different streams may require different conventions (encodings,
numeric formats, collating orders), but may use the same functions.

Those functions need to obtain the conventions from somewhere, and
that means either parameters or state.

Having dealt with state (libc's locale mechanism), I would rather have
parameters.

> >> Then how would a Haskell program know what encoding to use for
> >> stdout messages?
> >
> > It doesn't necessarily need to. If you are using message catalogues,
> > you just read bytes from the catalogue and write them to stdout.
> 
> gettext uses the locale to choose the encoding. Messages are
> internally stored as UTF-8 but emitted in the locale encoding.

It didn't use to be that way, but I can see why they would have
changed it (a single catalogue for encoding variants of a given
locale).

> >> How would it know how to interpret filenames for graphical
> >> display?
> >
> > An option menu on the file selector is one option; heuristics are
> > another.
> 
> Heuristics won't distinguish various ISO-8859-x from each other.

You treat the locale's encoding as a heuristic. If it looks like
ISO-8859-x, and the locale's encoding is ISO-8859-x, you use that. If
it looks like Shift-JIS, you don't complain and give up just because
the locale is UTF-8.

> An option menu on the file selector is user-unfriendly because users
> don't want to configure it for each program separately. They want to
> set it in one place and expect it to work everywhere.

Nothing will work everywhere. An option menu allows the user to force
the encoding for individual cases when whatever other mechanism(s) you
use get it wrong.

I've needed to use Mozilla's "View -> Character Encoding" menu enough
times when the browser's guess turned out to be wrong (and blindly
honouring the charset specified by HTTP's Content-Type: or HTML's META
tags would be a disaster).

> > At least Gtk-1 would attempt to display the filename; you would get
> > the odd question mark but at least you could select the file;
> 
> Gtk+2 also attempts to display the filename. It can be opened
> even though the filename has inconvertible characters escaped.

This isn't my experience; I just get messages like:

Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8. (try setting the environment variable G_FILENAME_ENCODING): Invalid byte sequence in conversion input

and the filename is omitted altogether.

> > The "current locale" mechanism is just a way of avoiding the issues
> > as much as possible when you can't get away with avoiding them
> > altogether.
> 
> It's a way to communicate the encoding of the terminal, filenames,
> strerror, gettext etc.

It's *a* way, but it's not a very good way. It sucks when you can't
apply a single convention to everything.

> > Unicode has been described (accurately, IMHO) as "Esperanto for
> > computers". Both use the same approach to try to solve essentially the
> > same problem. And both will be about as successful in the long run.
> 
> Unicode has no viable competition.

There are two viable alternatives. Byte strings with associated
encodings and ISO-2022. In CJK environments, ISO-2022 is still far
more widespread than UTF-8, and will likely remain so for the
foreseeable future. And byte strings with associated encodings are
probably still the most common of all.

-- 
Glynn Clements <glynn at gclements.plus.com>