[Haskell-cafe] invalid character encoding

Sat Mar 19 06:55:54 EST 2005

Glynn Clements <glynn at gclements.plus.com> writes:

>> A separate setting would force users to configure an encoding just
>> for the purposes of Haskell programs, as if the configuration wasn't
>> already too fragmented.
>
> 	encoding <- localeEncoding
> 	Curses.setupTerm encoding handle

In a properly configured system curses is always supposed to be used
like this. That is, it can as well use the locale encoding directly,
without complicating the API.

I don't want to force to implement bindings like this, but to allow it,
because it's a good default.

>> It's unwise to propose a new standard when an existing standard
>> works well enough.
>
> Existing standard? The standard curses API deals with bytes; encodings
> don't come into it. AFAIK, the wide-character curses API isn't yet a
> standard.

It's described in Single Unix Spec along with the narrow character
version (but in an earlier version; the newest version doesn't
describe curses at all).

But I meant a standard for communicating the encoding of the terminal
to programs. If programs are supposed check the locale to determine
that, it can be done automatically by bindings to readline & curses.

>> > Or expose the fact that the WC API is broken, depending upon your POV.
>> 
>> It's the only curses API which allows to write full-screen programs in
>> UTF-8 mode.
>
> All the more reason to fix it.
>
> And where does UTF-8 come into it? I would have expected it to use
> wide characters throughout.

The wide character API works with any encoding.

The narrow character API works only with encodings where one byte
corresponds to one character.

(In the wide character API wchar_t doesn't have to correspond to one
character cell; combining characters are attached to base characters,
and some characters are double-wide.)

> I'm talking about standard (XSI) curses, which will just pass
> printable (non-control) bytes straight to the terminal. If your
> terminal uses CP437 (or some other non-standard encoding), you can
> just pass the appropriate bytes to waddstr() etc and the corresponding
> characters will appear on the terminal.

Which terminal uses CP437?

Linux console doesn't, except temporarily after switching the mapping
to builtin CP437 (but this state is not used by curses) or after
loading CP437 as the user map (nobody does this, and it won't work
properly with all characters from the range 0x80-0x9F anyway).

>> You can treat it as immutable. Just don't call setlocale with
>> different arguments again.
>
> Which limits you to a single locale. If you are using the locale's
> encoding, that limits you to a single encoding.

There is no support for changing the encoding of a terminal on the fly
by programs running inside it.

> The point is that a single program often generates multiple streams of
> text, possibly for different "audiences" (e.g. humans and machines).
> Different streams may require different conventions (encodings,
> numeric formats, collating orders), but may use the same functions.

A single program has a single stdout and a single filesystem. The
contexts which use the locale encoding don't need multiple encodings.

Multiple encodings are needed e.g. for exchanging data with other
machines for the network, for reading contents of text files after the
user has specified an encoding explicitly etc. In these cases an API
with explicitly provided encoding should be used.

>> Gtk+2 also attempts to display the filename. It can be opened
>> even though the filename has inconvertible characters escaped.
>
> This isn't my experience; I just get messages like:
>
> Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8.
> (try setting the environment variable G_FILENAME_ENCODING): Invalid
> byte sequence in conversion input
>
> and the filename is omitted altogether.

Works for me, e.g. in gedit-2.8.2. The filename is displayed with
escapes like \377 and can be opened.

>> > The "current locale" mechanism is just a way of avoiding the issues
>> > as much as possible when you can't get away with avoiding them
>> > altogether.
>> 
>> It's a way to communicate the encoding of the terminal, filenames,
>> strerror, gettext etc.
>
> It's *a* way, but it's not a very good way. It sucks when you can't
> apply a single convention to everything.

It's not so bad to justify inventing our own conventions and forcing
users to configure the encoding of Haskell programs separately.

>> Unicode has no viable competition.
>
> There are two viable alternatives. Byte strings with associated
> encodings and ISO-2022.

ISO-2022 is an insanely complicated brain-damaged mess. I know it's
being used in some parts of the world, but the sooner it will die,
the better.

Byte strings with associated encodings coexist with Unicode and are
being slowly replaced by it, by using UTF-8 as the encoding more
often.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/