[Haskell-cafe] invalid character encoding

Thu Mar 17 14:55:05 EST 2005

John Meacham wrote:

> > > >> It doesn't affect functions added by the hierarchical libraries,
> > > >> i.e. those functions are safe only with the ASCII subset. (There is
> > > >> a vague plan to make Foreign.C.String conform to the FFI spec,
> > > >> which mandates locale-based encoding, and thus would change all
> > > >> those, but it's still up in the air.)
> > > >
> > > > Hmm. I'm not convinced that automatically converting to the current
> > > > locale is the ideal behaviour (it'd certianly break all my programs!).
> > > > Certainly a function for converting into the encoding of the current
> > > > locale would be useful for may users but it's important to be able to
> > > > know the encoding with certainty.
> > > 
> > > It should only be the default, not the only option.
> > 
> > I'm not sure that it should be available at all.
> > 
> > > It should be possible to specify the encoding explicitly.
> > 
> > Conversely, it shouldn't be possible to avoid specifying the encoding
> > explicitly.
> > 
> > Personally, I wouldn't provide an all-in-one "convert String to
> > CString using locale's encoding" function, just in case anyone was
> > tempted to actually use it.
> 
> But this is exactly what is needed for most C library bindings.

I very much doubt that "most" is accurate.

C functions which take a "char*" fall into three main cases:

1. Unspecified encoding, i.e. it's a string of bytes, not characters.

2. Locale's encoding, as determined by nl_langinfo(CODESET);
essentially, whatever was set with setlocale(LC_CTYPE), defaulting to
C/POSIX if setlocale() hasn't been called.

3. Fixed encoding, e.g. UTF-8, ISO-2022, US-ASCII (or EBCDIC on IBM
mainframes).

Historically, library functions have tended to fall into category 1
unless they *need* to know the interpretation of a given byte or
sequence of bytes (e.g. <ctype.h>), in which case they fall into
category 2. Most of libc falls into category 1, with a minority of
functions in category 2.

Code which is designed to handle multiple languages simultaneously is
more likely to fall into category 3, using one of the "universal"
encodings (typically ISO-2022 in southeast Asia and UTF-8 elsewhere).

E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
use of the locale's encoding for filenames (if you have filenames in
multiple encodings, you lose; filenames using the "wrong" encoding
simply don't appear in file selectors).

-- 
Glynn Clements <glynn at gclements.plus.com>