[Haskell-cafe] invalid character encoding

Ian Lynagh igloo at earth.li
Sat Mar 19 23:34:12 EST 2005

On Sun, Mar 20, 2005 at 01:33:44AM +0000, ross at soi.city.ac.uk wrote:
> On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:
> > Most importantly, though: is there any way to remove this file without
> > doing something like an FFI import of unlink?
> > Is there anything LC_CTYPE can be set to that will act like C/POSIX but
> > accept 8-bit bytes as chars too?
> en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
> behaviour (and the GHC behaviour).

This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591
(or en_US). My /etc/locale.gen contains:

en_GB ISO-8859-1
en_GB.ISO-8859-15 ISO-8859-15
en_GB.UTF-8 UTF-8

So is there anything that /always/ works?

> Indeed it's possible to have filenames (under POSIX, anyway) that H98
> programs can't touch (under Hugs).  That's pretty much follows from
> the Haskell definition FilePath = String.  The other thread under this
> subject has touched on the need for an (additional) API using an abstract
> FilePath type.

Hmm. I can't say I'm convinced by all this without having something like
that API.

> Yes, I don't see how to avoid this when using mbtowc() to do the
> conversion: it makes no distinction between a bad byte sequence and an
> incomplete one.

Perhaps you could use mbrtowc instead?

My manpage says

    If the n bytes starting at s do not contain a complete multibyte  char-
    acter,  mbrtowc  returns  (size_t)(-2).  This  can  happen even if n >=
    MB_CUR_MAX, if the multibyte string contains redundant shift sequences.

    If  the  multibyte  string  starting at s contains an invalid multibyte
    sequence  before  the  next   complete   character,   mbrtowc   returns
    (size_t)(-1) and sets errno to EILSEQ. In this case, the effects on *ps
    are undefined.

For both functions my manpage says

       ISO/ANSI C, UNIX98


