Output character encoding for ghc on OpenBSD

Matthias Kilian kili at outback.escape.de
Sun Apr 18 14:22:52 EDT 2010


On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:
> > Anyway, the short story is that I have to either hard-code the
> > character set to something like utf-8, or ghc will start to behave
> > really strange (for example, ghci would terminate immediately if
> > you just *type* a non-ASCII character).
> That sounds like it might be something to do with the haskeline
> package, which ghci uses for user interaction.  Haskeline makes its
> own FFI calls to translate raw input bytes into Unicode Chars.

Oh, this may indeed be a second problem. However, the encoding
problem itself also manifests in the `openTempFile001' test of the
testsuite.  For example, with an unpatched ghc-6.12, the test fails
with the following output:

=====> openTempFile001(normal) 1048 of 2375 [0, 38, 0]
cd ./lib/IO && '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf  -dno-debug-output -o openTempFile001 openTempFile001.hs    >openTempFil
e001.comp.stderr 2>&1
cd ./lib/IO && ./openTempFile001    </dev/null >openTempFile001.run.stdout 2>openTempFile001.run.stderr
Wrong exit code (expected 0 , actual 1 )

openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte sequence)

*** unexpected failure for openTempFile001(normal)

> Can
> you elaborate further on what exactly the issue is with OpenBSD's
> locale support?  In particular, there's several components used by
> Haskeline:
>  - call set_locale(LC_CTYPE)

Problem number 1: set_locale(LC_CTYPE) fails (i.e. returns NULL)
for any locale except `C` or `POSIX'. Did I mention that OpenBSD
is really bad with locales? ;-)

>  - call nl_langinfo(CODESET)

Always returns `646' (ASCII). Duh.

>  - pass the resulting string (which should be, e.g., $LANG) to iconv_open

iconv_open appears to need the *codeset* name, not a complete locale.
Note that OpenBSD uses GNU libiconv-1.13, which AFAIK differs from
the one included in glibc. Even worse, I have to pass something
like "UTF-8", whereas "UTF8" doesn't work.

>  - call iconv on user input (which may be malformed)

I wrote a little C program that does the following (some error
checks omitted here):

	char *inp, &outp;
	size_t insz, outsz;
	unsigned char in[] = {0xa9, 0, 0, 0};
	char out[512];

	inp = in;
	outp = out;
	insz = sizeof(in);
	outsz = sizeof(out) - 1;
	setlocale(LC_CTYPE, "");
	ic = iconv_open("", "UTF-32LE");
	if (iconv(ic, &inp, &insz, &outp, &outsz) == -1) {
		... bail out (perror() etc.) ...
	*outp = 0;

And it just doesn't work, regardless what I set LC_CTYPE to. The
only way to get it printing the copyright symbol is to explicitely
use "UTF-8" (or "ISO-8859-1" or something else that knows about
that symbol) as the first argument to iconv_open().

> Is the problem that setting $LC_ALL or $LANG has no effect on the
> string returned by nl_langinfo, so the translation fails?

Yes, see above.

> If so,
> haskeline is supposed to output "?"s in that case, so there might be a
> bug in the package.

It fails (or rather: ghci fails, since I didn't yet do any separate
haskeline tests) with the same error as the test mentioned above,
with the difference that it fails on hPutChar instead of hClose for
obvious reasons.

> Finally, when you say you have to "hard-code the character set", are
> you talking about ghc, haskeline, the base library, or somewhere else?

I'm talking about libraries/base/GHC/IO/Encoding/Iconv.hs

See? There just is no non-hackerish way to fix this (except of
course improving locale support on OpenBSD, but that's beyond my
scope currently).


More information about the Glasgow-haskell-users mailing list