[Haskell-cafe] invalid character encoding

Mon Mar 21 17:27:35 EST 2005

John Meacham wrote:

> > I'm not suggesting inventing conventions. I'm suggesting leaving such
> > issues to the application programmer who, unlike the library
> > programmer, probably has enough context to be able to reliably
> > determine the correct encoding in any specific instance.
> 
> But the whole point of Foreign.C.String is to interface to existing C
> code. And one of the most common conventions of said interfaces is to
> represent strings in the current locale, Which is why locale honoring
> conversion routines are useful. 

My point is that most C functions which accept or return char*s will
work regardless of whether those char*s can be decoded according to
the current locale. E.g.

	while (d = readdir(dir), d)
	{
		stat(d->d_name, &st);
		...
	}

will stat() every filename in the directory regardless of whether or
not the filenames are valid in the locale's encoding.

The Haskell equivalent using FilePath (i.e. String),
getDirectoryContents etc currently only works because the char* <->
String conversions are hardcoded to ISO-8859-1, which is infallible
and reversible. If it used e.g. UTF-8, it would fail on any filename
which wasn't valid UTF-8 even though it never actually needs to know
the string of characters which the filename represents.

The same applies to reading filenames from argv[] and passing them to
open() etc. This is one of the most common idioms in Unix programming,
and it doesn't care about encodings at all. Again, it would cease to
work reliably in Haskell if the automatic char* <-> String conversions
in getArgs etc started using the locale.

I'm not arguing about *how* char* <-> String conversions should be
performed so much as arguing about *whether* these conversions should
be performed. The conversion issues are only problems because the
conversions are being done at all.

-- 
Glynn Clements <glynn at gclements.plus.com>