[Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

Ketil Malde ketil at malde.org
Wed Jun 17 09:36:34 EDT 2009


Simon Marlow <marlowsd at gmail.com> writes:

>> Why only on Windows?

> Just because it's a lot easier on Windows - all the OS APIs take
> Unicode file paths, so it's obvious what to do.  In contrast on Unix I
> don't have a clear idea of how to proceed.

> On Unix, all file APIs take [Word8] rather than [Char].  By
> convention, the [Word8] is usually assumed to be a string in the
> locale encoding, but that's only a user-space convention.

If we want to incorporate a translation layer, I think it's fair to
only support UTF-8 (ignoring locales), but provide a workaround for
invalid characters.=20

>From http://en.wikipedia.org/wiki/UTF-8:

|  Therefore many modern UTF-8 converters translate errors to
|  something "safe". Only one byte is changed into the error
|  replacement and parsing starts again at the next byte, otherwise
|  concatenating strings could change good characters into
|  errors. Popular replacements for each byte are:=20
|
|    * nothing (the bytes vanish)
|    * '?' or '=BF'
|    * The replacement character (U+FFFD)
|    * The byte from ISO-8859-1 or CP1252
|    * An invalid Unicode code point, usually U+DCxx where xx is the byte's=
 value

How about using the last one? This would allow 'readFile' to work on
FilePaths provided by 'getDirectoryContents', while allowing for
real Unicode string literals.

-k
--=20
If I haven't seen further, it is by standing in the footprints of giants


More information about the Haskell-Cafe mailing list