[Haskell-cafe] Re: Unicode workaround for getDirectoryContents
under Windows?
Ketil Malde
ketil at malde.org
Wed Jun 17 09:36:34 EDT 2009
Simon Marlow <marlowsd at gmail.com> writes:
>> Why only on Windows?
> Just because it's a lot easier on Windows - all the OS APIs take
> Unicode file paths, so it's obvious what to do. In contrast on Unix I
> don't have a clear idea of how to proceed.
> On Unix, all file APIs take [Word8] rather than [Char]. By
> convention, the [Word8] is usually assumed to be a string in the
> locale encoding, but that's only a user-space convention.
If we want to incorporate a translation layer, I think it's fair to
only support UTF-8 (ignoring locales), but provide a workaround for
invalid characters.=20
>From http://en.wikipedia.org/wiki/UTF-8:
| Therefore many modern UTF-8 converters translate errors to
| something "safe". Only one byte is changed into the error
| replacement and parsing starts again at the next byte, otherwise
| concatenating strings could change good characters into
| errors. Popular replacements for each byte are:=20
|
| * nothing (the bytes vanish)
| * '?' or '=BF'
| * The replacement character (U+FFFD)
| * The byte from ISO-8859-1 or CP1252
| * An invalid Unicode code point, usually U+DCxx where xx is the byte's=
value
How about using the last one? This would allow 'readFile' to work on
FilePaths provided by 'getDirectoryContents', while allowing for
real Unicode string literals.
-k
--=20
If I haven't seen further, it is by standing in the footprints of giants
More information about the Haskell-Cafe
mailing list