behaviour change in getDirectoryContents in GHC 7.2?

John Millikin jmillikin at gmail.com
Sun Nov 6 17:56:17 CET 2011


2011/11/6 Max Bolingbroke <batterseapower at hotmail.com>:
> On 6 November 2011 04:14, John Millikin <jmillikin at gmail.com> wrote:
>> For what it's worth, on my Ubuntu system, Nautilus ignores the locale
>> and just treats all paths as either UTF8 or invalid.
>> To me, this seems like the most reasonable option; the concept of
>> "locale encoding" is entirely vestigal, and should only be used in
>> certain specialized cases.
>
> Unfortunately non-UTF8 locale encodings are seen in practice quite
> often. I'm not sure about Linux, but certainly lots of Windows systems
> are configured with a locale encoding like GBK or Big5.

This doesn't really matter for file paths, though. The Win32 file API
uses wide-character functions, which ought to work with Unicode text
regardless of what the user set their locale to.

>> Paths as text is what *Windows* programmers expect. Paths as bytes is
>> what's expected by programmers on non-Windows OSes, including Linux
>> and OS X.
>
> IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
> that uses bytes for paths (that we care about) is Linux.

UTF-8 is bytes. It can be treated as text in some cases, but it's
better to think about it as bytes.

>> I'm not saying one is inherently better than the other, but
>> considering that various UNIX  and UNIX-like operating systems have
>> been using byte-based paths for near on forty years now, trying to
>> abolish them by redefining the type is not a useful action.
>
> We have to:
>  1. Provide an API that makes sense on all our supported OSes
>  2. Have getArgs :: IO [String]
>  3. Have it such that if you go to your console and write
> (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
>
> Given these constraints I don't see any alternative to PEP-383 behaviour.

Requirement #1 directly contradicts #2 and #3.

>> If you're going to make all the System.IO stuff use text, at least
>> give us an escape hatch. The "unix" package is ideally suited, as it's
>> already inherently OS-specific. Something like this would be perfect:
>
> You can already do this with the implemented design. We have:
>
> openFile :: FilePath -> IO Handle
>
> The FilePath will be encoded in the fileSystemEncoding. On Unix this
> will have PEP383 roundtripping behaviour. So if you want openFile' ::
> [Byte] -> IO Handle you can write something like this:
>
> escape = map (\b -> if b < 128 then chr b else chr (0xEF00 + b))
> openFile = openFile' . escape
>
> The bytes that reach the API call will be exactly the ones you supply.
> (You can also implement "escape" by just encoding the [Byte] with the
> fileSystemEncoding).
>
> Likewise, if you have a String and want to get the [Byte] we decoded
> it from, you just need to encode the String again with the
> fileSystemEncoding.
>
> If this is not enough for you please let me know, but it seems to me
> that it covers all your use cases, without any need to reimplement the
> FFI bindings.

This is not enough, since these strings are still being passed through
the potentially (and in 7.2.1, actually) broken path encoder.

If the "unix" package had defined functions which operate on the
correct type (CString / ByteString), then it would not be necessary to
patch "base". I could just call the POSIX functions from system-fileio
and be done with it.

And this solution still assumes that there is such a thing as a
filesystem encoding in POSIX. There isn't. A file path is an arbitrary
sequence of bytes, with no significance except what the application
user interface decides.

It seems to me that there's two ways to provide bindings to operating
system functionality.

One is to give low-level access, using abstractions as close to the
real API as possible. In this model, "unix" would provide functions
like [[ rename :: ByteString -> ByteString -> IO () ]], and I would
know that it's not going to do anything weird to the parameters.

Another is to pretend that operating systems are all the same, and can
have their APIs abstracted away to some hypothetical virtual system.
This model just makes it more difficult for programmers to access the
OS, as they have to learn both the standard API, *and* whatever weird
thing has been layered on top of it.



More information about the Glasgow-haskell-users mailing list