[Haskell-cafe] ANNOUNCE: system-filepath 0.4.5 and system-fileio 0.3.4

Mon Feb 6 19:37:44 CET 2012

On Mon, Feb 6, 2012 at 10:05, Joey Hess <joey at kitenet.net> wrote:
> John Millikin wrote:
>> That was my understanding also, then QuickCheck found a
>> counter-example. It turns out that there are cases where a valid path
>> cannot be roundtripped in the GHC 7.2 encoding.
>
>> The issue is that  [238,189,178] decodes to 0xEF72, which is within
>> the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
>
> How did you deal with this in system-filepath?

I used 0xEF00 as an escape character, to mean the following char
should be interpreted as a literal byte.

A user pointed out that there is a problem with this solution also --
a path containing actual U+EF00 will be considered "invalid encoding".
I'm going to change things over to use the Python 3 solution -- they
use part of the UTF16 surrogate pair range, so it's impossible for a
valid path to contain their stand-in characters.

Another user says that GHC 7.4 also changed its escape range to match
Python 3, so it seems to be a pseudo-standard now. That's really good.
I'm going to add a 'posix_ghc704' rule to system-filepath, which
should mean that only users running GHC 7.2 will have to worry about
escape chars.

Unfortunately, the "text" package refuses to store codepoints in that
range (it replaces them with a placeholder), so I have to switch
things over to use [Char].

(Yak sighted! Prepare lather!)

> While no code points in the Supplementary Special-purpose Plane are currently
> assigned (http://www.unicode.org/roadmaps/ssp/), it is worrying that it's used,
> especially if filenames in a non-unicode encoding could be interpreted as
> containing characters really within this plane. I wonder why maxBound :: Char
> was not increased, and the addtional space after `\1114111' used for the
> un-decodable bytes?

There's probably a lot of code out there that assumes (maxBound ::
Char) is also the maximum Unicode code point. It would be difficult to
update, particularly when dealing with bindings to foreign libraries
(like the "text-icu" package).

Both Python 3 and GHC 7.4 are using codepoints in the UTF16 surrogate
pair range for this, and that seems like a pretty clean solution.

>> > For FFI, anything that deals with a FilePath should use this
>> > withFilePath, which GHC contains but doesn't export(?), rather than the
>> > old withCString or withCAString:
>> >
>> > import GHC.IO.Encoding (getFileSystemEncoding)
>> > import GHC.Foreign as GHC
>> >
>> > withFilePath :: FilePath -> (CString -> IO a) -> IO a
>> > withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
>>
>> If code uses either withFilePort or withCString, then the filenames
>                      withFilePath?
>> written will depend on the user's locale. This is wrong. Filenames are
>> either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary
>> bytes (non-OSX POSIX). They must not change depending on the locale.
>
> This is exactly how GHC 7.4 handles them. For example:
>
> openDirStream :: FilePath -> IO DirStream
> openDirStream name =
>  withFilePath name $ \s -> do
>    dirp <- throwErrnoPathIfNullRetry "openDirStream" name $ c_opendir s
>    return (DirStream dirp)
>
> removeLink :: FilePath -> IO ()
> removeLink name =
>  withFilePath name $ \s ->
>  throwErrnoPathIfMinus1_ "removeLink" name (c_unlink s)
>
> I do not see any locale-dependant behavior in the filename bytes read/written.

Perhaps I'm misunderstanding, but the definition of 'withFilePath' you
provided is definitely locale-dependent. Unless getFileSystemEncoding
is constant?

>> > Code that reads or writes a FilePath to a Handle (including even to
>> > stdout!) must take care to set the right encoding too:
>> >
>> > fileEncoding :: Handle -> IO ()
>> > fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
>>
>> This is also wrong. A "file path" cannot be written to a handle with
>> any hope of correct behavior. If it's to be displayed to the user, a
>> path should be converted to text first, then displayed.
>
> Sure it can. See find(1). Its output can be read as FilePaths once the
> Handle is set up as above.
>
> If you prefer your program not crash with an encoding error when an
> arbitrary FilePath is putStr, but instead perhaps output bytes that are
> not valid in the current encoding, that's also a valid choice. You might
> be writing a program, like find, that again needs to output any possible
> FilePath including badly encoded ones.

A program like find(1) has two use cases:

1. Display paths to the user, as text.

2. Provide paths to another program, in the operating system's file path format.

These two goals are in conflict. It is not possible to implement a
find(1) that performs both correctly in all locales.

The best solution is to choose #2, and always write in the OS format,
and hope the user's shell+terminal are capable of rendering it to a
reasonable-looking path.

> Filesystem.Path.CurrentOS.toText is a nice option if you want validly
> encoded output though. Thanks for that!

Ah, that's not what toText is for. toText provides a human-readable
representation of the path. It's used for things like file managers,
where you need to show the user a label which approximates the
underlying path. There's no guarantee that the output of toText can be
converted back to the original path, especially if it returns a Left.