behaviour change in getDirectoryContents in GHC 7.2?

Wed Nov 9 12:02:54 CET 2011

On 09/11/2011 10:39, Max Bolingbroke wrote:
> On 8 November 2011 11:43, Simon Marlow<marlowsd at gmail.com>  wrote:
>> Don't you mean 1 is what we have?
>
> Yes, sorry!
>
>> Failing to roundtrip in some cases, and doing so silently, seems highly
>> suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
>> is a swamp :).
>
> I *can* change the implementation back to using lone surrogates. This
> gives us guaranteed roundtripping but it means that the user might see
> lone-surrogate Char values in Strings from the filesystem/command
> line. IIRC this does break some software -- e.g. Brian's "text"
> library explicitly checks for such characters and fails if it detects
> them.
>
> So whatever happens we are going to end up making some group of users unhappy!
>    * No PEP383: Haskellers using non-ASCII get upset when their command
> line argument [String]s aren't in fact sequences of characters, but
> sequences of bytes in some arbitrary encoding
>    * PEP383(surrogates): Unicoders get upset by lone surrogates (which
> can actually occur at the moment, independent of PEP383 -- e.g. as
> character literals or from FFI)
>    * PEP383(private chars): Unixers get upset that we can't roundtrip
> byte sequences that look like the codepoint 0xEFXX encoded in the
> current locale. In practice, 0xEFXX is only decodable from a UTF
> encoding, so we fail to roundtrip byte sequences like the one Ian
> posted.
>
> I'm happy to implement any behaviour, I would just like to know that
> whatever it is is accepted as the correct tradeoff :-)

I would be happy with the surrogate approach I think.  Arguable if you 
try to treat a string with lone surrogates as Unicode and it fails, then 
that is a feature: the original string wasn't Unicode.  All you can do 
with an invalid Unicode string is use it as a FilePath again, and the 
right thing will happen.

Alternatively if we stick with the private char approach, it should be 
possible to have an escaping scheme for 0xEFxx characters in the input 
that would enable us to roundtrip correctly.  That is, escape 0xEFxx 
into a sequence 0xYYEF 0xYYxx for some suitable YY.  But perhaps that 
would be too expensive - an extra translation pass over the buffer after 
iconv (well, we do this for newline translation, so maybe it's not too bad).

> RE exposing a ByteString based interface to the IO library from
> base/unix/whatever: AFAIK Python doesn't do this, and just tells
> people to use the (x.encode(sys.getfilesystemencoding(),
> "surrogateescape")) escape hatch, which is what I've been
> recommending. I think this would be more satisfying to John if it were
> actually guaranteed to work on arbitrary byte sequences, not just
> *highly likely* to work :-)

The performance overhead of all this worries me.  withCString has taken 
a huge performance hit, and I think there are people who wnat to know 
that there aren't several complex encoding/decoding passes between their 
Haskell code and the POSIX API.  We ought to be able to program to POSIX 
directly, and the same goes for Win32.

Cheers,
	Simon