behaviour change in getDirectoryContents in GHC 7.2?

Wed Nov 9 11:39:50 CET 2011

On 8 November 2011 11:43, Simon Marlow <marlowsd at gmail.com> wrote:
> Don't you mean 1 is what we have?

Yes, sorry!

> Failing to roundtrip in some cases, and doing so silently, seems highly
> suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
> is a swamp :).

I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.

So whatever happens we are going to end up making some group of users unhappy!
  * No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
  * PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
  * PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.

I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)

RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)

Max