behaviour change in getDirectoryContents in GHC 7.2?

Wed Nov 2 20:59:21 CET 2011

On 2 November 2011 19:13, Ian Lynagh <igloo at earth.li> wrote:
> [snip some stuff I didn't understand. I think I made the mistake of
> entering a Unicode discussion]

Sorry, perhaps that was too opaque! The problem is that if we commit
to support occurrences of the private-use codepoint 0xEF80 then what
happens if we:

1. Decode the UTF-32le data [0x80, 0xEF, 0x00, 0x00] to a string "\xEF80"
2. Pass the string "\xEF80" to a function that encodes it using an
encoding which knows about the escaping mechanism.
3. Consequently encode "\xEF80" as [0x80]

This seems a bit sad.

> They are allowed to occur in Linux/ext2 filenames, anyway, and I think
> we ought to be able to handle them correctly if they do.

In Python, if a filename is decoded using UTF8 and the "surrogate
escape" error handler, occurrences of lone surrogates are a decoding
error because they are not allowed to occur in UTF-8 text. As a result
the lone surrogate is put into the string escaped so it can be
roundtripped back to a lone surrogate on output. So Python works OK.

In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip"
error handler, occurrences of 0xEFNN are not a decoding error because
they are perfectly fine Unicode codepoints. As a result they get put
into the string unescaped, and so when we try to roundtrip the string
we get the byte 0xNN in the output rather than the UTF-8 encoding of
0xEFNN. So GHC does not work OK in this situation :-(

(The problem I outlined at the start of this email doesn't arise with
the lone surrogate mechanism because surrogates aren't allowed in
UTF-32 text either. So step 1 in the process would have failed with a
decoding error.)

Hope that helps,
Max