behaviour change in getDirectoryContents in GHC 7.2?

Wed Nov 9 16:58:47 CET 2011

On 9 November 2011 13:11, Ian Lynagh <igloo at earth.li> wrote:
> If we aren't going to guarantee that the encoded string is unicode, then
> is there any benefit to encoding it in the first place?

(I think you mean decoded here - my understanding is that decode ::
ByteString -> String, encode :: String -> ByteString)

> Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is
> 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
>
> (Max gave some reasons earlier in this thread, but I'd need examples of
> what goes wrong to understand them).

We can do this but it doesn't solve all problems. Here are two such problems:

PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings)
===

So let's say we are reading a filename from stdin. Currently stdin
uses the utf8 TextEncoding -- this TextEncoding knows nothing about
private-char roundtripping, and will throw an exception when decoding
bad bytes or encoding our private chars.

Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes
0xEE 0xBC 0x80 on stdin.

The utf8 TextEncoding naively decodes this byte sequence to the
character sequence U+EF80.

We have lost at this point: if the user supplies the resulting String
to a function that encodes the String with the fileSystemEncoding, the
String will be encoded into the byte sequence 0x80. This is probably
not what we want to happen! It means that a program like this:

"""
main = do
  fp <- getLine
  readFile fp >>= putStrLn
"""

Will fail ("file not found: \x80") when given the name of an
(existant) file 0xEE 0xBC 0x80.

PROBLEM 2 (bleeding between two different escaping TextEncodings)
===

So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence
0xEE 0xBC 0x80) as a command line argument, so it goes through the
fileSystemEncoding. In your scheme the resulting Char sequence is
U+EFEE U+EFBC U+EF80.

What happens when we that *encode* that Char sequence using a UTF-16
TextEncoding (that knows about the 0xEFxx escape mechanism)? The
resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded
version of U+EF00! This is certainly contrary to what the user would
expect.

PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings)
===

Just as above, let's say the user supplies the UTF-8 encoded U+EF00
(byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes
through the fileSystemEncoding. In your scheme the resulting Char
sequence is U+EFEE U+EFBC U+EF80.

If you try to write this String to stdout (which uses the UTF-8
encoding that knows nothing about 0xEFxx escapes) you just get an
exception, NOT the UTF-8 encoded version of U+EF00. Game over man,
game over!

CONCLUSION
===

As far as I can see, the proposed escaping scheme recovers the
roundtrip property but fails to regain a lot of other
reasonable-looking behaviours.

(Note that the above outlined problems are problems in the current
implementation too -- but the current implementation doesn't even
pretend to support U+EFxx characters. Its correctness is entirely
dependent on them never showing up, which is why we chose a part of
the private codepoint region that is reserved specifically for the
purpose of encoding hacks).

Max