behaviour change in getDirectoryContents in GHC 7.2?
batterseapower at hotmail.com
Wed Nov 9 16:58:47 CET 2011
On 9 November 2011 13:11, Ian Lynagh <igloo at earth.li> wrote:
> If we aren't going to guarantee that the encoded string is unicode, then
> is there any benefit to encoding it in the first place?
(I think you mean decoded here - my understanding is that decode ::
ByteString -> String, encode :: String -> ByteString)
> Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is
> 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
> (Max gave some reasons earlier in this thread, but I'd need examples of
> what goes wrong to understand them).
We can do this but it doesn't solve all problems. Here are two such problems:
PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings)
So let's say we are reading a filename from stdin. Currently stdin
uses the utf8 TextEncoding -- this TextEncoding knows nothing about
private-char roundtripping, and will throw an exception when decoding
bad bytes or encoding our private chars.
Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes
0xEE 0xBC 0x80 on stdin.
The utf8 TextEncoding naively decodes this byte sequence to the
character sequence U+EF80.
We have lost at this point: if the user supplies the resulting String
to a function that encodes the String with the fileSystemEncoding, the
String will be encoded into the byte sequence 0x80. This is probably
not what we want to happen! It means that a program like this:
main = do
fp <- getLine
readFile fp >>= putStrLn
Will fail ("file not found: \x80") when given the name of an
(existant) file 0xEE 0xBC 0x80.
PROBLEM 2 (bleeding between two different escaping TextEncodings)
So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence
0xEE 0xBC 0x80) as a command line argument, so it goes through the
fileSystemEncoding. In your scheme the resulting Char sequence is
U+EFEE U+EFBC U+EF80.
What happens when we that *encode* that Char sequence using a UTF-16
TextEncoding (that knows about the 0xEFxx escape mechanism)? The
resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded
version of U+EF00! This is certainly contrary to what the user would
PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings)
Just as above, let's say the user supplies the UTF-8 encoded U+EF00
(byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes
through the fileSystemEncoding. In your scheme the resulting Char
sequence is U+EFEE U+EFBC U+EF80.
If you try to write this String to stdout (which uses the UTF-8
encoding that knows nothing about 0xEFxx escapes) you just get an
exception, NOT the UTF-8 encoded version of U+EF00. Game over man,
As far as I can see, the proposed escaping scheme recovers the
roundtrip property but fails to regain a lot of other
(Note that the above outlined problems are problems in the current
implementation too -- but the current implementation doesn't even
pretend to support U+EFxx characters. Its correctness is entirely
dependent on them never showing up, which is why we chose a part of
the private codepoint region that is reserved specifically for the
purpose of encoding hacks).
More information about the Glasgow-haskell-users