behaviour change in getDirectoryContents in GHC 7.2?

Wed Nov 2 20:02:09 CET 2011

On 2 November 2011 16:29, Ian Lynagh <igloo at earth.li> wrote:
> If I understand correctly, you use U+EF00-U+EFFF to encode the
> characters 0-255 when they are not a valid part of the UTF8 stream.

Yes.

> So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as
> U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely
> reversible?

This was also suggested by Mark Lentczner at the time I wrote the
patch, but I raised a few objections (reproduced below):

"""
This would require us to:
 1. Unconditionally decode these bytes sequences using the escape
mechanism, even if using a non-roundtripping encoding. This is because
the chars that result might be fed back into a roundtripping encoding,
where they would otherwise get confused with escapes representing some
other bytes.
 2. Unconditonally decode these particular characters from escapes,
even if using a non-roundtripping decoding -- necessary because of 1.

Which are both a little annoying. Perhaps more seriously, it would
play badly with e.g. reading in UTF-8 and writing out UTF-16, because
your UTF-16 would have bits of UTF-8 representing these private-use
chars embedded within it..
"""

So although this is approach is somewhat attractive, I'm not sure the
benefits of complete roundtripping outweigh the costs.

This is why the unmodified PEP383 approach is kind of nice - it uses
lone surrogate (rather than private use) codepoints to do the
escaping, and these codepoints are simply not allowed to occur in
valid UTF-encoded text.

Max