behaviour change in getDirectoryContents in GHC 7.2?

Thu Nov 10 10:13:34 CET 2011

On 10 November 2011 00:17, Ian Lynagh <igloo at earth.li> wrote:
> On Wed, Nov 09, 2011 at 03:58:47PM +0000, Max Bolingbroke wrote:
>>
>> (Note that the above outlined problems are problems in the current
>> implementation too
>
> Then the proposal seems to me to be strictly better than the current
> system. Under both systems the wrong thing happen when U+EFxx is entered
> as unicode text, but the proposed system works for all filenames read
> from the filesystem.

Your proposal is not *strictly* better than what is implemented in at
least the following ways:
  1. With your proposal, if you read a filename containing U+EF80 into
the variable "fp" and then expect the character U+EF80 to be in fp you
will be surprised to only find its escaped form. In the current
implementation you will in fact find U+EF80.
  2. The performance of iconv-based decoders will suffer because we
will need to do a post-pass in the TextEncoding to do this extra
escaping for U+EFxx characters

I'm really not keen about implementing a fix that addresses such a
limited subset of the problems, anyway.

> In the longer term, I think we need to fix the underlying problem that
> (for example) both getLine and getArgs produce a String from bytes, but
> do so in different ways. At some point we should change the type of
> getArgs and friends.

I'm not sure about this. hGetLine produces a String from bytes in a
different way depending on the encoding set on the Handle, but we
don't try to differentiate in the type system between Strings decoded
using different TextEncodings. Why should getLine and getArgs be
different?

If you are really unhappy about getLine and getArgs having different
behaviour in this sense, one option would be to change the default
stdout/stdin TextEncoding to use the fileSystemEncoding that knows
about escapes. (Note that this would mean that your Haskell program
wouldn't immediately die if you were using the UTF8 locale and then
tried to read some non-UTF8 input from stdin, which might or might not
be a good thing, depending on the application.)

Max