Proposal #3337: expose Unicode and newline translation from
System.IO
Simon Marlow
marlowsd at gmail.com
Fri Jul 3 04:23:48 EDT 2009
On 02/07/2009 23:04, Judah Jacobson wrote:
> On Tue, Jun 30, 2009 at 5:03 AM, Simon Marlow<marlowsd at gmail.com> wrote:
>> Ticket:
>>
>> http://hackage.haskell.org/trac/ghc/ticket/3337
>>
>> For the proposed new additions, see:
>>
>> * http://www.haskell.org/~simonmar/base/System-IO.html#23
>> System.IO (Unicode encoding/decoding)
>>
>> * http://www.haskell.org/~simonmar/base/System-IO.html#25
>> System.IO (Newline conversion)
>>
>> Discussion period: 2 weeks (14 July).
>
> Three points:
>
> 1) It would be good to have an hGetEncoding function, so that we can
> temporarily set the encoding of a Handle like stdin without affecting
> the rest of the program.
Sure. This might expose the fact that there's no instance Eq
TextEncoding, though - I can imagine someone wanting to know whether
localeEncoding is UTF-8 or not. Perhaps there should also be
textEncodingName :: TextEncoding -> String
the idea being that if you pass the String back to mkTextEncoding you
get the same encoding. But what about normalisation issues, e.g.
"UTF-8" vs. "UTF8"?
> 2) It looks like your API always throws an error on invalid input; it
> would be great if there were some way to customize this behavior.
> Nothing complicated, maybe just an enum which specifies one of the
> following behaviors:
>
> - throw an error
> - ignore (i.e., drop) invalid bytes/Chars
> - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
Yes.
> My preference for the API change would be to add a function in
> GHC.IO.Encoding.Iconv; for example,
>
> mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
So you're suggesting that we implement this only for iconv? That would
be easy enough, but then it wouldn't be available on Windows. Another
way would be to implement it at the Handle level, by catching
encoding/decoding errors from the codec and applying the appropriate
workaround. This is a lot more work, of course.
> since this is similar to how GHC.IO.Encoding.Latin1 allows error
> handling by providing latin1 and latin1_checked as separate encoders.
>
> Any more complicated behavior is probably best handled by something
> like the text package.
>
>
> 3) How hard would it be to get Windows code page support working? I'd
> like that a lot since it would further simplify the code in Haskeline.
> I can help out with the implementation if it's just a question of
> time.
Ok, so I did look into this. The problem is that the
MultiByteToWideChar API just isn't good enough.
1. It only converts to UTF-16. So I can handle this by using UTF-16
as our internal representation instead of UTF-32, and indeed I
have made all the changes for this - there is a #define in the
the library. I found it slower than UTF-32, however.
2. If there's a decoding error, you don't get to find out where
in the input the error occurred, or do a partial conversion.
3. If there isn't enough room in the target buffer, you don't get
to do a partial conversion.
4. Detecting errors is apparently only supported on Win XP and
later (MB_ERR_INVALID_CHARS), and for some code pages it
isn't supported at all.
2 and 3 are the real show-stoppers. Duncan Coutts found this code from
AT&T UWIN that implements iconv in terms of MultiByteToWideChar:
http://www.google.com/codesearch/p?hl=en&sa=N&cd=2&ct=rc#0IKL7zWk-JU/src/lib/libast/comp/iconv.c&l=333
notice how it uses a binary search strategy to solve the 3rd problem
above. Yuck! This would be the common case if we used this code in the
IO library. This is why I wrote our own UTF-{8,16,32} codecs for GHC
(borrowing some code from the text package).
BTW, Python uses its own automatically-generated codecs for Windows
codepages. Maybe we should do that too.
Cheers,
Simon
More information about the Libraries
mailing list