Proposal #3455: Add a setting to change how Unicode encoding
errors are handled
judah.jacobson at gmail.com
Mon Aug 31 19:29:51 EDT 2009
On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow<marlowsd at gmail.com> wrote:
> On 23/08/2009 17:22, Judah Jacobson wrote:
>> I proposal that we augment ghc-6.12.1's support for Unicode Handles
>> by adding the following functions to System.IO:
>> hSetOnEncodingError :: Handle -> OnEncodingError -> IO ()
>> hGetOnEncodingError :: Handle -> IO OnEncodingError
>> as well as the enumeration `OnEncodingError` with three constructors:
>> - `ThrowEncodingError`: Throw an exception at the first encoding or
>> - `SkipEncodingError`: Skip all invalid bytes or characters.
>> - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and
>> unencodable characters with '?'.
>> I have implemented this functionality in a patch attached to the
>> ticket. Haddock docs
>> are here:
>> The choice of error handler is orthogonal to the choice of encoder.
>> Additionally, the same setting is used for both read and write modes. For
>> portability, the handlers are written in pure Haskell rather than using
>> GNU iconv's //TRANSLIT feature.
>> Note that the text package, for example, provides more sophisticated
>> error-handling options. However, I think the above choices are useful
>> enough without making the API too complicated.
> I replied on the ticket, reproduced here for readers of libraries@:
> It looks like the main question here is whether the IOError should be
> returned explicitly (as in your patch), or whether we should just catch the
> exception. All things being equal, catching the exception would be simpler,
> as it wouldn't require any changes in the codecs. Is there a reason why you
> didn't do it that way? Perhaps because you want to be sure that the
> exception is really an encoding error, and not some other kind of exception?
> If that's the case, then we should introduce a new exception for encoding
> errors (that's probably a good idea anyway).
I agree that we should create a new exception type. Given the errors
currently thrown by the library, I assume that it doesn't need to be
anything more than a newtype wrapping a String message.
If the text package and ghc's IO library are merged into a new system,
then it would probably be better to explicitly return the error --
that way we can have pure ByteString <-> Text conversion functions.
But for the current state of the library (where the encoding type is
only exposed under GHC.* and makes few stability promises) it probably
doesn't make a big difference.
More information about the Libraries