Proposal #3455: Add a setting to change how Unicode encoding errors are handled

Mon Aug 31 19:29:51 EDT 2009

On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow<marlowsd at gmail.com> wrote:
> On 23/08/2009 17:22, Judah Jacobson wrote:
>>
>> I proposal that we augment ghc-6.12.1's support for Unicode Handles
>> by adding the following functions to System.IO:
>>
>> hSetOnEncodingError :: Handle ->  OnEncodingError ->  IO ()
>> hGetOnEncodingError :: Handle ->  IO OnEncodingError
>>
>> as well as the enumeration `OnEncodingError` with three constructors:
>>
>>  - `ThrowEncodingError`: Throw an exception at the first encoding or
>>  decoding
>>    error.
>>  - `SkipEncodingError`: Skip all invalid bytes or characters.
>>  - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and
>>  unencodable characters with '?'.
>>
>> I have implemented this functionality in a patch attached to the
>> ticket.  Haddock docs
>> are here:
>> http://code.haskell.org/~judah/new-io-docs/System-IO.html#23
>>
>>
>> The choice of error handler is orthogonal to the choice of encoder.
>> Additionally, the same setting is used for both read and write modes.  For
>> portability, the handlers are written in pure Haskell rather than using
>> GNU iconv's //TRANSLIT feature.
>>
>> Note that the text package, for example, provides more sophisticated
>> error-handling options.  However, I think the above choices are useful
>> enough without making the API too complicated.
>
> I replied on the ticket, reproduced here for readers of libraries@:
>
> It looks like the main question here is whether the IOError should be
> returned explicitly (as in your patch), or whether we should just catch the
> exception. All things being equal, catching the exception would be simpler,
> as it wouldn't require any changes in the codecs. Is there a reason why you
> didn't do it that way? Perhaps because you want to be sure that the
> exception is really an encoding error, and not some other kind of exception?
> If that's the case, then we should introduce a new exception for encoding
> errors (that's probably a good idea anyway).

I agree that we should create a new exception type.  Given the errors
currently thrown by the library, I assume that it doesn't need to be
anything more than a newtype wrapping a String message.

If the text package and ghc's IO library are merged into a new system,
then it would probably be better to explicitly return the error --
that way we can have pure ByteString <-> Text conversion functions.
But for the current state of the library (where the encoding type is
only exposed under GHC.* and makes few stability promises) it probably
doesn't make a big difference.

-Judah