Proposal #3455: Add a setting to change how Unicode encoding
errors are handled
ekmett at gmail.com
Mon Aug 31 19:59:32 EDT 2009
On Mon, Aug 31, 2009 at 7:29 PM, Judah Jacobson <judah.jacobson at gmail.com>wrote:
> On Tue, Aug 25, 2009 at 5:10 AM, Simon Marlow<marlowsd at gmail.com> wrote:
> > On 23/08/2009 17:22, Judah Jacobson wrote:
> >> I proposal that we augment ghc-6.12.1's support for Unicode Handles
> >> by adding the following functions to System.IO:
> >> hSetOnEncodingError :: Handle -> OnEncodingError -> IO ()
> >> hGetOnEncodingError :: Handle -> IO OnEncodingError
> >> as well as the enumeration `OnEncodingError` with three constructors:
> >> - `ThrowEncodingError`: Throw an exception at the first encoding or
> >> decoding
> >> error.
> >> - `SkipEncodingError`: Skip all invalid bytes or characters.
> >> - `TranslitEncodingError`: Replace undecodable bytes with u+FFFD, and
> >> unencodable characters with '?'.
As a brief, possibly irrelevant aside:
There is one other option for how to handle Unicode en/decoding errors that
I've used and seen used.
It is the basis of Markus Kuhn's "UTF-8B" encoding whereby parse errors are
read as 0xdc00 + the raw byte, which when you go to emit them, you can emit
them directly into the stream as raw bytes. This permits a perfect round
trip from UTF-8 to String to UTF-8, regardless of encoding errors. The
codepoints from 0xdc80-0xdcff don't conflict with UTF-16, because they are
in the unmapped d800-dfff range and in ISO 10646-1 section R.4 it notes that
the mapping of those code positions in UTF8 are undefined, so an
implementation is free to do with them as it pleases. The main good thing
that comes with this representation is that no information is discarded. It
doesn't hurt that this also sidesteps the other uses of the d800-dfff range
like the illegal Oracle-style "CESU-8" encoding of surrogate pairs, etc.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libraries