Proposal #3337: expose Unicode and newline translation from
System.IO
Simon Marlow
marlowsd at gmail.com
Tue Jul 14 09:19:35 EDT 2009
On 02/07/2009 23:04, Judah Jacobson wrote:
> 1) It would be good to have an hGetEncoding function, so that we can
> temporarily set the encoding of a Handle like stdin without affecting
> the rest of the program.
I have added this, but it might not behave exactly as you want.
hGetEncoding :: Handle -> IO (Maybe TextEncoding)
The issue is saving and restoring of the codec state. A TextEncoding is
a factory that makes new codec instances; it has no state. However, the
codec in use on a Handle does have a state. So if you save and restore
the codec, you lose the state. e.g. in UTF-16, you'll get a new BOM in
the output.
You might or might not want to save and restore the state, I can imagine
both possibilities being useful. For now however, I propose we provide
the non-state-saving version, clearly documented as such.
Providing a state-saving version would need a new type to represent the
codec + state, incedentally.
> 2) It looks like your API always throws an error on invalid input; it
> would be great if there were some way to customize this behavior.
> Nothing complicated, maybe just an enum which specifies one of the
> following behaviors:
>
> - throw an error
> - ignore (i.e., drop) invalid bytes/Chars
> - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
>
> My preference for the API change would be to add a function in
> GHC.IO.Encoding.Iconv; for example,
>
> mkTextEncodingError :: String -> ErrorHandling -> IO TextEncoding
>
> since this is similar to how GHC.IO.Encoding.Latin1 allows error
> handling by providing latin1 and latin1_checked as separate encoders.
>
> Any more complicated behavior is probably best handled by something
> like the text package.
Note that if you're using GNU iconv, you can say
mkTextEncoding "UTF-8//IGNORE"
to get the version that silently drops illegal characters (there's also
"//TRANSLIT", which tries to find an alternative for an illegal
character). This is not portable, so I can't provide it as a general
facility in GHC.IO.Encoding.Iconv.
Cheers,
Simon
More information about the Libraries
mailing list