Proposal #3337: expose Unicode and newline translation from System.IO

Simon Marlow marlowsd at gmail.com
Tue Jul 14 09:19:35 EDT 2009


On 02/07/2009 23:04, Judah Jacobson wrote:

> 1) It would be good to have an hGetEncoding function, so that we can
> temporarily set the encoding of a Handle like stdin without affecting
> the rest of the program.

I have added this, but it might not behave exactly as you want.

hGetEncoding :: Handle -> IO (Maybe TextEncoding)

The issue is saving and restoring of the codec state.  A TextEncoding is 
a factory that makes new codec instances; it has no state.  However, the 
codec in use on a Handle does have a state.  So if you save and restore 
the codec, you lose the state.  e.g. in UTF-16, you'll get a new BOM in 
the output.

You might or might not want to save and restore the state, I can imagine 
both possibilities being useful.  For now however, I propose we provide 
the non-state-saving version, clearly documented as such.

Providing a state-saving version would need a new type to represent the 
codec + state, incedentally.

> 2) It looks like your API always throws an error on invalid input; it
> would be great if there were some way to customize this behavior.
> Nothing complicated, maybe just an enum which specifies one of the
> following behaviors:
>
> - throw an error
> - ignore (i.e., drop) invalid bytes/Chars
> - replace undecodable bytes with u+FFFD and unencodable Chars with '?'
>
> My preference for the API change would be to add a function in
> GHC.IO.Encoding.Iconv; for example,
>
> mkTextEncodingError :: String ->  ErrorHandling ->  IO TextEncoding
>
> since this is similar to how GHC.IO.Encoding.Latin1 allows error
> handling by providing latin1 and  latin1_checked as separate encoders.
>
> Any more complicated behavior is probably best handled by something
> like the text package.

Note that if you're using GNU iconv, you can say

   mkTextEncoding "UTF-8//IGNORE"

to get the version that silently drops illegal characters (there's also 
"//TRANSLIT", which tries to find an alternative for an illegal 
character).  This is not portable, so I can't provide it as a general 
facility in GHC.IO.Encoding.Iconv.

Cheers,
	Simon


More information about the Libraries mailing list