[Haskell-cafe] UTF-8 BOM

Fri Jan 7 13:57:36 CET 2011

On 06/01/2011 05:44, Mark Lentczner wrote:
> On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:
>
>> Are you thinking that the BOM should be automatically stripped
>> from UTF8 text at some low level, if present?
>
> It should not. Wether or not a U+FFEF can be stripped depends on
> context in which it is found. There is no way that lower level code,
> even file primitives, can know this context.
>
>> I'm thinking that it would be correct behavior to drop the BOM
>> from the start of a UTF8 stream, even at a pretty low level. The
>> FAQ seems to allow it as a means of identifying the stream as UTF8
>> (although it isn't a reliable means of identifying a stream as
>> UTF8).
>
> §3.9 and §3.10 of the Unicode standard go into more depth on the
> issue and make things more clear. A leading U+FFEF is considered "not
> part of the text", and dropped, only in the case that the encoding is
> UTF-16 or UTF-32. In all other cases (including the -BE and -LE
> variants of UTF-16 and UTF-32) the U+FFEF character is retained.
>
> The FAQ states that a leading byte sequence of EF BB BF in a stream
> indicates that the stream is UTF-8, though it doesn't go so far as to
> say that it can be stripped. Since Unicode doesn't want to encourage
> the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want
> to promulgate it as a useful encoding indicator.
>
> So, it might be reasonable that when opening a file in UTF-16 mode
> (not UTF-16BE or UTF-16LE), that the system should read the initial
> bytes, determine the byte order, and remove the BOM if present[1].
> But it isn't safe or correct to do this for UTF-8.

This is exactly what the built-in System.IO.utf16 codec does.  There's 
also a utf8_bom which behaves like UTF8 except that it strips an 
optional leading BOM when reading and emits a BOM when writing.

Cheers,
	Simon