[Haskell-cafe] Encoding of Haskell source files

Mon Apr 4 17:51:41 CEST 2011

malcolm.wallace wrote:
>> BOM is not part of UTF8, because UTF8 is byte-oriented.  But applications
>> should be prepared to read and discard it, because some applications
>> erroneously generate it.

For maximum portability, the standard should be require compilers
to accept and discard an optional BOM as the first character of a
source code file.

Tako Schotanus wrote:
> That's not what the official unicode site says in its FAQ:
> http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

That FAQ clearly states that BOM is part of some "protocols".
It carefully avoids stating whether it is part of the encoding.

It is certainly not erroneous to include the BOM
if it is part of the protocol for the applications being used.
Applications can include whatever characters they'd like, and
they can use whatever handshake mechanism they'd like to
agree upon an encoding. The BOM mechanism is common
on the Windows platform. It has since appeared in other
places as well, but it is certainly not universally adopted.

Python supports a pseudo-encoding called "utf8-bom" that
automatically generates and discards the BOM in support
of that handshake mechanism But it isn't really an encoding,
it's a convenience.

Part of the source of all this confusion is some documentation
that appeared in the past on Microsoft's site which was unclear
about the fact that the BOM handshake is a protocol adopted
by Microsoft, not a part of the encoding itself. Some people
claim that this was intentional, part of the "extend and embrace"
tactic Microsoft allegedly employed in those days in an effort
to expand its monopoly.

The wording of the Unicode FAQ is obviously trying to tip-toe
diplomatically around this issue without arousing the ire of
either pro-Microsoft or anti-Microsoft developers.

Thanks,
Yitz