[Haskell-cafe] Re: UTF-8 BOM, really!?

Mon Jan 31 06:53:20 EST 2005

Graham Klyne <GK at ninebynine.org> writes:

> How can it make sense to have a BOM in UTF-8?  UTF-8 is a sequence of
> octets (bytes);  what ordering is there here that can sensibly be
> varied?

The *name* "BOM" doesn't make sense when applied to UTF-8, but some
software uses UTF-8 encoded U+FEFF it as a marker that the file is
encoded in UTF-8 rather than some other encoding. And Unicode seems
to support this usage, even if it doesn't recommend it.

I know only of Microsoft Notepad, and suspect other Microsoft tools
(Notepad assumes UTF-8 with the marker and the current Windows
codepage without). The HTML at http://www.microsoft.com/ begins with
a BOM, but other pages linked from there do not.

I think XML used to be silent about this, but later got amended to
explicitly say that optional U+FEFF at the beginning is allowed and
not treated as a part of document contents.

OTOH various other sofrware, in particular generic Unix tools, don't
treat UTF-8 BOM specially, and de facto implement the "non-standard"
UTF-8 without a BOM.

Technically in UTF-16/32 the BOM is handled in the translation between
encoding form (sequence of 16- or 32-bit code units) and encoding
scheme (these words serialized into bytes). I think it's supposed
to be the same in UTF-8, i.e. the analogous translation is *almost*
trivial - it translates bytes to the same bytes - except that initial
BOM must be stripped on decoding, and it must be added on encoding
when the first character of the contents is U+FEFF (and optionally in
other cases). I mean that it is supposed to happen on decoding UTF-8
on the level of bytes, not after decoding on the level of code points.

Anyway, on Unix it just doesn't happen at all, except in software
which explicitly handles it. iconv() doesn't handle UTF-8 BOM.

If I could decide about it, I would ban UTF-8 BOM at all. But perhaps
Unicode Consortium can be at least persuaded to recognize that some
software doesn't accept BOM in UTF-8, and could be conforming to the
variant of UTF-8 without the BOM rather than non-conforming at all.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/