[Haskell-cafe] UTF-8 BOM
Mark Lentczner
markl at glyphic.com
Thu Jan 6 06:44:00 CET 2011
On Jan 4, 2011, at 5:41 PM, Antoine Latter wrote:
> Are you thinking that the BOM should be automatically stripped from
> UTF8 text at some low level, if present?
It should not. Wether or not a U+FFEF can be stripped depends on context in which it is found. There is no way that lower level code, even file primitives, can know this context.
> I'm thinking that it would be correct behavior to drop the BOM from
> the start of a UTF8 stream, even at a pretty low level. The FAQ seems
> to allow it as a means of identifying the stream as UTF8 (although it
> isn't a reliable means of identifying a stream as UTF8).
§3.9 and §3.10 of the Unicode standard go into more depth on the issue and make things more clear. A leading U+FFEF is considered "not part of the text", and dropped, only in the case that the encoding is UTF-16 or UTF-32. In all other cases (including the -BE and -LE variants of UTF-16 and UTF-32) the U+FFEF character is retained.
The FAQ states that a leading byte sequence of EF BB BF in a stream indicates that the stream is UTF-8, though it doesn't go so far as to say that it can be stripped. Since Unicode doesn't want to encourage the use of BOM in UTF-8 (see end of §3.10), I imagine they don't want to promulgate it as a useful encoding indicator.
So, it might be reasonable that when opening a file in UTF-16 mode (not UTF-16BE or UTF-16LE), that the system should read the initial bytes, determine the byte order, and remove the BOM if present[1]. But it isn't safe or correct to do this for UTF-8.
On Jan 4, 2011, at 5:08 PM, Tony Morris wrote:
> I am reading files with System.IO.readFile. Some of these files start
> with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that
> process this String, this causes choking so I drop the BOM as shown
> below.
If you mean functions in the standard libs shouldn't have any problems with the BOM character. If they do, these are bugs.
On the other hand, if you know the context of the files, and know for certain that the leading BOM is intended only as an encoding indicator, then by all means strip it off. But only you can know if this is true for your application, the system cannot. If so, your code doesn't look hackish to me at all. I'd only perhaps tidy up dropBOM a bit (but this is pure stylistic choice):
readBomFile :: FilePath -> IO String
readBomFile p = dropBom `fmap` readFile p
where
dropBom (\xffef:s) = s -- U+FFEF at the start is a BOM
dropBom s = s
I'd keep dropBom private to readBomFile to ensure that it isn't used on arbitrary strings, since it is really only valid at the start of an encoded stream.
- Mark
[1] Software reading a single text stream that has been split across files would have a problem here. But this is perhaps an obscure and unlikely case.
More information about the Haskell-Cafe
mailing list