[Haskell-cafe] UTF-8 BOM
aslatter at gmail.com
Wed Jan 5 02:41:36 CET 2011
On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris <tonymorris at gmail.com> wrote:
> I am reading files with System.IO.readFile. Some of these files start
> with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that
> process this String, this causes choking so I drop the BOM as shown
> below. This feels particularly hacky, but I am not in control of many of
> these functions (that perhaps could use ByteString with a better solution).
> I'm wondering if there is a better way of achieving this goal. Thanks
> for any tips.
> dropBOM ::
> -> String
> dropBOM  =
> dropBOM s@(x:xs) =
> let unicodeMarker = '\65279' -- UTF-8 BOM
> in if x == unicodeMarker then xs else s
> readBOMFile ::
> -> IO String
> readBOMFile p =
> dropBOM `fmap` readFile p
Are you thinking that the BOM should be automatically stripped from
UTF8 text at some low level, if present?
I was thinking about it, and I was deeply conflicted about the idea.
Then I read the unicode.org BOM faq, and I'm still conflicted.
I'm thinking that it would be correct behavior to drop the BOM from
the start of a UTF8 stream, even at a pretty low level. The FAQ seems
to allow it as a means of identifying the stream as UTF8 (although it
isn't a reliable means of identifying a stream as UTF8).
But I'm no unicode expert.
> Tony Morris
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
More information about the Haskell-Cafe