[Haskell-cafe] Encoding of Haskell source files

Herbert Valerio Riedel hvr at gnu.org
Tue Apr 5 09:22:01 CEST 2011


On Mon, 2011-04-04 at 11:50 +0200, Roel van Dijk wrote:
> I am not aware of any algorithm that can reliably infer the character
> encoding used by just looking at the raw data. Why would people bother
> with stuff like <?xml version="1.0" encoding="UTF-8"?> if
> automatically figuring out the encoding was easy?

It is possible, if the syntax/grammar of the encoded content restricts
the set of allowed code-points in the first few characters.

For instance, valid JSON (see RFC 4673 section 3) requires the first two
characters to be plain "ASCII" code-points, thus which of the 5 BOM-less
UTF-encodings is used is uniquely determined by inspecting the first 4
bytes of the UTF encoded stream.





More information about the Haskell-Cafe mailing list