Proposal: Define UTF-8 to be the encoding of Haskell source files

Tue Apr 5 20:29:00 CEST 2011

Roel van Dijk wrote:
> I propose to make UTF-8 the only allowed encoding for Haskell source
> files. Implementations must discard an initial Byte Order Mark (BOM)
> if present

I am in favor of this proposal.

However, you wrote:

> "GHC assumes that source files are ASCII or UTF-8 only, other
> encodings are not recognised. However, invalid UTF-8 sequences will be
> ignored in comments, so it is possible to use other encodings such as
> Latin-1, as long as the non-comment source code is ASCII only." [4]
>
> From this I deduce that all current code accepted by GHC is compatible
> with UTF-8. No working code will be broken.

No. If GHC is changed to conform to this proposal, source code
including invalid UTF-8 in comments which previously compiled
successfully will now be rejected.

But anyway I think allowing invalid UTF-8 in comments is a
mistake. It could lead to the end of the comment being detected
in the wrong place, thus changing the meaning of the program in
very unexpected ways. Not likely, but possible.

I doubt that there is a whole lot of code out there which would
be affected. And GHC can easily provide a certain degree of
backward compatibility with a flag and/or pragma.

Thanks,
Yitz