Proposal: Define UTF-8 to be the encoding of Haskell source files
Roel van Dijk
vandijk.roel at gmail.com
Tue Apr 5 00:48:25 CEST 2011
Per the Haskell Prime process I would like to make an official
proposal [1].
* Proposal
The Haskell 2010 language specification states that: "Haskell uses the
Unicode character set" [2]. It does not state what encoding should be
used. This means, strictly speaking, it is not possible to reliably
exchange Haskell source files on the byte level.
I propose to make UTF-8 the only allowed encoding for Haskell source
files. Implementations must discard an initial Byte Order Mark (BOM)
if present [3].
* Pros
- Ensures that Haskell source can be reliably exchanged on the byte
level.
- Disallows implicit ISO-8859-* encodings in source code, ensuring
portability.
- Little or no implementation burden for compiler writers.
* Cons
- Existing code relying on a non-UTF8, locale-/implementation-specific
encoding will need conversion. (Only relevant for Hugs-only code).
* Implementation status
** GHC
"GHC assumes that source files are ASCII or UTF-8 only, other
encodings are not recognised. However, invalid UTF-8 sequences will be
ignored in comments, so it is possible to use other encodings such as
Latin-1, as long as the non-comment source code is ASCII only." [4]
More information about the Haskell-prime
mailing list