Proposal: Define UTF-8 to be the encoding of Haskell source files

Duncan Coutts duncan.coutts at googlemail.com
Wed Apr 6 15:13:30 CEST 2011


On 4 April 2011 23:48, Roel van Dijk <vandijk.roel at gmail.com> wrote:
> * Proposal
>
> The Haskell 2010 language specification states that: "Haskell uses the
> Unicode character set" [2]. It does not state what encoding should be
> used. This means, strictly speaking, it is not possible to reliably
> exchange Haskell source files on the byte level.
>
> I propose to make UTF-8 the only allowed encoding for Haskell source
> files. Implementations must discard an initial Byte Order Mark (BOM)
> if present [3].

> * Next step
>
> Discussion! There was already some discussion on the haskell-cafe
> mailing list [7].

This is a simple and obviously sensible proposal. I'm certainly in favour.

I think the only area where there might be some issue to discuss is
the language of the report. As far as I can see, the report does not
require that modules exist as files, does not require the ".hs"
extension and does not give the "standard" mapping from module name to
file name.

So since the goal is interoperability of source files then perhaps we
should also have a section somewhere with interoperability guidelines
for implementations that do store Haskell programs as OS files. The
section would describe the one module per file convention, the .hs
extension (this is already obliquely mentioned in the section on
literate Haskell syntax) and the mapping of module names to file names
in common OS file systems. Then this UTF8 stipulation could go there
(and it would be clear that it applies only to conventional
implementations that store Haskell programs as files).

e.g.

Interoperability Guidelines
========================

This Report does not specify how Haskell programs are represented or
stored. There is however a conventional representation using OS files.
Implementations that conform to these guidelines will benefit from the
portability of Haskell program representations.

Haskell modules are stored as files, one module per file. These
Haskell source files are given the file extension ".hs" for usual
Haskell files and ".lhs" for literate Haskell files (see section
10.4).

Source files must be encoded as UTF-8 \cite{utf8}. Implementations
must discard an initial Byte Order Mark (BOM) if present.

To find a source file corresponding to a module name used in an import
declaration, the following mapping from module name to OS file name is
used. The '.' character is mapped to the OS's directory separator
string while all other characters map to themselves. The ".hs" or
".lhs" extension is added. Where both ".hs" and ".lhs" files exist for
the same module, the ".lhs" one should be used. The OS's standard
convention for representing Unicode file names should be used.

For example, on a UNIX based OS, the module A.B would map to the file
name "A/B.hs" for a normal Haskell file or to "A/B.lhs" for a literate
Haskell file. Note that because it is rare for a Main module to be
imported, there is no restriction on the name of the file containing
the Main module. It is conventional, but not strictly necessary, that
the Main module use the ".hs" or ".lhs" extension.


Duncan



More information about the Haskell-prime mailing list