H98 Text IO

Tue Feb 26 08:22:07 EST 2008

Duncan Coutts wrote:
>>From the H98 report:
> 
>         All I/O functions defined here are character oriented. [...]
>         These functions cannot be used portably for binary I/O.
>         
>         In the following, recall that String is a synonym for [Char]
>         (Section 6.1.2).
> 
> So ordinary text Handles are for text, not binary. Char is of course a
> Unicode code point.
> 
> The crucial question of course is what encoding of text to use. For the
> H98 IO functions we cannot set it as a parameter, we have to pick a
> sensible default. Currently different implementations disagree on that
> default. Hugs has for some time used the current locale on posix systems
> (and I'm guessing the current code page on windows). GHC has always used
> the Latin-1 encoding.
> 
> These days, most operating systems use a locale/codepage encoding that
> covers full the Unicode range. So on hugs we get the benefit of that but
> on GHC we do not.
> 
> This is endlessly surprising for beginners. They do
> putStrLn "αβγδεζηθικλ"
> and it comes out on their terminal as junk.
> 
> It also causes problems for serious programs, see for example the recent
> hand-wringing on cabal-devel.
> 
> So here is a concrete proposal:
> 
>       * Haskell98 file IO should always use UTF-8.
>       * Haskell98 IO to terminals should use the current locale
>         encoding.

While I support Duncan's proposal (we discussed it on IRC), I thought I 
should point out some of the ramifications of this, and the alternatives.

If everything that is not a terminal uses UTF-8 by default, then shell 
commands may behave in an unexpected way, e.g. for a Haskell program "prog",

   prog | cat

will output in UTF-8, and if your locale encoding is something other than 
UTF-8 you'll see junk.  Similarly,

   prog >file; cat file

will give the same (wrong) result.

So some alternatives that fix this are

   1. all text I/O is in the locale encoding (what C and Hugs do)

   2. stdin/stdout/stderr and terminals are always in the locale
      encoding, everything else is UTF-8

   3. everything is UTF-8

(1) has the advantage of being easy to understand, but causes problems when 
you want to move a file created on one system to another system, or share 
files between users.  The programmer in this case has to anticipate the 
problem and set an encoding (and we're not proposing to provide a way to 
specify encodings, yet, so openBinaryFile and a separate UTF-8 step would
be required).

(2) has a sort of "do what I want" feel, and will almost certanly cause
confusion in some cases, simply because it's an aribtrary choice.

(3) is easy to understand, but does the wrong thing for people who have
a locale encoding other than UTF-8.

Duncan's proposal occupies a useful point: text that we know to be 
ephemeral, because it is being sent to a terminal, is definitely sent in 
the user's default encoding.  Text that might be persistent or might be 
crossing a locale-boundary is always written in UTF-8, which is good for 
interchange and portability, the catch is that sometimes we identify a 
Handle as persistent when it is really ephemeral.

Note that sensible people who set their locale to UTF-8 are not affected by 
any of this - and that includes most new installations of Linux these days, 
I believe.

Cheers,
	Simon