H98 Text IO
Duncan Coutts
duncan.coutts at worc.ox.ac.uk
Tue Feb 26 19:06:59 EST 2008
On Tue, 2008-02-26 at 14:18 +0000, Simon Marlow wrote:
> Simon Marlow wrote:
> > Duncan Coutts wrote:
>
> Let's call this one proposal 0:
>
> >> * Haskell98 file IO should always use UTF-8.
> >> * Haskell98 IO to terminals should use the current locale
> >> encoding.
>
> and the others:
>
> > 1. all text I/O is in the locale encoding (what C and Hugs do)
> >
> > 2. stdin/stdout/stderr and terminals are always in the locale
> > encoding, everything else is UTF-8
> >
> > 3. everything is UTF-8
So it's clear that all these solutions have some downsides. We have to
decide what is more important.
Let me try and summarise:
basically we can be consistent with the OS environment or consistent
with other Haskell systems in other environments or try to get some
mixture of the two. It is pretty clear however that trying to get a
mixture still leads to some inconsistency with the OS environment.
* "status quo" (what ghc/hugs do now)
This gives consistency with the OS environment with hugs and jhc
but not ghc, nhc or yhc. It gives consistency between haskell
programs (using the same haskell implementation) on different
platforms for ghc and nhc but not for hugs or jhc. There is no
consistency between haskell implementations.
* "always locale" (solution 1 above)
This gives us consistency with the OS environment. All of the
shell snippets people have posted work with this. The main
disadvantage is that files moved between systems may be
interpreted differently.
* "always utf8" (solution 3 above)
This gives consistency between Haskell programs across
platforms. The main disadvantage is that it is very unhelpful if
the locale is not UTF8. It fails the "putStr" test of printing
string literals to the terminal.
* "mixture A" (solution 0 above)
The input/output format changes depending on the device. prog |
cat prints junk in non-UTF8 locales.
* "mixture B" (solution 2 above)
The output format changes depending on the device. prog in
behaves differently to prog < in.
And some example people have noted:
* putStr "αβγδεζηθικλ"
That is just printing a string literal to the console/terminal.
Now that major implementations support Unicode .hs source files
it's kind of nice if this works.
This works with "always locale" and "mixture A" and "mixture B"
above. This fails for "status quo" with ghc (but works for hugs)
and fails for "always utf8" unless the locale happens to be
utf8.
* ./prog vs ./prog | cat
That is, piping the output of a haskell program through cat and
printing the result to a terminal produces the same output as
displaying the program output directly.
This works with "always locale" and "mixture B" and fails with
"mixture A". With "always utf8" and with "status quo" it has the
property that it consistently produces the same junk on the
terminal which some people see as a bonus (when not in a utf8
or latin1 locale respectively).
* ./prog vs ./prog >file; cat file
This is another variation on the above and it has the same
failures.
* ./prog in vs ./prog < in
That is reading a file given as a command line arg via readFile
gives the same result as reading stdin that has been redirected
from a the same file.
This works with "always locale" and "mixture A" and fails with
"mixture B". This is the dual of the previous two examples. This
fails with "always utf8" and with "status quo" when the file was
produced by another text processing program from the same
environment (eg a generic text editor).
* ./foo vs ./foo | hexdump -C
The output bytes we get sent to the terminal is exactly the same
as what we see piped to a program to examine those bytes.
This fails for "mixture A" and works for all the others. Works
in the strict sense that the bytes are the same, not in the
sense that the text output is readable.
So the problem with the mixture approaches is that the terminal and
files and pipes are all really interchangeable so we can find surprising
inconsistencies within the same OS environment.
The problem with the "always utf8" is that it's never right unless the
locale is set to utf8.
As a data point, Java and python use "always locale" as default if you
don't specify an encoding when opening a text stream.
I think personally I'm coming round to the "always locale" point of
view. We already have no cross-platform consistency for text files
because of the lf vs cr/lf issue and we have no cross-implementation
consistency.
Duncan
More information about the Libraries
mailing list