[Haskell-cafe] Re: Writing binary files?

Wed Sep 15 12:57:17 EDT 2004

Gabriel Ebner wrote:

> >> 3. The default encoding is settable from Haskell, defaults to
> >>    ISO-8859-1.
> >
> > Agreed.
> 
> So every haskell program that did more than just passing raw bytes
> From stdin to stdout should decode the appropriate environment
> variables, and set the encoding by itself?

This statement is too restrictive. Passing bytes isn't limited to
stdin->stdout, and there's no reason why setting the encoding needs to
be any more involved than e.g. "setLocaleEncoding". If you change it
to:

> So every haskell program that did more than just passing raw bytes
> ... should ... set the encoding by itself?

then the answer is yes.

>  IMO that's too much of
> redundancy, the RTS should actually do that.

The RTS doesn't know the encoding. Assuming that the data will use the
locale's encoding will be wrong too often.

> > There are limits to the extent to which this can be achieved. E.g.
> > what happens if you set the encoding to UTF-8, then call
> > getDirectoryContents for a directory which contains filenames which
> > aren't valid UTF-8 strings?
> 
> Then you _seriously_ messed up.  Your terminal would produce garbage,
> Nautilus would break, ...

Like so many other people, you're making an argument based upon
fiction (specifically, that you have a closed world where everything
always uses the same encoding) then deeming anyone who is unable to
maintain the fiction to be "wrong".

> >> 5. The default encoding is settable from Haskell, defaults to the
> >>    locale encoding.
> >
> > I feel that the default encoding should be one whose decoder cannot
> > fail, e.g. ISO-8859-1. You should have to explicitly request the use
> > of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "")
> > at the start of a C program; there's a good reason why C doesn't do
> > this without being explicitly told to).
> 
> So that any haskell program that doesn't call setlocale and outputs
> anything else than US-ASCII will produce garbage on an UTF-8 system?

No. If a program just passes bytes around, everything will work so
long as the inputs use the encoding which the outputs are assumed to
use. And if the inputs aren't in the "correct" encoding, then you have
to deal with encodings manually regardless of the default behaviour.

> > Actually, the more I think about it, the more I think that "simple,
> > stupid programs" probably shouldn't be using Unicode at all.
> 
> Care to give any examples?  Everything that has been mentioned until
> now would break with an UTF-8 locale:
>     - ls (sorting would break),
>     - env (sorting too)

Sorting according to codepoints inevitably involves decoding. However,
getting the order wrong is usually considered less problematic than
failing outright.

> > I.e. Char, String, string literals, and the I/O functions in
> > Prelude, IO etc should all be using bytes,
> 
> I don't want the same mess as in C, where strings and raw data are the
> very same.

Tough. You already have it, and will do for the foreseeable future. 
Many existing APIs (including the core Unix API), protocols and file
formats are defined in terms of byte strings with no encoding
specified or implied.

> Haskell has a nice type system and nicely defined types
> for binary data ([Word8]) and for Strings (String), why don't use it?

I'd like to. But many of the functions which provide or accept binary
data (e.g. FilePath) insist on represent it using Strings.

> > with a distinct wide-character API available for people who want to
> > make the (substantial) effort involved in writing (genuinely)
> > internationalised programs.
> 
> If you introduce an entirely new "i18n-only" API, then it'll surely
> become difficult. :-)

I18N is inherently difficult. Lots of textual data exists in lots of
different encodings, and the encoding is frequently unspecified.

It would be easier if we had a closed world where only one encoding
was ever used. But we don't, and pretending that we do doesn't make it
so.

> > Anything that isn't ISO-8859-1 just doesn't work for the most part,
> > and anyone who wants to provide real I18N first has to work around
> > the pseudo-I18N that's already there (e.g. convert Chars back into
> > Word8s so that they can decode them into real Chars).
> 
> One more reason to fix the I/O functions to handle encodings and have
> a seperate/underlying binary I/O API.

The problem is that we also need to fix them to handle *no encoding*.

Also, binary data and text aren't disjoint. Everything is binary; some
of it is *also* text.

> > Oh, and because bytes are being stored in Chars, the type system won't
> > help if you neglect to decode a string, or if you decode it twice.
> 
> Yes, that's the problem with the current approach, i.e. that there's
> no easy way get a list of Word8's out of a handle.

Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
of Word8s to a handle, or to openFile, getEnv etc.

> >> The current ISO-8859-1 assumption is also wrong. A program written in
> >> Haskell which sorts strings would break for non-ASCII letters even now
> >> that they are ISO-8859-2 unless specified otherwise.
> >
> > 1. In that situation, you can't avoid the encoding issues. It doesn't
> > matter what the default is, because you're going to have to set the
> > encoding anyhow.
> 
> Why do you always want me to set the encoding?  That should be the job
> of the RTS.

Because you might know the encoding, and the RTS doesn't. The locale
is a fallback mechanism, for the situation where you *need* an
encoding but one hasn't been specified by other means.

> > 2. If you assume ISO-8859-1, you can always convert back to Word8
> 
> If I want a list of Word8's, then I should be able to get them without
> extracting them from a string.

The point is that, currently, you can't. Nothing in the core Haskell98
API actually uses Word8, it all uses Char/String.

> > then re-decode as UTF-8. If you assume UTF-8, anything which is neither
> > UTF-8 nor ASCII will fail far more severely than just getting the
> > collation order wrong.
> 
> If I use String's to handle binary data, then I should expect things
> to break.  If I want to get text, and it's not in the expected
> encoding, then the user has messed up.

Or maybe the expectation is incorrect.

> > Well, my view is essentially that files should be treated as
> > containing bytes unless you explicitly choose to decode them, at
> > which point you have to specify the encoding.
> 
> Why do you always want me to _manually_ specify an encoding?

Because we don't have an "oracle" which will magically determine the
encoding for you.

> If I
> want bytes, I'll use the (currently being discussed, see beginning of
> this thread) binary I/O API, if I want String's (i.e. text), I'll use
> the current I/O API (which is pretty text-orientated anyway, see
> hPutStrLn, hGetLine, ...).

If you want text, well, tough; what comes out most system calls and
core library functions (not just read()) are bytes. There isn't any
magic wand which will turn them into characters without knowing the
encoding.

> > completely new wide-character API for those who wish to use it.
> 
> Which would make it horrendously difficult to do even basic I18N.

Why?

> > That gets the failed attempt at I18N out of everyone's way with a
> > minimum of effort and with maximum backwards compatibility for
> > existing code.
> 
> If existing code, expects String's to be just a list of bytes, it's
> _broken_.

I know. That's what I'm saying. The problem is that the broken "code"
is the Haskell98 API.

>  String's are a list of unicode characters, [Word8] is a
> list of bytes.

And what comes out of (and goes into) most core library functions is
the latter.

-- 
Glynn Clements <glynn.clements at virgin.net>