[Haskell-cafe] Re: Writing binary files?

Wed Sep 15 19:01:53 EDT 2004

Gabriel Ebner wrote:

> > The RTS doesn't know the encoding. Assuming that the data will use the
> > locale's encoding will be wrong too often.
> 
> If the program wants to get bytes, it should get bytes explicitly, not
> some sort of pseudo-Unicode String.

Er, that's what I've been saying. And most programs should be getting
filenames as bytes.

> > Like so many other people, you're making an argument based upon
> > fiction (specifically, that you have a closed world where everything
> > always uses the same encoding) then deeming anyone who is unable to
> > maintain the fiction to be "wrong".
> 
> Everything's fine here with LANG=de_AT.utf8.  And I can't recall
> having any problems with it.  But well, YMMV.

So either you never encounter Latin1 files or the programs aren't
trying to decode them.

Bear in mind that the standard libraries don't automatically decode
everything according to the locale's encoding. A lot of programs
completely ignore the locale, and many of those which use it for
something don't decode strings into wide strings.

> > No. If a program just passes bytes around, everything will work so
> > long as the inputs use the encoding which the outputs are assumed to
> > use. And if the inputs aren't in the "correct" encoding, then you have
> > to deal with encodings manually regardless of the default behaviour.
> 
> The only programs that just pass bytes around that come to mind are
> basic Unix utilities.  Basically everything else will somehow process
> the data.

I would suggest that most programs which deal with filenames merely
pass them around. I.e. read bytes from argv or the environment or
files, and pass the bytes to open() etc. And when programs do process
filenames, the processing is usually trivial, and not influenced by
the encoding, e.g. appending or removing directories or extensions.

> > Tough. You already have it, and will do for the foreseeable future. 
> > Many existing APIs (including the core Unix API), protocols and file
> > formats are defined in terms of byte strings with no encoding
> > specified or implied.
> 
> Guess why I like Haskell (the language; the implementations are not up
> to that ideal yet).

You're missing the point. Haskell is implemented upon those existing
APIs, and Haskell programs need to understand those protocols and file
formats. Nothing that Haskell (or an implementation thereof) does can
make the issues go away.

> > I'd like to. But many of the functions which provide or accept binary
> > data (e.g. FilePath) insist on represent it using Strings.
> 
> Good point.  Adding functions that accept bytes instead of strings
> would be a major undertaking.

Which is why I'm suggesting changing Char to be a byte, so that we can
have the basic, robust API now and wait for the more advanced API,
rather than having to wait for a usable API while people sort out all
of the issues.

> > I18N is inherently difficult. Lots of textual data exists in lots of
> > different encodings, and the encoding is frequently unspecified.
> 
> That's the problem with the current API.  You can neither easily
> read/write bytes nor strings in a specified encoding.

No, that's the problem with reality. It has nothing to do with Haskell
beyond the issue of whether Haskell is based upon reality or fiction.

> > The problem is that we also need to fix them to handle *no encoding*.
> 
> That's binary data. (assuming you didn't want to say 'unknown')

Yes. Filenames are binary data; environment strings are binary data;
argv[] is binary data. They may *also* be text, but if they are, the
encoding is, in general, unknown.

> > Also, binary data and text aren't disjoint. Everything is binary; some
> > of it is *also* text.
> 
> Simon's new-io proposal does this very nicely.  Stdin is by default a
> binary stream and you can obtain a TextInputStream for it using either
> the locale's encoding or a specified encoding.  That's the way I'd
> like it to be.

Yes. From what I've seen of it, it's basically the right thing, so far
as it goes, which unfortunately isn't that far. The issues go far
beyond reading and writing streams.

The problem is that this *isn't* the Haskell98 API; it isn't even
included in any existing implentation.

> > Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
> > of Word8s to a handle, or to openFile, getEnv etc.
> 
> That's a real issue.  Adding new functions with a bin- is the only
> solution that comes to my mind.

Well, the other obvious solution is changing the existing functions to
use Word8s (obviously, we want a better name, e.g. Byte or Char, or
even just CChar) and make the new functions use wide characters.

It isn't as if the existing functions actually deal with anything
other than bytes. You never get anything other than Latin1 from a
system function which returns Char (or IO Char), and passing anything
which isn't Latin1 to such a function results in it being silently
cast to a byte.

> > Because we don't have an "oracle" which will magically determine the
> > encoding for you.
> 
> That "oracle" is called locale setting.  If I want to read text and
> can't determine the encoding by other ways (protocol spec, ...), then
> it's what the user set his locale setting to.

No. An "oracle" would always get it right. The locale merely provides
a fallback.

> > If you want text, well, tough; what comes out most system calls and
> > core library functions (not just read()) are bytes.
> 
> Which need to be interpreted by the program depending on where these
> bytes come from.

They don't necessarily need to be interpreted. A lot of data simply
gets "routed" from one place to another. E.g. a program reads a
filename from argv[i] and passes it to open(). It doesn't matter if
the filename is in Klingon.

> > There isn't any magic wand which will turn them into characters
> > without knowing the encoding.
> 
> If I know the encoding, I should be able to set it.  If I don't, it's
> the locale setting.

If you *need* an encoding, and don't have any better information, then
the locale provides a last resort. Decoding bytes according to the
locale for the sake of it just adds an unnecessary failure mode.

> >> > completely new wide-character API for those who wish to use it.
> >> 
> >> Which would make it horrendously difficult to do even basic I18N.
> >
> > Why?
> 
> Having different types for single-byte and multi-byte strings together
> with seperate functions to handle them (that's what I assume you mean
> by a new wide-character API) with single-byte strings being the
> preferred one (the cause of being a seperate API) would make sorting,
> upper/lower case testing etc. not exactly easier.

For case testing, locale-dependent sorting and the like, you need to
convert to characters. [Although possibly only temporarily; you can
sort a list of byte strings based upon their corresponding character
strings using sortBy. This means that a decoding failure only means
that the ordering will be wrong. This is essentially what happens with
"ls" if you have filenames which aren't valid in the current locale.]

Note: there are still situations where sorting bytes makes sense, i.e. 
where you only need *an* ordering rather than a specific ordering,
e.g. uniq.

> > I know. That's what I'm saying. The problem is that the broken "code"
> > is the Haskell98 API.
> 
> No, it's not broken.  It just has some missing features (i.e. I/O /
> env functions accepting bytes instead of strings).

It's broken. Being able to represent filenames as byte strings is
fundamental. Being able to convert them to or from character strings
is useful but not essential. The only reason why the existing API
doesn't cause serious problems is because the translation is currently
hardwired to an encoding which can't fail.

> >>  String's are a list of unicode characters, [Word8] is a
> >> list of bytes.
> >
> > And what comes out of (and goes into) most core library functions is
> > the latter.
> 
> Strictly speaking, the former comes out with the semantics of the
> latter. :-)

By "core library functions", I was referring primarily to libc, not
the Haskell library functions which were built upon them. The Haskell
developers can change Haskell, they can't change libc.

-- 
Glynn Clements <glynn.clements at virgin.net>