[Haskell-cafe] Writing binary files?

Tue Sep 14 19:08:17 EDT 2004

Udo Stenzel wrote:

> > Note that this needs to include all of the core I/O functions, not
> > just reading/writing streams. E.g. FilePath is currently an alias for
> > String, but (on Unix, at least) filenames are strings of bytes, not
> > characters. Ditto for argv, environment variables, possibly other
> > cases which I've overlooked.
> 
> I don't think so.  They all are sequences of CChars, and C isn't
> particularly known for keeping bytes and chars apart.

CChar is a C "char", which is a byte (not necessarily an octet, and
not necessarily a character either).

> I believe,
> Windows NT has (alternate) filename handling functions that use unicode
> stringsr.

Almost all of the Win32 API functions which handle strings exist in
both char and wide-char versions.

> This would strengthen the view that a filename is a sequence
> of characters.

It would be reasonable to make FilePath equivalent to String on
Windows, but not on Unix.

> Ditto for argv, env, whatnot; they are typically entered
> from the shell and therefore are characters in the local encoding.

Both argv and envp are char**, i.e. lists of byte strings. There is no
guarantee that the values can be succesfully decoded according the
locale's encoding.

The environment is typically set on login, and inherited thereafter. 
It's typically limited to ASCII, but this isn't guaranteed. Similarly,
a program may need to access files which he didn't create, and which
have filenames which aren't valid strings according to his locale.

E.g. a user may choose a locale which uses UTF-8, but the sysadmin has
installed files with ISO-8859-1 filenames. If a Haskell program tries
to coerce everything to String using the user's locale, the program
will be unable to access such files.

> > > 3. The default encoding is settable from Haskell, defaults to
> > >    ISO-8859-1.
> > 
> > Agreed.
> 
> Oh no, please don't do that.  A global, settable encoding is, well,
> dys-functional.  Hidden state makes programs hard to understand and
> Haskell imho shouldn't go that route.

There's already plenty of hidden state in the system libraries upon
which a Haskell program depends.

> And please don't introduce the notion of a "default" encoding.

It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. 
much of IO, System and Directory) accept or return Strings, yet have
to be implemented on top of an OS which accepts or provides "char*"s. 
There *has* to be an encoding between the two, and currently it's
hardwired to ISO-8859-1.

The alternative to a global encoding is for *all* functions which
interface to the OS to always either accept or return [CChar] or, if
they accept or return Strings, accept an additional argument which
specifies the encoding.

Also, bear in mind that the functions under discussion are all I/O
functions which, by their nature, deal with state (e.g. the state of
the filesystem).

> I'd like to see the following:
> 
> - Duplicate the IO library.  The duplicate should work with [Byte]
>   everywhere where the old library uses String.  Byte is some suitable
>   unsigned integer, on most (all?) platforms this will be Word8

Technically it should be CChar. However, it's fairly safe to assume
that a byte will always be 8 bits; almost nobody writes code which
works on systems where it isn't.

However: if we go this route, I suspect that we will also need a
convenient method for specifying literal byte strings in Haskell
source code.

> - Provide an explicit conversion between encodings.  A simple conversion
>   of type [Word8] -> String would suit me, iconv would provide all that
>   is needed.

For the general case, you need to allow for stateful encodings (e.g. 
ISO-2022). Actually, even UTF-8 needs to deal with state if you need
to decode byte streams which are split into chunks and the breaks can
occur in the middle of a character (e.g. if you're using non-blocking
I/O).

> - iconv takes names of encodings as arguments.  Provide some names as
>   constants: one name for the internal encoding (probably UCS4), one
>   name for the canonical external encoding (probably locale dependent).
> 
> - Then redefine the old IO API in terms of the new API and appropriate
>   conversions.

The old API requires an implicit encoding. The OS gives accepts or
provides bytes, the old API functions accept or return Chars, and the
old API functions don't accept an encoding argument.

This is why we are (or, at least, I am) suggesting a settable current
encoding. Because the existing API *needs* a current encoding, and I'm
assuming that there may be some reluctance to just discarding it
completely.

> While we're at it, do away with the annoying CR/LF problem on Windows,
> this should simply be part of the local encoding.  This way file can
> always be opened as binary, hSetBinary can be dropped.  (This won't wont
> on ancient platforms where text files and binary files are genuinely
> different, but these are probably not interesting anyway.)

Apart from OS-specific issues, it would be useful to treat EOL
conventions as part of the encoding. E.g. for network protocols which
use CRLF, it would be useful to be able to set CRLF as the EOL
convention then use e.g. hPutStrLn to write lines.

> The same thoughts apply to filenames.  Make them [Word8] and convert
> explicitly.

Well, it's arguable that they should be [Word8] on Unix and String on
Windows. I suppose that you could handle the Windows case by
automatically converting to/from UTF-8.

> By the way, I think a path should be a list of names (that
> is of type [[Word8]]) and the library would be concerned with putting in
> the right path separator.  Add functions to read and show pathnames in
> the local conventions and we'll never need to worry about path
> separators again.

There would certainly be some advantages to making FilePath an
abstract type, but there are quite a few corner cases to deal with.

> > There are limits to the extent to which this can be achieved. E.g. 
> > what happens if you set the encoding to UTF-8, then call
> > getDirectoryContents for a directory which contains filenames which
> > aren't valid UTF-8 strings?
> 
> Well, then you did something stupid, didn't you?  If you don't know the
> encoding you shouldn't decode anything.  That's a strong point against
> any implicit decoding, I think.

Yes. However, I suspect that we will have to live with some of the
mistakes of the past, i.e. using String in the I/O functions.

-- 
Glynn Clements <glynn.clements at virgin.net>