[Haskell-cafe] Writing binary files?

Sun Sep 12 16:28:09 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

>> But the default encoding should
>> come from the locale instead of being ISO-8859-1.
>
> The problem with that is that, if the locale's encoding is UTF-8, a
> lot of stuff is going to break (i.e. anything in ISO-8859-* which
> isn't limited to the 7-bit ASCII subset).

What about this transition path:

1. API for manipulating byte sequences in I/O (without representing
   them in String type).

2. API for conversion between explicitly specified encodings and byte
   sequences, including attaching converters to Handles. There is also
   a way to obtain the locale encoding.

3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.

4. Libraries are reviewed to ensure that they work with various
   encoding settings.

5. The default encoding is settable from Haskell, defaults to the
   locale encoding.

Points 1-3 don't change the behavior of existing programs, but they
allow to start writing libraries and programs which manipulate
something other than texts in the default encoding and will work
in future.

After relevant libraries work with the default encoding changed,
programs which use them may begin their main function with setting
the default encoding to the locale encoding.

Finally, when we consider libraries and programs which break in this
setting obsolete, the default is changed.

> The advantage of assuming ISO-8859-* is that the decoder can't fail;
> every possible stream of bytes is valid.

Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
my files and filenames from ISO-8859-2 to UTF-8, and change the
locale, the assumption will be wrong. I can't change that now, because
too many programs would break.

The current ISO-8859-1 assumption is also wrong. A program written in
Haskell which sorts strings would break for non-ASCII letters even now
that they are ISO-8859-2 unless specified otherwise.

> The key problem with using the locale is that you frequently encounter
> files which aren't in the locale's encoding, and for which the
> encoding can't easily be deduced.

Programs should either explicitly set the encoding for I/O on these
files to ISO-8859-1, or manipulate them as binary data.

The problem is that API for that yet is not even designed, so programs
can't be written such that they will work after the default encoding
change.

> OTOH, if you assume UTF-8 (e.g. because that happens to be the
> locale's encoding), the decoder is likely to abort shortly after the
> first non-ASCII character it finds (either that, or it will just
> silently drop characters).

Detectable errors should not be automatically silenced, so it would
fail. So the change to the default encoding must be done some time
after it's possible to write programs which would not fail.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/