[Haskell-cafe] Writing binary files?

Mon Sep 13 02:33:57 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

>> 1. API for manipulating byte sequences in I/O (without representing
>>    them in String type).
>
> Note that this needs to include all of the core I/O functions, not
> just reading/writing streams. E.g. FilePath is currently an alias for
> String, but (on Unix, at least) filenames are strings of bytes, not
> characters. Ditto for argv, environment variables, possibly other
> cases which I've overlooked.

They don't hold binary data; they hold data intended to be interpreted
as text. If the encoding of the text doesn't agree with the locale,
the environment setup is broken and 'ls' and 'env' misbehave on an
UTF-8 terminal.

A program can explicitly set the default encoding to ISO-8859-1 if it
wishes to do something in a broken environment.

>> 4. Libraries are reviewed to ensure that they work with various
>>    encoding settings.
>
> There are limits to the extent to which this can be achieved. E.g. 
> what happens if you set the encoding to UTF-8, then call
> getDirectoryContents for a directory which contains filenames which
> aren't valid UTF-8 strings?

The library fails. Don't do that. This environment is internally
inconsistent.

> I feel that the default encoding should be one whose decoder cannot
> fail, e.g. ISO-8859-1.

But filenames on my filesystem and most file contents are *not*
encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly
wrong.

> You should have to explicitly request the use of the locale's
> encoding (analogous to calling setlocale(LC_CTYPE, "") at the start
> of a C program; there's a good reason why C doesn't do this without
> being explicitly told to).

C usually uses the paradigm of representing text in their original
8-bit encodings. This is why getting C programs to work in a UTF-8
locale is such a pain. Only some programs use wchar_t internally.

Java and C# uses the paradigm of representing text in Unicode
internally, recoding it on boundaries with the external world.

The second paradigm has a cost that you must be aware what encodings
are used in texts you manipulate. Locale gives a reasonable default
for simple programs which aren't supposed to work with multiple
encodings, and it specifies the encoding of texts which don't have an
encoding specified elsewhere (terminal I/O, filenames, environment
variables).

It also has benefits:

1. It's easier to work with multiple encodings, because the internal
   representation can represent text decoded from any of them and is
   the same in all places of the program.

2. It's much easier to work in a UTF-8 environment, and to work with
   libraries which use Unicode internally (e.g. Gtk+ or Qt).

3. isAlpha, toUpper etc. are true pure functions. (Haskell API is
   broken in a different way here: toUpper should be defined in terms
   of strings, not characters.)

> Actually, the more I think about it, the more I think that "simple,
> stupid programs" probably shouldn't be using Unicode at all.

This attitude causes them to break in a UTF-8 environment, which is
why I can't use it as a default yet.

ncurses wide character API is still broken. I reported bugs, the
author acknowledged them, but hasn't fixed them. (Attributes are
ignored on add_wch; get_wch is wrong for non-ASCII keys pressed if
the locale is different from ISO-8859-1 and UTF-8.) It seems people
don't use that API yet, because C traditionally uses the model of
representing texts in byte sequences. But the narrow character API
of ncurses is unusable with UTF-8 - this is not an implementation
limitation but inherent limitation of the interface.

> I.e. Char, String, string literals, and the I/O functions in Prelude,
> IO etc should all be using bytes, with a distinct wide-character API
> available for people who want to make the (substantial) effort
> involved in writing (genuinely) internationalised programs.

This would cause excessive duplication of APIs. Look, Java and C#
don't do that. Only file contents handling needs a byte API, because
many files don't contain text.

This would imply isAlpha :: Char -> IO Bool.

> Right now, the attempt at providing I18N "for free", by defining Char
> to mean Unicode, has essentially backfired, IMHO.

Because it needs to be accompanied with character recoders, both
invoked explicitly (also lazily) and attached to file handles, and
with a way to obtain recoders for various encodings.

Assuming that the same encoding is used everywhere and programs can
just copy bytes without interpreting them no longer works today.
A mail client is expected to respect the encoding set in headers.

> Oh, and because bytes are being stored in Chars, the type system won't
> help if you neglect to decode a string, or if you decode it twice.

This is why I said "1. API for manipulating byte sequences in I/O
(without representing them in String type)".

> 2. If you assume ISO-8859-1, you can always convert back to Word8 then
> re-decode as UTF-8. If you assume UTF-8, anything which is neither
> UTF-8 nor ASCII will fail far more severely than just getting the
> collation order wrong.

If I know the encoding, I should set the I/O handle to that encoding
in the first place instead of reinterpreting characters which have
been read using the default.

> Well, my view is essentially that files should be treated as
> containing bytes unless you explicitly choose to decode them, at which
> point you have to specify the encoding.

No problem, you can use the byte I/O API for text files if you wish.
But it will not work vice versa.

> Personally, I would take the C approach: redefine Char to mean a byte
> (i.e. CChar), treat string literals as bytes, keep the existing type
> signatures on all of the existing Haskell98 functions, and provide a
> completely new wide-character API for those who wish to use it.

Well, this is the paradigm which has problems in different areas.
It will often break in UTF-8 locale, it needs isAlpha :: Char -> IO Bool,
and it's painful to support multiple encodings.

Char is *the* new API. What is missing is byte API in areas which work
with arbitrary binary data (mostly file contents).

> My main concern is that someone will get sick of waiting and make the
> wrong "fix", i.e. keep the existing API but default to the locale's
> encoding, so that every simple program then has to explicitly set it
> back to ISO-8859-1 to get reasonable worst-case behaviour.

Supporting byte I/O and supporting character recoding needs to be done
before this.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/