[Haskell-cafe] Writing binary files?

Fri Sep 17 09:03:28 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

> What I'm suggesting in the above is to sidestep the encoding issue
> by keeping filenames as byte strings wherever possible.

Ok, but let it be in addition to, not instead treating them as
character strings.

> And program-generated email notifications frequently include text with
> no known encoding (i.e. binary data).

No, programs don't dump binary data among diagnostic messages. If they
output binary data to stdout, it's their only output and it's redirected
to a file or another process.

> Or are you going to demand that anyone who tries to hack into your
> system only sends it UTF-8 data so that the alert messages are
> displayed correctly in your mail program?

The email protocol is text-only. It may mangle newlines, it has
a maximum line length, some texts may be escaped during transport
(e.g. "From " at the beginning of a line). Arbitrary binary data
should be put in base64-or-otherwise-encoded attachments.

If the cron program embeds the output as email body, the cron job
should not dump arbitrary binary data to stdout. Encoding is not the
only problem.

>> Processing data in their original byte encodings makes supporting
>> multiple languages harder. Filenames which are inexpressible as
>> character strings get in the way of clean APIs. When considering only
>> filenames, using bytes would be sufficient, but in overall it's more
>> convenient to Unicodize them like other strings.
>
> It also harms reliability. Depending upon the encoding, two distinct
> byte strings may have the same Unicode representation.

Such encodings are not suitable for filenames.

http://www.mail-archive.com/linux-utf8@humbolt.nl.linux.org/msg00376.html

| ISO-2022-JP will never be a satisfactory terminal encoding (like
| ISO-8859-*, EUC-*, UTF-8, Shift_JIS) because
|
| 1) It is a stateful encoding. What happens when a program starts some
| terminal output and then is interrupted using Ctrl-C or Ctrl-Z? The
| terminal will remain in the shifted state, while other programs start
| doing output. But these programs expect that when they start, the
| terminal is in the initial state. The net result will be garbage on
| the screen.
|
| 2) ISO-2022-JP is not filesystem safe. Therefore filenames will never
| be able to carry Japanese characters in this encodings.
|
| Robert Brady writes:
| > Does ISO-2022 see much/any use as the locale encoding, or it it just used
| > for interchange?
|
| Just for interchange.
|
| Paul Eggert searched for uses of ISO-2022-JP as locale encodings (in
| order to convince me), and only came up with a handful of questionable
| URLs. He didn't convince me. And there are no plans to support
| ISO-2022-JP as a locale encoding in glibc - because of 1) and 2) above.

For me ISO-2022 is a brain-damaged concept and should die. Almost
nothing supports it anyway.

>> Such tarballs are not portable across systems using different encodings.
>
> Well, programs which treat filenames as byte strings to be read from
> argv[] and passed directly to open() won't have any problems with this.

The OS itself may have problems with this; only some filesystems
accept arbitrary bytes apart from '\0' and '/' (and with the special
meaning for '.'). Exotic characters in filenames are not very
portable.

>> A Haskell program in my world can do that too. Just set the encoding
>> to Latin1.
>
> But programs should handle this by default, IMHO.

IMHO it's more important to make them compatible with the
representation of strings used in other parts of the program.

> Filenames are, for the most part, just "tokens" to be passed around.

Filenames are often stored in text files, whose bytes are interpreted
as characters. Applying QP to non-ASCII parts of filenames is suitable
only if humans won't edit these files by hand.

>> > My specific point is that the Haskell98 API has a very big problem due
>> > to the assumption that the encoding is always known. Existing
>> > implementations work around the problem by assuming that the encoding
>> > is always ISO-8859-1.
>> 
>> The API is incomplete and needs to be enhanced. Programs written using
>> the current API will be limited to using the locale encoding.
>
> That just adds unnecessary failure modes.

But otherwise programs would continuously have bugs in handling text
which is not ISO-8859-1, especially with multibyte encoding where
pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.

I can't switch my environment to UTF-8 yet precisely because too many
programs were written with the attitude you are promoting: they don't
care about the encoding, they just pass bytes around.

Bugs range from small annoyances like tabular output which doesn't
line up, through mangled characters on a graphical display, to
full-screen interactive programs being unusable on a UTF-8 terminal.

>> This encoding would be incompatible with most other texts seen by the
>> program. In particular reading a filename from a file would not work
>> without manual recoding.
>
> We already have that problem; you can't read non-Latin1 strings from
> files.

This is going to be fixed. Some time after the API enhancements it
should become the default.

> BTW, that's why Emacs (and XEmacs) support ISO-2022 much better than
> they do UTF-8. Because MuLE was written by Japanese developers.

And that's why I haven't used Emacs for years. The default
installation of XEmacs (at least on PLD Linux Distribution) doesn't
handle *any* non-ASCII characters properly. When I enter some Polish
letters and save a file, it produces some ISO-2022 garbage that
nothing can read, including the XEmacs itself. When I open the
existing file, remove the escaped nonsense, enter Polish letters again
and save the file again, all non-ASCII characters are replaced with
tildes.

GNU Emacs is better, but still doesn't respect the locale and must be
explicitly told about the encoding. The locale mechanism was invented
precisely to avoid informing each and every program in its own
configuration about the encoding and other things to be used by
default. Emacs ignores this.

>> > Which is one of the reasons why they are likely to persist for longer
>> > than UTF-8 "true believers" might like.
>> 
>> My I/O design doesn't force UTF-8, it works with ISO-8859-x as well.
>
> But I was specifically addressing Unicode versus multiple encodings
> internally. The size of the Unicode "alphabet" effectively prohibits
> using codepoints as indices.

ISO-2022 is even less suitable. I can't imagine a ISO-2022 regexp.

As long as more than 256 distinct characters are needed, ISO-8859-x
are not suitable at all, so it doesn't matter that they would be more
convenient if they worked.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/