[Haskell-cafe] Writing binary files?

Wed Sep 15 17:56:59 EDT 2004

Marcin 'Qrczak' Kowalczyk wrote:

> >> When I switch my environment to UTF-8, which may happen in a few
> >> years, I will convert filenames to UTF-8 and set up mount options to
> >> translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
> >
> > But what about files which were been created by other people, who
> > don't use UTF-8?
> 
> All people sharing a filesystem should use the same encoding.

Again, this is just "hand waving" the issues away.

> BTW, when ftping files between Windows and Unix, a good ftp client
> should convert filenames to keep the same characters rather than
> bytes, so CP-1250 encoded names don't come as garbage in the encoding
> used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
> vice versa.

Which is fine if the FTP client can figure out which encoding is used
on the remote end. In practice, you have to tell it, i.e. have a list
of which servers (or even which directories on which servers) use
which encoding.

> >> I expect good programs to understand that and display them
> >> correctly no matter what technique they are using for the display.
> >
> > When it comes to display, you have to have to deal with encoding
> > issues one way or another. But not all programs deal with display.
> 
> So you advocate using multiple encodings internally. This is in
> general more complicated than what I advocate: using only Unicode
> internally, limiting other encodings to I/O boundary.

How do you draw that conclusion from what I wrote here?

There are cases where it's advantages to use multiple encodings, but I
wasn't suggesting that in the above. What I'm suggesting in the above
is to sidestep the encoding issue by keeping filenames as byte strings
wherever possible.

> > The core OS and network server applications essentially remain
> > encoding-agnostic.
> 
> Which is a problem when they generate an email, e.g. to send a
> non-empty output of a cron job, or report unauthorized use of sudo.
> If the data involved is not pure ASCII, I will often be mangled.

It only gets mangled if you feed it to a program which is making
assumptions about the encoding. Non-MIME messages neither specify nor
imply an encoding. MIME messages can use either
"text/plain; charset=x-unknown" or application/octet-stream if they
don't undertand the encoding.

And program-generated email notifications frequently include text with
no known encoding (i.e. binary data). Or are you going to demand that
anyone who tries to hack into your system only sends it UTF-8 data so
that the alert messages are displayed correctly in your mail program?

> It's rarely a problem in practice because filenames, command
> arguments, error messages, user full names etc. are usually pure
> ASCII. But this is slowly changing.

To the extent that non-ASCII filenames are used, I've encountered far
more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP
sites typically use ISO-2022 for everything; even ASCII names may have
"\e(B" prepended to them.

> > But, as I keep pointing out, filenames are byte strings, not
> > character strings. You shouldn't be converting them to character
> > strings unless you have to.
> 
> Processing data in their original byte encodings makes supporting
> multiple languages harder. Filenames which are inexpressible as
> character strings get in the way of clean APIs. When considering only
> filenames, using bytes would be sufficient, but in overall it's more
> convenient to Unicodize them like other strings.

It also harms reliability. Depending upon the encoding, two distinct
byte strings may have the same Unicode representation.

E.g. if you are interfacing to a server which uses ISO-2022 for
filenames, you have to get the escapes correct even when they are
no-ops in terms of the string representation. If you obtain a
directory listing, receive the filename "\e(Bfoo.txt", and convert it
to Unicode, you get "foo.txt". If you then convert it back without the
leading escape, the server is going to say "file not found".

> > The term "mismatch" implies that there have to be at least two things.
> > If they don't match, which one is at fault? If I make a tar file
> > available for you to download, and it contains non-UTF-8 filenames, is
> > that my fault or yours?
> 
> Such tarballs are not portable across systems using different encodings.

Well, programs which treat filenames as byte strings to be read from
argv[] and passed directly to open() won't have any problems with
this. It's only a problem if you make it a problem.

> If I tar a subdirectory stored on ext2 partition, and you untar it on
> a vfat partition, whose fault it is that files which differ only in
> case are conflated?

Arguably, it's Microsoft's fault for not considering the problems
caused by multiple encodings when they decided that filenames were
going to be case-folded.

> > In any case, if a program refuses to deal with a file because it is
> > cannot convert the filename to characters, even when it doesn't have
> > to, it's the program which is at fault.
> 
> Only if it's a low-level utility, to be used in an unfriendly
> environment.
> 
> A Haskell program in my world can do that too. Just set the encoding
> to Latin1.

But programs should handle this by default, IMHO. Filenames are, for
the most part, just "tokens" to be passed around. You get a value from
argv[], and pass it to open() or whatever. It doesn't need to have any
meaning.

> > My specific point is that the Haskell98 API has a very big problem due
> > to the assumption that the encoding is always known. Existing
> > implementations work around the problem by assuming that the encoding
> > is always ISO-8859-1.
> 
> The API is incomplete and needs to be enhanced. Programs written using
> the current API will be limited to using the locale encoding.

That just adds unnecessary failure modes.

> Just as ReadFile is limited to text files because of line endings.
> What do you prefer: to provide a non-Haskell98 API for binary files,
> or to "fix" the current API by forcing programs to use "\r\n" on
> Windows and "\n" on Unix manually?

That's a harder case. There is a good reason for auto-converting EOL,
as most programs actually process file contents. Most programs don't
"process" filenames; they just pass them around.

> >> If filenames were expressed as bytes in the Haskell program, how would
> >> you map them to WinAPI? If you use the current Windows code page, the
> >> set of valid characters is limited without a good reason.
> >
> > Windows filenames are arguably characters rather than bytes. However,
> > if you want to present a common API, you can just use a fixed encoding
> > on Windows (either UTF-8 or UTF-16).
> 
> This encoding would be incompatible with most other texts seen by the
> program. In particular reading a filename from a file would not work
> without manual recoding.

We already have that problem; you can't read non-Latin1 strings from
files.

In some regards, the problem is worse on Windows, because of the
prevalence of non-ASCII text (Windows 12xx and "smart" quotes), so
using UTF-8 for file contents on Windows is even harder.

> >> Which is a pity. ISO-2022 is brain-damaged because of enormous
> >> complexity,
> >
> > Or, depending upon ones perspective, Unicode is brain-damaged because,
> > for the sake of simplicity, it over-simplifies the situation. The
> > over-simplification is one reason for it's lack of adoption in the CJK
> > world.
> 
> It's necessary to simplify things in order to make them usable by
> ordinary programs. People reject overly complicated designs even if
> they are in some respects more general.
> 
> ISO-2022 didn't catch - about the only program I've seen which tries
> to fully support it is Emacs.

And X. Compound text is ISO-2022. For commercial X software, Motif
(which uses compound text) is still the most widely-used toolkit.

But, then, the fact that you haven't seen many ISO-2022 programs is
probably because you're used to using programs developed by and for
Westerners. In the far East, ISO-2022 is by far the most popular
encoding. There, you could realistically ignore all other encodings.

BTW, that's why Emacs (and XEmacs) support ISO-2022 much better than
they do UTF-8. Because MuLE was written by Japanese developers.

> > Multi-lingual text consists of distinct sections written in distinct
> > languages with distinct "alphabets". It isn't actually one big chunk
> > in a single global language with a single massive alphabet.
> 
> Multi-lingual text is almost context-insensitive. You can copy a part
> of it into another text, even written in another language, and it will
> retain its alphabet - this is much harder with stateful ISO-2022.
> 
> ISO-2022 is wrong not by distinguishing alphabets but by being
> stateful.

Sure, the statefulness adds complexity (which is one of the reasons so
many people prefer to work with UTF-8), but it has the benefit of
providing distinct markers to indicate where the character set is
being switched (that isn't a compelling advantage; you could
reconstruct the markers if you could uniquely determine the character
set for each character).

OTOH, Unicode is wrong by not distinguishing character sets. This is a
significant reason why it hasn't been adopt in the far East
(specifically, Han unification).

> >> and ISO-8859-x have small repertoires.
> >
> > Which is one of the reasons why they are likely to persist for longer
> > than UTF-8 "true believers" might like.
> 
> My I/O design doesn't force UTF-8, it works with ISO-8859-x as well.

But I was specifically addressing Unicode versus multiple encodings
internally. The size of the Unicode "alphabet" effectively prohibits
using codepoints as indices.

-- 
Glynn Clements <glynn.clements at virgin.net>