[Haskell-cafe] Writing binary files?

Tue Sep 14 22:35:54 EDT 2004

Marcin 'Qrczak' Kowalczyk wrote:

> > [Actually, regarding on-screen display, this is also an issue for
> > Unicode. How many people actually have all of the Unicode glyphs?
> > I certainly don't.]
> 
> If I don't have a particular character in fonts, I will not create
> files with it in filenames. Actually I only use 9 Polish letters in
> addition to ASCII, and even them rarely. Usually it's only a subset
> of ASCII.

But this seems to be assuming a closed world. I.e. the only files
which the program will ever see are those which were created by you,
or by others who are compatible with your conventions.

> Some programs use UTF-8 in filenames no matter what the locale is. For
> example the Evolution mail program which stores mail folders as files
> under names the user entered in a GUI.

This is entirely reasonable for a file which a program creates. If a
filename is just a string of bytes, a program can use whatever
encoding it wants.

> I had to rename some of these
> files in order to import them to Gnus, as it choked on filenames with
> strange characters, never mind that it didn't display them correctly
> (maybe because it tried to map them to virtual newsgroup names, or
> maybe because they are control characters in ISO-8859-x).

If it had just treated them as bytes, rather than trying to interpret
them as characters, there wouldn't have been any problems.

> If all programs consistently used the locale encoding for filenames,
> this should have worked.

But again, for this to work in general, you have to assume a closed
world.

> When I switch my environment to UTF-8, which may happen in a few
> years, I will convert filenames to UTF-8 and set up mount options to
> translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

But what about files which were been created by other people, who
don't use UTF-8?

> I expect good programs to understand that and display them correctly
> no matter what technique they are using for the display.

When it comes to display, you have to have to deal with encoding
issues one way or another. But not all programs deal with display.

> For example
> the Epiphany web browser, when I open the file:/home/users/qrczak URL,
> displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
> it created from the directory listing has &x105; in its <title> where
> the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
> the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
> and ISO-8859-2 filenames are not shown at all.

For many (probably most) programs, omitting such files would be an
unacceptable failure.

> > And even to the extent that it can be done, it will take a long time. 
> > Outside of the Free Software ghetto, long-term backward compatibility
> > still means a lot.
> 
> Windows has already switched most of its internals to Unicode, and it
> did it faster than Linux.

Microsoft is actively hostile to both backwards compatibility and
cross-platform compatibility.

I consider the fact that some Unix (primarily Linux) developers seem
equally hostile to be a problem.

Having said that, with Linux developers, the issue usually due to not
being bothered. Assuming that everything is UTF-8 allows a lot of
potential problems to be ignored.

Fortunately, the problem is mostly consigned to the periphery, i.e. 
the desktop, where most programs have to deal with display issues (so
you *have* to decode bytes into characters), and it isn't too critical
if they have limitations.

The core OS and network server applications essentially remain
encoding-agnostic.

> >> In CLisp it fails silently (undecodable filenames are skipped), which
> >> is bad. It should fail loudly.
> >
> > No, it shouldn't fail at all.
> 
> Since it uses Unicode as string representation, accepting filenames
> not encoded in the locale encoding would imply making garbage from
> filenames correctly encoded in the locale encoding. In a UTF-8
> environment character U+00E1 in the filename means bytes 0xC3 0xA1
> on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
> the same time mean 0xE1 on ext2 filesystem.

But, as I keep pointing out, filenames are byte strings, not character
strings. You shouldn't be converting them to character strings unless
you have to.

> >> And this is why I can't switch my home environment to UTF-8 yet. Too
> >> many programs are broken; almost all terminal programs which use more
> >> than stdin and stdout in default modes, i.e. which use line editing or
> >> work in full screen. How would you display a filename in a full screen
> >> text editor, such that it works in a UTF-8 environment?
> >
> > So, what are you suggesting? That the whole world switches to UTF-8?
> 
> No, each computer system decides for itself, and announces it in the
> locale setting. I'm suggesting that programs should respect that and
> correctly handle all correctly encoded texts, including filenames.

1. Actually, each user decides which locale they wish to use. Nothing
forces two users of a system to use the same locale.

2. Even if the locale was constant for all users on a system, there's
still the (not exactly minor) issue of networking.

> > Or that every program should pass everything through iconv()
> > (and handle the failures)?
> 
> If it uses Unicode as internal string representation, yes (because the
> OS API on Unix generally uses byte encodings rather than Unicode).

The problem with that is that you need to *know* the source and
destination encodings. The program gets to choose one of them, but it
may not even know the other one.

> This should be done transparently in libraries of respective languages
> instead of in each program independently.

The application still has to tell the library which encoding is to be
used; if it can actually determine it.

> >> A program is not supposed to encounter filenames which are not
> >> representable in the locale's encoding.
> >
> > Huh? What does "supposed to" mean in this context? That everything
> > would be simpler if reality wasn't how it is?
> 
> It means that if it encounters a filename encoded differently, it's
> usually not the fault of the program but of whoever caused the
> mismatch in the first place.

The term "mismatch" implies that there have to be at least two things.
If they don't match, which one is at fault? If I make a tar file
available for you to download, and it contains non-UTF-8 filenames, is
that my fault or yours?

In any case, if a program refuses to deal with a file because it is
cannot convert the filename to characters, even when it doesn't have
to, it's the program which is at fault.

Or are you suggesting that it's acceptable for e.g. "rm <filename>" to
refuse to work because the the filename cannot be converted to
characters?

BTW, to go back to this point:

> > Huh? What does "supposed to" mean in this context? That everything
> > would be simpler if reality wasn't how it is?

Note that I entirely agree that everything really would be simpler if
reality wasn't how it is, e.g. if the entire world only ever used
UTF-8. But that doesn't change anything.

The reality is that multiple encodings are in widespread use, and that
situation won't change for the foreseeable future. There exists a vast
amount of software that is limited to certain encodings (e.g. just
ASCII, or just single-byte encodings, or just ISO-2022-compatible
encodings), and some of that will still be in use decades hence (for
commercial software, particularly bespoke software, upgrading could
cost vast amounts of money).

> > Sure; but that doesn't automatically mean that the locale's encoding
> > is correct for any given filename. The point is that you often don't
> > need to know the encoding.
> 
> What if I do need to know the encoding? I must assume something.

If you need to know it, then you need to know it. But that wasn't my
point.

My specific point is that the Haskell98 API has a very big problem due
to the assumption that the encoding is always known. Existing
implementations work around the problem by assuming that the encoding
is always ISO-8859-1.

The implementations could be changed to do something else, e.g. assume
that it's always the locale's encoding, or that it's always UTF-8. 
Although neither of those is a significant improvement, and both are,
in some regards, even worse: UTF-8 decoding can fail (unlike
ISO-8859-*), while assuming the locale's encoding can result in the
program's behaviour varying for a reason which may not be readily
apparent.

The real problem is that the API is broken, and that isn't something
which the implementations can fix.

> > Converting a byte string to a character string when you're just going
> > to be converting it back to the original byte string is pointless.
> 
> It's necessary if the channel through which the filename is
> transferred uses Unicode text, or bytes in some explicitly chosen
> encoding, rather than raw bytes in some unspecified encoding.

I did say "... when you're just going to be converting it back to the
original byte string ...".

I'd really appreciate it if you could address what I'm actually
saying, rather than setting up straw-men.

I'm not going to address all of the other examples of "... but here's
a situation where you *do* need to deal with characters". That isn't
my point.

> > 2. Don't force everyone to deal with all of the the complexities
> > involved in character encoding even when they shouldn't have to.
> 
> I don't see how to have this property and at the same time make
> writing programs which do handle various encodings reasonably easy.
> With my choices all Haskell APIs use Unicode, so once libraries which
> interface with the world are written, the program passes strings
> between them without recoding. With your choices the API for filenames
> uses a different encoding than the API for GUI, so the conversion
> logic must be put in each program separately.

The API for filenames doesn't use any encoding, because filenames are
just bytes. And this isn't my choice, it's just how things are.

> >> OTOH newer Windows APIs use Unicode.
> >> 
> >> Haskell aims at being portable. It's easier to emulate the traditional
> >> C paradigm in the Unicode paradigm than vice versa,
> >
> > I'm not entirely sure what you mean by that, but I think that I
> > disagree. The C/Unix approach is more general; it isn't tied to any
> > specific encoding.
> 
> If filenames were expressed as bytes in the Haskell program, how would
> you map them to WinAPI? If you use the current Windows code page, the
> set of valid characters is limited without a good reason.

Windows filenames are arguably characters rather than bytes. However,
if you want to present a common API, you can just use a fixed encoding
on Windows (either UTF-8 or UTF-16).

> > If they tried a decade hence, it would still be too early. The
> > single-byte encodings (ISO-8859-*, koi-8, win-12xx) aren't likely to
> > be disappearing any time soon, nor is ISO-2022 (UTF-8 has quite
> > spectacularly failed to make inroads in CJK-land; there are probably
> > more UTF-8 users in the US than there).
> 
> Which is a pity. ISO-2022 is brain-damaged because of enormous
> complexity,

Or, depending upon ones perspective, Unicode is brain-damaged because,
for the sake of simplicity, it over-simplifies the situation. The
over-simplification is one reason for it's lack of adoption in the CJK
world.

Multi-lingual text consists of distinct sections written in distinct
languages with distinct "alphabets". It isn't actually one big chunk
in a single global language with a single massive alphabet.

> and ISO-8859-x have small repertoires.

Which is one of the reasons why they are likely to persist for longer
than UTF-8 "true believers" might like. E.g. languages which don't
primarily use the Roman alphabet (Greek, Russian) can still be
represented as one byte per character. And it's feasible to have
tables which are indexed by codepoint; as a counter-example, calling
XQueryFont for a Unicode font *really* sucks if either the server
doesn't have the BigFont extension or, worse still, it can't use it
because the client is remote.

> I would not *force* UTF-8, but it should work for those who
> voluntarily choose to use it as their locale encoding. Including
> filenames.

Not forcibly decoding filenames isn't the same thing as preventing
them from being decoded.

> > Look, C has all of the functionality that we're talking about: wide
> > characters, wide versions of string.h and ctype.h, and conversion
> > between byte-streams and wide characters.
> 
> ctype.h is useless for UTF-8.

Hello? Let's try that again, with emphasis:

> > C has ... WIDE VERSIONS OF string.h and ctype.h

They're called wchar.h and wctype.h.

> There is no capability of attaching automatic recoders of explicitly
> chosen encodings to file handles.

At this point you starting engaging in diversionary tactics. Again.

> No, the C language doesn't make these issues easy and has lots of
> historic baggage.

The issues aren't easy, and have lots of historic baggage. That's
reality.

Fortunately, C has a history of being geared to reality, rather than
the comfortable fantasy where the issues don't exist. Which is why
everyone uses it.

> > But it did it without getting in the way of writing programs which
> > don't care about encodings,
> 
> It does get in the way of writing programs which do care, because they
> must do whole recoding themselves and remember which API has which
> character set limitations.

No. Not doing something for you isn't the same thing as getting in the
way. Getting in the way is doing for you something which you didn't
want done in the first place. Getting in the way is not letting you do
something yourself.

-- 
Glynn Clements <glynn.clements at virgin.net>