[Haskell-cafe] Writing binary files?

Tue Sep 14 19:45:33 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

> [Actually, regarding on-screen display, this is also an issue for
> Unicode. How many people actually have all of the Unicode glyphs?
> I certainly don't.]

If I don't have a particular character in fonts, I will not create
files with it in filenames. Actually I only use 9 Polish letters in
addition to ASCII, and even them rarely. Usually it's only a subset
of ASCII.

Some programs use UTF-8 in filenames no matter what the locale is. For
example the Evolution mail program which stores mail folders as files
under names the user entered in a GUI. I had to rename some of these
files in order to import them to Gnus, as it choked on filenames with
strange characters, never mind that it didn't display them correctly
(maybe because it tried to map them to virtual newsgroup names, or
maybe because they are control characters in ISO-8859-x).

If all programs consistently used the locale encoding for filenames,
this should have worked.

When I switch my environment to UTF-8, which may happen in a few
years, I will convert filenames to UTF-8 and set up mount options to
translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

I expect good programs to understand that and display them correctly
no matter what technique they are using for the display. For example
the Epiphany web browser, when I open the file:/home/users/qrczak URL,
displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
it created from the directory listing has &x105; in its <title> where
the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
and ISO-8859-2 filenames are not shown at all.

It's fine for me that it doesn't deal with wrongly encoded filenames,
because it allowed to treat well encoded filenames correctly. For a
web page rendered on the screen it makes no sense to display raw
bytes. Epiphany treats filenames as sequences of characters encoded
according to the locale.

> And even to the extent that it can be done, it will take a long time. 
> Outside of the Free Software ghetto, long-term backward compatibility
> still means a lot.

Windows has already switched most of its internals to Unicode, and it
did it faster than Linux.

>> In CLisp it fails silently (undecodable filenames are skipped), which
>> is bad. It should fail loudly.
>
> No, it shouldn't fail at all.

Since it uses Unicode as string representation, accepting filenames
not encoded in the locale encoding would imply making garbage from
filenames correctly encoded in the locale encoding. In a UTF-8
environment character U+00E1 in the filename means bytes 0xC3 0xA1
on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
the same time mean 0xE1 on ext2 filesystem.

>> And this is why I can't switch my home environment to UTF-8 yet. Too
>> many programs are broken; almost all terminal programs which use more
>> than stdin and stdout in default modes, i.e. which use line editing or
>> work in full screen. How would you display a filename in a full screen
>> text editor, such that it works in a UTF-8 environment?
>
> So, what are you suggesting? That the whole world switches to UTF-8?

No, each computer system decides for itself, and announces it in the
locale setting. I'm suggesting that programs should respect that and
correctly handle all correctly encoded texts, including filenames.

Better programs may offer to choose the encoding explicitly when it
makes sense (e.g. text file editors for opening a file), but if they
don't, they should at least accept the locale encoding.

> Or that every program should pass everything through iconv()
> (and handle the failures)?

If it uses Unicode as internal string representation, yes (because the
OS API on Unix generally uses byte encodings rather than Unicode).

This should be done transparently in libraries of respective languages
instead of in each program independently.

>> A program is not supposed to encounter filenames which are not
>> representable in the locale's encoding.
>
> Huh? What does "supposed to" mean in this context? That everything
> would be simpler if reality wasn't how it is?

It means that if it encounters a filename encoded differently, it's
usually not the fault of the program but of whoever caused the
mismatch in the first place.

>> In your setting it's impossible to display a filename in a way
>> other than printing to stdout.
>
> Writing to stdout doesn't amount to "displaying" anything; stdout
> doesn't have to be a terminal.

I know, it's not the point. The point is that other display channels
than stdout connected to a terminal often work in terms of characters
rather than bytes of some implicit encoding. For example various GUI
frameworks, and wide character ncurses.

> Sure; but that doesn't automatically mean that the locale's encoding
> is correct for any given filename. The point is that you often don't
> need to know the encoding.

What if I do need to know the encoding? I must assume something.

> Converting a byte string to a character string when you're just going
> to be converting it back to the original byte string is pointless.

It's necessary if the channel through which the filename is
transferred uses Unicode text, or bytes in some explicitly chosen
encoding, rather than raw bytes in some unspecified encoding.
The channel might be:
- GUI API (e.g. UTF-8 for Gtk+ or UTF-16 for Qt)
- X selection copied & pasted between programs, if it uses UTF-8
- email contents, if encoded differently than the filename
- copy & paste between MS-DOS emulation window, which definitely
  uses a different encoding
- database field which uses e.g. UTF-16 internally
- XML file encoded in UTF-8

In overall it's better to internally use Unicode, because then only
places which are inherently incapable of expressing characters outside
some encoding cause loss of these characters. They should not block
these characters when they are only moved between sources which can
express them. I would be upset if a web browser refused to show
cyrillic web pages on a graphical display only because my locale
doesn't include cyrillic letters. Since the web page and the fonts may
use different encodings, Unicode is a natural mediator. A design of a
web browser is simpler if all its texts are kept in the same encoding,
converted only at I/O, rather than if all texts have explicit encoding
attached.

> And it introduces unnecessary errors. If the only difference between
> (decode . encode) and the identity function is that the former
> sometimes fails, what's the point?

The point is in not having to remember the encoding of strings
manipulated by the program. Encodings matter only for input and
output, not for processing.

> It frequently *is* an avoidable issue, because not every interface
> uses *any* encoding. Most of the standard Unix utilities work fine
> without even considering encodings.

Many of them broke because they did not consider encodings.
But today 'sort' works in UTF-8 too.

Those which don't have to consider encodings typically manipulate byte
streams rather than text streams.

> I'm not suggesting that we ignore them. I'm suggesting that we:
>
> 1. Don't provide a broken API which makes it impossible to write
> programs which work reliably in the real world (rather than some
> fantasy world where inconveniences (like filenames which don't match
> the locale's encoding) never happen).

It is possible in my setting. Just set the default encoding of the
program to ISO-8859-1 (it should only default to the locale encoding
but should be overridable from the program). But then better don't try
to show filenames to the user, unless your interface is just stdin /
stdout.

> 2. Don't force everyone to deal with all of the the complexities
> involved in character encoding even when they shouldn't have to.

I don't see how to have this property and at the same time make
writing programs which do handle various encodings reasonably easy.
With my choices all Haskell APIs use Unicode, so once libraries which
interface with the world are written, the program passes strings
between them without recoding. With your choices the API for filenames
uses a different encoding than the API for GUI, so the conversion
logic must be put in each program separately.

> And, given that Unicode isn't a simple "one code, one character"
> system (what with composing characters), it isn't actually all that
> much simpler than dealing with multi-byte strings.

Composing characters are not relevant for recoding for I/O and for
putting email contents on the wire. And for GUIs they are handled by
already written libraries rather than by each program (e.g. Pango in
Gnome on Linux).

> The main advantage of Unicode for display is that there's only one
> encoding. Unfortunately, given that most of the existing Unicode fonts
> are a bit short on actual glyphs, you typically just end up converting
> the Unicode back into pseudo-ISO-2022 anyhow.

Again it's the problem of GUI libraries. And TTF fonts have their
character set expressed in Unicode AFAIK.

>> So it should push bytes, not characters.
>
> And so should a lot of software. But it helps if languages and
> libraries doesn't go to great lengths to try and coerce everything
> into characters.

It's as bad to manipulate everything in terms of bytes. Programs
should generally have a choice.

>> OTOH newer Windows APIs use Unicode.
>> 
>> Haskell aims at being portable. It's easier to emulate the traditional
>> C paradigm in the Unicode paradigm than vice versa,
>
> I'm not entirely sure what you mean by that, but I think that I
> disagree. The C/Unix approach is more general; it isn't tied to any
> specific encoding.

If filenames were expressed as bytes in the Haskell program, how would
you map them to WinAPI? If you use the current Windows code page, the
set of valid characters is limited without a good reason.

>> It's not that hard if you may sacrifice supporting every broken
>> configuration. I did it myself, albeit without serious testing in real
>> world situations and without trying to interface to too many libraries.
>
> I take it that, by "broken", you mean any string of bytes (file,
> string, network stream, etc) which neither explicitly specifies its
> encoding(s) nor uses your locale's encoding?

No - you can treat file contents as a sequence of bytes rather than a
sequence of characters, and not recode them at all.

In fact you have to do it anyway to avoid mangling bytes 13 and 10.
Distinguishing text from binary data is not a new requirement.

> If they tried a decade hence, it would still be too early. The
> single-byte encodings (ISO-8859-*, koi-8, win-12xx) aren't likely to
> be disappearing any time soon, nor is ISO-2022 (UTF-8 has quite
> spectacularly failed to make inroads in CJK-land; there are probably
> more UTF-8 users in the US than there).

Which is a pity. ISO-2022 is brain-damaged because of enormous
complexity, and ISO-8859-x have small repertoires.

I would not *force* UTF-8, but it should work for those who
voluntarily choose to use it as their locale encoding. Including
filenames.

> Look, C has all of the functionality that we're talking about: wide
> characters, wide versions of string.h and ctype.h, and conversion
> between byte-streams and wide characters.

ctype.h is useless for UTF-8.

There is no capability of attaching automatic recoders of explicitly
chosen encodings to file handles.

wchar_t is not very portable. In some systems it's UTF-32, in others
it's UTF-16, and the C standard doesn't guarantee that it has anything
to do with Unicode at all (I'm sure it was not Unicode on FreeBSD,
I don't know if it has changed or not).

The iconv API is inconvenient for converting whole strings, because
the user has to allocate the output buffer and keep resizing it if it
was too small. Also it's not available everywhere, sometimes an extra
library needs to be installed and linked.

Different C libraries use different string encodings: some use
sequences of chars without an explicit encoding (perhaps the locale
encoding should be assumed), some use UTF-8 (Gtk+), some use their own
character type for UTF-16 (Qt, ICU), or for UTF-16 / UTF-32 depending
on how they have been built (Python), some use wchar_t (curses) etc.

No, the C language doesn't make these issues easy and has lots of
historic baggage.

> But it did it without getting in the way of writing programs which
> don't care about encodings,

It does get in the way of writing programs which do care, because they
must do whole recoding themselves and remember which API has which
character set limitations.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/