[Haskell-cafe] Writing binary files?

Mon Sep 13 13:18:24 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

> Unless you are the sole user of a system, you have no control over
> what filenames may occur on it (and even if you are the sole user,
> you may wish to use packages which don't conform to your rules).

For these occasions you may set the encoding to ISO-8859-1. But then
you can't sensibly show them to the user in a GUI, nor in ncurses
using the wide character API, nor you can't sensibly store them in a
file which is to be always encoded in UTF-8 (e.g. XML file where you
can't put raw bytes without knowing their encoding).

There are two paradigms: manipulate bytes not knowing their encoding,
and manipulating characters explicitly encoded in various encodings
(possibly UTF-8). The world is slowly migrating from the first to the
second.

>> > There are limits to the extent to which this can be achieved. E.g. 
>> > what happens if you set the encoding to UTF-8, then call
>> > getDirectoryContents for a directory which contains filenames which
>> > aren't valid UTF-8 strings?
>> 
>> The library fails. Don't do that. This environment is internally
>> inconsistent.
>
> Call it what you like, it's a reality, and one which programs need to
> deal with.

The reality is that filenames are encoded in different encodings
depending on the system. Sometimes it's ISO-8859-1, sometimes
ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
of UTF-8-encoded filenames.

In CLisp it fails silently (undecodable filenames are skipped), which
is bad. It should fail loudly.

> Most programs don't care whether any filenames which they deal with
> are valid in the locale's encoding (or any other encoding). They just
> receive lists (i.e. NUL-terminated arrays) of bytes and pass them
> directly to the OS or to libraries.

And this is why I can't switch my home environment to UTF-8 yet. Too
many programs are broken; almost all terminal programs which use more
than stdin and stdout in default modes, i.e. which use line editing or
work in full screen. How would you display a filename in a full screen
text editor, such that it works in a UTF-8 environment?

> If the assumed encoding is ISO-8859-*, this program will work
> regardless of the filenames which it is passed or the contents of the
> file (modulo the EOL translation on Windows). OTOH, if it were to use
> UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
> correctly if either filename or the file's contents weren't valid
> UTF-8.

A program is not supposed to encounter filenames which are not
representable in the locale's encoding. In your setting it's
impossible to display a filename in a way other than printing
to stdout.

> More accurately, it specifies which encoding to assume when you *need*
> to know the encoding (i.e. ctype.h etc), but you can't obtain that
> information from a more reliable source.

In the case of filenames there is no more reliable source.

> My central point is that the existing API forces the encoding to be
> an issue when it shouldn't be.

It is an unavoidable issue because not every interface in a given
computer system uses the same encoding. Gtk+ uses UTF-8; you must
convert text to UTF-8 in order to display it, and in order to convert
you must know its encoding.

> Well, to an extent it is an implementation issue. Historically, curses
> never cared about encodings. A character is a byte, you draw bytes on
> the screen, curses sends them directly to the terminal.

This is the old API. But newer ncurses API is prepared even for
combining accents. A character is coded with a sequence of wchar_t
values, such that all except the first one are combining characters.

> Furthermore, the curses model relies upon monospaced fonts, and falls
> down once you encounter CJK text (where a "monospaced" font means one
> whose glyphs are an integer multiple of the cell size, not necessarily
> a single cell).

It doesn't fall. Characters may span several columns. There is wcwidth(),
and curses specification in X/Open says how it should behave for wide
CJK characters. I haven't tested it but I believe ncurses supports
them.

> Extending something like curses to handle encoding issues is far
> from trivial; which is probably why it hasn't been finished yet.

It's almost finished. The API specification was ready in 1997.
It works in ncurses modulo unfixed bugs.

But programs can't use it unless they use Unicode internally.

> Although, if you're going to have implicit String -> [Word8]
> converters, there's no reason why you can't do the reverse, and have
> isAlpha :: Word8 -> IO Bool. Although, like ctype.h, this will only
> work for single-byte encodings.

We should not ignore multibyte encodings like UTF-8, which means that
Haskell should have a Unicoded character type. And it's already
specified in Haskell 98 that Char is such a type!

What is missing is API for manipulating binary files, and conversion
between byte streams and character streams using particular text
encodings.

>> A mail client is expected to respect the encoding set in headers.
>
> A client typically needs to know the encoding in order to display
> the text.

This is easier to handle when String type means Unicode.

> As a counter-example, a mail *server* can do its job without paying
> any attention to the encodings used. It can also handle non-MIME email
> (which doesn't specify any encoding) regardless of the encoding.

So it should push bytes, not characters.

>> This is why I said "1. API for manipulating byte sequences in I/O
>> (without representing them in String type)".
>
> Yes. But that API also needs to include functions such as those in the
> Directory and System modules.

If deemed really necessary, I will not fight against them.

> It isn't just about reading and writing streams. Most of the Unix
> API (kernel, libc, and many standard libraries) is byte-oriented
> rather than character-oriented.

Because they are primarily used from C, which use the older paradigm
of handling text: represent it in an unspecified external encoding
rather than in Unicode.

OTOH newer Windows APIs use Unicode.

Haskell aims at being portable. It's easier to emulate the traditional
C paradigm in the Unicode paradigm than vice versa, and Haskell
already tries to specify that it uses Unicode internally.

>> > 2. If you assume ISO-8859-1, you can always convert back to Word8 then
>> > re-decode as UTF-8. If you assume UTF-8, anything which is neither
>> > UTF-8 nor ASCII will fail far more severely than just getting the
>> > collation order wrong.
>> 
>> If I know the encoding, I should set the I/O handle to that encoding
>> in the first place instead of reinterpreting characters which have
>> been read using the default.
>
> And if you don't know the encoding?

Then it's not possible to recode it to something else.

But when it is possible because the encoding is known, it's easier
to use a single internal encoding everywhere than to determine two
encodings on each transition.

> Agreed. But writing programs which support I18N, multi-byte encodings,
> wide character sets (>256 codepoints) and the like on an OS whose core
> API is byte-oriented involves work.

It's not that hard if you may sacrifice supporting every broken
configuration. I did it myself, albeit without serious testing in real
world situations and without trying to interface to too many libraries.

> And it can't all be hidden within a library. Some of the work falls on
> the application programmers, who have to deal with determining the
> correct encoding in each situation, converting between encodings,
> handling encoding and decoding failures (e.g. when you encounter a
> Unicode filename but the terminal only has Latin1), and so on.

Indeed.

> My view is that, right now, we have the worst of both worlds, and
> taking a short step backwards (i.e. narrow the Char type and leave the
> rest alone) is a lot simpler (and more feasible) than the long journey
> towards real I18N.

It would bury any hope in supporting a UTF-8 environment.

I've heard that RedHat tried to impose UTF-8 by default. It was mostly
a failure because it's too early, too many programs are not ready for
it. I guess the RedHat move helped to identify some of them. But UTF-8
will inevitably be usable in future.

It would be great if Haskell programs were in the group which can
support it instead of being forced to be abandoned because of lack
of Unicode support in the language they are written in.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/