[Haskell-cafe] Writing binary files?

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Sun Sep 12 15:23:34 EDT 2004


Sven Panne <Sven.Panne at aedion.de> writes:

> Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break
> when we stipulate that the standard encoding for string I/O in Haskell is
> ISO-Latin-1? Additional encodings could be specified e.g. via a new "open"
> variant.

That the encoding of most file contents is not ISO-Latin-1 in practice.
The locale mechanism specifies a default.

It's also a default for other things: filenames (on Unix), program
invocation arguments, environment variables etc. Some other places
have an encoding hardwired (e.g. Gtk+ uses UTF-8 and Qt uses UTF-16),
and yet others have it specified as a part of the protocol (email,
usenet, WWW).

Unfortunately changing a Haskell implementation to actually convert
between the external encodings and Unicode must be done in all those
places at once, otherwise there will be mismatches and e.g. printing
program invocation arguments to a file will have a wrong effect.

Most Haskell programs currently work because they misuse Chars to
represent characters in the implicit default encoding. As long as they
don't use isAlpha or toUpper on non-ASCII characters, and as long as
they don't try to support several encodings at once.

These two paradigms:
A. Represent strings using their original encoding.
B. Use Unicode internally, convert it at the boundaries.
should not be mixed in one string type, or confusion will arise.

For at least some of these places, e.g. file contents or socket data,
a program must have a way to specify a different encoding, and also to
manipulate raw bytes without recoding. But the default encoding should
come from the locale instead of being ISO-8859-1. A Char value should
always mean a Unicode code point and not e.g. an ISO-8859-2-coded value.
This is the B paradigm and it must be applied consistently.

I did this for my language <http://kokogut.sourceforge.net/> and it
works. Only some things are hard, e.g. reading a file whose encoding
is specified inside it (trying to apply the default encoding might
fail, even if the text before the encoding name is all ASCII, because
of buffering); it's possible but needs care.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Haskell-Cafe mailing list