Unicode + Re: Reading/Writing Binary Data in Haskell

George Russell ger@tzi.de
Mon, 14 Jul 2003 18:30:30 +0200


Glynn wrote (about my binary library, snipped):
 > This is similar to UTF-8; however, UTF-8 is a standard format which
 > can be read and written by a variety of other programs.
 >
 > If we want a mechanism for encoding arbitrary Haskell strings as octet
 > lists, and we have a free choice as to the encoding, UTF-8 is
 > definitely the way to go.

No I don't think so.  UTF8 is a good choice if you want
a way of storing Unicode files on an 8-bit file-system, but it
is not as efficient an encoding for characters in general.
Thus with UTF8 you can represent character codes less than 2^11 in
two bytes; with my system you can represent codes less than 2^14.
In 3 bytes, UTF8 can represent codes less 2^16; I can do anything
less than 2^21.  This is not an error in UTF8's design; I think
it's because UTF8 includes extra bits which make it much easier to
use UTF8-encoded files with tools, such as "grep", which were only
written with 8-bit characters in mind.  That is not a design aim for
us.

Furthermore UTF8's encoding is in fact rather more complicated to
program than mine, and the implementor will need an encoding like
mine in any case to encode arbitrary-size integers (something UTF8
encoding can't do by the way).

If we think for a moment what a Haskell system using UTF8 would be
like, I think it's easiest to imagine that in future there will
be a way of specifying that a file contains character data stored
in UTF8 format, either as a flag stored in the filing system, or
as an option given to functions like Haskell's openFile.  Or
perhaps openFile would assume UTF8, and there will be an
openBinaryFile which does not.  However it's done, this is
entirely orthogonal to the question of how to encode character
data as binary *within* Haskell.

 > However, that isn't the issue which was being discussed in this
 > thread. The issue is that we need a standard mechanism for reading and
 > writing *octets*, so that Haskell programs can communicate with the
 > rest of the world.

Yes we do.  At the moment my binary library does of course have to
use standard character input, plus a couple of internal GHC functions
(for writing blocks of data), and I hope that there will someday
be standard functions I can use instead.