Unicode + Re: Reading/Writing Binary Data in Haskell

Glynn Clements glynn.clements@virgin.net
Mon, 14 Jul 2003 14:21:30 +0100


George Russell wrote:

>  > OTOH, existing implementations (at least GHC and Hugs) currently read
>  > and write "8-bit binary", i.e. characters 0-255 get read and written
>  > "as-is" and anything else breaks, and changing that would probably
>  > break a fair amount of existing code.
> 
> The binary library I posted to the libraries list:
> 
>     http://haskell.org/pipermail/libraries/2003-June/001227.html
> 
> which is for GHC, does this properly.  All characters are encoded
> using a standard encoding for unsigned integers, which uses the
> bottom 7 bits of each character as data, and the top bit to signal
> that the encoding is not yet complete.  Characters 0-127 (which
> include the standard ASCII ones) get encoded as themselves.

This is similar to UTF-8; however, UTF-8 is a standard format which
can be read and written by a variety of other programs.

If we want a mechanism for encoding arbitrary Haskell strings as octet
lists, and we have a free choice as to the encoding, UTF-8 is
definitely the way to go.

However, that isn't the issue which was being discussed in this
thread. The issue is that we need a standard mechanism for reading and
writing *octets*, so that Haskell programs can communicate with the
rest of the world.

As things stand, if you want to read/write files which were written by
another program, you have to rely either upon extensions, or upon
behaviour which isn't mandated by the report.

-- 
Glynn Clements <glynn.clements@virgin.net>