Unicode + Re: Reading/Writing Binary Data in Haskell
Glynn Clements
glynn.clements@virgin.net
Mon, 14 Jul 2003 14:21:30 +0100
George Russell wrote:
> > OTOH, existing implementations (at least GHC and Hugs) currently read
> > and write "8-bit binary", i.e. characters 0-255 get read and written
> > "as-is" and anything else breaks, and changing that would probably
> > break a fair amount of existing code.
>
> The binary library I posted to the libraries list:
>
> http://haskell.org/pipermail/libraries/2003-June/001227.html
>
> which is for GHC, does this properly. All characters are encoded
> using a standard encoding for unsigned integers, which uses the
> bottom 7 bits of each character as data, and the top bit to signal
> that the encoding is not yet complete. Characters 0-127 (which
> include the standard ASCII ones) get encoded as themselves.
This is similar to UTF-8; however, UTF-8 is a standard format which
can be read and written by a variety of other programs.
If we want a mechanism for encoding arbitrary Haskell strings as octet
lists, and we have a free choice as to the encoding, UTF-8 is
definitely the way to go.
However, that isn't the issue which was being discussed in this
thread. The issue is that we need a standard mechanism for reading and
writing *octets*, so that Haskell programs can communicate with the
rest of the world.
As things stand, if you want to read/write files which were written by
another program, you have to rely either upon extensions, or upon
behaviour which isn't mandated by the report.
--
Glynn Clements <glynn.clements@virgin.net>