Text in Haskell: A PROPOSAL

Ken Shan ken@digitas.harvard.edu
Wed, 7 Aug 2002 13:53:44 -0400


--tKW2IUtsqtDRztdT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2002-08-07T18:26:49+0100, Axel Simon wrote:
> > Fortunately, standard C requires char to be at least 8 bits.
> Does it?

Yes.  I am looking at section 5.2.4.2 ("Numerical limits") at the URL
http://www.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.htm , which is only a
committee draft, but I don't think this part of the standard has
changed.

> Although other compilers might not, GHC does indeed support Unicode 32 bi=
t=20
> characters directly.
> <advertisement>
> In gtk2hs ( gtk2hs.sourceforge.org ) I have a small demo displaying=20
> arabic text in a dialog which looks like this:
> arabic =3D
>   map chr [0x647,0x644,32,0x62A,0x62C,0x62F,0x646,32,0x647,0x622,
>            0x633,0x643,0x622,0x644,32,0x644,0x63A,0x62A,32,
>            0x645,0x62F,0x647,0x634,0x62A,0x61F]
> </advertisement>
>=20
> So instead the only thing we have to make sure is that we marshal strings=
=20
> from and to the outside correctly.

It seems that GTK deals directly with Unicode text.  That is great.
However, it is often insufficient to model text as a sequence of Unicode
code points.  In other words, it is impossible to marshal strings from
and to the outside correctly with the (function types of the) current
library.  For example, a Haskell program should be able to

  - read and write multiple files and network sockets with different
    encodings;

  - to normalize Unicode strings into various normalization forms, for
    example as specified by the W3C working draft "Character Model for
    the World Wide Web" (http://www.w3.org/TR/2002/WD-charmod-20020430/);

  - to deal gracefully with unencodable characters, possibly with user
    interaction (e.g., "the encoding you have selected is insufficient
    for this document; please choose one of the following alternatives");

  - to maintain state necessary for processing combining characters;

  - to distinguish between right-to-left or bidirectional text stored in
    "display order" versus "logical order";

  - etc.

Most of the functionality mentioned above are best handled by the
Haskell code itself.  It is an unrealistic simplification for the
Haskell library to pretend that files store, sockets transmit, and
foreign functions process Unicode characters (rather than say octets).
Abstractions at a higher level than raw octets may be desirable in most
circumstances, but different abstractions are needed by different
applications, so a lower-lever interface should be provided to the
Haskell programmer.

> I don't think anyone wants to fiddle=20
> with different representation within Haskell.

As the examples above illustrate, I do.  I would much rather fiddle with
different representations within Haskell than within C.  (:

--=20
Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig
When there was no meat, we ate fowl. When there was no fowl, we ate
crawdads. When there was no crawdads to be found, we ate sand.

--tKW2IUtsqtDRztdT
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9UV6ozjAc4f+uuBURAoSbAKDrh20xSY7I3p3FV1eJoNVXMRnU+QCgktbr
XneNgpF0zEGcwA4/ACiRNgE=
=vNBj
-----END PGP SIGNATURE-----

--tKW2IUtsqtDRztdT--