Text in Haskell: A PROPOSAL

Ken Shan ken@digitas.harvard.edu
Wed, 7 Aug 2002 12:53:33 -0400


--qtZFehHsKgwS5rPz
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2002-08-07T11:03:40+0100, Axel Simon wrote:
> Then I hope there is no C implementation where char is less than 8 bits=
=20
> long.

Fortunately, standard C requires char to be at least 8 bits.

I have a stake in using Haskell for international text processing: In
particular, I have been writing Haskell code that typeset international
text.  Let me summarize what I think are the basic types of data that
need to be distinguished and processed *somehow* within a Haskell
program:

  (1) chars in C (perhaps distinguishing between unsigned, signed, and
      default)
  (2) 8-bit integers (i.e., signed) and words (i.e., unsigned)
  (3) Unicode code values (16-bit)
  (4) Unicode code points (32-bit, including surrogate characters, which
      are often treated as two consecutive 16-bit code values)

The problem with the current situation is that Char in Haskell is
supposed to mean 4, but in reality (e.g., GHC implementation) mostly
means 1.

The conflict in the present discussion arises from two desires: One, to
use Char as 1 above, for FFI convenience and quick-and-dirty code.  Two,
to use Char as 4 above, for international text processing and conceptual
correctness.

I believe that we need library functions to:

  (a) Convert between 1 and 2, or more generally, convert between 1 and
      Integral types;
  (b) Convert between 2 and 4, under a specified encoding such as
      ISO-8859-1 or UTF-8;
  (c) Convert between 3 and 4, according to the Unicode standard.

My proposal involves the following types:

  (1) Represent char in C as Char, and zero-terminated strings (char*)
      in C as CString.
  (2) Represent 8-bit integers and words as Int8 and Word8.
  (3) Represent Unicode code values as Word16, or a new Haskell type
      CodeValue.
  (4) Represent Unicode code points as Word32, or a new Haskell type
      CodePoint.

String will continue to be a type synonym for [Char].  Some applications
may find it useful to define a type synonym for [CodeValue] or
[CodePoint], but given combining characters and other complexities of
Unicode, I suspect many text processing applications will need a more
sophisticated notion of text than just a sequence of characters.

In C, char is a numeric type like int and long.  In Haskell, we like to
regulate against comparing 'A' against 65 directly.  The functions ord
and chr are useful in this regard.  I suggest we extend ord and chr via
a new type class, Character:

    class Eq c =3D> Character c where
	fromChar :: Char -> c
	ord :: (Integral i) =3D> c -> i
	chr :: (Integral i) =3D> i -> c

    instance Character Char      where ...
    instance Character CodePoint where ...
    instance Character CodeValue where ...

Laws that should hold of Character include:

    chr (ord ch :: Integer) =3D=3D ch	for all ch :: c, where Character c
    fromChar char =3D=3D char		for all char :: Char

The defaulting mechanism should be extended to character literals and
string literals.  That is,

    'A'

should mean something like

    chr 65

and

    "ABC"

should mean something like

    map chr [65, 66, 67]

Also,

    '\xABCD'

should mean something like

    chr 0xABCD

Pattern matching against character and string literals should make use
of (=3D=3D) from Eq.

--=20
Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig
Use GPG!

--qtZFehHsKgwS5rPz
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9UVCNzjAc4f+uuBURAobQAKC+s+jDMXAQwnDbD3QmqwrLFQFNOACeNrAL
L4IUbayRnz6C/b00b+N+53k=
=+eLm
-----END PGP SIGNATURE-----

--qtZFehHsKgwS5rPz--