Text in Haskell: A PROPOSAL

Ashley Yakeley ashley@semantic.org
Wed, 7 Aug 2002 15:21:11 -0700


At 2002-08-07 11:05, Ken Shan wrote:

>Let me clarify my understanding of this point a bit further.  On the one
>hand, GHC uses Char to mean a 32-bit value like a Unicode code point.

No, GHC uses Char to mean a Unicode codepoint. These are not 32-bit. It 
only allows the 17 pages i.e. values in the range '\x0' to '\x10FFFF'. 
This is the Right Thing as per Unicode 3.1 and later (current is 3.2.0).

>On the other hand, GHC uses Char to mean what files store and sockets
>transmit and foreign functions process under the C type "char".  

Right, and this is a very bad idea. The file IO functions should be using 
Word8s

>These two uses are inconsistent, and must be separated.

I agree.

At 2002-08-07 09:53, Ken Shan wrote:

>I have a stake in using Haskell for international text processing: In
>particular, I have been writing Haskell code that typeset international
>text.  Let me summarize what I think are the basic types of data that
>need to be distinguished and processed *somehow* within a Haskell
>program:
>
>  (1) chars in C (perhaps distinguishing between unsigned, signed, and
>      default)
>  (2) 8-bit integers (i.e., signed) and words (i.e., unsigned)
>  (3) Unicode code values (16-bit)

I think the whole 16-bit code value thing was dropped as of 3.1. UTF-16 
uses 16-bit values to represent text just as UTF-8 uses 8-bit values.

>The conflict in the present discussion arises from two desires: One, to
>use Char as 1 above, for FFI convenience and quick-and-dirty code.  Two,
>to use Char as 4 above, for international text processing and conceptual
>correctness.
>
>I believe that we need library functions to:
>
>  (a) Convert between 1 and 2, or more generally, convert between 1 and
>      Integral types;
>  (b) Convert between 2 and 4, under a specified encoding such as
>      ISO-8859-1 or UTF-8;
>  (c) Convert between 3 and 4, according to the Unicode standard.

You mean according to UTF-16.

>My proposal involves the following types:
>
>  (1) Represent char in C as Char, and zero-terminated strings (char*)
>      in C as CString.

We already have the CChar type that means that.

>  (2) Represent 8-bit integers and words as Int8 and Word8.

Agreed.

>  (3) Represent Unicode code values as Word16, or a new Haskell type
>      CodeValue.

I don't think it's appropriate to have a new type. UTF-16 is a way of 
representing codepoints as 16-bit integers. The UTF-16 functions should 
use Word16.

>  (4) Represent Unicode code points as Word32, or a new Haskell type
>      CodePoint.

We already have the Char type that means that -- in GHC, at least. String 
and character literals in programs should use it.

-- 
Ashley Yakeley, Seattle WA