Text in Haskell: A PROPOSAL
Ashley Yakeley
ashley@semantic.org
Wed, 7 Aug 2002 15:21:11 -0700
At 2002-08-07 11:05, Ken Shan wrote:
>Let me clarify my understanding of this point a bit further. On the one
>hand, GHC uses Char to mean a 32-bit value like a Unicode code point.
No, GHC uses Char to mean a Unicode codepoint. These are not 32-bit. It
only allows the 17 pages i.e. values in the range '\x0' to '\x10FFFF'.
This is the Right Thing as per Unicode 3.1 and later (current is 3.2.0).
>On the other hand, GHC uses Char to mean what files store and sockets
>transmit and foreign functions process under the C type "char".
Right, and this is a very bad idea. The file IO functions should be using
Word8s
>These two uses are inconsistent, and must be separated.
I agree.
At 2002-08-07 09:53, Ken Shan wrote:
>I have a stake in using Haskell for international text processing: In
>particular, I have been writing Haskell code that typeset international
>text. Let me summarize what I think are the basic types of data that
>need to be distinguished and processed *somehow* within a Haskell
>program:
>
> (1) chars in C (perhaps distinguishing between unsigned, signed, and
> default)
> (2) 8-bit integers (i.e., signed) and words (i.e., unsigned)
> (3) Unicode code values (16-bit)
I think the whole 16-bit code value thing was dropped as of 3.1. UTF-16
uses 16-bit values to represent text just as UTF-8 uses 8-bit values.
>The conflict in the present discussion arises from two desires: One, to
>use Char as 1 above, for FFI convenience and quick-and-dirty code. Two,
>to use Char as 4 above, for international text processing and conceptual
>correctness.
>
>I believe that we need library functions to:
>
> (a) Convert between 1 and 2, or more generally, convert between 1 and
> Integral types;
> (b) Convert between 2 and 4, under a specified encoding such as
> ISO-8859-1 or UTF-8;
> (c) Convert between 3 and 4, according to the Unicode standard.
You mean according to UTF-16.
>My proposal involves the following types:
>
> (1) Represent char in C as Char, and zero-terminated strings (char*)
> in C as CString.
We already have the CChar type that means that.
> (2) Represent 8-bit integers and words as Int8 and Word8.
Agreed.
> (3) Represent Unicode code values as Word16, or a new Haskell type
> CodeValue.
I don't think it's appropriate to have a new type. UTF-16 is a way of
representing codepoints as 16-bit integers. The UTF-16 functions should
use Word16.
> (4) Represent Unicode code points as Word32, or a new Haskell type
> CodePoint.
We already have the Char type that means that -- in GHC, at least. String
and character literals in programs should use it.
--
Ashley Yakeley, Seattle WA