Text in Haskell: A PROPOSAL

Axel Simon A.Simon@ukc.ac.uk
Wed, 7 Aug 2002 18:26:49 +0100


On Wed, Aug 07, 2002 at 12:53:33PM -0400, Ken Shan wrote:
> On 2002-08-07T11:03:40+0100, Axel Simon wrote:
> > Then I hope there is no C implementation where char is less than 8 bits 
> > long.
> 
> Fortunately, standard C requires char to be at least 8 bits.
Does it?

> I have a stake in using Haskell for international text processing: In
> particular, I have been writing Haskell code that typeset international
> text.  Let me summarize what I think are the basic types of data that
> need to be distinguished and processed *somehow* within a Haskell
> program:
> 
>   (1) chars in C (perhaps distinguishing between unsigned, signed, and
>       default)
>   (2) 8-bit integers (i.e., signed) and words (i.e., unsigned)
>   (3) Unicode code values (16-bit)
>   (4) Unicode code points (32-bit, including surrogate characters, which
>       are often treated as two consecutive 16-bit code values)
> 
> The problem with the current situation is that Char in Haskell is
> supposed to mean 4, but in reality (e.g., GHC implementation) mostly
> means 1.
Although other compilers might not, GHC does indeed support Unicode 32 bit 
characters directly.
<advertisement>
In gtk2hs ( gtk2hs.sourceforge.org ) I have a small demo displaying 
arabic text in a dialog which looks like this:
arabic =
  map chr [0x647,0x644,32,0x62A,0x62C,0x62F,0x646,32,0x647,0x622,
           0x633,0x643,0x622,0x644,32,0x644,0x63A,0x62A,32,
           0x645,0x62F,0x647,0x634,0x62A,0x61F]
</advertisement>

So instead the only thing we have to make sure is that we marshal strings 
from and to the outside correctly. I don't think anyone wants to fiddle 
with different representation within Haskell.

Axel.