Text in Haskell: A PROPOSAL
Axel Simon
A.Simon@ukc.ac.uk
Wed, 7 Aug 2002 18:26:49 +0100
On Wed, Aug 07, 2002 at 12:53:33PM -0400, Ken Shan wrote:
> On 2002-08-07T11:03:40+0100, Axel Simon wrote:
> > Then I hope there is no C implementation where char is less than 8 bits
> > long.
>
> Fortunately, standard C requires char to be at least 8 bits.
Does it?
> I have a stake in using Haskell for international text processing: In
> particular, I have been writing Haskell code that typeset international
> text. Let me summarize what I think are the basic types of data that
> need to be distinguished and processed *somehow* within a Haskell
> program:
>
> (1) chars in C (perhaps distinguishing between unsigned, signed, and
> default)
> (2) 8-bit integers (i.e., signed) and words (i.e., unsigned)
> (3) Unicode code values (16-bit)
> (4) Unicode code points (32-bit, including surrogate characters, which
> are often treated as two consecutive 16-bit code values)
>
> The problem with the current situation is that Char in Haskell is
> supposed to mean 4, but in reality (e.g., GHC implementation) mostly
> means 1.
Although other compilers might not, GHC does indeed support Unicode 32 bit
characters directly.
<advertisement>
In gtk2hs ( gtk2hs.sourceforge.org ) I have a small demo displaying
arabic text in a dialog which looks like this:
arabic =
map chr [0x647,0x644,32,0x62A,0x62C,0x62F,0x646,32,0x647,0x622,
0x633,0x643,0x622,0x644,32,0x644,0x63A,0x62A,32,
0x645,0x62F,0x647,0x634,0x62A,0x61F]
</advertisement>
So instead the only thing we have to make sure is that we marshal strings
from and to the outside correctly. I don't think anyone wants to fiddle
with different representation within Haskell.
Axel.