Unicode in GHC: need more advice

Fri Jan 14 07:57:47 EST 2005

Hi,

Simon Marlow wrote:

> 
> You're doing fine - but a better place for the tables is as part of the
> base package, rather than the RTS.  We already have some C files in the
> base package: see libraries/base/cbits, for example.  I suggest just
> putting your code in there.

I have done that - now GHCi recognizes those symbols and loads fine. The 
test program also works when compiled. I still got some messages about 
missing prototypes and implicitly declared functions that I defined 
instead of libc functions, especially during Stage 1. I need to check 
into that, but since all those functions are basically int -> int, it 
does not affect the result.

The code I use is some draft code, based on what I submitted for Hugs 
(pure Unicode basically, even without extra space characters).

Now I need more advice on which "flavor" of Unicode support to 
implement. In Haskell-cafe, there were 3 flavors summarized: I am 
reposting the table here (its latest version).

            |Sebastien's| Marcin's | Hugs
     -------+-----------+----------+------
      alnum | L* N*     | L* N*    | L*, M*, N* <1>
      alpha | L*        | L*       | L* <1>
      cntrl | Cc        | Cc Zl Zp | Cc
      digit | N*        | Nd       | '0'..'9'
      lower | Ll        | Ll       | Ll <1>
      punct | P*        | P*       | P*
      upper | Lu        | Lt Lu    | Lu Lt <1>
      blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
                          U+00A0
                          U+2007
                          U+202F)
                          \t\n\v\f\r U+0085

<1>: for characters outside Latin1 range. For Latin1 characters
(0 to 255), there is a lookup table defined as
"unsigned char   charTable[NUM_LAT1_CHARS];"

I did not post the contents of the table Hugs uses for the Latin1 part. 
However, with that table completely removed, Hugs did not work properly. 
So its contents somehow differs from what Unicode defines for that 
character range. If needed, I may decode that table and post its mapping 
of character categories (keeping in mind that those are 
Haskell-recognized character categories, not Unicode)

I am not asking for discussion in this list again. I rather expect some 
  suggestion from the GHC team leads, which flavor (of shown above, or 
some combination of the above) to implement.

One more question that I had when experimenting with Hugs: if a 
character (like those extra blank chars) is forced into some category 
for the purposes of Haskell language compilation (per the Report), does 
this mean that any other Haskell application should recognize 
Haskell-defined category of that character rather than Unicode-defined?

For Hugs, there were no choice but say Yes, because both compiler and 
interpreter used the same code to decide on character category. In GHC 
this may be different.

Since Hugs got there first, does it make sense just follow what was done 
here, or will a different decision be adopted for GHC: say, for the 
Parser, extra characters are forced to be blank, but for the rest of the 
programs compiled by GHC, Unicode definitions are adhered to.

PS The latest rebuild I did, used ghc with new code compiled in as Stage 
1 compiler.

Dimitry Golubovsky
Middletown, CT