Unicode in GHC: need more advice
Dimitry Golubovsky
dimitry at golubovsky.org
Fri Jan 14 07:57:47 EST 2005
Hi,
Simon Marlow wrote:
>
> You're doing fine - but a better place for the tables is as part of the
> base package, rather than the RTS. We already have some C files in the
> base package: see libraries/base/cbits, for example. I suggest just
> putting your code in there.
I have done that - now GHCi recognizes those symbols and loads fine. The
test program also works when compiled. I still got some messages about
missing prototypes and implicitly declared functions that I defined
instead of libc functions, especially during Stage 1. I need to check
into that, but since all those functions are basically int -> int, it
does not affect the result.
The code I use is some draft code, based on what I submitted for Hugs
(pure Unicode basically, even without extra space characters).
Now I need more advice on which "flavor" of Unicode support to
implement. In Haskell-cafe, there were 3 flavors summarized: I am
reposting the table here (its latest version).
|Sebastien's| Marcin's | Hugs
-------+-----------+----------+------
alnum | L* N* | L* N* | L*, M*, N* <1>
alpha | L* | L* | L* <1>
cntrl | Cc | Cc Zl Zp | Cc
digit | N* | Nd | '0'..'9'
lower | Ll | Ll | Ll <1>
punct | P* | P* | P*
upper | Lu | Lt Lu | Lu Lt <1>
blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
U+00A0
U+2007
U+202F)
\t\n\v\f\r U+0085
<1>: for characters outside Latin1 range. For Latin1 characters
(0 to 255), there is a lookup table defined as
"unsigned char charTable[NUM_LAT1_CHARS];"
I did not post the contents of the table Hugs uses for the Latin1 part.
However, with that table completely removed, Hugs did not work properly.
So its contents somehow differs from what Unicode defines for that
character range. If needed, I may decode that table and post its mapping
of character categories (keeping in mind that those are
Haskell-recognized character categories, not Unicode)
I am not asking for discussion in this list again. I rather expect some
suggestion from the GHC team leads, which flavor (of shown above, or
some combination of the above) to implement.
One more question that I had when experimenting with Hugs: if a
character (like those extra blank chars) is forced into some category
for the purposes of Haskell language compilation (per the Report), does
this mean that any other Haskell application should recognize
Haskell-defined category of that character rather than Unicode-defined?
For Hugs, there were no choice but say Yes, because both compiler and
interpreter used the same code to decide on character category. In GHC
this may be different.
Since Hugs got there first, does it make sense just follow what was done
here, or will a different decision be adopted for GHC: say, for the
Parser, extra characters are forced to be blank, but for the rest of the
programs compiled by GHC, Unicode definitions are adhered to.
PS The latest rebuild I did, used ghc with new code compiled in as Stage
1 compiler.
Dimitry Golubovsky
Middletown, CT
More information about the Glasgow-haskell-users
mailing list