Unicode in GHC: need more advice

Mon Jan 17 07:24:21 EST 2005

On 14 January 2005 12:58, Dimitry Golubovsky wrote:

> Now I need more advice on which "flavor" of Unicode support to
> implement. In Haskell-cafe, there were 3 flavors summarized: I am
> reposting the table here (its latest version).
> 
>             |Sebastien's| Marcin's | Hugs
>      -------+-----------+----------+------
>       alnum | L* N*     | L* N*    | L*, M*, N* <1>
>       alpha | L*        | L*       | L* <1>
>       cntrl | Cc        | Cc Zl Zp | Cc
>       digit | N*        | Nd       | '0'..'9'
>       lower | Ll        | Ll       | Ll <1>
>       punct | P*        | P*       | P*
>       upper | Lu        | Lt Lu    | Lu Lt <1>
>       blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
>                           U+00A0
>                           U+2007
>                           U+202F)
>                           \t\n\v\f\r U+0085
> 
> <1>: for characters outside Latin1 range. For Latin1 characters
> (0 to 255), there is a lookup table defined as
> "unsigned char   charTable[NUM_LAT1_CHARS];"
> 
> I did not post the contents of the table Hugs uses for the Latin1
> part. However, with that table completely removed, Hugs did not work
> properly. So its contents somehow differs from what Unicode defines
> for that character range. If needed, I may decode that table and post
> its mapping of character categories (keeping in mind that those are
> Haskell-recognized character categories, not Unicode)

I don't know enough to comment on which of the above flavours is best.
However, I'd prefer not to use a separate table for Latin-1 characters
if possible.

We should probably stick to the Report definitions for isDigit and
isSpace, but we could add a separate isUniDigit/isUniSpace for the full
Unicode classes.

> One more question that I had when experimenting with Hugs: if a
> character (like those extra blank chars) is forced into some category
> for the purposes of Haskell language compilation (per the Report),
> does this mean that any other Haskell application should recognize
> Haskell-defined category of that character rather than
> Unicode-defined? 
>
> For Hugs, there were no choice but say Yes, because both compiler and
> interpreter used the same code to decide on character category. In GHC
> this may be different.

To be specific: the Report requires that the Haskell lexical class of
space characters includes Unicode spaces, but that the implementation of
isSpace only recognises Latin-1 spaces.  That means we need two separate
classes of space characters (or just use the report definition of
isSpace).

GHC's parser doesn't currently use the Data.Char character class
predicates, but at some point we will want to parse Unicode so we'll
need appropriate class predicates then.

> Since Hugs got there first, does it make sense just follow what was
> done here, or will a different decision be adopted for GHC: say, for
> the Parser, extra characters are forced to be blank, but for the rest
> of the programs compiled by GHC, Unicode definitions are adhered to.

Does what I said above help answer this question?

Cheers,
	Simon