Picky details about Unicode (was RE: Haskell 98 Report possible errors, part one)

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
24 Jul 2001 10:08:48 GMT


Mon, 23 Jul 2001 11:23:30 -0700, Mark P Jones <mpj@cse.ogi.edu> pisze:

> I guess the intention here is that:
> 
>   symbol  -> ascSymbol | uniSymbol_<special | _ | : | " | '>

Right.

> In fact, since all the characters in ascSymbol are either
> punctuation or symbols in Unicode, the inclusion of ascSymbol
> is redundant, and a better specification might be:
> 
>   symbol  -> uniSymbol_<special | _ | : | " | '>

It would still be nice to explicitly list ASCII symbols, so one
doesn't need to look at Unicode specs to use ASCII-only source.

There are two places when character predicates are used in Haskell:
program source and module Char. I'm sure that we all agree that they
should be consistent with each other.

Some predicates in module Char are "wrong", i.e. I don't agree with
their meaning. For example that isSpace is restricted to ISO-8859-1,
and that caseless letters are considered uppercase.

It's not clear what good definitions are, or even what set of
predicates is useful, because there is no single official source
with unambiguous and complete set of predicates. There are Unicode
character categories, Unicode property lists, and implementations of
C character predicates - all with different data. I guess Java specs
have something to tell here too.

I have an implemented proposal of improved Char predicates in QForeign
<http://sf.net/projects/qforeign/>. Definitions are based on both
Unicode character categories and PropList.txt from Unicode.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK