Character predicates (was: Re: [Haskell-cafe] Hugs vs GHC (again))

Tue Jan 11 12:58:27 EST 2005

Dimitry Golubovsky <dimitry at golubovsky.org> writes:

>            |Sebastien's| Marcin's | Hugs
>     -------+-----------+----------+------
>      alnum | L* N*     | L* N*    | L*, M*, N* <1>
>      alpha | L*        | L*       | L* <1>
>      cntrl | Cc        | Cc Zl Zp | Cc
>      digit | N*        | Nd       | '0'..'9'
>      lower | Ll        | Ll       | Ll <1>
>      punct | P*        | P*       | P*
>      upper | Lu        | Lt Lu    | Lu Lt <1>
>      blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
>                          U+00A0
>                          U+2007
>                          U+202F)
>                          \t\n\v\f\r U+0085
>
> <1>: for characters outside Latin1 range. For Latin1 characters
> (0 to 255), there is a lookup table defined as
> "unsigned char   charTable[NUM_LAT1_CHARS];"

If the table coincides with Unicode character category, then it's just
an implementation detail.

I changed
   c < ' ' || c >= '\DEL' && c <= '\x9f'
to "Cc" for Hugs because it's the same.

> So there might be a bunch of (perhaps autogenerated, from localedef
> files) modules for each locale/encoding, like ISO8859_1 or KOI_8.

I disagree. Char is supposed to mean Unicode only, and data is
converted to Unicode on boundaries with those parts of the world which
use different encodings.

With Unicode in mind it still makes sense to talk about digits as
'0'..'9' only; most programming languages specify numeric literals as
constisting of these digits only. Other contexts may require a wider
set, including today's Arabic digits etc. This is not because of the
encoding but because of the intended set of characters.

One reason why the predicates are not obvious is that when the
features encodable as text become more sophisticated, old algorithms
for handling text become limited. For example if an identifier is
specified as a letter followed by a sequence of letters or numbers,
then combining marks are not allowed in identifiers, even though the
corresponding precomposed characters are allowed. I guess this is why
Hugs includes M* in isAlphaNum. This is a lie which allows old code
to work better. These characters are not alphanumeric; it's the
definition of identifiers which is no longer appropriate. (Unicode
recommends a particular definition of identifiers in programming
languages which want to permit with non-ASCII identifiers; it has
various exceptions because it's intended to be somehow compatible
with older versions of itself.)

Another case when old interfaces are not sufficient is toUpper &
toLower. They should be defined on strings, not characters. Besides
'ß' there are other characters which uppercase or lowercase to several
code points: ligatures, precomposed characters which lack precomposed
variants in the other case but can be decomposed, Greek iota below
which is specified to uppercase to a separate iota after the letter
(some people believe this is wrong but it's how it's currently
specified in Unicode) and some cases with accents over I and i.
Case mapping is also context-dependent for sigma.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/