Unicode

Mon, 8 Oct 2001 12:04:40 +0200

----- Original Message -----
From: "Dylan Thurston" <dpt@math.harvard.edu>
To: "Andrew J Bromage" <andrew@bromage.org>
Cc: <glasgow-haskell-users@haskell.org>; <haskell-cafe@haskell.org>
Sent: Friday, October 05, 2001 6:00 PM
Subject: Re: UniCode

> On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
> > G'day all.
> >
> > On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
> >
> > > Why Char is 32 bit. UniCode characters is 16 bit.
> >
> > It's not quite as simple as that.  There is a set of one million
> > (more correctly, 1M) Unicode characters which are only accessible
> > using surrogate pairs (i.e. two UTF-16 codes).  There are currently
> > none of these codes assigned, and when they are, they'll be extremely
> > rare.  So rare, in fact, that the cost of strings taking up twice the
> > space that the currently do simply isn't worth the cost.
>
> This is no longer true, as of Unicode 3.1.  Almost half of all
> characters currently assigned are outside of the BMP (i.e., require
> surrogate pairs in the UTF-16 encoding), including many Chinese
> characters.  In current usage, these characters probably occur mainly
> in names, and are rare, but obviously important for the people
> involved.

In plane 2 (one of the surrogate planes) there are about 41000
Hàn characters, in addition to the about 27000 Hàn characters
in the BMP.  And more are expected to be encoded.  However,
IIRC, only about 6000-7000 of them are in modern use.

I don't really want to push for them (since I think they are a major design
mistake), but some people like them: the mathematical alphanumerical
characters in plane 1.  There are also the more likable (IMHO)
musical characters in plane 1 ("western", though that attribute was
removed, and Bysantine!). (You cannot set a musical score in
Unicode plain text, it just encodes the characters that you can use IN
a musical score.)

...
>   isAscii, isLatin1 - OK
Yes, but why do (or, rather, did) you want them; isLatin1 in particuar?
Then what about "isCP1252" (THE most common encoding today),
"isShiftJis", etc., for several hundered encodings? (I'm not proposing to
remove isAscii, but isLatin1 is dubious.)

>   isControl - I don't know about this.
Why do (did) you want it? There are several "kinds" of "control" characters
in Unicode: the traditional C0 and (less used) C1 ones, format control
characters (NO, they do NOT control FORMATTING, though they do control
FORMAT, like cursive connections), ...

>   isPrint - Dubious.  Is a non-spacing accent a printable character?
A combining character is most definitely "printable". (There is a difference
between non-spacing and combining, even though many combining
characters are non-spacing, not all of them are.)

>   isSpace - OK, by the comment in the report: "The isSpace function
>             recognizes only white characters in the Latin-1 range".
Sigh. There are several others, most importantly: LINE SEPARATOR,
PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE.  And the
NEL in the C1 range.

>   isUpper, isLower - Maybe OK.
This is property interrogation. There are many other properties of interest.

>   toUpper, toLower - Not OK.  There are cases where upper casing a
>      character yields two characters.
See my other e-mail.

> etc.  Any program using this library is bound to get confused on
> Unicode strings.  Even before Unicode, there is much functionality
> missing; for instance, I don't see any way to compare strings using
> a localized order.
>
> Is anyone working on honest support for Unicode, in the form of a real
> Unicode library with an interface at the correct level?

Well, IBM's ICU, for one, ...  But they only do it for C/C++/Java, not for Haskell...

        Kind regards
        /kent k