UniCode

08 Oct 2001 09:02:11 +0200

Dylan Thurston <dpt@math.harvard.edu> writes:

> Right.  In Unicode, the concept of a "character" is not really so
> useful;

After reading a bit about it, I'm certainly confused.
Unicode/ISO-10646 contains a lot of things that aren'r really one
character, e.g. ligatures.

> most functions that traditionally operate on characters (e.g.,
> uppercase or display-width) fundamentally need to operate on strings.
> (This is due to properties of particular languages, not any design
> flaw of Unicode.)

I think an argument could be put forward that Unicode is trying to be
more than just a character set.  At least at first glance, it seems to
try to be both a character set and a glyph map, and incorporate things
like transliteration between character sets (or subsets, now that
Unicode contains them all), directionality of script, and so on.

>   toUpper, toLower - Not OK.  There are cases where upper casing a
>      character yields two characters.

I though title case was supposed to handle this.  I'm probably
confused, though.

> etc.  Any program using this library is bound to get confused on
> Unicode strings.  Even before Unicode, there is much functionality
> missing; for instance, I don't see any way to compare strings using
> a localized order.

And you can't really use list functions like "length" on strings,
since one item can be two characters (Lj, ij, fi) and several items
can compose one character (combining characters).

And "map (==)" can't compare two Strings since, e.g. in the presence
of combining characters.  How are other systems handling this?  

It may be that Unicode isn't flawed, but it's certainly extremely
complex.  I guess I'll have to delve a bit deeper into it.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants