Unicode

Mon, 8 Oct 2001 11:51:07 +0200

----- Original Message -----
From: "Ketil Malde" <ketil@ii.uib.no>
To: "Dylan Thurston" <dpt@math.harvard.edu>
Cc: "Andrew J Bromage" <andrew@bromage.org>; <glasgow-haskell-users@haskell.org>; <haskell-cafe@haskell.org>
Sent: Monday, October 08, 2001 9:02 AM
Subject: Re: UniCode

(The spelling is 'Unicode' (and none other).)

> Dylan Thurston <dpt@math.harvard.edu> writes:
>
> > Right.  In Unicode, the concept of a "character" is not really so
> > useful;
>
> After reading a bit about it, I'm certainly confused.
> Unicode/ISO-10646 contains a lot of things that aren'r really one
> character, e.g. ligatures.

The ligatures that are included are there for compatiblity with older
character encodings.  Normally, for modern technology..., ligatures
are (to be) formed automatically through the font.  OpenType (OT,
MS and Adobe) and AAT (Apple) have support for this. There are
often requests to add more ligatures to 10646/Unicode, but they are
rejected since 10646/Unicode encode characters, not glyphs. (With
two well-known exceptions: for compatibility, and certain dingbats.)

> > most functions that traditionally operate on characters (e.g.,
> > uppercase or display-width) fundamentally need to operate on strings.
> > (This is due to properties of particular languages, not any design
> > flaw of Unicode.)
>
> I think an argument could be put forward that Unicode is trying to be
> more than just a character set.  At least at first glance, it seems to

Yes, but:

> try to be both a character set and a glyph map, and incorporate things

not that. See above.

> like transliteration between character sets (or subsets, now that
> Unicode contains them all), directionality of script, and so on.

Unicode (but not 10646) does handle bidirectionality
(seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration.
(Tranliteration is handled in IBMs ICU, though: http://www-124.ibm.com/developerworks/oss/icu4j/index.html)

>
> >   toUpper, toLower - Not OK.  There are cases where upper casing a
> >      character yields two characters.
>
> I though title case was supposed to handle this.  I'm probably
> confused, though.

The titlecase characters in Unicode are (essentially) only there
for compatibility reasons (originally for transliterating between
certain subsets of Cyrillic and Latin scripts in a 1-1 way).  You're
not supposed to really use them...

The cases where toUpper of a single character give two characters
is for some (classical) Greek, where a builtin subscript iota turn into
a capital iota, and other cases where there is no corresponding
uppercase letter.

It is also the case that case mapping is context sensitive.  E.g.
mapping capital sigma to small sigma (mostly) or ς (small final sigma)
(at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/
delete combining dot above for i and j in Lithuanian. See UTR 21
and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt.

>
> > etc.  Any program using this library is bound to get confused on
> > Unicode strings.  Even before Unicode, there is much functionality
> > missing; for instance, I don't see any way to compare strings using
> > a localized order.
>
> And you can't really use list functions like "length" on strings,
> since one item can be two characters (Lj, ij, fi) and several items
> can compose one character (combining characters).

Depends on what you mean by "lenght" and "character"...
You seem to be after what is sometimes referred to as "grapheme",
and counting those.  There is a proposal for a definition of
"language independent grapheme" (with lexical syntax), but I don't
think it is stable yet.

> And "map (==)" can't compare two Strings since, e.g. in the presence
> of combining characters.  How are other systems handling this?

I guess it is not very systematic.  Java and XML make the comparisons
directly by equality of the 'raw' characters *when* comparing identifiers/similar,
though for XML there is a proposal for "early normalisation" essentially to
NFC (normal form C).  I would have preferred comparing the normal forms
of the identifiers instead.  For searches, the recommendation (though I doubt
in practice yet) is to use a collation key based comparison. (Note that collation
keys are usually language dependent. More about collation in UTS 10,
http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.)

What does NOT make sense is to expose (to a user) the raw ordering (<)
of Unicode strings, though it may be useful internally.  Orders exposed to
people (or other systems, for that matter) that are't concerned with the
inner workings of a program should always be collation based.  (But that
holds for any character encoding, it's just more apparent for Unicode.)

> It may be that Unicode isn't flawed, but it's certainly extremely
> complex.  I guess I'll have to delve a bit deeper into it.

It's complex, but it is because the scripts of world are complex (and add
to that politics, as well as compatbility and implementation issues).

        Kind regards
        /kent k