Unicode support

Kent Karlsson kentk@md.chalmers.se
Tue, 9 Oct 2001 11:58:06 +0200


Just to clear up any misunderstanding:

----- Original Message -----
From: "Ashley Yakeley" <ashley@semantic.org>
To: "Haskell List" <haskell@haskell.org>
Sent: Monday, October 01, 2001 12:36 AM
Subject: Re: Unicode support


> At 2001-09-30 07:29, Marcin 'Qrczak' Kowalczyk wrote:
>
> >Some time ago the Unicode Consortium slowly began switching to the
> >point of view that abstract characters are denoted by numbers in the
> >range U+0000..10FFFF.
>
> It's worth mentioning that these are 'codepoints', not 'characters'.

Yes, but characters are allocated to code points (or rather code positions).

> Sometimes a character will be made up of two codepoints, for instance an
> 'a' with a dot above is a single character that can be made from the
> codepoints LATIN SMALL LETTER A and COMBINING DOT ABOVE.

Well, those ARE characters, which together form a GRAPHEME (which is
what Joe User would consider to be a character). Those two happen to
'combine' in NFC to LATIN SMALL LETTER A WITH DOT ABOVE.
But that is just that example. LATIN SMALL LETTER R and COMBINING
SHORT STROKE OVERLAY (yes, this is used in some places, but will never get
a precomposed character) are left as is also for NFC. Both of these examples,
for either normal form, MAY each be handled by one (ligature, if you like) glyph or
by two (overlaid) glyphs by a font.

Further, some code points are permanently reserved for UTF-16 "surrogates",
some are permanently reserved as non-characters(!), some are for
private use (which can be used for things not yet formally encoded,
or things that never will be encoded) and quite a lot are reserved for
future standardisation.

The 8, 16, or 32-bit units in the encoding forms are called 'code units'.
E.g. Java's 'char' type is for UTF-16 code units, not characters!
Though a single UTF-16 code unit can represent a character in the BMP
(if that code position has a character allocated to it). In many cases, but
definitely not all, a single character, in its string context, is a grapheme too.

In summary:

    code position (=code point): a value between 0000 and 10FFFF.

    code unit: a fixed bit-width value used in one of the encoding forms
        (often called "char" in programming languages).

    character: hard to give a proper definition (the 10646 one does not
        say anything), but in brief roughly "a thing deemed worthy of being
        added to the repertiore of 10646".

    grapheme: a sequence of one or more characters that naïve users
        think of as a character (may be language dependent).

    glyph: a piece of graphic that may image part of, a whole, or several
        characters in context.  It is highly font dependent how the exact mapping
        from characters to positioned glyphs is done.  (The partioning into
        subglyphs, if done, need not be tied to Unicode decomposition.)
        For most scripts, including Latin, this mapping is rather complex
        (and is yet to be implemented in full).

> Perhaps this
> makes the UTF-16 'surrogate' problem a bit less serious, since there
> never was a one-to-one correspondence between any kind of n-bit unit and
> displayed characters.

With that I agree.

        Kind regards
        /kent k


>
> --
> Ashley Yakeley, Seattle WA