Ketil Malde ketil@ii.uib.no
05 Oct 2001 14:35:17 +0200

"Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> writes:

> Fri, 5 Oct 2001 02:29:51 -0700 (PDT), Krasimir Angelov <ka2_mail@yahoo.com> pisze:
> > Why Char is 32 bit. UniCode characters is 16 bit.

> No, Unicode characters have 21 bits (range U+0000..10FFFF).

We've been through all this, of course, but here's a quote:

> "Unicode" originally implied that the encoding was UCS-2 and it
> initially didn't make any provisions for characters outside the BMP
> (U+0000 to U+FFFF). When it became clear that more than 64k
> characters would be needed for certain special applications
> (historic alphabets and ideographs, mathematical and musical
> typesetting, etc.), Unicode was turned into a sort of 21-bit
> character set with possible code points in the range U-00000000 to
> U-0010FFFF. The 21024 surrogate characters (U+D800 to U+DFFF) were
> introduced into the BMP to allow 10241024 non-BMP characters to be
> represented as a sequence of two 16-bit surrogate characters. This
> way UTF-16 was born, which represents the extended "21-bit" Unicode
> in a way backwards compatible with UCS-2. The term UTF-32 was
> introduced in Unicode to mean a 4-byte encoding of the extended
> "21-bit" Unicode. UTF-32 is the exact same thing as UCS-4, except
> that by definition UTF-32 is never used to represent characters
> above U-0010FFFF, while UCS-4 can cover all 231 code positions up to

from a/the Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html

Does Haskell's support of "Unicode" mean UTF-32, or full UCS-4?
Recent messages seem to indicate the former, but I don't see any
reason against the latter.

If I haven't seen further, it is by standing in the footprints of giants