Unicode support

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
30 Sep 2001 14:29:40 GMT


30 Sep 2001 22:28:52 +0900, Jens Petersen <petersen@redhat.com> pisze:

> 16 bits is enough to describe the Basic Multilingual Plane
> and I think 24 bits all the currently defined extended
> planes.  So I guess the report just refers to the BMP.

In early days the Unicode Consortium was doing everything to confuse
peoble about whether Unicode fits into 16 bits. It used to push the
view that it's based on 16-bit units, and that pairs of units from
the range U+D800..DFFF (called surrogates) can encode a million of
extra characters (none of which had a more specific meaning defined
at that time).

I was told on the Unicode list that it was done because for some people
it would be hard to accept an encoding which requires *more* than twice
as much storage as 8-bit charsets. 16 bits is "only" twice as much.

Unfortunately some companies, like Microsoft and Oracle, believed the
"lie of Unicode marketing" and adopted the 16-bit view as the basic
internal and external format, ignoring the issue of surrogates.

Some time ago the Unicode Consortium slowly began switching to the
point of view that abstract characters are denoted by numbers in the
range U+0000..10FFFF. Storing them in 16-bit units by expressing
characters below U+FFFF directly and representing others as pairs
of surrogates is just a way to serialize Unicode to streams of bytes
(or 16-bit words), called UTF-16.

AFAIK UTF-8 was first present in ISO-10646-1. The ISO standard,
although sharing actual assignments of characters to numbers with
Unicode, from the beginning viewed character codes as 31-bit numbers,
which can be serialized for transmission using for example UTF-8
or UTF-16.

Unicode adopted UTF-8 by cutting it at the point of U+10FFFF. It
also invented UTF-32 which means to just store characters in 32-bit
words (endianness issues are analogous to UTF-16), but is explicitly
restricted to characters below U+10FFFF, to avoid confusion with
unrestricted 31-bit codes of ISO-10646-1.

So now UTF-8, UTF-16 and UTF-32 are treated in parallel by Unicode.
The ISO standard is going to match this and limit itself to U+10FFFF
too, which in theory should end the problem about the number of
characters in these standards.

Unicode had to do something with this because it finally began adding
characters above U+FFFF, and it would really make no sense to treat
UTF-16 as the fundamental view, saying that some codes really don't
represent characters but must be used in pairs, since character
properties are defined in terms of real characters, not components
of surrogate pairs individually. Surrogates are just a hole in the
middle of the first 64k of characters, because UTF-16 can't encode
them insolated.

Unfortunately the 16-bit view is still widespread and there is much
confusion. Companies invested money in the 16-bit Unicode and they
can't simply replace it with something entirely different, so they
actually begin implementing UTF-16. In practice support for surrogates
could be almost non-existant in the past, but now there are actual
characters allocated there, so it must be done, despite the pain of
using a variable-length encoding.

There are cases like Oracle which ignored surrogates and misimplemented
UTF-8 by treating surrogates like other characters below U+FFFF,
yet calling it UTF-8. Now instead of fixing their mistake they added
the real UTF-8 under a strange name AL24UTFFSS (I'm not sure if they
finally fixed the names) and are trying to push their old version as
an official alternative to UTF-8. There is a very strong opposition,
but they are still trying.

IMHO it would have been better to not invent UTF-16 at all and use
UTF-8 in parallel with UTF-32. But Unicode used to promote UTF-16 as
the real Unicode, and now it causes so many threads on Unicode list
to clear the confusion about the nature of characters above U+FFFF.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK