UniCode

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
5 Oct 2001 18:17:26 GMT


Fri, 5 Oct 2001 23:23:50 +1000, Andrew J Bromage <andrew@bromage.org> pisze:

> There is a set of one million (more correctly, 1M) Unicode characters
> which are only accessible using surrogate pairs (i.e. two UTF-16
> codes).  There are currently none of these codes assigned,

This information is out of date. AFAIR about 40000 of them is assigned.
Most for Chinese (current, not historic).

> So rare, in fact, that the cost of strings taking up twice the
> space that the currently do simply isn't worth the cost.

In Haskell strings already have high overhead. In GHC a Char# value
(inside Char object) always takes the same size as the pointer
(32 or 64 bits), no matter how much of it is used.

> It just goes to show that strings are not merely arrays of characters
> like some languages would have you believe.

In Haskell String = [Char]. It's true that Char values don't
necessarily correspond to glyphs, but Strings are composed of Chars.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK