[Haskell-cafe] PROPOSAL: New efficient Unicode string library.

Thu Sep 27 02:40:06 EDT 2007

On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote:
> If UTF-16 is what's used by everyone else (how about Java? Python?) I
> think that's a strong reason to use it. I don't know Unicode well
> enough to say otherwise.

Java uses 16-bit variables to hold characters.
This is SOLELY for historical reasons, not because it is a good choice.
The history is a bit funny:  the ISO 10646 group were working away
defining a 31-bit character set, and the industry screamed blue murder
about how this was going to ruin the economy, bring back the Dark Ages,
&c, and promptly set up the Unicode consortium to define a 16-bit
character set that could do the same job.  Early versions of Unicode
had only about 30 000 characters, after heroic (and not entirely
appreciated) efforts at unifiying Chinese characters as used in China
with those used in Japan and those used in Korea.  They also lumbered
themselves (so that they would have a fighting chance of getting
Unicode adopted) with a "round trip conversion" policy, namely that it
should be possible to take characters using ANY current encoding
standard, convert them to Unicode, and then convert back to the original
encoding with no loss of information.  This led to failure of  
unification:
there are two versions of Å (one for ordinary use, one for Angstroms),
two versions of mu (one for Greek, one for micron), three complete  
copies
of ASCII, &c).  However, 16 bits really is not enough.

Here's a table from http://www.unicode.org/versions/Unicode5.0.0/

Graphic  	 98,884
Format 		    140
Control 	     65
Private Use 	137,468
Surrogate 	  2,048
Noncharacter 	     66
Reserved 	875,441

Excluding Private Use and Reserved, I make that 101,203 currently
defined codes.  That's nearly 1.5* the number that would fit in 16
bits.

Java has had to deal with this, don't think it hasn't.  For example,
where Java had one set of functions referring to characters in strings
by position, it now has two complete sets:  one to use *which 16-bit
code* (which is fast) and one to use *which actual Unicode character*
(which is slow).  The key point is that the second set is *always*
slow even when there are no characters outside the basic multilingual
plane.

One Smalltalk system I sometimes use has three complete string
implementations (all characters fit in a byte, all characters fit
in 16 bits, some characters require more) and dynamically switches
from narrow strings to wide strings behind your back.  In a language
with read-only strings, that makes a lot of sense; it's just a pity
Smalltalk isn't one.

If you want to minimize conversion effort when talking to the operating
system, files, and other programs, UTF-8 is probably the way to go.
(That's on Unix.  For Windows it might be different.)

If you want to minimize the effort of recognising character boundaries
while processing strings, 32-bit characters are the way to go.  If you
want to be able to index into a string efficiently, they are the *only*
way to go.  Solaris bit the bullet many years ago; Sun C compilers
jumped straight from 8-bit wchar_t to 32_bit without ever stopping at  
16.

16-bit characters *used* to be a reasonable compromise, but aren't any
longer.  Unicode keeps on growing.  There were 1,349 new characters
from Unicode 4.1 to Unicode 5.0 (IIRC).  There are lots more scripts
in the pipeline.  (What the heck _is_ Tangut, anyway?)