Unicode again

Kent Karlsson kentk@md.chalmers.se
Wed, 16 Jan 2002 17:38:42 +0100


This is getting a bit off-topic for Haskell...

> Isn't it fairly common to use 32bit Unicode character types in C?

Yes, in some implementations, but nobody by a few Linux and SunOS
programmers use that...  (Those systems are far from committed to
Unicode.)

In some other systems wchar_t is (except for the ASCII part) an unknown
(opaque) encoding, literally!  Only the system knows the mapping between it
and some external encoding.  Which renders it completely useless for 
writing line breaking routines, display routines, collation routines,
you name it.

Most commonly wchar_t is a 16-bit datatype that holds UTF-16 code units.
The Windows APIs use that...  Which is agains the C standard, to be nitpicking.

> I'm not sure I see the efficiency gain of UTF-16 over UTF-8 or
> UTF-32, as you still need the multi-unit character management as in 
> UTF-8, while most of the time using more memory.
> 
> Correct me if I'm wrong, but my impression is that UTF-16 was chosen
> partly on the assumption that all of Unicode would fit, and I'm not
> sure it's such an obvious choice today.

That is not true, but I've explained that before on this list, so I
won't do it again; at least not just now.


In relation to this: DIN has submitted a request to add (informatively)
a new datatype to C: utf16_t.  That is based on that some companies
(SAP in particular) have found UTF-32 to be too inefficient, even in
C programs, and wchar_t cannot conformantly nor portably be used for
UTF-16.  That does not mean that UTF-32 would not be suitable for
Haskell, though.

		/kent k