Unicode support

Ketil Malde ketil@ii.uib.no
09 Oct 2001 13:18:50 +0200


[Posted to haskell-cafe, since it's getting quite off topic]

"Kent Karlsson" <kentk@md.chalmers.se> writes:

>>> for a long time. 16 bit unicode should be gotten rid of, being the worst
>>> of both worlds, non backwards compatable with ascii, endianness issues
>>> and no constant length encoding.... utf8 externally and utf32 when
>>> worknig with individual characters is the way to go.

>> I totally agree with you.

> Now, what are your technical arguments for this position?
> (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)

What's wrong with the ones already mentioned?

You have endianness issues, and you need to explicitly type text files
or insert BOMs.

An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.
When not limited to ASCII, at least it avoids zero bytes and other
potential problems.  UTF-16 will among other things, be full of
NULLs. 

I can understand UCS-2 looking attractive when it looked like a
fixed-length encoding, but that no longer applies.

> So it is not surprising that most people involved do not consider
> UTF-16 a bad idea.  The extra complexity is minimal, and further
> surfaces rarely.  

But it needs to be there.  It will introduce larger programs, more
bugs, lower efficiency.

> BMP characters are still (relatively) easy to process, and it saves
> memory space and cache misses when large amounts of text data
> is processed (e.g. databases).

I couldn't find anything about the relative efficiencies of UTF-8 and
UTF-16 on various languages.  Do you have any pointers?  From a
Scandinavian POV, (using ASCII plus a handful of extra characters)
UTF-8 should be a big win, but I'm sure there are counter examples.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants