Unicode support

Ketil Malde ketil@ii.uib.no
09 Oct 2001 15:35:06 +0200


"Kent Karlsson" <kentk@md.chalmers.se> writes:

>> You have endianness issues, and you need to explicitly type text files
>> or insert BOMs.

> You have to distinguish between the encoding form (what you use internally)
> and encoding scheme (externally).  

Good point, of course.  Most of the arguments apply to the external
encoding scheme, but I suppose it wasn't clear which of them we were
discussing. 

> But as I said: they will not go away now, they are too firmly established.

Yep.  But it appears that the "right" choice for external encoding
scheme would be UTF-8.

>> When not limited to ASCII, at least it avoids zero bytes and other
>> potential problems.  UTF-16 will among other things, be full of
>> NULLs.

> Yes, and so what?

So, I can use it for file names, in regular expressions, and in
whatever legacy applications that expect textual data. That may be
worthless to you, but it isn't to me.

> So will a file filled with image data, video clips, or plainly a
> list of raw integers dumped to file (not formatted as strings).

But none of these pretend to be text!

> True.  But implementing normalisation, or case mapping for that matter,
> is non-trivial too.  In practice, the additional complexity with
> UTF-16 seems small. 

All right, but if there are no real advantages, why bother?

>> I couldn't find anything about the relative efficiencies of UTF-8 and
>> UTF-16 on various languages.

> So, how big is our personal hard disk now? 3GiB? 10GiB? How many images,
> mp3 files and video clips do you have?  (I'm sorry, but your argument here
> is getting old and stale.

Don't be sorry.  I'm just looking for a good argument in favor of
UTF-16 instead of UTF-8, and size was the only possibility I could
think of offhand.  (And apparently, the Japanese are unhappy with the
50% increase UTF-8's three-byte encoding over UTF-16's two-byte one)

You could run the same argument against UTF-16 vs UTF-32 as internal
encoding form, memory and memory bandwidth is getting cheap these
days, too, although memory is still a more expensive resource than
disk.  

But as (I assume) the internal encoding form shouldn't matter (as)
much, as it would be hidden from everybody but the Unicode library
implementor. It boils down to performance, which can be measured.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants